Variational Analysis in the Wasserstein Space

Project description

We study optimization problems whereby the optimization variable is a probability measure. Since the probability space is not a vector space, many classical and powerful methods for optimization (e.g., gradients) are of little help. Thus, one typically resorts to the abstract machinery of infinite-dimensional analysis or other ad-hoc methodologies - not tailored to the probability space - which however involve projections or rely on convexity-type assumptions. We believe instead that these problems call for a comprehensive methodological framework for calculus in probability spaces. In this work, we combine ideas from optimal transport, variational analysis, and Wasserstein gradient flows to equip the Wasserstein space (i.e., the space of probability measures endowed with the Wasserstein distance) with a variational structure, both by combining and extending existing results and introducing novel tools. Our theoretical analysis culminates in very general necessary optimality conditions for optimality. Notably, our conditions (i) resemble the rationales of Euclidean spaces, such as the Karush-Kuhn-Tucker and Lagrange conditions, (ii) are intuitive, informative, and easy to study, and (iii) they provide either closed form solutions or computationally attractive algorithms. We believe this framework lays the foundation for new algorithmic and theoretical advancements in the study of optimization problems in probability spaces, which we exemplify with numerous case studies and applications to machine learning, drug discovery, and distributionally robust optimization.

Problem formulation

It is quite simple: This work considers optimization problems of the form

$$ \definecolor{bluetteLight}{RGB}{173,216,230} \definecolor{bluetteDark}{RGB}{0,105,148} \definecolor{redLight}{RGB}{240,128,128} \definecolor{redDark}{RGB}{128,0,0} \definecolor{orangeLight}{RGB}{255,200,0} \definecolor{orangeDark}{RGB}{255,140,0} \definecolor{ETHblue}{RGB}{33,92,175}% blue \definecolor{ETHpetrol}{RGB}{0,120,148} % petrol \definecolor{ETHgreen}{RGB}{98,115,19}% green \definecolor{ETHbronze}{RGB}{142,103,19}% bronze \definecolor{ETHred}{RGB}{183,53,45}% red \definecolor{ETHpurple}{RGB}{163,7,116} % purple \definecolor{ETHgray}{RGB}{111,111,111}% gray \newcommand{\varGeneric}{\mu} \newcommand{\feasibleSet}{\mathcal{C}} \newcommand{\reals}{\mathbb{R}} \newcommand{\realsBar}{\bar{\mathbb{R}}} \newcommand{\Pp}[2]{\mathcal{P}_{#1}(#2)} \newcommand{\st}{\;|\;} \newcommand{\norm}[1]{||#1||} \newcommand{\gradient}{\nabla} \newcommand{\innerProduct}[2]{\langle #1, #2 \rangle} \newcommand{\varOptimal}{\mu^\ast} \newcommand{\varReference}{\hat\mu} \newcommand{\localZero}[1]{0_{#1}} \newcommand{\subgradient}[1]{\partial{#1}} \newcommand{\normalCone}[2]{\mathrm{N}_{#1}(#2)} \newcommand{\wassersteinDistance}[3]{W_{#3}(#1,#2)} \newcommand{\closedWassersteinBall}[3]{\mathrm{B}_{\wassersteinDistance{}{}{#3}}(#1;#2)} \newcommand{\objective}{\mathcal{J}} \newcommand{\expectedValue}[2]{\mathbb{E}^{#1}\left[#2\right]} \inf_{\varGeneric \in \feasibleSet}\, \objective(\varGeneric), $$
where \(\feasibleSet \subseteq \Pp{}{\reals^d}\) is a set of admissible probability measures and \(\objective: \Pp{}{\reals^d} \to \realsBar\) is a functional to minimize. This abstract problem setting stems from the observation that numerous fields, including machine learning, robust optimization, and biology, tackle their own version of this problem, but with ad hoc methods that often cease to be effective as soon as the problem structure deviates from idealized assumptions.

Main result: Gradients are aligned with the constraints at optimality

Our main result are general first-order optimality conditions for the optimization problem above:

If \(\varOptimal \in \Pp{}{\reals^d}\) is an optimal solution with finite second moment and provided that a constraint qualification holds, then the ``Wasserstein subgradients'' are ``aligned'' with the constraints at ``optimality'', i.e.,

$$ \localZero{\varOptimal} \in \subgradient{\objective}(\varOptimal) + \normalCone{\feasibleSet}{\varOptimal}, $$
where \(\subgradient{\objective}(\varOptimal)\) is the ``Wasserstein subgradient'' of \(\objective\) at \(\varOptimal\), \(\normalCone{\feasibleSet}{\varOptimal}\) is the ``Wasserstein normal cone'' of \(\feasibleSet\) at \(\varOptimal\) and \(\localZero{\varOptimal}\) is a ``null Wasserstein tangent vector'' at \(\varOptimal\).

To get some intuition, let's consider the following two pictures.

The figure on the left depicts two empirical candidate solutions \( \textcolor{blue}{\varGeneric_1}\) and \(\textcolor{orangeDark}{\varGeneric_2}\) for the generic optimization problem with \(\feasibleSet = \{\varGeneric\in\Pp{2}{\reals^2}\st\expectedValue{(x_1, x_2)\sim\varGeneric}{x_1^2+x_2^2}\leq \varepsilon^2\}\) (i.e., bounded second moment) and \(\objective(\varGeneric) = \expectedValue{(x_1, x_2) \sim \textcolor{blue}{\varGeneric}}{x_1+x_2}\), of which we show the contours and the gradient vector field. The solid (resp., dashed) black arrows represent the gradient of the constraint function \(\expectedValue{(x_1, x_2)\sim\varGeneric}{x_1^2+x_2^2}- \varepsilon^2\) at \(\textcolor{blue}{\varGeneric_1}\) (resp., \(\textcolor{orangeDark}{\varGeneric_2}\)). Here, \(\textcolor{blue}{\varGeneric_1}\) is indeed a candidate optimal solution: The gradient of the objective is aligned with the gradient of the constraint. For \(\textcolor{orangeDark}{\varGeneric_2}\), instead, these two are not aligned. Thus, \(\textcolor{orangeDark}{\varGeneric_2}\) cannot be optimal.

The figure on the right shows that \(\textcolor{blue}{\varOptimal}\) satisfies the optimality condition for the generic optimization problem with \(\feasibleSet = \closedWassersteinBall{\textcolor{red}{\varReference}}{\varepsilon}{2}=\{\varGeneric\in\Pp{2}{\reals^2}\st\wassersteinDistance{\varGeneric}{\textcolor{red}{\varReference}}{2}\leq\varepsilon\}\) (i.e., Wasserstein ball centered at \(\textcolor{red}{\varReference}\) of radius \(\varepsilon)\) and \(\objective(\varGeneric) = \expectedValue{(x_1, x_2) \sim \varGeneric}{x_1 + x_2}\), of which the contours and the gradient vector field are shown. The black arrows connecting \(\textcolor{red}{\varReference}\) and \(\textcolor{blue}{\varOptimal}\) represent the gradient of the constraint function \(\wassersteinDistance{\textcolor{blue}{\varGeneric}}{\textcolor{red}{\varReference}}{2}^2-\varepsilon^2\). Since \(\textcolor{blue}{\varOptimal}\) is optimal, the gradient of the objective and the constraint are aligned at all the "particles" of \(\textcolor{blue}{\varOptimal}\).

Are these conditions hard to use?

we illustrate our optimality conditions by informally studying a simple and accessible version of the general optimization problem. For \(\theta \neq 0\) and \(\varepsilon > 0\), consider the problem

$$ \inf_{\varGeneric \in \Pp{2}{\reals^d}}\; \expectedValue{x \sim \varGeneric}{\innerProduct{\theta}{x}} \qquad\text{subject to}\qquad \expectedValue{x \sim \varGeneric}{\norm{x}^2} \leq \varepsilon^2, $$
depicted in the figure above on the left for \(\theta = \begin{bmatrix} 1 & 1 \end{bmatrix}^\top\). To get some intuition, let us restrict to Dirac's delta of the form \(\delta_x\) for \(x\in\reals^d\). Accordingly, the optimization problem reduces to \( \inf_{\norm{x}^2 \leq \varepsilon^2} \innerProduct{\theta}{x}. \) This optimization problem can be studied through standard first-order optimality conditions in Euclidean spaces. Since the gradient of the objective \(\gradient{x}{\innerProduct{\theta}{x}} = \theta\) never vanishes, the optimal solution (if it exists) lies at the boundary. We thus seek the Lagrange multiplier \(\lambda > 0\) such that
$$ 0 = \gradient{x}{\innerProduct{\theta}{x}} + \lambda\gradient{x}{\norm{x}^2} = \theta + 2\lambda x \quad\text{and}\quad \varepsilon^2 = \norm{x}^2, $$
which yields \(x = -\varepsilon\frac{\theta}{\norm{\theta}}\) and \(\lambda = \frac{\norm{\theta}}{2\varepsilon}\). Now back to the problem in probability spaces: After basic algebraic manipulations, our main result (stated informally above) tells us that any solution \(\varOptimal \in \Pp{2}{\reals^d}\) satisfies the Lagrange-like condition
$$ 0 = 2\lambda x + \theta\quad\varOptimal-\text{a.e.} \quad\text{and}\quad \varepsilon^2 = \expectedValue{x \sim \varOptimal}{\norm{x}^2} = \frac{\norm{\theta}^2}{4\lambda^2}, $$
for some \(\lambda \geq 0\) constant across the support of \(\varOptimal\); cf. the figure above on the left for \(\theta = \begin{bmatrix} 1 & 1 \end{bmatrix}^\top\). We conclude that the mass of any candidate solution is necessarily located at \(-\varepsilon\frac{\theta}{\norm{\theta}}\). In particular, our optimality condition in mirrors its counterpart on \(\reals^d\).

For more examples and applications, check out the paper!

Bibtex

@article{lanzetti2024variational,
    author = {Lanzetti, Nicolas and Terpin, Antonio and D\"orfler, Florian},
    title = {Variational Analysis in the Wasserstein Space},
    journal={arXiv preprint arXiv:2406.10676}
    year = {2024}
}