Wasserstein Gradient Flows

Once $\Wass_2$ is a dynamic metric, one can run gradient descent directly on the space of measures. This chapter derives the formal Wasserstein gradient, explains the JKO minimizing-movement scheme, records the role of geodesic convexity in convergence, and then applies the same calculus to mean-field neural-network training.

from pathlib import Path
import sys

from IPython.display import Image as DisplayImage
from IPython.display import display

here = Path.cwd()
myst_dir = None
for candidate in [here, here.parent, here / "myst", here.parent / "myst", here.parent.parent / "myst"]:
    if (candidate / "ot4ml_web.py").exists():
        myst_dir = candidate.resolve()
        sys.path.insert(0, str(myst_dir))
        break

if myst_dir is None:
    raise RuntimeError("Could not locate myst/ot4ml_web.py")

repo_root = myst_dir.parent
thumbnails = repo_root / "notebooks-figures" / "thumbnails"

def show_book_figure(name, width=760):
    display(DisplayImage(filename=str(thumbnails / f"{name}.png"), width=width))

Minimizing Movements and Wasserstein Gradients¶

This first section explains how a variational implicit-Euler step on measures gives rise, in the small-step limit, to a continuity equation driven by the Wasserstein gradient of the energy.

We consider a function $f(\alpha)$ and seek a minimizing evolution $(\alpha_t)_t$ . The minimizing-movement strategy over a metric space builds a discrete-time evolution using an implicit Euler scheme:

\alpha_{t+\tau} \in \uargmin{\alpha\in\Pp_2(\RR^d)} \frac{1}{2\tau}\Wass_2(\alpha_t,\alpha)^2+f(\alpha).

(1)

Euclidean Gradient Flows¶

If (1) is restricted to finite dimensions with $\alpha_t=\delta_{x(t)}$ and $\alpha=\delta_x$ , it becomes the implicit Euler scheme

x(t+\tau) \in \uargmin{x} \frac{1}{2\tau}\norm{x-x(t)}^2+h(x), \qquad h(x)=f(\delta_x).

(2)

When the minimizer is unique and $h$ is differentiable, its optimality condition gives

x(t+\tau)=(\Id+\tau\nabla h)^{-1}(x(t)).

(3)

In contrast, explicit Euler uses

x(t+\tau)=(\Id-\tau\nabla h)(x(t))=x(t)-\tau\nabla h(x(t)).

(4)

Both schemes converge as $\tau\to0$ to the classical gradient flow

\dot x(t)=-\nabla h(x(t)).

(5)

Wasserstein Gradient Formula¶

The implicit Euler scheme has the advantage that it does not require $h$ or $f$ to be smooth. For $f$ , this is crucial when evolutions over measures may have densities, atoms or other singular parts.

As $\tau\to0$ , under suitable conditions on $f$ , (1) defines a continuous evolution $t\mapsto\alpha_t$ . As in the dynamic formulation, this evolution can be described by a Lagrangian evolution. We use the following first-variation convention: for any $\beta\in\Pp(\RR^d)$ and the signed zero-mass perturbation $\eta=\beta-\alpha$ ,

f((1-\tau)\alpha+\tau\beta) = f(\alpha+\tau\eta) = f(\alpha)+ \tau\int[\delta f(\alpha)](x)\d\eta(x) +o(\tau).

(6)

The key infinitesimal object is the vector field that represents this differential in the Wasserstein metric.

The associated formal gradient flow is the continuity equation

\frac{\partial\alpha_t}{\partial t} +\operatorname{div}(-\Wgrad f(\alpha_t)\alpha_t)=0.

(8)

The following proposition explains why this vector field is the Riemannian gradient for the $L^2(\alpha)$ metric on velocities.

Proposition: Formal Wasserstein Gradient

Assume that $f$ admits a smooth first variation $\delta f(\alpha)$ , that $\alpha$ has a smooth positive density, and that $v$ is regular and compactly supported, or satisfies the corresponding no-flux boundary condition. For infinitesimal perturbations generated by $v$ through $(\Id+\tau v)_\sharp\alpha$ , the differential of $f$ is

\frac{\d}{\d\tau}_{|\tau=0} f\big((\Id+\tau v)_\sharp\alpha\big) = \int \dotp{\nabla\delta f(\alpha)(x)}{v(x)}\d\alpha(x).

(9)

Hence, for the Riemannian metric $\norm{v}_{L^2(\alpha)}^2=\int\norm{v}^2\d\alpha$ , the Wasserstein gradient is the vector field

\Wgrad f(\alpha)=\nabla\delta f(\alpha).

(10)

The Wasserstein gradient-flow viewpoint already appears in John D. Lafferty’s PhD work, published as “The Density Manifold and Configuration Space Quantization”, under the name “density manifold”. It was then systematically developed by Otto, who exposed the formal Riemannian structure of this space Otto, 2001. Rigorous metric-space treatments and numerical JKO schemes can be found in Ambrosio et al., 2006Benamou et al., 2016Peyré, 2015Gallouët & Monsaingeon, 2017.

From the JKO Step to the Velocity Field¶

A first-order expansion of the JKO step explains why (8) uses the vector field $\Wgrad f(\alpha)$ . Write (1) as a minimization over displacement fields $v$ such that $\alpha=(\Id+\tau v)_\sharp\alpha_t$ :

\min_v \frac{1}{2\tau}\tau^2\norm{v}_{L^2(\alpha_t)}^2 + f((\Id+\tau v)_\sharp\alpha_t).

(14)

The push-forward and energy expansions are

(\Id+\tau v)_\sharp\alpha_t = \alpha_t-\tau\operatorname{div}(v\alpha_t)+o(\tau),

(15)

f((\Id+\tau v)_\sharp\alpha_t) = f(\alpha_t) -\tau\int\delta f(\alpha_t)\operatorname{div}(v\alpha_t)\d x +o(\tau)

(16)

and hence

f((\Id+\tau v)_\sharp\alpha_t) = f(\alpha_t) + \tau\int \dotp{\nabla_x\delta f(\alpha_t)(x)}{v(x)} \d\alpha_t(x) +o(\tau).

(17)

Thus the problem minimized in (1) has the first-order expansion

\min_v f(\alpha_t) + \tau\int \left[ \frac12\norm{v(x)}^2 + \dotp{\Wgrad f(\alpha_t)(x)}{v(x)} \right] \d\alpha_t(x) +o(\tau).

(18)

The pointwise minimizer is $v=-\Wgrad f(\alpha_t)$ , which gives the velocity in the continuity equation.

Metric Derivative and Curves of Maximal Slope¶

The same descent principle admits a coordinate-free formulation, which is the right language for limits of JKO schemes. Let $(\mathcal X,d)$ be a metric space and let $x:(0,T)\to\mathcal X$ be an absolutely continuous curve. Its metric derivative is

|\dot x_t| \eqdef \lim_{h\to0}\frac{d(x_{t+h},x_t)}{|h|}

(19)

for a.e. $t$ . Equivalently, $|\dot x_t|$ is the smallest $g\in L^1_{\rm loc}(0,T)$ such that $d(x_s,x_t)\leq\int_s^t g(r)\d r$ for all $0<s<t<T$ . If $f:\mathcal X\to(-\infty,+\infty]$ is lower semicontinuous, its local metric slope at a point $x$ with $f(x)<+\infty$ is

|\partial f|(x) \eqdef \limsup_{\substack{y\to x\\ y\neq x}} \frac{(f(x)-f(y))_+}{d(x,y)} .

(20)

f(x_t) + \frac12\int_s^t|\dot x_r|^2\d r + \frac12\int_s^t|\partial f|^2(x_r)\d r \leq f(x_s), \qquad 0\leq s<t .

(21)

This is the metric-space formulation of gradient descent Ambrosio et al., 2006; the inequality is stable under weak limits, while in smooth Hilbert or Wasserstein settings it is usually saturated. In $\mathcal X=\Pp_2(\RR^d)$ with $d=\Wass_2$ , an absolutely continuous curve admits a continuity-equation velocity $v_t$ and satisfies $|\dot\alpha_t|_{\Wass_2}\leq\norm{v_t}_{L^2(\alpha_t)}$ , with equality for the minimal velocity. For smooth energies, $|\partial f|(\alpha_t)=\norm{\Wgrad f(\alpha_t)}_{L^2(\alpha_t)}$ , so the maximal-slope condition reduces to the velocity-field equation $v_t=-\Wgrad f(\alpha_t)$ derived above. The energy consequences of this viewpoint are used below in Section Geodesic Convexity and Convergence.

We now detail examples of such Wasserstein gradient flows.

Figure Div shows the same minimizing-movement construction both in density coordinates and through the motion of quantiles.

JKO minimizing movements for the entropy flow in one dimension. The left panel displays successive implicit-Euler minimizers for the heat equation, colored from red to blue. The right panel tracks inverse CDF values $Q_t(s)=F_t^{-1}(s)$ for selected probability levels $s$ , giving a Lagrangian view of the proximal movement in Wasserstein space.

The interactive demo uses the heat-flow representative of the entropy JKO scheme: changing the step size changes the spacing between implicit Euler iterates, while the quantile panel shows how the same movement is seen in Lagrangian coordinates.

Interactive panel. Use the step size and iteration controls to inspect the JKO scheme as successive implicit steps of the entropy gradient flow.

Discrete Evolutions¶

If $f(\alpha)$ can be evaluated on discrete distributions and $\Wgrad$ is continuous in this case, the flow (8) maintains the number of Dirac masses:

\alpha_t=\frac1n\sum_i\delta_{x_i(t)}.

(22)

The particles $X(t)=(x_i(t))_i$ evolve according to the coupled ODE

\dot x_i(t)=-n\nabla_{x_i}F(X(t)),

(23)

where $F(X)=f\left(\frac1n\sum_i\delta_{x_i}\right)$ . The factor $n$ comes from the empirical Wasserstein metric $\frac1n\sum_i\norm{\dot x_i}^2$ .

Linear Functionals¶

The simplest example is a linear functional

f(\alpha)=\int h(x)\d\alpha(x).

(24)

Here $\delta f(\alpha)=h$ is independent of $\alpha$ . The flow (8) becomes

\frac{\partial\alpha_t}{\partial t} + \operatorname{div}(-\nabla h\,\alpha_t)=0.

(25)

Thus particles move independently according to the usual gradient flow (5).

Gradient Flows under Density Constraints¶

The JKO viewpoint also handles nonsmooth energies. A basic example is a linear-in-measure energy, generated by a potential, together with a hard upper bound on the density,

f_\kappa(\rho\d x) = \int_\Omega h(x)\rho(x)\d x + \iota_{\mathcal K_\kappa}(\rho\d x), \qquad \mathcal K_\kappa = \left\{ \rho\d x\in\Pp_2(\Omega)\;:\; 0\leq\rho\leq\kappa\ \text{a.e.} \right\},

(26)

where $\Omega\subset\RR^d$ is convex and $\mathcal K_\kappa$ is assumed nonempty; if $\Omega$ has finite volume, this requires $\kappa|\Omega|\geq1$ . The indicator term is not represented by an ordinary first variation; it contributes the normal cone to $\mathcal K_\kappa$ in the Wasserstein first-order condition. The JKO step becomes the constrained minimization

\alpha^{k+1} \in \underset{\alpha\in\mathcal K_\kappa}{\operatorname{argmin}} \frac{1}{2\tau}\Wass_2^2(\alpha^k,\alpha) + \int_\Omega h\,\d\alpha .

(27)

This is the mechanism behind macroscopic crowd-motion models with congestion, where the desired velocity $-\nabla h$ is projected onto velocities that do not increase the density beyond the maximal value $\kappa$ Maury et al., 2010Santambrogio, 2018.

The hard cap is not especially benign numerically: the projection in the JKO step and the pressure field below can be difficult to compute accurately. Its advantage is geometric. Proposition Proposition: Geodesic Convexity of Density Caps in Section Geodesic Convexity and Convergence shows that $\mathcal K_\kappa$ is geodesically convex, so the indicator of the constraint is harmless for the usual geodesic-convex convergence guarantees.

Formally, the constrained flow contains a pressure field $p_t$ , the Lagrange multiplier of the density cap. With the no-flux boundary condition $\rho_t\nabla(h+p_t)\cdot n=0$ on $\partial\Omega$ , where $n$ is the outward normal, this gives the complementarity system

\partial_t\rho_t = \operatorname{div}\left(\rho_t\nabla(h+p_t)\right), \qquad 0\leq\rho_t\leq\kappa,\qquad p_t\geq0,\qquad p_t(\kappa-\rho_t)=0.

(28)

Equivalently, $v_t=-\nabla h-\nabla p_t$ . The pressure vanishes away from the saturated set and pushes mass away from congested zones where $\rho_t=\kappa$ , keeping the density cap satisfied. For nonsmooth solutions these conditions are read in the variational sense associated with the normal cone to $\mathcal K_\kappa$ . Thus even when the driving potential $\int h\d\alpha$ is linear, the constraint couples distant particles through the pressure field.

Figure Div shows how decreasing the cap converts free concentration into a saturated region whose width is fixed by mass conservation.

Density-constrained gradient flow for the attractive quadratic potential $h(x)=\|x\|^2/2$ in one dimension, computed in quantile variables by explicit Lagrangian descent followed by the isotonic projection enforcing $\partial_s q\geq1/\kappa$ . Each panel stacks five times vertically, colored from red to blue. The dashed gray line marks the maximal density $\kappa$ for the constrained runs. A small cap forces the mass to form a wide saturated block, a medium cap allows a narrower congested region, and the unconstrained flow concentrates freely near the minimizer of $h$ .

Interactive panel. Vary the density cap, attraction strength, and final time in a projected one-dimensional model of congestion.

Multi-Species Gradient Flows¶

Several applications involve several densities living on the same physical space: chemical species, color channels, cell populations or competing phases. Let $\Omega\subset\RR^d$ be the physical domain and assume that the component masses $\mathbf m=(m_1,\ldots,m_p)$ are positive and fixed. A convenient notation is

\Pp_{2,\mathbf m}(\Omega;\RR_+^p) = \left\{ \alpha=(\alpha_1,\ldots,\alpha_p)\;:\; \alpha_i\in\Mm_+(\Omega),\ \alpha_i(\Omega)=m_i,\ \int_\Omega\norm{x}^2\d\alpha_i(x)<+\infty \right\}.

(29)

Since the components are finite measures rather than necessarily probabilities, the Wasserstein distance below is understood after normalization and with the natural homogeneous weight:

\Wass_{2,\oplus}^2(\alpha,\eta) = \sum_{i=1}^p m_i\, \Wass_2^2\!\left(\frac{\alpha_i}{m_i},\frac{\eta_i}{m_i}\right).

(30)

This is the diagonal, or independent-channel, geometry: each species is transported by its own scalar Wasserstein metric, while cross-effects may still enter through the energy or through constraints. In the dynamic language used later for vector-valued measures, this is the special case of a diagonal mobility; see Proposition: Diagonal Positive Vector Benamou--Brenier. The corresponding JKO step for a functional $f$ is

\alpha^{k+1} \in \underset{\alpha\in\Pp_{2,\mathbf m}(\Omega;\RR_+^p)}{\operatorname{argmin}} \frac{1}{2\tau}\Wass_{2,\oplus}^2(\alpha^k,\alpha) + f(\alpha).

(31)

Assume formally that $\alpha_i=\rho_i\d x$ and that $f$ has smooth first variations $\phi_i=\delta f/\delta\rho_i$ . Then the product metric yields

\partial_t\rho_i = \operatorname{div}\left(\rho_i\nabla\phi_i\right), \qquad \phi_i=\frac{\delta f}{\delta\rho_i}, \qquad i=1,\ldots,p.

(32)

The equations are independent only if $f$ is separable. Otherwise the coupling enters through the first variations $\phi_i$ , not through the metric tensor. Carlier, Chizat and Laborde Carlier et al., 2024 use displacement smoothness of entropic OT to prove well-posedness of Wasserstein gradient flows that include multi-species systems. Congestion and shared-density variants are closely related to crowd and traffic models Maury et al., 2010Carlier et al., 2008Santambrogio, 2018.

A second useful model imposes a pointwise composition constraint. Given a fixed reference density $\beta=b\d x$ with total mass $\sum_i m_i$ , set

\mathcal C_\beta = \left\{ \alpha\in\Pp_{2,\mathbf m}(\Omega;\RR_+^p)\;:\; \sum_{i=1}^p\alpha_i=\beta \right\}.

(33)

This describes, for instance, several chemical species whose total material density is prescribed, so that only the composition vector can change. The constrained JKO step is

\alpha^{k+1} \in \underset{\alpha\in\mathcal C_\beta}{\operatorname{argmin}} \frac{1}{2\tau}\Wass_{2,\oplus}^2(\alpha^k,\alpha) + f(\alpha).

(34)

In the smooth-density regime, the formal constrained flow has a common pressure $\lambda_t$ :

\partial_t\rho_i = \operatorname{div}\left(\rho_i\nabla(\phi_i-\lambda_t)\right), \qquad \operatorname{div}(b\nabla\lambda_t) = \operatorname{div}\left(\sum_{i=1}^p\rho_i\nabla\phi_i\right),

(35)

with no-flux boundary conditions $\rho_i\nabla(\phi_i-\lambda_t)\cdot n=0$ . The scalar field $\lambda_t$ is the Lagrange multiplier enforcing $\sum_i\rho_i=b$ at all times: summing the first equations and imposing $\partial_t\sum_i\rho_i=0$ gives the elliptic equation for $\lambda_t$ . For the separable Shannon entropy

f(\alpha) = \sum_{i=1}^p\int_\Omega\rho_i(x)\log\rho_i(x)\d x,

(36)

one has $\phi_i=\log\rho_i+1$ , and the constrained system becomes

\partial_t\rho_i = \Delta\rho_i-\operatorname{div}(\rho_i\nabla\lambda_t), \qquad \operatorname{div}(b\nabla\lambda_t)=\Delta b.

(37)

When $b$ is constant, for example on a periodic box, the pressure can be chosen constant and the system reduces to independent heat equations which nevertheless preserve the pointwise sum $\sum_i\rho_i=b$ . This product geometry should be contrasted with the vector-valued Benamou-Brenier distances of Vector and Matrix-Valued Measures, where the metric itself can couple the components through a non-diagonal mobility.

Figure Div visualizes this constant-total-density case for two, three, and five species.

Multi-species entropy flow under the shared-density constraint $\sum_i\rho_i=1$ on the periodic interval. Each row is a time snapshot of the stacked densities, with the species shown as colored bands and time increasing from top to bottom. For a constant total density, the pressure in (37) is constant, so each component follows the heat equation while the sum of the bands remains exactly flat.

Interactive panel. Change the number of species and diffusion strength while preserving the pointwise total density.

Shannon Neg-Entropy¶

A very different behavior is obtained by considering functionals that require $\alpha_t$ to have a density. The canonical example is Shannon neg-entropy

f(\alpha) = \int \log\left(\frac{\d\alpha}{\d x}(x)\right) \d\alpha(x).

(38)

Here $\delta f(\alpha)=\log(\frac{\d\alpha}{\d x})$ up to an additive constant, so $\Wgrad f(\alpha)=\nabla\alpha/\alpha$ , often called the score. The flow (8) becomes the heat equation

\partial_t\alpha_t=\Delta\alpha_t.

(39)

Other entropy functionals lead to nonlinear diffusion equations; finite-volume and particle discretizations are discussed in Carrillo et al., 2015Gianazza et al., 2009Maas, 2011Erbar, 2010.

For example, a generalized entropy

f(\alpha)=\int g\left(\frac{\d\alpha}{\d x}\right)\d x

(40)

for a scalar convex function $g$ leads, in the smooth-density regime, to

\frac{\partial\alpha_t}{\partial t} = \Delta(P(\alpha_t)),

(41)

where the pressure $P$ satisfies $P'(s)=s g''(s)$ . For $g(s)=s\log s$ , one has $P(s)=s$ and recovers (38); for $g(s)=s^m/(m-1)$ with $m>1$ , one obtains $P(s)=s^m$ up to an additive constant and the porous-medium equation.

A celebrated theorem by McCann McCann, 1997 states that an internal energy of the form (40), for $g:\RR^+\to\RR\cup\{+\infty\}$ with $g(0)=0$ , is geodesically convex on $\Pp(\RR^d)$ when $g$ is convex and the map $r\mapsto r^d g(r^{-d})$ is convex and nonincreasing on $(0,+\infty)$ . Examples include $g(s)=s^q$ for $q>1$ and Shannon entropy $g(s)=s\log s$ .

Figure Div contrasts the spreading induced by the linear heat equation with the compactly supported, density-dependent propagation of two porous-medium flows.

Entropy-driven Wasserstein gradient flows from the same compact initial density. The heat flow is generated by Shannon entropy $g(\rho)=\rho\log\rho$ and instantly develops Gaussian tails. The porous-medium flows use the power entropy $g(\rho)=\rho^m/(m-1)$ , hence $\partial_t\rho=\Delta(\rho^m)$ : the middle panel has $m=2$ , while the right panel has the stronger nonlinearity $m=6$ , i.e. $\partial_t\rho=\Delta(\rho^6)$ . Larger powers diffuse mainly where the density is high, producing a flatter core and a sharper compact free boundary.

The interactive demo isolates the effect of the entropy exponent. The heat curve keeps Gaussian tails, while increasing $m$ keeps a compact front and spreads mass mainly from the high-density core.

Interactive panel. Use the diffusion exponent and time controls to compare linear heat flow with nonlinear porous-medium spreading.

Interaction Energies¶

To obtain nonlinear evolutions without requiring the measure to have a density, one can consider

f(\alpha) := \iint k(x,y)\d\alpha(x)\d\alpha(y).

(42)

For a symmetric kernel $k$ ,

\delta f(\alpha)(x) = 2\int k(x,y)\d\alpha(y), \qquad \Wgrad f(\alpha)(x) = 2\int\nabla_x k(x,y)\d\alpha(y).

(43)

For $\alpha_0=\frac1n\sum_i\delta_{x_i}$ , the flow (8) implies the particle system

\dot x_i(t) = -\frac2n\sum_j\nabla k(x_i(t),x_j(t)).

(44)

If $k$ is positive definite, or more generally conditionally positive definite on signed measures of zero total mass as for the energy-distance kernel $k(x,y)=-\norm{x-y}$ , and one minimizes the squared kernel discrepancy to a teacher distribution $\beta$ , then

\norm{\alpha-\beta}_k^2 = \iint k\d\alpha\d\alpha -2\int\left(\int k(x,y)\d\beta(y)\right)\d\alpha(x) +\mathrm{constant}.

(45)

Thus MMD-type training energies are exactly an interaction energy plus a linear potential. The teacher distribution appears through the potential $x\mapsto-2\int k(x,y)\d\beta(y)$ , and the corresponding empirical Wasserstein gradient flow is

\dot x_i(t) = -\frac2n\sum_j\nabla_x k(x_i(t),x_j(t)) + 2\int\nabla_x k(x_i(t),y)\d\beta(y).

(46)

The first term is a kernelized self-interaction; the second is the attraction induced by the continuous teacher kernel mean. At the continuum level, characteristic positive-definite kernels, and the Euclidean energy-distance kernel on probability measures, have $\beta$ as the unique minimizer of $\norm{\alpha-\beta}_k^2$ . For finitely many particles, however, the flow can only form a kernelized quadrature of $\beta$ , and small particle systems may cover the target modes poorly. The particle-count figure below illustrates this finite-particle effect.

Figure Div illustrates this finite-particle effect.

Particle count in the deterministic Wasserstein gradient flow of the squared MMD-type discrepancy to a smooth two-Gaussian teacher distribution, using here the energy-distance kernel $k(x,y)=-\norm{x-y}$ . The teacher itself is shown only through true density contours, while red dots are a compact shifted Gaussian initialization placed away from the target, red-to-blue curves show a thinned subset of particle trajectories, and blue dots show the stabilized long-time particles. With too few particles, the empirical measure forms a sparse kernelized quadrature and may under-cover the target modes; increasing $n$ makes the particle cloud approximate the continuous target geometry more faithfully.

The interactive demo turns this finite-particle effect into a parameter: increasing the number of particles makes the same deterministic force field approximate the teacher geometry more faithfully.

Interactive panel. Use the particle count and kernel controls to see how MMD geometry drives a particle flow toward the target law.

The preceding particle flow minimizes a squared kernel discrepancy. Figure Div contrasts this with flows generated by three discrepancies toward the same target. For the energy distance and $W_2$ , the energy is the unsquared distance, so the Wasserstein velocity is normalized by the current discrepancy value away from the target. The KL case is different: it is a relative-entropy gradient flow and therefore follows the Fokker--Planck diffusion toward the target. The comparison keeps the endpoint fixed and makes the geometry visible through the transient densities.

Figure Div contrasts this with flows generated by three discrepancies toward the same target.

Wasserstein gradient flows of three discrepancies toward the same one-dimensional Gaussian-mixture target, shown as a dashed black density. The source density is also a Gaussian mixture. Red-to-blue curves are snapshots of $\alpha_t$ . The energy-distance and $W_2$ panels use many equal-weight quantile particles and descend the unsquared distances: in one dimension the energy distance is obtained from $\mathrm{ED}^2=2\int |F_{\alpha}-F_{\beta}|^2\,dx$ , while $W_2$ uses monotone quantile matching. The relative-entropy panel solves the reversible Fokker--Planck equation $\partial_t\rho=\partial_x(\rho_\beta\,\partial_x(\rho/\rho_\beta))$ by an implicit finite-volume scheme, so that $\rho_\beta$ is the discrete stationary density.

Figure Div complements this target-matching example by isolating three self-interaction regimes.

Interaction-energy particle flows for three choices of $k$ . A positive Gaussian kernel $k(x,y)=\exp(-\norm{x-y}^2/(2\sigma^2))$ produces short-range repulsion under Wasserstein descent; changing its sign produces attraction and collapse; adding a quadratic long-range attraction to the repulsive kernel yields a balanced attraction--repulsion dynamics. The curves use arclength-based red-to-blue coloring along a longer integration of the coupled particle ODE (23).

The interactive demo lets the sign and strength of the interaction change without editing the hidden particle solver. This is the quickest way to see how the same formal ODE can repel, collapse, or self-organize.

Interactive panel. Use the interaction strength and time controls to watch particles move under attraction, repulsion, and confinement.

Figure Div then compares the trajectory geometries produced by several discrepancies on common source and target clouds.

Particle trajectories induced by different discrepancy geometries. The red particles and blue target cloud are the same in all panels. Straight OT displacement produces rays from an optimal matching; an MMD-type witness field gives smoother nonlocal forces; the Sinkhorn-divergence force is an entropic, debiased transport attraction; and the normalized drifting field combines attraction to data with self-repulsion. The figure is qualitative: it compares geometric behavior, not solver performance.

The interactive demo keeps the source and target fixed while switching the discrepancy geometry. The smoothing parameter controls how local or nonlocal the induced force appears.

Interactive panel. Use the smoothing and geometry controls to compare how different discrepancies reshape the same particle objective.

Stochastic Particles and McKean--Vlasov Limits¶

Deterministic particle flows have stochastic counterparts, where Brownian noise at the particle level becomes an entropy term at the measure level. If the drift does not depend on the empirical measure, each particle evolves independently according to

\d X_t=b(X_t)\d t+\sqrt2\,\sigma\d B_t,

(47)

and the one-particle law $\alpha_t=\rho_t\d x$ satisfies the linear Fokker--Planck equation

\partial_t\rho_t = -\operatorname{div}(b\rho_t)+\sigma^2\Delta\rho_t.

(48)

For example, if $b=-\nabla V$ , this is the $\Wass_2$ gradient flow of the free energy

\int V\rho\,\d x+\sigma^2\int\rho\log\rho\,\d x.

(49)

The mean-field case is different: the drift is recomputed from the current empirical distribution of all particles,

\d X_i^n(t) = b(X_i^n(t),\alpha_t^n)\d t+\sqrt2\,\sigma\d B_i(t), \qquad \alpha_t^n=\frac1n\sum_{i=1}^n\delta_{X_i^n(t)}.

(50)

For finite $n$ , the empirical law $\alpha_t^n$ is random. Under suitable Lipschitz, growth and chaotic-initialization assumptions, propagation of chaos states that finitely many particles become asymptotically independent as $n\to\infty$ , all with the same deterministic law $\rho_t\d x$ . Equivalently, $\alpha_t^n$ converges in probability to this law. The limiting density solves the nonlinear Fokker--Planck, or McKean--Vlasov, equation

\partial_t\rho_t = -\operatorname{div}\big(b(x,\rho_t)\rho_t\big) + \sigma^2\Delta\rho_t.

(51)

When the interaction drift has variational form

b(x,\rho) = -\nabla\frac{\delta\mathcal E}{\delta\rho}(x),

(52)

this PDE is the Wasserstein gradient flow of the entropy-regularized energy

\mathcal E(\rho)+\sigma^2\int\rho\log\rho\,\d x.

(53)

For a prescribed target $\beta=\rho_\beta\d x$ , choosing $b=\sigma^2\nabla\log\rho_\beta$ gives the equivalent Fokker--Planck and deterministic score-transport descriptions

\partial_t\rho_t = \sigma^2\operatorname{div}\!\left( \rho_t\nabla\log\frac{\rho_t}{\rho_\beta} \right), \qquad v_t = \sigma^2\bigl(\nabla\log\rho_\beta-\nabla\log\rho_t\bigr).

(54)

Figure Div compares numerical realizations of precisely these three descriptions.

Three numerical representations of the same entropy-regularized Wasserstein gradient flow of $\KL(\rho|\beta)$ , where $\beta$ is a two-Gaussian target shifted to the right of an initially isotropic Gaussian density. The first row simulates independent Langevin particles and displays a thinned set of trajectories in the left panel. The second row evolves many deterministic particles with the score velocity in (54), estimating $\nabla\log\rho_t$ by a kernel-density estimator; only representative trajectories and particle subsets are displayed. The third row solves the corresponding Fokker--Planck equation on a grid, starting from the initial density in the left panel. The remaining columns use front-loaded times, so that the onset of the flow and the later deformation toward a bimodal law are both visible.

The interactive demo compares three views of the same entropy-regularized relaxation: stochastic Langevin particles, deterministic score particles, and a smoothed grid density. The noise slider controls the entropy strength.

Interactive panel. Use the drift and noise controls to compare trajectories, particles, and density evolution for the same Fokker-Planck dynamics.

Example: Entropy generates heat and porous-medium flows

The canonical density-dependent functional is the Shannon neg-entropy

f(\alpha) = \int \log\left(\frac{\d \alpha}{\d x}(x)\right) \d \alpha(x).

(57)

Here, $\delta f(\alpha) = \log(\d\alpha/\d x)$ up to an additive constant, so $\Wgrad f(\alpha) = \nabla \alpha/\alpha$ (often called the score). The flow (8) becomes the heat equation

\partial_t \alpha_t = \Delta \alpha_t.

(58)

Other entropy functionals lead to nonlinear diffusion equations; finite-volume and particle discretizations are discussed in Carrillo et al., 2015Gianazza et al., 2009Maas, 2011Erbar, 2010.

For a generalized entropy

f(\alpha) = \int g\left(\frac{\d \alpha}{\d x}\right) \d x,

(59)

with a scalar convex function $g$ , one obtains nonlinear diffusions in the smooth-density regime:

\frac{\partial \alpha_t}{\partial t} = \Delta(P(\alpha_t)),

(60)

where the pressure $P$ satisfies $P'(s)=s g''(s)$ . For example, $g(s) = s \log(s)$ gives $P(s)=s$ and recovers (57), while $g(s) = s^m/(m-1)$ , $m > 1$ , gives $P(s)=s^m$ up to an additive constant and yields the porous-medium equation.

Remark: Two gradient-flow interpretations of the heat equation

The heat equation already illustrates that a PDE does not determine a unique gradient-flow structure. Write $\alpha=\rho\d x$ on a flat domain, with periodic or no-flux boundary conditions. In the Hilbert geometry of densities, the squared distance and the Dirichlet energy are

d_{L^2}^2(\alpha,\beta)=\int |\rho_\alpha(x)-\rho_\beta(x)|^2\d x, \qquad \mathcal D(\alpha)=\frac12\int \norm{\nabla\rho(x)}^2\d x,

(61)

so $\delta\mathcal D/\delta\rho=-\Delta\rho$ and the $L^2$ gradient flow $\partial_t\rho_t=-\delta\mathcal D/\delta\rho(\rho_t)$ gives $\partial_t\rho_t=\Delta\rho_t$ . This viewpoint says that heat flow decreases oscillations and regularizes the density. In Wasserstein geometry, the same equation is instead the $\Wass_2$ gradient flow of the Shannon entropy $\int \rho\log\rho\d x$ , so the driving mechanism is entropic spreading. The example is a useful warning: explaining a dynamics as a gradient flow requires specifying both an energy and a metric, and different pairs $(f,d)$ may produce the same PDE.

Example: Application to single-cell gradient-flow models

Instead of only coupling successive snapshots, one may posit that a latent population law evolves by

\partial_t\al_t+\diverg(\al_t v_t)=0, \qquad v_t=-\Wgrad f(\al_t)=-\nabla\delta f(\al_t).

(63)

Here $\delta f(\al_t)$ is the first variation, with the convention introduced above. For example, if $\al=\rho\,\d x$ , the energy $f(\al)=\int V\d\al+\sigma^2\int\rho\log\rho\,\d x$ gives a drift toward a potential landscape $V$ together with diffusion of strength $\sigma^2$ . In single-cell modeling, $V$ , interaction terms or a neural approximation of $v_t$ can be fitted from unpaired population snapshots. This links Waddington-landscape intuition to the mathematical language of dynamic OT, Fokker--Planck equations and flow matching Tong et al., 2020Lavenant et al., 2021Klein et al., 2024.

Geodesic Convexity and Convergence¶

Geodesic convexity is the convexity notion adapted to Wasserstein geometry. It is the condition that turns the formal gradient-flow calculus into a convergence theory.

Geodesics and Convexity¶

A constant-speed $\Wass_2$ geodesic between $\alpha_0$ and $\alpha_1$ is obtained, as in the McCann interpolation, from any optimal coupling $\pi^\star\in\Couplings(\alpha_0,\alpha_1)$ by

\alpha_t=((1-t)P_0+tP_1)_\sharp\pi^\star, \qquad t\in[0,1],

(64)

where $P_0(x,y)=x$ and $P_1(x,y)=y$ . If the optimal plan is induced by a Brenier map $T$ , this reduces to $((1-t)\Id+tT)_\sharp\alpha_0$ . The coupling formula matters because geodesics exist even when no Monge map exists, for instance when a Dirac mass must split.

Proposition: Geodesic Convexity of Linear and Quadratic Energies

The following formal statements hold on $\Pp_2(\RR^d)$ .

If $h$ is convex, then the linear energy $\alpha\mapsto\int h\d\alpha$ is geodesically convex; if $h$ is $\lambda$ -strongly convex, it is $\lambda$ -geodesically convex. Conversely, geodesic convexity for all Dirac endpoints forces $h$ to be convex.
More generally, if $H_k:(\RR^d)^k\to\RR\cup\{+\infty\}$ is convex, then the polynomial energy $\alpha\mapsto \int H_k(x_1,\ldots,x_k)\d\alpha(x_1)\cdots\d\alpha(x_k)$ is geodesically convex whenever the integral is well defined. In particular, $\alpha\mapsto\frac12\iint W(x,y)\d\alpha(x)\d\alpha(y)$ is geodesically convex when $W$ is convex on $\RR^d\times\RR^d$ .

The polynomial criterion is useful but restrictive for kernel losses. The RKHS/MMD kernels introduced in Section Dual RKHS Norms and Maximum Mean Discrepancies are usually chosen for positive definiteness and statistical smoothing, not for convexity as functions of $(x,y)$ . Gaussian and Laplace kernels, for instance, are not convex on $\RR^d\times\RR^d$ , so positive definiteness of $k$ does not imply that $\alpha\mapsto\MMD_k^2(\alpha,\beta)$ is geodesically convex. In fact, such MMD losses are not geodesically convex in general. One-dimensional monotone transport gives a sharper exception for distance-type kernels, because ordered quantiles keep the sign of pairwise differences fixed along geodesics; this is the one-dimensional displacement-convexity mechanism emphasized, for instance, in Santambrogio, 2015.

Proposition: One-Dimensional Interactions and Energy-Distance MMD

Let $\varphi:\RR_+\to\RR$ be continuous with at most quadratic growth, and define on $\Pp_2(\RR)$

\mathcal I_\varphi(\alpha) \eqdef \frac12\iint \varphi(|x-y|)\d\alpha(x)\d\alpha(y).

(68)

Then $\mathcal I_\varphi$ is geodesically convex for $\Wass_2$ on $\Pp_2(\RR)$ if and only if $\varphi$ is convex on $\RR_+$ .

Fix also $\beta\in\Pp_2(\RR)$ . For the conditionally positive kernel $k(x,y)=-|x-y|$ from Definition Definition: Positive and Conditionally Positive Kernels, the squared MMD, equivalently the squared energy distance,

\mathcal E_\beta(\alpha) \eqdef \MMD_k^2(\alpha,\beta) = \iint -|x-y|\,\d(\alpha-\beta)(x)\d(\alpha-\beta)(y),

(69)

is geodesically convex as a function of $\alpha$ .

Let $Q_0,Q_1$ be the quantile functions of $\alpha_0,\alpha_1$ , and let $Q_t=(1-t)Q_0+tQ_1$ be the quantile function of the one-dimensional $\Wass_2$ geodesic $\alpha_t$ . For $r>s$ , monotonicity gives $Q_i(r)-Q_i(s)\geq0$ for $i=0,1$ , and hence

Q_t(r)-Q_t(s) = (1-t)(Q_0(r)-Q_0(s))+t(Q_1(r)-Q_1(s)).

(70)

Using the symmetry of the double integral and the quantile parametrization, one can write

\mathcal I_\varphi(\alpha_t) = \int_0^1\int_0^r \varphi(Q_t(r)-Q_t(s))\,\d s\,\d r.

(71)

Convexity of $\varphi$ on $\RR_+$ gives the desired convexity after integration.

Conversely, test geodesic convexity on two-point measures $\alpha_i=\frac12(\delta_0+\delta_{a_i})$ , with $a_i\geq0$ . Their monotone geodesic is $\alpha_t=\frac12(\delta_0+\delta_{(1-t)a_0+t a_1})$ . Since

\mathcal I_\varphi(\alpha_t) = \frac14\varphi(0)+\frac14\varphi((1-t)a_0+t a_1),

(72)

geodesic convexity of $\mathcal I_\varphi$ gives the ordinary convexity inequality for $\varphi$ on $\RR_+$ , after the identical $\frac14\varphi(0)$ terms cancel.

For the energy distance, the function $\varphi(r)=-r$ is affine on $\RR_+$ , so the self-interaction $-\iint |x-y|\d\alpha(x)\d\alpha(y)$ is affine along one-dimensional Wasserstein geodesics even though $z\mapsto-|z|$ is not convex on $\RR$ . Write $Q_\beta$ for the quantile function of $\beta$ . Expanding the MMD and dropping the term depending only on $\beta$ gives

\mathcal E_\beta(\alpha) = 2\int_0^1\!\!\int_0^1 |Q_\alpha(r)-Q_\beta(s)|\,\d r\,\d s - \int_0^1\!\!\int_0^1 |Q_\alpha(r)-Q_\alpha(s)|\,\d r\,\d s +\mathrm{constant}.

(73)

Here the constant is independent of $\alpha$ . The positive cross-distance term is convex in $Q_\alpha$ , since the absolute value is convex and $Q_t$ is affine in $t$ . The self-distance term is affine along quantile geodesics because

\int_0^1\!\!\int_0^1 |Q_t(r)-Q_t(s)|\,\d r\,\d s = 2\int_0^1\int_0^r (Q_t(r)-Q_t(s))\,\d s\,\d r.

(74)

Thus $\alpha\mapsto\mathcal E_\beta(\alpha)$ is geodesically convex.

This one-dimensional result is the mechanism behind the favorable behavior of energy-distance particle flows discussed earlier in this chapter and in recent analyses of distance-kernel MMD flows Duong et al., 2024. It should not be read as a general MMD statement: for a typical positive definite kernel $k$ , the decomposition

\MMD_k^2(\alpha,\beta) = \iint k\,\d\alpha\d\alpha -2\iint k\,\d\alpha\d\beta +\mathrm{constant}

(75)

contains neither a convex self-interaction nor a convex cross term along Wasserstein geodesics.

Internal energies require a different mechanism: convexity is hidden in the Jacobian determinant of the transport map, not in a pointwise convex integrand along particles. The precise criterion is McCann’s displacement-convexity theorem.

This theorem applies to $g(s)=s\log s$ and to porous-medium powers $g(s)=s^m/(m-1)$ , $m>1$ , and more generally to the fast-diffusion range $m\geq1-1/d$ when the energy is well defined, with $m=1$ understood as the entropy limit. The same criterion also clarifies which Csiszár divergences inherit displacement convexity from an internal-energy structure.

Proposition: Geodesic Convexity of Power Divergences

For $m>0$ , define

\phi_m(r)= \begin{cases} \dfrac{r^m-mr+m-1}{m(m-1)}, & m\neq1,\\[2mm] r\log r-r+1, & m=1. \end{cases}

(80)

Then the following geodesic-convexity statements hold.

If $\Omega\subset\RR^d$ is convex, $\beta=b\,\mathbf 1_\Omega\,\d x$ , and $m\geq 1-1/d$ , then $\alpha\mapsto D_{\phi_m}(\alpha|\beta)$ is geodesically convex on $\Pp_2(\Omega)$ .
If $\beta=Z^{-1}e^{-V}\d x$ on $\RR^d$ , $V$ is convex, and $m\geq1$ , then $\alpha\mapsto D_{\phi_m}(\alpha|\beta)$ is geodesically convex on $(\Pp_2(\RR^d),\Wass_2)$ .
In the KL case $m=1$ , if $V$ is $\lambda$ -strongly convex, then $\alpha\mapsto \KL(\alpha|\beta)=D_{\phi_1}(\alpha|\beta)$ is $\lambda$ -geodesically convex.

All statements use the convention that the divergence is $+\infty$ when $\alpha$ is not absolutely continuous with respect to $\beta$ .

For the flat reference case, write $\alpha=\rho\d x$ . Up to affine terms in $\rho$ , which only add constants on probability measures, the integrand $b\,\phi_m(\rho/b)$ has non-affine part $b^{1-m}\rho^m/(m(m-1))$ for $m\neq1$ , and $\rho\log\rho$ for $m=1$ . For $m=1$ , McCann’s criterion gives $\psi(r)=-d\log r$ . For $m\neq1$ , the transformed non-affine part is, up to the irrelevant factor $b^{1-m}$ ,

\psi_m(r)=\frac{r^{d(1-m)}}{m(m-1)}.

(81)

If $m>1$ , this function is convex and nonincreasing. If $0<m<1$ , it is convex and nonincreasing exactly when $d(1-m)\leq1$ , i.e. $m\geq1-1/d$ . This proves the first claim by Theorem Theorem: McCann Displacement Convexity for Internal Energies.

Assume next that $m>1$ and $V$ is convex. Up to constants,

D_{\phi_m}(\alpha|\beta) = \frac{Z^{m-1}}{m(m-1)} \int \rho(x)^m e^{(m-1)V(x)}\,\d x .

(82)

Let $T=\nabla\varphi$ be the Brenier map from $\alpha_0=\rho_0\d x$ to $\alpha_1$ , set $T_t=(1-t)\Id+tT$ , and write $J_t=\det((1-t)I+t\nabla T)$ . Along the geodesic $\alpha_t=(T_t)_\sharp\alpha_0$ ,

\int \rho_t^m e^{(m-1)V}\d x = \int \rho_0(x)^m \exp\!\left((m-1)\big(V(T_t(x))-\log J_t(x)\big)\right)\d x .

(83)

For each $x$ , the function $t\mapsto V(T_t(x))$ is convex, and $t\mapsto-\log J_t(x)$ is convex because $\log\det$ is concave on positive definite matrices. Since $m-1>0$ , the exponential of their sum is convex in $t$ . Integrating gives geodesic convexity. The nonsmooth case follows by approximation.

Finally, for $m=1$ ,

\KL(\alpha|\beta) = \int\rho\log\rho\,\d x+\int V\d\alpha+\log Z.

(84)

The entropy term is 0-geodesically convex by Theorem Theorem: McCann Displacement Convexity for Internal Energies. The linear potential term is $\lambda$ -geodesically convex when $V$ is $\lambda$ -strongly convex, by Proposition Proposition: Geodesic Convexity of Linear and Quadratic Energies. Taking $\lambda=0$ also covers the $m=1$ case in the second claim.

Remark: Flat references and general

\phi

-divergences

The Csiszár $\phi$ -divergences defined in Equation (28) behave like internal energies only in special reference geometries. If $\beta=b\,\mathbf 1_\Omega\,\d x$ is proportional to Lebesgue measure on a bounded convex domain, then

D_\phi(\alpha|\beta) = \int_\Omega b\,\phi(\rho(x)/b)\,\d x, \qquad \alpha=\rho\,\d x,

(85)

so McCann’s theorem applies whenever the integrand $g(s)=b\,\phi(s/b)$ satisfies the displacement-convexity condition. For a nonuniform target $\beta$ , this reduction is no longer available. Proposition Proposition: Geodesic Convexity of Power Divergences gives a useful structured extension for the power family: log-concave targets preserve geodesic convexity for $m\geq1$ , while the KL endpoint is special in the stronger sense that $\lambda$ -strong log-concavity gives $\lambda$ -geodesic convexity. Outside such structured families, a generic $\phi$ -divergence is not expected to be geodesically convex.

The Hellinger integrand $\phi_H(r)=(\sqrt r-1)^2$ illustrates why the qualifier “flat reference” matters. Since $\phi_H=\phi_{1/2}/2$ , Proposition Proposition: Geodesic Convexity of Power Divergences shows that the corresponding flat-reference internal energy is geodesically convex in dimension $d=1$ . The same $\phi$ -divergence can nevertheless fail to be geodesically convex once the reference is nonuniform.

A concrete Gaussian test shows the obstruction. Let $\beta=\mathcal N(0,1)$ on $\mathbb R$ and $\alpha_m=\mathcal N(m,1)$ . The family $m\mapsto\alpha_m$ is a $\Wass_2$ -geodesic family, since the geodesic between $\alpha_{m_0}$ and $\alpha_{m_1}$ is $\alpha_{(1-t)m_0+t m_1}$ . Writing

r_m(x)=\frac{\d\alpha_m}{\d\beta}(x)=\exp(mx-m^2/2),

(86)

one obtains

D_{\phi_H}(\alpha_m|\beta) = \int(\sqrt{r_m}-1)^2\,\d\beta = 2\left(1-e^{-m^2/8}\right).

(87)

Its second derivative is

\frac{\d^2}{\d m^2}D_{\phi_H}(\alpha_m|\beta) = \frac12 e^{-m^2/8}\left(1-\frac{m^2}{4}\right),

(88)

which is negative for $|m|>2$ . Thus the Hellinger $\phi$ -divergence to a Gaussian target is not even 0-geodesically convex. More generally, Gaussian translations reduce the question for any $\phi$ to ordinary convexity of

m\mapsto \mathbb E_{X\sim\mathcal N(0,1)} \left[\phi\left(e^{mX-m^2/2}\right)\right],

(89)

which need not hold. The KL integrand is special because this function is $m^2/2$ , up to the usual normalization.

Geodesically Convex Constraints¶

Geodesic convexity is also useful for hard constraints. If a feasible set is geodesically convex, then adding its indicator to an energy does not destroy the variational structure used by minimizing movements and convergence arguments. The density cap from (26) is the model example: it is numerically delicate, because one must compute a projection or a pressure, but it is geometrically compatible with Wasserstein descent.

Example: Squared Wasserstein distance need not be geodesically convex

The squared distance to a fixed measure is not geodesically convex in general. This is a sharp contrast with Hilbert spaces, where $x\mapsto\norm{x-y}^2$ is convex along line segments. In $\Pp_2(\RR^d)$ , $d\geq2$ , take

\beta=\frac12\de_{(-1/2,-1/2)}+\frac12\de_{(1/2,1/2)}, \quad \alpha_0=\frac12\de_{(-1,0)}+\frac12\de_{(1,0)}, \quad \alpha_1=\frac12\de_{(0,-1)}+\frac12\de_{(0,1)} .

(93)

One optimal coupling between $\alpha_0$ and $\alpha_1$ matches $(-1,0)$ with $(0,1)$ and $(1,0)$ with $(0,-1)$ . The associated geodesic has midpoint

\alpha_{1/2}=\frac12\de_{(-1/2,1/2)}+\frac12\de_{(1/2,-1/2)} .

(94)

Then

\Wass_2^2(\alpha_0,\beta) = \Wass_2^2(\alpha_1,\beta) = \frac12, \qquad \Wass_2^2(\alpha_{1/2},\beta) = 1,

(95)

which violates geodesic convexity. Thus objectives built from squared Wasserstein distances can be non-convex in the Wasserstein geometry itself. This is one reason why optimal quantization, which minimizes $\Wass_2^2(\alpha,\nu)$ over $m$ -atomic measures $\nu$ , leads to non-convex algorithms such as Lloyd’s method; see Section Optimal Quantization and Algorithm Algorithm: Lloyd quantization.

Remark: Target discrepancies are usually semiconcave

The preceding example is a useful warning. Losses of the form $\alpha\mapsto D(\alpha,\beta)$ , where $\beta$ is a fixed target distribution, are rarely geodesically convex in the transport geometry. This is unfortunate for generative AI and generative modeling more broadly: training a distribution to match data by decreasing $\Wass_2^2(\alpha,\beta)$ , $\SW_2^2(\alpha,\beta)$ , or similar discrepancies should not be expected to inherit the convex behavior of Euclidean least squares.

The squared Wasserstein loss has the opposite one-sided curvature: it is 2-geodesically semiconcave along $\Wass_2$ geodesics. More precisely, if $(\alpha_t)_{t\in[0,1]}$ is a constant-speed $\Wass_2$ geodesic, then

\Wass_2^2(\alpha_t,\beta) \geq (1-t)\Wass_2^2(\alpha_0,\beta) + t\Wass_2^2(\alpha_1,\beta) - t(1-t)\Wass_2^2(\alpha_0,\alpha_1).

(96)

This is the standard 2-geodesic semiconcavity convention for squared distances, often informally called “2-concavity”: the graph may fall below the chord, but only by the quadratic correction $t(1-t)\Wass_2^2(\alpha_0,\alpha_1)$ . Equivalently, $-\Wass_2^2(\cdot,\beta)$ is $(-2)$ -geodesically convex in the convention of Definition Definition: Geodesic Convexity.

To prove the estimate, write $\alpha_t=((1-t)P_0+tP_1)_\sharp\pi_{01}$ , where $\pi_{01}$ is optimal between $\alpha_0$ and $\alpha_1$ . Fix $t$ , put $Z=(1-t)X_0+tX_1$ , take an optimal coupling between $\alpha_t$ and $\beta$ , and glue it with a disintegration of $\pi_{01}$ over $Z$ . This gives random variables $(X_0,X_1,Y)$ such that $(Z,Y)$ is optimal between $\alpha_t$ and $\beta$ , while $(X_0,X_1)$ is optimal between $\alpha_0$ and $\alpha_1$ . Since $(X_i,Y)$ are admissible couplings between $\alpha_i$ and $\beta$ ,

\Wass_2^2(\alpha_i,\beta)\leq \EE\norm{X_i-Y}^2, \qquad i=0,1.

(97)

The Euclidean identity

(1-t)\norm{X_0-Y}^2+t\norm{X_1-Y}^2 = \norm{Z-Y}^2+t(1-t)\norm{X_0-X_1}^2

(98)

then gives the claim after taking expectations.

The sliced loss has the same flavor, but one must keep track of which geometry is used. The projection of a $\Wass_2$ geodesic need not be the monotone one-dimensional geodesic between the projected endpoint measures. The preceding proof nevertheless applies after projection, using the projected coupling induced by $\pi_{01}$ . With the normalization of Definition Definition: Sliced Wasserstein Distance, this gives, along the same $\Wass_2$ geodesic,

\SW_2^2(\alpha_t,\beta) \geq (1-t)\SW_2^2(\alpha_0,\beta) + t\SW_2^2(\alpha_1,\beta) - \frac{t(1-t)}{d}\Wass_2^2(\alpha_0,\alpha_1),

(99)

because the projection cost satisfies $\int_{\Sphere^{d-1}}\abs{\dotp{\theta}{x_0-x_1}}^2\d\sigma(\theta)=\norm{x_0-x_1}^2/d$ and $\pi_{01}$ is $\Wass_2$ -optimal. Thus this is a semiconcavity estimate for the sliced discrepancy along $\Wass_2$ geodesics, with a $\Wass_2$ -controlled correction. If one instead works in the idealized Hilbert embedding by projected quantile functions, straight sliced chords satisfy the exact identity with the last term $\SW_2^2(\alpha_0,\alpha_1)$ . The caveat, discussed in Intrinsic Sliced Length, is that such straight chords need not be realized by actual measures on $\RR^d$ .

Dissipation and Finite Metric Length¶

Before asking for convergence rates, one should first extract the basic information already contained in energy dissipation. For the classical gradient flow of a smooth function $h:\RR^d\to\RR$ , $\dot x_t=-\nabla h(x_t)$ , the chain rule gives

\frac{\d}{\d t}h(x_t) = -\norm{\nabla h(x_t)}^2 = -\norm{\dot x_t}^2.

(102)

Thus, for $0\leq s<t$ , Cauchy’s inequality gives

\int_s^t \norm{\dot x_r}\,\d r \leq \sqrt{t-s}\left(\int_s^t\norm{\dot x_r}^2\d r\right)^{1/2} = \sqrt{t-s}\,\bigl(h(x_s)-h(x_t)\bigr)^{1/2}.

(103)

The Wasserstein analogue uses the metric derivative $|\dot\alpha_t|_{\Wass_2}$ of an absolutely continuous curve. The following estimate isolates the only ingredient needed at this stage: energy dissipation controls the metric length of the path.

Proposition: Finite Metric Length from Dissipation

Let $(\alpha_r)_{r\in[0,T]}$ be an absolutely continuous curve in $(\Pp_2(\RR^d),\Wass_2)$ . Assume that, for all $0\leq s<t\leq T$ , it satisfies the exact energy-dissipation identity

f(\alpha_s)-f(\alpha_t) = \int_s^t |\dot\alpha_r|_{\Wass_2}^2\d r .

(104)

Then

\operatorname{Length}_{\Wass_2}\bigl((\alpha_r)_{r\in[s,t]}\bigr) = \int_s^t |\dot\alpha_r|_{\Wass_2}\,\d r \leq \sqrt{t-s}\,\bigl(f(\alpha_s)-f(\alpha_t)\bigr)^{1/2}.

(105)

In the smooth Wasserstein gradient-flow setting with velocity $v_r=-\Wgrad f(\alpha_r)$ , the dissipation identity reads equivalently

f(\alpha_s)-f(\alpha_t) = \int_s^t|\dot\alpha_r|_{\Wass_2}^2\d r = \int_s^t\int \norm{v_r(x)}^2\d\alpha_r(x)\d r,

(106)

If one assumes only the metric energy-dissipation inequality for a curve of maximal slope, then the same length estimate holds with the right-hand side multiplied by $\sqrt2$ .

Proposition Proposition: Finite Metric Length from Dissipation is weaker than a convergence rate: it does not identify a minimizer and it gives no decay law. It does, however, rule out escape to infinity in finite metric time whenever the energy is bounded below on finite time intervals. Indeed the trajectory segment remains in the $\Wass_2$ -ball centered at $\alpha_s$ with radius given by (105). It also gives a concrete modulus of continuity. If $m_T\leq f(\alpha_r)$ for $0\leq r\leq T$ , then for $0\leq s<t\leq T$ ,

\Wass_2(\alpha_s,\alpha_t) \leq \operatorname{Length}_{\Wass_2}\bigl((\alpha_r)_{r\in[s,t]}\bigr) \leq \sqrt{f(\alpha_0)-m_T}\,\sqrt{t-s},

(108)

with an additional factor $\sqrt2$ under the energy-dissipation inequality. Thus an energy-dissipating trajectory is $1/2$ -Hölder continuous as a curve in Wasserstein space on every time interval where the energy remains bounded from below. Non-coercivity of the objective means that sublevel sets need not be compact, but an energy-dissipating curve still cannot travel an infinite $\Wass_2$ -distance in finite time without an infinite energy drop. The convex estimate below adds a first quantitative conclusion under geodesic convexity; the PL viewpoint then gives sharper coercive rates toward minimizers.

Proposition: Energy Decay for Convex Wasserstein Flows

Assume formally that $f$ is geodesically convex, admits a smooth first variation, and has a minimizer $\alpha^\star$ . Let $(\alpha_t)_t$ be a smooth solution of the Wasserstein gradient flow

\partial_t\alpha_t+\operatorname{div}(\alpha_t v_t)=0, \qquad v_t=-\Wgrad f(\alpha_t).

(109)

Then

\frac{\d}{\d t}f(\alpha_t) = -\int\norm{\Wgrad f(\alpha_t)(x)}^2\d\alpha_t(x) \leq0.

(110)

If $T_t$ is the optimal map from $\alpha_t$ to $\alpha^\star$ , then

f(\alpha_t)-f(\alpha^\star) \leq -\frac{\d}{\d t}\frac12\Wass_2^2(\alpha_t,\alpha^\star),

(111)

and consequently

f(\alpha_t)-f(\alpha^\star) \leq \frac{\Wass_2^2(\alpha_0,\alpha^\star)}{2t}.

(112)

Convergence: The Wasserstein-PL Viewpoint¶

In general, analyzing (8) is delicate. Geodesic convexity gives the familiar convex-gradient-flow picture, but rates are driven more directly by a first-order coercivity inequality. The relevant quantity is the squared Wasserstein slope

|\partial f|^2(\alpha) \eqdef \int\norm{\Wgrad f(\alpha)(x)}^2\,\d\alpha(x)

(118)

in the smooth formal setting. Along a smooth gradient flow, this is both the squared metric speed and the energy-dissipation term.

The terminology mirrors the finite-dimensional Polyak-Łojasiewicz condition introduced by Polyak Polyak, 1963 and now standard in nonconvex optimization Karimi et al., 2016. It is also part of the broader family of Łojasiewicz and Kurdyka-Łojasiewicz gradient inequalities used to prove convergence of dissipative dynamics Attouch et al., 2010Hauer & Mazón, 2019Dello Schiavo et al., 2024. The Wasserstein version simply replaces the Euclidean gradient norm by the metric slope associated with $\Wass_2$ .

Theorem: Wasserstein-PL Convergence

Assume that $f_\star>-\infty$ , that $f$ satisfies the Wasserstein-PL inequality (119) with constant $\kappa>0$ , and that $(\alpha_t)_{t\geq0}$ is a global Wasserstein gradient flow satisfying the full energy-dissipation identity

\frac{\d}{\d t}f(\alpha_t) = -|\dot\alpha_t|^2 = -|\partial f|^2(\alpha_t) \qquad\text{for a.e. }t>0 .

(121)

Set $E(t)\eqdef f(\alpha_t)-f_\star$ . Then, for $0\leq s\leq t$ ,

E(t)\leq e^{-2\kappa(t-s)}E(s),

(122)

and the length of the flow segment satisfies

\operatorname{Length}_{\Wass_2}\bigl((\alpha_r)_{r\in[s,t]}\bigr) \leq \sqrt{\frac{2}{\kappa}}\bigl(\sqrt{E(s)}-\sqrt{E(t)}\bigr).

(123)

If, in addition, $f$ is lower semicontinuous for $\Wass_2$ -convergence, then $\alpha_t$ converges in $\Wass_2$ to a minimizer $\alpha_\infty\in\operatorname{Argmin} f$ , and

\Wass_2(\alpha_t,\alpha_\infty) \leq \sqrt{\frac{2}{\kappa}}\sqrt{E(t)} \leq \sqrt{\frac{2}{\kappa}}\sqrt{E(0)}\,e^{-\kappa t}.

(124)

In particular,

\operatorname{dist}_{\Wass_2}(\alpha_t,\operatorname{Argmin}f) \leq \sqrt{\frac{2}{\kappa}}\sqrt{E(0)}\,e^{-\kappa t}.

(125)

The energy estimate is immediate from the full energy-dissipation identity and the Wasserstein-PL inequality:

E'(t) = -|\partial f|^2(\alpha_t) \leq -2\kappa E(t)

(126)

for almost every $t$ . Gronwall’s lemma gives $E(t)\leq e^{-2\kappa(t-s)}E(s)$ .

The distance estimate uses the same identities differently. On the set where $E(r)>0$ , PL gives

|\partial f|(\alpha_r)\geq \sqrt{2\kappa E(r)}.

(127)

Using the full dissipation identity and $|\dot\alpha_r|=|\partial f|(\alpha_r)$ ,

\begin{aligned} \operatorname{Length}_{\Wass_2}\bigl((\alpha_r)_{r\in[s,t]}\bigr) &= \int_s^t|\dot\alpha_r|\,\d r = \int_s^t|\partial f|(\alpha_r)\,\d r \\ &= \int_s^t \frac{|\partial f|^2(\alpha_r)}{|\partial f|(\alpha_r)}\,\d r \leq \frac{1}{\sqrt{2\kappa}} \int_s^t \frac{|\partial f|^2(\alpha_r)}{\sqrt{E(r)}}\,\d r \\ &= \frac{1}{\sqrt{2\kappa}} \int_s^t \frac{-E'(r)}{\sqrt{E(r)}}\,\d r = \sqrt{\frac{2}{\kappa}}\bigl(\sqrt{E(s)}-\sqrt{E(t)}\bigr). \end{aligned}

(128)

If $E$ vanishes on a subinterval, then the flow is stationary there, and the same estimate follows by applying the argument on the part where $E>0$ . Since Wasserstein distance is bounded by metric length, the same upper bound controls $\Wass_2(\alpha_s,\alpha_t)$ .

Letting $t\to+\infty$ , the energy estimate gives $E(t)\to0$ , and the length estimate shows that the tail of $(\alpha_t)_t$ is Cauchy. Since $(\Pp_2(\RR^d),\Wass_2)$ is complete, there is a limit $\alpha_\infty$ . Lower semicontinuity yields

f(\alpha_\infty) \leq \liminf_{t\to+\infty}f(\alpha_t) = f_\star,

(129)

hence $\alpha_\infty\in\operatorname{Argmin}f$ . Sending the endpoint of the length estimate to $+\infty$ with $s=t$ fixed gives

\Wass_2(\alpha_t,\alpha_\infty) \leq \sqrt{\frac{2}{\kappa}}\sqrt{E(t)}.

(130)

Combining this with the exponential energy decay gives the stated distance rate. The distance-to-the-set bound follows because $\alpha_\infty$ is one minimizer.

The exponential decay of the energy gap is the direct metric-gradient-flow analogue of the classical PL argument. The tail-length estimate used in the theorem above is a convenient way to turn this energy decay into a $\Wass_2$ -distance-to- $\operatorname{Argmin}f$ bound; this exact formulation was communicated to us by Raphaël Barboni. More refined accounts of KL/PL-type inequalities, global convergence and rates for gradient flows and proximal point sequences in general metric spaces are given by Hauer and Mazón Hauer & Mazón, 2019 and by Dello Schiavo, Maas and Pedrotti Dello Schiavo et al., 2024.

Geodesic Strong Convexity¶

Strong geodesic convexity is the cleanest geometric assumption behind the Wasserstein-PL inequality: it turns displacement convexity into a quantitative lower bound on the metric slope, so that the abstract convergence theorem above applies automatically.

Together, Proposition Proposition: Strong Geodesic Convexity Implies Wasserstein-PL and Theorem Theorem: Wasserstein-PL Convergence recover the exponential energy rate $e^{-2\lambda t}$ for positively geodesically convex Wasserstein gradient flows, and also give the distance rate $e^{-\lambda t}$ toward the minimizer selected by the flow. The PL formulation is useful because it separates the first-order energy-dissipation mechanism from the stronger curvature requirement of geodesic convexity.

Remark: Convergence rates for classical examples

The examples needed to apply this implication have already been isolated above. Proposition Proposition: Geodesic Convexity of Linear and Quadratic Energies gives $\lambda$ -geodesic convexity for linear energies generated by $\lambda$ -strongly convex potentials, and for the polynomial and pairwise interaction energies covered there when the corresponding finite-dimensional integrand is strongly convex. Theorem Theorem: McCann Displacement Convexity for Internal Energies gives displacement convexity of internal energies, including entropy-type examples, and Proposition Proposition: Geodesic Convexity of Power Divergences adds the important nonflat case $\KL(\cdot|\beta)$ when $\d\beta=Z^{-1}e^{-V}\d x$ and $V$ is strongly convex. Whenever these criteria yield a positive convexity constant, Proposition Proposition: Strong Geodesic Convexity Implies Wasserstein-PL turns it into a Wasserstein--PL inequality, and Theorem Theorem: Wasserstein-PL Convergence gives the corresponding linear convergence rates.

Wasserstein Kurdyka-Lojasiewicz Inequalities¶

The PL condition is the quadratic case of a broader slope--energy principle. The classical starting point is the gradient inequality of Lojasiewicz for real-analytic functions Łojasiewicz, 1963, later extended by Kurdyka to functions definable in o-minimal structures Kurdyka, 1998. Modern nonsmooth optimization uses the same idea as the Kurdyka-Lojasiewicz property to prove convergence of descent algorithms and subgradient flows; see, for instance, Bolte, Daniilidis, Ley and Mazet Bolte et al., 2010 and Attouch, Bolte, Redont and Soubeyran Attouch et al., 2010. For displacement-convex functionals, Bolte and Blanchet Bolte & Blanchet, 2016 develop a family of Lojasiewicz-type functional inequalities in Wasserstein space and relate them to convergence of the associated gradient dynamics. In the metric-space setting relevant here, the Euclidean gradient norm is replaced by the metric slope, as in the general theory of curves of maximal slope and recent KL/PL convergence results for metric gradient flows Hauer & Mazón, 2019Dello Schiavo et al., 2024.

The Kurdyka-Lojasiewicz viewpoint replaces the linear relation between the energy gap and the squared slope by a power law. It is useful for degenerate convex landscapes, homogeneous energies, and transport discrepancies whose minimizers form a flat set; in those cases the same energy-dissipation computation still gives convergence, but typically with polynomial rather than exponential rates. We use “KL” in this paragraph for Kurdyka-Lojasiewicz, not for Kullback-Leibler divergence.

Definition: Wasserstein Kurdyka-Lojasiewicz Inequality

Let $f_\star=\inf f$ , let $E(\alpha)=f(\alpha)-f_\star$ , and fix $c>0$ , $\theta\in[0,1)$ . We say that $f$ satisfies a power Wasserstein-KL inequality on a set $\mathcal U\subset \Pp_2(\RR^d)$ if

|\partial f|(\alpha) \geq c\,E(\alpha)^\theta, \qquad \alpha\in\mathcal U, \quad 0<E(\alpha)<+\infty .

(136)

Equivalently, the desingularizing function

\psi(s)=\frac{s^{1-\theta}}{c(1-\theta)}

(137)

satisfies $\psi'(E(\alpha))|\partial f|(\alpha)\geq 1$ . The Wasserstein-PL inequality is the special case $\theta=1/2$ with $c=\sqrt{2\kappa}$ . The regime $\theta>1/2$ gives sublinear rates, while $\theta=1/2$ is exactly the exponential PL regime.

Theorem: Sublinear Convergence Under Wasserstein-KL

Let $(\alpha_t)_{t\geq0}$ be a Wasserstein gradient flow of $f$ satisfying the full energy-dissipation identity. Assume that $f_\star> -\infty$ , that $E(t)=f(\alpha_t)-f_\star$ is finite, and that the trajectory stays in a region where the Wasserstein-KL inequality of Definition Definition: Wasserstein Kurdyka-Lojasiewicz Inequality holds with $\theta\in(1/2,1)$ . If $E(0)>0$ , then

E(t) \leq \left(E(0)^{1-2\theta}+c^2(2\theta-1)t\right)^{-1/(2\theta-1)} .

(138)

Moreover, for every $t\geq0$ ,

\operatorname{Length}_{\Wass_2}\bigl((\alpha_s)_{s\geq t}\bigr) \leq \frac{E(t)^{1-\theta}}{c(1-\theta)}.

(139)

If $f$ is lower semicontinuous on $(\Pp_2(\RR^d),\Wass_2)$ , then $\alpha_t$ converges to some $\alpha_\infty\in\operatorname{Argmin}f$ and

\Wass_2(\alpha_t,\alpha_\infty) \leq \frac{E(t)^{1-\theta}}{c(1-\theta)} = O\!\left(t^{-(1-\theta)/(2\theta-1)}\right).

(140)

Theorem Theorem: Sublinear Convergence Under Wasserstein-KL is the Wasserstein analogue of the standard KL-rate argument for curves of maximal slope in metric spaces; see Attouch, Bolte and Svaiter Attouch et al., 2010, Hauer and Mazon Hauer & Mazón, 2019, and Dello Schiavo, Maas and Pedrotti Dello Schiavo et al., 2024. The endpoint $\theta=1/2$ is the PL theorem above. Exponents $\theta>1/2$ correspond to flatter landscapes and slower polynomial rates; formally, $\theta<1/2$ would instead lead to finite-time convergence in the scalar comparison inequality.

Proposition: Powers of PL Energies Are KL

Let $g\geq0$ satisfy $\inf g=0$ and the Wasserstein-PL inequality

|\partial g|^2(\alpha) \geq 2\kappa g(\alpha)

(145)

on a region where $g>0$ . Fix $r\geq1$ and set $f(\alpha)=g(\alpha)^r$ . Whenever the metric-slope chain rule $|\partial(g^r)|(\alpha)=r g(\alpha)^{r-1}|\partial g|(\alpha)$ holds, $f$ satisfies the Wasserstein-KL inequality

|\partial f|(\alpha) \geq r\sqrt{2\kappa}\, f(\alpha)^{1-1/(2r)}.

(146)

Thus $f$ has KL exponent $\theta=1-1/(2r)$ . For $r=1$ this is PL; for $r>1$ , Theorem Theorem: Sublinear Convergence Under Wasserstein-KL gives $f(\alpha_t)=O(t^{-r/(r-1)})$ along the $f$ -gradient flow, under its hypotheses.

Example: Homogeneous Wasserstein-KL Examples

Power laws first appear for potential energies. Let $V$ be a smooth potential bounded below, write $V_\star=\inf V$ , and set

f_V(\alpha)=\int V(x)\,\d\alpha(x), \qquad f_\star=V_\star .

(148)

At the formal smooth level, its Wasserstein slope is

|\partial f_V|^2(\alpha) = \int |\nabla V(x)|^2\,\d\alpha(x).

(149)

Hence, for exponents $\theta\in[1/2,1)$ , the pointwise KL inequality

|\nabla V(x)| \geq c\,(V(x)-V_\star)^\theta

(150)

is equivalent to the Wasserstein-KL inequality

|\partial f_V|(\alpha) \geq c\,(f_V(\alpha)-f_\star)^\theta

(151)

holding for all probability measures $\alpha$ . The implication from $V$ to $f_V$ follows from Jensen applied to $(V-V_\star)^{2\theta}$ , while the converse follows by testing on Dirac masses $\alpha=\delta_x$ .

For $q\geq2$ , the choice $V(x)=|x|^q$ gives the moment energy

f_q(\alpha)=\int |x|^q\,\d\alpha(x)

(152)

with minimizer $\delta_0$ . At the formal smooth level, its Wasserstein gradient is $\nabla |x|^q=q|x|^{q-2}x$ , hence

|\partial f_q|^2(\alpha) = q^2\int |x|^{2q-2}\,\d\alpha(x) \geq q^2 f_q(\alpha)^{2(q-1)/q},

(153)

where the last inequality is Jensen applied to $|x|^q$ . Thus $f_q$ satisfies a Wasserstein-KL inequality with $c=q$ and $\theta=(q-1)/q$ . The case $q=2$ is PL and produces exponential contraction, while $q>2$ gives the polynomial rate $f_q(\alpha_t)=O(t^{-q/(q-2)})$ .

The same homogeneity appears for powers of distances. On a geodesic metric space, the function $x\mapsto d(x,x_\star)^q$ has metric slope $q d(x,x_\star)^{q-1}$ away from the minimizer. Consequently, in the matching $\Wass_q$ geometry, the functional $\alpha\mapsto\Wass_q^q(\alpha,\beta)$ has formal metric slope $q\Wass_q(\alpha,\beta)^{q-1}=qf(\alpha)^{(q-1)/q}$ . In the present $\Wass_2$ geometry, this applies directly to powers $\Wass_2^q(\alpha,\beta)$ ; the notation $\Wass_q^q$ should be read as the analogous statement for the $q$ -Wasserstein metric.

Homogeneous interaction energies exhibit the same scaling. Define

f_q^{\rm int}(\alpha) = \frac12\iint |x-y|^q\,\d\alpha(x)\d\alpha(y), \qquad q\geq2 .

(154)

This energy is minimized exactly by Dirac masses, so the minimizer is not unique. Its first variation is

\frac{\delta f_q^{\rm int}}{\delta\alpha}(x) = \int |x-y|^q\,\d\alpha(y),

(155)

and its Wasserstein gradient is

G_\alpha(x) = q\int |x-y|^{q-2}(x-y)\,\d\alpha(y).

(156)

Writing $\bar x_\alpha=\int x\,\d\alpha(x)$ , one has $\int G_\alpha\,\d\alpha=0$ and, by symmetrization,

\int G_\alpha(x)\cdot(x-\bar x_\alpha)\,\d\alpha(x) = q f_q^{\rm int}(\alpha).

(157)

Moreover, Jensen gives

\int |x-\bar x_\alpha|^q\,\d\alpha(x) \leq \iint |x-y|^q\,\d\alpha(x)\d\alpha(y) = 2 f_q^{\rm int}(\alpha).

(158)

Combining this estimate with Holder and Cauchy’s inequality yields the non-sharp but useful bound

|\partial f_q^{\rm int}|(\alpha) = \left(\int |G_\alpha|^2\,\d\alpha\right)^{1/2} \geq q\,2^{-1/q} f_q^{\rm int}(\alpha)^{(q-1)/q}.

(159)

Thus $f_q^{\rm int}$ satisfies a Wasserstein-KL inequality with exponent $\theta=(q-1)/q$ relative to the manifold of Dirac minimizers. For $q=2$ this is a PL inequality; for $q>2$ it gives the same polynomial KL rate as the moment energy.

Convexity and Curvature¶

The same language is not restricted to subsets of $\RR^d$ . If $(\X,\dist,\mathfrak m)$ is a geodesic metric-measure space, $\Wass_2$ geodesics can be defined by transporting each pair of endpoints along metric geodesics, or more intrinsically by dynamical optimal plans on path space. Given a reference measure $\mathfrak m$ , the entropy relative to $\mathfrak m$ is

\mathrm{Ent}_{\mathfrak m}(\alpha) \eqdef \begin{cases} \displaystyle\int_\X\rho\log\rho\,\d\mathfrak m, &\text{if }\alpha=\rho\,\mathfrak m,\\ +\infty, &\text{otherwise.} \end{cases}

(160)

On a smooth Riemannian manifold $(M,g)$ , the Ricci curvature tensor $\mathrm{Ric}_g$ is the trace of the Riemann curvature tensor. The lower bound $\mathrm{Ric}_g\geq\lambda g$ means that $\mathrm{Ric}_g(v,v)\geq\lambda |v|_g^2$ for every tangent vector $v$ . The fundamental link between curvature and optimal transport is that this tensor lower bound is exactly encoded by geodesic convexity of entropy.

This equivalence was developed in the smooth Riemannian setting by Cordero-Erausquin, McCann and Schmuckenschlaeger and by von Renesse and Sturm Cordero-Erausquin et al., 2001Renesse & Sturm, 2005; it is a central theme of the optimal-transport approach to curvature in Villani’s monograph Villani, 2009. Lott--Villani and Sturm then used the same entropy-convexity principle to define synthetic lower Ricci curvature bounds on metric-measure spaces Lott & Villani, 2009Sturm, 2006Sturm, 2006. Outside this convex, curvature-controlled regime, such as in the mean-field neural-network example below, the flow may still be informative but its convergence analysis requires problem-specific arguments.

Functional Inequalities via Optimal Transport¶

Optimal transport proves inequalities by turning comparison into geometry. One transports a density by a monotone or Brenier map, interpolates along the resulting rays, and converts concavity of a Jacobian determinant or convexity of an energy into an integral estimate. The first two examples illustrate this geometric mechanism. The remainder concerns entropy, where comparing an energy gap with its squared Wasserstein slope gives a Wasserstein-PL inequality and hence convergence rates through Theorem Theorem: Wasserstein-PL Convergence. We work in the smooth Euclidean setting so the main calculation remains visible; standard approximation arguments recover the usual Borel and Sobolev formulations. Systematic accounts include McCann’s displacement-convexity principle, the Riemannian interpolation inequalities of Cordero-Erausquin, McCann and Schmuckenschläger, and Villani’s treatment of concentration and functional inequalities McCann, 1997Cordero-Erausquin et al., 2001Villani, 2009.

Brunn-Minkowski and Isoperimetry¶

The most geometric use of OT is to prove that transporting two uniform measures and looking at the interpolated support increases volume at least concavely. The perimeter inequality is then obtained by differentiating this volume estimate after adding a small ball.

Proposition: Brunn-Minkowski and Euclidean Isoperimetry

Let $A,B\subset\RR^d$ be bounded Borel sets with positive Lebesgue measure. For $t\in[0,1]$ , set

(1-t)A+tB\eqdef\{(1-t)x+ty:x\in A,\ y\in B\}.

(161)

Then

\operatorname{Vol}((1-t)A+tB)^{1/d} \geq (1-t)\operatorname{Vol}(A)^{1/d} + t\,\operatorname{Vol}(B)^{1/d}.

(162)

Consequently, if $A$ is smooth and bounded and $\omega_d=\operatorname{Vol}(B_1)$ , then

\operatorname{Per}(A) \geq d\,\omega_d^{1/d}\operatorname{Vol}(A)^{(d-1)/d}.

(163)

The constant is sharp, with equality for Euclidean balls.

We first give the argument for regular sets, the general Borel case following by approximation from inside and outside. Let $\alpha$ and $\beta$ be the uniform probability measures on $A$ and $B$ , and let $T=\nabla\phi$ be the Brenier map with $T_\sharp\alpha=\beta$ . The interpolating map $T_t=(1-t)\operatorname{Id}+tT$ sends $\alpha$ to a measure $\alpha_t$ supported in $(1-t)A+tB$ . At points where $T$ is differentiable,

\det DT_t = \det((1-t)\operatorname{Id}+tDT).

(164)

Since $DT$ is symmetric positive semidefinite and $\det^{1/d}$ is concave on the positive semidefinite cone,

\det(DT_t)^{1/d} \geq (1-t)+t\det(DT)^{1/d}.

(165)

The change-of-variables identity for the uniform measures gives $\det(DT)=\operatorname{Vol}(B)/\operatorname{Vol}(A)$ almost everywhere. If $\rho_t$ denotes the density of $\alpha_t$ , a second application of the same formula gives

\rho_t(T_t(x))^{-1/d} = \operatorname{Vol}(A)^{1/d}\det(DT_t(x))^{1/d} \geq c_t,

(166)

where $c_t=(1-t)\operatorname{Vol}(A)^{1/d}+t\operatorname{Vol}(B)^{1/d}$ . Since $\rho_t\leq c_t^{-d}$ almost everywhere, while $\alpha_t$ has unit mass and is supported in $(1-t)A+tB$ . Therefore $\operatorname{Vol}((1-t)A+tB)\geq c_t^d$ , which is the displayed Brunn-Minkowski inequality.

For isoperimetry, use the homogeneity of Brunn-Minkowski and apply it to $A$ and $\epsilon B_1$ . Writing $A+\epsilon B_1$ for the parallel set gives

\operatorname{Vol}(A+\epsilon B_1)^{1/d} \geq \operatorname{Vol}(A)^{1/d}+\epsilon\omega_d^{1/d}.

(167)

Using $\operatorname{Vol}(A+\epsilon B_1) =\operatorname{Vol}(A)+\epsilon\,\operatorname{Per}(A)+o(\epsilon)$ and differentiating at $\epsilon=0$ yields the claimed perimeter bound. Balls attain equality, so the constant is optimal.

Figure Div isolates the determinant-concavity step in the special case where the optimal map is affine.

Interactive panel. Move the affine stretch and interpolation time to see determinant concavity behind the Brunn-Minkowski transport proof.

:class: ot4ml-book-figure

Brunn-Minkowski through an affine optimal-transport interpolation between two ellipses. For ellipses, the Brenier map is affine, so the transported support $T_t(A)$ remains an ellipse and its area is computed exactly from the determinant of $(1-t)\operatorname{Id}+tDT$ . The right panel shows that $|T_t(A)|^{1/2}$ lies above the linear interpolation of the endpoint square-root areas, which is the two-dimensional determinant concavity behind the proof of Brunn-Minkowski.


### Prékopa-Leindler

The functional analogue of Brunn-Minkowski replaces sets by densities. The
transport proof is the same argument with the Jacobian no longer constant.

(prop-ot-prekopa-leindler)=
:::{admonition} Proposition: Prékopa-Leindler Inequality
:class: important
Let $u,v,w:\RR^d\to[0,+\infty)$ be integrable, and let $t\in[0,1]$. Assume that

```{math}
w((1-t)x+ty)\geq u(x)^{1-t}v(y)^t
\qquad
\text{for all }x,y\in\RR^d.
```

Then

```{math}
\int_{\RR^d} w
\geq
\left(\int_{\RR^d}u\right)^{1-t}
\left(\int_{\RR^d}v\right)^t .
```

Logarithmic Sobolev Inequalities¶

The logarithmic Sobolev inequality, introduced by Gross Gross, 1975, controls relative entropy by its dissipation. Let $\beta=\rho_\beta\d x$ be a reference probability measure. For $\alpha=h\beta$ , define

\mathcal I(\alpha|\beta) \eqdef \int \norm{\nabla\log h}^2\,\d\alpha = \int \frac{\norm{\nabla h}^2}{h}\,\d\beta,

(172)

with the lower-semicontinuous convention $\mathcal I(\alpha|\beta)=+\infty$ outside its natural domain. The measure $\beta$ satisfies a logarithmic Sobolev inequality with constant $\lambda>0$ if

\KL(\alpha|\beta) \leq \frac{1}{2\lambda}\mathcal I(\alpha|\beta) \qquad\text{for every probability measure }\alpha.

(173)

This is a coercivity property of the reference measure, not a universal inequality. Positive curvature of the potential is a standard sufficient condition.

Gaussian Logarithmic Sobolev Inequality¶

Transport also gives a short proof of the sharp Gaussian logarithmic Sobolev inequality. In Cordero-Erausquin’s argument Cordero-Erausquin, 2002, the Jacobian equation for the Brenier map is bounded by $\log\det A\leq\operatorname{tr}(A-\operatorname{Id})$ , and Gaussian integration by parts converts the trace term into Fisher information.

For the entropy functional $f(\alpha)=\KL(\alpha|\gamma_d)$ , the squared Wasserstein slope is the Fisher information,

|\partial f|^2(\alpha) = \int \norm{\nabla\log\frac{\d\alpha}{\d\gamma_d}}^2\,\d\alpha.

(181)

Thus the Gaussian logarithmic Sobolev inequality is exactly the Wasserstein-PL inequality for $f$ with constant 1. This interpretation turns the static estimate into exponential convergence of the Ornstein-Uhlenbeck flow.

Poincaré as a Linearized Wasserstein-PL Inequality¶

Many quadratic functional inequalities arise by zooming in on a nonlinear slope inequality near equilibrium. Assume that $\beta$ minimizes an energy $f$ , and let $\xi$ be a smooth bounded density modulation with $\int \xi\,\d\beta=0$ . For $|\epsilon|$ small enough, set

\beta_\epsilon=(1+\epsilon\xi)\beta, \qquad \beta_0=\beta .

(182)

Define the second-order energy and dissipation forms of $f$ at $\beta$ in the direction $\xi$ by

\mathsf H^f_\beta(\xi) \eqdef \lim_{\epsilon\to0} \frac{2(f(\beta_\epsilon)-f(\beta))}{\epsilon^2}, \qquad \mathsf D^f_\beta(\xi) \eqdef \lim_{\epsilon\to0} \frac{|\partial f|^2(\beta_\epsilon)}{\epsilon^2},

(183)

whenever these finite limits exist. The first is the quadratic part of the energy gap and the second is the quadratic part of the squared Wasserstein slope. Their comparison is the linearized Wasserstein-PL inequality.

Example: Linearized Forms for Standard Discrepancies

The two forms can be evaluated explicitly under the indicated smoothness and integrability assumptions.

Suppose that $\beta$ has a smooth positive density and that $\phi$ is twice continuously differentiable near 1, with $\phi(1)=0$ and $\phi''(1)>0$ . For the Csiszár divergence $f(\alpha)=\Divergm_\phi(\alpha|\beta)$ , one has

\mathsf H^f_\beta(\xi) = \phi''(1)\int \xi^2\,\d\beta, \qquad \mathsf D^f_\beta(\xi) = \phi''(1)^2\int \norm{\nabla\xi}^2\,\d\beta .

(187)

The affine part of $\phi$ is immaterial on probability measures, so only the curvature $\phi''(1)$ enters the linearization. For KL, $\phi(r)=r\log r-r+1$ , hence $\phi''(1)=1$ , and the linearized PL inequality becomes the Poincaré inequality.

For $f(\alpha)=\MMD_k^2(\alpha,\beta)$ , using the convention $\MMD_k^2(\alpha,\beta)=\iint k\,\d(\alpha-\beta)\d(\alpha-\beta)$ , assume that $k$ is symmetric, differentiable in its first variable, and that the displayed integrals are finite. Then

\mathsf H^f_\beta(\xi) = 2\iint k(x,y)\xi(x)\xi(y)\,\d\beta(x)\d\beta(y),

(188)

and

\mathsf D^f_\beta(\xi) = 4\int \norm{\int \nabla_x k(x,y)\xi(y)\,\d\beta(y)}^2 \d\beta(x).

(189)

Thus the energy Hessian is the kernel quadratic form, whereas the slope form differentiates the kernel witness spatially.

For $f(\alpha)=\Wass_2^2(\alpha,\beta)$ , the calculation is naturally written in the tangent space of $\Wass_2$ . If $\beta=\rho_\beta\d x$ is smooth and positive and the weighted Poisson problem

-\nabla\cdot(\rho_\beta\nabla u_\xi)=\xi\rho_\beta,

(190)

has a finite-energy solution $u_\xi$ , then

\mathsf H^f_\beta(\xi) = 2\int\norm{\nabla u_\xi}^2\,\d\beta, \qquad \mathsf D^f_\beta(\xi) = 4\int\norm{\nabla u_\xi}^2\,\d\beta .

(191)

The quadratic form $\int\norm{\nabla u_\xi}^2\d\beta$ is the negative Sobolev norm induced by the linearized Wasserstein metric.

For $f(\alpha)=\KL(\alpha|\beta)$ , Proposition Proposition: Linearizing a Wasserstein-PL Inequality says precisely that logarithmic Sobolev linearizes to Poincaré. Expanding $h_\epsilon=1+\epsilon\xi$ gives

\KL(h_\epsilon\beta|\beta) = \frac{\epsilon^2}{2}\int\xi^2\,\d\beta+o(\epsilon^2), \qquad \mathcal I(h_\epsilon\beta|\beta) = \epsilon^2\int\norm{\nabla\xi}^2\,\d\beta+o(\epsilon^2).

(192)

Proposition: Poincaré from Logarithmic Sobolev

Let $\beta$ be a smooth probability measure on $\RR^d$ . Assume that $\beta$ satisfies the logarithmic Sobolev inequality with constant $\lambda>0$ ,

\KL(\alpha|\beta) \leq \frac{1}{2\lambda}\,\mathcal I(\alpha|\beta), \qquad \mathcal I(\alpha|\beta) \eqdef \int\norm{\nabla\log\frac{\d\alpha}{\d\beta}}^2\,\d\alpha .

(193)

Then, for every smooth $\xi$ with $\int\xi\,\d\beta=0$ ,

\lambda\int\xi^2\,\d\beta \leq \int\norm{\nabla\xi}^2\,\d\beta .

(194)

In particular, if $\d\beta=\rho_\beta\d x=Z^{-1}e^{-V}\d x$ and $\nabla^2V\succeq\lambda\operatorname{Id}$ , so that $\beta$ is $\lambda$ -strongly log-concave, then the conclusion holds by the Bakry-Emery criterion Bakry & Émery, 1985Bakry et al., 2014.

If $\d\beta=\rho_\beta\d x=Z^{-1}e^{-V}\d x$ , the Poincaré inequality is equivalently a spectral gap for the $\beta$ -weighted Laplacian

L_\beta\xi \eqdef \Delta\xi-\dotp{\nabla V}{\nabla\xi} = \rho_\beta^{-1}\nabla\cdot(\rho_\beta\nabla\xi).

(197)

Indeed $L_\beta$ is symmetric in $L^2(\beta)$ and

\int\norm{\nabla\xi}^2\,\d\beta = \langle \xi,-L_\beta\xi\rangle_{L^2(\beta)}.

(198)

Thus Poincaré says that the bottom of the spectrum of $-L_\beta$ on mean-zero functions is at least $\lambda$ . When the resolvent is compact, this is the first nonzero eigenvalue; on a noncompact space the spectral gap need not be attained by an eigenfunction. This interpretation, classical for reversible Markov diffusion generators Ledoux, 2001Bakry et al., 2014, is the infinitesimal shadow of the nonlinear Wasserstein-PL/log-Sobolev inequality.

This interpretation becomes exact after freezing the Wasserstein mobility at equilibrium. For a mean-zero density perturbation $\dot h$ , define

\norm{\dot h}_{-1,\beta}^2 \eqdef \inf_v\left\{ \int\norm v^2\,\d\beta :\ \dot h=-\rho_\beta^{-1}\nabla\cdot(\rho_\beta v) \right\}.

(199)

For $E_2(h)=\frac12\chi^2(h\beta|\beta)=\frac12\int(h-1)^2\d\beta$ , integration by parts gives $|\partial E_2|_{-1,\beta}^2=\int\norm{\nabla h}^2\d\beta$ . Thus Poincaré is exactly PL for $E_2$ in the linearized Wasserstein geometry. It is not the full $\Wass_2$ -PL inequality for the same energy: the nonlinear mobility is $h\beta$ , and the corresponding squared slope is $\int h\norm{\nabla h}^2\d\beta$ .

Talagrand’s Transport-Entropy Inequality¶

The distance estimate in Theorem Theorem: Wasserstein-PL Convergence reveals the general principle: logarithmic Sobolev with constant $\lambda$ yields $\Wass_2^2(\alpha,\beta)\leq2\KL(\alpha|\beta)/\lambda$ . This is the Otto-Villani implication from logarithmic Sobolev to Talagrand’s $T_2$ inequality Otto & Villani, 2000. For the standard Gaussian, the same Jacobian estimate gives a direct sharp proof of Talagrand’s original result Talagrand, 1996Cordero-Erausquin, 2002Villani, 2009.

The HWI Bridge¶

The previous two Gaussian inequalities are not isolated facts. The HWI inequality of Otto and Villani Otto & Villani, 2000 compares three quantities at once: the entropy $H$ , the Wasserstein distance $W$ , and the Fisher information $I$ . It is one of the cleanest examples where displacement convexity turns into a functional inequality.

Proposition: Otto-Villani HWI Inequality

Let $\beta=\rho_\beta\d x=e^{-V}\d x/Z$ be a smooth probability measure in $\Pp_2(\RR^d)$ , and assume that $\nabla^2 V\succeq \lambda \operatorname{Id}$ for some $\lambda\in\RR$ . For an absolutely continuous probability measure $\alpha=\rho_\alpha\d x$ in $\Pp_2(\RR^d)$ , define the relative Fisher information

\mathcal I(\alpha|\beta) \eqdef \int \norm{\nabla\log\frac{\rho_\alpha}{\rho_\beta}}^2\,\d\alpha,

(205)

with the convention $\mathcal I(\alpha|\beta)=+\infty$ if $\alpha$ is not absolutely continuous or if the expression is not defined. Then

\KL(\alpha|\beta) \leq \Wass_2(\alpha,\beta)\sqrt{\mathcal I(\alpha|\beta)} -\frac{\lambda}{2}\Wass_2^2(\alpha,\beta).

(206)

When $\lambda>0$ , the HWI inequality is a bridge from curvature to PL. It is not itself a PL inequality because it still contains the distance term $\Wass_2(\alpha,\beta)$ . Optimizing the right-hand side, or simply using Young’s inequality $rs-\lambda r^2/2\leq s^2/(2\lambda)$ , gives the logarithmic Sobolev estimate

\KL(\alpha|\beta) \leq \frac{1}{2\lambda}\mathcal I(\alpha|\beta).

(210)

Since $\mathcal I(\cdot|\beta)=|\partial\KL(\cdot|\beta)|^2$ , this is the Wasserstein-PL inequality with constant $\lambda$ . This is precisely the curvature-to-logarithmic-Sobolev implication used above; the entropy-decay proposition below then converts it into exponential Fokker-Planck convergence.

Figure Div shows these inequalities on an exact one-dimensional Ornstein-Uhlenbeck relaxation, where all quantities can be evaluated accurately by grid quadrature and quantile inversion.

Interactive panel. Change the final time and mixture skew to recompute entropy, Fisher information, Wasserstein distance, and the HWI/log-Sobolev guides along an OU flow.

:class: ot4ml-book-figure

Functional inequalities along the Ornstein-Uhlenbeck flow from a one-dimensional Gaussian mixture to the standard Gaussian. The left panel shows the density relaxation, with the Gaussian target as a dashed curve. The middle panel compares $H=\KL(\alpha_t|\gamma_1)$ with the HWI upper bound $W\sqrt I-W^2/2$ , the logarithmic-Sobolev upper bound $I/2$ , and the Talagrand lower bound $W^2/2$ , where $W=\Wass_2(\alpha_t,\gamma_1)$ and $I=\mathcal I(\alpha_t|\gamma_1)$ . The right panel displays the dynamic consequence $H(t)\leq H(0)e^{-2t}$ .


### Entropy Decay Along the Fokker-Planck Flow

Functional inequalities become convergence rates once they are combined with
the energy-dissipation identity of a Wasserstein gradient flow. In the notation
of Definition {ref}`def-wasserstein-pl`, log-Sobolev is simply the PL
inequality for relative entropy, and Theorem
{ref}`thm-wasserstein-pl-convergence` gives both energy decay and
convergence in $\Wass_2$. This is why the previous inequalities are not merely
static estimates: they quantify relaxation of the Fokker-Planck equation.

The logarithmic Sobolev assumption below is the curvature-controlled mechanism
introduced above. With the convention used here, the standard Gaussian
$\gamma_d=\mathcal N(0,\Id)$ satisfies it with $\lambda=1$, exactly as in
{ref}`prop-gaussian-log-sobolev-ot`. More generally, if
$\beta=Z^{-1}e^{-V}\d x$ and $\nabla^2V\succeq\lambda\Id$, then the
Bakry--Emery criterion gives the logarithmic Sobolev inequality with constant
$\lambda$. Thus the hypothesis covers strongly log-concave targets, and it is
the functional-inequality counterpart of the $\lambda$-geodesic convexity of
$\KL(\cdot|\beta)$ discussed in {ref}`sec-geodesic-convexity`.

(prop-lsi-entropy-decay)=
:::{admonition} Proposition: Log-Sobolev Inequality Implies Entropy Decay
:class: important
Let $\beta=e^{-V}\d x/Z$ be a smooth probability measure on $\RR^d$. Assume
that $\beta$ satisfies the logarithmic Sobolev inequality with constant
$\lambda>0$,

```{math}
\KL(\alpha|\beta)
\leq
\frac{1}{2\lambda}
\int \norm{\nabla\log\frac{\d\alpha}{\d\beta}}^2\,\d\alpha .
```

Let $\alpha_t$ solve the Wasserstein gradient flow of
$f(\alpha)=\KL(\alpha|\beta)$ with finite initial entropy, namely

```{math}
\partial_t\rho_t
=
\nabla\cdot(\rho_t\nabla V)+\Delta\rho_t,
\qquad
\alpha_t=\rho_t\d x .
```

Then

```{math}
\KL(\alpha_t|\beta)
\leq
e^{-2\lambda t}\KL(\alpha_0|\beta).
```

Moreover, if $\alpha_0,\beta\in\Pp_2(\RR^d)$, then

```{math}
\Wass_2(\alpha_t,\beta)
\leq
\sqrt{\frac{2}{\lambda}\KL(\alpha_0|\beta)}\,e^{-\lambda t}.
```

Remark: Logarithmic Sobolev as Wasserstein--PL

For $f(\alpha)=\KL(\alpha|\beta)$ , the Wasserstein gradient is

\Wgrad f(\alpha) = \nabla\log\frac{\d\alpha}{\d\beta},

(213)

and its squared Wasserstein slope is precisely the relative Fisher information,

|\partial f|^2(\alpha)=\mathcal I(\alpha|\beta).

(214)

Since the unique minimizer is $\beta$ and $f(\beta)=0$ , the corresponding equation is exactly

|\partial f|^2(\alpha) \geq 2\lambda\bigl(f(\alpha)-f(\beta)\bigr),

(215)

that is, the Wasserstein--PL inequality of Definition Definition: Wasserstein-Polyak-Lojasiewicz Inequality. Thus logarithmic Sobolev is the functional-inequality incarnation of the convergence mechanism studied in Section Geodesic Convexity and Convergence, and Theorem Theorem: Wasserstein-PL Convergence converts it into exponential decay of both entropy and Wasserstein distance to equilibrium.

Training Two-Layer MLPs as Wasserstein Flows¶

Mean-field limits recast the training of wide neural networks as transport of a distribution of neurons. This section shows how the particle ODE of gradient descent becomes a Wasserstein flow in parameter space. This viewpoint, often summarized as treating parameters as interacting particles, was developed in closely related forms by Mei, Montanari and Nguyen Mei et al., 2018, Rotskoff and Vanden-Eijnden Rotskoff & Vanden-Eijnden, 2022, and Chizat and Bach Chizat & Bach, 2018. These works pass from the finite-width empirical neuron dynamics to a limiting nonlinear PDE for the neuron law.

Wasserstein training of two-layer MLPs. The key modeling step is to forget the ordering of the hidden neurons and to regard the network weights as a probability measure on parameter space. A finite-width network is then an empirical measure, and the population loss becomes a functional of this measure. Gradient descent on the particle positions is precisely the finite-dimensional discretization of a Wasserstein gradient flow for this functional, with the Wasserstein metric acting on neuron parameters rather than on data samples. This is the sense in which training a two-layer mean-field MLP becomes a transport problem for the law of its neurons.

We use $z\in\RR^d$ for the input data and $y\in\RR^{d'}$ for the label. A neuron is a particle

x=(u,v)\in\RR^d\times\RR^{d'},

(216)

where $u$ is the inner weight and $v$ is the outer vector weight. For a scalar nonlinearity $\sigma$ , define the vector-valued feature

\psi(x,z)=v\,\sigma(\dotp{u}{z})\in\RR^{d'}.

(217)

The width- $n$ network and its mean-field version are

G_X(z)=\frac1n\sum_{i=1}^n\psi(x_i,z), \qquad G_\alpha(z)=\int\psi(x,z)\d\alpha(x), \qquad \alpha=\frac1n\sum_i\delta_{x_i}.

(218)

This formulation removes the artificial ordering of neurons and allows $\alpha$ to be a continuous distribution of infinitely many neurons.

Let $\zeta$ be a probability distribution on data-label pairs $(z,y)\in\RR^d\times\RR^{d'}$ . The population risk is

f(\alpha)=\int\ell(G_\alpha(z),y)\d\zeta(z,y),

(219)

and the empirical risk is the special case $\zeta=\zeta_N\eqdef N^{-1}\sum_{k=1}^N\delta_{(z_k,y_k)}$ . Since $\alpha\mapsto G_\alpha$ is linear, $f$ is convex as a function of $\alpha$ whenever $\ell(\cdot,y)$ is convex. For the empirical neuron law $\alpha_X=n^{-1}\sum_i\delta_{x_i}$ , the Wasserstein metric induces on particles the rescaled metric $n^{-1}\sum_i\norm{\dot x_i}^2$ . The corresponding particle flow is

\dot x_i=-n\nabla_{x_i}F(X), \qquad F(X)=f\!\left(\frac1n\sum_i\delta_{x_i}\right).

(220)

This is the gradient flow of $F(X)=f(\alpha_X)$ for the Wasserstein particle metric, equivalently Euclidean gradient descent with time scale multiplied by $n$ . It gives a particle discretization of (8).

Assume that $\ell$ is differentiable in its first variable. The first variation is

\delta f(\alpha)(x) = \int \dotp{\nabla_1\ell(G_\alpha(z),y)}{\psi(x,z)} \d\zeta(z,y),

(221)

and the Wasserstein gradient in parameter space is

\Wgrad f(\alpha)(x) = \nabla_x\delta f(\alpha)(x) = \int [D_x\psi(x,z)]^\top\nabla_1\ell(G_\alpha(z),y) \d\zeta(z,y).

(222)

For the squared Euclidean loss $\ell(s,y)=\frac12\norm{s-y}^2$ , the energy is the sum of a quadratic interaction and a linear potential:

f(\alpha) = \frac12\iint k(x,x')\d\alpha(x)\d\alpha(x') + \int g(x)\d\alpha(x) + \frac12\int\norm{y}^2\d\zeta(z,y),

(223)

with

k(x,x') = \int\dotp{\psi(x,z)}{\psi(x',z)}\d\zeta(z,y), \qquad g(x) = -\int\dotp{y}{\psi(x,z)}\d\zeta(z,y).

(224)

Thus

\delta f(\alpha)(x) = \int k(x,x')\d\alpha(x')+g(x), \qquad \Wgrad f(\alpha)(x) = \int\nabla_x k(x,x')\d\alpha(x')+\nabla_x g(x).

(225)

These kernels are generally not convex in the particle variable, so the geodesic-convex convergence theory above does not apply directly.

Figure Div illustrates the resulting transport of neurons for a homogeneous ReLU model.

Mean-field training of a homogeneous two-layer model as transport in neuron space. The left panel shows the Wasserstein particle gradient flow in the reduced homogeneous coordinates $(|u|v_1,|u|v_2)$ , with black dashed rays marking the teacher directions. The right panel shows the weighted angular density along a front-loaded sequence of times, colored from red to blue, so that the early concentration of neuron directions is visible. The display follows the rendering of the auxiliary MLP experiment but keeps only the $W_2$ flow, not the spectral-flow comparison.

The interactive demo gives a lightweight version of the same phenomenon: particles move in reduced neuron coordinates, while their angles concentrate around the teacher directions.

Interactive panel. Use the width, homogeneity, and time controls to see the mean-field movement of ReLU neurons and the induced angular density.

Classical Convexity and Stationarity¶

Before using the specific homogeneity mechanism of Chizat and Bach, it is useful to isolate a simpler convex-analytic principle behind many mean-field arguments. Consider an energy

f(\alpha) = \frac12\iint k(x,x')\d\alpha(x)\d\alpha(x') + \int V(x)\d\alpha(x) +C

(226)

on probability measures over a parameter domain. Assume that the quadratic part is convex in the classical affine structure of measures:

Q((1-s)\alpha+s\beta) \leq (1-s)Q(\alpha)+sQ(\beta), \qquad Q(\alpha)=\frac12\iint k\d\alpha\d\alpha.

(227)

This is ordinary convexity of the functional on the convex set of measures, not displacement convexity along $\Wass_2$ geodesics.

Proposition: Affine Convexity and Stationary Positive Densities

Let $\Omega\subset\RR^d$ be connected, let $f=Q+\int V\d\alpha+C$ be as above on $\Pp(\Omega)$ , and assume that $Q$ is classically convex. Suppose that a Wasserstein gradient flow $\alpha_t$ of $f$ converges to $\alpha_\infty=\rho_\infty\d x$ , where $\rho_\infty>0$ almost everywhere, and that $f(\alpha_t)$ is bounded below. Assume that $\delta f(\alpha_\infty)\in C^1(\Omega)$ and that the squared Wasserstein slope is lower semicontinuous along the convergence:

\int_\Omega\norm{\nabla\delta f(\alpha_\infty)}^2\d\alpha_\infty \leq \liminf_{n\to\infty} \int_\Omega\norm{\nabla\delta f(\alpha_{t_n})}^2\d\alpha_{t_n}

(228)

for every sequence $t_n\to\infty$ . Then $\alpha_\infty$ is a global minimizer of $f$ .

The convergence theory then separates two regimes. Chizat and Bach prove global convergence for the unregularized, noiseless Wasserstein flow of positively homogeneous models Chizat & Bach, 2018. Mei, Montanari and Nguyen prove convergence for the noisy, regularized dynamics Mei et al., 2018: at the PDE level the noise adds a diffusion, or Laplacian, term, so an initially singular neuron law is immediately smoothed and acquires a density. Rotskoff and Vanden-Eijnden emphasize the parameters-as-particles interpretation, long-time convergence and asymptotic error scaling with the network width Rotskoff & Vanden-Eijnden, 2022. The following formal statement isolates the noiseless Chizat--Bach mechanism and ignores the technical issues due to ReLU non-smoothness, support propagation and compactness.

Proposition: Formal Global Optimality for Two-Homogeneous Mean-Field Flows

Assume that the feature is positively two-homogeneous in the neuron variable,

\psi(\lambda x,z)=\lambda^2\psi(x,z) \qquad(\lambda>0),

(231)

and that $f(\alpha)=J(G_\alpha)$ with $J$ convex and differentiable as a functional of the predictor. Let $\alpha$ be a smooth stationary point of the Wasserstein flow, so that $\nabla_x\delta f(\alpha)(x)=0$ on $\operatorname{supp}(\alpha)$ . Assume also full directional support: for every unit direction $\omega$ , the support of $\alpha$ intersects the ray $\{\lambda\omega:\lambda>0\}$ . Then $\alpha$ is a global minimizer of $f$ over the mean-field model class.

Choose the first-variation representative

h_\alpha(x) = \delta f(\alpha)(x) = \left\langle\nabla J(G_\alpha),\psi(x,\cdot)\right\rangle_\zeta.

(232)

Additive constants in first variations do not affect the Wasserstein gradient or first-order inequalities between probability measures; this normalization is useful because it inherits the homogeneity of $\psi$ . By two-homogeneity of $\psi$ , $h_\alpha(\lambda x)=\lambda^2h_\alpha(x)$ for $\lambda>0$ . Taking $x=0$ and any $\lambda\neq1$ also gives $h_\alpha(0)=0$ . We first record that stationarity forces $h_\alpha$ to vanish on $\operatorname{supp}(\alpha)$ . Indeed, if $x=r\omega\in\operatorname{supp}(\alpha)$ with $r>0$ and $\|\omega\|=1$ , then

0 = \langle\nabla h_\alpha(r\omega),\omega\rangle = \frac{\d}{\d s}h_\alpha(s\omega)\bigg|_{s=r} = 2r h_\alpha(\omega).

(233)

Thus $h_\alpha(\omega)=0$ , and hence $h_\alpha(x)=r^2h_\alpha(\omega)=0$ . The case $x=0$ has already been covered, so $h_\alpha=0$ on $\operatorname{supp}(\alpha)$ and $\int h_\alpha\,\d\alpha=0$ .

We now argue by contradiction. Suppose that a competitor $\beta$ satisfies $f(\beta)<f(\alpha)$ . Since

G_\beta-G_\alpha = \int\psi(x,\cdot)\d(\beta-\alpha)(x),

(234)

convexity of $J$ gives

f(\beta)-f(\alpha) \geq \int h_\alpha(x)\d(\beta-\alpha)(x).

(235)

The left-hand side is negative, while $\int h_\alpha\,\d\alpha=0$ , so $\int h_\alpha\,\d\beta<0$ . Since $h_\alpha(0)=0$ , this strict negativity must occur away from the origin: there exists a nonzero point $x=r\omega$ with $r>0$ , $\|\omega\|=1$ , and $h_\alpha(x)<0$ . Equivalently $h_\alpha(\omega)=h_\alpha(x)/r^2<0$ . By full directional support, there is $r_\omega>0$ such that $r_\omega\omega\in\operatorname{supp}(\alpha)$ . At this support point,

\langle\nabla h_\alpha(r_\omega\omega),\omega\rangle = 2r_\omega h_\alpha(\omega) <0,

(236)

which contradicts the stationarity condition $\nabla h_\alpha=0$ on $\operatorname{supp}(\alpha)$ . Thus no competitor has smaller risk.

The rigorous Chizat-Bach theorem replaces the full directional support assumption by propagation and overparameterization hypotheses ensuring that any negative first-variation direction is represented by the evolving support. The same radial-derivative contradiction, applied after this support-propagation step, then rules out non-optimal stationary limits.

Generalized Dynamic Wasserstein Flows¶

Generalized Wasserstein Flows¶

The JKO construction is not tied to the quadratic Wasserstein distance. Once a space of probability measures is equipped with a metric or extended metric $d$ , one can define a minimizing movement by the implicit Euler step

\alpha_{t+\tau} \in \uargmin{\alpha} \frac{1}{2\tau}d^2(\alpha_t,\alpha)+f(\alpha).

(237)

This is the metric-gradient-flow framework of Ambrosio, Gigli and Savaré Ambrosio et al., 2006Ambrosio & Gigli, 2013: under compactness, coercivity and lower-semicontinuity assumptions adapted to $d$ , the piecewise-constant interpolants of (237) can converge, as $\tau\to0$ , to a curve of maximal slope. A nonmetric discrepancy can still be inserted in the same update, but the metric theory does not then apply automatically. In the local mass-preserving metrics considered below, a smooth limiting curve is again represented by a continuity equation of the type introduced in (3) and used for Wasserstein flows in (8),

\partial_t\alpha_t+\operatorname{div}(\alpha_t v_{\alpha_t})=0,

(238)

but the velocity $v_\alpha$ is selected by the local steepest-descent rule associated with the geometry induced by $d$ , not necessarily by the classical $\Wass_2$ rule. Unbalanced variants replace this continuity equation by a balance law, while nonlocal variants replace vector fields by jump fluxes.

The metric ingredient needed below is the squared-speed action $\mathbb A(\alpha,w)$ introduced in Generalized Dynamic Wasserstein Distances. Its length distance is already defined in (38); the present section only uses the local cost in the infinitesimal variational step defined below. Some actions are pointwise integrals, such as ordinary $\Wass_2$ with $\mathbb A(\alpha,w)=\int\norm w^2\d\alpha$ . Others are built from a homogeneous local density but then squared at the metric level: for $\Wass_p$ , the local densities are $A_p(a,w)=a\norm w^p$ and $J_p(a,m)=\norm m^p/a^{p-1}$ , while the action paired with the JKO penalty is the squared Finsler speed

\mathbb A_p(\alpha,w)=\left(\int\norm w^p\d\alpha\right)^{2/p}.

(239)

Spectral geometries use the nonlocal covariance action $\mathbb A_\gamma$ from (62).

In the $\Wass_2$ case, all objects reduce to the familiar Benamou--Brenier ones:

A_2(a,w)=a\norm w^2, \qquad J_2(a,m)=\frac{\norm m^2}{a}, \qquad \mathsf D_{\mathbb A}=\Wass_2.

(240)

Definition 1 (Penalized minimization oracle)

Let $\mathbb A(\alpha,\cdot)$ be the local squared-speed action at $\alpha$ . For a cotangent field $u$ for which the pairing below is finite, usually $u=\Wgrad f(\alpha)$ , the penalized minimization oracle associated with this action is

\operatorname{PMO}_{\mathbb A,\alpha}(u) \in \uargmin{w} \left\{ \int \dotp{u(x)}{w(x)}\d\alpha(x) + \frac12\mathbb A(\alpha,w) \right\},

(241)

where the minimization is over admissible finite-action velocity representatives for the chosen geometry. In Hilbertian examples $u\in L^2(\alpha;\RR^d)$ ; for $\Wass_p$ , the natural dual space is $L^q(\alpha;\RR^d)$ , with $q=p/(p-1)$ .

For an energy $f$ , the formal small-step expansion of (237), when the distance is generated by the squared-speed action $\mathbb A$ and the JKO displacement admits an admissible finite-action tangent representative, selects the PMO direction

v_{\alpha} = \operatorname{PMO}_{\mathbb A,\alpha}(\Wgrad f(\alpha)).

(242)

Equivalently, the associated formal gradient-flow PDE is

\partial_t\alpha_t + \operatorname{div}\!\left( \alpha_t\,\operatorname{PMO}_{\mathbb A,\alpha_t}(\Wgrad f(\alpha_t)) \right)=0.

(243)

If $\mathbb A(\alpha,\cdot)$ has zero-action directions, the minimization in the PMO is understood on the quotient tangent space, or after projecting onto finite-action directions modulo the kernel of $\mathbb A(\alpha,\cdot)$ . Along smooth solutions of (243), whenever the minimizer is attained and $\mathbb A(\alpha,\cdot)$ is 2-homogeneous, one obtains

\frac{\d}{\d t}f(\alpha_t) = \left\langle \Wgrad f(\alpha_t),v_t\right\rangle_{\alpha_t} = -\mathbb A(\alpha_t,v_t), \qquad v_t=\operatorname{PMO}_{\mathbb A,\alpha_t}(\Wgrad f(\alpha_t)).

(244)

For $\Wass_2$ , $\mathbb A(\alpha,w)=\int\norm w^2\d\alpha$ and $\operatorname{PMO}_{\mathbb A,\alpha}(u)=-u$ , so the PDE reduces to the usual Wasserstein gradient-flow equation.

In the restricted Riemannian case of (40), the PMO becomes a linear preconditioning rule:

\operatorname{PMO}_{\mathbb A,\alpha}(u) = -Q_\alpha^{-1}u, \qquad v_\alpha=-Q_\alpha^{-1}\Wgrad f(\alpha).

(245)

with the inverse understood on the range of $Q_\alpha$ , or as a Moore--Penrose inverse after projecting onto admissible directions when constraints create null directions. For ordinary $\Wass_2$ , $Q_\alpha=\Id$ on $T_\alpha$ and one recovers $v_\alpha=-\Wgrad f(\alpha)$ . The linear inverse formula is special to Hilbertian actions. A basic non-Hilbertian example is $\Wass_p$ : for $1<p<+\infty$ , the squared Finsler tangent action associated with the distance is

\mathbb A_p(\alpha,w)=\left(\int\norm{w}^p\d\alpha\right)^{2/p}.

(246)

Writing $q=p/(p-1)$ and $U=\norm{u}_{L^q(\alpha)}$ , its PMO is the duality-map formula

\operatorname{PMO}_{\mathbb A_p,\alpha}(u)(x) = -\,U^{2-q}\norm{u(x)}^{q-2}u(x),

(247)

with the conventions that the right-hand side is 0 when $U=0$ , and that $\norm{u(x)}^{q-2}u(x)=0$ when $u(x)=0$ . This is nonlinear in $u$ , and it reduces to $-u$ only for $p=2$ . Endpoint cases such as $p=1$ are interpreted through subgradients of the dual norm.

The examples below are organized by the same dictionary. Concave-mobility flows use the action $\mathbb A_{\theta,\lambda}$ from Homogeneous Momentum Actions; spectral flows use the action $\mathbb A_\gamma$ from the spectral dynamic-distance construction; kernelized Benamou--Brenier flows use $\mathbb A_k(\alpha,v)=\norm{v}_{\RKHS_k^d}^2$ with a restricted RKHS tangent space. Nonlocal logarithmic-mean geometries also have quadratic actions, but their tangent variables are edge potentials or jump velocities rather than vector fields in $\RR^d$ ; their flows are treated separately in the nonlocal Wasserstein-flow section below. WFR, studied separately in Dynamic Unbalanced OT and WFR Flows, is another quadratic geometry, but its tangent variable is a pair $(v,g)$ combining displacement and mass modulation. The notation $Q_\alpha$ is reserved for Hilbertian vector-field subcases where a linear preconditioner represents the tangent geometry.

The generalized distances of Generalized Dynamic Wasserstein Distances are useful precisely because they change what descent means. The energy functional can be the same, while the local metric tensor, Finsler norm, mobility, spectral gauge or RKHS constraint changes the PDE selected by the minimizing movement. The rest of this section focuses on local or vector-field mass-preserving geometries. Nonlocal pair-space geometries are separated in the nonlocal Wasserstein-flow section below, and the transport--reaction case is separated in Dynamic Unbalanced OT and WFR Flows, because their tangent variables are structurally different.

General Mobility Flows¶

The concave-mobility distances of Homogeneous Momentum Actions replace the scalar linear mobility $a$ by a concave mobility $\theta(a)$ . For a density field $a=\rho_t(x)$ , this changes the Onsager operator of the gradient flow while keeping a continuity equation.

Let $\lambda$ be Lebesgue measure, write $\alpha=\rho\lambda$ , and let $u=\nabla(\delta f/\delta\rho)$ . For the velocity action $\mathbb A_{\theta,\lambda}$ , the PMO introduced in Definition 1 is the pointwise minimizer of

\int \dotp{u}{w}\,\rho\,\d x + \frac12\int \frac{\rho}{\theta(\rho)}\norm{w}^2\rho\,\d x,

(248)

so, wherever $\rho>0$ and $\theta(\rho)>0$ ,

\operatorname{PMO}_{\mathbb A_{\theta,\lambda},\rho\lambda}(u) = -\frac{\theta(\rho)}{\rho}u.

(249)

Thus the flux $\rho w$ is $-\theta(\rho)\nabla(\delta f/\delta\rho)$ . For a smooth density $\rho_t$ and an energy $f$ , the formal gradient flow associated with the pointwise momentum action $J_\theta(a,m)=|m|^2/\theta(a)$ is

\partial_t\rho_t = \nabla\!\cdot\!\left(\theta(\rho_t)\nabla\frac{\delta f}{\delta\rho}(\rho_t)\right).

(250)

If $f(\rho)=\int U(\rho(x))\,\d x+\int V(x)\rho(x)\,\d x$ , this becomes

\partial_t\rho = \nabla\!\cdot\!\left(\theta(\rho)\big(U''(\rho)\nabla\rho+\nabla V\big)\right).

(251)

Thus $\theta(a)=a$ and $U(a)=a\log a$ gives the usual heat or Fokker--Planck equation, while $\theta(a)=a(1-a/M)$ gives a volume-filling drift--diffusion whose mobility vanishes at saturation. Power internal energies and power mobilities tune nonlinear diffusion; this viewpoint is useful for nonlinear diffusions, exclusion models, volume-filling models and finite-volume discretizations Dolbeault et al., 2009.

Dynamic Spectral Wasserstein Flows¶

The spectral tangent action (62), and the dynamic distance (63) it generates, normalize the whole velocity covariance through a monotone spectral gauge. Using this geometry for gradient descent replaces the pointwise Wasserstein direction by a globally preconditioned direction. This is close to many large-scale optimizers: normalized gradient methods solve norm-constrained linear minimization problems Pethick et al., 2025; related ideas appear in normalized SGD Murray et al., 2019Cutkosky & Mehta, 2020, tensor-aware optimizers such as Shampoo Gupta et al., 2018, and Muon-type spectral normalizations Jordan et al., 2024Liu et al., 2025. We now describe the mean-field version developed in Peyré, 2026. The trace gauge gives the classical $\Wass_2$ flow, while the operator gauge $\gamma(M)=\lambda_{\max}(M)$ gives the idealized Muon geometry.

Specializing Definition 1 to the spectral action $\mathbb A_\gamma$ , the descent direction associated with a field $g\in L^2(\alpha;\RR^d)$ is simply $\operatorname{PMO}_{\mathbb A_\gamma,\alpha}(g)$ . The field $g$ should be read as the Wasserstein gradient $g=\Wgrad f(\alpha)=\nabla\delta f(\alpha)$ . Since $v\mapsto\mathbb A_\gamma(\alpha,v)$ is 2-homogeneous, fixed-speed normalized algorithms keep the same descent ray as this PMO and only change the scalar normalization; equivalently, the rescaled field minimizes $\int\dotp{g}{w}\d\alpha$ over $\mathbb A_\gamma(\alpha,w)\leq1$ . In the trace case, $\mathbb A_{\tr}(\alpha,v)=\int\norm{v}^2\d\alpha$ and $\operatorname{PMO}_{\mathbb A_{\tr},\alpha}(g)=-g$ . For other gauges, the velocity is globally preconditioned by the covariance of $g$ under $\alpha$ .

Proposition: Polar Formula for the Spectral Direction

Let $S_\alpha(g)=\int g(x)g(x)^\top\d\alpha(x)$ . Assume that the inverse-trace problem admits a positive definite minimizer

A_\alpha^\star \in \uargmin{A\in\mathcal B_\gamma\cap\mathbb S_{++}^d} \tr\!\left(A^{-1}S_\alpha(g)\right).

(252)

Then

\operatorname{PMO}_{\mathbb A_\gamma,\alpha}(g)(x) = -(A_\alpha^\star)^{-1}g(x)

(253)

is the spectral PMO direction. Boundary minimizers are understood by positive-definite approximation, or by using the inverse on the range of $S_\alpha(g)$ .

For the trace gauge, the minimizer is $A_\alpha^\star=\Id$ . For the operator gauge $\gamma(M)=\lambda_{\max}(M)$ , one has $\mathcal B_\gamma=\{A\succeq0:\tr(A)\leq1\}$ ; if $S_\alpha(g)\succ0$ , then

A_\alpha^\star = \frac{S_\alpha(g)^{1/2}}{\tr(S_\alpha(g)^{1/2})}, \qquad \operatorname{PMO}_{\mathbb A_\gamma,\alpha}(g)(x) = -\tr(S_\alpha(g)^{1/2})\,S_\alpha(g)^{-1/2}g(x).

(255)

This is the continuum covariance-normalization formula that becomes the Muon polar factor for empirical measures.

The static/dynamic equality identifies $\Wass_\gamma$ with the length distance generated by $\mathbb A_\gamma$ . The infinitesimal descent direction is therefore defined directly by the PMO in (241). Static/dynamic equality alone does not give a tangent expansion for every prescribed smooth field, because that field need not be the minimal-action representative of the induced perturbation.

Proposition: Formal Normalized Spectral Gradient Flow

Assume that $f$ has a smooth first variation and that the spectral PMO is attained along a smooth curve. The formal gradient flow generated by $\mathbb A_\gamma$ solves

\partial_t\alpha_t + \diverg\!\left( \alpha_t\,\operatorname{PMO}_{\mathbb A_\gamma,\alpha_t}(\Wgrad f(\alpha_t)) \right)=0.

(256)

Equivalently, if $A_t^\star$ solves the inverse-trace problem for $S_t=\int \Wgrad f(\alpha_t)(x)\Wgrad f(\alpha_t)(x)^\top\d\alpha_t(x)$ , then the velocity is $v_t(x)=-(A_t^\star)^{-1}\Wgrad f(\alpha_t)(x)$ . Along smooth solutions,

\frac{\d}{\d t}f(\alpha_t) = -\mathbb A_\gamma(\alpha_t,v_t).

(257)

Remark: Empirical and Muon limits

For empirical measures $\alpha_X=n^{-1}\sum_i\delta_{x_i}$ , stack the velocities $v(x_i)$ into a matrix $V\in\RR^{n\times d}$ and the gradients $g(x_i)$ into a matrix $G\in\RR^{n\times d}$ . By 1-homogeneity of $\gamma$ ,

\mathbb A_\gamma(\alpha_X,v) = \gamma\!\left(\frac1n V^\top V\right) = \frac1n\gamma(V^\top V), \qquad \int\dotp{g}{v}\d\alpha_X = \frac1n\tr(G^\top V).

(258)

Thus the empirical specialization of the PMO problem (241) is, up to the harmless factor $1/n$ , the finite-dimensional matrix descent problem. Writing $(G_X(t))_i=\Wgrad f(\alpha_{X(t)})(x_i(t))$ , the empirical flow is

\dot X(t) \in \uargmin{V\in\RR^{n\times d}} \left\{ \tr(G_X(t)^\top V)+\frac12\gamma(V^\top V) \right\}.

(259)

Here $G_X$ is the gradient for the empirical Wasserstein metric $\norm{\dot X}_n^2=n^{-1}\sum_i\norm{\dot x_i}^2$ , not the ordinary Euclidean gradient of $f(\alpha_X)$ . If one rewrites the same rule with the unweighted matrix convention, it corresponds to the lift $X\mapsto n f(\alpha_X)$ ; using $X\mapsto f(\alpha_X)$ only rescales time by the factor $n$ . If $G=U\diag(\sigma_i)W^\top$ and $\gamma(M)=\lambda_{\max}(M)$ , then

\uargmin{V\in\RR^{n\times d}} \left\{ \tr(G^\top V)+\frac12\lambda_{\max}(V^\top V) \right\} \ni -\norm{G}_{S_1}UW^\top = -\tr\!\left((G^\top G)^{1/2}\right)G(G^\top G)^{\dagger/2}.

(260)

The ray of this direction is the polar factor $-UW^\top$ , and the scalar factor $\norm{G}_{S_1}$ can be absorbed into a time or step-size normalization. This is the exact-polar, full-batch, continuous-time idealization of Muon Jordan et al., 2024Peyré, 2026. Practical Muon implementations replace the polar factor by a few Newton--Schulz iterations for GPU efficiency Liu et al., 2025; this polynomial approximation fits the same spectral-normalization philosophy, although adaptive rescaling and finite momentum make the practical algorithm more than a position-only metric gradient flow.

The effect of this spectral normalization is visible already in the homogeneous two-layer ReLU model discussed earlier in this chapter. Figure Div reproduces, in the style of the book figures, the numerical comparison from Peyré, 2026: the same teacher network, initialization and empirical first variation are evolved either by the $W_2$ particle flow or by the operator-gauge Muon direction.

Interactive panel. Change the number of neurons and teacher angle in a lightweight normalized-dynamics companion to the full MLP/Muon experiment.

:class: ot4ml-book-figure

Wasserstein versus Muon mean-field training of a homogeneous ReLU model. The first two panels display trajectories in reduced homogeneous coordinates; black dashed rays are the teacher directions. The right panel compares empirical square-loss decay, with a solid curve for the $W_2$ flow and a dashed curve for the operator-gauge Muon flow. Spectral normalization reaches the low-risk regime earlier in normalized time.


(sec-nonlocal-wasserstein-flows)=
## Nonlocal Wasserstein Flows

The nonlocal distances of {ref}`sec-nonlocal-wasserstein-distances` replace
the local vector-field tangent model by pairwise exchanges. This separate
section records the corresponding gradient-flow interpretation in the same
order as the distance constructions: first continuum nonlocal Wasserstein flows
and fractional PDEs, then discrete Wasserstein flows on Markov chains. In the
continuum jump-kernel case, entropy descent gives nonlocal diffusion and, for
power-law kernels, fractional PDEs. On a finite state space, the same
logarithmic-mean edge calculus makes a reversible Markov chain exactly the
entropy gradient flow of the discrete Wasserstein distance. Both examples are
mass-preserving, but they are not local Benamou--Brenier flows: their tangent
variables live on pairs or graph edges rather than at individual points.

### Nonlocal Wasserstein Flows and Fractional PDEs

The continuum nonlocal distance $\mathcal W_K$ of
{ref}`sec-nonlocal-wasserstein-distances` replaces local velocities by
antisymmetric fluxes across jumps selected by a reversible kernel $K$. Its
logarithmic-mean mobility is chosen so that entropy dissipation produces the
Markov generator. This turns jump processes, fractional heat equations and some
heavy-tailed stochastic models into gradient flows for nonlocal transport
geometries.

(prop-nonlocal-entropy-gradient-flow)=
:::{admonition} Proposition: Entropy Gradient Flow for Reversible Jump Kernels
:class: important
Under the regularity and irreducibility assumptions of
{cite:t}`Erbar2012JumpEntropy`, $\mathcal W_K$ is an extended distance on the
set of probability measures absolutely continuous with respect to
$\mathfrak m$, finite-action pairs are connected by constant-speed geodesics,
and the Markov semigroup generated by the closure of

```{math}
L\psi(x)
=
\int\bigl(\psi(y)-\psi(x)\bigr)K(x,\d y)
```

is the gradient flow of the relative entropy

```{math}
\operatorname{Ent}_{\mathfrak m}(\alpha)
=
\int\rho\log\rho\,\d\mathfrak m,
\qquad
\alpha=\rho\mathfrak m,
```

for the distance $\mathcal W_K$. In weak form, the entropy flow is

```{math}
:label: eq-nonlocal-entropy-flow
\partial_t\rho_t=L\rho_t .
```

When $K$ is singular, the integral defining $L$ is understood in the
principal-value sense.

The construction becomes especially transparent for translation-invariant kernels. A power-law jump kernel turns the entropy flow into a fractional heat equation.

On $\mathcal X=\RR^d$ , let

K_s(x,\d y) = c_{d,s}\frac{\d y}{|x-y|^{d+s}}, \qquad 0<s<2.

(264)

The associated generator is, up to the normalizing constant,

L_s\psi(x) = \operatorname{p.v.}\int_{\RR^d} \bigl(\psi(y)-\psi(x)\bigr) \frac{c_{d,s}}{|x-y|^{d+s}}\,\d y = -(-\Delta)^{s/2}\psi(x),

(265)

where $(-\Delta)^{s/2}$ is the fractional Laplacian Samko et al., 1993. Hence the $\mathcal W_{K_s}$ -gradient flow of the entropy is

\partial_t\rho_t = -(-\Delta)^{s/2}\rho_t .

(266)

The figure below illustrates the qualitative effect of lowering $s$ . Classical heat diffusion, $s=2$ , smooths local discontinuities by nearby averaging. Fractional diffusion keeps a sharper memory of localized peaks but immediately creates algebraic long-range tails, because mass can jump over macroscopic distances.

Figure Div illustrates the qualitative effect of lowering $s$ .

Classical and fractional heat flows from two localized indicator bumps. Each panel evolves the same normalized mixture of two intervals by the Fourier multiplier $e^{-t|\xi|^s}$ , with time colored from red to blue. The classical heat flow quickly rounds the discontinuities and spreads by local Gaussian averaging. As $s$ decreases, the diffusion becomes increasingly nonlocal: peaks remain more localized near the bumps, while heavier tails appear across the whole displayed window.

Interactive panel. Vary the fractional exponents and final time to compare local heat smoothing with heavier-tailed nonlocal diffusion from the same two localized initial blocks.

More generally, when a target-dependent jump kernel $K_\beta$ satisfies detailed balance with a desired invariant measure $\beta=\rho_\beta\mathfrak m$ , the same construction can be applied to $r_t=\d\alpha_t/\d\beta$ . The gradient flow of $\KL(\alpha|\beta)$ is then $\partial_t r_t=L_\beta r_t$ . This is the nonlocal analogue of the Fokker-Planck gradient flow of relative entropy, and it is one of the motivations behind more recent nonlocal diffusion gradient-flow frameworks Warren, 2024.

Remark: Lévy SDE viewpoint

The same operators appear probabilistically as generators of jump processes. This gives a complementary interpretation of nonlocal PDEs as laws of stochastic processes with heavy-tailed jumps.

For the kernel $K_s$ , the process with generator $L_s$ is the symmetric $s$ -stable Levy process $X_t=X_0+L_t^{(s)}$ , whose density solves (266) Applebaum, 2009. Adding a smooth confining drift gives the fractional Fokker--Planck equation

\partial_t\rho_t = \nabla\cdot(\rho_t\nabla V) - \sigma^s(-\Delta)^{s/2}\rho_t,

(267)

which is the forward equation of

\d X_t=-\nabla V(X_t)\,\d t+\sigma\,\d L_t^{(s)}.

(268)

Equation (267) should be distinguished from the exactly reversible entropy flow above unless the jump kernel is chosen to satisfy detailed balance with the target measure. Both viewpoints, however, share the same nonlocal mechanism: relaxation is driven not only by local drift and Brownian diffusion but also by rare long jumps.

Remark: Connection with stochastic gradient dynamics

This nonlocal point of view is useful in machine learning because stochastic optimization noise need not be well approximated by Brownian noise. Classical diffusion approximations model small-step SGD by a Brownian SDE near stationarity Mandt et al., 2017. In deep-network regimes, empirical and theoretical work has shown that gradient noise and even stationary iterates can be heavy-tailed Şimşekli et al., 2019Gürbüzbalaban et al., 2021. A coarse continuous-time model for parameters $\Theta_t$ is then

\d\Theta_t = -\nabla \mathcal L(\Theta_t)\,\d t + \sigma\,\d L_t^{(s)},

(269)

whose law solves a fractional Fokker--Planck equation of the form (267). The jumps represent occasional large stochastic-gradient fluctuations, so they can move the dynamics between basins in a way that Brownian approximations tend to understate. This makes nonlocal Wasserstein geometries a useful language for linking heavy-tailed training noise, fractional PDE models, and measure-valued gradient-flow ideas. The correspondence remains a modeling approximation: the effective tail index, anisotropy, minibatch correlations and learning-rate schedule all influence whether a Levy limit is a faithful description of a concrete training run.

Discrete Wasserstein Flows on Markov Chains¶

The finite-state distance $\mathcal W_K$ in Discrete Wasserstein Distances on Markov Chains is designed so that the reversible Markov chain itself becomes an entropy gradient flow. This is the discrete counterpart of the fact that the heat equation is the Wasserstein gradient flow of Shannon entropy.

For masses $a_i=\pi_i\rho_i$ , or equivalently relative densities $\rho_i=a_i/\pi_i$ with respect to the invariant law $\pi$ , the entropy relative to $\pi$ is

\operatorname{Ent}_\pi(\rho)\eqdef\sum_i\pi_i\rho_i\log\rho_i.

(270)

Applying the general minimizing movement (237) with $d=\mathcal W_K$ and $f=\operatorname{Ent}_\pi$ , the small- $\tau$ first-order optimality condition gives

\frac{\rho^{k+1}-\rho^k}{\tau} =-\mathcal K_{\rho^{k+1}}\log\rho^{k+1}+o(1) =K\rho^{k+1}+o(1),

(271)

where $(K\rho)_i=\sum_jK_{ij}(\rho_j-\rho_i)$ . Under smoothness and strict positivity, $\rho^{k+1}=\rho^k+O(\tau)$ , so the last expression also equals $K\rho^k+O(\tau)$ . Thus the discrete Wasserstein geometry is engineered so that the entropy gradient flow is the Markov semigroup; the proposition below makes this identity explicit.

Dynamic Unbalanced OT and WFR Flows¶

The generalized dynamic Wasserstein flows of Generalized Dynamic Wasserstein Flows are primarily mass-preserving: their tangent vectors are generated by continuity equations and are therefore velocity fields advecting the measure. Unbalanced transport changes the state space from probabilities to positive finite measures, where descent may both move and create or remove mass. The tangent variable is then a pair $(v,g)$ : the vector field $v$ displaces mass, while the scalar field $g$ modulates its local intensity. This transport--reaction geometry is the dynamic counterpart of the unbalanced distances of Dynamic Unbalanced Wasserstein Distances, especially the balance-equation formula (109), and it is best treated separately from the purely advective geometries above.

Formally, a smooth curve $\alpha_t=\rho_t\d x$ has tangent vectors of the form

\partial_t\rho_t+\operatorname{div}(\rho_t v_t)=g_t\rho_t,

(275)

with squared norm $\int(\norm{v_t}^2+\kappa^2g_t^2)\rho_t\d x$ . The vector field $v_t$ accounts for displacement, while $g_t$ is a Fisher--Rao growth rate. This is the infinitesimal version of the balance-equation action in the dynamic OT chapter.

The reaction term makes additive constants in the first variation meaningful: adding a constant to $\phi$ changes the global growth rate. This is not a defect but the infinitesimal signature of optimizing over positive finite measures, rather than on the probability simplex. Rigorous constructions of Kantorovich--Fisher--Rao and WFR gradient flows are developed in Gallouët & Monsaingeon, 2017Chizat et al., 2018Chizat et al., 2018. Chizat’s measure-optimization framework relates weight-changing gradient methods to mirror- and Bregman-type descents on positive measures Chizat, 2022Chizat, 2022. In machine learning, birth--death dynamics use the same reaction intuition to let a particle population reweight, remove and create neurons during training Rotskoff et al., 2019.

Example: Relative entropy and birth--death relaxation

Let $\beta=b\d x$ and use the finite-measure relative entropy

f(\alpha)=\KL(\alpha|\beta) = \int \left(\rho\log\frac{\rho}{b}-\rho+b\right)\d x, \qquad \alpha=\rho\d x.

(280)

This is the Csisz{'a}r divergence on positive measures; it is nonnegative and is minimized, without imposing a mass constraint, at $\alpha=\beta$ . Its first variation is $\delta f(\alpha)=\log(\rho/b)$ . The balanced $\Wass_2$ flow, when the total mass is fixed, is the Fokker--Planck equation

\partial_t\rho_t = \diverg\left(\rho_t\nabla\log\frac{\rho_t}{b}\right) = \Delta\rho_t-\diverg(\rho_t\nabla\log b),

(281)

whereas the $\WFR_\kappa$ flow is

\partial_t\rho_t = \Delta\rho_t-\diverg(\rho_t\nabla\log b) - \frac{1}{\kappa^2}\rho_t\log\frac{\rho_t}{b}.

(282)

The first two terms transport and diffuse mass, while the last term kills mass where $\rho_t>b$ and creates it where $\rho_t<b$ .

Figure Div contrasts this reaction-assisted relaxation with its mass-preserving Wasserstein counterpart.

Balanced and unbalanced gradient flows of $\KL(\rho|\beta)$ for one-dimensional Gaussian mixtures. In each panel the colored density stacks correspond to five increasing times, from red to blue, and the faint gray curve repeated on each row is the target mixture $\beta$ . The balanced Wasserstein flow is the conservative Fokker--Planck equation, so mass must move through the low-density region between modes. The $\WFR_\kappa$ flow adds the birth--death reaction term in (282); it attenuates overrepresented regions and creates missing mass near target modes, reaching the target shape much faster.

Interactive panel. Change the transport and growth weights to compare conservative motion with creation--destruction dynamics.

Conditional Wasserstein Training of Infinite ResNets¶

Residual networks learn small residual updates around the identity He et al., 2016. This makes depth behave like time: very deep residual architectures can be interpreted as discretizations of differential equations, a viewpoint developed in stable architectures, neural ODEs and mean-field optimal-control models of deep learning Haber & Ruthotto, 2017Lu et al., 2018Chen et al., 2018E et al., 2019. The conditional-OT formulation of Barboni, Peyré and Vialard Barboni et al., 2024 adds the width mean-field limit to this continuous-depth picture. This should be read alongside the broader conditional-transport literature reviewed in Section Conditional Wasserstein Distances, especially fibered gradient flows Peszek & Poyato, 2023, function-space conditional OT Hosseini et al., 2025, and conditional OT flow-matching constructions Chemseddine et al., 2025Kerrigan et al., 2024. From the viewpoint of Generalized Dynamic Wasserstein Flows, it is a concrete instance of a generalized dynamic Wasserstein flow: the conditional Wasserstein distance has a geodesic dynamic structure obtained by integrating the usual Benamou-Brenier action over the conditioning variable. The point of the construction is to keep track of two independent symmetries: neurons can be relabelled inside each layer, but mass should not move from one depth to another. Conditional Wasserstein geometry implements exactly this rule. It transports neurons fiber by fiber in parameter space, while the depth variable remains fixed.

We follow the three limiting levels of the model. A finite ResNet is first a labelled particle system arranged by layers. Sending the width to infinity turns each layer into a probability law of neurons. Sending the depth step to zero then turns the sequence of layer laws into a conditional measure indexed by depth, on which training becomes a Wasserstein gradient flow.

Finite-Depth Finite-Width ResNets¶

The starting point is the ordinary architecture: depth is discrete, width is finite, and each layer carries a labelled collection of neurons. Fix a parameter space $\Theta\subset\RR^q$ , a residual feature $\psi:\Theta\times\RR^d\to\RR^d$ , a data law $\zeta$ on input-label pairs $(z,y)$ , and a loss $\ell$ . With $L$ layers, step size $\tau=1/L$ , widths $n_r$ , and parameters $\theta_{r,i}\in\Theta$ , a finite ResNet is the composition of the residual updates

z_{r+1} = z_r+\tau G_{r,n_r}(z_r), \qquad G_{r,n_r}(z) \eqdef \frac1{n_r}\sum_{i=1}^{n_r}\psi(\theta_{r,i},z), \qquad z_0=z,\qquad r=0,\ldots,L-1.

(283)

The associated population loss, with $n=(n_0,\ldots,n_{L-1})$ , is

f_{L,n}((\theta_{r,i})_{r,i}) = \int \ell(z_L^{\theta}(z),y)\d\zeta(z,y).

(284)

Equivalently, each layer carries an empirical neuron law

\alpha_r = \frac1{n_r}\sum_{i=1}^{n_r}\delta_{\theta_{r,i}}, \qquad G_{r,n_r}(z) = \int_\Theta\psi(\theta,z)\d\alpha_r(\theta).

(285)

Thus a finite ResNet is already a discrete conditional measure on depth and neuron space,

\alpha^{L,n}(\d s,\d\theta) = \frac1L\sum_{r=0}^{L-1}\delta_{s_r}(\d s)\alpha_r(\d\theta), \qquad s_r=r/L,

(286)

whose marginal on the depth variable is the discrete measure $\lambda_L=L^{-1}\sum_{r=0}^{L-1}\delta_{s_r}$ .

Finite-Depth Mean-Field Limit¶

The first mean-field limit removes neuron labels inside each fixed layer. The residual depth grid stays discrete, but the widths tend to infinity. Each empirical law $\alpha_r$ is replaced by an arbitrary probability measure in $\Pp_2(\Theta)$ , and each residual block becomes a mean-field residual block

z_{r+1} = z_r+\tau G_{\alpha_r}(z_r), \qquad z_0=z, \qquad G_{\alpha_r}(z) \eqdef \int_\Theta\psi(\theta,z)\d\alpha_r(\theta).

(287)

The loss becomes a functional of the finite family of layer measures,

f_L(\alpha_0,\ldots,\alpha_{L-1}) = \int \ell(z_L^\alpha(z),y)\d\zeta(z,y).

(288)

This is the finite-depth conditional Wasserstein setting with condition marginal $\lambda_L$ : each fiber $s_r$ carries a Wasserstein geometry on neuron laws.

Continuous-Depth Conditional Geometry¶

The second limit turns the layer index into a conditioning variable. The model is no longer described by one neuron law, nor even by a finite list of them, but by a family of neuron laws indexed by depth. The state is therefore a conditional probability law

\alpha(\d s,\d\theta)=\alpha_s(\d\theta)\d s,

(289)

or, equivalently, a probability measure on $[0,1]\times\Theta$ whose first marginal is Lebesgue measure. The general conditional Wasserstein distance is defined in Section Conditional Wasserstein Distances. Here the condition is continuous depth, $s\in[0,1]$ , the condition law is $\lambda(\d s)=\d s$ , and the fiber space is the neuron parameter space $\Theta$ . For two conditional neuron laws $\alpha(\d s,\d\theta)=\alpha_s(\d\theta)\d s$ and $\beta(\d s,\d\theta)=\beta_s(\d\theta)\d s$ , the relevant distance is the depthwise quadratic Wasserstein metric

\Wass_{2,\lambda}^2(\alpha,\beta) = \int_0^1 \Wass_2^2(\alpha_s,\beta_s)\d s.

(290)

It compares neurons only at the same depth: transport in $\theta$ is allowed fiber by fiber, while mass is not transported between layers.

Infinite-Depth Mean-Field ResNet¶

Once depth is continuous, the forward pass is a controlled ODE and the control is the conditional neuron law. A mean-field infinite-depth and infinite-width ResNet is described by a measurable family $s\mapsto\alpha_s\in\Pp_2(\Theta)$ . Each slice is the neuron law of one infinitesimal residual block, and the feature $\psi(\theta,z)$ now acts as a vector field on the state variable. The forward pass is

\partial_s z_s = G_{\alpha_s}(z_s) \eqdef \int_\Theta \psi(\theta,z_s)\d\alpha_s(\theta), \qquad z_0=z.

(291)

The population risk is

f_{\mathrm{Res}}(\alpha) = \int \ell(z_1^\alpha(z),y)\d\zeta(z,y).

(292)

Under the usual regularity assumptions ensuring well-posedness of the ODE, this defines a functional on the conditional Wasserstein space.

Conditional Wasserstein Gradient Flow¶

Training is now steepest descent for an objective on conditional laws. More generally, let $(S,\lambda)$ be a probability space and let $f$ be a differentiable functional on measures $\alpha(\d s,\d\theta)=\alpha_s(\d\theta)\lambda(\d s)$ with fixed first marginal $\lambda$ . For signed perturbations $\xi(\d s,\d\theta)=\xi_s(\d\theta)\lambda(\d s)$ satisfying $\int_\Theta\d\xi_s=0$ for $\lambda$ -a.e. $s$ , we use the fiberwise first-variation convention

\frac{\d}{\d\varepsilon}f(\alpha+\varepsilon\xi)\bigg|_{\varepsilon=0} = \int_S\int_\Theta \frac{\delta f}{\delta\alpha_s}(\alpha)(\theta)\d\xi_s(\theta)\d\lambda(s).

(293)

The formal conditional Wasserstein gradient flow is the family of continuity equations

\partial_t\alpha_{t,s} +\diverg_\theta(\alpha_{t,s}v_{t,s})=0, \qquad v_{t,s}(\theta) = -\nabla_\theta\frac{\delta f}{\delta\alpha_s}(\alpha_t)(\theta),

(294)

for $\lambda$ -a.e. condition $s$ , whenever the first variation is differentiable in the parameter variable. In the ResNet case one takes $f=f_{\mathrm{Res}}$ and $S=[0,1]$ . The equation is fiberwise in its transport geometry, but not decoupled as a training dynamics: changing $\alpha_s$ affects the terminal loss through the forward ODE and the corresponding adjoint equation. Thus each depth performs a Wasserstein steepest descent in its own neuron space, while all depths interact through the state trajectory.

Particle Discretization and Layerwise Matching¶

The finite network is recovered by discretizing both the condition variable and the fiber measures. With the discrete depth marginal $\lambda_L$ , finite-depth finite-width ResNets are particle approximations of the conditional flow above. If two networks have the same depth grid and the same widths, and if their empirical layer laws are

\alpha_r=\frac1{n_r}\sum_{i=1}^{n_r}\delta_{\theta_{r,i}}, \qquad \beta_r=\frac1{n_r}\sum_{i=1}^{n_r}\delta_{\theta'_{r,i}},

(295)

then their squared conditional distance is

\Wass_{2,\lambda_L}^2(\alpha^{L,n},\beta^{L,n}) = \frac1L\sum_{r=0}^{L-1} \Wass_2^2(\alpha_r,\beta_r).

(296)

Keeping neuron labels fixed gives the labelled parameter distance

\frac1L\sum_{r=0}^{L-1}\frac1{n_r} \sum_{i=1}^{n_r}\norm{\theta_{r,i}-\theta'_{r,i}}^2,

(297)

which corresponds to a particular feasible coupling and therefore upper bounds the squared conditional Wasserstein distance. Optimizing over permutations inside each equal-weight layer gives the Wasserstein matching term above. In this sense, the shallow mean-field flow is the one-fiber case, whereas infinitely deep ResNets require a continuum of Wasserstein flows indexed by depth and coupled through the network state.

Second-Order Momentum Flows¶

The minimizing-movement viewpoint is intrinsically first order: a JKO step produces the next measure by trading off proximity to the previous measure and decrease of the energy. Momentum adds a memory of velocity. The state is therefore no longer only a measure, but a measure together with a tangent, or equivalently phase-space, variable. This makes a direct implicit JKO construction less natural than for first-order flows. We instead use an explicit construction, in the same spirit as optimization and machine-learning practice: momentum reduces zig-zagging across stiff directions, helps iterates keep a coherent direction through shallow valleys, and can accelerate convergence when the geometry is favorable Qian, 1999Sutskever et al., 2013.

Finite-Dimensional Momentum¶

Let $F:\RR^N\to\RR$ be a smooth finite-dimensional objective. Polyak’s heavy-ball method Polyak, 1964 augments gradient descent with a velocity variable. With step size $h>0$ and damping $\gamma>0$ , one convenient velocity form is

S^{k+1}=(1-\gamma h)S^k-h\nabla F(X^k), \qquad X^{k+1}=X^k+hS^{k+1}.

(298)

If $S^k$ approximates $\dot X(kh)$ , this is a semi-implicit, or symplectic-Euler, discretization: the velocity update is explicit, while the position uses the newly updated velocity. Its limiting system is

\dot X(t)=S(t), \qquad \dot S(t)=-\gamma S(t)-\nabla F(X(t)),

(299)

or equivalently the damped second-order equation

\ddot X(t)+\gamma\dot X(t)+\nabla F(X(t))=0.

(300)

Nesterov acceleration Nesterov, 1983 uses a related look-ahead point,

Y^k=X^k+\theta_k(X^k-X^{k-1}), \qquad X^{k+1}=Y^k-h\nabla F(Y^k),

(301)

and its leading continuous-time limit is

\ddot X(t)+\frac{r}{t}\dot X(t)+\nabla F(X(t))=0.

(302)

This ODE interpretation is due to Su, Boyd and Candes Su et al., 2016; see also the variational perspective of Wibisono, Wilson and Jordan Wibisono et al., 2016, the inertial dynamics with Hessian damping of Attouch, Peypouquet and Redont Attouch et al., 2016, and high-resolution ODE limits Shi et al., 2018.

Empirical Wasserstein Lift¶

We now take a functional $f$ on measures and lift it to particle configurations. For $X=(x_1,\ldots,x_n)\in(\RR^d)^n$ , write

\al_X \eqdef \frac1n\sum_{i=1}^n \delta_{x_i}, \qquad F(X) \eqdef f(\al_X).

(303)

This is the same empirical lift as in (23), and is the particle viewpoint underlying mean-field analyses of wide neural networks and Wasserstein gradient methods Chizat & Bach, 2018Mei et al., 2018Rotskoff & Vanden-Eijnden, 2022. The configuration space is used with the empirical metric

\norm{\dot X}_n^2 \eqdef \frac1n\sum_{i=1}^n\norm{\dot x_i}^2,

(304)

which is the metric induced by $\Wass_2$ on equally weighted empirical measures with fixed labels. Hence the Wasserstein gradient of $f$ is lifted to the particle gradient through

(\nabla^{(n)}F(X))_i = n\nabla_{x_i}F(X) = \nabla\delta f(\al_X)(x_i) = \Wgrad f(\al_X)(x_i).

(305)

The first-order Wasserstein particle descent is $\dot x_i=v_{\al_X}(x_i)$ with

v_\al(x)\eqdef -\Wgrad f(\al)(x)=-\nabla\delta f(\al)(x).

(306)

Applying the heavy-ball equation to the empirical metric gives the second-order Wasserstein momentum system

\dot x_i(t)=s_i(t), \qquad \dot s_i(t)=-\gamma s_i(t)+v_{\alpha_t}(x_i(t)), \qquad \alpha_t=\frac1n\sum_{i=1}^n\delta_{x_i(t)} .

(307)

Equivalently, $\ddot x_i(t)+\gamma\dot x_i(t)=-\Wgrad f(\alpha_t)(x_i(t))$ . This is an explicit inertial particle method for the measure energy $f$ : the Wasserstein steepest-descent field acts as an acceleration rather than as an instantaneous velocity.

Phase-Space Formulation¶

Because momentum is part of the state, the mean-field object is not merely a measure on positions. It is a law on phase space $(x,s)$ , whose spatial marginal is the current distribution of particles. Here $s$ is a Lagrangian particle velocity; it should not be confused with the Eulerian momentum measure $m=\rho v$ used in the Benamou--Brenier convex formulation (28). This deterministic kinetic viewpoint is the analogue of the phase-space formulation used for interacting particle systems; in learning, heavy-ball mean-field limits have recently been used to analyze wide networks trained with momentum Wu et al., 2022.

Proposition: Phase-Space Liouville Formulation

Let $v_\al$ be smooth enough that (307) has a classical solution. Define the empirical phase-space measure

\eta_t^n \eqdef \frac1n\sum_{i=1}^n\delta_{(x_i(t),s_i(t))} \in \Pp(\RR^d_x\times\RR^d_s), \qquad \alpha_t^n=(\pi_x)_\sharp\eta_t^n .

(312)

Then $\eta_t^n$ solves, in the sense of distributions,

\partial_t\eta_t^n + \nabla_x\cdot(s\,\eta_t^n) + \nabla_s\cdot\bigl((-\gamma s+v_{\alpha_t^n}(x))\,\eta_t^n\bigr) =0 .

(313)

If, as $n\to\infty$ , the empirical phase-space laws converge to a smooth density $\eta_t(x,s)$ and the nonlinear fields converge accordingly, the limiting kinetic equation is

\partial_t\eta_t + \nabla_x\cdot(s\,\eta_t) + \nabla_s\cdot\bigl((-\gamma s+v_{\alpha_t}(x))\,\eta_t\bigr) =0, \qquad \alpha_t=(\pi_x)_\sharp\eta_t .

(314)

Example: Quadratic energies and gravitational attraction

The quadratic-plus-linear energies

f(\al)=\frac12\iint k(x,y)\,\d\al(x)\d\al(y)+\int V(x)\,\d\al(x), \qquad k(x,y)=k(y,x),

(316)

are the interaction energies already discussed around (42), up to the harmless convention of inserting the factor $1/2$ . The force computed there becomes, in the present inertial setting, an acceleration:

v_\al(x) = -\int \nabla_x k(x,y)\,\d\al(y)-\nabla V(x).

(317)

For the attractive Newtonian kernel in three dimensions, $k(x,y)=-G/\norm{x-y}$ , understood with a short-distance regularization if needed, this gives

v_\al(x) = G\int \frac{y-x}{\norm{x-y}^3}\,\d\al(y)-\nabla V(x),

(318)

the usual mean-field gravitational acceleration. The phase-space equation (314) is then a damped Vlasov-type equation. This kinetic viewpoint is classical in statistical physics: Boltzmann introduced the description of gases through phase-space distributions, the collisionless mean-field limit gives Liouville--Vlasov equations, and adding collision operators leads to the Boltzmann equation Cercignani et al., 1994Villani, 2002Braun & Hepp, 1977.

For numerical stability, the numerical illustration below uses the smoothed distance $\sqrt{\norm{x-y}^2+\delta^2}$ with a small $\delta$ in the energy-distance kernel $k(x,y)=-\norm{x-y}$ . It compares the first-order flow with the undamped Newton lift $\ddot x=v_{\alpha_t}(x)=-\Wgrad f(\alpha_t)(x)$ , initialized with zero velocity. The initial measure is a single Gaussian located to the left of a two-Gaussian target mixture. Both the transported measure and the target are discretized by large empirical clouds, and the smooth density views are KDE renderings of these particles. The trajectory panels display only a representative subset of initial particles selected by farthest-point sampling.

For numerical stability, Figure Div uses the smoothed distance $\sqrt{\norm{x-y}^2+\delta^2}$ with a small $\delta$ .

First-order and Newton Wasserstein particle flows for the squared MMD with the energy-distance kernel $k(x,y)=-\norm{x-y}$ . The first row follows the overdamped Wasserstein gradient flow. The second row follows the infinite-momentum Newton dynamics $\ddot x=-\Wgrad f(\alpha_t)(x)$ , so the same Wasserstein force is read as an acceleration rather than a velocity. Both rows use the same large empirical source cloud and the same empirical two-Gaussian target cloud. Dashed gray contours show a KDE of the target particles, colored panels show KDEs of the evolving transported particles, and the left panels show farthest-point-subsampled trajectories.

Interactive panel. Compare the same energy-distance force interpreted as an overdamped Wasserstein velocity or as a Newton acceleration. The browser simulation keeps the source cloud, target mixture, representative trajectories and $-|x-y|$ interaction visible with a smaller empirical discretization than the publication figure.

Example: Entropy-driven inertial flow

Let

f(\al)= \begin{cases} \displaystyle \int_{\RR^d} \rho(x)\log\rho(x)\,\d x, & \text{if } \al=\rho\d x,\\[.4em] +\infty, & \text{otherwise.} \end{cases}

(319)

Thus $f(\al_X)=+\infty$ for an empirical measure $\al_X=n^{-1}\sum_i\delta_{x_i}$ , and the finite-dimensional lift $F(X)=f(\al_X)$ is not meaningful. For $\al=\rho\d x$ with a smooth positive density, however,

\delta f(\al)(x)=\log\rho(x)+1, \qquad v_\al(x)=-\nabla\log\rho(x).

(320)

The first-order Wasserstein gradient flow is therefore the heat equation $\partial_t\rho_t=\Delta\rho_t$ . If the spatial marginal remains smooth and positive, the corresponding formal second-order phase-space equation is

\partial_t\eta_t + \nabla_x\cdot(s\eta_t) + \nabla_s\cdot\bigl((-\gamma s-\nabla\log\rho_t(x))\eta_t\bigr) =0, \qquad \rho_t(x)=\int_{\RR^d}\eta_t(x,s)\,\d s .

(321)

This is a kinetic, inertial version of entropy diffusion: the entropy score acts as an acceleration rather than as an instantaneous velocity.

For the numerical illustration, we add the quadratic confinement $V(x)=\norm{x}^2/2$ . In one dimension the overdamped equation is the Ornstein--Uhlenbeck Fokker--Planck equation $\partial_t\rho_t=\partial_{xx}\rho_t+\partial_x(x\rho_t)$ , whose stationary law is the centered Gaussian $\mathcal N(0,1)$ . The Newton lift is discretized directly in phase space:

\partial_t\eta_t+\partial_x(s\eta_t) +\partial_s\bigl((-\partial_x\log\rho_t(x)-x)\eta_t\bigr)=0.

(322)

Both equations are solved by finite differences on fixed grids, with a mild grid smoothing only to evaluate the logarithmic derivative in the Newton row. The first row of Figure Div shows only the spatial density of the Wasserstein gradient flow. The second and third rows show, for the Newton equation, respectively the spatial marginal and the full phase-space density; the latter uses white for zero mass and black for high mass.

Finite-difference entropy and inertial entropy flows in one dimension. The initial spatial density is a three-Gaussian mixture with different widths and overlapping components. The top row solves the confined entropy Wasserstein gradient flow, i.e. the OU/Fokker--Planck equation, whose stationary density is the dashed centered Gaussian. The two lower rows solve the undamped Newton lift on a grid in $(x,s)$ , initialized with a narrow centered speed distribution: its spatial marginal is shown in color and the full phase-space density below in grayscale.

Interactive panel. Vary damping and time in a one-dimensional qualitative companion to the finite-difference entropy/Newton comparison.

References¶

Otto, F. (2001). The geometry of dissipative evolution equations: the porous medium equation. Communications in Partial Differential Equations, 26(1–2), 101–174.
Ambrosio, L., Gigli, N., & Savaré, G. (2006). Gradient Flows in Metric Spaces and in the Space of Probability Measures. Springer.
Benamou, J.-D., Carlier, G., Mérigot, Q., & Oudet, E. (2016). Discretization of functionals involving the Monge–Ampère operator. Numerische Mathematik, 134(3), 611–636.
Peyré, G. (2015). Entropic approximation of Wasserstein gradient flows. SIAM Journal on Imaging Sciences, 8(4), 2323–2351.
Gallouët, T. O., & Monsaingeon, L. (2017). A JKO splitting scheme for Kantorovich–Fisher–Rao gradient flows. SIAM Journal on Mathematical Analysis, 49(2), 1100–1130.
Maury, B., Roudneff-Chupin, A., & Santambrogio, F. (2010). A macroscopic crowd motion model of gradient flow type. Mathematical Models and Methods in Applied Sciences, 20(10), 1787–1821.
Santambrogio, F. (2018). Crowd motion and population dynamics under density constraints. GMT Preprint 3728.
Carlier, G., Chizat, L., & Laborde, M. (2024). Displacement smoothness of entropic optimal transport. ESAIM: Control, Optimisation and Calculus of Variations, 30, 25. 10.1051/cocv/2024013
Carlier, G., Jimenez, C., & Santambrogio, F. (2008). Optimal transportation with traffic congestion and Wardrop equilibria. SIAM Journal on Control and Optimization, 47(3), 1330–1350.
Carrillo, J. A., Chertock, A., & Huang, Y. (2015). A finite-volume method for nonlinear nonlocal equations with a gradient flow structure. Communications in Computational Physics, 17(01), 233–258.
Gianazza, U., Savaré, G., & Toscani, G. (2009). The Wasserstein gradient flow of the Fisher information and the quantum drift-diffusion equation. Archive for Rational Mechanics and Analysis, 194(1), 133–220.
Maas, J. (2011). Gradient flows of the entropy for finite Markov chains. Journal of Functional Analysis, 261(8), 2250–2292.
Erbar, M. (2010). The heat equation on manifolds as a gradient flow in the Wasserstein space. Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, 46(1), 1–23.
McCann, R. J. (1997). A convexity principle for interacting gases. Advances in Mathematics, 128(1), 153–179.
Tong, A., Huang, J., Wolf, G., van Dijk, D., & Krishnaswamy, S. (2020). TrajectoryNet: A Dynamic Optimal Transport Network for Modeling Cellular Dynamics. Proceedings of the 37th International Conference on Machine Learning, 119, 9526–9536. https://proceedings.mlr.press/v119/tong20a.html