Wasserstein Gradient Flows
Once is a dynamic metric, one can run gradient descent directly on the space of measures. This chapter derives the formal Wasserstein gradient, explains the JKO minimizing-movement scheme, records the role of geodesic convexity in convergence, and then applies the same calculus to mean-field neural-network training.
Minimizing Movements and Wasserstein Gradients¶
This first section explains how a variational implicit-Euler step on measures gives rise, in the small-step limit, to a continuity equation driven by the Wasserstein gradient of the energy.
We consider a function and seek a minimizing evolution . The minimizing-movement strategy over a metric space builds a discrete-time evolution using an implicit Euler scheme:
Euclidean Gradient Flows¶
If (1) is restricted to finite dimensions with and , it becomes the implicit Euler scheme
Its solution is formally
In contrast, explicit Euler uses
Both schemes converge as to the classical gradient flow
Wasserstein Gradient Formula¶
The implicit Euler scheme has the advantage that it does not require or to be smooth. For , this is crucial when evolutions over measures may have densities, atoms or other singular parts.
As , under suitable conditions on , (1) defines a continuous evolution . As in the dynamic formulation, this evolution can be described by a Lagrangian evolution. We use the following first-variation convention: for any and the signed zero-mass perturbation ,
The key infinitesimal object is the vector field that represents this differential in the Wasserstein metric.
The associated formal gradient flow is the continuity equation
The following proposition explains why this vector field is the Riemannian gradient for the metric on velocities.
Proof
The Wasserstein gradient-flow viewpoint already appears in John D. Lafferty’s PhD work, published as “The Density Manifold and Configuration Space Quantization”, under the name “density manifold”. It was then systematically developed by Otto, who exposed the formal Riemannian structure of this space Otto, 2001. Rigorous metric-space treatments and numerical JKO schemes can be found in Ambrosio et al., 2006Benamou et al., 2016Peyré, 2015Gallouët & Monsaingeon, 2017.
From the JKO Step to the Velocity Field¶
A first-order expansion of the JKO step explains why (8) uses the vector field . Write (1) as a minimization over displacement fields such that :
The push-forward and energy expansions are
and hence
Thus the problem minimized in (1) has the first-order expansion
The pointwise minimizer is , which gives the velocity in the continuity equation. We now detail examples of such Wasserstein gradient flows.

JKO minimizing movements for the entropy flow in one dimension. The left panel displays successive implicit-Euler minimizers for the heat equation, colored from red to blue. The right panel tracks inverse CDF values for selected probability levels , giving a Lagrangian view of the proximal movement in Wasserstein space.
The interactive demo uses the heat-flow representative of the entropy JKO scheme: changing the step size changes the spacing between implicit Euler iterates, while the quantile panel shows how the same movement is seen in Lagrangian coordinates.
Interactive panel. Use the step size and iteration controls to inspect the JKO scheme as successive implicit steps of the entropy gradient flow.
Discrete Evolutions¶
If can be evaluated on discrete distributions and is continuous in this case, the flow (8) maintains the number of Dirac masses:
The particles evolve according to the coupled ODE
where . The factor comes from the empirical Wasserstein metric .
Linear Functionals¶
The simplest example is a linear functional
Here is independent of . The flow (8) becomes
Thus particles move independently according to the usual gradient flow (5).
Shannon Neg-Entropy¶
A very different behavior is obtained by considering functionals that require to have a density. The canonical example is Shannon neg-entropy
Here up to an additive constant, so , often called the score. The flow (8) becomes the heat equation
Other entropy functionals lead to nonlinear diffusion equations; finite-volume and particle discretizations are discussed in Carrillo et al., 2015Gianazza et al., 2009Maas, 2011Erbar, 2010.
For example, a generalized entropy
for a scalar convex function leads, in the smooth-density regime, to
where the pressure satisfies . For , one has and recovers (23); for with , one obtains up to an additive constant and the porous-medium equation.
A celebrated theorem by McCann McCann, 1997 states that an internal energy of the form (25), for with , is geodesically convex on when is convex and the map is convex and nonincreasing on . Examples include for and Shannon entropy . By contrast, , associated with the reverse KL divergence, does not satisfy this displacement-convexity criterion.

Entropy-driven Wasserstein gradient flows from the same compact initial density. The heat flow is generated by Shannon entropy and instantly develops Gaussian tails. The porous-medium flows use the power entropy , hence : the middle panel has , while the right panel has the stronger nonlinearity , i.e. . Larger powers diffuse mainly where the density is high, producing a flatter core and a sharper compact free boundary.
The interactive demo isolates the effect of the entropy exponent. The heat curve keeps Gaussian tails, while increasing keeps a compact front and spreads mass mainly from the high-density core.
Interactive panel. Use the diffusion exponent and time controls to compare linear heat flow with nonlinear porous-medium spreading.
Interaction Energies¶
To obtain nonlinear evolutions without requiring the measure to have a density, one can consider
For a symmetric kernel ,
For , the flow (8) implies the particle system
If is positive definite, or more generally conditionally positive definite on signed measures of zero total mass as for the energy-distance kernel , and one minimizes the squared kernel discrepancy to a teacher distribution , then
Thus MMD-type training energies are exactly an interaction energy plus a linear potential. The teacher distribution appears through the potential , and the corresponding empirical Wasserstein gradient flow is
The first term is a kernelized self-interaction; the second is the attraction induced by the continuous teacher kernel mean. At the continuum level, characteristic positive-definite kernels, and the Euclidean energy-distance kernel on probability measures, have as the unique minimizer of . For finitely many particles, however, the flow can only form a kernelized quadrature of , and small particle systems may cover the target modes poorly. The particle-count figure below illustrates this finite-particle effect.

Particle count in the deterministic Wasserstein gradient flow of the squared MMD-type discrepancy to a smooth two-Gaussian teacher distribution, using here the energy-distance kernel . The teacher itself is shown only through true density contours, while red dots are a compact shifted Gaussian initialization placed away from the target, red-to-blue curves show a thinned subset of particle trajectories, and blue dots show the stabilized long-time particles. With too few particles, the empirical measure forms a sparse kernelized quadrature and may under-cover the target modes; increasing makes the particle cloud approximate the continuous target geometry more faithfully.
The interactive demo turns this finite-particle effect into a parameter: increasing the number of particles makes the same deterministic force field approximate the teacher geometry more faithfully.
Interactive panel. Use the particle count and kernel controls to see how MMD geometry drives a particle flow toward the target law.

Interaction-energy particle flows for three choices of . A positive Gaussian kernel produces short-range repulsion under Wasserstein descent; changing its sign produces attraction and collapse; adding a quadratic long-range attraction to the repulsive kernel yields a balanced attraction--repulsion dynamics. The curves use arclength-based red-to-blue coloring along a longer integration of the coupled particle ODE (20).
The interactive demo lets the sign and strength of the interaction change without editing the hidden particle solver. This is the quickest way to see how the same formal ODE can repel, collapse, or self-organize.
Interactive panel. Use the interaction strength and time controls to watch particles move under attraction, repulsion, and confinement.

Particle trajectories induced by different discrepancy geometries. The red particles and blue target cloud are the same in all panels. Straight OT displacement produces rays from an optimal matching; an MMD-type witness field gives smoother nonlocal forces; the Sinkhorn-divergence force is an entropic, debiased transport attraction; and the normalized drifting field combines attraction to data with self-repulsion. The figure is qualitative: it compares geometric behavior, not solver performance.
The interactive demo keeps the source and target fixed while switching the discrepancy geometry. The smoothing parameter controls how local or nonlocal the induced force appears.
Interactive panel. Use the smoothing and geometry controls to compare how different discrepancies reshape the same particle objective.
Stochastic Particles and McKean--Vlasov Limits¶
Deterministic particle flows have stochastic counterparts, where Brownian noise at the particle level becomes an entropy term at the measure level. If the drift does not depend on the empirical measure, each particle evolves independently according to
and the one-particle law satisfies the linear Fokker--Planck equation
For example, if , this is the gradient flow of the free energy
The mean-field case is different: the drift is recomputed from the current empirical distribution of all particles,
For finite , the empirical law is random. Under suitable Lipschitz, growth and chaotic-initialization assumptions, propagation of chaos states that finitely many particles become asymptotically independent as , all with the same deterministic law . Equivalently, converges in probability to this law. The limiting density solves the nonlinear Fokker--Planck, or McKean--Vlasov, equation
When the interaction drift has variational form
this PDE is the Wasserstein gradient flow of the entropy-regularized energy

Three numerical representations of the same entropy-regularized Wasserstein gradient flow of , where is a two-Gaussian target shifted to the right of an initially isotropic Gaussian density. The first row simulates independent Langevin particles and displays a thinned set of trajectories in the left panel. The second row evolves many deterministic particles with velocity , estimating by a sharper kernel-density score; only representative trajectories and particle subsets are displayed. The third row solves the corresponding Fokker--Planck equation on a grid, starting from the initial density in the left panel. The remaining columns use front-loaded times, so that the onset of the flow and the later deformation toward a bimodal law are both visible.
The interactive demo compares three views of the same entropy-regularized relaxation: stochastic Langevin particles, deterministic score particles, and a smoothed grid density. The noise slider controls the entropy strength.
Interactive panel. Use the drift and noise controls to compare trajectories, particles, and density evolution for the same Fokker-Planck dynamics.
Geodesic Convexity and Convergence¶
Geodesic convexity is the convexity notion adapted to Wasserstein geometry. It is the condition that turns the formal gradient-flow calculus into a convergence theory.
Geodesics and Convexity¶
A constant-speed geodesic between and is obtained, as in the McCann interpolation, from any optimal coupling by
where and . If the optimal plan is induced by a Brenier map , this reduces to . The coupling formula matters because geodesics exist even when no Monge map exists, for instance when a Dirac mass must split.
Proof
Along a Monge geodesic , convexity of gives , and strong convexity gives the additional quadratic term; integrating proves the first claim.
The interaction claim follows similarly by applying convexity of to pairwise differences and integrating over two independent copies. The entropy claim is McCann’s displacement-convexity theorem; at the density level it follows from the concavity of the Jacobian determinant under the interpolation of optimal maps. Finally,
so it is the sum of displacement-convex entropy and a -geodesically convex linear potential.
Convergence of the Flow¶
In general, analyzing (8) is delicate. The cleanest case is when is geodesically convex. This condition is the Wasserstein analogue of convexity in Euclidean gradient descent.
Proof
The chain rule and the formal Wasserstein-gradient proposition give
Geodesic convexity along the geodesic gives
Since ,
The first-variation formula for the squared Wasserstein distance gives
which proves the differential inequality. Integrating it from 0 to and using monotonicity of gives
If is -geodesically convex, the Wasserstein analogue of strong convexity gives the slope inequality
Combining it with the energy dissipation identity yields
and Gronwall’s lemma gives the exponential rate.
Proof
Let be the McCann interpolation between and , written with an optimal coupling as . For a linear energy, Jensen’s inequality gives
and the strong convexity version gives the additional term . Integrating over the optimal coupling proves geodesic convexity and -geodesic convexity.
For interaction energies, use two independent copies of the optimal coupling. The pairwise displacement evolves as
Convexity of gives the convexity inequality after integration over the product coupling. Evenness of ensures that the interaction is symmetric in the two particles and matches the usual factor in (27).
The entropy claim is McCann’s displacement-convexity theorem. For smooth positive densities and Brenier maps, it follows from the change-of-variables formula and the concavity of the determinant along positive matrices; the general statement is obtained by approximation. Finally,
so it is the sum of the displacement-convex entropy and the -geodesically convex linear potential generated by . The energy-decay proposition then applies to all four cases.
Convexity and Curvature¶
The same language is not restricted to subsets of . If is a geodesic metric-measure space, geodesics can be defined by transporting each pair of endpoints along metric geodesics, or more intrinsically by dynamical optimal plans on path space. Given a reference measure , the entropy relative to is
On a smooth Riemannian manifold , the Ricci curvature tensor is the trace of the Riemann curvature tensor. The lower bound means that for every tangent vector . The fundamental link between curvature and optimal transport is that this tensor lower bound is exactly encoded by geodesic convexity of entropy.
This equivalence was developed in the smooth Riemannian setting by Cordero-Erausquin, McCann and Schmuckenschlaeger and by von Renesse and Sturm Cordero-Erausquin et al., 2001Renesse & Sturm, 2005; it is a central theme of the optimal-transport approach to curvature in Villani’s monograph Villani, 2009. Lott--Villani and Sturm then used the same entropy-convexity principle to define synthetic lower Ricci curvature bounds on metric-measure spaces Lott & Villani, 2009Sturm, 2006Sturm, 2006. Outside this convex, curvature-controlled regime, such as in the mean-field neural-network example below, the flow may still be informative but its convergence analysis requires problem-specific arguments.
Training Two-Layer MLPs as Wasserstein Flows¶
Mean-field limits recast the training of wide neural networks as transport of a distribution of neurons. This section shows how the particle ODE of gradient descent becomes a Wasserstein flow in parameter space.
We use for the input data and for the label. A neuron is a particle
where is the inner weight and is the outer vector weight. For a scalar nonlinearity , define the vector-valued feature
The width- network and its mean-field version are
This formulation removes the artificial ordering of neurons and allows to be a continuous distribution of infinitely many neurons.
Let be a probability distribution on data-label pairs . The population risk is
and the empirical risk is the special case . Since is linear, is convex as a function of whenever is convex. For the empirical neuron law , the Wasserstein metric induces on particles the rescaled metric . The corresponding particle flow is
This is the gradient flow of for the Wasserstein particle metric, equivalently Euclidean gradient descent with time scale multiplied by . It gives a particle discretization of (8).
Assume that is differentiable in its first variable. The first variation is
and the Wasserstein gradient in parameter space is
For the squared Euclidean loss , the energy is the sum of a quadratic interaction and a linear potential:
with
Thus
These kernels are generally not convex in the particle variable, so the geodesic-convex convergence theory above does not apply directly.

Mean-field training of a homogeneous two-layer model as transport in neuron space. The left panel shows the Wasserstein particle gradient flow in the reduced homogeneous coordinates , with black dashed rays marking the teacher directions. The right panel shows the weighted angular density along a front-loaded sequence of times, colored from red to blue, so that the early concentration of neuron directions is visible. The display follows the rendering of the auxiliary MLP experiment but keeps only the flow, not the spectral-flow comparison.
The interactive demo gives a lightweight version of the same phenomenon: particles move in reduced neuron coordinates, while their angles concentrate around the teacher directions.
Interactive panel. Use the width, homogeneity, and time controls to see the mean-field movement of ReLU neurons and the induced angular density.
Classical Convexity and Stationarity¶
Before using the specific homogeneity mechanism of Chizat and Bach, it is useful to isolate a simpler convex-analytic principle behind many mean-field arguments. Consider an energy
on probability measures over a parameter domain. Assume that the quadratic part is convex in the classical affine structure of measures:
This is ordinary convexity of the functional on the convex set of measures, not displacement convexity along geodesics.
Proof Sketch
The dissipation identity for the gradient flow gives stationarity of the limit: formally, after passing to the limit,
Without support and positivity assumptions, this identity only controls the first variation on the region explored by the limit. The density hypothesis allows one to test against sufficiently many signed density perturbations of total mass zero. By approximation and the assumed regularity, this yields the displayed first-order variational inequality for arbitrary competitors . Classical convexity of in the affine variable then gives the usual subgradient inequality
Thus no competitor has smaller energy. For square-loss two-layer mean-field models, (73) is exactly of this quadratic-plus-linear form, and positive semidefiniteness of the induced kernel is the classical convexity assumption.
The mean-field description of two-layer training was developed in several works, including Chizat & Bach, 2018Mei et al., 2018. The distinctive contribution of Chizat and Bach is a global-convergence analysis for positively homogeneous networks without adding an explicit regularizer or relying on noisy SGD to create a Laplacian term. The following formal statement isolates the core mechanism and ignores the technical issues due to ReLU non-smoothness, support propagation and compactness.
Proof
Write
By two-homogeneity of , . Normalize a nonzero direction and choose with . Stationarity gives a zero radial derivative at this point:
Hence for every direction , and by homogeneity for every .
For any competitor , convexity of gives
Thus no competitor has smaller risk. The rigorous theorem replaces the full directional support assumption by propagation and overparameterization hypotheses ensuring that a negative descent direction would be present in the support and would contradict stationarity.
- Otto, F. (2001). The geometry of dissipative evolution equations: the porous medium equation. Communications in Partial Differential Equations, 26(1–2), 101–174.
- Ambrosio, L., Gigli, N., & Savaré, G. (2006). Gradient Flows in Metric Spaces and in the Space of Probability Measures. Springer.
- Benamou, J.-D., Carlier, G., Mérigot, Q., & Oudet, E. (2016). Discretization of functionals involving the Monge–Ampère operator. Numerische Mathematik, 134(3), 611–636.
- Peyré, G. (2015). Entropic approximation of Wasserstein gradient flows. SIAM Journal on Imaging Sciences, 8(4), 2323–2351.
- Gallouët, T. O., & Monsaingeon, L. (2017). A JKO splitting scheme for Kantorovich–Fisher–Rao gradient flows. SIAM Journal on Mathematical Analysis, 49(2), 1100–1130.
- Carrillo, J. A., Chertock, A., & Huang, Y. (2015). A finite-volume method for nonlinear nonlocal equations with a gradient flow structure. Communications in Computational Physics, 17(01), 233–258.
- Gianazza, U., Savaré, G., & Toscani, G. (2009). The Wasserstein gradient flow of the Fisher information and the quantum drift-diffusion equation. Archive for Rational Mechanics and Analysis, 194(1), 133–220.
- Maas, J. (2011). Gradient flows of the entropy for finite Markov chains. Journal of Functional Analysis, 261(8), 2250–2292.
- Erbar, M. (2010). The heat equation on manifolds as a gradient flow in the Wasserstein space. Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, 46(1), 1–23.
- McCann, R. J. (1997). A convexity principle for interacting gases. Advances in Mathematics, 128(1), 153–179.
- Cordero-Erausquin, D., McCann, R. J., & Schmuckenschläger, M. (2001). A Riemannian interpolation inequality à la Borell, Brascamp and Lieb. Inventiones Mathematicae, 146(2), 219–257.
- von Renesse, M.-K., & Sturm, K.-T. (2005). Transport inequalities, gradient estimates, entropy and Ricci curvature. Communications on Pure and Applied Mathematics, 58(7), 923–940.
- Villani, C. (2009). Optimal Transport: Old and New (Vol. 338). Springer.
- Lott, J., & Villani, C. (2009). Ricci curvature for metric-measure spaces via optimal transport. Annals of Mathematics, 169(3), 903–991.
- Sturm, K.-T. (2006). On the geometry of metric measure spaces. I. Acta Mathematica, 196(1), 65–131.