Generative Models via Transportation

The preceding gradient-flow calculus is variational. Modern machine-learning models often use the same transportation language more broadly: one may prescribe an interpolation and regress its velocity, fit a one-step generator to a descent field, or view network depth as a continuous transport of token measures. The examples below separate what is genuinely a Wasserstein gradient flow from what is a transportation dynamics with a useful geometric interpretation.

from pathlib import Path
import sys

from IPython.display import Image as DisplayImage
from IPython.display import display

here = Path.cwd()
myst_dir = None
for candidate in [here, here.parent, here / "myst", here.parent / "myst", here.parent.parent / "myst"]:
    if (candidate / "ot4ml_web.py").exists():
        myst_dir = candidate.resolve()
        sys.path.insert(0, str(myst_dir))
        break

if myst_dir is None:
    raise RuntimeError("Could not locate myst/ot4ml_web.py")

repo_root = myst_dir.parent
thumbnails = repo_root / "notebooks-figures" / "thumbnails"

def show_book_figure(name, width=760):
    display(DisplayImage(filename=str(thumbnails / f"{name}.png"), width=width))

Generative Models via Flow Matching¶

Flow matching constructs a generative map by learning the velocity field of an interpolation. The key computational insight is that a constrained continuity-equation problem can be trained by an unconstrained regression.

Generative models aim to build a transportation map $T$ between a reference distribution $\alpha$ (typically an isotropic Gaussian) and the target data distribution $\beta$ . Since such reference measures are non-atomic, a measurable map with $T_\sharp\alpha=\beta$ exists on standard Borel spaces, for instance by identifying both probability spaces with the unit interval and using a quantile-type rearrangement. This abstract existence statement is much weaker than having an explicit and numerically stable construction of $T$ . Optimal transport is one approach to achieving this, but it is computationally expensive and raises questions about how to estimate it from samples. A different route is to prescribe an interpolation between noise and data, learn its velocity, and obtain $T$ by integrating a time-dependent vector field $v_t$ . This point of view sits at the meeting point of two literatures, surveyed from a transport perspective in Peyré, 2025. The diffusion branch builds on score matching Hyvärinen, 2005, denoising score matching Vincent, 2011, nonequilibrium noising chains Sohl-Dickstein et al., 2015, denoising diffusion probabilistic models Ho et al., 2020, score-based generative modeling Song & Ermon, 2019, and the continuous-time score-SDE/probability-flow formulation Song et al., 2021. The deterministic regression branch was introduced, essentially in parallel, under three closely related names: flow matching Lipman et al., 2023, rectified flow Liu et al., 2023, and stochastic interpolants Albergo et al., 2025. In all three cases, the computational object is a velocity field whose regression loss avoids simulating the learned ODE during training. This vector field $v_t$ is obtained by constructing an interpolation $\alpha_t$ and then finding $v_t$ using the least-squares formula of the dynamic chapter. As we will explain, for a specific class of interpolation (obtained by a parametric push-forward), this $v_t$ can be obtained by avoiding explicitly inverting a Laplacian and instead computing a simple conditional expectation. This conditional expectation can itself be estimated by solving another least-squares problem, but this time unconstrained, making the estimation feasible from finite samples of $\alpha$ and $\beta$ .

Stochastic interpolant.¶

The word “stochastic” can hide two different levels of randomness. We first use the simpler one: after drawing a latent variable $U\sim\pi$ , the path $t\mapsto P_t(U)$ is deterministic and differentiable. The randomness only comes from the initial draw of $U$ ; after taking the push-forward law, $\alpha_t=(P_t)_\sharp\pi$ is a deterministic curve of measures and obeys an ordinary continuity equation. This is the setting behind the stochastic-interpolant construction recalled in Remark Remark: Static-noise stochastic interpolants, and behind the flow-matching and rectified-flow regressions below. Genuine temporal noise, where the path itself has Brownian fluctuations, is different and is discussed in Remark Remark: Brownian realizations of interpolant marginals.

We assume first that $\alpha_t$ is obtained by pushing a latent distribution $\pi \in \Pp(\RR^{d'})$ through a time-dependent map $P_t : \RR^{d'} \to \RR^d$ ; the latent dimension $d'$ may be larger than the data dimension $d$ :

\forall t \in [0,1], \quad \alpha_t := (P_t)_\sharp \pi.

(1)

The basic two-endpoint construction already covers most flow-matching paths used in practice.

Example: Linear two-endpoint deterministic interpolants

Set $d'=2d$ , write $(x,y)\in\RR^d\times\RR^d$ , and choose $P_0(x,y)=x$ and $P_1(x,y)=y$ . If $\pi$ has marginals $(\alpha_0,\alpha_1)$ , then $\alpha_t=(P_t)_\sharp\pi$ interpolates between the two endpoint laws. The simplest choices are the independent, or trivial, coupling $\pi=\alpha_0\otimes\alpha_1$ and the straight path

P_t(x,y)=(1-t)x+ty.

(2)

With this linear path and an arbitrary coupling $\pi$ , the regression below is the common core of flow matching and rectified flow: Lipman et al. emphasize conditional probability paths and simulation-free training of continuous normalizing flows, while rectified flow emphasizes straight couplings, reflow, and the possibility of reducing transport costs and discretization error Lipman et al., 2023Liu et al., 2023.

More complex constructions are possible when sampling from $\pi$ remains simple. Static auxiliary randomness is still handled by enlarging the latent variable, while Brownian noise leads to the diffusion correction described below; this is the broader stochastic-interpolant viewpoint connecting deterministic flows, probability-flow ODEs and diffusion SDEs Albergo et al., 2025.

If $\pi = \alpha \otimes \beta$ and $\alpha = \frac{1}{n} \sum_i \delta_{x_i}$ , $\beta = \frac{1}{m} \sum_j \delta_{y_j}$ , then $\alpha_t$ consists of $n \times m$ Dirac masses

\alpha_t = \frac{1}{nm} \sum_{i,j} \delta_{P_t(x_i,y_j)}.

(3)

If $\pi = (\Id, T)_\sharp \alpha$ is a Brenier-type coupling, then $\alpha_t = ((1-t)\Id + tT)_\sharp \alpha$ is the so-called McCann OT interpolation.

Figure Div separates the effect of the endpoint coupling from the effect of the path joining each paired endpoint.

Flow matching interpolants between the same empirical source and target measures. A product-style random pairing produces crossing paths, an OT pairing gives direct displacement rays, and a curved bridge changes the path geometry while keeping the same endpoints. Gray arrows mark representative midpoint velocities $\partial_tP_t$ .

Interactive panel. Use the interpolation and noise controls to compare flow-matching paths between source noise and target structure.

Flow matching formula.¶

This interpolation is not directly useful for sampling from $\beta$ , but it can be used to define a flow field $v_t$ so that the continuity equation, in Eulerian form, holds. This flow field is computed by solving an unconstrained least-squares problem, or equivalently, it is a conditional expectation.

Proposition: Flow matching vector field

For each fixed $t$ , assume $\partial_tP_t\in L^2(\pi;\RR^d)$ . Consider the flow-matching problem over measurable fields $v_t:\RR^d\to\RR^d$

\min_{v_t} \int_{\RR^{d'}} \norm{v_t(P_t(u)) - [\partial_t P_t](u)}^2 \, \d\pi(u).

(4)

Its minimizer is characterized $\alpha_t$ -almost everywhere by the conditional expectation

v_t(z) = \EE_{u \sim \pi} \big( [\partial_t P_t](u) \, \big| \, z = P_t(u) \big).

(5)

Then the pair $(\alpha_t,v_t)$ satisfies the continuity equation (3).

We first recall the two equivalent ways of writing the interpolated measure. Formally, one may write

\alpha_t(z)=\int_{\RR^{d'}}\delta(z-P_t(u))\,\d\pi(u),

(6)

while the rigorous meaning is that, for every smooth test function $\varphi$ ,

\int_{\RR^d}\varphi(z)\,\d\alpha_t(z) = \int_{\RR^{d'}}\varphi(P_t(u))\,\d\pi(u).

(7)

The minimizer in (4) is the orthogonal projection in $L^2(\pi;\RR^d)$ of the latent velocity $\partial_tP_t(u)$ onto the closed subspace of functions that depend on $u$ only through $P_t(u)$ . This projection is the conditional expectation (5). Formally, this can be read as

v_t(z)=\frac{1}{\alpha_t(z)} \int_{\RR^{d'}}\delta(z-P_t(u))[\partial_tP_t](u)\,\d\pi(u),

(8)

and rigorously it means that, for every smooth test vector field $m$ ,

\int \dotp{m(z)}{v_t(z)} \, \d\alpha_t(z) = \int \dotp{m(P_t(u))}{[\partial_t P_t](u)} \, \d\pi(u).

(9)

We now prove that this field transports the curve $(\alpha_t)_t$ . The weak form of $\partial_t\alpha_t+\diverg(\alpha_t v_t)=0$ is that, for every smooth scalar test function $\varphi$ ,

\frac{\d}{\d t}\int\varphi(z)\,\d\alpha_t(z) - \int\dotp{v_t(z)}{\nabla\varphi(z)}\,\d\alpha_t(z) =0.

(10)

Using (7) and differentiating under the integral sign gives

\frac{\d}{\d t}\int \varphi(z)\d\alpha_t(z) = \int \dotp{\nabla\varphi(P_t(u))}{[\partial_t P_t](u)}\d\pi(u).

(11)

On the other hand, applying (9) with $m=\nabla\varphi$ gives

\int\dotp{v_t(z)}{\nabla\varphi(z)}\,\d\alpha_t(z) = \int \dotp{\nabla\varphi(P_t(u))}{[\partial_t P_t](u)}\d\pi(u).

(12)

Comparing (11) and (12) yields (10), which is the desired continuity equation.

The conditional expectation in (5) has a simple measure-theoretic meaning. Let $\alpha_t=(P_t)_\sharp\pi$ and define the vector-valued flux measure $\omega_t$ on $\RR^d$ by

\int_{\RR^d}\dotp{\psi(z)}{\d\omega_t(z)} \eqdef \int_{\RR^{d'}}\dotp{\psi(P_t(u))}{[\partial_tP_t](u)}\d\pi(u)

(13)

for every bounded continuous vector field $\psi$ . Since $\alpha_t(A)=0$ implies $\pi(P_t^{-1}(A))=0$ , one has $\omega_t\ll\alpha_t$ . The Radon--Nikodym decomposition of $\omega_t$ with respect to $\alpha_t$ is therefore

\d\omega_t(z)=v_t(z)\d\alpha_t(z), \qquad v_t=\frac{\d\omega_t}{\d\alpha_t}.

(14)

In the language of Lebesgue decomposition, $\omega_t$ has only an absolutely continuous part with respect to $\alpha_t$ and no singular part; the conditional expectation is precisely its density. This agrees with the flux notation used in the dynamic formulation. Equivalently, disintegrating $\pi$ with respect to the map $P_t$ gives $\pi(\d u)=\pi_{t,z}(\d u)\alpha_t(\d z)$ , where $\pi_{t,z}$ is supported on the fiber $\{u\,:\,P_t(u)=z\}$ , and

v_t(z)=\int_{\{P_t(u)=z\}}[\partial_tP_t](u)\d\pi_{t,z}(u).

(15)

Thus the solution of (4) is the conditional expectation of the velocities $\partial_t P_t$ : intuitively, $v_t(z)$ is the average velocity of all trajectories passing through $z$ . Numerically, $(x,t) \to v_t(x)$ can be parameterized by a neural network (e.g., a U-Net for vision tasks) and estimated using stochastic gradient descent on the objective in (4). For the exact field $v_t$ , integrating the ODE $\dot{x}=v_t(x)$ defines a transport map $T_t$ . If $v_t$ is regular enough, or more generally if the continuity equation has a unique solution for this velocity, then $(T_t)_\sharp\alpha_0=\alpha_t$ . Thus the same interpolation as (1) is represented by a deterministic flow rather than by the original coupling. The sampling procedure consists in first drawing $X_0 \sim \alpha$ , and then integrating the ODE $\dot{X}_t = v_t(X_t)$ starting with $X_{t=0} = X_0$ . In the ideal exact-field limit, the resulting $X_{t=1}$ is distributed according to $\alpha_1 = \beta$ .

Remark: Static-noise stochastic interpolants

In the terminology of Albergo--Boffi--Vanden-Eijnden Albergo et al., 2025, a stochastic interpolant is not first defined as an SDE. It is an explicit random bridge

X_t = I_t(X_0,X_1,Z), \qquad X_0\sim\alpha_0,\quad X_1\sim\alpha_1,

(16)

where $Z$ is an auxiliary random variable, usually Gaussian and independent of the endpoints, and where

I_0(x_0,x_1,z)=x_0, \qquad I_1(x_0,x_1,z)=x_1.

(17)

A typical spatially linear example is

X_t=a(t)X_0+b(t)X_1+\gamma(t)Z, \qquad \gamma(0)=\gamma(1)=0.

(18)

The noise $Z$ is static: conditionally on $(X_0,X_1,Z)$ , the path $t\mapsto X_t$ is differentiable. Thus this construction is exactly the previous push-forward framework with $u=(X_0,X_1,Z)$ , $\pi=\operatorname{Law}(X_0,X_1,Z)$ , and $P_t=I_t$ . Its Eulerian velocity is therefore

v_t(x)=\EE\bigl[\partial_t I_t(X_0,X_1,Z)\mid X_t=x\bigr],

(19)

and the interpolant density satisfies the continuity equation. The associated SDEs in the stochastic-interpolant framework are alternative sampling dynamics having the same one-time marginals; they are not the definition of the interpolant itself.

Remark: Brownian realizations of interpolant marginals

One can also represent an interpolating marginal curve by Brownian-in-time dynamics. This is a different construction from the static-noise bridge of Remark Remark: Static-noise stochastic interpolants. Let $Z_t$ solve the It^o equation

\d Z_t = r_t(U,Z_t)\d t + \Sigma_t(U,Z_t)\d B_t, \qquad \alpha_t=\operatorname{Law}(Z_t),

(20)

where $U\sim\pi$ is static and $B_t$ is Brownian motion. Define the Eulerian drift and diffusion matrix by conditioning on the observed state,

v_t(z)=\EE\bigl[r_t(U,Z_t)\mid Z_t=z\bigr], \qquad D_t(z)=\EE\bigl[\Sigma_t(U,Z_t)\Sigma_t(U,Z_t)^\top\mid Z_t=z\bigr].

(21)

Then, for smooth test functions $\varphi$ ,

\frac{\d}{\d t}\int \varphi\d\alpha_t = \int \dotp{\nabla\varphi}{v_t}\d\alpha_t + \frac12\int \Tr\bigl(D_t\nabla^2\varphi\bigr)\d\alpha_t,

(22)

or, in distributional form,

\partial_t\alpha_t + \diverg(\alpha_t v_t) = \frac12\sum_{i,j}\partial_{ij}^2\bigl((D_t)_{ij}\alpha_t\bigr).

(23)

Thus the natural noisy analogue of (4) regresses the instantaneous drift,

\min_w \EE\bigl[\norm{w_t(Z_t)-r_t(U,Z_t)}^2\bigr],

(24)

and learns the drift term of a Fokker--Planck equation, not a pure continuity equation unless the diffusion tensor vanishes. When $\alpha_t=\rho_t\d x$ has a smooth positive density, the same marginal curve can be represented, at least formally, by a probability-flow ODE

\partial_t\rho_t+\diverg(\rho_t\bar v_t)=0, \qquad \bar v_t = v_t - \frac{1}{2\rho_t}\diverg(\rho_t D_t),

(25)

where the divergence of the matrix field $\rho_tD_t$ is taken row-wise. In the scalar spatially homogeneous case $D_t=\sigma_t^2\Id$ , this reduces to

\bar v_t = v_t - \frac{\sigma_t^2}{2}\nabla\log\rho_t,

(26)

which is the familiar score correction relating diffusion SDEs to probability-flow ODEs.

Connection with diffusion models.¶

In the special case where $P_t(x,y)=(1-t)x+ty$ is a linear interpolation and $\pi = \alpha \otimes \beta$ , the curve $\alpha_t$ is a convolution of rescaled versions of $\alpha_0$ and $\alpha_1$ . The flow-matching problem (4) becomes

\min_{(v_t)_t} \int_{\RR^{d} \times \RR^d} \norm{v_t( (1-t)x+t y ) - (y-x) }^2 \, \d\alpha_0(x) \d\alpha_1(y).

(27)

When one endpoint is an isotropic Gaussian, this construction is closely related to the probability-flow formulation of diffusion models, up to the usual change of time parametrization Song et al., 2021. This is why flow matching can be viewed both as a deterministic alternative to diffusion training and as a common language for diffusion paths, OT-inspired paths, and rectified paths Lipman et al., 2023Liu et al., 2023Albergo et al., 2025. The next two propositions are written in the noising direction, from a data law $\alpha$ to a Gaussian; reversing time gives the corresponding sampling flow. They also give an explicit closed form for $v_t$ and show that it is a gradient field. In this setting, $v_t$ is also the solution of the constrained least-squares problem from the dynamic chapter. The regression (4) is computationally simpler because the continuity equation has already been enforced by the chosen interpolant. To prove this, we rely on Tweedie’s formula Efron, 2011, which expresses the optimal Gaussian denoiser through the score, i.e. the gradient of the log-density.

Proposition: Tweedie identity

Let $W$ be a random vector in $\RR^{d}$ with law $\beta\in\Pp_1(\RR^d)$ . For $\sigma>0$ , observe

Z \;=\; W + \sigma\,\varepsilon, \quad\text{where } \varepsilon \sim \Gaussian(0,\Id) \text{ is independent of } W.

(28)

Denote by $\rho_\sigma$ the smooth positive density of

\beta_\sigma \;=\; \beta * \Gaussian\bigl(0,\sigma^{2}\Id\bigr),

(29)

which is the law of $Z$ . Then the conditional mean admits the following everywhere-defined version:

\EE\bigl[\,W \mid Z=z\bigr] \;=\; z \;+\;\sigma^{2}\,\nabla \log \rho_\sigma(z) \qquad\text{for all } z \in \RR^{d}.

(30)

Proposition: Gaussian-endpoint flow-matching field

Let $X\sim\alpha$ and $Y\sim\Gaussian(0,\Id)$ be independent. For $t\in(0,1)$ set

Z_t \;=\; (1-t)\,X + t\,Y, \qquad \alpha_t =\operatorname{Law}(Z_t)=\rho_t\,\d x.

(33)

The regression minimizer $v^\star:\RR^d\times(0,1)\to\RR^d$ of

\min_{v}\;\int_{0}^{1}\! \iint_{\RR^{d}\times\RR^{d}} \bigl|y-x-v\bigl((1-t)x+t y,t\bigr)\bigr|^{2}\, \d\alpha(x)\,\d\Gaussian(y)\,\d t

(34)

v^\star(x,t) = -\frac{1}{1-t}\,x \;-\; \frac{t}{1-t}\,\nabla\log\rho_t(x) \qquad (x\in\RR^{d},\;t\in(0,1)).

(35)

In particular, for each $t\in(0,1)$ this field is a gradient field,

v^\star(\cdot,t)=-\nabla \left( \frac{\norm{\cdot}^2}{2(1-t)} +\frac{t}{1-t}\log\rho_t \right).

(36)

Figure Div shows both directions of this construction: the prescribed interpolation noising the data and the same probability-flow ODE integrated backward for sampling.

One-dimensional diffusion bridge for a Gaussian-mixture data law. The forward path $Z_t=(1-t)X+tY$ smooths the red data density toward a blue Gaussian endpoint. Reversing the probability-flow ODE transports a denser set of blue noise samples back toward the data modes, making the splitting of trajectories across mixture components visible.

Interactive panel. Use the noising time and schedule controls to see the one-dimensional forward and reverse diffusion bridge.

The same probability-flow intuition is visible in two dimensions. For a discrete data law, or more generally for a Gaussian mixture, the noising density is a Gaussian mixture whose score can be evaluated explicitly. This makes it possible to draw backward trajectories without training a neural network. In the plots below, the Gaussian endpoint has covariance $\sigma^2\Id$ to keep the geometry visible at the scale of the three atoms. For a scalar noising schedule $Z_t=a_tX+b_tY$ , the intermediate law has component centers $a_t c_j$ and covariance $(b_t\sigma)^2\Id$ . For the linear bridge, $p_t(z)=\sum_j w_j\Gaussian((1-t)c_j,(t\sigma)^2\Id)$ , with $s_t=\nabla\log p_t$ , and the scaled Gaussian-endpoint field gives $v_t(z)=-(z+t\sigma^2s_t(z))/(1-t)$ .

In Figure Div, the Gaussian endpoint has covariance $\sigma^2\Id$ to keep the geometry visible at the scale of the three atoms.

Two-dimensional noising paths from three Dirac masses to a single Gaussian. The linear interpolation $Z_t=(1-t)X+tY$ moves component centers linearly toward the origin and grows covariance like $(t\sigma)^2\Id$ . The variance-preserving Ornstein--Uhlenbeck bridge has the same endpoints but a different speed of contraction and noising.

Interactive panel. Use the schedule and time controls to watch two-dimensional samples blur forward and concentrate backward.

When is the induced map optimal?¶

Integrating the learned velocity gives a deterministic map from $\alpha_0$ to $\alpha_1$ , but this map is not automatically the Brenier optimal map. It is optimal only in special cases where the accumulated flow remains the gradient of a convex potential. The Gaussian product-coupling case already shows the precise obstruction: the interpolated covariances are simple, the velocity is affine, but the terminal map can contain a hidden rotational part. This phenomenon, and its extensions to rectified flows and mixtures, is analyzed in depth in Hertrich et al., 2025.

Proposition: Gaussian flow matching and optimality

Let $\Sigma_0,\Sigma_1\succ0$ and let $X_0\sim\Gaussian(0,\Sigma_0)$ and $X_1\sim\Gaussian(0,\Sigma_1)$ be independent. Consider the linear flow-matching interpolation

Z_t=(1-t)X_0+tX_1, \qquad \alpha_t=\operatorname{Law}(Z_t)=\Gaussian(0,\Sigma_t),

(37)

where

\Sigma_t=(1-t)^2\Sigma_0+t^2\Sigma_1.

(38)

Then the exact flow-matching velocity is affine, $v_t(z)=A_tz$ , with

A_t=\bigl(t\Sigma_1-(1-t)\Sigma_0\bigr)\Sigma_t^{-1}.

(39)

The induced flow map $T_t^{\rm FM}$ from $\alpha_0$ to $\alpha_t$ is

T_t^{\rm FM} = \Sigma_0^{1/2} \Bigl((1-t)^2\Id+t^2\Sigma_0^{-1/2}\Sigma_1\Sigma_0^{-1/2}\Bigr)^{1/2} \Sigma_0^{-1/2}.

(40)

In particular,

T_1^{\rm FM} = \Sigma_0^{1/2} \bigl(\Sigma_0^{-1/2}\Sigma_1\Sigma_0^{-1/2}\bigr)^{1/2} \Sigma_0^{-1/2}.

(41)

This terminal map coincides with the quadratic optimal transport map

T^{\rm OT} = \Sigma_0^{-1/2} \bigl(\Sigma_0^{1/2}\Sigma_1\Sigma_0^{1/2}\bigr)^{1/2} \Sigma_0^{-1/2}

(42)

if and only if $\Sigma_0\Sigma_1=\Sigma_1\Sigma_0$ .

The conditional-expectation formula gives

v_t(z)=\EE[X_1-X_0\mid Z_t=z].

(43)

Since all variables are jointly Gaussian, this conditional expectation is linear and

v_t(z) = \operatorname{Cov}(X_1-X_0,Z_t)\operatorname{Cov}(Z_t)^{-1}z = \bigl(t\Sigma_1-(1-t)\Sigma_0\bigr)\Sigma_t^{-1}z,

(44)

which proves (39). To solve the characteristic equation, whiten the source by setting

C=\Sigma_0^{-1/2}\Sigma_1\Sigma_0^{-1/2}, \qquad \widetilde Z_t=\Sigma_0^{-1/2}Z_t.

(45)

In these coordinates the source covariance is $\Id$ and

\widetilde\Sigma_t=(1-t)^2\Id+t^2C.

(46)

Because $\Id$ and $C$ commute, the affine flow map in whitened coordinates is simply $\widetilde T_t=\widetilde\Sigma_t^{1/2}$ . Indeed,

\frac{\d}{\d t}\widetilde\Sigma_t^{1/2} = \bigl(tC-(1-t)\Id\bigr)\widetilde\Sigma_t^{-1/2},

(47)

which is exactly the equation $\dot{\widetilde T}_t=\widetilde A_t\widetilde T_t$ with $\widetilde T_0=\Id$ . Returning to the original coordinates gives (40), and $t=1$ gives (41).

Both $T_1^{\rm FM}$ and $T^{\rm OT}$ push $\Gaussian(0,\Sigma_0)$ to $\Gaussian(0,\Sigma_1)$ . The Brenier map between nondegenerate Gaussians is the unique symmetric positive definite linear map with this property. Hence $T_1^{\rm FM}=T^{\rm OT}$ if and only if $T_1^{\rm FM}$ is symmetric positive definite. The map $T_1^{\rm FM}$ is similar to $C^{1/2}$ , so if it is symmetric then it is automatically positive definite. It remains to characterize symmetry. Since $C^{1/2}$ is symmetric positive definite,

(T_1^{\rm FM})^\top = \Sigma_0^{-1/2}C^{1/2}\Sigma_0^{1/2}.

(48)

Thus symmetry of $T_1^{\rm FM}$ is equivalent to $\Sigma_0 C^{1/2}=C^{1/2}\Sigma_0$ , hence to $\Sigma_0 C=C\Sigma_0$ by functional calculus. Multiplying this identity on the left and right by $\Sigma_0^{1/2}$ gives $\Sigma_0\Sigma_1=\Sigma_1\Sigma_0$ . Conversely, if $\Sigma_0$ and $\Sigma_1$ commute, they are orthogonally co-diagonalizable, and both (41) and (42) reduce in that basis to the diagonal map with entries $\sqrt{\lambda_{1,k}/\lambda_{0,k}}$ . This proves the equivalence.

The Gaussian optimality proposition explains why the statement “flow matching gives an optimal map” is fragile. The same terminal map (41) is obtained for any scalar schedule $Z_t=a_tX_0+b_tX_1$ with the same endpoints, because after whitening the covariance path remains $a_t^2\Id+b_t^2C$ . Thus changing the speed of a scalar Gaussian bridge, for instance by using an OU schedule, cannot repair the non-optimality created by non-commuting covariances. Commuting covariances reduce the terminal map to independent one-dimensional scalings, whereas non-commuting covariances create a non-symmetric affine map, hence a transport with a rotational or shearing component. More generally, mixture-like paths can create the same obstruction even when every instantaneous velocity looks natural. This distinction is closely related to counterexamples showing that flow maps associated with Fokker--Planck or diffusion-type evolutions do not in general provide optimal transport maps Lavenant & Santambrogio, 2022. In particular, starting from an isotropic Gaussian does not by itself guarantee optimality once the target distribution is non-Gaussian; additional structural assumptions on the path or on the coupling are needed.

Variations on the interpolant.¶

The geometry of the generated trajectories depends on the chosen interpolant, not only on the two endpoint laws. There is first a harmless ambiguity: a monotone reparametrization $Z_t=(1-\lambda(t))X+\lambda(t)Y$ of the linear bridge only changes the speed of the flow,

v_t(z)=\lambda'(t)\,v^{\rm lin}_{\lambda(t)}(z), \qquad v^{\rm lin}_{r}(z)=\EE[Y-X\mid (1-r)X+rY=z].

(49)

It therefore leaves the spatial integral curves unchanged. Diffusion models use a genuinely different family of noising paths. If

Z_t=a_tX+b_tY,\qquad Y\sim\Gaussian(0,\sigma^2\Id),

(50)

then both the mixture centers and the component variances are changed. Writing $p_t$ for the density of $Z_t$ and $s_t=\nabla\log p_t$ , Tweedie’s formula gives, away from times where $a_t=0$ ,

v_t(z)=a'_t\,\EE[X\mid Z_t=z]+b'_t\,\EE[Y\mid Z_t=z] =\frac{a'_t}{a_t}z+ \left(\frac{a'_tb_t^2}{a_t}-b'_tb_t\right)\sigma^2s_t(z).

(51)

For the linear bridge, $a_t=1-t$ and $b_t=t$ , this recovers the formula above. For the variance-preserving Ornstein--Uhlenbeck noising used in diffusion models,

a_\tau=e^{-\tau},\qquad b_\tau=\sqrt{1-e^{-2\tau}},

(52)

one obtains the forward probability-flow velocity $v_\tau(z)=-z-\sigma^2\nabla\log p_\tau(z)$ . Sampling follows the reverse field $z+\sigma^2\nabla\log p_\tau(z)$ as $\tau$ decreases. This is the noising law used in the diffusion trajectory panel below; the trajectories are more curved than for the linear bridge because the centers and variances evolve according to the OU/Fokker--Planck scaling rather than by affine interpolation. Numerically, the integration is stopped at a small positive time before the Dirac endpoint, where the score becomes singular.

The finite-time coefficients $a_t=\cos(\pi t/2)$ and $b_t=\sin(\pi t/2)$ are not a new spatial interpolant: they are exactly the OU coefficients after the time change $\tau=-\log\cos(\pi t/2)$ . The schedule comparison below therefore places the OU bridge next to a genuinely different scalar bridge,

a_t=(1-t)(1-2t), \qquad b_t=t,

(53)

whose data coefficient changes sign before vanishing. This overshooting bridge is mainly a diagnostic example: it keeps the same endpoints, but its intermediate mixture reflects through the origin and produces visibly different reverse trajectories.

This is the noising law used in the left panel of Figure Div; the trajectories are more curved than for the linear bridge because the centers and variances evolve according to the OU/Fokker--Planck scaling rather than by affine interpolation.

Diffusion-style sampling trajectories compared with OT rays in the three-Dirac setting. Red particles are sampled from the centered Gaussian endpoint and transported toward the three blue atoms. The diffusion panel integrates a reverse probability-flow field based on a Gaussian-mixture score, while the OT panel uses straight displacement rays selected by a quadratic matching.

Interactive panel. Use the trajectory and schedule controls to compare curved diffusion sampling paths with straight optimal-transport rays.

Figure Div therefore compares OU with a genuinely different scalar bridge,

Effect of the interpolant on the exact reverse flow for the same three-Dirac target and the same Gaussian endpoint. The linear bridge $a_t=1-t$ , $b_t=t$ produces almost radial curves. The variance-preserving OU bridge $a_\tau=e^{-\tau}$ , $b_\tau=\sqrt{1-e^{-2\tau}}$ changes the relative speed of contraction and noising. The overshooting bridge $a_t=(1-t)(1-2t)$ , $b_t=t$ is not a time reparameterization of either one and produces a more pronounced bending of the reverse trajectories.

Interactive panel. Use the schedule controls to compare how different noising laws allocate motion over time.

Remark: Changing the bridge speed does not restore optimality

The same terminal map (41) is obtained for every continuously differentiable scalar schedule $Z_t=a_tX_0+b_tX_1$ satisfying $(a_0,b_0)=(1,0)$ , $(a_1,b_1)=(0,1)$ , and $a_t^2\Id+b_t^2C\succ0$ for all $t$ . Indeed, after whitening, the covariance path is $a_t^2\Id+b_t^2C$ , and the flow map is its positive square root. Thus changing the speed of a nondegenerate scalar Gaussian bridge, for instance by using an OU schedule, cannot repair the non-optimality created by non-commuting covariances.

Commuting covariances reduce the terminal map to independent one-dimensional scalings, whereas non-commuting covariances create a non-symmetric affine map, hence a transport with a rotational or shearing component. More generally, mixture-like paths can create the same obstruction even when every instantaneous velocity looks natural. This distinction is closely related to counterexamples showing that flow maps associated with Fokker--Planck or diffusion-type evolutions do not in general provide optimal transport maps Lavenant & Santambrogio, 2022. In particular, starting from an isotropic Gaussian does not by itself guarantee optimality once the target distribution is non-Gaussian; additional structural assumptions on the path or on the coupling are needed.

One-Step Generative Models¶

One-step generative models try to keep the geometric training principle of flows while removing the expensive multi-step integration at sampling time. The idea is to evolve the model distribution during training, but to store the final evolution in a single generator evaluation.

One-Step Models via Parameter-Domain Discrepancy Flows¶

The most direct one-step strategy is to train the generator parameters themselves by descending a discrepancy between generated and data distributions. Let $\zeta$ be a simple latent distribution, let $f_\theta:\mathcal Z\to\X$ be a neural generator, and let $\beta$ be the data distribution. The objective is to find $\theta$ such that

(f_\theta)_\sharp\zeta=\beta,

(54)

or, in practice, such that the generated law is close to $\beta$ . Given a discrepancy $\mathcal D$ on probability measures, define

\mathcal E_\beta(\gamma)\eqdef\mathcal D(\gamma,\beta), \qquad H_\beta(\theta) \eqdef \mathcal E_\beta\big((f_\theta)_\sharp\zeta\big) = \mathcal D\big((f_\theta)_\sharp\zeta,\beta\big).

(55)

This viewpoint includes Wasserstein GANs, MMD-GANs and Sinkhorn generative models Arjovsky et al., 2017Dziugaite et al., 2015Genevay et al., 2018; when $\mathcal D$ is written through a dual potential or discriminator, it connects directly with the adversarial formulation in GANs via Duality. The resulting training dynamics is the ordinary Euclidean gradient flow in parameter space,

\dot\theta_t=-\nabla_\theta H_\beta(\theta_t).

(56)

This parameter flow induces a flow of generated measures, but this induced flow is not intrinsic in general. Set

\alpha_t\eqdef(f_{\theta_t})_\sharp\zeta, \qquad X_t=f_{\theta_t}(Z), \qquad Z\sim\zeta .

(57)

If $t\mapsto\theta_t$ is smooth and $f_\theta$ is differentiable with respect to $\theta$ , then

\dot X_t = \partial_\theta f_{\theta_t}(Z)\dot\theta_t .

(58)

For $\alpha_t$ -a.e. $x$ , let $\eta_{t,x}$ be the disintegration of the latent law $\zeta$ with respect to the map $f_{\theta_t}$ . Thus $\eta_{t,x}$ is supported on the fiber $\{z:f_{\theta_t}(z)=x\}$ , and for every bounded measurable $\psi$ ,

\int \psi(z)\,\d\zeta(z) = \int_\X\left(\int \psi(z)\,\d\eta_{t,x}(z)\right)\d\alpha_t(x).

(59)

The Eulerian velocity of the generated measure is therefore the conditional average

v_t(x) = \int \partial_\theta f_{\theta_t}(z)\dot\theta_t\,\d\eta_{t,x}(z) = -\int \partial_\theta f_{\theta_t}(z)\nabla_\theta H_\beta(\theta_t)\,\d\eta_{t,x}(z).

(60)

Indeed, for every smooth test function $\varphi$ ,

\frac{\d}{\d t}\int \varphi(x)\,\d\alpha_t(x) = \int \langle\nabla\varphi(x),v_t(x)\rangle\,\d\alpha_t(x),

(61)

which is the weak form of the continuity equation

\partial_t\alpha_t+\operatorname{div}(\alpha_t v_t)=0.

(62)

The velocity (60) depends on the parameter value $\theta_t$ and on the latent disintegration, not only on the measure $\alpha_t$ . If the parametrization is non-identifiable, two different parameters can generate the same measure while inducing different velocities. Thus (56) is a Euclidean gradient flow on parameter space, and only after push-forward a model-dependent flow on measures. It coincides with an intrinsic Wasserstein gradient flow only in special cases where $v_t$ agrees with the Wasserstein velocity

-\nabla\delta_\gamma\mathcal E_\beta(\gamma)\big|_{\gamma=\alpha_t}.

(63)

More generally, a projected Wasserstein flow on the model manifold would require projecting this intrinsic velocity onto the infinitesimal velocities attainable by variations of $\theta$ , using the $L^2(\alpha_t)$ metric. The ordinary Euclidean parameter gradient flow (56) does not perform this projection in general.

In machine learning, however, $\beta$ is accessed through minibatches. If

\hat\beta_n=\frac{1}{n}\sum_{i=1}^n\delta_{Y_i}, \qquad Y_i\overset{\mathrm{i.i.d.}}{\sim}\beta,

(64)

then the implemented update often replaces $H_\beta$ by the random objective $H_{\hat\beta_n}$ . For nonlinear discrepancies this is not, in general, an unbiased gradient estimator:

\mathbb E\left[\nabla_\theta H_{\hat\beta_n}(\theta)\right] \neq \nabla_\theta H_\beta(\theta).

(65)

The expectation is over the data minibatch. This bias is a finite-sample effect of inserting the empirical measure inside a nonlinear transport or adversarial optimization before differentiating; it is distinct from the optimization noise of stochastic gradient descent. MMD losses admit unbiased $U$ -statistic variants, but OT and entropic OT objectives generally exhibit the bias--variance phenomena discussed in Bias and Variance of OT.

Example: Application to perturbation-response prediction

In perturbation biology, one observes an unperturbed control law $\al$ and a perturbed law $\be$ , but not paired cells. The goal is to learn how a new control cell would respond. Neural OT parameterizes a Monge map, a conditional Monge map or a stochastic semi-coupling and trains it from unpaired samples so that the pushed or transported control population matches $\be$ . The distinction between a coupling, a deterministic map and an out-of-sample extrapolator is essential here: the learned object should act on cells not seen during training, not merely explain one empirical coupling Bunne et al., 2022Lübeck et al., 2022Chen et al., 2024Klein et al., 2024. Conditional Wasserstein flows in Section Conditional Wasserstein Training of Infinite ResNets give a related geometric language when the conditioning variable is depth, context or treatment.

One-Step Model Using Wasserstein Flow of Discrepancy¶

This construction keeps the discrepancy-minimization viewpoint of the previous paragraph but changes the dynamics. Instead of following the Euclidean gradient of the discrepancy in parameter space, one prescribes the Wasserstein gradient flow of the generated law and then learns a generator update realizing this motion. This distribution-space, natural-gradient-type dynamics is less tied to a particular parametrization and can have better convergence behavior, in particular by reducing parameter-space traps when the discrepancy has a favorable geometry.

Let $\zeta$ be a simple latent distribution and let $\alpha_\theta=(G_\theta)_\sharp\zeta$ be the model distribution. Assume that the target data distribution is $\beta$ . The Wasserstein-flow construction chooses a discrepancy

\mathcal E_\beta(\alpha),

(66)

for instance a smoothed $\KL(\alpha|\beta)$ , an MMD/IPM loss, or the debiased Sinkhorn divergence $\bar\MK_c^\epsilon(\alpha,\beta)$ introduced in the Sinkhorn divergence section. The associated formal descent is

\partial_t\alpha_t+\operatorname{div}(\alpha_t w_t)=0, \qquad w_t(x)=-\nabla\delta_\alpha \mathcal E_\beta(\alpha_t)(x).

(67)

Instead of integrating (67) at inference time, one fits, at each training time $t$ , a parametric residual field $U_{\eta_t}$ along the current model distribution:

\min_\eta \int \norm{U_\eta(x)-w_t(x)}^2 \,\d\alpha_t(x).

(68)

In a particle or generator implementation, the learned residual is then used to update the current generator by

\alpha_{\theta}^{+} = (\Id+\tau U_{\eta_t})_\sharp \alpha_\theta, \qquad\text{or equivalently}\qquad G_\theta^{+}(z)=G_\theta(z)+\tau U_{\eta_t}(G_\theta(z)).

(69)

This is an ideal function-space update. A genuine one-step implementation fits the updated outputs back into a fixed generator architecture, or distills the accumulated transport into one network. After many training updates, that fixed-architecture or distilled generator is evaluated once at test time. This is the organizing principle behind recent one-step methods based on Wasserstein gradient flows: W-Flow uses such a construction with the Sinkhorn divergence as a tractable global discrepancy Han et al., 2026, while drifting methods evolve the generated distribution during training through a fitted vector field and also admit one-step inference Deng et al., 2026. The gradient-flow interpretation of drifting models, and its relation to KL, MMD, sliced-Wasserstein and Sinkhorn-type discrepancies, is analyzed in Gretton et al., 2026He et al., 2026. These ideas are also connected to the Sinkhorn-type normalization dynamics used to model attention in Sinkformers Sander et al., 2022.

Sliced-Wasserstein Flow¶

A particularly transparent instance uses the sliced objective introduced for imaging and barycenters by Rabin, Peyré, Delon and Bernot Rabin et al., 2011. Take

\mathcal E_\beta(\alpha)=\frac12\SW_2^2(\alpha,\beta),

(70)

with $\SW_2$ defined in Sliced Wasserstein Distances. With the continuity-equation convention $\partial_t\alpha_t+\operatorname{div}(\alpha_t w_t)=0$ , the descent velocity points from the current projected law toward the target projected law. More precisely, assume that $\alpha$ has a density and sufficient moments so that the projected monotone maps are uniquely defined for almost every direction. If $T_\theta$ denotes the one-dimensional monotone transport from $(P_\theta)_\sharp\alpha$ to $(P_\theta)_\sharp\beta$ , then the formal $\Wass_2$ -gradient-flow velocity is

w_{\SW}[\alpha,\beta](x) = \int_{\Sphere^{d-1}} \big(T_\theta(P_\theta x)-P_\theta x\big)\,\theta\,\d\sigma(\theta).

(71)

The sign follows from the one-dimensional identity: the first variation of $\frac12\Wass_2^2(\rho,\nu)$ has spatial derivative $s-T(s)$ , so the descent velocity is $T(s)-s$ , and composing with $P_\theta$ lifts it in the direction $\theta$ .

Thus empirical implementations only require one-dimensional sorting along sampled directions, followed by averaging the lifted projected displacements. This is the sliced-Wasserstein flow studied by Cozzi and Santambrogio Cozzi & Santambrogio, 2024, who prove long-time convergence under their hypotheses when the target is Gaussian and also show that the limiting characteristic map is not, in general, the optimal transport map. When both the evolving law and the target are Gaussian, the averaged velocity is affine and the flow closes on means and covariances; this finite-dimensional closure is revisited in Flows over the Gaussian Manifold, where the sliced objective appears in the Gaussian closure catalogue of Proposition: Gaussian closure catalogue.

The difference between the full $\Wass_2$ descent and its sliced analogue is already visible on a simple shape example. Div compares the exact empirical flow of $\frac12\Wass_2^2(\cdot,\beta)$ , where one fixed optimal assignment gives straight relaxation curves, with the $\Wass_2$ -gradient flow of $\frac12\SW_2^2(\cdot,\beta)$ , where the velocity is recomputed from projected monotone rearrangements.

Full Wasserstein and sliced-Wasserstein flows from a cat-shaped empirical law to a heart-shaped target. The left panel in each row shows representative particle trajectories colored from red to blue. The five panels on the right render kernel-density estimates of all particles at common normalized times. The $\Wass_2$ flow follows the fixed optimal assignment and therefore straight relaxation rays, whereas the sliced flow averages one-dimensional sorted rearrangements over projection directions, producing curved trajectories and a different transient density evolution.

Interactive panel. Change the slicing angle and samples to compare a global quadratic assignment flow with the coordinatewise flow induced by one projected sliced direction.

Stein Variational Gradient Descent¶

Stein variational gradient descent (SVGD) is another deterministic particle flow that fits naturally in this one-step viewpoint Liu & Wang, 2016. Its original motivation is Bayesian sampling: given a target probability $\beta=\rho_\beta\,\d x$ known through its score $\nabla\log\rho_\beta=-\nabla V$ , but not necessarily through its normalizing constant, drive a particle cloud toward $\beta$ without estimating the score of the current empirical law. Geometrically, this replaces the Wasserstein gradient flow of $\KL(\alpha|\beta)$ , whose tangent norm is $L^2(\alpha)$ , by the kernelized Benamou--Brenier geometry of Kernelized Benamou--Brenier Distances, whose velocities live in a vector-valued RKHS.

For the formal density-level calculation, assume $\alpha=\rho_\alpha\,\d x$ and $\beta=\rho_\beta\,\d x$ have smooth positive densities, and let $v$ be a smooth compactly supported vector field. For the perturbation $\alpha_\epsilon=(\Id+\epsilon v)_\sharp\alpha$ , integration by parts gives

\left.\frac{\d}{\d\epsilon}\KL(\alpha_\epsilon|\beta)\right|_{\epsilon=0} = -\int \big(\dotp{\nabla\log\rho_\beta(x)}{v(x)}+\operatorname{div} v(x)\big) \d\alpha(x).

(72)

The bracket is the Langevin--Stein operator applied to $v$ and averaged under $\alpha$ ; because it only evaluates $v$ , $\operatorname{div}v$ , and the target score at sample locations, it remains meaningful when $\alpha$ is empirical. Optimizing this linear functional over the unit ball of $\RKHS_k^d$ and using the reproducing property gives the RKHS steepest-descent direction,

v_{\alpha}^{\mathrm{SVGD}}(x) = \int \Big(k(y,x)\nabla\log\rho_\beta(y)+\nabla_y k(y,x)\Big) \d\alpha(y).

(73)

The associated mean-field equation is

\partial_t\alpha_t+\operatorname{div}\bigl(\alpha_t v_{\alpha_t}^{\mathrm{SVGD}}\bigr)=0.

(74)

For particles $\alpha^\ell_n=n^{-1}\sum_i\delta_{X_i^\ell}$ , this becomes

X_i^{\ell+1} = X_i^\ell + \tau\,\frac{1}{n}\sum_{j=1}^n \Big(k(X_j^\ell,X_i^\ell)\nabla\log\rho_\beta(X_j^\ell) + \nabla_{X_j}k(X_j^\ell,X_i^\ell)\Big).

(75)

The first term attracts particles toward increasing target log-density; the second term is a kernel repulsion that prevents immediate collapse. The gradient-flow interpretation and its many-particle limits are studied in Liu, 2017Duncan et al., 2023Nüsken & Renger, 2023. In generative modeling terms, SVGD transports a simple empirical latent law by repeated smooth particle updates. It is therefore close in spirit to the drifting fields above, but its velocity is not learned by regression: it is the closed-form RKHS steepest-descent direction of the KL functional.

The figure below contrasts this RKHS flow with a particle closure of the Wasserstein gradient flow of $\KL(\alpha\mid\beta)$ . The latter evaluates the current score $\nabla\log\rho_\alpha$ by a Gaussian KDE, while SVGD avoids this density estimate and uses the target score together with a kernel repulsion.

Figure Div contrasts this RKHS flow with a particle closure of the Wasserstein gradient flow of $\KL(\alpha|\beta)$ .

Particle trajectories for two deterministic descents of relative entropy toward the same three-Gaussian target, whose density is shown by gray contours. The left panel approximates the $\Wass_2$ gradient flow of $\KL(\alpha\mid\beta)$ by the KDE velocity $\nabla\log\rho_\beta-\nabla\log\rho_{\alpha,h}$ . The right panel uses the RBF-kernel SVGD velocity (73), where target-score attraction is coupled with RKHS repulsion. Both panels use the same initial particles and color trajectory time from red to blue.

Interactive panel. Change the number of particles and integration time to compare a KDE Wasserstein closure with an RKHS Stein-type particle flow.

Self-corrected drifting fields.¶

Drifting methods need not start from an exact Wasserstein gradient. They often prescribe an attraction-minus-repulsion field and then regress this field in $L^2(\alpha_t)$ . A simple continuous version uses a strictly positive kernel $K_\epsilon(x,y)$ for which the following integrals are finite, and defines, for any probability measure $\nu$ ,

B_\epsilon[\nu](x) \eqdef \frac{\int (y-x)K_\epsilon(x,y)\,\d\nu(y)} {\int K_\epsilon(x,y)\,\d\nu(y)}.

(76)

For the Gaussian kernel $K_\epsilon(x,y)=\exp(-\norm{x-y}^2/(2\epsilon))$ , this normalized field is a score of a smoothed density:

B_\epsilon[\nu](x) = \epsilon\nabla\log\!\left(\int K_\epsilon(x,y)\,\d\nu(y)\right).

(77)

The drifting velocity is then

u_t(x)=B_\epsilon[\beta](x)-B_\epsilon[\alpha_t](x) = \epsilon\nabla\log \frac{\int K_\epsilon(x,y)\,\d\beta(y)} {\int K_\epsilon(x,y)\,\d\alpha_t(y)}.

(78)

The first term pulls samples toward data, while the second term corrects self-attraction and prevents all particles from collapsing onto the same high-density region. For a fixed reference measure, $B_\epsilon[\nu]$ is precisely the Gaussian mean-shift displacement in (112): it moves $x$ toward the local kernel barycenter of $\nu$ . Hence self-corrected drifting can be read as the difference between a target mean-shift field and the current model’s own mean-shift field. Sinkhorn drifting replaces these one-sided kernel normalizations by two-sided entropic OT couplings, so that the cross and self terms are normalized by Sinkhorn scaling rather than by a single denominator He et al., 2026.

Figure Div illustrates why the self term matters: it corrects the collapse created by target attraction alone and improves coverage of separated modes.

Drifting trajectories for a small particle generator. The raw kernel drift has weak long-range attraction and can leave particles away from the data modes. The self-corrected field uses the difference $B_\epsilon[\beta]-B_\epsilon[\alpha_t]$ , so a longer integration brings particles to the blue modes while repelling them from their own current concentration.

Interactive panel. Use the drift and time controls to inspect a learned-looking velocity field and its induced particle trajectories.

Proposition: Instantaneous gradient representation of drifting

Let $\al_t=\rho_t\,\d x$ be a smooth curve of probability measures with positive densities, and let $u_t=\nabla\phi_t$ be a smooth time-dependent gradient field. Define the semi-relaxed functional

\mathcal R_t(\alpha|\al_t) \eqdef -\int \phi_t(x)\,\d\alpha(x) +\int \phi_t(x)\,\d\al_t(x).

(79)

Here $\al_t$ and $\phi_t$ are frozen when taking the first variation with respect to the first argument $\alpha$ . Then the continuity equation

\partial_t\al_t+\diverg(\al_t u_t)=0

(80)

is the formal Wasserstein gradient descent of the frozen time-dependent functional $\alpha\mapsto\mathcal R_t(\alpha|\al_t)$ .

Example: Kernel drifting as a frozen surrogate

For the Gaussian-kernel drift (78), set

\phi_t(x)= \epsilon\log \frac{\int K_\epsilon(x,y)\,\d\beta(y)} {\int K_\epsilon(x,y)\,\d\al_t(y)}.

(83)

Then $u_t=\nabla\phi_t$ , so Proposition Proposition: Instantaneous gradient representation of drifting shows that kernel drifting is the Wasserstein gradient descent of

\mathcal R_t^{\mathrm{drift}}(\alpha|\al_t) = \epsilon \int \log \frac{\int K_\epsilon(x,y)\,\d\al_t(y)} {\int K_\epsilon(x,y)\,\d\beta(y)} \,\d\alpha(x) +\mathrm{constant}.

(84)

The surrogate vanishes at $\alpha=\al_t$ , but it need not be nonnegative and is therefore not a divergence. It is “semi-relaxed” because the current model $\al_t$ is used to build the potential, but it is not varied inside the denominator when computing the first variation in $\alpha$ .

Remark: General fields and projection onto gradients

A general regressed field $b_t$ is not necessarily the minimal Wasserstein tangent representative. Such representatives belong to the $L^2(\al_t)$ closure of gradient fields, and fields producing the same continuity-equation variation can differ by a weighted divergence-free component. The gradient component is obtained by the weighted projection

\nabla\phi_t = \uargmin{\nabla\phi} \int \norm{\nabla\phi(x)-b_t(x)}^2\,\d\al_t(x).

(85)

One may first normalize $b_t$ pointwise, for instance by $b_t/(\norm{b_t}+\eta)$ , or globally by $\norm{b_t}_{L^2(\al_t)}$ , before this projection. Proposition Proposition: Instantaneous gradient representation of drifting then applies to the projected field. Non-gradient components can still be useful in a parametric model, but they are not descent directions of a scalar functional for the $\Wass_2$ Riemannian metric.

Moment Measures¶

Moment measures give another way to make a whole distribution from one convex potential. Instead of first fixing a simple source law and then learning a transport map, one asks for a convex function whose own log-concave density is pushed forward by its gradient. This couples sampling and mapping in a rigid way: the same potential defines both the source density and the Brenier map. The reward is a hidden convex structure: after a suitable optimal-transport reformulation, a nonlinear equation on convex functions becomes a convex minimization problem for probability densities. This is one of the cleanest places where optimal transport, Prékopa-type inequalities and convex geometry meet.

The normalization removes additive constants in $u$ . Translations of the argument are another invariance: if $u_a(x)=u(x-a)$ , then $\eta_{u_a}$ is the translate of $\eta_u$ , while $\nabla u_a(x)=\nabla u(x-a)$ , hence $\mathfrak M(u_a)=\mathfrak M(u)$ . A first obstruction is immediate. Formally, if $u$ is smooth and $e^{-u}$ decays fast enough for the boundary term to vanish, then

\int y\,\d\mathfrak M(u)(y) = Z_u^{-1}\int \nabla u(x)e^{-u(x)}\,\d x = -Z_u^{-1}\int \nabla(e^{-u(x)})\,\d x =0.

(89)

Thus moment measures are necessarily centered. The nonsmooth theory uses essentially continuous convex functions: lower-semicontinuous convex functions whose set of discontinuity points has zero $\mathcal H^{d-1}$ measure. Since a convex function is continuous in the interior of its effective domain, this condition controls only its boundary behavior.

Figure Div shows the forward construction in one dimension. The map $u'$ is implicit in the push-forward, but the display focuses on the two visible measures: the log-concave source $\eta_u=Z_u^{-1}e^{-u}\d x$ and the resulting moment measure $\mathfrak M(u)$ .

Forward moment-measure construction in one dimension. Each column shows a convex potential $u$ chosen so that the moment measure has a prescribed shape: a skewed unimodal density, two bumps, and three bumps with different widths and heights. The top row overlays $u$ (gray, vertically rescaled) with the density of the log-concave source $\eta_u=Z_u^{-1}e^{-u}\d x$ (red), while the bottom row shows $\mathfrak M(u)=(u')_{\#}\eta_u$ (blue); the dashed vertical line marks the zero barycenter.

Interactive panel. Change the convex potential coefficients and watch the same (u) define both the log-concave source measure (\eta_u) and the moment measure obtained by pushing it through the monotone map (u’).

This theorem is due to Cordero--Erausquin and Klartag Cordero-Erausquin & Klartag, 2015. It is a functional analogue of a Minkowski-type problem: the target measure prescribes how the gradient image of a log-concave density should be distributed. The hyperplane condition is the natural non-degeneracy assumption; otherwise the prescribed gradient image lives in a lower-dimensional affine direction and no coercive full-dimensional convex potential can be recovered.

Optimal-transport variational formulation.¶

Santambrogio Santambrogio, 2016 reformulates the moment-measure problem as a minimization over absolutely continuous probability measures. For a centered $\al\in\Pp_1(\RR^d)$ , define the maximal-correlation transport functional, with values in $\RR\cup\{+\infty\}$ ,

\mathcal C_\al(\eta) \eqdef \sup_{\pi\in\Couplings(\eta,\al)} \int_{\RR^d\times\RR^d} x\cdot y\,\d\pi(x,y).

(93)

By Kantorovich duality for the scalar-product cost,

\mathcal C_\al(\eta) = \inf_{v\ \mathrm{convex}} \left\{ \int v(x)\,\d\eta(x)+\int v^*(y)\,\d\al(y) \right\},

(94)

where the infimum is over convex functions for which both integrals are well defined and $v^*$ is the Legendre transform. If $\eta,\al\in\Pp_2(\RR^d)$ , then

\mathcal C_\al(\eta) = \frac12\int \norm{x}^2\,\d\eta(x) + \frac12\int \norm{y}^2\,\d\al(y) - \frac12\Wass_2^2(\eta,\al).

(95)

The variational problem attached to a centered target $\al$ is

\min_{\eta\in\Pp_1(\RR^d)} \left\{ \mathcal H(\eta)+\mathcal C_\al(\eta) \right\}, \qquad \mathcal H(r\,\d x)\eqdef \int r(x)\log r(x)\,\d x,

(96)

with $\mathcal H(\eta)=+\infty$ when $\eta$ is not absolutely continuous. The centering of $\al$ makes this functional invariant under translations of $\eta$ , since translating $\eta$ by a vector $a$ changes $\int x\cdot y\,\d\pi$ by $a\cdot\int y\,\d\al(y)=0$ .

Proposition: Variational characterization of moment measures

Let $\al\in\Pp_1(\RR^d)$ be centered and not supported on a hyperplane. Then the minimization problem (96) admits a solution, unique up to translations. Every minimizer is a probability measure with log-concave density of the form

\eta=Z_u^{-1}e^{-u}\,\d x,

(97)

where $u$ is convex, essentially continuous, and satisfies the moment-measure equation

\al=(\nabla u)_\sharp \eta.

(98)

Conversely, any essentially continuous convex $u$ satisfying this equation yields a global minimizer. If $\al\in\Pp_2(\RR^d)$ , the objective $\eta\mapsto\mathcal H(\eta)+\mathcal C_\al(\eta)$ is displacement convex along $\Wass_2$ geodesics of absolutely continuous measures in $\Pp_2(\RR^d)$ whenever the terms are finite. When merely $\al\in\Pp_1(\RR^d)$ , the maximal-correlation term remains convex along such finite-second-moment geodesics by approximation; existence and uniqueness on the full $\Pp_1$ domain follow from the lower-semicontinuity argument of Santambrogio.

The existence proof has two ingredients. First, since $\al$ is centered, translating $\eta$ does not change either $\mathcal H(\eta)$ or $\mathcal C_\al(\eta)$ , so one can center a minimizing sequence. Second, the assumption that $\al$ is not supported on a hyperplane gives a coercive lower bound of the form $\mathcal C_\al(\eta)\geq c_\al\int\norm{x}\,\d\eta(x)$ for centered absolutely continuous $\eta$ , with $c_\al>0$ . Together with the lower-semicontinuity estimates for entropy and maximal correlation, this yields a minimizer Santambrogio, 2016.

Let $\eta=r\,\d x$ be a minimizer and let $u$ be a convex optimizer in the dual formula (94). Keeping $u$ fixed and varying $\eta$ in (96) gives the Euler equation

\log r(x)+1+u(x)=\mathrm{constant} \qquad\text{on }\{r>0\},

(99)

so $\eta=Z_u^{-1}e^{-u}\d x$ . The optimality condition for the scalar-product transport problem says that an optimal coupling is supported on $\{(x,y):y\in\partial u(x)\}$ . Since $\eta$ is absolutely continuous and $u$ is convex, $\partial u(x)=\{\nabla u(x)\}$ for $\eta$ -almost every $x$ , hence $\al=(\nabla u)_\sharp\eta$ .

Conversely, assume $\eta=Z_u^{-1}e^{-u}\d x$ and $\al=(\nabla u)_\sharp\eta$ . Let $\nu$ be a smooth compactly supported competitor, and let $T$ be the Brenier map from $\eta$ to $\nu$ . Along the geodesic $\eta_t=((1-t)\Id+tT)_\sharp\eta$ , the right derivative of the entropy at $t=0$ is

\frac{\d}{\d t}\mathcal H(\eta_t)\Big|_{t=0^+} = -\int \dotp{T(x)-x}{\nabla u(x)}\,\d\eta(x),

(100)

where the identity follows by differentiating the Jacobian formula and integrating by parts. The dual optimizer $u$ gives the upper directional bound

\frac{\d}{\d t}\mathcal C_\al(\eta_t)\Big|_{t=0^+} \leq \int \dotp{T(x)-x}{\nabla u(x)}\,\d\eta(x).

(101)

Conversely, Santambrogio’s derivative estimate gives

\frac{\d}{\d t}\mathcal C_\al(\eta_t)\Big|_{t=0^+} \geq \int \dotp{T(x)-x}{\nabla u(x)}\,\d\eta(x),

(102)

because $(\Id,\nabla u)_\sharp\eta$ is optimal for the scalar-product problem. The two bounds coincide, so the first-order terms cancel. Hence the one-sided derivative of $\mathcal H+\mathcal C_\al$ at $\eta$ in every such direction is zero. Displacement convexity implies global minimality, and approximation removes the smooth compact-support restriction. Strict displacement convexity of entropy gives uniqueness, except for translations; translations do not change $\mathcal C_\al$ because $\al$ is centered.

It remains to justify the convexity assertion. The entropy term is displacement convex by McCann’s theorem, recalled in Theorem Theorem: McCann Displacement Convexity for Internal Energies. If $\al$ has finite second moment, identity (95) writes $\mathcal C_\al$ as the sum of the 1-convex moment term $\eta\mapsto\frac12\int\norm{x}^2\,\d\eta$ and the $(-1)$ -convex term $\eta\mapsto-\frac12\Wass_2^2(\eta,\al)$ , hence $\mathcal C_\al$ is displacement convex. For a target with only a finite first moment, Santambrogio obtains the same convexity along $\Pp_2$ geodesics by approximation and proves the full variational characterization by lower semicontinuity.

Remark: Where the convexity is hidden

If one eliminates $\eta$ first, the problem becomes the convex-potential functional

u\mapsto \int u^*(y)\,\d\al(y) - \log\!\left(\int e^{-u(x)}\,\d x\right).

(103)

This is a Toland-type duality: the functional is not visibly convex as a function of $u$ , because the first term is convex while the logarithmic partition term is concave in this parametrization. Cordero--Erausquin and Klartag make the hidden convexity visible by changing variables to the dual potential $\varphi=u^*$ . Since $u=\varphi^*$ for closed convex potentials, the same functional becomes

\varphi\mapsto \int \varphi(y)\,\d\al(y) - \log\!\left(\int e^{-\varphi^*(x)}\,\d x\right).

(104)

The first term is now affine in $\varphi$ . The core Prekopa--Leindler input is that $\varphi\mapsto \log\int e^{-\varphi^*}$ is concave along convex combinations of convex functions; equivalently, the negative log-partition in the display is convex. Santambrogio’s formulation reveals the same mechanism in transport language: the difficult convexity becomes the displacement convexity of an entropy-plus-maximal-correlation functional in the measure variable $\eta$ .

Conjugate moment measures for generation.¶

The moment-measure factorization suggests a generative recipe: sample $X$ from the log-concave law $Z_u^{-1}e^{-u}$ and output $\nabla u(X)$ . This ties sampling and mapping through the same convex potential. Vesseron, Béthune and Cuturi Vesseron et al., 2025 argue that this direct factorization can be poorly adapted to practical generative modeling, and propose instead the conjugate factorization

\beta = (\nabla w^*)_\sharp \left(Z_w^{-1}e^{-w(z)}\,\d z\right).

(105)

Here $\nabla w^*$ is the Brenier map from the learned log-concave source to the target distribution $\beta$ . This keeps the single-convex-potential philosophy, but places the transport map on the conjugate side; it can be parameterized by input-convex neural networks and trained using OT solvers. From the viewpoint of this chapter, moment measures are therefore a rigorous convex-analytic prototype for one-step generators based on gradients of convex potentials.

Evolution in Depth of Transformers¶

Deep residual architectures can be read as time discretizations of ODEs or PDEs. For transformers, the transported objects are token measures and the velocity is induced by attention.

Transformers were introduced as sequence-to-sequence architectures driven by self-attention Vaswani et al., 2017 and have since become a central architecture for language and vision models Brown et al., 2020Dosovitskiy et al., 2021. Their distinctive feature is that each token is updated by a data-dependent average of all other tokens. This makes an attention layer permutation-equivariant before positional encoding, context dependent after conditioning on the input sequence, and naturally compatible with a measure viewpoint in which a prompt is regarded as an empirical distribution of tokens.

The mathematical limit used below concerns depth rather than model scale: one lets the number of residual attention layers grow while each layer makes a small update, as in continuous-depth neural networks Chen et al., 2018. For attention, the resulting velocity is nonlinear in the current token law because it is normalized by the whole context. This measure-theoretic view appears in the analysis of attention as a Lipschitz or interacting-particle operator Vuckovic et al., 2020Geshkovski et al., 2025, in the Sinkhorn-normalized dynamics of Sinkformers Sander et al., 2022, and in recent well-posedness and mean-field-limit results for several attention mechanisms Castin et al., 2025. It also separates the infinite-depth limit studied here from the token-limit question, where one controls how a finite empirical context approximates its limiting attention operator Bohbot et al., 2025.

We now consider very deep transformers, focusing on a single-head attention mechanism for simplicity while ignoring MLP layers, layer normalization, causality, and masking. This stripped-down framework is best suited to modeling encoders and vision transformers; the references above indicate which parts of this simplified picture extend to richer attention mechanisms.

Attention as a context-dependent velocity.¶

After tokenization, embedding, and positional encoding, each input is represented as a point cloud $(x_i)_{i=1}^n$ of $n$ vectorized tokens. An attention layer with a skip connection and residual scale $1/T$ , where $T$ is the depth, transforms the tokens according to

x_i \mapsto x_i + \frac{1}{T} \sum_j \frac{e^{\langle Q x_i, K x_j \rangle} V x_j}{\sum_{\ell} e^{\langle Q x_i, K x_\ell \rangle}},

(106)

where $\theta=(K,Q,V)$ denotes the three parameter matrices. The conventional factor $1/\sqrt r$ , with $r$ the query/key dimension, can be absorbed into $Q$ or $K$ and is omitted here.

Token measure evolution.¶

To handle an arbitrary number of tokens, define $\alpha = \frac{1}{n} \sum_i \delta_{x_i}$ as the empirical measure of tokens and rewrite the transformer mapping as

x_i \mapsto x_i + \frac{1}{T} \Gamma_\theta[\alpha](x_i),

(107)

where

\Gamma_\theta[\alpha](x) := \frac{\int e^{\langle Q x, K y \rangle} V y \, \d \alpha(y)} {\int e^{\langle Q x, K z \rangle} \, \d \alpha(z)}.

(108)

At the level of the token distribution, the layer uses the context-dependent velocity $\Gamma_{\theta_t}[\alpha]$ and pushes $\alpha$ forward by $\Id+\tau\Gamma_{\theta_t}[\alpha]$ . This map depends on the whole context $\alpha$ and on the depth-dependent parameters $\theta_t$ . Denoting normalized depth by $t\in[0,1]$ and setting $\tau=1/T$ gives

\alpha_{t+\tau} = (\Id + \tau \Gamma_{\theta_t}[\alpha_t])_\sharp \alpha_t.

(109)

As $\tau \to 0$ , this converges formally to the conservation equation

\partial_t \alpha_t + \operatorname{div}(\alpha_t \Gamma_{\theta_t}[\alpha_t]) = 0.

(110)

$L^2$ attention and mean shift.¶

A particularly geometric variant replaces the dot-product score $\langle Qx,Ky\rangle$ by a negative squared Euclidean score $s_\epsilon(x,y)=-\norm{x-y}^2/(2\epsilon)$ . Take, for simplicity, the same token space for queries, keys and values, and set

K_\epsilon(x,y)=\exp(-\norm{x-y}^2/(2\epsilon)), \qquad \rho_\epsilon[\alpha](x)=\int K_\epsilon(x,y)\d\alpha(y), \qquad m_\epsilon[\alpha](x) = \frac{\int yK_\epsilon(x,y)\d\alpha(y)} {\rho_\epsilon[\alpha](x)}.

(111)

The map $x\mapsto m_\epsilon[\alpha](x)$ is exactly Gaussian-kernel attention, i.e. normalized kernel regression over tokens; such $L^2$ or Gaussian-kernel scores are used explicitly in transformer variants motivated by Lipschitz control and projection-free attention Kim et al., 2020Kundu et al., 2026. Classical mean shift, however, uses the displacement from the current point to this local barycenter. This gives

M_\epsilon[\alpha](x) \eqdef m_\epsilon[\alpha](x)-x = \frac{\int (y-x)K_\epsilon(x,y)\d\alpha(y)} {\rho_\epsilon[\alpha](x)} = \epsilon\nabla\log\rho_\epsilon[\alpha](x)

(112)

and, when $\alpha$ is empirical, $\rho_\epsilon[\alpha]$ is a Gaussian kernel density estimate up to normalization. Thus $M_\epsilon[\alpha]$ is the classical Gaussian mean-shift vector Fukunaga & Hostetler, 1975Cheng, 1995Comaniciu & Meer, 2002. If the data measure $\alpha$ is frozen, the update $x\leftarrow m_\epsilon[\alpha](x)$ is the usual mode-seeking mean-shift iteration. If instead all support points move and the empirical measure is recomputed after every step, one obtains the self-consistent, or blurring, mean-shift process Chen, 2015. Its damped update is

x_i^{k+1} = (1-\tau)x_i^k+\tau m_\epsilon[\alpha_k](x_i^k) = x_i^k+\tau M_\epsilon[\alpha_k](x_i^k)

(113)

which is an explicit Euler step of the continuous-time mean-shift equation

\partial_t\alpha_t+\operatorname{div}\bigl(\alpha_tM_\epsilon[\alpha_t]\bigr)=0.

(114)

For $\tau=1$ this is discrete blurring mean shift; for small residual steps it becomes a transport PDE that moves each token uphill along the log of the smoothed token density. This distinction between the raw barycentric attention output $m_\epsilon$ and the velocity $M_\epsilon=m_\epsilon-\Id$ is important: adding $m_\epsilon$ directly as a residual would produce a different drift. The mean-shift form isolates a purely metric attention mechanism from the learned bilinear geometry of $\dotp{Qx}{Ky}$ .

Consensus and Markov averaging.¶

When the averaging measure evolves together with the particles, mean shift becomes a consensus model. The Hegselmann--Krause model updates each opinion by averaging the opinions inside a confidence neighborhood Hegselmann & Krause, 2002; its finite-agent convergence was developed further in Blondel et al., 2009, and a measure-valued Eulerian formulation appears in Canuto et al., 2012. The row-normalized average below is also characteristic of Motsch--Tadmor dynamics Motsch & Tadmor, 2014.

For a positive kernel $K$ , define

M_K[\alpha](x) \eqdef \frac{\int (y-x)K(x,y)\,\d\alpha(y)} {\int K(x,y)\,\d\alpha(y)}.

(115)

For an empirical law $\alpha=\sum_{i=1}^n a_i\delta_{x_i}$ , let $X\in\RR^{n\times d}$ have rows $x_i^\top$ , write $a=(a_i)_i$ , and set

(K_X)_{ij}=K(x_i,x_j), \qquad \mathsf P_X \eqdef \operatorname{diag}(K_Xa)^{-1}K_X\operatorname{diag}(a).

(116)

The positive matrix $\mathsf P_X$ is row-stochastic. Its $i$ th row is the probability law used to average the cloud from $x_i$ . Figure Div visualizes the corresponding loss of memory for a fixed Markov matrix; mean shift applies the same averaging mechanism to every spatial coordinate, but with a matrix that changes with the cloud. The particle system is

\dot x_i = \frac{\sum_j a_jK(x_i,x_j)(x_j-x_i)} {\sum_j a_jK(x_i,x_j)}, \qquad\text{equivalently}\qquad \dot X=(\mathsf P_X-I_n)X.

(117)

Thus blurring mean shift is a state-dependent Markov averaging process.

Dobrushin contraction from Hilbert geometry.¶

The Sinkhorn analysis already contains the two geometries needed here. Section Sinkhorn Convergence: Monotone Point of View introduced the variation seminorm of Definition Definition: Variation Seminorm on functions modulo additive constants, while Section Sinkhorn Convergence: Linear Hilbert Metric Rate introduced Hilbert’s metric of Definition Definition: Hilbert Metric on positive vectors modulo multiplication. Logarithmic coordinates identify these quotients exactly:

\Hilbert(e^z,e^{z'})=\norm{z-z'}_V.

(118)

A row-stochastic matrix $\mathsf P$ is order preserving and satisfies $\mathsf P(z+s\mathbf1_n)=\mathsf Pz+s\mathbf1_n$ . It is therefore a linear topical map, so Proposition Proposition: Topical Maps are Variation-Nonexpansive gives nonexpansiveness on the additive quotient. Strict positivity yields more: Birkhoff contraction can be linearized at the constant ray to obtain a strict variation contraction.

Proposition: Dobrushin contraction is the tangent form of Birkhoff contraction

Let $\mathsf P\in\RR_+^{n\times n}$ be row-stochastic. Its Dobrushin coefficient is

\delta(\mathsf P) \eqdef \frac12\max_{i,\ell}\sum_j|\mathsf P_{ij}-\mathsf P_{\ell j}| = 1-\min_{i,\ell}\sum_j\min\{\mathsf P_{ij},\mathsf P_{\ell j}\}.

(119)

It is the exact operator norm induced by $\norm{\cdot}_V$ on $\RR^n/\operatorname{Span}(\mathbf1_n)$ :

\delta(\mathsf P) = \sup_{z\notin\operatorname{Span}(\mathbf1_n)} \frac{\norm{\mathsf Pz}_V}{\norm z_V}.

(120)

Consequently,

\norm{\mathsf Pz}_V \leq \delta(\mathsf P)\norm z_V \qquad (z\in\RR^n).

(121)

If $\mathsf P>0$ , then this exact tangent factor is bounded by the Birkhoff coefficient of Theorem Theorem: Birkhoff Contraction Theorem:

\delta(\mathsf P)\leq\lambda(\mathsf P)<1,

(122)

By duality, the same coefficient is the exact contraction factor of the adjoint Markov evolution on probability vectors: for $p,p'\in\simplex_n$ ,

\norm{\mathsf P^\top p-\mathsf P^\top p'}_{\ell^1} \leq \delta(\mathsf P)\norm{p-p'}_{\ell^1}.

(125)

Thus (121) contracts observables modulo constants, while the adjoint inequality contracts probability laws in total variation. The proposition also separates two useful constants. The Dobrushin coefficient is sharp but depends on the normalized rows. The Birkhoff coefficient can be looser, but its cross-ratio formula is explicit and invariant under positive diagonal scalings, precisely the invariance used in the Sinkhorn proof of Theorem Theorem: Projective Linear Convergence of Sinkhorn.

For a compact $C\subset\RR^d$ and $\alpha\in\Pp(C)$ , the corresponding Markov operator is

(\mathsf P_\alpha h)(x) \eqdef \int_C h(y)p_{\alpha,x}(y)\,\d\alpha(y), \qquad p_{\alpha,x}(y) \eqdef \frac{K(x,y)}{Z_\alpha(x)}, \qquad Z_\alpha(x) \eqdef \int_C K(x,z)\,\d\alpha(z).

(126)

Its Dobrushin coefficient is

\delta(\mathsf P_\alpha) \eqdef \frac12\sup_{x,x'\in\operatorname{supp}(\alpha)} \int_C|p_{\alpha,x}(y)-p_{\alpha,x'}(y)|\,\d\alpha(y).

(127)

Writing $\norm{h}_{V,S}=\sup_S h-\inf_S h$ , the same common-mass proof gives

\norm{\mathsf P_\alpha h}_{V,\operatorname{supp}(\alpha)} \leq \delta(\mathsf P_\alpha) \norm h_{V,\operatorname{supp}(\alpha)}.

(128)

Define the kernel cross-ratio and its Birkhoff factor by

\eta_K(C) \eqdef \sup_{x,x',y,y'\in C} \frac{K(x,y)K(x',y')}{K(x',y)K(x,y')}, \qquad \lambda_K(C) \eqdef \frac{\sqrt{\eta_K(C)}-1}{\sqrt{\eta_K(C)}+1}.

(129)

Row normalization does not change cross-ratios. For the empirical matrix,

\frac{(\mathsf P_X)_{ij}(\mathsf P_X)_{\ell r}} {(\mathsf P_X)_{ir}(\mathsf P_X)_{\ell j}} = \frac{K(x_i,x_j)K(x_\ell,x_r)} {K(x_i,x_r)K(x_\ell,x_j)}.

(130)

Consequently, for every weighted configuration supported in $C$ ,

\delta(\mathsf P_X) \leq \lambda(\mathsf P_X) = \lambda(K_X) \leq \lambda_K(C).

(131)

The equality in the middle is the same diagonal-scaling invariance used by Sinkhorn: the normalization $\operatorname{diag}(K_Xa)^{-1}$ and the weight matrix $\operatorname{diag}(a)$ disappear from every projective cross-ratio.

The integral Birkhoff--Hopf theorem and the same linearization therefore give

\bar\delta_K(C) \eqdef \sup_{\alpha\in\Pp(C)}\delta(\mathsf P_\alpha) \leq \lambda_K(C)<1.

(132)

If $0<k_-\leq K\leq k_+$ on $C\times C$ , then in particular

\lambda_K(C) \leq \frac{k_+-k_-}{k_++k_-}.

(133)

Theorem: Dobrushin consensus for positive mean shift

Let $\alpha_0\in\Pp(\RR^d)$ have compact support, and set

C_0=\operatorname{conv}(\operatorname{supp}\alpha_0), \qquad D_0=\operatorname{diam}(\operatorname{supp}\alpha_0), \qquad \bar\delta=\bar\delta_K(C_0)<1.

(134)

Assume $K:C_0\times C_0\to(0,+\infty)$ is Lipschitz.

For an empirical initial measure $\alpha_0=\sum_i a_i\delta_{x_i^0}$ , let $X^k$ follow the damped blurring iteration

X^{k+1} = \bigl((1-\tau)I_n+\tau\mathsf P_{X^k}\bigr)X^k, \qquad 0<\tau\leq1,

(135)

define

\delta_k=\delta(\mathsf P_{X^k}), \qquad q_k=1-\tau(1-\delta_k), \qquad q_\tau=1-\tau(1-\bar\delta)<1.

(136)

Then

D(X^k) \eqdef \max_{i,j}\norm{x_i^k-x_j^k} \leq D_0\prod_{r=0}^{k-1}q_r \leq q_\tau^kD_0.

(137)

All particles converge to a common $x_\infty^{\rm d}\in C_0$ , with

\max_i\norm{x_i^k-x_\infty^{\rm d}} \leq \frac{D_0}{1-\bar\delta}q_\tau^k.

(138)

The characteristic solution of

\partial_t\alpha_t +\operatorname{div}\bigl(\alpha_tM_K[\alpha_t]\bigr)=0, \qquad \alpha_{t=0}=\alpha_0,

(139)

is globally defined and has nested convex hulls,

\operatorname{conv}(\operatorname{supp}\alpha_t) \subset \operatorname{conv}(\operatorname{supp}\alpha_s) \qquad (t\geq s\geq0).

(140)

Writing $D(t)=\operatorname{diam}(\operatorname{supp}\alpha_t)$ , one has

D(t) \leq D(s)\exp\!\left( -\int_s^t[1-\delta(\mathsf P_{\alpha_r})]\,\mathrm dr \right) \leq e^{-(1-\bar\delta)(t-s)}D(s).

(141)

Consequently there exists $x_\infty^{\rm c}\in C_0$ such that

\Wass_\infty(\alpha_t,\delta_{x_\infty^{\rm c}}) \leq \frac{D_0}{1-\bar\delta}e^{-(1-\bar\delta)t}.

(142)

Positivity and Lipschitz regularity on the compact set $C_0\times C_0$ make the characteristic field uniformly Lipschitz in $x$ and Lipschitz in $\alpha$ for $\Wass_1$ . Moreover, $x+M_K[\alpha](x)=\mathsf P_\alpha\operatorname{Id}(x)$ belongs to $\operatorname{conv}(\operatorname{supp}\alpha)$ . Hence no characteristic crosses an outward supporting hyperplane, the convex hulls are nested, and the flow is globally well posed.

For every compact $S\subset\RR^d$ ,

\operatorname{diam}(S) = \sup_{\theta\in\mathbb S^{d-1}} \left( \sup_{x\in S}\langle\theta,x\rangle -\inf_{x\in S}\langle\theta,x\rangle \right).

(143)

For a finite configuration, set $z_\theta=X\theta$ . The proposition gives

\begin{aligned} \norm{z_\theta^{k+1}}_V &\leq (1-\tau)\norm{z_\theta^k}_V +\tau\norm{\mathsf P_{X^k}z_\theta^k}_V\\ &\leq [1-\tau(1-\delta(\mathsf P_{X^k}))]\norm{z_\theta^k}_V \leq q_\tau\norm{z_\theta^k}_V. \end{aligned}

(144)

Taking the supremum over directions gives the one-step factor $q_k$ ; multiplying these factors and using $q_k\leq q_\tau$ proves the discrete diameter estimate. Since $\norm{x_i^{k+1}-x_i^k}\leq\tau D(X^k)$ , every path is Cauchy; summing the uniform geometric tail gives the pointwise estimate.

For the measure flow, let $f_\theta(x)=\langle\theta,x\rangle$ and let $w_\theta(t)$ be the width of the support in direction $\theta$ . Its upper Dini derivative satisfies

D^+w_\theta(t) \leq \norm{\mathsf P_{\alpha_t}f_\theta}_{V,\operatorname{supp}(\alpha_t)} -w_\theta(t) \leq -[1-\delta(\mathsf P_{\alpha_t})]w_\theta(t).

(145)

Gronwall’s lemma and the directional diameter identity give the adaptive estimate. Finally, every characteristic has speed at most $D(t)$ , hence is Cauchy. All limits coincide because $D(t)\to0$ , and integrating the exponential tail gives the $\Wass_\infty$ estimate.

For the Gaussian kernel $K_\epsilon(x,y)=e^{-\norm{x-y}^2/(2\epsilon)}$ ,

\log\frac{K_\epsilon(x,y)K_\epsilon(x',y')} {K_\epsilon(x',y)K_\epsilon(x,y')} = \frac{\langle x-x',y-y'\rangle}{\epsilon}.

(146)

Thus, on a compact convex set of diameter $D$ ,

\eta_{K_\epsilon}=e^{D^2/\epsilon}, \qquad \lambda_{K_\epsilon}=\tanh\!\left(\frac{D^2}{4\epsilon}\right).

(147)

Combining this identity with (131) gives the projectively certified adaptive estimate for the full discrete update,

D(X^{k+1}) \leq \tanh\!\left(\frac{D(X^k)^2}{4\epsilon}\right)D(X^k) \leq \frac{D(X^k)^3}{4\epsilon},

(148)

and the continuous flow obeys

D(t) \leq D(s)\exp\!\left( -\int_s^t \left[1-\tanh\!\left(\frac{D(r)^2}{4\epsilon}\right)\right]\,\mathrm dr \right).

(149)

Strict positivity therefore gives a global exponential rate, while contraction becomes much stronger once the cloud is narrow relative to the bandwidth. These are Birkhoff bounds; the exact Dobrushin factors can be smaller.

Remark: Sinkhorn and mean shift share one projective mechanism

The link with Sinkhorn is exact at the level of positive operators, but the state variables differ. In Theorem Theorem: Projective Linear Convergence of Sinkhorn, entrywise multiplication and inversion are Hilbert isometries, while the two multiplications by $K$ and $K^\top$ each contribute one Birkhoff factor; hence a full scaling cycle contracts rays by $\lambda(K)^2$ . Mean shift performs one row-normalized multiplication by the state-dependent matrix $K_X$ . Diagonal normalization preserves its Birkhoff factor, and row stochasticity fixes the constant ray; the tangent dynamics on $\RR^n/\operatorname{Span}(\mathbf1_n)$ therefore contracts by the exact factor $\delta(\mathsf P_X)\leq\lambda(K_X)$ .

The analogy has precise limits. Sinkhorn compares two scaling iterates for a fixed kernel in the multiplicative projective metric. Mean shift contracts coordinate widths inside one evolving cloud in the additive tangent norm; because $X\mapsto\mathsf P_XX$ is state dependent, the theorem does not assert contraction between two different clouds. Nevertheless, both slow regimes have the same origin: when the positive kernel has a large projective diameter, its Birkhoff factor approaches one. This occurs for small entropic temperature in Sinkhorn and for a narrow Gaussian bandwidth relative to the cloud diameter in mean shift.

Scope of the consensus result.¶

The projective argument also clarifies what is not conserved and when clustering replaces consensus. Because row normalization makes the coefficients asymmetric, the consensus point need not equal the initial barycenter; that barycenter is conserved for the constant kernel, but not in general. The self-consistent flow should not be confused with classical mode seeking, where the data measure is frozen and different queries may converge to different modes. Finally, one-point consensus can fail for the hard confidence kernel $K_R(x,y)=\mathbf1_{\{\norm{x-y}\leq R\}}$ : once the interaction graph disconnects, its Dobrushin and Birkhoff coefficients can equal one. The discrete Hegselmann--Krause dynamics then converges to opinion clusters Blondel et al., 2009; under the Eulerian hypotheses, the limit is a finite combination of Dirac masses separated by at least $R$ Canuto et al., 2012. For the Gaussian kernel, global positivity eventually forces one Dirac, although $1-\tanh(D_0^2/(4\epsilon))$ can be very small and the multimodal transient can be long.

Figure Div visualizes this transient contraction before the eventual consensus guaranteed by Theorem Theorem: Dobrushin consensus for positive mean shift.

Continuous-time mean shift for a densely sampled three-Gaussian mixture. Left: initial density level sets, in red, and representative particle paths of $\dot x=M_\epsilon[\alpha_t](x)$ , colored from red to blue. Right: four later kernel-density renderings of $\alpha_t$ at increasing times, with the same red-to-blue time palette; the initial density is omitted because it is shown on the left. The snapshots show the long multimodal transient before globally positive Gaussian interactions drive the cloud toward the one-point consensus of Theorem Theorem: Dobrushin consensus for positive mean shift.

Interactive panel. Vary the bandwidth, particle count, and integration time to see the mean-shift transport PDE sharpen a three-mode density.

Gradient structure and limitations.¶

When the token space has dimension $d$ and the query/key space has dimension $r$ , take $Q,K\in\RR^{r\times d}$ and $V\in\RR^{d\times d}$ . If $V=Q^\top K$ , the field $\Gamma_\theta[\alpha]$ is a gradient vector field in the token variable. Indeed, define the log-partition potential

\Phi_\alpha(x) = \int \exp(\dotp{Qx}{Ky})\d\alpha(y), \qquad U_\alpha(x)=\log\Phi_\alpha(x).

(150)

Then

\nabla_x U_\alpha(x) = \frac{\int Q^\top K y\,\exp(\dotp{Qx}{Ky})\d\alpha(y)} {\int \exp(\dotp{Qx}{Kz})\d\alpha(z)} = \Gamma_\theta[\alpha](x).

(151)

This proves only that the velocity is an instantaneous gradient in $x$ ; it does not by itself identify a Wasserstein energy. Indeed, the natural scalar candidate

f_{\rm att}(\alpha)=\int U_\alpha(x)\,\d\alpha(x)

(152)

has first variation

\delta f_{\rm att}(\alpha)(z)=U_\alpha(z)+\int\frac{\exp(\dotp{Qx}{Kz})}{\Phi_\alpha(x)}\,\d\alpha(x),

(153)

up to an additive constant. The second term is the response of every query normalization to a perturbation of the key distribution, and its spatial gradient is absent from $\Gamma_\theta[\alpha]$ . Thus, without additional symmetry or integrability conditions, the attention PDE is a transportation dynamics rather than the Wasserstein gradient flow of this fixed scalar functional. Special variants recover additional structure: Sinkhorn attention can be interpreted through doubly stochastic normalization and Wasserstein-type gradient flows Sander et al., 2022Castin et al., 2025, while layer normalization leads naturally to dynamics on the sphere and to modified metrics. The key open difficulty for the present viewpoint is training: after the architecture has been rewritten as a controlled transport equation, learning corresponds to optimizing the time-dependent parameters $(\theta_t)_t$ rather than merely analyzing the forward PDE for fixed parameters.

Flows over the Gaussian Manifold¶

Gaussian measures provide a useful testing ground for the preceding dynamics. They are not invariant under a general Wasserstein gradient flow: a nonlinear velocity usually creates non-Gaussian densities immediately. The useful substitute is to either identify affine velocities, which exactly preserve Gaussianity, or to project the dynamics onto the Gaussian manifold. In both cases the measure PDE reduces to matrix ODEs for the mean and covariance. This viewpoint is emphasized in the survey Peyré, 2025 and is useful for comparing diffusion paths, Wasserstein gradient flows, drifting fields and transformer-type dynamics.

For constrained gradient flows on this family, the covariance equation is the finite-dimensional Bures--Wasserstein gradient flow on positive definite matrices. Thus Gaussian closure is not just a computational shortcut: it is the restriction of Wasserstein geometry to the Gaussian submanifold, where affine gradient fields encode tangent vectors. The following figure first compares three bridge-type Gaussian closures from a source $\alpha_0$ to a target $\gamma$ ; the exact gradient-flow closures for specified energies $f(\alpha)$ are catalogued afterwards in Proposition: Gaussian closure catalogue.

Figure Div first compares three bridge-type Gaussian closures from a source $\alpha_0$ to a target $\gamma$ ; the exact gradient-flow closures for specified energies $f(\alpha)$ are catalogued afterwards in Proposition Proposition: Gaussian closure catalogue.

Gaussian closures from a red source $\alpha_0$ to a blue target $\gamma=\mathcal N(\bar m,\bar\Sigma)$ . The left panel is the constant-speed $W_2$ geodesic, equivalently the displacement interpolation minimizing the Benamou--Brenier action between $\alpha_0$ and $\gamma$ . The middle panel is an entropic Sinkhorn/Schrödinger bridge-style closure for the quadratic cost $|x-y|^2$ and regularization strength $\epsilon>0$ ; it is a bridge toward $\gamma$ , not the gradient flow of a fixed energy $f(\alpha)$ , and the entropic noise inflates intermediate covariances. The right panel is a prescribed non-variational drifting flow, governed by a continuity equation with an affine Gaussian-preserving velocity, chosen so that the mean follows a curved path while the covariance is moment-matched to the same endpoint $\gamma$ .

Interactive panel. Use the anisotropy, angle, regularization, and drift controls to compare Gaussian closures of Wasserstein, Sinkhorn, and drifting dynamics.

Gaussianity preservation.¶

The first question is invariance: one wants a simple criterion ensuring that the continuity equation does not leave the finite-dimensional Gaussian family.

Proposition: Affine velocities preserve Gaussianity

Let $\alpha_0=\Gaussian(m_0,\Sigma_0)$ , with $\Sigma_0\succ0$ . Let $b_t\in\RR^d$ and $A_t\in\RR^{d\times d}$ be locally integrable on a time interval, and let $(m_t,\Sigma_t)$ solve

\dot m_t=b_t, \qquad \dot\Sigma_t=A_t\Sigma_t+\Sigma_tA_t^\top, \qquad (m_{t=0},\Sigma_{t=0})=(m_0,\Sigma_0).

(154)

Then, as long as this matrix ODE is defined, $\Sigma_t\succ0$ and

\alpha_t=\Gaussian(m_t,\Sigma_t)

(155)

is the solution of the continuity equation

\partial_t\alpha_t+\diverg(\alpha_t v_t)=0, \qquad v_t(x)=b_t+A_t(x-m_t).

(156)

In particular, if a smooth functional $f$ has a Wasserstein gradient on Gaussian measures of the affine form

\Wgrad f(\Gaussian(m,\Sigma))(x) = b_f(m,\Sigma)+A_f(m,\Sigma)(x-m),

(157)

with $A_f(m,\Sigma)$ symmetric, then the Wasserstein gradient flow of $f$ , initialized from any non-degenerate Gaussian, stays Gaussian and satisfies

\dot m_t=h_f(m_t,\Sigma_t), \qquad \dot\Sigma_t=H_f(m_t,\Sigma_t),

(158)

where

h_f(m,\Sigma)=-b_f(m,\Sigma), \qquad H_f(m,\Sigma) = -\bigl(A_f(m,\Sigma)\Sigma+\Sigma A_f(m,\Sigma)\bigr).

(159)

Conversely, fix a non-degenerate Gaussian $\Gaussian(m,\Sigma)$ . Suppose that a finite-energy Wasserstein tangent field $V_{m,\Sigma}$ is represented by a gradient in $L^2(\Gaussian(m,\Sigma))$ and is tangent to the Gaussian manifold at this Gaussian. Then there exist $b(m,\Sigma)$ and a symmetric matrix $A(m,\Sigma)$ such that

V_{m,\Sigma}(x)=b(m,\Sigma)+A(m,\Sigma)(x-m) \qquad \Gaussian(m,\Sigma)\text{-a.e.}

(160)

Consequently, under the same smoothness assumptions, if the Wasserstein gradient flow of a functional preserves the Gaussian family for all non-degenerate Gaussian initial data, then $\Wgrad f(\Gaussian(m,\Sigma))$ is affine on each Gaussian. Without the Wasserstein gradient representative, this converse is false because one may add velocity fields with zero $\Gaussian(m,\Sigma)$ -weighted divergence.

Finally, any smooth Gaussian curve with positive definite covariance can be generated by an affine velocity. If one wants the velocity to be a Wasserstein tangent gradient, one chooses the unique symmetric solution of the Lyapunov equation

A_t\Sigma_t+\Sigma_t A_t=\dot\Sigma_t.

(161)

Let $X_t$ follow the characteristic ODE $\dot X_t=b_t+A_t(X_t-m_t)$ with $X_0\sim\Gaussian(m_0,\Sigma_0)$ . Since $\dot m_t=b_t$ , the centered variable $\tilde X_t=X_t-m_t$ solves the homogeneous linear ODE $\dot{\tilde X}_t=A_t\tilde X_t$ . If $\Phi_t$ is the fundamental matrix $\dot\Phi_t=A_t\Phi_t$ , $\Phi_{t=0}=\Id$ , then

X_t=m_t+\Phi_t(X_0-m_0), \qquad \Sigma_t=\Phi_t\Sigma_0\Phi_t^\top.

(162)

Hence $X_t$ is Gaussian and $\Sigma_t\succ0$ , and

\dot\Sigma_t = \frac{\d}{\d t}\EE\bigl(\tilde X_t\tilde X_t^\top\bigr) = A_t\Sigma_t+\Sigma_t A_t^\top.

(163)

This proves Gaussian preservation and the moment ODE. The Wasserstein gradient-flow statement follows by inserting the descent velocity $v_t=-\Wgrad f(\alpha_t)$ .

For the converse, fix $\alpha=\Gaussian(m,\Sigma)$ and denote its density by $\rho$ . Tangency to the Gaussian manifold means that the density variation $-\diverg(\rho V)$ is generated by some moment variation $(\dot m,\dot\Sigma)$ , with $\dot\Sigma$ symmetric. Set $b=\dot m$ , and let $A=A^\top$ be the unique solution of

A\Sigma+\Sigma A=\dot\Sigma.

(164)

By the first part of the proposition, the affine gradient field $V_0(x)=b+A(x-m)$ generates exactly the same infinitesimal Gaussian variation. Hence

\diverg\bigl(\rho(V-V_0)\bigr)=0

(165)

in the distributional sense. Both $V$ and $V_0$ belong to the $L^2(\alpha)$ closure of gradient fields. The weighted Helmholtz decomposition therefore makes $V-V_0$ simultaneously a tangent gradient and orthogonal to every tangent gradient, so $V=V_0$ in $L^2(\alpha)$ . Equivalently, for a smooth representative $V-V_0=\nabla\psi$ , integration by parts gives $\int\norm{\nabla\psi}^2\d\alpha=0$ . This proves that the Wasserstein tangent representative is affine. The qualification is essential: without selecting the gradient representative, one can add a nonzero field with zero $\alpha$ -weighted divergence without changing the Gaussian curve.

For a prescribed smooth Gaussian curve, set $b_t=\dot m_t$ and choose any matrix $A_t$ satisfying $A_t\Sigma_t+\Sigma_tA_t^\top=\dot\Sigma_t$ . Since $\Sigma_t$ is positive definite, the Lyapunov map $A\mapsto A\Sigma_t+\Sigma_t A$ is invertible on symmetric matrices, which gives the unique symmetric choice when a gradient velocity is required. In that case $v_t$ is the gradient of the quadratic potential $x\mapsto \dotp{b_t}{x}+\dotp{A_t(x-m_t)}{x-m_t}/2$ .

Gaussian-preserving gradient flows.¶

We now instantiate the affine-gradient viewpoint by tracking functionals whose full Wasserstein gradient is affine on Gaussian inputs. The catalogue below contains exact Gaussian-preserving Wasserstein flows, not projected or constrained flows; the separate constrained construction is discussed only afterwards.

Proposition: Gaussian closure catalogue

Let $\gamma=\Gaussian(\bar m,\bar\Sigma)$ on $\RR^d$ , with $\bar\Sigma\succ0$ , and let the initial condition be $\alpha_0=\Gaussian(m_0,\Sigma_0)$ , with $\Sigma_0\succ0$ . For each functional displayed below, let $\alpha_t$ be the usual Wasserstein gradient flow on $\Pp_2(\RR^d)$ , initialized at $\alpha_0$ . Then the Gaussian family is invariant for these dynamics: as long as the solution exists and $\Sigma_t\succ0$ ,

\alpha_t=\Gaussian(m_t,\Sigma_t),

(166)

and the mean and covariance satisfy

\dot{m}_t=h(m_t,\Sigma_t), \qquad \dot{\Sigma}_t=H(m_t,\Sigma_t).

(167)

Write $\delta_m=m-\bar m$ , $A=\bar\Sigma^{-1}$ , and

M_{\Sigma,\bar\Sigma} \eqdef \Sigma^{-1/2}\bigl(\Sigma^{1/2}\bar\Sigma\Sigma^{1/2}\bigr)^{1/2}\Sigma^{-1/2}.

(168)

With the normalizations displayed in the first column, the mean vector field $h$ and covariance vector field $H$ are listed in the following table. Gradients with respect to $\Sigma$ are symmetric Frobenius gradients on the cone of covariance matrices.

\begin{center} \begingroup \small \renewcommand{\arraystretch}{1.55} Gaussian-preserving Wasserstein gradient flows.\par\smallskip \begin{tabularx}{\linewidth}{>{\raggedright\arraybackslash}p{.42\linewidth}>{\centering\arraybackslash}p{.18\linewidth}>{\raggedright\arraybackslash}X} \hline Functional $f(\alpha)$ & $h(m,\Sigma)$ & $H(m,\Sigma)$ \ \hline $g(m_\alpha,\Sigma_\alpha)$ & $-\nabla_m g$ & $-2(\Sigma\nabla_\Sigma g+\nabla_\Sigma g\,\Sigma)$ \ $\displaystyle \int\Bigl(\frac12x^\top Bx+\dotp{\ell}{x}\Bigr)\d\alpha(x)$ , $B=B^\top$ & $-(B m+\ell)$ & $-(\Sigma B+B\Sigma)$ \ $\displaystyle \frac14\iint (x-y)^\top G(x-y)\d\alpha(x)\d\alpha(y)$ , $G=G^\top$ & 0 & $-(\Sigma G+G\Sigma)$ \ $\KL(\alpha|\gamma)$ & $-A\delta_m$ & $2\Id-\Sigma A-A\Sigma$ \ $\mathcal I(\alpha|\gamma)$ & $-2A^2\delta_m$ & $4\Sigma^{-1}-2\Sigma A^2-2A^2\Sigma$ \ $\Wass_2^2(\alpha,\gamma)$ & $-2\delta_m$ & $2(M_{\Sigma,\bar\Sigma}\Sigma+\Sigma M_{\Sigma,\bar\Sigma}-2\Sigma)$ \ $\MMD_k^2(\alpha,\gamma)$ , $k(x,y)=\dotp{x}{y}^2$ & $-4R m$ & $-4(\Sigma R+R\Sigma)$ \ $\displaystyle \bar\MK_{\norm{\cdot-\cdot}^2}^{\epsilon}(\alpha,\gamma)$ & $-2\delta_m$ & $-2(\Sigma G_\epsilon+G_\epsilon\Sigma)$ \ $\SW_2^2(\alpha,\gamma)$ & $\displaystyle -\frac{2}{d}\delta_m$ & $-2(\Sigma G_{\mathrm{sw}}+G_{\mathrm{sw}}\Sigma)$ \ \hline \end{tabularx} \endgroup \end{center} Here, in the MMD row,

R=\Sigma+m\,m^\top-\bar\Sigma-\bar m\,\bar m^\top.

(169)

The debiased Sinkhorn row uses the notation of Corollary Corollary: Gaussian Sinkhorn Divergence and Smoothed Bures Term: for Gaussian inputs,

\bar\MK_{\norm{\cdot-\cdot}^2}^{\epsilon}(\alpha,\gamma) = \norm{\delta_m}^2+\Bb_\epsilon(\Sigma,\bar\Sigma)^2,

(170)

with the closed-form covariance gradient

G_\epsilon(\Sigma,\bar\Sigma) = \tau_\epsilon(\Sigma) - \bar\Sigma^{1/2} \tau_\epsilon\bigl(B_{\Sigma,\bar\Sigma}^{1/2}\bigr) B_{\Sigma,\bar\Sigma}^{-1/2} \bar\Sigma^{1/2}, \qquad B_{\Sigma,\bar\Sigma}=\bar\Sigma^{1/2}\Sigma\bar\Sigma^{1/2}.

(171)

Here $\tau_\epsilon$ is the scalar function

\tau_\epsilon(r) \eqdef \frac{\sqrt{\epsilon^2+16r^2}-\epsilon}{4r}, \qquad r>0,

(172)

applied to positive matrices by spectral calculus. Equivalently, for $M\succ0$ ,

\tau_\epsilon(M) = \bigl(\sqrt{\epsilon^2 I+16M^2}-\epsilon I\bigr)(4M)^{-1}.

(173)

With this convention, $G_\epsilon=\nabla_\Sigma \Bb_\epsilon(\Sigma,\bar\Sigma)^2$ . In the sliced row, $\sigma$ is the normalized spherical measure on $\Sphere^{d-1}$ , and

G_{\mathrm{sw}}(\Sigma,\bar\Sigma) = \int_{\Sphere^{d-1}} \left( 1-\sqrt{\frac{\theta^\top\bar\Sigma\theta}{\theta^\top\Sigma\theta}} \right) \theta\theta^\top\,\d\sigma(\theta).

(174)

Here

\mathcal I(\alpha|\gamma) = \int \left|\nabla\log\frac{\rho(x)}{\rho_\gamma(x)}\right|^2\rho(x)\,\d x = \int \left|\nabla\log\rho(x)+A(x-\bar m)\right|^2\rho(x)\,\d x \qquad(\alpha=\rho\,\d x),

(175)

where $\rho_\gamma$ is the density of $\gamma$ .

Not every PDE preserves Gaussianity exactly. Wasserstein flows of generic higher-order regularizers usually create higher moments immediately and require a Gaussian projection to close on $(m,\Sigma)$ . Such projected closures are still useful: they expose the finite-dimensional dynamics predicted by a variational model and make it easy to compare variational flows with non-variational affine dynamics such as drifting fields or the Gaussian transformer closure below.

Example: Linear mean-field networks as cross-covariance flows

Consider the two-layer mean-field model of Section Training Two-Layer MLPs as Wasserstein Flows, and take the linear activation $\sigma(s)=s$ , so that

\psi((u,v),z)=v\,\dotp{u}{z}.

(194)

We restrict this example to centered neuron laws,

\int (u,v)\d\alpha(u,v)=0, \qquad \Sigma_\alpha= \begin{pmatrix} \Sigma_{uu}(\alpha) & \Sigma_{uv}(\alpha)\\ \Sigma_{vu}(\alpha) & \Sigma_{vv}(\alpha) \end{pmatrix},

(195)

and use the lower-left cross-covariance block

\Sigma_{vu}(\alpha)=\int v u^\top\d\alpha(u,v)\in\RR^{d'\times d}.

(196)

The predictor is therefore the linear map

G_{\alpha_t}(z)=\Sigma_{vu}(\alpha_t)z.

(197)

For the squared Euclidean loss, set

S=\int zz^\top\d\rho(z,y), \qquad R=\int y z^\top\d\rho(z,y).

(198)

The learning energy of (223) is then the covariance functional

f(\alpha)=g(\Sigma_\alpha), \qquad g(\Sigma) = \frac12\tr\!\big(\Sigma_{vu}S\Sigma_{uv}\big) -\tr\!\big(R\Sigma_{uv}\big) +\frac12\int\norm{y}^2\d\rho(z,y).

(199)

This puts the model exactly in the centered moment-functional row of Proposition Proposition: Gaussian closure catalogue. To see that it recovers the usual particle equation, write

E_\alpha\eqdef \Sigma_{vu}(\alpha)S-R.

(200)

The first variation is

\delta f(\alpha)(u,v)=\dotp{E_\alpha}{v u^\top}=v^\top E_\alpha u.

(201)

Hence the particle velocity in parameter space is linear:

-\nabla_{(u,v)}\delta f(\alpha)(u,v) = -\begin{pmatrix} 0 & E_\alpha^\top \\ E_\alpha & 0 \end{pmatrix} \begin{pmatrix} u \\ v \end{pmatrix}.

(202)

Equivalently, at the level of $g$ ,

\nabla_\Sigma g= \frac12 \begin{pmatrix} 0 & E_\alpha^\top\\ E_\alpha & 0 \end{pmatrix}.

(203)

The factor $1/2$ in the covariance gradient comes from the symmetry of $\Sigma$ : the upper-right block is the transpose of the lower-left block. Substituting this gradient in Proposition Proposition: Gaussian closure catalogue gives

\dot m_t=0, \qquad \dot\Sigma_t=-(\Sigma_t L_t+L_t\Sigma_t), \qquad L_t= \begin{pmatrix} 0 & E_t^\top\\ E_t & 0 \end{pmatrix}, \qquad E_t=\Sigma_{vu}(\alpha_t)S-R.

(204)

Thus a centered Gaussian law of neurons remains centered Gaussian, and the dynamics is driven by the cross-covariance block alone. This exact closure is special to the linear activation; for nonlinear activations, Gaussian closures are usually projections rather than invariant families.

Constrained evolution on the Gaussian manifold.¶

The preceding affine-gradient examples have a limited scope: most Wasserstein gradient flows of a functional $f(\alpha)$ are not closed on the Gaussian manifold. A nonlinear ambient velocity creates higher-order moments immediately, so the exact Gaussian closure usually fails. The constrained viewpoint deliberately replaces the full evolution by its projection onto $\mathcal G$ , forcing the curve to remain Gaussian while keeping the Wasserstein tangent geometry.

Let

\mathcal G=\{\Gaussian(m,\Sigma):m\in\RR^d,\ \Sigma\succ0\}

(205)

be the Gaussian submanifold of $\Pp_2(\RR^d)$ . The Wasserstein gradient of a functional constrained to a smooth submanifold $\mathcal M\subset\Pp_2$ is defined as the Riesz representative of the differential restricted to tangent velocities of $\mathcal M$ . Equivalently, it is the small-step limit of the constrained JKO scheme

\alpha^{k+1}\in \argmin_{\alpha\in\mathcal M} \frac{1}{2\tau}\Wass_2^2(\alpha,\alpha^k)+f(\alpha).

(206)

For $\mathcal M=\mathcal G$ , tangent velocities are affine gradient fields $v(x)=b+A(x-m)$ with $A=A^\top$ . The constrained gradient is therefore the $L^2(\Gaussian(m,\Sigma))$ projection of the ambient Wasserstein gradient onto this finite-dimensional affine space, whenever the ambient gradient exists.

Proposition: Gaussian-constrained Wasserstein gradients

Let $f$ be a smooth functional and assume that its restriction to nondegenerate Gaussian measures can be written as

f(\Gaussian(m,\Sigma))=F(m,\Sigma).

(207)

Then the Wasserstein gradient constrained to the Gaussian family is the affine vector field

v_F(x) = \nabla_m F(m,\Sigma) + 2\nabla_\Sigma F(m,\Sigma)(x-m),

(208)

where $\nabla_\Sigma F$ denotes the symmetric matrix derivative. Equivalently, $v_F$ is the $L^2(\Gaussian(m,\Sigma))$ projection of the ambient Wasserstein gradient onto affine gradient fields, whenever the ambient gradient exists. Hence the gradient descent flow constrained to Gaussian measures satisfies

\dot m_t=-\nabla_m F(m_t,\Sigma_t), \qquad \dot\Sigma_t=-2\bigl(\Sigma_t\nabla_\Sigma F(m_t,\Sigma_t)+\nabla_\Sigma F(m_t,\Sigma_t)\Sigma_t\bigr),

(209)

and the descent velocity is affine.

This proposition is the organizing rule for constrained Gaussian closures: once the scalar energy has been reduced to a function of $(m,\Sigma)$ , its constrained Wasserstein gradient is automatically affine and the covariance follows the Bures-type ODE (209). When the first variation of $f$ is quadratic, this constrained gradient coincides with the full Wasserstein gradient.

Non-variational Gaussian-preserving flows.¶

The last examples are not ordinary gradient flows of a fixed scalar energy on the full Wasserstein space. They preserve Gaussianity because the prescribed velocity field is affine when evaluated on Gaussian measures.

Example: Flow matching and diffusion paths between Gaussians

Consider a prescribed Gaussian interpolation $\alpha_t=\Gaussian(m_t,\Sigma_t)$ . Proposition Proposition: Affine velocities preserve Gaussianity shows that an exact flow-matching velocity can be taken affine:

v_t(x)=\dot m_t+A_t(x-m_t), \qquad A_t\Sigma_t+\Sigma_t A_t=\dot\Sigma_t.

(214)

In the isotropic case $\Sigma_t=s_t^2\Id$ , this reduces to the transparent formula

v_t(x)=\dot m_t+\frac{\dot s_t}{s_t}(x-m_t).

(215)

For instance, the diffusion noising path

X_t=a_tX_0+\sigma_t Z,\qquad Z\sim\Gaussian(0,\Id),

(216)

has $m_t=a_tm_0$ and $\Sigma_t=a_t^2\Sigma_0+\sigma_t^2\Id$ . Thus, in the Gaussian case, diffusion paths and flow-matching paths reduce to the same mean-covariance bookkeeping, although the corresponding training objectives are different.

Example: Gaussian kernel drifting

Let the target be $\gamma=\Gaussian(\bar m,\bar\Sigma)$ and assume $\al_t=\Gaussian(m_t,\Sigma_t)$ . For the Gaussian kernel

K_\epsilon(x,y)=\exp(-\norm{x-y}^2/(2\epsilon)),

(217)

the normalized field (76) satisfies

B_\epsilon[\al_t](x) = -\epsilon(\Sigma_t+\epsilon\Id)^{-1}(x-m_t).

(218)

Indeed the smoothed density $x\mapsto\int K_\epsilon(x,y)\d\al_t(y)$ is proportional to the Gaussian density with mean $m_t$ and covariance $\Sigma_t+\epsilon\Id$ . Thus $B_\epsilon[\al_t]$ is the mean-shift vector of a Gaussian density: it points linearly toward the Gaussian mode, with strength set by the bandwidth. The drifting velocity (78) is therefore the difference of two affine mean-shift fields; it is affine and preserves Gaussianity. With

A_t=(\Sigma_t+\epsilon\Id)^{-1}, \qquad \bar A=(\bar\Sigma+\epsilon\Id)^{-1},

(219)

the ODE is

\dot m_t=\epsilon\bar A(\bar m-m_t), \qquad \dot\Sigma_t=\epsilon\bigl((A_t-\bar A)\Sigma_t+\Sigma_t(A_t-\bar A)\bigr).

(220)

This finite-dimensional model explains the stabilizing role of the self-normalized repulsion term in drifting: without it, the covariance equation loses the $A_t\Sigma_t+\Sigma_tA_t$ contribution.

Example: Gaussian closure of attention dynamics

For the transformer PDE, assume $\alpha=\Gaussian(m,\Sigma)$ . Since exponential tilting preserves Gaussianity,

\frac{\int e^{\dotp{Qx}{Ky}}\,y\,\d\alpha(y)} {\int e^{\dotp{Qx}{Kz}}\,\d\alpha(z)} = m+\Sigma K^\top Qx.

(221)

Therefore

\Gamma_\theta[\alpha](x)=Vm+V\Sigma K^\top Qx

(222)

is affine. The Gaussian token law is preserved and satisfies

\dot m_t=(V_t+V_t\Sigma_tK_t^\top Q_t)m_t, \qquad \dot\Sigma_t=B_t\Sigma_t+\Sigma_tB_t^\top, \qquad B_t=V_t\Sigma_tK_t^\top Q_t.

(223)

When $V_t=Q_t^\top K_t$ , the matrix $B_t=Q_t^\top K_t\Sigma_tK_t^\top Q_t$ is symmetric positive semidefinite, matching the gradient-field case mentioned above. This closure is not a convergence theorem for trained transformers. It is instead a tractable model of how attention can shear, amplify or contract a cloud of tokens through its covariance.

Contractive Gaussian projection.¶

The preceding examples show when Gaussianity is preserved or imposed by projection. Gelbrich’s inequality Gelbrich, 1990 gives a useful variational explanation: replacing a measure by the Gaussian with the same first two moments cannot increase its Wasserstein distance to another similarly projected measure.

Proposition Proposition: Nondegenerate Gaussian inputs remain Gaussian is an earlier application of this contraction: because every Gaussian input is fixed by $\mathcal R$ , projecting any competitor cannot increase the quadratic Wasserstein barycenter objective.

The following preservation criterion is a direct consequence of Gelbrich’s theorem and was explained to us by Hugo Lavenant. It says that a functional which does not increase under moment-matched Gaussian projection admits Gaussian minimizing movements from Gaussian initial data.

Theorem: Hugo Lavenant Gaussian-preservation criterion

Let $f:\Pp_2(\RR^d)\to(-\infty,+\infty]$ satisfy

f(\mathcal R\al)\leq f(\al) \qquad\forall\al\in\Pp_2(\RR^d),

(228)

with $\mathcal R$ defined in Theorem Theorem: Gelbrich theorem. If $\gamma$ is Gaussian and $\nu$ minimizes the JKO step

\eta\mapsto f(\eta)+\frac1{2\tau}\Wass_2^2(\gamma,\eta),

(229)

then $\mathcal R\nu$ is also a minimizer. If this JKO minimizer is unique, it is Gaussian. Consequently, if every step from Gaussian data has a unique minimizer and the resulting minimizing movements converge in $\Wass_2$ , each discrete iterate is Gaussian and every limit curve is Gaussian as well, possibly with a singular limiting covariance.

Moment closure beyond Gaussianity.¶

Gaussian preservation concerns the full law, whereas moment closure asks only whether selected statistics obey an autonomous system. The first row of Proposition Proposition: Gaussian closure catalogue in fact gives a distribution-free closure. If $f(\alpha)=g(m_\alpha,\Sigma_\alpha)$ and $G=\nabla_\Sigma g(m_\alpha,\Sigma_\alpha)$ denotes the symmetric Frobenius gradient, then the Wasserstein descent velocity is $v_\alpha(x)=-\nabla_m g(m_\alpha,\Sigma_\alpha)-2G(x-m_\alpha)$ . It is affine in $x$ for every $\alpha$ , not only for Gaussian laws. Consequently, every sufficiently regular flow starting from an arbitrary $\alpha_0\in\Pp_2(\RR^d)$ satisfies

\dot m_t=-\nabla_m g(m_t,\Sigma_t), \qquad \dot\Sigma_t = -2\bigl(\Sigma_t\nabla_\Sigma g(m_t,\Sigma_t) +\nabla_\Sigma g(m_t,\Sigma_t)\Sigma_t\bigr).

(231)

For Gaussian initial data, this affine velocity also preserves Gaussianity; for general initial data, the law need not become Gaussian, but its mean and covariance still follow the same closed vector field $(h,H)$ listed in that proposition.

This suggests asking which linear statistics $\alpha\mapsto\int\varphi\,\d\alpha$ admit an exact closure. For scalar statistics, the answer is completely characterized by an eikonal equation.

Proposition: Scalar moment closure and eikonal characterization

Let $\Omega\subset\RR^d$ be connected and open, let $\varphi\in C^2(\Omega)$ , and set $m_\varphi(\alpha)\eqdef\int_\Omega\varphi(x)\d\alpha(x)$ for probability measures such that $\varphi$ and $\norm{\nabla\varphi}^2$ are integrable. For every smooth $g:\RR\to\RR$ , the Wasserstein gradient flow of $f_g(\alpha)=g(m_\varphi(\alpha))$ satisfies, whenever the differentiation is justified,

\dot m_t = -g'(m_t)\int_\Omega\norm{\nabla\varphi(x)}^2\d\alpha_t(x), \qquad m_t=m_\varphi(\alpha_t).

(232)

This equation is autonomous for every $g$ , meaning that its right-hand side depends on $\alpha_t$ only through $m_t$ , if and only if there are constants $a,b\in\RR$ such that

\norm{\nabla\varphi(x)}^2=a+b\varphi(x) \qquad (x\in\Omega).

(233)

In that case, writing $m_t=m_\varphi(\alpha_t)$ , one has $\dot m_t=-(a+b m_t)g'(m_t)$ .

Moreover, on every connected open set $U\subset\{x:\nabla\varphi(x)\neq0\}$ , the restriction of (233) to $U$ is equivalent to the existence of an eikonal coordinate $r\in C^2(U)$ and constants $c,\kappa,\lambda\in\RR$ such that

\norm{\nabla r}=1, \qquad \varphi=c+\kappa r+\frac{\lambda}{2}r^2.

(234)

Under (233), one may take $\lambda=b/2$ .

Example: A closed signed-distance moment

Let $\mathcal S$ be a $C^2$ hypersurface and let $r_{\mathcal S}$ be its signed distance on a sufficiently small tubular neighborhood $U$ , where $\norm{\nabla r_{\mathcal S}}=1$ . For $\varphi(x)=r_{\mathcal S}(x)+\lambda r_{\mathcal S}(x)^2/2$ , one has $\norm{\nabla\varphi}^2=1+2\lambda\varphi$ . Consequently, for any solution whose support remains in $U$ , the scalar moment obeys $\dot m_t=-(1+2\lambda m_t)g'(m_t)$ under the objective $f_g(\alpha)=g(\int\varphi\,\d\alpha)$ .

For $\lambda=0$ and a genuinely curved $\mathcal S$ , the observable is generally non-polynomial and the descent velocity is proportional to the non-affine normal field $\nabla r_{\mathcal S}$ . This closure therefore does not arise from the mean--covariance mechanism above.

References¶

Peyré, G. (2025). Optimal and Diffusion Transports in Machine Learning. arXiv Preprint arXiv:2512.06797.
Hyvärinen, A. (2005). Estimation of Non-Normalized Statistical Models by Score Matching. Journal of Machine Learning Research, 6, 695–709.
Vincent, P. (2011). A Connection Between Score Matching and Denoising Autoencoders. Neural Computation, 23(7), 1661–1674.
Sohl-Dickstein, J., Weiss, E. A., Maheswaranathan, N., & Ganguli, S. (2015). Deep Unsupervised Learning Using Nonequilibrium Thermodynamics. Proceedings of the 32nd International Conference on Machine Learning, 37, 2256–2265.
Ho, J., Jain, A., & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. Advances in Neural Information Processing Systems, 33, 6840–6851.
Song, Y., & Ermon, S. (2019). Generative Modeling by Estimating Gradients of the Data Distribution. Advances in Neural Information Processing Systems, 32.
Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., & Poole, B. (2021). Score-Based Generative Modeling through Stochastic Differential Equations. International Conference on Learning Representations.
Lipman, Y., Chen, R. T. Q., Ben-Hamu, H., Nickel, M., & Le, M. (2023). Flow Matching for Generative Modeling. International Conference on Learning Representations.
Liu, X., Gong, C., & Liu, Q. (2023). Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow. International Conference on Learning Representations.
Albergo, M. S., Boffi, N. M., & Vanden-Eijnden, E. (2025). Stochastic Interpolants: A Unifying Framework for Flows and Diffusions. Journal of Machine Learning Research, 26(209), 1–80.
Efron, B. (2011). Tweedie’s Formula and Selection Bias. Journal of the American Statistical Association, 106(496), 1602–1614. 10.1198/jasa.2011.tm11181
Hertrich, J., Chambolle, A., & Delon, J. (2025). On the Relation between Rectified Flows and Optimal Transport. Advances in Neural Information Processing Systems.
Lavenant, H., & Santambrogio, F. (2022). The Flow Map of the Fokker–Planck Equation Does Not Provide Optimal Transport. Applied Mathematics Letters, 133, 108225.
Arjovsky, M., Chintala, S., & Bottou, L. (2017). Wasserstein generative adversarial networks. In D. Precup & Y. W. Teh (Eds.), Proceedings of the 34th International Conference on Machine Learning (Vol. 70, pp. 214–223). PMLR. https://proceedings.mlr.press/v70/arjovsky17a.html
Dziugaite, G. K., Roy, D. M., & Ghahramani, Z. (2015). Training generative neural networks via maximum mean discrepancy optimization. Uncertainty in Artificial Intelligence-Proceedings of the 31st Conference, UAI 2015, 258–267.

Generative Models via Transportation

Generative Models via Flow Matching¶

Stochastic interpolant.¶

Flow matching formula.¶

Connection with diffusion models.¶

When is the induced map optimal?¶

Variations on the interpolant.¶

One-Step Generative Models¶

One-Step Models via Parameter-Domain Discrepancy Flows¶

One-Step Model Using Wasserstein Flow of Discrepancy¶

Sliced-Wasserstein Flow¶

Stein Variational Gradient Descent¶

Self-corrected drifting fields.¶

Moment Measures¶

Optimal-transport variational formulation.¶

Conjugate moment measures for generation.¶

Evolution in Depth of Transformers¶

Attention as a context-dependent velocity.¶

Token measure evolution.¶

L2L^2L2 attention and mean shift.¶

Consensus and Markov averaging.¶

Dobrushin contraction from Hilbert geometry.¶

Scope of the consensus result.¶

Gradient structure and limitations.¶

Flows over the Gaussian Manifold¶

Gaussianity preservation.¶

Gaussian-preserving gradient flows.¶

Constrained evolution on the Gaussian manifold.¶

Non-variational Gaussian-preserving flows.¶

Contractive Gaussian projection.¶

Moment closure beyond Gaussianity.¶

$L^2$ attention and mean shift.¶