Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Generative Models via Transportation

The preceding gradient-flow calculus is variational. Modern machine-learning models often use the same transportation language more broadly: one may prescribe an interpolation and regress its velocity, fit a one-step generator to a descent field, or view network depth as a continuous transport of token measures. The examples below separate what is genuinely a Wasserstein gradient flow from what is a transportation dynamics with a useful geometric interpretation.

Generative Models via Flow Matching

Flow matching constructs a generative map by learning the velocity field of an interpolation. The key computational insight is that a constrained continuity-equation problem can be trained by an unconstrained regression.

Generative models aim to build a transportation map TT between a reference distribution α\alpha (typically an isotropic Gaussian) and the target data distribution β\beta. Since such reference measures are non-atomic, a measurable map with Tα=βT_\sharp\alpha=\beta exists on standard Borel spaces, for instance by identifying both probability spaces with the unit interval and using a quantile-type rearrangement. This abstract existence statement is much weaker than having an explicit and numerically stable construction of TT. Optimal transport is one approach to achieving this, but it is computationally expensive and raises questions about how to estimate it from samples. A different route is to prescribe an interpolation between noise and data, learn its velocity, and obtain TT by integrating a time-dependent vector field vtv_t. This point of view sits at the meeting point of two literatures, surveyed from a transport perspective in Peyré, 2025. The diffusion branch builds on score matching Hyvärinen, 2005, denoising score matching Vincent, 2011, nonequilibrium noising chains Sohl-Dickstein et al., 2015, denoising diffusion probabilistic models Ho et al., 2020, score-based generative modeling Song & Ermon, 2019, and the continuous-time score-SDE/probability-flow formulation Song et al., 2021. The deterministic regression branch was introduced, essentially in parallel, under three closely related names: flow matching Lipman et al., 2023, rectified flow Liu et al., 2023, and stochastic interpolants Albergo et al., 2025. In all three cases, the computational object is a velocity field whose regression loss avoids simulating the learned ODE during training. This vector field vtv_t is obtained by constructing an interpolation αt\alpha_t and then finding vtv_t using the least-squares formula of the dynamic chapter. As we will explain, for a specific class of interpolation (obtained by a parametric push-forward), this vtv_t can be obtained by avoiding explicitly inverting a Laplacian and instead computing a simple conditional expectation. This conditional expectation can itself be estimated by solving another least-squares problem, but this time unconstrained, making the estimation feasible from finite samples of α\alpha and β\beta.

Stochastic interpolant.

We assume that αt\alpha_t is defined via a “projection” (in a loose sense) of a latent distribution πP(Rd)\pi \in \Pp(\RR^{d'}), using an operator Pt:RdRdP_t : \RR^{d'} \to \RR^d where ddd' \gg d, i.e.

t[0,1],αt:=(Pt)π.\forall t \in [0,1], \quad \alpha_t := (P_t)_\sharp \pi.

A common case is d=2dd'=2d. We write (x,y)Rd=Rd×Rd(x,y) \in \RR^{d'} = \RR^d \times \RR^d and assume P0(x,y)=xP_0(x,y)=x and P1(x,y)=yP_1(x,y)=y, so that π\pi is a probabilistic coupling between α0\alpha_0 and α1\alpha_1, i.e. π\pi has marginals (α0,α1)(\alpha_0, \alpha_1). For instance, one can use π=α0α1\pi = \alpha_0 \otimes \alpha_1, the trivial coupling. An even more special case is the linear interpolation Pt(x,y)=(1t)x+tyP_t(x,y)=(1-t)x+ty. With this linear path and an arbitrary coupling π\pi, the regression below is the common core of flow matching and rectified flow: Lipman et al. emphasize conditional probability paths and simulation-free training of continuous normalizing flows, while rectified flow emphasizes straight couplings, reflow, and the possibility of reducing transport costs and discretization error Lipman et al., 2023Liu et al., 2023. More complex constructions are also possible, provided that sampling from π\pi remains simple; stochastic interpolants allow additional latent variables and stochastic perturbations, thereby unifying deterministic flows, probability-flow ODEs and diffusion SDEs Albergo et al., 2025.

If π=αβ\pi = \alpha \otimes \beta and α=1niδxi\alpha = \frac{1}{n} \sum_i \delta_{x_i}, β=1mjδyj\beta = \frac{1}{m} \sum_j \delta_{y_j}, then αt\alpha_t consists of n×mn \times m Dirac masses

αt=1nmi,jδPt(xi,yj).\alpha_t = \frac{1}{nm} \sum_{i,j} \delta_{P_t(x_i,y_j)}.

If π=(Id,T)α\pi = (\Id, T)_\sharp \alpha is a Brenier-type coupling, then αt=((1t)Id+tT)α\alpha_t = ((1-t)\Id + tT)_\sharp \alpha is the so-called McCann OT interpolation.

<IPython.core.display.Image object>

Flow matching interpolants between the same empirical source and target measures. A product-style random pairing produces crossing paths, an OT pairing gives direct displacement rays, and a curved bridge changes the path geometry while keeping the same endpoints. Gray arrows mark representative midpoint velocities tPt\partial_tP_t.

Interactive panel. Use the interpolation and noise controls to compare flow-matching paths between source noise and target structure.

Flow matching formula.

This interpolation is not directly useful for sampling from β\beta, but it can be used to define a flow field vtv_t so that the continuity equation, in Eulerian form, holds. This flow field is computed by solving an unconstrained least-squares problem, or equivalently, it is a conditional expectation.

Proof

We first recall the two equivalent ways of writing the interpolated measure. Formally, one may write

αt(z)=Rdδ(zPt(u))dπ(u),\alpha_t(z)=\int_{\RR^{d'}}\delta(z-P_t(u))\,\d\pi(u),

while the rigorous meaning is that, for every smooth test function φ\varphi,

Rdφ(z)dαt(z)=Rdφ(Pt(u))dπ(u).\int_{\RR^d}\varphi(z)\,\d\alpha_t(z) = \int_{\RR^{d'}}\varphi(P_t(u))\,\d\pi(u).

The minimizer in (3) is the orthogonal projection in L2(π;Rd)L^2(\pi;\RR^d) of the latent velocity tPt(u)\partial_tP_t(u) onto the closed subspace of functions that depend on uu only through Pt(u)P_t(u). This projection is the conditional expectation (4). Formally, this can be read as

vt(z)=1αt(z)Rdδ(zPt(u))[tPt](u)dπ(u),v_t(z)=\frac{1}{\alpha_t(z)} \int_{\RR^{d'}}\delta(z-P_t(u))[\partial_tP_t](u)\,\d\pi(u),

and rigorously it means that, for every smooth test vector field mm,

m(z),vt(z)dαt(z)=m(Pt(u)),[tPt](u)dπ(u).\int \dotp{m(z)}{v_t(z)} \, \d\alpha_t(z) = \int \dotp{m(P_t(u))}{[\partial_t P_t](u)} \, \d\pi(u).

We now prove that this field transports the curve (αt)t(\alpha_t)_t. The weak form of tαt+div(αtvt)=0\partial_t\alpha_t+\operatorname{div}(\alpha_t v_t)=0 is that, for every smooth scalar test function φ\varphi,

ddtφ(z)dαt(z)vt(z),φ(z)dαt(z)=0.\frac{\d}{\d t}\int\varphi(z)\,\d\alpha_t(z) - \int\dotp{v_t(z)}{\nabla\varphi(z)}\,\d\alpha_t(z) =0.

Using (6) and differentiating under the integral sign gives

ddtφ(z)dαt(z)=φ(Pt(u)),[tPt](u)dπ(u).\frac{\d}{\d t}\int \varphi(z)\d\alpha_t(z) = \int \dotp{\nabla\varphi(P_t(u))}{[\partial_t P_t](u)}\d\pi(u).

On the other hand, applying (8) with m=φm=\nabla\varphi gives

vt(z),φ(z)dαt(z)=φ(Pt(u)),[tPt](u)dπ(u).\int\dotp{v_t(z)}{\nabla\varphi(z)}\,\d\alpha_t(z) = \int \dotp{\nabla\varphi(P_t(u))}{[\partial_t P_t](u)}\d\pi(u).

Comparing (10) and (11) yields (9), which is the desired continuity equation.

The conditional expectation in (4) has a simple measure-theoretic meaning. Let αt=(Pt)π\alpha_t=(P_t)_\sharp\pi and define the vector-valued measure mtm_t on Rd\RR^d by

Rdψ(z),dmt(z):=Rdψ(Pt(u)),[tPt](u)dπ(u)\int_{\RR^d}\dotp{\psi(z)}{\d m_t(z)} \eqdef \int_{\RR^{d'}}\dotp{\psi(P_t(u))}{[\partial_tP_t](u)}\d\pi(u)

for every bounded continuous vector field ψ\psi. Since αt(A)=0\alpha_t(A)=0 implies π(Pt1(A))=0\pi(P_t^{-1}(A))=0, one has mtαtm_t\ll\alpha_t. The Radon--Nikodym decomposition of mtm_t with respect to αt\alpha_t is therefore

dmt(z)=vt(z)dαt(z),vt=dmtdαt.\d m_t(z)=v_t(z)\d\alpha_t(z), \qquad v_t=\frac{\d m_t}{\d\alpha_t}.

In the language of Lebesgue decomposition, the flux measure mtm_t has only an absolutely continuous part with respect to αt\alpha_t and no singular part; the conditional expectation is precisely this density. Equivalently, disintegrating π\pi with respect to the map PtP_t gives π(du)=πt,z(du)αt(dz)\pi(\d u)=\pi_{t,z}(\d u)\alpha_t(\d z), where πt,z\pi_{t,z} is supported on the fiber {u:Pt(u)=z}\{u\,:\,P_t(u)=z\}, and

vt(z)={Pt(u)=z}[tPt](u)dπt,z(u).v_t(z)=\int_{\{P_t(u)=z\}}[\partial_tP_t](u)\d\pi_{t,z}(u).

Thus the solution of (3) is the conditional expectation of the velocities tPt\partial_t P_t: intuitively, vt(z)v_t(z) is the average velocity of all trajectories passing through zz. Numerically, (x,t)vt(x)(x,t) \to v_t(x) can be parameterized by a neural network (e.g., a U-Net for vision tasks) and estimated using stochastic gradient descent on the objective in (3). For the exact field vtv_t, integrating the ODE x˙=vt(x)\dot{x}=v_t(x) defines a transport map TtT_t. If vtv_t is regular enough, or more generally if the continuity equation has a unique solution for this velocity, then (Tt)α0=αt(T_t)_\sharp\alpha_0=\alpha_t. Thus the same interpolation as (1) is represented by a deterministic flow rather than by the original coupling. The sampling procedure consists in first drawing X0αX_0 \sim \alpha, and then integrating the ODE X˙t=vt(Xt)\dot{X}_t = v_t(X_t) starting with Xt=0=X0X_{t=0} = X_0. In the ideal exact-field limit, the resulting Xt=1X_{t=1} is distributed according to α1=β\alpha_1 = \beta.

Connection with diffusion models.

In the special case where Pt(x,y)=(1t)x+tyP_t(x,y)=(1-t)x+ty is a linear interpolation and π=αβ\pi = \alpha \otimes \beta, the curve αt\alpha_t is a convolution of rescaled versions of α0\alpha_0 and α1\alpha_1. The flow-matching problem (3) becomes

min(vt)tRd×Rdvt((1t)x+ty)(yx)2dα0(x)dα1(y).\min_{(v_t)_t} \int_{\RR^{d} \times \RR^d} \norm{v_t( (1-t)x+t y ) - (y-x) }^2 \, \d\alpha_0(x) \d\alpha_1(y).

When one endpoint is an isotropic Gaussian, this construction is closely related to the probability-flow formulation of diffusion models, up to the usual change of time parametrization Song et al., 2021. This is why flow matching can be viewed both as a deterministic alternative to diffusion training and as a common language for diffusion paths, OT-inspired paths, and rectified paths Lipman et al., 2023Liu et al., 2023Albergo et al., 2025. The next two propositions are written in the noising direction, from a data law α\alpha to a Gaussian; reversing time gives the corresponding sampling flow. They also give an explicit closed form for vtv_t and show that it is a gradient field. In this setting, vtv_t is also the solution of the constrained least-squares problem from the dynamic chapter. The regression (3) is computationally simpler because the continuity equation has already been enforced by the chosen interpolant. To prove this, we rely on Tweedie’s formula, which expresses the optimal Gaussian denoiser through the score, i.e. the gradient of the log-density.

Proof

Bayes’ rule gives the conditional density pWZ(wz)=β(w)φσ(zw)βσ(z)p_{W|Z}(w\mid z) = \dfrac{\beta(w)\,\varphi_\sigma(z-w)}{\beta_\sigma(z)} with φσ\varphi_\sigma the N(0,σ2Id)\Gaussian(0,\sigma^{2}I_{d}) density. Hence

E[WZ=z]=1βσ(z)Rdwβ(w)φσ(zw)dw.\EE[W\mid Z=z] = \frac{1}{\beta_\sigma(z)} \int_{\RR^{d}} w\, \beta(w)\,\varphi_\sigma(z-w)\,\d w .

Differentiating the Gaussian convolution under the integral sign and using zφσ(zw)=σ2(zw)φσ(zw)\nabla_z\varphi_\sigma(z-w) = -\sigma^{-2}(z-w)\,\varphi_\sigma(z-w) yields

zβσ(z)=β(w)zφσ(zw)dw=σ2(zE[WZ=z])βσ(z).\nabla_z\beta_\sigma(z) = \int \beta(w)\,\nabla_z\varphi_\sigma(z-w)\,\d w = -\sigma^{-2}\Bigl(z-\EE[W\mid Z=z]\Bigr)\,\beta_\sigma(z).

Rearranging finishes the proof.

Proof

Fix t(0,1)t\in(0,1) and write W=(1t)XW=(1-t)X, σ=t\sigma=t, so that Zt=W+σYZ_t = W + \sigma\,Y matches the setting of Tweedie’s identity. Conditional expectations satisfy v(z,t)=E[YXZt=z]=1tE[ZtWZt=z]11tE[WZt=z].v^\star(z,t) = \EE[Y-X\mid Z_t=z] = \frac{1}{t}\,\EE[Z_t-W\mid Z_t=z] -\,\frac{1}{1-t}\,\EE[W\mid Z_t=z]. Applying Tweedie’s identity to E[WZt=z]\EE[W\mid Z_t=z] and noting E[YZt=z]=tlogαt(z)\EE[Y\mid Z_t=z] = -\,t\,\nabla\log\alpha_t(z) gives the claimed formula.

<IPython.core.display.Image object>

One-dimensional diffusion bridge for a Gaussian-mixture data law. The forward path Zt=(1t)X+tYZ_t=(1-t)X+tY smooths the red data density toward a blue Gaussian endpoint. Reversing the probability-flow ODE transports a denser set of blue noise samples back toward the data modes, making the splitting of trajectories across mixture components visible.

Interactive panel. Use the noising time and schedule controls to see the one-dimensional forward and reverse diffusion bridge.

The same probability-flow intuition is visible in two dimensions. For a discrete data law, or more generally for a Gaussian mixture, the noising density is a Gaussian mixture whose score can be evaluated explicitly. This makes it possible to draw backward trajectories without training a neural network. In the plots below, the Gaussian endpoint has covariance σ2Id\sigma^2\Id to keep the geometry visible at the scale of the three atoms. For a scalar noising schedule Zt=atX+btYZ_t=a_tX+b_tY, the intermediate law has component centers atcja_t c_j and covariance (btσ)2Id(b_t\sigma)^2\Id. For the linear bridge, pt(z)=jwjN((1t)cj,(tσ)2Id)p_t(z)=\sum_j w_j\Gaussian((1-t)c_j,(t\sigma)^2\Id), with st=logpts_t=\nabla\log p_t, and the scaled Gaussian-endpoint field gives vt(z)=(z+tσ2st(z))/(1t)v_t(z)=-(z+t\sigma^2s_t(z))/(1-t).

<IPython.core.display.Image object>

Two-dimensional noising paths from three Dirac masses to a single Gaussian. The linear interpolation Zt=(1t)X+tYZ_t=(1-t)X+tY moves component centers linearly toward the origin and grows covariance like (tσ)2Id(t\sigma)^2\Id. The variance-preserving Ornstein--Uhlenbeck bridge has the same endpoints but a different speed of contraction and noising.

Interactive panel. Use the schedule and time controls to watch two-dimensional samples blur forward and concentrate backward.

When is the induced map optimal?

Integrating the learned velocity gives a deterministic map from α0\alpha_0 to α1\alpha_1, but this map is not automatically the Brenier optimal map. It is optimal only in special cases where the accumulated flow remains the gradient of a convex potential. The Gaussian product-coupling case already shows the precise obstruction: the interpolated covariances are simple, the velocity is affine, but the terminal map can contain a hidden rotational part. This phenomenon, and its extensions to rectified flows and mixtures, is analyzed in depth in Hertrich et al., 2025.

Proof

The conditional-expectation formula gives

vt(z)=E[X1X0Zt=z].v_t(z)=\EE[X_1-X_0\mid Z_t=z].

Since all variables are jointly Gaussian, this conditional expectation is linear and

vt(z)=Cov(X1X0,Zt)Cov(Zt)1z=(tΣ1(1t)Σ0)Σt1z,v_t(z) = \operatorname{Cov}(X_1-X_0,Z_t)\operatorname{Cov}(Z_t)^{-1}z = \bigl(t\Sigma_1-(1-t)\Sigma_0\bigr)\Sigma_t^{-1}z,

which proves (27). To solve the characteristic equation, whiten the source by setting

C=Σ01/2Σ1Σ01/2,Z~t=Σ01/2Zt.C=\Sigma_0^{-1/2}\Sigma_1\Sigma_0^{-1/2}, \qquad \widetilde Z_t=\Sigma_0^{-1/2}Z_t.

In these coordinates the source covariance is Id\Id and

Σ~t=(1t)2Id+t2C.\widetilde\Sigma_t=(1-t)^2\Id+t^2C.

Because Id\Id and CC commute, the affine flow map in whitened coordinates is simply T~t=Σ~t1/2\widetilde T_t=\widetilde\Sigma_t^{1/2}. Indeed,

ddtΣ~t1/2=(tC(1t)Id)Σ~t1/2,\frac{\d}{\d t}\widetilde\Sigma_t^{1/2} = \bigl(tC-(1-t)\Id\bigr)\widetilde\Sigma_t^{-1/2},

which is exactly the equation T~˙t=A~tT~t\dot{\widetilde T}_t=\widetilde A_t\widetilde T_t with T~0=Id\widetilde T_0=\Id. Returning to the original coordinates gives (28), and t=1t=1 gives (29).

Both T1FMT_1^{\rm FM} and TOTT^{\rm OT} push N(0,Σ0)\Gaussian(0,\Sigma_0) to N(0,Σ1)\Gaussian(0,\Sigma_1). The Brenier map between nondegenerate Gaussians is the unique symmetric positive definite linear map with this property. Hence T1FM=TOTT_1^{\rm FM}=T^{\rm OT} if and only if T1FMT_1^{\rm FM} is symmetric positive definite. The map T1FMT_1^{\rm FM} is similar to C1/2C^{1/2}, so if it is symmetric then it is automatically positive definite. It remains to characterize symmetry. Since C1/2C^{1/2} is symmetric positive definite,

(T1FM)=Σ01/2C1/2Σ01/2.(T_1^{\rm FM})^\top = \Sigma_0^{-1/2}C^{1/2}\Sigma_0^{1/2}.

Thus symmetry of T1FMT_1^{\rm FM} is equivalent to Σ0C1/2=C1/2Σ0\Sigma_0 C^{1/2}=C^{1/2}\Sigma_0, hence to Σ0C=CΣ0\Sigma_0 C=C\Sigma_0 by functional calculus. Multiplying this identity on the left and right by Σ01/2\Sigma_0^{1/2} gives Σ0Σ1=Σ1Σ0\Sigma_0\Sigma_1=\Sigma_1\Sigma_0. Conversely, if Σ0\Sigma_0 and Σ1\Sigma_1 commute, they are orthogonally co-diagonalizable, and both (29) and (30) reduce in that basis to the diagonal map with entries λ1,k/λ0,k\sqrt{\lambda_{1,k}/\lambda_{0,k}}. This proves the equivalence.

The Gaussian optimality proposition explains why the statement “flow matching gives an optimal map” is fragile. The same terminal map (29) is obtained for any scalar schedule Zt=atX0+btX1Z_t=a_tX_0+b_tX_1 with the same endpoints, because after whitening the covariance path remains at2Id+bt2Ca_t^2\Id+b_t^2C. Thus changing the speed of a scalar Gaussian bridge, for instance by using an OU schedule, cannot repair the non-optimality created by non-commuting covariances. Commuting covariances reduce the terminal map to independent one-dimensional scalings, whereas non-commuting covariances create a non-symmetric affine map, hence a transport with a rotational or shearing component. More generally, mixture-like paths can create the same obstruction even when every instantaneous velocity looks natural. This distinction is closely related to counterexamples showing that flow maps associated with Fokker--Planck or diffusion-type evolutions do not in general provide optimal transport maps Lavenant & Santambrogio, 2022. In particular, starting from an isotropic Gaussian does not by itself guarantee optimality once the target distribution is non-Gaussian; additional structural assumptions on the path or on the coupling are needed.

Variations on the interpolant.

The geometry of the generated trajectories depends on the chosen interpolant, not only on the two endpoint laws. There is first a harmless ambiguity: a monotone reparametrization Zt=(1λ(t))X+λ(t)YZ_t=(1-\lambda(t))X+\lambda(t)Y of the linear bridge only changes the speed of the flow,

vt(z)=λ(t)vλ(t)lin(z),vrlin(z)=E[YX(1r)X+rY=z].v_t(z)=\lambda'(t)\,v^{\rm lin}_{\lambda(t)}(z), \qquad v^{\rm lin}_{r}(z)=\EE[Y-X\mid (1-r)X+rY=z].

It therefore leaves the spatial integral curves unchanged. Diffusion models use a genuinely different family of noising paths. If

Zt=atX+btY,YN(0,σ2Id),Z_t=a_tX+b_tY,\qquad Y\sim\Gaussian(0,\sigma^2\Id),

then both the mixture centers and the component variances are changed. Writing ptp_t for the density of ZtZ_t and st=logpts_t=\nabla\log p_t, Tweedie’s formula gives, away from times where at=0a_t=0,

vt(z)=atE[XZt=z]+btE[YZt=z]=atatz+(atbt2atbtbt)σ2st(z).v_t(z)=a'_t\,\EE[X\mid Z_t=z]+b'_t\,\EE[Y\mid Z_t=z] =\frac{a'_t}{a_t}z+ \left(\frac{a'_tb_t^2}{a_t}-b'_tb_t\right)\sigma^2s_t(z).

For the linear bridge, at=1ta_t=1-t and bt=tb_t=t, this recovers the formula above. For the variance-preserving Ornstein--Uhlenbeck noising used in diffusion models,

aτ=eτ,bτ=1e2τ,a_\tau=e^{-\tau},\qquad b_\tau=\sqrt{1-e^{-2\tau}},

one obtains the forward probability-flow velocity vτ(z)=zσ2logpτ(z)v_\tau(z)=-z-\sigma^2\nabla\log p_\tau(z). Sampling follows the reverse field z+σ2logpτ(z)z+\sigma^2\nabla\log p_\tau(z) as τ\tau decreases. This is the noising law used in the diffusion trajectory panel below; the trajectories are more curved than for the linear bridge because the centers and variances evolve according to the OU/Fokker--Planck scaling rather than by affine interpolation. Numerically, the integration is stopped at a small positive time before the Dirac endpoint, where the score becomes singular.

The finite-time coefficients at=cos(πt/2)a_t=\cos(\pi t/2) and bt=sin(πt/2)b_t=\sin(\pi t/2) are not a new spatial interpolant: they are exactly the OU coefficients after the time change τ=logcos(πt/2)\tau=-\log\cos(\pi t/2). The schedule comparison below therefore places the OU bridge next to a genuinely different scalar bridge,

at=(1t)(12t),bt=t,a_t=(1-t)(1-2t), \qquad b_t=t,

whose data coefficient changes sign before vanishing. This overshooting bridge is mainly a diagnostic example: it keeps the same endpoints, but its intermediate mixture reflects through the origin and produces visibly different reverse trajectories.

<IPython.core.display.Image object>

Diffusion-style sampling trajectories compared with OT rays in the three-Dirac setting. Red particles are sampled from the centered Gaussian endpoint and transported toward the three blue atoms. The diffusion panel integrates a reverse probability-flow field based on a Gaussian-mixture score, while the OT panel uses straight displacement rays selected by a quadratic matching.

Interactive panel. Use the trajectory and schedule controls to compare curved diffusion sampling paths with straight optimal-transport rays.

<IPython.core.display.Image object>

Effect of the interpolant on the exact reverse flow for the same three-Dirac target and the same Gaussian endpoint. The linear bridge at=1ta_t=1-t, bt=tb_t=t produces almost radial curves. The variance-preserving OU bridge aτ=eτa_\tau=e^{-\tau}, bτ=1e2τb_\tau=\sqrt{1-e^{-2\tau}} changes the relative speed of contraction and noising. The overshooting bridge at=(1t)(12t)a_t=(1-t)(1-2t), bt=tb_t=t is not a time reparameterization of either one and produces a more pronounced bending of the reverse trajectories.

Interactive panel. Use the schedule controls to compare how different noising laws allocate motion over time.

One-Step Generative Models

One-step generative models try to keep the geometric training principle of flows while removing the expensive multi-step integration at sampling time. The idea is to evolve the model distribution during training, but to store the final evolution in a single generator evaluation.

Training a one-step flow.

Let ζ\zeta be a simple latent distribution and let αθ=(Gθ)ζ\alpha_\theta=(G_\theta)_\sharp\zeta be the model distribution. Assume that the target data distribution is β\beta. A Wasserstein-flow construction chooses a discrepancy

Eβ(α),\mathcal E_\beta(\alpha),

for instance a smoothed KL(αβ)\KL(\alpha|\beta), an MMD/IPM loss, or the debiased Sinkhorn divergence Lˉcϵ(α,β)\bar\MK_c^\epsilon(\alpha,\beta) introduced in the Sinkhorn divergence section. The associated formal descent is

tμt+div(μtwt)=0,wt(x)=δαEβ(μt)(x).\partial_t\mu_t+\operatorname{div}(\mu_t w_t)=0, \qquad w_t(x)=-\nabla\delta_\alpha \mathcal E_\beta(\mu_t)(x).

Instead of integrating (44) at inference time, one fits a parametric residual field UηU_\eta along the current model distribution:

minη01 ⁣Uη(t,x)wt(x)2dμt(x)dt.\min_\eta \int_0^1\!\int \norm{U_\eta(t,x)-w_t(x)}^2 \,\d\mu_t(x)\,\d t.

In a particle or generator implementation, the learned residual is then used to update the current generator by

αθ+=(Id+τUη)αθ,or equivalentlyGθ+(z)=Gθ(z)+τUη(Gθ(z)).\alpha_{\theta}^{+} = (\Id+\tau U_\eta)_\sharp \alpha_\theta, \qquad\text{or equivalently}\qquad G_\theta^{+}(z)=G_\theta(z)+\tau U_\eta(G_\theta(z)).

After many training updates, the accumulated generator is evaluated once at test time. This is the organizing principle behind recent one-step methods based on Wasserstein gradient flows: W-Flow uses such a construction with the Sinkhorn divergence as a tractable global discrepancy Han et al., 2026, while drifting methods evolve the generated distribution during training through a fitted vector field and also admit one-step inference Deng et al., 2026. The gradient-flow interpretation of drifting models, and its relation to KL, MMD, sliced-Wasserstein and Sinkhorn-type discrepancies, is analyzed in Gretton et al., 2026He et al., 2026. These ideas are also connected to the Sinkhorn-type normalization dynamics used to model attention in Sinkformers Sander et al., 2022.

Self-corrected drifting fields.

Drifting methods need not start from an exact Wasserstein gradient. They often prescribe an attraction-minus-repulsion field and then regress this field in L2(μt)L^2(\mu_t). A simple continuous version uses a positive kernel Kϵ(x,y)K_\epsilon(x,y) and defines, for any measure ν\nu,

Bϵ[ν](x):=(yx)Kϵ(x,y)dν(y)Kϵ(x,y)dν(y).B_\epsilon[\nu](x) \eqdef \frac{\int (y-x)K_\epsilon(x,y)\,\d\nu(y)} {\int K_\epsilon(x,y)\,\d\nu(y)}.

For the Gaussian kernel Kϵ(x,y)=exp(xy2/(2ϵ))K_\epsilon(x,y)=\exp(-\norm{x-y}^2/(2\epsilon)), this normalized field is a score of a smoothed density:

Bϵ[ν](x)=ϵlog ⁣(Kϵ(x,y)dν(y)).B_\epsilon[\nu](x) = \epsilon\nabla\log\!\left(\int K_\epsilon(x,y)\,\d\nu(y)\right).

The drifting velocity is then

ut(x)=Bϵ[β](x)Bϵ[μt](x)=ϵlogKϵ(x,y)dβ(y)Kϵ(x,y)dμt(y).u_t(x)=B_\epsilon[\beta](x)-B_\epsilon[\mu_t](x) = \epsilon\nabla\log \frac{\int K_\epsilon(x,y)\,\d\beta(y)} {\int K_\epsilon(x,y)\,\d\mu_t(y)}.

The first term pulls samples toward data, while the second term corrects self-attraction and prevents all particles from collapsing onto the same high-density region. Sinkhorn drifting replaces these one-sided kernel normalizations by two-sided entropic OT couplings, so that the cross and self terms are normalized by Sinkhorn scaling rather than by a single denominator He et al., 2026.

<IPython.core.display.Image object>

Drifting trajectories for a small particle generator. The raw kernel drift has weak long-range attraction and can leave particles away from the data modes. The self-corrected field uses the difference Bϵ[β]Bϵ[μt]B_\epsilon[\beta]-B_\epsilon[\mu_t], so a longer integration brings particles to the blue modes while repelling them from their own current concentration.

Interactive panel. Use the drift and time controls to inspect a learned-looking velocity field and its induced particle trajectories.

Proof

Since μt\mu_t and ϕt\phi_t are fixed in the variation with respect to α\alpha, the first variation is

δαRt(αμt)(x)=ϕt(x).\delta_\alpha \mathcal R_t(\alpha|\mu_t)(x)=-\phi_t(x).

By the Wasserstein-gradient formula,

 ⁣WRt(αμt)=δαRt(αμt)=ϕt=ut.\Wgrad \mathcal R_t(\alpha|\mu_t) = \nabla\delta_\alpha \mathcal R_t(\alpha|\mu_t) = -\nabla\phi_t = -u_t.

The Wasserstein gradient-descent velocity is the negative of this gradient, namely utu_t. Substituting this velocity in the continuity equation gives the claimed flow.

Evolution in Depth of Transformers

Deep residual architectures can be read as time discretizations of ODEs or PDEs. For transformers, the transported objects are token measures and the velocity is induced by attention.

Transformers were introduced as sequence-to-sequence architectures driven by self-attention Vaswani et al., 2017 and have since become a central architecture for language and vision models Brown et al., 2020Dosovitskiy et al., 2021. Their distinctive feature is that each token is updated by a data-dependent average of all other tokens. This makes an attention layer permutation-equivariant before positional encoding, context dependent after conditioning on the input sequence, and naturally compatible with a measure viewpoint in which a prompt is regarded as an empirical distribution of tokens.

The mathematical limit used below concerns depth rather than model scale: one lets the number of residual attention layers grow while each layer makes a small update, as in continuous-depth neural networks Chen et al., 2018. For attention, the resulting velocity is nonlinear in the current token law because it is normalized by the whole context. This measure-theoretic view appears in the analysis of attention as a Lipschitz or interacting-particle operator Vuckovic et al., 2020Geshkovski et al., 2023, in the Sinkhorn-normalized dynamics of Sinkformers Sander et al., 2022, and in recent well-posedness and mean-field-limit results for several attention mechanisms Castin et al., 2025. It also separates the infinite-depth limit studied here from the token-limit question, where one controls how a finite empirical context approximates its limiting attention operator Bohbot et al., 2025.

We now consider very deep transformers, focusing on a single-head attention mechanism for simplicity while ignoring MLP layers, layer normalization, causality, and masking. This stripped-down framework is best suited to modeling encoders and vision transformers; the references above indicate which parts of this simplified picture extend to richer attention mechanisms.

Attention as a context-dependent velocity.

After tokenization, embedding, and positional encoding, each input (from a set of tokens) is represented as a point cloud (xi)i=1n(x_i)_{i=1}^n of nn points in the space of vectorized tokens. An attention layer with skip connection and rescaling by 1/T1/T (where TT is the depth) defines a transformation of the tokens:

xixi+1TjeQxi,KxjVxjeQxi,Kx,x_i \mapsto x_i + \frac{1}{T} \sum_j \frac{e^{\langle Q x_i, K x_j \rangle} V x_j}{\sum_{\ell} e^{\langle Q x_i, K x_\ell \rangle}},

where θ=(K,Q,V)\theta = (K, Q, V) are the parameters of the attention layer, represented by three matrices.

Token measure evolution.

To handle an arbitrary number of tokens, we define α=1niδxi\alpha = \frac{1}{n} \sum_i \delta_{x_i} as the empirical measure of tokens and rewrite the transformer mapping as:

xixi+1TΓθ[α](xi),x_i \mapsto x_i + \frac{1}{T} \Gamma_\theta[\alpha](x_i),

where

Γθ[α](x):=eQx,KyVydα(y)eQx,Kzdα(z).\Gamma_\theta[\alpha](x) := \frac{\int e^{\langle Q x, K y \rangle} V y \, \d \alpha(y)} {\int e^{\langle Q x, K z \rangle} \, \d \alpha(z)}.

At the level of the token distribution, the layer pushes α\alpha forward by the “in-context” mapping Γθt[α]\Gamma_{\theta_t}[\alpha], which depends on the context α\alpha, the tokens, and the depth-dependent parameters θt\theta_t. Denoting t[0,1]t \in [0, 1] as the depth and τ=1/T\tau = 1/T as the step size, this gives:

αt+τ=(Id+τΓθt[αt])αt.\alpha_{t+\tau} = (\Id + \tau \Gamma_{\theta_t}[\alpha_t])_\sharp \alpha_t.

As τ0\tau \to 0, this converges formally to the conservation equation

tαt+div(αtΓθt[αt])=0.\partial_t \alpha_t + \operatorname{div}(\alpha_t \Gamma_{\theta_t}[\alpha_t]) = 0.

Gradient structure and limitations.

When the token space has dimension dd and the query/key space has dimension rr, take Q,KRr×dQ,K\in\RR^{r\times d} and VRd×dV\in\RR^{d\times d}. If V=QKV=Q^\top K, the field Γθ[α]\Gamma_\theta[\alpha] is a gradient vector field in the token variable. Indeed, define the log-partition potential

Φα(x)=exp(Qx,Ky)dα(y),Uα(x)=logΦα(x).\Phi_\alpha(x) = \int \exp(\dotp{Qx}{Ky})\d\alpha(y), \qquad U_\alpha(x)=\log\Phi_\alpha(x).

Then

xUα(x)=QKyexp(Qx,Ky)dα(y)exp(Qx,Kz)dα(z)=Γθ[α](x).\nabla_x U_\alpha(x) = \frac{\int Q^\top K y\,\exp(\dotp{Qx}{Ky})\d\alpha(y)} {\int \exp(\dotp{Qx}{Kz})\d\alpha(z)} = \Gamma_\theta[\alpha](x).

This is an instantaneous gradient in xx. It is not, however, the gradient of the first variation of a fixed functional of α\alpha, because the potential UαU_\alpha itself depends on the current measure through the same attention normalization. Thus the PDE is generally a transportation dynamics, not a Wasserstein gradient flow. Special variants recover additional structure: Sinkhorn attention can be interpreted through doubly stochastic normalization and Wasserstein-type gradient flows Sander et al., 2022Castin et al., 2025, while layer normalization leads naturally to dynamics on the sphere and to modified metrics. The key open difficulty for the present viewpoint is training: after the architecture has been rewritten as a controlled transport equation, learning corresponds to optimizing the time-dependent parameters (θt)t(\theta_t)_t rather than merely analyzing the forward PDE for fixed parameters.

Flows over the Gaussian Manifold

Gaussian measures provide a useful testing ground for the preceding dynamics. They are not invariant under a general Wasserstein gradient flow: a nonlinear velocity usually creates non-Gaussian densities immediately. The useful substitute is to either identify affine velocities, which exactly preserve Gaussianity, or to project the dynamics onto the Gaussian manifold. In both cases the measure PDE reduces to matrix ODEs for the mean and covariance. This viewpoint is emphasized in the survey Peyré, 2025 and is useful for comparing diffusion paths, Wasserstein gradient flows, drifting fields and transformer-type dynamics.

For constrained gradient flows on this family, the covariance equation is the finite-dimensional Bures--Wasserstein gradient flow on positive definite matrices. Thus Gaussian closure is not just a computational shortcut: it is the restriction of Wasserstein geometry to the Gaussian submanifold, where affine gradient fields encode tangent vectors.

<IPython.core.display.Image object>

Gaussian closures of transport dynamics between two overlapping anisotropic Gaussians. The left panel is the exact W2\Wass_2 Gaussian geodesic. The middle panel shows a regularized Sinkhorn-style closure with inflated intermediate covariances. The right panel shows a drifting-style closure with a curved mean path and moment-matched covariance ellipses.

Interactive panel. Use the anisotropy, angle, regularization, and drift controls to compare Gaussian closures of Wasserstein, Sinkhorn, and drifting dynamics.

Gaussianity preservation.

The first question is invariance: one wants a simple criterion ensuring that the continuity equation does not leave the finite-dimensional Gaussian family.

Proof

Let XtX_t follow the characteristic ODE X˙t=bt+At(Xtmt)\dot X_t=b_t+A_t(X_t-\mean_t). This linear ODE maps Gaussian random variables to Gaussian random variables. Taking expectation gives m˙t=bt\dot\mean_t=b_t. Writing X~t=Xtmt\tilde X_t=X_t-\mean_t, one has X~˙t=AtX~t\dot{\tilde X}_t=A_t\tilde X_t, hence

Σ˙t=ddtE(X~tX~t)=AtΣt+ΣtAt.\dot\cov_t = \frac{\d}{\d t}\EE(\tilde X_t\tilde X_t^\top) = A_t\cov_t+\cov_t A_t^\top .

For the converse, set bt=m˙tb_t=\dot\mean_t and choose any matrix AtA_t satisfying the covariance equation. Since Σt\cov_t is positive definite, the Lyapunov map AAΣt+ΣtAA\mapsto A\cov_t+\cov_t A is invertible on symmetric matrices, which gives the unique symmetric choice when a gradient velocity is required. In that case vtv_t is the gradient of the quadratic potential xbt,x+At(xmt),xmt/2x\mapsto \dotp{b_t}{x}+\dotp{A_t(x-\mean_t)}{x-\mean_t}/2.

Constrained evolution on the Gaussian manifold.

For non-affine velocities, the finite-dimensional substitute is to project the Wasserstein dynamics onto the Gaussian manifold.

Let

G={N(m,Σ):mRd, Σ0}\mathcal G=\{\Gaussian(\mean,\cov):\mean\in\RR^d,\ \cov\succ0\}

be the Gaussian submanifold of P2(Rd)\Pp_2(\RR^d). The Wasserstein gradient of a functional constrained to a smooth submanifold MP2\mathcal M\subset\Pp_2 is defined as the Riesz representative of the differential restricted to tangent velocities of M\mathcal M. Equivalently, it is the small-step limit of the constrained JKO scheme

αk+1arg minαM12τW22(α,αk)+f(α).\alpha^{k+1}\in \argmin_{\alpha\in\mathcal M} \frac{1}{2\tau}\Wass_2^2(\alpha,\alpha^k)+f(\alpha).

For M=G\mathcal M=\mathcal G, tangent velocities are affine gradient fields v(x)=b+A(xm)v(x)=b+A(x-\mean) with A=AA=A^\top. The constrained gradient is therefore the L2(N(m,Σ))L^2(\Gaussian(\mean,\cov)) projection of the ambient Wasserstein gradient onto this finite-dimensional affine space, whenever the ambient gradient exists.

Proof

Test the functional along a Gaussian tangent vector, represented by an affine gradient field

v(x)=b+A(xm)v(x)=b+A(x-\mean)

with AA symmetric. The induced first-order variations are m˙=b\dot\mean=b and Σ˙=AΣ+ΣA\dot\cov=A\cov+\cov A. Therefore

dF(m,Σ)[b,AΣ+ΣA]=mF,b+tr ⁣(ΣF(AΣ+ΣA)).\d F(\mean,\cov)[b,A\cov+\cov A] = \dotp{\nabla_\mean F}{b} + \mathrm{tr}\!\left(\nabla_\cov F(A\cov+\cov A)\right).

Since AA, Σ\cov and ΣF\nabla_\cov F are symmetric, the second term equals

2tr(ΣFAΣ)=2ΣF(xm),A(xm)dN(m,Σ)(x).2\,\mathrm{tr}(\nabla_\cov F\,A\cov) = \int \dotp{2\nabla_\cov F(x-\mean)}{A(x-\mean)}\d\Gaussian(\mean,\cov)(x).

Together with the mean term, this gives

dF(m,Σ)[m˙,Σ˙]=vF(x),v(x)dN(m,Σ)(x)\d F(\mean,\cov)[\dot\mean,\dot\cov] = \int \dotp{v_F(x)}{v(x)}\d\Gaussian(\mean,\cov)(x)

for all affine gradient fields vv. This identifies the constrained Wasserstein gradient in the induced L2(α)L^2(\alpha) metric, or equivalently the projection of the ambient gradient when it exists. Substituting the descent velocity vF-v_F in the affine-velocity moment equations gives (72).

This proposition should be read as the organizing rule for Gaussian closures: once the scalar energy has been reduced to a function of (m,Σ)(\mean,\cov), its constrained Wasserstein gradient is automatically affine and the covariance follows the Bures-type ODE (72). When the first variation of ff is quadratic, this constrained gradient coincides with the full Wasserstein gradient.

Gaussian-preserving gradient flows.

The next examples show that many familiar energies already have affine Wasserstein gradients on Gaussian inputs, so their full flow remains inside the Gaussian family.

Proof

Each row is obtained by identifying the affine descent velocity v(x)=MCxv(x)=M_Cx generated by the corresponding Gaussian-constrained calculation and then applying the affine covariance equation, which gives C˙=MCC+CMC\dot C=M_CC+CM_C^\top. For KL(γ)\KL(\cdot|\gamma), the Fokker--Planck velocity is (C1Id)x(C^{-1}-\Id)x, hence C˙=2(IdC)\dot C=2(\Id-C). For the Fisher row, the restriction of 12I\frac12\mathcal I to centered Gaussians is

12(tr(C)+tr(C1)2d).\frac12\left(\tr(C)+\tr(C^{-1})-2d\right).

Using the Gaussian-constrained gradient formula gives the descent velocity (C2Id)x(C^{-2}-\Id)x, hence C˙=2(C1C)\dot C=2(C^{-1}-C). This row should be read as a Gaussian projected closure of the fourth-order Fisher flow.

For W22(,γ)\Wass_2^2(\cdot,\gamma), the Brenier map from N(0,C)\Gaussian(0,C) to γ\gamma is C1/2xC^{-1/2}x, so the descent velocity for the unhalved squared distance is 2(C1/2Id)x2(C^{-1/2}-\Id)x, giving 4(C1/2C)4(C^{1/2}-C). For the polynomial MMD row, centered Gaussians satisfy MMDk2(μ,γ)=CIdF2\MMD_k^2(\mu,\gamma)=\norm{C-\Id}_{\mathrm{F}}^2; the first variation is quadratic and its descent velocity is 4(IdC)x4(\Id-C)x, giving 8(CC2)8(C-C^2).

Gaussian Sinkhorn dual potentials are quadratic, so the velocity is again linear; differentiating the closed Gaussian formula yields the displayed spectral expression. The square roots are spectral functions of CC, hence commute with CC, which is why the covariance ODE closes as a matrix function of CC alone. For sliced Wasserstein, each one-dimensional projection is a Gaussian transport with velocity 2((θCθ)1/21)θ,xθ2((\theta^\top C\theta)^{-1/2}-1)\dotp{\theta}{x}\theta; averaging these velocities over Sd1\Sphere^{d-1} gives v(x)=V(C)xv(x)=V(C)x and thus C˙=V(C)C+CV(C)\dot C=V(C)C+CV(C).

Non-variational Gaussian-preserving flows.

The last examples are not ordinary gradient flows of a fixed scalar energy on the full Wasserstein space. They preserve Gaussianity because the prescribed velocity field is affine when evaluated on Gaussian measures.

Contractive Gaussian projection.

The preceding examples show when Gaussianity is preserved or imposed by projection. Gelbrich’s inequality Gelbrich, 1990 gives a useful variational explanation: replacing a measure by the Gaussian with the same first two moments cannot increase its Wasserstein distance to another similarly projected measure.

Proof

Take any coupling (X,Y)(X,Y) of μ\mu and ν\nu, center the variables, and write C=E[(Xmμ)(Ymν)]C=\EE[(X-\mean_\mu)(Y-\mean_\nu)^\top]. In the positive definite case, positivity of the block covariance matrix implies the factorization C=Σμ1/2KΣν1/2C=\cov_\mu^{1/2}K\cov_\nu^{1/2} with Kop1\norm{K}_{\mathrm{op}}\leq1, and therefore, by operator/nuclear norm duality,

trCtr((Σμ1/2ΣνΣμ1/2)1/2).\tr C\leq \tr\left((\cov_\mu^{1/2}\cov_\nu\cov_\mu^{1/2})^{1/2}\right).

The semidefinite case follows by adding ηId\eta\Id to both covariance matrices and letting η0\eta\downarrow0. Expanding EXY2\EE\norm{X-Y}^2 gives the lower bound

EXY2mμmν2+B2(Σμ,Σν).\EE\norm{X-Y}^2 \geq \norm{\mean_\mu-\mean_\nu}^2+\Bb^2(\cov_\mu,\cov_\nu).

Taking the infimum over couplings proves the inequality, while equality for Gaussian laws is exactly the Gaussian Bures formula.

The following preservation criterion is a direct consequence of Gelbrich’s theorem and was explained to us by Hugo Lavenant. It says that a functional which does not increase under moment-matched Gaussian projection admits Gaussian minimizing movements from Gaussian initial data.

Proof

For the JKO claim, Rγ=γ\mathcal R\gamma=\gamma because γ\gamma is Gaussian. Hence, for any competitor η\eta,

F(Rη)+12τW22(γ,Rη)F(η)+12τW22(γ,η).F(\mathcal R\eta)+\frac1{2\tau}\Wass_2^2(\gamma,\mathcal R\eta) \leq F(\eta)+\frac1{2\tau}\Wass_2^2(\gamma,\eta).

Applying this to a minimizer η=ν\eta=\nu shows that Rν\mathcal R\nu is again a minimizer. Uniqueness forces ν=Rν\nu=\mathcal R\nu.

References
  1. Peyré, G. (2025). Optimal and Diffusion Transports in Machine Learning. arXiv Preprint arXiv:2512.06797.
  2. Hyvärinen, A. (2005). Estimation of Non-Normalized Statistical Models by Score Matching. Journal of Machine Learning Research, 6, 695–709.
  3. Vincent, P. (2011). A Connection Between Score Matching and Denoising Autoencoders. Neural Computation, 23(7), 1661–1674.
  4. Sohl-Dickstein, J., Weiss, E. A., Maheswaranathan, N., & Ganguli, S. (2015). Deep Unsupervised Learning Using Nonequilibrium Thermodynamics. Proceedings of the 32nd International Conference on Machine Learning, 37, 2256–2265.
  5. Ho, J., Jain, A., & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. Advances in Neural Information Processing Systems, 33, 6840–6851.
  6. Song, Y., & Ermon, S. (2019). Generative Modeling by Estimating Gradients of the Data Distribution. Advances in Neural Information Processing Systems, 32.
  7. Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., & Poole, B. (2021). Score-Based Generative Modeling through Stochastic Differential Equations. International Conference on Learning Representations.
  8. Lipman, Y., Chen, R. T. Q., Ben-Hamu, H., Nickel, M., & Le, M. (2023). Flow Matching for Generative Modeling. International Conference on Learning Representations.
  9. Liu, X., Gong, C., & Liu, Q. (2023). Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow. International Conference on Learning Representations.
  10. Albergo, M. S., Boffi, N. M., & Vanden-Eijnden, E. (2025). Stochastic Interpolants: A Unifying Framework for Flows and Diffusions. Journal of Machine Learning Research, 26(209), 1–80.
  11. Hertrich, J., Chambolle, A., & Delon, J. (2025). On the Relation between Rectified Flows and Optimal Transport. Advances in Neural Information Processing Systems.
  12. Lavenant, H., & Santambrogio, F. (2022). The Flow Map of the Fokker–Planck Equation Does Not Provide Optimal Transport. Applied Mathematics Letters, 133, 108225.
  13. Han, J., Li, P., Guo, Q., Xu, R., Ermon, S., & Candès, E. J. (2026). One-Step Generative Modeling via Wasserstein Gradient Flows. arXiv Preprint arXiv:2605.11755.
  14. Deng, M., Li, H., Li, T., Du, Y., & He, K. (2026). Generative Modeling via Drifting. arXiv Preprint arXiv:2602.04770.
  15. Gretton, A., Li, K. W., Galashov, A., Thornton, J., De Bortoli, V., & Doucet, A. (2026). On the Wasserstein Gradient Flow Interpretation of Drifting Models. arXiv Preprint arXiv:2605.05118.