Dynamic Optimal Transport

Optimal transport becomes especially powerful once distances between measures are seen as actions of moving mass. This chapter first develops the dynamic language: continuity equations describe admissible measure evolutions, while the Benamou--Brenier formula identifies $\Wass_2$ with a least-action principle. These ideas prepare the gradient-flow and generative-model chapters that follow.

from pathlib import Path
import sys

from IPython.display import Image as DisplayImage
from IPython.display import display

here = Path.cwd()
myst_dir = None
for candidate in [here, here.parent, here / "myst", here.parent / "myst", here.parent.parent / "myst"]:
    if (candidate / "ot4ml_web.py").exists():
        myst_dir = candidate.resolve()
        sys.path.insert(0, str(myst_dir))
        break

if myst_dir is None:
    raise RuntimeError("Could not locate myst/ot4ml_web.py")

repo_root = myst_dir.parent
thumbnails = repo_root / "notebooks-figures" / "thumbnails"

def show_book_figure(name, width=760):
    display(DisplayImage(filename=str(thumbnails / f"{name}.png"), width=width))

Evolutions Over the Space of Measures¶

We start with the continuity equation because it is the common language for particles, densities and weak measure evolutions. It also makes precise which velocity fields actually move mass.

Lagrangian and Eulerian Descriptions¶

Consider an evolution $t\mapsto\alpha_t\in\mathcal P(\RR^d)$ . It can be described in a Lagrangian way as the advection of particles along a time-dependent vector field $v_t(x)$ :

\frac{\d x(t)}{\d t}=v_t(x(t)).

(1)

Writing $T_t$ for the associated flow map, so that $T_t(x(0))=x(t)$ , the advected measure is

\alpha_t=(T_t)_\sharp\alpha_0.

(2)

For empirical measures, $\alpha_t=n^{-1}\sum_{i=1}^n\delta_{x_i(t)}$ , each particle solves (1).

In the Eulerian description, the same motion is written directly on the evolving measure:

\frac{\partial\alpha_t}{\partial t} +\operatorname{div}(v_t\alpha_t)=0.

(3)

This PDE is often called the advection equation, the continuity equation, or Liouville’s equation when it acts on phase space. It is a classical PDE only when $\alpha_t$ has a smooth density. For general measures, and in particular for empirical measures, it is understood in the distributional sense: for every $\varphi\in C_c^1((0,1)\times\RR^d)$ ,

\int_0^1\!\int_{\RR^d} \left( \partial_t\varphi(t,x) +\dotp{v_t(x)}{\nabla_x\varphi(t,x)} \right) \d\alpha_t(x)\d t =0.

(4)

This weak equation is obtained from (3) by integration by parts. For smooth positive densities, the classical and weak formulations are equivalent; for particle clouds, the weak form remains meaningful.

Proposition: Lagrangian Flows Solve the Continuity Equation

Let $(T_t)_{t\in[0,1]}$ be a $C^1$ family of diffeomorphisms of $\RR^d$ and define $\alpha_t=(T_t)_\sharp\alpha_0$ . Assume that the derivatives below are integrable with respect to $\alpha_0$ , and define the Eulerian velocity field by

v_t(T_t(y))=\partial_t T_t(y).

(5)

Then $(\alpha_t,v_t)$ solves the continuity equation in the weak sense (4). In particular, if $\alpha_0=n^{-1}\sum_i\delta_{x_i(0)}$ is empirical, then $\alpha_t=n^{-1}\sum_i\delta_{x_i(t)}$ is empirical as well, with particle velocities $\dot x_i(t)=v_t(x_i(t))$ .

From Measure Evolutions to Vector Fields¶

For a given evolution $(\alpha_t)_t$ , there are typically infinitely many velocity fields $v_t$ satisfying

\partial_t\alpha_t+\operatorname{div}(\alpha_t v_t)=0.

(9)

This non-uniqueness comes from the kernel of the weighted divergence. The linear space of vector fields that leave a measure $\alpha$ invariant is

\mathcal H_\alpha = \{v\in L^2(\alpha;\RR^d):\operatorname{div}(\alpha v)=0 \text{ in distributions}\}.

(10)

It is usually non-trivial: if $\alpha$ is an isotropic Gaussian, $\mathcal H_\alpha$ contains rotational vector fields generated by anti-symmetric matrices.

Dacorogna--Moser Inversion¶

Reconstructing particles from an observed density evolution is therefore ill-posed. For a smooth positive density $\alpha_t=\rho_t\,\d x$ , a simple choice, introduced by Dacorogna and Moser Dacorogna & Moser, 1990, imposes that the flux $\rho_t v_t$ is a gradient field. With a fixed convention for the inverse Laplacian,

v_t = -\frac{1}{\rho_t} \nabla\Delta^{-1}(\partial_t\rho_t),

(11)

with suitable boundary conditions, for instance vanishing at infinity. This formula is useful conceptually but delicate when $\rho_t$ vanishes, and it does not generally produce a gradient velocity field.

The classical Dacorogna--Moser construction uses the linear density path. If $\alpha_i=\rho_i\,\d x$ are smooth positive densities with the same total mass on a bounded domain $\Omega$ , set

\alpha_t=(1-t)\alpha_0+t\alpha_1=\rho_t\,\d x, \qquad \rho_t=(1-t)\rho_0+t\rho_1.

(12)

Choose a time-independent flux $w$ satisfying

\operatorname{div} w=\rho_0-\rho_1, \qquad w\cdot n=0\quad\hbox{on }\partial\Omega,

(13)

for instance $w=-\nabla\phi$ with $\Delta\phi=\rho_1-\rho_0$ and Neumann boundary condition. Then

v_t=\frac{w}{\rho_t}

(14)

satisfies $\partial_t\rho_t+\operatorname{div}(\rho_t v_t)=0$ . The flow $\partial_t T_t=v_t\circ T_t$ , $T_0=\operatorname{Id}$ , therefore transports $\rho_0\d x$ onto $\rho_t\d x$ , and $T_1$ solves the prescribed-Jacobian problem $\rho_1(T_1(x))\det(\nabla T_1(x))=\rho_0(x)$ .

Least-Square Inversion and Gradient Structure¶

A more robust choice, used implicitly in flow matching, optimal transport and Wasserstein gradient flows, is to select among all admissible velocities the one with smallest kinetic energy:

\min_v \frac12\int_0^1\!\int_{\RR^d}\norm{v_t(x)}^2\,\d\alpha_t(x)\d t \quad \text{subject to} \quad \partial_t\alpha_t+\operatorname{div}(\alpha_t v_t)=0.

(15)

Proposition: Least-Square Velocities Are Gradients

Assume that $\alpha_t=\rho_t\,\d x$ is a smooth positive density curve, that $\partial_t\rho_t$ has zero integral, and that boundary terms vanish. The minimizer of (15), if it exists, is a gradient field

v_t=\nabla\phi_t,

(16)

where $\phi_t$ , unique up to an additive constant on each connected component, solves the weighted Poisson equation

-\operatorname{div}(\rho_t\nabla\phi_t)=\partial_t\rho_t, \qquad v_t=-\nabla\Delta_{\alpha_t}^{-1}(\partial_t\alpha_t), \qquad \Delta_{\alpha_t}\phi=\operatorname{div}(\alpha_t\nabla\phi).

(17)

In general this inversion is still computationally demanding, but special choices of $(\alpha_t)_t$ lead to simpler formulas; this is the mechanism exploited later by flow matching in Section Generative Models via Flow Matching.

Benamou--Brenier Dynamic Formulation of OT¶

The dynamic formulation identifies $\Wass_2$ with the kinetic energy of the cheapest continuity-equation path. It is the point where OT becomes a least-action principle.

Benamou--Brenier Formulation¶

Instead of assuming that a whole curve $(\alpha_t)_{t\in[0,1]}$ is prescribed, one fixes only the endpoints $\alpha_0$ and $\alpha_1$ and minimizes the least-square energy (15). The theorem of Benamou and Brenier states that this geodesic energy is exactly the squared Wasserstein distance Benamou & Brenier, 2000.

Theorem: Benamou--Brenier

For probability measures $\alpha_0,\alpha_1\in\mathcal P_2(\RR^d)$ ,

\Wass_2^2(\alpha_0,\alpha_1) = \inf_{(\alpha_t,v_t)} \int_0^1\!\int_{\RR^d}\norm{v_t(x)}^2\,\d\alpha_t(x)\d t,

(20)

where the infimum is over $(\alpha_t,v_t)$ solving $\partial_t\alpha_t+\nabla\!\cdot(\alpha_t v_t)=0$ with $\alpha_{t=0}=\alpha_0$ and $\alpha_{t=1}=\alpha_1$ . If $\alpha_0$ has a density and $T$ is the optimal Monge map $T_\sharp\alpha_0=\alpha_1$ , a minimizing curve is

\alpha_t=((1-t)\Id+tT)_\sharp\alpha_0, \qquad v_t((1-t)x+tT(x))=T(x)-x \quad\text{for $\alpha_0$-a.e. $x$ and a.e. $t\in(0,1)$}.

(21)

Convex Moment-Based Reformulation¶

Although (20) is not jointly convex in $(\alpha_t,v_t)$ , it becomes convex after replacing velocities by momenta. Given $v\in L^2(\alpha;\RR^d)$ , define the momentum

\omega\eqdef \alpha v, \qquad \omega(B)=\int_B v(x)\,\d\alpha(x),

(24)

which is a finite $\RR^d$ -valued measure. The nonlinear relation $\omega=\alpha v$ is eliminated by the quadratic perspective

J(a,m) \eqdef \begin{cases} \norm{m}^2/a, & a>0,\\ 0, & a=0\ \text{and}\ m=0,\\ +\infty, & a=0\ \text{and}\ m\neq0, \end{cases} \qquad (a,m)\in[0,+\infty)\times\RR^d.

(25)

This lower-semicontinuous convex function is positively 1-homogeneous: $J(\eta a,\eta m)=\eta J(a,m)$ for $\eta\geq0$ . If $\lambda$ is any positive measure dominating both $\alpha$ and the total variation $|\omega|$ , set

\mathbb J(\alpha,\omega) \eqdef \int J\left( \frac{\d\alpha}{\d\lambda}(x), \frac{\d\omega}{\d\lambda}(x) \right)\d\lambda(x).

(26)

The value is independent of the dominating measure: both Radon--Nikodym densities change by the same factor, and the 1-homogeneity of $J$ cancels the change of reference measure. This is the integral functional associated with a convex normal integrand in the measure-valued relaxation of dynamic OT Ambrosio et al., 2006; see also the perspective construction in Rockafellar, 2015. Moreover,

\mathbb J(\alpha,\omega)<+\infty \quad\Longleftrightarrow\quad \omega=v\alpha\ \text{with}\ v\in L^2(\alpha;\RR^d), \qquad \mathbb J(\alpha,\omega)=\int\norm{v}^2\,\d\alpha.

(27)

The Benamou--Brenier problem therefore has the convex measure formulation

\Wass_2^2(\alpha_0,\alpha_1) = \inf_{\substack{\partial_t\alpha_t+\operatorname{div}\omega_t=0\\ \alpha_{t=0}=\alpha_0,\ \alpha_{t=1}=\alpha_1}} \int_0^1\mathbb J(\alpha_t,\omega_t)\,\d t.

(28)

In the absolutely continuous case $\alpha_t=\rho_t\,\d x$ and $\omega_t=m_t\,\d x$ , this reduces to the familiar integral of $J(\rho_t,m_t)=\norm{m_t}^2/\rho_t$ , with the zero-density conventions already encoded in (25). This convex reformulation enables geodesic interpolation by convex optimization after discretization.

Dual Hamilton--Jacobi Formulation¶

The momentum formulation also has a useful dual. It turns the least-action problem into a Hamilton--Jacobi subsolution inequality for a scalar potential, with equality on the part of space-time actually visited by the optimal curve. With the no- $1/2$ convention of (28), the constants are as follows.

Proposition: Dual Benamou--Brenier Problem

Assume, for simplicity, that the densities are smooth, compactly supported, and that boundary terms vanish. Then the convex dynamic value has the dual formulation

\Wass_2^2(\alpha_0,\alpha_1) = \sup_{\phi} \left\{ \int_{\RR^d}\phi_1\,\d\alpha_1 - \int_{\RR^d}\phi_0\,\d\alpha_0 \;:\; \partial_t\phi_t+\frac14\norm{\nabla\phi_t}^2\leq 0 \right\}.

(29)

If $(\rho,m)$ and $\phi$ are smooth primal and dual optimizers, then

m_t=\frac{\rho_t}{2}\nabla\phi_t, \qquad \partial_t\phi_t+\frac14\norm{\nabla\phi_t}^2=0 \quad\text{on }\{\rho_t>0\}.

(30)

Equivalently, the optimal Eulerian velocity is $v_t=m_t/\rho_t=\nabla\phi_t/2$ .

This also recovers the static Kantorovich inequality from a dynamic principle. If $\gamma$ is any smooth curve with $\gamma(0)=x$ and $\gamma(1)=y$ , then

\frac{\d}{\d t}\phi_t(\gamma(t)) = \partial_t\phi_t(\gamma(t))+\dotp{\nabla\phi_t(\gamma(t))}{\dot\gamma(t)} \leq \norm{\dot\gamma(t)}^2.

(33)

After integration and minimization over curves,

\phi_1(y)-\phi_0(x)\leq \norm{x-y}^2.

(34)

Thus $(-\phi_0,\phi_1)$ is a feasible static Kantorovich dual pair for the quadratic cost. At optimality the inequality is saturated on the endpoint pairs connected by the primal characteristics.

Figure Div displays these primal--dual relations for a one-dimensional mixture transport, including the Hamilton--Jacobi contact identity along the active mass.

One-dimensional Benamou--Brenier primal and dual solutions. The endpoints are Gaussian mixtures and the solution is computed from monotone quantile interpolation. The panels show the primal density, the momentum $m_t=\rho_t v_t$ , and the dual Hamilton--Jacobi potential. Along the active transported mass, the notebook checks $m_t=\rho_t\partial_x\phi_t/2$ and $\partial_t\phi_t+|\partial_x\phi_t|^2/4=0$ .

Proximal Splitting¶

The convex momentum formulation also explains the original Benamou--Brenier solver. After discretization, the ALG2 scheme can be read as a Douglas--Rachford splitting, equivalently ADMM on the Fenchel--Rockafellar dual Papadakis et al., 2014. Suppressing discretization indices, write $U=(\rho,m)$ , let $\mathcal F(U)$ be the integral of the perspective action, and let $\mathcal G=\iota_{\mathcal C}$ be the indicator of the affine continuity constraint with prescribed endpoints. The problem is $\min_U \mathcal F(U)+\mathcal G(U)$ .

The two proximal operators separate the nonlinear and linear parts: the prox of $\mathcal F$ is local in $(t,x)$ and amounts to the perspective proximal operator, whereas the prox of $\mathcal G$ is the orthogonal projection onto the divergence equation and endpoint constraints. Douglas--Rachford alternates these two simple operations.

Algorithm: Douglas--Rachford for dynamic Benamou--Brenier

Input: Functionals $\mathcal F,\mathcal G=\iota_{\mathcal C}$ , proximal parameter $\tau>0$ , initial field $Z^0$ , tolerance $\mathrm{tol}>0$ , and maximum iteration count $K\geq1$ .

Output: Discrete density-momentum field $U^\star$ .

For $k=0,\ldots,K-1$ do:

$U^{k+1}=\prox_{\tau\mathcal F}(Z^k).$
Project reflected point: $\widetilde U^{k+1} = \prox_{\tau\mathcal G}(2U^{k+1}-Z^k) = \Proj_{\mathcal C}(2U^{k+1}-Z^k).$
Update $Z^{k+1}=Z^k+\widetilde U^{k+1}-U^{k+1}.$
If $\norm{U^{k+1}-\widetilde U^{k+1}}\leq\mathrm{tol}$ then:

Return $U^{k+1}$ .

Return $U^K$ .

Figure Div complements the Eulerian optimization viewpoint with the Lagrangian picture: matched particles travel along the straight characteristics of the minimizing curve.

Benamou--Brenier geodesic between two sampled silhouettes. A discrete quadratic OT plan between finely subsampled cat and two-disks point clouds induces the McCann interpolation $Z_t=(1-t)X+tY$ , which is the Lagrangian realization of the least-action solution. The left panel renders local color images of the smaller-bandwidth kernel-smoothed densities with enough padding to include the full silhouettes. The right panel overlays shortened velocity arrows centered at evenly subsampled midpoint particles $Z_{1/2}$ ; each displayed arrow runs in data coordinates from a source-side tail to a target-side head along the matched characteristic direction $Y-X$ , but is not drawn as the full endpoint segment from $X$ to $Y$ .

The interactive demo keeps the same Lagrangian picture: particles are matched once, then move along straight characteristics. The time and velocity scale controls separate the path $\alpha_t$ from the underlying displacement field.

Interactive panel. Use the time and velocity-scale controls to follow the Benamou-Brenier geodesic as a moving density with an Eulerian velocity field.

Path-Space Formulation¶

Let $\Ss=C([0,1];\RR^d)$ be the space of continuous paths endowed with the uniform topology. For $t\in[0,1]$ define the evaluation map

e_t:\Ss\to\RR^d, \qquad e_t(\gamma)=\gamma(t).

(35)

The Benamou--Brenier cost admits the equivalent formulation

\Wass_2^2(\alpha_0,\alpha_1) = \inf_{M\in\Pp(\Ss)} \enscond{ \int_{\Ss}\!\int_0^1\norm{\dot\gamma(t)}^2\d t\,\d M(\gamma) }{ (e_0)_\sharp M=\alpha_0,\ (e_1)_\sharp M=\alpha_1 }.

(36)

The inner energy is understood as $+\infty$ outside the absolutely continuous paths. If $\alpha_0$ has a density, the minimizer $M^*$ is unique. Its time marginals reproduce the optimal curve: $\alpha_t=(e_t)_\sharp M^*$ for all $t$ . Furthermore, for a.e. $t$ , the conditional law of the path velocity is deterministic:

(e_t,\dot e_t)_\sharp M^*(\d x,\d q) = \alpha_t(\d x)\delta_{v_t^*(x)}(\d q),

(37)

where $v_t^*$ is the optimal velocity field in the Benamou--Brenier formulation. Hence $M^*$ concentrates on straight-line geodesics and, for a.e. $t$ , assigns exactly one direction at $\alpha_t$ -a.e. spatial point.

Generalized Dynamic Wasserstein Distances¶

The quadratic Benamou--Brenier formula is only one instance of a broader fixed-mass dynamic language. The goal of this section is to define a large family of geodesic-like distances on spaces of probability measures by modifying the action minimized in the Benamou--Brenier formula. The objects introduced here are metric: they specify admissible curves, tangent variables and path energies. All descent constructions are postponed to Generalized Dynamic Wasserstein Flows, where these distances are used to generate gradient-flow PDE models.

Path Actions¶

The common construction replaces the quadratic kinetic energy in the Benamou--Brenier formula by an instantaneous action while retaining the continuity equation and endpoint constraints.

In the mass-preserving Euclidean setting, the basic input is an instantaneous action $\mathbb A(\alpha,w)$ , where $\alpha$ is the current measure and $w$ is an admissible velocity representative. When this action is normalized as a squared infinitesimal speed, it generates the length-space value

\mathsf D_{\mathbb A}^2(\alpha_0,\alpha_1) = \inf_{\alpha_t,v_t} \left\{ \int_0^1 \mathbb A(\alpha_t,v_t)\,\d t : \partial_t\alpha_t+\operatorname{div}(\alpha_t v_t)=0, \ \alpha_{t=0}=\alpha_0, \ \alpha_{t=1}=\alpha_1 \right\}.

(38)

Equivalently, one may quotient by velocity fields that induce the same first-order variation of the measure. The formula above should be read as a dynamic definition of the distance, not as a property automatically satisfied by an arbitrary discrepancy. Some standard distances, such as $\Wass_p$ , are first written with a $p$ -homogeneous action and then squared by taking a constant-speed parametrization; this normalization is made explicit below. Different choices of $\mathbb A$ change the resulting geometry; Generalized Dynamic Wasserstein Flows later reuses these choices when dynamics are introduced.

Quadratic, or Riemannian, Tangent Actions¶

A particularly transparent case occurs when $w\mapsto\mathbb A(\alpha,w)$ is quadratic. For simplicity, take admissible velocities in $L^2(\alpha;\RR^d)$ ; in some applications this Hilbert space is replaced by a closed subspace encoding additional constraints. Suppose the polarization of $\mathbb A$ is represented by a positive self-adjoint operator $Q_\alpha:L^2(\alpha;\RR^d)\to L^2(\alpha;\RR^d)$ ,

\mathbb A(\alpha;w,z) = \left\langle Q_\alpha w,z\right\rangle_{L^2(\alpha)}, \qquad \mathbb A(\alpha,w)=\left\langle Q_\alpha w,w\right\rangle_{L^2(\alpha)},

(39)

To obtain a genuine tangent norm, this quadratic form must be nondegenerate after quotienting velocity fields that induce the same measure variation.

The least-action distance generated by this tensor is

\mathsf D_Q^2(\alpha_0,\alpha_1) = \mathsf D_{\mathbb A}^2(\alpha_0,\alpha_1) = \inf_{\substack{\partial_t\alpha_t+\operatorname{div}(\alpha_t v_t)=0\\ \alpha_{t=0}=\alpha_0,\ \alpha_{t=1}=\alpha_1}} \int_0^1 \left\langle Q_{\alpha_t}v_t,v_t\right\rangle_{L^2(\alpha_t)} \d t .

(40)

The usual $\Wass_2$ geometry corresponds to $Q_\alpha=\Id$ in this simplified notation. Thus $Q_\alpha$ records how the chosen geometry deforms the Euclidean $L^2(\alpha)$ tangent norm: no deformation for $\Wass_2$ , and a nontrivial tensor for generalized Riemannian geometries. Generalized Dynamic Wasserstein Flows later reuses the same tensor as a preconditioner for metric descent.

Local Velocity Actions¶

Many dynamic distances are local with respect to a reference measure $\lambda$ . Write $\alpha=a\lambda$ . A velocity action is specified by a pointwise integrand

A:[0,+\infty)\times\RR^d\to[0,+\infty], \qquad (a,w)\mapsto A(a,w),

(41)

where $a\in\RR_+$ is a density value and $w\in\RR^d$ is a velocity value, and defines

\mathbb A(\alpha,w) = \int A\left(\frac{\d\alpha}{\d\lambda}(x),w(x)\right)\d\lambda(x).

(42)

For a fixed reference $\lambda$ , this covers density-dependent mobilities and congestion constraints. If, in addition, $A$ is positively 1-homogeneous in its first variable, $A(\eta a,w)=\eta A(a,w)$ for $\eta\geq0$ , then the same formula is intrinsic: replacing $\lambda$ by another dominating measure gives the same value. The usual Benamou--Brenier action is the model case

A_2(a,w)=a\norm{w}^2, \qquad \mathbb A(\alpha,w)=\int\norm{w}^2\d\alpha .

(43)

Homogeneous Momentum Actions¶

The same action can be written in momentum variables, and this is the form in which convexity and metric properties are easiest to read. Set $\omega=\alpha w$ , so that $\omega$ is a vector-valued measure. When the local description is written with the same reference $\lambda$ , so that $\alpha=a\lambda$ and $\omega=m\lambda$ , the pointwise momentum perspective is

J_A(a,m) \eqdef \begin{cases} A(a,m/a), & a>0,\\ 0, & a=0\ \text{and}\ m=0,\\ +\infty, & a=0\ \text{and}\ m\neq0, \end{cases}

(44)

and the measure action relative to $\lambda$ is

\mathbb J_{A,\lambda}(\alpha,\omega) \eqdef \int J_A\!\left( \frac{\d\alpha}{\d\lambda}, \frac{\d\omega}{\d\lambda} \right)\d\lambda,

(45)

with value $+\infty$ if $\alpha$ or the total variation $|\omega|$ is not absolutely continuous with respect to $\lambda$ . This zero-density convention is the lower-semicontinuous one for the superlinear actions used below; other growths use the corresponding recession extension. If $A$ is positively 1-homogeneous in $a$ , then $J_A$ is jointly 1-homogeneous: $J_A(\eta a,\eta m)=\eta J_A(a,m)$ . In that intrinsic case the value of $\mathbb J_{A,\lambda}$ is independent of the dominating reference measure, and we write simply $\mathbb J_A$ . For $A_2(a,w)=a\norm{w}^2$ , one recovers the quadratic perspective

J_2(a,m)=\frac{\norm m^2}{a},

(46)

which is the integrand used in the convex Benamou--Brenier formulation.

Proposition: Concave Mobilities Give Convex Momentum Actions

Let $I\subset[0,+\infty)$ be a convex interval, let $\theta:I\to[0,+\infty)$ be concave, and let $L:\RR^d\to[0,+\infty]$ be convex with $L(0)=0$ . Define, on the set where $\theta(a)>0$ ,

J_{\theta,L}(a,m) \eqdef \theta(a)L\!\left(\frac{m}{\theta(a)}\right), \qquad A_{\theta,L}(a,w) \eqdef J_{\theta,L}(a,aw) = \theta(a)L\!\left(\frac{aw}{\theta(a)}\right).

(47)

Extend $J_{\theta,L}$ to the boundary by lower semicontinuity. Then $J_{\theta,L}$ is convex in $(a,m)$ . This single construction contains the standard action $A(a,w)=aL(w)$ by taking $\theta(a)=a$ , and the concave-mobility quadratic action $A(a,w)=a^2\norm w^2/\theta(a)$ by taking $L(u)=\norm u^2$ .

The next proposition isolates the assumptions under which the momentum formulation generated by $A$ defines a path metric rather than only a variational principle.

Proposition: Homogeneous Dynamic Actions Define Distances

Fix a reference measure $\lambda$ , omitted from the notation only in the intrinsic case where $\mathbb J_{A,\lambda}$ does not depend on $\lambda$ . Assume that the momentum perspective $J_A$ defined in (44) is lower semicontinuous, convex in $(a,m)$ , even in $m$ , and satisfies $J_A(a,0)=0$ . Assume moreover that for some $r>1$

J_A(a,\xi m)=|\xi|^rJ_A(a,m) \qquad \text{for all admissible }a,\ \text{all }m,\ \text{and all }\xi\in\RR,

(50)

and that $J_A(a,m)=0$ if and only if $m=0$ . Equivalently, the evenness, homogeneity and nondegeneracy assumptions translate, away from zero density, into

A(a,-w)=A(a,w), \qquad A(a,\xi w)=|\xi|^rA(a,w), \qquad A(a,w)=0 \Longleftrightarrow w=0 \qquad (a>0).

(51)

Define, on every fixed-mass class,

\mathsf D_{A,\lambda}(\alpha_0,\alpha_1) \eqdef \inf_{\substack{\partial_t\alpha_t+\diverg\omega_t=0\\ \alpha_{t=0}=\alpha_0,\ \alpha_{t=1}=\alpha_1}} \left( \int_0^1\mathbb J_{A,\lambda}(\alpha_t,\omega_t)\,\d t \right)^{1/r}.

(52)

Assume finally that the relaxed dynamic problem is sequentially closed and attains its infimum whenever the value is finite. Then $\mathsf D_{A,\lambda}$ is an extended distance on each finite-action component: it is symmetric, satisfies the triangle inequality, and $\mathsf D_{A,\lambda}(\alpha_0,\alpha_1)=0$ only when $\alpha_0=\alpha_1$ . In the intrinsic case, we write $\mathsf D_A$ .

Without homogeneity or nondegeneracy, the same momentum action remains useful as a variational principle, but its $r$ -th root need not define a distance.

Example: Wasserstein-

p

action

The usual $\Wass_p$ distances correspond to changing only the homogeneity of the Benamou--Brenier action. The specific objects to insert in the general framework are

A_p(a,w)=a\norm{w}^p, \qquad J_p(a,m)= \begin{cases} \norm{m}^p/a^{p-1}, & a>0,\\ 0, & (a,m)=(0,0),\\ +\infty, & a=0,\ m\neq0, \end{cases}

(53)

With $A=A_p$ , Proposition Proposition: Homogeneous Dynamic Actions Define Distances gives the usual identity $\mathsf D_{A_p}=\Wass_p$ . The corresponding squared length-space normalization is

\mathbb A_p(\alpha,w) = \left(\int\norm{w}^p\d\alpha\right)^{2/p}.

(54)

This is the squared version of the $p$ -homogeneous action: minimizing $\int_0^1\mathbb A_p(\alpha_t,v_t)\d t$ gives $\Wass_p^2$ after constant-speed reparametrization. Thus $A_p,J_p$ denote the local $p$ -homogeneous velocity and momentum densities, whereas $\mathbb A_p$ denotes the squared tangent action used in the length formulation. The endpoint $p=1$ can be treated separately: $J_1(a,m)=\norm m$ , and the dynamic problem collapses to Beckmann’s formulation of $\Wass_1$ Beckmann, 1952.

Concave-Mobility Actions¶

One can instead keep a quadratic momentum action and change the mobility. Dolbeault, Nazaret and Savaré introduced this construction as a class of generalized transport distances adapted to nonlinear diffusion Dolbeault et al., 2009. Let $I\subset[0,+\infty)$ be a convex interval and let $\theta:I\to[0,+\infty)$ be concave. Define

J_\theta(a,m) \eqdef \begin{cases} \norm{m}^2/\theta(a), & \theta(a)>0,\\ 0, & \theta(a)=0 \text{ and } m=0,\\ +\infty, & \theta(a)=0 \text{ and } m\neq0. \end{cases}

(55)

The corresponding velocity action is

A_\theta(a,w)=J_\theta(a,aw)=\frac{a^2\norm{w}^2}{\theta(a)}.

(56)

The convexity of $J_\theta$ is the special case $L(u)=\norm u^2$ of Proposition Proposition: Concave Mobilities Give Convex Momentum Actions. This is why concavity of the mobility, rather than convexity, is the structural condition that makes the continuity-equation formulation convex.

Fix now a reference measure $\lambda$ . If $\alpha=a\lambda$ , the induced squared tangent action is

\mathbb A_{\theta,\lambda}(\alpha,w) = \int A_\theta(a(x),w(x))\,\d\lambda(x) = \int \frac{a(x)^2}{\theta(a(x))}\norm{w(x)}^2\,\d\lambda(x),

(57)

and it is set to $+\infty$ when $\alpha\not\ll\lambda$ . Equivalently, on the set where $\theta(a(x))>0$ ,

\mathbb A_{\theta,\lambda}(\alpha,w) = \int \frac{a(x)}{\theta(a(x))}\norm{w(x)}^2\,\d\alpha(x).

(58)

Hence this is a local Riemannian case whenever the multiplier $a(x)/\theta(a(x))$ is finite and positive, in the sense of (40). The associated tensor is the multiplication operator

\big(Q_{\theta,\lambda,\alpha}w\big)(x) = \frac{a(x)}{\theta(a(x))}w(x), \qquad \alpha=a\lambda,

(59)

defined $\alpha$ -a.e. on the set where $\theta(a)>0$ . Except for linear mobilities $\theta(a)=ca$ , and in particular the normalized case $\theta(a)=a$ which recovers $\Wass_2$ , the pointwise velocity action $A_\theta(a,w)$ is not positively 1-homogeneous in the density variable $a$ . Consequently the construction is not intrinsic under a change of $\lambda$ : the resulting distance depends on the chosen reference measure and is finite only between endpoints that can be joined by a finite-action curve with $\alpha_t\ll\lambda$ .

The associated value is therefore written

\mathsf W_{\theta,\lambda}(\alpha_0,\alpha_1) \eqdef \mathsf D_{A_\theta,\lambda}(\alpha_0,\alpha_1),

(60)

where the subscript $\lambda$ recalls that the action is measured through the density $a=\d\alpha/\d\lambda$ . Equivalently, because this action is quadratic in the momentum, $\mathsf W_{\theta,\lambda}^2$ is the path value (38) with $\mathbb A=\mathbb A_{\theta,\lambda}$ .

The choice $\theta(a)=a$ recovers $\Wass_2$ . Other choices encode different geometry: $\theta(a)=a^\gamma$ with $0<\gamma\leq1$ changes the cost of moving dilute mass, while $\theta(a)=a(1-a/M)$ on $[0,M]$ models a volume-filling or exclusion effect. The distance is comparable with $\Wass_2$ on classes where $\theta(a)$ is bounded above and below by positive multiples of $a$ ; otherwise zero-mobility barriers can make some pairs infinitely far apart.

Dynamic Spectral Wasserstein Distances¶

The static spectral distances of Spectral and Robust Wasserstein Distances penalize a coupling through the covariance of its displacement. A dynamic version keeps the continuity equation but replaces the pointwise kinetic energy by a gauge of the whole velocity covariance. The resulting action is nonlocal in space: velocity directions are charged globally through their covariance, rather than independently at each point.

Let $\gamma$ be a monotone spectral gauge on $\mathbb S_+^d$ . For a probability measure $\alpha$ and a velocity field $v\in L^2(\alpha;\RR^d)$ , define the spectral tangent action

\mathbb A_\gamma(\alpha,v) \eqdef \gamma\!\left(\int v(x)v(x)^\top\d\alpha(x)\right).

(62)

The trace gauge gives the usual Wasserstein tangent action, while the operator gauge $\gamma(M)=\lambda_{\max}(M)$ charges only the largest directional velocity variance. With the length-distance notation introduced in (38), the associated dynamic action distance is

\mathsf W_{\gamma,\mathrm{dyn}}^2 \eqdef \mathsf D_{\mathbb A_\gamma}^2 .

(63)

In density--momentum variables, this corresponds to the measure action

\mathbb J_\gamma(\alpha,\omega) = \gamma\!\left(\int \left(\frac{\d\omega}{\d\alpha}\right) \left(\frac{\d\omega}{\d\alpha}\right)^\top \d\alpha\right),

(64)

or, when $\alpha=\rho\,\d x$ and $\omega=m\,\d x$ ,

\mathbb J_\gamma(\rho,m) = \gamma\!\left(\int \frac{m(x)m(x)^\top}{\rho(x)}\,\d x\right).

(65)

This functional is convex in the density--momentum fields $(\rho,m)$ by the matrix perspective, together with the monotonicity and convexity of $\gamma$ . It is nevertheless not, in general, obtained by integrating a pointwise action density, because the covariance is computed globally before applying $\gamma$ . It becomes local only for linear spectral gauges. For instance, if $\gamma(M)=\operatorname{tr}(GM)$ with $G\succeq0$ , then the velocity and momentum densities are

A_{\mathrm{lin}}(a,w)=a\,w^\top G w, \qquad J_{\mathrm{lin}}(a,m)=\frac{m^\top G m}{a},

(66)

and the trace gauge, $G=\Id$ in $\gamma(M)=\operatorname{tr}(GM)$ , recovers the usual Benamou--Brenier action.

The following result, in the form used for normalized spectral flows in Peyré, 2026, shows that this dynamic construction is not merely infinitesimal: it exactly recovers the static displacement-covariance formulation.

First let $\pi\in\Couplings(\alpha_0,\alpha_1)$ , let $(X,Y)\sim\pi$ , set $Z_t=(1-t)X+tY$ and $\alpha_t=(Z_t)_\sharp\pi$ , and define the Eulerian velocity as the conditional mean $v_t(z)=\mathbb E[Y-X\mid Z_t=z]$ . Then $(\alpha_t,v_t)$ solves the continuity equation. If $M_\pi=\int(x-y)(x-y)^\top\d\pi(x,y)$ and $C_t=\int v_t(z)v_t(z)^\top\d\alpha_t(z)$ , conditional Jensen gives $C_t\preceq M_\pi$ : for every $u\in\RR^d$ ,

u^\top C_tu = \mathbb E\!\left[\mathbb E[\langle u,Y-X\rangle\mid Z_t]^2\right] \leq \mathbb E[\langle u,Y-X\rangle^2] = u^\top M_\pi u .

(68)

Since $\gamma$ is monotone for the Loewner order, $\int_0^1\gamma(C_t)\d t\leq\gamma(M_\pi)$ . Infimizing over $\pi$ gives $\mathsf W_{\gamma,\mathrm{dyn}}^2\leq\Wass_\gamma^2$ .

Conversely, let $(\alpha_t,v_t)$ be a finite-action competitor. Since $\gamma$ is a finite positive gauge on the finite-dimensional cone $\mathbb S_+^d$ , it is equivalent to the trace on this cone; hence finite spectral action gives finite kinetic energy. By the superposition principle, the competitor is represented by a probability law $\eta$ on absolutely continuous paths satisfying $\dot\omega_t=v_t(\omega_t)$ . For the endpoint coupling $\pi=(e_0,e_1)_\sharp\eta$ , Jensen along each path gives

(\omega_1-\omega_0)(\omega_1-\omega_0)^\top \preceq \int_0^1\dot\omega_t\dot\omega_t^\top\d t .

(69)

After integration over the path law, $M_\pi\preceq\int_0^1 C_t\d t$ . Therefore monotonicity and convexity of $\gamma$ imply

\gamma(M_\pi) \leq \gamma\!\left(\int_0^1 C_t\d t\right) \leq \int_0^1\gamma(C_t)\d t .

(70)

The static value is thus no larger than any dynamic action. The crucial hypothesis is the monotonicity of $\gamma$ : the proof only produces Loewner-order comparisons of covariance matrices, and these comparisons control the action only for monotone gauges.

The use of this geometry for normalized flows, including the operator-gauge connection with Muon-type normalization, is developed in Dynamic Spectral Wasserstein Flows.

Kernelized Benamou--Brenier Distances¶

A different way to deform the Benamou--Brenier geometry is to keep the local continuity equation but to measure velocities in a reproducing-kernel Hilbert space rather than in $L^2(\alpha)$ . This construction is motivated by Stein variational gradient descent, studied later in Stein Variational Gradient Descent: the kernel makes the velocity field smooth and computable from particles, at the price of defining a much more restrictive transport geometry.

Let $k$ be a positive definite kernel on $\RR^d$ with scalar RKHS $\RKHS_k$ . The vector-valued RKHS is

\RKHS_k^d\eqdef \RKHS_k\times\cdots\times\RKHS_k, \qquad \norm{v}_{\RKHS_k^d}^2 \eqdef \sum_{\ell=1}^d\norm{v_\ell}_{\RKHS_k}^2.

(71)

This is the vector-valued analogue of the scalar RKHS norm used for MMDs in Dual RKHS Norms and Maximum Mean Discrepancies. The specific kernelized tangent action is

\mathbb A_k(\alpha,v) \eqdef \norm{v}_{\RKHS_k^d}^2, \qquad \mathcal W_k^2 \eqdef \mathsf D_{\mathbb A_k}^2,

(72)

where the general distance formula (38) is understood with the restricted admissible tangent class $v_t\in\RKHS_k^d$ . The action itself is independent of $\alpha$ ; the measure only enters through the continuity equation $\partial_t\alpha_t+\diverg(\alpha_t v_t)=0$ , which says how the common smooth velocity field moves all particles. This type of Stein geometry was introduced in the analysis of SVGD by Liu and Wang Liu & Wang, 2016Liu, 2017 and later developed geometrically in Duncan et al., 2023Nüsken & Renger, 2023. The important caveat is that the admissible tangent space is the smooth RKHS class, not the whole Wasserstein tangent space.

One should read $\mathcal W_k$ as an extended distance on finite-action components, not as a replacement for $\Wass_2$ on all of $\Pp_2(\RR^d)$ . A useful sufficient condition for finiteness is that the endpoints lie on the same RKHS-flow orbit: if there exists $v\in L^2([0,1];\RKHS_k^d)$ whose flow map $\Phi_t$ solves $\dot\Phi_t=v_t\circ\Phi_t$ and satisfies $\alpha_1=(\Phi_1)_\sharp\alpha_0$ , then $\mathcal W_k^2(\alpha_0,\alpha_1)\leq\int_0^1\norm{v_t}_{\RKHS_k^d}^2\d t$ . In particular, for strictly positive definite kernels, two discrete measures with the same weights and distinct moving support points are at finite distance whenever their atoms can be connected by noncolliding smooth paths, because RKHS interpolation constructs vector fields realizing the prescribed atom velocities along the paths.

The same condition also explains the limitation. If $k$ is smooth enough that $\RKHS_k^d$ embeds into Lipschitz vector fields, finite-action curves are induced by regular flows. Atomic measures remain atomic with the same number of atoms, so a Dirac mass cannot be transported at finite kernelized action to a measure with a density. This lack of splitting is precisely what makes the geometry useful for deterministic particle methods, and also what makes it a nontrivial extended object rather than a full probability metric.

Nonlocal Wasserstein Distances¶

The local dynamic distances above transport mass through a vector field on the base space and a classical continuity equation. Nonlocal geometries use a different tangent model: the elementary motion is an exchange across an edge or a jump from $x$ to $y$ . The common data are a reversible kernel $K$ , a symmetric edge or jump measure $\mathsf J$ , a pairwise increment $\bar\nabla$ , and a pair-space action $\mathbb A_K$ . The tangent variable is therefore attached to pairs of points, not to a single point $x$ , so these constructions are not simply obtained by choosing another pointwise local action-density $A(\rho(x),m(x))$ .

There are two complementary versions. On a finite state space, the goal is to put a Wasserstein-like geometry on the probability simplex so that the entropy gradient flow is exactly a prescribed reversible Markov chain Maas, 2011Mielke, 2013Chow et al., 2012. On a continuum space, the same edge calculus becomes a jump calculus over $\mathcal X\times\mathcal X$ , which models nonlocal motion, heavy-tailed jumps, and fractional-type diffusion; this is the construction of Erbar Erbar, 2014, building on nonlinear mobilities Dolbeault et al., 2009, with subsequent metric and asymptotic refinements Slepčev & Warren, 2022.

In both settings the canonical mobility for entropy is the logarithmic mean

\theta(a,b)\eqdef \begin{cases} \displaystyle\frac{a-b}{\log a-\log b}, & a\neq b,\\[.4em] a, & a=b, \end{cases}

(75)

with the usual lower-semicontinuous extension at $a=0$ or $b=0$ . It appears because $\theta(a,b)(\log a-\log b)=a-b$ , which is the edge-wise chain rule identifying entropy-driven flows with the underlying reversible Markov or jump dynamics.

Continuum Jump Kernels¶

The continuum version replaces graph edges by a symmetric measure on pairs. Its action is still quadratic, but the tangent variable is an antisymmetric jump velocity $v(x,y)$ , and the mobility depends simultaneously on the two endpoint densities $\rho(x)$ and $\rho(y)$ . It is best viewed as a convex action on the pair space $\mathcal X\times\mathcal X$ , rather than as an integral of independent costs attached to single base points $x$ .

Let $(\mathcal X,\mathfrak m)$ be a reference measure space, and let $K(x,\cdot)$ be a nonnegative measure on $\mathcal X$ for each $x\in\mathcal X$ , possibly of infinite total mass. We write this kernel as $K(x,\d y)$ to emphasize that the integration variable is $y$ . The pair measure $\mathsf J$ on $\mathcal X\times\mathcal X$ is defined by testing against nonnegative measurable functions $\Phi$ :

\int_{\mathcal X\times\mathcal X}\Phi(x,y)\,\mathsf J(\d x,\d y) \eqdef \int_{\mathcal X}\left(\int_{\mathcal X}\Phi(x,y)\,K(x,\d y)\right)\mathfrak m(\d x).

(76)

The reversibility assumption is precisely that this measure $\mathsf J$ is symmetric, i.e. invariant under $(x,y)\mapsto(y,x)$ . For a density $\rho=\d\alpha/\d\mathfrak m$ , write

\bar\nabla \varphi(x,y)\eqdef \varphi(y)-\varphi(x)

(77)

for the nonlocal gradient, and use the logarithmic mean $\theta$ defined in (75). A curve $\alpha_t=\rho_t\mathfrak m$ driven by an antisymmetric velocity $v_t(x,y)=-v_t(y,x)$ satisfies the nonlocal continuity equation if, for all test functions $\varphi$ ,

\frac{\d}{\d t}\int \varphi\,\d\alpha_t = \frac12 \iint \bar\nabla\varphi(x,y)\, v_t(x,y)\, \theta(\rho_t(x),\rho_t(y))\, \mathsf J(\d x,\d y).

(78)

The corresponding pair-space tangent action is

\mathbb A_K(\alpha,v) \eqdef \frac12 \iint |v(x,y)|^2 \theta(\rho(x),\rho(y))\, \mathsf J(\d x,\d y),

(79)

for $\alpha=\rho\mathfrak m$ . This is the nonlocal analogue of a tangent action; here $v$ is not a vector field on $\mathcal X$ but an antisymmetric velocity on pairs $(x,y)$ .

The nonlocal transport distance is

\mathcal W_K^2(\alpha_0,\alpha_1) \eqdef \inf_{\rho_t,v_t} \int_0^1 \mathbb A_K(\alpha_t,v_t)\,\d t,

(80)

where the infimum is over curves solving (78) with endpoints $\alpha_0,\alpha_1$ .

We use the analytic compactness and lower-semicontinuity theorem of Erbar (2014) for the logarithmic-mean action. Namely, action-bounded sequences of admissible curves are compact for the narrow topology, the weak nonlocal continuity equation is closed under this convergence, and the action is lower semicontinuous.

Nonnegativity is immediate from the definition of $\mathbb A_K(\alpha,v)$ . If $\alpha_0=\alpha_1$ , the constant curve $\rho_t=\rho_0$ , $v_t=0$ , is admissible and has zero action.

Symmetry follows by time reversal. If $(\rho_t,v_t)$ transports $\alpha_0$ to $\alpha_1$ , set $\tilde\rho_t=\rho_{1-t}$ and $\tilde v_t=-v_{1-t}$ . The weak continuity equation is preserved by this change of time, and the quadratic action is unchanged. Thus $\mathcal W_K(\alpha_0,\alpha_1)=\mathcal W_K(\alpha_1,\alpha_0)$ .

For the triangle inequality, let $(\rho^0_t,v^0_t)$ connect $\alpha_0$ to $\alpha_1$ with action $A_0$ , and let $(\rho^1_t,v^1_t)$ connect $\alpha_1$ to $\alpha_2$ with action $A_1$ . For $0<\zeta<1$ , concatenate the two curves by

(\rho_t,v_t)= \begin{cases} \bigl(\rho^0_{t/\zeta},\,\zeta^{-1}v^0_{t/\zeta}\bigr), &0\leq t\leq\zeta,\\[.35em] \bigl(\rho^1_{(t-\zeta)/(1-\zeta)},\,(1-\zeta)^{-1}v^1_{(t-\zeta)/(1-\zeta)}\bigr), &\zeta<t\leq1. \end{cases}

(81)

The velocity factors are exactly those required by the weak continuity equation after time rescaling. Since $v\mapsto\mathbb A_K(\alpha,v)$ is quadratic, the concatenated action is

\frac{A_0}{\zeta}+\frac{A_1}{1-\zeta}.

(82)

Optimizing in $\zeta$ , for instance taking $\zeta=\sqrt{A_0}/(\sqrt{A_0}+\sqrt{A_1})$ when both actions are positive, gives the action $(\sqrt{A_0}+\sqrt{A_1})^2$ . Taking infima over the two curves proves the triangle inequality.

If $\mathcal W_K(\alpha_0,\alpha_1)=0$ , choose admissible curves with actions tending to zero. Compactness and lower semicontinuity give a limiting admissible curve of zero action. Hence $v_t=0$ for $\theta(\rho_t(x),\rho_t(y))\mathsf J(\d x,\d y)\d t$ -a.e. $(t,x,y)$ , and the weak continuity equation gives

\frac{\d}{\d t}\int\varphi\,\d\alpha_t=0

(83)

for every admissible test function $\varphi$ . The irreducibility/separation assumption in Erbar (2014) ensures that these test functions determine the measure, so $\alpha_t$ is constant and $\alpha_0=\alpha_1$ .

Finally, if $\mathcal W_K(\alpha_0,\alpha_1)<+\infty$ , the same direct-method compactness applied to a minimizing sequence gives a minimizer. Reparametrizing this minimizing curve by metric arclength gives a constant-speed curve; after this parametrization,

\mathcal W_K(\alpha_s,\alpha_t)=(t-s)\mathcal W_K(\alpha_0,\alpha_1), \qquad 0\leq s<t\leq1.

(84)

The consequences for entropy dynamics and fractional PDE examples are developed in the nonlocal Wasserstein-flow section below.

For a fixed jump kernel this geometry is genuinely nonlocal and does not coincide with ordinary $\Wass_2$ . The local metric is nevertheless recovered in a small-jump limit. More explicitly, on $\mathcal X=\mathbb R^d$ , let $\eta(z)=\bar\eta(\lVert z\rVert)$ be a nonnegative radial profile with

M_2(\eta):=\int_{\mathbb R^d}\lVert z\rVert^2\eta(z)\,\mathrm dz \in(0,+\infty),

(85)

and define

K_\varepsilon(x,\mathrm dy) :=\eta_\varepsilon(y-x)\,\mathrm dy, \qquad \eta_\varepsilon(z):=\varepsilon^{-d}\eta(z/\varepsilon).

(86)

Radiality implies symmetry and isotropy, and a change of variables gives

\int\lVert y-x\rVert^2K_\varepsilon(x,\mathrm dy) =\varepsilon^2M_2(\eta), \qquad \int (y-x)(y-x)^\top K_\varepsilon(x,\mathrm dy) =\varepsilon^2\frac{M_2(\eta)}{d}\operatorname{Id}.

(87)

Equation (87) is the precise sense in which the second-moment jump scale is $\varepsilon$ . To obtain a nontrivial local limit, one simultaneously accelerates the jump rate by $\varepsilon^{-2}$ and sets

\widehat K_\varepsilon :=\frac{2d}{\varepsilon^2M_2(\eta)}K_\varepsilon, \qquad \int (y-x)(y-x)^\top\widehat K_\varepsilon(x,\mathrm dy) =2\operatorname{Id}.

(88)

Multiplying a jump kernel by $c>0$ divides the associated distance by $\sqrt c$ . Hence, under the regularity and irreducibility hypotheses of Slepčev & Warren (2022), for endpoints supported in a fixed compact set,

\mathcal W_{\widehat K_\varepsilon} =\varepsilon\sqrt{\frac{M_2(\eta)}{2d}}\, \mathcal W_{K_\varepsilon} \longrightarrow\Wass_2 \qquad(\varepsilon\to0).

(89)

This makes precise how sharply concentrated isotropic jumps recover the local Benamou--Brenier geometry. Without isotropy, the covariance matrix in (87) need not be proportional to the identity, and the limit is instead an anisotropic Wasserstein geometry.

Discrete Wasserstein Distances on Markov Chains¶

The finite-state version keeps the same pair-space philosophy, but with a finite graph of admissible exchanges. It is not the naive Euclidean metric on the simplex. The key idea, introduced by Maas and independently developed in related forms by Mielke and by Chow--Huang--Li--Zhou, is to use the transition graph of a reversible Markov chain to define both the admissible directions and the mobility of the mass Maas, 2011Mielke, 2013Chow et al., 2012. The entropy gradient-flow interpretation is stated later in Proposition: Entropy Gradient Flow of a Reversible Markov Chain.

Let $\mathcal X=\{1,\ldots,n\}$ and let $K=(K_{ij})$ denote the off-diagonal transition rates of an irreducible continuous-time Markov chain reversible with respect to a probability vector $\pi$ , so that $\pi_iK_{ij}=\pi_jK_{ji}$ for $i\neq j$ . Write

\mathsf J_{ij}\eqdef \pi_iK_{ij}=\pi_jK_{ji}

(90)

for the symmetric edge measure, the finite counterpart of the jump measure used above. The transported object is a mass histogram

\Sigma_n\eqdef\left\{a\in\RR_+^n:\sum_i a_i=1\right\}.

(91)

Relative densities only enter as auxiliary variables with respect to the invariant law:

\rho_i(a)\eqdef \frac{a_i}{\pi_i}, \qquad a_i=\pi_i\rho_i(a).

(92)

The logarithmic mean $\theta$ defined in (75) is the mobility selected so that the entropy calculus later recovers exactly the Markov evolution. The identity $\theta(a,b)(\log a-\log b)=a-b$ converts entropy gradients into density differences along graph edges. For a potential $\psi\in\RR^n$ , set

\bar\nabla\psi(i,j)\eqdef\psi_j-\psi_i.

(93)

The finite nonlocal divergence is encoded by the density Onsager operator and by its mass form

(\mathcal K_\rho\psi)_i \eqdef \sum_j K_{ij}\theta(\rho_i,\rho_j)(\psi_i-\psi_j), \qquad (\mathcal L_a\psi)_i \eqdef \pi_i(\mathcal K_{\rho(a)}\psi)_i.

(94)

with tangent action

\mathbb A_K(a,\psi) \eqdef \frac12\sum_{i,j}\mathsf J_{ij}\theta(\rho_i(a),\rho_j(a)) (\bar\nabla\psi(i,j))^2.

(95)

This is the finite-state squared tangent action; the tangent variable is the potential $\psi$ , or equivalently the induced edge flux, rather than an ambient Euclidean vector field.

The discrete transport distance is

\mathcal W_K^2(a_0,a_1) \eqdef \inf_{a_t,\psi_t} \int_0^1\mathbb A_K(a_t,\psi_t)\,\d t, \qquad \dot a_t+\mathcal L_{a_t}\psi_t=0,

(96)

with endpoint conditions $a_{t=0}=a_0$ and $a_{t=1}=a_1$ . Equivalently, one can write the same formula in edge-flux variables: flux is only allowed along edges where $K_{ij}>0$ , and the denominator in the kinetic energy is the logarithmic mean of the two relative endpoint densities $\rho_i(a)=a_i/\pi_i$ .

The first nontrivial finite Markov geometries already show how the logarithmic mean bends the simplex. In both examples below, take the uniform random walk on the complete neighbor graph, so $\pi_i=1/n$ and $K_{ij}=1/(n-1)$ for $i\neq j$ .

Example: Three-point complete graph

On $\Sigma_3$ , the complete-neighbor graph is a triangle. For $a\in\operatorname{int}(\Sigma_3)$ , set

\Theta_{ij}(a)\eqdef\frac12\theta(a_i,a_j), \qquad 1\leq i<j\leq3.

(98)

For a fixed $a$ , write $\Theta_{ij}=\Theta_{ij}(a)$ .

For a tangent vector $h\in\RR^3$ with $h_1+h_2+h_3=0$ , orient the edges as $1\to2$ , $1\to3$ , $2\to3$ . The squared norm induced by the discrete Wasserstein metric is

\|h\|_a^2 = \min_{m_{12},m_{13},m_{23}} \left\{ \frac{m_{12}^2}{\Theta_{12}}+ \frac{m_{13}^2}{\Theta_{13}}+ \frac{m_{23}^2}{\Theta_{23}} \right\},

(99)

subject to

h_1+m_{12}+m_{13}=0, \qquad h_2-m_{12}+m_{23}=0, \qquad h_3-m_{13}-m_{23}=0.

(100)

Eliminating the three edge fluxes gives an explicit formula. With $D=\Theta_{12}^{-1}+\Theta_{13}^{-1}+\Theta_{23}^{-1}$ ,

m_{12}^*=\frac{h_2/\Theta_{23}-h_1/\Theta_{13}}{D}, \qquad m_{13}^*=-h_1-m_{12}^*, \qquad m_{23}^*=m_{12}^*-h_2,

(101)

and $\|h\|_a^2$ is obtained by inserting these values in (99). Therefore

\mathcal W_K^2(a_0,a_1) = \inf_{a_t\in\operatorname{int}(\Sigma_3)} \int_0^1\|\dot a_t\|_{a_t}^2\,\d t, \qquad a_{t=0}=a_0, \quad a_{t=1}=a_1.

(102)

Thus the three-state distance is an explicit two-dimensional Riemannian geodesic problem on the open triangle. The formula is simple enough to compute directly, but it already shows the main difference with Euclidean geometry on the simplex: the local metric depends nonlinearly on the current density through logarithmic edge mobilities.

Figure Div visualizes these small-dimensional geometries and compares them with the ordinary Wasserstein distance associated with the $0/1$ ground metric, for which $\Wass_2^2$ is exactly the total variation distance.

Discrete Wasserstein distances on small Markov-chain simplices. The left panel shows the closed-form profiles $r\mapsto \mathcal W_K(a_r,a_{r_0})$ , with $a_r=(r,1-r)$ , for several anchors $r_0$ on $\Sigma_2$ . The middle panel shows numerical level sets of $\mathcal W_K(a,\bar a)$ on $\Sigma_3$ , where $\bar a=(1/3,1/3,1/3)$ , using the local Riemannian norm induced by the complete-neighbor Markov chain. The right panel shows the corresponding level sets for the ordinary $W_2$ distance with $d(i,j)=1$ for $i\neq j$ , so that $W_2^2(a,\bar a)=\norm{a-\bar a}_{\mathrm{TV}}$ .

Interactive panel. Move the anchor in the two-state formula and refine the three-state grid to compare the Markov-chain Riemannian distance with the ordinary simplex distance induced by the $0/1$ ground metric.

Dynamic Unbalanced Wasserstein Distances¶

Balanced dynamic distances keep the total mass fixed: their tangent vectors are transport velocities or fluxes satisfying a continuity equation. Unbalanced distances use a different tangent model. A tangent vector now has both a spatial component and a reaction component, so mass can move, disappear, and reappear. This section isolates this reaction--transport geometry before its use for gradient flows in Dynamic Unbalanced OT and WFR Flows.

Balance Equation and Tangent Variables¶

Unbalanced dynamic transport is obtained by allowing mass to be created and destroyed along the path. At the density level, the continuity equation becomes a balance equation and an admissible tangent direction is a pair $(m,s)$ : the flux density $m$ transports mass, while the source density $s$ changes its amount locally. This formulation underlies the Hellinger--Kantorovich and Wasserstein--Fisher--Rao metrics Liero et al., 2016Chizat et al., 2018; its equivalence with static entropy-transport and cone formulations is developed in Liero et al., 2018Chizat et al., 2018.

A representative quadratic action is

\partial_t\rho_t+\nabla\!\cdot m_t=s_t, \qquad \int_0^1\!\int \left(\frac{|m_t|^2}{\rho_t}+\kappa^2\frac{s_t^2}{\rho_t}\right)\d x\,\d t,

(103)

with the usual perspective convention: zero flux and zero source through zero density cost nothing, whereas nonzero flux or source through zero density has infinite cost. At the measure level these densities become the vector-valued flux measure $\omega_t=m_t\,\d x$ and the signed source measure $\sigma_t=s_t\,\d x$ .

Reaction--Transport Action¶

The action attaches one price to displacement and another to local growth. For a density $a\geq0$ , a velocity $w\in\RR^d$ , and a relative growth rate $g\in\RR$ , define

A_\kappa(a,w,g) \eqdef a\bigl(\norm{w}^2+\kappa^2 g^2\bigr).

(104)

Thus, writing $m_t=\rho_t v_t$ and $s_t=\rho_t g_t$ , the smooth action is $\int_0^1\int A_\kappa(\rho_t,v_t,g_t)\,\d x\,\d t$ under $\partial_t\rho_t+\nabla\!\cdot(\rho_t v_t)=g_t\rho_t$ . The parameter $\kappa$ fixes the relative cost of reaction and transport.

For the convex measure formulation, set $m=aw$ and $r=ag$ . The corresponding three-variable perspective is

J_\kappa(a,m,r) \eqdef \begin{cases} \displaystyle\frac{\norm{m}^2+\kappa^2 r^2}{a}, & a>0,\\ 0, & a=0,\ m=0,\ r=0,\\ +\infty, & a=0,\ (m,r)\neq(0,0). \end{cases}

(105)

For measure-valued triples, $\alpha$ denotes the transported measure, $\omega$ the vector-valued flux measure, and $\sigma$ the signed source measure. If $\lambda$ dominates $\alpha$ , $|\omega|$ and $|\sigma|$ , define

\mathbb J_\kappa(\alpha,\omega,\sigma) \eqdef \int J_\kappa\!\left( \frac{\d\alpha}{\d\lambda}, \frac{\d \omega}{\d\lambda}, \frac{\d \sigma}{\d\lambda} \right)\d\lambda .

(106)

The one-homogeneity of $J_\kappa$ makes this definition independent of the chosen dominating measure. Finite action forces both the flux and source to be absolutely continuous with respect to the transported mass.

Static and Dynamic Viewpoints¶

The balance-equation formula is the least-action representation of the same cone distance used in static unbalanced OT. To make the normalization explicit, define on the cone $\mathfrak C[\RR^d]$ the squared cost

\Delta_\kappa\big((x,r),(y,s)\big)^2 \eqdef 4\kappa^2 \left[ r^2+s^2-2rs \cos\!\left( \frac{\norm{x-y}}{2\kappa}\wedge\frac{\pi}{2} \right) \right].

(107)

The radii encode masses through the weighted projection $\mathsf P_2$ defined in Unbalanced OT. Accordingly, set

\CW_\kappa(\alpha_0,\alpha_1) \eqdef \inf_{\substack{\gamma\in\Mm_+(\mathfrak C[\RR^d]^2)\\ \mathsf P_2\gamma_1=\alpha_0,\ \mathsf P_2\gamma_2=\alpha_1}} \int \Delta_\kappa\big((x,r),(y,s)\big)^2 \,\d\gamma(x,r,y,s).

(108)

For $\kappa=1/2$ , this is exactly the normalization of the Hellinger--Kantorovich cone cost used in Unbalanced OT.

Proposition: Static/Dynamic Equivalence for Unbalanced OT

For nonnegative finite measures $\alpha_0,\alpha_1$ on $\RR^d$ , the dynamic value

\WFR_\kappa^2(\alpha_0,\alpha_1) \eqdef \inf_{\substack{\partial_t\alpha_t+\nabla\cdot \omega_t=\sigma_t\\ \alpha_{t=0}=\alpha_0,\ \alpha_{t=1}=\alpha_1}} \int_0^1 \mathbb J_\kappa(\alpha_t,\omega_t,\sigma_t)\,\d t

(109)

equals the static cone value (108). Hence $\WFR_\kappa=\CW_\kappa^{1/2}$ is the geodesic distance generated by the balance-equation least-action problem.

Balanced Versus Unbalanced Interpolations¶

The distinction is visible for mixtures with mismatched modal masses. Balanced transport must physically move excess mass, whereas unbalanced transport can trade transport against reaction. The next figure uses entropic balanced and KL-relaxed barycenters as a qualitative numerical surrogate: the unbalanced row illustrates the mechanism but is not asserted to be an exact $\WFR_\kappa$ geodesic.

Figure Div uses entropic balanced and KL-relaxed barycenters as a qualitative numerical surrogate; its unbalanced row illustrates the reaction--transport mechanism but is not asserted to be an exact $\WFR_\kappa$ geodesic.

Balanced and unbalanced Sinkhorn-barycenter interpolations between two one-dimensional Gaussian mixtures with swapped modal masses. The balanced row conserves total mass, so excess mass from the dominant left mode must move along the line toward the dominant right target mode, producing transient mass in the middle. The unbalanced row uses KL-relaxed marginal constraints; mass can be attenuated near overrepresented modes and recreated near underrepresented modes, giving a reaction--transport interpolation closer to the Wasserstein--Fisher--Rao intuition.

Together, the local, spectral, kernelized, jump, graph, and unbalanced examples show that modifying the Benamou--Brenier action changes both the admissible motion and the topology of the measure space. The next chapter turns these geometries into gradient-flow equations.

References¶

Dacorogna, B., & Moser, J. (1990). On a Partial Differential Equation Involving the Jacobian Determinant. Annales de l’Institut Henri Poincaré C, Analyse Non Linéaire, 7(1), 1–26.
Benamou, J.-D., & Brenier, Y. (2000). A computational fluid mechanics solution to the Monge-Kantorovich mass transfer problem. Numerische Mathematik, 84(3), 375–393.
Ambrosio, L., Gigli, N., & Savaré, G. (2006). Gradient Flows in Metric Spaces and in the Space of Probability Measures. Springer.
Rockafellar, R. T. (2015). Convex Analysis. Princeton university press.
Papadakis, N., Peyré, G., & Oudet, E. (2014). Optimal transport with proximal splitting. SIAM Journal on Imaging Sciences, 7(1), 212–238.
Beckmann, M. (1952). A continuous model of transportation. Econometrica, 20, 643–660.
Dolbeault, J., Nazaret, B., & Savaré, G. (2009). A new class of transport distances between measures. Calculus of Variations and Partial Differential Equations, 34(2), 193–231.
Peyré, G. (2026). Muon Dynamics as a Spectral Wasserstein Flow. arXiv Preprint arXiv:2604.04891.
Liu, Q., & Wang, D. (2016). Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm. Advances in Neural Information Processing Systems, 29. https://arxiv.org/abs/1608.04471
Liu, Q. (2017). Stein Variational Gradient Descent as Gradient Flow. Advances in Neural Information Processing Systems, 30. https://proceedings.neurips.cc/paper/2017/hash/17ed8abedc255908be746d245e50263a-Abstract.html
Duncan, A. B., Nüsken, N., & Szpruch, L. (2023). On the Geometry of Stein Variational Gradient Descent. Journal of Machine Learning Research, 24(56), 1–39. https://www.jmlr.org/papers/v24/20-602.html
Nüsken, N., & Renger, D. R. M. (2023). Stein Variational Gradient Descent: Many-Particle and Long-Time Asymptotics. Foundations of Data Science, 5(3), 286–320. 10.3934/fods.2022023
Maas, J. (2011). Gradient flows of the entropy for finite Markov chains. Journal of Functional Analysis, 261(8), 2250–2292.
Mielke, A. (2013). Geodesic convexity of the relative entropy in reversible Markov chains. Calculus of Variations and Partial Differential Equations, 48(1–2), 1–31.
Chow, S.-N., Huang, W., Li, Y., & Zhou, H. (2012). Fokker-Planck equations for a free energy functional or Markov process on a graph. Archive for Rational Mechanics and Analysis, 203(3), 969–1008.