Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Generalized Wasserstein Distances

The first family of extensions keeps the idea of a distance between measures, but changes the geometry used to compare them. The variants in this chapter relax mass conservation, reduce high-dimensional transport to one-dimensional projections, or replace the trace quadratic cost by spectral gauges and robust projected viewpoints.

These constructions are useful when the standard distance Wp\Wass_p is too rigid or too expensive. They preserve much of the metric intuition of optimal transport, but expose new controls: how expensive it is to delete mass, which projections should be trusted, and which directions of displacement should be penalized.

Unbalanced OT

Unbalanced OT allows mass creation and destruction by penalizing marginal mismatch. It is essential when histograms are not normalized, when observations contain outliers, or when only part of the source should match the target Liero et al., 2018Chizat et al., 2018Chizat et al., 2018.

Relaxed Formulation

For nonnegative measures (α,β)M+(X)×M+(Y)(\alpha,\beta)\in\mathcal M_+(\X)\times\mathcal M_+(\Y), a generic relaxed formulation is

UWc(α,β)=infπM+(X×Y)X×Yc(x,y)dπ(x,y)+Dψ1(π1α)+Dψ2(π2β),\mathsf{UW}_c(\alpha,\beta) = \inf_{\pi\in\mathcal M_+(\X\times\Y)} \int_{\X\times\Y} c(x,y)\d\pi(x,y) + \mathcal D_{\psi_1}(\pi_1\mid\alpha) + \mathcal D_{\psi_2}(\pi_2\mid\beta),

where ψ1,ψ2\psi_1,\psi_2 are convex entropy functions. Exact conservation (π1,π2)=(α,β)(\pi_1,\pi_2)=(\alpha,\beta) is replaced by a cost for changing the marginals. Writing ψs=τψˉs\psi_s=\tau\bar\psi_s exposes the relaxation scale:

UWc,τ(α,β)=infπ0cdπ+τDψˉ1(π1α)+τDψˉ2(π2β).\mathsf{UW}_{c,\tau}(\alpha,\beta) = \inf_{\pi\geq0} \int c\d\pi + \tau\mathcal D_{\bar\psi_1}(\pi_1\mid\alpha) + \tau\mathcal D_{\bar\psi_2}(\pi_2\mid\beta).

Large τ\tau makes marginal mismatch expensive and approaches balanced OT when the total masses are compatible. Small τ\tau makes creation and destruction cheap; after rescaling by τ\tau, the zero-transport part reveals the pure divergence geometry.

Proof

For the upper bound, restrict to diagonal plans π=(Id,Id)ρ\pi=(\Id,\Id)_\sharp\rho, whose transport cost is zero and whose two marginals are both ρ\rho. This gives the desired upper bound after optimizing over ρ\rho.

For the lower bound, let τn0\tau_n\downarrow0 and let πn\pi_n be almost minimizing plans with bounded scaled values τn1UWc,τn(α,β)\tau_n^{-1}\mathsf{UW}_{c,\tau_n}(\alpha,\beta). Since the divergences are nonnegative, cdπn=O(τn)\int c\d\pi_n=O(\tau_n), hence cdπn0\int c\d\pi_n\to0. The bounded scaled values also put the two marginals in compact divergence sublevel sets. Since a coupling has the same total mass as each marginal, the couplings are tight on X×X\X\times\X. Up to subsequences, πnπ0\pi_n\rightharpoonup\pi_0.

Lower semicontinuity of the transport cost yields cdπ0=0\int c\d\pi_0=0, so π0\pi_0 is concentrated on the diagonal. Its two marginals are therefore equal to a common measure ρ\rho. Lower semicontinuity of the marginal divergences gives

lim infn1τnUWc,τn(α,β)Dψˉ1(ρα)+Dψˉ2(ρβ),\liminf_n \frac{1}{\tau_n} \mathsf{UW}_{c,\tau_n}(\alpha,\beta) \geq \mathcal D_{\bar\psi_1}(\rho\mid\alpha) + \mathcal D_{\bar\psi_2}(\rho\mid\beta),

and optimizing over ρ\rho gives the lower bound.

In the dominated case, the minimization over ρ=rλ\rho=r\lambda decouples into the scalar envelope mψˉ1,ψˉ2\mathfrak m_{\bar\psi_1,\bar\psi_2}. For KL, no singular part is admissible when α\alpha and β\beta are dominated by λ\lambda. The pointwise objective is rlog(r/a)r+a+rlog(r/b)r+br\log(r/a)-r+a+r\log(r/b)-r+b. Its optimality condition is log(r/a)+log(r/b)=0\log(r/a)+\log(r/b)=0, hence r=abr=\sqrt{ab}, and the minimum is a+b2ab=(ab)2a+b-2\sqrt{ab}=(\sqrt a-\sqrt b)^2.

Proof

Use the variational formula for the dual of a divergence and introduce the marginal variables through continuous potentials:

infπ0supf,gcdπ+fdπ1+gdπ2Dψ1(fα)Dψ2(gβ).\inf_{\pi\geq0}\sup_{f,g} \int c\d\pi + \int -f\d\pi_1 + \int -g\d\pi_2 - \mathcal D_{\psi_1}^*(-f\mid\alpha) - \mathcal D_{\psi_2}^*(-g\mid\beta).

Exchanging the infimum and supremum gives

supf,gDψ1(fα)Dψ2(gβ)+infπ0(c(fg))dπ.\sup_{f,g} - \mathcal D_{\psi_1}^*(-f\mid\alpha) - \mathcal D_{\psi_2}^*(-g\mid\beta) + \inf_{\pi\geq0} \int \big(c-(f\oplus g)\big)\d\pi .

The last infimum is 0 when fgcf\oplus g\leq c and -\infty otherwise.

Reverse and Homogeneous Formulations

The Liero--Mielke--Savare formulation rewrites marginal penalties as a local transport cost and then homogenizes it. Assuming first that the reference measures and transported marginals have mutually absolutely continuous parts, one can factor the objective as

c(x,y)dπ(x,y)+Dψ1(π1α)+Dψ2(π2β)=(c(x,y)+ψ1 ⁣(dπ1dα(x))dαdπ1(x)+ψ2 ⁣(dπ2dβ(y))dβdπ2(y))dπ(x,y).\begin{aligned} &\int c(x,y)\d\pi(x,y) + \mathcal D_{\psi_1}(\pi_1\mid\alpha) + \mathcal D_{\psi_2}(\pi_2\mid\beta) \\ &\quad = \int \left( c(x,y) + \psi_1\!\left(\frac{\d\pi_1}{\d\alpha}(x)\right) \frac{\d\alpha}{\d\pi_1}(x) + \psi_2\!\left(\frac{\d\pi_2}{\d\beta}(y)\right) \frac{\d\beta}{\d\pi_2}(y) \right) \d\pi(x,y). \end{aligned}

This motivates the local reverse cost

Lc(r,s):=c+rψ1(1/r)+sψ2(1/s),L_c(r,s) \eqdef c+r\psi_1(1/r)+s\psi_2(1/s),

with the usual recession convention at r=0r=0 or s=0s=0. If α=Fπ1+α\alpha=F\pi_1+\alpha^\perp and β=Gπ2+β\beta=G\pi_2+\beta^\perp are the Lebesgue decompositions of the reference marginals with respect to the transported marginals, then

UWc(α,β)=infπ0Lc(x,y)(F(x),G(y))dπ(x,y)+ψ1(0)α(X)+ψ2(0)β(Y).\mathsf{UW}_c(\alpha,\beta) = \inf_{\pi\geq0} \int L_{c(x,y)}(F(x),G(y))\d\pi(x,y) + \psi_1(0)\alpha^\perp(\X) + \psi_2(0)\beta^\perp(\Y).

The homogeneous formulation is obtained by taking the perspective transform of LcL_c,

Hc(r,s):=infθ>0θLc(r/θ,s/θ),H_c(r,s) \eqdef \inf_{\theta>0} \theta L_c(r/\theta,s/\theta),

which is positively 1-homogeneous. It defines

HWc(α,β)=infπ0Hc(x,y)(F(x),G(y))dπ(x,y)+ψ1(0)α(X)+ψ2(0)β(Y).\mathsf{HW}_c(\alpha,\beta) = \inf_{\pi\geq0} \int H_{c(x,y)}(F(x),G(y))\d\pi(x,y) + \psi_1(0)\alpha^\perp(\X) + \psi_2(0)\beta^\perp(\Y).
Proof

The inequality HWUW\mathsf{HW}\leq\mathsf{UW} follows from HcLcH_c\leq L_c by taking θ=1\theta=1. Conversely, take a feasible measure π\pi in the homogeneous formulation. By definition of the perspective transform, for every (x,y)(x,y) and every η>0\eta>0 there exists a scale θ(x,y)>0\theta(x,y)>0 such that

Hc(x,y)(F(x),G(y))+ηθ(x,y)Lc(x,y)(F(x)/θ(x,y),G(y)/θ(x,y)).H_{c(x,y)}(F(x),G(y))+\eta \geq \theta(x,y) L_{c(x,y)} \big(F(x)/\theta(x,y),G(y)/\theta(x,y)\big).

Replacing π\pi by the rescaled measure π~=θπ\tilde\pi=\theta\pi and the densities by F/θF/\theta and G/θG/\theta gives an admissible competitor for the reverse formulation with cost no larger than the homogeneous cost plus ηπ(X×Y)\eta\pi(\X\times\Y). Letting η0\eta\to0 yields UWHW\mathsf{UW}\leq\mathsf{HW}. The singular terms are unchanged because the same rescaling is performed before taking the Lebesgue decomposition of the marginals.

Conic Lifting

Assume now that X=Y\X=\Y and ψ1=ψ2=ψ\psi_1=\psi_2=\psi. The homogeneous formulation lifts the problem to the cone space C[X]:=(X×R+)/\mathfrak C[\X]\eqdef(\X\times\RR_+)/\sim, where all points (x,0)(x,0) are identified at the apex. For an exponent p1p\geq1, define

D((x,r),(y,s)):=Hc(x,y)(rp,sp)1/p.\mathsf D((x,r),(y,s)) \eqdef H_{c(x,y)}(r^p,s^p)^{1/p}.

Several classical unbalanced geometries are obtained by choosing ψ\psi, cc and pp so that D\mathsf D is a distance on the cone:

D((x,r),(y,s))2=r2+s22rscos(d(x,y)π/2).\mathsf D((x,r),(y,s))^2 = r^2+s^2-2rs\cos(d(x,y)\wedge\pi/2).
D((x,r),(y,s))2=r2+s22rsed(x,y)2/2.\mathsf D((x,r),(y,s))^2 = r^2+s^2-2rs e^{-d(x,y)^2/2}.
D((x,r),(y,s))=r+s(rs)(2d(x,y))+.\mathsf D((x,r),(y,s)) = r+s-(r\wedge s)(2-d(x,y))_+.

The corresponding cone value is

CW(α,β)=infγM+(C[X]2){D((x,r),(y,s))pdγ  ;  rpdγ1(,r)=α,spdγ2(,s)=β}.\mathsf{CW}(\alpha,\beta) = \inf_{\gamma\in\mathcal M_+(\mathfrak C[\X]^2)} \left\{ \int \mathsf D((x,r),(y,s))^p\d\gamma \; ; \; \int r^p\d\gamma_1(\cdot,r)=\alpha,\quad \int s^p\d\gamma_2(\cdot,s)=\beta \right\}.
Proof

The equality UW=HW\mathsf{UW}=\mathsf{HW} is the homogenization proposition. To prove HW=CW\mathsf{HW}=\mathsf{CW}, disintegrate an admissible cone coupling γ\gamma with respect to its spatial variables (x,y)(x,y) and radii (r,s)(r,s). The cone marginal constraints say precisely that the spatial marginals are recovered after weighting by rpr^p and sps^p. Since D((x,r),(y,s))p=Hc(x,y)(rp,sp)\mathsf D((x,r),(y,s))^p=H_{c(x,y)}(r^p,s^p), integrating the cone cost gives the homogeneous objective. Conversely, any homogeneous competitor can be lifted to the cone by placing, over each (x,y)(x,y), radii whose ppth powers are the two density factors appearing in HcH_c.

If D\mathsf D is a distance on the cone, then CW1/p\mathsf{CW}^{1/p} is the usual pp-Wasserstein distance between lifted measures under the linear cone-marginal constraints. Symmetry and the triangle inequality follow from the corresponding Wasserstein properties and the gluing lemma on the cone. If the distance is zero, an optimal cone coupling is concentrated on the diagonal of the cone, so the weighted projections agree and therefore α=β\alpha=\beta.

Entropic KL Relaxation

A generic entropic regularization of unbalanced OT reads

infπM+(X×Y)cdπ+Dψ1(π1α)+Dψ2(π2β)+ϵDϕ(παβ).\inf_{\pi\in\mathcal M_+(\X\times\Y)} \int c\d\pi + \mathcal D_{\psi_1}(\pi_1\mid\alpha) + \mathcal D_{\psi_2}(\pi_2\mid\beta) + \epsilon\mathcal D_\phi(\pi\mid\alpha\otimes\beta).

Its dual is

supf,gDψ1(fα)Dψ2(gβ)ϵDϕ(fgcϵ|αβ).\sup_{f,g} - \mathcal D_{\psi_1}^*(-f\mid\alpha) - \mathcal D_{\psi_2}^*(-g\mid\beta) - \epsilon\mathcal D_\phi^* \left(\frac{f\oplus g-c}{\epsilon}\middle|\alpha\otimes\beta\right).

For Dϕ=KL\mathcal D_\phi=\operatorname{KL}, the primal-dual relation is dπ=e(fgc)/ϵdαdβ\d\pi=e^{(f\oplus g-c)/\epsilon}\d\alpha\d\beta. If in addition Dψ1=Dψ2=τKL\mathcal D_{\psi_1}=\mathcal D_{\psi_2}=\tau\operatorname{KL}, coordinate maximization gives the damped soft transforms

f(x)=τϵτ+ϵlogYexp(g(y)c(x,y)ϵ)dβ(y),g(y)=τϵτ+ϵlogXexp(f(x)c(x,y)ϵ)dα(x).\begin{aligned} f(x) &= - \frac{\tau\epsilon}{\tau+\epsilon} \log\int_\Y \exp\left(\frac{g(y)-c(x,y)}{\epsilon}\right)\d\beta(y),\\ g(y) &= - \frac{\tau\epsilon}{\tau+\epsilon} \log\int_\X \exp\left(\frac{f(x)-c(x,y)}{\epsilon}\right)\d\alpha(x). \end{aligned}

In the discrete case, with Ki,j=eCi,j/ϵaibjK_{i,j}=e^{-C_{i,j}/\epsilon}a_i b_j and ρ=τ/(τ+ϵ)\rho=\tau/(\tau+\epsilon), this gives the generalized Sinkhorn scaling

ui(ai(Kv)i)ρ,vj(bj(Ku)j)ρ,P=diag(u)Kdiag(v).u_i\leftarrow \left(\frac{a_i}{(Kv)_i}\right)^\rho, \qquad v_j\leftarrow \left(\frac{b_j}{(K^\top u)_j}\right)^\rho, \qquad P=\diag(u)K\diag(v).

The exponent ρ<1\rho<1 is the visible difference with balanced Sinkhorn: marginal corrections are damped because violating the marginals is allowed.

<IPython.core.display.Image object>

KL unbalanced OT on one-dimensional Gaussian-mixture densities. The central matrix is the transported coupling. The side curves compare the prescribed marginals with the transported marginals; increasing τ\tau makes marginal mismatch more expensive, so more mass is moved rather than created or destroyed.

The interactive demo below exposes the two most important regularization scales. Increasing τ\tau pushes the transported marginals closer to the prescribed ones; increasing ϵ\epsilon spreads the coupling itself.

Interactive panel. Use the deletion cost and regularization controls to see when unbalanced transport prefers moving mass, creating mass, or removing it.

The entropy used in the marginal relaxation also changes the qualitative behavior. A KL penalty leads to smooth multiplicative rescaling. The reverse-KL, or Burg, penalty blows up when a transported marginal vanishes where the prescribed marginal is positive, so it discourages complete deletion of small modes. Total variation has a linear kink and behaves closer to partial transport: mass is either kept active or created and destroyed at nearly constant marginal price.

<IPython.core.display.Image object>

Effect of the marginal divergence in unbalanced entropic OT. The geometric cost, entropic plan regularization ϵ\epsilon, and relaxation strength τ\tau are fixed; only the marginal penalty changes. KL allows smooth mass variation, Burg keeps transported marginals from vanishing on prescribed modes, and total variation gives a sharper active-mass selection.

Sliced Wasserstein Distances

Sliced Wasserstein trades exact high-dimensional geometry for many one-dimensional projections. It is cheap, differentiable after sorting, and often effective in imaging and learning. For measures on Rd\RR^d and θSd1\theta\in\mathbb S^{d-1}, let Pθ(x)=θ,xP_\theta(x)=\dotp{\theta}{x} be the projection on direction θ\theta.

This construction is closely related to the Radon transform and is much cheaper to approximate numerically than high-dimensional OT, since each projected problem can be solved by sorting or quantiles Rabin et al., 2011Bonneel et al., 2015Kolouri et al., 2016. It metrizes the same weak-plus-moment topology as Wp\Wass_p, but its geometry is not bi-Lipschitz equivalent to Wp\Wass_p in high dimension Nadjahi et al., 2019.

<IPython.core.display.Image object>

Sliced Wasserstein projections between two planar densities. Fixed directions are drawn on both densities, and the middle panels show smoothed one-dimensional density estimates of the projected measures. Sliced OT averages one-dimensional Wasserstein discrepancies over many such directions.

The interactive demo separates two uses of a slice: comparing projected measures and lifting the sorted one-dimensional matching back to the plane. The lifted plan is always feasible in the original space, but it need not be the quadratic optimal plan.

Interactive panel. Use the projection angle and number of directions to see how sliced Wasserstein distances reduce high-dimensional transport to one-dimensional matchings.

Proof

Non-negativity and symmetry follow from the one-dimensional Wasserstein distance. For the triangle inequality, apply the triangle inequality of Wp\Wass_p for each direction θ\theta and then Minkowski’s inequality in Lp(Sd1)L^p(\mathbb S^{d-1}).

If SWp(α,β)=0\operatorname{SW}_p(\alpha,\beta)=0, then (Pθ)α=(Pθ)β(P_\theta)_\sharp\alpha=(P_\theta)_\sharp\beta for almost every direction. By continuity of characteristic functions this holds for all directions, and the Cramer--Wold theorem implies α=β\alpha=\beta.

The bound SWpWp\operatorname{SW}_p\leq\Wass_p follows because PθP_\theta is 1-Lipschitz. For p=2p=2, using any coupling π\pi between α\alpha and β\beta,

Sd1xy,θ2dπ(x,y)dσ(θ)=1dxy2dπ(x,y).\int_{\mathbb S^{d-1}}\int |\dotp{x-y}{\theta}|^2\d\pi(x,y)\d\sigma(\theta) = \frac{1}{d}\int \norm{x-y}^2\d\pi(x,y).

Optimizing over π\pi gives the sharper inequality. The weak-convergence statement follows from the same Cramer--Wold mechanism plus the moment condition.

Subspace-Sliced Variants

One-dimensional slices are extremely cheap, but they may discard too much geometry in high dimension. A natural compromise is to project onto kk-dimensional subspaces: the projected OT problems remain lower dimensional, while each projection retains correlations inside a small block of coordinates.

Proof

The first inequality in each line follows because an LpL^p average over a probability space is bounded by the corresponding supremum. The second inequality follows because orthogonal projections are 1-Lipschitz: pushing any admissible coupling between α\alpha and β\beta through a projection gives an admissible coupling for the projected measures with no larger transport cost. Optimizing over couplings and then averaging or maximizing over the projection gives the result.

Min-Sliced Lifted Transport Plans

The preceding constructions define distances between projected measures. A different use of slicing is to use a projection only as a device for building a feasible high-dimensional transport plan. For equal-weight empirical measures α=n1iδxi\alpha=n^{-1}\sum_i\delta_{x_i} and β=n1iδyi\beta=n^{-1}\sum_i\delta_{y_i}, sort the projected samples xi,θ\dotp{x_i}{\theta} and yj,θ\dotp{y_j}{\theta}, and let σθ\sigma_\theta be the monotone matching induced by this sorting. The lifted plan

πθ=1ni=1nδ(xi,yσθ(i))\pi_\theta = \frac{1}{n}\sum_{i=1}^n \delta_{(x_i,y_{\sigma_\theta(i)})}

is a genuine coupling between α\alpha and β\beta in the original space. Min-SWGG-type methods then choose the projection whose lifted plan has the smallest full-dimensional quadratic cost:

MSWGG2(α,β)2:=minθSd1xy2dπθ(x,y).\operatorname{MSWGG}_2(\alpha,\beta)^2 \eqdef \min_{\theta\in\mathbb S^{d-1}} \int\norm{x-y}^2\d\pi_\theta(x,y).

This quantity is not a projected distance; it is a cheap feasible-plan construction. Consequently,

W22(α,β)xy2dπθ(x,y),W22(α,β)MSWGG2(α,β)2.\Wass_2^2(\alpha,\beta) \leq \int\norm{x-y}^2\d\pi_\theta(x,y), \qquad \Wass_2^2(\alpha,\beta) \leq \operatorname{MSWGG}_2(\alpha,\beta)^2.
<IPython.core.display.Image object>

Lifted min-sliced plan. A one-dimensional direction is selected by a small deterministic sweep, then red and blue atoms are sorted after projection and matched in that order. The middle panel lifts this one-dimensional matching back to the plane; it is a feasible coupling but not the same object as the quadratic W2W_2 matching shown on the right.

Vector Quantiles and Linear Optimal Transport

Linear OT starts from the multivariate analogue of quantile coordinates. The one-dimensional quantile function represents a probability measure by the monotone map sending a fixed reference law to it; in dimension d>1d>1, Brenier’s theorem gives the corresponding construction after choosing an absolutely continuous reference probability ρ\rho, typically the uniform law on a convex body or a standard Gaussian.

Vector Quantiles

Assume that ρ\rho is absolutely continuous. For a target law μ\mu with finite second moment, its vector quantile relative to ρ\rho is the Brenier map

Tμ=ϕμ,(Tμ)ρ=μ,T_\mu=\nabla\phi_\mu, \qquad (T_\mu)_\sharp\rho=\mu,

or equivalently the solution of

minTρ=μxT(x)2dρ(x).\min_{T_\sharp\rho=\mu} \int\norm{x-T(x)}^2\d\rho(x).

This construction is canonical only after fixing ρ\rho: changing the reference law changes the coordinates used to represent μ\mu. Vector quantile regression uses the same idea conditionally, replacing scalar conditional quantiles by conditional Brenier maps and thereby encoding multivariate ranks and depths Carlier et al., 2017.

Linearized Wasserstein Coordinates

Linear OT replaces a nonlinear transport distance by a Hilbert norm between reference maps. It is useful when one reference measure is fixed and many nearby distributions must be compared cheaply. Let TαT_\alpha be the Brenier map pushing ρ\rho to α\alpha, understood as an element of L2(ρ;Rd)L^2(\rho;\RR^d) and hence defined only ρ\rho-almost everywhere. The linear OT embedding is

αTαIdL2(ρ;Rd),LOTρ(α,β)=TαTβL2(ρ).\alpha\mapsto T_\alpha-\Id\in L^2(\rho;\RR^d), \qquad \operatorname{LOT}_\rho(\alpha,\beta) = \norm{T_\alpha-T_\beta}_{L^2(\rho)}.

If one of the two targets equals the reference, the linearized distance is exact: for instance, LOTρ(ρ,α)=TαIdL2(ρ)=W2(ρ,α)\operatorname{LOT}_\rho(\rho,\alpha) =\norm{T_\alpha-\Id}_{L^2(\rho)} =\Wass_2(\rho,\alpha). For two arbitrary targets, the coupling (Tα,Tβ)ρ(T_\alpha,T_\beta)_\sharp\rho is admissible but not generally optimal, so LOTρ\operatorname{LOT}_\rho is a tangent-space approximation of the Wasserstein geometry Wang et al., 2013.

For a family (αs)s(\alpha_s)_s with weights (λs)s(\lambda_s)_s, the linearized barycenter is obtained by averaging maps,

Tˉ=sλsTαs,αˉLOT=Tˉρ.\bar T=\sum_s\lambda_s T_{\alpha_s}, \qquad \bar\alpha_{\operatorname{LOT}}=\bar T_\sharp\rho.

This is exact in one dimension, where quantile functions linearize W2\Wass_2, and it is especially useful when many barycenters with changing weights must be evaluated quickly.

<IPython.core.display.Image object>

Linear OT coordinates. Fixing a reference measure ρ\rho turns each target into a map TαT_\alpha from ρ\rho to α\alpha, or equivalently into the displacement field TαIdT_\alpha-\Id. In one dimension this is exactly the quantile parametrization of W2\Wass_2. In two dimensions, averaging the maps gives the linearized barycenter, which is compared with the genuine McCann midpoint.

The next control keeps the exact one-dimensional setting. The reference density defines the coordinate system, the target maps are quantile maps from that reference, and the barycenter is obtained by averaging those maps before pushing the reference forward.

Interactive panel. Use the reference and deformation controls to inspect how linear optimal transport embeds measures through maps from a fixed template.

Proof

The first inequality is immediate: (Tα,Tβ)ρ(T_\alpha,T_\beta)_\sharp\rho is a feasible coupling between α\alpha and β\beta. The reverse local estimate is a standard stability statement for the Monge--Ampere equation under the stated regularity assumptions: changes in the target measure control changes in the Brenier potential in Holder norms, hence control TαTβT_\alpha-T_\beta in L2(ρ)L^2(\rho).

In one-dimensional settings, quantile functions make this exact with η=1\eta=1. In several dimensions one should not read the statement as a global Lipschitz estimate in W2\Wass_2. Quantitative stability results for semi-discrete and Monge--Ampere maps give Holder exponents depending on the dimension, density bounds, support geometry and regularity Mérigot et al., 2020.

Spectral and Robust Wasserstein Distances

Spectral OT changes the scalar quadratic cost by measuring the whole displacement covariance through a matrix gauge. The same object admits a robust projected formulation: instead of fixing one projection, one maximizes over the polar set of the gauge. Subspace robust OT is the important non-convex rank-constrained version of this idea Paty & Cuturi, 2019; spectral gauges provide its convex minimax counterpart and connect to recent spectral-gradient viewpoints such as Muon dynamics Peyré, 2026.

The monotonicity condition means that increasing the displacement covariance in Loewner order cannot decrease the transport penalty.

The special case γ(M)=tr(M)\gamma(M)=\tr(M) gives the usual quadratic Wasserstein distance W2\Wass_2. The spectral gauge γ(M)=λmax(M)\gamma(M)=\lambda_{\max}(M) instead measures the worst transported variance direction. For A0A\succeq0, define the quadratic projected transport cost

W2,A(α,β)2:=infπΠ(α,β)(xy)A(xy)dπ(x,y)=W2((A1/2)α,(A1/2)β)2.\Wass_{2,A}(\alpha,\beta)^2 \eqdef \inf_{\pi\in\Couplings(\alpha,\beta)} \int (x-y)^\top A(x-y)\d\pi(x,y) = \Wass_2((A^{1/2})_\sharp\alpha,(A^{1/2})_\sharp\beta)^2.

The polar set of the gauge is

Bγ:={A0:tr(AM)γ(M) for all M0},\mathcal B_\gamma \eqdef \{A\succeq0: \tr(AM)\leq\gamma(M)\ \text{for all } M\succeq0\},

so that, for a closed gauge, γ(M)=supABγtr(AM)\gamma(M)=\sup_{A\in\mathcal B_\gamma}\tr(AM).

Proof

Using the polar representation of γ\gamma,

Wγ(α,β)2=infπΠ(α,β)supABγtr(AMπ).\Wass_\gamma(\alpha,\beta)^2 = \inf_{\pi\in\Couplings(\alpha,\beta)} \sup_{A\in\mathcal B_\gamma} \tr(AM_\pi).

The coupling set is convex and compact for weak convergence under compact support. The polar set Bγ\mathcal B_\gamma is convex and compact, and the map (π,A)tr(AMπ)(\pi,A)\mapsto\tr(AM_\pi) is affine in each variable and continuous. Sion’s minimax theorem gives

infπsupABγtr(AMπ)=supABγinfπtr(AMπ)=supABγW2,A(α,β)2.\inf_\pi\sup_{A\in\mathcal B_\gamma}\tr(AM_\pi) = \sup_{A\in\mathcal B_\gamma}\inf_\pi\tr(AM_\pi) = \sup_{A\in\mathcal B_\gamma}\Wass_{2,A}(\alpha,\beta)^2.

For fixed A0A\succeq0, W2,A\Wass_{2,A} is the Wasserstein pseudodistance associated with the seminorm xA1/2xx\mapsto\norm{A^{1/2}x}. A supremum of pseudodistances is symmetric and satisfies the triangle inequality. If aIBγaI\in\mathcal B_\gamma and AbIA\preceq bI for all ABγA\in\mathcal B_\gamma, then

aW2(α,β)2Wγ(α,β)2bW2(α,β)2,a\Wass_2(\alpha,\beta)^2 \leq \Wass_\gamma(\alpha,\beta)^2 \leq b\Wass_2(\alpha,\beta)^2,

which proves definiteness and equivalence with W2\Wass_2.

For the Ky Fan gauge

γk(M)==1kλ(M),\gamma_k(M)=\sum_{\ell=1}^k\lambda_\ell(M),

where the eigenvalues are sorted in decreasing order, the polar set is

Bγk={A:0AI, tr(A)k}.\mathcal B_{\gamma_k} = \{A:0\preceq A\preceq I,\ \tr(A)\leq k\}.

Thus k=dk=d gives γd(M)=tr(M)\gamma_d(M)=\tr(M) and recovers W2\Wass_2. The convex hull of rank-kk projectors is

{A:0AI, tr(A)=k},\{A:0\preceq A\preceq I,\ \tr(A)=k\},

and, since M0M\succeq0, the associated support function is the same Ky Fan gauge. Thus Wγk\Wass_{\gamma_k} is the convexified spectral counterpart of SRW2,k\operatorname{SRW}_{2,k}, while SRW2,k\operatorname{SRW}_{2,k} keeps the original non-convex rank constraint. For k=1k=1, γ1(M)=λmax(M)\gamma_1(M)=\lambda_{\max}(M) and Bγ1={A0:tr(A)1}\mathcal B_{\gamma_1}=\{A\succeq0:\tr(A)\leq1\}.

<IPython.core.display.Image object>

Trace and spectral gauges for displacement covariances. The trace gauge minimizes the average squared displacement and gives the usual quadratic transport plan. The λmax\lambda_{\max} gauge penalizes the worst projected displacement variance; the displayed plan is obtained by approximating the robust formulation with finitely many directions.

The interactive demo turns the displacement covariance into a visible object. The trace gauge sums both covariance eigenvalues, while the top-eigenvalue gauge cares only about the worst transported direction.

Interactive panel. Use the spectral weights and deformation controls to see how the gauge changes the geometry used to compare measures.

References
  1. Liero, M., Mielke, A., & Savaré, G. (2018). Optimal entropy-transport problems and a new Hellinger–Kantorovich distance between positive measures. Inventiones Mathematicae, 211(3), 969–1117.
  2. Chizat, L., Peyré, G., Schmitzer, B., & Vialard, F.-X. (2018). Unbalanced optimal transport: dynamic and Kantorovich formulation. Journal of Functional Analysis, 274(11), 3090–3123.
  3. Chizat, L., Schmitzer, B., Peyré, G., & Vialard, F.-X. (2018). An interpolating distance between optimal transport and Fisher–Rao metrics. Foundations of Computational Mathematics, 18(1), 1–44.
  4. Rabin, J., Peyré, G., Delon, J., & Bernot, M. (2011). Wasserstein barycenter and its application to texture mixing. International Conference on Scale Space and Variational Methods in Computer Vision, 435–446.
  5. Bonneel, N., Rabin, J., Peyré, G., & Pfister, H. (2015). Sliced and Radon Wasserstein barycenters of measures. Journal of Mathematical Imaging and Vision, 51(1), 22–45.
  6. Kolouri, S., Zou, Y., & Rohde, G. K. (2016). Sliced Wasserstein kernels for probability distributions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5258–5267.
  7. Nadjahi, K., Durmus, A., Simsekli, U., & Badeau, R. (2019). Asymptotic Guarantees for Learning Generative Models with the Sliced-Wasserstein Distance. Advances in Neural Information Processing Systems.
  8. Bonnotte, N. (2013). Unidimensional and evolution methods for optimal transportation [Phdthesis]. Université Paris-Sud.
  9. Carlier, G., Chernozhukov, V., & Galichon, A. (2017). Vector quantile regression beyond the specified case. Journal of Multivariate Analysis, 161, 96–102. 10.1016/j.jmva.2017.07.003
  10. Wang, W., Slepčev, D., Basu, S., Ozolek, J. A., & Rohde, G. K. (2013). A linear optimal transportation framework for quantifying and visualizing variations in sets of images. International Journal of Computer Vision, 101(2), 254–269.
  11. Mérigot, Q., Delalande, A., & Chazal, F. (2020). Quantitative Stability of Optimal Transport Maps and Linearization of the 2-Wasserstein Space. Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, 108, 3186–3196.
  12. Paty, F.-P., & Cuturi, M. (2019). Subspace Robust Wasserstein Distances. Proceedings of the 36th International Conference on Machine Learning, 97, 5072–5081.
  13. Peyré, G. (2026). Muon Dynamics as a Spectral Wasserstein Flow. arXiv Preprint arXiv:2604.04891.