Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Kantorovich Relaxation

Kantorovich’s relaxation is the decisive move that turns transport into convex optimization. Deterministic maps are replaced by couplings, infeasibility and asymmetry disappear, and the Wasserstein distances emerge. Historically, this linear-programming viewpoint grew from Kantorovich’s economic planning work Kantorovich, 1942 and is now the standard foundation of optimal transport Villani, 2003Villani, 2009Rachev & Rüschendorf, 1998.

Discrete Relaxation

The discrete relaxation is the cleanest place to see mass splitting. It replaces permutations by a transportation polytope and reveals the linear-programming structure that algorithms exploit.

Monge’s discrete matching problem cannot be applied when the two clouds have different cardinalities or unequal weights. The continuous Monge problem has the same obstruction: there may be no map TT such that Tα=βT_\sharp\al=\be, for instance when one Dirac mass must be sent to several Dirac masses. It is also asymmetric: two Dirac masses can be mapped to one, but one Dirac mass cannot be split into two by a deterministic map.

Kantorovich’s idea is to relax deterministic transportation. Instead of sending each source point xix_i to exactly one target, the mass at xix_i may be dispatched across several targets. The relaxation is encoded by a coupling matrix PR+n×mP\in\RR_+^{n\times m} for two discrete measures

α=iaiδxi,β=jbjδyj.\al=\sum_i a_i\delta_{x_i}, \qquad \be=\sum_j b_j\delta_{y_j}.

The first consequence is feasibility. There is always at least one admissible plan.

The feasible set is a bounded intersection of an affine space with the nonnegative orthant, hence a convex polytope. In one dimension, the coupling can be read as a matrix: rows index source bins, columns index target bins, and the marginal constraints appear as prescribed row and column sums.

Proof

The reverse implication is immediate. Conversely, assume that PP^\otimes is optimal and let QU(a,b)Q\in\CouplingsD(a,b) be arbitrary. Since all entries of PP^\otimes are positive, there exists t>0t>0 small enough that

R:=(1+t)PtQR\eqdef(1+t)P^\otimes-tQ

is nonnegative. It still has row sums aa and column sums bb, so RU(a,b)R\in\CouplingsD(a,b). Also

P=11+tR+t1+tQ.P^\otimes=\frac{1}{1+t}R+\frac{t}{1+t}Q.

Taking scalar products with CC, the optimality of PP^\otimes forces both RR and QQ to have the same cost as PP^\otimes. Since QQ was arbitrary, all couplings are optimal.

Thus the product plan is mainly a feasibility witness. Except when the linear cost is constant on the whole transportation polytope, it is not expected to solve optimal transport.

<IPython.core.display.Image object>

Discrete couplings represented as straight transport segments. The deterministic graph is a feasible Monge-type plan, the product plan spreads every source mass over all targets, and the optimal Kantorovich plan minimizes the quadratic transport cost. Line width and opacity encode transported mass.

The interactive demo below separates the main feasible-plan archetypes: deterministic graphs, independent product couplings, sparse splitting plans, and entropic approximations.

Interactive panel. Use the point and mass sliders to see how a Kantorovich plan can split mass into several weighted links rather than choosing one destination per source.

<IPython.core.display.Image object>

Coupling matrices with their prescribed marginals. The central grayscale image displays PijP_{ij}; the red curve on the left is the source marginal aa, and the blue curve on top is the target marginal bb. The independent product plan is diffuse, whereas the one-dimensional optimal plan concentrates near the monotone quantile correspondence.

The companion control varies the bin count and the endpoint laws, making the transition from diffuse independence to monotone transport visually explicit.

Interactive panel. Use the problem-size and mass-shape controls to compare the coupling matrix with its red and blue marginal sums.

The Kantorovich feasible set is symmetric: PU(a,b)P\in\CouplingsD(a,b) if and only if PU(b,a)P^\top\in\CouplingsD(b,a). With a unit transport cost matrix CijC_{ij}, the discrete Kantorovich problem reads

LC(a,b):=minPU(a,b)C,P=minPU(a,b)i,jCijPij.\mathcal{L}_C(a,b) \eqdef \min_{P\in\CouplingsD(a,b)} \langle C,P\rangle = \min_{P\in\CouplingsD(a,b)} \sum_{i,j} C_{ij}P_{ij}.

This is a linear program, and its solutions need not be unique.

<IPython.core.display.Image object>

From permutation matrices to splitting couplings. When the two empirical measures have the same number of atoms and uniform weights, an optimal plan can be a permutation matrix. Once target masses are nonuniform, one source can send mass to several targets and several sources can merge into the same target.

The interactive demo keeps the same source and target sites while changing the target mass imbalance, so the moment where permutation structure breaks becomes visible.

Interactive panel. Use the split-mass and geometry controls to contrast deterministic permutation-like transport with plans that divide source mass across targets.

Proof

The transportation polytope is compact, so a linear objective attains its minimum at an extreme point. Let PP be an extreme point and let E={(i,j):Pij>0}E=\{(i,j):P_{ij}>0\} be its support graph on the bipartite vertex set of source and target indices. If this graph contains a cycle, put alternating signs +1,1+1,-1 on the cycle, obtaining a nonzero matrix HH supported on EE with zero row and column sums. For small t>0t>0, both P+tHP+tH and PtHP-tH are nonnegative couplings and PP is their midpoint, contradicting extremality. Thus the support graph is a forest, which has at most n+m1n+m-1 edges.

Proof

All assignments are nonnegative. At each step, the mass placed in entry (i,j)(i,j) is subtracted from exactly one current row residual and one current column residual, so no row or column can receive more mass than prescribed. Conversely, an index is advanced only when its residual has been fully filled. When the algorithm stops, the total assigned mass is iai=jbj\sum_i a_i=\sum_j b_j, hence all row and column sums are exactly aa and bb.

Each positive assignment exhausts at least one current row or one current column. Before the final assignment, at most n1n-1 row advances and m1m-1 column advances can occur without terminating the construction. Hence the number of positive entries is at most (n1)+(m1)+1=n+m1(n-1)+(m-1)+1=n+m-1. For acyclicity, view the positive support as a bipartite graph. Once a row or column index is advanced, it never appears again, so each new positive edge either starts a new component or attaches at least one new vertex to the component currently being swept. No edge is ever added between two old vertices of the same component, so no cycle can be created.

The north-west corner rule, summarized in Algorithm Algorithm: North-west corner coupling, does not use the cost matrix and is therefore not meant to solve the discrete Kantorovich problem. Its role is algorithmic: an acyclic support corresponds to linearly independent marginal constraints. When the support has fewer than n+m1n+m-1 positive entries, transportation simplex implementations complete it with zero-mass basic variables to obtain a degenerate basic feasible solution. This gives a cheap initialization for the pivoting methods discussed in Section Linear-Programming Algorithms.

One-Dimensional Cases

In one dimension, the transportation polytope has a canonical monotone optimizer. This is the weighted version of the sorting rule from the matching chapter.

Proof

The north-west plan is monotone: it cannot put positive mass on both (i,j)(i,j) and (i,j)(i',j') when i<ii<i' and j>jj>j'. Conversely, any feasible plan with such a crossing pair can be improved by moving a small mass from (i,j)(i,j) and (i,j)(i',j') to (i,j)(i,j') and (i,j)(i',j). The two marginals are unchanged, and convexity of hh gives

h(xiyj)+h(xiyj)h(xiyj)+h(xiyj)h(x_i-y_j)+h(x_{i'}-y_{j'}) \geq h(x_i-y_{j'})+h(x_{i'}-y_j)

for i<ii<i' and j<jj'<j, with strict inequality for strictly convex hh and distinct points. Repeating this uncrossing procedure until no crossing remains yields a monotone optimal plan. There is only one monotone feasible plan with the prescribed sorted marginals, namely the sweep plan: it pairs the leftmost remaining source mass with the leftmost remaining target mass at every step. Sorting costs O(nlogn+mlogm)O(n\log n+m\log m) and the sweep uses at most n+m1n+m-1 assignments.

Permutation Matrices As Couplings

Now assume n=mn=m and uniform weights a=b=1n/na=b=\mathbf{1}_n/n. In this case, a matching can be encoded as a matrix with exactly one active entry per row and per column.

The corresponding probability coupling is Pσ/nP_\sigma/n. If the matching cost matrix is CC, then

C,Pσ/n=1ni=1nCi,σ(i).\langle C,P_\sigma/n\rangle = \frac1n\sum_{i=1}^n C_{i,\sigma(i)}.

Thus the assignment problem is the minimization of a linear function over the discrete, non-convex set of permutation matrices. The convex relaxation replaces this finite set by all bistochastic matrices.

Proof

Among all nonempty faces of C\mathcal C, choose one of minimal affine dimension. If this face contained two distinct points, maximizing a linear functional that is not constant on the face would produce a nonempty proper exposed subface, contradicting minimality. Hence the minimal face is a singleton, and its point is extreme.

Proof

The set S=arg minxC(x)S=\argmin_{x\in\mathcal C}\ell(x) is nonempty, compact and convex. By Proposition Proposition: Existence of Extreme Points, it has an extreme point xx. If x=(y+z)/2x=(y+z)/2 with y,zCy,z\in\mathcal C, then by linearity and optimality of xx, both yy and zz also minimize \ell on C\mathcal C, hence y,zSy,z\in S. Since xx is extreme in SS, y=z=xy=z=x. Thus xx is extreme in C\mathcal C.

Proof

We first prove that permutation matrices are extreme. Let PσPnpermP_\sigma\in\mathcal P_n^{\mathrm{perm}} and assume that

Pσ=Q+R2withQ,RBn.P_\sigma=\frac{Q+R}{2} \qquad\text{with}\qquad Q,R\in\mathcal B_n .

Every bistochastic matrix has entries in [0,1][0,1]. Since the only extreme points of [0,1][0,1] are 0 and 1, each entry of PσP_\sigma fixes the corresponding entries of QQ and RR: if (Pσ)ij=0(P_\sigma)_{ij}=0, then Qij=Rij=0Q_{ij}=R_{ij}=0, while if (Pσ)ij=1(P_\sigma)_{ij}=1, then Qij=Rij=1Q_{ij}=R_{ij}=1. Hence Q=R=PσQ=R=P_\sigma, so PσP_\sigma is extreme.

We now prove the converse by contrapositive. Pick PBnPnpermP\in\mathcal B_n\setminus\mathcal P_n^{\mathrm{perm}}. Since an integral bistochastic matrix is necessarily a permutation matrix, PP has at least one fractional entry. We shall split P=(Q+R)/2P=(Q+R)/2 with Q,RBnQ,R\in\mathcal B_n and QRQ\neq R, proving that PP is not extreme.

Associate with PP the bipartite graph whose left vertices are the rows, whose right vertices are the columns, and whose edges are the fractional entries 0<Pij<10<P_{ij}<1. An entry equal to 1 uses the whole mass of its row and column, so it is isolated in the positive support and does not appear in this fractional graph. If a left vertex is incident to one fractional edge, then it must be incident to at least one other fractional edge: after the first fractional contribution, the row still has positive remaining mass, and that remainder cannot be carried by an entry equal to 1. The same argument applies to columns. Thus every non-isolated vertex of the fractional graph has degree at least two.

Starting from any fractional edge, one may therefore walk through adjacent fractional edges without immediately backtracking and without getting stuck. Since the graph is finite, some vertex is eventually visited twice; the portion of the walk between the two visits contains a cycle. Choose a shortest such cycle and write it in alternating form

(i1,j1,i2,j2,,ip,jp),ip+1=i1,(i_1,j_1,i_2,j_2,\ldots,i_p,j_p), \qquad i_{p+1}=i_1,

where both (is,js)(i_s,j_s) and (is+1,js)(i_{s+1},j_s) are fractional for every ss. Define

ϵ:=min1sp{Pis,js,Pis+1,js,1Pis,js,1Pis+1,js}>0,\epsilon \eqdef \min_{1\leq s\leq p} \{ P_{i_s,j_s}, P_{i_{s+1},j_s}, 1-P_{i_s,j_s}, 1-P_{i_{s+1},j_s} \}>0,

and split the cycle edges into the alternating families

A={(is,js)}s=1p,B={(is+1,js)}s=1p.A=\{(i_s,j_s)\}_{s=1}^p, \qquad B=\{(i_{s+1},j_s)\}_{s=1}^p .

Set Q=PQ=P and R=PR=P outside ABA\cup B; on AA, set Qij=Pij+ϵ/2Q_{ij}=P_{ij}+\epsilon/2 and Rij=Pijϵ/2R_{ij}=P_{ij}-\epsilon/2; on BB, set Qij=Pijϵ/2Q_{ij}=P_{ij}-\epsilon/2 and Rij=Pij+ϵ/2R_{ij}=P_{ij}+\epsilon/2. By the definition of ϵ\epsilon, all modified entries stay in [0,1][0,1]. Each row and column of the cycle sees one +ϵ/2+\epsilon/2 and one ϵ/2-\epsilon/2, so the row and column sums remain one. Thus Q,RBnQ,R\in\mathcal B_n, QRQ\neq R, and P=(Q+R)/2P=(Q+R)/2. Hence PP is not extreme. Consequently every extreme point of Bn\mathcal B_n is integral, and every integral bistochastic matrix is a permutation matrix.

The same combinatorial idea gives the constructive decomposition used to express a bistochastic matrix as a convex combination of permutations.

Proof

The feasible set is Bn/n\mathcal B_n/n. By Proposition Proposition: Linear Programs Have Extreme Minimizers, the linear objective has an optimal extreme point. Since scaling preserves extreme points and Theorem Theorem: Birkhoff--von Neumann identifies the extreme points of Bn\mathcal B_n, this optimizer is Pσ/nP_\sigma/n for some permutation σ\sigma. Its cost is exactly n1iCi,σ(i)n^{-1}\sum_i C_{i,\sigma(i)}, so σ\sigma is an optimal assignment.

Equivalently, for uniform empirical measures, one can always choose a permutation matrix among the minimizers of the relaxed Kantorovich problem: the relaxation is tight for assignment problems.

Linear-Programming Algorithms

The discrete Kantorovich problem is a linear program with much more structure than a generic dense LP. Its variables are arcs of a complete bipartite network, its equality constraints are flow-conservation constraints, and its extreme points are sparse tree-like couplings.

Transportation Simplex And Network Simplex

The transportation simplex goes back to Dantzig’s formulation of the transportation problem Dantzig, 1951. It works on basic feasible couplings, whose support is completed into a spanning tree of the bipartite supply-demand graph. Reduced costs identify whether an unused arc can decrease the objective. Adding such an arc creates a unique cycle; one then pushes as much mass as possible around that cycle and removes the exhausted arc.

The network simplex is the corresponding pivoting method for general minimum-cost-flow problems Bertsekas & Eckstein, 1988. It keeps node potentials, reduced costs and a spanning-tree basis. Its worst-case number of pivots can be exponential, but the per-pivot operations exploit graph sparsity. Polynomial guarantees can be obtained from strongly polynomial minimum-cost-flow algorithms such as Orlin’s algorithm Orlin, 1997.

Interior-Point Methods

Generic interior-point methods approach the LP through a smooth central path. For the transport polytope, the logarithmic-barrier version is

Pϵ:=arg minP1m=a,P1n=bPij>0C,Pϵi,jlogPij.P_\epsilon \eqdef \argmin_{\substack{P\mathbf{1}_m=a,\;P^\top\mathbf{1}_n=b\\P_{ij}>0}} \langle C,P\rangle - \epsilon\sum_{i,j}\log P_{ij}.

The barrier is singular at the boundary, so each iterate stays strictly inside the transportation polytope. As ϵ0\epsilon\downarrow0, the central path approaches the set of LP minimizers.

<IPython.core.display.Image object>

Logarithmic-barrier central path for a triangular slice of a linear program. Large ϵ\epsilon selects a central interior point; decreasing ϵ\epsilon moves the minimizer toward the optimal vertex while never touching the boundary. This differs from entropic OT, where the entropy temperature is part of the regularized objective itself.

The interactive view exposes the barrier parameter directly: lowering ϵ\epsilon slides the minimizer from the center of the feasible triangle toward the LP vertex.

Interactive panel. Use the barrier and angle controls to move along the interior central path of the transport polytope.

Both interior-point methods and Sinkhorn keep iterates positive, but they use positivity differently. Interior-point algorithms solve the original LP by decreasing a barrier parameter. Sinkhorn fixes an entropic temperature and solves a different, KL-regularized OT problem by alternating diagonal scalings.

Relaxation For Arbitrary Measures

This section lifts the finite-dimensional coupling matrix to a joint probability measure. The payoff is that existence, duality and metric properties can be stated for arbitrary laws, including discrete, singular and continuous distributions.

Continuous Couplings

Unlike the Monge constraint, the coupling constraint is never empty. The continuous feasibility witness is the tensor product coupling.

Proof Sketch

If all couplings are optimal, the product coupling is optimal. Conversely, assume the product is optimal. If cross differences failed to vanish on the product support, there would be points x0,x1,y0,y1x_0,x_1,y_0,y_1 such that exchanging the two target neighborhoods decreases cost. Replacing a small amount of product mass on the two diagonal rectangles by mass on the crossed rectangles keeps the same marginals and lowers the cost, a contradiction. Vanishing cross differences imply c(x,y)=c(x,y)+c(x,y)c(x,y)c(x,y)=c(x,y_\star)+c(x_\star,y)-c(x_\star,y_\star) on the support, so the cost of any coupling depends only on its marginals.

If there exists a map T:XYT:\Xx\to\Yy with Tα=βT_\sharp\al=\be, then the Monge map induces the graph coupling π=(Id,T)αΠ(α,β)\pi=(\Id,T)_\sharp\al\in\Couplings(\al,\be), characterized by

h(x,y)dπ(x,y)=h(x,T(x))dα(x).\int h(x,y)\d\pi(x,y) = \int h(x,T(x))\d\al(x).

Graph couplings are precisely the Kantorovich representation of deterministic Monge maps.

Continuous Kantorovich Problem

The discrete Kantorovich problem becomes, for arbitrary measures,

Lc(α,β):=infπΠ(α,β)X×Yc(x,y)dπ(x,y).\mathcal{L}_c(\al,\be) \eqdef \inf_{\pi\in\Couplings(\al,\be)} \int_{\Xx\times\Yy} c(x,y)\d\pi(x,y).

This is an infinite-dimensional linear program over a space of measures.

Proof

The constraint set is nonempty because it contains αβ\al\otimes\be. It is closed for weak convergence because the marginal constraints are preserved. Since X×Y\Xx\times\Yy is compact, probability measures on it are compact for the weak topology, and Π(α,β)\Couplings(\al,\be) is compact. Finally, πcdπ\pi\mapsto\int c\d\pi is weakly continuous because cc is continuous and bounded.

On non-compact domains, one needs coercivity and moment conditions. For the Wasserstein cost c(x,y)=d(x,y)pc(x,y)=d(x,y)^p on a Polish metric space, the natural domain is

Pp(X):={μM+1(X):d(x,x0)pdμ(x)<+},\mathcal P_p(\Xx) \eqdef \left\{ \mu\in\Mm_+^1(\Xx): \int d(x,x_0)^p\d\mu(x)<+\infty \right\},

for one, and hence every, reference point x0x_0.

Monge--Kantorovich Equivalence

The proof of Brenier’s theorem relies on Kantorovich relaxation and duality. Under Brenier’s hypotheses, the relaxation is tight: it has the same cost as the Monge problem and the optimal coupling is induced by a map.

Proof

The proof of Brenier’s theorem shows that the support of any optimal Kantorovich plan lies in the subdifferential ϕ\partial\phi of a convex function. Since α\al has a density, ϕ\phi is differentiable α\al-almost everywhere, so ϕ(x)={ϕ(x)}\partial\phi(x)=\{\nabla\phi(x)\} for α\al-almost every xx. Every optimal coupling is therefore concentrated on the graph of T=ϕT=\nabla\phi and equals (Id,T)α(\Id,T)_\sharp\al.

If α\al does not have a density, non-smooth points of ϕ\phi can be charged by α\al and mass splitting can occur. For instance, moving δ0\delta_0 to (δ1+δ1)/2(\delta_{-1}+\delta_1)/2 can be represented by a plan concentrated on the set-valued subdifferential of ϕ(x)=x\phi(x)=|x|, but not by a deterministic map.

Cyclical Monotonicity

Cyclical monotonicity is the local geometric fingerprint of optimality for a cost cc. It converts a global minimization problem into finite exchange inequalities and is the bridge from Kantorovich plans to convex potentials.

It is enough to check cyclic permutations:

i=1kc(xi,yi)i=1kc(xi,yi+1),yk+1=y1.\sum_{i=1}^k c(x_i,y_i) \leq \sum_{i=1}^k c(x_i,y_{i+1}), \qquad y_{k+1}=y_1.
Proof Sketch

Suppose a finite family in supp(π)\supp(\pi) violates the exchange inequality. By continuity, the same strict inequality holds in small neighborhoods Ui×ViU_i\times V_i around the chosen pairs. Remove a small equal amount of mass from each rectangle, keep its first and second marginal pieces, and reinsert the mass after permuting the second marginal pieces. The new measure has the same marginals but strictly smaller cost, contradicting optimality.

If the optimal plan is induced by a map TT, cyclical monotonicity reads

i=1kc(xi,T(xi))i=1kc(xi,T(xi+1)).\sum_{i=1}^k c(x_i,T(x_i)) \leq \sum_{i=1}^k c(x_i,T(x_{i+1})).

For c(x,y)=12xy2c(x,y)=\frac12\|x-y\|^2, the two-point case gives

T(x)T(y),xy0,\langle T(x)-T(y),x-y\rangle\geq0,

so TT is a monotone vector field. In one dimension, for c(x,y)=xypc(x,y)=|x-y|^p, this reduces to the classical monotone rearrangement.

Metric Properties: Wasserstein Distances

OT costs become genuine distances when the ground cost comes from a metric. The proof relies on a gluing lemma.

Proof

If bj>0b_j>0, summing SijkS_{ijk} over kk gives Pijbj/bj=PijP_{ij}b_j/b_j=P_{ij}; if bj=0b_j=0, the corresponding column of PP and row of QQ are zero. The other prescribed marginal is checked in the same way. Summing over the intermediate index jj gives RR. Its row and column sums are aa and cc.

<IPython.core.display.Image object>

Discrete gluing lemma in matrix form. The first two panels are optimal one-dimensional couplings through an intermediate marginal. The third panel shows the induced marginal R=Pdiag(1/b)QR=P\diag(1/b)Q; it is feasible and is the coupling used in the triangle-inequality proof.

The interactive version changes the resolution of the intermediate marginal, which controls how mediated the glued source-target plan becomes.

Interactive panel. Use the mediation slider to inspect how two couplings through an intermediate marginal glue into a source-target plan.

Proof

Symmetry follows by transposing couplings. Positivity follows because a zero cost plan must be supported on the diagonal. For the triangle inequality, take optimal couplings PP from aa to bb and QQ from bb to cc, glue them into SS, and use the feasible marginal RR from aa to cc. Then Minkowski’s inequality and the ground triangle inequality give

Wp(a,c)(i,j,k(Dij+Djk)pSijk)1/pWp(a,b)+Wp(b,c).W_p(a,c) \leq \left(\sum_{i,j,k}(D_{ij}+D_{jk})^pS_{ijk}\right)^{1/p} \leq W_p(a,b)+W_p(b,c).
Proof Sketch

Disintegrate π\pi and ξ\xi against their common marginal β\be, obtaining conditional laws πy\pi_y on X\Xx and ξy\xi_y on Z\Zz. Define σ\sigma by the conditional-product formula

g(x,y,z)dσ=Y(XZg(x,y,z)dπy(x)dξy(z))dβ(y).\int g(x,y,z)\d\sigma = \int_\Yy \left(\int_\Xx\int_\Zz g(x,y,z)\d\pi_y(x)\d\xi_y(z)\right) \d\be(y).

This is the measure-theoretic version of the discrete formula above.

Proof

Symmetry is obtained by swapping the coordinates of a coupling. If the value is zero, an optimal coupling is supported on the diagonal and therefore the two marginals coincide. For the triangle inequality, glue optimal couplings πΠ(α,β)\pi\in\Couplings(\al,\be) and ξΠ(β,γ)\xi\in\Couplings(\be,\ga) into σ\sigma, project it to a coupling ρ\rho between α\al and γ\ga, and apply the ground triangle inequality plus Minkowski:

Wp(α,γ)((d(x,y)+d(y,z))pdσ(x,y,z))1/pWp(α,β)+Wp(β,γ).\Wass_p(\al,\ga) \leq \left(\int (d(x,y)+d(y,z))^p\d\sigma(x,y,z)\right)^{1/p} \leq \Wass_p(\al,\be)+\Wass_p(\be,\ga).

Interpolation Induced By A Plan

The quadratic Wasserstein distance does not only compare two endpoint measures. An optimal plan also says how to move mass between them: each active pair (x,y)(x,y) travels along the segment joining xx to yy. This turns an optimal coupling into a curve of measures.

In the discrete case, each mass PijP_{ij} moves from xix_i to yjy_j along its own segment. When the optimal plan is not induced by a map, one source atom can split into several moving atoms. If the optimal plan is not unique, different optimal plans may also induce different W2\Wass_2 geodesics.

Proof

Push the optimal plan π\pi^\star forward by (es,et)(e_s,e_t). This gives a coupling γs,tΠ(αs,αt)\gamma_{s,t}\in\Couplings(\al_s,\al_t), and

zz2dγs,t(z,z)=et(x,y)es(x,y)2dπ(x,y)=(ts)2W22(α0,α1).\int \norm{z-z'}^2\d\gamma_{s,t}(z,z') = \int \norm{e_t(x,y)-e_s(x,y)}^2\d\pi^\star(x,y) = (t-s)^2\Wass_2^2(\al_0,\al_1).

Hence W2(αs,αt)(ts)W2(α0,α1)\Wass_2(\al_s,\al_t)\leq(t-s)\Wass_2(\al_0,\al_1). Applying this upper bound to the three pairs (0,s)(0,s), (s,t)(s,t) and (t,1)(t,1), and using the triangle inequality of Proposition Proposition: Metric Property Of The Wasserstein Distance, gives

W2(α0,α1)W2(α0,αs)+W2(αs,αt)+W2(αt,α1)W2(α0,α1).\Wass_2(\al_0,\al_1) \leq \Wass_2(\al_0,\al_s)+\Wass_2(\al_s,\al_t)+\Wass_2(\al_t,\al_1) \leq \Wass_2(\al_0,\al_1).

All inequalities are therefore equalities, in particular the middle segment has the claimed length.

<IPython.core.display.Image object>

McCann interpolation induced by a non-deterministic optimal transport plan. In every panel, the red and blue endpoint measures are shown with low opacity, thin gray segments display the support Pij>tolP_{ij}>\mathrm{tol} of the coupling, and the moving atoms are colored from red to blue along the interpolation.

The companion panel lets the same coupling be inspected along time tt, with an entropy slider to contrast sparse and diffuse plans.

Interactive panel. Use the interpolation time and plan controls to see how a fixed coupling induces a cloud of displacement paths between endpoint measures.

General Geodesic Spaces

For Dirac masses in Euclidean space, the W2\Wass_2 geodesic from δx\delta_x to δy\delta_y is tδ(1t)x+tyt\mapsto\delta_{(1-t)x+t y}. The same idea extends to any geodesic metric space (X,d)(\X,d), meaning that each pair of points can be joined by a constant-speed metric geodesic. For each pair (x,y)(x,y), one replaces the Euclidean segment by a curve γx,y:[0,1]X\gamma^{x,y}:[0,1]\to\X such that γ0x,y=x\gamma^{x,y}_0=x, γ1x,y=y\gamma^{x,y}_1=y, and

d(γsx,y,γtx,y)=tsd(x,y).d(\gamma^{x,y}_s,\gamma^{x,y}_t)=|t-s|d(x,y).

If this geodesic is unique and depends measurably on (x,y)(x,y), one defines et(x,y)=γtx,ye_t(x,y)=\gamma^{x,y}_t and sets αt=(et)π\al_t=(e_t)_\sharp\pi^\star for an optimal coupling π\pi^\star. When geodesics are not unique, there is no canonical interpolation of a pair of Diracs unless a choice is made: one may select a particular geodesic between xx and yy, or randomize among several such geodesics. The intrinsic formulation is to choose a probability measure η\eta on the path space of constant-speed geodesics, called a dynamical optimal plan, such that (e0,e1)η(e_0,e_1)_\sharp\eta is an optimal coupling, and to set αt=(et)η\al_t=(e_t)_\sharp\eta. Different measurable choices, or different conditional distributions over geodesics with the same endpoints, can give different W2\Wass_2 geodesics; the constant-speed identity remains the same. This path-space viewpoint is standard in the general theory of Wasserstein spaces Ambrosio et al., 2006Villani, 2009Santambrogio, 2015.

Comparison With Monge

The distance Wp\Wass_p defined through the Kantorovich problem (42) should be contrasted with the directed distance W~\widetilde{\Wass} obtained using Monge’s problem. The Kantorovich feasible set is never empty, since it contains the product coupling, although the pp-cost may still be infinite without moment assumptions on non-compact spaces. By contrast, Monge’s constraint set {T:Tα=β}\{T:T_\sharp\al=\be\} can be empty. When an optimal Monge map exists, Kantorovich gives the same value by choosing the graph coupling (Id,T)α(\Id,T)_\sharp\alpha; in this sense the Kantorovich problem is the convex relaxation of Monge’s problem, with much better stability properties.

Metric Properties: Topology And Applications

Wasserstein distances metrize weak convergence under moment control, sit between weak and strong topologies, and provide quantitative estimates in probability and robust optimization.

Proof

The left inequality is Jensen’s inequality applied to rrq/pr\mapsto r^{q/p}. The right inequality follows from d(x,y)qdiam(X)qpd(x,y)pd(x,y)^q\leq\diam(\Xx)^{q-p}d(x,y)^p.

Proof Sketch

For discrete measures on a common support, set ci=min(ai,bi)c_i=\min(a_i,b_i). Any coupling has diagonal mass at most ici\sum_i c_i, so its off-diagonal cost is at least 1ici1-\sum_i c_i. This bound is achieved by putting cic_i on the diagonal and coupling the leftover positive and negative parts by a product plan. Since

1ici=12iaibi,1-\sum_i c_i = \frac12\sum_i |a_i-b_i|,

the formula follows.

For Dirac masses,

δxnδxTV=2,Wp(δxn,δx)=d(xn,x).\|\delta_{x_n}-\delta_x\|_{\TV}=2, \qquad \Wass_p(\delta_{x_n},\delta_x)=d(x_n,x).

Thus the strong topology never sees Diracs converge unless they are eventually equal, while the Wasserstein topology captures their spatial convergence.

Proof Sketch

For p=1p=1, this is the Kantorovich--Rubinstein metrization theorem: by duality, W1\Wass_1 is the supremum over 1-Lipschitz test functions, and on a compact metric space this class is compact modulo constants by Arzela--Ascoli. The comparison between Wp\Wass_p distances on compact spaces then gives the result for all p1p\geq1.

On non-compact spaces, one must also impose convergence of pp-th moments: Wp(αk,α)0\Wass_p(\alpha_k,\alpha)\to0 if and only if αkα\alpha_k\rightharpoonup\alpha and

d(x,x0)pdαk(x)d(x,x0)pdα(x).\int d(x,x_0)^p\d\alpha_k(x) \to \int d(x,x_0)^p\d\alpha(x).

On a discrete space, weak and strong topologies coincide, and

dmin2αβTVW1(α,β)dmax2αβTV.\frac{d_{\min}}{2}\|\al-\be\|_{\TV} \leq \Wass_1(\al,\be) \leq \frac{d_{\max}}{2}\|\al-\be\|_{\TV}.

Wasserstein Over Wasserstein

The construction can be iterated. Once (X,d)(\X,d) is a metric space, the set of probability measures on X\X becomes a metric space through Wp\Wass_p. It can therefore serve as a new ground space. This is useful whenever the objects to compare are themselves random probability measures, or mixtures whose components are meaningful objects rather than only a collapsed density.

Proof Sketch

Completeness follows by representing a Wp\Wass_p-Cauchy sequence through almost optimally glued couplings, which gives a Cauchy random sequence whose law is the desired limit. Separability follows by approximating measures with finitely supported measures on a countable dense subset and rational weights. If X\X is compact, Prokhorov compactness and Wasserstein metrization of weak convergence give compactness.

Elements of P2(P2(X))\Pp_2(\Pp_2(\X)) are probability laws over probability measures, or random probability measures. A parametric example is

A=(ζαζ)γ.\mathfrak A=(\zeta\mapsto\alpha_\zeta)_\sharp\gamma.

The Wasserstein distance on the Wasserstein space is

W22(A,B):=infΠΠ(A,B)P2(X)×P2(X)W22(α,β)dΠ(α,β).\mathbb W_2^2(\mathfrak A,\mathfrak B) \eqdef \inf_{\Pi\in\Couplings(\mathfrak A,\mathfrak B)} \int_{\Pp_2(\X)\times\Pp_2(\X)} \Wass_2^2(\alpha,\beta)\d\Pi(\alpha,\beta).

For Gaussian mixtures, this separates two geometries. A mixture can be viewed as a collapsed density on X\X, or as a component law over Gaussian atoms in the Bures--Wasserstein space. For two component laws

A=iaiδN(mi,Σi),B=jbjδN(nj,Λj),\mathfrak A=\sum_i a_i\delta_{\Gaussian(m_i,\Sigma_i)}, \qquad \mathfrak B=\sum_j b_j\delta_{\Gaussian(n_j,\Lambda_j)},

the component-level problem uses the cost

Cij=minj2+B(Σi,Λj)2.C_{ij}=\norm{m_i-n_j}^2+\Bb(\Sigma_i,\Lambda_j)^2.

If Π\Pi^\star is an optimal coupling between the weights aa and bb, and if AijA_{ij} is the Brenier linear part from Σi\Sigma_i to Λj\Lambda_j, each active pair follows the Gaussian geodesic

mij,t=(1t)mi+tnj,Σij,t=((1t)Id+tAij)Σi((1t)Id+tAij).m_{ij,t}=(1-t)m_i+t n_j, \qquad \Sigma_{ij,t} = \big((1-t)\Id+tA_{ij}\big)\Sigma_i \big((1-t)\Id+tA_{ij}\big).

Collapsing these component geodesics gives

αˉt=i,jΠijN(mij,t,Σij,t).\bar\alpha_t= \sum_{i,j}\Pi^\star_{ij}\Gaussian(m_{ij,t},\Sigma_{ij,t}).

This component-level interpolation generally differs from the true W2\Wass_2 interpolation between the collapsed mixture densities.

<IPython.core.display.Image object>

Two interpolations between the same three-component one-dimensional Gaussian mixtures. On the left, components are matched using the Bures--Wasserstein distance between Gaussians. On the right, the mixtures are collapsed into ordinary one-dimensional densities and interpolated by the true quantile formula for W2\Wass_2.

The interactive comparison keeps both geometries side by side: component-level transport moves Gaussian atoms, while collapsed transport rearranges the full density.

Interactive panel. Use the mixture and blur controls to compare transport between ordinary measures with transport between distributions of measures.

Proof

Fix a coupling Π\Pi between A\mathfrak A and B\mathfrak B. For each (α,β)(\alpha,\beta) choose an almost optimal coupling πα,βΠ(α,β)\pi_{\alpha,\beta}\in\Couplings(\alpha,\beta). Integrating this Markov kernel against Π\Pi gives a coupling between the collapsed mixtures whose cost is bounded by the Π\Pi-average of W22(α,β)\Wass_2^2(\alpha,\beta). Taking infima proves the claim.

This viewpoint also clarifies lower bounds for Gromov--Wasserstein distances: a metric-measure space can be mapped to a law of local distance profiles, and these laws can be compared by Wasserstein-over-Wasserstein.

Distributional Robustness And Wasserstein Infinity

Wasserstein distances define ambiguity sets around empirical laws. Given samples ziz_i and α^n=1niδzi\hat\alpha_n=\frac1n\sum_i\delta_{z_i}, a distributionally robust optimization problem replaces empirical risk by

supβ:Wp(β,α^n)ρθ(z)dβ(z).\sup_{\beta:\Wass_p(\beta,\hat\alpha_n)\leq\rho} \int \ell_\theta(z)\d\beta(z).

Under standard upper-semicontinuity and growth assumptions on the loss, one has the dual reformulation

supβ:Wp(β,α^n)pρpθdβ=infλ0λρp+1ni=1nsupz{θ(z)λd(z,zi)p}.\sup_{\beta:\Wass_p(\beta,\hat\alpha_n)^p\leq\rho^p} \int \ell_\theta\d\beta = \inf_{\lambda\geq0} \lambda\rho^p + \frac1n\sum_{i=1}^n \sup_z\{\ell_\theta(z)-\lambda d(z,z_i)^p\}.

The robust risk is therefore an empirical risk in which each sample is replaced by its worst penalized perturbation. For p=1p=1 and an LθL_\theta-Lipschitz loss,

supβ:W1(β,α^n)ρθdβ1niθ(zi)+ρLθ.\sup_{\beta:\Wass_1(\beta,\hat\alpha_n)\leq\rho} \int \ell_\theta\d\beta \leq \frac1n\sum_i\ell_\theta(z_i)+\rho L_\theta.
Proof

Let π0\pi_0 and π1\pi_1 be nearly optimal couplings for (α0,β0)(\alpha_0,\beta_0) and (α1,β1)(\alpha_1,\beta_1). Then (1t)π0+tπ1(1-t)\pi_0+t\pi_1 is a coupling between the convex combinations of the marginals, and its cost is the corresponding convex combination. Taking infima proves joint convexity. For p>1p>1, the root can destroy convexity; on the line, Wp((1t)δ0+tδ1,δ0)=t1/p\Wass_p((1-t)\delta_0+t\delta_1,\delta_0)=t^{1/p} is concave.

The limiting distance

W(α,β):=infπΠ(α,β)ess sup(x,y)πd(x,y)\Wass_\infty(\alpha,\beta) \eqdef \inf_{\pi\in\Couplings(\alpha,\beta)} \esssup_{(x,y)\sim\pi} d(x,y)

minimizes the worst displacement rather than an average displacement.

Proof

Any feasible coupling between α^\hat\alpha and β\beta disintegrates as iaiδziνi\sum_i a_i\delta_{z_i}\otimes\nu_i, with each νi\nu_i supported in B(zi,ρ)\overline B(z_i,\rho). Hence the robust expectation is bounded above by the displayed sum. The reverse inequality follows by choosing, for each ii, a maximizer ziB(zi,ρ)z_i^\star\in\overline B(z_i,\rho) and taking β=iaiδzi\beta=\sum_i a_i\delta_{z_i^\star}.

Theoretical Application: Quantitative Central Limit Theorems

Weak topology says whether laws converge; Wasserstein distances also quantify how fast. The central limit theorem becomes a rate estimate in W1\Wass_1, which controls the error of all 1-Lipschitz observables of the normalized sum.

<IPython.core.display.Image object>

Central-limit theorem for normalized Bernoulli sums. Starting from α0=12(δ1+δ1)\alpha_0=\frac12(\delta_{-1}+\delta_1), the law of Zn=n1/2iXiZ_n=n^{-1/2}\sum_i X_i remains discrete, but its rescaled atom heights approach the standard Gaussian density shown in gray.

The interactive demo varies the number of summands and the Bernoulli skew, making the Wasserstein view of weak convergence visible even while the law remains discrete.

Interactive panel. Use the dimension and sample-size controls to watch the Wasserstein CLT scaling predicted by Lipschitz test functions.

Proof Sketch

By Kantorovich--Rubinstein duality,

W1(αn,γ)=supLip(h)1Eh(Sn)Eh(G).\Wass_1(\alpha_n,\gamma) = \sup_{\Lip(h)\leq1} \left|\mathbb{E}h(S_n)-\mathbb{E}h(G)\right|.

For each such hh, solve Stein’s equation fh(x)xfh(x)=h(x)Eh(G)f_h'(x)-xf_h(x)=h(x)-\mathbb{E}h(G). The solution has uniform derivative bounds depending only on the Lipschitz constant of hh. Expanding by replacing summands one at a time, the first- and second-order terms cancel because EXi=0\mathbb{E}X_i=0 and EXi2=1\mathbb{E}X_i^2=1. The Taylor remainder is bounded by CiEXi/n3C\sum_i\mathbb{E}|X_i/\sqrt n|^3, giving the n1/2n^{-1/2} rate Berry, 1941Esseen, 1942Chen et al., 2011Bobkov, 2018Rio, 2011.

References
  1. Kantorovich, L. (1942). On the transfer of masses (in Russian). Doklady Akademii Nauk, 37(2), 227–229.
  2. Villani, C. (2003). Topics in Optimal Transportation (Vol. 58). American Mathematical Society.
  3. Villani, C. (2009). Optimal Transport: Old and New (Vol. 338). Springer.
  4. Rachev, S. T., & Rüschendorf, L. (1998). Mass Transportation Problems: Volume II: Applications. Springer.
  5. Dantzig, G. B. (1951). Application of the simplex method to a transportation problem. Activity Analysis of Production and Allocation, 13, 359–373.
  6. Bertsekas, D. P., & Eckstein, J. (1988). Dual coordinate step methods for linear network flow problems. Mathematical Programming, 42(1), 203–243.
  7. Orlin, J. B. (1997). A polynomial time primal network simplex algorithm for minimum cost flows. Mathematical Programming, 78(2), 109–129.
  8. Ambrosio, L., Gigli, N., & Savaré, G. (2006). Gradient Flows in Metric Spaces and in the Space of Probability Measures. Springer.
  9. Santambrogio, F. (2015). Optimal Transport for Applied Mathematicians: Calculus of Variations, PDEs, and Modeling. Birkhäuser.
  10. Berry, A. C. (1941). The Accuracy of the Gaussian Approximation to the Sum of Independent Variates. Transactions of the American Mathematical Society, 49(1), 122–136.
  11. Esseen, C.-G. (1942). On the Liapunoff Limit of Error in the Theory of Probability. Arkiv for Matematik, Astronomi Och Fysik, 28A(9), 1–19.
  12. Chen, L. H. Y., Goldstein, L., & Shao, Q.-M. (2011). Normal Approximation by Stein’s Method. Springer.
  13. Bobkov, S. G. (2018). Berry–Esseen bounds and Edgeworth expansions in the central limit theorem for transport distances. Probability Theory and Related Fields, 170, 229–262.
  14. Rio, E. (2011). Asymptotic constants for minimal distance in the central limit theorem. Electronic Communications in Probability, 16, 96–103.