Kantorovich Relaxation

Kantorovich’s relaxation is the decisive move that turns transport into convex optimization. Deterministic maps are replaced by couplings, infeasibility and asymmetry disappear, and the Wasserstein distances emerge. Historically, this linear-programming viewpoint grew from Kantorovich’s economic planning work Kantorovich, 1942 and is now the standard foundation of optimal transport Villani, 2003Villani, 2009Rachev & Rüschendorf, 1998.

from pathlib import Path
import sys

from IPython.display import Image as DisplayImage
from IPython.display import display

here = Path.cwd()
myst_dir = None
for candidate in [here, here.parent, here / "myst", here.parent / "myst", here.parent.parent / "myst"]:
    if (candidate / "ot4ml_web.py").exists():
        myst_dir = candidate.resolve()
        sys.path.insert(0, str(myst_dir))
        break

if myst_dir is None:
    raise RuntimeError("Could not locate myst/ot4ml_web.py")

repo_root = myst_dir.parent
thumbnails = repo_root / "notebooks-figures" / "thumbnails"

def show_book_figure(name, width=760):
    display(DisplayImage(filename=str(thumbnails / f"{name}.png"), width=width))

from ot4ml_web import plot_regularization_sweep

Discrete Relaxation¶

The discrete relaxation is the cleanest place to see mass splitting. It replaces permutations by a transportation polytope and reveals the linear-programming structure that algorithms exploit.

Monge’s discrete matching problem cannot be applied when the two clouds have different cardinalities or unequal weights. The continuous Monge problem has the same obstruction: there may be no map $T$ such that $T_\sharp\al=\be$ , for instance when one Dirac mass must be sent to several Dirac masses. It is also asymmetric: two Dirac masses can be mapped to one, but one Dirac mass cannot be split into two by a deterministic map.

Kantorovich’s idea is to relax deterministic transportation. Instead of sending each source point $x_i$ to exactly one target, the mass at $x_i$ may be dispatched across several targets. The relaxation is encoded by a coupling matrix $\P\in\RR_+^{n\times m}$ for two discrete measures

\al=\sum_i a_i\delta_{x_i}, \qquad \be=\sum_j b_j\delta_{y_j}.

(1)

Remark: Small transportation polytopes

The definition is already informative in the smallest dimensions. If $n=m=1$ , mass conservation fixes the only entry, so the feasible set is a singleton. The same happens for $(n,m)=(2,1)$ , and by symmetry for $(n,m)=(1,2)$ : the unique coupling is forced by its only column, or by its only row.

The first nontrivial case is $(n,m)=(2,2)$ . Let $\a=(p,1-p)$ and $\b=(q,1-q)$ with $p,q\in[0,1]$ . Once $s\eqdef \P_{1,1}$ is chosen, the marginal constraints force all other entries, hence every coupling has the form

\P(s)= \begin{pmatrix} s & p-s \\ q-s & 1-p-q+s \end{pmatrix}.

(4)

The nonnegativity constraints are exactly

s\in\big[\max(0,p+q-1),\min(p,q)\big],

(5)

so $\CouplingsD(\a,\b)$ is a segment, possibly reduced to a point at the boundary. In general, when all marginal entries are positive, the transportation polytope has affine dimension $(n-1)(m-1)$ before the nonnegativity inequalities cut out its faces.

The first consequence is feasibility. There is always at least one admissible plan.

The feasible set is a bounded intersection of an affine space with the nonnegative orthant, hence a convex polytope. In one dimension, the coupling can be read as a matrix: rows index source bins, columns index target bins, and the marginal constraints appear as prescribed row and column sums.

Thus the product plan is mainly a feasibility witness. Except when the linear cost is constant on the whole transportation polytope, it is not expected to solve optimal transport.

Figure Div contrasts deterministic, product and optimal couplings through weighted transport segments.

Discrete couplings represented as straight transport segments. The deterministic graph is a feasible Monge-type plan, the product plan spreads every source mass over all targets, and the optimal Kantorovich plan minimizes the quadratic transport cost. Line width and opacity encode transported mass.

The interactive demo below separates the main feasible-plan archetypes: deterministic graphs, independent product couplings, sparse splitting plans, and entropic approximations.

Interactive panel. Use the point and mass sliders to see how a Kantorovich plan can split mass into several weighted links rather than choosing one destination per source.

Figure Div gives the complementary matrix view and displays the prescribed marginals next to each coupling.

Coupling matrices with their prescribed marginals. The central grayscale image displays $\P_{ij}$ ; the red curve on the left is the source marginal $a$ , and the blue curve on top is the target marginal $b$ . The independent product plan is diffuse, whereas the one-dimensional optimal plan concentrates near the monotone quantile correspondence.

The companion control varies the bin count and the endpoint laws, making the transition from diffuse independence to monotone transport visually explicit.

Interactive panel. Use the problem-size and mass-shape controls to compare the coupling matrix with its red and blue marginal sums.

The Kantorovich feasible set is symmetric: $\P\in\CouplingsD(a,b)$ if and only if $\P^\top\in\CouplingsD(b,a)$ . With a unit transport cost matrix $\C_{ij}$ , the discrete Kantorovich problem reads

\mathcal{L}_\C(a,b) \eqdef \min_{\P\in\CouplingsD(a,b)} \langle \C,\P\rangle = \min_{\P\in\CouplingsD(a,b)} \sum_{i,j} \C_{ij}\P_{ij}.

(9)

This is a linear program, and its solutions need not be unique.

Figure Div contrasts permutation plans for uniform empirical measures with the splitting couplings needed for nonuniform marginals.

From permutation matrices to splitting couplings. When the two empirical measures have the same number of atoms and uniform weights, an optimal plan can be a permutation matrix. Once target masses are nonuniform, one source can send mass to several targets and several sources can merge into the same target.

The interactive demo keeps the same source and target sites while changing the target mass imbalance, so the moment where permutation structure breaks becomes visible.

Interactive panel. Use the split-mass and geometry controls to contrast deterministic permutation-like transport with plans that divide source mass across targets.

Sparsity here is not peculiar to transport. For nonnegative variables, the relevant quantity is the rank of the constraint operator, not the raw number of listed constraints, which may contain redundancies. The following standard linear-programming principle makes this precise.

For transportation polytopes, the same kernel argument has a graph interpretation: a cycle in the bipartite support carries an alternating perturbation with zero row and column sums. Thus every extreme coupling has a forest support.

The north-west corner rule, summarized in Algorithm Algorithm: North-west corner coupling, does not use the cost matrix and is therefore not meant to solve the discrete Kantorovich problem. Its role is algorithmic: an acyclic support corresponds to linearly independent marginal constraints. When the support has fewer than $n+m-1$ positive entries, transportation simplex implementations complete it with zero-mass basic variables to obtain a degenerate basic feasible solution. This gives a cheap initialization for the pivoting methods discussed in Section Linear-Programming Algorithms.

One-Dimensional Cases¶

In one dimension, the transportation polytope has a canonical monotone optimizer. This is the weighted version of the sorting rule from the matching chapter.

Permutation Matrices As Couplings¶

Now assume $n=m$ and uniform weights $a=b=\mathbf{1}_n/n$ . In this case, a matching can be encoded as a matrix with exactly one active entry per row and per column.

The corresponding probability coupling is $\P_\sigma/n$ . If the matching cost matrix is $\C$ , then

\langle \C,\P_\sigma/n\rangle = \frac1n\sum_{i=1}^n \C_{i,\sigma(i)}.

(18)

Thus the assignment problem is the minimization of a linear function over the discrete, non-convex set of permutation matrices. The convex relaxation replaces this finite set by all bistochastic matrices.

Figure Div shows the non-extreme mechanism used in the proof below. The displayed matrix is bistochastic but not a permutation matrix: the unit entries already behave like isolated matching edges, while the fractional support contains a minimal alternating cycle.

Cycle certificate in the Birkhoff--von Neumann proof. The left panel is a $7\times7$ bistochastic matrix which is not a permutation matrix. The right panel shows its bipartite positive-support graph, with the column nodes sorted as $j_1,\ldots,j_7$ from top to bottom to match the matrix order: red nodes are rows, blue nodes are columns, thin purple edges correspond to $0<\P_{ij}<1$ , and bold black edges correspond to isolated entries $\P_{ij}=1$ . The orange halo marks the longer alternating fractional cycle along which one can add and subtract mass while preserving all row and column sums.

Interactive panel. Move mass around the alternating cycle and observe that all row and column sums remain unchanged.

We first prove that permutation matrices are extreme. Let $\P_\sigma\in\mathcal P_n^{\mathrm{perm}}$ and assume that

\P_\sigma=\frac{\Q+\R}{2} \qquad\text{with}\qquad \Q,\R\in\mathcal B_n .

(22)

Every bistochastic matrix has entries in $[0,1]$ . Since the only extreme points of $[0,1]$ are 0 and 1, each entry of $\P_\sigma$ fixes the corresponding entries of $\Q$ and $\R$ : if $(\P_\sigma)_{ij}=0$ , then $\Q_{ij}=\R_{ij}=0$ , while if $(\P_\sigma)_{ij}=1$ , then $\Q_{ij}=\R_{ij}=1$ . Hence $\Q=\R=\P_\sigma$ , so $\P_\sigma$ is extreme.

We now prove the converse by contrapositive. Pick $\P\in\mathcal B_n\setminus\mathcal P_n^{\mathrm{perm}}$ . Since an integral bistochastic matrix is necessarily a permutation matrix, $\P$ has at least one fractional entry. We shall split $\P=(\Q+\R)/2$ with $\Q,\R\in\mathcal B_n$ and $\Q\neq \R$ , proving that $\P$ is not extreme.

Associate with $\P$ the bipartite graph whose left vertices are the rows, whose right vertices are the columns, and whose edges are the fractional entries $0<\P_{ij}<1$ . An entry equal to 1 uses the whole mass of its row and column, so it is isolated in the positive support and does not appear in this fractional graph. If a left vertex is incident to one fractional edge, then it must be incident to at least one other fractional edge: after the first fractional contribution, the row still has positive remaining mass, and that remainder cannot be carried by an entry equal to 1. The same argument applies to columns. Thus every non-isolated vertex of the fractional graph has degree at least two.

Starting from any fractional edge, one may therefore walk through adjacent fractional edges without immediately backtracking and without getting stuck. Since the graph is finite, some vertex is eventually visited twice; the portion of the walk between the two visits contains a cycle. Choose a shortest such cycle and write it in alternating form

(i_1,j_1,i_2,j_2,\ldots,i_p,j_p), \qquad i_{p+1}=i_1,

(23)

where both $(i_s,j_s)$ and $(i_{s+1},j_s)$ are fractional for every $s$ . Define

\epsilon \eqdef \min_{1\leq s\leq p} \{ \P_{i_s,j_s}, \P_{i_{s+1},j_s}, 1-\P_{i_s,j_s}, 1-\P_{i_{s+1},j_s} \}>0,

(24)

and split the cycle edges into the alternating families

A=\{(i_s,j_s)\}_{s=1}^p, \qquad B=\{(i_{s+1},j_s)\}_{s=1}^p .

(25)

Set $\Q=\P$ and $\R=\P$ outside $A\cup B$ ; on $A$ , set $\Q_{ij}=\P_{ij}+\epsilon/2$ and $\R_{ij}=\P_{ij}-\epsilon/2$ ; on $B$ , set $\Q_{ij}=\P_{ij}-\epsilon/2$ and $\R_{ij}=\P_{ij}+\epsilon/2$ . By the definition of $\epsilon$ , all modified entries stay in $[0,1]$ . Each row and column of the cycle sees one $+\epsilon/2$ and one $-\epsilon/2$ , so the row and column sums remain one. Thus $\Q,\R\in\mathcal B_n$ , $\Q\neq \R$ , and $\P=(\Q+\R)/2$ . Hence $\P$ is not extreme. Consequently every extreme point of $\mathcal B_n$ is integral, and every integral bistochastic matrix is a permutation matrix.

The same combinatorial idea gives the constructive decomposition used to express a bistochastic matrix as a convex combination of permutations.

Algorithm: Birkhoff--von Neumann decomposition

Input: Bistochastic matrix $\P\in\mathcal B_n$ .

Output: Decomposition $\P=\sum_r\lambda_rP_{\sigma_r}$ .

Initialize: Set $\R=\P$ , $s=1$ , and $\mathcal L=\emptyset$ .

While $s>0$ do:

Build bipartite graph $G_\R=\{(i,j):\R_{ij}>0\}$ .
Set $\sigma$ to the lexicographically first perfect matching of $G_\R$ .
Set $\lambda=\min_i \R_{i,\sigma(i)}.$
Append $(\lambda,\sigma)$ to $\mathcal L$ .
Update $\R\leftarrow \R-\lambda \P_\sigma$ and $s\leftarrow s-\lambda$ .

Return $\P=\sum_{(\lambda_r,\sigma_r)\in\mathcal L}\lambda_rP_{\sigma_r}, \qquad \sum_r\lambda_r=1.$

The perfect matching required at each iteration exists by Hall’s theorem. Indeed, while the common row and column sum of the residual matrix is $s>0$ , any set $I$ of row vertices and its neighborhood $N(I)$ satisfy

s|I| = \sum_{i\in I}\sum_{j\in N(I)}\R_{ij} \leq \sum_{j\in N(I)}\sum_iR_{ij} =s|N(I)|.

(26)

Thus $|N(I)|\geq|I|$ , which is Hall’s condition. Subtracting $\lambda \P_\sigma$ preserves a common row and column sum $s-\lambda$ and removes at least one positive entry. The algorithm therefore terminates after finitely many steps with $\R=0$ ; summing the updates yields the announced convex decomposition and $\sum_r\lambda_r=1$ .

Equivalently, for uniform empirical measures, one can always choose a permutation matrix among the minimizers of the relaxed Kantorovich problem: the relaxation is tight for assignment problems.

Rational Weights¶

The strict assignment model is tied to equal cardinalities and equal weights, whereas the coupling set Definition: Discrete Couplings And Mass Conservation accommodates arbitrary discrete masses and different support sizes. Figure Div contrasts these regimes. For rational weights, the relaxed problem can in fact be reduced back to a larger uniform assignment problem.

From assignments to transport plans, using the same disk-to-annulus geometry. In the balanced equal-weight case, each source atom is matched to one target atom. With a target cloud that has half as many atoms, or with strongly nonuniform target weights, the coupling matrix can merge or split mass; segment thickness and opacity encode its nonzero entries, and blue marker areas encode the prescribed target masses.

The interactive panel below exposes the target resolution, target weights, and regularization level. The first displayed plan is sparse, while positive regularization values show the entropic smoothing used later in the Sinkhorn chapter.

fig = plot_regularization_sweep(
    n_source=n_source,
    n_target=n_target,
    source_shape="disk",
    target_shape="annulus",
    cost_power=2,
    epsilons=epsilons,
    weight_mode=weight_mode,
    weight_strength=weight_strength,
    seed=2031,
)

Interactive panel. Use the source and target sizes, weight pattern, and regularization sliders to see how unequal masses and finite resolution change the matching picture.

Proposition: Rational Weights as Duplicated Uniform Matching

Let

\al=\sum_{i=1}^n \frac{k_i}{N}\delta_{x_i}, \qquad \be=\sum_{j=1}^m \frac{\ell_j}{N}\delta_{y_j}, \qquad \sum_i k_i=\sum_j\ell_j=N,

(27)

with positive integers $k_i,\ell_j$ . Replace each $x_i$ by $k_i$ identical copies and each $y_j$ by $\ell_j$ identical copies, producing two uniform $N$ -point clouds. The duplicated assignment problem and the discrete Kantorovich problem between $\al$ and $\be$ have the same optimal value. Moreover, an optimal coupling exists of the form

\P_{ij}=\frac{n_{ij}}N, \qquad n_{ij}\in\NN, \qquad \sum_j n_{ij}=k_i, \quad \sum_i n_{ij}=\ell_j.

(28)

Couplings whose scaled entries $N\P_{ij}$ are integral are exactly the collapsed assignments between the duplicated clouds. Fractional optimal couplings may nevertheless coexist when the optimum is degenerate.

This network-flow integrality mechanism is the rational-weight counterpart of the Birkhoff--von Neumann theorem above (Theorem: Birkhoff--von Neumann): in both cases, a linear transport relaxation has an optimizer represented by integer edge flows after scaling. Equal unit margins specialize these flows to permutation matrices, whereas a degenerate optimal face may also contain fractional couplings.

Figure Div makes this reduction explicit by replacing each rational mass with identical unit copies and then regrouping the resulting uniform assignment.

Rational weights as duplicated uniform matchings, using the same disk-to-annulus geometry with fewer displayed atoms. The red and blue locations are kept fixed, while disk areas encode the integer multiplicities $k_i$ and $\ell_j$ . Solving the assignment problem after duplicating particles produces several collapsed segments attached to high-multiplicity atoms; this is the integer count matrix of the proposition.

Interactive panel. Use the site and multiplicity sliders to see how rational weights can be represented by duplicated unit masses before solving an ordinary matching problem.

Applications of discrete transport¶

Discrete Kantorovich transport applies whenever weighted data points or histograms must be compared through a geometry-aware correspondence. The following examples illustrate information transfer between datasets, comparison of visual distributions, and inference of temporal relations between cell populations.

Example: Application to domain adaptation

In unsupervised domain adaptation, labeled source samples $(x_i^s,y_i^s)_i$ and unlabeled target samples $(x_j^t)_j$ define empirical laws $\al_s=\sum_i a_i\de_{x_i^s}$ and $\al_t=\sum_j b_j\de_{x_j^t}$ . Writing $\a=(a_i)_i$ and $\b=(b_j)_j$ , a Kantorovich coupling $\P\in\CouplingsD(\a,\b)$ gives soft correspondences between the two clouds. For $b_j>0$ , disintegrating $\P$ with respect to its target marginal and averaging the source labels gives the backward barycentric rule $\widetilde y_j=b_j^{-1}\sum_i\P_{i,j}y_i^s$ , for either real-valued labels or one-hot class vectors. This is the label-space analogue of the barycentric projection introduced later in Definition Definition: Barycentric Projection of a Coupling. Alternatively, transport can be optimized jointly with a classifier by adding label-prediction terms to the feature cost. This is the mechanism behind OT domain adaptation and JDOT: the plan is not only a distance certificate, but an explicit cross-domain alignment Courty et al., 2017Courty et al., 2017. Learning or adapting the ground cost is the inverse viewpoint developed later in Section Metric Learning and Inverse OT.

Example: Application to single-cell population dynamics

Single-cell sequencing observes different cells at each collection time. If $x_i^{[k]}\in\Xx$ denotes the state of cell $i$ at time $t_k$ , write

\al_{t_k}=\sum_{i=1}^{n_k}a_i^{[k]}\de_{x_i^{[k]}} \qquad\text{and}\qquad \a^{[k]}=(a_i^{[k]})_{i=1}^{n_k}\in\simplex_{n_k}.

(30)

Thus a cell is one atom of an empirical population measure, rather than itself a measure over genes as in Example Example: Application to gene-expression distances between two cells. A balanced transport between two consecutive snapshots is represented by a matrix $\P^{[k]}\in\RR_+^{n_k\times n_{k+1}}$ satisfying

\P^{[k]}\ones_{n_{k+1}}=\a^{[k]} \qquad\text{and}\qquad (\P^{[k]})^\top\ones_{n_k}=\a^{[k+1]}.

(31)

Its entry $\P^{[k]}_{i,j}$ is a soft ancestor--descendant mass, not evidence that the same cell was observed twice. For every row with positive sum, the normalization

\K^{[k]}_{i,j} \coloneqq \frac{\P^{[k]}_{i,j}} {\sum_{j'=1}^{n_{k+1}}\P^{[k]}_{i,j'}}

(32)

defines a conditional transition probability; multiplying successive matrices $\K^{[k]}$ yields population-level lineage probabilities across several collection times.

Waddington-OT modifies this balanced baseline by incorporating growth estimates and relaxing the marginal constraints through unbalanced transport, as discussed in Example Example: Application to proliferating and dying cell populations Schiebinger et al., 2019. Related continuous and entropic constructions include trajectory inference and generative transport models Tong et al., 2020Lavenant et al., 2021Klein et al., 2024. Figure Div displays three observed population snapshots in the common force-layout embedding used by Waddington-OT; this visualization need not coincide with the ground geometry used to compute $\P^{[k]}$ .

Three snapshots of the Waddington-OT single-cell reprogramming time course. The first panel aggregates a representative sample from all collection times and colors cells from red to blue according to time. The remaining panels highlight the populations observed at days 0, 9, and 18, while cells from other times remain in gray. Every panel uses the same official force-layout embedding and viewport; no transport coupling or interpolated trajectory is displayed.

Linear-Programming Algorithms¶

The discrete Kantorovich problem is a linear program with much more structure than a generic dense LP. Its variables are arcs of a complete bipartite network, its equality constraints are flow-conservation constraints, and its extreme points are sparse tree-like couplings.

Transportation Simplex And Network Simplex¶

The transportation simplex goes back to Dantzig’s formulation of the transportation problem Dantzig, 1951. It works on basic feasible couplings, whose support is completed into a spanning tree of the bipartite supply-demand graph. Reduced costs identify whether an unused arc can decrease the objective. Adding such an arc creates a unique cycle; one then pushes as much mass as possible around that cycle and removes the exhausted arc.

The network simplex is the corresponding pivoting method for general minimum-cost-flow problems Bertsekas & Eckstein, 1988. It keeps node potentials, reduced costs and a spanning-tree basis. Its worst-case number of pivots can be exponential, but the per-pivot operations exploit graph sparsity. Polynomial guarantees can be obtained from strongly polynomial minimum-cost-flow algorithms such as Orlin’s algorithm Orlin, 1997.

Interior-Point Methods¶

Generic interior-point methods approach the LP through a smooth central path. Assume here that all entries of $a$ and $b$ are positive; zero-mass rows and columns must first be removed. The logarithmic-barrier problem on the resulting transport polytope is

\P_\epsilon \eqdef \argmin_{\substack{\P\mathbf{1}_m=a,\;\P^\top\mathbf{1}_n=b\\P_{ij}>0}} \langle \C,\P\rangle - \epsilon\sum_{i,j}\log \P_{ij}.

(33)

The barrier is singular at the boundary, so each iterate stays strictly inside the transportation polytope. As $\epsilon\downarrow0$ , the central path approaches the set of LP minimizers.

Figure Div isolates this mechanism on a two-dimensional polytope: decreasing the barrier parameter moves the minimizer along the central path toward the optimal face.

Logarithmic-barrier central path for a triangular slice of a linear program. Large $\epsilon$ selects a central interior point; decreasing $\epsilon$ moves the minimizer toward the optimal vertex while never touching the boundary. This differs from entropic OT, where the entropy temperature is part of the regularized objective itself.

The interactive view exposes the barrier parameter directly: lowering $\epsilon$ slides the minimizer from the center of the feasible triangle toward the LP vertex.

Interactive panel. Use the barrier and angle controls to move along the interior central path of the transport polytope.

Both interior-point methods and Sinkhorn keep iterates positive, but they use positivity differently. Interior-point algorithms solve the original LP by decreasing a barrier parameter. Sinkhorn fixes an entropic temperature and solves a different, KL-regularized OT problem by alternating diagonal scalings.

Relaxation For Arbitrary Measures¶

This section lifts the finite-dimensional coupling matrix to a joint probability measure. The payoff is that existence, duality and metric properties can be stated for arbitrary laws, including discrete, singular and continuous distributions.

Continuous Couplings¶

Remark: Probabilistic interpretation of couplings

If $X\sim\al$ and $Y\sim\be$ , then $\pi\in\Couplings(\al,\be)$ means that $\pi$ is the law of a pair $(X,Y)$ whose coordinates have laws $\al$ and $\be$ . The coupling encodes the dependence between $X$ and $Y$ . The tensor product $\al\otimes\be$ corresponds to independence, whereas a graph coupling $(\Id,T)_\sharp\al$ corresponds to the deterministic relation $Y=T(X)$ .

In the discrete case, when $\al=\sum_i \a_i\de_{x_i}$ and $\be=\sum_j \b_j\de_{y_j}$ , the constraint $\pi_1=\al$ and $\pi_2=\be$ forces every coupling to have the form $\pi=\sum_{i,j}\P_{ij}\de_{(x_i,y_j)}$ with $\P\in\CouplingsD(\a,\b)$ . The discrete formulation is therefore a special case of the continuous one, not merely an approximation.

Unlike the Monge constraint, the coupling constraint is never empty. The continuous feasibility witness is the tensor product coupling.

The next result echoes Proposition Proposition: Discrete Product Optimality Is Degenerate in the continuous setting. In both cases, the independent coupling is optimal precisely when the objective is flat over the whole admissible set; for continuous costs, this flatness is equivalently the additive form $c(x,y)=u(x)+v(y)$ on the product support.

The tensor product is therefore a trivial feasible coupling, not a typical optimizer. The continuity assumption matters: changing a cost on an $\al\otimes\be$ -negligible set can change the cost of singular couplings while leaving the product cost unchanged.

If there exists a map $T:\Xx\to\Yy$ with $T_\sharp\al=\be$ , then the Monge map induces the graph coupling $\pi=(\Id,T)_\sharp\al\in\Couplings(\al,\be)$ , characterized by

\int h(x,y)\d\pi(x,y) = \int h(x,T(x))\d\al(x).

(39)

Graph couplings are precisely the Kantorovich representation of deterministic Monge maps.

A last important class consists of semi-discrete problems, where $\al$ has a density and $\be$ is discrete. Every coupling is supported on the union of the slices $\Xx\times\{y_j\}$ . When an optimal coupling is induced by a map, these slices are selected by a partition of $\Xx$ into transport cells, as developed in Chapter Paragraph.

Continuous Kantorovich Problem¶

For a nonnegative Borel cost $c:\Xx\times\Yy\to[0,+\infty]$ , the discrete Kantorovich problem becomes, for arbitrary measures,

\mathcal{L}_c(\al,\be) \eqdef \inf_{\pi\in\Couplings(\al,\be)} \int_{\Xx\times\Yy} c(x,y)\d\pi(x,y).

(40)

This is an infinite-dimensional linear program over a space of measures.

The linear formulation gives the Kantorovich value opposite curvature properties in its two kinds of arguments: it is convex in the marginals but concave in the ground cost.

Proposition: Convexity in the Marginals and Concavity in the Cost

Fix a nonnegative Borel cost $c$ . The extended-valued map

(\alpha,\beta)\longmapsto\mathcal L_c(\alpha,\beta)

(41)

is jointly convex on pairs of probability measures. Conversely, for fixed marginals $(\alpha,\beta)$ , the map $c\mapsto\mathcal L_c(\alpha,\beta)$ is concave on the cone of nonnegative Borel costs: for any such $c_0,c_1$ and $t\in[0,1]$ ,

\mathcal L_{(1-t)c_0+tc_1}(\alpha,\beta) \geq (1-t)\mathcal L_{c_0}(\alpha,\beta) +t\mathcal L_{c_1}(\alpha,\beta).

(42)

The inequality is understood in $[0,+\infty]$ ; in particular, the map is concave on every convex family of costs on which it is finite.

For the Wasserstein cost $c(x,y)=d(x,y)^p$ on a Polish metric space, the natural finite-valued domain is

\mathcal P_p(\Xx) \eqdef \left\{ \al\in\Mm_+^1(\Xx): \int d(x,x_0)^p\d\al(x)<+\infty \right\},

(46)

for one, and hence every, reference point $x_0$ . If $\al,\be\in\mathcal P_p(\Xx)$ , the product coupling has finite $p$ -cost by the triangle inequality, and the proposition supplies an optimal coupling.

Monge--Kantorovich Equivalence¶

The proof of Brenier’s theorem relies on Kantorovich relaxation and duality. Under Brenier’s hypotheses, the relaxation is tight: it has the same cost as the Monge problem and the optimal coupling is induced by a map.

If $\al$ does not have a density, non-smooth points of $\phi$ can be charged by $\al$ and mass splitting can occur. For instance, moving $\delta_0$ to $(\delta_{-1}+\delta_1)/2$ can be represented by a plan concentrated on the set-valued subdifferential of $\phi(x)=|x|$ , but not by a deterministic map.

Remark: Book-shifting as a flat Kantorovich face

The Monge book-shifting example in Example Example: Book-shifting in the Monge problem has a transparent coupling interpretation. Let $\al$ be uniform on $[0,2]$ and $\be$ uniform on $[1,3]$ . For every $\pi\in\Couplings(\al,\be)$ ,

\int |y-x|\d\pi(x,y) \geq \int (y-x)\d\pi(x,y) = \int y\d\be(y)-\int x\d\al(x) =1.

(48)

Equality holds exactly for couplings concentrated on the half-plane $\{(x,y):y\geq x\}$ , where $|y-x|=y-x$ . Hence the optimal set is a whole face of the coupling polytope, not a single graph. The translation and book-shifting maps give two graph couplings inside this face, but many non-deterministic couplings are optimal as well.

Kantorovich solution in 1D¶

The atomless assumption in the Monge statement of Section One-Dimensional Transport And Quantiles is a limitation of maps, not of one-dimensional optimality. Once couplings are allowed, atoms can be split by assigning subintervals of quantile levels to different target points. The common quantile parameter therefore defines an optimal relaxed coupling for arbitrary probability measures.

Theorem: One-dimensional Kantorovich solution

Let $\al,\be\in\Mm_+^1(\RR)$ and let $h:\RR\to[0,+\infty)$ be convex. Consider the cost $c(x,y)=h(x-y)$ . Write $q_\al=\cumul{\al}^{-1}$ and $q_\be=\cumul{\be}^{-1}$ , and assume that $h(q_\al-q_\be)\in L^1(0,1)$ . Then the quantile coupling

\pi^\star=(q_\al,q_\be)_\sharp\mathrm{Leb}_{[0,1]} = (\cumul{\al}^{-1},\cumul{\be}^{-1})_\sharp\mathrm{Leb}_{[0,1]}

(49)

minimizes $\int h(x-y)\d\pi(x,y)$ over $\pi\in\Couplings(\al,\be)$ . In particular, $h(t)=|t|^p$ gives the usual one-dimensional optimal coupling for every $p\geq1$ .

The push-forward statement $\pi^\star\in\Couplings(\al,\be)$ follows from the quantile push-forward proposition. It remains to prove optimality.

The key point is the one-dimensional uncrossing inequality. If $x<x'$ and $y>y'$ , set $a=x-y$ , $\delta=x'-x>0$ and $\eta=y-y'>0$ . Convexity of $h$ implies that increments are monotone, hence

h(a+\eta)-h(a)\leq h(a+\delta+\eta)-h(a+\delta),

(50)

which is exactly

h(x-y)+h(x'-y')\geq h(x-y')+h(x'-y).

(51)

Thus removing a crossing never increases the cost. For a finite transport matrix on two ordered grids, if $i<i'$ and $j>j'$ carry crossed masses $P_{i j}$ and $P_{i'j'}$ , move $\theta=\min(P_{i j},P_{i'j'})$ units from the crossed entries $(i,j)$ , $(i',j')$ to the uncrossed entries $(i,j')$ , $(i',j)$ . The marginals are unchanged, and the cost does not increase. Repeating this elementary step yields an ordered plan; on an ordered uniform quantile grid, this is the diagonal plan.

For general measures, lift any coupling to quantile coordinates. Let $\pi\in\Couplings(\al,\be)$ . Using regular conditional laws of a uniform quantile variable given its image under $q_\al$ and $q_\be$ , construct a coupling $\gamma$ of two uniform variables such that $\pi=(q_\al,q_\be)_\sharp\gamma$ .

To justify the approximation, let $\kappa_M(r)=\max(-M,\min(r,M))$ and set $q_{\al,M}=\kappa_M\circ q_\al$ and $q_{\be,M}=\kappa_M\circ q_\be$ . Approximate these bounded nondecreasing functions almost everywhere by nondecreasing step functions, constant on the uniform intervals $I_k=((k-1)/N,k/N]$ . The matrix $G^N_{k\ell}=\gamma(I_k\times I_\ell)$ couples two uniform histograms. Proposition Proposition: One-Dimensional Weighted Sweep applied to the ordered step values therefore yields the desired comparison for the step functions. For fixed $M$ , continuity of $h$ on $[-2M,2M]$ allows passage to the limit as $N\to\infty$ .

Finally, $\kappa_M$ is nondecreasing and 1-Lipschitz, so for every $x,y$ there is $t_M(x,y)\in[0,1]$ such that $\kappa_M(x)-\kappa_M(y)=t_M(x,y)(x-y)$ . Convexity and nonnegativity give

h(t_M(x,y)(x-y)) \leq (1-t_M(x,y))h(0)+t_M(x,y)h(x-y) \leq h(0)+h(x-y).

(52)

The assumed integrability controls the diagonal term; for a competitor of finite cost the same bound controls the other term, while an infinite-cost competitor is irrelevant. Dominated convergence as $M\to\infty$ gives

\int_0^1 h(q_\al(r)-q_\be(r))\d r \leq \int h(x-y)\d\pi(x,y)

(53)

for every $\pi\in\Couplings(\al,\be)$ .

This result is strictly more flexible than the Monge formula. If $\al$ has an atom, a map can only send that whole atom to one target point, whereas the quantile interval associated with the atom can be coupled with a nontrivial portion of $\be$ . The one-dimensional Kantorovich solution therefore handles mass splitting without changing the monotone geometry.

Cyclical Monotonicity¶

Cyclical monotonicity is the local geometric fingerprint of optimality for a cost $c$ . It converts a global minimization problem into finite exchange inequalities and is the bridge from Kantorovich plans to convex potentials.

Support and $c$ -Cyclical Monotonicity¶

The support of a coupling is the topological support introduced in Definition Definition: Support Of A Measure, now applied to a Radon measure on $\Xx\times\Yy$ . Thus $(x,y)\in\supp(\pi)$ exactly when every open neighborhood of $(x,y)$ has positive $\pi$ -mass.

It is enough to check cyclic permutations:

\sum_{i=1}^k c(x_i,y_i) \leq \sum_{i=1}^k c(x_i,y_{i+1}), \qquad y_{k+1}=y_1.

(55)

Optimal Matching to Optimal Transport¶

For uniform marginals on the same number of atoms, Corollary Corollary: Kantorovich For Matching gives an optimal permutation plan. Its support must be $c$ -cyclically monotone: otherwise exchanging finitely many targets along a violating cycle would lower the matching cost. The next theorem says that the same finite-exchange certificate holds for arbitrary optimal plans.

Monotonicity¶

If the optimal plan is induced by a map $T$ , there is a set $G$ of full $\al$ -measure such that $(x,T(x))\in\supp(\pi)$ for every $x\in G$ . For $x_1,\ldots,x_k\in G$ , cyclical monotonicity reads

\sum_{i=1}^k c(x_i,T(x_i)) \leq \sum_{i=1}^k c(x_i,T(x_{i+1})).

(56)

For $c(x,y)=\frac12\|x-y\|^2$ , the two-point case gives, for $x,y\in G$ ,

\langle T(x)-T(y),x-y\rangle\geq0,

(57)

so the optimal representative of $T$ is monotone on $G$ .

One Dimension¶

In one dimension, for $c(x,y)=|x-y|^p$ , the two-point inequality has a strict uncrossing consequence when $p>1$ : if $x<y$ , every optimal map satisfies $T(x)\leq T(y)$ on its full-measure transport set. For $p=1$ , uncrossing is not strict. The monotone rearrangement remains optimal, but nonmonotone maps and nondeterministic plans can also be optimal, as in Remark Remark: Book-shifting as a flat Kantorovich face.

Metric Properties: Wasserstein Distances¶

OT costs become genuine distances when the ground cost comes from a metric. The proof relies on a gluing lemma.

OT Defines a Distance¶

The discrete gluing lemma is the finite-dimensional mechanism behind the triangle inequality.

Figure Div displays this construction in matrix form.

Discrete gluing lemma in matrix form. The first two panels are optimal one-dimensional couplings through an intermediate marginal. The third panel shows the induced marginal $R=P\diag(1/b)Q$ ; it is feasible and is the coupling used in the triangle-inequality proof.

The interactive version changes the resolution of the intermediate marginal, which controls how mediated the glued source-target plan becomes.

Interactive panel. Use the mediation slider to inspect how two couplings through an intermediate marginal glue into a source-target plan.

Continuous Gluing¶

The same construction extends to probability measures by disintegrating both couplings with respect to their common marginal.

Interpolation Induced By A Plan¶

The quadratic Wasserstein distance does not only compare two endpoint measures. An optimal plan also says how to move mass between them: each active pair $(x,y)$ travels along the segment joining $x$ to $y$ . This turns an optimal coupling into a curve of measures.

In the discrete case, each mass $P_{ij}$ moves from $x_i$ to $y_j$ along its own segment. When the optimal plan is not induced by a map, one source atom can split into several moving atoms. If the optimal plan is not unique, different optimal plans may also induce different $\Wass_2$ geodesics.

Figure Div visualizes the construction when the optimal plan splits mass, so that the intermediate measure is obtained by moving every coupled pair along its Euclidean segment.

McCann interpolation induced by a non-deterministic optimal transport plan. In every panel, the red and blue endpoint measures are shown with low opacity, thin gray segments display the support $P_{ij}>\mathrm{tol}$ of the coupling, and the moving atoms are colored from red to blue along the interpolation.

The companion panel lets the same coupling be inspected along time $t$ , with an entropy slider to contrast sparse and diffuse plans.

Interactive panel. Use the interpolation time and plan controls to see how a fixed coupling induces a cloud of displacement paths between endpoint measures.

Remark: Interpolation on a general geodesic space

For Dirac masses in Euclidean space, the $\Wass_2$ geodesic from $\delta_x$ to $\delta_y$ is $t\mapsto\delta_{(1-t)x+t y}$ . The same idea extends to any geodesic metric space $(\X,d)$ , meaning that each pair of points can be joined by a constant-speed metric geodesic. For each pair $(x,y)$ , one replaces the Euclidean segment by a curve $\gamma^{x,y}:[0,1]\to\X$ such that $\gamma^{x,y}_0=x$ , $\gamma^{x,y}_1=y$ , and

d(\gamma^{x,y}_s,\gamma^{x,y}_t)=|t-s|d(x,y).

(71)

If this geodesic is unique and depends measurably on $(x,y)$ , one defines $e_t(x,y)=\gamma^{x,y}_t$ and sets $\al_t=(e_t)_\sharp\pi^\star$ for an optimal coupling $\pi^\star$ . When geodesics are not unique, there is no canonical interpolation of a pair of Diracs unless a choice is made: one may select a particular geodesic between $x$ and $y$ , or randomize among several such geodesics. The intrinsic formulation is to choose a probability measure $\eta$ on the path space of constant-speed geodesics, called a dynamical optimal plan, such that $(e_0,e_1)_\sharp\eta$ is an optimal coupling, and to set $\al_t=(e_t)_\sharp\eta$ . Different measurable choices, or different conditional distributions over geodesics with the same endpoints, can give different $\Wass_2$ geodesics; the constant-speed identity remains the same. This path-space viewpoint is standard in the general theory of Wasserstein spaces Ambrosio et al., 2006Villani, 2009Santambrogio, 2015.

Comparison With Monge¶

The distance $\Wass_p$ defined through the Kantorovich problem (64) should be contrasted with the directed distance $\widetilde{\Wass}$ obtained using Monge’s problem. The Kantorovich feasible set is never empty, since it contains the product coupling, although the $p$ -cost may still be infinite without moment assumptions on non-compact spaces. By contrast, Monge’s constraint set $\{T:T_\sharp\al=\be\}$ can be empty. When an optimal Monge map exists, Kantorovich gives the same value by choosing the graph coupling $(\Id,T)_\sharp\al$ .

The next proposition makes precise one important sense in which Kantorovich is the relaxation of Monge. The cleanest statement is first made in the lifted plan variable $\pi$ : deterministic graph couplings are dense among all couplings when the source can be split at arbitrarily fine scales. Thus the Kantorovich functional is the weak lower-semicontinuous envelope of the Monge graph functional.

Proposition: Kantorovich As The Plan-Space Relaxation Of Monge

Let $(\Xx,d)$ be a compact metric space, let $p\geq1$ , and let $\al,\be\in\Pp(\Xx)$ with $\al$ atomless. Define

\mathcal G(\al,\be) \eqdef \{(\Id,T)_\sharp\al:T_\sharp\al=\be\} \subset \Couplings(\al,\be),

(72)

and set $F_p(\pi)\eqdef\int_{\Xx\times\Xx}d(x,y)^p\d\pi(x,y)$ . For every $\pi\in\Couplings(\al,\be)$ , there are measurable maps $T_k$ such that $(T_k)_\sharp\al=\be$ , $(\Id,T_k)_\sharp\al\rightharpoonup\pi$ , and $F_p((\Id,T_k)_\sharp\al)\to F_p(\pi)$ .

Consequently $F_p$ is the weak lower-semicontinuous envelope on $\Couplings(\al,\be)$ of the functional that equals $F_p$ on graph couplings and $+\infty$ outside them. In particular,

\tilde\Wass_p(\al,\be)=\Wass_p(\al,\be).

(73)

This is an equality of infimal values: the Kantorovich minimum is attained, whereas the infimum defining $\tilde\Wass_p$ need not be attained by a transport map.

Since $F_p$ is affine in the plan variable and $\Couplings(\al,\be)$ is convex, this envelope is also the closed convex relaxation of the Monge graph problem in the space of transport plans.

At the level of endpoint measures, this gives a literal lower-semicontinuous-envelope interpretation for the Monge $p$ -cost whenever source measures can be regularized into atomless ones.

The extra density assumption in the corollary is essential. If $\al$ has atoms, the graph-density statement can fail dramatically: a single source Dirac mass cannot be mapped to two target Dirac masses. On finite spaces, the topology is discrete and this obstruction cannot be removed by closure. In such cases the Kantorovich formulation is not merely a closure of existing maps with the same marginals; it genuinely adds the possibility of splitting atomic mass.

Applications of Wasserstein distance¶

Representing structured data as probability measures turns the Wasserstein distance into a geometry-aware comparison tool. The following are two representative application domains: single-cell biology and natural-language processing.

Example: Application to gene-expression distances between two cells

A single cell can be encoded as a measure over genes,

\al_{\mathrm{cell}}=\sum_g e_g\de_{\varphi(g)},

(75)

where $e_g\geq0$ , $\sum_g e_g=1$ , and $\varphi(g)$ is a gene embedding or annotation vector. Wasserstein distances then compare cells by moving expression mass between genes. The choice of $\dist(\varphi(g),\varphi(g'))$ is biologically meaningful: it can come from annotations, pathways or a learned ground metric. This is the idea behind the Gene Mover’s Distance and related metric-learning approaches for single-cell data Bellazzi et al., 2021Huizing et al., 2022; it is the single-cell analogue of the cost-learning question revisited in Section Metric Learning and Inverse OT.

Example: Application to word embeddings and documents

A document can similarly be viewed as a probability measure on a word-embedding space,

\al_{\mathrm{doc}}=\sum_{w\in\mathrm{doc}} a_w\de_{e_w}.

(76)

Here $a_w$ are normalized word frequencies and $e_w$ are word embeddings. The Word Mover’s Distance is the Wasserstein distance between such document measures Kusner et al., 2015. It compares bags of words through the geometry learned by the embedding, so that replacing a word by a nearby synonym is less costly than replacing it by an unrelated word. When two embedding spaces are not already aligned, Gromov--Wasserstein variants compare their intrinsic neighborhood geometry instead of relying on a common coordinate system Alvarez-Melis et al., 2019; this is the intrinsic-space viewpoint developed in Section Gromov--Wasserstein.

Metric Properties: Topology And Applications¶

Wasserstein distances metrize weak convergence under moment control, sit between weak and strong topologies, and provide quantitative estimates in probability and robust optimization.

On compact spaces this is also the weak-* topology inherited from the duality between continuous functions and finite measures. On noncompact spaces, “narrow convergence” avoids conflating this probability topology with other weak-* topologies.

Remark: Weak convergence for discrete measures

In the special case of a single Dirac, $\de_{x^{(n)}} \rightharpoonup \de_x$ is equivalent to $\int f \d\de_{x^{(n)}} = f(x^{(n)}) \rightarrow \int f \d\de_{x} = f(x)$ for any continuous $f$ . This in turn is equivalent to $x^{(n)} \rightarrow x$ . For a fixed number of atoms, if $\al_n=\sum_{i=1}^N a_i^{(n)}\de_{x_i^{(n)}}$ and, after extracting a subsequence and relabeling, $a_i^{(n)}\to a_i$ and $x_i^{(n)}\to x_i$ , then $\al_n$ converges weakly to $\sum_i a_i\de_{x_i}$ , with atoms at identical limits merged. Without a uniform bound on the number of atoms, weak limits of discrete measures can be non-discrete; empirical measures are the standard example.

Strong Versus Weak Topology¶

The total variation norm induces the strong topology on measures. For a signed difference, $\norm{\al-\be}_{\TV}=|\al-\be|(\Xx)$ ; this is an $L^1$ norm for densities and an $\ell^1$ norm for discrete weights. The following proposition shows that total variation is itself a transport cost for the degenerate $0/1$ ground metric.

Probabilistic Interpretation¶

Probability theory gives weak convergence its most direct interpretation: it compares distributions rather than samplewise realizations. In terms of random vectors, if $X_n\sim\al_n$ and $X\sim\al$ (not necessarily defined on the same probability space), then $\al_n\rightharpoonup\al$ means precisely that $X_n$ converges in law to $X$ .

Convergence in law should be distinguished from stronger notions that compare random variables on a common probability space. In that setting, $X_n\to X$ almost surely means pointwise convergence outside a null set, while convergence in probability means

\foralls \epsilon>0,\qquad \PP(\norm{X_n-X}>\epsilon)\to0.

(84)

Almost-sure convergence implies convergence in probability, which in turn implies convergence in law. The last notion depends only on the marginal laws, since it is exactly the weak convergence $(X_n)_\sharp\PP\rightharpoonup X_\sharp\PP$ , and therefore does not require a common probability space.

Convergence in law should also be distinguished from strong convergence of measures. Total variation convergence controls the mass assigned to every measurable set, not only averages against continuous test functions, and therefore implies weak convergence. The converse fails, notably for empirical approximations of continuous laws.

Central Limit Theorem and OT¶

Limit theorems produce canonical convergent sequences of probability measures, and Wasserstein distances turn their qualitative conclusions into metric error bounds. The central limit theorem concerns sums of independent random variables; because sums of independent variables correspond to convolutions of their laws, the following statement introduces convolution before expressing the theorem in transport language.

Remark: Central limit theorem

The central limit theorem states that if $(X_i)_{i\geq1}$ are i.i.d. random vectors with finite second moments, $\EE(X_i)=0$ , and $\EE(X_i X_i^\top)=\Id$ , then the normalized sum

Z_n \eqdef \frac{1}{\sqrt{n}} \sum_{i=1}^n X_i

(85)

converges in law toward the standard Gaussian $\Gaussian(0,\Id)$ . In the terminology recalled above, this means that the measure $\al_n$ representing the law of $Z_n$ converges weakly toward the centered Gaussian measure $\al=\Gaussian(0,\Id)$ .

Equivalently, this is a statement about rescaled convolutions of measures. If $\al$ and $\be$ are probability measures on $\RR^d$ , their convolution is

\al*\be \eqdef \operatorname{add}_\sharp(\al\otimes\be), \qquad \int \varphi\,\d(\al*\be) = \iint \varphi(x+y)\,\d\al(x)\d\be(y)

(86)

for every bounded continuous $\varphi$ , where $\operatorname{add}(x,y)=x+y$ . Thus $\al*\be$ is the law of $X+Y$ when $X$ and $Y$ are independent with laws $\al$ and $\be$ . When $\al$ and $\be$ have densities $f$ and $g$ , the convolution has density

(f*g)(z)=\int_{\RR^d} f(x)g(z-x)\d x.

(87)

If $\al$ is the common law of the variables $X_i$ , writing $\al^{*n}$ for the $n$ -fold convolution of $\al$ with itself, and denoting by $D_a(x)=a x$ the dilation map, the law of $Z_n$ is

\al_n=(D_{1/\sqrt n})_\sharp\al^{*n}.

(88)

The CLT therefore says that the normalized $n$ -fold convolution $(D_{1/\sqrt n})_\sharp\al^{*n}$ converges weakly to the Gaussian $\Gaussian(0,\Id)$ .

Figure Div makes this qualitative weak convergence visible in the elementary Bernoulli case: every finite- $n$ law is discrete, yet the normalized atom heights approach the Gaussian density.

Central-limit theorem for normalized Bernoulli sums. Starting from $\alpha_0=\frac12(\delta_{-1}+\delta_1)$ , the law of $Z_n=n^{-1/2}\sum_i X_i$ remains discrete, but its rescaled atom heights approach the standard Gaussian density shown in gray. Proposition Proposition: Berry--Esseen bound in $\Wass_1$ later quantifies this weak convergence in $\Wass_1$ .

The interactive version varies the number of summands and the Bernoulli skew, making weak convergence visible even while every displayed law remains discrete.

Interactive panel. Use the number-of-summands and Bernoulli-skew controls to watch the Wasserstein CLT scaling predicted by Lipschitz test functions.

Wasserstein Metrizes Weak Convergence¶

As explained in Remark Remark: Weak convergence for discrete measures, for Dirac masses,

\|\delta_{x_n}-\delta_x\|_{\TV}=2, \qquad \Wass_p(\delta_{x_n},\delta_x)=d(x_n,x).

(89)

Thus the strong topology never sees Diracs converge unless they are eventually equal, while the Wasserstein topology captures their spatial convergence. On an unbounded space, weak convergence alone does not prevent a vanishing amount of mass from escaping to infinity; Wasserstein convergence also controls the corresponding moment tail.

Proposition: Wasserstein Convergence On Polish Spaces

Let $(\Xx,d)$ be Polish, let $1\leq p<+\infty$ , and let $\al_k,\al\in\Pp_p(\Xx)$ . For any $x_0\in\Xx$ , the following are equivalent:

$\Wass_p(\al_k,\al)\to0$ ;
$\al_k\rightharpoonup\al$ and
$\int d(x,x_0)^p\d\al_k(x) \longrightarrow \int d(x,x_0)^p\d\al(x);$
(90)
$\al_k\rightharpoonup\al$ and
$\lim_{R\to+\infty}\sup_k \int_{\{d(x,x_0)>R\}}d(x,x_0)^p\d\al_k(x)=0.$
(91)

These conditions do not depend on the choice of $x_0$ .

On compact spaces the moment function is bounded and continuous, so the moment condition is automatic and $\Wass_p$ metrizes weak convergence.

On a finite metric space, weak and strong topologies coincide. If $d_{\min}=\min_{x\neq y}d(x,y)$ and $d_{\max}=\max_{x,y}d(x,y)$ , then

\frac{d_{\min}}{2}\|\al-\be\|_{\TV} \leq \Wass_1(\al,\be) \leq \frac{d_{\max}}{2}\|\al-\be\|_{\TV}.

(93)

Measure-to-Measure Maps on Wasserstein Space¶

Many constructions in modern machine learning act directly on probability laws. This section isolates this viewpoint and records two useful principles: some transformations move particles without splitting them, while others are intrinsically diffusive.

Maps on Wasserstein space.¶

Once Wasserstein distances provide a topology on probability measures, it is natural to study transformations of probability measures as maps

\Phi:\Pp_p(\Xx)\longrightarrow \Pp_p(\Xx)

(94)

on Wasserstein space. Later chapters use such maps repeatedly: flow matching and diffusion models evolve laws during sampling, one-step transportation methods learn maps between latent and data distributions, and transformers update the empirical law of their tokens; see Chapter Paragraph and Section Evolution in Depth of Transformers. Two questions are especially useful. The structural question asks whether $\Phi$ preserves a discrete particle representation. The metric question asks whether $\Phi$ is stable, for instance Lipschitz, for $\Wass_p$ .

Particle-preserving transport representations.¶

The deterministic case is obtained by pushing each input measure through a map that may itself depend on that input measure:

\Phi(\al) = \Gamma[\al]_\sharp \al, \qquad \Gamma[\al] : \Xx \to \Xx .

(95)

Then, for every discrete measure,

\al=\sum_{i=1}^n a_i\de_{x_i} \quad\Longrightarrow\quad \Phi(\al)=\sum_{i=1}^n a_i\de_{\Gamma[\al](x_i)} .

(96)

Thus the weights and the number of particles are preserved, up to possible collisions between images. This is the natural structure behind deterministic particle methods: particles move, but they do not split. Lavenant and Savaré Lavenant & Savaré, 2026 study when transformations of measures admit transport representatives of the form (95), the obstructions to choosing representatives continuously, and the additional regularity available when $\Phi$ is Wasserstein-Lipschitz.

Mass-splitting Markov maps.¶

The opposite case is a stochastic transformation, where one input particle can generate a full output distribution. Let $K$ be a Markov kernel on $\Xx$ , so that $K(y,\cdot)$ is a probability measure for each $y$ . To obtain a map on $\Pp_p(\Xx)$ , assume a finite-moment bound

\int_{\Xx} d(x,x_0)^p\,K(y,\d x) \leq C\bigl(1+d(y,x_0)^p\bigr)

(97)

for some $x_0\in\Xx$ . The associated linear map

\int_{\Xx} f(x)\,\d\Psi(\al)(x) = \int_{\Xx}\int_{\Xx} f(x)\,K(y,\d x)\,\d\al(y)

(98)

is a measure-to-measure map from $\Pp_p(\Xx)$ to itself: integrating the moment bound against $\al$ proves that $\Psi(\al)$ has finite $p$ th moment. If $\Xx=\RR^d$ and $K(y,\d x)=\kappa(x-y)\d x$ for a probability density $\kappa$ with finite $p$ th moment, then $\Psi(\al)=\al*\kappa$ is convolution. Unless $K(y,\cdot)$ is a Dirac mass, a single atom is sent to a diffuse probability distribution. Heat flows, noising steps in diffusion models, and other smoothing mechanisms therefore belong to this mass-splitting class.

If in addition $\Xx$ is Polish and $\Wass_p(K(y,\cdot),K(y',\cdot))\leq Ld(y,y')$ , then $\Psi$ is $L$ -Lipschitz for $\Wass_p$ . Glue an input coupling of $(y,y')$ with measurable optimal couplings between $K(y,\cdot)$ and $K(y',\cdot)$ , then integrate the resulting kernel coupling. Its expected $p$ th-power cost is at most $L^p$ times the input coupling cost.

Wasserstein stability.¶

Regularity of $\Phi$ is a stability requirement: small perturbations of the input law should not create large changes of the output law. For transport representations, the following elementary estimate separates the spatial Lipschitz constant of the map from its sensitivity to the input measure.

Remark:

\Wass_2

-Lipschitz functionals and bounded gradients

The same word “Lipschitz” is also used for scalar functionals $f:\Pp_2(\RR^d)\to\RR$ . In a geodesic metric space, an $L$ -Lipschitz functional has descending metric slope at most $L$ . Conversely, if the slope is a strong upper gradient and is uniformly bounded by $L$ , then $f$ is $L$ -Lipschitz along curves, hence for $\Wass_2$ on $\Pp_2(\RR^d)$ . In the smooth Otto calculus this slope is the $L^2(\al)$ -norm of the Wasserstein gradient introduced in Proposition Proposition: Formal Wasserstein Gradient:

|\partial f|_{\Wass_2}(\al) = \norm{\Wgrad f(\al)}_{L^2(\al)} .

(99)

Thus, under the usual chain-rule assumptions, $\Wass_2$ -Lipschitz regularity of $f$ is the metric analogue of imposing a uniform gradient bound,

\sup_{\al}\norm{\Wgrad f(\al)}_{L^2(\al)}\leq L .

(100)

This first-order boundedness should not be confused with $L$ -smoothness in optimization, which would instead control how the gradient itself varies.

Proposition: Wasserstein stability of transport representations

Let $(\Xx,d)$ be Polish, let $p\geq1$ , let $E\subset\Xx$ , and assume that, for all $\al,\be\in\Pp_p(\Xx)$ supported in $E$ , the maps $T[\al]:E\to\Xx$ satisfy

d(T[\al](x),T[\al](y)) \leq L_x d(x,y) \quad\text{for all }x,y\in E,

(101)

and

\sup_{y\in E} d(T[\al](y),T[\be](y)) \leq L_{\rm law} \Wass_p(\al,\be).

(102)

Then $\Phi(\al)=T[\al]_\sharp\al$ is $(L_x+L_{\rm law})$ -Lipschitz on probability measures supported in $E$ :

\Wass_p(\Phi(\al),\Phi(\be)) \leq (L_x+L_{\rm law})\Wass_p(\al,\be).

(103)

The fixed-map case is the basic corollary, obtained with $E=\Xx$ and $L_{\rm law}=0$ . If $T:\Xx\to\Xx$ is $L$ -Lipschitz, then

\Wass_p(T_\sharp\al,T_\sharp\be)\leq L\Wass_p(\al,\be).

(106)

This is the metric counterpart of the elementary push-forward operation introduced in Definition Definition: Push-Forward.

Mean-field attention.¶

Self-attention is a central example because the number of tokens can be large and variable. A token cloud is represented by the empirical law $\al=n^{-1}\sum_i\de_{x_i}$ , and a single-head mean-field attention layer is naturally a transport representation. With the notation of Section Evolution in Depth of Transformers, define

\Gamma_\theta[\al](x) = \frac{\int e^{\dotp{Qx}{Kz}}Vz\,\d\al(z)} {\int e^{\dotp{Qx}{Kz}}\,\d\al(z)} , \qquad \theta=(Q,K,V),

(107)

where, for simplicity, the value map $V$ takes values in the same Euclidean feature space.

The corresponding measure map is

\operatorname{Att}_\theta(\al)=(\Gamma_\theta[\al])_\sharp\al ,

(108)

and a residual transformer layer uses the closely related map $(\Id+\tau\Gamma_\theta[\al])_\sharp\al$ . Lipschitz estimates for $\operatorname{Att}_\theta$ are therefore stability estimates for attention in the many-token regime.

Proposition: Compact-support attention stability

Assume $\Xx=\RR^d$ . Fix $R>0$ and let $\mathcal P_R$ be the set of probability measures supported in the Euclidean ball $B(0,R)$ . For every $p\geq1$ , there is a constant $C_{\theta,p}(R)$ such that

\Wass_p(\operatorname{Att}_\theta(\al),\operatorname{Att}_\theta(\be)) \leq C_{\theta,p}(R)\Wass_p(\al,\be), \qquad \al,\be\in\mathcal P_R.

(109)

Writing $A_R=\norm{Q}_{\rm op}\norm{K}_{\rm op}R^2$ , one can take a constant bounded, up to polynomial factors in $R$ and the operator norms of $Q,K,V$ , by

C_{\theta,p}(R)\lesssim e^{2A_R}.

(110)

This is the exponential-in-score-radius behavior, equivalently $e^{O(R^2)}$ for dot-product scores on $B(0,R)$ , refined in Castin et al., 2024.

Distributional Robustness And Wasserstein Infinity¶

Wasserstein distances define ambiguity sets around empirical laws. Given samples $z_i$ and $\widehat{\al}_n=\frac1n\sum_i\delta_{z_i}$ , a distributionally robust optimization problem replaces empirical risk by

\sup_{\be:\Wass_p(\be,\widehat{\al}_n)\leq\rho} \int \ell_\theta(z)\d\be(z).

(114)

Under standard upper-semicontinuity and growth assumptions on the loss, one has the dual reformulation

\sup_{\be:\Wass_p(\be,\widehat{\al}_n)^p\leq\rho^p} \int \ell_\theta\d\be = \inf_{\lambda\geq0} \lambda\rho^p + \frac1n\sum_{i=1}^n \sup_z\{\ell_\theta(z)-\lambda d(z,z_i)^p\}.

(115)

The robust risk is therefore an empirical risk in which each sample is replaced by its worst penalized perturbation. For $p=1$ and an $L_\theta$ -Lipschitz loss,

\sup_{\be:\Wass_1(\be,\widehat{\al}_n)\leq\rho} \int \ell_\theta\d\be \leq \frac1n\sum_i\ell_\theta(z_i)+\rho L_\theta.

(116)

Figure Div shows this robustification for a genuinely nonlinear classification problem. The red and blue samples form two noisy interlocking crescents whose opposing tips overlap locally. The Wasserstein adversary transports samples toward high-loss regions under a global root-mean-square displacement budget.

Wasserstein robustness reshapes the separator between two noisy interlocking moons. The black curve is the learned zero-score boundary. Filled dots are observed samples; hollow dots and violet segments show a deterministic approximation of the adversarial transport for increasing quadratic Wasserstein radii.

Interactive panel. Increase the Wasserstein radius to move the two moons toward high-loss regions and observe how their winding nonlinear separator reorganizes.

Applied to $c=d^p$ , Proposition Proposition: Convexity in the Marginals and Concavity in the Cost shows that $(\alpha,\beta)\mapsto\Wass_p(\alpha,\beta)^p$ is jointly convex. Thus $\Wass_1$ itself is jointly convex, but $\Wass_p$ need not be convex for $p>1$ : on the real line,

\Wass_p\big((1-t)\delta_0+t\delta_1,\delta_0\big)=t^{1/p},

(117)

which is strictly concave for $p>1$ .

The limiting distance

\Wass_\infty(\al,\be) \eqdef \inf_{\pi\in\Couplings(\al,\be)} \esssup_{(x,y)\sim\pi} d(x,y)

(118)

minimizes the worst displacement rather than an average displacement. It is the limit of $\Wass_p$ as $p\to\infty$ on bounded spaces, but not the limit of the linear objectives defining $\Wass_p^p$ . Although the essential-supremum objective is not linear, each sublevel is a convex support-constrained feasibility problem:

\Wass_\infty(\al,\be)\leq r \quad\Longleftrightarrow\quad \exists\pi\in\Couplings(\al,\be) \text{ supported on }\{d\leq r\}.

(119)

Thus one can compute it by threshold search over feasible coupling problems.

Proposition:

\Wass_\infty

Robust Envelope Around An Empirical Law

Let $(\Zz,d)$ be a Polish metric space. Let $\widehat{\al}=\sum_{i=1}^n a_i\delta_{z_i}$ with distinct $z_i$ , $a_i>0$ , and $\sum_i a_i=1$ , and assume the closed balls $\overline B(z_i,\rho)$ are compact. For any real-valued upper-semicontinuous loss $\ell$ , the following identity holds. Repeated atoms may first be merged.

\sup_{\be:\Wass_\infty(\be,\widehat{\al})\leq\rho} \int \ell(z)\d\be(z) = \sum_{i=1}^n a_i \sup_{z\in\overline B(z_i,\rho)}\ell(z).

(120)

The coupling viewpoint developed in this chapter provides feasibility, existence, geometry, and stability. The next chapter adds the complementary dual description, in which marginal constraints are represented by Kantorovich potentials.

References¶

Kantorovich, L. (1942). On the transfer of masses (in Russian). Doklady Akademii Nauk, 37(2), 227–229.
Villani, C. (2003). Topics in Optimal Transportation (Vol. 58). American Mathematical Society.
Villani, C. (2009). Optimal Transport: Old and New (Vol. 338). Springer.
Rachev, S. T., & Rüschendorf, L. (1998). Mass Transportation Problems: Volume II: Applications. Springer.
Courty, N., Flamary, R., Tuia, D., & Rakotomamonjy, A. (2017). Optimal transport for domain adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(9), 1853–1865.
Courty, N., Flamary, R., Habrard, A., & Rakotomamonjy, A. (2017). Joint distribution optimal transportation for domain adaptation. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 30 (pp. 3730–3739).
Rubner, Y., Tomasi, C., & Guibas, L. J. (2000). The earth mover’s distance as a metric for image retrieval. International Journal of Computer Vision, 40(2), 99–121.
Xia, G.-S., Ferradans, S., Peyré, G., & Aujol, J.-F. (2014). Synthesizing and mixing stationary Gaussian texture models. SIAM Journal on Imaging Sciences, 7(1), 476–508.
Solomon, J., De Goes, F., Peyré, G., Cuturi, M., Butscher, A., Nguyen, A., Du, T., & Guibas, L. (2015). Convolutional Wasserstein distances: efficient optimal transportation on geometric domains. ACM Transactions on Graphics, 34(4), 66:1-66:11.
Bonneel, N., Rabin, J., Peyré, G., & Pfister, H. (2015). Sliced and Radon Wasserstein barycenters of measures. Journal of Mathematical Imaging and Vision, 51(1), 22–45.
Bonneel, N., & Digne, J. (2023). A Survey of Optimal Transport for Computer Graphics and Computer Vision. Computer Graphics Forum, 42(2), 439–460. 10.1111/cgf.14778
Schiebinger, G., Shu, J., Tabaka, M., Cleary, B., Subramanian, V., Gould, J., Solomon, A., Liu, S., Lin, S., Berube, P., Lee, L., Chen, J., Brumbaugh, J., Rigollet, P., Hochedlinger, K., Jaenisch, R., Regev, A., & Lander, E. S. (2019). Optimal-Transport Analysis of Single-Cell Gene Expression Identifies Developmental Trajectories in Reprogramming. Cell, 176(4), 928-943.e22. 10.1016/j.cell.2019.01.006
Tong, A., Huang, J., Wolf, G., van Dijk, D., & Krishnaswamy, S. (2020). TrajectoryNet: A Dynamic Optimal Transport Network for Modeling Cellular Dynamics. Proceedings of the 37th International Conference on Machine Learning, 119, 9526–9536. https://proceedings.mlr.press/v119/tong20a.html
Lavenant, H., Zhang, S., Kim, Y.-H., & Schiebinger, G. (2021). Towards a Mathematical Theory of Trajectory Inference. arXiv Preprint arXiv:2102.09204. https://arxiv.org/abs/2102.09204
Klein, D., Uscidda, T., Theis, F., & Cuturi, M. (2024). GENOT: Entropic (Gromov) Wasserstein Flow Matching with Applications to Single-Cell Genomics. Advances in Neural Information Processing Systems, 37. 10.52202/079017-3301