Kantorovich Relaxation
Kantorovich’s relaxation is the decisive move that turns transport into convex optimization. Deterministic maps are replaced by couplings, infeasibility and asymmetry disappear, and the Wasserstein distances emerge. Historically, this linear-programming viewpoint grew from Kantorovich’s economic planning work Kantorovich, 1942 and is now the standard foundation of optimal transport Villani, 2003Villani, 2009Rachev & Rüschendorf, 1998.
Discrete Relaxation¶
The discrete relaxation is the cleanest place to see mass splitting. It replaces permutations by a transportation polytope and reveals the linear-programming structure that algorithms exploit.
Monge’s discrete matching problem cannot be applied when the two clouds have different cardinalities or unequal weights. The continuous Monge problem has the same obstruction: there may be no map such that , for instance when one Dirac mass must be sent to several Dirac masses. It is also asymmetric: two Dirac masses can be mapped to one, but one Dirac mass cannot be split into two by a deterministic map.
Kantorovich’s idea is to relax deterministic transportation. Instead of sending each source point to exactly one target, the mass at may be dispatched across several targets. The relaxation is encoded by a coupling matrix for two discrete measures
The first consequence is feasibility. There is always at least one admissible plan.
The feasible set is a bounded intersection of an affine space with the nonnegative orthant, hence a convex polytope. In one dimension, the coupling can be read as a matrix: rows index source bins, columns index target bins, and the marginal constraints appear as prescribed row and column sums.
Proof
The reverse implication is immediate. Conversely, assume that is optimal and let be arbitrary. Since all entries of are positive, there exists small enough that
is nonnegative. It still has row sums and column sums , so . Also
Taking scalar products with , the optimality of forces both and to have the same cost as . Since was arbitrary, all couplings are optimal.
Thus the product plan is mainly a feasibility witness. Except when the linear cost is constant on the whole transportation polytope, it is not expected to solve optimal transport.

Discrete couplings represented as straight transport segments. The deterministic graph is a feasible Monge-type plan, the product plan spreads every source mass over all targets, and the optimal Kantorovich plan minimizes the quadratic transport cost. Line width and opacity encode transported mass.
The interactive demo below separates the main feasible-plan archetypes: deterministic graphs, independent product couplings, sparse splitting plans, and entropic approximations.
Interactive panel. Use the point and mass sliders to see how a Kantorovich plan can split mass into several weighted links rather than choosing one destination per source.

Coupling matrices with their prescribed marginals. The central grayscale image displays ; the red curve on the left is the source marginal , and the blue curve on top is the target marginal . The independent product plan is diffuse, whereas the one-dimensional optimal plan concentrates near the monotone quantile correspondence.
The companion control varies the bin count and the endpoint laws, making the transition from diffuse independence to monotone transport visually explicit.
Interactive panel. Use the problem-size and mass-shape controls to compare the coupling matrix with its red and blue marginal sums.
The Kantorovich feasible set is symmetric: if and only if . With a unit transport cost matrix , the discrete Kantorovich problem reads
This is a linear program, and its solutions need not be unique.

From permutation matrices to splitting couplings. When the two empirical measures have the same number of atoms and uniform weights, an optimal plan can be a permutation matrix. Once target masses are nonuniform, one source can send mass to several targets and several sources can merge into the same target.
The interactive demo keeps the same source and target sites while changing the target mass imbalance, so the moment where permutation structure breaks becomes visible.
Interactive panel. Use the split-mass and geometry controls to contrast deterministic permutation-like transport with plans that divide source mass across targets.
Proof
The transportation polytope is compact, so a linear objective attains its minimum at an extreme point. Let be an extreme point and let be its support graph on the bipartite vertex set of source and target indices. If this graph contains a cycle, put alternating signs on the cycle, obtaining a nonzero matrix supported on with zero row and column sums. For small , both and are nonnegative couplings and is their midpoint, contradicting extremality. Thus the support graph is a forest, which has at most edges.
Proof
All assignments are nonnegative. At each step, the mass placed in entry is subtracted from exactly one current row residual and one current column residual, so no row or column can receive more mass than prescribed. Conversely, an index is advanced only when its residual has been fully filled. When the algorithm stops, the total assigned mass is , hence all row and column sums are exactly and .
Each positive assignment exhausts at least one current row or one current column. Before the final assignment, at most row advances and column advances can occur without terminating the construction. Hence the number of positive entries is at most . For acyclicity, view the positive support as a bipartite graph. Once a row or column index is advanced, it never appears again, so each new positive edge either starts a new component or attaches at least one new vertex to the component currently being swept. No edge is ever added between two old vertices of the same component, so no cycle can be created.
The north-west corner rule, summarized in Algorithm Algorithm: North-west corner coupling, does not use the cost matrix and is therefore not meant to solve the discrete Kantorovich problem. Its role is algorithmic: an acyclic support corresponds to linearly independent marginal constraints. When the support has fewer than positive entries, transportation simplex implementations complete it with zero-mass basic variables to obtain a degenerate basic feasible solution. This gives a cheap initialization for the pivoting methods discussed in Section Linear-Programming Algorithms.
One-Dimensional Cases¶
In one dimension, the transportation polytope has a canonical monotone optimizer. This is the weighted version of the sorting rule from the matching chapter.
Proof
The north-west plan is monotone: it cannot put positive mass on both and when and . Conversely, any feasible plan with such a crossing pair can be improved by moving a small mass from and to and . The two marginals are unchanged, and convexity of gives
for and , with strict inequality for strictly convex and distinct points. Repeating this uncrossing procedure until no crossing remains yields a monotone optimal plan. There is only one monotone feasible plan with the prescribed sorted marginals, namely the sweep plan: it pairs the leftmost remaining source mass with the leftmost remaining target mass at every step. Sorting costs and the sweep uses at most assignments.
Permutation Matrices As Couplings¶
Now assume and uniform weights . In this case, a matching can be encoded as a matrix with exactly one active entry per row and per column.
The corresponding probability coupling is . If the matching cost matrix is , then
Thus the assignment problem is the minimization of a linear function over the discrete, non-convex set of permutation matrices. The convex relaxation replaces this finite set by all bistochastic matrices.
Proof
Among all nonempty faces of , choose one of minimal affine dimension. If this face contained two distinct points, maximizing a linear functional that is not constant on the face would produce a nonempty proper exposed subface, contradicting minimality. Hence the minimal face is a singleton, and its point is extreme.
Proof
The set is nonempty, compact and convex. By Proposition Proposition: Existence of Extreme Points, it has an extreme point . If with , then by linearity and optimality of , both and also minimize on , hence . Since is extreme in , . Thus is extreme in .
Proof
We first prove that permutation matrices are extreme. Let and assume that
Every bistochastic matrix has entries in . Since the only extreme points of are 0 and 1, each entry of fixes the corresponding entries of and : if , then , while if , then . Hence , so is extreme.
We now prove the converse by contrapositive. Pick . Since an integral bistochastic matrix is necessarily a permutation matrix, has at least one fractional entry. We shall split with and , proving that is not extreme.
Associate with the bipartite graph whose left vertices are the rows, whose right vertices are the columns, and whose edges are the fractional entries . An entry equal to 1 uses the whole mass of its row and column, so it is isolated in the positive support and does not appear in this fractional graph. If a left vertex is incident to one fractional edge, then it must be incident to at least one other fractional edge: after the first fractional contribution, the row still has positive remaining mass, and that remainder cannot be carried by an entry equal to 1. The same argument applies to columns. Thus every non-isolated vertex of the fractional graph has degree at least two.
Starting from any fractional edge, one may therefore walk through adjacent fractional edges without immediately backtracking and without getting stuck. Since the graph is finite, some vertex is eventually visited twice; the portion of the walk between the two visits contains a cycle. Choose a shortest such cycle and write it in alternating form
where both and are fractional for every . Define
and split the cycle edges into the alternating families
Set and outside ; on , set and ; on , set and . By the definition of , all modified entries stay in . Each row and column of the cycle sees one and one , so the row and column sums remain one. Thus , , and . Hence is not extreme. Consequently every extreme point of is integral, and every integral bistochastic matrix is a permutation matrix.
The same combinatorial idea gives the constructive decomposition used to express a bistochastic matrix as a convex combination of permutations.
Proof
The feasible set is . By Proposition Proposition: Linear Programs Have Extreme Minimizers, the linear objective has an optimal extreme point. Since scaling preserves extreme points and Theorem Theorem: Birkhoff--von Neumann identifies the extreme points of , this optimizer is for some permutation . Its cost is exactly , so is an optimal assignment.
Equivalently, for uniform empirical measures, one can always choose a permutation matrix among the minimizers of the relaxed Kantorovich problem: the relaxation is tight for assignment problems.
Linear-Programming Algorithms¶
The discrete Kantorovich problem is a linear program with much more structure than a generic dense LP. Its variables are arcs of a complete bipartite network, its equality constraints are flow-conservation constraints, and its extreme points are sparse tree-like couplings.
Transportation Simplex And Network Simplex¶
The transportation simplex goes back to Dantzig’s formulation of the transportation problem Dantzig, 1951. It works on basic feasible couplings, whose support is completed into a spanning tree of the bipartite supply-demand graph. Reduced costs identify whether an unused arc can decrease the objective. Adding such an arc creates a unique cycle; one then pushes as much mass as possible around that cycle and removes the exhausted arc.
The network simplex is the corresponding pivoting method for general minimum-cost-flow problems Bertsekas & Eckstein, 1988. It keeps node potentials, reduced costs and a spanning-tree basis. Its worst-case number of pivots can be exponential, but the per-pivot operations exploit graph sparsity. Polynomial guarantees can be obtained from strongly polynomial minimum-cost-flow algorithms such as Orlin’s algorithm Orlin, 1997.
Interior-Point Methods¶
Generic interior-point methods approach the LP through a smooth central path. For the transport polytope, the logarithmic-barrier version is
The barrier is singular at the boundary, so each iterate stays strictly inside the transportation polytope. As , the central path approaches the set of LP minimizers.

Logarithmic-barrier central path for a triangular slice of a linear program. Large selects a central interior point; decreasing moves the minimizer toward the optimal vertex while never touching the boundary. This differs from entropic OT, where the entropy temperature is part of the regularized objective itself.
The interactive view exposes the barrier parameter directly: lowering slides the minimizer from the center of the feasible triangle toward the LP vertex.
Interactive panel. Use the barrier and angle controls to move along the interior central path of the transport polytope.
Both interior-point methods and Sinkhorn keep iterates positive, but they use positivity differently. Interior-point algorithms solve the original LP by decreasing a barrier parameter. Sinkhorn fixes an entropic temperature and solves a different, KL-regularized OT problem by alternating diagonal scalings.
Relaxation For Arbitrary Measures¶
This section lifts the finite-dimensional coupling matrix to a joint probability measure. The payoff is that existence, duality and metric properties can be stated for arbitrary laws, including discrete, singular and continuous distributions.
Continuous Couplings¶
Unlike the Monge constraint, the coupling constraint is never empty. The continuous feasibility witness is the tensor product coupling.
Proof Sketch
If all couplings are optimal, the product coupling is optimal. Conversely, assume the product is optimal. If cross differences failed to vanish on the product support, there would be points such that exchanging the two target neighborhoods decreases cost. Replacing a small amount of product mass on the two diagonal rectangles by mass on the crossed rectangles keeps the same marginals and lowers the cost, a contradiction. Vanishing cross differences imply on the support, so the cost of any coupling depends only on its marginals.
If there exists a map with , then the Monge map induces the graph coupling , characterized by
Graph couplings are precisely the Kantorovich representation of deterministic Monge maps.
Continuous Kantorovich Problem¶
The discrete Kantorovich problem becomes, for arbitrary measures,
This is an infinite-dimensional linear program over a space of measures.
Proof
The constraint set is nonempty because it contains . It is closed for weak convergence because the marginal constraints are preserved. Since is compact, probability measures on it are compact for the weak topology, and is compact. Finally, is weakly continuous because is continuous and bounded.
On non-compact domains, one needs coercivity and moment conditions. For the Wasserstein cost on a Polish metric space, the natural domain is
for one, and hence every, reference point .
Monge--Kantorovich Equivalence¶
The proof of Brenier’s theorem relies on Kantorovich relaxation and duality. Under Brenier’s hypotheses, the relaxation is tight: it has the same cost as the Monge problem and the optimal coupling is induced by a map.
Proof
The proof of Brenier’s theorem shows that the support of any optimal Kantorovich plan lies in the subdifferential of a convex function. Since has a density, is differentiable -almost everywhere, so for -almost every . Every optimal coupling is therefore concentrated on the graph of and equals .
If does not have a density, non-smooth points of can be charged by and mass splitting can occur. For instance, moving to can be represented by a plan concentrated on the set-valued subdifferential of , but not by a deterministic map.
Cyclical Monotonicity¶
Cyclical monotonicity is the local geometric fingerprint of optimality for a cost . It converts a global minimization problem into finite exchange inequalities and is the bridge from Kantorovich plans to convex potentials.
It is enough to check cyclic permutations:
Proof Sketch
Suppose a finite family in violates the exchange inequality. By continuity, the same strict inequality holds in small neighborhoods around the chosen pairs. Remove a small equal amount of mass from each rectangle, keep its first and second marginal pieces, and reinsert the mass after permuting the second marginal pieces. The new measure has the same marginals but strictly smaller cost, contradicting optimality.
If the optimal plan is induced by a map , cyclical monotonicity reads
For , the two-point case gives
so is a monotone vector field. In one dimension, for , this reduces to the classical monotone rearrangement.
Metric Properties: Wasserstein Distances¶
OT costs become genuine distances when the ground cost comes from a metric. The proof relies on a gluing lemma.
Proof
If , summing over gives ; if , the corresponding column of and row of are zero. The other prescribed marginal is checked in the same way. Summing over the intermediate index gives . Its row and column sums are and .

Discrete gluing lemma in matrix form. The first two panels are optimal one-dimensional couplings through an intermediate marginal. The third panel shows the induced marginal ; it is feasible and is the coupling used in the triangle-inequality proof.
The interactive version changes the resolution of the intermediate marginal, which controls how mediated the glued source-target plan becomes.
Interactive panel. Use the mediation slider to inspect how two couplings through an intermediate marginal glue into a source-target plan.
Proof
Symmetry follows by transposing couplings. Positivity follows because a zero cost plan must be supported on the diagonal. For the triangle inequality, take optimal couplings from to and from to , glue them into , and use the feasible marginal from to . Then Minkowski’s inequality and the ground triangle inequality give
Proof Sketch
Disintegrate and against their common marginal , obtaining conditional laws on and on . Define by the conditional-product formula
This is the measure-theoretic version of the discrete formula above.
Proof
Symmetry is obtained by swapping the coordinates of a coupling. If the value is zero, an optimal coupling is supported on the diagonal and therefore the two marginals coincide. For the triangle inequality, glue optimal couplings and into , project it to a coupling between and , and apply the ground triangle inequality plus Minkowski:
Interpolation Induced By A Plan¶
The quadratic Wasserstein distance does not only compare two endpoint measures. An optimal plan also says how to move mass between them: each active pair travels along the segment joining to . This turns an optimal coupling into a curve of measures.
In the discrete case, each mass moves from to along its own segment. When the optimal plan is not induced by a map, one source atom can split into several moving atoms. If the optimal plan is not unique, different optimal plans may also induce different geodesics.
Proof
Push the optimal plan forward by . This gives a coupling , and
Hence . Applying this upper bound to the three pairs , and , and using the triangle inequality of Proposition Proposition: Metric Property Of The Wasserstein Distance, gives
All inequalities are therefore equalities, in particular the middle segment has the claimed length.

McCann interpolation induced by a non-deterministic optimal transport plan. In every panel, the red and blue endpoint measures are shown with low opacity, thin gray segments display the support of the coupling, and the moving atoms are colored from red to blue along the interpolation.
The companion panel lets the same coupling be inspected along time , with an entropy slider to contrast sparse and diffuse plans.
Interactive panel. Use the interpolation time and plan controls to see how a fixed coupling induces a cloud of displacement paths between endpoint measures.
General Geodesic Spaces¶
For Dirac masses in Euclidean space, the geodesic from to is . The same idea extends to any geodesic metric space , meaning that each pair of points can be joined by a constant-speed metric geodesic. For each pair , one replaces the Euclidean segment by a curve such that , , and
If this geodesic is unique and depends measurably on , one defines and sets for an optimal coupling . When geodesics are not unique, there is no canonical interpolation of a pair of Diracs unless a choice is made: one may select a particular geodesic between and , or randomize among several such geodesics. The intrinsic formulation is to choose a probability measure on the path space of constant-speed geodesics, called a dynamical optimal plan, such that is an optimal coupling, and to set . Different measurable choices, or different conditional distributions over geodesics with the same endpoints, can give different geodesics; the constant-speed identity remains the same. This path-space viewpoint is standard in the general theory of Wasserstein spaces Ambrosio et al., 2006Villani, 2009Santambrogio, 2015.
Comparison With Monge¶
The distance defined through the Kantorovich problem (42) should be contrasted with the directed distance obtained using Monge’s problem. The Kantorovich feasible set is never empty, since it contains the product coupling, although the -cost may still be infinite without moment assumptions on non-compact spaces. By contrast, Monge’s constraint set can be empty. When an optimal Monge map exists, Kantorovich gives the same value by choosing the graph coupling ; in this sense the Kantorovich problem is the convex relaxation of Monge’s problem, with much better stability properties.
Metric Properties: Topology And Applications¶
Wasserstein distances metrize weak convergence under moment control, sit between weak and strong topologies, and provide quantitative estimates in probability and robust optimization.
Proof
The left inequality is Jensen’s inequality applied to . The right inequality follows from .
Proof Sketch
For discrete measures on a common support, set . Any coupling has diagonal mass at most , so its off-diagonal cost is at least . This bound is achieved by putting on the diagonal and coupling the leftover positive and negative parts by a product plan. Since
the formula follows.
For Dirac masses,
Thus the strong topology never sees Diracs converge unless they are eventually equal, while the Wasserstein topology captures their spatial convergence.
Proof Sketch
For , this is the Kantorovich--Rubinstein metrization theorem: by duality, is the supremum over 1-Lipschitz test functions, and on a compact metric space this class is compact modulo constants by Arzela--Ascoli. The comparison between distances on compact spaces then gives the result for all .
On non-compact spaces, one must also impose convergence of -th moments: if and only if and
On a discrete space, weak and strong topologies coincide, and
Wasserstein Over Wasserstein¶
The construction can be iterated. Once is a metric space, the set of probability measures on becomes a metric space through . It can therefore serve as a new ground space. This is useful whenever the objects to compare are themselves random probability measures, or mixtures whose components are meaningful objects rather than only a collapsed density.
Proof Sketch
Completeness follows by representing a -Cauchy sequence through almost optimally glued couplings, which gives a Cauchy random sequence whose law is the desired limit. Separability follows by approximating measures with finitely supported measures on a countable dense subset and rational weights. If is compact, Prokhorov compactness and Wasserstein metrization of weak convergence give compactness.
Elements of are probability laws over probability measures, or random probability measures. A parametric example is
The Wasserstein distance on the Wasserstein space is
For Gaussian mixtures, this separates two geometries. A mixture can be viewed as a collapsed density on , or as a component law over Gaussian atoms in the Bures--Wasserstein space. For two component laws
the component-level problem uses the cost
If is an optimal coupling between the weights and , and if is the Brenier linear part from to , each active pair follows the Gaussian geodesic
Collapsing these component geodesics gives
This component-level interpolation generally differs from the true interpolation between the collapsed mixture densities.

Two interpolations between the same three-component one-dimensional Gaussian mixtures. On the left, components are matched using the Bures--Wasserstein distance between Gaussians. On the right, the mixtures are collapsed into ordinary one-dimensional densities and interpolated by the true quantile formula for .
The interactive comparison keeps both geometries side by side: component-level transport moves Gaussian atoms, while collapsed transport rearranges the full density.
Interactive panel. Use the mixture and blur controls to compare transport between ordinary measures with transport between distributions of measures.
Proof
Fix a coupling between and . For each choose an almost optimal coupling . Integrating this Markov kernel against gives a coupling between the collapsed mixtures whose cost is bounded by the -average of . Taking infima proves the claim.
This viewpoint also clarifies lower bounds for Gromov--Wasserstein distances: a metric-measure space can be mapped to a law of local distance profiles, and these laws can be compared by Wasserstein-over-Wasserstein.
Distributional Robustness And Wasserstein Infinity¶
Wasserstein distances define ambiguity sets around empirical laws. Given samples and , a distributionally robust optimization problem replaces empirical risk by
Under standard upper-semicontinuity and growth assumptions on the loss, one has the dual reformulation
The robust risk is therefore an empirical risk in which each sample is replaced by its worst penalized perturbation. For and an -Lipschitz loss,
Proof
Let and be nearly optimal couplings for and . Then is a coupling between the convex combinations of the marginals, and its cost is the corresponding convex combination. Taking infima proves joint convexity. For , the root can destroy convexity; on the line, is concave.
The limiting distance
minimizes the worst displacement rather than an average displacement.
Proof
Any feasible coupling between and disintegrates as , with each supported in . Hence the robust expectation is bounded above by the displayed sum. The reverse inequality follows by choosing, for each , a maximizer and taking .
Theoretical Application: Quantitative Central Limit Theorems¶
Weak topology says whether laws converge; Wasserstein distances also quantify how fast. The central limit theorem becomes a rate estimate in , which controls the error of all 1-Lipschitz observables of the normalized sum.

Central-limit theorem for normalized Bernoulli sums. Starting from , the law of remains discrete, but its rescaled atom heights approach the standard Gaussian density shown in gray.
The interactive demo varies the number of summands and the Bernoulli skew, making the Wasserstein view of weak convergence visible even while the law remains discrete.
Interactive panel. Use the dimension and sample-size controls to watch the Wasserstein CLT scaling predicted by Lipschitz test functions.
Proof Sketch
By Kantorovich--Rubinstein duality,
For each such , solve Stein’s equation . The solution has uniform derivative bounds depending only on the Lipschitz constant of . Expanding by replacing summands one at a time, the first- and second-order terms cancel because and . The Taylor remainder is bounded by , giving the rate Berry, 1941Esseen, 1942Chen et al., 2011Bobkov, 2018Rio, 2011.
- Kantorovich, L. (1942). On the transfer of masses (in Russian). Doklady Akademii Nauk, 37(2), 227–229.
- Villani, C. (2003). Topics in Optimal Transportation (Vol. 58). American Mathematical Society.
- Villani, C. (2009). Optimal Transport: Old and New (Vol. 338). Springer.
- Rachev, S. T., & Rüschendorf, L. (1998). Mass Transportation Problems: Volume II: Applications. Springer.
- Dantzig, G. B. (1951). Application of the simplex method to a transportation problem. Activity Analysis of Production and Allocation, 13, 359–373.
- Bertsekas, D. P., & Eckstein, J. (1988). Dual coordinate step methods for linear network flow problems. Mathematical Programming, 42(1), 203–243.
- Orlin, J. B. (1997). A polynomial time primal network simplex algorithm for minimum cost flows. Mathematical Programming, 78(2), 109–129.
- Ambrosio, L., Gigli, N., & Savaré, G. (2006). Gradient Flows in Metric Spaces and in the Space of Probability Measures. Springer.
- Santambrogio, F. (2015). Optimal Transport for Applied Mathematicians: Calculus of Variations, PDEs, and Modeling. Birkhäuser.
- Berry, A. C. (1941). The Accuracy of the Gaussian Approximation to the Sum of Independent Variates. Transactions of the American Mathematical Society, 49(1), 122–136.
- Esseen, C.-G. (1942). On the Liapunoff Limit of Error in the Theory of Probability. Arkiv for Matematik, Astronomi Och Fysik, 28A(9), 1–19.
- Chen, L. H. Y., Goldstein, L., & Shao, Q.-M. (2011). Normal Approximation by Stein’s Method. Springer.
- Bobkov, S. G. (2018). Berry–Esseen bounds and Edgeworth expansions in the central limit theorem for transport distances. Probability Theory and Related Fields, 170, 229–262.
- Rio, E. (2011). Asymptotic constants for minimal distance in the central limit theorem. Electronic Communications in Probability, 16, 96–103.