This chapter focuses on two computationally useful degeneracies of the dual
problem. Semi-discrete optimal transport turns a continuous-to-discrete map
into finite-dimensional geometry, while W1 replaces convex potentials by
Lipschitz functions and flow fields. The material connects computational
geometry Aurenhammer et al., 1998Mérigot, 2011Mérigot, 2013 with the
Kantorovich--Rubinstein and Beckmann formulations
Kantorovich & Rubinstein, 1958Beckmann, 1952.
from pathlib import Path
import sys
from IPython.display import Image as DisplayImage
from IPython.display import display
here = Path.cwd()
myst_dir = None
for candidate in [here, here.parent, here / "myst", here.parent / "myst", here.parent.parent / "myst"]:
if (candidate / "ot4ml_web.py").exists():
myst_dir = candidate.resolve()
sys.path.insert(0, str(myst_dir))
break
if myst_dir is None:
raise RuntimeError("Could not locate myst/ot4ml_web.py")
repo_root = myst_dir.parent
thumbnails = repo_root / "notebooks-figures" / "thumbnails"
def show_book_figure(name, width=760):
display(DisplayImage(filename=str(thumbnails / f"{name}.png"), width=width))
Partial maximization of a concave problem preserves concavity, so
E is still concave. The advantage is that the explicit
inequality constraint has disappeared, which allows simpler optimization
algorithms.
The semi-discrete case is the setting where dual potentials become weights of
Laguerre cells. This gives both geometry and algorithms for quantization and
density fitting.
is discrete. The same construction applies if α is discrete, after
exchanging the roles of α and β. For a dual vector
g∈Rm, the discrete cˉ-transform is
This maps a vector g to a continuous function under the same regularity
assumptions on c as in the continuous setting. It coincides with the usual
c-transform when the target space is restricted to the support of
β. Using this transform when β is discrete yields the
finite-dimensional semi-dual
The geometric object encoded by the dual weights is a weighted
nearest-neighbor diagram: each source point is assigned to the target atom that
realizes the discrete cˉ-transform.
For quadratic costs, varying the dual weights moves the walls between adjacent
cells while keeping them parallel. This is the geometric mechanism by which
the cell masses are adjusted.
show_book_figure("semidiscrete-laguerre-cells")
Laguerre cells for semi-discrete quadratic transport. The red contours show a
continuous source density α given by a three-component Gaussian mixture
on the right. The twenty-one colored circular sites are the atoms of the
discrete target β, sampled from a compact cloud on the left; each site
color matches its Laguerre cell. Starting from ordinary Voronoi cells,
semi-dual weight updates deform the cells so that the α-mass captured
by each cell approaches the prescribed target mass.
The interactive demo exposes the dual-weight mechanism directly. Increase the number
of weight updates to watch cells with too little mass expand and cells with too
much mass shrink.
Interactive panel. Use the weight and seed controls to deform Laguerre cells and watch how their areas respond to semi-discrete masses.
At optimality, the transport map is piecewise constant: it sends
x∈Lj(g) to yj. For the quadratic cost, uniqueness follows
from Brenier’s theorem when α has a density.
Quadratic power diagrams have polyhedral cells and can be computed efficiently
using computational-geometry algorithms Aurenhammer, 1987Aurenhammer et al., 1998Mérigot, 2011.
One classical construction lifts sites to
(yj,∥yj∥2−gj)∈Rd+1 and obtains the power diagram by
projecting the lower envelope of their convex hull. In dimensions two and
three, Chan’s output-sensitive convex-hull algorithm has complexity
O(mlogQ) for m sites and Q hull vertices Chan, 1996.
an unbiased estimator of ∇E(g) when cell boundaries have
α-measure zero. One can therefore maximize the semi-dual without first
discretizing α: the measure is used as a black box from which
independent samples are drawn, a natural setup in high-dimensional statistics
and machine learning.
Starting from g(0)=0, stochastic gradient ascent draws
xℓ∼α and performs
where g⋆ is a maximizer and the expectation is over the i.i.d. samples.
This stochastic viewpoint is one of the main algorithmic advantages of the
semi-discrete formulation Mérigot, 2011Genevay et al., 2016.
Optimal quantization asks for the best discrete approximation of a measure by
m codepoints. It is the geometric core of vector quantization, compression
and k-means clustering.
This problem is classical in approximation theory and information theory
Graf & Luschgy, 2000Lloyd, 1982. The optimal-transport
formulation emphasizes that one optimizes both the support locations Y and,
unless prescribed, the masses b.
Proof
For the upper bound, partition Ω into m cells of diameter at most
Cm−1/d, up to boundary effects, and place one codepoint in each nonempty
cell. Sending each point to the codepoint in its cell gives a transport
distance bounded by Cm−1/d.
For the lower bound, fix any set Y of m codepoints and write
dY(x)=minj∥x−yj∥. Since the density is bounded above, the mass of
the t-neighborhood of Y is at most Cmtd. Choosing
t0≃m−1/d small enough gives
α({dY>t})≥c for 0<t<t0. Hence
Taking the p-th root and minimizing over Y proves the lower bound.
This deterministic rate mirrors the empirical optimal-transport
sample-complexity rate: both are governed by the spacing m−1/d of points
in dimension d. Quantization is best-case and deterministic, while empirical
OT is random, but both display the same curse of dimensionality.
For fixed codepoints Y, the problem is convex with respect to the weights
b. The dependence on Y is nonconvex and is generally computationally hard.
In one dimension, monotonicity fixes the ordering of cells, reducing the
problem to interval endpoints and centroids; for the uniform law with the
quadratic cost, the optimal centroids are equally spaced.
Proof
For any coupling between α and a measure supported on Y, the
conditional destination of a point x belongs to Y, so its conditional cost
is at least minjc(x,yj). Integrating gives the lower bound. Conversely,
choose a measurable nearest-codepoint map
TY(x)∈argminjc(x,yj), breaking ties measurably, and
set bj=α(TY−1(yj)). Then
(TY)♯α=∑jbjδyj and the induced transport reaches
the displayed lower bound.
Consequently, the quantization energy can be written in nearest-centroid form:
Lloyd’s algorithm, also known as the k-means algorithm, iterates this fixed
point: assign points to nearest sites, then replace each site by the centroid
of its cell Lloyd, 1982. With standard tie-breaking, the objective
decreases at each step. Since the problem is nonconvex in Y, the iterates
generally converge only to a local minimum. Good seeding matters; for squared
Euclidean costs, k-means++ gives a logarithmic approximation guarantee in
expectation Arthur & Vassilvitskii, 2007.
Lloyd quantization for the same continuous density and twenty-one initial
sites as the Laguerre-cell figure. The red contours show the density
α, while the colored disks are the current codepoints and have the same
colors as their Voronoi cells. The iterations move the initially left-located
sites toward the high-density region and reshape the cells according to
centroidal Voronoi geometry.
The interactive demo separates the nonconvex geometry from the fixed-point update:
increase the iteration counter and watch sites migrate toward the density
before settling into a local centroidal configuration.
Interactive panel. Use the iteration and site controls to compare Lloyd-style quantization steps with the semi-discrete geometry.
The W1 distance has an especially transparent dual: the admissible
potentials are exactly 1-Lipschitz test functions. This makes W1 the
meeting point between transport, PDE formulations and weak norms on signed
measures.
where the last inequality is the reverse triangle inequality. Thus
Lip(f)≤1.
If Lip(f)≤1, then f(x)≤f(y)+d(x,y), so
d(x,y)−f(x)≥−f(y) for all x, hence fc(y)≥−f(y). Taking x=y
gives fc(y)≤−f(y). Therefore fc=−f. Applying the same property to
−f gives (−f)c=f, so every 1-Lipschitz function is c-concave.
Using the alternating c-transform scheme from the dual chapter, one can
replace the dual pair by (f,−f) with Lip(f)≤1. The Kantorovich dual
therefore becomes the Kantorovich--Rubinstein formula
This expression depends only on the signed measure α−β. It
therefore extends to finite signed measures of total mass zero and defines the
Kantorovich--Rubinstein norm on that space Kantorovich & Rubinstein, 1958.
For a discrete signed measure
α−β=∑krkδzk with ∑krk=0,
W1(α,β)=(fk)kmax{k∑fkrk:∣fk−fℓ∣≤d(zk,zℓ)for all k,ℓ}.
This finite-dimensional linear program can be solved by generic interior-point
or first-order methods; structured graph versions admit the flow formulations
described below.
When d(x,y)=∣x−y∣ on R, ordering the support points
z1≤z2≤⋯ reduces the constraints to neighboring pairs:
W1(α,β)=(fk)kmax{k∑fkrk:∣fk+1−fk∣≤zk+1−zkfor all k}.
For X=Y=Rd with c(x,y)=∥x−y∥, the global Lipschitz constraint in
the Kantorovich--Rubinstein formula can be made local as a uniform bound on
the gradient:
often called the Beckmann formulation Beckmann, 1952. The vector field
m(x) describes local movement of mass. Outside the support of the two input
measures, div(m)=0, which is conservation of mass.
Once discretized with finite elements, the dual Lipschitz problem and the
Beckmann problem become nonsmooth convex optimization problems. The same
formulation extends to Riemannian manifolds by replacing the Euclidean
distance by geodesic distance and interpreting gradient and divergence as
differential operators on the manifold.
This graph distance turns W1 into a finite-dimensional flow problem.
Proof
The edge constraint ∣fi−fj∣≤ℓe implies, by summing along paths, that
∣fi−fj∣≤dG(i,j) for all vertices. Conversely, any 1-Lipschitz
function for dG satisfies the edge constraints because each edge is a path
of length ℓe. The first equality is therefore the
Kantorovich--Rubinstein formula on the metric space (V,dG).
For the second equality, write the graph Beckmann problem and dualize its
equality constraint with a potential f:
Using divG=−∇G∗, the coupling term is
∑eme(∇Gf)e. The minimization over each scalar flow me is
finite exactly when ∣(∇Gf)e∣≤ℓe, and is then equal to zero.
The dual problem is the graph Lipschitz dual above. Strong duality holds
because this is a finite-dimensional linear program with a nonempty feasible
set: connectedness and ∑iri=0 allow the signed surplus to be routed
along paths.
show_book_figure("w1-graph-transport-flow")
Graph Beckmann formulation of W1 on a Delaunay graph. Red and blue
disks encode the positive and negative parts of r=α−β. Violet
arrows display the signed edge flow m: orientation gives the sign, width is
proportional to ∣me∣, and the flow satisfies the conservation
constraint divGm=r.
The interactive graph view lets the source and sink clusters move and changes the
graph resolution. It makes the transshipment interpretation of W1
visible: signed mass is routed through local edges rather than matched only by
straight source-to-target segments.
Interactive panel. Use the graph and demand controls to inspect how Wasserstein-1 transport becomes a flow problem on edges.
This graph formulation is the transshipment version of W1. It is the
natural discrete analogue of the Beckmann formulation: gradients are edge
differences, divergences are incidence-matrix balances, and geodesic distance
is shortest-path length. It can be solved by min-cost flow methods on sparse
graphs, while entropic or KL-projection variants lead to flow-Sinkhorn
algorithms for graph W1Beckmann, 1952Peyré, 2026.
Aurenhammer, F., Hoffmann, F., & Aronov, B. (1998). Minkowski-type theorems and least-squares clustering. Algorithmica, 20(1), 61–76.
Mérigot, Q. (2011). A multiscale approach to optimal transport. Computer Graphics Forum, 30(5), 1583–1592.
Mérigot, Q. (2013). A comparison of two dual methods for discrete optimal transport. In Geometric science of information (pp. 389–396). Springer.
Kantorovich, L., & Rubinstein, G. S. (1958). On a space of totally additive functions. Vestn Leningrad Universitet, 13, 52–59.
Beckmann, M. (1952). A continuous model of transportation. Econometrica, 20, 643–660.
Aurenhammer, F. (1987). Power diagrams: properties, algorithms and applications. SIAM Journal on Computing, 16(1), 78–96.
Chan, T. M. (1996). Optimal output-sensitive convex hull algorithms in two and three dimensions. Discrete & Computational Geometry, 16(4), 361–368.
Genevay, A., Cuturi, M., Peyré, G., & Bach, F. (2016). Stochastic optimization for large-scale optimal transport. Advances in Neural Information Processing Systems, 3440–3448.
Graf, S., & Luschgy, H. (2000). Foundations of Quantization for Probability Distributions (Vol. 1730). Springer.
Lloyd, S. (1982). Least Squares Quantization in PCM. IEEE Transactions on Information Theory, 28(2), 129–137.
Arthur, D., & Vassilvitskii, S. (2007). k-means++: The Advantages of Careful Seeding. Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, 1027–1035.
Bertsekas, D. P., & Eckstein, J. (1988). Dual coordinate step methods for linear network flow problems. Mathematical Programming, 42(1), 203–243.
Orlin, J. B. (1997). A polynomial time primal network simplex algorithm for minimum cost flows. Mathematical Programming, 78(2), 109–129.
Peyré, G. (2026). Robust Sublinear Convergence Rates for Iterative Bregman Projections. arXiv Preprint arXiv:2602.01372.