Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Semi-discrete and Wasserstein-1

This chapter focuses on two computationally useful degeneracies of the dual problem. Semi-discrete optimal transport turns a continuous-to-discrete map into finite-dimensional geometry, while W1\Wass_1 replaces convex potentials by Lipschitz functions and flow fields. The material connects computational geometry Aurenhammer et al., 1998Mérigot, 2011Mérigot, 2013 with the Kantorovich--Rubinstein and Beckmann formulations Kantorovich & Rubinstein, 1958Beckmann, 1952.

Semi-dual

The semi-dual eliminates one potential by an exact cc-transform. It keeps concavity while removing explicit inequality constraints.

Write the dual problem as

supf,gC(X)×C(Y)E(f,g),\sup_{f,g\in\Cc(\X)\times\Cc(\Y)} \mathcal{E}(f,g),

where E(f,g)\mathcal{E}(f,g) is the dual objective, with value -\infty when the feasibility constraint fails. Optimizing out gg exactly gives the semi-dual problem

supfC(X)E~(f):=E(f,fc)=supgE(f,g)=Xfdα+Yfcdβ.\sup_{f\in\Cc(\X)} \widetilde{\mathcal{E}}(f) \eqdef \mathcal{E}(f,f^c) = \sup_g \mathcal{E}(f,g) = \int_\X f\,\d\alpha + \int_\Y f^c\,\d\beta .

Partial maximization of a concave problem preserves concavity, so E~\widetilde{\mathcal{E}} is still concave. The advantage is that the explicit inequality constraint has disappeared, which allows simpler optimization algorithms.

Semi-discrete

The semi-discrete case is the setting where dual potentials become weights of Laguerre cells. This gives both geometry and algorithms for quantization and density fitting.

Discrete Targets and Laguerre Cells

Consider the case where

β=j=1mbjδyj\beta=\sum_{j=1}^m b_j\delta_{y_j}

is discrete. The same construction applies if α\alpha is discrete, after exchanging the roles of α\alpha and β\beta. For a dual vector gRmg\in\RR^m, the discrete cˉ\bar c-transform is

gcˉ(x):=min1jmc(x,yj)gj.g^{\bar c}(x) \eqdef \min_{1\le j\le m} c(x,y_j)-g_j.

This maps a vector gg to a continuous function under the same regularity assumptions on cc as in the continuous setting. It coincides with the usual cc-transform when the target space is restricted to the support of β\beta. Using this transform when β\beta is discrete yields the finite-dimensional semi-dual

Lc(α,β)=maxgRmE(g):=Xgcˉ(x)dα(x)+j=1mgjbj.\mathcal{L}_c(\alpha,\beta) = \max_{g\in\RR^m} \mathcal{E}(g) \eqdef \int_\X g^{\bar c}(x)\,\d\alpha(x) + \sum_{j=1}^m g_j b_j .

The geometric object encoded by the dual weights is a weighted nearest-neighbor diagram: each source point is assigned to the target atom that realizes the discrete cˉ\bar c-transform.

For quadratic costs, varying the dual weights moves the walls between adjacent cells while keeping them parallel. This is the geometric mechanism by which the cell masses are adjusted.

<IPython.core.display.Image object>

Laguerre cells for semi-discrete quadratic transport. The red contours show a continuous source density α\alpha given by a three-component Gaussian mixture on the right. The twenty-one colored circular sites are the atoms of the discrete target β\beta, sampled from a compact cloud on the left; each site color matches its Laguerre cell. Starting from ordinary Voronoi cells, semi-dual weight updates deform the cells so that the α\alpha-mass captured by each cell approaches the prescribed target mass.

The interactive demo exposes the dual-weight mechanism directly. Increase the number of weight updates to watch cells with too little mass expand and cells with too much mass shrink.

Interactive panel. Use the weight and seed controls to deform Laguerre cells and watch how their areas respond to semi-discrete masses.

Mass Balance

The semi-dual energy can be rewritten as

E(g)=j=1mLj(g)(c(x,yj)gj)dα(x)+g,b.\mathcal{E}(g) = \sum_{j=1}^m \int_{\mathcal{L}_j(g)} \left(c(x,y_j)-g_j\right)\,\d\alpha(x) + \langle g,b\rangle .
Proof

For α\alpha-almost every xx, the minimizing index in minjc(x,yj)gj\min_j c(x,y_j)-g_j is unique. If this index is j(x)j(x), then the directional derivative in direction hRmh\in\RR^m is

ddtt=0minj(c(x,yj)gjthj)=hj(x).\left.\frac{\d}{\d t}\right|_{t=0} \min_j\left(c(x,y_j)-g_j-t h_j\right) = -h_{j(x)}.

Dominated convergence gives

dE(g)[h]=jhjLj(g)dα+jhjbj,\d\mathcal{E}(g)[h] = -\sum_j h_j\int_{\mathcal{L}_j(g)}\d\alpha + \sum_j h_j b_j,

which is the announced gradient formula.

The first-order optimality condition says that solving the semi-discrete dual amounts to choosing weights gg so that

Lj(g)dα=bjfor every j.\int_{\mathcal{L}_j(g)}\d\alpha=b_j \qquad\text{for every }j.

At optimality, the transport map is piecewise constant: it sends xLj(g)x\in\mathcal{L}_j(g) to yjy_j. For the quadratic cost, uniqueness follows from Brenier’s theorem when α\alpha has a density.

Quadratic power diagrams have polyhedral cells and can be computed efficiently using computational-geometry algorithms Aurenhammer, 1987Aurenhammer et al., 1998Mérigot, 2011. One classical construction lifts sites to (yj,yj2gj)Rd+1(y_j,\norm{y_j}^2-g_j)\in\RR^{d+1} and obtains the power diagram by projecting the lower envelope of their convex hull. In dimensions two and three, Chan’s output-sensitive convex-hull algorithm has complexity O(mlogQ)O(m\log Q) for mm sites and QQ hull vertices Chan, 1996.

Stochastic Optimization

The semi-discrete formulation is useful because the objective is an expectation with respect to α\alpha:

E(g)=XE(g,x)dα(x)=EX(E(g,X)),E(g,x):=gcˉ(x)+g,b.\mathcal{E}(g) = \int_\X E(g,x)\,\d\alpha(x) = \EE_X(E(g,X)), \qquad E(g,x)\eqdef g^{\bar c}(x)+\langle g,b\rangle .

Away from cell boundaries, the stochastic gradient of the integrand is

gE(g,x)=(bj1Lj(g)(x))j=1m,\nabla_g E(g,x) = \left(b_j-\mathbf{1}_{\mathcal{L}_j(g)}(x)\right)_{j=1}^m,

an unbiased estimator of E(g)\nabla\mathcal{E}(g) when cell boundaries have α\alpha-measure zero. One can therefore maximize the semi-dual without first discretizing α\alpha: the measure is used as a black box from which independent samples are drawn, a natural setup in high-dimensional statistics and machine learning.

Starting from g(0)=0g^{(0)}=0, stochastic gradient ascent draws xαx_\ell\sim\alpha and performs

g(+1):=g()+τgE(g(),x).g^{(\ell+1)} \eqdef g^{(\ell)} + \tau_\ell\nabla_g E(g^{(\ell)},x_\ell).

The step size must decay so that sampling noise averages out. A typical schedule is

τ:=τ01+/0.\tau_\ell\eqdef\frac{\tau_0}{1+\ell/\ell_0}.

Under standard stochastic-approximation assumptions,

E(g)E(E(g()))=O(1/2),\mathcal{E}(g^\star) - \EE\left(\mathcal{E}(g^{(\ell)})\right) = O(\ell^{-1/2}),

where gg^\star is a maximizer and the expectation is over the i.i.d. samples. This stochastic viewpoint is one of the main algorithmic advantages of the semi-discrete formulation Mérigot, 2011Genevay et al., 2016.

Optimal Quantization

Optimal quantization asks for the best discrete approximation of a measure by mm codepoints. It is the geometric core of vector quantization, compression and kk-means clustering.

For a measure α\alpha, define

Qm(α):=minY=(yj)j=1m,  bΔmWp(α,j=1mbjδyj).\mathcal{Q}_m(\alpha) \eqdef \min_{Y=(y_j)_{j=1}^m,\;b\in\simplex_m} \Wass_p\left(\alpha,\sum_{j=1}^m b_j\delta_{y_j}\right).

This problem is classical in approximation theory and information theory Graf & Luschgy, 2000Lloyd, 1982. The optimal-transport formulation emphasizes that one optimizes both the support locations YY and, unless prescribed, the masses bb.

Proof

For the upper bound, partition Ω\Omega into mm cells of diameter at most Cm1/dC m^{-1/d}, up to boundary effects, and place one codepoint in each nonempty cell. Sending each point to the codepoint in its cell gives a transport distance bounded by Cm1/dC m^{-1/d}.

For the lower bound, fix any set YY of mm codepoints and write dY(x)=minjxyjd_Y(x)=\min_j\norm{x-y_j}. Since the density is bounded above, the mass of the tt-neighborhood of YY is at most CmtdCmt^d. Choosing t0m1/dt_0\simeq m^{-1/d} small enough gives α({dY>t})c\alpha(\{d_Y>t\})\ge c for 0<t<t00<t<t_0. Hence

dY(x)pdα(x)=0+ptp1α({dY>t})dtct0pcmp/d.\int d_Y(x)^p\,\d\alpha(x) = \int_0^{+\infty} p t^{p-1}\alpha(\{d_Y>t\})\,\d t \ge c t_0^p \simeq c m^{-p/d}.

Taking the pp-th root and minimizing over YY proves the lower bound.

This deterministic rate mirrors the empirical optimal-transport sample-complexity rate: both are governed by the spacing m1/dm^{-1/d} of points in dimension dd. Quantization is best-case and deterministic, while empirical OT is random, but both display the same curse of dimensionality.

For fixed codepoints YY, the problem is convex with respect to the weights bb. The dependence on YY is nonconvex and is generally computationally hard. In one dimension, monotonicity fixes the ordering of cells, reducing the problem to interval endpoints and centroids; for the uniform law with the quadratic cost, the optimal centroids are equally spaced.

Proof

For any coupling between α\alpha and a measure supported on YY, the conditional destination of a point xx belongs to YY, so its conditional cost is at least minjc(x,yj)\min_j c(x,y_j). Integrating gives the lower bound. Conversely, choose a measurable nearest-codepoint map TY(x)arg minjc(x,yj)T_Y(x)\in\operatorname*{arg\,min}_j c(x,y_j), breaking ties measurably, and set bj=α(TY1(yj))b_j=\alpha(T_Y^{-1}(y_j)). Then (TY)α=jbjδyj(T_Y)_\sharp\alpha=\sum_j b_j\delta_{y_j} and the induced transport reaches the displayed lower bound.

Consequently, the quantization energy can be written in nearest-centroid form:

Qm(α)p=minYF(Y),F(Y):=Xmin1jmc(x,yj)dα(x).\mathcal{Q}_m(\alpha)^p = \min_Y \mathcal{F}(Y), \qquad \mathcal{F}(Y) \eqdef \int_\X \min_{1\le j\le m} c(x,y_j)\,\d\alpha(x).

At a differentiability point of this energy, each local minimizer satisfies the centroid condition

yjarg minyVj(Y)c(x,y)dα(x).y_j \in \operatorname*{arg\,min}_{y} \int_{\mathcal{V}_j(Y)} c(x,y)\,\d\alpha(x).

For the squared Euclidean cost, this becomes

yj=Vj(Y)xdα(x)Vj(Y)dα.y_j = \frac{\int_{\mathcal{V}_j(Y)} x\,\d\alpha(x)} {\int_{\mathcal{V}_j(Y)} \d\alpha}.

Lloyd’s algorithm, also known as the kk-means algorithm, iterates this fixed point: assign points to nearest sites, then replace each site by the centroid of its cell Lloyd, 1982. With standard tie-breaking, the objective decreases at each step. Since the problem is nonconvex in YY, the iterates generally converge only to a local minimum. Good seeding matters; for squared Euclidean costs, kk-means++ gives a logarithmic approximation guarantee in expectation Arthur & Vassilvitskii, 2007.

<IPython.core.display.Image object>

Lloyd quantization for the same continuous density and twenty-one initial sites as the Laguerre-cell figure. The red contours show the density α\alpha, while the colored disks are the current codepoints and have the same colors as their Voronoi cells. The iterations move the initially left-located sites toward the high-density region and reshape the cells according to centroidal Voronoi geometry.

The interactive demo separates the nonconvex geometry from the fixed-point update: increase the iteration counter and watch sites migrate toward the density before settling into a local centroidal configuration.

Interactive panel. Use the iteration and site controls to compare Lloyd-style quantization steps with the semi-discrete geometry.

Wasserstein-1 Norm

The W1\Wass_1 distance has an especially transparent dual: the admissible potentials are exactly 1-Lipschitz test functions. This makes W1\Wass_1 the meeting point between transport, PDE formulations and weak norms on signed measures.

c-Transform for Wasserstein-1

Assume that dd is a distance on X=Y\X=\Y and take the ground cost c(x,y)=d(x,y)c(x,y)=d(x,y).

Proof

First suppose f=gcf=g^c for some gg. For x,yXx,y\in\X,

f(x)f(y)=infz[d(x,z)g(z)]infz[d(y,z)g(z)]supzd(x,z)d(y,z)d(x,y),|f(x)-f(y)| = \left| \inf_z [d(x,z)-g(z)] - \inf_z [d(y,z)-g(z)] \right| \le \sup_z |d(x,z)-d(y,z)| \le d(x,y),

where the last inequality is the reverse triangle inequality. Thus Lip(f)1\Lip(f)\le1.

If Lip(f)1\Lip(f)\le1, then f(x)f(y)+d(x,y)f(x)\le f(y)+d(x,y), so d(x,y)f(x)f(y)d(x,y)-f(x)\ge -f(y) for all xx, hence fc(y)f(y)f^c(y)\ge -f(y). Taking x=yx=y gives fc(y)f(y)f^c(y)\le -f(y). Therefore fc=ff^c=-f. Applying the same property to f-f gives (f)c=f(-f)^c=f, so every 1-Lipschitz function is cc-concave.

Using the alternating cc-transform scheme from the dual chapter, one can replace the dual pair by (f,f)(f,-f) with Lip(f)1\Lip(f)\le1. The Kantorovich dual therefore becomes the Kantorovich--Rubinstein formula

W1(α,β)=maxf{Xfd(αβ):Lip(f)1}.\Wass_1(\alpha,\beta) = \max_f \left\{ \int_\X f\,\d(\alpha-\beta) : \Lip(f)\le1 \right\}.

This expression depends only on the signed measure αβ\alpha-\beta. It therefore extends to finite signed measures of total mass zero and defines the Kantorovich--Rubinstein norm on that space Kantorovich & Rubinstein, 1958.

For a discrete signed measure αβ=krkδzk\alpha-\beta=\sum_k r_k\delta_{z_k} with krk=0\sum_k r_k=0,

W1(α,β)=max(fk)k{kfkrk:fkfd(zk,z)for all k,}.\Wass_1(\alpha,\beta) = \max_{(f_k)_k} \left\{ \sum_k f_k r_k : |f_k-f_\ell|\le d(z_k,z_\ell) \quad\text{for all }k,\ell \right\}.

This finite-dimensional linear program can be solved by generic interior-point or first-order methods; structured graph versions admit the flow formulations described below.

When d(x,y)=xyd(x,y)=|x-y| on R\RR, ordering the support points z1z2z_1\le z_2\le\cdots reduces the constraints to neighboring pairs:

W1(α,β)=max(fk)k{kfkrk:fk+1fkzk+1zkfor all k}.\Wass_1(\alpha,\beta) = \max_{(f_k)_k} \left\{ \sum_k f_k r_k : |f_{k+1}-f_k|\le z_{k+1}-z_k \quad\text{for all }k \right\}.

In one dimension this is equivalent to the closed-form cumulative formula introduced earlier.

Wasserstein-1 on Euclidean Spaces

For X=Y=Rd\X=\Y=\RR^d with c(x,y)=xyc(x,y)=\norm{x-y}, the global Lipschitz constraint in the Kantorovich--Rubinstein formula can be made local as a uniform bound on the gradient:

W1(α,β)=supf{Rdf(dαdβ):f1}.\Wass_1(\alpha,\beta) = \sup_f \left\{ \int_{\RR^d} f\,(\d\alpha-\d\beta) : \norm{\nabla f}_\infty\le1 \right\}.

Dualizing this expression with respect to vector fields m:RdRdm:\RR^d\to\RR^d gives the fixed-divergence problem

W1(α,β)=infm{Rdm(x)dx:div(m)=αβ},\Wass_1(\alpha,\beta) = \inf_m \left\{ \int_{\RR^d}\norm{m(x)}\,\d x : \operatorname{div}(m)=\alpha-\beta \right\},

often called the Beckmann formulation Beckmann, 1952. The vector field m(x)m(x) describes local movement of mass. Outside the support of the two input measures, div(m)=0\operatorname{div}(m)=0, which is conservation of mass.

Once discretized with finite elements, the dual Lipschitz problem and the Beckmann problem become nonsmooth convex optimization problems. The same formulation extends to Riemannian manifolds by replacing the Euclidean distance by geodesic distance and interpreting gradient and divergence as differential operators on the manifold.

This graph distance turns W1\Wass_1 into a finite-dimensional flow problem.

Proof

The edge constraint fifje|f_i-f_j|\le\ell_e implies, by summing along paths, that fifjdG(i,j)|f_i-f_j|\le d_G(i,j) for all vertices. Conversely, any 1-Lipschitz function for dGd_G satisfies the edge constraints because each edge is a path of length e\ell_e. The first equality is therefore the Kantorovich--Rubinstein formula on the metric space (V,dG)(V,d_G).

For the second equality, write the graph Beckmann problem and dualize its equality constraint with a potential ff:

infmeeme+supfifi(ri(divGm)i).\inf_m \sum_e \ell_e|m_e| + \sup_f \sum_i f_i \left(r_i-(\operatorname{div}_G m)_i\right).

Using divG=G\operatorname{div}_G=-\nabla_G^*, the coupling term is eme(Gf)e\sum_e m_e(\nabla_G f)_e. The minimization over each scalar flow mem_e is finite exactly when (Gf)ee|(\nabla_G f)_e|\le\ell_e, and is then equal to zero. The dual problem is the graph Lipschitz dual above. Strong duality holds because this is a finite-dimensional linear program with a nonempty feasible set: connectedness and iri=0\sum_i r_i=0 allow the signed surplus to be routed along paths.

<IPython.core.display.Image object>

Graph Beckmann formulation of W1\Wass_1 on a Delaunay graph. Red and blue disks encode the positive and negative parts of r=αβr=\alpha-\beta. Violet arrows display the signed edge flow mm: orientation gives the sign, width is proportional to me\sqrt{|m_e|}, and the flow satisfies the conservation constraint divGm=r\operatorname{div}_G m=r.

The interactive graph view lets the source and sink clusters move and changes the graph resolution. It makes the transshipment interpretation of W1\Wass_1 visible: signed mass is routed through local edges rather than matched only by straight source-to-target segments.

Interactive panel. Use the graph and demand controls to inspect how Wasserstein-1 transport becomes a flow problem on edges.

This graph formulation is the transshipment version of W1\Wass_1. It is the natural discrete analogue of the Beckmann formulation: gradients are edge differences, divergences are incidence-matrix balances, and geodesic distance is shortest-path length. It can be solved by min-cost flow methods on sparse graphs, while entropic or KL-projection variants lead to flow-Sinkhorn algorithms for graph W1\Wass_1 Beckmann, 1952Peyré, 2026.

References
  1. Aurenhammer, F., Hoffmann, F., & Aronov, B. (1998). Minkowski-type theorems and least-squares clustering. Algorithmica, 20(1), 61–76.
  2. Mérigot, Q. (2011). A multiscale approach to optimal transport. Computer Graphics Forum, 30(5), 1583–1592.
  3. Mérigot, Q. (2013). A comparison of two dual methods for discrete optimal transport. In Geometric science of information (pp. 389–396). Springer.
  4. Kantorovich, L., & Rubinstein, G. S. (1958). On a space of totally additive functions. Vestn Leningrad Universitet, 13, 52–59.
  5. Beckmann, M. (1952). A continuous model of transportation. Econometrica, 20, 643–660.
  6. Aurenhammer, F. (1987). Power diagrams: properties, algorithms and applications. SIAM Journal on Computing, 16(1), 78–96.
  7. Chan, T. M. (1996). Optimal output-sensitive convex hull algorithms in two and three dimensions. Discrete & Computational Geometry, 16(4), 361–368.
  8. Genevay, A., Cuturi, M., Peyré, G., & Bach, F. (2016). Stochastic optimization for large-scale optimal transport. Advances in Neural Information Processing Systems, 3440–3448.
  9. Graf, S., & Luschgy, H. (2000). Foundations of Quantization for Probability Distributions (Vol. 1730). Springer.
  10. Lloyd, S. (1982). Least Squares Quantization in PCM. IEEE Transactions on Information Theory, 28(2), 129–137.
  11. Arthur, D., & Vassilvitskii, S. (2007). k-means++: The Advantages of Careful Seeding. Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, 1027–1035.
  12. Bertsekas, D. P., & Eckstein, J. (1988). Dual coordinate step methods for linear network flow problems. Mathematical Programming, 42(1), 203–243.
  13. Orlin, J. B. (1997). A polynomial time primal network simplex algorithm for minimum cost flows. Mathematical Programming, 78(2), 109–129.
  14. Peyré, G. (2026). Robust Sublinear Convergence Rates for Iterative Bregman Projections. arXiv Preprint arXiv:2602.01372.