The goal of this chapter is to pass from finite matching to transport between
arbitrary probability laws. The central stakes are to define measures,
push-forwards and Monge maps carefully enough that the discrete picture
survives, while exposing why deterministic maps can fail to exist. Monge’s
original formulation Monge, 1781 and modern treatments
Villani, 2003Villani, 2009Santambrogio, 2015Rachev & Rüschendorf, 1998 are the
conceptual background for this transition.
The previous chapter handled two sets with the same number of points. To relax
this to a more general setting, one needs probability distributions, so that
points may carry unequal masses and continuous densities can be treated in the
same language as finite clouds.
from pathlib import Path
import sys
from IPython.display import Image as DisplayImage
from IPython.display import display
here = Path.cwd()
myst_dir = None
for candidate in [here, here.parent, here / "myst", here.parent / "myst", here.parent.parent / "myst"]:
if (candidate / "ot4ml_web.py").exists():
myst_dir = candidate.resolve()
sys.path.insert(0, str(myst_dir))
break
if myst_dir is None:
raise RuntimeError("Could not locate myst/ot4ml_web.py")
repo_root = myst_dir.parent
thumbnails = repo_root / "notebooks-figures" / "thumbnails"
def show_book_figure(name, width=760):
display(DisplayImage(filename=str(thumbnails / f"{name}.png"), width=width))
Measures are the language that lets point clouds, densities and singular
objects be handled uniformly. We only recall the facts needed later:
integration, total variation, densities and probabilistic laws.
In applications, it is useful to manipulate both the positions xi and the
weights ai. Moving the positions is a Lagrangian discretization; changing
the weights is an Eulerian one. The Lagrangian view is often more adaptive, but
it tends to break convexity.
We consider Borel measures α∈M(X) on a metric space (X,d). This
means that α(A) is defined for every Borel set A, obtained from open sets
by countable unions, intersections and complements. Unless otherwise stated,
the measures are finite.
A Dirac measure is defined by δx(A)=1 if x∈A and 0 otherwise.
For the discrete measure above,
Many measure-theoretic statements used later require a mild regularity
assumption on the underlying space. The point is not to restrict applications,
since Euclidean spaces, complete separable manifolds and separable Hilbert
spaces are covered, but to exclude pathological measurable spaces where
disintegration, tightness or weak convergence can fail to behave properly.
Polish spaces are the natural ambient category for probability measures. Borel
probability measures on them are regular, tightness gives compactness
criteria, regular conditional probabilities and disintegrations exist under
standard assumptions, and Wasserstein spaces remain Polish; see Proposition
Proposition: Wasserstein Spaces As Ground Spaces.
Integration against a finite measure on a compact space defines a continuous
linear form on the Banach space (C(X),∥⋅∥∞), since
∣∫fdα∣≤∥f∥∞∣α∣(X). Conversely, the
Riesz--Markov--Kakutani representation theorem identifies every continuous
linear form on C(X) with integration against a finite signed Radon
measure Rudin, 1987Bogachev, 2007. This is the
duality M(X)=C(X)∗ that later supports convex duality.
where the supremum is over finite or countable measurable partitions of A.
If α=∑iaiδxi with distinct atoms, then
∣α∣=∑i∣ai∣δxi. If
dα(x)=ρ(x)dλ(x), then
d∣α∣(x)=∣ρ(x)∣dλ(x).
For the reverse inequality, write the Jordan decomposition
α=α+−α−. The measurable sign
s=dα/d∣α∣ takes values in {−1,1} outside a null set and satisfies
dα=sd∣α∣. By regularity of Radon measures on compact spaces, s can
be approximated in L1(∣α∣) by continuous functions fk with
∥fk∥∞≤1. Hence
∫fkdα→∫sdα=∣α∣(X).
For absolutely continuous measures
dα=ραdλ and dβ=ρβdλ,
Radon probability measures represent laws of random variables. A random
variable with values in X is a measurable map
X:Ω→X from an abstract probability space (Ω,P). Its law is
the Radon probability measure α defined by
Monge’s problem asks for a deterministic map transporting one law onto another
while minimizing a prescribed cost. It is geometrically direct, because every
source point is assigned one destination, but analytically fragile: the
feasible set is non-convex, it can be empty, and a map cannot split mass. These
limitations motivate Kantorovich’s relaxation in the next chapter.
This proves the counting statement. If every target atom has mass 1/n, each
target receives exactly one source atom, hence a permutation. The converse and
the cost identity follow by direct substitution.
Proof
A standard measure-isomorphism theorem identifies the atomless probability
space generated by α with Lebesgue measure on [0,1], modulo null sets
Bogachev, 2007. It is therefore enough to construct a map from
[0,1] to the target law β. Choose a Borel isomorphism from the support of
β onto a Borel subset of [0,1], push β through it, and use the
generalized inverse of the corresponding cumulative distribution function.
Composing back gives a measurable transport map from α to β.
The next figure shows a finite-dimensional instance of this deterministic
viewpoint. The source and target measures are empirical color clouds in RGB
space, and the map transports colors while leaving pixel positions fixed.
Grayscale equalization is one-dimensional, but full palette transfer requires
transporting empirical measures in a three-dimensional color space. Early
methods used affine statistics or iterated one-dimensional projections
Reinhard et al., 2001Pitié et al., 2005; replacing these projections by a
three-dimensional OT map gives a more intrinsic palette match
Rabin et al., 2011.
show_book_figure("monge-color-transfer-rgb")
Color transfer as a Monge map in RGB space, from a beach photograph to a
flower photograph. The top row applies the palette map to the source image; the
bottom row shows the empirical color clouds in the RGB cube. Only colors are
transported here, not pixel locations.
Interactive panel. Use the interpolation, resolution, target palette, and contrast controls to replay the RGB color transport while keeping pixel locations fixed.
The directed value Wp is useful conceptually, but it is too
rigid to be the main distance between measures: it can be infinite and
asymmetric. Kantorovich’s formulation remedies both issues by replacing maps
with couplings.
This section records the main regimes where Monge’s deterministic formulation
becomes well posed. Brenier’s theorem is the central result: for the squared
Euclidean cost, absolute continuity of the source restores existence,
uniqueness and convex-potential structure.
Brenier’s theorem Brenier, 1987Brenier, 1991 ensures that in Rd, for the
quadratic cost, absolute continuity of the source is enough for Monge’s problem
to have a unique solution. It also gives the decisive structural description:
the optimal map is the gradient of a convex potential.
Proof Sketch
The proof uses Kantorovich relaxation and duality, developed later in the book.
For the quadratic cost, dual optimality gives potentials whose equality set is
equivalent, after subtracting quadratic terms, to the Fenchel equality
ϕ(x)+ϕ∗(y)=⟨x,y⟩ for a convex function ϕ. Therefore
the support of every optimal plan lies in the graph of ∂ϕ. Since
α has a density and convex functions are differentiable Lebesgue-almost
everywhere, ∂ϕ(x) is a singleton for α-almost every x. The
relaxed optimizer is therefore induced by the map T=∇ϕ, and
uniqueness follows from concentration on the same graph.
Brenier’s theorem should be read through the analogy between convex gradients
and increasing functions. The gradient of a convex function is a monotone
field:
Brenier’s theorem provides a canonical way to extract the monotone part of an
arbitrary map. Suppose one starts from a square-integrable deformation
u:Ω→Rd. Its law ν=u♯λ records where the mass ends
up, but forgets how the points of Ω were labelled. Brenier’s polar
factorization Brenier, 1987Brenier, 1991 separates these effects: a
measure-preserving rearrangement s changes labels without changing mass, then
the unique convex-gradient map ∇ϕ sends the uniform source to the
output law.
Proof
By Brenier’s theorem there is a unique gradient of a convex function
T=∇ϕ transporting λ to ν. The maps u and T have the
same image law. The rearrangement theorem for non-atomic probability spaces
gives a measure-preserving map s such that u=T∘s. Uniqueness of the
Brenier factor follows from Brenier’s theorem.
For linear maps under a Gaussian reference, this reduces to the usual matrix
polar decomposition. If X∼N(0,Id) and u(x)=Ax, then
u♯N(0,Id)=N(0,AA⊤). The Brenier map from
N(0,Id) to this Gaussian is x↦Sx, where
S=(AA⊤)1/2 is symmetric positive semidefinite. Hence
An optimal map does not only match two endpoint measures; it tells how to draw
a path between them. Each particle keeps its identity and travels at constant
speed from its initial position to its image.
Proof
For s<t, define Ss,t=Tt∘Ts−1 along transported particles.
Then (Ss,t)♯αs=αt and
The reverse inequality follows by applying the triangle inequality to the
three legs α→αs→αt→β. For a Brenier map
T=∇ϕ, Tt is the gradient of
(1−t)∥x∥2/2+tϕ(x), which is strongly convex for every t<1 and hence
injective on the differentiability set of ϕ.
McCann displacement interpolation between a cat silhouette and a heart
silhouette. The first row displays a small farthest-point subset of transported
particles along Tt(x)=(1−t)x+tT(x). The second row renders kernel-smoothed
densities from a denser transported cloud as color images: white means zero
density, while high density saturates in the red-to-blue interpolation color of
the corresponding time.
Interactive panel. Use the interpolation and particle controls to compare the particle motion with the evolving density during McCann displacement interpolation.
The previous results identify the optimal map. Regularity theory asks when this
map is a classical smooth deformation rather than only an almost-everywhere
gradient. For quadratic costs this becomes the regularity theory of the
Monge--Ampere equation.
in the Alexandrov sense, with second boundary condition
∇ϕ(Ω)=Λ. Density bounds and convexity of the domains give
strict convexity and localization of sections. Caffarelli’s interior theory
then yields the Cloc2,α estimates
Caffarelli, 2003Villani, 2009.
For smooth densities, the change-of-variables formula gives the
Monge--Ampere equation
With suitable boundary conditions, this characterizes the Brenier potential up
to an additive constant among convex solutions. The following proposition
records the infinitesimal form.
In one dimension, optimal transport is completely explicit. The cumulative
distribution function orders the mass, and the optimal coupling is obtained by
matching equal quantile levels. This case is both a computational tool and the
template for several linearized constructions used later.
Proof
Assume first that α has a strictly positive density, so that
Fα is strictly increasing and continuous. Let
γ=(Fα−1)♯Leb[0,1]. For every x,
General measures follow from the same argument with generalized inverses and
right-continuity. If α has no atoms, the probability integral transform
gives (Fα)♯α=Leb[0,1].
satisfies T♯α=β. For the cost c(x,y)=∣x−y∣2, this map is
nondecreasing and therefore the derivative of a convex function in dimension
one.
Proof
The displayed measure is a coupling by the quantile push-forward proposition.
Its support is monotone: equal quantile levels cannot create crossing pairs.
If a coupling had two crossed pairs x<x′ and y>y′ with positive mass,
exchanging the targets decreases the cost for strictly convex powers and does
not increase it for p=1, by the two-point inequality from the matching
chapter. Eliminating crossings gives the quantile coupling. If α has no
atoms, the map formula follows from
(Fα)♯α=Leb[0,1].
Proof
The first formula follows because the optimal coupling is obtained by taking
the same quantile level r for both measures. For p=1, use the layer-cake
identity. If qα and qβ are the quantile functions, then
and the measure of the set inside the integral is
∣Fα(x)−Fβ(x)∣ for almost every x.
The quantile formula above means that
α↦Fα−1 embeds one-dimensional Wasserstein geometry
isometrically into a linear Lp space. For p=2, Wasserstein geometry on
probability measures over the real line is Hilbertian.
show_book_figure("monge-1d-quantile-geodesic")
One-dimensional transport through quantiles. The same two smooth laws are
shown as densities, cumulative functions and quantile functions. The last
panel displays the displacement interpolation obtained by the linear quantile
path Qt=(1−t)Qα+tQβ, which is the explicit one-dimensional
W2 geodesic.
Interactive panel. Use the time and endpoint controls to follow the one-dimensional Wasserstein geodesic through quantiles, CDFs, and densities.
In quantile coordinates, the interpolating measure is characterized by
For p=1, the cumulative formula above shows that W1 is a norm on signed
measures with zero total mass once they are identified with their cumulative
primitives.
There is another canonical way to build transport maps in several dimensions:
transport one coordinate at a time by conditional one-dimensional quantiles.
This construction is not usually cost-optimal, but it gives a deterministic
rearrangement under weak assumptions.
Proof
The construction is recursive. For k=1, let T1 be the monotone
rearrangement between the first marginals of α and β. Suppose
T1,…,Tk−1 have been constructed. Write
x<k=(x1,…,xk−1) and
T<k=(T1,…,Tk−1). Let αx<kk and
βy<kk be regular conditional laws of the k-th coordinate given
the previous coordinates. Define Tk(x<k,⋅) as the one-dimensional
monotone rearrangement from αx<kk to
βT<k(x<k)k. The chain rule for disintegrations shows that after
step k the first k coordinates of T♯α match those of β.
Triangular rearrangement between the same cat and heart densities as in the
McCann interpolation figure. The panels are computed directly on image
histograms. The first three transitions move mass horizontally by the monotone
rearrangement between the x-marginals; the pivot has the target horizontal
marginal. The last three transitions keep each column fixed and move mass
vertically by one-dimensional monotone rearrangements between conditional
laws.
Interactive panel. Use the horizontal and vertical interpolation sliders to inspect the Knothe triangular rearrangement one coordinate update at a time.
This construction transports successively along coordinate axes and is often
called axis-wise transport. It depends on the chosen ordering of coordinates
and is not generally optimal for the quadratic cost. It is nevertheless a
useful limiting object: Brenier maps for increasingly anisotropic quadratic
costs converge to triangular rearrangements under suitable assumptions
Carlier et al., 2010.
Proof Sketch
Let πϵ=(Id,Tϵ)♯α. Compactness gives a weakly
convergent subsequence. Optimality for
first forces any weak limit to minimize the one-dimensional quadratic cost in
the first coordinate, hence to realize the first monotone rearrangement.
Subtract that common minimum, divide by ϵ, and let
ϵ→0 to identify the second coordinate as a conditional monotone
rearrangement. Repeating gives the triangular graph coupling. Since the graph
coupling is unique, all weak limits agree. A Lusin and Portmanteau argument
then upgrades convergence in law to convergence in L2(α) because the
maps take values in a common compact set.
Gaussian measures form the most important finite-dimensional family preserved
by quadratic optimal transport. The mean moves linearly, while the covariance
follows the Bures--Wasserstein geometry of positive semidefinite matrices.
The positive square root gives the displayed formula for A. This map pushes
α to β and is a gradient of a convex quadratic potential, hence is
optimal by Brenier. If X∼α,
The covariance term B is the Bures--Wasserstein metric on positive
semidefinite matrices Bures, 1969Gelbrich, 1990Bhatia et al., 2019.
It separates Euclidean displacement of the mean from the intrinsic transport
geometry of covariance ellipsoids.
show_book_figure("monge-gaussian-w2-geodesic")
One- and two-dimensional Gaussian W2 geodesics. In one dimension, the
coordinates (m,σ) turn geodesics into Euclidean segments in the upper
half-plane. In two dimensions, means move linearly while covariance ellipses
follow the Bures--Wasserstein interpolation.
The two-dimensional Gaussian panels in the boxed figure show covariance
ellipses evolving along the Bures--Wasserstein interpolation.
Interactive panel. Use the target mean, variance, and angle controls to see how the Gaussian Wasserstein geodesic moves means and covariance ellipses.
Symmetry, positivity and separation follow immediately. The triangle
inequality follows by choosing two almost optimal orthogonal matrices and
applying the usual triangle inequality for the Frobenius norm.
Then UtUt⊤=(1−t)Σ0+tΣ1 and
VtVt⊤=(1−t)Λ0+tΛ1, while the squared Frobenius distance
is the same convex combination. Taking the infimum proves joint convexity.
Monge, G. (1781). Mémoire sur la théorie des déblais et des remblais. Histoire de l’Académie Royale Des Sciences, 666–704.
Villani, C. (2003). Topics in Optimal Transportation (Vol. 58). American Mathematical Society.
Villani, C. (2009). Optimal Transport: Old and New (Vol. 338). Springer.
Santambrogio, F. (2015). Optimal Transport for Applied Mathematicians: Calculus of Variations, PDEs, and Modeling. Birkhäuser.
Rachev, S. T., & Rüschendorf, L. (1998). Mass Transportation Problems: Volume I: Theory. Springer.
Rudin, W. (1987). Real and Complex Analysis (Third). McGraw–Hill.
Bogachev, V. I. (2007). Measure Theory. Springer.
Reinhard, E., Adhikhmin, M., Gooch, B., & Shirley, P. (2001). Color Transfer between Images. IEEE Computer Graphics and Applications, 21(5), 34–41.
Pitié, F., Kokaram, A. C., & Dahyot, R. (2005). N-dimensional Probability Density Function Transfer and Its Application to Color Transfer. IEEE International Conference on Computer Vision, 1434–1439.
Rabin, J., Peyré, G., Delon, J., & Bernot, M. (2011). Wasserstein barycenter and its application to texture mixing. International Conference on Scale Space and Variational Methods in Computer Vision, 435–446.
Brenier, Y. (1987). Décomposition polaire et réarrangement monotone des champs de vecteurs. C. R. Acad. Sci. Paris Sér. I Math., 305(19), 805–808.
Brenier, Y. (1991). Polar factorization and monotone rearrangement of vector-valued functions. Communications on Pure and Applied Mathematics, 44(4), 375–417.
Gangbo, W., & McCann, R. J. (1996). The geometry of optimal transportation. Acta Mathematica, 177(2), 113–161.
McCann, R. J. (1997). A convexity principle for interacting gases. Advances in Mathematics, 128(1), 153–179.
Caffarelli, L. (2003). The Monge-Ampere equation and optimal transportation, an elementary review. Lecture Notes in Mathematics, Springer-Verlag, 1–10.