Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Monge Problem between Measures

The goal of this chapter is to pass from finite matching to transport between arbitrary probability laws. The central stakes are to define measures, push-forwards and Monge maps carefully enough that the discrete picture survives, while exposing why deterministic maps can fail to exist. Monge’s original formulation Monge, 1781 and modern treatments Villani, 2003Villani, 2009Santambrogio, 2015Rachev & Rüschendorf, 1998 are the conceptual background for this transition.

The previous chapter handled two sets with the same number of points. To relax this to a more general setting, one needs probability distributions, so that points may carry unequal masses and continuous densities can be treated in the same language as finite clouds.

Measures

Measures are the language that lets point clouds, densities and singular objects be handled uniformly. We only recall the facts needed later: integration, total variation, densities and probabilistic laws.

Histograms

Discrete And Empirical Measures

The Dirac mass should be thought of as a unit of mass infinitely concentrated at one location.

An empirical probability distribution is uniform on a point cloud,

α=1ni=1nδxi.\al=\frac1n\sum_{i=1}^n\delta_{x_i}.

In applications, it is useful to manipulate both the positions xix_i and the weights aia_i. Moving the positions is a Lagrangian discretization; changing the weights is an Eulerian one. The Lagrangian view is often more adaptive, but it tends to break convexity.

General Measures

We consider Borel measures αM(X)\al\in\Mm(\X) on a metric space (X,d)(\Xx,d). This means that α(A)\al(A) is defined for every Borel set AA, obtained from open sets by countable unions, intersections and complements. Unless otherwise stated, the measures are finite.

A Dirac measure is defined by δx(A)=1\delta_x(A)=1 if xAx\in A and 0 otherwise. For the discrete measure above,

α(A)=xiAai.\al(A)=\sum_{x_i\in A} a_i .

We denote by M+(X)\Mm_+(\X) the set of positive finite measures on X\X and by M+1(X)\Mm_+^1(\X) the set of probability measures, i.e. positive measures of total mass one.

Polish Metric Spaces

Many measure-theoretic statements used later require a mild regularity assumption on the underlying space. The point is not to restrict applications, since Euclidean spaces, complete separable manifolds and separable Hilbert spaces are covered, but to exclude pathological measurable spaces where disintegration, tightness or weak convergence can fail to behave properly.

Polish spaces are the natural ambient category for probability measures. Borel probability measures on them are regular, tightness gives compactness criteria, regular conditional probabilities and disintegrations exist under standard assumptions, and Wasserstein spaces remain Polish; see Proposition Proposition: Wasserstein Spaces As Ground Spaces.

Radon Measures

Using Lebesgue integration, a Borel measure computes integrals of measurable functions. We write the duality pairing as

f,α:=f(x)dα(x).\langle f,\al\rangle \eqdef \int f(x)\d\al(x).

For a discrete measure this becomes

Xf(x)dα(x)=i=1naif(xi).\int_\X f(x)\d\al(x)=\sum_{i=1}^n a_i f(x_i).

Integration against a finite measure on a compact space defines a continuous linear form on the Banach space (C(X),)(\Cc(\Xx),\|\cdot\|_\infty), since fdαfα(X)|\int f\d\al|\leq \|f\|_\infty |\al|(\Xx). Conversely, the Riesz--Markov--Kakutani representation theorem identifies every continuous linear form on C(X)\Cc(\Xx) with integration against a finite signed Radon measure Rudin, 1987Bogachev, 2007. This is the duality M(X)=C(X)\Mm(\Xx)=\Cc(\Xx)^* that later supports convex duality.

Relative Densities

On Rd\RR^d the reference λ\lambda is often Lebesgue measure dx\d x.

Total Variation

The norm inherited from the duality M(X)=C(X)\Mm(\Xx)=\Cc(\Xx)^* is the total variation norm.

The absolute value of a signed measure is

α(A):=supA=iBiiα(Bi),|\al|(A) \eqdef \sup_{A=\cup_i B_i}\sum_i |\al(B_i)|,

where the supremum is over finite or countable measurable partitions of AA. If α=iaiδxi\al=\sum_i a_i\delta_{x_i} with distinct atoms, then α=iaiδxi|\al|=\sum_i |a_i|\delta_{x_i}. If dα(x)=ρ(x)dλ(x)\d\al(x)=\rho(x)\d\lambda(x), then dα(x)=ρ(x)dλ(x)\d|\al|(x)=|\rho(x)|\d\lambda(x).

Proof

The inequality αTVα(X)\|\al\|_{\TV}\leq|\al|(\Xx) follows from

fdαfdαfα(X).\left|\int f\d\al\right| \leq \int |f|\d|\al| \leq \|f\|_\infty |\al|(\Xx).

For the reverse inequality, write the Jordan decomposition α=α+α\al=\al^+-\al^-. The measurable sign s=dα/dαs=\d\al/\d|\al| takes values in {1,1}\{-1,1\} outside a null set and satisfies dα=sdα\d\al=s\d|\al|. By regularity of Radon measures on compact spaces, ss can be approximated in L1(α)L^1(|\al|) by continuous functions fkf_k with fk1\|f_k\|_\infty\leq1. Hence fkdαsdα=α(X)\int f_k\d\al\to\int s\d\al=|\al|(\Xx).

For absolutely continuous measures dα=ραdλ\d\al=\rho_\al\d\lambda and dβ=ρβdλ\d\be=\rho_\be\d\lambda,

αβTV=Xρα(x)ρβ(x)dλ(x).\|\al-\be\|_{\TV} = \int_\Xx |\rho_\al(x)-\rho_\be(x)|\d\lambda(x).

For two discrete measures written on the same union of supports, α=kakδzk\al=\sum_k a_k\delta_{z_k} and β=kbkδzk\be=\sum_k b_k\delta_{z_k}, with missing coefficients set to zero,

αβTV=kakbk.\|\al-\be\|_{\TV}=\sum_k |a_k-b_k|.

Probabilistic Interpretation

Radon probability measures represent laws of random variables. A random variable with values in X\X is a measurable map X:ΩXX:\Omega\to\X from an abstract probability space (Ω,P)(\Omega,\PP). Its law is the Radon probability measure α\al defined by

α(A)=P{ωΩ:X(ω)A}.\al(A)=\PP\{\omega\in\Omega : X(\omega)\in A\}.

Integrals with respect to the law are expectations:

Xf(x)dα(x)=E[f(X)].\int_\X f(x)\d\al(x)=\mathbb{E}[f(X)].

Push Forward

Push-forwards encode how maps move mass. This short section is the bridge between deterministic maps and linear operations on measures.

For a continuous map T:XY\T:\X\to\Y, the push-forward operator T:M(X)M(Y)\T_\sharp:\Mm(\X)\to\Mm(\Y) sends a Dirac mass to Tδx=δT(x)\T_\sharp\delta_x=\delta_{\T(x)}. For discrete measures,

Tα=iaiδT(xi).\T_\sharp\al=\sum_i a_i\delta_{\T(x_i)}.

Moving from T\T to T\T_\sharp linearizes the action of a map at the price of moving from the space X\Xx to the measure space M(X)\Mm(\Xx).

Proof

Using the change of variables y=T(x)y=\T(x) in the push-forward identity gives, for all hC(Y)h\in\Cc(\Yy),

Yh(y)ρβ(y)dy=Xh(T(x))ρα(x)dx=Yh(y)ρα(T1y)dydet(T(T1y)).\begin{aligned} \int_\Yy h(y)\rho_\be(y)\d y &= \int_\Xx h(\T(x))\rho_\al(x)\d x \\ &= \int_\Yy h(y)\rho_\al(\T^{-1}y) \frac{\d y}{|\det(\T'(\T^{-1}y))|}. \end{aligned}

Since this holds for all test functions, the densities agree. Rewriting with y=T(x)y=\T(x) gives the displayed formula.

Monge’s Formulation

Monge’s problem asks for a deterministic map transporting one law onto another while minimizing a prescribed cost. It is geometrically direct, because every source point is assigned one destination, but analytically fragile: the feasible set is non-convex, it can be empty, and a map cannot split mass. These limitations motivate Kantorovich’s relaxation in the next chapter.

Monge Problem

Given αM+1(X)\al\in\Mm_+^1(\Xx), βM+1(Y)\be\in\Mm_+^1(\Yy) and a cost c:X×YR+c:\Xx\times\Yy\to\RR_+, the Monge problem is

Mc(α,β):=infT:XY{Xc(x,T(x))dα(x)  :  Tα=β}.\Monge_c(\al,\be) \eqdef \inf_{\T:\Xx\to\Yy} \left\{ \int_\Xx c(x,\T(x))\d\al(x) \;:\; \T_\sharp\al=\be \right\}.

The constraint Tα=β\T_\sharp\al=\be means that T\T pushes the mass of α\al onto β\be.

Proof

The push-forward condition implies that all images T(xi)\T(x_i) belong to the support of β\be. For any target atom zz,

β({z})=α(T1({z}))=1n#{i:T(xi)=z}.\be(\{z\}) = \al(\T^{-1}(\{z\})) = \frac1n\#\{i:\T(x_i)=z\}.

This proves the counting statement. If every target atom has mass 1/n1/n, each target receives exactly one source atom, hence a permutation. The converse and the cost identity follow by direct substitution.

Proof

A standard measure-isomorphism theorem identifies the atomless probability space generated by α\al with Lebesgue measure on [0,1][0,1], modulo null sets Bogachev, 2007. It is therefore enough to construct a map from [0,1][0,1] to the target law β\be. Choose a Borel isomorphism from the support of β\be onto a Borel subset of [0,1][0,1], push β\be through it, and use the generalized inverse of the corresponding cumulative distribution function. Composing back gives a measurable transport map from α\al to β\be.

The next figure shows a finite-dimensional instance of this deterministic viewpoint. The source and target measures are empirical color clouds in RGB space, and the map transports colors while leaving pixel positions fixed. Grayscale equalization is one-dimensional, but full palette transfer requires transporting empirical measures in a three-dimensional color space. Early methods used affine statistics or iterated one-dimensional projections Reinhard et al., 2001Pitié et al., 2005; replacing these projections by a three-dimensional OT map gives a more intrinsic palette match Rabin et al., 2011.

<IPython.core.display.Image object>

Color transfer as a Monge map in RGB space, from a beach photograph to a flower photograph. The top row applies the palette map to the source image; the bottom row shows the empirical color clouds in the RGB cube. Only colors are transported here, not pixel locations.

Interactive panel. Use the interpolation, resolution, target palette, and contrast controls to replay the RGB color transport while keeping pixel locations fixed.

Monge Distance

When X=Y\Xx=\Yy and c(x,y)=d(x,y)pc(x,y)=d(x,y)^p for a metric dd, set

Eα(T):=Xd(x,T(x))pdα(x).\mathcal{E}_\al(\T) \eqdef \int_\Xx d(x,\T(x))^p\d\al(x).

The Monge value defines the directed quantity

W~p(α,β)p:=infT:XX{Eα(T)  :  Tα=β}.\widetilde{\Wass}_p(\al,\be)^p \eqdef \inf_{\T:\Xx\to\Xx} \left\{ \mathcal{E}_\al(\T) \;:\; \T_\sharp\al=\be \right\}.

If the constraint set is empty, then W~p(α,β)=+\widetilde{\Wass}_p(\al,\be)=+\infty.

Proof

Nonnegativity is immediate. If W~p(α,β)=0\widetilde{\Wass}_p(\al,\be)=0, choose feasible maps Tk\T_k with d(x,Tk(x))pdα(x)0\int d(x,\T_k(x))^p\d\al(x)\to0. For every bounded 1-Lipschitz function hh,

hdβhdα=h(Tk(x))h(x)dα(x)(d(x,Tk(x))pdα(x))1/p0.\left|\int h\d\be-\int h\d\al\right| = \left|\int h(\T_k(x))-h(x)\d\al(x)\right| \leq \left(\int d(x,\T_k(x))^p\d\al(x)\right)^{1/p} \to0.

Since bounded Lipschitz functions separate probability measures on metric spaces, α=β\al=\be.

For the triangle inequality, take ϵ\epsilon-minimizers Sα=γS_\sharp\al=\ga and Tγ=βT_\sharp\ga=\be. The composition TST\circ S is feasible from α\al to β\be, and Minkowski’s inequality gives

W~p(α,β)Eα(S)1/p+Eγ(T)1/pW~p(α,γ)+W~p(γ,β)+2ϵ.\widetilde{\Wass}_p(\al,\be) \leq \mathcal{E}_\al(S)^{1/p} + \mathcal{E}_\ga(T)^{1/p} \leq \widetilde{\Wass}_p(\al,\ga) + \widetilde{\Wass}_p(\ga,\be) + 2\epsilon .

Letting ϵ0\epsilon\to0 proves the result.

The directed value W~p\widetilde{\Wass}_p is useful conceptually, but it is too rigid to be the main distance between measures: it can be infinite and asymmetric. Kantorovich’s formulation remedies both issues by replacing maps with couplings.

Existence And Uniqueness Of The Monge Map

This section records the main regimes where Monge’s deterministic formulation becomes well posed. Brenier’s theorem is the central result: for the squared Euclidean cost, absolute continuity of the source restores existence, uniqueness and convex-potential structure.

Brenier’s Theorem

Brenier’s theorem Brenier, 1987Brenier, 1991 ensures that in Rd\RR^d, for the quadratic cost, absolute continuity of the source is enough for Monge’s problem to have a unique solution. It also gives the decisive structural description: the optimal map is the gradient of a convex potential.

Proof Sketch

The proof uses Kantorovich relaxation and duality, developed later in the book. For the quadratic cost, dual optimality gives potentials whose equality set is equivalent, after subtracting quadratic terms, to the Fenchel equality ϕ(x)+ϕ(y)=x,y\phi(x)+\phi^*(y)=\langle x,y\rangle for a convex function ϕ\phi. Therefore the support of every optimal plan lies in the graph of ϕ\partial\phi. Since α\al has a density and convex functions are differentiable Lebesgue-almost everywhere, ϕ(x)\partial\phi(x) is a singleton for α\al-almost every xx. The relaxed optimizer is therefore induced by the map T=ϕ\T=\nabla\phi, and uniqueness follows from concentration on the same graph.

Brenier’s theorem should be read through the analogy between convex gradients and increasing functions. The gradient of a convex function is a monotone field:

ϕ(x)ϕ(x),xx0.\langle \nabla\phi(x)-\nabla\phi(x'),x-x'\rangle\geq0.

Polar Factorization

Brenier’s theorem provides a canonical way to extract the monotone part of an arbitrary map. Suppose one starts from a square-integrable deformation u:ΩRdu:\Omega\to\RR^d. Its law ν=uλ\nu=u_\sharp\lambda records where the mass ends up, but forgets how the points of Ω\Omega were labelled. Brenier’s polar factorization Brenier, 1987Brenier, 1991 separates these effects: a measure-preserving rearrangement ss changes labels without changing mass, then the unique convex-gradient map ϕ\nabla\phi sends the uniform source to the output law.

Proof

By Brenier’s theorem there is a unique gradient of a convex function T=ϕ\T=\nabla\phi transporting λ\lambda to ν\nu. The maps uu and T\T have the same image law. The rearrangement theorem for non-atomic probability spaces gives a measure-preserving map ss such that u=Tsu=\T\circ s. Uniqueness of the Brenier factor follows from Brenier’s theorem.

For linear maps under a Gaussian reference, this reduces to the usual matrix polar decomposition. If XN(0,Id)X\sim\Gaussian(0,\Id) and u(x)=Axu(x)=Ax, then uN(0,Id)=N(0,AA)u_\sharp\Gaussian(0,\Id)=\Gaussian(0,AA^\top). The Brenier map from N(0,Id)\Gaussian(0,\Id) to this Gaussian is xSxx\mapsto Sx, where S=(AA)1/2S=(AA^\top)^{1/2} is symmetric positive semidefinite. Hence

A=SO,O=SA.A=SO, \qquad O=S^\dagger A.

The factor SxSx is the convex-gradient transport part, while OxOx preserves the standard Gaussian law.

Displacement Interpolation

An optimal map does not only match two endpoint measures; it tells how to draw a path between them. Each particle keeps its identity and travels at constant speed from its initial position to its image.

Proof

For s<ts<t, define Ss,t=TtTs1S_{s,t}=\T_t\circ\T_s^{-1} along transported particles. Then (Ss,t)αs=αt(S_{s,t})_\sharp\al_s=\al_t and

W~p(αs,αt)pTt(x)Ts(x)pdα(x)=(ts)pW~p(α,β)p.\widetilde{\Wass}_p(\al_s,\al_t)^p \leq \int \|\T_t(x)-\T_s(x)\|^p\d\al(x) = (t-s)^p\widetilde{\Wass}_p(\al,\be)^p.

The reverse inequality follows by applying the triangle inequality to the three legs ααsαtβ\al\to\al_s\to\al_t\to\be. For a Brenier map T=ϕ\T=\nabla\phi, Tt\T_t is the gradient of (1t)x2/2+tϕ(x)(1-t)\|x\|^2/2+t\phi(x), which is strongly convex for every t<1t<1 and hence injective on the differentiability set of ϕ\phi.

<IPython.core.display.Image object>

McCann displacement interpolation between a cat silhouette and a heart silhouette. The first row displays a small farthest-point subset of transported particles along Tt(x)=(1t)x+tT(x)T_t(x)=(1-t)x+tT(x). The second row renders kernel-smoothed densities from a denser transported cloud as color images: white means zero density, while high density saturates in the red-to-blue interpolation color of the corresponding time.

Interactive panel. Use the interpolation and particle controls to compare the particle motion with the evolving density during McCann displacement interpolation.

Regularity And The Monge-Ampere Equation

The previous results identify the optimal map. Regularity theory asks when this map is a classical smooth deformation rather than only an almost-everywhere gradient. For quadratic costs this becomes the regularity theory of the Monge--Ampere equation.

Proof Sketch

The potential solves the Monge--Ampere equation

det(2ϕ(x))=ρ(x)η(ϕ(x))\det(\nabla^2\phi(x)) = \frac{\rho(x)}{\eta(\nabla\phi(x))}

in the Alexandrov sense, with second boundary condition ϕ(Ω)=Λ\nabla\phi(\Omega)=\Lambda. Density bounds and convexity of the domains give strict convexity and localization of sections. Caffarelli’s interior theory then yields the Cloc2,αC^{2,\alpha}_{\mathrm{loc}} estimates Caffarelli, 2003Villani, 2009.

For smooth densities, the change-of-variables formula gives the Monge--Ampere equation

det(2ϕ(x))ρβ(ϕ(x))=ρα(x).\det(\nabla^2\phi(x))\density{\be}(\nabla\phi(x)) = \density{\al}(x).

With suitable boundary conditions, this characterizes the Brenier potential up to an additive constant among convex solutions. The following proposition records the infinitesimal form.

Proof

The change-of-variables equation for Tϵ\T_\epsilon is

ρ0(x)=ρϵ(Tϵ(x))det(Tϵ(x)).\rho_0(x) = \rho_\epsilon(\T_\epsilon(x))\det(\nabla\T_\epsilon(x)).

Expanding ρϵ(x+ϵu)=ρ0(x)+ϵr(x)+ϵρ0,u+o(ϵ)\rho_\epsilon(x+\epsilon\nabla u)=\rho_0(x)+\epsilon r(x)+ \epsilon\langle\nabla\rho_0,\nabla u\rangle+o(\epsilon) and

det(I+ϵ2u)=1+ϵΔu+o(ϵ)\det(I+\epsilon\nabla^2u)=1+\epsilon\Delta u+o(\epsilon)

gives

ρ0=ρ0+ϵ(r+(ρ0u))+o(ϵ).\rho_0 = \rho_0 + \epsilon\left(r+\nabla\cdot(\rho_0\nabla u)\right) + o(\epsilon).

The first-order term must vanish.

One-Dimensional Transport And Quantiles

In one dimension, optimal transport is completely explicit. The cumulative distribution function orders the mass, and the optimal coupling is obtained by matching equal quantile levels. This case is both a computational tool and the template for several linearized constructions used later.

Proof

Assume first that α\al has a strictly positive density, so that Fα\cumul{\al} is strictly increasing and continuous. Let γ=(Fα1)Leb[0,1]\ga=(\cumul{\al}^{-1})_\sharp\mathrm{Leb}_{[0,1]}. For every xx,

Fγ(x)=011(,x](Fα1(z))dz=011[0,Fα(x)](z)dz=Fα(x).\cumul{\ga}(x) = \int_0^1 \mathbf{1}_{(-\infty,x]}(\cumul{\al}^{-1}(z))\d z = \int_0^1 \mathbf{1}_{[0,\cumul{\al}(x)]}(z)\d z = \cumul{\al}(x).

General measures follow from the same argument with generalized inverses and right-continuity. If α\al has no atoms, the probability integral transform gives (Fα)α=Leb[0,1](\cumul{\al})_\sharp\al=\mathrm{Leb}_{[0,1]}.

If α\al has no atoms, the map

T=Fβ1Fα\T=\cumul{\be}^{-1}\circ\cumul{\al}

satisfies Tα=β\T_\sharp\al=\be. For the cost c(x,y)=xy2c(x,y)=|x-y|^2, this map is nondecreasing and therefore the derivative of a convex function in dimension one.

Proof

The displayed measure is a coupling by the quantile push-forward proposition. Its support is monotone: equal quantile levels cannot create crossing pairs. If a coupling had two crossed pairs x<xx<x' and y>yy>y' with positive mass, exchanging the targets decreases the cost for strictly convex powers and does not increase it for p=1p=1, by the two-point inequality from the matching chapter. Eliminating crossings gives the quantile coupling. If α\al has no atoms, the map formula follows from (Fα)α=Leb[0,1](\cumul{\al})_\sharp\al=\mathrm{Leb}_{[0,1]}.

Proof

The first formula follows because the optimal coupling is obtained by taking the same quantile level rr for both measures. For p=1p=1, use the layer-cake identity. If qαq_\al and qβq_\be are the quantile functions, then

01qα(r)qβ(r)dr=Rλ{r:qα(r)x<qβ(r) or qβ(r)x<qα(r)}dx,\int_0^1 |q_\al(r)-q_\be(r)|\d r = \int_\RR \lambda\{r:q_\al(r)\leq x<q_\be(r)\ \text{or}\ q_\be(r)\leq x<q_\al(r)\} \d x,

and the measure of the set inside the integral is Fα(x)Fβ(x)|\cumul{\al}(x)-\cumul{\be}(x)| for almost every xx.

The quantile formula above means that αFα1\al\mapsto\cumul{\al}^{-1} embeds one-dimensional Wasserstein geometry isometrically into a linear LpL^p space. For p=2p=2, Wasserstein geometry on probability measures over the real line is Hilbertian.

<IPython.core.display.Image object>

One-dimensional transport through quantiles. The same two smooth laws are shown as densities, cumulative functions and quantile functions. The last panel displays the displacement interpolation obtained by the linear quantile path Qt=(1t)Qα+tQβQ_t=(1-t)Q_\alpha+tQ_\beta, which is the explicit one-dimensional W2\Wass_2 geodesic.

Interactive panel. Use the time and endpoint controls to follow the one-dimensional Wasserstein geodesic through quantiles, CDFs, and densities.

In quantile coordinates, the interpolating measure is characterized by

Fαt1(r)=(1t)Fα1(r)+tFβ1(r),r(0,1).\cumul{\al_t}^{-1}(r) = (1-t)\cumul{\al}^{-1}(r)+t\cumul{\be}^{-1}(r), \qquad r\in(0,1).

For p=1p=1, the cumulative formula above shows that W1\Wass_1 is a norm on signed measures with zero total mass once they are identified with their cumulative primitives.

Triangular Rearrangements

There is another canonical way to build transport maps in several dimensions: transport one coordinate at a time by conditional one-dimensional quantiles. This construction is not usually cost-optimal, but it gives a deterministic rearrangement under weak assumptions.

Proof

The construction is recursive. For k=1k=1, let T1\T_1 be the monotone rearrangement between the first marginals of α\al and β\be. Suppose T1,,Tk1\T_1,\ldots,\T_{k-1} have been constructed. Write x<k=(x1,,xk1)x_{<k}=(x_1,\ldots,x_{k-1}) and T<k=(T1,,Tk1)\T_{<k}=(\T_1,\ldots,\T_{k-1}). Let αx<kk\al^k_{x_{<k}} and βy<kk\be^k_{y_{<k}} be regular conditional laws of the kk-th coordinate given the previous coordinates. Define Tk(x<k,)\T_k(x_{<k},\cdot) as the one-dimensional monotone rearrangement from αx<kk\al^k_{x_{<k}} to βT<k(x<k)k\be^k_{\T_{<k}(x_{<k})}. The chain rule for disintegrations shows that after step kk the first kk coordinates of Tα\T_\sharp\al match those of β\be.

<IPython.core.display.Image object>

Triangular rearrangement between the same cat and heart densities as in the McCann interpolation figure. The panels are computed directly on image histograms. The first three transitions move mass horizontally by the monotone rearrangement between the xx-marginals; the pivot has the target horizontal marginal. The last three transitions keep each column fixed and move mass vertically by one-dimensional monotone rearrangements between conditional laws.

Interactive panel. Use the horizontal and vertical interpolation sliders to inspect the Knothe triangular rearrangement one coordinate update at a time.

This construction transports successively along coordinate axes and is often called axis-wise transport. It depends on the chosen ordering of coordinates and is not generally optimal for the quadratic cost. It is nevertheless a useful limiting object: Brenier maps for increasingly anisotropic quadratic costs converge to triangular rearrangements under suitable assumptions Carlier et al., 2010.

Proof Sketch

Let πϵ=(Id,Tϵ)α\pi_\epsilon=(\Id,T_\epsilon)_\sharp\alpha. Compactness gives a weakly convergent subsequence. Optimality for

Fϵ(γ)=k=1dϵk1xkyk2dγ(x,y)F_\epsilon(\gamma) = \sum_{k=1}^d \epsilon^{k-1} \int |x_k-y_k|^2\d\gamma(x,y)

first forces any weak limit to minimize the one-dimensional quadratic cost in the first coordinate, hence to realize the first monotone rearrangement. Subtract that common minimum, divide by ϵ\epsilon, and let ϵ0\epsilon\to0 to identify the second coordinate as a conditional monotone rearrangement. Repeating gives the triangular graph coupling. Since the graph coupling is unique, all weak limits agree. A Lusin and Portmanteau argument then upgrades convergence in law to convergence in L2(α)L^2(\alpha) because the maps take values in a common compact set.

Gaussian Measures And The Bures Metric

Gaussian measures form the most important finite-dimensional family preserved by quadratic optimal transport. The mean moves linearly, while the covariance follows the Bures--Wasserstein geometry of positive semidefinite matrices.

One-Dimensional Gaussians

Let α=N(mα,σα2)\al=\Gaussian(m_\al,\sigma_\al^2) and β=N(mβ,σβ2)\be=\Gaussian(m_\be,\sigma_\be^2) be nondegenerate Gaussians on R\RR. Then

T(x)=σβσα(xmα)+mβ\T(x)=\frac{\sigma_\be}{\sigma_\al}(x-m_\al)+m_\be

satisfies Tα=β\T_\sharp\al=\be. It is the derivative of the convex function

ϕ(x)=σβ2σα(xmα)2+mβx,\phi(x)=\frac{\sigma_\be}{2\sigma_\al}(x-m_\al)^2+m_\be x,

so Brenier’s theorem shows that it is the optimal quadratic transport. The distance is

W2(α,β)2=(mαmβ)2+(σασβ)2.\Wass_2(\al,\be)^2 = (m_\al-m_\be)^2+(\sigma_\al-\sigma_\be)^2.

Thus the OT geometry of one-dimensional Gaussians is the Euclidean geometry of the half-plane (m,σ)R×R+(m,\sigma)\in\RR\times\RR_+.

Multivariate Gaussians

If

α=N(mα,Σα),β=N(mβ,Σβ),T(x)=mβ+A(xmα),\al=\Gaussian(\mean_\al,\cov_\al), \qquad \be=\Gaussian(\mean_\be,\cov_\be), \qquad \T(x)=\mean_\be+A(x-\mean_\al),

then T\T is the gradient of a convex quadratic potential if and only if AA is symmetric positive semidefinite.

Proof

An affine function maps a Gaussian to a Gaussian, so the law of T(X)\T(X) is determined by mean and covariance. If XαX\sim\al and Y=T(X)Y=\T(X), then

E(Y)=mβ,E((Ymβ)(Ymβ))=AΣαA.\mathbb{E}(Y)=\mean_\be, \qquad \mathbb{E}((Y-\mean_\be)(Y-\mean_\be)^\top) = A\cov_\al A^\top.
Proof

Multiplying AΣαA=ΣβA\cov_\al A=\cov_\be on the left and right by Σα1/2\cov_\al^{1/2} gives

(Σα1/2AΣα1/2)2=Σα1/2ΣβΣα1/2.(\cov_\al^{1/2}A\cov_\al^{1/2})^2 = \cov_\al^{1/2}\cov_\be\cov_\al^{1/2}.

The positive square root gives the displayed formula for AA. This map pushes α\al to β\be and is a gradient of a convex quadratic potential, hence is optimal by Brenier. If XαX\sim\al,

EXT(X)2=mαmβ2+tr((IA)Σα(IA))=mαmβ2+tr(Σα)+tr(Σβ)2tr((Σα1/2ΣβΣα1/2)1/2).\begin{aligned} \mathbb{E}\|X-\T(X)\|^2 &= \|\mean_\al-\mean_\be\|^2 + \tr((I-A)\cov_\al(I-A)^\top) \\ &= \|\mean_\al-\mean_\be\|^2 + \tr(\cov_\al)+\tr(\cov_\be) -2\tr((\cov_\al^{1/2}\cov_\be\cov_\al^{1/2})^{1/2}). \end{aligned}

The covariance term B\Bb is the Bures--Wasserstein metric on positive semidefinite matrices Bures, 1969Gelbrich, 1990Bhatia et al., 2019. It separates Euclidean displacement of the mean from the intrinsic transport geometry of covariance ellipsoids.

<IPython.core.display.Image object>

One- and two-dimensional Gaussian W2\Wass_2 geodesics. In one dimension, the coordinates (m,σ)(m,\sigma) turn geodesics into Euclidean segments in the upper half-plane. In two dimensions, means move linearly while covariance ellipses follow the Bures--Wasserstein interpolation.

The two-dimensional Gaussian panels in the boxed figure show covariance ellipses evolving along the Bures--Wasserstein interpolation.

Interactive panel. Use the target mean, variance, and angle controls to see how the Gaussian Wasserstein geodesic moves means and covariance ellipses.

Proof

The key identity is the Procrustes representation

B2(Σ,Λ)=minQQ=IΣ1/2Λ1/2QF2.\Bb^2(\Sigma,\Lambda) = \min_{Q^\top Q=I}\|\Sigma^{1/2}-\Lambda^{1/2}Q\|_F^2.

Symmetry, positivity and separation follow immediately. The triangle inequality follows by choosing two almost optimal orthogonal matrices and applying the usual triangle inequality for the Frobenius norm.

For convexity, use the factor formulation

B2(Σ,Λ)=minUU=Σ,  VV=ΛUVF2.\Bb^2(\Sigma,\Lambda) = \min_{UU^\top=\Sigma,\;VV^\top=\Lambda}\|U-V\|_F^2.

Choose nearly optimal factors (U0,V0)(U_0,V_0) and (U1,V1)(U_1,V_1), and set

Ut=[1tU0,tU1],Vt=[1tV0,tV1].U_t=[\sqrt{1-t}\,U_0,\sqrt t\,U_1], \qquad V_t=[\sqrt{1-t}\,V_0,\sqrt t\,V_1].

Then UtUt=(1t)Σ0+tΣ1U_tU_t^\top=(1-t)\Sigma_0+t\Sigma_1 and VtVt=(1t)Λ0+tΛ1V_tV_t^\top=(1-t)\Lambda_0+t\Lambda_1, while the squared Frobenius distance is the same convex combination. Taking the infimum proves joint convexity.

References
  1. Monge, G. (1781). Mémoire sur la théorie des déblais et des remblais. Histoire de l’Académie Royale Des Sciences, 666–704.
  2. Villani, C. (2003). Topics in Optimal Transportation (Vol. 58). American Mathematical Society.
  3. Villani, C. (2009). Optimal Transport: Old and New (Vol. 338). Springer.
  4. Santambrogio, F. (2015). Optimal Transport for Applied Mathematicians: Calculus of Variations, PDEs, and Modeling. Birkhäuser.
  5. Rachev, S. T., & Rüschendorf, L. (1998). Mass Transportation Problems: Volume I: Theory. Springer.
  6. Rudin, W. (1987). Real and Complex Analysis (Third). McGraw–Hill.
  7. Bogachev, V. I. (2007). Measure Theory. Springer.
  8. Reinhard, E., Adhikhmin, M., Gooch, B., & Shirley, P. (2001). Color Transfer between Images. IEEE Computer Graphics and Applications, 21(5), 34–41.
  9. Pitié, F., Kokaram, A. C., & Dahyot, R. (2005). N-dimensional Probability Density Function Transfer and Its Application to Color Transfer. IEEE International Conference on Computer Vision, 1434–1439.
  10. Rabin, J., Peyré, G., Delon, J., & Bernot, M. (2011). Wasserstein barycenter and its application to texture mixing. International Conference on Scale Space and Variational Methods in Computer Vision, 435–446.
  11. Brenier, Y. (1987). Décomposition polaire et réarrangement monotone des champs de vecteurs. C. R. Acad. Sci. Paris Sér. I Math., 305(19), 805–808.
  12. Brenier, Y. (1991). Polar factorization and monotone rearrangement of vector-valued functions. Communications on Pure and Applied Mathematics, 44(4), 375–417.
  13. Gangbo, W., & McCann, R. J. (1996). The geometry of optimal transportation. Acta Mathematica, 177(2), 113–161.
  14. McCann, R. J. (1997). A convexity principle for interacting gases. Advances in Mathematics, 128(1), 153–179.
  15. Caffarelli, L. (2003). The Monge-Ampere equation and optimal transportation, an elementary review. Lecture Notes in Mathematics, Springer-Verlag, 1–10.