Monge Problem between Measures

The goal of this chapter is to pass from finite matching to transport between arbitrary probability laws. The central stakes are to define measures, push-forwards and Monge maps carefully enough that the discrete picture survives, while exposing why deterministic maps can fail to exist. Monge’s original formulation Monge, 1781 and modern treatments Villani, 2003Villani, 2009Santambrogio, 2015Rachev & Rüschendorf, 1998 are the conceptual background for this transition.

The previous chapter handled two sets with the same number of points. To relax this to a more general setting, one needs probability distributions, so that points may carry unequal masses and continuous densities can be treated in the same language as finite clouds.

from pathlib import Path
import sys

from IPython.display import Image as DisplayImage
from IPython.display import display

here = Path.cwd()
myst_dir = None
for candidate in [here, here.parent, here / "myst", here.parent / "myst", here.parent.parent / "myst"]:
    if (candidate / "ot4ml_web.py").exists():
        myst_dir = candidate.resolve()
        sys.path.insert(0, str(myst_dir))
        break

if myst_dir is None:
    raise RuntimeError("Could not locate myst/ot4ml_web.py")

repo_root = myst_dir.parent
thumbnails = repo_root / "notebooks-figures" / "thumbnails"

def show_book_figure(name, width=760):
    display(DisplayImage(filename=str(thumbnails / f"{name}.png"), width=width))

Measures¶

Measures are the language that lets point clouds, densities and singular objects be handled uniformly. We only recall the facts needed later: integration, total variation, densities and probabilistic laws.

Histograms¶

Discrete And Empirical Measures¶

The Dirac mass should be thought of as a unit of mass infinitely concentrated at one location.

An empirical probability distribution is uniform on a point cloud,

\al=\frac1n\sum_{i=1}^n\delta_{x_i}.

(3)

In applications, it is useful to manipulate both the positions $x_i$ and the weights $a_i$ . Moving the positions is a Lagrangian discretization; changing the weights is an Eulerian one. The Lagrangian view is often more adaptive, but it tends to break convexity.

General Measures¶

We write $\Mm(\X)$ for the finite signed Borel measures on a metric space $(\Xx,d)$ . The Borel sets form the smallest $\sigma$ -algebra containing the open subsets of $\Xx$ , obtained by closing the open sets under complements and countable unions. Unless otherwise stated, all measures are finite.

A Dirac measure is defined by $\delta_x(A)=1$ if $x\in A$ and 0 otherwise. For the discrete measure above,

\al(A)=\sum_{x_i\in A} a_i .

(4)

We denote by $\Mm_+(\X)$ the set of positive finite measures on $\X$ and by $\Mm_+^1(\X)$ the set of probability measures, i.e. positive measures of total mass one.

Polish Metric Spaces¶

Many measure-theoretic statements used later require a mild regularity assumption on the underlying space. The point is not to restrict applications, since Euclidean spaces, complete separable manifolds and separable Hilbert spaces are covered, but to exclude pathological measurable spaces where disintegration, tightness or weak convergence can fail to behave properly.

This assumption already includes genuinely infinite-dimensional spaces. In the Euclidean dynamical formulations used later, the path space

\Ss=C([0,1];\RR^d),\qquad d_\infty(\gamma,\eta) \eqdef \sup_{t\in[0,1]}\|\gamma(t)-\eta(t)\|.

(5)

is Polish. Completeness follows because a uniformly Cauchy sequence of continuous paths converges uniformly to a continuous path, and separability follows by approximating paths uniformly by piecewise-linear paths with rational breakpoints and rational values. This example is used later as the state space for laws of trajectories: the endpoint evaluation maps are continuous on $\Ss$ , which makes endpoint constraints well behaved for dynamical optimal plans and Schrodinger bridges; see the path-space formulation in Path-Space Formulation and Section Path-Space Schrodinger Problem.

Polish spaces are the natural ambient category for probability measures. Borel probability measures on them are regular, tightness gives compactness criteria, regular conditional probabilities and disintegrations exist under standard assumptions, and Wasserstein spaces remain Polish; see Proposition Proposition: Wasserstein Spaces As Ground Spaces.

Radon Measures¶

A positive Borel measure is Radon if it is inner regular, meaning that the mass of every Borel set is the supremum of the masses of its compact subsets. Every finite Borel measure on a Polish space is Radon. Such a measure integrates measurable functions, and we write the pairing as

\langle f,\al\rangle \eqdef \int f(x)\d\al(x).

(7)

For a discrete measure this becomes

\int_\X f(x)\d\al(x)=\sum_{i=1}^n a_i f(x_i).

(8)

Integration against a finite measure on a compact space defines a continuous linear form on the Banach space $(\Cc(\Xx),\|\cdot\|_\infty)$ , since $|\int f\d\al|\leq \|f\|_\infty |\al|(\Xx)$ . Conversely, the Riesz--Markov--Kakutani representation theorem identifies every continuous linear form on $\Cc(\Xx)$ with integration against a finite signed Radon measure Rudin, 1987Bogachev, 2007. This is the duality $\Mm(\Xx)=\Cc(\Xx)^*$ that later supports convex duality.

Relative Densities¶

On $\RR^d$ the reference $\lambda$ is often Lebesgue measure $\d x$ .

Total Variation¶

The norm inherited from the duality $\Mm(\Xx)=\Cc(\Xx)^*$ is the total variation norm.

The absolute value of a signed measure is

|\al|(A) \eqdef \sup_{A=\cup_i B_i}\sum_i |\al(B_i)|,

(12)

where the supremum is over finite or countable measurable partitions of $A$ . If $\al=\sum_i a_i\delta_{x_i}$ with distinct atoms, then $|\al|=\sum_i |a_i|\delta_{x_i}$ . If $\d\al(x)=\rho(x)\d\lambda(x)$ , then $\d|\al|(x)=|\rho(x)|\d\lambda(x)$ .

For absolutely continuous measures $\d\al=\rho_\al\d\lambda$ and $\d\be=\rho_\be\d\lambda$ ,

\|\al-\be\|_{\TV} = \int_\Xx |\rho_\al(x)-\rho_\be(x)|\d\lambda(x).

(15)

For two discrete measures, one first recasts both measures as weight vectors on the same finite support. If

\al=\sum_{i=1}^n \tilde a_i\delta_{x_i}, \qquad \be=\sum_{j=1}^m \tilde b_j\delta_{y_j},

(16)

let $z_1,\ldots,z_r$ be the distinct points in the union $\{x_i\}_i\cup\{y_j\}_j$ and set

a_k=\sum_{i:x_i=z_k}\tilde a_i, \qquad b_k=\sum_{j:y_j=z_k}\tilde b_j.

(17)

Then $\al=\sum_{k=1}^r a_k\delta_{z_k}$ and $\be=\sum_{k=1}^r b_k\delta_{z_k}$ . This convention merges masses located at the same point before comparing the two measures. If the two input supports are already enumerated without repetitions and are disjoint, this simply amounts to taking $(z_k)_k=(x_1,\ldots,x_n,y_1,\ldots,y_m)$ and padding the weights as $a=(\tilde a,0_m)$ and $b=(0_n,\tilde b)$ . With this common-support notation,

\|\al-\be\|_{\TV}=\sum_{k=1}^r |a_k-b_k|.

(18)

Probabilistic Interpretation¶

Probability measures represent laws of random variables. Let $(\Omega,\Ff,\PP)$ be an abstract probability space and let $\X$ be Polish. A random variable with values in $\X$ is a measurable map $X:(\Omega,\Ff)\to(\X,\Bb(\X))$ , or simply $X:\Omega\to\X$ once the measurable structures are understood. Its law is the Radon probability measure $\al=X_\sharp\PP$ defined by

\al(A)=\PP\{\omega\in\Omega : X(\omega)\in A\}.

(19)

For every integrable $f$ , integration with respect to the law is expectation:

\int_\X f(x)\d\al(x)=\mathbb{E}[f(X)].

(20)

Push Forward¶

Push-forwards encode how maps move mass. This short section is the bridge between deterministic maps and linear operations on measures.

For a measurable map $\T:\X\to\Y$ , the push-forward operator $\T_\sharp:\Mm(\X)\to\Mm(\Y)$ records the distribution of image points. It sends a Dirac mass to $\T_\sharp\delta_x=\delta_{\T(x)}$ . For discrete measures,

\T_\sharp\al=\sum_i a_i\delta_{\T(x_i)}.

(21)

The operation is linear in the input measure, although its definition for a general measure uses inverse images rather than a decomposition into Dirac masses. Moving from $\T$ to $\T_\sharp$ thus linearizes the action of a map at the price of moving from $\Xx$ to the measure space $\Mm(\Xx)$ .

Remark: Pullback and push-forward

If $\T:\X\to\Y$ is continuous, the pullback by $\T$ is the linear operator

\T^\sharp:\Cc(\Y)\to\Cc(\X), \qquad \T^\sharp g=g\circ\T .

(24)

The definition of the push-forward is exactly the dual relation between this pullback on functions and the action of $\T_\sharp$ on measures:

\int_\X \T^\sharp g(x)\d\al(x) = \int_\Y g(y)\d(\T_\sharp\al)(y).

(25)

In pairing notation,

\left\langle \T^\sharp g,\al\right\rangle_{\Cc(\X),\Mm(\X)} = \left\langle g,\T_\sharp\al\right\rangle_{\Cc(\Y),\Mm(\Y)}.

(26)

Thus push-forward is the adjoint operation to pullback, with the direction reversed. The two arrows should not be confused: $\T_\sharp$ transports mass from $\X$ to $\Y$ , whereas $\T^\sharp$ transports test functions from $\Y$ back to $\X$ .

Figure Div makes the determinant factor visible. Even when the source density is uniform, nonlinear diffeomorphisms can compress area around one or several centers and expand it elsewhere; the target density is therefore not obtained by the naive composition $\rho_\al\circ\T^{-1}$ alone. The high-density bumps appear exactly where the deformed grid cells have small area.

Jacobian determinant in the density push-forward formula. The three panels show smooth diffeomorphisms that strongly compress a uniform source grid around one, three, and five centers. The deformed grid is drawn as a dense red mesh, and the pushed density is shown through black level sets of $\density{\be}(y)=|\det\T'(\T^{-1}y)|^{-1}$ . The tighter crop focuses on the bump region, where small grid cells coincide with large values of the pushed density.

Interactive panel. Change the compression strength, width, and residual rotation to see how the determinant controls the density amplification.

Remark: Probabilistic interpretation

Let $X:\Omega\to\X$ be a random variable defined on an abstract probability space $(\Omega,\Ff,\PP)$ . Its law, or probability distribution, is the push-forward of $\PP$ by $X$ , namely $\al=X_\sharp\PP$ .

Applying another push-forward $\be = \T_\sharp\al$ for $\T : \X \rightarrow \Y$ , following (23), is equivalent to defining another random variable $Y=\T(X)$ , namely $\omega \in \Om \mapsto \T(X(\omega)) \in \Y$ . Indeed

\be=\T_\sharp\al=\T_\sharp(X_\sharp\PP)=(\T\circ X)_\sharp\PP,

(30)

so $\be$ is the law of $Y$ .

Drawing a random sample $y$ from $Y$ is thus simply achieved by computing $y=\T(x)$ where $x$ is drawn from $X$ .

Monge’s Formulation¶

Monge’s problem asks for a deterministic map transporting one law onto another while minimizing a prescribed cost. It is geometrically direct, because every source point is assigned one destination, but analytically fragile: the feasible set is non-convex, it can be empty, and a map cannot split mass. These limitations motivate Kantorovich’s relaxation in the next chapter.

Monge Problem¶

Given $\al\in\Mm_+^1(\Xx)$ , $\be\in\Mm_+^1(\Yy)$ and a nonnegative measurable cost $c:\Xx\times\Yy\to[0,+\infty]$ , the Monge problem is

\Monge_c(\al,\be) \eqdef \inf_{\T:\Xx\to\Yy\,\text{ measurable}} \left\{ \int_\Xx c(x,\T(x))\d\al(x) \;:\; \T_\sharp\al=\be \right\}.

(31)

The constraint $\T_\sharp\al=\be$ means that $\T$ pushes the mass of $\al$ onto $\be$ . By convention, the infimum is $+\infty$ when no feasible map exists.

Proposition: Empirical Monge Maps And Matchings

Assume that the source atoms $x_1,\ldots,x_n$ are distinct and

\al=\frac1n\sum_{i=1}^n\delta_{x_i}, \qquad \be=\frac1n\sum_{j=1}^n\delta_{y_j}.

(32)

If $\T_\sharp\al=\be$ , then for each distinct target value $z$ in the support of $\be$ , exactly $n\be(\{z\})$ source atoms are mapped to $z$ . In particular, if the $y_j$ are distinct, then there is a permutation $\sigma\in\Perm(n)$ such that $\T(x_i)=y_{\sigma(i)}$ .

Conversely, every assignment of source atoms to target atoms with the correct masses defines a feasible Monge map on the support of $\al$ , and in the distinct-target case

\int_\Xx c(x,\T(x))\d\al(x) = \frac1n\sum_{i=1}^n c(x_i,y_{\sigma(i)}).

(33)

The values of a feasible map away from $\{x_1,\ldots,x_n\}$ may be chosen arbitrarily, provided the extension is measurable. If source locations are repeated, they should first be merged into atoms with larger masses; such atoms cannot be split by a Monge map.

Example: A splitting obstruction

Let $\al$ be uniform on $\{0\}\times[0,1]$ and let $\be$ be the average of the uniform laws on $\{-1\}\times[0,1]$ and $\{1\}\times[0,1]$ . For squared Euclidean cost, the relaxed plan sending $(0,t)$ equally to $(-1,t)$ and $(1,t)$ has cost 1, the smallest possible value because every target point has horizontal distance 1 from the source segment.

No deterministic map attains this value. Equality would force the vertical coordinate to remain equal to $t$ almost everywhere. If $A\subset[0,1]$ were the levels sent to the right segment, the target marginal would require $|A\cap B|=|B|/2$ for every Borel set $B$ , which is impossible for $B=A$ . Feasible maps nevertheless approach cost 1: partition $[0,1]$ into $n$ equal cells and map each half-cell affinely onto the corresponding full cell on one side. The vertical displacement is at most $1/(2n)$ , so the costs converge to 1. Thus the Monge infimum equals the relaxed value but is not attained Santambrogio, 2015.

Example: Semi-discrete Monge maps

The Monge formulation is not symmetric in $\al$ and $\be$ . It makes sense, for instance, when $\al$ has a density with respect to Lebesgue measure and $\be$ is discrete. On $\Xx=\Yy=\RR^d$ , let $\be=\sum_j b_j\delta_{y_j}$ be supported on $\{y_1,\ldots,y_m\}$ . A map $\T$ such that $\T_\sharp\al=\be$ defines a segmentation of the space into cells

C_j\eqdef \T^{-1}(y_j), \qquad \al(C_j)=b_j.

(35)

This is the semi-discrete setting. Chapter Paragraph explains how the cells become Laguerre cells for prescribed masses and ordinary Voronoi cells when the masses are free. Figure Div shows two such equal-mass piecewise-constant maps: the color of each target atom matches the Laguerre cell that is sent to it. If one exchanges the roles of $\al$ and $\be$ so that $\al$ is discrete, then no valid $\T$ exists in general: it is not possible to push forward a discrete measure to a measure with density.

Semi-discrete Monge maps. The red contours show the continuous source density $\al$ . The colored regions extend the numerical Laguerre cells over the whole displayed domain, while their masses are computed with respect to $\al$ . The circular atoms form the discrete target $\be$ . Their colors are tied to horizontal position, with a small random perturbation, and are reused for the cells. Faint segments connect cell barycenters to their images under the piecewise-constant Monge map.

Interactive panel. Vary the target masses, source density and number of dual-weight updates to see how ordinary Voronoi cells deform into Laguerre cells with the prescribed semi-discrete masses.

The next figure shows a finite-dimensional instance of this deterministic viewpoint. The source and target measures are empirical color clouds in RGB space, and the map transports colors while leaving pixel positions fixed. Grayscale equalization is one-dimensional, but full palette transfer requires transporting empirical measures in a three-dimensional color space. Early methods used affine statistics or iterated one-dimensional projections Reinhard et al., 2001Pitié et al., 2005; replacing these projections by a three-dimensional OT map gives a more intrinsic palette match Rabin et al., 2011.

Figure Div shows a finite-dimensional instance of this deterministic viewpoint.

Color transfer as a Monge map in RGB space, from a beach photograph to a flower photograph. The top row applies the palette map to the source image; the bottom row shows the empirical color clouds in the RGB cube. Only colors are transported here, not pixel locations.

Interactive panel. Use the interpolation, resolution, target palette, and contrast controls to replay the RGB color transport while keeping pixel locations fixed.

Monge Distance¶

When $\Xx=\Yy$ , $1\leq p<+\infty$ and $c(x,y)=d(x,y)^p$ for a metric $d$ , set

\mathcal{E}_\al(\T) \eqdef \int_\Xx d(x,\T(x))^p\d\al(x).

(36)

The Monge value defines the directed quantity

\widetilde{\Wass}_p(\al,\be)^p \eqdef \inf_{\T:\Xx\to\Xx\,\text{ measurable}} \left\{ \mathcal{E}_\al(\T) \;:\; \T_\sharp\al=\be \right\}.

(37)

If the constraint set is empty, then $\widetilde{\Wass}_p(\al,\be)=+\infty$ .

Example: Book-shifting in the Monge problem

Let

\al=\frac12\mathds{1}_{[0,2]}(x)\d x, \qquad \be=\frac12\mathds{1}_{[1,3]}(y)\d y.

(38)

The monotone translation $T_{\rm tr}(x)=x+1$ pushes $\al$ to $\be$ . For the cost $|x-y|$ , it is not the only optimal Monge map. The book-shifting map

T_{\rm book}(x)= \begin{cases} x+2, & 0\leq x\leq1,\\ x, & 1<x\leq2, \end{cases}

(39)

also satisfies $(T_{\rm book})_\sharp\al=\be$ : it keeps the overlapping interval $[1,2]$ fixed and sends the left interval $[0,1]$ to the uncovered interval $[2,3]$ . For any admissible map $T_\sharp\al=\be$ ,

\int |T(x)-x|\d\al(x) \geq \int (T(x)-x)\d\al(x) = \int y\d\be(y)-\int x\d\al(x) =1.

(40)

Both $T_{\rm tr}$ and $T_{\rm book}$ satisfy $T(x)\geq x$ for $\al$ -almost every $x$ , hence both saturate this lower bound and are optimal for the Monge value $\tilde\Wass_1$ . More generally, one may replace the map $x\mapsto x+2$ on $[0,1]$ by any measure-preserving map from $[0,1]$ to $[2,3]$ , while keeping the identity on $[1,2]$ . The flatness is specific to the linear cost: for $p>1$ , strict convexity selects the monotone translation.

The directed value $\widetilde{\Wass}_p$ is useful conceptually, but it is too rigid to be the main distance between measures: it can be infinite and asymmetric. Kantorovich’s formulation remedies both issues by replacing maps with couplings.

Existence And Uniqueness Of The Monge Map¶

This section records the main regimes where Monge’s deterministic formulation becomes well posed. Brenier’s theorem is the central result: for the squared Euclidean cost, absolute continuity of the source restores existence, uniqueness and convex-potential structure.

Brenier’s Theorem¶

Brenier’s theorem Brenier, 1987Brenier, 1991 ensures that in $\RR^d$ , for the quadratic cost, absolute continuity of the source is enough for Monge’s problem to have a unique solution. It also gives the decisive structural description: the optimal map is the gradient of a convex potential.

Brenier’s theorem is the higher-dimensional analogue of the one-dimensional monotone rearrangement theorem. In one dimension, the derivative of a convex function is an increasing map; in several dimensions, the corresponding object is the gradient of a convex function. Such gradients are monotone fields:

\langle \nabla\phi(x)-\nabla\phi(x'),x-x'\rangle\geq0.

(44)

Remark: Monotone fields need not be gradients

In dimensions larger than one, not all monotone fields are gradients of convex functions. Consider in $\RR^2$ the rotation matrix

R_\theta=\begin{pmatrix} \cos\theta & -\sin\theta\\ \sin\theta & \cos\theta \end{pmatrix}.

(45)

The linear map $x\mapsto R_\theta x$ is monotone as soon as $|\theta|\leq\pi/2$ , because

\dotp{R_\theta x-R_\theta x'}{x-x'} = \dotp{R_\theta(x-x')}{x-x'} = \cos(\theta)\norm{x-x'}^2\geq0.

(46)

However, for $\theta\neq0$ , $R_\theta$ is not symmetric and therefore cannot be the gradient of a scalar potential. Indeed, if a linear field $Ax$ equals $\nabla\phi(x)$ , then its Jacobian $A$ must be symmetric; equivalently, a quadratic potential $\phi(x)=\dotp{Bx}{x}/2$ has gradient $((B+B^\top)/2)x$ . Thus monotonicity is weaker than Brenier optimality in dimension $d\geq2$ .

Radial Measures¶

Radial symmetry gives a useful higher-dimensional case where the Brenier map reduces to a one-dimensional monotone rearrangement. A measure $\al$ on $\RR^d$ is radial if $Q_\sharp\al=\al$ for every orthogonal map $Q$ . Such a measure is determined by the law of the radius $\norm{x}$ , and the optimal map transports this radius while keeping the angular direction fixed.

Proposition: Optimal Transport Between Radial Measures

Let $\al,\be\in\Mm_+^1(\RR^d)$ be radial probability measures with finite second moments, and assume that $\al$ is absolutely continuous with respect to Lebesgue measure. Define their radial distribution functions

F_\al(r)\eqdef \al(\{x:\norm{x}\leq r\}), \qquad F_\be(r)\eqdef \be(\{y:\norm{y}\leq r\}), \qquad r\geq0,

(47)

and let $F_\be^{-1}(u)\eqdef\inf\{r\geq0:F_\be(r)\geq u\}$ be the generalized inverse. Set $\tau(r)\eqdef F_\be^{-1}(F_\al(r))$ . Then the quadratic Brenier map from $\al$ to $\be$ is the radial map

\T(0)=0, \qquad \T(x)=\frac{\tau(\norm{x})}{\norm{x}}x \quad\text{for }x\neq0.

(48)

Moreover, if $\al_R=(x\mapsto\norm{x})_\sharp\al$ is the law of the source radius, then

\Wass_2^2(\al,\be) = \int_0^\infty |r-\tau(r)|^2\,\d\al_R(r).

(49)

If $\al$ is not absolutely continuous, the same radial idea is still useful but has to be interpreted at the level of couplings of the radial variables: atoms on spheres may have to be split, so a Monge map need not exist.

Polar Factorization¶

Brenier’s theorem provides a canonical way to extract the monotone part of a nondegenerate square-integrable map $u:\Omega\to\RR^d$ . Its law $\be=u_\sharp\lambda$ records where the mass ends up, but forgets how the points of $\Omega$ were labelled. Brenier’s polar factorization Brenier, 1987Brenier, 1991 separates these effects: a measure-preserving rearrangement changes labels, then the unique convex-gradient map sends the uniform source to the output law.

Absolute continuity is a sufficient form of Brenier’s nondegeneracy hypothesis. Without such a hypothesis, deterministic polar factorization may fail or be nonunique.

Figure Div separates these two factors on a colored grid. Its three panels display $x$ , $s(x)$ , and $u(x)=(\nabla\phi\circ s)(x)$ . The map $s$ swirls the square grid while preserving area, whereas $\nabla\phi$ is a symmetric positive definite affine map and hence the gradient of a convex quadratic potential.

Polar factorization as a relabeling followed by a Brenier map. From left to right, the panels show $x$ , $s(x)$ with $s_\sharp\lambda=\lambda$ , and $u(x)=\nabla\phi(s(x))$ . Here $s$ is generated by the area-preserving flow of a divergence-free Hamiltonian vector field, while the Brenier factor is the symmetric positive definite affine map $\nabla\phi(z)=Bz$ . Faint arrows indicate the successive maps.

Interactive panel. Vary the area-preserving swirl and the SPD stretch to separate relabeling from the Brenier factor.

For linear maps under a Gaussian reference, this reduces to the usual matrix polar decomposition. If $X\sim\Gaussian(0,\Id)$ and $u(x)=Ax$ , then $u_\sharp\Gaussian(0,\Id)=\Gaussian(0,AA^\top)$ . The Brenier map from $\Gaussian(0,\Id)$ to this Gaussian is $x\mapsto Sx$ , where $S=(AA^\top)^{1/2}$ is symmetric positive semidefinite. Hence

A=SO.

(52)

When $A$ is invertible, $O=S^{-1}A$ is orthogonal. In the singular square case, $S^\dagger A$ is only a partial isometry and must be extended on its kernel to an orthogonal matrix $O$ before $Ox$ preserves the full standard Gaussian law. The factor $Sx$ is the convex-gradient transport part, whereas $Ox$ is the measure-preserving relabeling.

Displacement Interpolation¶

An optimal map does not only match two endpoint measures; it tells how to draw a path between them. Each particle keeps its identity and travels at constant speed from its initial position to its image.

Proposition: Directed Monge Displacement Geodesics

Let $1\leq p<+\infty$ , let $\al,\be\in\Mm_+^1(\RR^d)$ have finite $p$ -th moments, and let $\T$ be an optimal map for $\widetilde{\Wass}_p(\al,\be)$ . Set $\al_t=(\T_t)_\sharp\al$ with $\T_t=(1-t)\Id+t\T$ . Assume that, for every $t<1$ , $\T_t$ is one-to-one on a full $\al$ -measure Borel set. Then, for $0\leq s\leq t\leq1$ ,

\widetilde{\Wass}_p(\al_s,\al_t) = (t-s)\widetilde{\Wass}_p(\al,\be).

(54)

Thus $t\mapsto\al_t$ is an oriented constant-speed geodesic for the directed Monge distance. For $p=2$ , this applies to the Brenier map under the hypotheses of Brenier’s theorem.

Figure Div illustrates this displacement geodesic on two non-convex silhouettes, both through representative particle paths and through the evolving transported density.

McCann displacement interpolation between a cat silhouette and a heart silhouette. The first row displays a small farthest-point subset of transported particles along $T_t(x)=(1-t)x+tT(x)$ . The second row renders kernel-smoothed densities from a denser transported cloud as color images: white means zero density, while high density saturates in the red-to-blue interpolation color of the corresponding time.

Interactive panel. Use the interpolation and particle controls to compare the particle motion with the evolving density during McCann displacement interpolation.

Regularity And The Monge-Ampere Equation¶

The previous results identify the optimal map. Regularity theory asks when this map is a classical smooth deformation rather than only an almost-everywhere gradient. For quadratic costs this becomes the regularity theory of the Monge--Ampere equation.

Proposition: Caffarelli Regularity

Let $\Omega,\Lambda\subset\RR^d$ be bounded uniformly convex domains with $C^2$ boundaries. Let $\al=\rho(x)\d x$ be supported on $\Omega$ and $\be=\eta(y)\d y$ be supported on $\Lambda$ , with $0<m\leq\rho,\eta\leq M<+\infty$ . If $\rho\in C^\alpha(\overline\Omega)$ and $\eta\in C^\alpha(\overline\Lambda)$ for some $\alpha\in(0,1)$ , then the Brenier potential $\phi$ is strictly convex in $\Omega$ and $\phi\in C^{2,\alpha}_{\mathrm{loc}}(\Omega)$ ; in particular, $\nabla\phi\in C^{1,\alpha}_{\mathrm{loc}}(\Omega)$ . Standard stronger boundary assumptions give the corresponding global result.

Figure Div illustrates the geometric role of the convexity assumption through a finite-sample stress test rather than a counterexample. A quadratic assignment transports a dense farthest-point sample of the disk to a dense farthest-point sample of a connected non-convex target made of two disks joined by a thin rectangle. The panels show the corresponding empirical McCann interpolation, so the loss of convex target geometry is visible directly along the transported particles.

Empirical quadratic OT interpolation from a disk to a connected non-convex two-disk domain. A 5200-point farthest-point sample of the disk is matched to a 5200-point farthest-point sample of two disks connected by a thin rectangle. The panels display $(1-t)x_i+tT_N(x_i)$ for the empirical optimal assignment $T_N$ . Colors are inherited from the horizontal coordinate of $x_i$ in the initial disk, making the transported material regions visible throughout the interpolation.

Interactive panel. Change the neck strength and number of displayed rings to see how a smooth source foliation bends when the target develops a non-convex throat.

Remark: Regularity, weak maps, and splitting

Caffarelli’s theorem should be read as a warning as well as a theorem. Brenier’s theorem gives existence and uniqueness under mild assumptions, but smoothness requires density bounds, smoothness and convex geometry that are rarely satisfied by empirical, manifold-supported or neural generative distributions. In such applications, the exact OT map is often only weakly defined, possibly unstable, and better represented by a coupling, an entropic approximation or a learned parametric surrogate.

Even without smoothness, the convex potential is locally Lipschitz on the interior of its domain, so $\nabla\phi$ is defined Lebesgue-almost everywhere. If the source measure does not satisfy the non-splitting hypotheses of Brenier’s theorem, the correct relaxed object is instead an optimal Kantorovich plan concentrated on the graph of the set-valued map $\partial\phi$ . At points where $\partial\phi(x)$ contains several target locations, the plan may split the mass starting from $x$ . Thus the subdifferential still describes the geometry of optimality, but the transport object is a coupling rather than a single-valued map.

For smooth densities, the change-of-variables formula gives the Monge--Ampere equation

\det(\nabla^2\phi(x))\density{\be}(\nabla\phi(x)) = \density{\al}(x).

(57)

With suitable boundary conditions, this characterizes the Brenier potential up to an additive constant among convex solutions. The convexity constraint forces $\det(\nabla^2\phi(x))\geq0$ and is necessary for this fully nonlinear elliptic equation to be well posed.

Remark: Numerical Monge--Amp`ere solvers

The Monge--Amp`ere operator should be viewed as a fully nonlinear, degenerate elliptic analogue of the Laplacian. Proposition Proposition: Linearization Of The Monge-Ampere Equation makes this analogy literal at first order, where the equation becomes a weighted Poisson equation. The nonlinear operator $\phi\mapsto\det(\nabla^2\phi)$ , however, is elliptic only on the convex branch, together with the second boundary condition $\nabla\phi(\Omega)=\Lambda$ . This is the numerical difficulty: a scheme is not reliable merely because it approximates the determinant of the Hessian. It must also select the convex Alexandrov/viscosity solution and discretize the boundary condition consistently.

Several complementary approaches have been developed. Geometric discretizations go back to Oliker--Prussner and reflector-design methods; PDE-based solvers include Newton methods, monotone wide-stencil finite differences and variational discretizations; semi-discrete and power-diagram methods exploit the convex-cell structure of the transport map. Representative references include Oliker & Prussner, 1989Caffarelli et al., 1999Loeper & Rapetti, 2005Benamou et al., 2014Froese & Oberman, 2011Benamou et al., 2016Mirebeau, 2015Sulman et al., 2011. We do not describe these schemes here; the point is that numerical Monge--Amp`ere transport is primarily a problem of consistent discretization of this nonlinear Laplacian-like operator.

The following proposition records the infinitesimal form.

Proposition: Linearization Of The Monge-Ampere Equation

Let $\rho_\epsilon=\rho_0+\epsilon r+o(\epsilon)$ be a smooth perturbation of a positive reference density $\rho_0$ on a smooth bounded domain $\Omega$ , with $\int_\Omega r\d x=0$ . Assume that the expansions hold strongly enough to differentiate the change-of-variables identity. If $\T_\epsilon(x)=x+\epsilon\nabla u(x)+o(\epsilon)$ transports $\rho_0\d x$ to $\rho_\epsilon\d x$ , then, to first order,

-\nabla\cdot(\rho_0\nabla u)=r.

(58)

When $\rho_0$ is constant, the linearized equation is $-\Delta u=r/\rho_0$ . On a fixed domain with no boundary flux, the associated condition is $\rho_0\partial_n u=0$ on $\partial\Omega$ .

Beyond The Quadratic Euclidean Cost¶

The quadratic Euclidean cost is the model case, but optimal-map theory also covers many non-quadratic costs. The key point is to separate three roles: a convexity-like structure gives potentials, a twist condition prevents splitting, and curvature-type conditions give regularity.

$W_p$ Costs¶

The quadratic cost is special because it identifies the optimal map with the Euclidean gradient of an ordinary convex potential. For the Wasserstein cost, normalized as $c(x,y)=\norm{x-y}^p/p$ with $p>1$ , the same Monge-map picture survives, but the potential is adapted to the cost. More generally, for $c(x,y)=h(x-y)$ with $h$ smooth and strictly convex, absolute continuity of the source again rules out splitting and yields a unique optimal map. This map is characterized by a $c$ -convex potential $f$ : at differentiability points of $f$ ,

\nabla f(x)=\nabla_x c(x,\T(x)).

(62)

For $c(x,y)=\norm{x-y}^p/p$ , this gives the explicit relation

\T(x)=x-\norm{\nabla f(x)}^{q-2}\nabla f(x), \qquad \frac1p+\frac1q=1.

(63)

Thus the Brenier formula $\T=\nabla\phi$ should be read as the quadratic representative of a broader $c$ -convex theory. The Euclidean theory for strictly convex displacement costs is developed in Gangbo--McCann’s work on the geometry of optimal transportation Gangbo & McCann, 1996; see also the general treatment in Villani, 2009.

Squared Geodesic Distance¶

On a Riemannian manifold, the natural analogue of the quadratic Euclidean cost is the squared geodesic distance $c(x,y)=d_M(x,y)^2/2$ . The optimal map is no longer written with a vector-space subtraction, but with the exponential map

\T(x)=\exp_x(-\nabla\phi(x)),

(64)

where $\phi$ is $c$ -convex. This is the intrinsic version of the formula $T=x-\nabla\phi$ in normal coordinates. The main additional issues are the cut locus, possible non-uniqueness of minimizing geodesics, and regularity of the exponential map, which is why the Euclidean statement is usually presented first. The Riemannian polar-factorization theorem of McCann McCann, 2001 gives the corresponding optimal-map framework, and these ideas feed into displacement convexity and the general manifold theory of optimal transport McCann, 1997Villani, 2009.

Figure Div makes this intrinsic interpolation explicit on the closed upper hemisphere $M=\mathbb S^2_+=\{x\in\mathbb R^3:\|x\|=1,\ x_3\geq0\}$ . Consider two equal-weight empirical measures $\alpha=n^{-1}\sum_i\delta_{x_i}$ and $\beta=n^{-1}\sum_j\delta_{y_j}$ and the intrinsic cost

C_{ij}=\frac12d_{\mathbb S^2}(x_i,y_j)^2 =\frac12\arccos\!\bigl(\langle x_i,y_j\rangle\bigr)^2.

(65)

The Birkhoff--von Neumann theorem gives an optimal permutation coupling $P^\star_{i,\sigma(i)}=1/n$ . For a non-antipodal matched pair, let

\vartheta_i=\arccos\!\bigl(\langle x_i,y_{\sigma(i)}\rangle\bigr), \qquad \gamma_i(t) =\frac{\sin((1-t)\vartheta_i)}{\sin\vartheta_i}x_i +\frac{\sin(t\vartheta_i)}{\sin\vartheta_i}y_{\sigma(i)}.

(66)

Each $\gamma_i$ is a constant-speed minimizing spherical geodesic, and $\alpha_t=n^{-1}\sum_i\delta_{\gamma_i(t)}$ is the intrinsic McCann interpolation. The paths stay in the hemisphere because both sine coefficients are nonnegative on $[0,1]$ .

Intrinsic McCann interpolation on the upper hemisphere. The thin violet great-circle arcs show the fixed optimal permutation coupling between the hollow red and blue endpoint atoms. The filled atoms follow these arcs at $t=0,1/4,1/2,3/4,1$ , with color interpolated from red to blue.

Poincare Disk¶

Negative curvature gives a complementary intrinsic picture. The Poincare disk is

\mathbb D^2=\{x\in\mathbb R^2:\|x\|<1\}, \qquad g_x=\frac{4}{(1-\|x\|^2)^2}I_2,

(67)

a complete Riemannian manifold of constant curvature -1. Its geodesic distance is

d_{\mathbb D}(x,y) =\operatorname{arcosh}\!\left( 1+\frac{2\|x-y\|^2}{(1-\|x\|^2)(1-\|y\|^2)} \right).

(68)

For equal-weight point clouds, the cost $C_{ij}=d_{\mathbb D}(x_i,y_j)^2/2$ again admits an optimal permutation coupling, denoted by $\sigma$ . Introduce the hyperboloid lift and Lorentz product

\iota(x)=\frac{(1+\|x\|^2,2x)}{1-\|x\|^2}, \qquad \langle X,Y\rangle_L=-X_0Y_0+\langle X_{1:2},Y_{1:2}\rangle.

(69)

Put $X_i=\iota(x_i)$ , $Y_i=\iota(y_{\sigma(i)})$ , and $\eta_i=d_{\mathbb D}(x_i,y_{\sigma(i)})$ . The constant-speed geodesic is

Z_i(t) =\frac{\sinh((1-t)\eta_i)}{\sinh\eta_i}X_i +\frac{\sinh(t\eta_i)}{\sinh\eta_i}Y_i, \qquad \gamma_i(t)=\frac{Z_{i,\mathrm{sp}}(t)}{Z_{i,0}(t)+1}.

(70)

In disk coordinates, its trace is a Euclidean circle orthogonal to the unit circle, with diameters as the limiting case. Thus $\alpha_t=n^{-1}\sum_i\delta_{\gamma_i(t)}$ is the hyperbolic McCann interpolation shown below. The $\sinh$ coefficients are the negative-curvature counterparts of the $\sin$ coefficients in the spherical formula.

Intrinsic McCann interpolation in the Poincare disk. The black circle is the ideal boundary, the faint grid gives hyperbolic radial and distance coordinates, and the thin violet arcs are the fixed optimal geodesic coupling. Filled atoms move from the hollow red source to the hollow blue target at $t=0,1/4,1/2,3/4,1$ .

Twist Condition¶

The first non-degeneracy condition asks that the first-order information at a source point identifies at most one target point. This is the structural hypothesis that turns an optimal relation into a map.

The quadratic cost satisfies twist since $\nabla_x\norm{x-y}^2=2(x-y)$ ; the bilinear cost $c(x,y)=-\dotp{x}{y}$ satisfies twist since $\nabla_x c(x,y)=-y$ ; more generally $c(x,y)=h(x-y)$ is twisted when $h$ is smooth strictly convex and $\nabla h$ is injective. On a Riemannian manifold, the squared geodesic cost is twisted locally away from the cut locus. By contrast, a separated cost $a(x)+b(y)$ is never twisted, since $\nabla_x c$ does not see $y$ .

Ma--Trudinger--Wang Curvature¶

Twist gives a map, but it does not by itself make this map continuous or smooth. For a general smooth cost, the relevant structural hypothesis is the Ma--Trudinger--Wang (MTW) condition, introduced for a priori estimates of the generated Jacobian equation associated with optimal transport Ma et al., 2005Trudinger & Wang, 2001.

Definition: Weak MTW Condition

Assume that $c\in C^4(X\times Y)$ , that $X,Y\subset\RR^d$ , that $c$ is twisted, and that the mixed Hessian $\nabla^2_{xy}c(x,y)$ is invertible. For $(x,p)$ in the image of the change of variables $(x,y)\mapsto(x,-\nabla_x c(x,y))$ , let $Y(x,p)$ be defined by

-\nabla_x c(x,Y(x,p))=p,

(73)

and set

A_{ij}(x,p)\eqdef -\partial^2_{x_i x_j}c(x,Y(x,p)).

(74)

The cost satisfies the weak MTW condition if

\sum_{i,j,k,l}\partial^2_{p_kp_l} A_{ij}(x,p)\,\xi_i\xi_j\eta_k\eta_l\geq0 \qquad\text{whenever }\dotp{\xi}{\eta}=0.

(75)

A strong MTW condition asks for a positive lower bound, proportional to $\norm{\xi}^2\norm{\eta}^2$ , on the same orthogonal directions.

The tensor above is often called the cost-sectional curvature. With this sign convention, it measures how the negative $x$ -Hessian of the cost bends when the target point is varied through the dual momentum $p$ .

The flat quadratic and bilinear costs have zero MTW curvature, hence satisfy the weak condition. For the squared Riemannian distance, the MTW tensor is a refined curvature condition on the cost: near the diagonal it recovers sectional curvature, negative sectional curvature gives an obstruction, and global regularity also depends on cut-locus and domain-convexity issues. Many smooth strictly convex costs fail MTW, which is why twist is enough for existence of a map but not for regularity.

One-Dimensional Transport And Quantiles¶

Cumulative and quantile functions¶

In one dimension, optimal transport is completely explicit. The cumulative distribution function orders the mass, and the optimal coupling is obtained by matching equal quantile levels. This case is both a computational tool and the template for several linearized constructions used later.

Figure Div follows four smooth laws through these three representations. Each density mode produces a change of slope in the cumulative function, while intervals of low density become steep portions of the quantile function. The aligned quarter-mass guides make the inversion between the last two panels explicit.

Densities, cumulative functions and quantiles for four Gaussian mixtures. Red-to-blue colors identify mixtures with one through four components across all three panels. The cumulative functions integrate the modes into successive increases of mass, while inversion exchanges slowly varying cumulative regions with steep quantile regions. Faint guides mark the same quarter-mass levels in the last two panels.

1D Monge solutions¶

The quantile construction becomes a deterministic Monge map only when the source has no atoms, because the cumulative distribution function can then be used as a genuine change of variable. This gives an explicit one-dimensional Monge solution in the atomless case. If $\al$ has atoms, a deterministic map may fail to realize the same quantile matching because an atom cannot be split; the corresponding relaxed statement is treated in Section Relaxation For Arbitrary Measures, paragraph Kantorovich solution in 1D.

Theorem: One-dimensional Monge solution

Let $\al,\be\in\Mm_+^1(\RR)$ , assume that $\al$ has no atoms, and let $h:\RR\to\RR$ be convex. Consider the cost $c(x,y)=h(x-y)$ . Write $q_\al=\cumul{\al}^{-1}$ and $q_\be=\cumul{\be}^{-1}$ , and assume that $h(q_\al-q_\be)\in L^1(0,1)$ . Then

\T=q_\be\circ\cumul{\al}=\cumul{\be}^{-1}\circ\cumul{\al}

(79)

satisfies $\T_\sharp\al=\be$ and minimizes $\int h(x-S(x))\d\al(x)$ over all maps $S_\sharp\al=\be$ .

In particular, $h(t)=|t|^p$ gives the usual one-dimensional Monge map for every $1\leq p<+\infty$ whenever the source has no atoms.

For $p=2$ , Bobkov and Ledoux Bobkov & Ledoux, 2019 gave a complementary representation involving only cumulative distribution functions. It gives the same exact value as the quantile formula, but replaces the inverse-CDF step by an integral over CDF values. This is useful when cumulative functions are easier to estimate or aggregate than quantiles. This cumulative formula has recently been used to design data-parallel estimators of sliced Wasserstein distances Vauthier et al., 2026.

The last panel of Figure Div is the one-dimensional specialization of the displacement interpolation introduced above.

One-dimensional transport through quantiles. The same two smooth laws are shown as densities, cumulative functions and quantile functions. The last panel displays the displacement interpolation obtained by the linear quantile path $Q_t=(1-t)Q_\alpha+tQ_\beta$ , which is the explicit one-dimensional $\Wass_2$ geodesic.

Interactive panel. Use the time and endpoint controls to follow the one-dimensional Wasserstein geodesic through quantiles, CDFs, and densities.

In quantile coordinates, the interpolating measure is characterized by

\cumul{\al_t}^{-1}(r) = (1-t)\cumul{\al}^{-1}(r)+t\cumul{\be}^{-1}(r), \qquad r\in(0,1).

(89)

OT on trees¶

The line is the simplest tree. On a general tree, the order is no longer total, but each edge still defines a cut and hence a cumulative imbalance. This gives an exact formula for the $\Wass_1$ cost associated with the tree geodesic distance. The formula is classical in fast earth-mover computations on tree metrics Ling & Okada, 2007 and is also the mechanism behind tree-sliced Wasserstein distances Le et al., 2019.

Proposition: Cumulative Formula on a Tree

Let $\mathsf{T}=(V,E,\ell)$ be a finite weighted tree, where each edge $e$ has length $\ell_e>0$ , and let $d_{\mathsf{T}}$ be its geodesic distance. Choose a root $o$ . For an edge $e=(u,v)$ oriented away from $o$ , let $V_e$ be the set of vertices in the connected component containing $v$ after removing $e$ . If $\al,\be\in\Pp(V)$ , then

\Wass_{1,d_{\mathsf{T}}}(\al,\be) = \min_{\pi\in\Couplings(\al,\be)} \sum_{x,y\in V} d_{\mathsf{T}}(x,y)\pi_{xy} = \sum_{e\in E}\ell_e \left|\al(V_e)-\be(V_e)\right|.

(90)

The value is computable in $O(|V|)$ arithmetic operations once the signed masses $\al-\be$ are stored on the vertices, by one postorder traversal of the rooted tree.

For a chain rooted at one end, the sets $V_e$ are rays, so the tree formula is the discrete version of the one-dimensional identity (84). This edgewise cumulative decoupling is specific to the geodesic-distance cost $d_{\mathsf{T}}$ , i.e. to $\Wass_1$ . For powers $d_{\mathsf{T}}^p$ with $p>1$ , the term $\big(\sum_{e\in[x,y]}\ell_e\big)^p$ couples all edges along a path, so the simple sum of absolute subtree imbalances no longer gives the optimal value. The tree structure remains algorithmically useful in a different sense: tree metrics give fast surrogates and embeddings for EMD-type computations on histograms Indyk & Thaper, 2003Andoni et al., 2008Ling & Okada, 2007, and randomized or data-adapted trees lead to tree-sliced Wasserstein distances that trade the exact Euclidean ground metric for much faster one-dimensional/tree computations Le et al., 2019.

The same tree can nevertheless be used as a genuine geodesic space. If $\pi$ is an optimal coupling for the squared tree distance $d_{\mathsf{T}}^2$ , the associated displacement interpolation moves each packet of mass $\pi_{ij}$ at constant speed along the unique path from $x_i$ to $x_j$ . Figure Div shows this tree analogue of McCann interpolation. The intermediate measures are not necessarily supported on the original vertices: mass may sit inside edges while it travels through the branching structure.

McCann interpolation on a finite tree. A quadratic optimal plan is computed between two non-uniform vertex histograms for the squared geodesic distance on the tree. Each transported packet then moves along the unique tree path connecting its source and target vertices. Circle areas encode transported masses, colors interpolate from the source measure $\al$ in red to the target measure $\be$ in blue, and faint colored corridors mark the tree branches carrying transport.

Triangular Rearrangements¶

There is another canonical way to build transport maps in several dimensions: transport one coordinate at a time by conditional one-dimensional quantiles. This construction is not usually cost-optimal, but it gives a deterministic rearrangement under weak assumptions.

Proposition: Knothe--Rosenblatt Triangular Rearrangement

Let $\al,\be\in\Mm_+^1(\RR^d)$ . Assume that the first marginal of $\al$ and, recursively, the one-dimensional conditional source laws used below are atomless almost everywhere. Fix measurable versions of all regular conditional distributions. Then there is a triangular map

\T(x_1,\ldots,x_d) = (\T_1(x_1),\T_2(x_1,x_2),\ldots,\T_d(x_1,\ldots,x_d))

(92)

such that $\T_\sharp\al=\be$ and, for each $k$ , the function $x_k\mapsto\T_k(x_{<k},x_k)$ is nondecreasing for $\al_{<k}$ -almost every $x_{<k}=(x_1,\ldots,x_{k-1})$ .

Figure Div shows the two-dimensional mechanism on image histograms.

Triangular rearrangement between the same cat and heart densities as in the McCann interpolation figure. The panels are computed directly on image histograms. The first three transitions move mass horizontally by the monotone rearrangement between the $x$ -marginals; the pivot has the target horizontal marginal. The last three transitions keep each column fixed and move mass vertically by one-dimensional monotone rearrangements between conditional laws.

Interactive panel. Use the horizontal and vertical interpolation sliders to inspect the Knothe triangular rearrangement one coordinate update at a time.

This construction transports successively along coordinate axes and is often called axis-wise transport. It depends on the chosen ordering of coordinates and is not generally optimal for the quadratic cost. It is nevertheless a useful limiting object: Brenier maps for increasingly anisotropic quadratic costs converge to triangular rearrangements under suitable assumptions Carlier et al., 2010.

Proposition: Anisotropic Brenier Maps Converge To Knothe--Rosenblatt

Let $\al,\be$ be compactly supported probability measures on $\RR^d$ with densities bounded above and below on rectangular supports. Their source and target conditional laws are therefore atomless almost everywhere. For $\epsilon>0$ , set

c_\epsilon(x,y) \eqdef \sum_{k=1}^d \epsilon^{k-1}|x_k-y_k|^2 .

(93)

Let $T_\epsilon$ be the Monge map from $\al$ to $\be$ for $c_\epsilon$ , and let $T_{\mathrm{KR}}$ be the triangular Knothe--Rosenblatt rearrangement with the same coordinate order. Then

T_\epsilon \longrightarrow T_{\mathrm{KR}} \qquad\text{in }L^2(\al;\RR^d) \qquad\text{as }\epsilon\to0 .

(94)

Example: Linear obstruction to composing Brenier maps

In higher dimension, Brenier maps for the quadratic cost are gradients of convex functions, and such maps do not generally remain gradients after composition. The simplest obstruction is linear. If $\T_1(x)=A_1x$ and $\T_2(x)=A_2x$ with $A_1,A_2$ symmetric positive definite, then $\T_2\circ\T_1$ has matrix $A_2A_1$ . It is a gradient field only when this product is symmetric, equivalently $A_1A_2=A_2A_1$ . Gaussian transport gives a concrete instance: between nondegenerate Gaussian laws, the Brenier map is affine with symmetric positive definite linear part. Compositions of Gaussian optimal maps are therefore optimal only in special commuting situations, for instance when all covariance matrices are simultaneously diagonalizable. Otherwise the composition contains a rotational or shearing component and is not the Brenier map between the initial and final Gaussians.

Algorithm: Knothe--Rosenblatt triangular rearrangement

Input: Probability measures $\al,\be$ on $\RR^d$ with conditional laws.

Output: Knothe--Rosenblatt triangular map $\T$ .

Compute first-coordinate rearrangement: $\T_1=(F_{\be_1})^{-1}\circ F_{\al_1}.$

For $k=2,\ldots,d$ do:

Set $x_{<k}=(x_1,\ldots,x_{k-1})$ .
Compute conditional laws $\al^k_{x_{<k}}$ and $\be^k_{\T_{<k}(x_{<k})}$ .
Set $\T_k(x_{<k},x_k) = \bigl(F_{\be^k_{\T_{<k}(x_{<k})}}\bigr)^{-1} \circ F_{\al^k_{x_{<k}}}(x_k).$

Return $\T(x)=(\T_1(x_1),\T_2(x_1,x_2),\ldots,\T_d(x_1,\ldots,x_d)).$

Gaussian Measures And The Bures Metric¶

Gaussian measures form the most important finite-dimensional family preserved by quadratic optimal transport. The mean moves linearly, while the covariance follows the Bures--Wasserstein geometry of positive semidefinite matrices.

One-Dimensional Gaussians¶

Let $\al=\Gaussian(m_\al,\sigma_\al^2)$ and $\be=\Gaussian(m_\be,\sigma_\be^2)$ be nondegenerate Gaussians on $\RR$ . Then

\T(x)=\frac{\sigma_\be}{\sigma_\al}(x-m_\al)+m_\be

(97)

satisfies $\T_\sharp\al=\be$ . It is the derivative of the convex function

\phi(x)=\frac{\sigma_\be}{2\sigma_\al}(x-m_\al)^2+m_\be x,

(98)

so Brenier’s theorem shows that it is the optimal quadratic transport. The distance is

\Wass_2(\al,\be)^2 = (m_\al-m_\be)^2+(\sigma_\al-\sigma_\be)^2.

(99)

Thus the OT geometry of one-dimensional Gaussians is the Euclidean geometry of the closed half-plane $(m,\sigma)\in\RR\times\RR_+$ . By contrast, the Fisher--Rao boundary $\sigma=0$ is infinitely far from every nondegenerate Gaussian, and the KL divergence from a nondegenerate Gaussian to a singular one is infinite.

Multivariate Gaussians¶

\al=\Gaussian(\mean_\al,\cov_\al), \qquad \be=\Gaussian(\mean_\be,\cov_\be), \qquad \T(x)=\mean_\be+A(x-\mean_\al),

(100)

then $\T$ is the gradient of a convex quadratic potential if and only if $A$ is symmetric positive semidefinite.

Proposition: Gaussian

\Wass_2

Formula And Bures Covariance Term

Assume that $\cov_\al$ and $\cov_\be$ are positive definite. The unique symmetric positive-definite solution of $A\cov_\al A=\cov_\be$ is

A= \cov_\al^{-1/2} \left(\cov_\al^{1/2}\cov_\be\cov_\al^{1/2}\right)^{1/2} \cov_\al^{-1/2}.

(104)

The affine map $\T(x)=\mean_\be+A(x-\mean_\al)$ is the optimal quadratic-cost transport from $\Gaussian(\mean_\al,\cov_\al)$ to $\Gaussian(\mean_\be,\cov_\be)$ , and

\Wass_2(\al,\be)^2 = \|\mean_\al-\mean_\be\|^2+\Bb(\cov_\al,\cov_\be)^2,

(105)

where $\Bb$ is the Bures metric of Definition Definition: Bures Metric.

Remark: Common-generator elliptical laws

The Gaussian formula has a useful, but deliberately limited, extension beyond Gaussian tails. Fix an absolutely continuous, centered, radial probability measure $\rho$ on $\RR^d$ with covariance $\Id$ . For positive definite matrices $\cov_\al,\cov_\be$ , define two elliptically contoured laws with this same radial generator by

\al=(z\mapsto \mean_\al+\cov_\al^{1/2}z)_\sharp\rho, \qquad \be=(z\mapsto \mean_\be+\cov_\be^{1/2}z)_\sharp\rho.

(116)

For this pair, the same Brenier matrix (104) gives the optimal map $\T(x)=\mean_\be+A(x-\mean_\al)$ , and the same formula (105) holds. Indeed, if $X=\mean_\al+\cov_\al^{1/2}Z$ with $Z\sim\rho$ , then $A\cov_\al A=\cov_\be$ , hence $A\cov_\al^{1/2}=\cov_\be^{1/2}Q$ for some orthogonal matrix $Q$ . Since $\rho$ is radial, $QZ$ has the same law as $Z$ , so $\T_\sharp\al=\be$ . Since $A$ is symmetric positive definite, $\T$ is the gradient of a convex quadratic potential and Brenier’s theorem gives optimality. Finally, because $\rho$ is centered with covariance $\Id$ , the transport cost depends only on the means and covariance matrices and gives the Bures expression. The common-generator assumption is the point: for two elliptical laws with different radial profiles, one must also transport the radial variable, and the covariance/Bures term is no longer the full transport cost.

The covariance term $\Bb$ is the Bures--Wasserstein metric on positive semidefinite matrices Bures, 1969Gelbrich, 1990Bhatia et al., 2019. It separates Euclidean displacement of the mean from the intrinsic transport geometry of covariance ellipsoids. For $2\times2$ covariance matrices, this geometry can be seen inside a familiar cone. The useful coordinates separate trace, anisotropy and correlation. Writing

\Sigma=\begin{pmatrix} a & c \\ c & b \end{pmatrix}, \qquad t=\frac{a+b}{\sqrt2},\qquad u=\frac{a-b}{\sqrt2},\qquad v=\sqrt2\,c,

(117)

identifies the vector space of symmetric $2\times2$ matrices with $\RR^3$ through an orthonormal change of coordinates for the Frobenius inner product; the inverse map is

a=\frac{t+u}{\sqrt2},\qquad b=\frac{t-u}{\sqrt2},\qquad c=\frac{v}{\sqrt2}.

(118)

Moreover,

t^2-u^2-v^2=2(ab-c^2)=2\det(\Sigma).

(119)

The condition $\Sigma\succeq0$ is therefore equivalent to

t\geq \sqrt{u^2+v^2}.

(120)

Thus the cone of $2\times2$ covariance matrices is the Lorentz, or ice-cream, cone. The Bures distance is not the ambient Euclidean distance in these coordinates: on the positive definite interior it is the geodesic distance of a smooth nonlinear geometry, whose geodesics are the covariance parts of Gaussian optimal transport rather than the straight chords inherited from $\RR^3$ . The cone panel below illustrates this distinction.

Figure Div displays this Euclidean half-plane geometry and the corresponding displacement interpolation of the Gaussian densities.

One- and two-dimensional Gaussian $\Wass_2$ geodesics. In one dimension, the coordinates $(m,\sigma)$ turn geodesics into Euclidean segments in the upper half-plane. Both paths share red and blue endpoint colors; the first interpolates directly between them, whereas the second passes through green at mid-time. In two dimensions, means move linearly while covariance ellipses follow the Bures--Wasserstein interpolation. The cone panel displays the same two covariance paths inside the $2\times2$ positive-semidefinite cone, with $u$ and $v$ horizontal, $t$ vertical, and faint gray chords showing the ambient Euclidean segments for comparison.

Interactive panel. Use the target mean, variance, and angle controls to see how the Gaussian Wasserstein geodesic moves means and covariance ellipses.

Remark: Comparison with the Fisher--Rao metric with mean variation

The previous formula shows that the Wasserstein geometry of one-dimensional Gaussians is Euclidean in $(m,\sigma)$ . The Fisher--Rao geometry obtained from the local expansion of the Kullback--Leibler divergence is different as soon as the mean is allowed to vary Costa et al., 2015. For $\sigma,\sigma'>0$ ,

\KL\big(\Gaussian(m,\sigma^2)\mid\Gaussian(m',(\sigma')^2)\big) = \log\frac{\sigma'}{\sigma} + \frac{\sigma^2+(m-m')^2}{2(\sigma')^2} - \frac12 .

(121)

Expanding the first argument around $(m,\sigma)$ gives, for increments $(h,s)$ ,

\KL\big(\Gaussian(m+\varepsilon h,(\sigma+\varepsilon s)^2) \mid \Gaussian(m,\sigma^2)\big) = \frac{\varepsilon^2}{2} \left(\frac{h^2}{\sigma^2}+2\frac{s^2}{\sigma^2}\right) +o(\varepsilon^2).

(122)

Thus the Fisher--Rao metric on the Gaussian half-plane is

g_{(m,\sigma)}^{\mathrm{FR}}((h,s),(h',s')) = \frac{hh'+2ss'}{\sigma^2}.

(123)

Setting $z=m/\sqrt2+i\sigma$ identifies this metric with twice the usual hyperbolic metric on the upper half-plane. Its geodesics are therefore vertical lines or Euclidean semicircles orthogonal to the boundary $\sigma=0$ after this horizontal rescaling. Consequently

d_{\mathrm{FR}}\big(\Gaussian(m,\sigma^2),\Gaussian(m',(\sigma')^2)\big) = \sqrt2\,\operatorname{arcosh}\left( 1+\frac{(m-m')^2+2(\sigma-\sigma')^2}{4\sigma\sigma'} \right).

(124)

In contrast, $\Wass_2^2=(m-m')^2+(\sigma-\sigma')^2$ . Wasserstein geodesics are straight segments in the $(m,\sigma)$ half-plane and reach $\sigma=0$ at finite distance, whereas Fisher--Rao geodesics bend away from the boundary and the boundary $\sigma=0$ is infinitely far away. Figure Div displays the resulting difference in both parameter space and density space.

Wasserstein and Fisher--Rao geodesics in the one-dimensional Gaussian family. Both interpolations share red and blue endpoint colors. The Wasserstein curves interpolate directly between them, whereas the Fisher--Rao curves pass through green at mid-time; the same palettes identify both geodesics in the left parameter-space panel. The two density panels use the same endpoint Gaussians and time samples, but the Fisher--Rao path expands the standard deviation along its hyperbolic arc before returning to the target scale.

Interactive panel. Move the endpoint and time controls to compare the straight Wasserstein path with the Fisher--Rao hyperbolic path in the Gaussian half-plane.

The cone panel in Figure Paragraph illustrates this distinction.

The two-dimensional Gaussian panels in the boxed figure show covariance ellipses evolving along the Bures--Wasserstein interpolation, together with the same covariance paths drawn in cone coordinates. The first path uses a direct red-to-blue palette, whereas the second shares these endpoints but passes through green. The interactive panel above varies the same Gaussian ingredients in real time.

Remark: Comparison with the Fisher--Rao metric for zero-mean Gaussians

The Bures metric is the covariance geometry induced by quadratic optimal transport. A different, information-geometric metric is obtained by expanding the Kullback--Leibler divergence between centered Gaussians. For $\Sigma,\Sigma'\in\mathbb S_{++}^d$ , write

\KL(\Sigma\mid\Sigma') \eqdef \KL\big(\Gaussian(0,\Sigma)\mid\Gaussian(0,\Sigma')\big) = \frac12\left( \tr((\Sigma')^{-1}\Sigma)-d-\log\det((\Sigma')^{-1}\Sigma) \right).

(125)

If $D=D^\top$ and $\Sigma+\varepsilon D$ remains positive definite, then

\KL(\Sigma+\varepsilon D\mid\Sigma) = \frac{\varepsilon^2}{2}\langle Q_\Sigma(D),D\rangle_F+o(\varepsilon^2), \qquad Q_\Sigma(D)=\frac12\Sigma^{-1}D\Sigma^{-1},

(126)

where $\langle A,B\rangle_F=\tr(A^\top B)$ is the Frobenius pairing. Thus the local quadratic form is

g_\Sigma(D,E)=\frac12\tr(\Sigma^{-1}D\Sigma^{-1}E).

(127)

With this normalization, the Fisher--Rao, or affine-invariant, geodesic distance on $\mathbb S_{++}^d$ is

d_{\mathrm{FR}}(\Sigma,\Sigma')^2 = \frac12 \norm{\log(\Sigma^{-1/2}\Sigma'\Sigma^{-1/2})}_F^2,

(128)

where $\log$ denotes the principal matrix logarithm, and the corresponding geodesic is

\Sigma_t^{\mathrm{FR}} = \Sigma^{1/2} \left(\Sigma^{-1/2}\Sigma'\Sigma^{-1/2}\right)^t \Sigma^{1/2}.

(129)

The contrast with Bures is especially visible near the boundary of the covariance cone. For example, $\Bb^2(\Sigma,0)=\tr(\Sigma)$ , and more generally the Bures formula extends continuously to $\mathbb S_+^d$ ; rank-deficient covariance matrices are therefore at finite distance and the closed cone is part of the metric completion. Fisher--Rao is different. If an eigenvalue of $\Sigma^{-1/2}\Sigma'\Sigma^{-1/2}$ tends to zero, then the corresponding logarithm diverges in $d_{\mathrm{FR}}$ . Hence the boundary, made of degenerate covariance matrices and representing infinitely anisotropic Gaussian limits on fixed-trace sections, is at infinite distance. Figure Div shows this in the $2\times2$ ice-cream cone coordinates: the Bures path reaches a rank-one covariance, while the Fisher--Rao path is drawn only to a small positive-definite regularization of the same rank-one limit.

Bures--Wasserstein and Fisher--Rao covariance geodesics in the $2\times2$ positive-semidefinite cone. The three trajectories use distinct red-to-blue, orange-to-violet, and teal-to-gold palettes, repeated identically in both panels, while faint gray chords show the ambient Euclidean segments. The Bures paths reach rank-one covariances on the cone boundary. The Fisher--Rao paths use positive-definite regularizations with the same dominant directions because the limiting rank-one covariances are not at finite Fisher--Rao distance.

Interactive panel. Move the rank-one limiting direction and the positive-definite Fisher--Rao floor. The Bures path is allowed to touch the closed covariance cone, while the Fisher--Rao path stays inside the open cone.

References¶

Monge, G. (1781). Mémoire sur la théorie des déblais et des remblais. Histoire de l’Académie Royale Des Sciences, 666–704.
Villani, C. (2003). Topics in Optimal Transportation (Vol. 58). American Mathematical Society.
Villani, C. (2009). Optimal Transport: Old and New (Vol. 338). Springer.
Santambrogio, F. (2015). Optimal Transport for Applied Mathematicians: Calculus of Variations, PDEs, and Modeling. Birkhäuser.
Rachev, S. T., & Rüschendorf, L. (1998). Mass Transportation Problems: Volume I: Theory. Springer.
Rudin, W. (1987). Real and Complex Analysis (Third). McGraw–Hill.
Bogachev, V. I. (2007). Measure Theory. Springer.
Reinhard, E., Adhikhmin, M., Gooch, B., & Shirley, P. (2001). Color Transfer between Images. IEEE Computer Graphics and Applications, 21(5), 34–41.
Pitié, F., Kokaram, A. C., & Dahyot, R. (2005). N-dimensional Probability Density Function Transfer and Its Application to Color Transfer. IEEE International Conference on Computer Vision, 1434–1439.
Rabin, J., Peyré, G., Delon, J., & Bernot, M. (2011). Wasserstein barycenter and its application to texture mixing. International Conference on Scale Space and Variational Methods in Computer Vision, 435–446.
Brenier, Y. (1987). Décomposition polaire et réarrangement monotone des champs de vecteurs. C. R. Acad. Sci. Paris Sér. I Math., 305(19), 805–808.
Brenier, Y. (1991). Polar factorization and monotone rearrangement of vector-valued functions. Communications on Pure and Applied Mathematics, 44(4), 375–417.
Gangbo, W., & McCann, R. J. (1996). The geometry of optimal transportation. Acta Mathematica, 177(2), 113–161.
McCann, R. J. (1997). A convexity principle for interacting gases. Advances in Mathematics, 128(1), 153–179.
Caffarelli, L. (2003). The Monge-Ampere equation and optimal transportation, an elementary review. Lecture Notes in Mathematics, Springer-Verlag, 1–10.

Monge Problem between Measures

Measures¶

Histograms¶

Discrete And Empirical Measures¶

General Measures¶

Polish Metric Spaces¶

Radon Measures¶

Relative Densities¶

Total Variation¶

Probabilistic Interpretation¶

Push Forward¶

Monge’s Formulation¶

Monge Problem¶

Monge Distance¶

Existence And Uniqueness Of The Monge Map¶

Brenier’s Theorem¶

Radial Measures¶

Polar Factorization¶

Displacement Interpolation¶

Regularity And The Monge-Ampere Equation¶

Beyond The Quadratic Euclidean Cost¶

WpW_pWp​ Costs¶

Squared Geodesic Distance¶

Poincare Disk¶

Twist Condition¶

Ma--Trudinger--Wang Curvature¶

One-Dimensional Transport And Quantiles¶

Cumulative and quantile functions¶

1D Monge solutions¶

OT on trees¶

Triangular Rearrangements¶

Gaussian Measures And The Bures Metric¶

One-Dimensional Gaussians¶

Multivariate Gaussians¶

$W_p$ Costs¶