Generalized OT Problems

This chapter changes the optimization problem rather than only the ground distance. Barycenters average several measures, multi-marginal OT couples many measures at once, low-rank and capacity constraints restrict the admissible plans, inverse OT learns the cost from observed transport, and weak or martingale OT acts on conditional laws. These models remain close to Kantorovich optimization, but the unknown can now be a family of couplings, a factored plan, a learned cost, or a coupling subject to nonlinear conditional constraints.

from pathlib import Path
import sys

from IPython.display import Image as DisplayImage
from IPython.display import display

here = Path.cwd()
myst_dir = None
for candidate in [here, here.parent, here / "myst", here.parent / "myst", here.parent.parent / "myst"]:
    if (candidate / "ot4ml_web.py").exists():
        myst_dir = candidate.resolve()
        sys.path.insert(0, str(myst_dir))
        break

if myst_dir is None:
    raise RuntimeError("Could not locate myst/ot4ml_web.py")

repo_root = myst_dir.parent
thumbnails = repo_root / "notebooks-figures" / "thumbnails"

def show_book_figure(name, width=760):
    display(DisplayImage(filename=str(thumbnails / f"{name}.png"), width=width))

OT Barycenters¶

Barycenters ask how to average probability measures rather than points. This section explains the variational definition, the special closed forms in one dimension and for Gaussians, and the entropic algorithms used in practice.

Frechet Means¶

The natural formulation is a Frechet-mean problem on the space of probability measures: the unknown is the barycenter measure itself, and its support is not prescribed. It uses the continuous Kantorovich value $\mathcal L_c$ defined in (40).

Unlike a coupling, the barycenter is a new probability measure on $\Xx$ . Since the weights $\lambda_s$ are nonnegative, problem (1) is convex in $\alpha$ : Proposition Proposition: Convexity in the Marginals and Concavity in the Cost shows that the continuous Kantorovich value is jointly convex in its two marginals. Agueh and Carlier introduced this problem, following earlier ideas of Carlier and Ekeland Agueh & Carlier, 2011Carlier & Ekeland, 2010. For the quadratic cost on $\Xx=\RR^d$ , a barycenter exists under the finite-second- moment assumption. It is unique if at least one positive-weight input is absolutely continuous; more general criteria ensure uniqueness through an essentially unique multi-marginal barycentric map. Discrete existence, consistency, and fixed-point constructions are studied in Anderes et al., 2016Esteban et al., 2016Le Gouic & Loubes, 2016.

Fixed-support discrete barycenters¶

For computation, one often turns the preceding infinite-dimensional problem into a finite one by prescribing possible barycenter locations. Assume the inputs are discrete,

\beta_s=\sum_{j=1}^{n_s} b_{s,j}\delta_{x_{s,j}}, \qquad b_s=(b_{s,j})_{j=1}^{n_s}\in\simplex_{n_s}.

(2)

Choose candidate barycenter sites $(y_i)_{i=1}^n$ and restrict the unknown to $\alpha=\sum_i a_i\delta_{y_i}$ . For each input $s$ , the cost $\mathcal L_c(\alpha,\beta_s)$ then becomes a finite Kantorovich problem with cost matrix

(C_s)_{ij}=c(y_i,x_{s,j})\in\RR^{n\times n_s}.

(3)

Thus its value is $\mathcal L_{C_s}(a,b_s)$ , in the notation of the discrete Kantorovich problem (9).

This construction is a finite-dimensional restriction, not an exact discrete reduction of the general barycenter problem. In the ordinary two-marginal Kantorovich problem, once both marginals are discrete, the two supports are known and the whole problem is exactly the matrix optimization (9) on their product support. For barycenters, the input supports do not determine the support of the unknown barycenter: a minimizer may place mass outside the chosen sites $(y_i)_i$ and outside the union of the input supports. Once the candidate support is fixed, however, the nonnegative weights $\lambda_s$ and Proposition Proposition: Joint Convexity of Discrete OT show that problem (4) is convex in $a$ : the discrete Kantorovich value is jointly convex in its two histograms.

For quadratic costs, the multi-marginal formulation of Section Multimarginal OT shows that, for discrete inputs, one may choose a barycenter supported on weighted averages $\sum_s\lambda_s x_{s,i_s}$ of one support point from each input. This exact candidate set can contain $\prod_s n_s$ points, but Corollary Corollary: Sparse Discrete Barycenters shows that there exists a barycenter for which at most $\sum_s n_s-S+1$ of them carry positive mass. Prescribing the support $(y_i)_i$ before solving (4) is nevertheless a numerical approximation, because the active weighted averages are not known in advance.

Figure Div moves beyond this degenerate case and compares barycenter grids obtained from one-dimensional quantile averaging and two-dimensional entropic transport under the same bilinear corner weights.

Wasserstein barycenter grids for four corner measures. The left panel uses the one-dimensional formula $Q_{u,v}=\sum_{i,j}\lambda_{ij}(u,v)Q_{ij}$ for one Gaussian law and three asymmetric two-Gaussian mixtures, and displays densities reconstructed from the averaged quantiles. The right panel computes entropic Wasserstein barycenters on a common pixel grid for the cat, two-disk, cross and clover silhouettes, using the normalized squared ground cost, $\epsilon=4\cdot10^{-4}$ and a Sinkhorn tolerance of $5\cdot10^{-8}$ . The barycenters are rendered as density images with values clamped at their $95\%$ quantile rather than by threshold contours. Colors interpolate between the four corners and encode the same bilinear weights in both panels.

The interactive demo below keeps the exact one-dimensional formula visible: the two coordinates set bilinear weights on the four corner laws, the middle panel averages their quantile functions, and the right panel reconstructs the resulting barycenter density.

Interactive panel. Use the barycentric coordinate controls to move through the four input laws and compare quantile and entropic barycenter constructions.

Remark: Mean of a quadratic barycenter

For $c(x,y)=\norm{x-y}^2$ , the mean of the barycenter $\al^\star$ is necessarily the barycenter of the means,

\int_\Xx x \d\al^\star(x) = \sum_s \la_s \int_\Xx x \d\be_s(x).

(5)

Write $m_\alpha=\int x\,\d\alpha(x)$ and let $\alpha^0=(x\mapsto x-m_\alpha)_\sharp\alpha$ be the centered translate. Then

\Wass_2^2(\alpha,\beta) = \norm{m_\alpha-m_\beta}^2+\Wass_2^2(\alpha^0,\beta^0).

(6)

The cross term vanishes under every coupling of the centered measures. The barycenter objective therefore separates into a centered term and the strictly convex Euclidean function $m\mapsto\sum_s\lambda_s\norm{m-m_{\beta_s}}^2$ , whose minimizer is $\sum_s\lambda_s m_{\beta_s}$ . If the inputs have compact support, Proposition Proposition: Multi-Marginal Formula for Quadratic Barycenters also gives a barycenter supported in the convex hull of their supports.

One-Dimensional Case¶

On the line, barycenters become linear after the quantile change of variables. This gives the rare case where the barycenter is explicit rather than the solution of a high-dimensional optimization problem.

Gaussian Case¶

Gaussian barycenters show that the same separation as in the Gaussian Wasserstein formula persists: means average linearly, while covariances average according to the Bures--Wasserstein geometry.

Proposition: Nondegenerate Gaussian inputs remain Gaussian

Let the positive-weight inputs be $\beta_s=\Gaussian(\mean_s,\cov_s)$ , with $\cov_s\succeq0$ , and assume that at least one $\cov_s$ is positive definite. The quadratic Wasserstein barycenter is unique and has the form $\alpha^\star=\Gaussian(\mean,\cov)$ , where

\mean=\sum_s\lambda_s\mean_s.

(9)

Its positive-definite covariance $\cov$ is the unique minimizer of the Bures objective

\cov \mapsto \sum_s \la_s \Bb(\cov,\cov_s)^2.

(10)

Equivalently, defining the map on the positive-definite cone

\Psi_{\lambda}(X) := \sum_s \la_s \pa{X^{1/2}\cov_sX^{1/2}}^{1/2},

(11)

its covariance is the unique solution of the fixed-point equation

\Psi_{\lambda}(\cov)=\cov.

(12)

In dimension one, if $\sigma_s=\sqrt{\cov_s}$ denotes the input standard deviation, then the barycenter standard deviation is $\sigma=\sum_s\lambda_s\sigma_s$ .

If all input covariances are singular, the same contraction argument still gives a Gaussian barycenter, but the uniqueness step can fail and non-Gaussian barycenters may coexist. Thus nondegeneracy is essential when asserting that every barycenter is Gaussian.

Remark: Forward KL barycenter of Gaussian laws

The contrast with Wasserstein averaging is particularly sharp for the forward KL barycenter. Assume now that every $\cov_s\succ0$ , and define

\alpha_{\mathrm{KL}} := \argmin_{\alpha\in\Pp(\RR^d)} \sum_s\lambda_s\KL(\alpha\mid\beta_s).

(16)

Then $\alpha_{\mathrm{KL}}=\Gaussian(\mean_{\mathrm{KL}},\cov_{\mathrm{KL}})$ , with

\cov_{\mathrm{KL}} = \left(\sum_s\lambda_s\cov_s^{-1}\right)^{-1}, \qquad \mean_{\mathrm{KL}} = \cov_{\mathrm{KL}} \sum_s\lambda_s\cov_s^{-1}\mean_s.

(17)

Thus $\cov_{\mathrm{KL}}$ is the weighted harmonic mean of the covariance matrices. Unlike the Wasserstein barycenter in Proposition Proposition: Nondegenerate Gaussian inputs remain Gaussian, its mean is precision-weighted and therefore depends on the input covariances. Normalizing the geometric mean $\prod_s(\d\beta_s/\d x)^{\lambda_s}$ gives this Gaussian, and the objective differs from $\KL(\alpha\mid\alpha_{\mathrm{KL}})$ only by a constant. This is directional: minimizing $\sum_s\lambda_s\KL(\beta_s\mid\alpha)$ over arbitrary $\alpha$ gives the mixture $\sum_s\lambda_s\beta_s$ , generally not a Gaussian; see Section Phi-Divergences.

Remark: Raw fixed-point iteration

The raw Picard map $\Psi_{\lambda}$ is not a global Banach contraction in the Frobenius norm, as the scalar case already shows. With $c=\sum_s\lambda_s\sigma_s$ and covariance fixed point $\cov_\star=c^2$ , one has

\Psi_{\lambda}(r)=c\sqrt r, \qquad \Psi_{\lambda}^k(r)=\cov_\star^{1-2^{-k}}r^{2^{-k}}.

(18)

Every compact interval $[m,M]\subset(0,+\infty)$ containing $\cov_\star$ is invariant under $\Psi_{\lambda}$ , and the exact Lipschitz constant of $\Psi_{\lambda}^k$ on this interval is

q_k = 2^{-k}\pa{\frac{\cov_\star}{m}}^{1-2^{-k}},

(19)

which tends to zero. Thus sufficiently high iterates are contractions on every such fixed interval. The raw iteration $X_{\ell+1}=\Psi_{\lambda}(X_\ell)$ often converges numerically, but no global contraction theorem explains this behavior Rüschendorf & Uckelmann, 2002Esteban et al., 2016. A normalized update with the same fixed point that converges globally under the proposition’s assumptions is Esteban et al., 2016Bhatia et al., 2019

X_{\ell+1} = X_\ell^{-1/2}\Psi_{\lambda}(X_\ell)^2X_\ell^{-1/2}.

(20)

Figure Div illustrates the nonlinear covariance interpolation characterized above; increasing anisotropy makes the simultaneous rotation and rescaling of the Bures--Wasserstein barycenter especially visible.

Bures--Wasserstein barycenters of centered Gaussian covariance matrices. Each panel shows a $5\times5$ grid of barycenter ellipses for four corner covariances, without separate input panels: the corner ellipses are the four input covariances themselves. The right grid uses more anisotropic inputs, making the nonlinear rotation and scaling of covariance barycenters more visible.

The interactive Gaussian demo compares the Bures covariance barycenter with a plain Euclidean covariance average under the same weights. The difference is most visible for rotated, anisotropic covariances: the Euclidean average blends matrix entries, whereas the Bures barycenter follows the geometry induced by quadratic Gaussian transport.

Interactive panel. Use the corner-covariance and interpolation controls to see how Gaussian barycenter ellipses interpolate covariance geometry.

Sliced and Radon Barycenters¶

Slicing gives a scalable surrogate for high-dimensional barycenters by applying the one-dimensional quantile formula in every projection direction. For measures on $\RR^d$ , one replaces $\Wass_2$ by the sliced distance $\SW_2$ introduced in Definition Definition: Sliced Wasserstein Distance and interpreted through the Radon transform in Section Sliced Wasserstein Distances:

\min_{\alpha\in\Pp_2(\RR^d)} \sum_s \lambda_s \SW_2^2(\alpha,\beta_s).

(21)

The constraint that all projected measures come from the same $\alpha$ is the nontrivial part. A cheaper Radon-domain approximation drops this consistency constraint and minimizes directly over one-dimensional projected laws $(\gamma_\theta)_\theta$ :

\min_{(\gamma_\theta)} \int_{\Sphere^{d-1}} \sum_s \lambda_s \Wass_2^2(\gamma_\theta,(P_\theta)_\sharp\beta_s) \d\sigma(\theta).

(22)

For each $\theta$ , this is a one-dimensional barycenter, hence its quantile is the weighted average of the projected quantiles. For two inputs $\beta_0$ and $\beta_1$ , define

Q_i(\theta,r) = F^{-1}_{(P_\theta)_\sharp\beta_i}(r), \qquad Q_t(\theta,r) = (1-t)Q_0(\theta,r)+tQ_1(\theta,r), \qquad \gamma_{t,\theta} = \bigl(Q_t(\theta,\cdot)\bigr)_\sharp\mathrm{Leb}_{[0,1]}.

(23)

Thus $Q_t$ is the directionwise quantile field, and when $\gamma_{t,\theta}$ has a density we denote it by $h_t(\theta,\cdot)$ . The relaxed value is a lower bound on the sliced-barycenter value. If the minimizing family is Radon-consistent, meaning that $\gamma_\theta=(P_\theta)_\sharp\bar\alpha$ for a common probability measure $\bar\alpha$ and almost every $\theta$ , then $\bar\alpha$ is an exact sliced barycenter. In general, independently computed one-dimensional barycenters do not satisfy the range conditions of the Radon transform. One therefore reconstructs a density in a least-squares sense, usually through a regularized Radon pseudoinverse.

Let $h(\theta,t)$ denote a density of $\gamma_\theta$ . We use the one-dimensional Fourier transform in $t$ given by

\widehat h(\theta,\omega) = \int_{\RR}e^{-\imath\omega t}h(\theta,t)\d t, \qquad h(\theta,t) = \frac1{2\pi}\int_{\RR}e^{\imath\omega t}\widehat h(\theta,\omega)\d\omega,

(24)

whenever Fourier inversion is valid.

Proposition: Radon Least-Squares Pseudoinverse

Let $d\geq2$ , let $\sigma$ be the uniform probability measure on $\Sphere^{d-1}$ , and let $R$ be the density Radon transform from Remark Remark: Radon viewpoint. Write

\mathcal D(R) = \left\{\rho\in L^2(\RR^d): R\rho\in L^2(\Sphere^{d-1}\times\RR,\d\sigma\,\d t)\right\}.

(25)

Let $h\in L^2(\Sphere^{d-1}\times\RR,\d\sigma\,\d t)$ and assume that the Fourier expressions below define an element of $\mathcal D(R)$ . Then the unique solution of

\min_{\rho\in\mathcal D(R)} \int_{\Sphere^{d-1}}\int_{\RR} \abs{R\rho(\theta,t)-h(\theta,t)}^2 \d t\,\d\sigma(\theta)

(26)

is the density $\rho^\dagger=R^\dagger h$ , where

R^\dagger h(x) = \frac{\abs{\Sphere^{d-1}}}{2(2\pi)^d} \int_{\Sphere^{d-1}}\int_{\RR} e^{\imath\omega\dotp{\theta}{x}} |\omega|^{d-1}\widehat h(\theta,\omega) \d\omega\,\d\sigma(\theta).

(27)

Thus $R^\dagger h$ is the density of the pseudoinverse reconstruction, which is a priori a signed measure. If $h=R\rho$ is Radon-consistent and $\rho$ is sufficiently regular, then $R^\dagger R\rho=\rho$ .

Formula (27) is the filtered back-projection representation of the Radon pseudoinverse used in tomography Herman, 1980; $|\omega|^{d-1}$ is its ramp multiplier. Since this multiplier amplifies high frequencies, one typically chooses a bandwidth $\Omega>0$ and an even low-pass window $\chi$ with $\chi(0)=1$ , and replaces the ramp by

m_\Omega(\omega) = |\omega|^{d-1}\chi(\omega/\Omega), \qquad R_\Omega^\dagger h(x) = \frac{\abs{\Sphere^{d-1}}}{2(2\pi)^d} \int_{\Sphere^{d-1}}\int_{\RR} e^{\imath\omega\dotp{\theta}{x}} m_\Omega(\omega)\widehat h(\theta,\omega) \d\omega\,\d\sigma(\theta).

(31)

The numerical reconstruction below uses the super-Gaussian window $\chi(s)=e^{-|s|^4}$ . Choose $\eta_t\geq0$ so that the positive part below has nonzero mass, and define the nonnegative, unit-mass reconstruction

A_t(x) = \frac{\bigl((R_\Omega^\dagger h_t)(x)-\eta_t\bigr)_+} {\displaystyle\int_{\RR^d} \bigl((R_\Omega^\dagger h_t)(z)-\eta_t\bigr)_+\d z}.

(32)

For the endpoints, set $A_i=\rho_i$ when $\beta_i=\rho_i\d x$ , $i\in\{0,1\}$ . In the figure, the small threshold $\eta_t$ only suppresses finite-angle inversion ghosts. The resulting regularized density is generally only a least-squares approximation to the independently averaged slices. This fast construction was introduced for sliced and Radon Wasserstein barycenters in Bonneel et al., 2015, but it is not the exact constrained sliced barycenter. With all directions, the Radon transform is injective by the Cramér--Wold theorem Cramér & Wold, 1936; inconsistency comes from failure of Radon range conditions such as antipodal symmetry and moment consistency; with finitely many directions, the sampled Radon operator is also non-injective.

Figure Div follows the resulting density, projected-density and quantile fields through a cat-to-heart interpolation.

Radon-domain sliced barycentric interpolation between the cat and heart densities. The columns correspond to $t=0,0.2,\ldots,1$ . The first row shows the endpoint densities and the intermediate reconstructions $A_t$ defined in (32) from the windowed pseudoinverse (31). The second row shows the projected-density fields $h_t$ (labeled $R_t$ in the figure), obtained by converting the directionwise quantile barycenters back into one-dimensional densities. The third row shows the quantile fields $Q_t$ defined in (23).

Interactive panel. Move the interpolation time and projection angle to compare image-space densities, Radon profiles, and quantile interpolation in the sliced barycenter construction.

Sinkhorn for Barycenters¶

A key difference with the regularized two-marginal OT problem is that there is no canonical reference measure $\alpha\otimes\beta$ , because the barycenter $\alpha$ is unknown. To reduce complexity, one usually fixes a candidate support for the barycenter and solves the discrete problem (4); this introduces a discretization error but keeps the number of unknowns manageable.

One can then use the entropy-only convention of (2) and approximate (4) by

\min_{a\in\simplex_n} \sum_{s=1}^S \lambda_s\mathcal L_{\C_s}^{\epsilon}(a,b_s)

(33)

for some $\epsilon>0$ . This is a smooth convex minimization problem, which can be tackled using gradient descent Cuturi & Doucet, 2014. An alternative is to use a descent method, typically quasi-Newton, on the semi-dual Cuturi & Peyré, 2016; this is useful when adding extra regularization on the barycenter, for instance to impose smoothness.

A simple but effective approach developed in Benamou et al., 2015 observes that (33) has the same minimizers as the weighted KL projection problem

\min_{(\P_s)_s} \epsilon\sum_s\lambda_s \operatorname{KL}(\P_s\mid K_s)

(34)

subject to

\P_s^\top\mathbf 1_n=b_s \quad\text{for all }s, \qquad \P_1\mathbf 1_{n_1} = \cdots = \P_S\mathbf 1_{n_S}.

(35)

Here $K_s\eqdef e^{-\C_s/\epsilon}$ . The barycenter $a$ is implicitly encoded in the common row marginal

a=\P_1\mathbf 1_{n_1}=\cdots=\P_S\mathbf 1_{n_S}.

(36)

The two objectives differ only by constants depending on $(\C_s,\epsilon)_s$ , not on the couplings or barycenter. Assume below that every $\C_s$ is finite and every $b_s$ is positive; zero-weight target atoms can be deleted before the iteration. The optimal couplings then have scaling form

\P_s=\diag(u_s)K_s\diag(v_s),

(37)

and the generalized Sinkhorn iterations are

v_s\leftarrow\frac{b_s}{K_s^\top u_s}, \qquad a\leftarrow\prod_s(K_s v_s)^{\lambda_s}, \qquad u_s\leftarrow\frac{a}{K_s v_s}.

(38)

The geometric mean enforces the fact that all couplings share the same barycenter marginal.

The scaling cycle has an exact dual-optimization interpretation; it is not merely a sequence of marginal normalizations. Write $u_s=e^{f_s/\epsilon}$ and $v_s=e^{g_s/\epsilon}$ , with all operations understood componentwise. The proposition below shows that these potentials maximize the concave dual (41). With $(f_s)_s$ fixed, the objective separates over $s$ , and its exact maximizer in each $g_s$ is

g_s^+ = \epsilon\log\!\left( \frac{b_s}{K_s^\top e^{f_s/\epsilon}} \right),

(39)

which is precisely the $v_s$ -update. Conversely, with $(g_s^+)_s$ fixed, exact maximization over the coupled block $(f_s)_s$ under $\sum_s\lambda_s f_s=0$ gives

q_s \eqdef K_s e^{g_s^+/\epsilon}, \qquad a^+ = \prod_s q_s^{\lambda_s}, \qquad f_s^+ = \epsilon\log\!\left(\frac{a^+}{q_s}\right).

(40)

Thus a complete generalized Sinkhorn cycle is exact two-block coordinate ascent on the dual, or equivalently alternating minimization of its negative. Indeed, the constraint follows from $\log a^+=\sum_s\lambda_s\log q_s$ , while the first-order conditions require $e^{f_s^+/\epsilon}q_s=a^+$ for every $s$ . In the primal formulation (34), the same cycle alternates weighted KL projections onto the target-column constraints and the common-row-marginal constraint Benamou et al., 2015.

Classical applications include two-dimensional image interpolation, three-dimensional shape interpolation, and barycenters on surfaces where the ground cost is the square of the geodesic distance Solomon et al., 2015.

Algorithm: Entropic barycenter Sinkhorn

Input: Finite costs $\C_s$ , positive target histograms $\b_s$ , barycenter weights $\lambda\in\operatorname{int}(\simplex_S)$ , regularization $\epsilon>0$ , tolerance $\mathrm{tol}$ .

Output: Barycenter weights $\a$ and couplings $\P_s$ .

Initialize: Set $\K_s=e^{-\C_s/\epsilon}$ , $\uD_s^{(0)}=\ones_n$ for all $s$ , $r_0=+\infty$ , and $k=0$ .

While $r_k>\mathrm{tol}$ do:

Set $k\leftarrow k+1$ .
For each marginal $s$ do

$\vD_s^{(k)} = \frac{\b_s}{\transp{\K_s}\uD_s^{(k-1)}}.$

Compute barycenter marginal: $\a^{(k)} = \prod_s \bigl(\K_s\vD_s^{(k)}\bigr)^{\lambda_s}.$
For each marginal $s$ do

$\uD_s^{(k)} = \frac{\a^{(k)}}{\K_s\vD_s^{(k)}}.$

Set $\P_s^{(k)}=\diag(\uD_s^{(k)})\K_s\diag(\vD_s^{(k)})$ for all $s$ .
Set $r_k=\max_s \max\{\norm{\P_s^{(k)}\ones-\a^{(k)}}_1,\norm{(\P_s^{(k)})^\top\ones-\b_s}_1\}$ .

Return $\a^{(k)}$ and $\P_s^{(k)}$ .

Wasserstein-Over-Wasserstein and Barycenters¶

The barycenter formula does not require a finite list of inputs. The Wasserstein-over-Wasserstein viewpoint of Section Wasserstein Over Wasserstein allows one to replace the discrete family $(\beta_s,\lambda_s)_s$ by a law $\mathfrak A\in\mathcal P(\mathcal P(\Omega))$ over probability measures. Such population Wasserstein barycenters were studied for random probability measures by Bigot and Klein, in general geodesic settings by Le Gouic and Loubes, and on Riemannian manifolds by Kim and Pass Bigot & Klein, 2018Le Gouic & Loubes, 2016Kim & Pass, 2017. Assume, for instance, that $c$ is lower semicontinuous and that there exists at least one $\beta_0\in\mathcal P(\Omega)$ with

\int_{\mathcal P(\Omega)} \mathcal L_c(\beta_0,\alpha)\,\mathrm d\mathfrak A(\alpha)<+\infty,

(46)

together with the usual compactness or coercivity hypotheses ensuring existence of minimizers. For instance, these assumptions are automatic when $\Omega$ is compact and $c$ is continuous. The barycenter correspondence is then

\mathcal B_c(\mathfrak A) \eqdef \operatorname*{Argmin}_{\beta\in\mathcal P(\Omega)} \int_{\mathcal P(\Omega)} \mathcal L_c(\beta,\alpha)\,\mathrm d\mathfrak A(\alpha).

(47)

When this set is a singleton, we denote its element by

\widetilde\alpha_{\mathfrak A} \eqdef \operatorname*{argmin}_{\beta\in\mathcal P(\Omega)} \int_{\mathcal P(\Omega)} \mathcal L_c(\beta,\alpha)\,\mathrm d\mathfrak A(\alpha),

(48)

which defines a nonlinear flattening map $\mathfrak A\mapsto\widetilde\alpha_{\mathfrak A}$ from laws over measures back to measures on $\Omega$ . When $\mathfrak A=\sum_s\lambda_s\delta_{\beta_s}$ , this is exactly the finite barycenter problem above. This map should be contrasted with the linear collapsed, or barycentric, mixture $\bar\alpha_{\mathfrak A}$ of Definition Definition: Collapsed, Or Barycentric, Mixture, which simply averages the input measures themselves. The two operations agree in degenerate linear situations, but in general $\widetilde\alpha_{\mathfrak A}$ is a geometric average in transport space, whereas $\bar\alpha_{\mathfrak A}$ is an ordinary mixture in the ambient linear space of measures.

The next result records the corresponding law of large numbers. It is useful when a dataset is itself made of probability measures, for instance populations of histograms, posterior distributions or shapes. We state it in the compact setting, where no moment or tightness side conditions are needed; non-compact extensions require the usual integrability assumptions. Consistency of Wasserstein barycenters and related statistical constructions is developed in Boissard et al., 2015Le Gouic & Loubes, 2016Zemel & Panaretos, 2019; streaming and large-scale uses of many input measures appear for instance in Staib et al., 2017Srivastava et al., 2015Srivastava et al., 2018.

Proposition: Law of Large Numbers for Barycenters Over Measures

Let $\Omega$ be a compact metric space, let $c:\Omega\times\Omega\to\RR$ be continuous, and let $\mathfrak A\in\mathcal P(\mathcal P(\Omega))$ . Let $\alpha_1,\alpha_2,\ldots$ be independent random probability measures with common law $\mathfrak A$ , and set

\widehat{\mathfrak A}_p \eqdef \frac1p\sum_{i=1}^p\delta_{\alpha_i} \in\mathcal P(\mathcal P(\Omega)).

(49)

Then, almost surely,

\widehat{\mathfrak A}_p \rightharpoonup \mathfrak A \quad\text{in }\mathcal P(\mathcal P(\Omega)).

(50)

\bar\alpha_{\widehat{\mathfrak A}_p} \rightharpoonup \bar\alpha_{\mathfrak A} \quad\text{in }\mathcal P(\Omega).

(51)

Moreover, if $\mathcal B_c(\mathfrak A)=\{\widetilde\alpha_{\mathfrak A}\}$ is a singleton and if $\widetilde\alpha_{\widehat{\mathfrak A}_p}\in\mathcal B_c(\widehat{\mathfrak A}_p)$ is any empirical barycenter, then, almost surely,

\widetilde\alpha_{\widehat{\mathfrak A}_p} \rightharpoonup \widetilde\alpha_{\mathfrak A} \quad\text{in }\mathcal P(\Omega).

(52)

Set $K=\mathcal P(\Omega)$ . Since $\Omega$ is compact metric, $K$ is compact metric for weak convergence, and so is $\mathcal P(K)$ . The space $C(K)$ is separable for the uniform norm. Applying the scalar strong law to a countable dense family of test functions and then using uniform approximation gives, almost surely, for every $\Phi\in C(K)$ ,

\int \Phi(\alpha)\,\mathrm d\widehat{\mathfrak A}_p(\alpha) = \frac1p\sum_{i=1}^p\Phi(\alpha_i) \longrightarrow \int \Phi(\alpha)\,\mathrm d\mathfrak A(\alpha)

(53)

This is exactly (50).

For the collapsed mixtures, take $f\in C(\Omega)$ and define $\Phi_f(\alpha)=\int_\Omega f\,\mathrm d\alpha$ . This function is continuous on $\mathcal P(\Omega)$ . Hence

\int_\Omega f\,\mathrm d\bar\alpha_{\widehat{\mathfrak A}_p} = \int_{\mathcal P(\Omega)} \Phi_f(\alpha)\,\mathrm d\widehat{\mathfrak A}_p(\alpha) \longrightarrow \int_{\mathcal P(\Omega)} \Phi_f(\alpha)\,\mathrm d\mathfrak A(\alpha) = \int_\Omega f\,\mathrm d\bar\alpha_{\mathfrak A},

(54)

which proves (51).

It remains to prove the nonlinear barycenter consistency. The map $(\beta,\alpha)\mapsto\mathcal L_c(\beta,\alpha)$ is continuous on $K^2$ . Therefore the map $\beta\mapsto h_\beta$ , where $h_\beta(\alpha)=\mathcal L_c(\beta,\alpha)$ , is continuous from the compact space $K$ to $C(K)$ . Its image

\mathcal H \eqdef \{h_\beta:\beta\in K\}

(55)

is compact in $C(K)$ . The convergence of $\widehat{\mathfrak A}_p$ to $\mathfrak A$ is uniform over $\mathcal H$ : given $\eta>0$ , cover $\mathcal H$ by finitely many $\eta$ -balls in $\|\cdot\|_\infty$ , use weak convergence for the centers, and bound the error on the balls by the total variation of $\widehat{\mathfrak A}_p-\mathfrak A$ . Hence the empirical objectives

F_p(\beta)\eqdef \int\mathcal L_c(\beta,\alpha)\, \mathrm d\widehat{\mathfrak A}_p(\alpha)

(56)

converge uniformly on $K$ to

F(\beta)\eqdef \int\mathcal L_c(\beta,\alpha)\, \mathrm d\mathfrak A(\alpha).

(57)

Let $\beta_p\in\mathcal B_c(\widehat{\mathfrak A}_p)$ . Compactness gives a subsequence $\beta_{p_k}\rightharpoonup\beta$ . Uniform convergence, continuity of $F$ , and optimality of $\beta_{p_k}$ give, for any $\gamma\in K$ ,

F(\beta) = \lim_k F_{p_k}(\beta_{p_k}) \leq \lim_k F_{p_k}(\gamma) = F(\gamma),

(58)

so $\beta\in\mathcal B_c(\mathfrak A)$ . If this set is the singleton $\{\widetilde\alpha_{\mathfrak A}\}$ , every converging subsequence has the same limit, and therefore the whole sequence $\widetilde\alpha_{\widehat{\mathfrak A}_p}$ converges to $\widetilde\alpha_{\mathfrak A}$ , proving (52).

Thus, (50) is the classical law of large numbers on Wasserstein space, and (51) is its linear image under the collapse map. By contrast, (52) is nonlinear: it recomputes a Wasserstein barycenter from the empirical law over measures. The number $p$ of input measures should not be confused with the number $n$ of samples used to approximate each input measure, studied in Section Sample Complexity. In applications one often observes $p$ empirical measures, each made of roughly $n$ atoms, hence about $np$ points in total. Balancing the error due to finitely many input laws against the error due to finitely sampled input laws is a separate statistical and computational tradeoff.

Toward Central Limit Theorems on Wasserstein Space¶

The same hierarchy suggests a central-limit refinement of the preceding law of large numbers, but the nonlinear geometry makes this substantially more delicate. For the linear collapse $\bar\alpha_{\widehat{\mathfrak A}_p}$ , testing against a fixed $f\in C(\Omega)$ reduces the question to the classical scalar central limit theorem for the random variable $\alpha\mapsto\int f\,\mathrm d\alpha$ . For the nonlinear barycenter $\widetilde\alpha_{\widehat{\mathfrak A}_p}$ , however, there is no canonical vector difference $\widetilde\alpha_{\widehat{\mathfrak A}_p}-\widetilde\alpha_{\mathfrak A}$ inside $\mathcal P(\Omega)$ . One has to choose a local linearization. When the population barycenter is sufficiently regular so that the optimal map $T_p$ from $\widetilde\alpha_{\mathfrak A}$ to $\widetilde\alpha_{\widehat{\mathfrak A}_p}$ exists, this amounts to asking whether $\sqrt p\,(T_p-\mathrm{Id})$ converges in a Hilbert space such as $L^2(\widetilde\alpha_{\mathfrak A})$ . In nonsmooth settings one must instead work with optimal-plan or logarithmic-map coordinates. Even after such a linearization, an infinite-dimensional CLT requires tightness of the rescaled tangent variables and a genuine Radon Gaussian random element; in a Hilbert space, the associated covariance must be trace class. A cylindrical Gaussian limit alone is therefore not a probability law on the tangent space. This obstruction explains why Wasserstein-space CLTs are more rigid than the weak laws above.

There are nevertheless important settings where such results can be proved. In one dimension, the quantile representation linearizes $\mathcal W_2$ , so barycenter fluctuations can be studied through empirical averages of quantile functions. Another finite-dimensional case is the family of non-degenerate Gaussian measures in fixed dimension, where $\mathcal W_2$ reduces to the Bures geometry of means and covariance matrices. Agueh and Carlier Agueh & Carlier, 2017 formulate this Wasserstein-barycenter CLT precisely in tangent coordinates and prove it in a few special cases, including the one-dimensional non-atomic setting and finite laws supported on non-degenerate Gaussian measures. Entropic barycenters give a smoother variant for which central-limit theorems for empirical barycenters are also available Carlier et al., 2021. These results should be read as nonlinear analogues of the statistical limits discussed in Chapter Paragraph, not as a generic Hilbert-space CLT valid on all of $\mathcal P(\Omega)$ .

Example: Application to fair score repair

Let $S$ be a protected group and $Y=f(X)$ a score. A basic demographic-parity constraint asks that the conditional laws $\al_s=\mathcal L(Y\mid S=s)$ be independent of $s$ . OT post-processing chooses a common fair law, often a Wasserstein barycenter

\bar\al\in\uargmin{\zeta}\sum_s p_s\Wass_2^2(\al_s,\zeta),

(59)

and transports each group toward it. In one dimension this uses monotone rearrangements; in the quadratic absolutely continuous case it uses Brenier maps $(T_s)_\sharp\al_s=\bar\al$ ; otherwise one can use the barycentric projection of an optimal plan. The repaired score is $\widetilde Y=T_S(Y)$ when a map is used. The barycenter is the compromise distribution, while the OT maps give minimal geometric changes to the original scores Gordaliza et al., 2019Chzhen et al., 2020Buyl & De Bie, 2022Hu et al., 2023. Thus the barycenter is not only an averaging tool: it can define a target distribution used to repair several observed laws simultaneously.

Multimarginal OT¶

Multi-marginal OT couples more than two measures at once. It is the natural language for barycenters, matching with teams and several-body costs, but its tensor dimension is the main computational obstacle.

Definition and Basic Structure¶

The multi-marginal formulation replaces a coupling between two measures by a joint distribution with several prescribed marginals. Given measures $(\alpha_s)_{s=1}^S$ on spaces $(\Xx_s)_{s=1}^S$ and a lower-semicontinuous cost $c:\Xx_1\times\cdots\times\Xx_S\to\RR\cup\{+\infty\}$ bounded from below, the problem reads

\inf_{\pi\in\Couplings(\alpha_1,\ldots,\alpha_S)} \int_{\Xx_1\times\cdots\times\Xx_S} c(x_1,\ldots,x_S)\d\pi(x_1,\ldots,x_S),

(60)

where $\Couplings(\alpha_1,\ldots,\alpha_S)$ is the set of probability measures whose $s$ -th marginal is $\alpha_s$ . This is still a linear program in the discrete setting, but its ambient tensor has size $\prod_s n_s$ .

Monge Structure and Splitting-Set Twist¶

As in the two-marginal case, one would like to know when the optimal joint law is induced by deterministic maps from one marginal. The relevant non-degeneracy assumption is stronger than pairwise twist, because the other $S-1$ variables have to be recovered simultaneously. The condition below is the standard multi-marginal analogue used in the Monge-structure theory of Gangbo--Swiech and Pass Gangbo & Swiech, 1998Pass, 2011Pass, 2012Pass, 2015.

Definition: Twist on Splitting Sets

Fix $x_1\in\Xx_1$ . A set $M\subset\Xx_2\times\cdots\times\Xx_S$ is a $c$ -splitting set at $x_1$ if there exist functions $u_s:\Xx_s\to\RR\cup\{-\infty\}$ , for $s=2,\ldots,S$ , such that

\sum_{s=2}^S u_s(x_s)\leq c(x_1,x_2,\ldots,x_S)

(61)

for all $(x_2,\ldots,x_S)$ , with equality on $M$ . Assume $c$ is differentiable in $x_1$ . The cost is twisted on splitting sets if, for every $x_1$ and every $c$ -splitting set $M$ at $x_1$ , the map

(x_2,\ldots,x_S) \longmapsto \nabla_{x_1}c(x_1,x_2,\ldots,x_S)

(62)

is injective on $M$ .

Proposition: Multi-Marginal Monge Structure

Assume that $\Xx_s\subset\RR^d$ , that $c$ is continuous and differentiable with respect to $x_1$ , and that $c$ is twisted on splitting sets. Suppose that Kantorovich dual maximizers $(\varphi_s)_{s=1}^S$ exist, that $\alpha_1$ is absolutely continuous, and that $\varphi_1$ is differentiable $\alpha_1$ -a.e. Then every optimal plan $\pi^\star\in\Couplings(\alpha_1,\ldots,\alpha_S)$ is concentrated on the graph of maps

\pi^\star=(\Id,\T_2,\ldots,\T_S)_\sharp\alpha_1, \qquad (\T_s)_\sharp\alpha_1=\alpha_s.

(63)

In particular, under these hypotheses the optimizer is unique.

Remark: Recovery of the Two-Marginal Theory

When $S=2$ , twist on splitting sets is exactly the usual twist condition of Definition Definition: Twist Condition. Indeed, for fixed $x$ , the whole target space is a splitting set by taking $u_2(y)=c(x,y)$ ; hence the condition requires $y\mapsto\nabla_x c(x,y)$ to be injective. At a dual contact point,

\nabla\varphi_1(x)=\nabla_x c(x,y),

(67)

so this injectivity selects a unique $y=\T(x)$ . Provided the Kantorovich problem admits an optimizer, the proposition therefore makes that optimizer equal to $(\Id,\T)_\sharp\alpha_1$ : the relaxation is tight and $\T$ is an optimal Monge map. This is precisely the two-marginal mechanism isolated in Proposition Proposition: Twist Prevents Splitting.

For the quadratic cost $c(x,y)=\norm{x-y}^2$ , one has $\nabla_x c(x,y)=2(x-y)$ and therefore

\T(x)=x-\frac12\nabla\varphi_1(x) =\nabla\left(\frac12\norm{x}^2-\frac12\varphi_1(x)\right).

(68)

Choosing the usual $c$ -concave representative of the quadratic dual potential makes the potential in parentheses convex; it is differentiable $\alpha_1$ -almost everywhere because $\alpha_1$ is absolutely continuous. Thus the case $S=2$ recovers the convex-gradient map, uniqueness, and tightness conclusions of Brenier’s theorem Theorem: Brenier and Corollary Corollary: Monge--Kantorovich Equivalence Under Brenier. For $S>2$ , the splitting-set twist is the stronger requirement that the same first-order identity recover the entire tuple $(x_2,\ldots,x_S)$ at once.

Coulomb Cost and Density-Functional Theory¶

A second canonical example, besides barycenters, comes from electronic structure. For $N$ electrons in $\RR^3$ , the repulsive Coulomb interaction is the multi-body cost

c_{\mathrm{Coul}}(x_1,\ldots,x_N) \eqdef \sum_{1\leq i<j\leq N}\frac{1}{\norm{x_i-x_j}},

(69)

with the value $+\infty$ on the collision set. Proposition Proposition: Multi-Marginal Monge Structure therefore does not apply verbatim: the Coulomb cost is neither finite nor differentiable on the whole product space. Any finite-energy plan gives zero mass to exact collisions, so the cost is smooth at almost every point charged by the plan, but this removes only the singularity; one must still establish differentiability of the dual potential and twist on the relevant splitting sets. Away from collisions,

\nabla_{x_1}c_{\mathrm{Coul}}(x_1,\ldots,x_N) = -\sum_{j=2}^N\frac{x_1-x_j}{\norm{x_1-x_j}^3}.

(70)

For $N=2$ , the map from $x_2$ to this vector is injective, so the ordinary two-marginal twist argument can be recovered under the required existence, duality and differentiability hypotheses. For $N\geq3$ , however, the displayed total force does not by itself determine the entire tuple $(x_2,\ldots,x_N)$ ; twist on splitting sets, and hence a Monge representation, is not automatic. The previous proposition thus supplies a mechanism to verify in special Coulomb models, not a general existence theorem for co-motion maps.

If $\rho$ is an electron density with $\int_{\RR^3}\rho(x)\d x=N$ and $\al=\rho/N$ is the associated probability density, the strictly-correlated-electrons relaxation of density-functional theory is the equal-marginal problem

V_{\mathrm{ee}}^{\mathrm{SCE}}[\rho] \eqdef \inf_{\pi\in\Couplings(\al,\ldots,\al)} \int_{(\RR^3)^N} c_{\mathrm{Coul}}(x_1,\ldots,x_N) \d\pi(x_1,\ldots,x_N).

(71)

Since the cost and constraints are permutation invariant, symmetrizing any admissible plan does not change its value, so one may equivalently minimize over symmetric plans. This functional gives the smallest possible electron--electron repulsion compatible with the prescribed one-particle density; it appears as the strong-interaction limit in density-functional theory and was connected to optimal transport in Gori-Giorgi et al., 2009Buttazzo et al., 2012Cotar et al., 2013Di Marino et al., 2015. The deterministic ansatz writes a plan through co-motion maps

\pi=(\Id,\T_2,\ldots,\T_N)_\sharp\al, \qquad (\T_i)_\sharp\al=\al,

(72)

so that the position of one electron determines the positions of the others. The following cyclic version is the most common structural form of this ansatz in the strictly-correlated-electrons literature.

Proposition: Cyclic Co-Motion Plans

Let $\al\in\Pp(\RR^3)$ and let $\T:\RR^3\to\RR^3$ be measurable with $\T_\sharp\al=\al$ and $\T^N=\Id$ $\al$ -a.e. Set $\T^0=\Id$ and

\pi_\T=(\T^0,\T^1,\ldots,\T^{N-1})_\sharp\al .

(73)

Then $\pi_\T\in\Couplings(\al,\ldots,\al)$ . If $R_\sigma(x_1,\ldots,x_N)=(x_{\sigma(1)},\ldots,x_{\sigma(N)})$ for $\sigma\in\Perm(N)$ , then

\bar\pi_\T \eqdef \frac1{N!}\sum_{\sigma\in\Perm(N)}(R_\sigma)_\sharp\pi_\T

(74)

is a symmetric admissible plan and

\int c_{\mathrm{Coul}}\d\bar\pi_\T = \int c_{\mathrm{Coul}}\d\pi_\T = \int_{\RR^3} \sum_{0\leq i<j\leq N-1} \frac{1}{\norm{\T^i(x)-\T^j(x)}}\d\al(x).

(75)

The Monge-structure proposition above explains the general mechanism that can force graph solutions, while the cyclic co-motion proposition records the additional equal-marginal symmetry used by co-motion maps. For the Coulomb cost, however, the singular repulsion and permutation symmetry make the structure delicate: co-motion maps are optimal in special geometries, but they are not universally optimal, and counterexamples are known Colombo & Stra, 2015Bindini et al., 2020. Thus the DFT problem is both a central application and a warning that multi-marginal OT is richer than a naive deterministic matching problem.

Figure Div shows the same phenomenon in a deliberately small one-dimensional model.

Entropic three-marginal Coulomb transport in one dimension. The three marginals are equal and the pairwise cost is a softened Coulomb repulsion. Each panel shows the $(X_1,X_2)$ marginal of the tensor Sinkhorn solution: small regularization pushes mass away from the collision diagonal, while larger regularization blurs the repulsive structure toward the independent reference.

Interactive panel. Adjust the entropic temperature and repulsion strength to see the pairwise marginals of a three-marginal Coulomb plan move away from the diagonals.

Multi-Marginal Formulation of Barycenters¶

Wasserstein barycenters are the central example. For the squared Euclidean cost, one can introduce a latent barycenter point and eliminate it explicitly, leading to the multi-marginal cost

c_{\mathrm{bar}}(x_1,\ldots,x_S) = \min_{x\in\RR^d} \sum_{s=1}^S\lambda_s\norm{x-x_s}^2.

(76)

Proposition: Multi-Marginal Formula for Quadratic Barycenters

Let $\beta_1,\ldots,\beta_S\in\mathcal P_2(\RR^d)$ and $\lambda\in\simplex_S$ . Define

B(x_1,\ldots,x_S)=\sum_{s=1}^S\lambda_s x_s, \qquad c_{\mathrm{bar}}(x_1,\ldots,x_S) = \min_x \sum_s\lambda_s\norm{x-x_s}^2.

(77)

If $\pi^\star$ solves the multi-marginal OT problem with marginals $(\beta_s)_s$ and cost $c_{\mathrm{bar}}$ , then $\alpha^\star=B_\sharp\pi^\star$ is a Wasserstein barycenter. Conversely, every barycenter is obtained this way from an optimal multi-marginal plan.

This linear support bound, rather than the cardinality of the full product grid, is the standard sparsity estimate for discrete Wasserstein barycenters Anderes et al., 2016.

Entropic Regularization of Multi-Marginal OT¶

As in the two-marginal case, adding an entropic penalty with respect to the product measure $\alpha_1\otimes\cdots\otimes\alpha_S$ leads to scaling algorithms:

\inf_{\pi\in\Couplings(\alpha_1,\ldots,\alpha_S)} \int c\d\pi + \epsilon\operatorname{KL} (\pi\mid\alpha_1\otimes\cdots\otimes\alpha_S).

(84)

The optimizer has the generalized Gibbs form

\d\pi^\star(x_1,\ldots,x_S) = \exp\!\left( \frac{\sum_s f_s(x_s)-c(x_1,\ldots,x_S)}{\epsilon} \right) \prod_s\d\alpha_s(x_s),

(85)

and generalized Sinkhorn iterations alternately update one potential $f_s$ so that the $s$ -th marginal is correct. This formula is direct but, without additional structure, it is mostly a conceptual baseline. In the discrete case, even storing the Gibbs tensor or the coupling requires $\prod_s n_s$ entries, and the unstructured multi-marginal Sinkhorn complexity inherits this exponential dependence on the number of marginals Lin et al., 2019.

Treewidth and Graphical Structure¶

The important exception is when the cost factors over a sparse interaction graph. Let $G=(V,E)$ be a finite undirected graph with $V=\{1,\ldots,S\}$ and suppose, for simplicity, that

c(x_1,\ldots,x_S) = \sum_{(r,s)\in E}c_{r,s}(x_r,x_s).

(86)

The relevant complexity parameter is not merely the number of edges, but the largest intermediate interaction created when variables are summed out.

The last condition is the running-intersection property. A tree decomposition equipped with factors assigned to its bags is also called a junction tree. Equivalently, choose an order in which to eliminate the vertices of $G$ . Just before eliminating a vertex, connect all of its remaining neighbors, thereby adding fill-in edges. The induced width of the order is the largest number of remaining neighbors encountered. Treewidth is the minimum induced width over all elimination orders, and the corresponding bags consist of each eliminated vertex together with those neighbors. This equivalent viewpoint explains why treewidth controls exact summation.

The same construction applies to higher-order factors $c_A(x_A)$ indexed by subsets $A\subseteq V$ : form the primal interaction graph by connecting every pair of variables occurring in a common factor, then compute the treewidth of that graph.

Junction-Tree Contractions Inside Sinkhorn¶

The treewidth reduction replaces each full tensor contraction in Sinkhorn by exact sum-product messages. In the discrete setting, define the edge Gibbs matrices and current unary factors

K^{r,s}_{i_r,i_s} = \exp\!\left(-\frac{\C^{r,s}_{i_r,i_s}}{\epsilon}\right), \qquad h_s(i_s) = (a_s)_{i_s}(u_s)_{i_s}.

(87)

Under (86), the current scaled coupling is represented implicitly as

\P_{i_1,\ldots,i_S} = \prod_{s\in V}h_s(i_s) \prod_{(r,s)\in E}K^{r,s}_{i_r,i_s}.

(88)

When $G$ is a tree, let $N_G(r)$ denote the neighbors of $r$ . The directed sum-product messages satisfy

m_{r\to s}(i_s) = \sum_{i_r=1}^{n_r} K^{r,s}_{i_r,i_s}h_r(i_r) \prod_{q\in N_G(r)\setminus\{s\}} m_{q\to r}(i_r).

(89)

Cutting the edge $(r,s)$ separates the tree into two components: $m_{r\to s}(i_s)$ is the total contribution of the component containing $r$ , conditional on the boundary state $i_s$ . Conditioning first on $i_r$ gives the message recursion. A leaf-to-root pass followed by a root-to-leaf pass computes every directed message and hence every current marginal,

(\widehat a_s)_{i_s} = h_s(i_s) \prod_{r\in N_G(s)}m_{r\to s}(i_s).

(90)

For the block $s$ currently selected by cyclic scaling, the exact coordinate update is

u_s \leftarrow u_s\odot\frac{a_s}{\widehat a_s},

(91)

with componentwise products and quotients. This changes only the unary factor $h_s$ . Updating all blocks at once would instead define a Jacobi scheme. Thus the expensive denominator of one generalized Sinkhorn block update is evaluated by messages rather than by enumerating all multi-indices.

For a general tree decomposition, choose one host bag for each unary factor, assign each edge factor in (88) to a bag containing both endpoints, and denote the product assigned to bag $B_q$ by $\psi_q(i_{B_q})$ . For adjacent bags $q,q'\in Q$ , set $\mathcal S_{q,q'}=B_q\cap B_{q'}$ . Junction-tree messages take the form

M_{q\to q'}(i_{\mathcal S_{q,q'}}) = \sum_{i_{B_q\setminus\mathcal S_{q,q'}}} \psi_q(i_{B_q}) \prod_{\ell\in N_{\mathcal T}(q)\setminus\{q'\}} M_{\ell\to q}(i_{\mathcal S_{\ell,q}}).

(92)

After a collect-distribute pass, the calibrated bag belief is

\mathfrak b_q(i_{B_q}) = \psi_q(i_{B_q}) \prod_{\ell\in N_{\mathcal T}(q)} M_{\ell\to q}(i_{\mathcal S_{\ell,q}}).

(93)

For any $s\in B_q$ , summing $\mathfrak b_q$ over $B_q\setminus\{s\}$ gives $\widehat a_s$ ; the running-intersection property ensures that every bag containing $s$ gives the same result.

Redundant adjacent bags may be contracted, so one can assume that the decomposition is reduced. If its width is $w$ , every bag contains at most $w+1$ variables and every separator contains at most $w$ . Consequently, if $n_s\leq n$ , one collect-distribute inference pass for fixed scalings costs

O\!\left(|Q|n^{w+1}\right)\quad\text{time}, \qquad O\!\left(|Q|n^w\right)\quad\text{message memory}.

(94)

More precisely, the time is bounded by

O\!\left( \sum_{q\in Q}(1+\deg_{\mathcal T}(q)) \prod_{s\in B_q}n_s \right),

(95)

and the stored messages occupy

O\!\left( \sum_{(q,q')\in F} \prod_{s\in\mathcal S_{q,q'}}n_s \right)

(96)

beyond the original factors and scaling vectors. The message-memory bound assumes streamed bag contractions; materializing every dense belief (93) instead requires $O(|Q|n^{w+1})$ working memory.

These are fixed-scaling inference bounds, not the cost of an entire Sinkhorn solve. They also yield a sharper block implementation. Assign each constrained marginal $s$ to a host bag $q(s)$ . After updating $u_s$ , only the messages along the unique path from $q(s)$ to the host bag of the next updated marginal must be refreshed. A path of length $\ell$ costs $O(\ell n^{w+1})$ , hence at most $O(\operatorname{diam}(\mathcal T)n^{w+1})$ per block update. This is the junction-tree iterative-scaling mechanism analyzed in Haasler et al., 2020Fan et al., 2021. For a tree interaction graph, direct edge messages give $O(\sum_{(r,s)\in E}n_rn_s)$ for a full pass, or $O(Sn^2)$ with equal support sizes. A cycle admits $O(Sn^3)$ full junction-tree passes, while the complete graph still costs $O(n^S)$ in time.

The structured implementation never forms $K$ or $\P$ as full tensors. It stores the original local factors and separator messages, evaluates bag products on the fly, and replaces the explicit contraction in Algorithm: Multi-marginal Sinkhorn by (90) or (92), and returns the coupling through its factors and scalings. Materializing every entry of $\P$ would itself require $\prod_s n_s$ operations, so the saving applies when only marginals, costs, samples, or other low-order statistics are needed. For small $\epsilon$ , the same recursions are evaluated in the log domain with log-sum-exp operations. This connects entropic multi-marginal OT with exact inference in probabilistic graphical models and Schrodinger bridge computations Haasler et al., 2021Haasler et al., 2020Fan et al., 2021Altschuler & Boix-Adsera, 2023.

A representative fluid-mechanics example is the time discretization of Brenier’s generalized incompressible Euler problem: kinetic action couples neighboring time slices and periodic incompressibility closes the chain into a cycle, hence a graph of treewidth two. Entropic Bregman/Sinkhorn schemes exploit this circular structure through low-order contractions Benamou et al., 2015Benamou et al., 2019.

Practical barycenter solvers therefore exploit separability of the cost, low-rank structure, convolutional kernels, or a fixed barycenter support.

Algorithm: Multi-marginal Sinkhorn

Input: Positive marginals $\a_s\in\simplex_{n_s}$ , finite tensor cost $\C$ , regularization $\epsilon>0$ , tolerance $\mathrm{tol}$ .

Output: Multi-marginal entropic coupling tensor $\P$ .

Build $K_{i_1,\ldots,i_S} = \exp\!\left(-\frac{\C_{i_1,\ldots,i_S}}{\epsilon}\right) \prod_{s=1}^S(\a_s)_{i_s}.$

Initialize: Set $u_s=\ones_{n_s}$ for all $s$ and residual $r=+\infty$ .

While $r>\mathrm{tol}$ do:

For $s=1,\ldots,S$ do:

$(u_s)_i \leftarrow \frac{(\a_s)_i} { \sum_{i_1,\ldots,i_{s-1},i_{s+1},\ldots,i_S} K_{i_1,\ldots,i_{s-1},i,i_{s+1},\ldots,i_S} \prod_{r\neq s}(u_r)_{i_r}}.$

Set $\P_{i_1,\ldots,i_S}=K_{i_1,\ldots,i_S}\prod_s (u_s)_{i_s}$ .
Set $r=\max_s\norm{(\mathrm{proj}_s)_\sharp \P-\a_s}_1$ .

Return $\P$ .

Low-Rank Optimal Transport¶

Low-rank OT reduces the size of a transport plan by forcing the coupling itself to pass through a small latent measure. This is useful when the mass exchange is expected to be organized by a few hidden clusters or prototypes, and it is distinct from approximating the Sinkhorn kernel by a low-rank matrix. The idea was introduced statistically through factored couplings by Forrow, Hütter, Nitzan, Rigollet, Schiebinger and Weed Forrow et al., 2019, and developed algorithmically for arbitrary costs by Scetbon, Cuturi and Peyré Scetbon et al., 2021.

Definition: Low-Rank Factored Couplings

Let $a\in\simplex_n$ , $b\in\simplex_m$ and let $r\geq1$ . A rank- $r$ factored coupling is a triple $(\Q,\R,g)$ such that

\Q\in\RR_+^{n\times r},\qquad \R\in\RR_+^{m\times r}, \qquad g\in\simplex_r,

(97)

with

\Q\ones_r=a, \qquad \R\ones_r=b, \qquad \Q^\top\ones_n=\R^\top\ones_m=g.

(98)

It induces the coupling

\P(\Q,\R,g) = \Q\operatorname{diag}(g)^{-1}\R^\top, \qquad \P_{i,j}=\sum_{k=1}^r \frac{\Q_{i,k}\R_{j,k}}{g_k},

(99)

where columns with $g_k=0$ are discarded before applying the formula.

The latent interpretation is immediate. The vector $g$ is the law of an intermediate variable $Z\in\{1,\ldots,r\}$ ; $\Q$ is a coupling of the source index $X$ with $Z$ ; $\R$ is a coupling of the target index $Y$ with the same $Z$ . Formula (99) is the law of $(X,Y)$ obtained by sampling $Z\sim g$ , then sampling $X$ and $Y$ conditionally independently given $Z$ . Equivalently, OT is replaced by a succession of two transports through an abstract intermediate measure $\eta=\sum_{k=1}^r g_k\delta_{z_k}$ on an $r$ -point space. The locations $z_k$ only label the latent atoms and do not enter the original cost.

For a cost matrix $\C\in\RR^{n\times m}$ , the low-rank constrained OT value is

\min_{(\Q,\R,g)} \sum_{k=1}^r \frac{\Q_{:,k}^\top \C \R_{:,k}}{g_k}.

(102)

The minimization is over triples satisfying Definition Definition: Low-Rank Factored Couplings.

This problem is non-convex. Scetbon, Cuturi and Peyré regularize the joint variables $(\Q,\R,g)$ by the sum of their entropies and optimize them by constrained mirror descent Scetbon et al., 2021. To isolate the simpler block mechanism used in Figure Div, fix a positive latent law $g\in\simplex_r$ and optimize only the two sub-couplings:

\min_{\substack{\Q\ones_r=a,\;\Q^\top\ones_n=g\\ \R\ones_r=b,\;\R^\top\ones_m=g}} \sum_{k=1}^r \frac{\Q_{:,k}^\top \C \R_{:,k}}{g_k} + \epsilon\KLD(\Q|a\otimes g) + \epsilon\KLD(\R|b\otimes g).

(103)

For fixed $g$ , this differs only by constants from adding the negative entropies of $\Q$ and $\R$ to the factorized transport cost. Each block subproblem is a strictly convex entropic OT problem with an effective cost, although the joint objective remains non-convex because of its bilinear $\Q$ -- $\R$ term.

Algorithm: Alternating Low-Rank Sinkhorn with Fixed Latent Mass

Input: positive marginals $a\in\operatorname{int}(\simplex_n)$ and $b\in\operatorname{int}(\simplex_m)$ , cost $\C\in\RR^{n\times m}$ , rank $r$ , positive latent mass $g\in\operatorname{int}(\simplex_r)$ , regularization $\epsilon>0$ , maximum number of iterations $L\geq1$ , and tolerance $\mathrm{tol}$ .

Output: factored coupling $\P=\Q\operatorname{diag}(g)^{-1}\R^\top$ .

Initialize $\Q^{(0)}=a\otimes g$ , $\R^{(0)}=b\otimes g$ and $J^{(0)}=+\infty$ .

For $\ell=0,\ldots,L-1$ :

Set $A^{(\ell)}=\C \R^{(\ell)}\operatorname{diag}(g)^{-1}$ .
Update $\Q^{(\ell+1)}$ by entropic OT: $\Q^{(\ell+1)}=\arg\min_{\Q\ones_r=a,\,\Q^\top\ones_n=g} \langle A^{(\ell)},\Q\rangle+\epsilon\KLD(\Q|a\otimes g)$ .
Set $B^{(\ell+1)}=\C^\top \Q^{(\ell+1)}\operatorname{diag}(g)^{-1}$ .
Update $\R^{(\ell+1)}$ by entropic OT: $\R^{(\ell+1)}=\arg\min_{\R\ones_r=b,\,\R^\top\ones_m=g} \langle B^{(\ell+1)},\R\rangle+\epsilon\KLD(\R|b\otimes g)$ .
Set $\P^{(\ell+1)}=\Q^{(\ell+1)}\operatorname{diag}(g)^{-1}(\R^{(\ell+1)})^\top$ .
Set $J^{(\ell+1)}$ to the value of (103) at $(\Q^{(\ell+1)},\R^{(\ell+1)})$ .
If $\ell\geq1$ and $|J^{(\ell+1)}-J^{(\ell)}|\leq \mathrm{tol}\max\{1,|J^{(\ell)}|\}$ , return $\P^{(\ell+1)}$ .

Return $\P^{(L)}$ if the loop reaches $L$ iterations.

Each update exactly minimizes one factor block while keeping the other fixed, so the objective decreases monotonically. Positivity makes each block minimizer unique and keeps it in the relative interior of its transport polytope. Compactness and exact cyclic block-coordinate descent imply that every accumulation point is a coordinatewise minimizer, hence a stationary point of the constrained problem. The non-convexity does not guarantee a globally optimal rank- $r$ coupling.

Figure Div visualizes both the intermediate latent measure and the improvement of the factored coupling as the prescribed rank increases.

Low-rank entropic OT on a one-dimensional Gaussian-mixture example. The first view shows factorization through four latent atoms; the matrix panels compare the full entropic coupling with fixed-latent-mass low-rank couplings of increasing rank. This is deliberately not a favorable example for low rank: with a small entropic parameter, one-dimensional quadratic OT is close to a sparse Monge graph rather than to a genuinely low-rank matrix.

Interactive panel. Vary the latent rank and entropic scale to see the same one-dimensional problem as a two-stage transport through a small intermediate measure. The right matrix should approach the full entropic plan as the rank increases.

Capacity-Constrained Optimal Transport¶

Classical Kantorovich transport only fixes the marginals: if a pair $(x,y)$ is cheap, the optimizer may concentrate as much mass as the marginal constraints allow on this pair. Capacity-constrained OT adds a local congestion rule on the coupling itself. It is useful when edges, facilities or matchings have limited throughput, and it also gives a clean mathematical way to interpolate between a sparse OT plan and the independent product coupling. The systematic study of the continuous problem, including existence and the geometry of active saturated regions, was developed by Korman and McCann Korman & McCann, 2015.

Let $\alpha\in\Mm_+^1(\Xx)$ , $\beta\in\Mm_+^1(\Yy)$ and let $\kappa:\Xx\times\Yy\to[0,+\infty)$ be a finite-valued measurable capacity. The capacity-constrained transport value is

\MK_c^\kappa(\alpha,\beta) = \inf_{\pi\in\Couplings(\alpha,\beta)} \left\{ \int_{\Xx\times\Yy} c(x,y)\,d\pi(x,y) :\; \pi\ll\alpha\otimes\beta,\quad \frac{d\pi}{d(\alpha\otimes\beta)}(x,y)\leq\kappa(x,y) \right\}.

(104)

The product coupling $\alpha\otimes\beta$ is feasible whenever $\kappa\geq1$ . Thus the constraint is not meant to promote the product plan; rather, it prevents the optimizer from using any pair more than the prescribed density ratio. Separately, we adopt the convention $\MK_c^{+\infty}(\alpha,\beta)=\MK_c(\alpha,\beta)$ : the symbol $\kappa\equiv+\infty$ removes both the density bound and the absolute-continuity requirement, and hence recovers the full Kantorovich problem. At the opposite extreme, $\kappa=1$ forces the independent coupling itself, because a density bounded by one and integrating to one must equal one almost everywhere. Values close to one therefore enforce diffuse plans close to this reference.

For discrete measures $\alpha=\sum_i a_i\delta_{x_i}$ and $\beta=\sum_j b_j\delta_{y_j}$ , a capacity is an upper matrix $U\in\RR_+^{n\times m}$ . The density-ratio discretization of (104) is $U_{i,j}=\kappa_{i,j}a_i b_j$ , and the finite-dimensional problem is the linear program

\min_{\P\in\CouplingsD(a,b)} \langle \C,\P\rangle \quad\text{subject to}\quad 0\leq \P_{i,j}\leq U_{i,j}\quad\forall(i,j).

(105)

Feasibility is now a genuine issue: the upper matrix must contain enough mass in every row-column cut to support the prescribed marginals. The usual transport polytope is recovered when $U_{i,j}=+\infty$ , while small capacities select a smaller capped transportation polytope. For index sets $I$ and $J$ , write $a(I)=\sum_{i\in I}a_i$ , $b(J)=\sum_{j\in J}b_j$ , and $U(I,J)=\sum_{i\in I,j\in J}U_{ij}$ .

Entropic smoothing gives a direct Sinkhorn-like algorithm. Assume the cut condition above. With $K_{i,j}=a_i b_j e^{-\C_{i,j}/\epsilon}$ , the regularized problem is

\min_{\P\in\CouplingsD(a,b),\,0\leq \P\leq U} \langle \C,\P\rangle + \epsilon\KLD(\P|a\otimes b).

(109)

Equivalently, up to additive constants, the objective is $\epsilon\KLD(\P|K)$ . The problem is therefore the KL projection of $K$ onto the intersection of three convex sets: the row constraints, the column constraints and the box $\P\leq U$ . Alternating KL projections with Dykstra correction factors Dykstra, 1985Bauschke & Lewis, 2000, in the same spirit as the Bregman projection formulation of Sinkhorn Benamou et al., 2015, gives a simple capacity-constrained scaling scheme.

Algorithm: Capacity-Constrained Sinkhorn by KL-Dykstra

Input: positive marginals $a\in\simplex_n$ and $b\in\simplex_m$ , finite cost $\C$ , feasible capacity $U$ , regularization $\epsilon>0$ , and tolerance $\mathrm{tol}$ .

Output: entropic capped coupling $\P$ .

Restrict all arrays to the active edge set $E=\{(i,j):U_{ij}>0\}$ ; entries outside $E$ remain zero. All products and divisions below are entrywise on $E$ .

Set $K_{i,j}=a_i b_j e^{-\C_{i,j}/\epsilon}$ for $(i,j)\in E$ .

Initialize $\P=K$ , correction matrices $\R_1=\R_2=\R_3=\ones_{n\times m}$ and $r=+\infty$ .

While $r>\mathrm{tol}$ :

Set $Z=\P\odot \R_1$ .
Project rows: $\P=\operatorname{diag}(a/(Z\ones_m))Z$ .
Update $\R_1=Z\oslash \P$ .
Set $Z=\P\odot \R_2$ .
Project columns: $\P=Z\operatorname{diag}(b/(Z^\top\ones_n))$ .
Update $\R_2=Z\oslash \P$ .
Set $Z=\P\odot \R_3$ .
Project capacity: $\P=\min\{Z,U\}$ entrywise.
Update $\R_3=Z\oslash \P$ .
Set $r=\max\{\|\P\ones_m-a\|_\infty,\|\P^\top\ones_n-b\|_\infty,\|(\P-U)_+\|_\infty\}$ .

Return $\P$ .

On the active support, all iterates and correction factors are positive, so every division is defined. Finite-dimensional KL-Dykstra convergence shows that, whenever the capped polytope is nonempty, the iterates converge to the unique regularized minimizer. Deleting zero-capacity edges is essential: otherwise the correction step creates undefined products of zero and infinite factors.

Figure Div shows how lowering the entrywise cap progressively spreads a one-dimensional coupling while preserving both prescribed marginals.

Capacity-constrained entropic OT between two one-dimensional Gaussian-mixture histograms. The same source and target marginals are coupled with a density-ratio cap $U_{ij}=\kappa a_i b_j$ . Large capacity leaves a nearly Monge-like graph, whereas small capacity saturates many entries and forces the coupling to spread.

Interactive panel. Vary the density-ratio cap and the entropic regularization to see how the upper bound turns a graph-like one-dimensional coupling into a saturated spread-out plan.

For the empirical self-coupling in Div, the cap is chosen to prescribe a minimum number of outgoing connections per source point. With uniform weights $a_i=1/n$ , imposing $\P_{ij}\leq1/(qn)$ is equivalent to the conditional bound $\P_{ij}/a_i\leq1/q$ . Since each row has total mass $1/n$ , this forces each source row to use at least $q$ target atoms, up to the small extra spreading introduced by entropic smoothing.

For the empirical self-coupling in Figure Div, the cap is chosen to prescribe a minimum number of outgoing connections per source point.

Capacity-constrained local self-couplings on a two-dimensional empirical Gaussian mixture. The source and target are the same semi-regular uniform empirical measure, but the diagonal is removed to avoid the trivial identity plan. The three panels use off-diagonal caps $U_{ij}=1/(qn)$ with $q=1,3,5$ , equivalently $\P_{ij}/a_i\leq1/q$ because $a_i=1/n$ . They therefore impose at least one, three and five outgoing connections per source atom.

Interactive panel. Change the admissible number of outgoing connections to see how the capacity bound turns a dense self-coupling into a local transport graph.

Metric Learning and Inverse OT¶

Metric learning differentiates a forward transport loss through a parameterized cost, whereas inverse OT starts from an observed plan and asks which cost makes it optimal. The first viewpoint is often bilevel; the second admits a direct convex formulation for affine cost families, provided the intrinsic cost invariances are removed.

Differentiating OT Losses¶

Inverse OT and metric learning repeatedly differentiate a forward OT value with respect to the input law and to the ground cost. The two resulting objects are precisely the two certificates of optimality: a Kantorovich potential for perturbations of the marginal and an optimal coupling for perturbations of the cost. The main caveat is non-uniqueness. In the unregularized case, the correct objects are one-sided directional derivatives, or equivalently subgradients in the measure variable and supergradients in the cost variable. Entropic regularization selects a unique plan and, for positive finite histograms, gives ordinary derivatives on the simplex interiors.

Proposition: First variations of unregularized OT

Let $\alpha\in\Pp(\Xx)$ and $\beta\in\Pp(\Yy)$ , where $\Xx,\Yy$ are compact metric spaces, and let $c\in\Cc(\Xx\times\Yy)$ . Define

\mathcal V_c(\alpha,\beta) \eqdef \inf_{\pi\in\Couplings(\alpha,\beta)} \int_{\Xx\times\Yy}c(x,y)\d\pi(x,y).

(110)

Let $\mathcal O_c(\alpha,\beta)$ be the set of optimal couplings and let $\mathcal D_c(\alpha,\beta)$ be the set of optimal dual potentials $(f,g)$ with $f(x)+g(y)\leq c(x,y)$ . If $\chi$ is a signed measure with $\chi(\Xx)=0$ and $\alpha_t=\alpha+t\chi$ is a probability measure for $0\leq t\leq t_0$ , then

\left.\frac{\d}{\d t}\right|_{t=0^+} \mathcal V_c(\alpha_t,\beta) = \sup_{(f,g)\in\mathcal D_c(\alpha,\beta)} \int_{\Xx} f(x)\d\chi(x).

(111)

If $h\in\Cc(\Xx\times\Yy)$ and $c_t=c+th$ , then

\left.\frac{\d}{\d t}\right|_{t=0^+} \mathcal V_{c_t}(\alpha,\beta) = \inf_{\pi\in\mathcal O_c(\alpha,\beta)} \int_{\Xx\times\Yy} h(x,y)\d\pi(x,y).

(112)

In particular, if the normalized optimal potential $f^\star$ and the optimal plan $\pi^\star$ are unique, then

\frac{\delta \mathcal V_c}{\delta\alpha}=f^\star, \qquad \frac{\delta \mathcal V_c}{\delta c}=\pi^\star .

(113)

The second identity means that the first variation with respect to the function $c$ is represented by the optimal measure $\pi^\star$ on $\Xx\times\Yy$ .

In the discrete case, this proposition says that any optimal dual vector $f^\star$ is a subgradient with respect to the source weights $a$ , while any optimal plan $P^\star$ is a supergradient with respect to the cost matrix $C$ , because the value is concave in $C$ :

f^\star\in\partial_a\mathcal L_C(a,b), \qquad P^\star\in\partial_C^{\mathrm{sup}}\mathcal L_C(a,b).

(116)

Here $\partial_C^{\mathrm{sup}}$ denotes the superdifferential of the concave map $C\mapsto\mathcal L_C(a,b)$ . When the corresponding objects are unique, these inclusions become the gradients $\nabla_a\mathcal L_C(a,b)=f^\star$ on the tangent space $\{\dotp{\ones}{\chi}=0\}$ and $\nabla_C\mathcal L_C(a,b)=P^\star$ . Without uniqueness, the exact directional derivative with respect to $C$ in a direction $\Delta C$ is the minimum of $\dotp{\Delta C}{P}$ over all optimal plans.

Proposition: First variations of entropic OT

Under the compact-space and continuous-cost assumptions of Proposition Proposition: First variations of unregularized OT, let $\epsilon>0$ and define the KL-normalized entropic value

\mathcal V_{c,\epsilon}(\alpha,\beta) \eqdef \inf_{\pi\in\Couplings(\alpha,\beta)} \int c\d\pi+\epsilon\operatorname{KL}(\pi\mid\alpha\otimes\beta).

(117)

Choose optimal entropic potentials $(f_\epsilon,g_\epsilon)$ in soft-transform form, so that the following density has marginals $\alpha$ and $\beta$ :

\d\pi_\epsilon(x,y) = \exp\!\left(\frac{f_\epsilon(x)+g_\epsilon(y)-c(x,y)}{\epsilon}\right) \d\alpha(x)\d\beta(y)

(118)

This defines the unique optimal coupling. For the same perturbations $\alpha_t=\alpha+t\chi$ and $c_t=c+th$ as above,

\left.\frac{\d}{\d t}\right|_{t=0^+} \mathcal V_{c,\epsilon}(\alpha_t,\beta) = \int f_\epsilon\d\chi, \qquad \left.\frac{\d}{\d t}\right|_{t=0^+} \mathcal V_{c_t,\epsilon}(\alpha,\beta) = \int h\d\pi_\epsilon .

(119)

Equivalently,

\frac{\delta \mathcal V_{c,\epsilon}}{\delta\alpha}=f_\epsilon, \qquad \frac{\delta \mathcal V_{c,\epsilon}}{\delta c}=\pi_\epsilon .

(120)

In finite dimension, for positive histograms, these are ordinary derivatives on the relative interior of the simplices.

For a finite-dimensional parametrization $c_\theta$ or $\C_\theta$ , the entropic formula gives the backpropagation rule

\partial_{\theta}\mathcal V_{c_\theta,\epsilon} = \int \partial_\theta c_\theta(x,y)\d\pi_\epsilon(x,y),

(123)

where $\pi_\epsilon$ is the entropic optimizer. For the unregularized value $\mathcal V_{c_\theta}$ , uniqueness of the optimal plan $\pi^\star$ gives $\partial_\theta\mathcal V_{c_\theta} =\int\partial_\theta c_\theta\,\d\pi^\star$ . Without uniqueness, the directional derivative in a parameter direction $\dot\theta$ is obtained by minimizing $\int \dot\theta\cdot\partial_\theta c_\theta\,\d\pi$ over the optimal face, while any selected optimal plan gives a valid supergradient with respect to the cost. This is the calculus behind ground-metric learning, which was explicitly studied in Cuturi & Avis, 2014 and connects to the broader metric-learning literature Kulis, 2012Bellet et al., 2015. If one uses the entropy-only discrete convention of the Sinkhorn chapter instead of the KL-normalized value, then, for positive source weights,

\mathcal L_C^\epsilon(a,b) = \mathcal V_{C,\epsilon}(a,b) - \epsilon H(a)-\epsilon H(b),

(124)

so its derivative with respect to $a$ is represented on the simplex tangent space by $f_\epsilon+\epsilon\log a$ , up to an irrelevant additive constant.

Figure Div gives the geometric counterpart of this differentiation rule: changing an anisotropic quadratic cost changes which transport segments are selected.

Changing the ground metric changes the optimal coupling. The same red and blue empirical measures are matched with $c_A(x,y)=(x-y)^\top A(x-y)$ for the Euclidean metric and two increasingly anisotropic Mahalanobis metrics. The small gray ellipse shows the unit ball of the metric: directions in which the ellipse is elongated are cheaper, and this deforms the transport segments selected by the OT plan.

The interactive demo lets the anisotropy and orientation of the Mahalanobis cost move. The transport plan is recomputed exactly for the displayed particles, so the segments show how the learned cost changes the matching.

Interactive panel. Use the metric and deformation controls to see how learning the ground cost changes the apparent transport geometry.

Inverse Optimal Transport¶

Inverse OT asks for a ground cost that explains observed matchings or flows as optimal transport plans. In its most direct form, one observes a plan $\widehat\pi$ with marginals $(\alpha,\beta)$ and seeks a cost $c$ such that $\widehat\pi$ is optimal for

\inf_{\pi\in\Couplings(\alpha,\beta)} \int c(x,y)\d\pi(x,y).

(125)

This is ill-posed without structure. Adding $u(x)+v(y)$ to a cost shifts every feasible objective by the same marginal-dependent constant, multiplying a cost by a positive scalar does not change its minimizers, and the zero cost rationalizes every feasible plan. An identifiable model must quotient or normalize these gauge and scale freedoms; a sparse observed plan can still be compatible with a nontrivial cone of normalized costs.

A useful statistical methodology is to measure the suboptimality of the observed plan through a Fenchel--Young loss. Write the score as $s=-c$ and define the convex regularized prediction value

G_\epsilon(s) = \sup_{\pi\in\Couplings(\alpha,\beta)} \int s\d\pi - \epsilon\operatorname{KL}(\pi\mid\alpha\otimes\beta).

(126)

The Fenchel--Young loss

\mathcal L_\epsilon(c;\widehat\pi) = G_\epsilon(-c) + G_\epsilon^*(\widehat\pi) + \int c\d\widehat\pi

(127)

is nonnegative by Fenchel’s inequality and vanishes exactly when $\widehat\pi\in\partial G_\epsilon(-c)$ , i.e. when $\widehat\pi$ satisfies the regularized optimality conditions for $c$ . Entropic regularization is important here because it makes the forward map smoother and provides gradients with respect to $c$ , at the price of a statistical bias Andrade et al., 2025Peyré et al., 2026.

In the discrete unregularized case, this loss reduces to the optimality gap of the observed coupling. For $\widehat \P\in\mathbf U(a,b)$ and a cost matrix $\C$ , denote it by

\mathcal L_{\mathrm{iOT}}(\C;\widehat \P) = \dotp{\C}{\widehat \P} - \min_{\P\in\mathbf U(a,b)}\dotp{\C}{\P}.

(128)

This inverse-OT gap loss is nonnegative and vanishes exactly when $\widehat \P$ is optimal for $\C$ .

In practice, one restricts the cost to a finite-dimensional model class, often affine:

\C_\theta=\sum_{r=1}^R\theta_r \C^{(r)}, \qquad \theta\in\Theta,

(129)

where $\Theta$ is convex and the matrices $\C^{(r)}$ encode features, graph distances or a Mahalanobis parameterization. This viewpoint appears in low-rank and sparse inverse OT models Dupuy et al., 2019Andrade et al., 2024 and in convex formulations for learning OT costs from observed plans Ma et al., 2020Peyré et al., 2026.

A minimal finite-dimensional model is obtained by learning a bilinear cost on $\RR^d$ ,

c_A(x,y)=\dotp{Ax}{y}, \qquad A\in\RR^{d\times d}.

(130)

For empirical measures $\alpha=\frac1n\sum_i\delta_{x_i}$ and $\beta=\frac1n\sum_j\delta_{y_j}$ , this gives the cost matrix

\C(A)_{i,j}=\dotp{Ax_i}{y_j},

(131)

so both maps $A\mapsto \C(A)$ and $A\mapsto c_A$ are linear. Inverse OT within this model asks which matrix $A$ makes an observed matching or coupling look optimal; learning the cost is thus reduced to estimating a linear parameter.

For a fixed matrix $A$ , the forward prediction is the optimal face

\mathcal P_A\eqdef \uargmin{\P\in\CouplingsD(\ones_n/n,\ones_n/n)} \dotp{\C(A)}{\P}.

(132)

When this face is a singleton, write its element as $\P_A$ ; otherwise $\P_A$ denotes a deterministic tie-broken selection. Although $A\mapsto \C(A)$ is linear, the solution correspondence $A\mapsto\mathcal P_A$ is polyhedral: changing $A$ changes the direction in which the transport polytope is probed, and a tie-broken selection is constant on normal-cone cells. The figure below illustrates this correspondence on the OT4ML point clouds. The construction follows the visual idea of the Python Optimal Transport logo Flamary et al., 2021: red source atoms, blue target atoms and straight segments show the selected optimal bijection. With $e=(1,1)^\top$ and $\delta=10^{-3}$ , the first two rank-one matrices are

A_h=-e_1e^\top+\delta e_2e^\top, \qquad A_v=\delta e_1e^\top-e_2e^\top .

(133)

These small transverse terms break the large ties of the pure horizontal or vertical scores while preserving a rank-one cost. The matrix $A=-I$ gives the usual quadratic $\Wass_2$ assignment, up to the marginal-only terms discussed below, while $A=+I$ reverses the correlation and produces an anti- $\Wass_2$ matching.

Figure Div illustrates this correspondence on the OT4ML point clouds.

Interactive panel. Change the bilinear cost matrix and solve the corresponding equal-weight assignment in the browser; each panel uses an actual Hungarian solve.

Forward solutions of the bilinear cost $c_A(x,y)=\dotp{Ax}{y}$ on the OT4ML logo point clouds. Each panel solves the equal-weight assignment problem with a different matrix $A$ ; the first two use $\delta=10^{-3}$ to break rank-one ties. The source atoms are red, the target atoms are blue, and the gray segments give one deterministic optimal bijection.

This elementary model already contains the quadratic Wasserstein assignment. Adding to a cost matrix a term depending only on $x_i$ or only on $y_j$ shifts all feasible couplings by the same constant, and therefore does not change the optimizer. Since

\norm{x-y}^2=\norm{x}^2+\norm{y}^2-2\dotp{x}{y},

(134)

the usual quadratic Wasserstein assignment has the same optimizer as the bilinear cost with $A_\star=-I$ , up to these marginal-only terms and an irrelevant positive factor. The inverse problem goes in the opposite direction: after observing a coupling, one asks which matrices $A$ could have generated it. The next figure generates an observed coupling $\widehat \P$ from this cost on two empirical mixtures of Gaussians, and then evaluates $\mathcal L_{\mathrm{iOT}}(\C(A_t);\widehat \P)$ along the anisotropic path

A_t=-\diag(1+t,1-t), \qquad -1\leq t\leq 1,

(135)

so that $t=0$ recovers the matrix that generated the observed coupling. Equivalently, with equal weights, $\widehat \P\in\CouplingsD(\ones_n/n,\ones_n/n)=\mathcal B_n/n$ , and the plotted loss is the Kantorovich gap

\mathcal L_{\mathrm{iOT}}(\C(A_t);\widehat \P) = \dotp{\C(A_t)}{\widehat \P} - \min_{\P\in\CouplingsD(\ones_n/n,\ones_n/n)} \dotp{\C(A_t)}{\P}, \qquad \C(A_t)_{i,j}=\dotp{A_t x_i}{y_j}.

(136)

Because $t\mapsto \C(A_t)$ is affine and the Kantorovich value is a minimum of affine functions over the fixed polytope $\CouplingsD(\ones_n/n,\ones_n/n)$ , this one-dimensional gap is convex and piecewise affine. Its zero set can contain an interval for a small sample, reflecting the fact that the same observed coupling remains optimal for a cone of nearby costs.

Figure Div generates an observed coupling $\widehat \P$ from this cost on two empirical mixtures of Gaussians, and then evaluates $\mathcal L_{\mathrm{iOT}}(\C(A_t);\widehat \P)$ along the anisotropic path

Interactive panel. Vary sample size and cost rotation to recompute the empirical Kantorovich gap along the one-parameter inverse-OT path.

Inverse-OT gap loss for a bilinear cost. Panel (a): two empirical mixtures of two Gaussians are matched with the cost $c_{A_\star}(x,y)=\dotp{A_\star x}{y}$ for $A_\star=-I$ , which gives the same optimizer as the quadratic $\Wass_2$ cost; red and blue level sets display the two sampling densities. Panels (b,c): the unregularized Fenchel--Young Kantorovich gap $\mathcal L_{\mathrm{iOT}}(\C(A_t);\widehat \P)$ along $A_t=-\diag(1+t,1-t)$ for $n=10$ and $n=100$ , using the same vertical scale. The red dot marks the generating parameter $t=0$ ; the curves are convex and piecewise affine.

The comparison between $n=10$ and $n=100$ illustrates a statistical effect: as the number of sampled points grows, the flat zero region can shrink to a sharper piecewise-affine minimum. A finite-sample linear-programming gap remains polyhedral and has no classical second derivative inside its cells. Curvature is instead a population phenomenon. Peyré, Poon and Tron Peyré et al., 2026 prove local curvature and identifiability modulo the natural cost invariances for smooth positive marginals under a nondegeneracy condition. They also identify affine-map settings, including Gaussian or elliptical examples, as genuinely degenerate cases. Larger samples can reveal population curvature, but do not remove structural non-identifiability by themselves.

Proposition: Convex Dual-Gap Formulation of Inverse OT

Let $\widehat \P\in\mathbf U(a,b)$ be an observed coupling and let $\C_\theta$ depend affinely on $\theta\in\Theta$ , where $\Theta$ is convex. The condition that $\widehat \P$ is optimal for the cost $\C_\theta$ is equivalent to the existence of dual potentials $(f,g)$ such that

f_i+g_j\leq (\C_\theta)_{i,j} \qquad\text{and}\qquad \sum_{i,j}\widehat \P_{i,j} \big((\C_\theta)_{i,j}-f_i-g_j\big)=0.

(137)

Consequently, for a convex regularizer $R$ and $\lambda\geq0$ , the dual-gap fitting problem is the convex program

\min_{\theta\in\Theta,f,g} R(\theta) + \lambda \sum_{i,j}\widehat \P_{i,j} \big((\C_\theta)_{i,j}-f_i-g_j\big)

(138)

subject to $f_i+g_j\leq(\C_\theta)_{i,j}$ for all $i,j$ . It is nontrivial only if $\Theta$ imposes a cost normalization or otherwise excludes the zero-cost and marginal-gauge degeneracies.

The formulation avoids differentiating through a forward OT solver: it learns a normalized cost by making the observed plan nearly satisfy complementary slackness. In statistical settings, $\widehat \P$ is only partially observed or noisy, so one adds sparsity, low-rank, smoothness or metric constraints to select a meaningful representative Dupuy et al., 2019Andrade et al., 2024. For entropic OT, the optimality condition becomes smoother:

\widehat \P_{i,j} \approx a_i b_j \exp\!\left( \frac{f_i+g_j-(\C_\theta)_{i,j}}{\epsilon} \right),

(141)

which leads to likelihood-based or KL-based convex objectives when $\C_\theta$ is affine, and connects inverse OT with generalized Sinkhorn iterations and transport-regularized inverse problems Karlsson & Ringh, 2017Ma et al., 2020.

Weak Optimal Transport¶

Weak OT relaxes the cost so that it depends on the conditional distribution of destinations rather than only on pointwise pairs. It is useful when a source point is allowed to choose a randomized response and the model only penalizes an aggregate of that response, such as its conditional mean.

Barycentric Projection of a Coupling¶

The first object to isolate is the map obtained by collapsing each conditional law to its barycenter.

The projected target $\bar\beta_\pi$ records the distribution of conditional means, not the full second marginal. Thus it is generally different from $\beta$ ; if $\pi=(\Id,T)_\sharp\alpha$ is induced by a map, then $\bar T_\pi=T$ and $\bar\beta_\pi=\beta$ . This projection is not an optimal map for an arbitrary coupling: a deterministic rotation of a radially symmetric source, for example, projects to the rotation itself, whereas the optimal map from the source to itself is the identity. The useful positive statement is attached to quadratic optimal plans, as in the tangent-space viewpoint on $\Wass_2$ developed by Ambrosio, Gigli and Savare Ambrosio et al., 2006.

Remark: Barycentric Projection Appears Everywhere

Barycentric projection turns a conditional law into a mean. Definition Definition: Barycentric Projection of a Coupling applies it to a disintegrated coupling, and Proposition Proposition: Barycentric Projection of a Quadratic Optimal Plan shows that quadratic optimal plans are stable under this collapse. The barycentric weak cost below inserts $\bar T_\pi$ into (145); martingale OT instead imposes $\int y\,\d\pi_x(y)=x$ in (153). Conditional Wasserstein distances use the same disintegration language fiber by fiber in (158). Later, the mean-shift and Gaussian-attention velocity (112) is another barycentric average, while SVGD replaces it by the RKHS steepest-descent average (73). Across these examples, the recurring move is to retain conditional structure while replacing a full conditional law by a tractable first or kernelized moment.

Weak Transport Costs¶

Weak transport costs use the same disintegration but allow the objective to depend on the whole conditional law, or on summaries such as the barycentric projection (142). The framework was introduced through general transport costs and weak transport inequalities, with existence, duality and optimality conditions developed on Polish spaces Gozlan et al., 2017Backhoff Veraguas et al., 2019. For a weak cost $C:\Xx\times\mathcal P(\Yy)\to\RR\cup\{+\infty\}$ , the weak OT value is

\WOT_C(\alpha,\beta) \eqdef \inf_{\pi\in\Couplings(\alpha,\beta)} \int C(x,\pi_x)\d\alpha(x).

(145)

The classical Kantorovich problem is recovered when $C(x,\nu)=\int c(x,y)\d\nu(y)$ , because the objective then becomes $\int c(x,y)\d\pi(x,y)$ . The genuinely weak behavior starts when $C$ is nonlinear in $\nu$ .

Proposition: Weak Kantorovich Duality

Let $\Xx$ and $\Yy$ be compact metric spaces, and let $C:\Xx\times\mathcal P(\Yy)\to\RR\cup\{+\infty\}$ be proper, jointly lower semicontinuous, bounded from below, and convex in its second argument. Fix $\alpha\in\mathcal P(\Xx)$ and $\beta\in\mathcal P(\Yy)$ , and assume that $\WOT_C(\alpha,\beta)<+\infty$ . For $g\in C(\Yy)$ , define

g^C(x) \eqdef \inf_{\nu\in\mathcal P(\Yy)} \left\{C(x,\nu)-\int g(y)\d\nu(y)\right\}.

(146)

Then

\WOT_C(\alpha,\beta) = \sup_{g\in C(\Yy)} \left\{\int g^C(x)\d\alpha(x)+\int g(y)\d\beta(y)\right\}.

(147)

For $C(x,\nu)=\int c(x,y)\d\nu(y)$ , this is the usual Kantorovich dual.

Figure Div separates the full conditional coupling from its barycentric projection, making explicit which part of each conditional law is retained by the weak cost.

Weak barycentric transport on a small disk-to-annulus coupling. The left panel shows the full conditional laws: each red source atom splits its mass among several blue target atoms, with segment thickness proportional to transported mass. The right panel collapses each conditional law $\pi_x$ to its barycenter $\bar T_\pi(x)=\int y\d\pi_x(y)$ , shown in violet. The barycentric weak cost only sees the red-to-violet displacement, and therefore ignores the conditional spread around each barycenter.

The interactive demo lets each source point split toward several targets. Increasing the split count or spread usually increases the full quadratic cost while the weak barycentric cost can remain much smaller.

Interactive panel. Use the spread and barycentric controls to compare full weak conditional laws with their barycentric projections.

The barycentric cost is the canonical example to keep in mind: admissibility still constrains the full conditional laws to have second marginal $\beta$ , but the objective only charges the displacement from $x$ to $\bar T_\pi(x)$ and ignores the conditional variance around this barycenter. This connects weak OT with martingale transport, Strassen-type convex-order constraints, barycentric projections and learning problems where conditional averages are meaningful objects.

Martingale Optimal Transport¶

Martingale OT is the extreme barycentric version of the weak viewpoint: a source point may split randomly, but the average destination must remain equal to the source point. This turns the barycentric projection from an object used in the cost into a hard constraint.

Definition: Martingale Couplings and Martingale OT

Let $\alpha,\beta\in\mathcal P_1(\RR^d)$ . A coupling $\pi\in\Couplings(\alpha,\beta)$ is a martingale coupling if, for the disintegration $\pi(\d x,\d y)=\pi_x(\d y)\alpha(\d x)$ ,

\int_{\RR^d}y\d\pi_x(y)=x \qquad\text{for $\alpha$-a.e. }x .

(153)

Equivalently, $\bar T_\pi=\Id$ in (142). The set of such couplings is

\Couplings_{\mathrm{mart}}(\alpha,\beta) \eqdef \left\{ \pi\in\Couplings(\alpha,\beta)\;:\; \bar T_\pi=\Id\quad\alpha\text{-a.e.} \right\}.

(154)

For a cost $c$ , the martingale OT value is $+\infty$ if the admissible set is empty, and otherwise is obtained by minimizing $\int c\d\pi$ over $\Couplings_{\mathrm{mart}}(\alpha,\beta)$ .

The terminology comes from probability: if $(X,Y)\sim\pi$ , then (153) is exactly $\mathbb E[Y|X]=X$ . Hence martingale OT is a Kantorovich problem with the usual two marginal constraints plus a barycentric constraint on the conditional laws. Equivalently, the barycentric projected coupling $(\Id,\bar T_\pi)_\sharp\alpha$ must be the diagonal coupling $(\Id,\Id)_\sharp\alpha$ . This is stronger than merely asking the projected target $(\bar T_\pi)_\sharp\alpha$ to equal $\alpha$ , since a nontrivial measure-preserving map could have the same projected marginal without satisfying $\bar T_\pi(x)=x$ pointwise. Martingale OT is central in robust finance, where one transports today prices to tomorrow prices without introducing drift, and has led to a rich martingale transport theory Beiglböck et al., 2013Galichon et al., 2014Dolinsky & Soner, 2014Guo & Obłój, 2019.

Stochastic Orders¶

The admissibility of constrained couplings is governed by stochastic order. The basic principle is that inequalities tested against a class of functions are equivalent to the existence of couplings satisfying a pointwise or conditional constraint. Strassen’s theorem is the canonical result of this kind Strassen, 1965.

Convex Order and Martingale Feasibility¶

For martingale OT, the pointwise order constraint is replaced by a barycentric constraint on conditional laws. The corresponding order is the convex order. For $\alpha,\beta\in\Pp_1(\RR^d)$ ,

\alpha\preceq_{\mathrm{cx}}\beta \quad\Longleftrightarrow\quad \int\varphi\,\d\alpha\leq\int\varphi\,\d\beta \quad\text{for every convex }\varphi\text{ for which both integrals are defined}.

(156)

For finite-first-moment measures, it is enough to test continuous convex functions with at most linear growth. Testing affine functions gives equality of means, while the remaining convex tests say that $\beta$ is more spread out than $\alpha$ . Strassen’s martingale theorem says that this spread condition is exactly what is needed to realize $\beta$ from $\alpha$ by mean-preserving randomization.

The same theorem gives an exact geometric description of the barycentric weak cost: weak OT transports to the closest measure below $\beta$ in convex order.

Theorem Theorem: Strassen’s Martingale Theorem explains why convex order is the right admissibility notion for martingale OT: it is exactly the feasibility condition for the martingale constraint. If $\alpha\not\preceq_{\mathrm{cx}}\beta$ , then $\Couplings_{\mathrm{mart}}(\alpha,\beta)=\emptyset$ and the martingale OT value is $+\infty$ , independently of the cost. If $\alpha\preceq_{\mathrm{cx}}\beta$ , then the optimization problem is nonempty and the cost selects, among all mean-preserving splittings of each source point, the martingale coupling best adapted to the application. This is the probabilistic meaning of the barycentric constraint: mass may branch, but it cannot drift on average.

For Gaussian measures with the same mean, convex order reduces to the Loewner order on covariance matrices:

\mathcal N(m,\Sigma_0)\preceq_{\mathrm{cx}}\mathcal N(m,\Sigma_1) \quad\Longleftrightarrow\quad \Sigma_1-\Sigma_0\succeq0 .

(164)

Indeed, if $\Sigma_1-\Sigma_0\succeq0$ , then $\mathcal N(m,\Sigma_1)$ is obtained from $\mathcal N(m,\Sigma_0)$ by adding independent centered Gaussian noise. Conversely, testing the convex quadratic functions $x\mapsto\langle u,x\rangle^2$ gives the Loewner inequality.

Figure Div gives a discrete non-Gaussian counterpart: centered conditional kernels provide a feasible martingale plan, while optimizing the transport cost selects a much sparser plan with the same marginals and barycentric constraint.

A discrete one-dimensional martingale OT example. The source $\alpha$ is a red Gaussian mixture on a grid. A first feasible plan is generated by centered kernels $K_i(y_j)=\kappa_i(y_j-x_i)$ , whose discrete barycenter is $x_i$ . Keeping the same marginals, the third panel solves the martingale OT linear program with row, column, and constraints $\sum_j(y_j-x_i)P_{ij}=0$ . The optimized plan is much sparser, while both plans have the identity barycentric projection.

Interactive panel. Change the space-varying kernel width and source skew to see how centered conditional kernels create a more spread target while preserving barycentric centering.

References¶

Agueh, M., & Carlier, G. (2011). Barycenters in the Wasserstein space. SIAM Journal on Mathematical Analysis, 43(2), 904–924.
Carlier, G., & Ekeland, I. (2010). Matching for teams. Economic Theory, 42(2), 397–418.
Anderes, E., Borgwardt, S., & Miller, J. (2016). Discrete Wasserstein barycenters: optimal transport for discrete data. Mathematical Methods of Operations Research, 84(2), 389–409.
Álvarez Esteban, P. C., del Barrio, E., Cuesta-Albertos, J., & Matrán, C. (2016). A fixed-point approach to barycenters in Wasserstein space. Journal of Mathematical Analysis and Applications, 441(2), 744–762. 10.1016/j.jmaa.2016.04.045
Le Gouic, T., & Loubes, J.-M. (2016). Existence and consistency of Wasserstein barycenters. Probability Theory and Related Fields, 168, 901–917.
Bhatia, R., Jain, T., & Lim, Y. (2019). On the Bures–Wasserstein distance between positive definite matrices. Expositiones Mathematicae, 37(2), 165–191. 10.1016/j.exmath.2018.01.002
Rüschendorf, L., & Uckelmann, L. (2002). On the n-coupling problem. Journal of Multivariate Analysis, 81(2), 242–258.
Herman, G. (1980). Image Reconstruction from Projections: the Fundamentals of Computerized Tomography. Academic Press.
Bonneel, N., Rabin, J., Peyré, G., & Pfister, H. (2015). Sliced and Radon Wasserstein barycenters of measures. Journal of Mathematical Imaging and Vision, 51(1), 22–45.
Cramér, H., & Wold, H. (1936). Some Theorems on Distribution Functions. Journal of the London Mathematical Society, s1-11(4), 290–294. 10.1112/jlms/s1-11.4.290
Cuturi, M., & Doucet, A. (2014). Fast computation of Wasserstein barycenters. Proceedings of the 31st International Conference on Machine Learning, 32, 685–693. https://proceedings.mlr.press/v32/cuturi14.html
Cuturi, M., & Peyré, G. (2016). A smoothed dual approach for variational Wasserstein problems. SIAM Journal on Imaging Sciences, 9(1), 320–343.
Benamou, J.-D., Carlier, G., Cuturi, M., Nenna, L., & Peyré, G. (2015). Iterative Bregman projections for regularized transportation problems. SIAM Journal on Scientific Computing, 37(2), A1111–A1138.
Solomon, J., De Goes, F., Peyré, G., Cuturi, M., Butscher, A., Nguyen, A., Du, T., & Guibas, L. (2015). Convolutional Wasserstein distances: efficient optimal transportation on geometric domains. ACM Transactions on Graphics, 34(4), 66:1-66:11.
Bigot, J., & Klein, T. (2018). Characterization of barycenters in the Wasserstein space by averaging optimal transport maps. ESAIM: Probability and Statistics, 22, 35–57. 10.1051/ps/2017020