Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Generalized OT Problems

The second family changes the optimization problem rather than only the ground distance. Barycenters average several measures, multi-marginal OT couples many measures at once, inverse OT learns the cost from observed transport, and weak OT allows randomized conditional responses.

These formulations remain close to Kantorovich linear programming, but the object being optimized is richer than a single two-marginal coupling.

OT Barycenters

Barycenters ask how to average probability measures rather than points. This section explains the variational definition, the special closed forms in one dimension and for Gaussians, and the entropic algorithms used in practice.

Frechet Means

For discrete input histograms {bs}s=1S\{b_s\}_{s=1}^S, with bsΔnsb_s\in\simplex_{n_s}, and weights λΔS\lambda\in\simplex_S, a Wasserstein barycenter can be computed by minimizing

minaΔns=1SλsLCs(a,bs),\min_{a\in\simplex_n} \sum_{s=1}^S \lambda_s\,\mathcal L_{C_s}(a,b_s),

where the cost matrices CsRn×nsC_s\in\RR^{n\times n_s} are prescribed. This barycenter problem was introduced by Carlier and coauthors, following earlier matching ideas, and is the measure-valued analogue of a Frechet mean Agueh & Carlier, 2011Carlier & Ekeland, 2010.

Given input measures (βs)s(\beta_s)_s on a space X\X, the barycenter problem is

minαM+1(X)s=1SλsLc(α,βs).\min_{\alpha\in\mathcal M_+^1(\X)} \sum_{s=1}^S \lambda_s\,\mathcal L_c(\alpha,\beta_s).

For X=Rd\X=\RR^d and c(x,y)=xy2c(x,y)=\norm{x-y}^2, if one input measure has a density, then the barycenter is unique Agueh & Carlier, 2011. Discrete existence, consistency, and fixed-point constructions are studied in Anderes et al., 2016Esteban et al., 2016Le Gouic & Loubes, 2016.

<IPython.core.display.Image object>

Wasserstein barycenter grids for four corner measures. The left panel uses the one-dimensional formula Qu,v=i,jλij(u,v)QijQ_{u,v}=\sum_{i,j}\lambda_{ij}(u,v)Q_{ij} for one Gaussian law and three asymmetric two-Gaussian mixtures, and displays densities reconstructed from the averaged quantiles. The right panel computes entropic Wasserstein barycenters on a common pixel grid for the cat, two-disk, cross and clover silhouettes, using the normalized squared ground cost, ϵ=4104\epsilon=4\cdot10^{-4} and a Sinkhorn tolerance of 51085\cdot10^{-8}. The barycenters are rendered as density images with values clamped at their 95%95\% quantile rather than by threshold contours. Colors interpolate between the four corners and encode the same bilinear weights in both panels.

The interactive demo below keeps the exact one-dimensional formula visible: the two coordinates set bilinear weights on the four corner laws, the middle panel averages their quantile functions, and the right panel reconstructs the resulting barycenter density.

Interactive panel. Use the barycentric coordinate controls to move through the four input laws and compare quantile and entropic barycenter constructions.

Proof

Let (α0,β0)(\alpha_0,\beta_0) and (α1,β1)(\alpha_1,\beta_1) be two pairs of probability measures and let t[0,1]t\in[0,1]. For η>0\eta>0, choose couplings πiΠ(αi,βi)\pi_i\in\Couplings(\alpha_i,\beta_i) such that

cdπiLc(αi,βi)+η(i=0,1).\int c\d\pi_i \leq \mathcal L_c(\alpha_i,\beta_i)+\eta \qquad (i=0,1).

Then πt=(1t)π0+tπ1\pi_t=(1-t)\pi_0+t\pi_1 is a coupling between (1t)α0+tα1(1-t)\alpha_0+t\alpha_1 and (1t)β0+tβ1(1-t)\beta_0+t\beta_1. Hence

Lc((1t)α0+tα1,(1t)β0+tβ1)(1t)Lc(α0,β0)+tLc(α1,β1)+η.\mathcal L_c((1-t)\alpha_0+t\alpha_1,(1-t)\beta_0+t\beta_1) \leq (1-t)\mathcal L_c(\alpha_0,\beta_0) + t\mathcal L_c(\alpha_1,\beta_1) + \eta .

Letting η0\eta\to0 gives the claim.

Even when all input measures are discrete, the support of a barycenter is not known a priori. The multi-marginal formulation below shows that a discrete barycenter can be supported on all weighted averages of one support point from each input. This gives at most sns\prod_s n_s candidate points if the ss-th input has nsn_s atoms, which is prohibitive when the number of inputs is large. A common numerical compromise is therefore to prescribe a smaller support for the barycenter and solve a fixed-support problem.

One-Dimensional Case

On the line, barycenters become linear after the quantile change of variables. This gives the rare case where the barycenter is explicit rather than the solution of a high-dimensional optimization problem.

Proof

The one-dimensional formula for W2\Wass_2 gives

sλsW22(α,βs)=01sλsFα1(r)Fβs1(r)2dr.\sum_s\lambda_s\Wass_2^2(\alpha,\beta_s) = \int_0^1 \sum_s\lambda_s \abs{F_\alpha^{-1}(r)-F_{\beta_s}^{-1}(r)}^2 \d r.

The minimization decouples pointwise in rr. For each fixed rr, the minimizer of zsλszFβs1(r)2z\mapsto\sum_s\lambda_s|z-F_{\beta_s}^{-1}(r)|^2 is the weighted average sλsFβs1(r)\sum_s\lambda_sF_{\beta_s}^{-1}(r). This function is nondecreasing because it is a positive weighted sum of nondecreasing quantile functions, hence it is a valid quantile function.

Gaussian Case

Gaussian barycenters show that the same separation as in the Gaussian Wasserstein formula persists: means average linearly, while covariances average according to the Bures--Wasserstein geometry.

<IPython.core.display.Image object>

Bures--Wasserstein barycenters of centered Gaussian covariance matrices. Each panel shows a 5×55\times5 grid of barycenter ellipses for four corner covariances, without separate input panels: the corner ellipses are the four input covariances themselves. The right grid uses more anisotropic inputs, making the nonlinear rotation and scaling of covariance barycenters more visible.

The interactive Gaussian demo compares the Bures covariance barycenter with a plain Euclidean covariance average under the same weights. The difference is most visible for rotated, anisotropic covariances: the Euclidean average blends matrix entries, whereas the Bures barycenter follows the geometry induced by quadratic Gaussian transport.

Interactive panel. Use the corner-covariance and interpolation controls to see how Gaussian barycenter ellipses interpolate covariance geometry.

Sinkhorn for Barycenters

A key difference with the regularized two-marginal OT problem is that there is no canonical reference measure αβ\alpha\otimes\beta, because the barycenter α\alpha is unknown. To reduce complexity, one usually fixes a candidate support for the barycenter and solves the discrete problem (1); this introduces a discretization error but keeps the number of unknowns manageable.

One can then use entropic smoothing and approximate (1) by

minaΔns=1SλsLCsϵ(a,bs)\min_{a\in\simplex_n} \sum_{s=1}^S \lambda_s\mathcal L_{C_s}^{\epsilon}(a,b_s)

for some ϵ>0\epsilon>0. This is a smooth convex minimization problem, which can be tackled using gradient descent Cuturi & Doucet, 2014. An alternative is to use a descent method, typically quasi-Newton, on the semi-dual Cuturi & Peyré, 2016; this is useful when adding extra regularization on the barycenter, for instance to impose smoothness.

A simple but effective approach rewrites (10) as a weighted KL projection problem Benamou et al., 2015:

min(Ps)sϵsλsKL(PsKs)\min_{(P_s)_s} \epsilon\sum_s\lambda_s \operatorname{KL}(P_s\mid K_s)

subject to

Ps1n=bsfor all s,P11n1==PS1nS.P_s^\top\mathbf 1_n=b_s \quad\text{for all }s, \qquad P_1\mathbf 1_{n_1} = \cdots = P_S\mathbf 1_{n_S}.

Here Ks:=eCs/ϵK_s\eqdef e^{-C_s/\epsilon}. The barycenter aa is implicitly encoded in the common row marginal

a=P11n1==PS1nS.a=P_1\mathbf 1_{n_1}=\cdots=P_S\mathbf 1_{n_S}.

The optimal couplings have scaling form

Ps=diag(us)Ksdiag(vs),P_s=\diag(u_s)K_s\diag(v_s),

and the generalized Sinkhorn iterations are

vsbsKsus,as(Ksvs)λs,usaKsvs.v_s\leftarrow\frac{b_s}{K_s^\top u_s}, \qquad a\leftarrow\prod_s(K_s v_s)^{\lambda_s}, \qquad u_s\leftarrow\frac{a}{K_s v_s}.

The geometric mean enforces the fact that all couplings share the same barycenter marginal.

Proof

Introduce Lagrange multipliers in (11):

min(Ps)s,amax(fs,gs)ssλs(ϵKL(PsKs)+aPs1ns,fs+bsPs1n,gs).\min_{(P_s)_s,a} \max_{(f_s,g_s)_s} \sum_s\lambda_s \left( \epsilon\operatorname{KL}(P_s\mid K_s) + \dotp{a-P_s\mathbf 1_{n_s}}{f_s} + \dotp{b_s-P_s^\top\mathbf 1_n}{g_s} \right).

Strong duality allows one to exchange the minimum and maximum. Minimizing with respect to aa gives the constraint sλsfs=0\sum_s\lambda_s f_s=0, and minimizing with respect to PsP_s gives the Legendre transform of KL(Ks)\operatorname{KL}(\cdot\mid K_s):

max(fs,gs)ssλs[gs,bsϵKL(fsgsϵ|Ks)],sλsfs=0.\max_{(f_s,g_s)_s} \sum_s\lambda_s \left[ \dotp{g_s}{b_s} - \epsilon \operatorname{KL}^* \left(\frac{f_s\oplus g_s}{\epsilon}\middle|K_s\right) \right], \qquad \sum_s\lambda_s f_s=0.

The separable conjugate is

KL(UK)=i,jKi,j(eUi,j1),\operatorname{KL}^*(U\mid K) = \sum_{i,j}K_{i,j}(e^{U_{i,j}}-1),

because for k>0k>0,

supr0ur(rlog(r/k)r+k)=k(eu1).\sup_{r\geq0} ur-\big(r\log(r/k)-r+k\big) = k(e^u-1).

Dropping constants independent of the potentials gives the displayed dual. Coordinate maximization in gsg_s gives the vsv_s update; block maximization in all (fs)s(f_s)_s gives the common marginal update and then the usu_s update.

Classical applications include two-dimensional image interpolation, three-dimensional shape interpolation, and barycenters on surfaces where the ground cost is the square of the geodesic distance Solomon et al., 2015.

Multimarginal OT

Multi-marginal OT couples more than two measures at once. It is the natural language for barycenters, matching with teams and several-body costs, but its tensor dimension is the main computational obstacle.

Definition and Basic Structure

The multi-marginal formulation replaces a coupling between two measures by a joint distribution with several prescribed marginals. Given measures (αs)s=1S(\alpha_s)_{s=1}^S on spaces (Xs)s=1S(\X_s)_{s=1}^S and a cost c:X1××XSRc:\X_1\times\cdots\times\X_S\to\RR, the problem reads

infπΠ(α1,,αS)X1××XSc(x1,,xS)dπ(x1,,xS),\inf_{\pi\in\Couplings(\alpha_1,\ldots,\alpha_S)} \int_{\X_1\times\cdots\times\X_S} c(x_1,\ldots,x_S)\d\pi(x_1,\ldots,x_S),

where Π(α1,,αS)\Couplings(\alpha_1,\ldots,\alpha_S) is the set of probability measures whose ss-th marginal is αs\alpha_s. This is still a linear program in the discrete setting, but its ambient tensor has size sns\prod_s n_s.

Multi-Marginal Formulation of Barycenters

Wasserstein barycenters are the central example. For the squared Euclidean cost, one can introduce a latent barycenter point and eliminate it explicitly, leading to the multi-marginal cost

cbar(x1,,xS)=minxRds=1Sλsxxs2.c_{\mathrm{bar}}(x_1,\ldots,x_S) = \min_{x\in\RR^d} \sum_{s=1}^S\lambda_s\norm{x-x_s}^2.
Proof

For any candidate barycenter α\alpha and couplings πsΠ(α,βs)\pi_s\in\Couplings(\alpha,\beta_s), glue the couplings along their common α\alpha marginal to obtain a joint law of (X,Y1,,YS)(X,Y_1,\ldots,Y_S). Conditioning on (Ys)s(Y_s)_s and minimizing over XX gives

sλsEXYs2EminxsλsxYs2=Ecbar(Y1,,YS).\sum_s\lambda_s\mathbb E\norm{X-Y_s}^2 \geq \mathbb E \min_x \sum_s\lambda_s\norm{x-Y_s}^2 = \mathbb E c_{\mathrm{bar}}(Y_1,\ldots,Y_S).

Taking the infimum over the couplings gives that the barycenter value is at least the multi-marginal value. Conversely, from an optimal multi-marginal plan π\pi^\star, set X=B(Y1,,YS)X=B(Y_1,\ldots,Y_S). The couplings between XX and each YsY_s are feasible for the barycenter problem and attain exactly the multi-marginal cost.

If α\alpha^\star is any barycenter, choose optimal couplings between α\alpha^\star and each βs\beta_s and glue them along the common α\alpha^\star marginal. Since the barycenter and multi-marginal values are equal, the conditional minimization inequality above must be an equality. Thus X=B(Y1,,YS)X=B(Y_1,\ldots,Y_S) almost surely for the induced optimal multi-marginal plan.

Proof

Let the input Gaussians have means msm_s and covariances Σs\Sigma_s. For any candidate barycenter α\alpha with mean mm and covariance Σ\Sigma, Gelbrich’s inequality gives

W22(α,βs)mms2+B(Σ,Σs)2,\Wass_2^2(\alpha,\beta_s) \geq \norm{m-m_s}^2+\mathcal B(\Sigma,\Sigma_s)^2,

with equality for the Gaussian law with mean mm and covariance Σ\Sigma. Therefore the barycenter objective is bounded below by a function depending only on (m,Σ)(m,\Sigma), and this lower bound is attained by the Gaussian measure with the minimizing mean and covariance. For discrete inputs, any multi-marginal optimizer is supported on the finite product of the input supports, and BB maps this product to at most sns\prod_s n_s points.

Entropic Regularization of Multi-Marginal OT

As in the two-marginal case, adding an entropic penalty with respect to the product measure α1αS\alpha_1\otimes\cdots\otimes\alpha_S leads to scaling algorithms:

infπΠ(α1,,αS)cdπ+ϵKL(πα1αS).\inf_{\pi\in\Couplings(\alpha_1,\ldots,\alpha_S)} \int c\d\pi + \epsilon\operatorname{KL} (\pi\mid\alpha_1\otimes\cdots\otimes\alpha_S).

The optimizer has the generalized Gibbs form

dπ(x1,,xS)=exp ⁣(sfs(xs)c(x1,,xS)ϵ)sdαs(xs),\d\pi^\star(x_1,\ldots,x_S) = \exp\!\left( \frac{\sum_s f_s(x_s)-c(x_1,\ldots,x_S)}{\epsilon} \right) \prod_s\d\alpha_s(x_s),

and generalized Sinkhorn iterations alternately update one potential fsf_s so that the ss-th marginal is correct. The bottleneck is the tensor size sns\prod_s n_s in the discrete case. Practical barycenter solvers exploit separability of the cost, low-rank structure, convolutional kernels, or a fixed barycenter support.

Metric Learning and Inverse OT

This section points to inverse problems where the ground cost itself is learned. Such problems are typically bilevel and non-convex, but OT provides useful gradients with respect to the cost.

Metric Learning and Derivatives of OT

OT is convex with respect to the measure and concave with respect to the cost. Ground-metric learning was explicitly studied in Cuturi & Avis, 2014, and it connects to the broader metric-learning literature Kulis, 2012Bellet et al., 2015.

Proof

The value is the minimum of affine functions of CC,

LC(a,b)=minPU(a,b)C,P.\mathcal L_C(a,b) = \min_{P\in\mathbf U(a,b)} \dotp{C}{P}.

The envelope theorem, or equivalently Danskin’s theorem, states that the subdifferential with respect to CC is the convex hull of the optimal couplings. If the optimizer is unique, this subdifferential is the singleton {P(C)}\{P^\star(C)\}, hence the value is differentiable with the displayed gradient.

Thus, if the cost is parameterized as CθC_\theta, gradients of losses involving OT values are obtained by backpropagating through the inner product P(Cθ),θCθ\dotp{P^\star(C_\theta)}{\partial_\theta C_\theta}. The difficulty is not differentiating a solved OT problem, but learning a cost for which the resulting matching has the desired semantic behavior; this is a bilevel and usually non-convex optimization problem.

<IPython.core.display.Image object>

Changing the ground metric changes the optimal coupling. The same red and blue empirical measures are matched with cA(x,y)=(xy)A(xy)c_A(x,y)=(x-y)^\top A(x-y) for the Euclidean metric and two increasingly anisotropic Mahalanobis metrics. The small gray ellipse shows the unit ball of the metric: directions in which the ellipse is elongated are cheaper, and this deforms the transport segments selected by the OT plan.

The interactive demo lets the anisotropy and orientation of the Mahalanobis cost move. The transport plan is recomputed exactly for the displayed particles, so the segments show how the learned cost changes the matching.

Interactive panel. Use the metric and deformation controls to see how learning the ground cost changes the apparent transport geometry.

Inverse Optimal Transport

Inverse OT asks for a ground cost that explains observed matchings or flows as optimal transport plans. In its most direct form, one observes a plan π^\widehat\pi with marginals (α,β)(\alpha,\beta) and seeks a cost cc such that π^\widehat\pi is optimal for

infπΠ(α,β)c(x,y)dπ(x,y).\inf_{\pi\in\Couplings(\alpha,\beta)} \int c(x,y)\d\pi(x,y).

This is ill-posed without structure: adding potentials u(x)+v(y)u(x)+v(y) to a cost does not change the set of optimal couplings, and many costs can rationalize the same sparse plan.

A useful statistical methodology is to measure the suboptimality of the observed plan through a Fenchel--Young loss. Write the score as s=cs=-c and define the convex regularized prediction value

Gϵ(s)=supπΠ(α,β)sdπϵKL(παβ).G_\epsilon(s) = \sup_{\pi\in\Couplings(\alpha,\beta)} \int s\d\pi - \epsilon\operatorname{KL}(\pi\mid\alpha\otimes\beta).

The Fenchel--Young loss

Lϵ(c;π^)=Gϵ(c)+Gϵ(π^)+cdπ^\mathcal L_\epsilon(c;\widehat\pi) = G_\epsilon(-c) + G_\epsilon^*(\widehat\pi) + \int c\d\widehat\pi

is nonnegative by Fenchel’s inequality and vanishes exactly when π^Gϵ(c)\widehat\pi\in\partial G_\epsilon(-c), i.e. when π^\widehat\pi satisfies the regularized optimality conditions for cc. Entropic regularization is important here because it makes the forward map smoother and provides gradients with respect to cc, at the price of a statistical bias Andrade et al., 2025Peyré et al., 2026.

In the discrete unregularized case, this loss reduces to the optimality gap of the observed coupling. For P^U(a,b)\widehat P\in\mathbf U(a,b) and a cost matrix CC, denote it by

LiOT(C;P^)=C,P^minPU(a,b)C,P.\mathcal L_{\mathrm{iOT}}(C;\widehat P) = \dotp{C}{\widehat P} - \min_{P\in\mathbf U(a,b)}\dotp{C}{P}.

This inverse-OT gap loss is nonnegative and vanishes exactly when P^\widehat P is optimal for CC.

In practice, one restricts the cost to a finite-dimensional model class, often affine:

Cθ=r=1RθrC(r),θΘ,C_\theta=\sum_{r=1}^R\theta_r C^{(r)}, \qquad \theta\in\Theta,

where Θ\Theta is convex and the matrices C(r)C^{(r)} encode features, graph distances or a Mahalanobis parameterization. This viewpoint appears in low-rank and sparse inverse OT models Dupuy et al., 2019Andrade et al., 2024 and in convex formulations for learning OT costs from observed plans Ma et al., 2020Peyré et al., 2026.

A minimal finite-dimensional model is obtained by learning a bilinear cost on Rd\RR^d,

cA(x,y)=Ax,y,ARd×d.c_A(x,y)=\dotp{Ax}{y}, \qquad A\in\RR^{d\times d}.

For empirical measures α=1niδxi\alpha=\frac1n\sum_i\delta_{x_i} and β=1njδyj\beta=\frac1n\sum_j\delta_{y_j}, this gives the cost matrix

C(A)i,j=Axi,yj,C(A)_{i,j}=\dotp{Ax_i}{y_j},

so both maps AC(A)A\mapsto C(A) and AcAA\mapsto c_A are linear. Inverse OT within this model asks which matrix AA makes an observed matching or coupling look optimal; learning the cost is thus reduced to estimating a linear parameter.

For a fixed matrix AA, the forward prediction is the optimal face

PA:=arg minPU(1n/n,1n/n)C(A),P.\mathcal P_A\eqdef \uargmin{P\in\CouplingsD(\ones_n/n,\ones_n/n)} \dotp{C(A)}{P}.

When this face is a singleton, write its element as PAP_A; otherwise PAP_A denotes a deterministic tie-broken selection. Although AC(A)A\mapsto C(A) is linear, the solution correspondence APAA\mapsto\mathcal P_A is polyhedral: changing AA changes the direction in which the transport polytope is probed, and a tie-broken selection is constant on normal-cone cells. The figure below illustrates this correspondence on the OT4ML point clouds. The construction follows the visual idea of the Python Optimal Transport logo Flamary et al., 2021: red source atoms, blue target atoms and straight segments show the selected optimal bijection. The rank-one matrices A=e1e1A=-e_1e_1^\top and A=e2e2A=-e_2e_2^\top only score horizontal or vertical correlations. The matrix A=IA=-I gives the usual quadratic W2\Wass_2 assignment, up to the marginal-only terms discussed below, while A=+IA=+I reverses the correlation and produces an anti-W2\Wass_2 matching.

This elementary model already contains the quadratic Wasserstein assignment. Adding to a cost matrix a term depending only on xix_i or only on yjy_j shifts all feasible couplings by the same constant, and therefore does not change the optimizer. Since

xy2=x2+y22x,y,\norm{x-y}^2=\norm{x}^2+\norm{y}^2-2\dotp{x}{y},

the usual quadratic Wasserstein assignment has the same optimizer as the bilinear cost with A=IA_\star=-I, up to these marginal-only terms and an irrelevant positive factor. The inverse problem goes in the opposite direction: after observing a coupling, one asks which matrices AA could have generated it. The next figure generates an observed coupling P^\widehat P from this cost on two empirical mixtures of Gaussians, and then evaluates LiOT(C(At);P^)\mathcal L_{\mathrm{iOT}}(C(A_t);\widehat P) along the anisotropic path

At=diag(1+t,1t),1t1,A_t=-\diag(1+t,1-t), \qquad -1\leq t\leq 1,

so that t=0t=0 recovers the matrix that generated the observed coupling. Equivalently, with equal weights, P^U(1n/n,1n/n)=Bn/n\widehat P\in\CouplingsD(\ones_n/n,\ones_n/n)=\mathcal B_n/n, and the plotted loss is the Kantorovich gap

LiOT(C(At);P^)=C(At),P^minPU(1n/n,1n/n)C(At),P,C(At)i,j=Atxi,yj.\mathcal L_{\mathrm{iOT}}(C(A_t);\widehat P) = \dotp{C(A_t)}{\widehat P} - \min_{P\in\CouplingsD(\ones_n/n,\ones_n/n)} \dotp{C(A_t)}{P}, \qquad C(A_t)_{i,j}=\dotp{A_t x_i}{y_j}.

Because tC(At)t\mapsto C(A_t) is affine and the Kantorovich value is a minimum of affine functions over the fixed polytope U(1n/n,1n/n)\CouplingsD(\ones_n/n,\ones_n/n), this one-dimensional gap is convex and piecewise affine. Its zero set can contain an interval for a small sample, reflecting the fact that the same observed coupling remains optimal for a cone of nearby costs.

<IPython.core.display.Image object>

Inverse-OT gap loss for a bilinear cost. Panel (a): two empirical mixtures of two Gaussians are matched with the cost cA(x,y)=Ax,yc_{A_\star}(x,y)=\dotp{A_\star x}{y} for A=IA_\star=-I, which gives the same optimizer as the quadratic W2\Wass_2 cost; red and blue level sets display the two sampling densities. Panels (b,c): the unregularized Fenchel--Young Kantorovich gap LiOT(C(At);P^)\mathcal L_{\mathrm{iOT}}(C(A_t);\widehat P) along At=diag(1+t,1t)A_t=-\diag(1+t,1-t) for n=10n=10 and n=100n=100, using the same vertical scale. The red dot marks the generating parameter t=0t=0; the curves are convex and piecewise affine.

The comparison between n=10n=10 and n=100n=100 illustrates an important statistical effect: as the number of sampled points grows, the flat region of the empirical gap typically shrinks and the loss develops more visible curvature around the generating parameter. This anticipates the population theory of Peyré, Poon and Tron Peyré et al., 2026: in the limit n+n\to+\infty, when the limiting Monge map itself has nondegenerate curvature as the cost parameter varies, the iOT loss identifies the cost robustly, up to the usual marginal-only gauge freedoms. In that regime, minimizing the gap is not only a certificate of optimality of the observed transport, but also a stable way to recover the underlying cost.

Proof

For a fixed cost CθC_\theta, Kantorovich duality gives

minPU(a,b)Cθ,P=maxfi+gj(Cθ)i,jf,a+g,b.\min_{P\in\mathbf U(a,b)} \dotp{C_\theta}{P} = \max_{f_i+g_j\leq(C_\theta)_{i,j}} \dotp{f}{a}+\dotp{g}{b}.

Since P^\widehat P has marginals (a,b)(a,b), every dual feasible pair satisfies

Cθ,P^f,ag,b=i,jP^i,j((Cθ)i,jfigj)0.\dotp{C_\theta}{\widehat P} - \dotp{f}{a} - \dotp{g}{b} = \sum_{i,j}\widehat P_{i,j} \big((C_\theta)_{i,j}-f_i-g_j\big) \geq0.

This nonnegative quantity is exactly the primal-dual gap of P^\widehat P. It vanishes if and only if P^\widehat P reaches the dual value and is therefore optimal. If CθC_\theta is affine and Θ\Theta and RR are convex, the constraints and objective in (42) are convex.

The formulation is useful because it avoids differentiating through a forward OT solver: it learns a cost by making the observed plan satisfy complementary slackness. In statistical settings, P^\widehat P is only partially observed or noisy, so one adds sparsity, low-rank, smoothness or metric constraints to select a meaningful cost. For entropic OT, the optimality condition becomes smoother:

P^i,jaibjexp ⁣(fi+gj(Cθ)i,jϵ),\widehat P_{i,j} \approx a_i b_j \exp\!\left( \frac{f_i+g_j-(C_\theta)_{i,j}}{\epsilon} \right),

which leads to likelihood-based or KL-based convex objectives when CθC_\theta is affine, and connects inverse OT with generalized Sinkhorn iterations and transport-regularized inverse problems Karlsson & Ringh, 2016Ma et al., 2020.

Weak Optimal Transport

Weak OT relaxes the cost so that it depends on the conditional distribution of destinations rather than only on pointwise pairs. It is useful when a source point is allowed to choose a randomized response and the model only penalizes an aggregate of that response, such as its conditional mean.

Barycentric Projection of a Coupling

The first object to isolate is the map obtained by collapsing each conditional law to its barycenter.

The projected target βˉπ\bar\beta_\pi records the distribution of conditional means, not the full second marginal. Thus it is generally different from β\beta; if π=(Id,T)α\pi=(\Id,T)_\sharp\alpha is induced by a map, then Tˉπ=T\bar T_\pi=T and βˉπ=β\bar\beta_\pi=\beta. This projection is not an optimal map for an arbitrary coupling: a deterministic rotation of a radially symmetric source, for example, projects to the rotation itself, whereas the optimal map from the source to itself is the identity. The useful positive statement is attached to quadratic optimal plans, as in the tangent-space viewpoint on W2\Wass_2 developed by Ambrosio, Gigli and Savare Ambrosio et al., 2006.

Proof

By the cyclic-monotonicity characterization of quadratic optimality, π\pi is concentrated on a cc-cyclically monotone set Γ\Gamma for c(x,y)=xy2c(x,y)=\norm{x-y}^2. This means that every finite cycle (xi,yi)i=1mΓ(x_i,y_i)_{i=1}^m\subset\Gamma satisfies

i=1mxi,yii=1mxi,yi+1,ym+1=y1.\sum_{i=1}^m\dotp{x_i}{y_i} \geq \sum_{i=1}^m\dotp{x_i}{y_{i+1}}, \qquad y_{m+1}=y_1.

After changing the disintegration on an α\alpha-negligible set, πx\pi_x is supported on the section Γx={y:(x,y)Γ}\Gamma_x=\{y:(x,y)\in\Gamma\} for α\alpha-almost every xx. Choose x1,,xmx_1,\ldots,x_m in this full-measure set and independently sample YiπxiY_i\sim\pi_{x_i}. Applying the cyclic inequality to (xi,Yi)(x_i,Y_i) and taking expectations gives

i=1mxi,Tˉπ(xi)i=1mxi,Tˉπ(xi+1).\sum_{i=1}^m\dotp{x_i}{\bar T_\pi(x_i)} \geq \sum_{i=1}^m\dotp{x_i}{\bar T_\pi(x_{i+1})}.

Thus (Id,Tˉπ)α(\Id,\bar T_\pi)_\sharp\alpha is concentrated on a cyclically monotone graph. By the cyclic-monotonicity characterization of quadratic optimality, this plan is optimal between its two marginals.

Weak transport costs use the same disintegration but allow the objective to depend on the whole conditional law, or on summaries such as the barycentric projection (46). The framework was introduced through general transport costs and weak transport inequalities, with existence, duality and optimality conditions developed on Polish spaces Gozlan et al., 2017Backhoff Veraguas et al., 2018. For a weak cost C:X×P(Y)R{+}C:\X\times\mathcal P(\Y)\to\RR\cup\{+\infty\}, the weak OT value is

WC(α,β):=infπΠ(α,β)C(x,πx)dα(x).\mathcal W_C(\alpha,\beta) \eqdef \inf_{\pi\in\Couplings(\alpha,\beta)} \int C(x,\pi_x)\d\alpha(x).

The classical Kantorovich problem is recovered when C(x,ν)=c(x,y)dν(y)C(x,\nu)=\int c(x,y)\d\nu(y), because the objective then becomes c(x,y)dπ(x,y)\int c(x,y)\d\pi(x,y). The genuinely weak behavior starts when CC is nonlinear in ν\nu.

Proof

For any coupling π\pi and any gC(Y)g\in\mathcal C(\Y), the definition of gCg^C gives

C(x,πx)gC(x)+g(y)dπx(y).C(x,\pi_x) \geq g^C(x) + \int g(y)\d\pi_x(y).

After integration with respect to α\alpha, the second term becomes gdβ\int g\d\beta because the second marginal of π\pi is β\beta. This proves weak duality.

For the reverse inequality, consider the convex minimization over probability kernels xπxx\mapsto\pi_x with the affine constraint πxdα(x)=β\int \pi_x\d\alpha(x)=\beta. Fenchel--Rockafellar duality gives a continuous Lagrange multiplier gg for this marginal constraint. Minimizing the Lagrangian over each conditional law gives exactly the pointwise term gC(x)g^C(x), while the multiplier contributes gdβ\int g\d\beta. The compactness, lower semicontinuity, convexity and qualification assumptions ensure no duality gap.

Proof

Let π\pi be any coupling and disintegrate it as πxα\pi_x\alpha. By Jensen’s inequality,

xTˉπ(x)2xy2dπx(y).\norm{x-\bar T_\pi(x)}^2 \leq \int\norm{x-y}^2\d\pi_x(y).

Integrating in xx gives Cbar(x,πx)dα(x)xy2dπ(x,y)\int C_{\mathrm{bar}}(x,\pi_x)\d\alpha(x)\leq \int\norm{x-y}^2\d\pi(x,y). Taking the infimum over π\pi proves the claim.

<IPython.core.display.Image object>

Weak barycentric transport on a small disk-to-annulus coupling. The left panel shows the full conditional laws: each red source atom splits its mass among several blue target atoms, with segment thickness proportional to transported mass. The right panel collapses each conditional law πx\pi_x to its barycenter Tˉπ(x)=ydπx(y)\bar T_\pi(x)=\int y\d\pi_x(y), shown in violet. The barycentric weak cost only sees the red-to-violet displacement, and therefore ignores the conditional spread around each barycenter.

The interactive demo lets each source point split toward several targets. Increasing the split count or spread usually increases the full quadratic cost while the weak barycentric cost can remain much smaller.

Interactive panel. Use the spread and barycentric controls to compare full weak conditional laws with their barycentric projections.

The barycentric cost is the canonical example to keep in mind: admissibility still constrains the full conditional laws to have second marginal β\beta, but the objective only charges the displacement from xx to Tˉπ(x)\bar T_\pi(x) and ignores the conditional variance around this barycenter. This connects weak OT with martingale transport, Strassen-type convex-order constraints, barycentric projections and learning problems where conditional averages are meaningful objects.

References
  1. Agueh, M., & Carlier, G. (2011). Barycenters in the Wasserstein space. SIAM Journal on Mathematical Analysis, 43(2), 904–924.
  2. Carlier, G., & Ekeland, I. (2010). Matching for teams. Economic Theory, 42(2), 397–418.
  3. Anderes, E., Borgwardt, S., & Miller, J. (2016). Discrete Wasserstein barycenters: optimal transport for discrete data. Mathematical Methods of Operations Research, 84(2), 389–409.
  4. Álvarez Esteban, P. C., del Barrio, E., Cuesta-Albertos, J., & Matrán, C. (2016). A fixed-point approach to barycenters in Wasserstein space. Journal of Mathematical Analysis and Applications, 441(2), 744–762.
  5. Le Gouic, T., & Loubes, J.-M. (2016). Existence and consistency of Wasserstein barycenters. Probability Theory and Related Fields, 168, 901–917.
  6. Bhatia, R., Jain, T., & Lim, Y. (2019). On the Bures–Wasserstein distance between positive definite matrices. Expositiones Mathematicae, 37(2), 165–191. 10.1016/j.exmath.2018.01.002
  7. Cuturi, M., & Doucet, A. (2014). Fast computation of Wasserstein barycenters. Proceedings of the 31st International Conference on Machine Learning, 32, 685–693. https://proceedings.mlr.press/v32/cuturi14.html
  8. Cuturi, M., & Peyré, G. (2016). A smoothed dual approach for variational Wasserstein problems. SIAM Journal on Imaging Sciences, 9(1), 320–343.
  9. Benamou, J.-D., Carlier, G., Cuturi, M., Nenna, L., & Peyré, G. (2015). Iterative Bregman projections for regularized transportation problems. SIAM Journal on Scientific Computing, 37(2), A1111–A1138.
  10. Solomon, J., De Goes, F., Peyré, G., Cuturi, M., Butscher, A., Nguyen, A., Du, T., & Guibas, L. (2015). Convolutional Wasserstein distances: efficient optimal transportation on geometric domains. ACM Transactions on Graphics, 34(4), 66:1-66:11.
  11. Cuturi, M., & Avis, D. (2014). Ground metric learning. Journal of Machine Learning Research, 15, 533–564.
  12. Kulis, B. (2012). Metric learning: a survey. Foundations and Trends in Machine Learning, 5(4), 287–364.
  13. Bellet, A., Habrard, A., & Sebban, M. (2015). Metric learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 9(1), 1–151.
  14. Andrade, F., Peyré, G., & Poon, C. (2025). Learning from Samples: Inverse Problems over Measures via Sharpened Fenchel–Young Losses. arXiv Preprint arXiv:2505.07124.
  15. Peyré, G., Poon, C., & Tron, O. (2026). Curvature of optimal transport with respect to the cost and applications to inverse optimal transport. arXiv Preprint arXiv:2604.22670.