Generalized OT Problems
The second family changes the optimization problem rather than only the ground distance. Barycenters average several measures, multi-marginal OT couples many measures at once, inverse OT learns the cost from observed transport, and weak OT allows randomized conditional responses.
These formulations remain close to Kantorovich linear programming, but the object being optimized is richer than a single two-marginal coupling.
OT Barycenters¶
Barycenters ask how to average probability measures rather than points. This section explains the variational definition, the special closed forms in one dimension and for Gaussians, and the entropic algorithms used in practice.
Frechet Means¶
For discrete input histograms , with , and weights , a Wasserstein barycenter can be computed by minimizing
where the cost matrices are prescribed. This barycenter problem was introduced by Carlier and coauthors, following earlier matching ideas, and is the measure-valued analogue of a Frechet mean Agueh & Carlier, 2011Carlier & Ekeland, 2010.
Given input measures on a space , the barycenter problem is
For and , if one input measure has a density, then the barycenter is unique Agueh & Carlier, 2011. Discrete existence, consistency, and fixed-point constructions are studied in Anderes et al., 2016Esteban et al., 2016Le Gouic & Loubes, 2016.

Wasserstein barycenter grids for four corner measures. The left panel uses the one-dimensional formula for one Gaussian law and three asymmetric two-Gaussian mixtures, and displays densities reconstructed from the averaged quantiles. The right panel computes entropic Wasserstein barycenters on a common pixel grid for the cat, two-disk, cross and clover silhouettes, using the normalized squared ground cost, and a Sinkhorn tolerance of . The barycenters are rendered as density images with values clamped at their quantile rather than by threshold contours. Colors interpolate between the four corners and encode the same bilinear weights in both panels.
The interactive demo below keeps the exact one-dimensional formula visible: the two coordinates set bilinear weights on the four corner laws, the middle panel averages their quantile functions, and the right panel reconstructs the resulting barycenter density.
Interactive panel. Use the barycentric coordinate controls to move through the four input laws and compare quantile and entropic barycenter constructions.
Proof
Even when all input measures are discrete, the support of a barycenter is not known a priori. The multi-marginal formulation below shows that a discrete barycenter can be supported on all weighted averages of one support point from each input. This gives at most candidate points if the -th input has atoms, which is prohibitive when the number of inputs is large. A common numerical compromise is therefore to prescribe a smaller support for the barycenter and solve a fixed-support problem.
One-Dimensional Case¶
On the line, barycenters become linear after the quantile change of variables. This gives the rare case where the barycenter is explicit rather than the solution of a high-dimensional optimization problem.
Proof
The one-dimensional formula for gives
The minimization decouples pointwise in . For each fixed , the minimizer of is the weighted average . This function is nondecreasing because it is a positive weighted sum of nondecreasing quantile functions, hence it is a valid quantile function.
Gaussian Case¶
Gaussian barycenters show that the same separation as in the Gaussian Wasserstein formula persists: means average linearly, while covariances average according to the Bures--Wasserstein geometry.

Bures--Wasserstein barycenters of centered Gaussian covariance matrices. Each panel shows a grid of barycenter ellipses for four corner covariances, without separate input panels: the corner ellipses are the four input covariances themselves. The right grid uses more anisotropic inputs, making the nonlinear rotation and scaling of covariance barycenters more visible.
The interactive Gaussian demo compares the Bures covariance barycenter with a plain Euclidean covariance average under the same weights. The difference is most visible for rotated, anisotropic covariances: the Euclidean average blends matrix entries, whereas the Bures barycenter follows the geometry induced by quadratic Gaussian transport.
Interactive panel. Use the corner-covariance and interpolation controls to see how Gaussian barycenter ellipses interpolate covariance geometry.
Sinkhorn for Barycenters¶
A key difference with the regularized two-marginal OT problem is that there is no canonical reference measure , because the barycenter is unknown. To reduce complexity, one usually fixes a candidate support for the barycenter and solves the discrete problem (1); this introduces a discretization error but keeps the number of unknowns manageable.
One can then use entropic smoothing and approximate (1) by
for some . This is a smooth convex minimization problem, which can be tackled using gradient descent Cuturi & Doucet, 2014. An alternative is to use a descent method, typically quasi-Newton, on the semi-dual Cuturi & Peyré, 2016; this is useful when adding extra regularization on the barycenter, for instance to impose smoothness.
A simple but effective approach rewrites (10) as a weighted KL projection problem Benamou et al., 2015:
subject to
Here . The barycenter is implicitly encoded in the common row marginal
The optimal couplings have scaling form
and the generalized Sinkhorn iterations are
The geometric mean enforces the fact that all couplings share the same barycenter marginal.
Proof
Introduce Lagrange multipliers in (11):
Strong duality allows one to exchange the minimum and maximum. Minimizing with respect to gives the constraint , and minimizing with respect to gives the Legendre transform of :
The separable conjugate is
because for ,
Dropping constants independent of the potentials gives the displayed dual. Coordinate maximization in gives the update; block maximization in all gives the common marginal update and then the update.
Classical applications include two-dimensional image interpolation, three-dimensional shape interpolation, and barycenters on surfaces where the ground cost is the square of the geodesic distance Solomon et al., 2015.
Multimarginal OT¶
Multi-marginal OT couples more than two measures at once. It is the natural language for barycenters, matching with teams and several-body costs, but its tensor dimension is the main computational obstacle.
Definition and Basic Structure¶
The multi-marginal formulation replaces a coupling between two measures by a joint distribution with several prescribed marginals. Given measures on spaces and a cost , the problem reads
where is the set of probability measures whose -th marginal is . This is still a linear program in the discrete setting, but its ambient tensor has size .
Multi-Marginal Formulation of Barycenters¶
Wasserstein barycenters are the central example. For the squared Euclidean cost, one can introduce a latent barycenter point and eliminate it explicitly, leading to the multi-marginal cost
Proof
For any candidate barycenter and couplings , glue the couplings along their common marginal to obtain a joint law of . Conditioning on and minimizing over gives
Taking the infimum over the couplings gives that the barycenter value is at least the multi-marginal value. Conversely, from an optimal multi-marginal plan , set . The couplings between and each are feasible for the barycenter problem and attain exactly the multi-marginal cost.
If is any barycenter, choose optimal couplings between and each and glue them along the common marginal. Since the barycenter and multi-marginal values are equal, the conditional minimization inequality above must be an equality. Thus almost surely for the induced optimal multi-marginal plan.
Proof
Let the input Gaussians have means and covariances . For any candidate barycenter with mean and covariance , Gelbrich’s inequality gives
with equality for the Gaussian law with mean and covariance . Therefore the barycenter objective is bounded below by a function depending only on , and this lower bound is attained by the Gaussian measure with the minimizing mean and covariance. For discrete inputs, any multi-marginal optimizer is supported on the finite product of the input supports, and maps this product to at most points.
Entropic Regularization of Multi-Marginal OT¶
As in the two-marginal case, adding an entropic penalty with respect to the product measure leads to scaling algorithms:
The optimizer has the generalized Gibbs form
and generalized Sinkhorn iterations alternately update one potential so that the -th marginal is correct. The bottleneck is the tensor size in the discrete case. Practical barycenter solvers exploit separability of the cost, low-rank structure, convolutional kernels, or a fixed barycenter support.
Metric Learning and Inverse OT¶
This section points to inverse problems where the ground cost itself is learned. Such problems are typically bilevel and non-convex, but OT provides useful gradients with respect to the cost.
Metric Learning and Derivatives of OT¶
OT is convex with respect to the measure and concave with respect to the cost. Ground-metric learning was explicitly studied in Cuturi & Avis, 2014, and it connects to the broader metric-learning literature Kulis, 2012Bellet et al., 2015.
Proof
The value is the minimum of affine functions of ,
The envelope theorem, or equivalently Danskin’s theorem, states that the subdifferential with respect to is the convex hull of the optimal couplings. If the optimizer is unique, this subdifferential is the singleton , hence the value is differentiable with the displayed gradient.
Thus, if the cost is parameterized as , gradients of losses involving OT values are obtained by backpropagating through the inner product . The difficulty is not differentiating a solved OT problem, but learning a cost for which the resulting matching has the desired semantic behavior; this is a bilevel and usually non-convex optimization problem.

Changing the ground metric changes the optimal coupling. The same red and blue empirical measures are matched with for the Euclidean metric and two increasingly anisotropic Mahalanobis metrics. The small gray ellipse shows the unit ball of the metric: directions in which the ellipse is elongated are cheaper, and this deforms the transport segments selected by the OT plan.
The interactive demo lets the anisotropy and orientation of the Mahalanobis cost move. The transport plan is recomputed exactly for the displayed particles, so the segments show how the learned cost changes the matching.
Interactive panel. Use the metric and deformation controls to see how learning the ground cost changes the apparent transport geometry.
Inverse Optimal Transport¶
Inverse OT asks for a ground cost that explains observed matchings or flows as optimal transport plans. In its most direct form, one observes a plan with marginals and seeks a cost such that is optimal for
This is ill-posed without structure: adding potentials to a cost does not change the set of optimal couplings, and many costs can rationalize the same sparse plan.
A useful statistical methodology is to measure the suboptimality of the observed plan through a Fenchel--Young loss. Write the score as and define the convex regularized prediction value
The Fenchel--Young loss
is nonnegative by Fenchel’s inequality and vanishes exactly when , i.e. when satisfies the regularized optimality conditions for . Entropic regularization is important here because it makes the forward map smoother and provides gradients with respect to , at the price of a statistical bias Andrade et al., 2025Peyré et al., 2026.
In the discrete unregularized case, this loss reduces to the optimality gap of the observed coupling. For and a cost matrix , denote it by
This inverse-OT gap loss is nonnegative and vanishes exactly when is optimal for .
In practice, one restricts the cost to a finite-dimensional model class, often affine:
where is convex and the matrices encode features, graph distances or a Mahalanobis parameterization. This viewpoint appears in low-rank and sparse inverse OT models Dupuy et al., 2019Andrade et al., 2024 and in convex formulations for learning OT costs from observed plans Ma et al., 2020Peyré et al., 2026.
A minimal finite-dimensional model is obtained by learning a bilinear cost on ,
For empirical measures and , this gives the cost matrix
so both maps and are linear. Inverse OT within this model asks which matrix makes an observed matching or coupling look optimal; learning the cost is thus reduced to estimating a linear parameter.
For a fixed matrix , the forward prediction is the optimal face
When this face is a singleton, write its element as ; otherwise denotes a deterministic tie-broken selection. Although is linear, the solution correspondence is polyhedral: changing changes the direction in which the transport polytope is probed, and a tie-broken selection is constant on normal-cone cells. The figure below illustrates this correspondence on the OT4ML point clouds. The construction follows the visual idea of the Python Optimal Transport logo Flamary et al., 2021: red source atoms, blue target atoms and straight segments show the selected optimal bijection. The rank-one matrices and only score horizontal or vertical correlations. The matrix gives the usual quadratic assignment, up to the marginal-only terms discussed below, while reverses the correlation and produces an anti- matching.

Forward solutions of the bilinear cost on the OT4ML logo point clouds. Each panel solves the equal-weight assignment problem with a different matrix ; the source atoms are red, the target atoms are blue, and the gray segments give one deterministic optimal bijection.
This elementary model already contains the quadratic Wasserstein assignment. Adding to a cost matrix a term depending only on or only on shifts all feasible couplings by the same constant, and therefore does not change the optimizer. Since
the usual quadratic Wasserstein assignment has the same optimizer as the bilinear cost with , up to these marginal-only terms and an irrelevant positive factor. The inverse problem goes in the opposite direction: after observing a coupling, one asks which matrices could have generated it. The next figure generates an observed coupling from this cost on two empirical mixtures of Gaussians, and then evaluates along the anisotropic path
so that recovers the matrix that generated the observed coupling. Equivalently, with equal weights, , and the plotted loss is the Kantorovich gap
Because is affine and the Kantorovich value is a minimum of affine functions over the fixed polytope , this one-dimensional gap is convex and piecewise affine. Its zero set can contain an interval for a small sample, reflecting the fact that the same observed coupling remains optimal for a cone of nearby costs.

Inverse-OT gap loss for a bilinear cost. Panel (a): two empirical mixtures of two Gaussians are matched with the cost for , which gives the same optimizer as the quadratic cost; red and blue level sets display the two sampling densities. Panels (b,c): the unregularized Fenchel--Young Kantorovich gap along for and , using the same vertical scale. The red dot marks the generating parameter ; the curves are convex and piecewise affine.
The comparison between and illustrates an important statistical effect: as the number of sampled points grows, the flat region of the empirical gap typically shrinks and the loss develops more visible curvature around the generating parameter. This anticipates the population theory of Peyré, Poon and Tron Peyré et al., 2026: in the limit , when the limiting Monge map itself has nondegenerate curvature as the cost parameter varies, the iOT loss identifies the cost robustly, up to the usual marginal-only gauge freedoms. In that regime, minimizing the gap is not only a certificate of optimality of the observed transport, but also a stable way to recover the underlying cost.
Proof
For a fixed cost , Kantorovich duality gives
Since has marginals , every dual feasible pair satisfies
This nonnegative quantity is exactly the primal-dual gap of . It vanishes if and only if reaches the dual value and is therefore optimal. If is affine and and are convex, the constraints and objective in (42) are convex.
The formulation is useful because it avoids differentiating through a forward OT solver: it learns a cost by making the observed plan satisfy complementary slackness. In statistical settings, is only partially observed or noisy, so one adds sparsity, low-rank, smoothness or metric constraints to select a meaningful cost. For entropic OT, the optimality condition becomes smoother:
which leads to likelihood-based or KL-based convex objectives when is affine, and connects inverse OT with generalized Sinkhorn iterations and transport-regularized inverse problems Karlsson & Ringh, 2016Ma et al., 2020.
Weak Optimal Transport¶
Weak OT relaxes the cost so that it depends on the conditional distribution of destinations rather than only on pointwise pairs. It is useful when a source point is allowed to choose a randomized response and the model only penalizes an aggregate of that response, such as its conditional mean.
Barycentric Projection of a Coupling¶
The first object to isolate is the map obtained by collapsing each conditional law to its barycenter.
The projected target records the distribution of conditional means, not the full second marginal. Thus it is generally different from ; if is induced by a map, then and . This projection is not an optimal map for an arbitrary coupling: a deterministic rotation of a radially symmetric source, for example, projects to the rotation itself, whereas the optimal map from the source to itself is the identity. The useful positive statement is attached to quadratic optimal plans, as in the tangent-space viewpoint on developed by Ambrosio, Gigli and Savare Ambrosio et al., 2006.
Proof
By the cyclic-monotonicity characterization of quadratic optimality, is concentrated on a -cyclically monotone set for . This means that every finite cycle satisfies
After changing the disintegration on an -negligible set, is supported on the section for -almost every . Choose in this full-measure set and independently sample . Applying the cyclic inequality to and taking expectations gives
Thus is concentrated on a cyclically monotone graph. By the cyclic-monotonicity characterization of quadratic optimality, this plan is optimal between its two marginals.
Weak transport costs use the same disintegration but allow the objective to depend on the whole conditional law, or on summaries such as the barycentric projection (46). The framework was introduced through general transport costs and weak transport inequalities, with existence, duality and optimality conditions developed on Polish spaces Gozlan et al., 2017Backhoff Veraguas et al., 2018. For a weak cost , the weak OT value is
The classical Kantorovich problem is recovered when , because the objective then becomes . The genuinely weak behavior starts when is nonlinear in .
Proof
For any coupling and any , the definition of gives
After integration with respect to , the second term becomes because the second marginal of is . This proves weak duality.
For the reverse inequality, consider the convex minimization over probability kernels with the affine constraint . Fenchel--Rockafellar duality gives a continuous Lagrange multiplier for this marginal constraint. Minimizing the Lagrangian over each conditional law gives exactly the pointwise term , while the multiplier contributes . The compactness, lower semicontinuity, convexity and qualification assumptions ensure no duality gap.
Proof
Let be any coupling and disintegrate it as . By Jensen’s inequality,
Integrating in gives . Taking the infimum over proves the claim.

Weak barycentric transport on a small disk-to-annulus coupling. The left panel shows the full conditional laws: each red source atom splits its mass among several blue target atoms, with segment thickness proportional to transported mass. The right panel collapses each conditional law to its barycenter , shown in violet. The barycentric weak cost only sees the red-to-violet displacement, and therefore ignores the conditional spread around each barycenter.
The interactive demo lets each source point split toward several targets. Increasing the split count or spread usually increases the full quadratic cost while the weak barycentric cost can remain much smaller.
Interactive panel. Use the spread and barycentric controls to compare full weak conditional laws with their barycentric projections.
The barycentric cost is the canonical example to keep in mind: admissibility still constrains the full conditional laws to have second marginal , but the objective only charges the displacement from to and ignores the conditional variance around this barycenter. This connects weak OT with martingale transport, Strassen-type convex-order constraints, barycentric projections and learning problems where conditional averages are meaningful objects.
- Agueh, M., & Carlier, G. (2011). Barycenters in the Wasserstein space. SIAM Journal on Mathematical Analysis, 43(2), 904–924.
- Carlier, G., & Ekeland, I. (2010). Matching for teams. Economic Theory, 42(2), 397–418.
- Anderes, E., Borgwardt, S., & Miller, J. (2016). Discrete Wasserstein barycenters: optimal transport for discrete data. Mathematical Methods of Operations Research, 84(2), 389–409.
- Álvarez Esteban, P. C., del Barrio, E., Cuesta-Albertos, J., & Matrán, C. (2016). A fixed-point approach to barycenters in Wasserstein space. Journal of Mathematical Analysis and Applications, 441(2), 744–762.
- Le Gouic, T., & Loubes, J.-M. (2016). Existence and consistency of Wasserstein barycenters. Probability Theory and Related Fields, 168, 901–917.
- Bhatia, R., Jain, T., & Lim, Y. (2019). On the Bures–Wasserstein distance between positive definite matrices. Expositiones Mathematicae, 37(2), 165–191. 10.1016/j.exmath.2018.01.002
- Cuturi, M., & Doucet, A. (2014). Fast computation of Wasserstein barycenters. Proceedings of the 31st International Conference on Machine Learning, 32, 685–693. https://proceedings.mlr.press/v32/cuturi14.html
- Cuturi, M., & Peyré, G. (2016). A smoothed dual approach for variational Wasserstein problems. SIAM Journal on Imaging Sciences, 9(1), 320–343.
- Benamou, J.-D., Carlier, G., Cuturi, M., Nenna, L., & Peyré, G. (2015). Iterative Bregman projections for regularized transportation problems. SIAM Journal on Scientific Computing, 37(2), A1111–A1138.
- Solomon, J., De Goes, F., Peyré, G., Cuturi, M., Butscher, A., Nguyen, A., Du, T., & Guibas, L. (2015). Convolutional Wasserstein distances: efficient optimal transportation on geometric domains. ACM Transactions on Graphics, 34(4), 66:1-66:11.
- Cuturi, M., & Avis, D. (2014). Ground metric learning. Journal of Machine Learning Research, 15, 533–564.
- Kulis, B. (2012). Metric learning: a survey. Foundations and Trends in Machine Learning, 5(4), 287–364.
- Bellet, A., Habrard, A., & Sebban, M. (2015). Metric learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 9(1), 1–151.
- Andrade, F., Peyré, G., & Poon, C. (2025). Learning from Samples: Inverse Problems over Measures via Sharpened Fenchel–Young Losses. arXiv Preprint arXiv:2505.07124.
- Peyré, G., Poon, C., & Tron, O. (2026). Curvature of optimal transport with respect to the cost and applications to inverse optimal transport. arXiv Preprint arXiv:2604.22670.