The first family of extensions keeps the idea of a distance between measures,
but changes the geometry used to compare them. The variants in this chapter
relax mass conservation, reduce high-dimensional transport to
one-dimensional projections, or replace the trace quadratic cost by spectral
gauges and robust projected viewpoints.
These constructions are useful when the standard distance Wp is too
rigid or too expensive. They preserve much of the metric intuition of optimal
transport, but expose new controls: how expensive it is to delete mass, which
projections should be trusted, and which directions of displacement should be
penalized.
from pathlib import Path
import sys
from IPython.display import Image as DisplayImage
from IPython.display import display
here = Path.cwd()
myst_dir = None
for candidate in [here, here.parent, here / "myst", here.parent / "myst", here.parent.parent / "myst"]:
if (candidate / "ot4ml_web.py").exists():
myst_dir = candidate.resolve()
sys.path.insert(0, str(myst_dir))
break
if myst_dir is None:
raise RuntimeError("Could not locate myst/ot4ml_web.py")
repo_root = myst_dir.parent
thumbnails = repo_root / "notebooks-figures" / "thumbnails"
def show_book_figure(name, width=760):
display(DisplayImage(filename=str(thumbnails / f"{name}.png"), width=width))
Unbalanced OT allows mass creation and destruction by penalizing marginal
mismatch. It is essential when histograms are not normalized, when
observations contain outliers, or when only part of the source should match
the target Liero et al., 2018Chizat et al., 2018Chizat et al., 2018.
where ψ1,ψ2 are convex entropy functions. Exact conservation
(π1,π2)=(α,β) is replaced by a cost for changing the
marginals. Writing ψs=τψˉs exposes the relaxation scale:
Large τ makes marginal mismatch expensive and approaches balanced OT when
the total masses are compatible. Small τ makes creation and destruction
cheap; after rescaling by τ, the zero-transport part reveals the pure
divergence geometry.
Proof
For the upper bound, restrict to diagonal plans
π=(Id,Id)♯ρ, whose transport cost is zero and whose two
marginals are both ρ. This gives the desired upper bound after
optimizing over ρ.
For the lower bound, let τn↓0 and let πn be almost
minimizing plans with bounded scaled values
τn−1UWc,τn(α,β). Since the divergences are
nonnegative, ∫cdπn=O(τn), hence ∫cdπn→0. The
bounded scaled values also put the two marginals in compact divergence
sublevel sets. Since a coupling has the same total mass as each marginal, the
couplings are tight on X×X. Up to subsequences,
πn⇀π0.
Lower semicontinuity of the transport cost yields ∫cdπ0=0, so
π0 is concentrated on the diagonal. Its two marginals are therefore equal
to a common measure ρ. Lower semicontinuity of the marginal divergences
gives
In the dominated case, the minimization over ρ=rλ decouples into
the scalar envelope mψˉ1,ψˉ2. For KL, no
singular part is admissible when α and β are dominated by
λ. The pointwise objective is
rlog(r/a)−r+a+rlog(r/b)−r+b. Its optimality condition is
log(r/a)+log(r/b)=0, hence r=ab, and the minimum is
a+b−2ab=(a−b)2.
Proof
Use the variational formula for the dual of a divergence and introduce the
marginal variables through continuous potentials:
The Liero--Mielke--Savare formulation rewrites marginal penalties as a local
transport cost and then homogenizes it. Assuming first that the reference
measures and transported marginals have mutually absolutely continuous parts,
one can factor the objective as
with the usual recession convention at r=0 or s=0. If
α=Fπ1+α⊥ and β=Gπ2+β⊥ are the Lebesgue
decompositions of the reference marginals with respect to the transported
marginals, then
The inequality HW≤UW follows from Hc≤Lc by
taking θ=1. Conversely, take a feasible measure π in the
homogeneous formulation. By definition of the perspective transform, for
every (x,y) and every η>0 there exists a scale θ(x,y)>0 such
that
Replacing π by the rescaled measure π~=θπ and the
densities by F/θ and G/θ gives an admissible competitor for the
reverse formulation with cost no larger than the homogeneous cost plus
ηπ(X×Y). Letting η→0 yields
UW≤HW. The singular terms are unchanged because the
same rescaling is performed before taking the Lebesgue decomposition of the
marginals.
Assume now that X=Y and ψ1=ψ2=ψ. The homogeneous formulation
lifts the problem to the cone space
C[X]:=(X×R+)/∼, where all points (x,0) are
identified at the apex. For an exponent p≥1, define
The equality UW=HW is the homogenization proposition. To
prove HW=CW, disintegrate an admissible cone coupling
γ with respect to its spatial variables (x,y) and radii (r,s). The
cone marginal constraints say precisely that the spatial marginals are
recovered after weighting by rp and sp. Since
D((x,r),(y,s))p=Hc(x,y)(rp,sp), integrating the cone cost
gives the homogeneous objective. Conversely, any homogeneous competitor can be
lifted to the cone by placing, over each (x,y), radii whose pth powers are
the two density factors appearing in Hc.
If D is a distance on the cone, then CW1/p is the
usual p-Wasserstein distance between lifted measures under the linear
cone-marginal constraints. Symmetry and the triangle inequality follow from
the corresponding Wasserstein properties and the gluing lemma on the cone. If
the distance is zero, an optimal cone coupling is concentrated on the diagonal
of the cone, so the weighted projections agree and therefore α=β.
The exponent ρ<1 is the visible difference with balanced Sinkhorn:
marginal corrections are damped because violating the marginals is allowed.
show_book_figure("unbalanced-mass-relaxation")
KL unbalanced OT on one-dimensional Gaussian-mixture densities. The central
matrix is the transported coupling. The side curves compare the prescribed
marginals with the transported marginals; increasing τ makes marginal
mismatch more expensive, so more mass is moved rather than created or
destroyed.
The interactive demo below exposes the two most important regularization scales.
Increasing τ pushes the transported marginals closer to the prescribed
ones; increasing ϵ spreads the coupling itself.
Interactive panel. Use the deletion cost and regularization controls to see when unbalanced transport prefers moving mass, creating mass, or removing it.
The entropy used in the marginal relaxation also changes the qualitative
behavior. A KL penalty leads to smooth multiplicative rescaling. The
reverse-KL, or Burg, penalty blows up when a transported marginal vanishes
where the prescribed marginal is positive, so it discourages complete deletion
of small modes. Total variation has a linear kink and behaves closer to
partial transport: mass is either kept active or created and destroyed at
nearly constant marginal price.
show_book_figure("unbalanced-divergence-choice")
Effect of the marginal divergence in unbalanced entropic OT. The geometric
cost, entropic plan regularization ϵ, and relaxation strength τ
are fixed; only the marginal penalty changes. KL allows smooth mass
variation, Burg keeps transported marginals from vanishing on prescribed
modes, and total variation gives a sharper active-mass selection.
Sliced Wasserstein trades exact high-dimensional geometry for many
one-dimensional projections. It is cheap, differentiable after sorting, and
often effective in imaging and learning. For measures on Rd and
θ∈Sd−1, let
Pθ(x)=⟨θ,x⟩ be the projection on direction θ.
This construction is closely related to the Radon transform and is much
cheaper to approximate numerically than high-dimensional OT, since each
projected problem can be solved by sorting or quantiles
Rabin et al., 2011Bonneel et al., 2015Kolouri et al., 2016. It
metrizes the same weak-plus-moment topology as Wp, but its geometry is
not bi-Lipschitz equivalent to Wp in high dimension
Nadjahi et al., 2019.
Sliced Wasserstein projections between two planar densities. Fixed directions
are drawn on both densities, and the middle panels show smoothed
one-dimensional density estimates of the projected measures. Sliced OT
averages one-dimensional Wasserstein discrepancies over many such directions.
The interactive demo separates two uses of a slice: comparing projected measures and
lifting the sorted one-dimensional matching back to the plane. The lifted plan
is always feasible in the original space, but it need not be the quadratic
optimal plan.
Interactive panel. Use the projection angle and number of directions to see how sliced Wasserstein distances reduce high-dimensional transport to one-dimensional matchings.
Proof
Non-negativity and symmetry follow from the one-dimensional Wasserstein
distance. For the triangle inequality, apply the triangle inequality of
Wp for each direction θ and then Minkowski’s inequality in
Lp(Sd−1).
If SWp(α,β)=0, then
(Pθ)♯α=(Pθ)♯β for almost every direction.
By continuity of characteristic functions this holds for all directions, and
the Cramer--Wold theorem implies α=β.
The bound SWp≤Wp follows because Pθ is
1-Lipschitz. For p=2, using any coupling π between α and
β,
One-dimensional slices are extremely cheap, but they may discard too much
geometry in high dimension. A natural compromise is to project onto
k-dimensional subspaces: the projected OT problems remain lower
dimensional, while each projection retains correlations inside a small block
of coordinates.
Proof
The first inequality in each line follows because an Lp average over a
probability space is bounded by the corresponding supremum. The second
inequality follows because orthogonal projections are 1-Lipschitz: pushing
any admissible coupling between α and β through a projection gives
an admissible coupling for the projected measures with no larger transport
cost. Optimizing over couplings and then averaging or maximizing over the
projection gives the result.
The preceding constructions define distances between projected measures. A
different use of slicing is to use a projection only as a device for building
a feasible high-dimensional transport plan. For equal-weight empirical
measures
α=n−1∑iδxi and
β=n−1∑iδyi, sort the projected samples
⟨xi,θ⟩ and ⟨yj,θ⟩, and let σθ be the
monotone matching induced by this sorting. The lifted plan
is a genuine coupling between α and β in the original space.
Min-SWGG-type methods then choose the projection whose lifted plan has the
smallest full-dimensional quadratic cost:
Lifted min-sliced plan. A one-dimensional direction is selected by a small
deterministic sweep, then red and blue atoms are sorted after projection and
matched in that order. The middle panel lifts this one-dimensional matching
back to the plane; it is a feasible coupling but not the same object as the
quadratic W2 matching shown on the right.
Linear OT starts from the multivariate analogue of quantile coordinates. The
one-dimensional quantile function represents a probability measure by the
monotone map sending a fixed reference law to it; in dimension d>1,
Brenier’s theorem gives the corresponding construction after choosing an
absolutely continuous reference probability ρ, typically the uniform law
on a convex body or a standard Gaussian.
This construction is canonical only after fixing ρ: changing the
reference law changes the coordinates used to represent μ. Vector
quantile regression uses the same idea conditionally, replacing scalar
conditional quantiles by conditional Brenier maps and thereby encoding
multivariate ranks and depths Carlier et al., 2017.
Linear OT replaces a nonlinear transport distance by a Hilbert norm between
reference maps. It is useful when one reference measure is fixed and many
nearby distributions must be compared cheaply. Let Tα be the Brenier
map pushing ρ to α, understood as an element of
L2(ρ;Rd) and hence defined only ρ-almost everywhere. The linear
OT embedding is
If one of the two targets equals the reference, the linearized distance is
exact: for instance,
LOTρ(ρ,α)=∥Tα−Id∥L2(ρ)=W2(ρ,α). For two arbitrary targets, the coupling
(Tα,Tβ)♯ρ is admissible but not generally optimal, so
LOTρ is a tangent-space approximation of the Wasserstein
geometry Wang et al., 2013.
For a family (αs)s with weights (λs)s, the linearized
barycenter is obtained by averaging maps,
This is exact in one dimension, where quantile functions linearize
W2, and it is especially useful when many barycenters with changing
weights must be evaluated quickly.
show_book_figure("dualnorms-linear-ot-embedding")
Linear OT coordinates. Fixing a reference measure ρ turns each target
into a map Tα from ρ to α, or equivalently into the
displacement field Tα−Id. In one dimension this is exactly the
quantile parametrization of W2. In two dimensions, averaging the maps
gives the linearized barycenter, which is compared with the genuine McCann
midpoint.
The next control keeps the exact one-dimensional setting. The reference
density defines the coordinate system, the target maps are quantile maps from
that reference, and the barycenter is obtained by averaging those maps before
pushing the reference forward.
Interactive panel. Use the reference and deformation controls to inspect how linear optimal transport embeds measures through maps from a fixed template.
Proof
The first inequality is immediate:
(Tα,Tβ)♯ρ is a feasible coupling between α and
β. The reverse local estimate is a standard stability statement for the
Monge--Ampere equation under the stated regularity assumptions: changes in
the target measure control changes in the Brenier potential in Holder norms,
hence control Tα−Tβ in L2(ρ).
In one-dimensional settings, quantile functions make this exact with
η=1. In several dimensions one should not read the statement as a global
Lipschitz estimate in W2. Quantitative stability results for
semi-discrete and Monge--Ampere maps give Holder exponents depending on the
dimension, density bounds, support geometry and regularity
Mérigot et al., 2020.
Spectral OT changes the scalar quadratic cost by measuring the whole
displacement covariance through a matrix gauge. The same object admits a
robust projected formulation: instead of fixing one projection, one maximizes
over the polar set of the gauge. Subspace robust OT is the important
non-convex rank-constrained version of this idea Paty & Cuturi, 2019;
spectral gauges provide its convex minimax counterpart and connect to recent
spectral-gradient viewpoints such as Muon dynamics Peyré, 2026.
The monotonicity condition means that increasing the displacement covariance
in Loewner order cannot decrease the transport penalty.
The special case γ(M)=tr(M) gives the usual quadratic Wasserstein
distance W2. The spectral gauge γ(M)=λmax(M) instead
measures the worst transported variance direction. For A⪰0, define
the quadratic projected transport cost
The coupling set is convex and compact for weak convergence under compact
support. The polar set Bγ is convex and compact, and the map
(π,A)↦tr(AMπ) is affine in each variable and continuous. Sion’s
minimax theorem gives
For fixed A⪰0, W2,A is the Wasserstein pseudodistance
associated with the seminorm x↦∥∥A1/2x∥∥. A supremum of
pseudodistances is symmetric and satisfies the triangle inequality. If
aI∈Bγ and A⪯bI for all
A∈Bγ, then
and, since M⪰0, the associated support function is the same Ky Fan
gauge. Thus Wγk is the convexified spectral counterpart of
SRW2,k, while SRW2,k keeps the
original non-convex rank constraint. For k=1,
γ1(M)=λmax(M) and
Bγ1={A⪰0:tr(A)≤1}.
show_book_figure("spectral-wasserstein-gauge")
Trace and spectral gauges for displacement covariances. The trace gauge
minimizes the average squared displacement and gives the usual quadratic
transport plan. The λmax gauge penalizes the worst projected
displacement variance; the displayed plan is obtained by approximating the
robust formulation with finitely many directions.
The interactive demo turns the displacement covariance into a visible object. The
trace gauge sums both covariance eigenvalues, while the top-eigenvalue gauge
cares only about the worst transported direction.
Interactive panel. Use the spectral weights and deformation controls to see how the gauge changes the geometry used to compare measures.
Liero, M., Mielke, A., & Savaré, G. (2018). Optimal entropy-transport problems and a new Hellinger–Kantorovich distance between positive measures. Inventiones Mathematicae, 211(3), 969–1117.
Chizat, L., Schmitzer, B., Peyré, G., & Vialard, F.-X. (2018). An interpolating distance between optimal transport and Fisher–Rao metrics. Foundations of Computational Mathematics, 18(1), 1–44.
Rabin, J., Peyré, G., Delon, J., & Bernot, M. (2011). Wasserstein barycenter and its application to texture mixing. International Conference on Scale Space and Variational Methods in Computer Vision, 435–446.
Bonneel, N., Rabin, J., Peyré, G., & Pfister, H. (2015). Sliced and Radon Wasserstein barycenters of measures. Journal of Mathematical Imaging and Vision, 51(1), 22–45.
Kolouri, S., Zou, Y., & Rohde, G. K. (2016). Sliced Wasserstein kernels for probability distributions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5258–5267.
Nadjahi, K., Durmus, A., Simsekli, U., & Badeau, R. (2019). Asymptotic Guarantees for Learning Generative Models with the Sliced-Wasserstein Distance. Advances in Neural Information Processing Systems.
Bonnotte, N. (2013). Unidimensional and evolution methods for optimal transportation [Phdthesis]. Université Paris-Sud.
Carlier, G., Chernozhukov, V., & Galichon, A. (2017). Vector quantile regression beyond the specified case. Journal of Multivariate Analysis, 161, 96–102. 10.1016/j.jmva.2017.07.003
Wang, W., Slepčev, D., Basu, S., Ozolek, J. A., & Rohde, G. K. (2013). A linear optimal transportation framework for quantifying and visualizing variations in sets of images. International Journal of Computer Vision, 101(2), 254–269.
Mérigot, Q., Delalande, A., & Chazal, F. (2020). Quantitative Stability of Optimal Transport Maps and Linearization of the 2-Wasserstein Space. Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, 108, 3186–3196.
Paty, F.-P., & Cuturi, M. (2019). Subspace Robust Wasserstein Distances. Proceedings of the 36th International Conference on Machine Learning, 97, 5072–5081.
Peyré, G. (2026). Muon Dynamics as a Spectral Wasserstein Flow. arXiv Preprint arXiv:2604.04891.