Convergence for entropic optimal transport has two complementary meanings. At
fixed marginals and fixed temperature, one studies how Sinkhorn iterates
approach the regularized optimizer. When the marginals are empirical, one also
studies how the regularized value and potentials behave as the number of
samples grows toward a mean-field limit.
This chapter keeps these two scales together. It starts with algorithmic
convergence of matrix scaling and soft transforms, then explains the Hilbert
metric contraction mechanism, and finally turns to Gaussian closed forms and
sample-complexity consequences. The recurring message is that entropy helps
optimization and statistics by smoothing the problem, but the constants
deteriorate as the temperature ϵ goes to zero.
from pathlib import Path
import sys
from IPython.display import Image as DisplayImage
from IPython.display import display
here = Path.cwd()
myst_dir = None
for candidate in [here, here.parent, here / "myst", here.parent / "myst", here.parent.parent / "myst"]:
if (candidate / "ot4ml_web.py").exists():
myst_dir = candidate.resolve()
sys.path.insert(0, str(myst_dir))
break
if myst_dir is None:
raise RuntimeError("Could not locate myst/ot4ml_web.py")
repo_root = myst_dir.parent
thumbnails = repo_root / "notebooks-figures" / "thumbnails"
def show_book_figure(name, width=760):
display(DisplayImage(filename=str(thumbnails / f"{name}.png"), width=width))
Sinkhorn can be read as alternating Bregman projections. The main geometric
message is simple: each row or column rescaling is the KL projection onto one
affine marginal constraint. The convergence mechanism then follows from the
Pythagorean identity for Bregman divergences.
For simplicity, this section is written for discrete measures. The same ideas
carry over to general measures, and the robust-rate section below expresses
the constants through cost and potential oscillations rather than through the
number of grid points.
The projection viewpoint explains Sinkhorn as repeated enforcement of one
marginal constraint at a time. It is not specific to entropy, although KL is
the case where the projections reduce to elementary row and column scalings.
Bregman divergences are useful because their geometry can encode constraints.
A Legendre-type generator blows up, or has an infinite derivative, at the
boundary of its domain. For negative entropy, positivity is therefore built
into the divergence, so one projects onto affine marginal constraints without
separately handling non-negativity.
Adding a linear cost to a Bregman penalty merely shifts the reference point in
dual coordinates. The usual Gibbs--KL reformulation is exactly the entropy
specialization.
On the transport polytope, scaling Ka,bϵ is equivalent to scaling
the Gibbs kernel K=e−C/ϵ because the factors ai and bj can
be absorbed into the Sinkhorn scalings. The unique entropic optimizer is the
KL projection of this tilted Gibbs reference onto the coupling constraints:
Applying this identity at every half-step shows that
BΦ(Q∣Zℓ) decreases for every
Q∈C1∩C2, and that the projection drops are
summable. Compactness gives cluster points. The adjacent half-steps approach
one another, so every cluster point belongs to both affine sets. The
telescoped first-order conditions give the normal-cone optimality condition
for the Bregman projection onto the intersection. Strict convexity gives
uniqueness, hence the whole sequence converges.
the two projection steps are the usual Sinkhorn updates on the scaling
vectors. In practice one stores the vectors and multiplies by the Gibbs
kernel, often exploiting separable, sparse, low-rank or geometric structure.
The Bregman proof is geometric, but its direct finite-dimensional linear-rate
constants can degrade with dimension and with small ϵ. The robust
dual analysis below gives a dimension-free qualitative message: before any
local linear regime becomes visible, one can still guarantee an O(1/k) dual
gap whose constants depend on the cost range and potential oscillation.
There is another, older route to convergence, going back to Fortet’s proof of
the Schrodinger system Fortet, 1940Essid & Pavon, 2019Léonard, 2019.
It uses the order structure of soft transforms rather than an explicit
contraction factor.
Proof Sketch
The soft c-transform is order reversing. For example, if g≤g′, then
The composition of two order-reversing transforms is order preserving. Soft
transforms also commute with additive constants in projective form. Starting
from a subsolution representative gives f0≤f1, and order preservation
gives representatives satisfying fk≤fk+1. Soft-transform
oscillation bounds, controlled by supc−infc, prevent escape to infinity
after normalization. Monotone convergence, compactness of equicontinuous soft
transforms, and continuity of A give a fixed point. Uniqueness of
entropic potentials up to constants identifies this fixed point with the
Sinkhorn solution.
This proof is qualitative rather than quantitative, but it is conceptually
useful: Sinkhorn is not only alternating projection or projective contraction;
it is also a monotone fixed-point iteration on potential classes once
constants are quotiented out.
This is the natural size for Sinkhorn potentials because adding constants
changes their gauge but not the coupling.
Hence every value of T(f)−T(g) lies in [a,b], so its
oscillation is at most b−a=∥f−g∥V. The order-reversing,
anti-homogeneous case gives the same oscillation bound with the interval
reversed.
Proof
The soft transform is order reversing and satisfies
(g+λ)cˉ,ϵ=gcˉ,ϵ−λ, and similarly
for the other block. The topical-map proposition applies to each block, and
the composition of two 1-Lipschitz maps is 1-Lipschitz.
Sinkhorn is cyclic coordinate ascent on the smooth dual objective; equivalently,
it alternates KL projections on the two marginal constraint sets. The rate
most useful for complexity estimates is the pre-asymptotic one: before a
linear regime becomes visible, the dual objective gap decreases at order
1/k. Related robust Bregman-projection rates are developed in
Peyré, 2026Altschuler et al., 2017Dvurechensky et al., 2018, and
statistical consequences of entropic smoothing in
Genevay et al., 2019Bigot et al., 2019.
Proof
Let A={i:pi≥qi} and set
a=∑i∈Api, b=∑i∈Aqi. Then
a−b=21∥p−q∥1. Data processing for relative entropy gives
The same identity yields computable stopping diagnostics. The KL drops are
exactly marginal defects: after a row update the row marginal is correct and
the remaining drop is measured by the column marginal, and conversely after a
column update. Marginal violations therefore monitor both feasibility and the
remaining dual gap, up to the bounded-radius constant.
Hilbert’s projective metric gives a complementary convergence mechanism.
Instead of following objective values, it measures distances between positive
scaling vectors modulo global multiplication. Positive kernels are
contractions in this geometry, yielding a global linear convergence statement
Franklin & Lorenz, 1989.
Multiplying both vectors by arbitrary positive constants does not change this
quantity, so it is a distance only after passing to projective classes.
Proof
The map u↦logu identifies R+,∗n/∼ with the quotient
vector space Rn/span(1n). The variation seminorm
vanishes exactly on constant vectors, so it induces a norm on this quotient.
Symmetry, the triangle inequality, separation and completeness follow from
the corresponding finite-dimensional norm properties.
Hilbert’s metric was introduced by Birkhoff and Samelson to give quantitative
proofs of the Perron--Frobenius theorem Birkhoff, 1957Samelson, 1957.
Sinkhorn can be viewed as a nonlinear matrix-scaling analogue of this theory.
Proof Sketch
For a positive linear map A on a cone, define its projective diameter
The same estimate holds for the other block. Applying it twice gives the
linear rate. The a posteriori marginal-residual bounds follow from the
triangle inequality and the fixed-point equations. The logarithmic primal
bound follows because
logP(ℓ)−logP⋆ is the sum of the two logarithmic scaling
errors.
For bounded cost, the contraction can also be read directly on the dual
potentials. With Kϵ=e−c/ϵ and
R=supc−infc,
This is a useful global statement, but it also explains a limitation: the
worst-case factor becomes exponentially close to one as ϵ→0.
In practice, one often observes a much better local linear regime after enough
iterations. The same Hilbert-metric mechanism extends beyond finite matrices
to positive integral operators under compactness and positivity assumptions,
but Gaussian distributions with quadratic cost require a different approach.
The interactive demo below separates the observed marginal residual from two
theoretical guides. The O(1/k) curve captures the robust pre-asymptotic
message, while the Hilbert curve shows how pessimistic the global contraction
bound becomes when the Gibbs kernel is close to degenerate.
Interactive panel. This exploratory panel plots convergence guides for Sinkhorn iterations. Use epsilon and condition controls to compare observed residuals with pessimistic global contraction bounds.
Gaussian marginals provide an explicit finite-dimensional model of Sinkhorn’s
behavior. The soft c-transform preserves quadratic potentials, the optimal
entropic coupling is Gaussian, and the value can be written with matrix square
roots Janati et al., 2020. This is the entropic counterpart of the
Gaussian W2 and Bures formula.
Proof
The exponent is the sum of a quadratic polynomial in y and the logarithm of
the Gaussian density of β. Completing the square in y evaluates the
integral as a positive constant times the exponential of a quadratic
polynomial in x. Taking −ϵlog therefore gives a quadratic
polynomial.
Proof Sketch
For any coupling, replace (X,Y) by the Gaussian vector with the same mean
and covariance. The quadratic cost is unchanged, and Gaussian laws maximize
entropy at fixed covariance, so the relative entropy cannot increase. It is
therefore enough to optimize over Gaussian couplings.
Writing the cross-covariance as
K=Σα1/2SΣβ1/2, the block covariance constraint
is equivalent to the singular values of S being at most one. The cost
depends on K through −2tr(K), while
The first-order condition is
2σi=ϵs/(1−s2), whose positive solution is the displayed
si.
Proof Sketch
The raw Gaussian entropic value is the squared mean displacement plus two
trace terms and the spectral sum
∑iψϵ(σi(Σα,Σβ)). Applying the
same formula to the self-costs (α,α) and (β,β) replaces
the cross singular values by the eigenvalues of Σα and
Σβ.
In the debiased polarization formula, the trace terms cancel:
The polarization of the squared mean terms leaves
∥mα−mβ∥2. Finally,
τϵ(r)→1 and
ϵlog(1−τϵ(r)2)→0, so
ψϵ(r)→−2r. The covariance limit is therefore
which is the Bures--Wasserstein covariance formula.
The controls below expose exactly the quantities in the formula: ϵ
sets the singular-value shrinkage, anisotropy changes the eigenvalues, and the
angle changes the covariance misalignment.
Interactive panel. This exploratory panel exposes the Gaussian formula directly. Use epsilon, anisotropy, and angle to see how entropic shrinkage changes the covariance term.
the derivative of the full-cycle map at the fixed point is
Tϵ′(q⋆)2=A⋆−4.
This scalar calculation illustrates the general Gaussian convergence picture
of Chizat, Delalande and Vaskevicius Chizat et al., 2024: the rate
improves when ϵ is large or the covariance scales overlap well, and
deteriorates in the small-temperature limit where the entropic coupling
approaches a deterministic Brenier map.
This section separates two statistical regimes. Exact OT resolves geometry at
all spatial scales and pays dimension-dependent empirical rates. Fixed
temperature Sinkhorn divergences smooth the dual potentials and recover
parametric fluctuations, at the price of regularization bias.
The sample complexity of unregularized OT suffers from the curse of
dimensionality. Entropic regularization changes the picture: for a fixed
ϵ>0, Sinkhorn divergences have parametric n−1/2 statistical
rates, although the constant deteriorates when ϵ→0Genevay et al., 2019Bigot et al., 2019. Related two-sample-testing
viewpoints are developed in Ramdas et al., 2017, and the
large-ϵ kernel limit connects to classical MMD tests
Gretton et al., 2012.
Empirical fluctuations in dimensions three and six. For each sample size
n, two independent empirical measures are drawn from the same standard
Gaussian law. Exact OT follows a slower dimension-dependent scale, while MMD
and the fixed-ϵ Sinkhorn divergence behave closer to the parametric
n−1/2 guide. This is a statistical illustration, not a solver benchmark.
For the one-sample term, partition [0,1]d into dyadic cubes. At scale
2−j, empirical mass fluctuations over the cells are of order
n−1/22jd/2, while moving this excess mass inside cells costs
2−j. Summing the multiscale contributions up to the scale where the
expected number of samples per cell is order one gives 2−J with
2Jd≃n, hence n−1/d. Matching lower bounds follow from packing
arguments Dudley, 1969Fournier & Guillin, 2015Weed & Bach, 2019.
Proof
Let Φ(x)=k(x,⋅) be the feature map and
mα=EΦ(X). The reverse triangle inequality gives
Jensen’s inequality and k(x,x)≤κ2 give the displayed bound.
Proof Sketch
By the envelope theorem, the fluctuation of
Lcϵ with respect to its first marginal is controlled by
the class of entropic dual potentials. The soft c-transform smooths these
potentials at spatial scale ϵ for a quadratic-type cost.
Covering a bounded d-dimensional domain at this scale gives an effective
complexity of order ϵ−d/2. Standard Rademacher or Dudley entropy
bounds then give an empirical-process fluctuation of order
ϵ−d/2/n for each marginal. Applying the same estimate to
the three terms defining the debiased divergence gives the stated bound.
The interactive demo below is only a scaling guide: change the dimension to see the
exact-OT exponent flatten, and change ϵ to move the Sinkhorn bias
floor against its parametric fluctuation term.
Interactive panel. This exploratory panel is a scaling guide. Use dimension, sample size, and epsilon to compare statistical fluctuation with regularization bias.
Bregman, L. M. (1967). The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics, 7(3), 200–217.
Essid, M., & Pavon, M. (2019). Traversing the Schrödinger Bridge Strait: Robert Fortet’s Marvelous Proof Redux. Journal of Optimization Theory and Applications, 181, 23–60.
Léonard, C. (2019). Revisiting Fortet’s proof of existence of a solution to the Schrödinger system. arXiv Preprint arXiv:1904.13211.
Lemmens, B., & Nussbaum, R. (2012). Nonlinear Perron-Frobenius Theory (Vol. 189). Cambridge University Press.
Peyré, G. (2026). Robust Sublinear Convergence Rates for Iterative Bregman Projections. arXiv Preprint arXiv:2602.01372.
Altschuler, J., Weed, J., & Rigollet, P. (2017). Near-linear time approximation algorithms for optimal transport via Sinkhorn iteration. Advances in Neural Information Processing Systems, 30, 1964–1974.
Dvurechensky, P., Gasnikov, A., & Kroshnin, A. (2018). Computational Optimal Transport: Complexity by Accelerated Gradient Descent Is Better Than by Sinkhorn’s Algorithm. In J. Dy & A. Krause (Eds.), Proceedings of the 35th International Conference on Machine Learning (Vol. 80, pp. 1367–1376). PMLR.
Genevay, A., Chizat, L., Bach, F., Cuturi, M., & Peyré, G. (2019). Sample Complexity of Sinkhorn Divergences. Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, 89, 1574–1583.
Bigot, J., Cazelles, E., & Papadakis, N. (2019). Central limit theorems for entropy-regularized optimal transport on finite spaces and statistical applications. Electronic Journal of Statistics, 13(2), 5120–5150. 10.1214/19-EJS1637
Franklin, J., & Lorenz, J. (1989). On the scaling of multidimensional matrices. Linear Algebra and Its Applications, 114, 717–735.
Birkhoff, G. (1957). Extensions of Jentzsch’s theorem. Transactions of the American Mathematical Society, 85(1), 219–227.
Samelson, H. (1957). On the Perron-Frobenius theorem. Michigan Mathematical Journal, 4(1), 57–59.
Janati, H., Muzellec, B., Peyré, G., & Cuturi, M. (2020). Entropic Optimal Transport between Unbalanced Gaussian Measures has a Closed Form. Advances in Neural Information Processing Systems.
Chizat, L., Delalande, A., & Vaškevičius, T. (2024). Sharper Exponential Convergence Rates for Sinkhorn’s Algorithm in Continuous Settings. arXiv Preprint arXiv:2407.01202.