Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Entropic Regularization: Convergence

Convergence for entropic optimal transport has two complementary meanings. At fixed marginals and fixed temperature, one studies how Sinkhorn iterates approach the regularized optimizer. When the marginals are empirical, one also studies how the regularized value and potentials behave as the number of samples grows toward a mean-field limit.

This chapter keeps these two scales together. It starts with algorithmic convergence of matrix scaling and soft transforms, then explains the Hilbert metric contraction mechanism, and finally turns to Gaussian closed forms and sample-complexity consequences. The recurring message is that entropy helps optimization and statistics by smoothing the problem, but the constants deteriorate as the temperature ϵ\epsilon goes to zero.

Sinkhorn Convergence: Bregman View

Sinkhorn can be read as alternating Bregman projections. The main geometric message is simple: each row or column rescaling is the KL projection onto one affine marginal constraint. The convergence mechanism then follows from the Pythagorean identity for Bregman divergences.

For simplicity, this section is written for discrete measures. The same ideas carry over to general measures, and the robust-rate section below expresses the constants through cost and potential oscillations rather than through the number of grid points.

Alternating KL Projections

The projection viewpoint explains Sinkhorn as repeated enforcement of one marginal constraint at a time. It is not specific to entropy, although KL is the case where the projections reduce to elementary row and column scalings.

Bregman divergences are useful because their geometry can encode constraints. A Legendre-type generator blows up, or has an infinite derivative, at the boundary of its domain. For negative entropy, positivity is therefore built into the divergence, so one projects onto affine marginal constraints without separately handling non-negativity.

Linear Tilts and Gibbs References

Adding a linear cost to a Bregman penalty merely shifts the reference point in dual coordinates. The usual Gibbs--KL reformulation is exactly the entropy specialization.

Proof

Subtract the two Bregman divergences:

BΦ(PQC)BΦ(PQ)=Φ(Q)Φ(QC),P+cst.B_\Phi(P\mid Q^C)-B_\Phi(P\mid Q) = \langle \nabla\Phi(Q)-\nabla\Phi(Q^C),P\rangle+\text{cst}.

Using Φ(Q)Φ(QC)=C/ϵ\nabla\Phi(Q)-\nabla\Phi(Q^C)=C/\epsilon and multiplying by ϵ\epsilon gives the claim.

For the negative entropy, take Q=abQ=a\otimes b. The tilted reference is

Ka,bϵ:=(ab)eC/ϵ.K_{a,b}^\epsilon \eqdef (a\otimes b)\odot e^{-C/\epsilon}.

Thus

P,C+ϵKL(Pab)=ϵKL(PKa,bϵ)+cst.\langle P,C\rangle+\epsilon\operatorname{KL}(P\mid a\otimes b) = \epsilon\operatorname{KL}(P\mid K_{a,b}^\epsilon)+\text{cst}.

On the transport polytope, scaling Ka,bϵK_{a,b}^\epsilon is equivalent to scaling the Gibbs kernel K=eC/ϵK=e^{-C/\epsilon} because the factors aia_i and bjb_j can be absorbed into the Sinkhorn scalings. The unique entropic optimizer is the KL projection of this tilted Gibbs reference onto the coupling constraints:

Pϵ=arg minPU(a,b)KL(PKa,bϵ).P_\epsilon = \argmin_{P\in\mathcal U(a,b)} \operatorname{KL}(P\mid K_{a,b}^\epsilon).

Cyclic Projection Convergence

The convergence mechanism is the classical one of Bregman projections Bregman, 1967.

Proof Sketch

For three interior points one has the Bregman three-point formula

BΦ(QP)=BΦ(QP+)+BΦ(P+P)+Φ(P+)Φ(P),QP+.B_\Phi(Q\mid P) = B_\Phi(Q\mid P^+) + B_\Phi(P^+\mid P) + \langle\nabla\Phi(P^+)-\nabla\Phi(P),Q-P^+\rangle .

If P+=ProjCBΦ(P)P^+=\operatorname{Proj}_{\mathcal C}^{B_\Phi}(P) and C\mathcal C is affine, the first-order condition cancels the last term for all QCQ\in\mathcal C. This gives the Pythagorean identity

BΦ(QP)=BΦ(QP+)+BΦ(P+P).B_\Phi(Q\mid P) = B_\Phi(Q\mid P^+)+B_\Phi(P^+\mid P).

Applying this identity at every half-step shows that BΦ(QZ)B_\Phi(Q\mid Z_\ell) decreases for every QC1C2Q\in\mathcal C_1\cap\mathcal C_2, and that the projection drops are summable. Compactness gives cluster points. The adjacent half-steps approach one another, so every cluster point belongs to both affine sets. The telescoped first-order conditions give the normal-cone optimality condition for the Bregman projection onto the intersection. Strict convexity gives uniqueness, hence the whole sequence converges.

Let

Ca1:={P:P1m=a},Cb2:={P:P1n=b}.\mathcal C_a^1 \eqdef \{P:P\mathbf 1_m=a\}, \qquad \mathcal C_b^2 \eqdef \{P:P^\top\mathbf 1_n=b\}.

Then U(a,b)=Ca1Cb2\mathcal U(a,b)=\mathcal C_a^1\cap\mathcal C_b^2, and KL cyclic projections read

P(2+1)=ProjCa1KL(P(2)),P(2+2)=ProjCb2KL(P(2+1)).P^{(2\ell+1)} = \operatorname{Proj}_{\mathcal C_a^1}^{\operatorname{KL}}(P^{(2\ell)}), \qquad P^{(2\ell+2)} = \operatorname{Proj}_{\mathcal C_b^2}^{\operatorname{KL}}(P^{(2\ell+1)}).

The preceding proposition applies with P0=Ka,bϵP_0=K_{a,b}^\epsilon and gives convergence to the entropic optimizer.

Row and Column Scalings

The projectors are explicit because they only rescale rows or columns.

Proof

Along each row or column vector, impose a fixed sum s>0s>0:

minp{KL(pq):p,1=s}.\min_p \left\{ \operatorname{KL}(p\mid q) : \langle p,\mathbf 1\rangle=s \right\}.

The Lagrange multiplier equation gives log(p/q)+λ1=0\log(p/q)+\lambda\mathbf 1=0, hence p=uqp=uq with u=eλ>0u=e^{-\lambda}>0. The constraint fixes u=s/iqiu=s/\sum_i q_i, which is exactly the scaling formula.

Defining

P(2):=diag(u())Kdiag(v()),P^{(2\ell)} \eqdef \operatorname{diag}(u^{(\ell)})K\operatorname{diag}(v^{(\ell)}),

the two projection steps are the usual Sinkhorn updates on the scaling vectors. In practice one stores the vectors and multiplies by the Gibbs kernel, often exploiting separable, sparse, low-rank or geometric structure.

The Bregman proof is geometric, but its direct finite-dimensional linear-rate constants can degrade with dimension and with small ϵ\epsilon. The robust dual analysis below gives a dimension-free qualitative message: before any local linear regime becomes visible, one can still guarantee an O(1/k)O(1/k) dual gap whose constants depend on the cost range and potential oscillation.

Sinkhorn Convergence: Monotone Point of View

There is another, older route to convergence, going back to Fortet’s proof of the Schrodinger system Fortet, 1940Essid & Pavon, 2019Léonard, 2019. It uses the order structure of soft transforms rather than an explicit contraction factor.

Proof Sketch

The soft cc-transform is order reversing. For example, if ggg\leq g', then

ϵloge(gc)/ϵdβϵloge(gc)/ϵdβ.-\epsilon\log\int e^{(g-c)/\epsilon}\,\d\beta \geq -\epsilon\log\int e^{(g'-c)/\epsilon}\,\d\beta .

The composition of two order-reversing transforms is order preserving. Soft transforms also commute with additive constants in projective form. Starting from a subsolution representative gives f0f1f_0\leq f_1, and order preservation gives representatives satisfying fkfk+1f_k\leq f_{k+1}. Soft-transform oscillation bounds, controlled by supcinfc\sup c-\inf c, prevent escape to infinity after normalization. Monotone convergence, compactness of equicontinuous soft transforms, and continuity of A\mathcal A give a fixed point. Uniqueness of entropic potentials up to constants identifies this fixed point with the Sinkhorn solution.

This proof is qualitative rather than quantitative, but it is conceptually useful: Sinkhorn is not only alternating projection or projective contraction; it is also a monotone fixed-point iteration on potential classes once constants are quotiented out.

This is the natural size for Sinkhorn potentials because adding constants changes their gauge but not the coupling.

Proof

Set a=inf(fg)a=\inf(f-g) and b=sup(fg)b=\sup(f-g). Then

g+afg+b.g+a\leq f\leq g+b.

If T\mathcal T is order preserving and additively homogeneous, applying T\mathcal T gives

T(g)+aT(f)T(g)+b.\mathcal T(g)+a \leq \mathcal T(f) \leq \mathcal T(g)+b.

Hence every value of T(f)T(g)\mathcal T(f)-\mathcal T(g) lies in [a,b][a,b], so its oscillation is at most ba=fgVb-a=\norm{f-g}_V. The order-reversing, anti-homogeneous case gives the same oscillation bound with the interval reversed.

Proof

The soft transform is order reversing and satisfies (g+λ)cˉ,ϵ=gcˉ,ϵλ(g+\lambda)^{\bar c,\epsilon}=g^{\bar c,\epsilon}-\lambda, and similarly for the other block. The topical-map proposition applies to each block, and the composition of two 1-Lipschitz maps is 1-Lipschitz.

Sinkhorn Convergence: Sublinear Robust Rate

Sinkhorn is cyclic coordinate ascent on the smooth dual objective; equivalently, it alternates KL projections on the two marginal constraint sets. The rate most useful for complexity estimates is the pre-asymptotic one: before a linear regime becomes visible, the dual objective gap decreases at order 1/k1/k. Related robust Bregman-projection rates are developed in Peyré, 2026Altschuler et al., 2017Dvurechensky et al., 2018, and statistical consequences of entropic smoothing in Genevay et al., 2019Bigot et al., 2019.

Proof

Let A={i:piqi}A=\{i:p_i\geq q_i\} and set a=iApia=\sum_{i\in A}p_i, b=iAqib=\sum_{i\in A}q_i. Then ab=12pq1a-b=\frac12\norm{p-q}_1. Data processing for relative entropy gives

KL(pq)alogab+(1a)log1a1b.\operatorname{KL}(p\mid q) \geq a\log\frac{a}{b} + (1-a)\log\frac{1-a}{1-b}.

The binary relative entropy dominates 2(ab)22(a-b)^2 because its second derivative, after subtracting 2(ab)22(a-b)^2, is nonnegative. Therefore KL(pq)12pq12\operatorname{KL}(p\mid q)\geq\frac12\norm{p-q}_1^2.

Proof Sketch

The proof uses three ingredients. First, soft transforms bound the oscillation of normalized iterates and optima:

fkV+gkV+fV+gVCR.\norm{f_k}_V+\norm{g_k}_V + \norm{f^\star}_V+\norm{g^\star}_V \leq C R.

Second, each Sinkhorn half-step is an exact KL projection. The Pythagorean identity gives an ascent identity of the form

Dϵ(fk+1,gk+1)Dϵ(fk,gk)=ϵ[KL(ππk)KL(ππk+1)].\mathcal D_\epsilon(f_{k+1},g_{k+1}) - \mathcal D_\epsilon(f_k,g_k) = \epsilon\left[ \operatorname{KL}(\pi^\star\mid\pi_k) - \operatorname{KL}(\pi^\star\mid\pi_{k+1}) \right].

Pinsker’s inequality converts this KL drop into control of marginal residuals. Third, convexity of the exponential dual objective yields

Δk2CR2ϵ(Dϵ(fk+1,gk+1)Dϵ(fk,gk)).\Delta_k^2 \leq \frac{C R^2}{\epsilon} \left( \mathcal D_\epsilon(f_{k+1},g_{k+1}) - \mathcal D_\epsilon(f_k,g_k) \right).

Summing the reciprocal inequality

1Δk+11ΔkϵCR2\frac1{\Delta_{k+1}}-\frac1{\Delta_k} \geq \frac{\epsilon}{C R^2}

gives the displayed O(1/k)O(1/k) rate.

The same identity yields computable stopping diagnostics. The KL drops are exactly marginal defects: after a row update the row marginal is correct and the remaining drop is measured by the column marginal, and conversely after a column update. Marginal violations therefore monitor both feasibility and the remaining dual gap, up to the bounded-radius constant.

Sinkhorn Convergence: Linear Hilbert Metric Rate

Hilbert’s projective metric gives a complementary convergence mechanism. Instead of following objective values, it measures distances between positive scaling vectors modulo global multiplication. Positive kernels are contractions in this geometry, yielding a global linear convergence statement Franklin & Lorenz, 1989.

Multiplying both vectors by arbitrary positive constants does not change this quantity, so it is a distance only after passing to projective classes.

Proof

The map uloguu\mapsto\log u identifies R+,n/\RR_{+,*}^n/\sim with the quotient vector space Rn/span(1n)\RR^n/\operatorname{span}(\mathbf 1_n). The variation seminorm vanishes exactly on constant vectors, so it induces a norm on this quotient. Symmetry, the triangle inequality, separation and completeness follow from the corresponding finite-dimensional norm properties.

Hilbert’s metric was introduced by Birkhoff and Samelson to give quantitative proofs of the Perron--Frobenius theorem Birkhoff, 1957Samelson, 1957. Sinkhorn can be viewed as a nonlinear matrix-scaling analogue of this theory.

Proof Sketch

For a positive linear map AA on a cone, define its projective diameter

Δ(A):=supu,v>0H(Au,Av).\Delta(A) \eqdef \sup_{u,v>0}\mathsf H(Au,Av).

The Birkhoff--Hopf estimate states

H(Au,Av)tanh(Δ(A)/4)H(u,v).\mathsf H(Au,Av) \leq \tanh(\Delta(A)/4)\mathsf H(u,v).

For a positive matrix KK, the projective diameter is

Δ(K)=logη(K).\Delta(K)=\log\eta(K).

Substituting into tanh(Δ/4)\tanh(\Delta/4) gives λ(K)=(η(K)1)/(η(K)+1)\lambda(K)=(\sqrt{\eta(K)}-1)/(\sqrt{\eta(K)}+1).

Proof Sketch

Using the invariance of Hilbert’s metric under coordinatewise inversion and positive multiplication,

H ⁣(aKv(),aKv)=H(Kv(),Kv)λ(K)H(v(),v).\mathsf H\!\left( \frac{a}{Kv^{(\ell)}}, \frac{a}{Kv^\star} \right) = \mathsf H(Kv^{(\ell)},Kv^\star) \leq \lambda(K)\mathsf H(v^{(\ell)},v^\star).

The same estimate holds for the other block. Applying it twice gives the linear rate. The a posteriori marginal-residual bounds follow from the triangle inequality and the fixed-point equations. The logarithmic primal bound follows because logP()logP\log P^{(\ell)}-\log P^\star is the sum of the two logarithmic scaling errors.

For bounded cost, the contraction can also be read directly on the dual potentials. With Kϵ=ec/ϵK_\epsilon=e^{-c/\epsilon} and R=supcinfcR=\sup c-\inf c,

λ(Kϵ)tanh(R/(2ϵ)).\lambda(K_\epsilon) \leq \tanh(R/(2\epsilon)).

This is a useful global statement, but it also explains a limitation: the worst-case factor becomes exponentially close to one as ϵ0\epsilon\to0. In practice, one often observes a much better local linear regime after enough iterations. The same Hilbert-metric mechanism extends beyond finite matrices to positive integral operators under compactness and positivity assumptions, but Gaussian distributions with quadratic cost require a different approach.

The interactive demo below separates the observed marginal residual from two theoretical guides. The O(1/k)O(1/k) curve captures the robust pre-asymptotic message, while the Hilbert curve shows how pessimistic the global contraction bound becomes when the Gibbs kernel is close to degenerate.

Interactive panel. This exploratory panel plots convergence guides for Sinkhorn iterations. Use epsilon and condition controls to compare observed residuals with pessimistic global contraction bounds.

Entropic Optimal Transport Between Gaussians

Gaussian marginals provide an explicit finite-dimensional model of Sinkhorn’s behavior. The soft cc-transform preserves quadratic potentials, the optimal entropic coupling is Gaussian, and the value can be written with matrix square roots Janati et al., 2020. This is the entropic counterpart of the Gaussian W2\Wass_2 and Bures formula.

Proof

The exponent is the sum of a quadratic polynomial in yy and the logarithm of the Gaussian density of β\beta. Completing the square in yy evaluates the integral as a positive constant times the exponential of a quadratic polynomial in xx. Taking ϵlog-\epsilon\log therefore gives a quadratic polynomial.

Proof Sketch

For any coupling, replace (X,Y)(X,Y) by the Gaussian vector with the same mean and covariance. The quadratic cost is unchanged, and Gaussian laws maximize entropy at fixed covariance, so the relative entropy cannot increase. It is therefore enough to optimize over Gaussian couplings.

Writing the cross-covariance as K=Σα1/2SΣβ1/2K=\Sigma_\alpha^{1/2}S\Sigma_\beta^{1/2}, the block covariance constraint is equivalent to the singular values of SS being at most one. The cost depends on KK through 2tr(K)-2\operatorname{tr}(K), while

KL(παβ)=12logdet(ISS).\operatorname{KL}(\pi\mid\alpha\otimes\beta) = -\frac12\log\det(I-SS^\top).

Von Neumann’s trace inequality aligns SS with the singular vectors of Σα1/2Σβ1/2\Sigma_\alpha^{1/2}\Sigma_\beta^{1/2}, so the problem separates into

min0s<12σisϵ2log(1s2).\min_{0\leq s<1} -2\sigma_i s - \frac{\epsilon}{2}\log(1-s^2).

The first-order condition is 2σi=ϵs/(1s2)2\sigma_i=\epsilon s/(1-s^2), whose positive solution is the displayed sis_i.

Proof Sketch

The raw Gaussian entropic value is the squared mean displacement plus two trace terms and the spectral sum iψϵ(σi(Σα,Σβ))\sum_i\psi_\epsilon(\sigma_i(\Sigma_\alpha,\Sigma_\beta)). Applying the same formula to the self-costs (α,α)(\alpha,\alpha) and (β,β)(\beta,\beta) replaces the cross singular values by the eigenvalues of Σα\Sigma_\alpha and Σβ\Sigma_\beta.

In the debiased polarization formula, the trace terms cancel:

trΣα+trΣβ12(2trΣα)12(2trΣβ)=0.\operatorname{tr}\Sigma_\alpha+\operatorname{tr}\Sigma_\beta - \frac12(2\operatorname{tr}\Sigma_\alpha) - \frac12(2\operatorname{tr}\Sigma_\beta) =0.

The polarization of the squared mean terms leaves mαmβ2\norm{m_\alpha-m_\beta}^2. Finally, τϵ(r)1\tau_\epsilon(r)\to1 and ϵlog(1τϵ(r)2)0\epsilon\log(1-\tau_\epsilon(r)^2)\to0, so ψϵ(r)2r\psi_\epsilon(r)\to-2r. The covariance limit is therefore

trΣ+trΛ2iσi(Σ,Λ),\operatorname{tr}\Sigma+\operatorname{tr}\Lambda - 2\sum_i\sigma_i(\Sigma,\Lambda),

which is the Bures--Wasserstein covariance formula.

The controls below expose exactly the quantities in the formula: ϵ\epsilon sets the singular-value shrinkage, anisotropy changes the eigenvalues, and the angle changes the covariance misalignment.

Interactive panel. This exploratory panel exposes the Gaussian formula directly. Use epsilon, anisotropy, and angle to see how entropic shrinkage changes the covariance term.

Proof

Completing the square in

exp ⁣(qy2(xy)2ϵ)dN(0,1)(y)\int \exp\!\left( \frac{q y^2-(x-y)^2}{\epsilon} \right) \,\d\mathcal N(0,1)(y)

gives the coefficient Tϵ(q)T_\epsilon(q). The fixed-point equation q=11/Aq_\star=1-1/A_\star, together with q=1+ϵ/2Aq_\star=1+\epsilon/2-A_\star, gives

A2ϵ2A1=0.A_\star^2-\frac{\epsilon}{2}A_\star-1=0.

The positive solution is the displayed AA_\star. Since

Tϵ(q)=1(1q+ϵ/2)2,T_\epsilon'(q) = -\frac{1}{(1-q+\epsilon/2)^2},

the derivative of the full-cycle map at the fixed point is Tϵ(q)2=A4T_\epsilon'(q_\star)^2=A_\star^{-4}.

This scalar calculation illustrates the general Gaussian convergence picture of Chizat, Delalande and Vaskevicius Chizat et al., 2024: the rate improves when ϵ\epsilon is large or the covariance scales overlap well, and deteriorates in the small-temperature limit where the entropic coupling approaches a deterministic Brenier map.

Sample Complexity

This section separates two statistical regimes. Exact OT resolves geometry at all spatial scales and pays dimension-dependent empirical rates. Fixed temperature Sinkhorn divergences smooth the dual potentials and recover parametric fluctuations, at the price of regularization bias.

The sample complexity of unregularized OT suffers from the curse of dimensionality. Entropic regularization changes the picture: for a fixed ϵ>0\epsilon>0, Sinkhorn divergences have parametric n1/2n^{-1/2} statistical rates, although the constant deteriorates when ϵ0\epsilon\to0 Genevay et al., 2019Bigot et al., 2019. Related two-sample-testing viewpoints are developed in Ramdas et al., 2017, and the large-ϵ\epsilon kernel limit connects to classical MMD tests Gretton et al., 2012.

<IPython.core.display.Image object>

Empirical fluctuations in dimensions three and six. For each sample size nn, two independent empirical measures are drawn from the same standard Gaussian law. Exact OT follows a slower dimension-dependent scale, while MMD and the fixed-ϵ\epsilon Sinkhorn divergence behave closer to the parametric n1/2n^{-1/2} guide. This is a statistical illustration, not a solver benchmark.

Proof Sketch

By the triangle inequality,

Wp(α^n,β^m)Wp(α,β)Wp(α^n,α)+Wp(β^m,β).\abs{\Wass_p(\hat\alpha_n,\hat\beta_m)-\Wass_p(\alpha,\beta)} \leq \Wass_p(\hat\alpha_n,\alpha) + \Wass_p(\hat\beta_m,\beta).

For the one-sample term, partition [0,1]d[0,1]^d into dyadic cubes. At scale 2j2^{-j}, empirical mass fluctuations over the cells are of order n1/22jd/2n^{-1/2}2^{jd/2}, while moving this excess mass inside cells costs 2j2^{-j}. Summing the multiscale contributions up to the scale where the expected number of samples per cell is order one gives 2J2^{-J} with 2Jdn2^{Jd}\simeq n, hence n1/dn^{-1/d}. Matching lower bounds follow from packing arguments Dudley, 1969Fournier & Guillin, 2015Weed & Bach, 2019.

Proof

Let Φ(x)=k(x,)\Phi(x)=k(x,\cdot) be the feature map and mα=EΦ(X)m_\alpha=\mathbb E\Phi(X). The reverse triangle inequality gives

MMDk(α^n,β^m)MMDk(α,β)MMDk(α^n,α)+MMDk(β^m,β).\abs{ \operatorname{MMD}_k(\hat\alpha_n,\hat\beta_m) - \operatorname{MMD}_k(\alpha,\beta) } \leq \operatorname{MMD}_k(\hat\alpha_n,\alpha) + \operatorname{MMD}_k(\hat\beta_m,\beta).

Independence cancels cross terms after taking squared norms and expectation:

EMMDk(α^n,α)2=1nEΦ(X)mαHk2=1n(Ek(X,X)Ek(X,X)).\mathbb E\operatorname{MMD}_k(\hat\alpha_n,\alpha)^2 = \frac1n \mathbb E\norm{\Phi(X)-m_\alpha}_{\mathcal H_k}^2 = \frac1n \left( \mathbb E k(X,X)-\mathbb E k(X,X') \right).

Jensen’s inequality and k(x,x)κ2k(x,x)\leq\kappa^2 give the displayed bound.

Proof Sketch

By the envelope theorem, the fluctuation of Lcϵ\mathcal L_c^\epsilon with respect to its first marginal is controlled by the class of entropic dual potentials. The soft cc-transform smooths these potentials at spatial scale ϵ\sqrt\epsilon for a quadratic-type cost. Covering a bounded dd-dimensional domain at this scale gives an effective complexity of order ϵd/2\epsilon^{-d/2}. Standard Rademacher or Dudley entropy bounds then give an empirical-process fluctuation of order ϵd/2/n\epsilon^{-d/2}/\sqrt n for each marginal. Applying the same estimate to the three terms defining the debiased divergence gives the stated bound.

The interactive demo below is only a scaling guide: change the dimension to see the exact-OT exponent flatten, and change ϵ\epsilon to move the Sinkhorn bias floor against its parametric fluctuation term.

Interactive panel. This exploratory panel is a scaling guide. Use dimension, sample size, and epsilon to compare statistical fluctuation with regularization bias.

References
  1. Bregman, L. M. (1967). The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics, 7(3), 200–217.
  2. Fortet, R. (1940). Résolution d’un système d’équations de M. Schrödinger. Journal de Mathématiques Pures et Appliquées, 19(1–4), 83–105. https://www.numdam.org/item/JMPA_1940_9_19_1-4_83_0/
  3. Essid, M., & Pavon, M. (2019). Traversing the Schrödinger Bridge Strait: Robert Fortet’s Marvelous Proof Redux. Journal of Optimization Theory and Applications, 181, 23–60.
  4. Léonard, C. (2019). Revisiting Fortet’s proof of existence of a solution to the Schrödinger system. arXiv Preprint arXiv:1904.13211.
  5. Lemmens, B., & Nussbaum, R. (2012). Nonlinear Perron-Frobenius Theory (Vol. 189). Cambridge University Press.
  6. Peyré, G. (2026). Robust Sublinear Convergence Rates for Iterative Bregman Projections. arXiv Preprint arXiv:2602.01372.
  7. Altschuler, J., Weed, J., & Rigollet, P. (2017). Near-linear time approximation algorithms for optimal transport via Sinkhorn iteration. Advances in Neural Information Processing Systems, 30, 1964–1974.
  8. Dvurechensky, P., Gasnikov, A., & Kroshnin, A. (2018). Computational Optimal Transport: Complexity by Accelerated Gradient Descent Is Better Than by Sinkhorn’s Algorithm. In J. Dy & A. Krause (Eds.), Proceedings of the 35th International Conference on Machine Learning (Vol. 80, pp. 1367–1376). PMLR.
  9. Genevay, A., Chizat, L., Bach, F., Cuturi, M., & Peyré, G. (2019). Sample Complexity of Sinkhorn Divergences. Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, 89, 1574–1583.
  10. Bigot, J., Cazelles, E., & Papadakis, N. (2019). Central limit theorems for entropy-regularized optimal transport on finite spaces and statistical applications. Electronic Journal of Statistics, 13(2), 5120–5150. 10.1214/19-EJS1637
  11. Franklin, J., & Lorenz, J. (1989). On the scaling of multidimensional matrices. Linear Algebra and Its Applications, 114, 717–735.
  12. Birkhoff, G. (1957). Extensions of Jentzsch’s theorem. Transactions of the American Mathematical Society, 85(1), 219–227.
  13. Samelson, H. (1957). On the Perron-Frobenius theorem. Michigan Mathematical Journal, 4(1), 57–59.
  14. Janati, H., Muzellec, B., Peyré, G., & Cuturi, M. (2020). Entropic Optimal Transport between Unbalanced Gaussian Measures has a Closed Form. Advances in Neural Information Processing Systems.
  15. Chizat, L., Delalande, A., & Vaškevičius, T. (2024). Sharper Exponential Convergence Rates for Sinkhorn’s Algorithm in Continuous Settings. arXiv Preprint arXiv:2407.01202.