Entropic Regularization: Convergence

This chapter focuses on algorithmic convergence for entropic optimal transport: the marginals and the temperature are fixed, and one asks how Sinkhorn iterates, soft transforms and related monotone scaling procedures approach their regularized fixed point. Statistical convergence is a separate question, because the empirical marginals then change with the sample size; this is the topic of Paragraph.

The chapter revisits Sinkhorn convergence through several complementary lenses. Bregman projections explain the alternating-projection geometry, Fortet’s order argument gives qualitative fixed-point convergence, and an order-theoretic M-function viewpoint explains why Sinkhorn-like scaling can still converge even when the equations are no longer the gradient of a convex potential. Robust Bregman estimates then give a non-asymptotic dual-gap bound, while Hilbert’s metric gives a clean linear contraction when the kernel is uniformly positive. The last sections discuss Gaussian closed forms and continuous $\varepsilon$ -Sinkhorn flows as model cases where the fixed-point structure becomes explicit.

from pathlib import Path
import sys

from IPython.display import Image as DisplayImage
from IPython.display import display

here = Path.cwd()
myst_dir = None
for candidate in [here, here.parent, here / "myst", here.parent / "myst", here.parent.parent / "myst"]:
    if (candidate / "ot4ml_web.py").exists():
        myst_dir = candidate.resolve()
        sys.path.insert(0, str(myst_dir))
        break

if myst_dir is None:
    raise RuntimeError("Could not locate myst/ot4ml_web.py")

repo_root = myst_dir.parent
thumbnails = repo_root / "notebooks-figures" / "thumbnails"

def show_book_figure(name, width=760):
    display(DisplayImage(filename=str(thumbnails / f"{name}.png"), width=width))

Sinkhorn Convergence: Bregman Point of View¶

Sinkhorn can be read as alternating Bregman projections. The main geometric message is simple: each row or column rescaling is the KL projection onto one affine marginal constraint. The convergence mechanism then follows from the Pythagorean identity for Bregman divergences.

For simplicity, this section is written for discrete measures. The same ideas carry over to general measures, and the robust-rate section below expresses the constants through cost and potential oscillations rather than through the number of grid points.

Alternating KL Projections¶

The projection viewpoint explains Sinkhorn as repeated enforcement of one marginal constraint at a time. It is not specific to entropy, although KL is the case where the projections reduce to elementary row and column scalings.

The following matrix construction is the finite-dimensional counterpart of the measure-valued Definition: Measure Bregman Divergence.

For the quadratic generator $\Phi(\P)=\frac12\|\P\|_{\mathrm F}^2$ on $\mathbb R^{n\times m}$ , one recovers half the squared Euclidean distance, $B_\Phi(\P\mid Q)=\frac12\|\P-Q\|_{\mathrm F}^2$ . For the negative entropy $\Phi(\P)=\sum_{i,j}\P_{i,j}\log \P_{i,j}$ on the nonnegative orthant, extended by $+\infty$ outside it, one obtains $B_\Phi(\P\mid Q)=\operatorname{KL}(\P\mid Q)$ . For $\P\geq0$ and $Q>0$ , this identity is understood through the lower-semicontinuous extension, with $0\log0=0$ .

Bregman divergences are useful because their geometry can encode constraints. A Legendre-type generator blows up, or has an infinite derivative, at the boundary of its domain. For negative entropy, positivity is therefore built into the divergence, so one projects onto affine marginal constraints without separately handling non-negativity.

Linear Tilts and Gibbs References¶

Adding a linear cost to a Bregman penalty merely shifts the reference point in dual coordinates. The usual Gibbs--KL reformulation is exactly the entropy specialization.

For the negative entropy, take $Q=a\otimes b$ . The tilted reference is

K_{a,b}^\epsilon \eqdef (a\otimes b)\odot e^{-C/\epsilon}.

(5)

Thus

\langle \P,C\rangle+\epsilon\operatorname{KL}(\P\mid a\otimes b) = \epsilon\operatorname{KL}(\P\mid K_{a,b}^\epsilon)+\text{cst}.

(6)

On the transport polytope, scaling $K_{a,b}^\epsilon$ is equivalent to scaling the Gibbs kernel $K=e^{-C/\epsilon}$ because the factors $a_i$ and $b_j$ can be absorbed into the Sinkhorn scalings. The unique entropic optimizer is the KL projection of this tilted Gibbs reference onto the coupling constraints:

\P_\epsilon = \argmin_{\P\in\mathcal U(a,b)} \operatorname{KL}(\P\mid K_{a,b}^\epsilon).

(7)

Cyclic Projection Convergence¶

Given two closed convex constraint sets $\mathcal C_1$ and $\mathcal C_2$ , the cyclic method projects first onto $\mathcal C_1$ , then onto $\mathcal C_2$ , and repeats. Full iterates satisfy the second constraint and half-iterates satisfy the first; both are attained together only in the limit.

The following algorithm records the general cyclic iteration. The two nonnegative constraint defects vanish on their respective sets and provide a practical stopping test.

Algorithm: Cyclic Bregman Projections

Input: Closed convex sets $\mathcal C_1,\mathcal C_2$ , Bregman divergence $B_\Phi$ , interior point $\P^{(0)}$ , constraint defects $\mathrm{def}_{\mathcal C_1},\mathrm{def}_{\mathcal C_2}$ , tolerance $\mathrm{tol}$ , and iteration budget $L\geq1$ .

Output: Approximate point in $\mathcal C_1\cap\mathcal C_2$ when the intersection is nonempty.

Initialize: Set $r^{(0)}=+\infty$ and $\ell=0$ .

While $r^{(\ell)}>\mathrm{tol}$ and $\ell<L$ do:

Set $\ell\leftarrow\ell+1$ .
$\P^{(\ell-1/2)}= \operatorname{Proj}_{\mathcal C_1}^{B_\Phi}(\P^{(\ell-1)})$ and $\P^{(\ell)}= \operatorname{Proj}_{\mathcal C_2}^{B_\Phi}(\P^{(\ell-1/2)})$ .
Set $r^{(\ell)}=\max\{\mathrm{def}_{\mathcal C_1}(\P^{(\ell)}), \mathrm{def}_{\mathcal C_2}(\P^{(\ell)})\}$ .

Return $\P^{(\ell)}$ .

The convergence mechanism is the classical one of Bregman projections Bregman, 1967. General convex constraints determine a feasible limit, while affine constraints give the stronger closest-point characterization.

Proposition: Convergence of Cyclic Bregman Projections

Let $\Phi$ be a Legendre generator on a finite-dimensional convex domain, and let $\mathcal C_1,\mathcal C_2$ be closed convex sets whose intersection meets the interior of the domain. Generate $(\P^{(\ell-1/2)},\P^{(\ell)})$ with Algorithm: Cyclic Bregman Projections from an interior point $\P^{(0)}$ . Assume that the projections are well-defined and that all half-steps remain in a compact subset of the interior. Then there exists $\bar\P\in\mathcal C_1\cap\mathcal C_2$ such that $\P^{(\ell)}\to\bar\P$ and $\P^{(\ell-1/2)}\to\bar\P$ .

If, in addition, $\mathcal C_1$ and $\mathcal C_2$ are affine, then

\bar\P = \operatorname{Proj}_{\mathcal C_1\cap\mathcal C_2}^{B_\Phi}(\P^{(0)}).

(8)

Row and Column Scalings¶

We now apply the general Bregman-projection framework to entropic OT. The two affine constraints impose the source and target marginals, while negative entropy turns their cyclic projections into explicit row and column rescalings. Denote these constraint sets by

\mathcal C_a^1 \eqdef \{\P\in\mathbb R^{n\times m}:\P\mathbf 1_m=a\}, \qquad \mathcal C_b^2 \eqdef \{\P\in\mathbb R^{n\times m}:\P^\top\mathbf 1_n=b\}.

(13)

Positivity is a separate constraint, so

\mathcal U(a,b) = \mathcal C_a^1\cap\mathcal C_b^2\cap\mathbb R_+^{n\times m}.

(14)

For negative entropy, however, the Legendre generator has effective domain $\Omega=\mathbb R_+^{n\times m}$ and is $+\infty$ outside $\Omega$ . Thus $\operatorname{Proj}_{\mathcal C_i}^{\operatorname{KL}}$ already minimizes over $\mathcal C_i\cap\Omega$ : the generator enforces positivity, which need not be repeated in the projector notation. The cyclic iteration therefore specializes to

\P^{(2\ell+1)} = \operatorname{Proj}_{\mathcal C_a^1}^{\operatorname{KL}}(\P^{(2\ell)}), \qquad \P^{(2\ell+2)} = \operatorname{Proj}_{\mathcal C_b^2}^{\operatorname{KL}}(\P^{(2\ell+1)}).

(15)

These two projectors are explicit: they rescale respectively the rows and the columns.

For positive histograms and a strictly positive Gibbs kernel, iterative proportional fitting keeps the half-steps positive and converges Ruschendorf, 1995Rüschendorf & Thomsen, 1998. Since the two marginal sets are affine, Proposition: Convergence of Cyclic Bregman Projections, applied with $\P^{(0)}=K_{a,b}^\epsilon$ , identifies the limit as the entropic optimizer.

Defining

\P^{(2\ell)} \eqdef \operatorname{diag}(u^{(\ell)})K\operatorname{diag}(v^{(\ell)}),

(18)

the two projection steps are the usual Sinkhorn updates on the scaling vectors. In practice one stores the vectors and multiplies by the Gibbs kernel, often exploiting separable, sparse, low-rank or geometric structure.

The Bregman proof is geometric, but its direct finite-dimensional linear-rate constants can degrade with dimension and with small $\epsilon$ . The robust dual analysis below gives a dimension-free qualitative message: before any local linear regime becomes visible, one can still guarantee an $O(1/\ell)$ dual gap whose constants depend on the cost range and potential oscillation.

Other Divergences¶

The simplicity of the KL construction relies on negative entropy encoding nonnegativity through its effective domain. Suppose instead that $\Phi$ is Legendre on the full space $\mathbb R^{n\times m}$ . Positivity must then be included explicitly in the marginal constraints:

\mathcal C_{a,+}^1 \eqdef \mathcal C_a^1\cap\mathbb R_+^{n\times m}, \qquad \mathcal C_{b,+}^2 \eqdef \mathcal C_b^2\cap\mathbb R_+^{n\times m}.

(19)

Thus $\mathcal U(a,b)=\mathcal C_{a,+}^1\cap\mathcal C_{b,+}^2$ . These sets are convex but no longer affine, and their $B_\Phi$ -projectors generally have no closed-form scaling formula. The affine clause of Proposition: Convergence of Cyclic Bregman Projections no longer applies: ordinary cyclic projections may converge to a feasible point without computing $\operatorname{Proj}_{\mathcal U(a,b)}^{B_\Phi}(\P^{(0)})$ .

For the Euclidean generator $\Phi(\P)=\frac12\|\P\|_{\mathrm F}^2$ , projection onto $\mathcal C_{a,+}^1$ is a rowwise simplex projection:

\left[ \operatorname{Proj}_{\mathcal C_{a,+}^1}^{B_\Phi}(Q) \right]_{i,j} = (Q_{i,j}-\tau_i)_+, \qquad \sum_j(Q_{i,j}-\tau_i)_+=a_i.

(20)

The column formula is analogous. These thresholds are not multiplicative scalings, but sorting or selection computes them efficiently Blondel et al., 2018.

To recover the closest-point projection onto the intersection, Dykstra’s algorithm alternates the same Bregman projectors while carrying correction variables in dual coordinates. Under the standard finite-dimensional assumptions, it converges to $\operatorname{Proj}_{\mathcal C_{a,+}^1\cap\mathcal C_{b,+}^2}^{B_\Phi} (\P^{(0)})$ Dykstra, 1985Censor & Reich, 1998Bauschke & Lewis, 2000. Benamou, Carlier, Cuturi, Nenna, and Peyré developed this construction systematically for regularized transport Benamou et al., 2015. Through Fenchel duality, Bregman--Dykstra is block-coordinate ascent on the marginal dual potentials; the correction variables store the dual memory absent from plain cyclic projection Bregman et al., 1999.

For the quadratic generator and the product reference $\xi=a\otimes b$ ,

B_\Phi(\P\mid\xi) = \frac12\sum_{i,j}(\P_{i,j}-a_i b_j)^2.

(21)

This is closely related, but not identical, to the quadratic $\phi$ -divergence generated by $\phi(r)=\frac12(r-1)^2$ :

D_\phi(\P\mid a\otimes b) = \frac12\sum_{i,j}\frac{(\P_{i,j}-a_i b_j)^2}{a_i b_j},

(22)

for positive $a_i,b_j$ . The difference is the fixed inverse product-marginal weighting in the second quadratic norm. Both models yield positive-part thresholding and potentially sparse plans. The section on Other Convex Regularizers develops their dual laws and alternating dual-maximization algorithms.

Sinkhorn Convergence: Monotone Point of View¶

This section isolates the order structure shared by generalized Sinkhorn updates. It first introduces the abstract language of topical maps, then applies it to the generalized soft transforms of Other Convex Regularizers, and finally derives a monotone fixed-point argument valid beyond KL regularization.

Variation Seminorm and Topical Maps¶

Potentials are defined only up to additive constants, so their natural gauge-invariant size is their oscillation.

The order maps relevant to Sinkhorn commute with this additive gauge.

Generalized Sinkhorn Maps¶

For the $\phi$ -divergence regularized transport problem, the one-sided block updates are expressed using the Legendre transform $\phi^*$ defined in (47). They are the generalized soft $c$ -transforms of (121), and their alternating dual-maximization scheme is (124). After choosing a consistent extremal minimizer whenever a scalar update is not unique, one full cycle acting on the first potential is

\mathcal A_\phi(f) \eqdef \big(f^{c,\epsilon,\phi}\big)^{\bar c,\epsilon,\phi}.

(27)

For the KL generator, this is the usual double Sinkhorn map.

Proposition: Generalized Sinkhorn Maps are Topical

Assume that the scalar minimizer sets in (121) are nonempty compact intervals, and select either their smallest elements everywhere or their largest elements everywhere. Each selected one-sided transform is order reversing and additively anti-homogeneous:

g\leq g' \Longrightarrow g^{\bar c,\epsilon,\phi}\geq {g'}^{\bar c,\epsilon,\phi}, \qquad (g+s)^{\bar c,\epsilon,\phi} = g^{\bar c,\epsilon,\phi}-s,

(28)

and likewise for $f\mapsto f^{c,\epsilon,\phi}$ . Consequently, $\mathcal A_\phi$ is topical and

\norm{\mathcal A_\phi(f)-\mathcal A_\phi(h)}_V \leq \norm{f-h}_V.

(29)

No selection convention is needed when the scalar minimizers are unique.

Topicality gives nonexpansiveness, not a strict contraction. For KL, positivity of the Gibbs kernel supplies the stronger Hilbert-metric contraction proved in Sinkhorn Convergence: Linear Hilbert Metric Rate; no comparable strict factor follows from the order properties alone for a general $\phi$ .

Monotone Convergence for Generalized Sinkhorn¶

The following order argument traces back to Fortet’s proof of the Schrodinger system Fortet, 1940Essid & Pavon, 2019Léonard, 2019. It is not based on an optimization principle: a subsolution generates an increasing orbit, while a fixed point provides an upper barrier.

Proposition: Monotone Convergence of Generalized Sinkhorn Cycles

Let $\mathcal X$ and $\mathcal Y$ be compact metric spaces and $c\in C(\mathcal X\times\mathcal Y)$ . Suppose that the assumptions of Proposition: Generalized Sinkhorn Maps are Topical hold, that the selected transforms are finite valued, and that $\mathcal A_\phi$ has a fixed point $f^\star\in C(\mathcal X)$ .

If $f^{(0)}\in C(\mathcal X)$ is a subsolution, $f^{(0)}\leq\mathcal A_\phi(f^{(0)})$ , then

f^{(\ell+1)} = \mathcal A_\phi(f^{(\ell)})

(30)

converges uniformly to a fixed point of $\mathcal A_\phi$ . The analogous conclusion holds for a supersolution, with a decreasing sequence. If the fixed point is unique modulo additive constants, the corresponding potential classes converge to this unique class.

The subsolution and supersolution conditions are invariant under additive shifts, but a shift cannot turn an arbitrary initialization into either one. The proposition therefore isolates the genuinely order-theoretic convergence mechanism; unrestricted initialization requires an additional compactness, ascent, or contraction argument.

Sinkhorn Convergence: Sublinear Robust Rate¶

The preceding projection and monotonicity arguments establish convergence but do not provide a quantitative rate. The Hilbert-metric analysis in Sinkhorn Convergence: Linear Hilbert Metric Rate will give geometric, or linear, convergence, the canonical asymptotic behavior of a strictly contractive fixed-point iteration; however, the gap between its global contraction factor and one typically becomes exponentially small in the cost oscillation divided by $\epsilon$ . We therefore first establish a slower $O(1/\ell)$ dual-gap rate whose constant grows only as $1/\epsilon$ Peyré, 2026Altschuler et al., 2017Dvurechensky et al., 2018. This robust dependence is crucial when Sinkhorn approximates unregularized discrete OT: Corollary Corollary: Approximating Unregularized OT by Regularized Dual Costs chooses $\epsilon=\delta/(2\log(nm))$ , so the temperature decreases with the requested accuracy and with the number of support points.

Proposition: Robust

O(1/\ell)

Dual Rate for Discrete Sinkhorn

Let $a\in\simplex_n$ and $b\in\simplex_m$ be positive histograms, let $C\in\RR^{n\times m}$ be finite, and set

R\eqdef\max_{i,j}C_{ij}-\min_{i,j}C_{ij}.

(33)

Initialize $g^{(0)}=0$ , perform complete Sinkhorn dual cycles, and denote the resulting potentials by $(f^{(\ell)},g^{(\ell)})$ for $\ell\geq1$ . If

\Delta^{(\ell)} \eqdef \mathcal D_\epsilon(f^\star,g^\star) - \mathcal D_\epsilon(f^{(\ell)},g^{(\ell)}),

(34)

then

0\leq\Delta^{(\ell)}\leq\frac{8R^2}{\epsilon\ell}, \qquad \ell\geq1.

(35)

Corollary: Approximating Unregularized OT by Regularized Dual Costs

Under the preceding assumptions, suppose $nm>1$ . Let $\mathcal D_{\epsilon,\ell}$ be the KL-normalized entropic dual value after $\ell\geq1$ complete cycles and define

L_{\epsilon,\ell} \eqdef \mathcal D_{\epsilon,\ell} - \epsilon H(a)-\epsilon H(b), \qquad H(a)=-\sum_i a_i\log a_i.

(40)

Then

0\leq \operatorname{OT}_C(a,b)-L_{\epsilon,\ell} \leq \epsilon\log(nm)+\frac{8R^2}{\epsilon\ell}.

(41)

Consequently,

\epsilon=\frac{\delta}{2\log(nm)}, \qquad \ell\geq\frac{32R^2\log(nm)}{\delta^2}

(42)

produce a lower bound with error at most $\delta$ .

The same identities give computable diagnostics. If $r=P\mathbf1_m$ , the KL projection onto the row constraint satisfies $\operatorname{KL}(\operatorname{Proj}(P)\mid P)=\operatorname{KL}(a\mid r)$ ; the column identity is analogous. Each observed dual ascent is therefore a marginal KL defect, and Theorem Theorem: Pinsker Inequality turns it into an $\ell^1$ residual. The residual is not itself the remaining dual gap: a certificate also needs the quotient-radius estimate used above.

Sinkhorn Convergence: Linear Hilbert Metric Rate¶

Hilbert’s projective metric measures positive scaling vectors modulo global multiplication. A positive kernel contracts this geometry, yielding a global linear rate for Sinkhorn scaling rays.

Projective Contraction¶

Multiplying either vector by a positive scalar leaves $\mathsf H$ unchanged. It is therefore a metric only on projective classes.

Nested Simplex Images¶

The contraction is already visible on the three-state probability simplex. With the column-vector convention, every positive column-stochastic matrix $K$ maps $\simplex_3$ into its interior, and

K^{\ell+1}\simplex_3 =K^\ell(K\simplex_3) \subseteq K^\ell\simplex_3.

(47)

The first two panels use the explicit family

J_3=\frac13\mathbf1_3\mathbf1_3^\top, \qquad A= \begin{pmatrix} 0&-1&1\\ 1&0&-1\\ -1&1&0 \end{pmatrix}, \qquad K_{\rho,\delta}=\rho I_3+(1-\rho)J_3+\delta A.

(48)

This family is doubly stochastic and is strictly positive when $|\delta|<(1-\rho)/3$ . We use

\rho=.90,\qquad \delta=.014,\qquad K_1=K_{\rho,0},\quad K_2=K_{\rho,\delta}.

(49)

The control $K_1$ acts as $\rho I$ on the zero-sum tangent plane. Since the restriction of $A$ has eigenvalues $\pm\mathrm i\sqrt3$ , the non-Perron eigenvalues of $K_{\rho,\delta}$ are $\rho\pm\mathrm i\sqrt3\delta$ . Thus $K_2$ contracts isotropically while turning gradually. The third panel instead uses the positive doubly stochastic, non-normal kernel

K_3=K_{\mathrm{aniso}} = \begin{pmatrix} .950&.018&.032\\ .042&.880&.078\\ .008&.102&.890 \end{pmatrix}.

(50)

Its non-Perron eigenvalues are approximately .9222 and .7978. Their unequal moduli make the images progressively slender, while non-normality adds shear and a gradual transient turn. Writing

r_i=\max\{|z|:z\in\operatorname{spec}(K_i),\ z\ne1\}

(51)

gives $r_1=\rho$ , $r_2=\sqrt{\rho^2+3\delta^2}\simeq.9003$ , and $r_3\simeq.9222$ . This Euclidean asymptotic rate is distinct from the global Birkhoff factor $\lambda(K_i)$ in Hilbert’s metric.

Figure Div contrasts isotropic contraction, rotation, and anisotropic non-normal contraction.

Positive Markov kernels contract the three-state simplex and can simultaneously rotate its image. Color progresses from red at $\ell=0$ to blue at $\ell=15$ , and the star is the common stationary vector $\mathbf1_3/3$ . The isotropic $K_1$ keeps parallel edges, $K_2$ rotates an essentially homothetic triangle, and the non-normal $K_3$ both turns and strongly elongates its image because its tangent modes contract at unequal rates. The outer boundary is only a geometric reference, since the Hilbert metric becomes finite after the first positive image.

The theorem applies to positive linear maps between proper cones. Related nonlinear projective results require order preservation and homogeneity; a generic affine map is not covered.

Theorem: Projective Linear Convergence of Sinkhorn

Let $v^{(0)}>0$ , generate Sinkhorn iterates, and fix optimal scalings $(u^\star,v^\star)$ . For $\ell\geq1$ , set

\P^{(\ell)} = \operatorname{diag}(u^{(\ell)})K\operatorname{diag}(v^{(\ell)}),

(52)

and for $\ell\geq0$ define the row-normalized half-step

\P^{(\ell+1/2)} = \operatorname{diag}(u^{(\ell+1)})K\operatorname{diag}(v^{(\ell)}).

(53)

Writing $\lambda=\lambda(K)$ , one has

\mathsf H(v^{(\ell)},v^\star) \leq \lambda^{2\ell}\mathsf H(v^{(0)},v^\star), \qquad \mathsf H(u^{(\ell+1)},u^\star) \leq \lambda^{2\ell+1}\mathsf H(v^{(0)},v^\star).

(54)

The scaling rays converge linearly. After fixing a gauge, their representatives converge and $\P^{(\ell)}\to \P^\star$ entrywise. The posterior estimates are

\mathsf H(u^{(\ell)},u^\star) \leq \frac{\mathsf H(\P^{(\ell)}\mathbf1_m,a)}{1-\lambda^2}, \qquad \ell\geq1,

(55)

and

\mathsf H(v^{(\ell)},v^\star) \leq \frac{\mathsf H((\P^{(\ell+1/2)})^\top\mathbf1_n,b)}{1-\lambda^2}, \qquad \ell\geq0.

(56)

Finally,

\norm{\log \P^{(\ell)}-\log \P^\star}_\infty \leq \mathsf H(u^{(\ell)},u^\star) + \mathsf H(v^{(\ell)},v^\star).

(57)

Nonlinear Sinkhorn Images of the Simplex¶

The projective contraction can be visualized simultaneously for every possible left-scaling ray. Using the row and column maps from the proof, define the full-cycle map

F_u(u) = R(C(u)) = a\oslash\left[K\left(b\oslash(K^\top u)\right)\right].

(61)

It satisfies $F_u(su)=sF_u(u)$ for every $s>0$ . It therefore induces the projective self-map

\widehat F_u(p) = \frac{F_u(p)}{\langle F_u(p),\mathbf1_3\rangle}, \qquad p\in\simplex_3,

(62)

and the normalized complete-cycle iterates obey $\widehat u^{(\ell+1)}=\widehat F_u(\widehat u^{(\ell)})$ . The image sets are nested because

\widehat F_u^{\,\ell+1}(\simplex_3) = \widehat F_u^{\,\ell}\!\left(\widehat F_u(\simplex_3)\right) \subseteq \widehat F_u^{\,\ell}(\simplex_3).

(63)

After one cycle they lie in the positive cone, and the preceding theorem gives

\operatorname{diam}_{\mathsf H}\!\left( \widehat F_u^{\,\ell}(\simplex_3) \right) \leq \lambda(K)^{2(\ell-1)} \operatorname{diam}_{\mathsf H}\!\left( \widehat F_u(\simplex_3) \right), \qquad \ell\geq1.

(64)

The figure reuses the three kernels $K_i$ from Div, with the same uniform Sinkhorn marginals $a=b=\mathbf1_3/3$ in every panel. This isolates the passage from the linear action $K_i^\ell$ to the nonlinear balancing map $\widehat F_{u,i}^{\,\ell}$ . The symmetric control remains aligned, $K_2$ turns the curved images, and the unequal tangent rates of $K_3$ produce a pronounced anisotropic collapse. The normalized Sinkhorn fixed ray and stationary Markov vector both equal $\mathbf1_3/3$ in this doubly stochastic example, although they need not coincide in general.

Figure Div reuses the three kernels $\K_i$ from Figure Div, with the same uniform Sinkhorn marginals $\a=\b=\ones_3/3$ in every panel.

Complete Sinkhorn cycles curve, turn and contract the simplex of normalized left scalings. Panel $i$ reuses $K_i$ from the preceding linear figure and the common marginals $a=b=\mathbf1_3/3$ . Color is the densely sampled boundary of $\widehat F_{u,i}^{\,\ell}(\simplex_3)$ , progressing from the red triangle at $\ell=0$ to the blue curve at $\ell=15$ ; the star is $\widehat u_i^\star=\mathbf1_3/3$ . Reciprocal scaling bends all sixteen boundaries; $K_2$ adds a gradual turn, whereas the non-normal $K_3$ combines turning with a much stronger collapse across one tangent direction. For these invertible kernels the curves are the actual boundaries of the nested image sets. The Hilbert estimate begins after the first positive cycle.

Dual-Potential Form¶

The KL-normalized potentials satisfy $f_\ell=\epsilon\log(u^{(\ell)}\oslash a)$ and $g_\ell=\epsilon\log(v^{(\ell)}\oslash b)$ . Hence

\norm{f_\ell-f^\star}_V = \epsilon\mathsf H(u^{(\ell)},u^\star), \qquad \norm{g_\ell-g^\star}_V = \epsilon\mathsf H(v^{(\ell)},v^\star).

(65)

The temperature factor is essential when passing to coupling densities:

\norm{\log(\d\pi_\ell/\d\pi^\star)}_\infty \leq \frac{\norm{f_\ell-f^\star}_V+\norm{g_\ell-g^\star}_V}{\epsilon}.

(66)

For $K_\epsilon=e^{-C/\epsilon}$ and $R=\max C-\min C$ ,

\lambda(K_\epsilon) \leq \tanh\!\left(\frac{R}{2\epsilon}\right)<1.

(67)

Thus the global rate becomes exponentially pessimistic as $\epsilon\downarrow0$ . Sharper continuous analyses obtain polynomial rates under additional semiconcavity or log-concavity assumptions Chizat et al., 2026. In practice, monitor the marginal not just normalized: the source residual at $P^{(\ell)}$ and the target residual at $P^{(\ell+1/2)}$ are the two meaningful posterior diagnostics.

Entropic Optimal Transport Between Gaussians¶

Gaussian marginals provide an explicit finite-dimensional model of Sinkhorn’s behavior. The soft $c$ -transform preserves quadratic potentials, the optimal entropic coupling is Gaussian, and the value can be written with matrix square roots Janati et al., 2020. This is the entropic counterpart of the Gaussian $\Wass_2$ and Bures formula.

Proposition: Quadratic Closure of Sinkhorn Iterates

Let $\beta=\mathcal N(m_\beta,\Sigma_\beta)$ on $\RR^d$ and take $c(x,y)=\norm{x-y}^2$ . If $g(y)$ is a quadratic polynomial such that the Gaussian integral below is finite, then the soft transform

f(x) = -\epsilon\log \int \exp\!\left(\frac{g(y)-\norm{x-y}^2}{\epsilon}\right) \,\d\beta(y)

(68)

is a quadratic polynomial in $x$ . In particular, starting Sinkhorn from $g_0=0$ gives

f_1(x) = \frac{\epsilon}{2} \log\det\!\left(I+\frac{2\Sigma_\beta}{\epsilon}\right) + \epsilon \left\langle x-m_\beta, (\epsilon I+2\Sigma_\beta)^{-1}(x-m_\beta) \right\rangle .

(69)

Proposition: Balanced Entropic OT Between Gaussians

Let $\alpha=\mathcal N(m_\alpha,\Sigma_\alpha)$ and $\beta=\mathcal N(m_\beta,\Sigma_\beta)$ with positive-definite covariances, and let

\Sigma_\alpha^{1/2}\Sigma_\beta^{1/2} = U\operatorname{diag}(\sigma_i)V^\top

(70)

be a singular-value decomposition. For the balanced objective

\min_{\pi\in\Couplings(\alpha,\beta)} \int\norm{x-y}^2\,\d\pi(x,y) + \epsilon\operatorname{KL}(\pi\mid\alpha\otimes\beta),

(71)

the optimizer is Gaussian with cross-covariance

K_\epsilon = \Sigma_\alpha^{1/2} U\operatorname{diag}(s_i)V^\top \Sigma_\beta^{1/2}, \qquad s_i = \frac{\sqrt{\epsilon^2+16\sigma_i^2}-\epsilon}{4\sigma_i}.

(72)

The optimal value is

\norm{m_\alpha-m_\beta}^2 + \operatorname{tr}(\Sigma_\alpha) + \operatorname{tr}(\Sigma_\beta) + \sum_i \left( -2\sigma_i s_i - \frac{\epsilon}{2}\log(1-s_i^2) \right).

(73)

As $\epsilon\downarrow0$ , $s_i\to1$ and the full covariance contribution converges to the Bures--Wasserstein covariance term.

Corollary: Gaussian Sinkhorn Divergence and Smoothed Bures Term

For $r>0$ , define

\tau_\epsilon(r) \eqdef \frac{\sqrt{\epsilon^2+16r^2}-\epsilon}{4r}, \qquad \psi_\epsilon(r) \eqdef -2r\,\tau_\epsilon(r) - \frac{\epsilon}{2}\log(1-\tau_\epsilon(r)^2).

(76)

If $\sigma_i(\Sigma,\Lambda)$ denotes the singular values of $\Sigma^{1/2}\Lambda^{1/2}$ and $\lambda_i(\Sigma)$ the eigenvalues of $\Sigma$ , then the debiased Gaussian Sinkhorn divergence is

\overline{\mathcal L}_{\norm{\cdot-\cdot}^2}^{\epsilon}(\alpha,\beta) = \norm{m_\alpha-m_\beta}^2 + \mathcal B_\epsilon(\Sigma_\alpha,\Sigma_\beta)^2,

(77)

where

\mathcal B_\epsilon(\Sigma,\Lambda)^2 \eqdef \sum_i\psi_\epsilon(\sigma_i(\Sigma,\Lambda)) - \frac12\sum_i\psi_\epsilon(\lambda_i(\Sigma)) - \frac12\sum_i\psi_\epsilon(\lambda_i(\Lambda)).

(78)

Moreover $\mathcal B_\epsilon(\Sigma,\Lambda)^2\to \mathcal B(\Sigma,\Lambda)^2$ as $\epsilon\downarrow0$ .

The controls below expose exactly the quantities in the formula: $\epsilon$ sets the singular-value shrinkage, anisotropy changes the eigenvalues, and the angle changes the covariance misalignment.

Interactive panel. This exploratory panel exposes the Gaussian formula directly. Use epsilon, anisotropy, and angle to see how entropic shrinkage changes the covariance term.

Proposition: One-Dimensional Gaussian Sinkhorn Rate

Consider $\alpha=\beta=\mathcal N(0,1)$ on $\RR$ with $c(x,y)=(x-y)^2$ . If a dual potential has the form $g_q(y)=q y^2+\text{cst}$ , then one soft transform has quadratic coefficient

T_\epsilon(q) = 1-\frac{1}{1-q+\epsilon/2}, \qquad q<1+\epsilon/2.

(81)

One full Sinkhorn cycle acts as $q\mapsto T_\epsilon(T_\epsilon(q))$ . The fixed point $q_\star=T_\epsilon(q_\star)$ is determined by

A_\star^2-\frac{\epsilon}{2}A_\star-1=0, \qquad A_\star \eqdef 1-q_\star+\frac{\epsilon}{2} = \frac{\epsilon+\sqrt{\epsilon^2+16}}{4}.

(82)

Consequently the local asymptotic contraction factor of one full Sinkhorn cycle on the quadratic coefficient is

\rho_\epsilon = A_\star^{-4} = \left(\frac{4}{\epsilon+\sqrt{\epsilon^2+16}}\right)^4 .

(83)

This scalar calculation illustrates the general Gaussian convergence picture of Chizat, Delalande and Vaskevicius Chizat et al., 2026: the rate improves when $\epsilon$ is large or the covariance scales overlap well, and deteriorates in the small-temperature limit where the entropic coupling approaches a deterministic Brenier map.

Continuous $\varepsilon$ -Sinkhorn Flow¶

This section studies a simultaneous high-resolution, many-iteration limit. It is not a continuous-time interpolation of a fixed-temperature algorithm: the grid is refined while the temperature and the fictitious time step both vanish as $1/k$ .

Parabolic Monge--Ampere Limit¶

For the quadratic torus cost $c(x,y)=d_{\mathbb T^d}(x,y)^2/2$ , Berman’s scaling discretizes both marginals on a grid of mesh $1/k$ , sets $\epsilon_k=1/k$ , and assigns duration $1/k$ to each Sinkhorn update Berman, 2020. The $m$ -th log-potential is observed at time $t=m/k$ . In this coupled limit, Sinkhorn becomes a parabolic Monge--Ampere flow.

To make the scaling explicit, let $\alpha^{(k)}$ and $\beta^{(k)}$ be the positive grid discretizations and define

v_k[u](y) \eqdef \frac1k\log\int e^{-k(c(x,y)+u(x))}\,d\alpha^{(k)}(x),

(88)

(S_ku)(x) \eqdef \frac1k\log\int e^{-k(c(x,y)+v_k[u](y))}\,d\beta^{(k)}(y).

(89)

The multiplicative increment is

\rho_{k,u}(x) \eqdef e^{k(S_ku(x)-u(x))} = e^{-ku(x)} \int \frac{e^{-kc(x,y)}} {\int e^{-k(c(x',y)+u(x'))}\,d\alpha^{(k)}(x')} \,d\beta^{(k)}(y).

(90)

The normalized update subtracts the spatial mean of $S_ku$ .

No explicit $\epsilon$ remains in the normalized limiting PDE: it records the vanishing temperature before rescaling. The Kahler analogue is the parabolic complex Monge--Ampere equation.

Gaussian Closure¶

Gaussian marginals give a finite-dimensional test case for the continuous flow. The theorem above is stated on the flat torus, but the same local Laplace calculation can be read formally on $\RR^d$ for confining Gaussian densities. Write $\alpha=\Gaussian(m_\alpha,\Sigma_\alpha)$ and $\beta=\Gaussian(m_\beta,\Sigma_\beta)$ , and restrict the potential to the quadratic ansatz for which

T_t(x)\eqdef x+\nabla u_t(x) = q_t+B_t(x-m_\alpha), \qquad B_t\in\mathbb S_{++}^d .

(95)

Taking the spatial gradient of the continuous $\varepsilon$ -Sinkhorn PDE removes the additive gauge. Since

F(x)=\frac12\langle x-m_\alpha,\Sigma_\alpha^{-1}(x-m_\alpha)\rangle+\mathrm{cst}, \qquad G(y)=\frac12\langle y-m_\beta,\Sigma_\beta^{-1}(y-m_\beta)\rangle+\mathrm{cst},

(96)

coefficient matching in the identity $\partial_tT_t=\nabla\partial_tu_t$ gives

\dot B_t=\Sigma_\alpha^{-1}-B_t\Sigma_\beta^{-1}B_t, \qquad \dot q_t=-B_t\Sigma_\beta^{-1}(q_t-m_\beta).

(97)

Thus the parabolic Monge--Ampere equation reduces, on the Gaussian ansatz, to a Riccati evolution for the linear part of the transport. The image mean is $q_t$ , and the image covariance is

\Sigma_t=B_t\Sigma_\alpha B_t, \qquad \dot\Sigma_t=\dot B_t\Sigma_\alpha B_t+B_t\Sigma_\alpha\dot B_t.

(98)

At equilibrium,

B\Sigma_\beta^{-1}B=\Sigma_\alpha^{-1}, \qquad q=m_\beta,

(99)

which is equivalent to $B\Sigma_\alpha B=\Sigma_\beta$ . The stationary map is therefore the Gaussian Brenier map, and the endpoint covariance is governed by the same Bures--Wasserstein geometry as in Section Entropic Optimal Transport Between Gaussians. This Gaussian reduction should be viewed as the finite-dimensional covariance shadow of the vanishing-temperature continuous Sinkhorn limit, not as the fixed-temperature Gaussian Sinkhorn formula itself.

In one dimension the flow reduces to

\partial_tu_t(x) = \log(1+u_t''(x))-G(x+u_t'(x))+F(x)-\bar r_t,

(100)

as long as $1+u_t''>0$ . This scalar case is useful for visualization because the potential curves can be plotted directly and the positivity condition is exactly the monotonicity of $x\mapsto x+u_t'(x)$ .

Figure Div shows this evolution from the zero initialization for two smooth pairs of marginals.

Continuous $\varepsilon$ -Sinkhorn flow in one dimension. The curves are snapshots of the gauge-fixed potential $u_t$ , initialized at $u_0=0$ , under the parabolic Monge--Ampere equation obtained from Berman’s high-resolution, vanishing-temperature Sinkhorn scaling. Time is encoded from red to blue; the faint bottom silhouettes show the source density in red and the target density in blue.

Interactive panel. Adjust the entropic scale and flow time to watch the log-domain continuous Sinkhorn relaxation approach the fixed-point dual potentials.

Monotone Clearing Beyond Variational Sinkhorn¶

The preceding convergence mechanisms mostly used variational structure: Sinkhorn is alternating KL projection, coordinate ascent on a dual objective, or a contraction in a projective metric. There is another, more algebraic, convergence mechanism which keeps the scaling form but discards the existence of an objective. In Galichon’s equilibrium-flow viewpoint, and in related work on substitutability and inverse isotonicity, one studies nonlinear market-clearing equations whose Jacobian has a substitute sign structure Galichon & Jacquet, 2024Galichon et al., 2022. This is the nonlinear analogue of the classical theory of nonsingular M-matrices and M-functions Moré & Rheinboldt, 1973Plemmons, 1977. The relevance for OT is that Sinkhorn is the canonical scaling example, but the same monotone clearing proof also covers fixed-point equations that are not first-order conditions of any convex regularized transport problem.

Two-block clearing maps¶

Write the unknowns as two blocks $z=(u,v)\in\RR^n\times\RR^m$ , where $u$ and $v$ will be signed log-scalings, and let

Q(z)=(Q^\alpha(u,v),Q^\beta(u,v))\in\RR^n\times\RR^m.

(101)

The parallel coordinate-clearing map $T$ is defined by solving, for each coordinate,

Q_\ell(T_\ell(z),z_{-\ell})=0, \qquad 1\leq \ell\leq n+m.

(102)

This is the Jacobi version: all coordinates are cleared against the old values of the other coordinates. In two-block scaling problems, all coordinates in $u$ decouple when $v$ is fixed, and all coordinates in $v$ decouple when $u$ is fixed. The more common alternating Sinkhorn sweep is the Gauss--Seidel composition of these two block clearings; the order argument below is stated for the parallel map to keep the notation short. The same construction extends to any finite number of blocks.

Definition: Z-functions and M-functions

Let $D\subset\RR^N$ be an order interval. A map $Q:D\to\RR^N$ is diagonally isotone if $Q_\ell(z_\ell,z_{-\ell})$ is nondecreasing in $z_\ell$ for every $\ell$ . It is a Z-function if increasing the other coordinates cannot increase the $\ell$ -th component:

z_{-\ell}\leq z'_{-\ell} \quad\Longrightarrow\quad Q_\ell(z_\ell,z_{-\ell})\geq Q_\ell(z_\ell,z'_{-\ell}).

(103)

For a $C^1$ map this means $\partial Q_\ell/\partial z_k\leq0$ for $k\neq\ell$ . An M-function is a Z-function that is inverse isotone:

Q(z)\leq Q(z')\quad\Longrightarrow\quad z\leq z'.

(104)

A vector $z$ is a subsolution if $Q(z)\leq0$ , and a supersolution if $Q(z)\geq0$ .

The M-function assumption says that cross-effects have the sign of substitutes, and that own effects dominate them strongly enough to prevent a global reversal of order. Galichon, Samuelson and Vernet formulate this idea through nonreversingness and unified gross substitutes; for single-valued maps this is the inverse-isotone structure used below. A degenerate $M_0$ -function keeps the same order structure but allows a null gauge direction, which is exactly what happens for balanced Sinkhorn before a potential normalization is imposed.

Theorem: Monotone convergence of coordinate clearing

Let $D=[\underline z,\overline z]\subset\RR^N$ be a closed order interval, with $\underline z$ a subsolution and $\overline z$ a supersolution. Assume that $Q:D\to\RR^N$ is continuous, diagonally isotone and an M-function. Assume also that, for every $z\in D$ and every $\ell$ , the scalar clearing equation $Q_\ell(\xi,z_{-\ell})=0$ has a unique solution $\xi\in[\underline z_\ell,\overline z_\ell]$ . Then $Q(z)=0$ has a unique solution $z^\star\in D$ . The Jacobi coordinate-clearing iterates starting from any subsolution in $D$ increase monotonically to $z^\star$ , while those starting from any supersolution in $D$ decrease monotonically to $z^\star$ .

Sinkhorn as the canonical $M_0$ -system¶

Let $K=\exp(-C/\epsilon)>0$ and write

\P_{ij}=r_iK_{ij}s_j, \qquad r_i=e^{u_i},\qquad s_j=e^{-v_j}.

(109)

The signed convention $v=-\log s$ makes the clearing equations

Q_i^\alpha(u,v)=\sum_jK_{ij}e^{u_i-v_j}-a_i, \qquad Q_j^\beta(u,v)=b_j-\sum_iK_{ij}e^{u_i-v_j}.

(110)

Then $Q=0$ is exactly $\P\mathbf 1=a$ and $\P^\top\mathbf 1=b$ , and coordinate clearing gives the usual Sinkhorn scalings

r_i^+=\frac{a_i}{\sum_jK_{ij}s_j}, \qquad s_j^+=\frac{b_j}{\sum_iK_{ij}r_i}.

(111)

The Jacobian has positive diagonal entries and nonpositive off-diagonal entries. Its column sums vanish, reflecting the gauge invariance $(u,v)\mapsto(u+c\mathbf 1,v+c\mathbf 1)$ . Thus balanced Sinkhorn is naturally an $M_0$ -system. Fixing one log-scaling coordinate turns the reduced Jacobian into a principal minor of the weighted bipartite graph Laplacian. Under connected support, automatic here because $K>0$ , it is a nonsingular M-matrix. This complements the variational and Hilbert-metric proofs above, and also connects Sinkhorn scaling with choice models Qu et al., 2023.

Example: Lossy Sinkhorn clearing with outside options

The following OT-shaped clearing model is deliberately minimal. It keeps a positive transport kernel and multiplicative scalings, but introduces outside options and lossy arrivals. It should be read as a toy absorptive matching model rather than as a new transport distance. Let $0<\eta_{ij}\leq1$ , $\sigma_i>0$ , $\tau_j>0$ , and set

\P_{ij}=r_i\K_{ij}s_j,\qquad r_i=e^{u_i},\qquad s_j=e^{-v_j}.

(112)

The clearing equations are

\sigma_i r_i+\sum_jP_{ij}=a_i, \qquad \tau_j s_j+\sum_i\eta_{ij}\P_{ij}=b_j.

(113)

They say that source mass can exit through an outside sink, while target demand can be met either by effective arrivals or by a local outside source. The coordinate-clearing update is still Sinkhorn-like:

r_i^+=\frac{a_i}{\sigma_i+\sum_j\K_{ij}s_j}, \qquad s_j^+=\frac{b_j}{\tau_j+\sum_i\eta_{ij}\K_{ij}r_i}.

(114)

In the variables $(u,v)$ , the off-diagonal derivatives again have the Z-sign. On the invariant domain $r_i\leq a_i/\sigma_i$ , the simple sufficient condition

\tau_j>\sum_i(1-\eta_{ij})\K_{ij}\frac{a_i}{\sigma_i}, \qquad 1\leq j\leq m,

(115)

implies the weighted diagonal-dominance certificate $\ones^\top DQ\gg0$ , hence Proposition Proposition: Smooth M-matrix certificate shows that the system is an M-function on that domain. Indeed, for the clearing map $Q_i^\alpha=\sigma_i r_i+\sum_jP_{ij}-a_i$ and $Q_j^\beta=b_j-\tau_j s_j-\sum_i\eta_{ij}\P_{ij}$ , the column sums of the Jacobian are

\sigma_i r_i+\sum_j(1-\eta_{ij})\P_{ij}>0, \qquad \tau_j s_j-\sum_i(1-\eta_{ij})\P_{ij} = s_j\left(\tau_j-\sum_i(1-\eta_{ij})\K_{ij}r_i\right)>0.

(116)

The model is generally non-variational. If $Q=\nabla\mathcal E$ for a $C^2$ potential, the cross-partials would satisfy

\frac{\partial Q_i^\alpha}{\partial v_j} = \frac{\partial Q_j^\beta}{\partial u_i}.

(117)

Here this would force $-\P_{ij}=-\eta_{ij}\P_{ij}$ on every active arc, hence $\eta_{ij}=1$ . Lossy arrivals therefore destroy exactness of the one-form $\sum_\ell Q_\ell\,\d z_\ell$ , while preserving the monotone clearing structure.

Figure Div shows this mechanism on two empirical Gaussian mixtures in $\RR^2$ .

Non-variational lossy Sinkhorn scaling on two Gaussian-mixture point clouds. The outside coefficients are $\sigma_i=\rho\bar\sigma_i$ and $\tau_j=\rho\bar\tau_j$ , and columns increase the common scale $\rho$ . Colors show centered log-scalings, $\log r-\langle\log r\rangle$ on the source row and $\log s-\langle\log s\rangle$ on the target row; faint violet links mark the largest entries of the induced effective plan. The first displayed case uses uniform outside coefficients, while the second uses spatially varying outside coefficients and directional loss factors. The updates are Sinkhorn-like row and column clearings, but $\eta_{ij}\neq1$ breaks the cross-partial symmetry required by a convex potential.

Interactive panel. Change the two monotone update temperatures to watch the source and target logarithmic scalings stabilize under a non-variational Sinkhorn-like iteration.

References¶

Bregman, L. M. (1967). The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics, 7(3), 200–217.
Ruschendorf, L. (1995). Convergence of the iterative proportional fitting Procedure. Annals of Statistics, 23(4), 1160–1174.
Rüschendorf, L., & Thomsen, W. (1998). Closedness of sum spaces and the generalized Schrödinger problem. Theory of Probability and Its Applications, 42(3), 483–494.
Blondel, M., Seguy, V., & Rolet, A. (2018). Smooth and Sparse Optimal Transport. Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, 84, 880–889. https://proceedings.mlr.press/v84/blondel18a.html
Dykstra, R. L. (1985). An iterative procedure for obtaining I-projections onto the intersection of convex sets. Annals of Probability, 13(3), 975–984.
Censor, Y., & Reich, S. (1998). The Dykstra algorithm with Bregman projections. Communications in Applied Analysis, 2, 407–419.
Bauschke, H. H., & Lewis, A. S. (2000). Dykstra’s algorithm with Bregman projections: a convergence proof. Optimization, 48(4), 409–427.
Benamou, J.-D., Carlier, G., Cuturi, M., Nenna, L., & Peyré, G. (2015). Iterative Bregman projections for regularized transportation problems. SIAM Journal on Scientific Computing, 37(2), A1111–A1138.
Bregman, L. M., Censor, Y., & Reich, S. (1999). Dykstra’s Algorithm as the Nonlinear Extension of Bregman’s Optimization Method. Journal of Convex Analysis, 6(2), 319–333.
Lemmens, B., & Nussbaum, R. (2012). Nonlinear Perron-Frobenius Theory (Vol. 189). Cambridge University Press. 10.1017/CBO9781139026079
Fortet, R. (1940). Résolution d’un système d’équations de M. Schrödinger. Journal de Mathématiques Pures et Appliquées, 19(1–4), 83–105. https://www.numdam.org/item/JMPA_1940_9_19_1-4_83_0/
Essid, M., & Pavon, M. (2019). Traversing the Schrödinger Bridge Strait: Robert Fortet’s Marvelous Proof Redux. Journal of Optimization Theory and Applications, 181, 23–60.
Léonard, C. (2019). Revisiting Fortet’s proof of existence of a solution to the Schrödinger system. arXiv Preprint arXiv:1904.13211.
Peyré, G. (2026). Robust Sublinear Convergence Rates for Iterative Bregman Projections. arXiv Preprint arXiv:2602.01372. https://arxiv.org/abs/2602.01372
Altschuler, J., Weed, J., & Rigollet, P. (2017). Near-linear time approximation algorithms for optimal transport via Sinkhorn iteration. Advances in Neural Information Processing Systems, 30, 1964–1974.