Entropic Regularization: Sinkhorn Algorithm

Entropic regularization makes optimal transport smooth, strictly convex and scalable. This chapter first explains the discrete KL-regularized problem, derives Sinkhorn’s alternating matrix scaling algorithm, and then rewrites the same construction as a relative-entropy projection problem. It then records the general continuous formulation, develops the dual soft-transform picture, explains the path-space Schrodinger problem behind the static coupling formulation, and presents the main convex regularization variants and the debiased Sinkhorn divergence. A final section records a less standard viewpoint: after fixing the potential gauge, the finite-dimensional Sinkhorn equations admit a local holomorphic continuation to complex values of the temperature.

The presentation connects the older matrix-scaling literature Sinkhorn, 1964Sinkhorn & Knopp, 1967Sinkhorn, 1967 with modern entropic OT Cuturi, 2013Peyré & Cuturi, 2019.

from pathlib import Path
import sys

from IPython.display import Image as DisplayImage
from IPython.display import display

here = Path.cwd()
myst_dir = None
for candidate in [here, here.parent, here / "myst", here.parent / "myst", here.parent.parent / "myst"]:
    if (candidate / "ot4ml_web.py").exists():
        myst_dir = candidate.resolve()
        sys.path.insert(0, str(myst_dir))
        break

if myst_dir is None:
    raise RuntimeError("Could not locate myst/ot4ml_web.py")

repo_root = myst_dir.parent
thumbnails = repo_root / "notebooks-figures" / "thumbnails"

def show_book_figure(name, width=760):
    display(DisplayImage(filename=str(thumbnails / f"{name}.png"), width=width))

Entropic Regularization for Discrete Measures¶

Entropy turns a possibly non-unique linear program into a unique smooth problem. The price is bias, but the reward is differentiability and fast scaling algorithms.

Using this entropy as a regularizing function gives the approximate transport value

\mathcal{L}_{\C}^{\epsilon}(a,b) \eqdef \min_{\P\in\mathbf{U}(a,b)} \langle \P,\C\rangle - \epsilon H(\P).

(2)

Equivalently, the regularizer is $\epsilon\sum_{i,j}\P_{i,j}\log \P_{i,j}$ . It penalizes concentrated couplings and makes the objective strictly convex on the relative interior of the transport polytope.

Smoothing Effect¶

The entropy acts as a barrier for positivity and makes $\mathcal{L}_{\C}^{\epsilon}(a,b)$ smooth in $a$ , $b$ , and $\C$ as long as these variables stay in the relative interior. As $\epsilon\to+\infty$ , the minimizer converges to the independent coupling $a\otimes b$ ; as $\epsilon\to0$ , it approaches the optimal face of the original transport linear program.

Figure Div visualizes this temperature-dependent path and contrasts it with a generic logarithmic barrier on linear-programming slacks.

Entropic regularization and slack barriers. Large $\epsilon$ selects an interior reference point, while small $\epsilon$ moves the minimizer toward a low-cost face of the transport polytope. The second row gives the analogous entropy-on-slacks picture for a generic linear program.

Interactive panel. Move the temperature to see the entropic minimizer travel along the central path from the simplex interior toward the linear-programming vertex.

Entropy Barriers Versus Generic LP Barriers¶

For a generic linear program $\min_z \ell^\top z$ with constraints $Az\le b$ , one can introduce positive slacks $s=b-Az$ and penalize them by an entropy. This is a useful analogy, but it is not the standard self-concordant interior-point barrier. The canonical barrier is the Burg, or reverse-KL, barrier $-\sum_i\log s_i$ , which leads to Newton systems.

Optimal transport is special because entropy is placed on the entries of $\P$ , while the constraints are only row and column marginals. This separable structure turns Bregman projections into diagonal rescalings, giving the Sinkhorn iterations.

Remark: Entropy barriers versus generic LP barriers

For a generic linear program $\min_z \ell^\top z$ subject to $Az\leq b$ , one can introduce positive slacks $s=b-Az$ and use an entropy-on-slacks penalty $H(s)=\sum_i s_i(\log s_i-1)$ as a smooth interior regularization. This is a useful analogy for Figure Div, but it is not the standard interior-point barrier for linear programming. The canonical barrier on the positive orthant is the Burg, or reverse-KL, logarithmic barrier $-\sum_i\log s_i$ ; it is self-concordant and therefore fits the Newton theory of interior-point methods Nesterov & Nemirovskii, 1994. The price is that a generic Newton step solves a dense linear system, leading to cubic per-iteration scaling in the relevant number of variables or constraints. Optimal transport is special: the entropy is placed on the entries of $\P$ , while the constraints are only the row and column marginals. This separable structure turns the associated Bregman projections into diagonal rescalings, hence into the Sinkhorn matrix-vector iterations developed next.

Sinkhorn’s Algorithm¶

Sinkhorn’s algorithm is alternating normalization of rows and columns. The key point is that the optimizer of the entropic problem has a multiplicative scaling form.

In matrix notation, $\P=\operatorname{diag}(u)K\operatorname{diag}(v)$ . The marginal constraints become

u\odot(Kv)=a, \qquad v\odot(K^\top u)=b.

(7)

Solving each equation in turn gives Sinkhorn’s algorithm:

u^{(\ell+1)} = \frac{a}{Kv^{(\ell)}}, \qquad v^{(\ell+1)} = \frac{b}{K^\top u^{(\ell+1)}}.

(8)

The division is entrywise. The scaling vectors are not unique: multiplying $u$ by $\lambda>0$ and $v$ by $1/\lambda$ leaves $\P$ unchanged.

Figure Div exposes the alternating feasibility mechanism on a small matrix: each row or column normalization enforces one marginal exactly while generally perturbing the other.

Marginal constraints during Sinkhorn scaling. Row normalizations align the red source marginal and leave a blue defect; column normalizations align the blue target marginal and leave a red defect.

The interactive demo exposes the alternating row/column normalization directly. Change the half-step count to see the current coupling acquire one marginal, lose the other, and then converge toward both.

Interactive panel. Use the iteration, regularization, and mass controls to watch Sinkhorn row and column scalings enforce the marginals.

Figure Div shows the same alternating projection mechanism on a dense one-dimensional discretization, where the marginal defects appear as continuous side curves.

Dense Sinkhorn scaling for one-dimensional Gaussian-mixture marginals. The violet side curves are the current row and column sums; the red and blue curves are the prescribed marginals.

After convergence, the regularization strength controls how much of the Gibbs kernel remains visible in the optimal plan. Small $\epsilon$ produces a concentrated transport band, while larger $\epsilon$ spreads the same marginals into a smoother coupling.

Figure Div compares these converged plans at four temperatures while keeping both marginals fixed.

Final Sinkhorn couplings for the same one-dimensional marginals and four regularization strengths. Decreasing $\epsilon$ sharpens the plan toward an optimal-transport graph; increasing $\epsilon$ keeps more of the product structure.

Before that, Figure Div tracks the same Sinkhorn scaling in dual variables.

KL-normalized dual potentials along the scaling iteration. The logarithmic scaling potentials stabilize as the row/column normalizations converge.

The next interactive demo keeps the iteration count high and varies the temperature. It is the quickest way to see the geometry-bias tradeoff: low temperature is geometric and sharp, high temperature is smooth and closer to independence.

Interactive panel. Use the regularization slider to compare sparse exact-looking couplings with smoother entropic plans and potentials.

Complexity bounds for Sinkhorn and comparisons with accelerated first-order methods are discussed in Altschuler et al., 2017Dvurechensky et al., 2018Knight, 2008. For a dense $n\times m$ problem, each iteration costs one multiplication by $K$ and one by $K^\top$ , so the cost scales like $Cnm$ for $\C$ iterations. For fixed positive $\epsilon$ , the marginal error eventually has a linear regime, but small $\epsilon$ makes the Gibbs kernel more peaked and scaling harder.

Figure Div complements the complexity discussion by plotting the marginal defect across half-steps and showing how smaller temperatures slow the observed linear regime.

Marginal violation along Sinkhorn half-steps for several values of $\epsilon$ . Smaller $\epsilon$ gives sharper transport geometry but slower scaling.

Interactive panel. Vary $\epsilon$ and the conditioning parameters to compare observed residual decay with Hilbert-metric convergence guides.

Remark: Separable Gaussian kernels on grids

When the samples lie on a Cartesian grid and $c(x,y)=\norm{x-y}^2$ , the Gibbs kernel is Gaussian and factorizes along coordinates. If the grid has $q$ points per axis in dimension $d$ , so that $N=q^d$ grid points are used, then

K(x,y)=\exp\!\left(-\frac{\norm{x-y}^2}{\epsilon}\right) = \prod_{\ell=1}^d \exp\!\left(-\frac{(x_\ell-y_\ell)^2}{\epsilon}\right).

(9)

Multiplication by $K$ can therefore be applied by successively multiplying along each coordinate direction, equivalently by applying one-dimensional Gaussian kernel operators along the axes. On a periodic or sufficiently padded uniform grid these are literal discrete convolutions. A direct dense one-dimensional multiplication costs $O(q^2)$ on each of the $q^{d-1}$ coordinate lines, and this is repeated for $d$ axes. Hence one Sinkhorn half-step costs

O(d\,q^{d+1})=O(d\,N^{1+1/d})

(10)

instead of $O(N^2)$ . With FFT-based or truncated Gaussian convolutions, the same separability can be pushed further, but the simple tensor-product estimate already explains why grid-based Sinkhorn can scale much better than a generic dense coupling.

Algorithm: Sinkhorn scaling

Input: Positive weights $\a,\b$ , cost matrix $\C$ , regularization $\epsilon>0$ , tolerance $\mathrm{tol}$ .

Output: Entropic coupling $\P$ .

Initialize: Set $\K_{ij}=e^{-\C_{ij}/\epsilon}$ , $\vD^{(0)}=\ones_m$ , $r^{(0)}=+\infty$ , and $\ell=0$ .

While $r^{(\ell)}>\mathrm{tol}$ do:

Set $\ell\leftarrow \ell+1$ .
$\uD^{(\ell)}=\frac{\a}{\K\vD^{(\ell-1)}}.$
$\vD^{(\ell)}=\frac{\b}{\transp{\K}\uD^{(\ell)}}.$
$\P^{(\ell)}=\diag(\uD^{(\ell)})\K\diag(\vD^{(\ell)}).$
Set $r^{(\ell)}=\max\{\norm{\P^{(\ell)}\ones_m-\a}_1,\norm{(\P^{(\ell)})^\top\ones_n-\b}_1\}$ .

Return $\P^{(\ell)}$ .

Reformulation Using Relative Entropy¶

The KL formulation identifies Sinkhorn as a projection method. It also prepares the continuous and unbalanced settings, where a reference measure is essential.

Relative Entropy¶

A convenient tool to reformulate and normalize discrete entropy is relative entropy. It turns entropy regularization into a finite-dimensional projection problem and admits a direct measure-theoretic extension.

For matrices with the same total mass, the affine terms cancel and

\operatorname{KL}(P|Q) = \sum_{i,j}P_{i,j}\log\frac{P_{i,j}}{Q_{i,j}}.

(12)

On fixed-mass couplings, taking $Q=\mathbf 1_{n\times m}$ is equivalent to subtracting the Shannon--Boltzmann entropy.

KL Reformulation of Regularized OT¶

Choosing the tensor product $\a\otimes\b=(\a_i\b_j)_{i,j}$ as reference measure leads to the normalized problem

\min_{\P\in\CouplingsD(\a,\b)} \langle \P,\C\rangle + \epsilon\KLD(\P|\a\otimes\b).

(14)

For every $\P\in\CouplingsD(\a,\b)$ ,

\langle \P,\C\rangle+\epsilon\KLD(\P|\a\otimes\b) = \langle \P,\C\rangle-\epsilon\HD(\P) +\epsilon\bigl(\HD(\a)+\HD(\b)\bigr).

(15)

Hence this problem has exactly the same minimizer as the original entropic OT problem, while its optimal value is $\MKD_\C^\epsilon(\a,\b)+\epsilon(\HD(\a)+\HD(\b))$ . The normalization becomes substantive in unbalanced OT, where changing the reference measure is no longer merely an additive shift.

The tensor-product reference is nevertheless useful when supports vary. It makes explicit which entries may vanish and passes cleanly to the continuous formulation.

Figure Div shows how the corresponding KL-normalized dual potentials deform with temperature, from nearly hard Kantorovich potentials to smoother log-sum-exp profiles.

KL-normalized Sinkhorn dual potentials for one-dimensional Gaussian-mixture histograms. For $\epsilon=0.010$ the curves are already close to the unregularized one-dimensional Kantorovich potentials; increasing $\epsilon$ turns this hard $c$ -transform geometry into smoother log-sum-exp potentials.

Interactive panel. Change $\epsilon$ to compare the dual potentials with the corresponding entropic coupling.

Figure Div illustrates the two limiting regimes established above: the plan approaches a sparse optimal coupling as $\epsilon\downarrow0$ and the product coupling as $\epsilon$ grows.

Entropically regularized couplings between the red disk and blue annulus point clouds. The plans are strictly positive for every $\epsilon>0$ , but the visible mass pattern evolves from nearly radial and sparse to diffuse as $\epsilon$ increases.

Interactive panel. Use the same temperature control to see positivity, diffusion, and sharpening of entropic couplings in a one-dimensional setting.

General Formulation¶

The continuous formulation replaces matrices by measures and discrete KL by relative entropy. This section records the measure-theoretic problem, explains how the temperature $\epsilon$ connects exact transport to the independent product coupling, and states the two asymptotic regimes that are useful later: a large-temperature expansion around independence and a small-temperature expansion around quadratic optimal transport.

Measure Formulation¶

The only structural change from the discrete problem is that matrix entries are replaced by densities with respect to a product reference measure. For probability measures $\alpha$ and $\beta$ , define

\mathcal{L}_{c}^{\epsilon}(\alpha,\beta) \eqdef \min_{\pi\in\Couplings(\alpha,\beta)} \int_{\X\times\Y}c(x,y)\,\d\pi(x,y) + \epsilon\operatorname{KL}(\pi|\alpha\otimes\beta).

(20)

For fixed balanced marginals, the specific product reference only matters up to additive constants, provided the reference marginals are mutually absolutely continuous with $\alpha$ and $\beta$ . Its support still matters: it determines which couplings have finite entropy.

Probabilistic Interpretation¶

With this terminology, the entropic problem is

\inf_{X\sim\alpha,\;Y\sim\beta} \mathbb E(c(X,Y))+\epsilon\mathcal I(X,Y).

(23)

Large $\epsilon$ favors nearly independent endpoints, while small $\epsilon$ suppresses endpoint randomness and recovers an optimal Monge--Kantorovich coupling in the limit. When the unregularized quadratic problem has a Brenier map, this limiting coupling is deterministic.

Sinkhorn for General Measures¶

The multiplicative scaling structure extends from matrices to positive functions. Define the Gibbs kernel and its two integral operators by

\begin{aligned} k_\epsilon(x,y)&\eqdef\exp\!\left(-\frac{c(x,y)}{\epsilon}\right),\\ (\mathcal K_\epsilon v)(x) &\eqdef\int_\Yy k_\epsilon(x,y)v(y)\d\be(y), & (\mathcal K_\epsilon^*u)(y) &\eqdef\int_\Xx k_\epsilon(x,y)u(x)\d\al(x). \end{aligned}

(24)

Fubini’s theorem gives the adjoint identity $\int_\Xx u\mathcal K_\epsilon v\d\al =\int_\Yy v\mathcal K_\epsilon^*u\d\be$ . For compact marginal supports and continuous $c$ , Propositions Proposition: Continuous Entropic Duality and Proposition: Existence and Uniqueness of Entropic Dual Potentials provide optimal dual potentials $(f_\epsilon,g_\epsilon)$ . Set $u_\epsilon=e^{f_\epsilon/\epsilon}$ and $v_\epsilon=e^{g_\epsilon/\epsilon}$ . The continuous density law (54) then becomes

\frac{\d\pi_\epsilon}{\d(\al\otimes\be)}(x,y) =u_\epsilon(x)k_\epsilon(x,y)v_\epsilon(y),

(25)

Because $\al\otimes\be$ already contains the prescribed marginals, their target densities with respect to $\al$ and $\be$ are both one. Thus

u_\epsilon\mathcal K_\epsilon v_\epsilon=1 \quad \al\text{-a.e.}, \qquad v_\epsilon\mathcal K_\epsilon^*u_\epsilon=1 \quad \be\text{-a.e.}

(26)

Starting, for instance, from $v^{(0)}=1$ , continuous Sinkhorn alternately enforces these two identities:

u^{(\ell+1)}=\frac{1}{\mathcal K_\epsilon v^{(\ell)}}, \qquad v^{(\ell+1)}=\frac{1}{\mathcal K_\epsilon^*u^{(\ell+1)}}.

(27)

Equivalently, pointwise,

\begin{aligned} u^{(\ell+1)}(x) &=\frac{1}{\displaystyle\int_\Yy k_\epsilon(x,y)v^{(\ell)}(y)\d\be(y)},\\ v^{(\ell+1)}(y) &=\frac{1}{\displaystyle\int_\Xx k_\epsilon(x,y)u^{(\ell+1)}(x)\d\al(x)}. \end{aligned}

(28)

Given $v^{(\ell)}$ , the first update produces an intermediate coupling with $\Xx$ -marginal $\al$ ; the second produces one with $\Yy$ -marginal $\be$ , while generally perturbing the first marginal again. The scalings retain the gauge $(u,v)\mapsto(\lambda u,v/\lambda)$ .

There is one useful situation in which no iteration is required: choose the target by applying the normalized Gibbs kernel itself to the source.

Proposition: Closed-Form Gibbs Coupling

Let $\beta_0$ be a sigma-finite reference measure on $\Y$ and suppose that

Z_\epsilon(x) \eqdef \int_\Y k_\epsilon(x,y)\d\beta_0(y) \in(0,+\infty) \qquad\text{for $\alpha$-a.e. }x.

(29)

Define the normalized Gibbs transition and its output density by

p_\epsilon(x,y) \eqdef \frac{k_\epsilon(x,y)}{Z_\epsilon(x)}, \qquad q_\epsilon(y) \eqdef \int_\X p_\epsilon(x,y)\d\alpha(x), \qquad \d\beta_\epsilon=q_\epsilon\d\beta_0.

(30)

Assume that $\log Z_\epsilon\in L^1(\alpha)$ , $\log q_\epsilon\in L^1(\beta_\epsilon)$ , and the plan below has finite entropic objective. Then the unique solution of the entropic problem between $\alpha$ and $\beta_\epsilon$ is

\d\pi_\epsilon(x,y) = p_\epsilon(x,y)\d\alpha(x)\d\beta_0(y).

(31)

Relative to $\alpha\otimes\beta_\epsilon$ , its Sinkhorn scalings are

\frac{\d\pi_\epsilon}{\d(\alpha\otimes\beta_\epsilon)}(x,y) = \frac{k_\epsilon(x,y)}{Z_\epsilon(x)q_\epsilon(y)} = u_\epsilon(x)k_\epsilon(x,y)v_\epsilon(y), \qquad u_\epsilon=\frac1{Z_\epsilon}, \quad v_\epsilon=\frac1{q_\epsilon}.

(32)

If $k_\epsilon(x,\cdot)$ is already normalized with respect to $\beta_0$ , then $Z_\epsilon=1$ and (30) reduces to

\frac{\d\beta_\epsilon}{\d\beta_0}(y) = \int_\X k_\epsilon(x,y)\d\alpha(x).

(34)

For example, let $\X=\Y=\RR^d$ , let $\beta_0$ be Lebesgue measure, and take $c(x,y)=\norm{x-y}^2$ . Then $Z_\epsilon=(\pi\epsilon)^{d/2}$ and

p_\epsilon(x,y) = (\pi\epsilon)^{-d/2}e^{-\norm{x-y}^2/\epsilon}, \qquad \beta_\epsilon = \alpha*\Gaussian\!\left(0,\frac{\epsilon}{2}\Id\right).

(35)

Equivalently, the closed-form coupling is the law of $(X,X+\sqrt{\epsilon/2}\,G)$ for independent $X\sim\alpha$ and $G\sim\Gaussian(0,\Id)$ . Time-indexed Gaussian blurrings are the forward noising mechanism behind diffusion models, with an additional deterministic rescaling for variance-preserving Ornstein--Uhlenbeck schedules; see Connection with diffusion models..

For discrete measures, setting $\uD_i=\a_i u(x_i)$ and $\vD_j=\b_jv(y_j)$ gives $\P_{i,j}=\a_i\b_j u(x_i)\K_{i,j}v(y_j)=\uD_i\K_{i,j}\vD_j$ , so the functional iteration reduces exactly to the matrix Sinkhorn iteration (8). Its logarithmic interpretation as continuous dual block ascent is derived in Dual Sinkhorn for General Measures. Continuous convergence is revisited through a generalized Fortet-type monotonicity argument in Section Sinkhorn Convergence: Monotone Point of View; the finite-dimensional linear rate is studied through Hilbert’s metric in Section Sinkhorn Convergence: Linear Hilbert Metric Rate.

Convergence with $\epsilon$ ¶

The continuous problem has the same qualitative temperature limits as the finite-dimensional problem, but the zero-temperature selection is subtler. For quadratic transport between smooth densities, the limiting OT plan is typically supported on a graph and is therefore singular with respect to $\alpha\otimes\beta$ . Thus the robust statement is weak convergence of minimizers. This is the standard $\Gamma$ -convergence mechanism for entropic OT Léonard, 2012Carlier et al., 2017; the density hypothesis below isolates the only approximation point needed in the proof.

Proposition: Convergence with

\epsilon

for measures

Let $\X$ and $\Y$ be compact metric spaces, let $c\in C(\X\times\Y)$ , and let $\alpha\in\mathcal P(\X)$ and $\beta\in\mathcal P(\Y)$ . Assume that finite-entropy couplings are dense in $\Couplings(\alpha,\beta)$ for weak convergence with convergence of the cost integral. If $\pi_\epsilon$ minimizes (20), then

\mathcal{L}_{c}^{\epsilon}(\alpha,\beta) \longrightarrow \mathcal{L}_{c}(\alpha,\beta) \qquad(\epsilon\downarrow0),

(36)

and every weak cluster point of $\pi_\epsilon$ is an exact optimal plan. If the exact optimal plan is unique, then the whole sequence converges to it. In particular, for $c(x,y)=\|x-y\|^2$ and $\alpha$ absolutely continuous, $\pi_\epsilon\rightharpoonup(\mathrm{Id},T)_\sharp\alpha$ , where $T$ is the Brenier map.

As $\epsilon\to+\infty$ ,

\pi_\epsilon\to\alpha\otimes\beta \quad\text{in total variation}, \qquad \mathcal{L}_{c}^{\epsilon}(\alpha,\beta) \to \int_{\X\times\Y}c(x,y)\,\d\alpha(x)\d\beta(y).

(37)

The proof is the standard $\Gamma$ -convergence argument: the entropy is nonnegative, finite-entropy couplings provide recovery sequences, and Pinsker’s inequality from Theorem Theorem: Pinsker Inequality turns the large- $\epsilon$ entropy bound into total-variation convergence.

Large-Temperature Expansion¶

When $\epsilon$ is large, the entropy dominates and the optimal plan is a small perturbation of the product coupling. The useful object is the part of the cost that cannot be absorbed into row and column potentials. Let $r=\alpha\otimes\beta$ and define

\bar c_\X(x)=\int_\Y c(x,y)\,\d\beta(y),\qquad \bar c_\Y(y)=\int_\X c(x,y)\,\d\alpha(x),\qquad \bar c=\int_{\X\times\Y}c\,\d r,

(38)

and

c_0(x,y)=c(x,y)-\bar c_\X(x)-\bar c_\Y(y)+\bar c.

(39)

Proposition: Large-temperature expansion

Assume $c\in L^\infty(r)$ and that the large-temperature branch $p_\epsilon=\d\pi_\epsilon/\d r$ admits an expansion to third order in $\epsilon^{-1}$ near 0 in $L^\infty(r)$ . Then

p_\epsilon(x,y) = 1-\frac{c_0(x,y)}{\epsilon} +O(\epsilon^{-2}) \quad\text{in }L^2(r),

(40)

and

\mathcal{L}_{c}^{\epsilon}(\alpha,\beta) = \bar c - \frac{1}{2\epsilon} \int_{\X\times\Y}c_0(x,y)^2\,\d r(x,y) + \frac{1}{6\epsilon^2} \int_{\X\times\Y}c_0(x,y)^3\,\d r(x,y) +O(\epsilon^{-3}).

(41)

With

A(x)=\int_\Y c_0(x,y)^2\,\d\beta(y),\qquad B(y)=\int_\X c_0(x,y)^2\,\d\alpha(x),\qquad \sigma^2=\int_{\X\times\Y}c_0^2\,\d r,

(42)

and the gauge $\int g_\epsilon\,\d\beta=0$ , the corresponding potentials satisfy

f_\epsilon(x) = \bar c_\X(x)-\frac{A(x)}{2\epsilon} +O(\epsilon^{-2}), \qquad g_\epsilon(y) = \bar c_\Y(y)-\bar c +\frac{\sigma^2-B(y)}{2\epsilon} +O(\epsilon^{-2}).

(43)

The coefficient $c_0$ has zero conditional means. The second-order term follows by expanding the constrained exponential tilt and using this conditional orthogonality.

Small-Temperature Expansion for Smooth Densities¶

At small temperature, entropic transport is a viscous perturbation of quadratic optimal transport. The expansion contains an $\epsilon\log\epsilon$ term from the Gaussian normalization of Brownian bridges, an endpoint entropy correction, and a Fisher-information term along the McCann interpolation. The formula below translates the small-noise Schrödinger expansion to the convention $\|x-y\|^2+\epsilon\operatorname{KL}(\cdot|\alpha\otimes\beta)$ Conforti & Tamanini, 2021Chizat et al., 2020.

Proposition: Small-temperature quadratic expansion

Let $\alpha=\rho_0\,\d x$ and $\beta=\rho_1\,\d x$ be probability measures on $\RR^d$ with bounded compactly supported densities. Let $\alpha_t=\rho_t\,\d x$ be their quadratic displacement interpolation and assume

\mathcal I_{\mathrm{geo}}(\alpha,\beta) = \int_0^1\int_{\RR^d} \|\nabla\log\rho_t(x)\|^2\rho_t(x)\,\d x\,\d t <+\infty.

(44)

For $c(x,y)=\|x-y\|^2$ and $\mathrm H(\alpha)=\int_{\RR^d}\rho_0\log\rho_0\,\d x$ ,

\mathcal{L}_{\|\cdot\|^2}^{\epsilon}(\alpha,\beta) = \mathcal W_2^2(\alpha,\beta) - \frac{d\epsilon}{2}\log(\pi\epsilon) - \frac{\epsilon}{2} \left(\mathrm H(\alpha)+\mathrm H(\beta)\right) + \frac{\epsilon^2}{16}\mathcal I_{\mathrm{geo}}(\alpha,\beta) + o(\epsilon^2).

(45)

If, in addition, the endpoint densities are smooth and positive on their supports and the optimal map is a smooth non-degenerate diffeomorphism, then normalized Sinkhorn potentials converge locally uniformly to Kantorovich potentials on the interiors of the supports Nutz & Wiesel, 2022. In the gauge $\int g_\epsilon\,\d\beta=0$ , the scalar part satisfies

\int f_\epsilon\,\d\alpha = \mathcal L_{\|\cdot\|^2}^{\epsilon}(\alpha,\beta),

(46)

so it has the displayed expansion. Spatial order- $\epsilon$ corrections come from the Laplace prefactors in the soft $c$ -transform equations.

The Brownian entropy used in the proof is relative to the sigma-finite endpoint measure $p_T(x,y)\,\d x\d y$ , and therefore uses $\mathscr H(\pi|\xi)=\int\log(\d\pi/\d\xi)\,\d\pi$ rather than the finite-measure generalized KL above. This distinction is what fixes the Gaussian normalization and the $\epsilon\log\epsilon$ coefficient.

Dual of Sinkhorn¶

The dual point of view replaces couplings by potentials and soft $c$ -transforms. It is the right formulation for stabilized implementations and differentiation.

Discrete Dual¶

The KL-normalized problem has the dual

\min_{\P\in\mathbf U(a,b)} \langle \P,\C\rangle+\epsilon\operatorname{KL}(\P|a\otimes b) = \max_{f,g} \left[ \langle f,a\rangle+\langle g,b\rangle - \epsilon\sum_{i,j} \exp\left(\frac{f_i+g_j-\C_{i,j}}{\epsilon}\right)a_i b_j + \epsilon \right].

(47)

The optimal potentials are linked to the scaling variables through

u_i=a_i e^{f_i/\epsilon}, \qquad v_j=b_j e^{g_j/\epsilon}.

(48)

Discrete Soft $c$ -Transforms¶

For fixed $g$ , maximizing the dual with respect to $f$ gives

f_i = -\epsilon\log \sum_j \exp\left(\frac{g_j-\C_{i,j}}{\epsilon}\right)b_j.

(49)

This is a smoothed minimum.

Exponentiating the alternating soft-transform iterations recovers Sinkhorn’s algorithm. For small $\epsilon$ , one must compute the log-sum-exp terms with the usual stabilization trick: subtract the minimum before exponentiating and add it back afterward.

Figure Div visualizes the corresponding soft minimum: decreasing $\epsilon$ sharpens the smooth best response toward the hard $c$ -transform envelope.

Soft $c$ -transforms for decreasing temperatures. A positive $\epsilon$ replaces the hard lower envelope by a log-sum-exp soft minimum.

Interactive panel. Use the epsilon and potential controls to see how the hard c-transform is softened by log-sum-exp smoothing.

Continuous Dual and Soft-Transforms¶

The continuous formula follows from the same entropy conjugacy as its matrix counterpart; it is not a discretization heuristic.

Proposition: Continuous Entropic Duality

Let $\X$ and $\Y$ be compact metric spaces, let $\alpha\in\mathcal P(\X)$ and $\beta\in\mathcal P(\Y)$ , and let $c\in\mathcal C(\X\times\Y)$ . For every $\epsilon>0$ , the continuous KL-regularized problem satisfies

\mathcal{L}_{c}^{\epsilon}(\alpha,\beta) = \sup_{f,g} \mathcal D_\epsilon(f,g),

(52)

where

\mathcal D_\epsilon(f,g) = \int f\,\d\alpha+\int g\,\d\beta - \epsilon \int \left( e^{(f(x)+g(y)-c(x,y))/\epsilon} - 1 \right) \d\alpha(x)\d\beta(y).

(53)

If $\pi^\star$ and $(f^\star,g^\star)$ are primal and dual optimizers, then

\frac{\d\pi^\star}{\d(\alpha\otimes\beta)}(x,y) = \exp\!\left( \frac{f^\star(x)+g^\star(y)-c(x,y)}{\epsilon} \right) \qquad (\alpha\otimes\beta)\text{-a.e.}

(54)

This is the smooth counterpart of the hard feasibility constraint $f\oplus g\le c$ from the Kantorovich dual.

Remark: Convexity properties of soft transforms

The log-sum-exp part behaves like a smoothed maximum and preserves convexity. Since the soft transform takes the negative of this quantity after inserting the cost, it preserves the usual $c$ -concavity structure. In particular, for the bilinear cost $c(x,y)=-\dotp{x}{y}$ , the transform $f^{c,\epsilon}$ is concave for any $f$ . Therefore, for the quadratic cost $c(x,y)=\norm{x-y}^2/2$ , the optimal potentials have the form $f^\star(x)=\norm{x}^2/2-\phi^\star(x)$ and $g^\star(y)=\norm{y}^2/2-\psi^\star(y)$ , where $\phi^\star$ and $\psi^\star$ are convex.

Dual Sinkhorn for General Measures¶

The soft transforms are not only regularized analogues of hard $c$ -transforms: they are the exact block-maximization steps of the continuous dual objective (53). Indeed, for fixed $g$ and $h\in\Cc(\X)$ ,

\left.\frac{\d}{\d s}\mathcal D_\epsilon(f+s h,g)\right|_{s=0} = \int_\X h(x)\left[ 1-e^{f(x)/\epsilon} \int_\Y e^{(g(y)-c(x,y))/\epsilon}\d\beta(y) \right]\d\alpha(x).

(61)

Hence exact maximization first over $f$ and then over $g$ gives

f^{(\ell+1)}=(g^{(\ell)})^{\bar c,\epsilon}, \qquad g^{(\ell+1)}=(f^{(\ell+1)})^{c,\epsilon}.

(62)

The transforms are those of Definition Definition: Continuous Soft $c$ -Transforms. Their decorations record their domains: the $\bar c$ -transform sends a potential on $\Y$ to one on $\X$ , whereas the $c$ -transform sends a potential on $\X$ to one on $\Y$ .

This dual iteration is exactly the logarithmic form of the continuous scaling iteration (27). Set $u^{(\ell)}=e^{f^{(\ell)}/\epsilon}$ and $v^{(\ell)}=e^{g^{(\ell)}/\epsilon}$ . Exponentiating the dual updates and using the kernel operators (24) gives

u^{(\ell+1)}=\frac{1}{\mathcal K_\epsilon v^{(\ell)}}, \qquad v^{(\ell+1)}=\frac{1}{\mathcal K_\epsilon^*u^{(\ell+1)}}.

(63)

The coupling density reconstructed from the current potentials is therefore $u^{(\ell)}(x)k_\epsilon(x,y)v^{(\ell)}(y)$ with respect to $\alpha\otimes\beta$ , as in (25). At a fixed point, its two marginals are $\alpha$ and $\beta$ ; equivalently, the potentials jointly maximize the continuous dual.

Neural Dual Solvers¶

The convex-potential structure above suggests a sample-based alternative to evaluating soft transforms on a grid or on all pairs of samples. For the bilinear cost $c(x,y)=-\dotp{x}{y}$ , the signs of the dual potentials are convex: writing $\Phi=-f$ and $\Psi=-g$ , the zero-temperature constraint $f(x)+g(y)\leq-\dotp{x}{y}$ becomes

\Phi(x)+\Psi(y)\geq \dotp{x}{y}.

(64)

For the quadratic cost this is the same statement after subtracting the quadratic terms. One can therefore maximize the continuous dual over parameterized convex potentials, estimating the integrals by stochastic samples.

A useful parameterization is given by input-convex neural networks (ICNNs) Amos et al., 2017Makkuva et al., 2020. The construction mirrors elementary closure rules for convex functions: nonnegative linear combinations preserve convexity, composition with a convex nondecreasing scalar nonlinearity preserves convexity, and the ReLU $r\mapsto\max(r,0)$ is both convex and nondecreasing. Thus a feed-forward network with nonnegative hidden-to-hidden weights and affine skip connections from the input defines a convex function of its input. This gives a flexible cone of convex trial potentials, although the finite-dimensional optimization over the network weights is not a convex optimization problem. Universal-approximation statements must be read with this distinction in mind: max-affine functions are dense among continuous convex functions on compact convex sets, and ICNN-type architectures are designed to inherit this approximation principle. General ReLU universal approximation results, such as the width $d+1$ theorem on compact subsets of $\RR^d$ Hanin, 2019, provide useful background but do not by themselves enforce convexity. In practice, neural dual solvers trade exact Sinkhorn scaling for amortized stochastic optimization of the dual potentials.

Algorithm: Log-domain Sinkhorn by soft transforms

Input: Positive weights $\a,\b$ , cost matrix $\C$ , regularization $\epsilon>0$ , tolerance $\mathrm{tol}$ .

Output: Entropic coupling $\P$ computed from stabilized potentials.

Initialize: Set $\gD^{(0)}=0$ , $\eta^{(0)}=+\infty$ , and $\ell=0$ .

While $\eta^{(\ell)}>\mathrm{tol}$ do:

Set $\ell\leftarrow \ell+1$ .
For $i=1,\ldots,n$ do:

Set $M_i=\max_j\{\gD_j^{(\ell-1)}-\C_{ij}\}$ .
Set $\fD_i^{(\ell)}=-M_i-\epsilon\log\sum_j\b_j \exp((\gD_j^{(\ell-1)}-\C_{ij}-M_i)/\epsilon)$ .

For $j=1,\ldots,m$ do:

Set $N_j=\max_i\{\fD_i^{(\ell)}-\C_{ij}\}$ .
Set $\gD_j^{(\ell)}=-N_j-\epsilon\log\sum_i\a_i \exp((\fD_i^{(\ell)}-\C_{ij}-N_j)/\epsilon)$ .

Set $\P_{ij}^{(\ell)}=\a_i\b_j \exp((\fD_i^{(\ell)}+\gD_j^{(\ell)}-\C_{ij})/\epsilon)$ .
Set $\eta^{(\ell)}=\max\{\norm{\P^{(\ell)}\ones_m-\a}_1, \norm{(\P^{(\ell)})^\top\ones_n-\b}_1\}$ .

Return $\P^{(\ell)}$ .

The path-space meaning of the static Schrodinger problem now follows the dual construction. A noisy reference dynamics first defines a probability law on trajectories; after optimizing out the conditional law of the path given its endpoints, only the endpoint coupling remains.

Path-Space Schrodinger Problem¶

Schrodinger’s reciprocal problem is naturally posed on paths rather than on endpoint pairs. The Sinkhorn problem appears after the path law is reduced to its two endpoint marginals.

Unregularized Path-Space Transport¶

Let $\Omega=C([0,1];\X)$ be a path space and let $e_t(\omega)=\omega_t$ be the evaluation maps. Given a path action $\mathcal A:\Omega\to[0,+\infty]$ , the unregularized path-space problem is

\inf_{M\in\mathcal P(\Omega)} \left\{ \int_\Omega\mathcal A(\omega)\,\d M(\omega) : (e_0)_\sharp M=\alpha,\ (e_1)_\sharp M=\beta \right\}.

(65)

For quadratic Wasserstein geometry on $\RR^d$ ,

\mathcal A(\omega) = \begin{cases} \int_0^1\norm{\dot\omega_t}^2\,\d t, & \omega\text{ absolutely continuous},\\ +\infty, & \text{otherwise}. \end{cases}

(66)

The endpoint cost induced by the action is

c_{\mathcal A}(x,y) \eqdef \inf\left\{ \mathcal A(\omega): e_0(\omega)=x,\ e_1(\omega)=y \right\}.

(67)

For the quadratic action, the minimizing path is the straight segment and $c_{\mathcal A}(x,y)=\norm{x-y}^2$ .

Entropic Path-Space Problem¶

Let $R^\epsilon\in\mathcal P(\Omega)$ be a reference path law, for instance a Brownian or Langevin dynamics at noise level $\epsilon$ . The dynamic Schrodinger bridge problem is the entropy projection

\mathrm{SB}_\epsilon(\alpha,\beta) \eqdef \inf_{M\in\mathcal P(\Omega)} \left\{ \epsilon\operatorname{KL}(M|R^\epsilon): (e_0)_\sharp M=\alpha,\ (e_1)_\sharp M=\beta \right\}.

(71)

It asks for the most likely path law, relative to the prior dynamics, among all path laws matching the observed endpoints Schrödinger, 1931Léonard, 2012Léonard, 2014Chen et al., 2016.

Stochastic-Control Interpretation¶

For Brownian references, the entropy projection has a direct control meaning. To fix constants, suppose that $R^\epsilon$ is the law of

\d X_t = \sqrt{\epsilon}\,\d B_t, \qquad X_0\sim\alpha ,

(72)

so that the generator is $(\epsilon/2)\Delta$ . Under the standard finite-entropy assumptions, a path law $M$ with the same initial marginal and $M\ll R^\epsilon$ can be represented, in the weak sense, by an adapted drift $u_t$ of finite energy,

\d X_t = u_t\,\d t+\sqrt{\epsilon}\,\d B_t, \qquad X_0\sim\alpha .

(73)

Conversely, such a drift defines an absolutely continuous change of law when the usual Girsanov integrability conditions hold. Girsanov’s formula then gives

\operatorname{KL}(M|R^\epsilon) = \frac{1}{2\epsilon}\mathbb E_M\int_0^1\norm{u_t}^2\,\d t, \qquad \epsilon\operatorname{KL}(M|R^\epsilon) = \frac12\mathbb E_M\int_0^1\norm{u_t}^2\,\d t .

(74)

If the reference initial law is not $\alpha$ , an additional term $\epsilon\operatorname{KL}(\alpha|R^\epsilon_0)$ appears, but it is fixed by the endpoint constraint. Hence, up to this fixed endpoint cost, the Schrodinger bridge can be read as

\inf_u \left\{ \frac12\mathbb E\int_0^1\norm{u_t}^2\,\d t : \d X_t=u_t\,\d t+\sqrt{\epsilon}\,\d B_t,\quad X_0\sim\alpha,\ X_1\sim\beta \right\}.

(75)

Thus the bridge is the least energetic change of drift that steers the Brownian prior from $\alpha$ to $\beta$ . The optimizer may be chosen Markovian. If $h_t$ denotes the positive space-time harmonic Schrodinger factor associated with the terminal constraint, then the optimal feedback drift has the Doob-transform form

u_t^\star(x)=\epsilon\nabla\log h_t(x).

(76)

Equivalently, with the value potential $\phi_t=\epsilon\log h_t$ , one has $u_t^\star=\nabla\phi_t$ and $\phi_t$ solves the viscous Hamilton--Jacobi equation $\partial_t\phi_t+\frac12\norm{\nabla\phi_t}^2+\frac{\epsilon}{2}\Delta\phi_t=0$ . This control viewpoint and the endpoint-coupling reduction below are two faces of the same object: the controlled diffusion describes the whole path law, whereas the static Sinkhorn coupling records only its two endpoints.

Viscous Benamou--Brenier Formulations¶

The dynamic problem also has viscous Benamou--Brenier formulations. Here $v_t$ denotes the forward drift called $u_t$ in the control formulation, whereas $u_t$ denotes the associated current velocity. If the uncontrolled noise is $\sqrt{\sigma}\,\d B_t$ , its generator is $(\sigma/2)\Delta$ and

\partial_t\rho_t+\operatorname{div}(\rho_t v_t) = \frac{\sigma}{2}\Delta\rho_t

(77)

and one minimizes a kinetic action. After absorbing the diffusion into the velocity $u_t=v_t-\frac{\sigma}{2}\nabla\log\rho_t$ , the same minimizers are obtained from

\int_0^1\!\int \left( \frac12\norm{u_t(x)}^2 + \frac{\sigma^2}{8}\norm{\nabla\log\rho_t(x)}^2 \right) \rho_t(x)\,\d x\,\d t,

(78)

up to endpoint entropy terms. The extra term is a Fisher-information penalty.

Brownian Bridges and Sinkhorn Couplings¶

For the convention $\d X_t=\sqrt{\epsilon}\,\d B_t$ , the endpoint kernel is $\exp(-\norm{x-y}^2/(2\epsilon))$ and the corresponding static cost is $\norm{x-y}^2/2$ . Renaming the static temperature gives the equivalent kernel

\exp\left(-\frac{\norm{x-y}^2}{\epsilon}\right).

(82)

After rewriting this prior with respect to $\alpha\otimes\beta$ , the endpoint problem is exactly the continuous Sinkhorn problem up to an additive constant. Sinkhorn computes which endpoints should be paired; the path-space Schrodinger bridge then connects each pair by a Brownian bridge.

Figure Div illustrates this endpoint-to-path lifting on a small discrete example.

Endpoint couplings lifted to Brownian bridges. Increasing $\epsilon$ both softens the endpoint coupling and amplifies the Brownian fluctuations between paired endpoints.

Interactive panel. Vary $\epsilon$ to move continuously from straight OT rays to noisy Brownian-bridge lifts with a more diffuse endpoint coupling.

Marginal-Dependent Problems¶

The balanced Sinkhorn problem fixes both marginals exactly. Many nearby models instead optimize the transported marginals, but only through penalties or constraints applied separately to each marginal. The useful point, emphasized by the generalized scaling algorithms of Chizat, Peyré, Schmitzer and Vialard Chizat et al., 2018, is that entropic OT remains a diagonal-scaling problem whenever these marginal terms admit simple KL-proximal maps.

Let $\mathcal F$ and $\mathcal G$ be proper convex lower semicontinuous functionals on finite nonnegative measures on $\Xx$ and $\Yy$ . The unregularized marginal-dependent transport problem is

\inf_{\pi\in\Mm_+(\Xx\times\Yy)} \int c(x,y)\,\d\pi(x,y) + \mathcal F(\pi_1) + \mathcal G(\pi_2),

(83)

where $\pi_1=(\mathrm p_{\Xx})_\sharp\pi$ and $\pi_2=(\mathrm p_{\Yy})_\sharp\pi$ are the two marginals of $\pi$ . Entropic regularization turns this problem into the scaling-friendly form

\inf_{\pi\in\Mm_+(\Xx\times\Yy)} \int c(x,y)\,\d\pi(x,y) + \mathcal F(\pi_1) + \mathcal G(\pi_2) + \epsilon\operatorname{KL}(\pi|\alpha\otimes\beta),

(84)

where $\alpha$ and $\beta$ are reference measures. In this subsection, when the total mass of $\pi$ is not fixed, the KL term is understood in the generalized sense associated with $\varphi(s)=s\log s-s+1$ . Thus, if $\lambda=\alpha\otimes\beta$ and $\pi=\rho\lambda+\pi^\perp$ is the Lebesgue decomposition of $\pi$ with respect to $\lambda$ , then

\operatorname{KL}(\pi|\lambda) = \int \bigl(\rho\log\rho-\rho+1\bigr)\,\mathrm d\lambda

(85)

with value $+\infty$ when $\pi^\perp\ne0$ . On probability couplings this coincides with the usual relative entropy.

Balanced OT is recovered by taking $\mathcal F=\iota_{\{\alpha\}}$ and $\mathcal G=\iota_{\{\beta\}}$ . Unbalanced OT replaces these hard indicators by marginal divergences, as developed later in Section Unbalanced OT. An entropic JKO step fixes the first marginal to the previous iterate and puts the energy on the second marginal, for instance $\mathcal F=\iota_{\{\alpha_t\}}$ and $\mathcal G=E$ , with cost $c/(2\tau)$ ; this is the static counterpart of the minimizing-movement schemes of Chapter Paragraph. Barycenters are the multi-coupling extension: several couplings share one unknown marginal and are treated by the generalized Sinkhorn updates of Section OT Barycenters.

In finite dimension, with reference weights satisfying $a_i,b_j>0$ and proper convex functions $\mathsf F:\RR_+^n\to\RR\cup\{+\infty\}$ , $\mathsf G:\RR_+^m\to\RR\cup\{+\infty\}$ , the entropic version becomes

\inf_{\P\in\RR_+^{n\times m}} \langle\C,\P\rangle + \mathsf F(\P\ones_m) + \mathsf G(\P^\top\ones_n) + \epsilon\KLD(\P|\a\otimes\b).

(86)

Equivalently, if $\K_{ij}=a_i b_j e^{-\C_{ij}/\epsilon}$ , the terms involving $\C$ can be absorbed into the Gibbs reference and the problem is, up to the additive constant $\epsilon\sum_{i,j}(a_i b_j-\K_{ij})$ ,

\inf_{\P\ge0} \mathsf F(\P\ones_m) + \mathsf G(\P^\top\ones_n) + \epsilon\KLD(\P|\K).

(87)

Proposition: Dual and scaling for marginal penalties

Assume $a_i,b_j>0$ , $\epsilon>0$ , and a Fenchel qualification condition, for instance the existence of a matrix $\P>0$ such that $\P\ones_m$ and $\P^\top\ones_n$ belong to the relative interiors of $\operatorname{dom}(\mathsf F)$ and $\operatorname{dom}(\mathsf G)$ . The Fenchel dual of the discrete marginal-dependent problem is

\sup_{\mathbf f\in\RR^n,\mathbf g\in\RR^m} - \mathsf F^*(-\mathbf f) - \mathsf G^*(-\mathbf g) - \epsilon\sum_{i,j}a_i b_j \left[ \exp\left(\frac{\mathbf f_i+\mathbf g_j-\C_{ij}}{\epsilon}\right)-1 \right].

(88)

Equivalently, up to the additive constant $\epsilon\sum_{i,j}a_i b_j$ , the last term is

- \epsilon\sum_{i,j}\K_{ij} \exp\left(\frac{\mathbf f_i+\mathbf g_j}{\epsilon}\right).

(89)

If $\mathbf u=e^{\mathbf f/\epsilon}$ and $\mathbf v=e^{\mathbf g/\epsilon}$ , exact block ascent in the two dual variables is the generalized Sinkhorn cycle

\begin{aligned} r &\leftarrow \operatorname{prox}_{\mathsf F/\epsilon}^{\KLD}(\K\mathbf v), &\qquad \mathbf u &\leftarrow r\oslash(\K\mathbf v),\\ s &\leftarrow \operatorname{prox}_{\mathsf G/\epsilon}^{\KLD}(\K^\top\mathbf u), &\qquad \mathbf v &\leftarrow s\oslash(\K^\top\mathbf u),\\ \P&=\operatorname{diag}(\mathbf u)\K\operatorname{diag}(\mathbf v), \end{aligned}

(90)

where divisions are entrywise and

\operatorname{prox}^{\KLD}_{\mathsf h}(z) \eqdef \operatorname*{arg\,min}_{r\in\RR_+^d} \mathsf h(r)+\KLD(r|z).

(91)

In particular, $\operatorname{prox}_{\mathsf F/\epsilon}^{\KLD}(z)$ is the minimizer of $\mathsf F(r)+\epsilon\KLD(r|z)$ .

When $\mathsf F=\iota_{\{\a\}}$ and $\mathsf G=\iota_{\{\b\}}$ , the first two dual terms are $\langle\mathbf f,\a\rangle+\langle\mathbf g,\b\rangle$ , and one recovers the usual entropic dual. The classical Sinkhorn update is the special case in which the KL proximal maps return the prescribed marginals $\a$ and $\b$ .

Algorithm: Generalized Sinkhorn for marginal penalties

Input: Reference weights $\a,\b$ , cost $\C$ , convex marginal penalties $\mathsf F,\mathsf G$ , regularization $\epsilon>0$ , tolerance $\mathrm{tol}$ .

Output: Coupling $\P$ and optimized marginals $r=\P\ones_m$ , $s=\transp{\P}\ones_n$ .

Set $\K_{ij}=a_i b_j e^{-\C_{ij}/\epsilon}$ .

Initialize: $\uD=\ones_n$ , $\vD=\ones_m$ , $\eta=+\infty$ .

While $\eta>\mathrm{tol}$ do:

Store $\uD_{\mathrm{old}}=\uD$ and $\vD_{\mathrm{old}}=\vD$ .
Set $z=\K\vD$ .
Compute $r=\prox_{\mathsf F/\epsilon}^{\KLD}(z)$ .
Set $\uD=r\oslash z$ .
Set $w=\transp{\K}\uD$ .
Compute $s=\prox_{\mathsf G/\epsilon}^{\KLD}(w)$ .
Set $\vD=s\oslash w$ .
Set $\eta=\max\{\norm{\uD-\uD_{\mathrm{old}}}_\infty,\norm{\vD-\vD_{\mathrm{old}}}_\infty\}$ .

Return $\P=\diag(\uD)\K\diag(\vD)$ , $r=\P\ones_m$ , and $s=\transp{\P}\ones_n$ .

The usefulness of this formulation is that many KL-proximal maps are explicit.

Hard marginal constraint. If $\mathsf F=\iota_{\{\a\}}$ , then $\operatorname{prox}_{\mathsf F/\epsilon}^{\KLD}(z)=\a$ .
KL marginal relaxation. If $\mathsf F(r)=\tau\KLD(r|\a)$ with $\tau>0$ , then, coordinatewise,

\operatorname{prox}_{\mathsf F/\epsilon}^{\KLD}(z) = z^{\epsilon/(\tau+\epsilon)}\odot \a^{\tau/(\tau+\epsilon)},

(94)

so $\mathbf u\leftarrow(\a\oslash z)^{\tau/(\tau+\epsilon)}$ , the damped scaling of unbalanced Sinkhorn.

Pointwise bounds. If $\mathsf F=\iota_{\{\ell\le r\le u\}}$ , then the proximal map is the coordinatewise clipping $\operatorname{prox}_{\mathsf F/\epsilon}^{\KLD}(z)_i=\min\{u_i,\max\{\ell_i,z_i\}\}$ .
Total-variation marginal relaxation. If $\mathsf F(r)=\tau\norm{r-\a}_1$ and $\lambda=\tau/\epsilon$ , then

\operatorname{prox}_{\mathsf F/\epsilon}^{\KLD}(z)_i = \begin{cases} z_i e^{\lambda}, & z_i<a_i e^{-\lambda},\\ a_i, & a_i e^{-\lambda}\le z_i\le a_i e^{\lambda},\\ z_i e^{-\lambda}, & z_i>a_i e^{\lambda}. \end{cases}

(95)

Fixed total mass. If $\mathsf F=\iota_{\{\langle r,\ones\rangle=m\}}$ , then $\operatorname{prox}_{\mathsf F/\epsilon}^{\KLD}(z)=m z/\langle z,\ones\rangle$ .

These examples explain why generalized Sinkhorn algorithms remain practical: the expensive operation is still multiplication by $\K$ or $\K^\top$ , while the model-specific part is a low-dimensional KL-proximal update on a marginal.

Heat Kernels and Hopf--Cole Transforms¶

The Gaussian kernel used by Sinkhorn is also the Euclidean heat kernel. This viewpoint clarifies when entropic OT admits fast grid and surface implementations, and it places soft minima and Hopf--Cole transforms in the same heat-kernel calculus.

Geodesics in Heat¶

On $\mathbb R^d$ , the heat kernel for $\partial_t u=\Delta u$ is

h_t(x,y)=(4\pi t)^{-d/2}\exp\!\left(-\frac{\norm{x-y}^2}{4t}\right).

(96)

For the quadratic cost $c(x,y)=\norm{x-y}^2$ , the Sinkhorn Gibbs kernel is exactly a heat kernel up to a scalar factor:

K_\epsilon(x,y) =e^{-\norm{x-y}^2/\epsilon} =(\pi\epsilon)^{d/2}h_{\epsilon/4}(x,y).

(97)

The scalar factor is absorbed by the Sinkhorn scalings and does not change the coupling. On a Riemannian manifold or surface $M$ , write $L=-\Delta_M$ and replace the dense Gibbs matrix by the intrinsic heat operator $H_\epsilon=e^{-(\epsilon/4)L}$ . For two histograms on the same discretized domain, Sinkhorn becomes

u^{(\ell+1)}=a\oslash(H_\epsilon v^{(\ell)}), \qquad v^{(\ell+1)}=b\oslash(H_\epsilon^\top u^{(\ell+1)}).

(98)

Here $H_\epsilon$ includes quadrature weights and $H_\epsilon^\top$ is its discrete adjoint; they coincide for a symmetric mass-normalized discretization. This fits Sinkhorn because kernel multiplication is its only expensive step. Equivalently, the heat kernel defines the effective cost $c_\epsilon(x,y)=-\epsilon\log h_{\epsilon/4}(x,y)$ . Varadhan’s formula gives

c_\epsilon(x,y)\longrightarrow d_M(x,y)^2 \qquad (\epsilon\downarrow0),

(99)

so convolutional Sinkhorn recovers the squared geodesic ground cost at small temperature without computing all pairwise geodesic distances Varadhan, 1967Solomon et al., 2015. This is also the asymptotic principle behind geodesics-in-heat distance estimation Crane et al., 2013.

Computationally, the heat operator admits the resolvent approximation

H_\epsilon =\lim_{q\to\infty} \left(I+\frac{\epsilon}{4q}L\right)^{-q}.

(100)

For a sparse discrete Laplacian $L_h$ , factor $A_{\epsilon,q}=I+\epsilon L_h/(4q)=RR^\top$ once by sparse Cholesky. Each application of $H_\epsilon$ is then approximated by $q$ successive solves with $A_{\epsilon,q}$ , each reduced to two triangular substitutions. The same factorization is reused in every Sinkhorn row and column update, avoiding a dense kernel and an all-pairs distance matrix Solomon et al., 2015. An $\epsilon$ -scaling schedule requires one factorization per temperature. The diffusion length is of order $\sqrt\epsilon$ , so the small-temperature limit must still be resolved by the mesh; taking $\epsilon$ smaller than the squared grid spacing produces metrication and discretization artifacts.

Figure Div compares the exact distance to a non-convex source curve with heat-kernel and shifted-Laplacian approximations at several smoothing scales.

Geodesics-in-heat approximation of the distance to a dense non-convex source curve. The one-step approximation $(I+\epsilon L_h/4)^{-1}$ with Neumann boundary conditions is followed by a normalized-gradient Poisson solve. Larger Sinkhorn temperatures suppress unresolved grid-scale artifacts but progressively round the non-convex level-set geometry.

Interactive panel. Adjust the Sinkhorn temperature $\epsilon$ and number of sources to see how heat smoothing rounds Voronoi fronts and approximate distance level sets.

Soft Hopf--Lax and Hopf--Cole¶

The Hopf--Lax formula is the Hamilton--Jacobi incarnation of the hard $c$ -transform. We use the normalized quadratic cost

c(x,y)=\frac12\norm{x-y}^2,

(101)

so that the Hopf--Lax operator applied to an initial datum $h$ is precisely

(-h)^{\bar c}(x)=\inf_y\left\{h(y)+\frac12\norm{x-y}^2\right\}.

(102)

The sign only reflects the convention of Definition: $c$ -Transform. This is the usual Hopf--Lax formula for the Hamiltonian $\norm{p}^2/2$ Evans, 2010Villani, 2009; other quadratic scalings amount to multiplying the cost by a constant. In the present entropic setting, the parameter of interest is instead the temperature $\epsilon$ .

The soft version replaces the infimum by a log-sum-exp soft minimum. In the notation of Definition: Continuous Soft $c$ -Transforms, and using Lebesgue measure on $\RR^d$ ,

(-h)^{\bar c,\epsilon}(x) = -\epsilon\log \int \exp\!\left(-\frac{h(y)+\norm{x-y}^2/2}{\epsilon}\right)\,dy .

(103)

This is a soft $c$ -transform of the function $-h$ , and Laplace’s principle gives $(-h)^{\bar c,\epsilon}\to(-h)^{\bar c}$ as $\epsilon\to0$ under the usual compactness assumptions on near-minimizers. The same formula is also a heat-kernel formula. If

G_\epsilon(z)=(2\pi\epsilon)^{-d/2} \exp\!\left(-\frac{\norm{z}^2}{2\epsilon}\right),

(104)

then

(-h)^{\bar c,\epsilon}(x) = -\epsilon\log\big(G_\epsilon\ast e^{-h/\epsilon}\big)(x) -\frac{\epsilon d}{2}\log(2\pi\epsilon).

(105)

Thus a soft quadratic $c$ -transform is a Gaussian convolution followed by a logarithm, up to an explicit additive constant independent of $x$ . This is the bridge between soft-minimum operators, heat kernels and entropic transport potentials.

Proposition: Soft Quadratic

c

-Transform and Legendre Approximation

Let $f:\RR^d\to\RR\cup\{+\infty\}$ be such that the integrals below are finite, and introduce the quadratic shift $\mathsf S f(y)=f(y)-\norm{y}^2/2$ . For $\epsilon>0$ , define the soft conjugate by applying the soft $\bar c$ -transform to the shifted function:

f^{*,\epsilon}(p) =\frac12\norm{p}^2-\big(-\mathsf S f\big)^{\bar c,\epsilon}(p).

(106)

Then

f^{*,\epsilon}(p) =\epsilon\log\int_{\RR^d} \exp\!\left(\frac{\dotp{p}{y}-f(y)}{\epsilon}\right)\,dy,

(107)

and equivalently

f^{*,\epsilon}(p) =\frac12\norm{p}^2 +\epsilon\log\big(G_\epsilon\ast e^{-\mathsf S f/\epsilon}\big)(p) +\frac{\epsilon d}{2}\log(2\pi\epsilon).

(108)

If, for instance, $f$ is proper, lower semicontinuous and superlinear, then $f^{*,\epsilon}(p)\to f^*(p)$ for every $p$ as $\epsilon\to0$ .

The proof is just completion of squares. Since

\inf_y\left\{\mathsf S f(y)+\frac12\norm{p-y}^2\right\} =\frac12\norm{p}^2-f^*(p),

(109)

this shift turns the Legendre--Fenchel transform into a quadratic $\bar c$ -transform. Replacing the hard transform by its soft version gives the definition of $f^{*,\epsilon}$ . Expanding the square yields the log-sum-exp formula, the Gaussian-convolution expression follows from the normalized kernel $G_\epsilon$ , and the convergence follows from Laplace’s principle.

Remark: Fast soft Legendre--Fenchel transforms

Proposition Proposition: Soft Quadratic $c$ -Transform and Legendre Approximation gives a computational recipe for a smoothed Legendre--Fenchel transform. On a periodic, or sufficiently padded, grid, the term $G_\epsilon\ast e^{-\mathsf S f/\epsilon}$ in (108) is a Gaussian convolution. It can therefore be evaluated in $O(N\log N)$ operations for $N$ grid samples using an FFT. This is not an exact hard discrete Legendre transform, but a regularized approximation whose zero-temperature limit recovers the hard transform by Laplace’s principle. Exact discrete conjugation and lower-envelope algorithms exploit convex-analytic and computational-geometry structure instead; see Lucet’s survey of computational convex analysis and the distance-transform algorithm of Felzenszwalb and Huttenlocher Lucet, 2010Felzenszwalb & Huttenlocher, 2012.

The convolutional route is nevertheless delicate when $\epsilon$ is small. In fixed precision, the factors $e^{-f/\epsilon}$ or $e^{-\mathsf S f/\epsilon}$ can underflow or overflow, and the logarithm can amplify small relative convolution errors. Practical implementations therefore use shifts, log-domain evaluations, or stabilized FFT convolutions. In dimension $d$ , one can also use the same separability of the Gaussian kernel as in grid Sinkhorn: direct one-dimensional passes give the tensor-product cost (10), namely $O(d\,N^{1+1/d})$ , while FFT convolutions provide an additional acceleration when boundary conditions and conditioning permit it.

The first figure below isolates the biconjugation effect. It compares the hard lower convex envelope with finite-temperature soft biconjugates for both a simple and a more oscillatory non-convex profile.

Figure Div illustrates the biconjugation viewpoint directly.

Soft Legendre biconjugates as approximations of lower convex envelopes. The dashed gray curve is the original non-convex function, the red curve is $f^{**}$ , and the purple-to-blue curves show $(f^{*,\epsilon})^{*,\epsilon}$ for increasing $\epsilon$ .

Interactive panel. Change the smoothing temperature to see how the soft $c$ -transform interpolates between hard envelopes and smooth log-sum-exp transforms.

The nonlinear PDEs are linearized by the Hopf--Cole transform. With the same temperature normalization, $u_s=e^{-\phi_s/\epsilon}$ converts

\partial_s\phi_s+\frac12\norm{\nabla\phi_s}^2=\frac{\epsilon}{2}\Delta\phi_s

(110)

into $\partial_su_s=(\epsilon/2)\Delta u_s$ . Conversely, $\phi_s=-\epsilon\log u_s$ gives a Hamilton--Jacobi solution, while $v_s=\nabla\phi_s=-\epsilon\nabla\log u_s$ solves the gradient viscous Burgers equation $\partial_s v_s+(v_s\cdot\nabla)v_s=(\epsilon/2)\Delta v_s$ . In one dimension this is the classical Cole--Hopf transform; in higher dimension this scalar reduction applies to irrotational velocity fields. The figure below keeps only the PDE content: the same initial potential is evolved through the Hopf--Cole transform for three values of the viscosity.

Figure Div starts from a Gaussian velocity bump, whose inviscid evolution would form a shock on its decreasing flank, and shows how the viscosity parameter $\epsilon/2$ regularizes this steepening.

Hopf--Cole numerics for viscous Hamilton--Jacobi and Burgers dynamics. The upper row shows the potentials $\phi_t$ , the lower row shows the velocities $v_t=\partial_x\phi_t$ , and colors encode time from red to blue.

Interactive panel. Change the viscosity, final time and initial velocity bump to see the same Hopf--Cole mechanism: heat evolves the transformed variable, while the logarithm reconstructs the Hamilton--Jacobi potential and the Burgers velocity.

Other Convex Regularizers¶

KL regularization is the case that leads to multiplicative Sinkhorn scalings. Replacing KL by another density-ratio penalty keeps the same transport constraints but changes the scalar law linking the optimal density to the dual potentials.

$\phi$ -Divergence Regularization¶

The exponential Gibbs relation is replaced by the scalar law generated by a general convex entropy function.

Let $\phi$ be an entropy function and define

\mathcal L_{c,\phi}^{\epsilon}(\alpha,\beta) \eqdef \min_{\pi\in\Couplings(\alpha,\beta)} \int c(x,y)\,\d\pi(x,y) + \epsilon D_\phi(\pi|\alpha\otimes\beta).

(111)

Proposition: Dual and Density Law for

\phi

-Regularized OT

Under standard Fenchel--Rockafellar qualification assumptions,

\mathcal L_{c,\phi}^{\epsilon}(\alpha,\beta) = \sup_{f,g} \int f\,\d\alpha+\int g\,\d\beta - \epsilon \int \phi^* \left( \frac{f(x)+g(y)-c(x,y)}{\epsilon} \right) \d\alpha(x)\d\beta(y).

(112)

If the optimal plan has density $r^\star=\d\pi^\star/\d(\alpha\otimes\beta)$ and the solution is smooth and interior, then

r^\star(x,y) = (\phi')^{-1} \left( \frac{f^\star(x)+g^\star(y)-c(x,y)}{\epsilon} \right).

(113)

For KL, $\phi(r)=r\log r-r+1$ and $\phi^*(s)=e^s-1$ , recovering the Sinkhorn dual. Other choices replace the exponential law by another scalar transfer function:

\begin{array}{lll} \phi(r)=r\log r-r+1 &\Rightarrow& r^\star=e^s,\\ \phi(r)=r-\log r-1 &\Rightarrow& r^\star=(1-s)^{-1}\quad(s<1),\\ \phi(r)=\frac12(r-1)^2 &\Rightarrow& r^\star=(1+s)_+, \end{array} \qquad s=\frac{f^\star\oplus g^\star-c}{\epsilon}.

(114)

Bregman-Divergence Regularization¶

The previous construction regularizes OT by a density-ratio divergence. This differs from using a Bregman divergence generated by a convex functional on the space of measures.

With the product reference $\xi=\alpha\otimes\beta$ , the corresponding regularized transport value is

\mathcal L_{c,\Phi}^{\epsilon}(\alpha,\beta) \eqdef \inf_{\pi\in\Couplings(\alpha,\beta)} \left\{ \int c\,\d\pi+\epsilon B_\Phi(\pi|\xi) \right\}.

(116)

Proposition: Dual and Density Law for Bregman-Regularized OT

Fix $\xi=\alpha\otimes\beta$ and assume that Fenchel duality is exact. Then

\mathcal L_{c,\Phi}^{\epsilon}(\alpha,\beta) = \sup_{f,g} \int f\,\d\alpha+\int g\,\d\beta -\epsilon\left[ \Phi^*\left( \delta\Phi(\xi)+\frac{f\oplus g-c}{\epsilon} \right) -\Phi^*(\delta\Phi(\xi)) \right].

(117)

If $(f^\star,g^\star)$ and $\pi^\star$ are optimal and the solution is interior, then

\delta\Phi(\pi^\star) = \delta\Phi(\xi)+\frac{f^\star\oplus g^\star-c}{\epsilon}.

(118)

The primal--dual relations (118) and (113) make the distinction precise. Bregman regularization translates the reference measure in the functional dual coordinate $\delta\Phi$ , whereas $\phi$ -divergence regularization applies the scalar derivative $\phi'$ pointwise to the density relative to $\alpha\otimes\beta$ . For KL these laws coincide: logarithmic dual coordinates turn additive potential shifts into multiplicative density scalings.

Thus the two generalizations lead to different duals and different algorithms. Only for KL do density-ratio regularization and Bregman projection geometry coincide and reduce to multiplicative Sinkhorn scalings.

Generalized Soft $c$ -Transforms and Alternate Dual Maximization Method¶

The two dual formulations above suggest the same basic optimizer: maximize exactly over one potential while the other is fixed, then exchange their roles. For the $\phi$ -divergence dual, separability with respect to $\alpha\otimes\beta$ makes both block updates pointwise:

\begin{aligned} g^{\bar c,\epsilon,\phi}(x) &\in \operatorname*{argmin}_{u\in\mathbb R} \left\{ \epsilon\int \phi^*\left(\frac{u+g(y)-c(x,y)}{\epsilon}\right)\d\beta(y)-u \right\},\\ f^{c,\epsilon,\phi}(y) &\in \operatorname*{argmin}_{v\in\mathbb R} \left\{ \epsilon\int \phi^*\left(\frac{f(x)+v-c(x,y)}{\epsilon}\right)\d\alpha(x)-v \right\}. \end{aligned}

(121)

When $\phi^*$ is differentiable, the first minimizer satisfies

\int (\phi^*)'\left(\frac{u+g(y)-c(x,y)}{\epsilon}\right)\d\beta(y)=1,

(122)

and the second satisfies the symmetric equation. Thus each update normalizes one conditional density. For Burg or quadratic penalties, the solve remains one-dimensional and monotone, but it is no longer a log-sum-exp.

The Bregman dual has analogous block transforms, but they are function-space minimizations unless $\Phi$ is separable. With $\xi=\alpha\otimes\beta$ , define

\begin{aligned} g^{\bar c,\epsilon,\Phi} &\in \operatorname*{argmin}_{u\in\mathcal C(\mathcal X)} \left\{ \epsilon\Phi^*\left(\delta\Phi(\xi)+\frac{u\oplus g-c}{\epsilon}\right) -\int u\,\d\alpha \right\},\\ f^{c,\epsilon,\Phi} &\in \operatorname*{argmin}_{v\in\mathcal C(\mathcal Y)} \left\{ \epsilon\Phi^*\left(\delta\Phi(\xi)+\frac{f\oplus v-c}{\epsilon}\right) -\int v\,\d\beta \right\}. \end{aligned}

(123)

These are precisely the exact block minimizers of the negative Bregman dual. For separable $\Phi$ , disintegration reduces them to independent scalar problems.

The two alternate dual-maximization schemes are therefore

\begin{aligned} f^{(\ell+1)}&=\big(g^{(\ell)}\big)^{\bar c,\epsilon,\phi}, &g^{(\ell+1)}&=\big(f^{(\ell+1)}\big)^{c,\epsilon,\phi},\\ f^{(\ell+1)}&=\big(g^{(\ell)}\big)^{\bar c,\epsilon,\Phi}, &g^{(\ell+1)}&=\big(f^{(\ell+1)}\big)^{c,\epsilon,\Phi}. \end{aligned}

(124)

For KL, which has both descriptions, these iterations coincide with the usual soft $c$ -transform iteration and recover Sinkhorn.

Quadratic regularizers replace exponentiation by positive-part thresholding. For discrete measures, the choices $\phi(r)=\frac12(r-1)^2$ and $\Phi(\mathrm P)=\frac12\|\mathrm P\|_{\mathrm F}^2$ give, respectively,

\mathrm P^\star_{i,j} = a_i b_j\left(1+\frac{f_i^\star+g_j^\star-\mathrm C_{i,j}}{\epsilon}\right)_+, \qquad \mathrm P^\star_{i,j} = \left(a_i b_j+\frac{f_i^\star+g_j^\star-\mathrm C_{i,j}}{\epsilon}\right)_+.

(125)

Both laws can produce sparse plans, as advocated by Blondel, Seguy, and Rolet Blondel et al., 2018. For the first law, the row transform solves

\sum_j b_j \left(1+\frac{u+g_j-\mathrm C_{i,j}}{\epsilon}\right)_+=1.

(126)

The left-hand side is continuous, nondecreasing, and piecewise affine. Sorting its breakpoints, equivalently computing a weighted simplex projection, costs $O(m\log m)$ for a row of length $m$ . The unweighted quadratic Bregman transform is an ordinary Euclidean simplex projection.

Figures Div and Div show the two visible consequences: changing the regularizer modifies both the smoothing of the dual envelope and the support of the primal coupling.

Generalized soft double transforms for $c(x,y)=-xy$ . The dashed curve is the same non-concave input potential in all panels, the dark curve is the hard double $c$ -transform after centering, and colored curves show the centered double transform $(f^{c,\epsilon,\phi})^{\bar c,\epsilon,\phi}$ for increasing $\epsilon$ from red to blue. KL, Burg, and quadratic density-ratio penalties smooth the concave envelope differently.

Density-ratio regularizers and coupling support. KL gives a diffuse positive plan, the Burg barrier keeps positive but differently tailed support, and the rightmost quadratic plan is computed by alternate threshold transforms and is exactly sparse through the positive-part law.

The interactive demo separates the same two effects. The left plot shows the pointwise law $r=h(s)$ , while the right plot recomputes a coupling after enforcing the marginals with that law.

Interactive panel. Compare entropic and quadratic penalties on the same transport problem by changing the regularizer and its strength.

Sinkhorn Divergences¶

Sinkhorn divergences remove the entropic self-bias while retaining smoothness. They interpolate between OT-like geometry and kernel-like norms, which explains their statistical behavior.

Entropic Bias¶

The raw Sinkhorn cost is biased: for $\epsilon>0$ , minimizing $\mathcal L_c^\epsilon(\alpha,\beta)$ over $\beta$ does not generally return $\beta=\alpha$ . In the large-temperature limit, the raw value behaves like a product interaction:

For $c(x,y)=\norm{x-y}^2$ , minimizing this large-temperature limit over $\beta$ collapses toward a Dirac at the mean of $\alpha$ .

Sinkhorn Divergences¶

The standard debiasing subtracts the two self-interaction energies. This cancellation removes the large-temperature attraction toward the independent coupling; positivity is a separate property, proved below through the positive-definite kernel associated with $e^{-c/\epsilon}$ .

Figure Div demonstrates the role of the two self-cost corrections by optimizing a finite point cloud against a fixed target with and without debiasing.

Debiasing by point optimization. For large $\epsilon$ , minimizing the raw entropic cost collapses atoms toward the barycenter, whereas the self-cost subtraction keeps a bimodal cloud.

The interactive demo below shows the same mechanism with two-dimensional point clouds. The raw entropic loss tends to keep the fitted cloud too concentrated, whereas the self-cost correction spreads the moving particles across the target geometry.

Interactive panel. Use the smoothing, correction, and iteration controls to compare raw entropic attraction with the debiased Sinkhorn divergence.

The zero-temperature statement is the $\Gamma$ -convergence of entropic OT to the Kantorovich problem Léonard, 2012Carlier et al., 2017. Non-negativity of relative entropy and continuity of $c$ give the liminf inequality; a finite-entropy approximation of an optimal coupling, chosen so that its entropy multiplied by $\epsilon$ vanishes, gives a recovery sequence.

Proposition: Positivity of Sinkhorn Divergences

Assume dual optimizers exist and the symmetric kernel $k_\epsilon(x,y)=e^{-c(x,y)/\epsilon}$ is positive semidefinite in the sense of Definition Definition: Positive and Conditionally Positive Kernels. Then $\overline{\mathcal L}_c^\epsilon(\alpha,\beta)\ge0$ .

If, in addition, the common state space is compact, $c$ is continuous, and $k_\epsilon$ is universal, then $\overline{\mathcal L}_c^\epsilon(\alpha,\beta)=0$ if and only if $\alpha=\beta$ . Moreover, $\overline{\mathcal L}_c^\epsilon(\alpha_n,\alpha)\to0$ is equivalent to $\alpha_n\rightharpoonup\alpha$ Feydy et al., 2019.

Example: Large Temperature Collapse

Suppose that minimizers $\be_\epsilon$ are tight and that the large-temperature convergence is uniform enough to pass to their cluster points. The limiting functional is linear in the second argument:

\be\mapsto \int V_\al(y)\d\be(y), \qquad V_\al(y)\eqdef\int c(x,y)\d\al(x).

(134)

Thus every cluster point is supported on $\argmin V_\al$ . When this set is the singleton $\{y^\star(\al)\}$ ,

\be_\epsilon \rightharpoonup \delta_{y^\star(\al)}, \qquad y^\star(\al)=\uargmin{y} V_\al(y).

(135)

For the quadratic cost $c(x,y)=\norm{x-y}^2$ on $\RR^d$ , assuming $\al$ has finite second moment, one has $V_\al(y)=\norm{y-\int x\d\al(x)}^2+\mathrm{const}$ , so the collapse is toward the Dirac mass at the mean of $\al$ .

Complex $\epsilon$ ¶

The Sinkhorn temperature is usually a positive real number: positivity makes the Gibbs kernel positive, gives the entropy a convex meaning, and underlies the Hilbert-metric and monotone convergence arguments of the next chapter. Once the equations are written as exponential fixed-point equations, however, $\epsilon$ can also be regarded locally as a complex variable. This does not produce a positive coupling or a contraction theorem; it produces a holomorphic branch of the same scaling equations near any positive real temperature.

Measure fixed point¶

Let $\alpha\in\mathcal P(X)$ and $\beta\in\mathcal P(Y)$ . For $\epsilon\in\mathbb C\setminus\{0\}$ , reuse the Gibbs kernel and integral operators in (24) whenever the defining integrals exist. The density factorization (25), the marginal equations that follow it, and the updates (27) then remain valid verbatim for complex-valued $u$ and $v$ , provided that the divisions are well defined. For real $\epsilon>0$ these are the usual Sinkhorn equations. For complex $\epsilon$ they are only local analytic identities: they imply neither positivity of $\pi_\epsilon$ nor convergence of the alternating iteration.

Discrete histograms¶

For discrete histograms, evaluate the Gibbs kernel entrywise at $(x_i,y_j)$ . The factorization (4) and Sinkhorn updates (8) are then used verbatim over $\mathbb C$ , with the same marginal constraints and multiplicative gauge, whenever $\epsilon\ne0$ and all divisions are defined. This complexified iteration is again a local parametrization of the scaling equations, not a globally convergent algorithm.

The following finite-dimensional result is the scaling-variable counterpart of Theorem 2.1 and Remark 2.2 of Carlier et al., 2023. For compactly supported marginals and a continuous cost, Carlier, Pegon and Tamanini prove through the Schrödinger system that the normalized potentials, and hence the entropic cost, depend analytically on every real $\epsilon>0$ ; their remark explicitly records the resulting local extension to complex temperatures. We state the discrete version directly in $(u,v)$ and also allow the marginals and cost matrix to vary.

Theorem: Local Holomorphic Continuation of Sinkhorn Scalings

Fix positive histograms $a^0\in\Delta_n$ , $b^0\in\Delta_m$ , a finite real cost matrix $\C^0$ , and a real temperature $\epsilon_0>0$ . Choose positive scalings $(u^0,v^0)$ of the corresponding Sinkhorn coupling. Then there are complex neighborhoods of $(\epsilon_0,a^0,b^0,\C^0)$ , inside the affine constraint $\sum_i a_i=\sum_j b_j$ , and a unique holomorphic map

(\epsilon,a,b,\C)\mapsto(u_\epsilon,v_\epsilon)\in\mathbb C^n\times\mathbb C^m

(136)

satisfying the linear gauge $\sum_i a_i^0u_{\epsilon,i}=\sum_i a_i^0u_i^0$ , such that

\P_\epsilon = \operatorname{diag}(u_\epsilon)K_\epsilon(\C) \operatorname{diag}(v_\epsilon), \qquad \P_\epsilon\mathbf 1_m=a, \qquad \P_\epsilon^\top\mathbf 1_n=b.

(137)

Consequently the gauge-fixed scalings and the coupling $\P_\epsilon$ are holomorphic in all four arguments near the base point.

After shrinking the neighborhoods if necessary, every scaling coordinate is nonzero. Choosing local logarithms defines the holomorphic log-scalings $f_\epsilon=\epsilon\log u_\epsilon$ and $g_\epsilon=\epsilon\log v_\epsilon$ . They are useful for visualization but are not needed in the theorem.

Figure Div visualizes the local continuation at the level of the coupling, without choosing logarithm branches. For a fixed real part $\epsilon_0$ , it displays $|\P_{\epsilon_0+i\eta}|$ at four increasing imaginary parts. The complex matrix $\P_{\epsilon_0+i\eta}$ retains the prescribed marginals $(a,b)$ , but taking its entrywise modulus destroys these linear identities. The attached red and blue profiles therefore show $(a,b)$ , while the violet profiles show the row and column sums of $|\P_{\epsilon_0+i\eta}|$ and reveal the cancellations hidden by the modulus.

The magnitude of a complex Sinkhorn coupling exposes the oscillations created by an imaginary temperature. Two Gaussian-mixture histograms and $\epsilon_0=0.55$ are fixed, while $\epsilon=\epsilon_0+i\eta$ is continued through $\eta\in\{0,0.20,0.40,0.60\}$ . Every panel uses the same intensity scale. The red and blue profiles are the prescribed marginals of the complex coupling; the violet profiles are the marginals of its entrywise modulus and separate from them as cancellation increases.

Interactive panel. Vary the real and imaginary parts of the temperature to inspect $|\P_{\epsilon_0+i\eta}|$ . The side profiles distinguish the prescribed marginals of the complex matrix from the marginals obtained after taking its entrywise modulus.

Example: Centered one-dimensional Gaussians

Let $\al=\Gaussian(0,\sigma_\al^2)$ and $\be=\Gaussian(0,\sigma_\be^2)$ on $\RR$ , with $\sigma_\al,\sigma_\be>0$ , and take $c(x,y)=(x-y)^2$ . Proposition Proposition: Balanced Entropic OT Between Gaussians shows that the real-temperature Sinkhorn coupling is Gaussian. Its complex continuation uses the same formula. For $\Re(\epsilon)>0$ , set

k_\epsilon \eqdef \frac{\sqrt{\epsilon^2+16\sigma_\al^2\sigma_\be^2}-\epsilon}{4},

(141)

where the square root is the holomorphic branch on the right half-plane that is positive for real $\epsilon>0$ . The continued coupling $\pi_\epsilon$ is a centered complex Gaussian, with covariance parameter

\begin{pmatrix} \sigma_\al^2 & k_\epsilon\\ k_\epsilon & \sigma_\be^2 \end{pmatrix}.

(142)

Thus no new Gaussian computation is required. The formula defines a finite complex Gaussian coupling throughout $\Re(\epsilon)>0$ . It can be continued farther along paths on which the Gaussian integral converges and which avoid $\epsilon=0$ and the square-root branch points $\epsilon=\pm4i\sigma_\al\sigma_\be$ .

References¶

Sinkhorn, R. (1964). A relationship between arbitrary positive matrices and doubly stochastic matrices. The Annals of Mathematical Statistics, 35(2), 876–879. 10.1214/aoms/1177703591
Sinkhorn, R., & Knopp, P. (1967). Concerning nonnegative matrices and doubly stochastic matrices. Pacific Journal of Mathematics, 21(2), 343–348. 10.2140/pjm.1967.21.343
Sinkhorn, R. (1967). Diagonal equivalence to matrices with prescribed row and column sums. American Mathematical Monthly, 74, 402–405.
Cuturi, M. (2013). Sinkhorn distances: lightspeed computation of optimal transport. Advances in Neural Information Processing Systems 26, 2292–2300.
Peyré, G., & Cuturi, M. (2019). Computational Optimal Transport: With Applications to Data Science. Foundations and Trends in Machine Learning, 11(5–6), 355–607. 10.1561/2200000073
Nesterov, Y., & Nemirovskii, A. (1994). Interior-point polynomial algorithms in convex programming (Vol. 13). SIAM.
Altschuler, J., Weed, J., & Rigollet, P. (2017). Near-linear time approximation algorithms for optimal transport via Sinkhorn iteration. Advances in Neural Information Processing Systems, 30, 1964–1974.
Dvurechensky, P., Gasnikov, A., & Kroshnin, A. (2018). Computational Optimal Transport: Complexity by Accelerated Gradient Descent Is Better Than by Sinkhorn’s Algorithm. In J. Dy & A. Krause (Eds.), Proceedings of the 35th International Conference on Machine Learning (Vol. 80, pp. 1367–1376). PMLR.
Knight, P. A. (2008). The Sinkhorn–Knopp algorithm: convergence and applications. SIAM Journal on Matrix Analysis and Applications, 30(1), 261–275. 10.1137/060659624
Léonard, C. (2012). From the Schrödinger problem to the Monge–Kantorovich problem. Journal of Functional Analysis, 262(4), 1879–1920.
Carlier, G., Duval, V., Peyré, G., & Schmitzer, B. (2017). Convergence of entropic schemes for optimal transport and gradient flows. SIAM Journal on Mathematical Analysis, 49(2), 1385–1418.
Conforti, G., & Tamanini, L. (2021). A formula for the time derivative of the entropic cost and applications. Journal of Functional Analysis, 280(11), 108964. 10.1016/j.jfa.2021.108964
Chizat, L., Roussillon, P., Léger, F., Vialard, F.-X., & Peyré, G. (2020). Faster Wasserstein Distance Estimation with the Sinkhorn Divergence. Advances in Neural Information Processing Systems, 33, 2257–2269.
Nutz, M., & Wiesel, J. (2022). Entropic Optimal Transport: Convergence of Potentials. Probability Theory and Related Fields, 184, 401–424. 10.1007/s00440-021-01096-8
Amos, B., Xu, L., & Kolter, J. Z. (2017). Input Convex Neural Networks. Proceedings of the 34th International Conference on Machine Learning, 70, 146–155.

Entropic Regularization: Sinkhorn Algorithm

Entropic Regularization for Discrete Measures¶

Smoothing Effect¶

Entropy Barriers Versus Generic LP Barriers¶

Sinkhorn’s Algorithm¶

Reformulation Using Relative Entropy¶

Relative Entropy¶

KL Reformulation of Regularized OT¶

General Formulation¶

Measure Formulation¶

Probabilistic Interpretation¶

Sinkhorn for General Measures¶

Convergence with ϵ\epsilonϵ¶

Large-Temperature Expansion¶

Small-Temperature Expansion for Smooth Densities¶

Dual of Sinkhorn¶

Discrete Dual¶

Discrete Soft ccc-Transforms¶

Continuous Dual and Soft-Transforms¶

Dual Sinkhorn for General Measures¶

Neural Dual Solvers¶

Path-Space Schrodinger Problem¶

Unregularized Path-Space Transport¶

Entropic Path-Space Problem¶

Stochastic-Control Interpretation¶

Viscous Benamou--Brenier Formulations¶

Brownian Bridges and Sinkhorn Couplings¶

Marginal-Dependent Problems¶

Heat Kernels and Hopf--Cole Transforms¶

Geodesics in Heat¶

Soft Hopf--Lax and Hopf--Cole¶

Other Convex Regularizers¶

ϕ\phiϕ-Divergence Regularization¶

Bregman-Divergence Regularization¶

Generalized Soft ccc-Transforms and Alternate Dual Maximization Method¶

Sinkhorn Divergences¶

Entropic Bias¶

Sinkhorn Divergences¶

Complex ϵ\epsilonϵ¶

Measure fixed point¶

Discrete histograms¶

Convergence with $\epsilon$ ¶

Discrete Soft $c$ -Transforms¶

$\phi$ -Divergence Regularization¶

Generalized Soft $c$ -Transforms and Alternate Dual Maximization Method¶

Complex $\epsilon$ ¶