Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Entropic Regularization: Sinkhorn Algorithm

Entropic regularization makes optimal transport smooth, strictly convex and scalable. This chapter first explains the discrete KL-regularized problem, derives Sinkhorn’s alternating matrix scaling algorithm, and then rewrites the same construction as a relative-entropy projection problem. It then records the general continuous formulation, explains the path-space Schrodinger problem behind the static coupling formulation, develops the dual soft-transform picture, and presents the main convex regularization variants and the debiased Sinkhorn divergence.

The presentation connects the older matrix-scaling literature Sinkhorn, 1964Sinkhorn & Knopp, 1967Sinkhorn, 1967 with modern entropic OT Cuturi, 2013Peyré & Cuturi, 2019.

Entropic Regularization for Discrete Measures

Entropy turns a possibly non-unique linear program into a unique smooth problem. The price is bias, but the reward is differentiability and fast scaling algorithms.

Using this entropy as a regularizing function gives the approximate transport value

LCϵ(a,b):=minPU(a,b)P,CϵH(P).\mathcal{L}_{C}^{\epsilon}(a,b) \eqdef \min_{P\in\mathbf{U}(a,b)} \langle P,C\rangle - \epsilon H(P).

Equivalently, the regularizer is ϵi,jPi,jlogPi,j\epsilon\sum_{i,j}P_{i,j}\log P_{i,j}. It penalizes concentrated couplings and makes the objective strictly convex on the relative interior of the transport polytope.

Proof

The transport polytope is non-empty and compact, and the objective is continuous with the convention 0log0=00\log0=0, so a minimizer exists. On the relative interior,

2H(P)=diag(1/Pi,j)-\partial^2 H(P)=\operatorname{diag}(1/P_{i,j})

is positive definite on every non-zero feasible direction. Hence H-H is strictly convex on the polytope, which gives uniqueness.

If ai,bj>0a_i,b_j>0 and a minimizer had Pi,j=0P_{i,j}=0, then the perturbation Pt=(1t)P+tabP_t=(1-t)P+t\,a\otimes b remains feasible for small t>0t>0. The derivative of rlogrr\log r at zero along a positive direction is -\infty, so the objective decreases, contradicting optimality.

Smoothing Effect

The entropy acts as a barrier for positivity and makes LCϵ(a,b)\mathcal{L}_{C}^{\epsilon}(a,b) smooth in aa, bb, and CC as long as these variables stay in the relative interior. As ϵ+\epsilon\to+\infty, the minimizer converges to the independent coupling aba\otimes b; as ϵ0\epsilon\to0, it approaches the optimal face of the original transport linear program.

<IPython.core.display.Image object>

Entropic regularization and slack barriers. Large ϵ\epsilon selects an interior reference point, while small ϵ\epsilon moves the minimizer toward a low-cost face of the transport polytope. The second row gives the analogous entropy-on-slacks picture for a generic linear program.

Entropy Barriers Versus Generic LP Barriers

For a generic linear program minzz\min_z \ell^\top z with constraints AzbAz\le b, one can introduce positive slacks s=bAzs=b-Az and penalize them by an entropy. This is a useful analogy, but it is not the standard self-concordant interior-point barrier. The canonical barrier is the Burg, or reverse-KL, barrier ilogsi-\sum_i\log s_i, which leads to Newton systems.

Optimal transport is special because entropy is placed on the entries of PP, while the constraints are only row and column marginals. This separable structure turns Bregman projections into diagonal rescalings, giving the Sinkhorn iterations.

Sinkhorn’s Algorithm

Sinkhorn’s algorithm is alternating normalization of rows and columns. The key point is that the optimizer of the entropic problem has a multiplicative scaling form.

Proof

After removing zero-mass rows and columns, the minimizer is strictly positive, so the positivity constraint can be ignored in the first-order conditions. Introduce Lagrange multipliers fRnf\in\RR^n and gRmg\in\RR^m for the two marginal constraints. The Lagrangian is

L(P,f,g)=P,C+ϵi,jPi,jlogPi,j+f,aP1+g,bP1.\mathcal{L}(P,f,g) = \langle P,C\rangle + \epsilon\sum_{i,j}P_{i,j}\log P_{i,j} + \langle f,a-P\mathbf 1\rangle + \langle g,b-P^\top\mathbf 1\rangle .

Stationarity with respect to Pi,jP_{i,j} gives

Ci,j+ϵ(logPi,j+1)figj=0.C_{i,j} + \epsilon(\log P_{i,j}+1) - f_i-g_j =0.

Thus Pi,j=exp((fi+gjCi,j)/ϵ1)P_{i,j}=\exp((f_i+g_j-C_{i,j})/\epsilon-1), which is exactly the scaling form after absorbing the one-body terms into uu and vv.

In matrix notation, P=diag(u)Kdiag(v)P=\operatorname{diag}(u)K\operatorname{diag}(v). The marginal constraints become

u(Kv)=a,v(Ku)=b.u\odot(Kv)=a, \qquad v\odot(K^\top u)=b.

Solving each equation in turn gives Sinkhorn’s algorithm:

u(+1)=aKv(),v(+1)=bKu(+1).u^{(\ell+1)} = \frac{a}{Kv^{(\ell)}}, \qquad v^{(\ell+1)} = \frac{b}{K^\top u^{(\ell+1)}}.

The division is entrywise. The scaling vectors are not unique: multiplying uu by λ>0\lambda>0 and vv by 1/λ1/\lambda leaves PP unchanged.

<IPython.core.display.Image object>

Marginal constraints during Sinkhorn scaling. Row normalizations align the red source marginal and leave a blue defect; column normalizations align the blue target marginal and leave a red defect.

The interactive demo exposes the alternating row/column normalization directly. Change the half-step count to see the current coupling acquire one marginal, lose the other, and then converge toward both.

Interactive panel. Use the iteration, regularization, and mass controls to watch Sinkhorn row and column scalings enforce the marginals.

<IPython.core.display.Image object>

Dense Sinkhorn scaling for one-dimensional Gaussian-mixture marginals. The violet side curves are the current row and column sums; the red and blue curves are the prescribed marginals.

After convergence, the regularization strength controls how much of the Gibbs kernel remains visible in the optimal plan. Small ϵ\epsilon produces a concentrated transport band, while larger ϵ\epsilon spreads the same marginals into a smoother coupling.

<IPython.core.display.Image object>

Final Sinkhorn couplings for the same one-dimensional marginals and four regularization strengths. Decreasing ϵ\epsilon sharpens the plan toward an optimal-transport graph; increasing ϵ\epsilon keeps more of the product structure.

<IPython.core.display.Image object>

KL-normalized dual potentials along the scaling iteration. The logarithmic scaling potentials stabilize as the row/column normalizations converge.

The next interactive demo keeps the iteration count high and varies the temperature. It is the quickest way to see the geometry-bias tradeoff: low temperature is geometric and sharp, high temperature is smooth and closer to independence.

Interactive panel. Use the regularization slider to compare sparse exact-looking couplings with smoother entropic plans and potentials.

Complexity bounds for Sinkhorn and comparisons with accelerated first-order methods are discussed in Altschuler et al., 2017Dvurechensky et al., 2018Knight, 2008. For a dense n×mn\times m problem, each iteration costs one multiplication by KK and one by KK^\top, so the cost scales like CnmCnm for CC iterations. For fixed positive ϵ\epsilon, the marginal error eventually has a linear regime, but small ϵ\epsilon makes the Gibbs kernel more peaked and scaling harder.

<IPython.core.display.Image object>

Marginal violation along Sinkhorn half-steps for several values of ϵ\epsilon. Smaller ϵ\epsilon gives sharper transport geometry but slower scaling.

Reformulation Using Relative Entropy

The KL formulation identifies Sinkhorn as a projection method. It also prepares the continuous and unbalanced settings, where a reference measure is essential.

For matrices with the same total mass, the affine terms cancel and

KL(PQ)=i,jPi,jlogPi,jQi,j.\operatorname{KL}(P|Q) = \sum_{i,j}P_{i,j}\log\frac{P_{i,j}}{Q_{i,j}}.

On fixed-mass couplings, taking Q=1n×mQ=\mathbf 1_{n\times m} is equivalent to subtracting the Shannon--Boltzmann entropy. Taking the tensor-product reference gives the normalized formulation

minPU(a,b)P,C+ϵKL(Pab).\min_{P\in\mathbf U(a,b)} \langle P,C\rangle + \epsilon\operatorname{KL}(P|a\otimes b).
Proof

Write ϕ(s)=slogss+1\phi(s)=s\log s-s+1. Convexity gives ϕ(s)ϕ(1)+ϕ(1)(s1)=0\phi(s)\ge\phi(1)+\phi'(1)(s-1)=0, with equality only at s=1s=1. Hence

KL(PQ)=i,jQi,jϕ(Pi,j/Qi,j)0.\operatorname{KL}(P|Q) = \sum_{i,j}Q_{i,j}\phi(P_{i,j}/Q_{i,j}) \ge0.
Proof

Expand the logarithm and use the marginal constraints:

KL(Pab)=KL(Pab)+iailogaiai+jbjlogbjbj=KL(Pab)KL(aa)KL(bb).\begin{aligned} \operatorname{KL}(P|a\otimes b) &= \operatorname{KL}(P|a'\otimes b') + \sum_i a_i\log\frac{a_i'}{a_i} + \sum_j b_j\log\frac{b_j'}{b_j} \\ &= \operatorname{KL}(P|a'\otimes b') - \operatorname{KL}(a|a') - \operatorname{KL}(b|b'). \end{aligned}

The tensor-product reference is nevertheless useful when supports vary. It makes explicit which entries may vanish and passes cleanly to the continuous formulation.

<IPython.core.display.Image object>

KL-normalized Sinkhorn dual potentials for one-dimensional Gaussian-mixture histograms. Increasing ϵ\epsilon turns the hard cc-transform geometry into smoother log-sum-exp potentials.

Proof Sketch

For ϵ0\epsilon\to0, use compactness of the transport polytope and compare the optimality inequalities for the entropic problem against an exact Kantorovich optimizer. The cost gap is bounded by ϵ\epsilon times a KL difference, so every cluster point is cost-optimal; after dividing by ϵ\epsilon, the cluster point is the KL-minimizer on the optimal face.

For ϵ+\epsilon\to+\infty, subtract a constant from CC so that C0C\ge0. Testing the objective at aba\otimes b gives

KL(Pϵab)C,abϵ,\operatorname{KL}(P_\epsilon|a\otimes b) \le \frac{\langle C,a\otimes b\rangle}{\epsilon},

so the KL divergence to aba\otimes b vanishes.

<IPython.core.display.Image object>

Entropically regularized couplings between the red disk and blue annulus point clouds. The plans are strictly positive for every ϵ>0\epsilon>0, but the visible mass pattern evolves from nearly radial and sparse to diffuse as ϵ\epsilon increases.

General Formulation

The continuous formulation replaces matrices by measures and discrete KL by relative entropy. For probability measures α\alpha and β\beta, define

Lcϵ(α,β):=minπΠ(α,β)X×Yc(x,y)dπ(x,y)+ϵKL(παβ).\mathcal{L}_{c}^{\epsilon}(\alpha,\beta) \eqdef \min_{\pi\in\Couplings(\alpha,\beta)} \int_{\X\times\Y}c(x,y)\,\d\pi(x,y) + \epsilon\operatorname{KL}(\pi|\alpha\otimes\beta).

For fixed balanced marginals, the specific product reference only matters up to additive constants, provided the reference marginals are mutually absolutely continuous with α\alpha and β\beta. Its support still matters: it determines which couplings have finite entropy.

Path-Space Schrodinger Problem

Schrodinger’s reciprocal problem is naturally posed on paths rather than on endpoint pairs. The Sinkhorn problem appears after the path law is reduced to its two endpoint marginals.

Unregularized Path-Space Transport

Let Ω=C([0,1];X)\Omega=C([0,1];\X) be a path space and let et(ω)=ωte_t(\omega)=\omega_t be the evaluation maps. Given a path action A:Ω[0,+]\mathcal A:\Omega\to[0,+\infty], the unregularized path-space problem is

infMP(Ω){ΩA(ω)dM(ω):(e0)M=α, (e1)M=β}.\inf_{M\in\mathcal P(\Omega)} \left\{ \int_\Omega\mathcal A(\omega)\,\d M(\omega) : (e_0)_\sharp M=\alpha,\ (e_1)_\sharp M=\beta \right\}.

For quadratic Wasserstein geometry on Rd\RR^d,

A(ω)={01ω˙t2dt,ω absolutely continuous,+,otherwise.\mathcal A(\omega) = \begin{cases} \int_0^1\norm{\dot\omega_t}^2\,\d t, & \omega\text{ absolutely continuous},\\ +\infty, & \text{otherwise}. \end{cases}

The endpoint cost induced by the action is

cA(x,y):=inf{A(ω):e0(ω)=x, e1(ω)=y}.c_{\mathcal A}(x,y) \eqdef \inf\left\{ \mathcal A(\omega): e_0(\omega)=x,\ e_1(\omega)=y \right\}.

For the quadratic action, the minimizing path is the straight segment and cA(x,y)=xy2c_{\mathcal A}(x,y)=\norm{x-y}^2.

Proof

For any feasible path law MM, set π=(e0,e1)M\pi=(e_0,e_1)_\sharp M. Then πΠ(α,β)\pi\in\Couplings(\alpha,\beta) and by definition of cAc_{\mathcal A},

ΩA(ω)dM(ω)X×XcA(x,y)dπ(x,y).\int_\Omega\mathcal A(\omega)\,\d M(\omega) \ge \int_{\X\times\X}c_{\mathcal A}(x,y)\,\d\pi(x,y).

Conversely, mix selected minimizing paths according to any coupling π\pi and then optimize over π\pi.

Entropic Path-Space Problem

Let RϵP(Ω)R^\epsilon\in\mathcal P(\Omega) be a reference path law, for instance a Brownian or Langevin dynamics at noise level ϵ\epsilon. The dynamic Schrodinger bridge problem is the entropy projection

SBϵ(α,β):=infMP(Ω){ϵKL(MRϵ):(e0)M=α, (e1)M=β}.\mathrm{SB}_\epsilon(\alpha,\beta) \eqdef \inf_{M\in\mathcal P(\Omega)} \left\{ \epsilon\operatorname{KL}(M|R^\epsilon): (e_0)_\sharp M=\alpha,\ (e_1)_\sharp M=\beta \right\}.

It asks for the most likely path law, relative to the prior dynamics, among all path laws matching the observed endpoints Schrödinger, 1931Léonard, 2012Léonard, 2014Chen et al., 2016.

The dynamic problem also has viscous Benamou--Brenier formulations. In one common convention,

tρt+div(ρtvt)=σ2Δρt\partial_t\rho_t+\operatorname{div}(\rho_t v_t) = \frac{\sigma}{2}\Delta\rho_t

and one minimizes a kinetic action. After absorbing the diffusion into the velocity ut=vtσ2logρtu_t=v_t-\frac{\sigma}{2}\nabla\log\rho_t, the same minimizers are obtained from

01 ⁣(12ut(x)2+σ28logρt(x)2)ρt(x)dxdt,\int_0^1\!\int \left( \frac12\norm{u_t(x)}^2 + \frac{\sigma^2}{8}\norm{\nabla\log\rho_t(x)}^2 \right) \rho_t(x)\,\d x\,\d t,

up to endpoint entropy terms. The extra term is a Fisher-information penalty.

Proof

Disintegrate both MM and RϵR^\epsilon with respect to their endpoint pairs. The chain rule gives

KL(MRϵ)=KL(πR01ϵ)+KL(Mx,yRϵ,x,y)dπ(x,y).\operatorname{KL}(M|R^\epsilon) = \operatorname{KL}(\pi|R^\epsilon_{01}) + \int \operatorname{KL}(M^{x,y}|R^{\epsilon,x,y}) \,\d\pi(x,y).

The second term is nonnegative and vanishes exactly when the conditional path law is the reference bridge for π\pi-almost every endpoint pair.

For Brownian reference dynamics on Rd\RR^d, the endpoint prior has heat-kernel density proportional to

exp(xy2ϵ).\exp\left(-\frac{\norm{x-y}^2}{\epsilon}\right).

After rewriting this prior with respect to αβ\alpha\otimes\beta, the endpoint problem is exactly the continuous Sinkhorn problem up to an additive constant. Sinkhorn computes which endpoints should be paired; the path-space Schrodinger bridge then connects each pair by a Brownian bridge.

<IPython.core.display.Image object>

Endpoint couplings lifted to Brownian bridges. Increasing ϵ\epsilon both softens the endpoint coupling and amplifies the Brownian fluctuations between paired endpoints.

With this terminology, the entropic problem is

infXα,  YβE(c(X,Y))+ϵI(X,Y).\inf_{X\sim\alpha,\;Y\sim\beta} \mathbb E(c(X,Y))+\epsilon\mathcal I(X,Y).

Large ϵ\epsilon favors nearly independent endpoints, while small ϵ\epsilon suppresses endpoint randomness and recovers an optimal Monge--Kantorovich coupling in the limit.

Dual of Sinkhorn

The dual point of view replaces couplings by potentials and soft cc-transforms. It is the right formulation for stabilized implementations and differentiation.

Discrete Dual

The KL-normalized problem has the dual

minPU(a,b)P,C+ϵKL(Pab)=maxf,g[f,a+g,bϵi,jexp(fi+gjCi,jϵ)aibj+ϵ].\min_{P\in\mathbf U(a,b)} \langle P,C\rangle+\epsilon\operatorname{KL}(P|a\otimes b) = \max_{f,g} \left[ \langle f,a\rangle+\langle g,b\rangle - \epsilon\sum_{i,j} \exp\left(\frac{f_i+g_j-C_{i,j}}{\epsilon}\right)a_i b_j + \epsilon \right].

The optimal potentials are linked to the scaling variables through

ui=aiefi/ϵ,vj=bjegj/ϵ.u_i=a_i e^{f_i/\epsilon}, \qquad v_j=b_j e^{g_j/\epsilon}.

Discrete Soft cc-Transforms

For fixed gg, maximizing the dual with respect to ff gives

fi=ϵlogjexp(gjCi,jϵ)bj.f_i = -\epsilon\log \sum_j \exp\left(\frac{g_j-C_{i,j}}{\epsilon}\right)b_j.

This is a smoothed minimum.

Exponentiating the alternating soft-transform iterations recovers Sinkhorn’s algorithm. For small ϵ\epsilon, one must compute the log-sum-exp terms with the usual stabilization trick: subtract the minimum before exponentiating and add it back afterward.

<IPython.core.display.Image object>

Soft cc-transforms for decreasing temperatures. A positive ϵ\epsilon replaces the hard lower envelope by a log-sum-exp soft minimum.

Interactive panel. Use the epsilon and potential controls to see how the hard c-transform is softened by log-sum-exp smoothing.

Continuous Dual and Soft-Transforms

For compact spaces and continuous cost, the KL-regularized problem has the concave dual

Lcϵ(α,β)=supf,gDϵ(f,g),\mathcal{L}_{c}^{\epsilon}(\alpha,\beta) = \sup_{f,g} \mathcal D_\epsilon(f,g),

where

Dϵ(f,g)=fdα+gdβϵ(e(f(x)+g(y)c(x,y))/ϵ1)dα(x)dβ(y).\mathcal D_\epsilon(f,g) = \int f\,\d\alpha+\int g\,\d\beta - \epsilon \int \left( e^{(f(x)+g(y)-c(x,y))/\epsilon} - 1 \right) \d\alpha(x)\d\beta(y).

This is the smooth counterpart of the hard feasibility constraint fgcf\oplus g\le c from the Kantorovich dual.

Proof Sketch

Normalize potentials by imposing fdα=0\int f\,\d\alpha=0. Replacing a pair of potentials by the corresponding soft transforms does not decrease the dual objective. The transformed potentials have oscillations bounded by the oscillation of cc, and their modulus of continuity is controlled by the modulus of continuity of cc. Arzela--Ascoli gives existence.

Uniqueness up to constants follows from strict convexity of HeH/ϵd(αβ)H\mapsto\int e^{H/\epsilon}\d(\alpha\otimes\beta) on the image of (f,g)fgc(f,g)\mapsto f\oplus g-c, modulo constants.

Other Convex Regularizers

KL regularization is the case that leads to multiplicative Sinkhorn scalings. Replacing KL by another density-ratio penalty keeps the same transport constraints but changes the scalar law linking the optimal density to the dual potentials.

Let ϕ\phi be an entropy function and define

Lc,ϕϵ(α,β):=minπΠ(α,β)c(x,y)dπ(x,y)+ϵDϕ(παβ).\mathcal L_{c,\phi}^{\epsilon}(\alpha,\beta) \eqdef \min_{\pi\in\Couplings(\alpha,\beta)} \int c(x,y)\,\d\pi(x,y) + \epsilon D_\phi(\pi|\alpha\otimes\beta).

For KL, ϕ(r)=rlogrr+1\phi(r)=r\log r-r+1 and ϕ,0(s)=es1\phi^{*,\ge0}(s)=e^s-1, recovering the Sinkhorn dual. Other choices replace the exponential law by another scalar transfer function:

ϕ(r)=rlogrr+1r=es,ϕ(r)=rlogr1r=(1s)1(s<1),ϕ(r)=12(r1)2r=(1+s)+,s=fgcϵ.\begin{array}{lll} \phi(r)=r\log r-r+1 &\Rightarrow& r^\star=e^s,\\ \phi(r)=r-\log r-1 &\Rightarrow& r^\star=(1-s)^{-1}\quad(s<1),\\ \phi(r)=\frac12(r-1)^2 &\Rightarrow& r^\star=(1+s)_+, \end{array} \qquad s=\frac{f^\star\oplus g^\star-c}{\epsilon}.
<IPython.core.display.Image object>

Density-ratio regularizers and coupling support. KL gives the usual diffuse positive plan, the Burg barrier keeps positive but differently tailed support, and the quadratic density penalty can set entries exactly to zero through its positive-part law.

The interactive demo below separates the two effects. The left plot shows the pointwise law r=h(s)r=h(s), while the right plot recomputes a coupling after enforcing the marginals with that law. Changing ϵ\epsilon controls how much the cost score is softened; changing the regularizer changes whether mass is spread everywhere, protected by a barrier, or clipped to a sparse support.

Interactive panel. Use the regularization-family and strength controls to compare entropic and quadratic penalties on the same transport problem.

Bregman vs. ϕ\phi-Divergence Regularization

The previous construction regularizes OT by a density-ratio divergence. This differs from using a Bregman divergence generated by a convex functional on the space of measures.

Main Idea of the Proof

Write the Bregman-regularized objective, up to constants independent of π\pi, as

ϵΦ(π)+(cϵδΦ(ξ))dπ.\epsilon\Phi(\pi) + \int(c-\epsilon\delta\Phi(\xi))\,\d\pi .

Dualizing the marginal constraints and minimizing over π\pi produces the global conjugate Φ\Phi^*. Equality in Fenchel’s inequality gives the Bregman optimality condition. The density-ratio dual follows instead from the scalar Legendre formula for Dϕ(αβ)D_\phi(\cdot|\alpha\otimes\beta).

Thus the two generalizations lead to different duals and different algorithms. Only for KL do density-ratio regularization and Bregman projection geometry coincide and reduce to multiplicative Sinkhorn scalings.

Sinkhorn Divergences

Sinkhorn divergences remove the entropic self-bias while retaining smoothness. They interpolate between OT-like geometry and kernel-like norms, which explains their statistical behavior.

Entropic Bias

The raw Sinkhorn cost is biased: for ϵ>0\epsilon>0, minimizing Lcϵ(α,β)\mathcal L_c^\epsilon(\alpha,\beta) over β\beta does not generally return β=α\beta=\alpha. In the large-temperature limit, the raw value behaves like a product interaction:

For c(x,y)=xy2c(x,y)=\norm{x-y}^2, minimizing this large-temperature limit over β\beta collapses toward a Dirac at the mean of α\alpha.

Sinkhorn Divergences

The standard debiasing subtracts the two self-interaction energies. This cancellation removes the large-temperature attraction toward the independent coupling; positivity is a separate property, proved below through the positive-definite kernel associated with ec/ϵe^{-c/\epsilon}.

<IPython.core.display.Image object>

Debiasing by point optimization. For large ϵ\epsilon, minimizing the raw entropic cost collapses atoms toward the barycenter, whereas the self-cost subtraction keeps a bimodal cloud.

The interactive demo below shows the same mechanism in a one-dimensional toy setting. Move the model cloud relative to the target and compare the raw entropic objective with its debiased version.

Interactive panel. Use the blur and separation controls to compare the raw entropic cost with the debiased Sinkhorn divergence.

Proof

At optimality,

1=e(fα,β(x)+gα,β(y)c(x,y))/ϵdβ(y)1 = \int e^{(f_{\alpha,\beta}(x)+g_{\alpha,\beta}(y)-c(x,y))/\epsilon} \d\beta(y)

for α\alpha-almost every xx. Therefore the exponential penalty term in the dual integrates to zero, and the dual value reduces to the two linear potential terms.

Proof Sketch

Use the optimal self-potentials for (α,α)(\alpha,\alpha) and (β,β)(\beta,\beta) as a suboptimal pair in the dual problem between α\alpha and β\beta. After rewriting with α~=efα,α/ϵα\tilde\alpha=e^{f_{\alpha,\alpha}/\epsilon}\alpha and β~=efβ,β/ϵβ\tilde\beta=e^{f_{\beta,\beta}/\epsilon}\beta, one obtains

1ϵLcϵ(α,β)1α~,β~k.\frac{1}{\epsilon} \overline{\mathcal L}_c^\epsilon(\alpha,\beta) \ge 1-\langle \tilde\alpha,\tilde\beta\rangle_k.

The self Sinkhorn fixed-point equations imply α~k=β~k=1\norm{\tilde\alpha}_k=\norm{\tilde\beta}_k=1, so Cauchy--Schwarz gives the result.

References
  1. Sinkhorn, R. (1964). A relationship between arbitrary positive matrices and doubly stochastic matrices. The Annals of Mathematical Statistics, 35(2), 876–879. 10.1214/aoms/1177703591
  2. Sinkhorn, R., & Knopp, P. (1967). Concerning nonnegative matrices and doubly stochastic matrices. Pacific Journal of Mathematics, 21(2), 343–348. 10.2140/pjm.1967.21.343
  3. Sinkhorn, R. (1967). Diagonal equivalence to matrices with prescribed row and column sums. American Mathematical Monthly, 74, 402–405.
  4. Cuturi, M. (2013). Sinkhorn distances: lightspeed computation of optimal transport. Advances in Neural Information Processing Systems 26, 2292–2300.
  5. Peyré, G., & Cuturi, M. (2019). Computational Optimal Transport with Applications to Data Sciences. Foundations and Trends in Machine Learning, 11(5–6), 355–607. 10.1561/2200000073
  6. Altschuler, J., Weed, J., & Rigollet, P. (2017). Near-linear time approximation algorithms for optimal transport via Sinkhorn iteration. Advances in Neural Information Processing Systems, 30, 1964–1974.
  7. Dvurechensky, P., Gasnikov, A., & Kroshnin, A. (2018). Computational Optimal Transport: Complexity by Accelerated Gradient Descent Is Better Than by Sinkhorn’s Algorithm. In J. Dy & A. Krause (Eds.), Proceedings of the 35th International Conference on Machine Learning (Vol. 80, pp. 1367–1376). PMLR.
  8. Knight, P. A. (2008). The Sinkhorn–Knopp algorithm: convergence and applications. SIAM Journal on Matrix Analysis and Applications, 30(1), 261–275.
  9. Schrödinger, E. (1931). Über die Umkehrung der Naturgesetze. Sitzungsberichte Preuss. Akad. Wiss. Berlin. Phys. Math., 144, 144–153.
  10. Léonard, C. (2012). From the Schrödinger problem to the Monge–Kantorovich problem. Journal of Functional Analysis, 262(4), 1879–1920.
  11. Léonard, C. (2014). A survey of the Schrödinger problem and some of its connections with optimal transport. Discrete Continuous Dynamical Systems Series A, 34(4), 1533–1574.
  12. Chen, Y., Georgiou, T. T., & Pavon, M. (2016). On the relation between optimal transport and Schrödinger bridges: A stochastic control viewpoint. Journal of Optimization Theory and Applications, 169(2), 671–691.