Beyond Comparing Measures

This chapter leaves the setting of scalar measures on a common ambient space. Vector- and matrix-valued OT transports mass with internal degrees of freedom, Gromov--Wasserstein compares metric-measure spaces without a prescribed correspondence, and quantum OT replaces scalar couplings by positive operators. In each case, the transport plan must also encode structure carried by the support, the fibers, or the non-commutative state space.

from pathlib import Path
import sys

from IPython.display import Image as DisplayImage
from IPython.display import display

here = Path.cwd()
myst_dir = None
for candidate in [here, here.parent, here / "myst", here.parent / "myst", here.parent.parent / "myst"]:
    if (candidate / "ot4ml_web.py").exists():
        myst_dir = candidate.resolve()
        sys.path.insert(0, str(myst_dir))
        break

if myst_dir is None:
    raise RuntimeError("Could not locate myst/ot4ml_web.py")

repo_root = myst_dir.parent
thumbnails = repo_root / "notebooks-figures" / "thumbnails"

def show_book_figure(name, width=760):
    display(DisplayImage(filename=str(thumbnails / f"{name}.png"), width=width))

Vector and Matrix-Valued Measures¶

Scalar OT transports a nonnegative density. In imaging, color processing, spectral analysis, diffusion tensor imaging and quantum-inspired models, the object attached to a point can instead have several nonnegative components or a positive semidefinite matrix. The first step beyond scalar OT is the positive vector-valued case: the fiber remains linear and commutative, but the transport cost may couple its channels.

Positive Vector-Valued Measures¶

The simplest way to keep internal structure in transport is to attach several nonnegative masses to each spatial point and to decide whether these channels move independently or interact through the cost.

This models multi-channel densities such as color histograms, spectral bins or several species transported on the same domain. In a conservative model the mass of each channel is preserved, so one assumes $\al_0^k(\X)=\al_1^k(\X)$ for every $k$ . The natural vector-valued extension therefore starts from the positive cone $\RR_+^m$ .

For the dynamic formulas below, assume that $\X$ is either $\RR^d$ , the flat torus, or a bounded convex domain in $\RR^d$ equipped with no-flux boundary conditions. First suppose that the endpoints and the curve have densities. The direct analogue of Benamou--Brenier fixes a vector density $u_t(x)\in\RR_+^m$ and a spatial flux $V_t(x)=(V_{t,1},\ldots,V_{t,d})\in(\RR^m)^d$ , where $V_{t,\ell}^k$ is the momentum of channel $k$ in spatial direction $\ell$ . The conservative vector transport cost associated with an action density $\Phi$ is

\mathcal W_{\Phi}^2(\al_0,\al_1) \eqdef \inf_{u,V} \int_0^1\!\int_\X \Phi(u_t(x),V_t(x))\,\d x\,\d t

(2)

subject to the endpoint constraints $u_0\d x=\al_0$ , $u_1\d x=\al_1$ and the componentwise continuity equation

\partial_t u_t+\nabla_x\cdot V_t=0, \qquad (\nabla_x\cdot V_t)^k = \sum_{\ell=1}^d\partial_{x_\ell}V_{t,\ell}^k.

(3)

Thus each component satisfies its own continuity equation, but the cost may still couple the components. Singular curves are handled as in scalar dynamic OT by replacing densities and fluxes by measures and using the lower semicontinuous perspective recession convention.

A simple quadratic family is obtained from a mobility matrix $\mathsf M(u)\in\mathbb S_+^m$ :

\Phi_{\mathsf M}(u,V) = \sum_{\ell=1}^d V_\ell^\top \mathsf M(u)^\dagger V_\ell,

(4)

with the usual convention that the value is finite only when each $V_\ell$ belongs to the range of $\mathsf M(u)$ . If $u\mapsto\mathsf M(u)$ is linear and takes values in $\mathbb S_+^m$ , this action is jointly convex and positively one-homogeneous in $(u,V)$ . Indeed, the matrix fractional map $(M,z)\mapsto z^\top M^\dagger z$ is jointly convex on its effective domain, and $\mathsf M(su)=s\mathsf M(u)$ for $s>0$ . For $m=1$ and $\mathsf M(u)=u$ , one recovers exactly the scalar Benamou--Brenier action. For

\mathsf M_{\mathrm{diag}}(u)=\operatorname{diag}(u_1,\ldots,u_m),

(5)

the channels move independently. Non-diagonal mobilities are the simplest way to couple the coordinates while keeping the same componentwise conservation law. For instance, with $q=m^{-1/2}(1,\ldots,1)$ and $\kappa\geq0$ ,

\mathsf M_\kappa(u) = \operatorname{diag}(u) + \kappa\left(\sum_{k=1}^m u_k\right)qq^\top

(6)

increases the mobility in the common channel direction $q$ while leaving transverse directions controlled by the diagonal part. The local cost of moving one component can therefore depend on the densities and momenta of the other components, even though each component mass remains conserved.

The conservative positive-cone model above is the basic extension of Benamou--Brenier. Adding a source term $\partial_t u+\nabla\cdot V=S$ and a convex perspective penalty in $S$ gives unbalanced or reaction--transport variants Maas et al., 2015Maas et al., 2016Dolbeault et al., 2009Mielke, 2013. These generalized transport models include dissipation and density modulation. The figure below contrasts the exact diagonal case $\kappa=0$ , where each positive channel is transported by its quantile map, with a large- $\kappa$ illustrative common-mode interpolation in which the channels move more coherently. The endpoints are two-mode mixtures: at each spatial mode the two channels have Gaussian profiles with the same center but different amplitudes.

Figure Div contrasts the exact diagonal case $\kappa=0$ , where each positive channel is transported by its quantile map, with a large- $\kappa$ illustrative common-mode interpolation in which the channels move more coherently.

One-dimensional positive $\RR_+^2$ -valued transport displayed by arrow glyphs at eight time levels. Each endpoint is a mixture of two localized Gaussian modes, and, inside each mode, both channel profiles have the same center. Each arrow is proportional to the local fiber value $(u_t^1(x),u_t^2(x))$ , and time runs vertically from the red source to the blue target. Left: for $\kappa=0$ , the diagonal mobility gives two independent scalar quantile geodesics. Right: a large- $\kappa$ common-mode interpolation bends the display toward $q=2^{-1/2}(1,1)$ , illustrating the effect of a mobility that favors coherent channel motion while keeping the same componentwise continuity equation.

The interactive demo keeps the same glyph idea and lets the coupling strength bend the fibers toward a common channel direction.

Interactive panel. Use the coupling and mixture controls to see how vector-valued mass transports both location and channel composition.

Positive Matrix-Valued Measures¶

The next simplest fiber is the positive matrix cone. This is the simplest tensor-valued model beyond vectors: the diagonal entries behave like positive channels, while the eigenvectors encode local orientations.

Equivalently, $\mathcal A$ is a symmetric matrix of finite signed measures such that $\mathcal A(E)\succeq0$ for every Borel set $E$ . If $\mathcal A$ has density $A(x)\succeq0$ , then $\operatorname{tr}A(x)$ is the scalar amount of mass at $x$ , while, wherever $\operatorname{tr}A(x)>0$ , the normalized matrix $A(x)/\operatorname{tr}A(x)$ records an internal covariance or orientation. This is the matrix analogue of the positive vector case: diagonal matrices encode nonnegative vector components, and non-diagonal matrices add a local eigenbasis.

The conservative Benamou--Brenier model fixes a matrix density $A_t(x)\in\mathbb S_+^m$ and symmetric matrix fluxes $P_t(x)=(P_{t,1},\ldots,P_{t,d})\in(\mathbb S^m)^d$ . With no flux through the boundary of $\X$ , the full matrix mass $\int_\X A_t(x)\d x$ is conserved, so the endpoints must have the same total matrix. The model minimizes the matrix-perspective action

\mathcal W_{\mathrm{mat}}^2(\mathcal A_0,\mathcal A_1) \eqdef \inf_{A,P} \int_0^1\!\int_\X \sum_{\ell=1}^d \operatorname{tr}\!\left(P_{t,\ell}^{\top} A_t^\dagger P_{t,\ell}\right) \d x\,\d t

(10)

subject to $A_0\d x=\mathcal A_0$ , $A_1\d x=\mathcal A_1$ and to the matrix-valued continuity equation

\partial_t A_t+\nabla_x\cdot P_t=0, \qquad \nabla_x\cdot P_t = \sum_{\ell=1}^d\partial_{x_\ell}P_{t,\ell}.

(11)

Here $A^\dagger$ denotes the Moore--Penrose inverse, with the usual lower-semicontinuous perspective convention: the action is finite only when the columns of each $P_{t,\ell}$ belong to the range of $A_t$ . The matrix fractional map $(A,P)\mapsto\operatorname{tr}(P^\top A^\dagger P)$ is jointly convex on this effective domain and positively one-homogeneous in $(A,P)$ . This gives the simplest non-trivial matrix-valued transport model: spatial motion is conservative, but the fiber carries orientation through the eigenvectors of $A_t(x)$ .

Proposition: Diagonal Matrix Subproblem

Assume that the endpoints are diagonal in a fixed orthonormal basis,

\mathcal A_i=\operatorname{diag}(\al_i^1,\ldots,\al_i^m), \qquad i=0,1,

(12)

and that $\al_0^k(\X)=\al_1^k(\X)=m_k$ for every $k$ . If one restricts the admissible curves in (10) to remain diagonal in that basis,

A_t=\operatorname{diag}(u_t^1,\ldots,u_t^m), \qquad P_{t,\ell}=\operatorname{diag}(V_{t,\ell}^1,\ldots,V_{t,\ell}^m),

(13)

then the restricted matrix problem has value

\sum_{k:m_k>0} m_k\, \Wass_2^2\!\left( \frac{\al_0^k}{m_k}, \frac{\al_1^k}{m_k} \right).

(14)

with zero contribution from zero-mass channels. Thus the commuting matrix submodel is exactly the diagonal positive vector-valued Benamou--Brenier model.

The restriction to a fixed diagonal basis gives eigenvalue transport; it should be read as a commuting submodel, not as a claim that non-diagonal excursions can never change the unrestricted value. The genuinely matrix-valued case starts when the eigenspaces vary with $x$ or along the interpolation, so that the transported object carries both mass and orientation. Static matrix-valued Monge--Kantorovich problems and dual test-function metrics were developed in Ning & Georgiou, 2014Jiang et al., 2012Ning et al., 2015; dynamic versions and related non-commutative geometries appear in Chen et al., 2016Chen et al., 2020Carlen & Maas, 2014Peyré et al., 2019. The figure below shows the analogous independent/coupled contrast for positive $2\times2$ matrix fibers, using two localized matrix modes whose eigenvalue profiles share a common center at each mode.

Figure Div shows the analogous independent/coupled contrast for positive $2\times2$ matrix fibers, using two localized matrix modes whose eigenvalue profiles share a common center at each mode.

Positive $2\times2$ matrix-valued transport on a one-dimensional base. Each endpoint is a mixture of two localized matrix modes; within one mode, both eigenvalue profiles are Gaussian bumps with the same center. Each ellipse is the glyph of a positive semidefinite matrix $A_t(x)$ , with axes given by eigenvectors and eigenvalues. Left: the matrices are diagonal in a fixed basis, giving the commuting tensor analogue of independent vector channels. Right: a coupled illustrative interpolation bends packet motion toward the trace-density transport and uses non-commuting eigendirections; the superposition remains positive semidefinite and produces spatially varying orientations.

Interactive panel. Use the coupling and rotation controls to compare matrix-valued transport of anisotropic local structure.

Example: Diagonal and coupled positive mobilities

Choose a mobility matrix $\mathsf M(u)\in\mathbb S_+^m$ , where $\mathbb S_+^m$ denotes the cone of real symmetric positive semidefinite matrices, and set

\Phi_{\mathsf M}(u,V) = \sum_{\ell=1}^d V_{\ell}^{\top}\mathsf M(u)^\dagger V_{\ell},

(16)

with the usual convention that the value is finite only when each $V_\ell$ belongs to the range of $\mathsf M(u)$ . One chooses $\mathsf M$ so that this matrix perspective is convex and one-homogeneous in $(u,V)$ ; this holds for the linear positive mobilities below. For $m=1$ and $\mathsf M(u)=u$ , one recovers exactly the scalar Benamou--Brenier action. For

\mathsf M_{\mathrm{diag}}(u)=\diag(u_1,\ldots,u_m),

(17)

\mathsf M_\kappa(u)=\diag(u)+\kappa\Big(\sum_{k=1}^m u_k\Big) q q^\top

(18)

Wasserstein Over Wasserstein¶

The construction can be iterated. Once $(\X,d)$ is a metric space, the set of probability measures on $\X$ becomes a metric space through $\Wass_p$ . It can therefore serve as a new ground space. This is useful whenever the objects to compare are themselves random probability measures, or mixtures whose components are meaningful objects rather than only a collapsed density.

The standard setting is that of Polish spaces, introduced in Definition: Polish Metric Space. These assumptions provide separability, completeness, tightness criteria and regular conditional probabilities. The next proposition shows that Wasserstein spaces preserve this well-behaved structure.

Fix $1\leq p<\infty$ . Elements of $\Pp_p(\Pp_p(\X))$ are probability laws over probability measures, or random probability measures. A measurable parametric family gives

\mathfrak A=(\zeta\mapsto\alpha_\zeta)_\sharp\gamma.

(19)

This law belongs to $\Pp_p(\Pp_p(\X))$ precisely when, for one and hence every $x_0\in\X$ ,

\int \Wass_p(\alpha_\zeta,\delta_{x_0})^p\,\d\gamma(\zeta)<\infty.

(20)

If $\mathfrak A\in\Pp_p(\Pp_p(\X))$ , then $\bar\alpha_{\mathfrak A}\in\Pp_p(\X)$ because

\int_\X d(x,x_0)^p\,\d\bar\alpha_{\mathfrak A}(x) = \int_{\Pp_p(\X)}\Wass_p(\alpha,\delta_{x_0})^p\,\d\mathfrak A(\alpha).

(22)

The Wasserstein distance on the Wasserstein space is

\mathbb W_p^p(\mathfrak A,\mathfrak B) \eqdef \inf_{\Pi\in\Couplings(\mathfrak A,\mathfrak B)} \int_{\Pp_p(\X)\times\Pp_p(\X)} \Wass_p^p(\alpha,\beta)\d\Pi(\alpha,\beta), \qquad \mathfrak A,\mathfrak B\in\Pp_p(\Pp_p(\X)).

(23)

For $p=2$ , Gaussian mixtures provide an explicit example with two geometries. A mixture can be viewed as a collapsed density on $\X$ , or as a component law over Gaussian atoms in the Bures--Wasserstein space. For two component laws

\mathfrak A=\sum_i a_i\delta_{\Gaussian(m_i,\Sigma_i)}, \qquad \mathfrak B=\sum_j b_j\delta_{\Gaussian(n_j,\Lambda_j)},

(24)

the component-level problem uses the cost

C_{ij}=\norm{m_i-n_j}^2+\Bb(\Sigma_i,\Lambda_j)^2.

(25)

If $\P^\star$ is an optimal coupling between the weights $a$ and $b$ , and if $A_{ij}$ is the Brenier linear part from $\Sigma_i$ to $\Lambda_j$ , each active pair follows the Gaussian geodesic

m_{ij,t}=(1-t)m_i+t n_j, \qquad \Sigma_{ij,t} = \big((1-t)\Id+tA_{ij}\big)\Sigma_i \big((1-t)\Id+tA_{ij}\big)^\top.

(26)

Collapsing these component geodesics gives

\bar\alpha_t= \sum_{i,j}\P^\star_{ij}\Gaussian(m_{ij,t},\Sigma_{ij,t}).

(27)

This component-level interpolation generally differs from the true $\Wass_2$ interpolation between the collapsed mixture densities.

Figure Div contrasts this component-level geodesic with the true one-dimensional Wasserstein interpolation of the collapsed mixtures, making the internal mass splitting absent from the former visible.

Two interpolations between the same asymmetric three-component one-dimensional Gaussian mixtures. The red endpoint has a broad central component carrying most of the mass, while the blue endpoint has two dominant sharp side modes. Left: Gaussian components are transported as atoms using their Bures--Wasserstein distance. Right: the collapsed densities are interpolated by the true one-dimensional quantile formula for $\Wass_2$ . The central mass is split and recombined in the collapsed geometry, making the two paths visibly distinct.

The interactive comparison keeps both geometries side by side: component-level transport moves Gaussian atoms, while collapsed transport rearranges the full density.

Interactive panel. Use the mixture and blur controls to compare transport between ordinary measures with transport between distributions of measures.

This viewpoint also clarifies lower bounds for Gromov--Wasserstein distances: a metric-measure space can be mapped to a law of local distance profiles, and these laws can be compared by Wasserstein-over-Wasserstein.

Gromov--Wasserstein¶

Gromov--Wasserstein compares spaces through their internal distance structures rather than through a fixed ambient ground cost. This is the right extension for graphs, shapes and point clouds whose points are not pre-aligned.

Discrete Formulation¶

Optimal transport needs a ground cost $\C$ to compare histograms $(a,b)$ , and thus cannot be used directly if the histograms are not defined on the same underlying space, or if one cannot pre-register these spaces to define a ground cost. Instead, assume that two matrices $D\in\RR^{n\times n}$ and $D'\in\RR^{m\times m}$ represent relationships between points. A typical scenario is when these matrices are powers of distance matrices. Define the quadratic distortion and its minimum by

\begin{aligned} \mathcal E_{D,D'}(\P) &\eqdef \sum_{i,j,i',j'} \Delta(D_{i,i'},D'_{j,j'})^pP_{i,j}\P_{i',j'}, \\ \operatorname{GW}((a,D),(b,D'))^p &\eqdef \min_{\P\in\mathbf U(a,b)}\mathcal E_{D,D'}(\P). \end{aligned}

(30)

where $p\geq1$ and $\Delta$ is usually $\Delta(u,v)=|u-v|$ . This is a non-convex quadratic problem over the transport polytope. In the uniform case with $m=n$ and $\P$ constrained to be a permutation matrix, it becomes a Quadratic Assignment Problem, already NP-hard in full generality Loiola et al., 2007. The relaxed coupling formulation can therefore be read as a soft graph-matching model Lyzinski et al., 2016.

Figure Div shows this intrinsic matching principle under progressively stronger deformations: correspondences are selected from within-space distance patterns rather than an ambient cross-space cost.

Gromov--Wasserstein correspondences under increasing deformation. The red and blue point clouds are not compared through an ambient Euclidean cross-cost; instead, the GW coupling compares their internal pairwise distances. A perfectly isometric copy admits a clean structural match, while mild and deliberately stronger deformations progressively bend the correspondence.

The interactive demo uses a fixed structural correspondence and lets the deformation change the pairwise-distance residual. This isolates the quantity minimized by the GW objective.

Interactive panel. Use the deformation and point controls to inspect correspondences when only within-space distances are meaningful.

When $D,D'$ are genuine distance matrices, the construction below defines a distance between metric spaces equipped with a probability distribution, up to measure-preserving isometries Mémoli, 2011Sturm, 2012Schmitzer & Schnörr, 2013. The same construction also explains why GW satisfies the triangle inequality after quotienting by isometries, and its relation to Hausdorff and Gromov--Hausdorff distances is discussed at the end of the section.

General Setting¶

The continuous formulation abstracts the discrete distance matrices into metric-measure spaces, so that GW compares intrinsic geometries independently of labels, parametrizations or ambient coordinates.

Compactness is often assumed below to avoid additional tightness and integrability arguments.

For metric-measure spaces $\mathbb X=(\X,d_\X,\alpha)$ and $\mathbb Y=(\Y,d_\Y,\beta)$ , define

\operatorname{GW}(\mathbb X,\mathbb Y)^p \eqdef \min_{\pi\in\Couplings(\alpha,\beta)} \int_{\X^2\times\Y^2} \Delta(d_\X(x,x'),d_\Y(y,y'))^p \d\pi(x,y)\d\pi(x',y').

(33)

Proposition: Marginal Stability and Empirical GW Rates

Let $\mathbb X=(\X,d_\X,\alpha)$ and $\mathbb Y=(\Y,d_\Y,\beta)$ be compact metric-measure spaces. Replacing the marginals by $\widetilde\alpha$ and $\widetilde\beta$ gives

\left| \operatorname{GW}((\X,d_\X,\widetilde\alpha),(\Y,d_\Y,\widetilde\beta)) - \operatorname{GW}(\mathbb X,\mathbb Y) \right| \leq 2\Wass_p^\X(\widetilde\alpha,\alpha) + 2\Wass_p^\Y(\widetilde\beta,\beta).

(39)

Consequently, for empirical measures $\widehat\alpha_n$ and $\widehat\beta_m$ ,

\mathbb E\left| \operatorname{GW}((\X,d_\X,\widehat\alpha_n),(\Y,d_\Y,\widehat\beta_m)) - \operatorname{GW}(\mathbb X,\mathbb Y) \right| \leq 2\mathbb E\Wass_p^\X(\widehat\alpha_n,\alpha) + 2\mathbb E\Wass_p^\Y(\widehat\beta_m,\beta).

(40)

The metric structure also gives geodesics. Sturm’s construction allows one to speak about interpolation, barycenters and gradient flows directly on the space of metric-measure spaces, even though the intermediate space lives on a product support and is therefore expensive numerically Sturm, 2012.

Proposition: Gromov--Wasserstein Geodesics

Let $\mathbb X_0=(\X_0,d_{\X_0},\alpha_0)$ and $\mathbb X_1=(\X_1,d_{\X_1},\alpha_1)$ be compact metric-measure spaces, and let $\pi^\star$ be an optimal coupling. Define, on $\mathcal Z=\X_0\times\X_1$ ,

d_t((x_0,x_1),(x'_0,x'_1)) \eqdef (1-t)d_{\X_0}(x_0,x'_0) + t d_{\X_1}(x_1,x'_1), \qquad \mathbb X_t=(\mathcal Z,d_t,\pi^\star).

(41)

For $0<t<1$ , $d_t$ is a metric. At the endpoints one quotients the product space by the corresponding zero-distance relation. Then $t\mapsto\mathbb X_t$ is a constant-speed geodesic:

\operatorname{GW}(\mathbb X_s,\mathbb X_t) = |t-s|\operatorname{GW}(\mathbb X_0,\mathbb X_1).

(42)

Figure Div complements the global GW objective with local diagnostics, displaying where a mildly non-isometric correspondence creates the largest pairwise-distance residuals.

Local distortion in a mildly non-isometric GW match. The left panel colors transport segments by the average residual induced by the displayed hard correspondence. The right panel shows the pairwise-distance residual matrix $|d_\X(x_i,x_{i'})-d_\Y(y_{\sigma(i)},y_{\sigma(i')})|$ , with darker entries marking larger local distortion. This matrix is the local contribution minimized by the discrete GW objective for the displayed correspondence.

Interactive panel. Use the deformation and shift controls to see where a Gromov-Wasserstein correspondence preserves or distorts pairwise distances.

Proposition: Memoli Profile Lower Bound

Let $\mathbb X=(\X,d_\X,\alpha)$ and $\mathbb Y=(\Y,d_\Y,\beta)$ be compact metric-measure spaces. For each $x\in\X$ and $y\in\Y$ , define the distance-profile measures on $\RR_+$ by

\alpha_x\eqdef(d_\X(x,\cdot))_\sharp\alpha, \qquad \beta_y\eqdef(d_\Y(y,\cdot))_\sharp\beta.

(44)

Let $\mathfrak D_\mathbb X=(x\mapsto\alpha_x)_\sharp\alpha$ and $\mathfrak D_\mathbb Y=(y\mapsto\beta_y)_\sharp\beta$ . Then

\mathbb W_p(\mathfrak D_\mathbb X,\mathfrak D_\mathbb Y) \leq \operatorname{GW}(\mathbb X,\mathbb Y),

(45)

where $\mathbb W_p$ is the Wasserstein-over-Wasserstein distance (23), whose ground metric is the one-dimensional $\Wass_p$ distance between profile measures.

The next figure exposes the two nested transport problems for the planar shapes $\mathbb X$ and $\mathbb Y$ : sorting each distance profile computes the one-dimensional costs, then an outer assignment couples the resulting profile laws.

Figure Div makes the two transport levels explicit for the planar shapes $\XX$ and $\YY$ : sorting each distance profile computes the one-dimensional costs, then an outer assignment couples the resulting profile laws.

Mémoli distance profiles expose a computable lower bound for intrinsic GW comparison. The planar shapes $\mathbb X$ and $\mathbb Y$ are represented by cat and bunny silhouettes, centered and normalized to unit diameter. Matching colors identify representative anchor pairs selected by the optimal profile assignment with $C_{ij}=\Wass_2^2(\alpha_{x_i},\beta_{y_j})$ , their connecting segments and their two side histograms. The histograms are display summaries only: every profile cost is computed by sorting the complete distance profiles. The resulting outer assignment realizes the Mémoli profile lower bound.

This lower bound is useful computationally because the profile cost matrix $\C_{ij}=\Wass_p(\alpha_{x_i},\beta_{y_j})^p$ is an ordinary OT cost between points. Solving this easier OT problem gives a geometry-aware initialization for the non-convex GW iterations.

Relation With Wasserstein-Procrustes¶

The profile lower bound is intrinsic. In Euclidean applications, it is naturally paired with an extrinsic upper certificate obtained by registering the two measures before applying the ordinary Wasserstein distance. Proposition Proposition: Wasserstein-Procrustes Upper Certificate, proved in Quotient Wasserstein and Wasserstein-Procrustes, supplies exactly this certificate. The converse need not hold, because a small GW value may be achieved by an intrinsic correspondence that is not induced by any ambient rigid motion. Combining the profile lower bound with the Procrustes upper certificate gives the sandwich

\mathbb W_p(\mathfrak D_\mathbb X,\mathfrak D_\mathbb Y) \leq \operatorname{GW}(\mathbb X,\mathbb Y) \leq 2\,\Wass_{p,\mathrm E(d)}([\alpha],[\beta]),

(47)

where $\mathbb X=(\RR^d,\norm{\cdot},\alpha)$ and $\mathbb Y=(\RR^d,\norm{\cdot},\beta)$ . The left term is intrinsic and inexpensive; the right term is an ambient rigid-registration certificate.

Entropic Regularization and Fused GW¶

For the common squared distortion $\Delta(u,v)^2=(u-v)^2$ , one often seeks a stationary point of the entropic relaxation

\min_{P\in\mathbf U(a,b)} \mathcal E_{D,D'}(P)-\epsilon H(P).

(48)

For symmetric distance matrices, define the half-gradient

\C(P) \eqdef D^{\odot2}a\,\mathbf 1_m^\top + \mathbf 1_n(D'^{\odot2}b)^\top - 2D\,P\,D'^\top, \qquad \nabla\mathcal E_{D,D'}(P)=2\C(P).

(49)

A standard fixed-point linearization Peyré et al., 2016 computes

P^{(\ell+1)} = \operatorname*{argmin}_{P\in\mathbf U(a,b)} \langle P,\C(P^{(\ell)})\rangle -\frac{\epsilon}{2}H(P).

(50)

The factor $\epsilon/2$ is essential because $\C(P)$ is one half of the quadratic gradient. Each update is an ordinary entropic OT problem and can therefore be solved with Sinkhorn iterations. If the iterates converge to a positive fixed point, it satisfies the stationarity conditions of the regularized GW objective. The basic fixed-point iteration is not a descent method in general and has no global guarantee for this non-convex problem; line searches or proximal variants are needed when monotone decrease is required.

Fused Gromov--Wasserstein augments the structural term with a feature transport cost Vayer et al., 2019. In the discrete case, given a cross-feature cost $M\in\RR^{n\times m}$ and a parameter $\lambda\in[0,1]$ , one minimizes

\operatorname{FGW}_{\lambda,p}((a,D),(b,D'))^p \eqdef \min_{P\in\mathbf U(a,b)} (1-\lambda)\sum_{i,j}M_{ij}P_{ij} + \lambda \sum_{i,j,i',j'} \Delta(D_{ii'},D'_{jj'})^pP_{ij}P_{i'j'}.

(51)

The endpoints $\lambda=0$ and $\lambda=1$ recover feature-only OT and pure GW respectively; intermediate values trade attribute matching against structural matching. The first term compares node attributes in the usual OT sense, and the second compares intrinsic geometry; this is useful when two spaces have both distances and features, and the two sources of information may disagree.

Figure Div isolates this tradeoff on a small graph pair by comparing feature-only, structure-only and fused correspondences.

Feature information and intrinsic geometry in fused Gromov--Wasserstein. Small inner disks encode binary node features. Feature-only OT follows the attributes even when this crosses the shape structure, pure GW follows the intrinsic ordering, and fused GW balances the feature term with the pairwise-distance distortion.

Interactive panel. Use the geometry-weight and feature-conflict controls to balance structural matching against feature agreement.

Hausdorff and Gromov--Hausdorff Viewpoints¶

If $A,B$ are compact subsets of a common metric space $(\mathcal Z,d_\mathcal Z)$ , their Hausdorff distance is

d_{\mathrm H}^{\mathcal Z}(A,B) = \max\left\{ \sup_{a\in A}\inf_{b\in B}d_\mathcal Z(a,b), \sup_{b\in B}\inf_{a\in A}d_\mathcal Z(a,b) \right\}.

(52)

The Gromov--Hausdorff distance removes the common ambient space by minimizing this quantity over all isometric embeddings into a third space:

d_{\mathrm{GH}}(\X,\Y) = \inf_{\mathcal Z,\phi,\psi} d_{\mathrm H}^{\mathcal Z}(\phi(\X),\psi(\Y)).

(53)

Equivalently, it is half the minimal distortion of a correspondence between $\X$ and $\Y$ Gromov, 2001Mémoli, 2007. This is a worst-case set distance: every point must be matched with small distortion. Gromov--Wasserstein replaces correspondences by probability couplings and worst-case distortion by averaged distortion. It is therefore better adapted to noisy sampled shapes and weighted graphs, but it can ignore small sets of mass that would dominate the Hausdorff distance.

Algorithm: Entropic Gromov--Wasserstein linearization

Input: Symmetric metric matrices $\distD,\distD'$ , weights $\a,\b$ , regularization $\epsilon>0$ , tolerance $\mathrm{tol}$ , maximum iterations $L$ .

Output: Approximate entropic GW coupling $\P\in\CouplingsD(\a,\b)$ .

Initialize: Set $\P^{(0)}=\a\otimes\b$ .

For $k=0,\ldots,L-1$ do:

$\C^{(k)} = \distD^{\odot2}\a\,\ones_m^\top + \ones_n(\distD'^{\odot2}\b)^\top - 2\distD\,\P^{(k)}\,\transp{\distD'}.$
Solve entropic OT subproblem: $\P^{(k+1)} = \uargmin{\P\in\CouplingsD(\a,\b)} \dotp{\P}{\C^{(k)}}-(\epsilon/2)\HD(\P).$
If $\norm{\P^{(k+1)}-\P^{(k)}}_{\mathrm F}\leq\mathrm{tol}$ then:

Return $\P^{(k+1)}$ .

Return $\P^{(L)}$ .

Quantum Optimal Transport¶

Quantum optimal transport replaces probability vectors by density matrices and scalar couplings by positive operators on a tensor product space. This is the right language when the transported objects are matrix-valued signals, covariance-like descriptors or quantum states, and it exposes a precise bridge between OT, non-commutative entropy and operator scaling Ning & Georgiou, 2014Chen et al., 2016Chen et al., 2020Peyré et al., 2019Caglioti et al., 2020Chakrabarti et al., 2019.

Finite-Dimensional States and Couplings¶

A joint quantum state between $\mathbb C^n$ and $\mathbb C^m$ is a matrix $T\in\mathbb H_{nm}^+$ acting on $\mathbb C^n\otimes\mathbb C^m$ . Its marginals are the partial traces, defined by duality through

\operatorname{tr}(F\,\operatorname{Tr}_B T) = \operatorname{tr}((F\otimes I_m)T), \qquad \operatorname{tr}(G\,\operatorname{Tr}_A T) = \operatorname{tr}((I_n\otimes G)T).

(55)

for all $F\in\mathbb H_n$ and $G\in\mathbb H_m$ . Thus $\operatorname{Tr}_B(T)\in\mathbb H_n^+$ and $\operatorname{Tr}_A(T)\in\mathbb H_m^+$ play exactly the role of the two marginals of a classical coupling.

The feasible set is never empty, since $A\otimes B$ has marginals $A$ and $B$ .

The dual potentials have the usual scalar gauge freedom: replacing $(F,G)$ by $(F+tI_n,G-tI_m)$ leaves both the constraint and the value unchanged because $\operatorname{tr}(A)=\operatorname{tr}(B)=1$ .

Entropic Regularization and Bregman Iterations¶

For $\epsilon>0$ define

\operatorname{QOT}_C^\epsilon(A,B) = \min_{T\succeq0} \left\{ \operatorname{tr}(CT)+\epsilon H(T): \operatorname{Tr}_B(T)=A,\ \operatorname{Tr}_A(T)=B \right\}.

(61)

This is the non-commutative analogue of entropic OT: the Shannon entropy of a coupling is replaced by the trace entropy of a density matrix Peyré et al., 2019Chakrabarti et al., 2019.

Proposition: Entropic Quantum OT Duality

Assume $A\succ0$ , $B\succ0$ and $\epsilon>0$ . Then (61) has a unique positive minimizer. Its dual is

\operatorname{QOT}_C^\epsilon(A,B) = \max_{F\in\mathbb H_n,\ G\in\mathbb H_m} \left\{ \operatorname{tr}(FA)+\operatorname{tr}(GB) - \epsilon\, \operatorname{tr} \exp\!\left( \frac{F\otimes I_m+I_n\otimes G-C}{\epsilon} \right) \right\}.

(62)

At optimality, primal and dual variables are linked by the Gibbs formula

T_e(F,G) = \exp\!\left( \frac{F\otimes I_m+I_n\otimes G-C}{\epsilon} \right).

(63)

with $\operatorname{Tr}_B(T_e)=A$ and $\operatorname{Tr}_A(T_e)=B$ .

Writing $K=\exp(-C/\epsilon)$ , the objective differs by a constant from $\epsilon$ times the quantum KL divergence

D_H(T\mid K) = \operatorname{tr}\!\left( T(\log T-\log K)-T+K \right).

(65)

The exact quantum analogue of Sinkhorn is an implicit alternating Bregman projection scheme onto the affine marginal sets

\mathcal M_A=\{T\succeq0:\operatorname{Tr}_B(T)=A\}, \qquad \mathcal M_B=\{T\succeq0:\operatorname{Tr}_A(T)=B\}.

(66)

In the diagonal case this proposition gives the usual multiplicative Sinkhorn updates. In the non-commutative case, however, the exact block equations

\operatorname{Tr}_B T_e(F,G)=A, \qquad \operatorname{Tr}_A T_e(F,G)=B

(69)

do not admit scalar division formulas, because the exponential of $F\otimes I_m+I_n\otimes G-C$ cannot be separated unless the local potential commutes with the cost.

Gurvits Scaling and Quantum Sinkhorn¶

The algorithm often called quantum Sinkhorn comes from the operator-scaling literature of Gurvits and subsequent developments Gurvits, 2003Gurvits, 2004Georgiou & Pavon, 2015Garg & Oliveira, 2018. It replaces the true Gibbs coupling (63) by the symmetric factorization

T_s(F,G) = \exp\!\left(\frac{Z}{2\epsilon}\right) \exp(-C/\epsilon) \exp\!\left(\frac{Z}{2\epsilon}\right) = (U\otimes V)K(U\otimes V), \qquad Z=F\otimes I_m+I_n\otimes G,

(70)

where $U=\exp(F/(2\epsilon))$ , $V=\exp(G/(2\epsilon))$ and $K=\exp(-C/\epsilon)$ . If $[Z,C]=0$ , then $T_s(F,G)=T_e(F,G)$ ; otherwise this is a Strang-type symmetric surrogate.

Fix a Choi convention and let $\mathcal K:\mathbb H_m\to\mathbb H_n$ be the completely positive map represented by the positive Choi matrix $K$ ; let $\mathcal K^\star$ be its Hilbert--Schmidt adjoint. Up to the transpose dictated by the chosen Choi convention, the marginal equations for the symmetric coupling take the operator-scaling form

U\,\mathcal K(V^2)\,U=A, \qquad V\,\mathcal K^\star(U^2)\,V=B,

(71)

and can be enforced by the congruence normalizations

\begin{aligned} R_V&=\mathcal K(V^2), & U&\leftarrow R_V^{-1/2} \left(R_V^{1/2} A R_V^{1/2}\right)^{1/2} R_V^{-1/2}, \\ S_U&=\mathcal K^\star(U^2), & V&\leftarrow S_U^{-1/2} \left(S_U^{1/2} B S_U^{1/2}\right)^{1/2} S_U^{-1/2}. \end{aligned}

(72)

These inverse square roots are well-defined when $K\succ0$ and $U,V,A,B\succ0$ . Under the standard strict-positivity and scalability hypotheses, the alternating normalizations converge to the prescribed marginals Georgiou & Pavon, 2015Garg & Oliveira, 2018. At finite tolerance they return an approximate coupling. When all matrices are diagonal, the updates reduce to classical Sinkhorn scaling; when the targets are proportional to identities, they match the usual bistochastic operator-scaling normalization up to trace convention.

Remark: Gurvits scaling is not the exact Bregman scheme

It is important not to identify (72) with the exact Bregman scheme for (61). The exact Bregman step would enforce the marginals of $T_e(F,G)=\exp((Z-C)/\epsilon)$ and would be a block maximization of the true concave dual (62). Gurvits scaling instead enforces the marginals of the surrogate

T_s= \exp\!\left(\frac{Z}{2\epsilon}\right) \exp(-C/\epsilon) \exp\!\left(\frac{Z}{2\epsilon}\right).

(73)

The two coincide in the commuting/diagonal regime, but in general the Baker--Campbell--Hausdorff commutator terms do not vanish. The Gurvits iteration should therefore be understood as a tractable symmetric operator-scaling approximation to entropic Q--OT, not as the literal alternating KL projection algorithm.

Remark: Operator-valued couplings

The same definitions extend formally from matrices to separable Hilbert spaces by replacing density matrices with positive trace-class operators of trace one, observables with bounded self-adjoint operators and (55) with partial traces defined by duality against local bounded observables. If $\Pi(A,B)$ denotes positive trace-class operators with partial traces $A$ and $B$ , a bounded cost observable $C$ gives the problem $\inf_{T\in\Pi(A,B)}\Tr(CT)$ . For unbounded positive costs one must define the energy through the quadratic form or spectral truncations, and in the entropic case one must ensure that the Gibbs operator $\exp(-C/\epsilon)$ is trace class and that the partial traces of the candidate coupling are well-defined. The matrix formulas above are therefore the clean finite-dimensional core; the operator version adds domain and compactness assumptions rather than a different algebraic structure.

Algorithm: Exact quantum Bregman projections

Input: Positive definite density matrices $A,B$ , cost $C$ , regularization $\epsilon>0$ , tolerance $\mathrm{tol}$ , maximum iterations $L$ .

Output: Approximate quantum entropic coupling $T$ and its partial-trace residual.

Initialize: Set Hermitian potentials $F^{(0)}=0$ and $G^{(0)}=0$ .

For $k=0,\ldots,L-1$ do:

$T^{(k)}= T_e(F^{(k)},G^{(k)}) = \exp\!\left( \frac{F^{(k)}\otimes\Id_m+\Id_n\otimes G^{(k)}-C}{\epsilon} \right).$
Solve $A$ -projection equation: $\operatorname{Tr}_B T_e(F^+,G^{(k)})=A,$
Set $F^{(k+1)}=F^+$ .
Solve $B$ -projection equation: $\operatorname{Tr}_A T_e(F^{(k+1)},G^+)=B,$
Set $G^{(k+1)}=G^+$ .
If both partial-trace residuals are at most $\mathrm{tol}$ then:

Return $T_e(F^{(k+1)},G^{(k+1)})$ .

Return $T_e(F^{(L)},G^{(L)})$ and its residual.

Algorithm: Gurvits/operator scaling for quantum Sinkhorn

Input: Positive definite marginals $A,B$ , positive definite kernel operator $K$ , maps $\mathcal K,\mathcal K^\star$ , tolerance $\mathrm{tol}$ , maximum iterations $L$ .

Output: Symmetrically scaled coupling $T_s$ with approximate prescribed marginals.

Initialize: Set $U=\Id_n$ and $V=\Id_m$ .

Set residual $r=+\infty$ and counter $k=0$ .

While $r>\mathrm{tol}$ and $k<L$ do:

$R_V=\mathcal K(V^2), \qquad U\leftarrow R_V^{-1/2}\bigl(R_V^{1/2} A R_V^{1/2}\bigr)^{1/2}R_V^{-1/2}.$
$S_U=\mathcal K^\star(U^2), \qquad V\leftarrow S_U^{-1/2}\bigl(S_U^{1/2} B S_U^{1/2}\bigr)^{1/2}S_U^{-1/2}.$
Set $T_s=(U\otimes V)K(U\otimes V)$ and $r$ to the maximum of its two operator-marginal residuals against $A$ and $B$ .
Set $k\leftarrow k+1$ .

Return $T_s$ .

Dynamic Time Warping¶

Dynamic time warping (DTW) compares ordered feature sequences when the same phenomenon may be observed under different clocks. It is historically rooted in speech recognition Vintsyuk, 1968Sakoe & Chiba, 1978, and is now a standard tool for time-series alignment, retrieval and classification Berndt & Clifford, 1994Müller, 2007. Like OT, it minimizes an aggregate feature mismatch over correspondences; unlike OT, those correspondences must respect chronology.

Ordered Alignments Versus Transport Couplings¶

For two empirical measures, Kantorovich OT minimizes a linear cost over the convex polytope of nonnegative matrices with prescribed row and column sums. DTW instead minimizes over the finite, non-convex set of connected monotone paths through the pairwise cost matrix. Every time index must be visited, but it may be visited repeatedly; the row and column sums therefore record endogenous visit counts rather than prescribed masses. Normalizing a path matrix produces a coupling only for these path-dependent marginals, not for fixed input histograms. Conversely, ordinary OT between the unordered empirical feature measures forgets chronology and may match indices in a crossing order. Temporal penalties, causal constraints, and joint OT--DTW models interpolate between the two viewpoints; spatio-temporal alignment, for example, combines regularized OT for spatial comparison with soft-DTW for chronological alignment Janati et al., 2020. Here “dynamic” refers to Bellman’s dynamic programming on the index grid, not to the transport PDEs of Paragraph.

Discrete Variational Problem¶

Let $x=(x_i)_{i=1}^n$ and $y=(y_j)_{j=1}^m$ be two sequences in a feature space $\mathcal Z$ , and set $\C_{ij}=c(x_i,y_j)$ for a nonnegative cost $c:\mathcal Z\times\mathcal Z\to\RR_+$ . A warping path is a sequence

\omega=((i_\ell,j_\ell))_{\ell=1}^L

(74)

that starts at $(1,1)$ , ends at $(n,m)$ , and has increments

(i_{\ell+1}-i_\ell,j_{\ell+1}-j_\ell) \in\{(1,0),(0,1),(1,1)\}.

(75)

Denote the set of such paths by $\Omega_{n,m}$ and the corresponding incidence matrix by $(A_\omega)_{ij}=\mathbf1_{\{(i,j)\in\omega\}}$ . Its length satisfies $\max\{n,m\}\leq L\leq n+m-1$ , its total mass is $\sum_{ij}(A_\omega)_{ij}=L$ , and its two marginals are precisely the row and column visit counts.

The definition is symmetric when $c$ is symmetric, but it is generally not a metric: repetitions can give zero cost to distinct sequences, and the triangle inequality can fail. Step weights, slope constraints and a Sakoe-Chiba band are common variants that penalize excessive repetition or restrict the admissible temporal distortion Sakoe & Chiba, 1978Müller, 2007.

Dynamic Programming¶

The monotone path structure converts the exponentially large variational problem (76) into a shortest-path computation on an acyclic grid.

Algorithm: Dynamic Time Warping by Dynamic Programming

Input: Sequences $x=(x_i)_{i=1}^n$ , $y=(y_j)_{j=1}^m$ , local cost $c$ .

Output: DTW value $D_{nm}$ and an optimal warping path $\omega^\star$ .

Initialize: Set $D\in(\RR\cup\{+\infty\})^{(n+1)\times(m+1)}$ to $+\infty$ , $D_{0,0}=0$ , and allocate predecessors $B$ .

For $i=1,\ldots,n$ do:

For $j=1,\ldots,m$ do:
Set $\C_{ij}=c(x_i,y_j)$ .
Choose $(r^\star,s^\star)\in\argmin_{(r,s)\in\{(i-1,j),(i,j-1),(i-1,j-1)\}}D_{rs}$ .
Set $B_{ij}=(r^\star,s^\star)$ and $D_{ij}=\C_{ij}+D_{r^\star,s^\star}$ .

Initialize: Set $(i,j)=(n,m)$ and $\omega^\star=[\,]$ .

While $(i,j)\neq(0,0)$ do:

Prepend $(i,j)$ to $\omega^\star$ .
Set $(i,j)\leftarrow B_{ij}$ .

Return $D_{nm}$ and $\omega^\star$ .

Continuous Time Warping¶

Discrete DTW depends on the sampling density because every visited cell contributes once. A direct continuous registration minimizes $\int_0^1 c(x(t),y(\gamma(t)))\d t$ over nondecreasing endpoint-fixing maps $\gamma$ , but this one-clock formulation is asymmetric and privileges the parameterization of $x$ . Continuous DTW instead traverses both clocks and measures mismatch per unit length in the parameter square Buchin et al., 2022.

For simplicity, let $x,y:[0,1]\to\mathcal Z$ use normalized clocks, and let $\Gamma_\uparrow$ contain pairs $(\phi,\psi)$ of absolutely continuous, nondecreasing surjections of $[0,1]$ onto itself. Equivalently, the endpoint conditions are $\phi(0)=\psi(0)=0$ and $\phi(1)=\psi(1)=1$ . The continuous DTW functional is

\mathrm{CDTW}_c(x,y) \eqdef \inf_{(\phi,\psi)\in\Gamma_\uparrow} \int_0^1 c\bigl(x(\phi(s)),y(\psi(s))\bigr) \bigl(\dot\phi(s)+\dot\psi(s)\bigr)\d s.

(78)

Because $\dot\phi,\dot\psi\geq0$ almost everywhere, the last factor is the $\ell^1$ line element of the monotone path $s\mapsto(\phi(s),\psi(s))$ . Formula (78) is invariant under increasing reparameterizations of the auxiliary variable $s$ and does not privilege either clock; when $c$ is symmetric, the resulting functional is also symmetric in $x$ and $y$ . After parameterization by $\ell^1$ arc length, it is simply the line integral of the feature mismatch along a monotone path. For physical clock intervals $[0,p]$ and $[0,q]$ , the same formula uses $\phi:[0,1]\to[0,p]$ and $\psi:[0,1]\to[0,q]$ . Exact computation is substantially harder than the discrete recurrence: for arc-length-parametrized one-dimensional polygonal curves and the standard cost $c(u,v)=|u-v|$ , Buchin, Nusser and Wong propagate piecewise-quadratic boundary costs in $O((n+m)^5)$ time Buchin et al., 2022. This complexity statement does not apply to an arbitrary feature cost $c$ .

Soft-DTW and the Sinkhorn Analogy¶

The hard minimum in (77) is nonsmooth when several paths tie. Soft-DTW replaces it with the log-sum-exp soft minimum Cuturi & Blondel, 2017,

\operatorname{softmin}_\epsilon(r_1,\ldots,r_q) = -\epsilon\log\!\left(\sum_{k=1}^q e^{-r_k/\epsilon}\right),

(79)

and defines $D_{00}^\epsilon=0$ , $D_{i0}^\epsilon=D_{0j}^\epsilon=+\infty$ for $i,j>0$ , together with

D_{ij}^\epsilon = \C_{ij} +\operatorname{softmin}_\epsilon \bigl(D_{i-1,j}^\epsilon,D_{i,j-1}^\epsilon,D_{i-1,j-1}^\epsilon\bigr), \qquad \mathrm{sDTW}_{c,\epsilon}(x,y)=D_{nm}^\epsilon.

(80)

Equivalently, it is the free energy of all monotone paths,

\mathrm{sDTW}_{c,\epsilon}(x,y) = -\epsilon\log \sum_{\omega\in\Omega_{n,m}} \exp\!\left(-\frac{\dotp{A_\omega}{\C}}{\epsilon}\right).

(81)

To make the regularization explicit, let $\Delta(\Omega_{n,m})$ be the simplex of probability laws $q=(q_\omega)_\omega$ over paths and let $H(q)=-\sum_\omega q_\omega\log q_\omega$ be their Shannon entropy, with $0\log0=0$ . The Gibbs variational identity gives

\mathrm{sDTW}_{c,\epsilon}(x,y) = \min_{q\in\Delta(\Omega_{n,m})} \left\{ \sum_{\omega}q_\omega\dotp{A_\omega}{\C} -\epsilon H(q) \right\}.

(82)

Its unique minimizer is the Gibbs law

\PP_\epsilon(\omega) = \frac{\exp(-\dotp{A_\omega}{\C}/\epsilon)} {\sum_{\omega'}\exp(-\dotp{A_{\omega'}}{\C}/\epsilon)}, \qquad E_\epsilon \eqdef \nabla_\C\mathrm{sDTW}_{c,\epsilon}(x,y),

(83)

Indeed, subtracting the value in (81) from the objective in (82) gives $\epsilon\KL(q|\PP_\epsilon)\geq0$ . Moreover,

\mathrm{DTW}_c(x,y)-\epsilon\log|\Omega_{n,m}| \leq \mathrm{sDTW}_{c,\epsilon}(x,y) \leq \mathrm{DTW}_c(x,y),

(84)

because the partition sum is bounded below by its largest term and above by $|\Omega_{n,m}|$ times that term. Hence $\mathrm{sDTW}_{c,\epsilon}\to\mathrm{DTW}_c$ as $\epsilon\to0$ . Forward and backward dynamic programs compute its value and gradient in $O(nm)$ time and $O(nm)$ memory Cuturi & Blondel, 2017. Differentiating the finite log-partition function gives

E_\epsilon = \EE_{\omega\sim\PP_\epsilon}[A_\omega].

(85)

Thus $(E_\epsilon)_{ij}$ is the probability that a Gibbs path visits cell $(i,j)$ . The matrix $E_\epsilon$ is a diffuse expected alignment and converges, when the hard optimum is unique, to its path-incidence matrix.

The analogy with entropic OT is now exact at the level of free energies, but not at the level of feasible variables. Sinkhorn regularizes a coupling with prescribed marginals, whereas soft-DTW regularizes the path law $q$ in (82). Its mean $E_\epsilon$ generally has neither prescribed row sums nor prescribed column sums, and the path entropy $H(q)$ cannot in general be recovered from $E_\epsilon$ alone. Algorithmically, Sinkhorn uses alternating matrix scaling, while soft-DTW uses forward--backward dynamic programming on an acyclic grid. Global-alignment kernels sum the same Gibbs weights over all paths Cuturi et al., 2007Cuturi, 2011.

Raw soft-DTW also has an entropic self-bias and can be negative. In direct analogy with the Sinkhorn divergence of Sinkhorn Divergences, define

\overline{\mathrm{sDTW}}_{c,\epsilon}(x,y) = \mathrm{sDTW}_{c,\epsilon}(x,y) -\frac12\mathrm{sDTW}_{c,\epsilon}(x,x) -\frac12\mathrm{sDTW}_{c,\epsilon}(y,y).

(86)

The correction always vanishes on the diagonal, but positivity requires care.

Remark: Validity of the Soft-DTW Divergence

Blondel, Mensch and Vert prove that (86) is nonnegative and vanishes only for identical sequences when either $c(u,v)=|u-v|$ in one dimension or

c(u,v)=\delta(u,v)+\log\!\bigl(2-e^{-\delta(u,v)}\bigr), \qquad \delta(u,v)=\tfrac12\|u-v\|^2.

(87)

For the ordinary squared Euclidean cost used in Div, they prove for equal-length sequences that the diagonal is stationary and provide numerical evidence for nonnegativity, but do not prove it in general Blondel et al., 2021. Thus debiasing alone should not be taken as a universal positivity theorem. Even for the costs covered by the result, no triangle inequality is asserted.

Hard and soft monotone alignments recover a nonlinear time warp. Left: an oscillatory signal $x$ and the warped observation $y(t)=x(\gamma(t))$ for a smooth increasing map $\gamma$ ; thin gray segments mark exact corresponding times. Middle: the pairwise squared feature-cost matrix $\C_{ij}=|x_i-y_j|^2$ , with the optimal DTW path shown in red. Right: the same matrix overlaid with the soft-DTW expected alignment $E_\epsilon$ from (85) at $\epsilon=.200$ ; red intensity gives cell-visit probability and the dark red curve is its row-wise barycentric summary.

References¶

Maas, J., Rumpf, M., Schönlieb, C., & Simon, S. (2015). A generalized model for optimal transport of images including dissipation and density modulation. ESAIM: Mathematical Modelling and Numerical Analysis, 49(6), 1745–1769.
Maas, J., Rumpf, M., & Simon, S. (2016). Generalized optimal transport with singular sources. arXiv Preprint arXiv:1607.01186.
Dolbeault, J., Nazaret, B., & Savaré, G. (2009). A new class of transport distances between measures. Calculus of Variations and Partial Differential Equations, 34(2), 193–231.
Mielke, A. (2013). Geodesic convexity of the relative entropy in reversible Markov chains. Calculus of Variations and Partial Differential Equations, 48(1–2), 1–31.
Ning, L., & Georgiou, T. T. (2014). Metrics for matrix-valued measures via test functions. 53rd IEEE Conference on Decision and Control, 2642–2647.
Jiang, X., Ning, L., & Georgiou, T. T. (2012). Distances and Riemannian metrics for multivariate spectral densities. IEEE Transactions on Automatic Control, 57(7), 1723–1735.
Ning, L., Georgiou, T. T., & Tannenbaum, A. (2015). On matrix-valued Monge–Kantorovich optimal mass transport. IEEE Transactions on Automatic Control, 60(2), 373–382. 10.1109/TAC.2014.2350171
Chen, Y., Georgiou, T. T., & Tannenbaum, A. (2016). Matrix optimal mass transport: a quantum mechanical approach. arXiv Preprint arXiv:1610.03041.
Chen, Y., Gangbo, W., Georgiou, T. T., & Tannenbaum, A. (2020). On the matrix Monge-Kantorovich problem. European Journal of Applied Mathematics, 31(4), 574–600. 10.1017/S0956792519000172
Carlen, E. A., & Maas, J. (2014). An analog of the 2-Wasserstein metric in non-commutative probability under which the fermionic Fokker–Planck equation is gradient flow for the entropy. Communications in Mathematical Physics, 331(3), 887–926.
Peyré, G., Chizat, L., Vialard, F.-X., & Solomon, J. (2019). Quantum entropic regularization of matrix-valued optimal transport. European Journal of Applied Mathematics, 30(6), 1079–1102. 10.1017/S0956792517000274
Loiola, E. M., de Abreu, N. M. M., Boaventura-Netto, P. O., Hahn, P., & Querido, T. (2007). A survey for the quadratic assignment problem. European Journal of Operational Research, 176(2), 657–690. 10.1016/j.ejor.2005.09.032
Lyzinski, V., Fishkind, D. E., Fiori, M., Vogelstein, J. T., Priebe, C. E., & Sapiro, G. (2016). Graph matching: relax at your own risk. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(1), 60–73.
Mémoli, F. (2011). Gromov–Wasserstein distances and the metric approach to object matching. Foundations of Computational Mathematics, 11(4), 417–487.
Sturm, K.-T. (2012). The space of spaces: curvature bounds and gradient flows on the space of metric measure spaces (Preprint 1208.0434). arXiv.