Divergences and Dual Norms

This chapter compares optimal transport with divergence-based and adversarial ways of measuring discrepancy. The main stake is topological: $\phi$ -divergences are cheap but strong, while dual norms and GAN objectives can be weak enough to compare singular measures. The discussion connects classical information divergences Ciszár, 1967Ali & Silvey, 1966 with modern integral probability metrics and generative modeling Sriperumbudur et al., 2009Goodfellow et al., 2014Arjovsky et al., 2017.

from pathlib import Path
import sys

from IPython.display import Image as DisplayImage
from IPython.display import display

here = Path.cwd()
myst_dir = None
for candidate in [here, here.parent, here / "myst", here.parent / "myst", here.parent.parent / "myst"]:
    if (candidate / "ot4ml_web.py").exists():
        myst_dir = candidate.resolve()
        sys.path.insert(0, str(myst_dir))
        break

if myst_dir is None:
    raise RuntimeError("Could not locate myst/ot4ml_web.py")

repo_root = myst_dir.parent
thumbnails = repo_root / "notebooks-figures" / "thumbnails"

def show_book_figure(name, width=760):
    display(DisplayImage(filename=str(thumbnails / f"{name}.png"), width=width))

Dual Norms and Integral Probability Metrics¶

This section isolates the test-function viewpoint behind weak discrepancies. Dual norms generalize the $\Wass_1$ test-function principle and are useful in statistics because they compare distributions through a restricted discriminator class.

Integral Probability Metrics¶

The Kantorovich--Rubinstein formula for $\Wass_1$ is a special case of a dual norm. This viewpoint designs weak discrepancies by testing signed differences of measures against a controlled class of functions.

Symmetry makes the supremum equal to $\sup_{f\in B}|\int f\,\d\xi|$ , while convexity makes $B$ a natural unit ball.

The choice of the test-function class $B$ determines both the topology and the statistical behavior of the discrepancy Sriperumbudur et al., 2012Sriperumbudur et al., 2009Sriperumbudur et al., 2008.

Example: Flat norm and Dudley metric

If $B$ is uniformly bounded and separates measures, then $\norm{\cdot}_B$ is a finite norm on the whole space $\Mm(\Xx)$ of finite signed measures.

This is not the case for $\Wass_1$ : zero total mass is necessary, and on an unbounded space a finite first moment is also required. For nonzero total mass, $\norm{\xi}_B=+\infty$ because constants belong to the Lipschitz ball.

This is remedied by imposing a bound on the value of the potential $\f$ , which leads for instance to the flat norm,

B=\enscond{f}{\Lip(f) \leq 1 \qquad\text{and}\qquad \norm{\f}_\infty \leq 1}.

(4)

On compact metric spaces, it metrizes weak convergence of finite nonnegative measures, and weak- $\ast$ convergence on every total-variation-bounded family of signed measures.

The finite-dimensional version is obtained from the usual $\Wass_1$ dual linear program by adding the box constraints $\abs{\fD_k}\leq1$ .

The flat norm is sometimes called the “Kantorovich--Rubinstein” norm Hanin, 1992 and has been used as a fidelity term for inverse problems in imaging Lellmann et al., 2014.

The flat norm is equivalent to the bounded-Lipschitz, or Dudley, metric, whose test class is

B=\enscond{f}{\Lip(f) + \norm{f}_\infty \leq 1}.

(5)

On a Euclidean domain, $\Lip(f)=\norm{\nabla f}_\infty$ for differentiable $f$ .

Figure Div compares the optimal test functions selected by several integral probability metrics for the same pair of one-dimensional densities.

Dual witnesses for integral probability metrics. The red and blue curves are two one-dimensional probability densities and the violet curve is a normalized optimal dual witness $f^\star_{\alpha,\beta}$ for the IPM variational problem. $\Wass_1$ restricts the slope through Kantorovich--Rubinstein duality, MMD restricts the RKHS norm, and total variation can saturate pointwise and therefore reacts sharply to signed density differences.

The interactive demo makes the topology visible. As the two densities move, the total-variation witness jumps with the sign of the density difference, the Wasserstein witness keeps a unit-slope geometry, and the MMD witness is smoothed by the kernel bandwidth.

Interactive panel. Use the kernel, bandwidth, and separation controls to see how witness functions detect differences between measures.

The following proposition gives a compact-space criterion. The dual ball should be rich enough to approximate continuous observables, but compact enough for weak convergence to imply uniform convergence over the discriminator class.

Dual RKHS Norms and Maximum Mean Discrepancies¶

Kernel methods turn probability measures into mean elements of a reproducing kernel Hilbert space. The resulting Hilbertian dual seminorms are quadratic discrepancies, handled with Euclidean geometry while retaining a weak test-function interpretation.

Here ``positive definite’’ has its standard kernel-theory meaning of positive semidefinite. Strict positivity is an additional property, and its absence can make the induced discrepancy degenerate.

The conditional version is the right notion for probability distances, because one applies the quadratic form to signed measures $\xi=\alpha-\beta$ of total mass zero. Adding $a(x)+a(y)$ to the kernel does not change $\iint K(x,y)\,\d\xi(x)\,\d\xi(y)$ on such measures, and many natural distance kernels are only conditionally positive definite.

Example: Riesz, energy and Matérn-type kernels

On $\RR^d$ , translation-invariant kernels are most transparent in Fourier variables. The Riesz family associated with $(-\Delta)^{-s}$ has multiplier $\norm{\om}^{-2s}$ and defines a nonnegative quadratic form on zero-mass measures for which the low-frequency singularity is integrable; this is the kernel counterpart of classical Riesz potentials Berg et al., 1984. The energy distance corresponds to the conditionally positive kernel $\Krkhs(x,y)=-\norm{x-y}$ , whose Fourier multiplier is proportional to $\norm{\om}^{-(d+1)}$ ; for $\xi=\al-\be$ ,

-\iint \norm{x-y}\d\xi(x)\d\xi(y)

(11)

is exactly the customary squared energy distance $2\EE\norm{X-Y}-\EE\norm{X-X'}-\EE\norm{Y-Y'}$ ; only its Fourier representation carries a dimension-dependent constant Schoenberg, 1938Székely & Rizzo, 2004.

Shifted kernels replace $(-\Delta)^{-s}$ by $(-\Delta+\lambda I)^{-s}$ with $\lambda>0$ . Their Fourier multiplier $(\norm{\om}^2+\lambda)^{-s}$ is bounded at the origin, hence the kernel is positive definite without imposing zero mass. These are Matern kernels; in closed form they are radial and involve a modified Bessel function Wendland, 2005. The Laplacian kernel $e^{-\norm{x-y}/\sigma}$ is a low-smoothness Matern example, while the Gaussian kernel $e^{-\norm{x-y}^2/(2\sigma^2)}$ is the infinite-smoothness limit after the usual rescaling of the Matern smoothness parameter.

These seminorms are usually called maximum mean discrepancies in statistics and machine learning Gretton et al., 2012Muandet et al., 2017, and kernel norms in shape analysis Hofmann et al., 2008. For a positive-definite kernel, if $X,X'$ are independent with law $\alpha$ , then $\norm{\alpha}_K^2=\EE_{X,X'}(K(X,X'))$ , whenever this expression is finite. For a conditionally positive kernel, fixing $x_0\in\X$ and replacing $K$ by

\widetilde K(x,y)=K(x,y)-K(x,x_0)-K(x_0,y)+K(x_0,x_0)

(14)

produces a positive-definite kernel with the same energy on zero-mass measures.

Further background on RKHS spaces can be found in Berlinet & Thomas-Agnan, 2003Hofmann et al., 2008Schölkopf & Smola, 2002.

The preceding discrepancies are widely used as sample-based criteria, both for testing whether two populations agree and for evaluating generative models.

Example: Two-sample testing and generative-model evaluation

Given samples $X_1,\ldots,X_n\sim\al$ and $Y_1,\ldots,Y_m\sim\be$ , a two-sample test evaluates a statistic $D(\hat\al_n,\hat\be_m)$ under the null hypothesis $H_0:\al=\be$ . Wasserstein distances provide geometric choices of $D$ , whereas MMD and the energy distance provide kernel and negative-type alternatives Ramdas et al., 2017Gretton et al., 2012Székely & Rizzo, 2004. In generative-model evaluation, the Frechet Inception Distance embeds the data in a neural feature space, fits a Gaussian to each empirical feature distribution, and computes the squared $\Wass_2$ distance between these Gaussian laws: the squared distance between their empirical means plus the Bures covariance term of Proposition Proposition: Gaussian $\Wass_2$ Formula And Bures Covariance Term Heusel et al., 2017. The Kernel Inception Distance instead estimates a squared MMD, typically with a polynomial kernel, in the same feature space Bińkowski et al., 2018. In both settings, the geometry encoded by $D$ must be distinguished from the finite-sample calibration of $D(\hat\al_n,\hat\be_m)$ ; Sections Sample Complexity and Bias and Variance of OT analyze the resulting rates, bias and fluctuations.

In the special case where $\alpha=\sum_{i=1}^n a_i\delta_{x_i}$ is discrete, one obtains

\norm{\alpha}_K^2 = \sum_{i,i'} a_i a_{i'}K(x_i,x_{i'}) = a^\top K_X a,

(24)

where $(K_X)_{i,i'}=K(x_i,x_{i'})$ . In particular, if $\alpha=\sum_i a_i\delta_{x_i}$ and $\beta=\sum_i b_i\delta_{x_i}$ are supported on the same point cloud, then $\norm{\alpha-\beta}_K^2=(a-b)^\top K_X(a-b)$ , a Euclidean seminorm on the simplex. It is nondegenerate exactly when $r^\top K_Xr>0$ for every nonzero zero-sum vector $r$ . For two arbitrary discrete measures,

\norm{\alpha-\beta}_K^2 = \sum_{i,i'} a_i a_{i'}K(x_i,x_{i'}) + \sum_{j,j'} b_j b_{j'}K(y_j,y_{j'}) - 2\sum_{i,j}a_i b_j K(x_i,y_j).

(25)

Phi-Divergences¶

This section develops divergences based on pointwise density ratios. They are computationally simple and statistically classical, but on nondiscrete spaces they generally induce a topology much stronger than weak convergence and do not see small spatial displacements between mutually singular measures.

Definition by Density Ratios¶

On a common discrete support, phi-divergences cost only $O(n)$ to evaluate, but on a continuous space they generally fail to metrize weak convergence. Bregman divergences provide a different convex construction and should not be conflated with density-ratio divergences.

If $\phi'_\infty=+\infty$ , then $\phi$ grows faster than any linear function and is called superlinear. Any entropy function induces a $\phi$ -divergence, also known as a Ciszar divergence or $f$ -divergence Ciszár, 1967Ali & Silvey, 1966.

Here $\alpha^\perp$ is the part of $\alpha$ singular with respect to $\beta$ . The singular term is the recession contribution of the perspective functional. It gives the weak- $\ast$ lower-semicontinuous extension of the density-ratio integral when singular mass appears. This is essential for linear-growth entropies such as total variation. For superlinear entropies, such as the usual entropy, $\phi'_\infty=+\infty$ , so the divergence is infinite when $\alpha$ is not absolutely continuous with respect to $\beta$ .

For discrete measures supported on the same set,

\alpha=\sum_i a_i\delta_{x_i}, \qquad \beta=\sum_i b_i\delta_{x_i},

(29)

the formula becomes

D_\phi(a|b) = \sum_{i\in\operatorname{supp}(b)} b_i\, \phi\left(\frac{a_i}{b_i}\right) + \phi'_\infty \sum_{i\notin\operatorname{supp}(b)}a_i .

(30)

Classical Examples and Topology¶

The following examples calibrate the strength of $\phi$ -divergences. KL is sensitive to absolute continuity, while total variation gives the strong topology and therefore behaves very differently from Wasserstein-type weak metrics.

Example: Total variation

With the convention of this book, total variation $\TV \eqdef \Divergm_{\phi_{\TV}}$ is the full variation norm, without the factor $1/2$ sometimes used for probabilities. It is associated with

\phi_{\TV}(s)= \begin{cases} |s-1| & \textnormal{for } s\geq0 , \\ +\infty & \textnormal{otherwise.} \end{cases}

(37)

It actually defines a norm on the full space of measures $\Mm(\X)$ where

\TV(\al|\be) = \norm{\al-\be}_{\TV}, \qquad\text{where}\qquad \norm{\al}_{\TV} = |\al|(\X) = \int_\X \d|\al|(x).

(38)

If $\al$ has a density $\density{\al}$ on $\X=\RR^\dim$ , then the TV norm is the $L^1$ norm on functions, $\norm{\al}_{\TV} = \int_\X |\density{\al}(x)| \d x = \norm{\density{\al}}_{L^1}$ .

If $\al$ is discrete as in (30), then the TV norm is the $\ell^1$ norm of vectors in $\RR^n$ , $\norm{\al}_{\TV}=\sum_i |\a_i| = \norm{\a}_{\ell^1}$ .

KL and total variation are two very different $\phi$ -divergences: the former is smooth and sensitive to density ratios, whereas the latter is a nonsmooth norm. Pinsker’s fundamental inequality nevertheless controls the square of the latter by the former Pinsker, 1964.

Remark: Strong vs. weak topology

The total variation norm (38) defines the so-called “strong” topology on the space of measures.

For probability measures on a compact metric space,

\Wass_1(\al,\be) \leq \frac{\operatorname{diam}(\X)}{2}\norm{\al-\be}_{\TV}.

(43)

Indeed, a 1-Lipschitz test function can be shifted so that its sup norm is at most $\operatorname{diam}(\X)/2$ . Thus total-variation convergence implies weak convergence.

The converse is false: if $x_n\to x$ with $x_n\ne x$ , then $\delta_{x_n}\rightharpoonup\delta_x$ but $\norm{\delta_{x_n}-\delta_x}_{\TV}=2$ for every $n$ .

A chief advantage is that $\Mm_+^1(\Xx)$ (once again on a compact ground space $\X$ ) is compact for the weak topology so that from any sequence of probability measures $(\al_k)_k$ , one can always extract a converging subsequence, which makes it a suitable space for several optimization problems.

Main Families of $\phi$ -Divergences¶

Several classical divergences fit in the same template. The power-divergence family

\phi_\gamma(s) = \frac{s^\gamma-\gamma s+\gamma-1}{\gamma(\gamma-1)} \qquad(\gamma\ne0,1)

(44)

interpolates, up to conventional multiplicative normalizations, between Pearson’s $\chi^2$ divergence at $\gamma=2$ , Hellinger behavior at $\gamma=1/2$ , and, by taking limits, the KL divergence as $\gamma\to1$ and the reverse KL or Burg entropy $\phi_0(s)=-\log s+s-1$ as $\gamma\to0$ . The Hellinger divergence is often written with $\phi_H(s)=(\sqrt{s}-1)^2$ . If $\alpha=\rho_\alpha\lambda$ and $\beta=\rho_\beta\lambda$ , then $\operatorname{Hellinger}(\alpha,\beta) =\norm{\sqrt{\rho_\alpha}-\sqrt{\rho_\beta}}_{L^2(\lambda)}$ . The Jensen--Shannon distance Endres & Schindelin, 2003Österreicher & Vajda, 2003 is the square root of the symmetrized, bounded KL-to-the-mixture divergence

\operatorname{JS}(\alpha,\beta)^2 = \frac12\operatorname{KL}\!\left(\alpha\middle|\frac{\alpha+\beta}{2}\right) + \frac12\operatorname{KL}\!\left(\beta\middle|\frac{\alpha+\beta}{2}\right),

(45)

and $0\le\operatorname{JS}(\alpha,\beta)^2\le\log2$ . Its exact generator is

\phi_{\operatorname{JS}}(s) =\frac12\left[s\log s-(s+1)\log\left(\frac{s+1}{2}\right)\right].

(46)

Total variation, generated by $|s-1|$ , is exceptional because it is both a $\phi$ -divergence and an integral probability metric.

Figure Div places the principal generators and their induced scalar density-ratio penalties side by side, clarifying how their different growth and boundary behavior affect measure comparison.

$\phi$ -divergences through density ratios. The left panel shows normalized generators for common divergences as functions of $s=\d\alpha/\d\beta$ ; all curves vanish at $s=1$ up to affine normalization. The right panel shows the discrete formula $D_\phi(a|b)=\sum_i b_i\phi(a_i/b_i)$ : hollow blue circles encode $b_i$ , filled red circles encode $a_i$ , the violet curve gives the ratios $a_i/b_i$ , and orange lollipops show local KL-type contributions.

The interactive demo changes the generator family and the amount of mismatch between two discrete histograms. The near-zero control deliberately creates small target bins, making the recession and singularity behavior visible: ratio-based penalties react to overlap and density ratios rather than to spatial displacement.

Interactive panel. Use the divergence and ratio controls to compare convex generators and their dual penalties around density ratio one.

Variational Dual Formula¶

The following formula turns a pointwise density-ratio penalty into a dual optimization problem over test functions. It is the analogue, for $\phi$ -divergences, of the Kantorovich dual formula for transport costs.

GANs via Duality¶

GANs fit naturally into the dual viewpoint: the discriminator is a parameterized potential and the generator moves a reference measure. This section first explains the original divergence-based GAN objective, then contrasts it with integral probability metrics such as MMD and Wasserstein distances.

The goal is to fit a generative parametric model $\alpha_\theta=(g_\theta)_\sharp\zeta$ to empirical data

\beta=\frac1m\sum_{j=1}^m\delta_{y_j},

(51)

where $\zeta$ is a fixed probability measure on the latent space and $g_\theta:\mathcal{Z}\to\X$ is the generator, often a neural network.

Divergence-Based Adversarial Losses¶

Any $\phi$ -divergence can be written in adversarial form through the dual formula:

\min_\theta D_\phi(\alpha_\theta|\beta) = \min_\theta\sup_f \left\{ \int_\X f\,\d\alpha_\theta - D_\phi^*(f|\beta) \right\} = \min_\theta\sup_f \left\{ \int_\mathcal{Z} f(g_\theta(z))\,\d\zeta(z) - \frac1m\sum_{j=1}^m\phi^*(f(y_j)) \right\}.

(52)

Replacing the unrestricted potential $f$ by a neural network $f_\xi$ gives a saddle problem

\min_\theta\max_\xi \int_\mathcal{Z} f_\xi(g_\theta(z))\,\d\zeta(z) - \frac1m\sum_{j=1}^m\phi^*(f_\xi(y_j)).

(53)

For fixed $\theta$ , restricting the discriminator gives a lower bound on the exact divergence. This distinction is essential for empirical data: if $\beta$ is discrete and $\alpha_\theta$ is non-atomic, a superlinear divergence is $+\infty$ , while the restricted objective can remain finite.

The original vanilla GAN Goodfellow et al., 2014 corresponds, up to an additive constant and discriminator reparametrization, to the unscaled Jensen--Shannon generator $\widehat\phi_{\operatorname{JS}}=2\phi_{\operatorname{JS}}$ ,

\widehat\phi_{\operatorname{JS}}(s) = s\log s-(s+1)\log\frac{s+1}{2}, \qquad \widehat\phi_{\operatorname{JS}}^*(u) = -\log(2-e^u), \quad u<\log2,

(54)

Thus $D_{\widehat\phi_{\operatorname{JS}}}=2\operatorname{JS}^2$ . In practice the min--max problem is solved by alternating stochastic gradient descent/ascent. Although the unrestricted maximization is concave in $f$ , neural parametrization generally destroys concavity in $\xi$ ; the generator problem is likewise nonconvex in $\theta$ . Density-ratio losses can also saturate on singular measures: $\operatorname{JS}^2$ reaches its maximum $\log2$ on disjoint supports.

Dual Norms and Integral Probability Metrics¶

Instead of a density-ratio divergence, one can minimize an integral probability metric:

\min_\theta\norm{\alpha_\theta-\beta}_B = \min_\theta \sup_{f\in B} \left\{ \int_\mathcal{Z} f(g_\theta(z))\,\d\zeta(z) - \frac1m\sum_{j=1}^m f(y_j) \right\}.

(55)

MMD-GANs take $B$ to be a unit ball in an RKHS Dziugaite et al., 2015; Wasserstein GANs take $B$ to be a Lipschitz ball, following Kantorovich--Rubinstein duality Arjovsky et al., 2017Frogner et al., 2015. The advantage is topological: for a continuous kernel on a compact space, the RKHS unit ball is uniformly bounded and equicontinuous, while the normalized Lipschitz ball is compact by Arzela--Ascoli. The objective is therefore weakly continuous. It can therefore compare singular empirical and generated measures through test functions instead of requiring pointwise density ratios. The price is that the discriminator class must be controlled geometrically, either by a kernel norm, a Lipschitz constraint, or a related regularization.

Example: Application to imitation learning

In imitation learning, one can compare the expert occupancy measure $\rho_E$ and the learner occupancy measure $\rho_\theta$ on state-action space. OT gives either a primal matching loss $W(\rho_\theta,\rho_E)$ , or a dual adversarial reward obtained from a Kantorovich potential. Thus the discriminator in an adversarial imitation method can be interpreted as a learned reward shaping the learner toward the expert distribution, exactly as the GAN discriminator above is a learned potential. Wasserstein adversarial imitation and primal Wasserstein imitation exploit this distribution-matching viewpoint while retaining the geometry of state-action space, for instance through a cost that compares nearby states and actions more mildly than distant ones Xiao et al., 2019Dadashi et al., 2020.

References¶

Ciszár, I. (1967). Information-type measures of difference of probability distributions and indirect observations. Studia Scientiarum Mathematicarum Hungarica, 2, 299–318.
Ali, S. M., & Silvey, S. D. (1966). A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society. Series B (Methodological), 28(1), 131–142.
Sriperumbudur, B. K., Fukumizu, K., Gretton, A., Schölkopf, B., & Lanckriet, G. R. (2009). On integral probability metrics, ϕ-divergences and binary classification. arXiv Preprint arXiv:0901.2698.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial nets. Advances in Neural Information Processing Systems, 2672–2680.
Arjovsky, M., Chintala, S., & Bottou, L. (2017). Wasserstein generative adversarial networks. In D. Precup & Y. W. Teh (Eds.), Proceedings of the 34th International Conference on Machine Learning (Vol. 70, pp. 214–223). PMLR. https://proceedings.mlr.press/v70/arjovsky17a.html
Sriperumbudur, B. K., Fukumizu, K., Gretton, A., Schölkopf, B., & Lanckriet, G. R. (2012). On the empirical estimation of integral probability metrics. Electronic Journal of Statistics, 6, 1550–1599.
Sriperumbudur, B. K., Gretton, A., Fukumizu, K., Lanckriet, G., & Schölkopf, B. (2008). Injective Hilbert space embeddings of probability measures. Proceedings of the 21st Annual Conference on Learning Theory, 111–122.
Hanin, L. G. (1992). Kantorovich-Rubinstein norm and its application in the theory of Lipschitz spaces. Proceedings of the American Mathematical Society, 115(2), 345–352.
Lellmann, J., Lorenz, D. A., Schönlieb, C., & Valkonen, T. (2014). Imaging with Kantorovich–Rubinstein discrepancy. SIAM Journal on Imaging Sciences, 7(4), 2833–2859.
Berg, C., Christensen, J. P. R., & Ressel, P. (1984). Harmonic Analysis on Semigroups. Springer Verlag.
Schoenberg, I. J. (1938). Metric spaces and positive definite functions. Transactions of the American Mathematical Society, 44(3), 522–536. 10.1090/S0002-9947-1938-1501980-0
Székely, G. J., & Rizzo, M. L. (2004). Testing for equal distributions in high dimension. InterStat, 5(16.10).
Wendland, H. (2005). Scattered Data Approximation. Cambridge University Press.
Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., & Smola, A. (2012). A kernel two-sample test. Journal of Machine Learning Research, 13(Mar), 723–773.
Muandet, K., Fukumizu, K., Sriperumbudur, B., & Schölkopf, B. (2017). Kernel mean embedding of distributions: a review and beyond. Foundations and Trends in Machine Learning, 10(1–2), 1–141.

Divergences and Dual Norms

Dual Norms and Integral Probability Metrics¶

Integral Probability Metrics¶

Dual RKHS Norms and Maximum Mean Discrepancies¶

Phi-Divergences¶

Definition by Density Ratios¶

Classical Examples and Topology¶

Main Families of ϕ\phiϕ-Divergences¶

Variational Dual Formula¶

GANs via Duality¶

Divergence-Based Adversarial Losses¶

Dual Norms and Integral Probability Metrics¶

Main Families of $\phi$ -Divergences¶