Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Divergences and Dual Norms

This chapter compares optimal transport with divergence-based and adversarial ways of measuring discrepancy. The main stake is topological: ϕ\phi-divergences are cheap but strong, while dual norms and GAN objectives can be weak enough to compare singular measures. The discussion connects classical information divergences Ciszár, 1967Ali & Silvey, 1966 with modern integral probability metrics and generative modeling Sriperumbudur et al., 2009Goodfellow et al., 2014Arjovsky et al., 2017.

Dual Norms and Integral Probability Metrics

Dual norms generalize the W1\Wass_1 test-function principle. They are useful in statistics because they compare distributions by restricting the discriminator class.

Integral Probability Metrics

The Kantorovich--Rubinstein formula for W1\Wass_1 is a special case of a dual norm. This viewpoint designs weak discrepancies by testing signed differences of measures against a controlled class of functions.

The choice of the test-function class BB determines both the topology and the statistical behavior of the discrepancy Sriperumbudur et al., 2012Sriperumbudur et al., 2009Sriperumbudur et al., 2008.

<IPython.core.display.Image object>

Dual witnesses for integral probability metrics. The red and blue curves are two one-dimensional probability densities and the violet curve is a normalized optimal dual witness fα,βf^\star_{\alpha,\beta} for the IPM variational problem. W1\Wass_1 restricts the slope through Kantorovich--Rubinstein duality, MMD restricts the RKHS norm, and total variation can saturate pointwise and therefore reacts sharply to signed density differences.

The interactive demo makes the topology visible. As the two densities move, the total-variation witness jumps with the sign of the density difference, the Wasserstein witness keeps a unit-slope geometry, and the MMD witness is smoothed by the kernel bandwidth.

Interactive panel. Use the kernel, bandwidth, and separation controls to see how witness functions detect differences between measures.

The following proposition gives a compact-space criterion. The dual ball should be rich enough to approximate continuous observables, but compact enough for weak convergence to imply uniform convergence over the discriminator class.

Proof

For the first implication, αnαB0\norm{\alpha_n-\alpha}_B\to0 and the symmetry of BB imply

fd(αnα)αnαB(fB).\left|\int f\,\d(\alpha_n-\alpha)\right| \le \norm{\alpha_n-\alpha}_B \qquad (f\in B).

By linearity, integrals converge for every hspan(B)h\in\operatorname{span}(B). Let uC(X)u\in\Cc(\X) and choose hspan(B)h\in\operatorname{span}(B) with uhη\norm{u-h}_\infty\le\eta. Since αn\alpha_n and α\alpha are probabilities,

ud(αnα)hd(αnα)+2η.\left|\int u\,\d(\alpha_n-\alpha)\right| \le \left|\int h\,\d(\alpha_n-\alpha)\right| +2\eta .

Taking the limsup as nn\to\infty and then letting η0\eta\to0 gives weak convergence.

For the second implication, assume αnα\alpha_n\rightharpoonup\alpha and choose a subsequence (αnk)k(\alpha_{n_k})_k realizing the limsup of αnαB\norm{\alpha_n-\alpha}_B. Since BB is compact and ffd(αnkα)f\mapsto\int f\,\d(\alpha_{n_k}-\alpha) is continuous on BB, the supremum is attained by some fnkBf_{n_k}\in B. Extract a further subsequence with fnkff_{n_k}\to f uniformly. Then

fnkd(αnkα)=fd(αnkα)+(fnkf)dαnk(fnkf)dα.\int f_{n_k}\,\d(\alpha_{n_k}-\alpha) = \int f\,\d(\alpha_{n_k}-\alpha) + \int (f_{n_k}-f)\,\d\alpha_{n_k} - \int (f_{n_k}-f)\,\d\alpha .

The first term tends to zero by weak convergence and the last two by uniform convergence. Hence the limsup is zero.

Proof

For p=1p=1, take B={f:Lip(f)1}B=\{f:\operatorname{Lip}(f)\le1\}. The span of BB contains all Lipschitz functions, which are dense in C(X)\Cc(\X) on compact metric spaces. This gives W1(αn,α)0αnα\Wass_1(\alpha_n,\alpha)\to0\Rightarrow\alpha_n\rightharpoonup\alpha.

Conversely, constants do not change the pairing with αnα\alpha_n-\alpha. Fix x0Xx_0\in\X and normalize potentials by f(x0)=0f(x_0)=0. The normalized unit Lipschitz ball is uniformly bounded by diam(X)\operatorname{diam}(\X) and equicontinuous, hence compact in \norm{\cdot}_\infty by Arzela--Ascoli. The previous proposition gives W1(αn,α)0\Wass_1(\alpha_n,\alpha)\to0. On compact spaces, all Wp\Wass_p distances induce the same topology.

Dual RKHS Norms and Maximum Mean Discrepancies

Kernel methods turn probability measures into mean elements of a reproducing kernel Hilbert space. The resulting Hilbertian dual norms are quadratic discrepancies, handled with Euclidean geometry while retaining a weak test-function interpretation.

The conditional version is the right notion for probability distances, because one applies the quadratic form to signed measures ξ=αβ\xi=\alpha-\beta of total mass zero. Adding a(x)+a(y)a(x)+a(y) to the kernel does not change K(x,y)dξ(x)dξ(y)\iint K(x,y)\,\d\xi(x)\,\d\xi(y) on such measures, and many natural distance kernels are only conditionally positive definite.

These norms are usually called maximum mean discrepancies in statistics and machine learning Gretton et al., 2012Muandet et al., 2017, and kernel norms in shape analysis Hofmann et al., 2008. If X,XX,X' are independent with law α\alpha, then αK2=EX,X(K(X,X))\norm{\alpha}_K^2=\EE_{X,X'}(K(X,X')), whenever this expression is finite.

Proof

By the reproducing property,

h(x)dξ(x)=h,K(x,)dξ(x)H=h,mξH.\int h(x)\,\d\xi(x) = \left\langle h,\int K(x,\cdot)\,\d\xi(x) \right\rangle_{\mathcal{H}} = \langle h,m_\xi\rangle_{\mathcal{H}}.

Cauchy--Schwarz gives

suphH1hdξ=mξH.\sup_{\norm{h}_{\mathcal{H}}\le1}\int h\,\d\xi = \norm{m_\xi}_{\mathcal{H}}.

Finally,

mξH2=K(x,y)dξ(x)dξ(y).\norm{m_\xi}_{\mathcal{H}}^2 = \iint K(x,y)\,\d\xi(x)\,\d\xi(y).
Proof

If MMDK(αn,α)0\operatorname{MMD}_K(\alpha_n,\alpha)\to0, then integrals of all RKHS functions converge. For any hC(X)h\in\Cc(\X) and any η>0\eta>0, choose gHg\in\mathcal{H} with hgη\norm{h-g}_\infty\le\eta. Since αn\alpha_n and α\alpha are probabilities,

hd(αnα)2η+gd(αnα),\left|\int h\,\d(\alpha_n-\alpha)\right| \le 2\eta + \left|\int g\,\d(\alpha_n-\alpha)\right|,

and the last term tends to zero. Conversely, if αnα\alpha_n\rightharpoonup\alpha, then αnαn\alpha_n\otimes\alpha_n, αnα\alpha_n\otimes\alpha, and αα\alpha\otimes\alpha converge weakly on the compact product space. Applying this to the continuous bounded function KK in

MMDK(αn,α)2=Kdαndαn2Kdαndα+Kdαdα\operatorname{MMD}_K(\alpha_n,\alpha)^2 = \iint K\,\d\alpha_n\,\d\alpha_n -2\iint K\,\d\alpha_n\,\d\alpha +\iint K\,\d\alpha\,\d\alpha

gives convergence to zero.

Further background on RKHS spaces can be found in Berlinet & Thomas-Agnan, 2003Hofmann et al., 2008Schölkopf & Smola, 2002.

In the special case where α=i=1naiδxi\alpha=\sum_{i=1}^n a_i\delta_{x_i} is discrete, one obtains

αK2=i,iaiaiK(xi,xi)=aKXa,\norm{\alpha}_K^2 = \sum_{i,i'} a_i a_{i'}K(x_i,x_{i'}) = a^\top K_X a,

where (KX)i,i=K(xi,xi)(K_X)_{i,i'}=K(x_i,x_{i'}). In particular, if α=iaiδxi\alpha=\sum_i a_i\delta_{x_i} and β=ibiδxi\beta=\sum_i b_i\delta_{x_i} are supported on the same point cloud, then αβK2=(ab)KX(ab)\norm{\alpha-\beta}_K^2=(a-b)^\top K_X(a-b), a Euclidean quadratic form on the simplex. For two arbitrary discrete measures,

αβK2=i,iaiaiK(xi,xi)+j,jbjbjK(yj,yj)2i,jaibjK(xi,yj).\norm{\alpha-\beta}_K^2 = \sum_{i,i'} a_i a_{i'}K(x_i,x_{i'}) + \sum_{j,j'} b_j b_{j'}K(y_j,y_{j'}) - 2\sum_{i,j}a_i b_j K(x_i,y_j).

Phi-Divergences

This section develops divergences based on pointwise density ratios. They are computationally simple and statistically classical, but they do not see small spatial displacements between singular measures.

Definition by Density Ratios

Phi-divergences are simpler to compute, typically O(n)O(n) for discrete distributions, but they never metrize weak-\ast convergence on singular measures. Another route is possible through Bregman divergences, which may metrize weak-\ast convergence when the associated entropy functional is weakly regular.

If ϕ=+\phi'_\infty=+\infty, then ϕ\phi grows faster than any linear function and is called superlinear. Any entropy function induces a ϕ\phi-divergence, also known as a Ciszar divergence or ff-divergence Ciszár, 1967Ali & Silvey, 1966.

Here α\alpha^\perp is the part of α\alpha singular with respect to β\beta. The singular term is the recession contribution of the perspective functional. It gives the weak-\ast lower-semicontinuous extension of the density-ratio integral when singular mass appears. This is essential for linear-growth entropies such as total variation. For superlinear entropies, such as the usual entropy, ϕ=+\phi'_\infty=+\infty, so the divergence is infinite when α\alpha is not absolutely continuous with respect to β\beta.

For discrete measures supported on the same set,

α=iaiδxi,β=ibiδxi,\alpha=\sum_i a_i\delta_{x_i}, \qquad \beta=\sum_i b_i\delta_{x_i},

the formula becomes

Dϕ(ab)=isupp(b)biϕ(aibi)+ϕisupp(b)ai.D_\phi(a|b) = \sum_{i\in\operatorname{supp}(b)} b_i\, \phi\left(\frac{a_i}{b_i}\right) + \phi'_\infty \sum_{i\notin\operatorname{supp}(b)}a_i .
Proof

Define the perspective

ψ(u,v)={vϕ(u/v),v>0,uϕ,v=0.\psi(u,v) = \begin{cases} v\,\phi(u/v), & v>0,\\ u\,\phi'_\infty, & v=0. \end{cases}

Joint 1-homogeneity follows directly. In the discrete case, Dϕ(ab)=iψ(ai,bi)D_\phi(a|b)=\sum_i\psi(a_i,b_i), so it is enough to show that ψ\psi is convex. For v1,v2>0v_1,v_2>0, λ[0,1]\lambda\in[0,1], τ=1λ\tau=1-\lambda, set

θ1=τv1τv1+λv2,θ2=λv2τv1+λv2.\theta_1=\frac{\tau v_1}{\tau v_1+\lambda v_2}, \qquad \theta_2=\frac{\lambda v_2}{\tau v_1+\lambda v_2}.

Then θ1+θ2=1\theta_1+\theta_2=1 and

τu1+λu2τv1+λv2=θ1u1v1+θ2u2v2.\frac{\tau u_1+\lambda u_2}{\tau v_1+\lambda v_2} = \theta_1\frac{u_1}{v_1} + \theta_2\frac{u_2}{v_2}.

Convexity of ϕ\phi gives convexity of ψ\psi on v>0v>0; the case v=0v=0 follows by lower semicontinuity of the recession value. In the measure case, weak-\ast lower semicontinuity is the standard theorem for convex integral functionals with recession extension.

Proof

Let m=α+βm=\alpha+\beta and write a=dα/dma=\d\alpha/\d m, b=dβ/dmb=\d\beta/\d m. Using the perspective,

Dϕ(αβ)=ψ(a,b)dm.D_\phi(\alpha|\beta) = \int \psi(a,b)\,\d m.

For probability measures, adm=bdm=1\int a\,\d m=\int b\,\d m=1. Jensen’s inequality and ψ(1,1)=ϕ(1)=0\psi(1,1)=\phi(1)=0 give

Dϕ(αβ)ψ(adm,bdm)=0.D_\phi(\alpha|\beta) \ge \psi\left(\int a\,\d m,\int b\,\d m\right) =0.

If ϕ\phi is strictly convex, equality in Jensen forces a=ba=b almost everywhere, hence α=β\alpha=\beta.

Classical Examples and Topology

The following examples calibrate the strength of ϕ\phi-divergences. KL is sensitive to absolute continuity, while total variation gives the strong topology and therefore behaves very differently from Wasserstein-type weak metrics.

Main Families of ϕ\phi-Divergences

Several classical divergences fit in the same template. The power-divergence family

ϕγ(s)=sγγs+γ1γ(γ1)(γ0,1)\phi_\gamma(s) = \frac{s^\gamma-\gamma s+\gamma-1}{\gamma(\gamma-1)} \qquad(\gamma\ne0,1)

interpolates between the Pearson χ2\chi^2 divergence at γ=2\gamma=2, a Hellinger-type behavior at γ=1/2\gamma=1/2, and, by taking limits, the KL divergence as γ1\gamma\to1 and the reverse KL or Burg entropy ϕ0(s)=logs+s1\phi_0(s)=-\log s+s-1 as γ0\gamma\to0. The Hellinger divergence is often written with ϕH(s)=(s1)2\phi_H(s)=(\sqrt{s}-1)^2; for measures with densities, Hellinger(α,β)=ραρβL2\operatorname{Hellinger}(\alpha,\beta) =\norm{\sqrt{\rho_\alpha}-\sqrt{\rho_\beta}}_{L^2}. The Jensen--Shannon divergence is the symmetrized and bounded KL-to-the-mixture divergence

JS(α,β)2=12KL ⁣(α|α+β2)+12KL ⁣(β|α+β2),\operatorname{JS}(\alpha,\beta)^2 = \frac12\operatorname{KL}\!\left(\alpha\middle|\frac{\alpha+\beta}{2}\right) + \frac12\operatorname{KL}\!\left(\beta\middle|\frac{\alpha+\beta}{2}\right),

generated, up to an irrelevant affine term, by ϕJS(s)=slogs(s+1)log((s+1)/2)\phi_{\operatorname{JS}}(s)=s\log s-(s+1)\log((s+1)/2). Total variation, generated by s1|s-1|, is exceptional because it is both a ϕ\phi-divergence and an integral probability metric.

<IPython.core.display.Image object>

ϕ\phi-divergences through density ratios. The left panel shows normalized generators for common divergences as functions of s=dα/dβs=\d\alpha/\d\beta; all curves vanish at s=1s=1 up to affine normalization. The right panel shows the discrete formula Dϕ(ab)=ibiϕ(ai/bi)D_\phi(a|b)=\sum_i b_i\phi(a_i/b_i): hollow blue circles encode bib_i, filled red circles encode aia_i, the violet curve gives the ratios ai/bia_i/b_i, and orange lollipops show local KL-type contributions.

The interactive demo changes the generator family and the amount of mismatch between two discrete histograms. The near-zero control deliberately creates small target bins, making the recession and singularity behavior visible: ratio-based penalties react to overlap and density ratios rather than to spatial displacement.

Interactive panel. Use the divergence and ratio controls to compare convex generators and their dual penalties around density ratio one.

Variational Dual Formula

The following formula turns a pointwise density-ratio penalty into a dual optimization problem over test functions. It is the analogue, for ϕ\phi-divergences, of the Kantorovich dual formula for transport costs.

Proof

First assume ϕ=+\phi'_\infty=+\infty, so the divergence is infinite unless α\alpha has a density ρ0\rho\ge0 with respect to β\beta. The Legendre--Fenchel transform of Dϕ(β)D_\phi(\cdot|\beta) is

Dϕ(fβ)=supρ0Xf(x)ρ(x)dβ(x)Xϕ(ρ(x))dβ(x)=Xsupρ(x)0(f(x)ρ(x)ϕ(ρ(x)))dβ(x).D_\phi^*(f|\beta) = \sup_{\rho\ge0} \int_\X f(x)\rho(x)\,\d\beta(x) - \int_\X\phi(\rho(x))\,\d\beta(x) = \int_\X \sup_{\rho(x)\ge0} \left(f(x)\rho(x)-\phi(\rho(x))\right) \d\beta(x).

This is the displayed integral of ϕ,0\phi^{*,\ge0}. Fenchel--Moreau gives the dual expression. For a general entropy, the same argument is applied to the perspective with its recession term; the singular part is encoded by the effective domain of ϕ,0\phi^{*,\ge0}.

GANs via Duality

GANs fit naturally into the dual viewpoint: the discriminator is a parameterized potential and the generator moves a reference measure. This section first explains the original divergence-based GAN objective, then contrasts it with integral probability metrics such as MMD and Wasserstein distances.

The goal is to fit a generative parametric model αθ=(gθ)ζ\alpha_\theta=(g_\theta)_\sharp\zeta to empirical data

β=1mjδyj,\beta=\frac1m\sum_j\delta_{y_j},

where ζ\zeta is a fixed density over the latent space and gθ:ZXg_\theta:\mathcal{Z}\to\X is the generator, often a neural network.

Divergence-Based Adversarial Losses

Any ϕ\phi-divergence can be written in adversarial form through the dual formula:

minθDϕ(αθβ)=minθsupf{XfdαθDϕ(fβ)}=minθsupf{Zf(gθ(z))dζ(z)1mjϕ(f(yj))}.\min_\theta D_\phi(\alpha_\theta|\beta) = \min_\theta\sup_f \left\{ \int_\X f\,\d\alpha_\theta - D_\phi^*(f|\beta) \right\} = \min_\theta\sup_f \left\{ \int_\mathcal{Z} f(g_\theta(z))\,\d\zeta(z) - \frac1m\sum_j\phi^*(f(y_j)) \right\}.

Replacing the unrestricted potential ff by a neural network fξf_\xi gives a saddle problem

minθmaxξZfξ(gθ(z))dζ(z)1mjϕ(fξ(yj)).\min_\theta\max_\xi \int_\mathcal{Z} f_\xi(g_\theta(z))\,\d\zeta(z) - \frac1m\sum_j\phi^*(f_\xi(y_j)).

The original vanilla GAN Goodfellow et al., 2014 is this construction for the Jensen--Shannon generator

ϕJS(s)=slogs(s+1)logs+12,ϕJS(u)=log(2eu),u<log2,\phi_{\operatorname{JS}}(s) = s\log s-(s+1)\log\frac{s+1}{2}, \qquad \phi_{\operatorname{JS}}^*(u) = -\log(2-e^u), \quad u<\log2,

up to affine normalizations and the usual reparametrization of the potential by a discriminator with values in (0,1)(0,1). In practice the min--max problem is solved by alternating stochastic gradient descent/ascent. Unlike the convex-concave variational formula, the neural parametrization is nonconvex in θ\theta and nonconcave in ξ\xi, which explains instability and mode-collapse pathologies. These losses estimate density ratios, which is meaningful when the measures overlap but can saturate when the model and data are mutually singular. For example, the Jensen--Shannon divergence is already maximal for disjoint supports.

Dual Norms and Integral Probability Metrics

Instead of a density-ratio divergence, one can minimize an integral probability metric:

minθαθβB=minθsupfB{Zf(gθ(z))dζ(z)1mjf(yj)}.\min_\theta\norm{\alpha_\theta-\beta}_B = \min_\theta \sup_{f\in B} \left\{ \int_\mathcal{Z} f(g_\theta(z))\,\d\zeta(z) - \frac1m\sum_j f(y_j) \right\}.

MMD-GANs take BB to be a unit ball in an RKHS Dziugaite et al., 2015; Wasserstein GANs take BB to be a Lipschitz ball, following Kantorovich--Rubinstein duality Arjovsky et al., 2017Frogner et al., 2015. The advantage is topological: for bounded continuous RKHS balls, or for bounded Lipschitz balls on compact spaces, the objective is weakly continuous. It can therefore compare singular empirical and generated measures through test functions instead of requiring pointwise density ratios. The price is that the discriminator class must be controlled geometrically, either by a kernel norm, a Lipschitz constraint, or a related regularization.

Wasserstein GANs originally used weight clipping as a proxy for enforcing fξ{f:Lip(f)1}f_\xi\in\{f:\operatorname{Lip}(f)\le1\}. This parameter set is both smaller than the true Lipschitz ball and nonconvex, so clipping should be understood as a practical heuristic rather than a faithful implementation of the Kantorovich--Rubinstein dual constraint.

References
  1. Ciszár, I. (1967). Information-type measures of difference of probability distributions and indirect observations. Studia Scientiarum Mathematicarum Hungarica, 2, 299–318.
  2. Ali, S. M., & Silvey, S. D. (1966). A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society. Series B (Methodological), 28(1), 131–142.
  3. Sriperumbudur, B. K., Fukumizu, K., Gretton, A., Schölkopf, B., & Lanckriet, G. R. (2009). On integral probability metrics, ϕ-divergences and binary classification. arXiv Preprint arXiv:0901.2698.
  4. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial nets. Advances in Neural Information Processing Systems, 2672–2680.
  5. Arjovsky, M., Chintala, S., & Bottou, L. (2017). Wasserstein generative adversarial networks. In D. Precup & Y. W. Teh (Eds.), Proceedings of the 34th International Conference on Machine Learning (Vol. 70, pp. 214–223). PMLR. https://proceedings.mlr.press/v70/arjovsky17a.html
  6. Sriperumbudur, B. K., Fukumizu, K., Gretton, A., Schölkopf, B., & Lanckriet, G. R. (2012). On the empirical estimation of integral probability metrics. Electronic Journal of Statistics, 6, 1550–1599.
  7. Sriperumbudur, B. K., Gretton, A., Fukumizu, K., Lanckriet, G., & Schölkopf, B. (2008). Injective Hilbert space embeddings of probability measures. Proceedings of the 21st Annual Conference on Learning Theory, 111–122.
  8. Hanin, L. G. (1992). Kantorovich-Rubinstein norm and its application in the theory of Lipschitz spaces. Proceedings of the American Mathematical Society, 115(2), 345–352.
  9. Lellmann, J., Lorenz, D. A., Schönlieb, C., & Valkonen, T. (2014). Imaging with Kantorovich–Rubinstein discrepancy. SIAM Journal on Imaging Sciences, 7(4), 2833–2859.
  10. Berg, C., Christensen, J. P. R., & Ressel, P. (1984). Harmonic Analysis on Semigroups. Springer Verlag.
  11. Schoenberg, I. J. (1938). Metric spaces and positive definite functions. Transactions of the American Mathematical Society, 44(3), 522–536. 10.1090/S0002-9947-1938-1501980-0
  12. Székely, G. J., & Rizzo, M. L. (2004). Testing for equal distributions in high dimension. InterStat, 5(16.10).
  13. Wendland, H. (2005). Scattered Data Approximation. Cambridge University Press.
  14. Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., & Smola, A. (2012). A kernel two-sample test. Journal of Machine Learning Research, 13(Mar), 723–773.
  15. Muandet, K., Fukumizu, K., Sriperumbudur, B., & Schölkopf, B. (2017). Kernel mean embedding of distributions: a review and beyond. Foundations and Trends in Machine Learning, 10(1–2), 1–141.