# functionspace_regularized_rényi_divergences__4275adb3.pdf Published as a conference paper at ICLR 2023 FUNCTION-SPACE REGULARIZED RÉNYI DIVERGENCES Jeremiah Birrell1, Yannis Pantazis2, Paul Dupuis3, Luc Rey-Bellet1, Markos A. Katsoulakis1 1University of Massachusetts, Amherst, 2Foundation for Research & Technology - Hellas, 3Brown University, {jbirrell, luc, markos}@umass.edu, pantazis@iacm.forth.gr, paul_dupuis@brown.edu We propose a new family of regularized Rényi divergences parametrized not only by the order α but also by a variational function space. These new objects are defined by taking the infimal convolution of the standard Rényi divergence with the integral probability metric (IPM) associated with the chosen function space. We derive a novel dual variational representation that can be used to construct numerically tractable divergence estimators. This representation avoids risk-sensitive terms and therefore exhibits lower variance, making it well-behaved when α > 1; this addresses a notable weakness of prior approaches. We prove several properties of these new divergences, showing that they interpolate between the classical Rényi divergences and IPMs. We also study the α limit, which leads to a regularized worst-case-regret and a new variational representation in the classical case. Moreover, we show that the proposed regularized Rényi divergences inherit features from IPMs such as the ability to compare distributions that are not absolutely continuous, e.g., empirical measures and distributions with lowdimensional support. We present numerical results on both synthetic and real datasets, showing the utility of these new divergences in both estimation and GAN training applications; in particular, we demonstrate significantly reduced variance and improved training performance. 1 INTRODUCTION Rényi divergence, Rényi (1961), is a significant extension of Kullback-Leibler (KL) divergence for numerous applications; see, e.g., Van Erven & Harremos (2014). The recent neural-based estimators for divergences Belghazi et al. (2018) along with generative adversarial networks (GANs) Goodfellow et al. (2014) accelerated the use of divergences in the field of deep learning. The neural-based divergence estimators are feasible through the utilization of variational representation formulas. These formulas are essentially lower bounds (and, occasionally, upper bounds) which are approximated by tractable statistical averages. The estimation of a divergence based on variational formulas is a notoriously difficult problem. Challenges include potentially high bias that may require an exponential number of samples Mc Allester & Stratos (2020) or the exponential statistical variance for certain variational estimators Song & Ermon (2019), rendering divergence estimation both data inefficient and computationally expensive. This is especially prominent for Rényi divergences with order larger than 1. Indeed, numerical simulations have shown that, unless the distributions P and Q are very close to one another, the Rényi divergence Rα(P Q) is almost intractable to estimate when α > 1 due to the high variance of the statistically-approximated risk-sensitive observables Birrell et al. (2021), see also the recent analysis in Lee & Shin (2022). A similar issue has also been observed for the KL divergence, Song & Ermon (2019). Overall, the lack of estimators with low variance for Rényi divergences has prevented wide-spread and accessible experimentation with this class of information-theoretic tools, except in very special cases. We hope our results here will provide a suitable set of tools to address this gap in the methodology. One approach to variance reduction is the development of new variational formulas. This direction is especially fruitful for the estimation of mutual information van den Oord et al. (2018); Cheng et al. (2020). Another approach is to regularize the divergence by restricting the function space of the variational formula. Indeed, instead of directly attacking the variance issue, the function space of the variational formula can be restricted, for instance, by bounding the test functions or more appropriately by bounding the derivative of the test functions. The latter regularization leads to Published as a conference paper at ICLR 2023 Lipschitz continuous function spaces which are also foundational to integral probability metrics (IPMs) and more specifically to the duality property of the Wasserstein metric. In this paper we combine the above two approaches, first deriving a new variational representation of the classical Rényi divergences and then regularizing via an infimal-convolution as follows RΓ,IC α (P Q) := inf η {Rα(P η) + W Γ(Q, η)} , (1) where P and Q are the probability distributions being compared, the infimum is over the space of probability measures, Rα is the classical Rényi divergence, and W Γ is the IPM corresponding to the chosen regularizing function space, Γ. The new family of regularized Rényi divergences that are developed here address the risk-sensitivity issue inherent in prior approaches. More specifically, our contributions are as follows. We define a new family of function-space regularized Rényi divergences via the infimal convolution operator between the classical Rényi divergence and an arbitrary IPM (1). The new regularized Rényi divergences inherit their function space from the IPM. For instance, they inherit mass transport properties when one regularizes using the 1-Wasserstein metric. We derive a dual variational representation (11) of the regularized Rényi divergences which avoids risk-sensitive terms and can therefore be used to construct lower-variance statistical estimators. We prove a series of properties for the new object: (a) the divergence property, (b) being bounded by the minimum of the Rényi divergence and IPM, thus allowing for the comparison of non-absolutely continuous distributions, (c) limits as α 1 from both left and right, (d) regimes in which the limiting cases Rα(P Q) and W Γ(Q, P) are recovered. We propose a rescaled version of the regularized Rényi divergences (16) which lead to a new variational formula for the worst-case regret (i.e., α ). This new variational formula does not involve the essential supremum of the density ratio as in the classical definition of worst-case regret, thereby avoiding risk-sensitive terms. We present a series of illustrative examples and counterexamples that further motivate the proposed definition for the function-space regularized Rényi divergences. We present numerical experiments that show (a) that we can estimate the new divergence for large values of the order α without variance issues and (b) train GANs using regularized function spaces. Related work. The order of Rényi divergence controls the weight put on the tails, with the limiting cases being mode-covering and mode-selection Minka (2005). Rényi divergence estimation is used in a number of applications, including Sajid et al. (2022) (behavioural sciences), Mironov (2017) (differential privacy), and Li & Turner (2016) (variational inference); in the latter the variational formula is an adaptation of the evidence lower bound. Rényi divergences have been also applied in the training of GANs Bhatia et al. (2021) (loss function for binary classification - discrete case) and in Pantazis et al. (2022) (continuous case, based on the Rényi-Donsker-Varahdan variational formula in Birrell et al. (2021)). Rényi divergences with α > 1 are also used in contrastive representation learning, Lee & Shin (2022), as well as in PAC-Bayesian Bounds, Bégin et al. (2016). In the context of uncertainty quantification and sensitivity analysis, Rényi divergences provide confidence bounds for rare events, Atar et al. (2015); Dupuis et al. (2020), with higher rarity corresponding to larger α. Reducing the variance of divergence estimators through control of the function space have been recently proposed. In Song & Ermon (2019) an explicit bound to the output restricts the divergence values. A systematic theoretical framework on how to regularize through the function space has been developed in Dupuis, Paul & Mao, Yixiang (2022); Birrell et al. (2022a) for the KL and f-divergences. Despite not covering the Rényi divergence, the theory in Dupuis, Paul & Mao, Yixiang (2022); Birrell et al. (2022a) and particularly the infimal-convolution formulation clearly inspired the current work. However, adapting the infimal-convolution method to the Rényi divergence setting requires two new technical innovations: (a) We develop a new low-variance convex-conjugate variational formula for the classical Rényi divergence in Theorem 2.1 (see also Fig. 1), allowing us to apply infimalconvolution tools to develop the new Γ-Rényi divergences in Theorem 3.4. (b) We study the α limit of (a) to obtain a new low-variance variational representation of worst-case regret in Theorem 2.2 and study its Γ-regularization in Theorem 4.5. Published as a conference paper at ICLR 2023 2 NEW VARIATIONAL REPRESENTATIONS OF CLASSICAL RÉNYI DIVERGENCES The Rényi divergence of order α (0, 1) (1, ) between P and Q, denoted Rα(P Q), can be defined as follows: Let ν be a sigma-finite positive measure with d P = pdν and d Q = qdν. Then 1 α(α 1) log h R q>0 pαq1 αdν i if 0 < α < 1 or α > 1 and P Q , + if α > 1 and P Q , (2) where P Q denotes absolute continuity of P with respect to Q. There always exists such a ν (e.g., ν = P + Q) and one can show that the definition (2) does not depend on the choice of ν. The Rα provide a notion of distance between P and Q in that they satisfy the divergence property, i.e., they are non-negative and equal zero iff Q = P. The limit of Rα as α approaches 1 or 0 equals the KL or reverse KL divergence respectively Van Erven & Harremos (2014). An alternative representation of Rα, the so-called Rényi-Donsker-Varadhan variational formula, was derived from (2) in Birrell et al. (2021), Rα(P Q) = sup ϕ Mb(Ω) 1 α 1 log Z e(α 1)ϕd P 1 α log Z eαϕd Q , P, Q P(Ω) . (3) Here (Ω, M) denotes a measurable space, Mb(Ω) the space of bounded measurable real-valued functions on Ω, and P(Ω) is the space of probability measures on Ω. By a change of variables argument this can be transformed into the following new variational representation; see Theorem A.2 in Appendix A for a proof. We call it the convex-conjugate Rényi variational formula (CC-Rényi). Theorem 2.1 (Convex-Conjugate Rényi Variational Formula). Let P, Q P(Ω) and α (0, 1) (1, ). Then Rα(P Q) = sup g Mb(Ω):g<0 Z gd Q + 1 α 1 log Z |g|(α 1)/αd P + α 1(log α + 1) . (4) If (Ω, M) is a metric space with the Borel σ-algebra then (4) holds with Mb(Ω) replaced by Cb(Ω), the space of bounded continuous real-valued functions on Ω. The representation (4) is of convex-conjugate type, which will be key in our development of functionspace regularized Rényi divergences. It is also of independent interest as it avoids risk-sensitive terms, unlike (3) which contains cumulant-generating-functions. This makes (4) better behaved in estimation problems, especially when α > 1; see the example in Section 6.1 below. We also obtain a new variational formula for worst-case regret, as defined by Van Erven & Harremos (2014) D (P Q) := lim α αRα(P Q) = ( log ess sup P d P d Q , P Q , P Q . (5) In contrast to (5), which requires estimation of the likelihood ratio, the new variational formula (6) below avoids risk-sensitive terms. Theorem 2.2 (Worst-case Regret Variational Formula). Let P, Q P(Ω). Then D (P Q) = sup g Mb(Ω):g<0 Z gd Q + log Z |g|d P + 1 . (6) If Ωis a metric space with the Borel σ-algebra then (6) holds with Mb(Ω) replaced by Cb(Ω). See Theorem A.5 in Appendix A for a proof. Equation (6) is a new result of independent interest and will also be useful in our study of the α limit of the function-space regularized Rényi divergences that we define in the next section. Remark 2.3. Alternative variational formulas for D on a finite alphabet were derived in Kurri et al. (2022). Published as a conference paper at ICLR 2023 3 PRIMAL AND DUAL FORMULATIONS OF THE INFIMAL-CONVOLUTION Γ-RÉNYI DIVERGENCES We are now ready to define the function-space regularized Rényi divergences and derive their key properties. In this section, X will denote a compact metric space, P(X) will denote the set of Borel probability measures on X, and C(X) will denote the space of continuous real-valued functions on X. We equip C(X) with the supremum norm and recall that the dual space of C(X) is C(X) = M(X), the space of finite signed Borel measures on X (see the Riesz representation theorem, e.g., Theorem 7.17 in Folland (2013)). Definition 3.1. Given a test-function space Γ C(X), we define the infimal-convolution Γ-Rényi divergence (i.e., IC-Γ-Rényi divergence) between P, Q P(X) by RΓ,IC α (P Q) := inf η P(X){Rα(P η) + W Γ(Q, η)} , α (0, 1) (1, ) , (7) where W Γ denotes the Γ-IPM W Γ(µ, ν) := sup g Γ { Z gdµ Z gdν} , µ, ν M(X) . (8) Remark 3.2. The classical Rényi divergence is convex in its second argument but not in its first when α > 1 Van Erven & Harremos (2014). This is the motivation for defining the IC-Γ-Rényi divergences via an infimal convolution in the second argument of Rα; convex analysis tools will be critical in deriving properties of RΓ,IC α below. For α (0, 1) one can use the identity Rα(P Q) = R1 α(Q P) to rewrite (7) as an infimal convolution in the first argument. The definition (7) can be thought of as a regularization of the classical Rényi divergence using the Γ-IPM. For computational purposes it is significantly more efficient to have a dual formulation, i.e., a representation of RΓ,IC α in terms of a supremum over a function space. To derive such a representation we begin with the variational formula for Rα from Theorem 2.1. If we define the convex mapping ΛP α : C(X) ( , ], ΛP α[g] := 1g <0 1 α 1 log Z |g|(α 1)/αd P + α 1(log α + 1) 1g<0 , (9) then (4) from Theorem 2.1 can be written as a convex conjugate Rα(P Q) = (ΛP α) [Q] := sup g C(X) { Z gd Q ΛP α[g]} . (10) One can then use Fenchel-Rockafellar duality to derive a dual formulation of the IC-Γ-Rényi divergences. To apply this theory we will need to work with spaces of test functions that satisfy the following admissibility properties. These properties are similar to those used in the construction of regularized KL and f-divergences in Dupuis, Paul & Mao, Yixiang (2022) and Birrell et al. (2022a). Definition 3.3. We will call Γ C(X) admissible if it is convex and contains the constant functions. We will call an admissible Γ strictly admissible if there exists a P(X)-determining set Ψ C(X) such that for all ψ Ψ there exists c R, ϵ > 0 such that c ϵψ Γ. Recall that Ψ being P(X)-determining means that for all Q, P P(X), if R ψd Q = R ψd P for all ψ Ψ then Q = P. Putting the above pieces together one obtains the following variational representation. Theorem 3.4. Let Γ C(X) be admissible, P, Q P(X), and α (0, 1) (1, ). Then: RΓ,IC α (P Q) = sup g Γ:g<0 Z gd Q + 1 α 1 log Z |g|(α 1)/αd P + α 1(log α + 1) . 2. If (11) is finite then there exists η P(X) such that RΓ,IC α (P Q) = inf η P(X){Rα(P η) + W Γ(Q, η)} = Rα(P η ) + W Γ(Q, η ) . (12) Published as a conference paper at ICLR 2023 3. RΓ,IC α (P Q) min{Rα(P Q), W Γ(Q, P)}. 4. If Γ is strictly admissible then RΓ,IC α has the divergence property. See Theorem B.3 in Appendix B for detailed proofs of these results as well as several additional properties. We note that there are alternative strategies for proving the variational formula (11) which make different assumptions; further comments on this can be found in Remark B.4. Important examples of strictly admissible Γ include the following: 1. Γ = C(X), which leads to the classical Rényi-divergences. 2. Γ = Lip1(X), i.e. all 1-Lipschitz functions. This regularizes the Rényi divergences via the Wasserstein metric. 3. Γ = {c + g : c R, g C(X), |g| 1}. This regularizes the Rényi divergences via the total-variation metric. 4. Γ = {c + g : c R, g Lip1(X), |g| 1}. This regularizes the Rényi divergences via the Dudley metric. 5. Γ = {c + g : c R, g Y : g V 1}, the unit ball in a RKHS V C(X). This regularizes the Rényi divergences via MMD. In practice, uniform bounds can be implemented using an appropriately chosen final NN layer. Lipschitz bounds can be implemented using spectral normalization of neural networks Miyato et al. (2018), or using a soft gradient penalty Gulrajani et al. (2017). The function space Γ for structurepreserving GANs discussed in the Appendix is implemented using equivariant neural networks, Birrell et al. (2022b). If Γ is a ball in an RKHS space the implementation is carried out using the same tools used in, e.g., MMD distances and divergences, Gretton et al. (2012); Glaser et al. (2021). The IC-Γ-Rényi divergences also satisfy a data processing inequality. See Theorem B.8 in Appendix B for a proof as well as details regarding the notation. Theorem 3.5 (Data Processing Inequality). Let α (0, 1) (1, ), Q, P P(X), and K be a probability kernel from X to Y such that K[g] C(X) for all g C(X, Y ). If Γ C(Y ) is admissible then RΓ,IC α (K[P] K[Q]) RK[Γ],IC α (P Q). If Γ C(X Y ) is admissible then RΓ,IC α (P K Q K) RK[Γ],IC α (P Q). If K[Γ] is strictly contained in Γ then the bounds in Theorem 3.5 can be strictly tighter than the classical data processing inequality Van Erven & Harremos (2014). Data-processing inequalities are important for constructing symmetry-preserving GANs; see Birrell et al. (2022b) and Section D.1. 4 LIMITS, INTERPOLATIONS, AND REGULARIZED WORST-CASE REGRET Next we use Theorem 3.4 to compute various limits of the IC-Γ-Rényi divergences. First we show that they interpolate between Rα and W Γ in the following sense (see Theorem B.5 for a proof). Theorem 4.1. Let Γ C(X) be admissible, P, Q P(X), and α (0, 1) (1, ). 1. limδ 0+ 1 δ RδΓ,IC α (P Q) = W Γ(Q, P), 2. If Γ is strictly admissible then lim L RLΓ,IC α (P Q) = Rα(P Q). Now we discuss the limiting behavior in α. These results generalize several properties of the classical Rényi divergences Van Erven & Harremos (2014). First we consider the α 1 limit; see Theorem B.6 for a proof. Theorem 4.2. Let Γ C(X) be admissible and P, Q P(X). Then lim α 1+ RΓ,IC α (P Q) = inf η P(X): β>1,Rβ(P η)< {R(P η) + W Γ(Q, η)} , (13) lim α 1 RΓ,IC α (P Q) = inf η P(X){R(P η) + W Γ(Q, η)} (14) = sup g Γ:g<0 { Z gd Q + Z log |g|d P} + 1 . (15) Published as a conference paper at ICLR 2023 Remark 4.3. When Γ = C(X), changing variables to g = exp(ϕ 1) transforms (15) into the Legendre-transform variational formula for R(P Q); see equation (1) in Birrell et al. (2022c) with f(x) = x log(x). Eq. (14) is an infimal convolution of the reverse KL-divergence, as opposed to the results in Dupuis, Paul & Mao, Yixiang (2022) which apply to the (forward) KL-divergence. Function-space regularized worst-case regret. Next we investigate the α limit of the IC-Γ-Rényi divergences, which will lead to the function-space regularized worst-case regret. First recall that some authors use an alternative definition of the classical Rényi divergences, related to the one used in this paper by Dα( ) := αRα( ). This alternative definition has the useful property of being non-decreasing in α; see Van Erven & Harremos (2014). Appropriately rescaled, the IC-Γ-Rényi divergence also satisfies this property, leading to the following definition. Definition 4.4. For Γ C(X), α (0, 1) (1, ) and P, Q P(X) we define DΓ,IC α (P Q) := αRΓ/α,IC α (P Q) . (16) Note that αRΓ/α,IC α (P Q) is non-decreasing in α; see Lemma B.1 for a proof. We now show that the divergences DΓ,IC α are well behaved in the α limit, generalizing (5). Taking this limit provides a definition of function-space regularized worst-case regret, along with the following dual variational representation. Theorem 4.5. Let Γ C(X) be admissible and P, Q P(X). Then DΓ,IC (P Q) := lim α DΓ,IC α (P Q) = inf η P (X){D (P η) + W Γ(Q, η)} (17) = sup g Γ:g<0 Z gd Q + log Z |g|d P + 1 . (18) We call DΓ,IC the infimal-convolution Γ-worst-case regret (i.e., IC-Γ-WCR). The method of proof of Theorem 4.5 is similar to that of part (1) of Theorem 3.4; see Theorem B.7 in Appendix B for details. Theorem 4.5 suggests that DΓ,IC α is the appropriate α-scaling to use when α is large and we find this to be the case in practice; see the example in Section 6.3.1. 5 ANALYTICAL EXAMPLES AND COUNTEREXAMPLES In this section we present several analytical examples and counterexamples that illustrate important properties of the IC-Γ-Rényi divergences and demonstrate weaknesses of other attempts to define regularized Rényi divergences. In particular, we show that other attempts at regularizing Rényi divergences fail to inherit important properties from the Γ-IPM. More details on the computations can be found in Appendix C Infimal convolution and scaling limits: First we present a simple example that illustrates the infimal convolution formula and limiting properties from Sections 3 and 4. Let P = δ0, Qx,c = cδ0+(1 c)δx for c (0, 1), x > 0, and let Γ = Lip1. Then for L > 0 one can compute RLΓ,IC α (P Qx,c) = (1 c)Lx , 0 < αLx < 1 α 1 c Lx + α 1 log(αLx) , 1 αLx 1/c α 1 log(1/c) , αLx > 1/c . (19) In particular, it is straightforward to show that RLΓ,IC α (P Qx,c) (1 c)Lx = W LΓ(Qx,c, P), limx 0+ RLΓ,IC α (P Qx,c) = limx 0+(1 c)Lx = 0, and lim L RLΓ,IC α (P Qx,c) = α log(1/c) = Rα(P Qx,c). We can also rewrite this in terms of the solution to the infimal convolution problem and take the worst-case-regret scaling limit as follows RLΓ,IC α (P Qx,c) = W LΓ(Qx,c, P) , 0 < αLx < 1 Rα(P Qx,1/(αLx)) + W LΓ(Qx,c, Qx,1/(αLx)) , 1 αLx 1/c Rα(P Qx,c) , αLx > 1/c , lim α αRΓ/α,IC α (P Qx,c) = W Γ(Qx,c, P) , 0 < x < 1 D (P Qx,1/x) + W Γ(Qx,c, Qx,1/x) , 1 x 1/c D (P Qx,c) , x > 1/c . (20) Published as a conference paper at ICLR 2023 Γ-Rényi-Donsker-Varadhan counterexample: As an alternative to Definition 3.1, one can attempt to regularize the Rényi divergences by restricting the test-function space in the variational representation (3), leading to the Γ-Rényi-Donsker-Varadhan (Γ-Rényi-DV) divergences RΓ,DV α (P Q) := sup ϕ Γ 1 α 1 log Z e(α 1)ϕd P 1 α log Z eαϕd Q . (21) The bound log R ecϕd P c R ϕd P for all ϕ Γ, c R implies that RΓ,DV α W Γ for α (0, 1), making (21) a useful regularization of the Rényi divergences in this case; this utility was demonstrated in Pantazis et al. (2022), where it was used to construct GANs. However, estimators built from the representation (21) (i.e., replacing P and Q by empirical measures) are known to be numerically unstable when α > 1. Below we provide a counterexample showing that, unlike for the IC-Γ-Rényi divergences, RΓ,DV α W Γ in general when α > 1. We conjecture that this is a key reason for the instability of Γ-Rényi-Donsker-Varadhan estimators when α > 1. Let Px,c = cδ0 + (1 c)δx, Q = δ0 for x > 0, c (0, 1) and ΓL = Lip L. Then for α > 1 we have RΓL,DV α (Px,c Q) = 1 α 1 log (c + (1 c) exp((α 1)Lx)) and W ΓL(Px,c, Q) = (1 c)Lx. Using strict concavity of the logarithm one can then obtain the bound RΓL,DV α (Px,c Q) > W ΓL(Px,c, Q) . (22) This shows that, when α > 1, Γ-Rényi-DV violates the key property that allows the IC-Γ-Rényi divergences to inherit properties from the corresponding Γ-IPM. Another alternative is to begin with (3) and then reduce the test-function space to 1 α log(Γ), where the logarithm is introduced to eliminate the exponential functions in (21). However, this definition also fails to provide an appropriate regularized Rényi divergence; in particular, it is incapable of meaningfully comparing Dirac distributions. See Appendix C.3 for details. These counterexamples lend further credence to our infimal-convolution based regularization approach (7). 6 NUMERICAL EXPERIMENTS In this section we present numerical examples that demonstrate the use of the IC-Γ-Rényi divergences for both estimation and training of GANs (additional examples can be found in Appendix D). All of the divergences considered in this paper have a variational representation of the form D(P Q) = supg Γ H[g; P, Q] for some objective functional H; we use the corresponding estimator b Dn(P Q) := sup θ Θ H[gθ; Pn, Qn] (23) where Pn, Qn are n-sample empirical measures and gθ is a family of neural networks (NN) with parameters θ Θ. For Lipschitz function spaces we weaken the Lipschitz constraint to a soft 1-sided gradient penalty (see Section 4.1 of Birrell et al. (2022a)). Optimization is performed using the Adam optimizer Kingma & Ba (2014). For the infimal convolution divergences we enforce negativity of the test function (i.e., discriminators) using a final layer having one of the following forms: 1) abs(x) or 2) (1/(1 x)1x<0 + (1 + x)1x 0). The latter, which we term poly-softplus, is C1 and decays like O(x 1) as x . 6.1 VARIANCE OF RÉNYI ESTIMATORS As a first example, we compare estimators of the classical Rényi divergences (i.e., without regularization) constructed from DV-Rényi (3) and CC-Rényi (4) in a simple case where the exact Rényi divergence is known. We let Q and P be 1000-dimensional Gaussians with equal variance and study Rα(P Q) as a function of the separation between their means. The results are shown in Figure 1. We see that the estimator based on the convex-conjugate Rényi variational formula 4 has smaller variance and mean-squared error (MSE) that the Rényi-Donsker-Varadhan variational formula 3, with the difference becoming very large when α 1 or when P and Q are far apart (i.e., when µq is large). The Rényi-Donsker-Varadhan estimator only works well when µq and α are both not too large, but even in such cases the convex-conjugate Rényi estimator generally performs better. We conjecture that this difference is due to the presence of risk-sensitive terms in (3) which were eliminated in the new representation (4). We note that the NN for the convex-conjugate Rényi estimator used the poly-softplus final layer, as we found the abs final layer to result in a significant percentage of failed runs (i.e., Na N outputs) but this issue did not arise when using poly-softplus. We do not show results for either DV-WCR or CC-WCR here as the exact divergence is infinite in this example. Published as a conference paper at ICLR 2023 10 -2 10 -1 10 0 10 1 10 -10 10 -2 10 -1 10 0 10 1 10 -10 Figure 1: Variance and MSE of estimators of the classical Rényi divergence between 1000dimensional Gaussians. DV-Rα refers to Rényi divergence estimators built using (3) while CC-Rα refers to estimators built using our new variational representation (4). We used a NN with one fully connected layer of 64 nodes, Re LU activations, and a poly-softplus final layer (for CC-Rényi). We trained for 10000 epochs with a minibatch size of 500. The variance and MSE were computing using data from 50 independent runs. Note that the CC-Rényi estimator has significantly reduced variance and MSE compared to the DV-Rényi estimator, even when α is large. Strikingly, the 1-D case exhibits the same behavior (see Figure 3 in Appendix D.2), demonstrating that the DV-Rényi estimator is unsuitable even in low dimensions. 6.2 DETECTION OF RARE SUB-POPULATIONS IN SINGLE-CELL BIOLOGICAL DATASETS A critical task in cancer assessment is the detection of rare sub-populations subsumed in the overall population of cells. The advent of affordable flow and mass cytometry technologies that perform single cell measurements opens a new direction for the analysis and comparison of high-dimensional cell distributions Shahi et al. (2017) via divergence estimation. We consider single cell mass cytometry measurements on 16 bone marrow protein markers (d = 16) coming from healthy and disease individuals with acute myeloid leukemia Levine et al. (2015). Following Weber et al. (2019), we create two datasets: one with only healthy samples and another one with decreasing percentage of sick cells and compute several divergences. Considering the estimated divergence value as the score of a binary classifier, we compute the ROC curve and the respective area under the ROC curve (AUC) for any pair of sample distributions. More specifically, true negatives correspond to the divergence values between two healthy datasets while true positives correspond to the divergence between a healthy and a diseased dataset. Thus, the AUC is 1.0 when the divergence estimates are completely separable while AUC is 0.5 when they completely overlap. Table 1 reports the AUC values for the scaled IC-Γ-Rényi divergences (16), various levels of rarity and two sample sizes for the datasets. The best performance in the Rényi family is obtained for α = using the IC-Γ-WCR variational formula (18). IC-Γ-WCR also outperforms the Wasserstein distance of first order in both sample size regimes. 6.3 IC-Γ-RÉNYI GANS Finally, we study a pair of GAN examples (the second example is presented in Appendix D). Here the goal is to learn a distribution P using a family of generator distribution Qψ hψ(X) where X is a noise source and hψ is a family of neural networks parametrized by ψ Ψ, i.e., the goal is to solve inf ψ Ψ b Dn(P Qψ) , (24) where b Dn is a divergence estimator of the form (23). In particular, we will study the GANs constructed from the newly introduced IC-Γ-Rényi and IC-Γ-WCR GANs and compare them with Wasserstein GAN Gulrajani et al. (2017); Arjovsky et al. (2017). Published as a conference paper at ICLR 2023 Table 1: AUC values (higher is better) for several divergences and various levels of rarity. The AUC values have been averaged from 50 independent runs. The neural discriminator has 2 hidden layers with 32 units each and Re LU activation. The DΓ,IC α divergences used the poly-softplus final layer. Sample size 100K 20K Probability (%) 0.1 0.2 0.3 0.4 0.5 1.0 10.0 0.1 0.3 0.5 1.0 10.0 α = 2 0.51 0.55 0.62 0.64 0.70 0.92 1.00 0.48 0.58 0.58 0.60 1.00 α = 5 0.66 0.70 0.71 0.72 0.74 0.80 1.00 0.32 0.37 0.43 0.38 0.91 α = 10 0.57 0.50 0.62 0.64 0.49 0.59 1.00 0.48 0.48 0.43 0.47 0.74 α = 0.64 0.89 0.96 0.99 1.00 1.00 1.00 0.58 0.71 0.79 0.91 1.00 Wasserstein 0.63 0.58 0.58 0.51 0.57 0.55 1.00 0.46 0.40 0.45 0.40 1.00 6.3.1 CIFAR-10 In Figure 2 we demonstrate improved performance of the IC-Γ-Rényi and IC-Γ-WCR GANs, as compared to Wasserstein GAN with gradient penalty (WGAN-GP), on the CIFAR-10 dataset Krizhevsky et al. (2009). The IC GANs also outperform Rényi-DV GAN (21), as the latter is highly unstable when α > 1 and so the training generally encounters Na N after a small number of training epochs (hence we omit those results from the figure). We use the same Res Net neural network architecture as in (Gulrajani et al., 2017, Appendix F) and focus on evaluating the effect of different divergences. Here we let Γ be the set of 1-Lipschitz functions, implement via a gradient penalty. Note that DΓ,IC performs significantly better than RΓ,IC α with large α, and the rescaled DΓ,IC α -GAN performs better that RΓ,IC α -GAN when α is large. 10 3 10 4 10 5 5 10 3 10 4 10 5 5 Figure 2: Comparison between IC-Γ-Rényi GAN, IC-Γ-WCR GAN, and WGAN-GP (both 1 and 2-sided) on the CIFAR-10 dataset. Here we plot the inception score as a function of the number of training epochs (moving average over the last 5 data points, with results averaged over 5 runs). We also show the averaged final FID score in the legend, computed using 50000 samples from both P and Q. For the IC GANs we enforce negativity of the discriminator by using a final layer equal to abs. The GANs were trained using the Adam optimizer with an initial learning rate of 0.0002. The left pane shows that the IC-Γ-Rényi GANs outperform WGAN while the right pane shows that GANs based on the rescaled DΓ,IC α divergences (16) perform better when α is large, including in the α limit, i.e., IC-Γ-WCR (17). In both cases the IC GANs outperform the Γ-Rényi-DV GANs with α > 1 (21); the latter fail to converge due to the presence of risk-sensitive terms. Published as a conference paper at ICLR 2023 ACKNOWLEDGMENTS The research of J.B., M.K. and L.R.-B. was partially supported by the Air Force Office of Scientific Research (AFOSR) under the grant FA9550-21-1-0354. The research of M. K. and L.R.-B. was partially supported by the National Science Foundation (NSF) under the grants DMS-2008970 and TRIPODS CISE-1934846. The research of P.D. was partially supported by the NSF under the grant DMS-1904992 and by the AFOSR under the grant FA-9550-21-1-0354. The work of Y.P. was partially supported by the Hellenic Foundation for Research and Innovation (HFRI) through the Second Call for HFRI Research Projects to support Faculty Members and Researchers under Project 4753. This work was performed in part using high performance computing equipment obtained under a grant from the Collaborative R&D Fund managed by the Massachusetts Technology Collaborative. Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In International conference on machine learning, pp. 214 223. PMLR, 2017. R. Atar, K. Chowdhary, and P. Dupuis. Robust bounds on risk-sensitive functionals via Rényi divergence. SIAM/ASA Journal on Uncertainty Quantification, 3(1):18 33, 2015. doi: 10.1137/ 130939730. URL https://doi.org/10.1137/130939730. Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeshwar, Sherjil Ozair, Yoshua Bengio, Aaron Courville, and Devon Hjelm. Mutual information neural estimation. In Proceedings of the International Conference on Machine Learning, pp. 531 540, 2018. Himesh Bhatia, William Paul, Fady Alajaji, Bahman Gharesifard, and Philippe Burlina. Least kth-order and Rényi generative adversarial networks. Neural Computation, 33(9):2473 2510, 2021. Jeremiah Birrell, Paul Dupuis, Markos A. Katsoulakis, Luc Rey-Bellet, and Jie Wang. Variational representations and neural network estimation of Rényi divergences. SIAM Journal on Mathematics of Data Science, 3(4):1093 1116, 2021. doi: 10.1137/20M1368926. URL https://doi.org/ 10.1137/20M1368926. Jeremiah Birrell, Paul Dupuis, Markos A. Katsoulakis, Yannis Pantazis, and Luc Rey-Bellet. (f, Γ)- divergences: Interpolating between f-divergences and integral probability metrics. Journal of Machine Learning Research, 23(39):1 70, 2022a. URL http://jmlr.org/papers/v23/ 21-0100.html. Jeremiah Birrell, Markos Katsoulakis, Luc Rey-Bellet, and Wei Zhu. Structure-preserving GANs. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 1982 2020. PMLR, 17 23 Jul 2022b. URL https://proceedings.mlr.press/v162/birrell22a.html. Jeremiah Birrell, Markos A. Katsoulakis, and Yannis Pantazis. Optimizing variational representations of divergences and accelerating their statistical estimation. IEEE Transactions on Information Theory, 68(7):4553 4572, 2022c. doi: 10.1109/TIT.2022.3160659. J. Borwein and Q. Zhu. Techniques of Variational Analysis. CMS Books in Mathematics. Springer New York, 2006. ISBN 9780387282718. URL https://books.google.com/books? id=Fquue78k Je EC. Luc Bégin, Pascal Germain, François Laviolette, and Jean-Francis Roy. PAC-Bayesian bounds based on the Rényi divergence. In Arthur Gretton and Christian C. Robert (eds.), Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, volume 51 of Proceedings of Machine Learning Research, pp. 435 444, Cadiz, Spain, 09 11 May 2016. PMLR. URL https://proceedings.mlr.press/v51/begin16.html. Pengyu Cheng, Weituo Hao, Shuyang Dai, Jiachang Liu, Zhe Gan, and Lawrence Carin. Club: A contrastive log-ratio upper bound of mutual information. In ICML 2020, July 2020. URL https://www.microsoft.com/en-us/research/publication/ club-a-contrastive-log-ratio-upper-bound-of-mutual-information/. Published as a conference paper at ICLR 2023 Neel Dey, Antong Chen, and Soheil Ghafurian. Group equivariant generative adversarial networks. In International Conference on Learning Representations, 2021. URL https://openreview. net/forum?id=rg FNu JHHXv. Paul Dupuis, Markos A. Katsoulakis, Yannis Pantazis, and Luc Rey-Bellet. Sensitivity analysis for rare events based on Rényi divergence. Ann. Appl. Probab., 30(4):1507 1533, 2020. ISSN 1050-5164. doi: 10.1214/19-AAP1468. URL https://doi.org/10.1214/19-AAP1468. Dupuis, Paul and Mao, Yixiang. Formulation and properties of a divergence used to compare probability measures without absolute continuity. ESAIM: COCV, 28:10, 2022. doi: 10.1051/cocv/ 2022002. URL https://doi.org/10.1051/cocv/2022002. G.B. Folland. Real Analysis: Modern Techniques and Their Applications. Pure and Applied Mathematics: A Wiley Series of Texts, Monographs and Tracts. Wiley, 2013. ISBN 9781118626399. URL https://books.google.com/books?id=w I4f Aw AAQBAJ. Pierre Glaser, Michael Arbel, and Arthur Gretton. KALE flow: A relaxed KL gradient flow for probabilities with disjoint support. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 8018 8031. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/ paper/2021/file/433a6ea5429d6d75f0be9bf9da26e24c-Paper.pdf. I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Bengio. Generative Adversarial Nets. In Proceedings of the Annual Conference on Neural Information Processing Systems, pp. 2672 2680, 2014. A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(Mar):723 773, 2012. Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. Improved Training of Wasserstein GANs. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS 17, pp. 5769 5779, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964. Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv:1412.6980, 2014. Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009. Gowtham R. Kurri, Oliver Kosut, and Lalitha Sankar. A variational formula for infinity-Rényi divergence with applications to information leakage. In 2022 IEEE International Symposium on Information Theory (ISIT), pp. 2493 2498. IEEE Press, 2022. doi: 10.1109/ISIT50566.2022. 9834358. URL https://doi.org/10.1109/ISIT50566.2022.9834358. Y. Le Cun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. Kyungmin Lee and Jinwoo Shin. Rényi CL: Contrastive representation learning with skew Rényi divergence. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview. net/forum?id=73h4EZYt Sht. Jacob H. Levine, Erin F. Simonds, Sean C. Bendall, Kara L. Davis, El Ad D. Amir, Michelle D. Tadmor, Oren Litvin, Harris G. Fienberg, Astraea Jager, Eli R. Zunder, Rachel Finck, Amanda L. Gedman, Ina Radtke, James R. Downing, Dana Pe er, and Garry P. Nolan. Data-Driven Phenotypic Dissection of AML Reveals Progenitor-like Cells that Correlate with Prognosis. Cell, 162(1): 184 197, 2015. ISSN 10974172. doi: 10.1016/j.cell.2015.05.047. URL http://dx.doi. org/10.1016/j.cell.2015.05.047. Yingzhen Li and Richard E Turner. Rényi divergence variational inference. In Proceedings of the Advances in Neural Information Processing Systems, pp. 1073 1081, 2016. Published as a conference paper at ICLR 2023 David Mc Allester and Karl Stratos. Formal limitations on the measurement of mutual information. In Silvia Chiappa and Roberto Calandra (eds.), Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pp. 875 884. PMLR, 26 28 Aug 2020. URL https://proceedings. mlr.press/v108/mcallester20a.html. Tom Minka. Divergence measures and message passing. Technical Report MSR-TR-2005-173, January 2005. URL https://www.microsoft.com/en-us/research/publication/ divergence-measures-and-message-passing/. Ilya Mironov. Rényi differential privacy. In 2017 IEEE 30th Computer Security Foundations Symposium (CSF), pp. 263 275, 2017. doi: 10.1109/CSF.2017.11. T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, 2018. Yannis Pantazis, Dipjyoti Paul, Michail Fasoulakis, Yannis Stylianou, and Markos A. Katsoulakis. Cumulant GAN. IEEE Transactions on Neural Networks and Learning Systems, pp. 1 12, 2022. doi: 10.1109/TNNLS.2022.3161127. Alfréd Rényi. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, pp. 547 561, Berkeley, Calif., 1961. University of California Press. URL https: //projecteuclid.org/euclid.bsmsp/1200512181. Noor Sajid, Francesco Faccio, Lancelot Da Costa, Thomas Parr, Jürgen Schmidhuber, and Karl Friston. Bayesian Brains and the Rényi Divergence. Neural Computation, 34(4):829 855, 03 2022. ISSN 0899-7667. doi: 10.1162/neco_a_01484. URL https://doi.org/10.1162/neco_ a_01484. P Shahi, SC Kim, JR Haliburton, ZJ Gartner, and AR Abate. Abseq: Ultrahigh-throughput single cell protein profiling with droplet microfluidic barcoding. Sci Rep, 7:44447, 2017. Jiaming Song and Stefano Ermon. Understanding the Limitations of Variational Mutual Information Estimators. ar Xiv e-prints, art. ar Xiv:1910.06222, October 2019. Aäron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. Co RR, abs/1807.03748, 2018. URL http://arxiv.org/abs/1807.03748. Tim Van Erven and Peter Harremos. Rényi divergence and Kullback-Leibler divergence. IEEE Transactions on Information Theory, 60(7):3797 3820, 2014. Lukas M Weber, Malgorzata Nowicka, Charlotte Soneson, and Mark D. Robinson. diffcyt: Differential discovery in high-dimensional cytometry via high-resolution clustering. Communications Biology, 2(183), 2019. ISSN 2399-3642. doi: 10.1038/s42003-019-0415-5. URL https://doi.org/10.1038/s42003-019-0415-5. A DERIVATION OF VARIATIONAL FORMULAS FOR THE CLASSICAL RÉNYI DIVERGENCES In this appendix we provide several variational formulas for the classical Rényi divergences, some of which are new. In the following we let (Ω, M) denote a measurable space, M(Ω) be the space of measurable real-valued functions on Ω, Mb(Ω) the subspace of bounded functions, and P(Ω) will be the space of probability measures on Ω. First we recall the Rényi-Donsker-Varadhan variational formula derived in Birrell et al. (2021). This is a generalization of the Donsker-Varadhan variational representation of the KL divergence. Published as a conference paper at ICLR 2023 Theorem A.1 (Rényi-Donsker-Varadhan Variational Formula). Let P and Q be probability measures on (Ω, M) and α R, α = 0, 1. Then for any set of functions, Φ, with Mb(Ω) Φ M(Ω) we have Rα(P Q) = sup ϕ Φ 1 α 1 log Z e(α 1)ϕd P 1 α log Z eαϕd Q , (25) where we interpret and + . If in addition (Ω, M) is a metric space with the Borel σ-algebra then (25) holds for all Φ that satisfy Lipb(Ω) Φ M(Ω), where Lipb(Ω) is the space of bounded Lipschitz functions on Ω(i.e., Lipschitz for any Lipschitz constant L (0, )). Using Theorem A.1 we can derive a new variational representation that takes the form of a convex conjugate. Theorem A.2 (Convex-Conjugate Rényi Variational Formula). Let P, Q P(Ω) and α (0, 1) (1, ). Then Rα(P Q) = sup g Mb(Ω):g<0 Z gd Q + 1 α 1 log Z |g|(α 1)/αd P + α 1(log α + 1) . (26) If (Ω, M) is a metric space with the Borel σ-algebra then (26) holds with Mb(Ω) replaced by Cb(Ω), the space of bounded continuous real-valued functions on Ω. Proof. Let Φ = {α 1 log( h) : h Mb(Ω), h < 0}. We have Mb(Ω) Φ M(Ω), hence Theorem A.1 implies Rα(P Q) = sup ϕ Φ 1 α 1 log Z e(α 1)ϕd P 1 α log Z eαϕd Q (27) = sup h Mb(Ω):h<0 1 α 1 log Z e(α 1)(α 1 log( h))d P 1 α log Z eα(α 1 log( h))d Q = sup h Mb(Ω):h<0 1 α 1 log Z |h|(α 1)/αd P 1 α log Z ( h)d Q . Note that the second term is finite but the first term is possibly infinite when α (0, 1). Next use the identity log(c) = inf z R{z 1 + ce z} , c (0, ) (28) in the second term to write Rα(P Q) = sup h Mb(Ω):h<0 1 α 1 log Z |h|(α 1)/αd P 1 α inf z R{z 1 + e z Z ( h)d Q} = sup z R sup h Mb(Ω):h<0 1 α 1 log Z |h|(α 1)/αd P + z 1 α + α 1e z Z hd Q . For each z R make the change variables h = αezg, g Mb(Ω), g < 0 in the inner supremum to derive Rα(P Q) = sup z R sup g Mb(Ω):g<0 1 α 1 log Z |αezg|(α 1)/αd P z 1 α + α 1e z Z αezgd Q = sup z R sup g Mb(Ω):g<0 1 α 1 log Z |g|(α 1)/αd P + (α 1(log α + 1) + Z gd Q) = sup g Mb(Ω):g<0 Z gd Q + 1 α 1 log Z |g|(α 1)/αd P + α 1(log α + 1) . This completes the proof of (26). The proof of the metric-space version in nearly identical. Published as a conference paper at ICLR 2023 Remark A.3. To reverse the above derivation and obtain (25) (with Φ = {ϕ M(Ω) : ϕ is bounded above}) from (26), change variables g 7 c exp(αϕ), ϕ Φ, c > 0 in (26) and then maximize over c. Corollary A.4. If X is a compact metric space with the Borel σ-algebra, P, Q P(X), and α (0, 1) (1, ) then Cb(X) = C(X) and so Rα(P Q) = sup g C(X):g<0 Z gd Q + 1 α 1 log Z |g|(α 1)/αd P + α 1(log α + 1) . (31) Next we derive a variational formula for the worst case regret, defined by D (P Q) := lim α αRα(P Q) . (32) Theorem A.5. Let P, Q P(Ω). Then D (P Q) = sup g Mb(Ω):g<0 Z gd Q + log Z |g|d P + 1 . (33) If Ωis a metric space with the Borel σ-algebra then (33) holds with Mb(Ω) replaced by Cb(Ω). Remark A.6. Note that on a compact metric space, the space of bounded continuous functions is the same as the space of all continuous functions. Proof. Recall Van Erven & Harremos (2014) ( log ess sup P d P d Q , P Q , P Q . (34) First suppose P Q. Then there exists a measurable set A with Q(A) = 0 and P(A) > 0. Let gn = n1A 1Ac. Then sup g Mb(Ω):g<0 Z gd Q + log Z |g|d P + 1 Z gnd Q + log Z |gn|d P + 1 (35) = n Q(A) Q(Ac) + log(n P(A) + P(Ac)) + 1 = log(n P(A) + P(Ac)) (36) as n . Therefore sup g Mb(Ω):g<0 Z gd Q + log Z |g|d P + 1 = = D (P Q) . (37) Now suppose P Q. Using the definition (32) along with Theorem A.2 and changing variables g = g/α we have D (P Q) = lim α αRα(P Q) (38) sup g Mb(Ω):g<0 Z αgd Q + α α 1 log Z |g|(α 1)/αd P + (log α + 1) Z gd Q + α α 1 log Z | g/α|(α 1)/αd P + (log α + 1) Z gd Q + α α 1 log Z | g|(α 1)/αd P + 1 = Z gd Q + log Z | g|d P + 1 for all g Mb(Ω), g < 0. Here we used the dominated convergence theorem to evaluate the limit. Hence, by maximizing over g we obtain D (P Q) sup g Mb(Ω):g<0 Z gd Q + log Z |g|d P + 1 . (39) Published as a conference paper at ICLR 2023 To prove the reverse inequality, take any r (0, ess sup P d P/d Q). By definition of the essential supremum we have P(d P/d Q > r) > 0. We also have the bound P(d P/d Q > r) = Z 1d P/d Q>r d P d Qd Q Z 1d P/d Q>rrd Q = r Q(d P/d Q > r) . (40) For c, ϵ > 0 define gc,ϵ = c1d P/d Q>r ϵ. These satisfy gc,ϵ Mb(Ω), gc,ϵ < 0 and so sup g Mb(Ω):g<0 Z gd Q + log Z |g|d P + 1 Z gc,ϵd Q + log Z |gc,ϵ|d P + 1 (41) = c Q(d P/d Q > r) ϵ + log(c P(d P/d Q > r) + ϵ) + 1 c P(d P/d Q > r)/r ϵ + log(c P(d P/d Q > r) + ϵ) + 1 . Letting ϵ 0+ we find sup g Mb(Ω):g<0 Z gd Q + log Z |g|d P + 1 c P(d P/d Q > r)/r + log(c P(d P/d Q > r)) + 1 for all c > 0. We have P(d P/d Q > r) > 0, hence by maximizing over c > 0 and changing variables to z = c P(d P/d Q > r) we obtain sup g Mb(Ω):g<0 Z gd Q + log Z |g|d P + 1 sup z>0 { z/r + log(z) + 1} = log(r) . (43) This holds for all r < ess sup P d P/d Q, therefore we can take r ess sup P d P/d Q and use (34) to conclude sup g Mb(Ω):g<0 Z gd Q + log Z |g|d P + 1 log(ess sup P d P/d Q) = D (P Q) . (44) Combining this with (39) completes the proof of (33). Now suppose Ωis a metric space. We clearly have D (P Q) = sup g Mb(Ω):g<0 Z gd Q + log Z |g|d P + 1 (45) sup g Cb(Ω):g<0 Z gd Q + log Z |g|d P + 1 . To prove the reverse inequality, take any g Mb(Ω) with g < 0. By Lusin s theorem, for all ϵ > 0 there exists a closed set Eϵ and hϵ Cb(Ω) such that P(Ec ϵ) ϵ, Q(Ec ϵ) ϵ, hϵ|Eϵ = g, and inf g hϵ 0. Define gϵ = hϵ ϵ. Then gϵ < 0, gϵ Cb(Ω) and we have sup g Cb(Ω):g<0 Z gd Q + log Z |g|d P Z gϵd Q + log Z |gϵ|d P (46) = Z gd Q + Z (hϵ g)1Ecϵ d Q ϵ + log( Z |g|d P + Z (|hϵ| |g|)1Ecϵ d P + ϵ) Z gd Q (sup g inf g)Q(Ec ϵ) ϵ + log( Z |g|d P + inf g P(Ec ϵ) + ϵ) Z gd Q (sup g inf g)ϵ ϵ + log( Z |g|d P + inf gϵ + ϵ) . Taking the limit ϵ 0+ we therefore obtain sup g Cb(Ω):g<0 Z gd Q + log Z |g|d P Z gd Q + log Z |g|d P . (47) This holds for all g Mb(Ω) with g < 0, hence by taking the supremum over g we obtain the reverse inequality to (45). This completes the proof. Published as a conference paper at ICLR 2023 In this appendix we provide a number of proofs that were omitted from the main text. Recall that X denotes a compact metric space. Lemma B.1. Let Γ C(X) and P, Q P(X). Then αRΓ/α,IC α (P Q) is non-decreasing in α (0, 1) (1, ). If 0 Γ then αRΓ,IC α (P Q) is also non-decreasing. Proof. If 0 Γ then W Γ 0, hence αRΓ,IC α (P Q) = inf η P(X){αRα(P η) + αW Γ(Q, η)} (48) where both α 7 αRα(P η) and α 7 αW Γ(Q, η) are non-decreasing. Therefore the infimum is as well. The proof for α 7 αRΓ/α,IC α (P Q) is similar, though it doesn t require the assumption 0 Γ due to the identity αW Γ/α = W Γ. Next we prove a key lemma that is used in our main result. First recall the definition ΛP α[g] := 1g <0 1 α 1 log Z |g|(α 1)/αd P + α 1(log α + 1) 1g<0 , g C(X). (49) Lemma B.2. ΛP α is convex and is continuous on {g C(X) : g < 0}, an open subset of C(X). Proof. First we prove convexity. Let g0, g1 {C(X) : g < 0} and λ (0, 1). For α (0, 1) we can use the inequality λa + (1 λ)b aλb1 λ for all a, b > 0 to compute 1 α 1 log Z |λg1 + (1 λ)g0|(α 1)/αd P 1 α 1 log Z (|g1|λ|g0|1 λ)(α 1)/αd P . (50) Using Hölder s inequality with exponents p = 1/λ, q = 1/(1 λ) we then obtain 1 α 1 log Z (|g1|λ|g0|1 λ)(α 1)/αd P (51) 1 α 1 log Z |g1|(α 1)/αd P λ Z |g0|(α 1)/αd P 1 λ =λ 1 α 1 log Z |g1|(α 1)/αd P + (1 λ) 1 α 1 log Z |g0|(α 1)/αd P . Therefore g 7 1 α 1 log R |g|(α 1)/αd P is convex on {g < 0}. This proves ΛP α is convex when α (0, 1). Now suppose α > 1. The map t > 0, t 7 t(α 1)/α is concave and log is decreasing and convex, hence 1 α 1 log Z |λg1 + (1 λ)g0|(α 1)/αd P (52) 1 α 1 log λ Z |g1|(α 1)/αd P + (1 λ) Z |g0|(α 1)/αd P λ 1 α 1 log Z |g1|(α 1)/αd P + (1 λ) 1 α 1 log Z |g0|(α 1)/αd P . This proves that ΛP α is also convex when α > 1. Openness of {g < 0} follows from the assumption that X is compact and so any strictly negative continuous function is strictly bounded away from zero. Continuity on {g < 0} then follows from the dominated convergence theorem. Now we prove our main theorem, deriving the dual variational formula and other important properties of the IC-Γ-Rényi divergences. Theorem B.3. Let Γ C(X) be admissible, P, Q P(X), and α (0, 1) (1, ). Then: Published as a conference paper at ICLR 2023 RΓ,IC α (P Q) = sup g Γ:g<0 Z gd Q + 1 α 1 log Z |g|(α 1)/αd P + α 1(log α + 1) . 2. If (53) is finite then there exists η P(X) such that RΓ,IC α (P Q) = inf η P(X){Rα(P η) + W Γ(Q, η)} = Rα(P η ) + W Γ(Q, η ) . (54) 3. RΓ,IC α (P Q) is convex in Q. If α (0, 1) then RΓ,IC α (P Q) is jointly convex in (P, Q). 4. (P, Q) 7 RΓ,IC α (P Q) is lower semicontinuous. 5. RΓ,IC α (P Q) 0 with equality if P = Q. 6. RΓ,IC α (P Q) min{Rα(P Q), W Γ(Q, P)}. 7. If Γ is strictly admissible then RΓ,IC α has the divergence property. Proof. 1. Define F, G : C(X) ( , ] by F = ΛP α and G[g] = 1g Γ EQ[g]. Using the assumptions on Γ along with Lemma B.2 we see that F and G are convex, F[ 1] < , G[ 1] < , and F is continuous at 1. Therefore Fenchel-Rockafellar duality (see, e.g., Theorem 4.4.3 in Borwein & Zhu (2006)) along with the identity C(X) = M(X) gives sup g C(X) { F[g] G[g]} = inf η M(X){F [η] + G [ η]} , (55) and if either side is finite then the infimum on the right hand side is achieved at some η M(X). Using the definitions, we can rewrite the left hand side as follow sup g C(X) { F[g] G[g]} (56) = sup g Γ:g<0 Z gd Q + 1 α 1 log Z |g|(α 1)/αd P + α 1(log α + 1) . We can also compute G [ η] = sup g C(X) { Z gdη ( 1g Γ EQ[g])} = W Γ(Q, η) . (57) inf η M(X){(ΛP α) [η] + W Γ(Q, η)} (58) = sup g Γ:g<0 Z gd Q + 1 α 1 log Z |g|(α 1)/αd P + α 1(log α + 1) . Next we show that the infimum over M(X) can be restricted to P(X). First suppose η M(X) with η(X) = 1. Then, using the assumption that Γ contains the constant functions, we have W Γ(Q, η) EQ[ n] Z ndη = n(1 η(X)) (59) as n (for appropriate choice of sign). Therefore W Γ(Q, η) = if η(X) = 1. This implies that the infimum can be restricted to {η M(X) : η(X) = 1}. Now suppose η M(X) is not positive. Take a measurable set A with η(A) < 0. By Lusin s theorem, for all ϵ > 0 there exists a closed set Eϵ X and a continuous function Published as a conference paper at ICLR 2023 gϵ C(X) such that |η|(Ec ϵ) < ϵ, 0 gϵ 1, and gϵ|Eϵ = 1A. Define gn,ϵ = ngϵ 1, n Z+. Then gn,ϵ {g C(X) : g < 0}, hence (ΛP α) [η] Z gn,ϵdη + 1 α 1 log Z |gn,ϵ|(α 1)/αd P + α 1(log α + 1) (60) = nη(A) + nη(A Ec ϵ) n Z gϵ1Ecϵ dη η(X) + 1 α 1 log Z |ngϵ + 1|(α 1)/αd P + α 1(log α + 1) n(|η(A)| 2ϵ) η(X) + 1 α 1 log Z |ngϵ + 1|(α 1)/αd P + α 1(log α + 1) . If α > 1 then log R |ngϵ + 1|(α 1)/αd P 0 and if α (0, 1) then log R |ngϵ + 1|(α 1)/αd P 0. In either case we have 1 α 1 log R |ngϵ + 1|(α 1)/αd P 0 and so (ΛP α) [η] n(|η(A)| 2ϵ) η(X) + α 1(log α + 1) . (61) By choosing ϵ < |η(A)|/2 and taking n we see that (ΛP α) [η] = whenever η M(X) is not positive. Therefore the infimum can further be restricted to positive measures. Combining these results we find sup g Γ:g<0 Z gd Q + 1 α 1 log Z |g|(α 1)/αd P + α 1(log α + 1) (62) = inf η P(X){(ΛP α) [η] + W Γ(Q, η)} . For η P(X), equation (10) implies (ΛP α) [η] = Rα(P η). This completes the proof. 2. The existence of a minimizer follows from Fenchel-Rockafellar duality; again, see Theorem 4.4.3 in Borwein & Zhu (2006). 3. This follows from (53) together with the fact that the supremum of convex functions is convex and y 7 1 α 1 log(y) is convex when α (0, 1). 4. Compactness of X implies that g and |g|(α 1)/α are bounded and continuous whenever g Γ satisfies g < 0. Therefore Q R gd Q and P R |g|(α 1)/αd P are continuous in the weak topology on P(X). Therefore the objective functional in (53) is continuous in (P, Q). The supremum is therefore lower semicontinuous. 5. This easily follows from the definition (7). 6. Rα is a divergence, hence is non-negative. Γ contains the constant functions, hence W Γ 0. Therefore RΓ,IC α 0. If Q = P then 0 RΓ,IC α (P Q) Rα(P P) + W Γ(P, P) = 0, hence RΓ,IC α (P Q) = 0. 7. Suppose Γ is strictly admissible. Due to part 5 of this theorem, we only need to show that if RΓ,IC α (P Q) = 0 then P = Q. If RΓ,IC α (P Q) = 0 then part 2 implies there exists η P(X) such that 0 = Rα(P η ) + W Γ(Q, η ) . (63) Both terms are non-negative, hence Rα(P η ) = 0 = W Γ(Q, η ). Rα has the divergence property, hence η = P. So W Γ(Q, P) = 0. Therefore 0 R gd Q R gd P for all g Γ. Let Ψ be as in the definition of strict admissibility and let ψ Ψ. There exists c R, ϵ > 0 such that c ϵψ Γ and so 0 ϵ( R ψd Q R ψd P). Therefore R ψd Q = R ψd P for all ψ Ψ. Ψ is P(X)-determining, hence Q = P. Remark B.4. The Fenchel-Rockafellar Theorem applies under two different sets of assumptions: the first assumes both mappings are lower semicontinuous (LSC) while the second applies when one mapping is continuous at a point where both are finite. The mapping ΛP α, as defined by (49) and Published as a conference paper at ICLR 2023 appearing in (10), is not LSC but it is continuous on its domain, hence we used the second version of Fenchel-Rockafellar in our proof of Theorem B.3. For α > 1 one could alternatively redefine ΛP α along the boundary of {g < 0} to make it LSC while still maintaining the relation (10) and thereby utilize the first version of Fenchel-Rockafellar. This alternative approach is also amenable to extending the theorem to non-compact spaces, using the methods from Dupuis, Paul & Mao, Yixiang (2022); Birrell et al. (2022a). However, these methods do not apply to α (0, 1). With this in mind, in order to provide a simple unified treatment of all α (0, 1) (1, ) we structured our proof around the second version of Fenchel-Rockafellar. Despite the fact that ΛP α is not LSC, the Fenchel-Rockafellar Theorem does imply that convex duality holds at all points of continuity in the domain, i.e., one has ΛP α[g] = sup η M(X) { Z gdη Rα(P η)} for all g < 0 , (64) but this duality formula doesn t necessarily hold if g < 0. Here, Rα(P η) for general η M(X) is defined via the variational formula Rα(P η) := (ΛP α) [η] = sup g C(X) { Z gdη ΛP α[g]} (65) and one can rewrite this in terms of the classical Rényi divergence as follows ( if η 0 or η = 0 , Rα(P η α log η if η is a nonzero positive measure. (66) Next we prove the limiting results from Theorem 4.1. Theorem B.5. Let Γ C(X) be admissible, P, Q P(X), and α (0, 1) (1, ). Then lim δ 0+ 1 δ RδΓ,IC α (P Q) = W Γ(Q, P) (67) and if Γ is strictly admissible we have lim L RLΓ,IC α (P Q) = Rα(P Q) . (68) Proof. It is straightforward to show that the scaled function spaces are admissible and W cΓ = c W Γ for all c > 0. First we prove (67). From the definition 7 we have δ 1RδΓ,IC α (P Q) = inf η P(X){δ 1Rα(P η) + W Γ(Q, η)} W Γ(Q, P) (69) and so δ 1RδΓ,IC α (P Q) is non-increasing in δ. Therefore lim δ 0+ δ 1RδΓ,IC α (P Q) = sup δ>0 δ 1RδΓ,IC α (P Q) (70) lim δ 0+ δ 1RδΓ,IC α (P Q) W Γ(Q, P) . (71) We will assume this inequality is strict and derive a contradiction. This assumption, together with (70), implies RδΓ,IC α (P Q) < for all δ > 0. Part (2) of Theorem 3.4 then implies the existence of η ,δ P(X) such that δ 1RδΓ,IC α (P Q) = δ 1Rα(P η ,δ) + W Γ(Q, η ,δ) W Γ(Q, η ,δ) . (72) Take a sequence δn 0+. We have assumed X is compact, hence P(X) is also compact and so there exists a weakly convergent subsequence η ,δnj η . From the variational formulas (25) and (8) we see that Rα(P ) and W Γ(Q, ) are lower semicontinuous, hence lim infj W Γ(Q, η ,δnj ) W Γ(Q, η ) and Rα(P η ) lim inf j Rα(P η ,δnj ) lim inf j δnj(δ 1 nj Rα(P η ,δnj ) + W Γ(Q, η ,δnj )) (73) = lim inf j δnj(δ 1 nj R δnj Γ,IC α (P Q)) = 0 , (74) Published as a conference paper at ICLR 2023 where the last equality follows from the assumed strictness of the inequality (70). Therefore the divergence property for the classical Rényi divergences implies Rα(P η ) = 0 and P = η . Combining the above results we obtain lim δ 0+ δ 1RδΓ,IC α (P Q) = lim j δ 1 nj R δnj Γ,IC α (P Q) lim inf j W Γ(Q, η ,δnj ) (75) W Γ(Q, η ) = W Γ(Q, P) . This contradicts (71) and therefore we have proven the equality (67). Now we assume Γ is strictly admissible and will prove (68) via similar reasoning. From the definition 7 we see that RLΓ,IC α (P Q) = inf η P(X){Rα(P η) + LW Γ(Q, η)} Rα(P Q) (76) and RLΓ,IC α is non-decreasing in L. Hence lim L RLΓ,IC α (P Q) = sup L>0 RLΓ,IC α (P Q) and lim L RLΓ,IC α (P Q) Rα(P Q) . (77) Suppose this inequality is strict. Then RLΓ,IC α (P Q) < for all L and we can use part (2) of Theorem 3.4 to conclude there exists η ,L P(X) such that RLΓ,IC α (P Q) = Rα(P η ,L) + LW Γ(Q, η ,L) . (78) Take Ln . Compactness of P(X) implies the existence of a weakly convergent subsequence η ,j := η ,Lnj η P(X). Lower semicontinuity of Rα(P ) and W Γ(Q, ) imply lim infj Rα(P η ,j) Rα(P η ) and W Γ(Q, η ) lim inf j W Γ(Q, η ,j) = lim inf j L 1 nj W Lnj Γ(Q, η ,j) (79) lim inf j L 1 nj R Lnj Γ,IC α (P Q) = 0 , where the last equality follows from the assumed strictness of the inequality (77). Therefore W Γ(Q, η ) = 0. Γ is strictly admissible, hence Q = η (see the proof of part (7) of Theorem 3.4). Combining these results we see that lim L RLΓ,IC α (P Q) = lim j R Lnj Γ,IC α (P Q) = lim j (Rα(P η ,Lnj ) + Lnj W Γ(Q, η ,Lnj )) (80) lim inf j Rα(P η ,j) Rα(P η ) = Rα(P Q) . This contradicts the assumed strictness of the inequality (77) and hence (77) is an equality. This completes the proof. Next we prove Theorem 4.2, regarding the α 1 limit of the IC-Γ-Rényi divergences. Theorem B.6. Let Γ C(X) be admissible and P, Q P(X). Then lim α 1+ RΓ,IC α (P Q) = inf η P(X): β>1,Rβ(P η)< {R(P η) + W Γ(Q, η)} , (81) lim α 1 RΓ,IC α (P Q) = inf η P(X){R(P η) + W Γ(Q, η)} (82) = sup g Γ:g<0 { Z gd Q + Z log |g|d P} + 1 . (83) Proof. Lemma B.1 implies α 7 αRΓ,IC α (P Q) is non-decreasing on (1, ), therefore lim α 1+ αRΓ,IC α (P Q) = inf α>1 αRΓ,IC α (P Q) (84) = inf α>1 inf η P(X){αRα(P η) + αW Γ(Q, η)} = inf η P(X) inf α>1{αRα(P η) + αW Γ(Q, η)} = inf η P(X){ lim α 1+ αRα(P η) + W Γ(Q, η)} . Published as a conference paper at ICLR 2023 From Van Erven & Harremos (2014) we have lim α 1+ Rα(P η) = R(P η) if β > 1, Rβ(P η) < otherwise (85) and so we can conclude lim α 1+ RΓ,IC α (P Q) = lim α 1+ αRΓ,IC α (P Q) = inf η P (X): β>1,Rβ(P η)< {R(P η) + W Γ(Q, η)} . (86) This proves (81). Now we compute the limit as α 1 . Note that the limit exists due to the fact that α 7 αRΓ,IC α (P Q) is non-decreasing. From the definition (7), for all η P(X) we have lim α 1 RΓ,IC α (P Q) lim α 1 Rα(P η) + W Γ(Q, η) = R(P η) + W Γ(Q, η) . (87) Here we used the fact that limα 1 Rα(P η) = R(P η) (see Van Erven & Harremos (2014)). Maximizing over η then gives lim α 1 RΓ,IC α (P Q) inf η P(X){R(P η) + W Γ(Q, η)} . (88) To prove the reverse inequality, use part 1 of Theorem 3.4 to compute lim α 1 RΓ,IC α (P Q) = lim α 1 αRΓ,IC α (P Q) (89) = lim α 1 sup g Γ:g<0 α Z gd Q + α α 1 log Z |g|(α 1)/αd P + log α + 1 Z gd Q + lim α 1 α α 1 log Z |g|(α 1)/αd P + 1 = Z gd Q + d dy |y=0 log Z ey log |g|d P + 1 = Z gd Q + Z log |g|d P + 1 for all g Γ, g < 0. Therefore, maximizing over g gives lim α 1 RΓ,IC α (P Q) sup g Γ:g<0 Z gd Q + Z log |g|d P + 1 . (90) We now use Fenchel-Rockafellar duality (Theorem 4.4.3 in Borwein & Zhu (2006)) to compute the dual variational representation of the right hand side of (90). Define F, G : C(X) ( , ] by F[g] = 1g <0 R log |g|d P and G[g] = 1g Γ EQ[g]. It is stratightforward to show that F and G are convex, F[ 1] < , G[ 1] < , and F is continuous at 1. Therefore inf g C(X){F[g] + G[g]} = sup η E { F ( η) G (η)} , (91) sup g Γ:g<0 {EQ[g] + Z log |g|d P} = inf η M(X){F (η) + W Γ(Q, η)} , (92) where F (η) = supg C(X):g<0{ R gdη + R log |g|d P}. Now we show the infimum can be restricted to η P(X): If η(X) = 1 then by taking g = n we find W Γ(Q, η) n|Q(X) η(X)| (93) as n . Therefore W Γ(Q, η) = if η(X) = 1. Now suppose η M(X) is not positive. Take a measurable set A with η(A) < 0. By Lusin s theorem, for all ϵ > 0 there exists a closed set Eϵ X and a continuous function gϵ C(X) Published as a conference paper at ICLR 2023 such that |η|(Ec ϵ) < ϵ, 0 gϵ 1, and gϵ|Eϵ = 1A. Define gn,ϵ = ngϵ 1, n Z+. Then gn,ϵ {g C(X) : g < 0}, hence F (η) Z gn,ϵdη + Z log |gn,ϵ|d P (94) = Z ngϵ 1dη + Z log(ngϵ + 1)d P n(|η(A)| Z (gϵ 1A)1Ecϵ dη) η(X) n(|η(A)| ϵ) η(X) . Letting ϵ = |η(A)|/2 and taking n gives F (η) = . Therefore we conclude inf η M(X){F (η) + W Γ(Q, η)} = inf η P(X){F (η) + W Γ(Q, η)} . (95) To evaluate F (η) for η P(X) we make a change of variables g = exp(h 1), h C(X) to obtain F (η) = sup h C(X) { Z hd P Z eh 1dη} 1 = R(P η) 1 . (96) Here we used the Legendre-transform variational representation of the KL divergence; see equation (1) in Birrell et al. (2022c) with f(x) = x log(x). Combining these results we obtain inf η P(X){R(P η) + W Γ(Q, η)} lim α 1 RΓ,IC α (P Q) (97) sup g Γ:g<0 Z gd Q + Z log |g|d P + 1 = inf η M(X){F (η) + W Γ(Q, η)} + 1 = inf η P(X){R(P η) + W Γ(Q, η)} . This completes the proof. Now we prove Theorem 4.5, regarding the α limit of the IC-Γ-Rényi divergences. Theorem B.7. Let Γ C(X) be admissible and P, Q P(X). Then lim α αRΓ/α,IC α (P Q) = inf η P (X){D (P η) + W Γ(Q, η)} (98) = sup g Γ:g<0 Z gd Q + log Z |g|d P + 1 . (99) Proof. First note that αRΓ/α,IC α (P Q) = inf η P(X){αRα(P η) + W Γ(Q, η)} (100) is nondecreasing in α, therefore for η P(X) we have lim α αRΓ/α,IC α (P Q) = sup α>1 αRΓ/α,IC α (P Q) (101) sup α>1 {αRα(P η) + W Γ(Q, η)} =D (P η) + W Γ(Q, η) . Maximizing over η gives the upper bound lim α αRΓ/α,IC α (P Q) inf η P(X){D (P η) + W Γ(Q, η)} . (102) Published as a conference paper at ICLR 2023 To prove the reverse inequality, use the variational formula (53) to write αRΓ/α,IC α (P Q) =α sup g Γ:g<0 Z g/αd Q + 1 α 1 log Z |g/α|(α 1)/αd P + log α + 1 (103) = sup g Γ:g<0 Z gd Q + α α 1 log Z |g|(α 1)/αd P + 1 . Therefore, for all g Γ, g < 0 we can use the dominated convergence theorem to compute lim α αRΓ/α,IC α (P Q) Z gd Q + lim α α α 1 log Z |g|(α 1)/αd P + 1 (104) = Z gd Q + log Z |g|d P + 1 . Maximizing over g then gives lim α αRΓ/α,IC α (P Q) sup g Γ:g<0 Z gd Q + log Z |g|d P + 1 . (105) Next we use the Fenchel-Rockafellar duality to derive a dual formulation of the right hand side of (105). Define G, F : C(X) ( , ], G[g] = 1g Γ EQ[g], F[g] = 1g <0 log R |g|d P. It is straightforward to prove that G, F are convex and G[ 1] < , F[ 1] < and F is continuous at 1. Therefore Fenchel-Rockafellar duality implies inf g C(X){F[g] + G[g]} = sup η C(X) { F [ η] G [η]} , (106) sup g Γ:g<0 EQ[g] + log Z |g|d P = inf η M(X){F [η] + W Γ(Q, η)} , (107) where F [η] = supg C(X):g<0{ R gdη + log R |g|d P}. We now prove that the infimum over M(X) can be restricted to P(X). First suppose η(X) = 1. Then, because Γ contains the constant functions, we have W Γ(Q, η) n(1 η(X)) (108) as n for appropriate choice of sign. Therefore W Γ(Q, η) = when η(X) = 1. Now suppose η M(X) is not positive. Take a measurable set A with η(A) < 0. By Lusin s theorem, for all ϵ > 0 there exists a closed set Eϵ X and a continuous function gϵ C(X) such that |η|(Ec ϵ) < ϵ, 0 gϵ 1, and gϵ|Eϵ = 1A. Define gn,ϵ = ngϵ 1, n Z+. Then gn,ϵ {g C(X) : g < 0}, hence F [η] Z gn,ϵdη + log Z |gn,ϵ|d P (109) =n(|η(A)| Z (gϵ 1A)1Ec ϵ dη) η(X) + log(n Z gϵd P + 1) n(|η(A)| ϵ) η(X) . Letting ϵ = |η(A)|/2 and then taking n we see that F [η] = when η is not positive. Together these results imply inf η M(X){F [η] + W Γ(Q, η)} = inf η P(X){F [η] + W Γ(Q, η)} . (110) Finally, using Theorem A.5 we see that F [η] + 1 = sup g C(X):g<0 { Z gdη + log Z |g|d P} + 1 = D (P η) (111) Published as a conference paper at ICLR 2023 for all η P(X). Combining these results gives lim α αRΓ/α,IC α (P Q) sup g Γ:g<0 Z gd Q + log Z |g|d P + 1 (112) = inf η M(X){F [η] + W Γ(Q, η)} + 1 = inf η P(X){D (P η) + W Γ(Q, η)} lim α αRΓ/α,IC α (P Q) . This completes the proof. Finally, we prove Theorem 3.5, the data-processing inequality for the IC-Γ-Rényi divergences. First we introduce the following notation: Let Y be another compact metric space and K be a probability kernel from X to Y . Given P P(X) we denote the composition of P with K by P K (a probability measure on X Y ) and we denote the marginal distribution on Y by K[P]. Given g C(X Y ) we let K[g] denote the function on X given by x 7 R g(x, y)Kx(dy). Theorem B.8 (Data Processing Inequality). Let α (0, 1) (1, ), Q, P P(X), and K be a probability kernel from X to Y such that K[g] C(X) for all g C(X, Y ). 1. If Γ C(Y ) is admissible then RΓ,IC α (K[P] K[Q]) RK[Γ],IC α (P Q) . (113) 2. If Γ C(X Y ) is admissible then RΓ,IC α (P K Q K) RK[Γ],IC α (P Q) . (114) Proof. It is straightforward to show that admissiblility of Γ implies admissibility of K[Γ]. Hence we can write RK[Γ],IC α (P Q) = sup g K[Γ]: g<0 Z gd Q + 1 α 1 log Z | g|(α 1)/αd P + α 1(log α + 1) (115) sup g Γ:g<0 Z K[g]d Q + 1 α 1 log Z |K[g]|(α 1)/αd P + α 1(log α + 1) . Using Jensen s inequality we can derive | Z g(y)Kx(dy)|(α 1)/α Z |g(y)|(α 1)/αKx(dy) , α (0, 1) , (116) | Z g(y)Kx(dy)|(α 1)/α Z |g(y)|(α 1)/αKx(dy) , α > 1 . (117) Combining (115) - (117) with the monotonicity of y 7 1 α 1 log(y) we arrive at (113). The proof of (114) is similar. C DETAILS ON ANALYTICAL EXAMPLES AND COUNTEREXAMPLES In this appendix we present several details regarding the analytical examples found in Section 5. C.1 INFIMAL CONVOLUTION AND SCALING LIMITS First we present a simple example that illustrates the infimal convolution formula and limiting properties from Sections 3 and 4. Let P = δ0, Qx,c = cδ0 + (1 c)δx for c (0, 1), x > 0, and let Γ = Lip1. Then for L > 0 one can compute Rα(P Qx,c) =α 1 log(1/c) , (118) W LΓ(Qx,c, P) =(1 c)Lx , (119) Published as a conference paper at ICLR 2023 RLΓ,IC α (P Qx,c) = sup a,b<0:|a b| x {Lca + L(1 c)b + α 1 log(L|a|)} + α 1(log α + 1) (120) = sup a<0 {Lca + L(1 c) min{x + a, 0} + α 1 log(L|a|)} + α 1(log α + 1) =α 1 + α 1 sup y>0 cy + log y , y αLx (1 c)αLx y + log y , y > αLx (1 c)Lx , 0 < αLx < 1 α 1 c Lx + α 1 log(αLx) , 1 αLx 1/c α 1 log(1/c) , αLx > 1/c . In particular, it is straightforward to show that RLΓ,IC α (P Qx,c) W LΓ(Qx,c, P) , (121) lim x 0+ RLΓ,IC α (P Qx,c) = lim x 0+(1 c)Lx = 0 , (122) lim L RLΓ,IC α (P Qx,c) = α log(1/c) = Rα(P Qx,c) . (123) We can also rewrite this in terms of the solution to the infimal convolution problem as follows RLΓ,IC α (P Qx,c) = W LΓ(Qx,c, P) , 0 < αLx < 1 Rα(P Qx,1/(αLx)) + W LΓ(Qx,c, Qx,1/(αLx)) , 1 αLx 1/c Rα(P Qx,c) , αLx > 1/c . Taking the worst-case-regret scaling limit we find lim α αRΓ/α,IC α (P Qx,c) = (1 c)x , 0 < x < 1 1 cx + log(x) , 1 x 1/c log(1/c) , x > 1/c (125) W Γ(Qx,c, P) , 0 < x < 1 D (P Qx,1/x) + W Γ(Qx,c, Qx,1/x) , 1 x 1/c D (P Qx,c) , x > 1/c , where D (P Qx,c) = log(1/c). C.2 Γ-RÉNYI-DONSKER-VARADHAN COUNTEREXAMPLE As an alternative to Definition 3.1, one can attempt to regularize the Rényi divergences by restricting the test-function space in the variational representation (3), leading to the Γ-Rényi-Donsker Varadhan divergences RΓ,DV α (P Q) := sup ϕ Γ 1 α 1 log Z e(α 1)ϕd P 1 α log Z eαϕd Q . (126) log Z ecϕd P c Z ϕd P , ϕ Γ, c R (127) implies that RΓ,DV α W Γ for α (0, 1), making (126) a useful regularization of the Rényi divergences in this case; this utility was demonstrated in Pantazis et al. (2022), where it was used to construct GANs. However, the representation (126) is known to be poorly behaved when α > 1. Here we provide a counterexample showing that, unlike for the IC-Γ-Rényi divergences, RΓ,DV α W Γ in general when α > 1. We conjecture that this fact is the reason for the poor behavior of RΓ,DV α when α > 1. Published as a conference paper at ICLR 2023 Let Px,c = cδ0 + (1 c)δx, Q = δ0 for x > 0, c (0, 1) and ΓL = Lip L. Then for α > 1 we have RΓL,DV α (Px,c Q) = sup a,b R:|a b| Lx 1 α 1 log(c exp((α 1)a) + (1 c) exp((α 1)b)) a 1 α 1 log(c exp((α 1)a) + (1 c) exp((α 1)(Lx + a))) a = 1 α 1 log (c + (1 c) exp((α 1)Lx)) , W ΓL(Px,c, Q) = sup |a b| Lx {ca + (1 c)b a} = (1 c)Lx . (129) Note that the condition α > 1 was crucial in computing the supreumum over b in (128). Using strict concavity of the logarithm one can then obtain the bound RΓL,DV α (Px,c Q) > W ΓL(Px,c, Q) . (130) C.3 log-Γ-RÉNYI-DONSKER-VARADHAN COUNTEREXAMPLE A second alternative to Definition 3.1 is to again start with (3) and then reduce the test-function space to 1 RΓ,log DV α (P Q) := sup g Γ:g>0 1 α 1 log Z g(α 1)/αd P 1 α log Z gd Q . (131) However, as we show below, this definition fails to provide a regularized divergence; in particular, it is incapable of meaningfully comparing Dirac distributions. Let P = δ0, Qx = δx, x > 0, ΓL = Lip L. Then straightforward computations using the variational definition gives RΓL,DV log α (P Qx) =α 1 sup ϕ Γ,ϕ>0 log(ϕ(0)/ϕ(x)) (132) =α 1 sup b>0 sup a>0:b x a x+b log(a/b) =α 1 sup b>0 log(1 + x/b) = . In contrast we have RΓL,IC α (P Qx) (133) = sup a<0,b<0:b x a b+x {Lb + α 1 log L + α 1 log(|a|)} + α 1(log(α) + 1) = sup b<0 {Lb + 1 α log(|b x|)} + α 1 log L + α 1(log α + 1) = α 1 log(αLx) + α 1 , x 1/(αL) Lx , x < 1/(αL) . In particular, RΓL,IC α (P Qx) Lx = W ΓL(P, Qx) , (134) lim x 0+ RΓL,IC α (P Qx) = 0 , (135) showing that RΓL,IC α is able to capture the convergence of Qx to P as x 0+, while RΓ,log DV α fails to do so. Published as a conference paper at ICLR 2023 D ADDITIONAL EXAMPLES D.1 TRAINING SYMMETRY-PRESERVING GANS ON ROTMNIST When learning a distribution P that is invariant under a symmetry group (e.g., rotation invariance for images without preferred orientation) one can greatly increase performance by using a GAN that incorporates the symmetry information into the generator and the discriminator space Γ Dey et al. (2021). A theory of such symmetry-preserving GANs was developed in Birrell et al. (2022b) and the new divergences introduced in this paper satisfy the assumptions required to apply that theory. In Table 2 we demonstrate this effectiveness on the Rot MNIST dataset, obtained from randomly rotating the original MNIST digit dataset Le Cun et al. (1998), resulting in an rotation-invariant distribution. Note that incorporating more symmetry information into the GAN (i.e., progressing down the rows of the table) results in greatly improved performance, especially in the low data regime. Table 2: The median of the FIDs (lower is better), calculated every 1,000 generator update for 20,000 iterations, over three independent trials. The number of the training samples used for experiments varies from 1% (600) to 10% (6,000) of the Rot MNIST training set. The NN structure and hyperparameters are the same as those used in Section 5.4 of Birrell et al. (2022b). Eqv G (resp. Inv D) denotes that the symmetry information was incorporated into the generator (resp. discriminator) while CNN implies that a convolutional NN was used (without rotational symmetry). Σ denotes the rotation group used, where Cn denotes rotations by being integer multiples of 2π/n. Architecture 1% 5% 10% Reverse RΓ,IC 2 CNN G&D Eqv G + CNN D, Σ = C4 CNN G + Inv D, Σ = C4 Eqv G + Inv D, Σ = C4 Eqv G + Inv D, Σ = C8 357 464 366 151 114 325 271 321 105 71 298 263 302 89 62 D.2 VARIANCE OF RÉNYI ESTIMATORS Here we compare the DV-Rényi and CC-Rényi estimators on the Gaussian test problem from Section 6.1, except in lower dimensions (1-D and 100-D). Qualitatively, the behavior is similar. In particular, it is striking that the DV-Rényi estimator performs extremely poorly even in the 1-D case (see Figure 3) while the CC-Rényi estimator has much lower variance and MSE when the separation between the distributions becomes larger (i.e., as µq increases). Published as a conference paper at ICLR 2023 10 -2 10 -1 10 0 10 1 10 -10 10 -2 10 -1 10 0 10 1 10 -10 Figure 3: Variance and MSE of estimators of the classical Rényi divergence between 1-dimensional Gaussians. DV-Rα refers to Rényi divergence estimators built using (3) while CC-Rα refers to estimators built using our new variational representation (4). We used a NN with one fully connected layer of 64 nodes, Re LU activations, and a poly-softplus final layer (for CC-Rényi). We trained for 10000 epochs with a minibatch size of 500. The variance and MSE were computing using data from 50 independent runs. Note that the CC-Rényi estimator has significantly reduced variance and MSE compared to the DV-Rényi estimator, even when α is large. 10 -2 10 -1 10 0 10 1 10 -10 10 -2 10 -1 10 0 10 1 10 -10 Figure 4: Variance and MSE of estimators of the classical Rényi divergence between 100-dimensional Gaussians. DV-Rα refers to Rényi divergence estimators built using (3) while CC-Rα refers to estimators built using our new variational representation (4). We used a NN with one fully connected layer of 64 nodes, Re LU activations, and a poly-softplus final layer (for CC-Rényi). We trained for 10000 epochs with a minibatch size of 500. The variance and MSE were computing using data from 50 independent runs. Again, the CC-Rényi estimator has significantly reduced variance and MSE compared to the DV-Rényi estimator, even when α is large.