# structurepreserving_gans__5aab36ac.pdf Structure-preserving GANs Jeremiah Birrell 1 Markos A. Katsoulakis 1 Luc Rey-Bellet 1 Wei Zhu 1 Generative adversarial networks (GANs), a class of distribution-learning methods based on a twoplayer game between a generator and a discriminator, can generally be formulated as a minmax problem based on the variational representation of a divergence between the unknown and the generated distributions. We introduce structurepreserving GANs as a data-efficient framework for learning distributions with additional structure such as group symmetry, by developing new variational representations for divergences. Our theory shows that we can reduce the discriminator space to its projection on the invariant discriminator space, using the conditional expectation with respect to the σ-algebra associated to the underlying structure. In addition, we prove that the discriminator space reduction must be accompanied by a careful design of structured generators, as flawed designs may easily lead to a catastrophic mode collapse of the learned distribution. We contextualize our framework by building symmetry-preserving GANs for distributions with intrinsic group symmetry, and demonstrate that both players, namely the equivariant generator and invariant discriminator, play important but distinct roles in the learning process. Empirical experiments and ablation studies across a broad range of data sets, including real-world medical imaging, validate our theory, and show our proposed methods achieve significantly improved sample fidelity and diversity almost an order of magnitude measured in Fr echet Inception Distance especially in the small data regime. 1Department of Mathematics and Statistics, University of Massachusetts Amherst, Amherst, MA 01003, USA. Correspondence to: Jeremiah Birrell . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). Figure 1. Real and GAN generated ANHIR images dyed with the H&E stain [cf. Section 5.5]. Left panel: real images. Right panels: randomly selected DL 2 -GAN generated samples after 40,000 generator iterations. Top right panel: CNN G&D, i.e., the baseline model. Bottom right panel: Eqv G + Inv D, i.e., our proposed framework contextualized in learning group-invariant distributions. More images are available in Appendix F. Figure 2. Randomly generated digits 2, 3 and 8 by GANs trained on the rotated MNIST images using 1% (600) training samples. (a): the baseline CNN model. (b): our proposed framework for learning group-invariant distributions. 1. Introduction Since their introduction by Goodfellow et al. (2014), generative adversarial networks (GANs) have become a burgeoning domain in distribution learning with a diverse range of innovative applications (Karras et al., 2019; Zhu et al., 2019; Mustafa et al., 2019; Yi et al., 2019). Mathematically, the minmax game between a generator and a discriminator in GAN can typically be formulated as minimizing a divergence or other notions of distance with a variational representation between the unknown and the generated distributions. Such formulations, however, do not make prior structural assumptions on the probability measures, making them sub-optimal in sample efficiency when learning distributions with intrinsic structures, such as the (rotation) group symmetry for medical images without preferred orientation; see Figure 1. We introduce, in this work, the structure-preserving GANs, a data-efficient framework for learning probability measures Structure-preserving GANs with embedded structures, by developing new variational representations for divergences between structured distributions. We demonstrate that efficient adversarial learning can be achieved by reducing the discriminator space to its projection onto its invariant subspace, using the conditional expectation with respect to the σ-algebra associated to the underlying structure; such practice, which is rigorously justified by our theory and generally applicable to a broad range of variational divergences, acts effectively as an unbiased regularization to prevent discriminator overfitting, a common challenge for GAN optimization in the limited data regime (Zhao et al., 2020). Furthermore, our theory suggests that the discriminator space reduction must be accompanied by correctly building generators sharing the same probabilistic structure, as the lack of which may easily lead to mode collapse in the trained model, i.e., the generated distribution samples only a subset of the support of the data source [cf. Figure 4a (2nd row)]. As an example, we contextualize our framework by building symmetry-preserving GANs for learning distributions with group symmetry. Unlike prior empirical work, our choice of equivariant generators and invariant discriminators is theoretically founded, and we show (theoretically and empirically) how flawed design of equivariant generators results easily in the aforementioned mode collapse [cf. Figure 4a (4th row)]. Experiments and ablation studies over synthetic and real-world data sets validate our theory, disentangle the contribution of the structural priors on generators and discriminators, and demonstrate the significant outperformance of our framework in terms of both sample quality and diversity in some cases almost by an order of magnitude measured in Fr echet Inception Distance; see Figure 1 and 2 for a visual illustration. 2. Related Work Neural generation of group-invariant distributions has mainly been proposed in a flow-based framework (K ohler et al., 2019; 2020; Rezende et al., 2019; Liu et al., 2019; Biloˇs & G unnemann, 2021; Boyda et al., 2021; Garcia Satorras et al., 2021). Such models typically use an equivariant normalizing-flow to push-forward a group-invariant prior distribution to a complex invariant target. In the context of GANs, Dey et al. (2021) intuitively replace the 2D convolutions with group convolutions (Cohen & Welling, 2016a) to build group-equivariant GANs; however, their empirical study has not been justified by theory, and their incomplete design of the equivariant generator may easily lead to a mode collapse of the learned model; see the discussion of Theorem 4.6. The existence of symmetry can often be deduced from prior or domain knowledge of the distribution, e.g., the rotation symmetry for medical images without preferred orientation. Symmetry detection from data has also been studied in recent works such as (Dehmamy et al., 2021). When extended from group symmetry to probability structures induced from other operators, our work is also related to GAN-assisted coarse-graining (CG) for molecular dynamics (Durumeric & Voth, 2019) and cosmology (Mustafa et al., 2019; Feder et al., 2020); see the end of Section 4.1 for a detailed discussion. 3. Background and Motivation 3.1. Generative adversarial networks Generative adversarial networks are a class of methods in learning a probability distribution via a zero-sum game between a generator and a discriminator (Goodfellow et al., 2014; Arjovsky et al., 2017; Nowozin et al., 2016; Gulrajani et al., 2017). Specifically, let (X, M) be a measurable space, and P(X) be the set of probability measures on X; given a target distribution Q P(X), the original GAN proposed by Goodfellow et al. (2014) learns Q by solving inf g G D(Q Pg) = inf g G sup γ Γ H[γ; Q, Pg], (1) where H[γ; Q, Pg] = EQ[log γ] + EPg[log(1 γ)]. The map g : Z X in Eq. (1) is called a generator, which maps a random vector z Z to a generated sample g(z) X, pushing forward the noise distribution P P(Z) (typically a Gaussian) to a probability measure Pg P(X), i.e., Pg := g P := P g 1; the test function γ : X R is called a discriminator, which aims to differentiate the source distribution Q and the generated probability measure Pg by maximizing H[γ; Q, Pg]. The spaces G and Γ, respectively, of generators and discriminators are both parametrized by neural networks (NNs), and the solution of model (1) is the best generator g G that is able to fool all discriminators γ Γ by achieving the smallest D(Q Pg), which measures the dissimilarity between Q and Pg. 3.2. Variational representations for divergences Mathematically, most GANs can be formulated as minimizing the distance between the probability measures Q and Pg according to some divergence or probability metric with a variational representation supγ Γ H(γ; Q, Pg) as in (1). We hereby recast these formulations in a unified but flexible mathematical framework that will prove essential in Section 4.1. Let M(X) be the space of measurable functions on X and Mb(X) be the subspace of bounded measurable functions. Given an objective functional H : M(X)n P(X) P(X) [ , ] and a test function space Γ M(X)n, n Z+, we define DΓ H(Q P) = sup γ Γ H(γ; Q, P) . (2) DΓ H is called a divergence if DΓ H 0 and DΓ H(Q P) = 0 if and only if Q = P, hence providing a notion of distance Structure-preserving GANs between probability measures. Variational representations of the form (2) have been widely used, including in GANs (Goodfellow et al., 2014; Nowozin et al., 2016; Arjovsky et al., 2017), divergence estimation (Nguyen et al., 2007; 2010; Ruderman et al., 2012; Birrell et al., 2021), determining independence through mutual information estimation (Belghazi et al., 2018), uncertainty quantification of stochastic processes (Chowdhary & Dupuis, 2013; Dupuis et al., 2016), bounding risk in probably approximately correct (PAC) learning (Mc Allester, 1999; Shawe-Taylor & Williamson, 1997; Catoni et al., 2008), parameter estimation (Broniatowski & Keziou, 2009), statistical mechanics and interacting particles (Kipnis & Landim, 1999), and large deviations (Dupuis & Ellis, 2011). It is known that formula (2) includes, through suitable choices of functional H(γ; Q, P) and function space Γ, many divergences and probability metrics. Below we list several classes of examples. (a) f-divergences. Let f : [0, ) R be convex and lower semi-continuous (LSC), with f(1) = 0 and f strictly convex at x = 1. The f-divergence between Q and P is Df(Q P) = sup γ Mb(X) {EQ[γ] EP [f (γ)]}, (3) where f denotes the Legendre transform of f. Some notable examples of the f-divergences include the Kullback Leibler (KL) divergence and the family of α-divergences, which are constructed, respectively, from f KL = x log x, fα(x) = xα 1 α(α 1), α > 0, α = 1. (4) The flexibility of f allows one to tailor the divergence to the data source, e.g., for heavy tailed data. However, the formula (3) becomes Df(Q P) = when Q is not absolutely continuous with respect to P, limiting its efficacy in comparing distributions with low-dimensional support. (b) Γ-Integral Probability Metrics (IPMs). Given Γ Mb(X), the Γ-IPM between Q and P is defined as W Γ(Q, P) = sup γ Γ {EQ[γ] EP [γ]}. (5) Apart from the Wasserstein metric when Γ = Lip1(X) (the space of 1-Lipschitz functions), examples of IPMs also include the total variation metric, the Dudley metric, and maximum mean discrepancy (MMD) (M uller, 1997; Sriperumbudur et al., 2012). With suitable choices of Γ, IPMs are able to meaningfully compare not-absolutely continuous distributions, but they could potentially fail at comparing distributions with heavy tails (Birrell et al., 2022). (c) (f, Γ)-divergences. This class of divergences was introduced by Birrell et al. (2022) and they subsume both f-divergences and Γ-IPMs. Given a function f satisfying the same condition as in the definition of the f-divergence and Γ Mb(X), the (f, Γ)-divergence is defined as DΓ f (Q P) = sup γ Γ EQ[γ] ΛP f [γ] , (6) where ΛP f [γ] = infν R {ν + EP [f (γ ν)]}. One can verify that (6) includes as a special case the f-divergence (3) when Γ = Mb(X), and it is demonstrated in (Birrell et al., 2022) that under suitable assumptions on Γ we have 0 DΓ f (Q P) min{Df(Q P), W Γ(Q, P)} , (7) making DΓ f suitable to compare not-absolutely continuous distributions with heavy tails. An example of the (f, Γ)- divergence is the Lipschitz α-divergence, DL α(Q P) = sup γ Lip L b (X) {EQ[γ] ΛP fα[γ]}, (8) where f = fα as in Eq. (4), and Γ = Lip L b (X) is the space of bounded L-Lipschitz functions. (d) Sinkhorn divergences. The Wasserstein metric associated with a cost function c : X2 R+ has the variational representation W Γ c (Q, P) = supγ=(γ1,γ2) Γ{EP [γ1] + EQ[γ2]}, where Γ = {(γ1, γ2) C(X)2 : γ1(x)+γ2(y) c(x, y)}, and C(X) is the space of continuous functions on X. The Sinkhorn divergence is given by SDΓ c,ϵ(Q, P) = W Γ c,ϵ(Q, P) W Γ c,ϵ(Q, Q) + W Γ c,ϵ(P, P) 2 , where W Γ c,ϵ(Q, P) is the entropic regularization of the Wasserstein metrics [cf. Eq. (33)]. We refer to Appendix A for a detailed discussion of the variational divergences introduced above. In all the aforementioned examples, the choice of the discriminator space, Γ, is a defining characteristic of the divergence. We will explain, in Section 4.1, a general framework, i.e., the structurepreserving GANs, for incorporating added structural knowledge of the probability distributions or data sets into the choice of Γ, leading to enhanced performance and data efficiency in adversarial learning of structured distributions. 3.3. Group invariance and equivariance We first introduce the structure-preserving GAN framework in the context of learning distributions with group symmetry. We emphasize that the focus of this work is not to discuss the group-invariance properties of probability measures (which can be found in, e.g., (Schindler, 2003)), but to understand how to incorporate such structural information into the generator/discriminator of GANs such that invariant probability distributions can be learned more efficiently. However, Structure-preserving GANs we first require the following background and notations. Groups and group actions. A group is a set Σ equipped with a binary operator, the group product, satisfying the axioms of associativity, identity, and invertibility. Given a group Σ and a set X, a map T : Σ X X is called a group action if, for all σ Σ, Tσ := T(σ, ) : X X is an automorphism on X, and Tσ1 Tσ2 = Tσ1 σ2, σ1, σ2 Σ. In this paper, we will consider mainly the 2D rotation group SO(2) = {Rθ R2 2 : θ R} and roto-reflection group O(2) = {Rm,θ R2 2 : m Z, θ R}, where Rθ is the 2D rotation matrix of angle θ, and Rm,θ has a further reflection if m 1 (mod 2). The natural actions of SO(2) and O(2) on R2 are matrix multiplications, which can be lifted to actions on the space of (k-channel) planar signals L2(R2, Rk), e.g., RGB images. More specifically, when Σ is SO(2) or O(2) let Tσf(x) := f(σ 1x), σ Σ, f L2(R2, Rk). We will also consider the finite subgroups Cn, Dn, respectively, of SO(2) and O(2), with the rotation angles θ restricted to integer multiples of 2π/n. Group equivariance and invariance. Let T Z and T X, respectively, be Σ-actions on the spaces Z and X. A map g : Z X is called Σ-equivariant if T X σ g = g T Z σ , σ Σ. A map γ : X Y is called Σ-invariant if γ T X σ = γ, σ Σ. Invariance is thus a special case of equivariance after equipping Y with the action T Y σ y y, σ Σ. In the context of NNs, achieving equivariance/invariance via group-equivariant CNNs (G-CNNs) has been well-studied, and we refer the reader to (Cohen et al., 2019; Weiler & Cesa, 2019) for a complete theory of G-CNNs. Let G be a collection of measurable maps g : Z X. We denote its subset of Σ-equivariant maps as Geqv Σ := {g G : T X σ g = g T Z σ , σ Σ}. Similarly, let Γ be a set of measurable functions γ : X Y ; its subset, Γinv Σ , of Σ-invariant functions is defined as Γinv Σ := {γ Γ : γ T X σ = γ, σ Σ} . (10) The function space Γ is called closed under Σ if γ T X σ Γ, σ Σ, γ Γ . (11) Finally, a probability measure P P(X) is called Σinvariant if P = P (T X σ ) 1 for all σ Σ. For instance, the distribution of medical images without orientation preference should be SO(2)-invariant; see Figure 1. The set of all Σ-invariant distributions on X is denoted as PΣ(X) := {P P(X) : P is Σ-invariant}. (12) 3.4. Definition of Haar measure on Σ and the symmetrization operators SΣ and SΣ We will make frequent use of the symmetrization operators, on both functions and probability distributions, that are induced by a group action on X. These are constructed using the unique Haar probability measure, µΣ, of a compact Hausdorff topological group Σ (see, e.g., Chapter 11 in Folland (2013)). Intuitively the Haar measure is the uniform probability measure on Σ. Mathematically, this is expressed via the invariance of Haar measure under group multiplication, µΣ(σ E) = µΣ(E σ) = µΣ(E) for all σ Σ and all Borel sets E Σ. This is a generalization of the invariance of Lebesgue measure under translations and rotations. The Haar measure can be used to define symmetrization operators on both functions and probability measures as follows (going forward, we assume the group action is measurable). Symmetrization of functions: SΣ : Mb(X) Mb(X), SΣ[γ](x) := Z Σ γ(Tσ (x))µΣ(dσ ) = EµΣ[γ Tσ (x)] . Symmetrization of probability measures (dual operator): SΣ : P(X) P(X), defined for γ Mb(X) by ESΣ[P ]γ := Z X SΣ[γ](x)d P(x) = EP SΣ[γ] . (14) Remark 3.1. Sampling from SΣ[P]: If xi, i = 1, ..., N are samples from P, and σj, j = 1, ..., M are samples from the Haar probability measure µΣ (all independent) then Tσj(xi) are samples from SΣ[P]. If P is Σ-invariant then the use of Tσj(xi) can be viewed as a form of data augmentation. The following lemma provides several key properties of the symmetrization operators. Proofs and further details can be found in Appendix B, Lemma B.1. Lemma 3.2. (a) The symmetrization operator SΣ : Mb(X) Mb(X) is a projection onto the subspace of Σ-invariant bounded measurable functions, Minv b,Σ [cf. Eq. (10)]. (b) The symmetrization operator SΣ : P(X) P(X) is a projection onto the subset of Σ-invariant probability measures, PΣ(X) [cf. Eq. (12)]. (c) SΣ is the conditional expectation with respect to the σ-algebra MΣ of Σ-invariant sets, MΣ := {Measurable sets B M : Tσ(B) = B, σ Σ}, i.e., SΣ[γ] = EP [γ|MΣ] for all γ Mb(X), P PΣ(X). Lemma 3.2 implies that since SΣ, SΣ are projections onto Minv b,Σ, PΣ(X) respectively, i.e. SΣ SΣ = SΣ, SΣ SΣ = SΣ, they are necessarily structure-preserving, namely here symmetry-preserving. We discuss a general concept of structure-preserving operators at the end of Section 4.1. We present in this section our theory for structure-preserving GANs. The results are first stated for the special case of Structure-preserving GANs learning group-invariant distributions. We then extend the theory to a general class of structure-preserving operators. 4.1. Invariant discriminator theorem We demonstrate under assumptions outlined below and for broad classes of divergences and probability metrics that for Σ-invariant probability measures P, Q we can restrict the test function space Γ (discriminator space in GANs) in (2) to the subset of Σ-invariant functions, Γinv Σ [cf. Eq. (10)], without changing the divergence/probability metric, i.e., DΓ H(Q P) = DΓinv Σ H (Q P) for all Q, P PΣ . (15) The space Γinv Σ is a much smaller and more efficient discriminator space to optimize over in the proposed GANs. We rigorously formulate our results in the following theorem, which first considers the (f, Γ) divergence (6), the Γ-IPM (5), and the Sinkhorn divergence (9). The proof is found in Appendix B. Theorem 4.1. If SΣ[Γ] Γ and the probability measures P, Q are Σ-invariant then DΓ(Q P) = DΓinv Σ (Q P) , (16) where DΓ is an (f, Γ)-divergence or a Γ-IPM. Eq. (16) also holds for Sinkhorn divergences if the cost is Σ-invariant (i.e., c(Tσ(x), Tσ(y)) = c(x, y) for all σ Σ, x, y X). Remark 4.2. Eq. (16) can be generalized to a wider range of objective functionals satisfying appropriate convexity, continuity, and invariance conditions; see Theorem B.10. For the Σ-invariant (f, Γ)-divergences, we also obtain a refined version of (7), given by the following infimal convolution formula (for appropriate Γ and f): DΓinv Σ f (Q P) = inf η PΣ(X){Df(η P) + W Γinv Σ (Q, η)} (17) for all Q, P PΣ(X). See Appendix D for details on (17) and other results generalizing those in (Birrell et al., 2022). Theorem 4.1 suggests that the discriminator space reduction effectively acts as an unbiased regularization to prevent discriminator overfitting, a common challenge for GAN optimization in the small data regime. Using invariant discriminators can thus improve the data-efficiency of the model; this will be empirically verified in Tables 1-3. Examples satisfying the key condition SΣ[Γ] Γ of Theorem 4.1 1. First we consider the standard f-divergence (3) between two Σ-invariant probability measures P and Q. The identity SΣ[Mb(X)] = Minv b,Σ(X) from Lemma 3.2 implies that the functions space can be restricted to the Σ-invariant bounded functions Minv b,Σ(X), giving rise to an (f, Γ)-divergence (6) with Γ = Minv b,Σ(X), i.e., Df(Q P) = D Minv b,Σ(X) f (Q P). 2. If the group Σ is finite and the function space Γ Mb(X) is convex and closed under Σ in the sense of (11), then SΣ[Γ] Γ , as readily follows from the definition (13). Our implemented examples in Section 5 fall under this category. 3. The space of 1-Lipschitz functions on a metric space (X, d), assuming the action is 1-Lipschitz, i.e., d(Tσ(x), Tσ(y)) d(x, y) for all σ Σ, x, y X. 4. The unit ball in an appropriate RKHS; see Lemma C.1. 5. More generally, if Γ is convex and closed in the weak topology on Γ induced by integration against finite signed measures; see Lemma C.3 for a proof. Extension to other structure-preserving operators Let Kx(dx ) be a probability kernel from X to X and define SK : Mb(X) 7 Mb(X) by SK[f](x) := R f(x )Kx(dx ). K also defines a dual map SK : P(X) P(X), SK[P] := R Kx( )P(dx). Let PK(X) be the set of K-invariant probability measures, i.e., PK(X) = {P P(X) : SK[P] = P}. In this setting we have the following generalization of Theorem 4.1. Theorem 4.3. If SK[Γ] Γ and Q, P PK(X) then DΓ(Q P) = DSK[Γ](Q P) , (18) where DΓ is an (f, Γ)-divergence or a Γ-IPM. It also holds when DΓ is a Sinkhorn divergence if SK[c( , y)] = c( , y) and SK[c(x, )] = c(x, ) for all x, y X. In addition, if SK is a projection (i.e., SK SK = SK) then SK[Γ] = Γinv K where Γinv K := {γ Γ : SK[γ] = γ}. Remark 4.4. Conditional expectations, SK[f] := EP [f|A], are a special case of Theorem 4.3 with kernel being a regular conditional probability, K = P( |A). Here Γinv K is the set of A-measurable functions in Γ, which can be significantly smaller than Γ. The case where A = σ(ξ) for some random variable ξ has particular importance in coarse graining of molecular dynamics (Noid, 2013; Pak & Voth, 2018), see Appendix E. The result for Σ-invariant measures, Theorem 4.1, is also special case of Theorem 4.3, where the kernel is Kx = µΣ R 1 x , Rx(σ) := Tσ(x). Alternatively, Lemma 3.2 (c) shows SΣ can be written as a conditional expectation. Remark 4.5. Theorem 4.3 is an instance of the data processing inequality; see Theorem 2.21 in (Birrell et al., 2022). 4.2. Equivariant generator theorem Theorem 4.1 provides the theoretical justification for reducing the discriminator space Γ to its Σ-invariant subset Γinv Σ Structure-preserving GANs Figure 3. The Σ-symmetrization layer (enclosed in the red rectangle), which is missing in (Dey et al., 2021), ensures generator equivariance, which is critical in preventing GAN mode collapse [cf. Remark 4.11]. when the source Q and the generated measure Pg are both Σ-invariant. Our next theorem, however, shows that such practice could easily lead to mode collapse if one of the two distributions is not Σ-invariant, see Figure 4a; the proof is deferred to Appendix B. Theorem 4.6. Let SΣ[Γ] Γ and P, Q P(X), i.e., not necessarily Σ-invariant. We have DΓinv Σ (Q P) = DΓ(SΣ[Q] SΣ[P]) , (19) where DΓ is an (f, Γ)-divergence or a Γ-IPM. Remark 4.7. The analogous result for the Sinkhorn divergences also holds if the cost is separately Σ-invariant in each variable, i.e., c(Tσ(x), y) = c(x, y) and c(x, Tσ(y)) = c(x, y) for all σ Σ, x, y X. However, this is a strong assumption that is not satisfied by most commonly used cost functions and actions. Theorem 4.6 has the following implications: If one uses a Σ-invariant GAN (i.e., invariant discriminators and equivariant generators) to learn a non-invariant data source Q then one will in fact learn the symmetrized version SΣ[Q]. On the other hand, if the data source Q is Σ-invariant (i.e., SΣ[Q] = Q, cf. Lemma 3.2) but the GAN generated distribution Pg is not then discriminators from Γinv Σ alone can not differentiate Q and Pg, i.e., DΓinv Σ (Q Pg) = 0, as long as Q = SΣ[Pg]. This suggests that Pg can easily suffer from mode collapse , as it only needs to equal Q after Σ-symmetrization; we refer readers to Figure 4a (2nd and 4th rows) for a visual illustration, where a unimodal Pg can be erroneously selected as the best fitting model, even though its Σ-symmetrization SΣ[Pg] should be the correct one. To prevent this from happening, one needs to ensure the generator produces a Σ-invariant distribution Pg; this is guaranteed by the following Theorem. Theorem 4.8. If PZ P(Z) is Σ-invariant and g : Z X is Σ-equivariant then the push-forward measure Pg := PZ g 1 is Σ-invariant, i.e., Pg PΣ(X). See Appendix B for a proof. We note that equivariant flowbased methods have also been proposed based on a similar strategy to Theorem 4.8. We refer readers to Section 2 for a discussion of related works. Remark 4.9. Suppose g = γ2 γ1 is a composition of two maps, γ1 : Z W and γ2 : W X. Even if γ1 is not Σ-equivariant (in fact, Z does not even need to be equipped with a Σ-action T Z σ ), as long as Pγ1 P(W) is Σ-invariant and γ2 is Σ-equivariant, the push-forward measure Pg P(X) is still Σ-invariant. To construct the Σ-invariant noise source required in Theorem 4.8 (or Remark 4.9) one can begin with an arbitrary noise source and use a Σ-symmetrization layer, as described by the following theorem. Theorem 4.10. Let W µΣ and N be a Z-valued random variable (i.e., an arbitrary noise source). If N and W are independent then the distribution of T Z(W, N) is Σinvariant. Remark 4.11. Dey et al. (2021) also proposed to use GCNNs to generate images with C4/D4-invariant distributions. However, the first step in their model, i.e., the Project & Reshape step [cf. Figure 3], uses a fully-connected layer which destroys the group symmetry in the noise source, leading to non-invariant final distribution Pg even if the subsequent layers are all Σ-equivariant. This easily leads to mode collapse [cf. Theorem 4.6], which we will empirically demonstrate in Section 5; see, e.g., Figure 4a (4th row). An easy remedy for this is to add a Σ-symmetrization layer: let w be the output of Project & Reshape ; the Σ-symmetrization layer draws a random σ µΣ and transforms w into T W σ (w), producing a Σ-invariant distribution on the layer output (see Theorem 4.10). The final distribution Pg is thus Σ-invariant if subsequent layers are all Σ-equivariant by Remark 4.9. See Figure 3 for a visual illustration. 5. Experiments We present experiments on both synthetic and real-world data sets with embedded group symmetry to empirically verify our theory for structure-preserving GANs in Section 4. 5.1. Algorithmic Feasibility Theorems 4.1 and 4.8 imply that one can build invariant GANs by using Σinvariant discriminators, Σ-equivariant generators, and a Σ-invariant noise source. Equivariant networks for arbitrary group symmetry (and gauge invariance) have been studied in recent works such as (Cohen & Welling, 2016b). Invariant noise sources can be constructed as shown in Theorem 4.10. We note that the symmetrization operators SΣ, SΣ are only used in the proofs of theoretical properties of the proposed GANs and are not needed in practical implementations. The necessary invariance/equivariance is built into the discriminator/generator via the structure of the layers; see Appendix G.4. Structure-preserving GANs 5.2. Data sets and common experimental setups Toy example. Following (Birrell et al., 2022), this synthetic data source is a mixture of four 2D t-distributions with 0.5 degrees of freedom, embedded in a plane in R12. The four centers of the t-distributions are located (in the supporting plane) at coordinates ( 10, 10), exhibiting C4-symmetry [cf. Figure 4a]. Rot MNIST is built by randomly rotating the original 10-class 28 28 MNIST digits (Le Cun et al., 1998), resulting in an SO(2)-invariant distribution. We use different portions of the 60,000 training images for experiments in Section 5.4. ANHIR consists of pathology slides stained with 5 distinct dyes for the study of cellular compositions (Borovec et al., 2020). Following (Dey et al., 2021), we extract from the original images 28,407 foreground patches of size 64 64. The staining dye is used as the class label for conditioned image synthesis. As the images have no preferred orientation/reflection, the distribution is O(2)-invariant. LYSTO contains 20,000 patches extracted from whole-slide images of breast, colon and prostate cancer stained with immunohistochemical markers (Ciompi et al., 2019). The images are classified into 3 categories based on the organ source, and we downsize the images to 64 64. Similar to ANHIR, this data set is also O(2)-invariant. Common experimental setups. To verify our theory in Section 4, and to quantify and disentangle the contributions of the structure-preserving discriminator (D) and generator (G) (Theorem 4.1 and Theorem 4.6), we replace the baseline G and/or D by their group-equivariant/invariant counterparts, Eqv G and Inv D, while adjusting the number of filters according to the group size to ensure a similar number of trainable parameters. We also consider the incomplete attempt by Dey et al. (2021) at building equivariant generators ((I)Eqv G), wherein the first fully-connected layer destroys the symmetry in the noise source, resulting in nonequivariant G even if subsequent layers are all equivariant [cf. Remark 4.11]. We use the Fr echet Inception Distance (FID) (Heusel et al., 2017) to evaluate the quality and diversity of the GAN generated samples after embedding them in the feature space of a pre-trained Inception-v3 network (Szegedy et al., 2016). Due to the simplicity of Rot MNIST, we replace the inception-featurization by the encoding feature space of an autoencoder trained on the rotated digits. We note that, compared to classifiers, autoencoders are guaranteed to produce different features for rotated versions of the same digit; they are thus more suitable to measure sample diversity in rotation. (a) 2D projection of the generated samples. (b) DL 2 -GANs. Figure 4. This figure illustrates how our method can simultaneously handle heavy tails and low-dimensional support. Panel (a): 2D projection of the DL 2 -GAN generated samples onto the support plane of the source Q [cf. Section 5.3]. Each column shows the result after a given number of training epochs. The rows correspond to different settings for the generators (G) and discriminators (D); in particular, the 2nd and 4th rows use invariant D accompanied by, respectively, a baseline G and an incorrectly constructed equivariant G, leading to mode collapse [cf. Theorem 4.6]. The blue ovals mark the 25% and 50% probability regions of the data source Q, while the heat-map shows the generator samples. Panel (b) and (c): Generator distribution, projected onto components orthogonal to the support plane of Q. Values concentrated around zero indicate convergence to the sub-manifold. Models are trained on 200 training points. 5.3. Toy Example We test the performance of different GANs (and their equivariant versions) based on 3 types of divergences, namely the Wasserstein-GAN (WGAN) based on the Γ-IPM Eq. (2), the Dfα-GAN based on the classifical f-divergence Eq. (3) and (4), and the DL α-GAN based on the (f, Γ)-divergence Eq. (8), in learning the C4-invariant mixture Q. We use fullyconnected networks with 3 hidden layers for the baseline G and D (Vanilla G&D). The generator pushes forward a 10D Gaussian noise source, which is itself C4-invariant after prescribing a proper group action, e.g., π/2-rotations in the first two dimensions. Equivariant G (Eqv G) and invariant D (Inv D) are built by replacing fully-connected Structure-preserving GANs layers with C4-convolutional layers based on Theorem 4.8 due to the C4-invariance of the noise source. We also mimic the incomplete attempt by Dey et al. (2021) in building equivariant generators ((I)Eqv G) by leaving the first fully-connected layer unchanged and replacing only the subsequent layers by C4-convolutions. Figure 4a displays the 2D projection of the generated samples learned by the DL α=2-GAN (and its equivariant versions) on 200 training samples. It is clear that the baseline model without structural prior (Vanilla G&D) has difficulty in learning Q in such small data regime. Using an Inv D alone without an Eqv G (Vanilla G + Inv D) or with an incorrectly imposed Eqv G ((I)Eqv G + Inv D) leads easily to mode collapse , validating Theorem 4.6. On the other hand, DL α-GAN with an Eqv G (even without an Inv D) is able to learn all 4 modes of Q. We omit the results of (equivariant) Dfα-GANs and WGANs from Figure 4a, as both fail to learn the data source Q; this is unsurprising due to the lack of absolute continuity between Q and Pg (the former is supported on a plane, while the latter is the entire 12D space) and the fact that Q is heavytailed (as the mean does not exist.) This demonstrates the importance of our framework s broad applicability to a variety of variational divergences, as an improper choice of the divergence even with structural prior can fail to learn the source distribution. Figure 4 (b) and (c) show the generated distribution projected onto components orthogonal to the support plane of Q. Values concentrated around zero indicate successful learning of the low-dimensional source distribution, i.e., generating high-fidelity samples. Figure 4b indicates that an Inv D in the DL α-GAN helps produce a distribution with sharper support, whereas Eqv G alone without Inv D tends to generate relatively low-quality samples away from the supporting plane. In contrast, Figure 4c indicates that WGAN (even with symmetry prior) fails to learn the support plane due to Q being heavy-tailed. Results with different numbers of training samples and α s are shown in Appendix F, and the conclusions are similar. 5.4. Rot MNIST We adopt a similar setup to Dey et al. (2021). Specifically, in the baseline G, a fully-connected layer first projects and reshapes the concatenated Gaussian noise and class embedding into a 2D feature map (see Figure 3); spectrally-normalized convolutions (Miyato et al., 2018), interspersed with pointwise-nonlinearities, class-conditional batch-normalizations, and upsamplings, are subsequently used to increase the spatial dimension. We note again that replacing 2D convolutions with Cn-convolutions does not simply lead to Eqv G, as the distribution after the project and reshape layer is no longer Cn-invariant. This can be Figure 5. Randomly generated digits 2, 3 and 8 by the RA-GANs trained on Rot MNIST after 20K generator iterations and using 1% (600) training data. (a): CNN G&D. (b): (I)Eqv G + Inv D, Σ = C4. (c) & (d): Eqv G + Inv D, i.e., our models with correctly constructed equivariant generators. (c): Σ = C4. (d): Σ = C8. More images are available in Appendix F. Table 1. The median of the FIDs (lower is better), calculated every 1,000 generator update for 20,000 iterations, averaged over three independent trials. The number of the training samples used for experiments varies from 1% (600) to 10% (6,000) of the entire training set. See Appendix F for further results. Architecture 1% 5% 10% CNN G&D Eqv G + CNN D, Σ = C4 CNN G + Inv D, Σ = C4 (I)Eqv G+Inv D, Σ = C4 Eqv G + Inv D, Σ = C4 Eqv G + Inv D, Σ = C8 295 389 223 173 98 123 357 333 181 141 78 52 348 355 188 132 89 51 CNN G&D Eqv G + CNN D, Σ = C4 CNN G + Inv D, Σ = C4 (I)Eqv G+Inv D, Σ = C4 Eqv G + Inv D, Σ = C4 Eqv G + Inv D, Σ = C8 280 253 330 273 149 122 261 271 208 147 99 55 283 251 192 133 88 57 fixed by adding a Cn-symmetrization layer after the first linear embedding; see Remark 4.11. We consider GANs with the relative average loss (RA-GANs) (Jolicoeur-Martineau, 2019) in addition to the DL α-GANs for this experiment. All configurations are trained with a batch size of 64 for 20,000 generator iterations. Implementation details are available in Appendix G. Table 1 shows the median of the FIDs, calculated every 1,000 generator update, averaged over three independent trials. It is clear that our proposed models (Eqv G + Inv D) consistently achieve significantly improved results compared to the baseline CNN G&D and the prior approach ((I)Eqv G + Inv D); the out-performance is even more pronounced when increasing the group size from Σ = C4 to C8. We note that, similar to Rot MNIST, one can also use a custom autoencoder featurization for FID evaluation, and the superiority of our model (Eqv G + Inv D) is even Structure-preserving GANs Table 2. The (min, median) of the FIDs over the course of training, averaged over three independent trials on the medical images, where the plus sign + after the data set, e.g., ANHIR+, denotes the presence of data augmentation during training. Loss Architecture ANHIR ANHIR+ DL 2 CNN G&D (I)Eqv G + Inv D Eqv G + Inv D (313, 485) (120, 176) (97, 157) (347, 539) (119, 177) (90, 128) Loss Architecture LYSTO LYSTO+ DL 2 CNN G&D (I)Eqv G + Inv D Eqv G + Inv D (289, 410) (253, 343) (205, 259) (265, 376) (244, 329) (192, 259) more prominent under such metric: for instance, on ANHIR, the median FIDs calculated through autoencoder featurization of the three comparing models are, respectively, 1221 (CNN G&D), 936 (((I)Eqv G + Inv D)), and 329 (Eqv G + Inv D). See Figure 5 also for randomly generated samples by RA-GANs trained with 1% training data. More results are available in Appendix F. 5.5. ANHIR and LYSTO Compared to Rot MNIST, Res Net and its D4-equivariant counterpart are used instead of CNNs for G and D. All models are trained for 40,000 generator iterations with a batch size of 32. Implementation details are available in Appendix G. Table 2 displays the minimum and median of the FIDs, calculated every 2,000 generator update, averaged over three independent trials. The plus sign + after the data set, e.g., ANHIR+, denotes the presence of data augmentation (random 90 rotations and reflection) during training. It is clear that augmentation usually (but not always) has a positive effect on the results evaluated by the FID; however, our proposed model even without data augmentation still consistently and significantly outperforms the baseline model (CNN G&D) and the prior approach ((I)Eqv G + Inv D) (Dey et al., 2021) with augmentation. Figure 1 presents a random collection of real and generated ANHIR images, visually verifying the improved sample fidelity of our model over the baseline. More results are available in Appendix F. 5.6. Discussion of empirical findings Consistently across all experiments, our proposed structurepreserving GAN outperforms prior approaches in generating high-fidelity and diverse samples by a significant margin, in some cases almost an order of magnitude measured in FID. The results also show that, compared to data-augmentation (a common strategy for learning from limited data), building theoretically-guided structural probabilistic priors directly into the two GAN players achieves substantially improved performance and data efficiency in adversarial learning. Acknowledgements The research of J.B., M.K. and L.R.-B. was partially supported by the Air Force Office of Scientific Research (AFOSR) under the grant FA9550-21-1-0354. The research of M. K. and L.R.-B. was partially supported by the National Science Foundation (NSF) under the grants DMS-2008970 and TRIPODS CISE-1934846. The research of W.Z. was partially supported by NSF under DMS-2052525 and DMS2140982. We thank Neel Dey for sharing the pre-processed ANHIR data set. This work was performed in part using high performance computing equipment obtained under a grant from the Collaborative R&D Fund managed by the Massachusetts Technology Collaborative. Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein generative adversarial networks. In International conference on machine learning, pp. 214 223. PMLR, 2017. Belghazi, M. I., Baratin, A., Rajeshwar, S., Ozair, S., Bengio, Y., Courville, A., and Hjelm, D. Mutual information neural estimation. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 531 540, Stockholmsm assan, Stockholm Sweden, 10 15 Jul 2018. PMLR. URL http://proceedings.mlr. press/v80/belghazi18a.html. Biloˇs, M. and G unnemann, S. Scalable normalizing flows for permutation invariant densities. In International Conference on Machine Learning, pp. 957 967. PMLR, 2021. Birrell, J., Dupuis, P., Katsoulakis, M. A., Rey-Bellet, L., and Wang, J. Variational representations and neural network estimation of R enyi divergences. SIAM Journal on Mathematics of Data Science, 3(4):1093 1116, 2021. doi: 10.1137/20M1368926. URL https: //doi.org/10.1137/20M1368926. Birrell, J., Dupuis, P., Katsoulakis, M. A., Pantazis, Y., and Rey-Bellet, L. (f, Γ)-Divergences: Interpolating between f-Divergences and Integral Probability Metrics. Journal of Machine Learning Research, (to appear), 2022. URL https://arxiv.org/abs/2011.05953. Borovec, J., Kybic, J., Arganda-Carreras, I., Sorokin, D. V., Bueno, G., Khvostikov, A. V., Bakas, S., Eric, I., Chang, Structure-preserving GANs C., Heldmann, S., et al. Anhir: automatic non-rigid histological image registration challenge. IEEE transactions on medical imaging, 39(10):3042 3052, 2020. Bot, R., Grad, S., and Wanka, G. Duality in Vector Optimization. Vector Optimization. Springer Berlin Heidelberg, 2009. ISBN 9783642028861. Boyda, D., Kanwar, G., Racani ere, S., Rezende, D. J., Albergo, M. S., Cranmer, K., Hackett, D. C., and Shanahan, P. E. Sampling using su (n) gauge equivariant flows. Physical Review D, 103(7):074504, 2021. Brock, A., Donahue, J., and Simonyan, K. Large scale GAN training for high fidelity natural image synthesis. ar Xiv preprint ar Xiv:1809.11096, 2018. Broniatowski, M. and Keziou, A. Parametric estimation and tests through divergences and the duality technique. Journal of Multivariate Analysis, 100(1):16 36, 2009. ISSN 0047-259X. doi: https://doi.org/10.1016/j.jmva.2008.03. 011. URL http://www.sciencedirect.com/ science/article/pii/S0047259X08001036. Catoni, O., Euclid, P., Library, C. U., and Press, D. U. PAC-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning. Lecture notesmonograph series. Cornell University Library, 2008. URL https://books.google.gr/books?id= -Etrn QAACAAJ. Chowdhary, K. and Dupuis, P. Distinguishing and integrating aleatoric and epistemic variation in uncertainty quantification. ESAIM: Mathematical Modelling and Numerical Analysis, 47(3):635 662, 2013. doi: 10.1051/ m2an/2012038. Ciompi, F., Jiao, Y., and van der Laak, J. Lymphocyte assessment hackathon (LYSTO), October 2019. URL https: //doi.org/10.5281/zenodo.3513571. Cohen, T. and Welling, M. Group equivariant convolutional networks. In International conference on machine learning, pp. 2990 2999. PMLR, 2016a. Cohen, T. and Welling, M. Group equivariant convolutional networks. In Balcan, M. F. and Weinberger, K. Q. (eds.), Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pp. 2990 2999, New York, New York, USA, 20 22 Jun 2016b. PMLR. URL https://proceedings.mlr.press/v48/ cohenc16.html. Cohen, T. S., Geiger, M., and Weiler, M. A general theory of equivariant CNNs on homogeneous spaces. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alch e-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings. neurips.cc/paper/2019/file/ b9cfe8b6042cf759dc4c0cccb27a6737-Paper. pdf. Cohn, D. Measure Theory. Birkh auser Boston, 2013. ISBN 9781489903990. URL https://books.google. com/books?id=rg Xy Bw AAQBAJ. Dehmamy, N., Walters, R., Liu, Y., Wang, D., and Yu, R. Automatic symmetry discovery with lie algebra convolutional network. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 2503 2515. Curran Associates, Inc., 2021. URL https://proceedings. neurips.cc/paper/2021/file/ 148148d62be67e0916a833931bd32b26-Paper. pdf. Dey, N., Chen, A., and Ghafurian, S. Group equivariant generative adversarial networks. In International Conference on Learning Representations, 2021. URL https: //openreview.net/forum?id=rg FNu JHHXv. Dupuis, P. and Ellis, R. S. A weak convergence approach to the theory of large deviations, volume 902. John Wiley & Sons, 2011. Dupuis, P., Katsoulakis, M. A., Pantazis, Y., and Plechac, P. Path-space information bounds for uncertainty quantification and sensitivity analysis of stochastic dynamics. SIAM/ASA Journal on Uncertainty Quantification, 4(1): 80 111, 2016. doi: 10.1137/15M1025645. Durumeric, A. E. and Voth, G. A. Adversarial-residualcoarse-graining: Applying machine learning theory to systematic molecular coarse-graining. The Journal of chemical physics, 151(12):124110, 2019. Feder, R. M., Berger, P., and Stein, G. Nonlinear 3d cosmic web simulation with heavy-tailed generative adversarial networks. Physical Review D, 102(10):103504, 2020. Folland, G. Real Analysis: Modern Techniques and Their Applications. Pure and Applied Mathematics: A Wiley Series of Texts, Monographs and Tracts. Wiley, 2013. ISBN 9781118626399. URL https://books. google.com/books?id=w I4f Aw AAQBAJ. Garcia Satorras, V., Hoogeboom, E., Fuchs, F., Posner, I., and Welling, M. E (n) equivariant normalizing flows. Advances in Neural Information Processing Systems, 34, 2021. Structure-preserving GANs Genevay, A., Cuturi, M., Peyr e, G., and Bach, F. Stochastic optimization for large-scale optimal transport. In Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https://proceedings. neurips.cc/paper/2016/file/ 2a27b8144ac02f67687f76782a3b5d8f-Paper. pdf. Glaser, P., Arbel, M., and Gretton, A. KALE flow: A relaxed kl gradient flow for probabilities with disjoint support. ar Xiv e-prints, art. ar Xiv:2106.08929, June 2021. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. Advances in neural information processing systems, 27, 2014. Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. C. Improved training of Wasserstein GANs. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings. neurips.cc/paper/2017/file/ 892c3b1c6dccd52936e27cbd0ff683d6-Paper. pdf. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017. Jolicoeur-Martineau, A. The relativistic discriminator: a key element missing from standard GAN. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum? id=S1er Ho R5t7. Karras, T., Laine, S., and Aila, T. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4401 4410, 2019. Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Kipnis, C. and Landim, C. Scaling Limits of Interacting Particle Systems. Springer-Verlag, 1999. K ohler, J., Klein, L., and No e, F. Equivariant flows: sampling configurations for multi-body systems with symmetric energies. ar Xiv preprint ar Xiv:1910.00753, 2019. K ohler, J., Klein, L., and No e, F. Equivariant flows: exact likelihood generative learning for symmetric densities. In International Conference on Machine Learning, pp. 5361 5370. PMLR, 2020. Kullback, S. and Leibler, R. A. On information and sufficiency. The annals of mathematical statistics, 22(1): 79 86, 1951. Le Cun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. Li, W., Burkhart, C., Poli nska, P., Harmandaris, V., and Doxastakis, M. Backmapping coarse-grained macromolecules: An efficient and versatile machine learning approach. The Journal of Chemical Physics, 153(4):041101, 2020. Liu, J., Kumar, A., Ba, J., Kiros, J., and Swersky, K. Graph normalizing flows. ar Xiv preprint ar Xiv:1905.13177, 2019. Mc Allester, D. A. Pac-bayesian model averaging. In Proceedings of the Twelfth Annual Conference on Computational Learning Theory, COLT 99, pp. 164 170, New York, NY, USA, 1999. Association for Computing Machinery. ISBN 1581131674. doi: 10. 1145/307400.307435. URL https://doi.org/10. 1145/307400.307435. Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum? id=B1QRgzi T-. Mustafa, M., Bard, D., Bhimji, W., Luki c, Z., Al-Rfou, R., and Kratochvil, J. M. Cosmo GAN: creating high-fidelity weak lensing convergence maps using Generative Adversarial Networks. Computational Astrophysics and Cosmology, 6(1):1, December 2019. ISSN 2197-7909. doi: 10.1186/s40668-019-0029-9. URL https://comp-astrophys-cosmol. springeropen.com/articles/10.1186/ s40668-019-0029-9. M uller, A. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29 (2):429 443, 1997. doi: 10.2307/1428011. Nguyen, X., Wainwright, M. J., and Jordan, M. I. Nonparametric estimation of the likelihood ratio and divergence functionals. In 2007 IEEE International Symposium on Information Theory, pp. 2016 2020, 2007. Structure-preserving GANs Nguyen, X., Wainwright, M. J., and Jordan, M. I. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11):5847 5861, 2010. Noid, W. G. Perspective: Coarse-grained models for biomolecular systems. The Journal of Chemical Physics, 139(9):090901, 2013. doi: 10.1063/1.4818908. URL https://doi.org/10.1063/1.4818908. Nowozin, S., Cseke, B., and Tomioka, R. f-GAN: Training generative neural samplers using variational divergence minimization. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 271 279, 2016. Pak, A. J. and Voth, G. A. Advances in coarse-grained modeling of macromolecular complexes. Current Opinion in Structural Biology, 52:119 126, 2018. ISSN 0959440X. doi: https://doi.org/10.1016/j.sbi.2018.11.005. URL https://www.sciencedirect.com/ science/article/pii/S0959440X18300939. Cryo electron microscopy: the impact of the cryo-EM revolution in biology Biophysical and computational methods - Part A. Rezende, D. J., Racani ere, S., Higgins, I., and Toth, P. Equivariant hamiltonian flows. ar Xiv preprint ar Xiv:1909.13739, 2019. Ruderman, A., Reid, M. D., Garc ıa-Garc ıa, D., and Petterson, J. Tighter variational representations of f-divergences via restriction to probability measures. In Proceedings of the 29th International Coference on International Conference on Machine Learning, ICML 12, pp. 1155 1162, Madison, WI, USA, 2012. Omnipress. ISBN 9781450312851. Rudin, W. Functional Analysis. International series in pure and applied mathematics. Mc Graw-Hill, 2006. ISBN 9780070619883. Schindler, W. Measures with Symmetry Properties. Lecture Notes in Mathematics. Springer Berlin Heidelberg, 2003. ISBN 9783540362104. URL https://books. google.com/books?id=xyt8Cw AAQBAJ. Shawe-Taylor, J. and Williamson, R. C. A PAC analysis of a Bayesian estimator. In Proceedings of the Tenth Annual Conference on Computational Learning Theory, COLT 97, pp. 2 9, New York, NY, USA, 1997. Association for Computing Machinery. ISBN 0897918916. doi: 10. 1145/267460.267466. URL https://doi.org/10. 1145/267460.267466. Sriperumbudur, B. K., Fukumizu, K., and Lanckriet, G. R. Universality, characteristic kernels and RKHS embedding of measures. Journal of Machine Learning Research, 12(70):2389 2410, 2011. URL http://jmlr.org/ papers/v12/sriperumbudur11a.html. Sriperumbudur, B. K., Fukumizu, K., Gretton, A., Sch olkopf, B., and Lanckriet, G. R. G. On the empirical estimation of integral probability metrics. Electronic Journal of Statistics, 6(none):1550 1599, 2012. doi: 10.1214/12-EJS722. URL https://doi.org/10. 1214/12-EJS722. Steinwart, I. and Christmann, A. Support Vector Machines. Information Science and Statistics. Springer New York, 2008. ISBN 9780387772424. URL https://books. google.com/books?id=HUnqnrp Yt4IC. Stieffenhofer, M., Bereau, T., and Wand, M. Adversarial reverse mapping of condensed-phase molecular structures: Chemical transferability. APL Materials, 9(3): 031107, 2021. doi: 10.1063/5.0039102. URL https: //doi.org/10.1063/5.0039102. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818 2826, 2016. Weiler, M. and Cesa, G. General E(2)-equivariant steerable CNNs. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alch e-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings. neurips.cc/paper/2019/file/ 45d6637b718d0f24a237069fe41b0db4-Paper. pdf. Yi, X., Walia, E., and Babyn, P. Generative adversarial network in medical imaging: A review. Medical image analysis, 58:101552, 2019. Zhang, H., Goodfellow, I., Metaxas, D., and Odena, A. Self-attention generative adversarial networks. In International conference on machine learning, pp. 7354 7363. PMLR, 2019. Zhao, S., Liu, Z., Lin, J., Zhu, J.-Y., and Han, S. Differentiable augmentation for data-efficient gan training. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 7559 7570. Curran Associates, Inc., 2020. URL https://proceedings. neurips.cc/paper/2020/file/ 55479c55ebd1efd3ff125f1337100388-Paper. pdf. Structure-preserving GANs Zhu, M., Pan, P., Chen, W., and Yang, Y. Dm-gan: Dynamic memory generative adversarial networks for textto-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019. Structure-preserving GANs A. More details on variational representations of divergences and probability metrics We provide, in this appendix, more details on variational representations of the divergences and probability metrics discussed in Section 3.2. Recall the notation introduced in the main paper: let (X, M) be a measurable space, M(X) be the space of measurable functions on X, and Mb(X) be the subspace of bounded measurable functions. We denote P(X) as the set of probability measures on X. Given an objective functional H : Mn(X) P(X) P(X) [ , ] and a test function space Γ M(X)n, n Z+, we define DΓ H(Q P) = sup γ Γ H(γ; Q, P) . (20) DΓ H is called a divergence if DΓ H 0 and DΓ H(Q P) = 0 if and only if Q = P, hence providing a notion of distance between probability measures. DΓ H is further called a probability metric if it satisfies the triangle inequality (i.e., DΓ H(Q P) DΓ H(Q ν) + DΓ H(ν P) for all Q, P, ν P(X)) and is symmetric (i.e., DΓ H(Q P) = DΓ H(P Q) for all P, Q P(X)). It is well known that formula (20) includes, through suitable choices of objective functional H(γ; Q, P) and function space Γ, many divergences and probability metrics. Below we further elaborate on the examples discussed in Section 3.2. (a) f-divergences. Let f : [0, ) R be convex and lower semi-continuous (LSC), with f(1) = 0 and f strictly convex at x = 1. The f-divergence between Q and P can be defined based on two equivalent variational representations (Birrell et al., 2022), namely Df(Q P) = sup γ Mb(X) {EQ[γ] EP [f (γ)]} (21) = sup γ Mb(X) {EQ[γ] ΛP f [γ]} , (22) where f in the first representation (21) denotes the Legendre transform (LT) of f, f (y) = sup x R {yx f(x)}, y R, (23) and ΛP f [γ] in the second representation (22) is defined as ΛP f [γ] := inf ν R{ν + EP [f (γ ν)]} , γ Mb(Ω) . (24) The two variational representations Eq. (21) and Eq. (22) share the same Γ = Mb(X), and their equivalence is due to Mb(Ω) being closed under the shift map γ 7 γ ν for ν R. Examples of the f-divergences include the Kullback-Leibler (KL) divergence (Kullback & Leibler, 1951), the total variation distance, the χ2-divergence, the Hellinger distance, the Jensen-Shannon divergence, and the family of α-divergences (Nowozin et al., 2016). For instance, the KL-divergence is constructed from f KL = x log x, x 0. (25) A key element in the second variational representation for Df [Eq. (22)] is the functional ΛP f [γ], which is a generalization of the cumulant generating function from the KL-divergence case to the f-divergence case. Indeed, for the KL-divergence where f(x) = f KL(x) = x log x, it is straightforward to show that ΛP f becomes the standard cumulant generating function, ΛP f KL[γ] = log EP [eγ], and Eq. (22) becomes the Donsker-Varadhan variational formula; see Appendix C.2 in (Dupuis & Ellis, 2011). The flexibility of f allows one to tailor the divergence to the data source, e.g., for heavy tailed data. Moreover, the strict concavity of f in γ can result in improved statistical learning, estimation, and convergence performance. However, the variational representations (21) and (22) both result in Df(Q P) = if Q is not absolutely continuous with respect to P, limiting their efficacy in comparing distributions with low-dimensional support. (b) Γ-Integral Probability Metrics (IPMs). Given Γ Mb(X), the Γ-IPM between Q and P is defined as W Γ(Q, P) = sup γ Γ {EQ[γ] EP [γ]}. (26) Structure-preserving GANs We refer to (M uller, 1997; Sriperumbudur et al., 2012) for a complete theory and conditions on Γ ensuring that W Γ(Q, P) is a metric. Apart from the Wasserstein metric when Γ = Lip1(X) is the space of 1-Lipschitz functions, examples of IPMs also include: the total variation metric, where Γ is the unit ball in Mb(X); the Dudley metric, where Γ is the unit ball in the space of bounded and Lipschitz continuous functions; and maximum mean discrepancy (MMD), where Γ is the unit ball in an RKHS (M uller, 1997; Sriperumbudur et al., 2012). With suitable choices of Γ, IPMs are able to meaningfully compare not-absolutely continuous distributions, but they could potentially fail at comparing distributions with heavy tails (Birrell et al., 2022). (c) (f, Γ)-divergences. This class of divergences were introduced in (Birrell et al., 2022) and they subsume both fdivergences and Γ-IPMs. Given a function f satisfying the same condition as in the definition of the f-divergence and Γ Mb(X), the (f, Γ)-divergence is defined as DΓ f (Q P) = sup γ Γ EQ[γ] ΛP f [γ] , (27) where ΛP f [γ] is again given by Eq. (24), implying that Eq. (6) includes as a special case the f-divergence (3) when Γ = Mb(X) and the Γ Mb(X) implies DΓ f (Q P) Df(Q P) (28) for any Γ Mb(X). It is demonstrated in (Birrell et al., 2022) that one also has DΓ f (Q P) W Γ(Q, P) . (29) Some notable examples of such Γ s can be found in (Birrell et al., 2022), for instance the 1-Lipschitz functions Lip1(X), the RKHS unit ball, Re LU neural networks, Re LU neural networks with spectral normalizations, etc. The property (29) readily implies that (f, Γ) divergences can be defined for non-absolutely continuous probability distributions. If X is further assumed to be a complete separable metric space then, under stronger assumptions on f and Γ, one has the following Infimal Convolution Formula: DΓ f (Q P) = inf η P(X) Df(η P) + W Γ(Q, η) , (30) which implies, in particular, 0 DΓ f (Q P) min{Df(Q P), W Γ(Q, P)}, i.e., Eq. (28) and Eq. (29). (d) Sinkhorn divergences. The Wasserstein (or earth-mover ) metric associated with a cost function c : X X R+ has the variational representation W Γ c (Q, P) = inf π Co(Q,P ) Eπ[c(x, y)] = sup γ=(γ1,γ2) Γ {EP [γ1] + EQ[γ2]} , (31) where Co(Q, P) is the set of all couplings of P and Q and Γ = {γ = (γ1, γ2) C(X) C(X) : γ1(x) + γ2(y) c(x, y) , x, y X}, with C(X) being the space of continuous functions on X (Cb(X) will denote the subspace of bounded continuous functions). The Sinkhorn divergence is given by SDΓ c,ϵ(Q, P) = W Γ c,ϵ(Q, P) 1 2W Γ c,ϵ(Q, Q) 1 2W Γ c,ϵ(P, P), (32) with W Γ c,ϵ(Q, P) being the entropic regularization of the Wasserstein metrics (Genevay et al., 2016), W Γ c,ϵ(Q, P) = inf π Co(Q,P ) {Eπ[c(x, y)] + ϵR(π P Q)} (33) = sup γ=(γ1,γ2) Γ EP [γ1] + EQ[γ2] ϵEP Q exp γ1 γ2 c where now Γ = Cb(X) Cb(X) and γ1 γ2(x, y) := γ1(x) + γ2(y). Structure-preserving GANs In this appendix we provide proofs of results that were stated in the main text. First we prove the properties of the symmetrization operators from Lemma 3.2. Lemma B.1. (a) The symmetrization operator SΣ : Mb(X) Mb(X) is a projection operator onto the subspace of Σ-invariant bounded measurable functions Minv b,Σ(X) := {γ Mb(X) : γ Tσ = γ for all σ Σ} , (35) in the sense that 1. SΣ[Mb(X)] = Minv b,Σ(X), 2. SΣ SΣ = SΣ. SΣ[γ Tσ] = SΣ[γ] (36) for all γ Mb(X), σ Σ. (b) The symmetrization operator SΣ : P(X) P(X) is a projection operator onto the subset of Σ-invariant probability measures PΣ(X) := {P P(X) : P T 1 σ = P for all σ Σ} , (37) in the sense that 1. SΣ[P(X)] = PΣ(X), 2. SΣ SΣ = SΣ. (c) SΣ is the conditional expectation operator with respect to the σ-algebra of Σ-invariant sets. More specifically, for all γ Mb(X), P PΣ(X) we have SΣ[γ] = EP [γ|MΣ] . (38) where MΣ is the σ-algebra of Σ-invariant sets, MΣ := {Measurable sets B X : Tσ(B) = B for all σ Σ} . (39) Proof. We will need the following invariance property of integrals with respect to Haar measure, which can be proven using the invariance of Haar measure under left and right group multiplication: Z Σ h(σ σ )dµΣ(σ ) = Z Σ h(σ σ)dµΣ(σ ) = Z Σ h(σ )dµΣ(σ ) . (40) (a) If γ Mb(X) then γ = SΣ[γ] Minv b,Σ(X) by applying (40) with h(σ) := γ Tσ(x), x X. Indeed we have γ Tσ(x) = Z γ(Tσ (Tσ(x)))dµΣ(σ ) = Z h(σ σ)µΣ(dσ ) = Z h(σ )µΣ(dσ ) = γ (x) . Furthermore any γ Minv b,Σ(X) belongs to the range of SΣ since γ Tσ = γ for all σ Σ implies that γ = SΣ[γ]. This also shows that SΣ SΣ = SΣ. Finally, for γ Mb(X), σ Σ, x X we can compute SΣ[γ Tσ](x) = Z γ(Tσ σ (x))µΣ(dσ ) = Z γ(T σ(x))µΣ(dσ ) = SΣ[γ](x) , where we again used the invariance property of integrals with respect to Haar measure (40). Structure-preserving GANs (b) For P P(X), γ Mb(X), and σ Σ we can use (36) to compute Z γd SΣ[P] T 1 σ = Z γ Tσd SΣ[P] = Z SΣ[γ Tσ]d P = Z SΣ[γ]d P = Z γd SΣ[P] . This holds for all γ Mb(X), hence SΣ[P] T 1 σ = SΣ[P] for all σ Σ. Therefore SΣ[P] PΣ(X). Conversely, if P PΣ(X) then EP [γ Tσ] = EP [γ] for all σ Σ and γ Mb(X) and thus, by Fubini s theorem, EP [SΣ[γ]] = EP [γ]. Hence SΣ[P] = P and so P SΣ[P]. This completes the proof that SΣ[P(X)] = PΣ(X). Combining these calculations it is also clear that SΣ SΣ = SΣ. (c) Let γ Mb(X) and P PΣ(X). From part (a) we know that SΣ[γ] Minv b,Σ(X) and from this it is straightforward to show that SΣ[γ] is MΣ-measurable. Now fix A MΣ and note that 1A Tσ = 1A for all σ Σ (where 1A denotes the indicator function for A). Using this fact together with SΣ[P] = P (see part (b)) we can compute Z SΣ[γ]1Ad P = Z Z γ Tσ 1AµΣ(dσ )d P = Z Z (γ1A) Tσ µΣ(dσ )d P = Z SΣ[γ1A]d P = Z γ1Ad SΣ[P] = Z γ1Ad P . This proves SΣ[γ] = EP [γ|MΣ] by the definition of conditional expectation. Now we prove Theorem 4.1. Theorem B.2. If SΣ[Γ] Γ and the probability measures P, Q are Σ-invariant then DΓ(Q P) = DΓinv Σ (Q P) , (41) where DΓ is an (f, Γ)-divergence or a Γ-IPM. Eq. (41) also holds for Sinkhorn divergences if the cost is Σ-invariant (i.e., c(Tσ(x), Tσ(y)) = c(x, y) for all σ Σ, x, y X). Remark B.3. Note that the classical Sinkhorn divergence is obtained when Γ = Cb(X) Cb(X) but the proof of this theorem applies to any Γ Mb(X)2 with SΣ[Γ] Γ. Proof. We first prove the Theorem for (f, Γ)-divergences. Start by using Jensen s inequality and the convexity of the Legendre transform f to obtain f (SΣ[γ](x) ν) = f Z γ(Tσ(x)) ν µΣ(dσ) Z f (γ(Tσ(x)) ν)µΣ(dσ) = SΣ[f (γ(x) ν)] for all γ Mb(X). Therefore DSΣ[Γ] f (Q P) = sup γ Γ,ν R {EQ[SΣ[γ]] ν EP [f (SΣ[γ] ν)]} sup γ Γ,ν R {EQ[SΣ[γ] ν] EP [SΣ[f (γ ν)]]} = sup γ Γ,ν R {EQ[γ] ν EP [f (γ ν)]} = DΓ f (Q P) , where in the next to last equality we use Lemma 3.2(c) together with the assumptions P, Q PΣ(X) to conclude EP [SΣ[f (γ ν)]] = EP [f (γ ν)] and EQ[SΣ[γ]] = EQ[γ]. Hence we obtain DΓ f (Q P) DSΣ[Γ] f (Q P). Furthermore, since SΣ[Γ] Γ, we have from (6) that DSΣ[Γ] f (Q P) = DΓ f (Q P). We conclude by showing that SΣ[Γ] Γ implies SΣ[Γ] = Γinv Σ . First, if γ Γinv Σ , then SΣ[γ] = γ, therefore Γinv Σ SΣ[Γ]. Conversely, since Γ Mb(X), the functions in SΣ[Γ] are Σ-invariant (see Lemma 3.2). We assumed SΣ[Γ] Γ, hence SΣ[Γ] Γinv Σ . Structure-preserving GANs The proof for Γ-IPMs is similar, but does not require Jensen s inequality due to the linearity of the objective functional in γ. Hence the hypothesis SΣ[Γ] Γ is not necessary to obtain W Γ(Q, P) = W SΣ[Γ](Q, P). Finally, we prove the result for Sinkhorn divergences. Equation (32) implies that it suffices to show W Γ c,ϵ(Q, P) = W Γinv Σ c,ϵ (Q, P): By the same reasoning used for (f, Γ)-divergences, our assumptions imply Γinv Σ = SΣ[Γ] and therefore W Γinv Σ c,ϵ (Q, P) =W SΣ[Γ] c,ϵ (Q, P) = sup (γ1,γ2) Γ EP [SΣ[γ1]] + EQ[SΣ[γ2]] ϵEP Q exp SΣ[γ1] SΣ[γ2] c = sup (γ1,γ2) Γ ESΣ[P ][γ1] + ESΣ[Q][γ2] ϵEP Q exp R γ1(Tσ(x)) + γ2(Tσ(y)) c(x, y)µΣ(dσ) Using Jensen s inequality followed by Fubini s theorem on the third term we obtain W Γinv Σ c,ϵ (Q, P) sup (γ1,γ2) Γ ESΣ[P ][γ1] + ESΣ[Q][γ2] ϵ Z EP Q exp γ1(Tσ(x)) + γ2(Tσ(y)) c(x, y) µΣ(dσ) + ϵ . Finally, the Σ-invariance of Q, P, and c imply SΣ[P] = P, SΣ[Q] = Q, and exp γ1(Tσ(x)) + γ2(Tσ(y)) c(x, y) exp γ1(Tσ(x)) + γ2(Tσ(y)) c(Tσ(x), Tσ(y)) = Z Z Z exp γ1(x) + γ2(y) c(x, y) Q T 1 σ (dx)P T 1 σ (dy)µΣ(dσ) = Z Z exp γ1(x) + γ2(y) c(x, y) Q(dx)P(dy) . W Γinv Σ c,ϵ (Q, P) sup (γ1,γ2) Γ EP [γ1] + EQ[γ2] ϵEP Q exp γ1 γ2 c + ϵ = W Γ c,ϵ(Q, P) . The reverse inequality follows from Γinv Σ Γ and so the proof is complete. Next we prove Theorem 4.3, a generalization of Theorem 4.1. Theorem B.4. Let Kx(dx ) be a probability kernel from X to X and define SK : Mb(X) 7 Mb(X) by SK[f](x) = R f(x )Kx(dx ). K also defines a dual map SK : P(X) P(X), SK[P] := R Kx( )P(dx). Let PK(X) be the set of K-invariant probability measures, i.e., PK(X) = {P P(X) : SK[P] = P}. If Γ Mb(X) such that SK[Γ] Γ and Q, P PK(X) then DΓ(Q P) = DSK[Γ](Q P) , (42) where DΓ is an (f, Γ)-divergence or a Γ-IPM. It also holds for the Sinkhorn divergence if SK[c( , y)] = c( , y) and SK[c(x, )] = c(x, ) for all x, y X. In addition, if SK is a projection (i.e., SK SK = SK) then SK[Γ] = Γinv K where where Γinv K := {γ Γ : SK[γ] = γ}. Proof. We prove (42) for (f, Γ)-divergences. The proof for Γ-IPMs and Sinkhorn divergences are similar. We note that for Γ-IPMs, (42) does not require the assumption SK[Γ] Γ. Structure-preserving GANs Fix Q, P PK(X) and use Jensen s inequality along with the K-invariance of Q and P to compute DSK[Γ] f (Q P) = sup γ Γ,ν R {EQ[SK[γ] ν] EP [f (SK[γ] ν)]} = sup γ Γ,ν R {EQ[SK[γ ν]] EP [f ( Z (γ(x ) ν)Kx(dx ))]} sup γ Γ,ν R {EQ[SK[γ ν]] EP [ Z f (γ(x ) ν)Kx(dx ))]} = sup γ Γ,ν R {ESK[Q][γ ν] ESK[P ][f (γ ν)]} = sup γ Γ,ν R {EQ[γ ν] EP [f (γ ν)]} = DΓ f (Q P) . Therefore DSK[Γ] f (Q P) DΓ f (Q P). Note that this computation is the same as the proof of the data processing inequality for (f, Γ)-divergences; see Theorem 2.21 in (Birrell et al., 2022). The assumption SK[Γ] Γ implies the reverse inequality, hence we conclude DSK[Γ] f (Q P) = DΓ f (Q P). Now suppose SK SK = SK. If γ = SK[γ ] SK[Γ] then SK[γ] = SK[SK[γ ]] = SK[γ ] = γ. This, together with the assumption that SK[Γ] Γ implies γ Γinv K . Conversely, if γ Γinv K then γ = SK[γ] SK[Γ] by the definition of Γinv K . This completes the proof. We now prove Theorem 4.6, which explains the potential mode collapse in GANs when restricting the test function space from Γ to Γinv Σ if at least one of the distributions Q and P is not Σ-invariant. Theorem B.5. Suppose SΣ[Γ] Γ and P, Q P(X) (i.e., not necessarily Σ-invariant). Then DΓinv Σ f (Q P) = DΓ f (SΣ[Q] SΣ[P]) , (43) W Γinv Σ (Q, P) = W Γ(SΣ[Q], SΣ[P]) . (44) Remark B.6. The analogous result for the Sinkhorn divergences also holds if the cost is separately Σ-invariant in each variable, i.e., c(Tσ(x), y) = c(x, y) and c(x, Tσ(y)) = c(x, y) for all σ Σ, x, y X. Though this is not satisfied by most commonly used cost functions and actions one can always enforce it by replacing the cost function c with the symmetrized cost cΣ(x, y) := Z Z c(Tσ(x), Tσ (y))µΣ(dσ)µΣ(dσ ) . (45) Proof. We prove only the validity of (43); the proof of (44) is similar. DΓ f (SΣ[Q] SΣ[P]) = DΓinv Σ f (SΣ[Q] SΣ[P]) = sup γ Γinv Σ ,ν R ESΣ[Q][γ ν] ESΣ[P ][f (γ ν)] = sup γ Γinv Σ ,ν R {EQ[γ ν] EP [f (γ ν)]} = DΓinv Σ f (Q P) , where the first equality is due to Theorem 4.1, and the third equality holds as γ ν and f (γ ν) are both Σ-invariant when γ Γinv Σ . Next we prove Theorem 4.8, which explains how to ensure the generator produces a Σ-invariant distribution Pg Theorem B.7. If PZ P(Z) is Σ-invariant and g : Z X is Σ-equivariant then the push-forward measure Pg := PZ g 1 is Σ-invariant, i.e., Pg PΣ(X). Structure-preserving GANs Proof. The proof is based on the equivalence of the following commutative diagrams: More specifically, Pg (T X σ ) 1 = PZ g 1 (T X σ ) 1 = PZ (T X σ g) 1 =PZ (g T Z σ ) 1 = PZ (T Z σ ) 1 g 1 = PZ g 1 where the third and fifth equalities are due to the equivariance and invariance, respectively, of g and PZ. Next we prove Theorem 4.10, which provides a method for constructing Σ-invariant noise sources. Theorem B.8. Let W µΣ and N be a Z-valued random variable (i.e., an arbitrary noise source). If W and N are independent then the distribution of T Z(W, N) is Σ-invariant. Proof. Let PZ denote the distribution of N. Independence of W and N implies (W, N) µΣ PZ. Therefore T Z(W, N) (µΣ PZ) (T Z) 1 := P Σ Z . We need to show that P Σ Z is Σ-invariant: For σ Σ we can compute P Σ Z (T Z σ ) 1 =(µΣ PZ) (T Z) 1 (T Z σ ) 1 (47) =(µΣ PZ) (T Z σ T Z) 1 =(µΣ PZ) (T Z (T Σ σ id)) 1 =(µΣ PZ) (T Σ σ id) 1 (T Z) 1 , where T Σ is the left-multiplication action of Σ on itself. Invariance of µΣ implies (µΣ PZ) (T Σ σ id) 1 =(µΣ (T Σ σ ) 1) PZ = µΣ PZ . (48) P Σ Z T 1 σ = (µΣ PZ) (T Z) 1 = P Σ Z . (49) This proves P Σ Z is Σ-invariant as claimed. Next we show how the proof of Theorem 4.1 can be generalized to a wider variety of objective functionals. This result will utilize a certain topology on the space of bounded measurable functions which we describe in the following definition. Definition B.9. Let V be a subspace of Mb(X)n, n Z+, and M(X) be the set of finite signed measures on X. For ν M(X)n we define τν : V R by τν(γ) := Pn i=1 R γidνi and we let T = {τν : ν M(X)n}. T is a separating vector space of linear functionals on V and we equip V with the weak topology from T (i.e., the weakest topology on V for which every τ T is continuous). This makes V a locally convex topological vector space with dual space V = T ; see Theorem 3.10 in (Rudin, 2006). In the following we will abbreviate this by saying that V has the M(X)-topology. Theorem B.10. Let V be a subspace of Mb(X)n, n Z+, that is closed under Σ in the sense of (11) and satisfies SΣ[V ] V . Given an objective functional H : V P(X) P(X) [ , ) and a test function space Γ V we define DΓ H(Q P) := sup γ Γ H(γ; Q, P) . (50) If H( ; Q, P) is concave and upper semi-continuous (USC) in the M(X)-topology on V (see Definition B.9) and H(γ Tσ; Q, P) = H(γ; Q T 1 σ , P T 1 σ ) (51) Structure-preserving GANs for all σ Σ, γ V , and Q, P P(X) then for all Σ-invariant Q, P we have DΓ H(Q P) DSΣ[Γ] H (Q P) . (52) If, in addition, SΣ[Γ] Γ then SΣ[Γ] = Γinv Σ and DΓ H(Q P) = DΓinv Σ H (Q P) . (53) Remark B.11. See Appendix C for conditions implying SΣ[Γ] Γ. Proof. Fix γ Γ and Σ-invariant Q, P. Define G := H( ; Q, P) and note that G : V ( , ] is LSC and convex. Convex conjugate duality (see the Fenchel-Moreau Theorem, e.g., Theorem 2.3.6 in Bot et al. (2009)) and Fubini s theorem then imply G(SΣ[γ]) = sup ν M(X)n{τν(SΣ[γ]) G (τν)} = sup ν M(X)n{ X Z SΣ[γi]dνi G (τν)} = sup ν M(X)n{ Z X Z γi Tσdνi G (τν)µΣ(dσ)} = sup ν M(X)n{ Z τν(γ Tσ) G (τν)µΣ(dσ)} Z G(γ Tσ)µΣ(dσ) . We can use our assumptions to compute G(γ Tσ) = H(γ Tσ; Q, P) = H(γ; Q T 1 σ , P T 1 σ ) = H(γ; Q, P) and hence we obtain H(SΣ[γ]; Q, P) H(γ; Q, P) . Taking the supremum over γ Γ gives (52). If SΣ[Γ] Γ then we clearly have the bound DSΣ[Γ] H DΓ H and hence DSΣ[Γ] H = DΓ H. The equality SΣ[Γ] = Γinv Σ was shown in the proof of Theorem 4.1 and so we are done. Theorem B.10 applies to many classes of divergences, some of which have not been discussed in the main text. For example: 1. Integral probability metrics and MMD (5); see (M uller, 1997; Sriperumbudur et al., 2012). 2. (f, Γ) divergences (6); concavity and USC of the objective functional follows Proposition B.8 in (Birrell et al., 2022). 3. Sinkhorn divergences (9); concavity and USC of the objective functional follows Lemma B.7 in (Birrell et al., 2022). 4. R enyi divergence for α (0, 1); see Theorem 3.1 in (Birrell et al., 2021). 5. The Kullback-Leibler Approximate Lower bound Estimator (KALE); see Definition 1 in (Glaser et al., 2021). Structure-preserving GANs C. Conditions Ensuring SΣ[Γ] Γ In this appendix we provide conditions under which the test function space Γ is closed under symmetrization, that being a key assumption in our main results in Section 4. First we show that SΣ[Γ] Γ when Γ is the unit ball in an appropriate RKHS. Lemma C.1. Let V Mb(X) be a separable RKHS with reproducing-kernel k : X X R. Let Γ = {γ V : γ V 1} be the unit ball in V . Suppose we have a measurable group action T : Σ X X and k is Σ-invariant under this action (i.e., k(Tσ(x), Tσ(y)) = k(x, y) for all σ Σ, x, y X). Then SΣ[Γ] Γ. Remark C.2. The proof will use many standard properties of a RKHS. In particular, recall that the assumption X Mb(X) implies k is bounded and jointly measurable. See Chapter 4 in (Steinwart & Christmann, 2008) for this and further background. See (Sriperumbudur et al., 2011) and references therein for more discussion of characteristic kernels as well as the related topic of universal kernels. Proof. The Σ-invariance of k implies k(Tσ(x), y) = k(Tσ(x), Tσ(Tσ 1(y))) = k(x, Tσ 1(y)) (54) k( , Tσ(x)), k( , Tσ(y)) V = k(Tσ(x), Tσ(y)) = k(x, y) = k( , x), k( , y) V (55) for all σ Σ and x, y X. Next we will show that the map Uσ : γ 7 γ Tσ is an isometry on V for all σ Σ, γ V : It is clearly a linear map. To show its range is contained in V , first recall that the span of {k( , x)}x X is dense in V . Therefore, given γ V there is a sequence γn γ having the form i=1 an,ik( , xn,i) for some an,i R, xn,i X. Equation (54) implies i=1 an,ik(Tσ( ), xn,i) = i=1 an,ik( , Tσ 1(xn,i)) . Combining Eq. (56) with Eq. (55) we can conclude that γn Tσ V = γn V and γn Tσ γm Tσ V = γn γm V . γn converges in V , hence is Cauchy, therefore γn Tσ is Cauchy as well. We have assumed V is complete, therefore γn Tσ γ for some γ V . V is a RKHS, hence the evaluation maps are continuous and we find γ(x) = limn γn(Tσ(x)) = γ(Tσ(x)) for all x. Therefore γ Tσ = γ V and γ Tσ V = lim n γn Tσ V = lim n γn V = γ V . This proves Uσ is an isometry on V . Now fix γ Γ. We will show that the map σ Uσ[γ] is Bochner integrable (see, e.g., Appendix E in Cohn (2013)): It clearly has has separable range since V was assumed to be separable. By the same reasoning as above, given γ V we have a sequence γn γ where i=1 an,ik( , xn,i) . γ, Uσ[γ] V = lim n i=1 an,i k( , xn,i), Uσ[γ] V = lim n i=1 an,i, Uσ[γ](xn,i) i=1 an,i, γ(Tσ(xn,i)) , Structure-preserving GANs which is now clearly measurable in σ due to the measurability of the action. Therefore σ 7 Uσ[γ] is strongly measurable. Uσ[γ] V = γ V 1, therefore the Bochner integral R Uσ[γ]µΣ(dσ) exists in V and satisfies Z Uσ[γ]µΣ(dσ) V Z Uσ[γ] V µΣ(dσ) 1 . This proves R Uσ[γ]µΣ(dσ) Γ. Finally, V is a RKHS and so the evaluation maps are in V . Therefore evaluation commutes with the Bochner integral and we find ( Z Uσ[γ]µΣ(dσ))(x) = Z Uσ[γ](x)µΣ(dσ) = Z γ(Tσ(x))µΣ(dσ) = SΣ[γ](x) . Hence we can conclude SΣ[γ] Γ for all γ Γ as claimed. The next result provides a general framework for proving SΣ[Γ] Γ. Lemma C.3. Let V Mb(X)n, n Z+, be a subspace equipped with the M(X)-topology (see Definition B.9) and Γ V . If Γ is convex and closed, the group action T : Σ X X is measurable, SΣ[V ] V , and Γ is closed under Σ (i.e., γ Tσ Γ for all γ Γ, σ Σ) then SΣ[Γ] Γ. Proof. Suppose we have γ Γ with SΣ[γ] Γ. As noted in Definition B.9, V is a locally convex topological vector space with V = {τν : ν M(X)n}, τν(γ) := Pn i=1 R γidνi. The separating hyperplane theorem (see Theorem 3.4(b) in Rudin (2006)) applied to A = {SΣ[γ]} and B = Γ therefore implies the existence of ν M(X)n such that τν( γ) > τν(SΣ[γ]) (56) for all γ Γ. We have assumed Γ is closed under Σ and so we can let γ = γ Tσ to get Z SΣ[γi]dνi > 0 (57) for all σ Σ. Integrating with respect to µΣ(dσ) and using Fubini s theorem to change the order of integration we obtain a contradiction. Therefore SΣ[γ] Γ as claimed. We end this section with several examples of function spaces, V , that are useful in conjunction with Lemma C.3: 1. V = Mb(X)n, n Z+, in which case SΣ[V ] V follows from measurability of the action. 2. X is a metric space, the action T : Σ X X is continuous, and V = Cb(X)n, n Z+. In this case, SΣ[V ] V follows from the dominated convergence theorem. 3. X is a metric space, the action T : Σ X X is continuous, Tσ is 1-Lipschitz for all σ Σ, and V = Lip1 b(X)n, n Z+. In this case, SΣ[V ] V follows from the following calculation: |SΣ[γ](x) SΣ[γ](y)| Z |γ(Tσ(x)) γ(Tσ(y))|µΣ(dσ) Z d(Tσ(x), Tσ(y))µΣ(dσ) Z d(x, y)µΣ(dσ) = d(x, y) for all γ Lip1 b(X). D. Additional Properties of Σ-Invariant (f, Γ)-Divergences In this appendix we derive further properties of (f, Γ)-divergences between Σ-invariant distributions. Here we will assume that X is a complete separable metric space (with metric d). Our analysis will require the following notion of a determining set of functions. Structure-preserving GANs Definition D.1. Given Q P(X), a subset Ψ Mb(X) will be called Q-determining if for all Q, P Q, EQ[ψ] = EP [ψ] for all ψ Ψ implies Q = P. We will also need f and Γ to satisfy one of the following admissibility criteria, as introduced in (Birrell et al., 2022). Definition D.2. For a, b with a < 1 < b we define F1(a, b) to be the set of convex functions f : (a, b) R with f(1) = 0. For f F1(a, b), if b is finite we extend the definition of f by f(b) := limx b f(x). Similarly, if a is finite we define f(a) := limx a f(x) (convexity implies these limits exist in ( , ]). Finally, extend f to x [a, b] by f(x) = . The resulting function f : R ( , ] is convex and LSC. We will call f F1(a, b) admissible if {f < } = R and limy f (y) < (note that this limit always exists by convexity). If f is also strictly convex at 1 then we will call f strictly admissible. We will call Γ Cb(X) admissible if 0 Γ, Γ is convex, and Γ is closed in the M(X)-topology on Cb(X) (see Definition B.9). Γ will be called strictly admissible if it also satisfies the following property: There exists a P(X)-determining set Ψ Cb(X) such that for all ψ Ψ there exists c R, ϵ > 0 such that c ϵψ Γ. Finally, an admissible Γ Cinv b,Σ(X) (the set of Σ-invariant bounded continuous functions) will be called Σ strictly admissible if there exists a PΣ(X)-determining set Ψ Cb(X) such that for all ψ Ψ there exists c R, ϵ > 0 such that c ϵψ Γ. One way to construct a Σ-strictly admissible set is to start with an appropriate strictly admissible set and then restrict to the subset of Σ-invariant functions; see Appendix D.1 for a proof. Lemma D.3. Let Γ Cb(X). 1. If Γ is admissible then Γinv Σ is admissible. 2. If Γ is strictly admissible and SΣ[Γ] Γ then Γinv Σ is Σ-strictly admissible. Below are several useful examples of strictly admissible Γ that satisfy SΣ[Γ] Γ. 1. Γ := Cb(X), if the action is continuous in x, i.e., if Tσ : X X is continuous for all σ Σ. 2. Γ := {g Cb(X) : |g| C} for any C > 0 and assuming the action is continuous in x, 3. Γ := Lip L b (X) for any L > 0 and assuming the action is 1-Lipschitz, i.e., d(Tσ(x), Tσ(y)) d(x, y) for all σ Σ, x, y X. 4. Γ := {g Lip L b (X) : |g| C} for any C, L > 0 and assuming the action is 1-Lipschitz. 5. The unit ball in an appropriate RKHS V , Γ := {g V : g V 1}, assuming the kernel is Σ-invariant; see Lemma D.6 for details. The following result extends the infimal convolution formula and divergence properties from (Birrell et al., 2022) to the case where the models and test-function space are Σ-invariant. Theorem D.4. Suppose f and Γ are admissible and Γ Cinv b,Σ(X). For Q, P PΣ(X) we have the following properties: 1. Infimal Convolution Formula on PΣ(X): DΓ f (Q P) = inf η PΣ(X){Df(η P) + W Γ(Q, η)} . (58) 2. Existence of an Optimizer: If DΓ f (Q P) < then there exists η PΣ(X) such that DΓ f (Q P) = Df(η P) + W Γ(Q, η ) . (59) If f is strictly convex then there is a unique such η . 3. PΣ(X)-Divergence Property for W Γ: W Γ(Q, P) 0 and W Γ(Q, P) = 0 if Q = P. If Γ is Σ-strictly admissible then W Γ(Q, P) = 0 implies Q = P. Structure-preserving GANs 4. PΣ(X)-Divergence Property for DΓ f : DΓ f (Q P) 0 and DΓ f (Q P) = 0 if Q = P. If f is strictly admissible and Γ is Σ-strictly admissible then DΓ f (Q P) = 0 implies Q = P. Proof. 1. Part 1 of Theorem 2.15 from (Birrell et al., 2022) implies an infimal convolution formula on P(X), hence DΓ f (Q P) = inf η P(X){Df(η P) + W Γ(Q, η)} inf η PΣ(X){Df(η P) + W Γ(Q, η)} . (60) To prove the reverse inequality, we use the bound Df DSΣ[Mb(X)] f , the equality SΣ[Γ] = Γ, and then Theorem B.5 to compute DΓ f (Q P) inf η P(X){DSΣ[Mb(X)] f (η P) + W SΣ[Γ](Q, η)} (61) = inf η P(X){Df(SΣ[η] P) + W Γ(Q, SΣ[η])} = inf η PΣ(X){Df(η P) + W Γ(Q, η)} . This proves the infimal convolution formula on PΣ(X). 2. Now suppose DΓ f (Q P) < . Part 2 of Theorem 2.15 from (Birrell et al., 2022) implies there exists η P(X) such that DΓ f (Q P) = Df(η P) + W Γ(Q, η ) . (62) We need to show that η can be taken to be Σ-invariant. To do this, first use the infimal convolution formula to bound DΓ f (Q P) Df(SΣ[η ] P) + W Γ(Q, SΣ[η ]) . (63) The Σ-invariance of Q and P together with Theorem B.5 imply W Γ(Q, SΣ[η ]) = W Γ(Q, η ) . (64) Df(SΣ[η ] P) = D Minv b,Σ(X) f (η P) Df(η P) . (65) DΓ f (Q P) Df(SΣ[η ] P) + W Γ(Q, SΣ[η ]) Df(η P) + W Γ(Q, η ) = DΓ f (Q P) . (66) DΓ f (Q P) = Df(SΣ[η ] P) + W Γ(Q, SΣ[η ]) (67) with SΣ[η ] PΣ(X) as claimed. If f is strictly convex then uniqueness is a corollary of the corresponding uniqueness result from Part 2 of Theorem 2.15 in (Birrell et al., 2022). 3. Admissibility of Γ implies 0 Γ, hence W Γ(Q P) EQ[0] EP [0] = 0. If Q = P then the definition clearly implies W Γ(Q, P) = 0. If Γ is Σ-strictly admissible and W Γ(Q, P) = 0 then 0 EQ[g] EP [g] for all g Γ. Letting g = c ϵψ as in the definition of Σ-strict admissiblity we see that 0 (EQ[ψ] EP [ψ]). Hence EQ[ψ] = EP [ψ] for all ψ Ψ. Ψ is a PΣ(X)-determining set and Q, P PΣ(X), hence we can conclude that Q = P. 4. We know that Df 0 and W Γ 0, therefore the infimal convolution formula implies DΓ f 0. If Q = P we can bound 0 DΓ f (Q P) Df(Q P) = 0 , (68) Structure-preserving GANs hence DΓ f (Q P) = 0. Finally, suppose f is strictly admissible, Γ is Σ-strictly admissible, and DΓ f (Q P) = 0. Then Part 2 of this theorem implies 0 = DΓ f (Q P) = Df(η P) + W Γ(Q, η ) (69) for some η PΣ(X). Both terms are non-negative, hence Df(η P) = W Γ(Q, η ) = 0 . (70) The PΣ(X)-divergence property for W Γ then implies Q = η . f being strictly admissible implies that Df has the divergence property, hence η = P. Therefore Q = P as claimed. D.1. Admissibility Lemmas In this appendix we prove several lemmas regarding admissible test function spaces. First we prove the admissibility properties of Γinv Σ from Lemma D.3. Lemma D.5. Let Γ Cb(X). 1. If Γ is admissible then Γinv Σ is admissible. 2. If Γ is strictly admissible and SΣ[Γ] Γ then Γinv Σ is Σ-strictly admissible. Proof. 1. The zero function is Σ-invariant, hence is in Γinv Σ . If γ1, γ2 Γinv Σ and t [0, 1] then convexity of Γ implies tγ1 + (1 t)γ2 Γ. We have (tγ1 + (1 t)γ2) Tσ = tγ1 Tσ + (1 t)γ2 Tσ = tγ1 + (1 t)γ2, hence we conclude that Γinv Σ is convex. Finally, we can write Γinv Σ =Γ \ σ Σ,x X {γ Cb(X) : γ(Tσ(x)) = γ(x)} σ Σ,x X {γ Cb(X) : τδTσ(x)[γ] = τδx[γ]} . We have assumed Γ is admissible, hence it is closed. The maps τν, ν M(X) are continuous on Cb(X), hence the sets {γ Cb(X) : τδTσ(x)[γ] = τδx[γ]} are also closed. Therefore Γinv Σ is closed. This proves Γinv Σ is admissible. 2. Now suppose Γ is strictly admissible and SΣ[Γ] Γ. In particular, Γ is admissible and so Part 1 implies Γinv Σ is admissible. Let Ψ be as in the definition of strict admissibility. For every ψ Ψ there exists c R, ϵ > 0 such that c ϵψ Γ. Hence c ϵSΣ[ψ] = SΣ[c ϵψ] SΣ[Γ] = Γinv Σ (see the proof of Theorem 4.1) and SΣ[Ψ] Cb(X). Finally, suppose Q, P PΣ(X) such that EQ[SΣ[ψ]] = EP [SΣ[ψ]] for all ψ Ψ. Part (b) of Lemma 3.2 then implies EQ[ψ] = EP [ψ] for all ψ Ψ. Ψ is P(X)-determining, hence Q = P. Therefore SΣ[Ψ] is a PΣ(X)-determining set and we conclude that Γinv Σ is Σ-strictly admissible. Next we provide assumptions under which the unit ball in a RKHS is closed under SΣ and is (strictly) admissible. Lemma D.6. Let V Cb(X) be a separable RKHS with reproducing-kernel k : X X R. Let Γ = {γ V : γ V 1} be the unit ball in V . Then: 1. Γ is admissible. 2. If the kernel is characteristic (i.e., the map P P(X) 7 R k( , x)P(dx) V is one-to-one) then Γ is strictly admissible. 3. If k is Σ-invariant the SΣ[Γ] Γ. Proof. 1. Admissibility was shown in Lemma C.9 in (Birrell et al., 2022). Structure-preserving GANs 2. Now suppose the kernel is characteristic. Let P, Q P(X) with R γd P = R γd Q for all γ Γ (and hence for all γ V ). Therefore 0 = Z γd Q Z γd P = γ, Z k( , x)Q(dx) Z k( , x)P(dx) V (71) for all γ V . Therefore R k( , x)Q(dx) = R k( , x)P(dx). We have assumed the kernel is characteristic, hence we conclude that Q = P. This proves Γ is P(X)-determining. We also have Γ Γ, hence Γ is strictly admissible. 3. This was shown in Lemma C.1 above. E. Coarse-graining and structure-preserving operators We show in this section how to apply our structure preserving formalism, Theorem 4.3 in particular, in the context of coarse-graining. We refer to the reviews (Noid, 2013; Pak & Voth, 2018) for fundamental concepts in the coarse-graining of molecular systems. Mathematically, a coarse-graining of the state space X is given by a measurable (non-invertible) map where y = ξ(x) are thought of as the coarse variables and Y as a space of significantly less complexity than X. If A = σ(ξ) is the σ-algebra generated by the coarse-graining map ξ then a function is measurable with respect to A if it is constant on every level set ξ 1(y). To complete the description of the coarse-graining one selects a kernel Ky(dx), which in the coarse-graining literature is called the back-mapping. The kernel Ky(dx) describes the conditional distribution of the fully resolved state x ξ 1(y), conditioned on the coarse-grained state y = ξ(x), namely Ky(dx) = P(dx|y); in particular Ky(dx) is supported on the set ξ 1(y). The kernel induces naturally a projection SK : Mb(X) Mb(X) given by SK[f](x) = Z ξ 1(y) f(x )Ky(dx ) for any x ξ 1(y) and, by construction, SK[f](x) is A-measurable. If a measure is SK-invariant, i.e., SK[P] = P, then it is uniquely determined by its value on A, in other words it is completely specified by a probability measure Q P(Y ) on the coarse variable y = ξ(x). We refer to such a Q as a coarse-grained probability measure. Once a coarse-grained measure is constructed on Y , see (Noid, 2013; Pak & Voth, 2018) for a rich array of such methods, it can be then reconstructed as a measure on X by the kernel Ky(dx) as P(dx) = Ky(dx)Q(dy). For example, if we take X and Y to be discrete sets we can chose the trivial (uniform) reconstruction kernel with density ky(x) = δx(ξ 1(y)) 1 |ξ 1(y)| and any coarse-grained measure with density q(y) on the coarse variables y is reconstructed on X as a probability density on X: p(x) = δx(ξ 1(y)) 1 |ξ 1(y)|q(y) , where y = ξ(x) , x X . Finally, we note that back-mappings Ky(dx) = P(dx|y) in coarse-graining being probabilities conditioned on the coarse variables can be constructed, to great accuracy, as generative models using conditional GANs, see (Li et al., 2020; Stieffenhofer et al., 2021). Structure-preserving GANs F. Additional Experiments (a) Models trained with 50 training samples. (b) Models trained with 5000 training samples. Figure 6. 2D projection of the DL 2 -GAN generated samples onto the support plane of the source distribution Q [cf. Section 5.3]. Each column shows the result after a given number of training epochs. The rows correspond to different settings for the generators and discriminators. The solid and dashed blue ovals mark the 25% and 50% probability regions, respectively, of the data source Q, while the heat-map shows the generator samples. Panel (a): models are trained with 50 training samples. Panel (b): models are trained with 5000 training samples. Structure-preserving GANs (a) DL α-GANs, α = 5. (b) DL α-GANs, α = 10. Figure 7. 2D projection of the DL α-GAN generated samples onto the support plane of the source distribution Q [cf. Section 5.3]. Each column shows the result after a given number of training epochs. The rows correspond to different settings for the generators and discriminators. The solid and dashed blue ovals mark the 25% and 50% probability regions, respectively, of the data source Q, while the heat-map shows the generator samples. Models are trained on 200 training points. Panel (a): α = 5. Panel (b): α = 10. Structure-preserving GANs Figure 8. 2D projection of the DL 2 -GAN generated samples (3000 for each setting) onto the support plane of the source distribution Q [cf. Section 5.3]. Each GAN is trained for 10000 epochs. The rows correspond to the number of training points N = 50, 200, or 5000. The columns correspond to different settings for the generators and discriminators. The solid and dashed blue ovals mark the 25% and 50% probability regions, respectively, of the data source Q. Compared to Figure 6, heat maps are suppressed in this figure for easier examination of the sample quality. (a) CNN G&D (b) Eqv G + CNN D, Σ = C4 (c) CNN G + Inv D, Σ = C4 (d) (I)Eqv G + Inv D, Σ = C4 (e) Eqv G + Inv D, Σ = C4 (f) Eqv G + Inv D, Σ = C8 Figure 9. Randomly generated digits by the DL 2 -GANs trained on Rot MNIST after 20K generator iterations with 1% (600) training data. Structure-preserving GANs (a) CNN G&D (b) Eqv G + CNN D, Σ = C4 (c) CNN G + Inv D, Σ = C4 (d) (I)Eqv G + Inv D, Σ = C4 (e) Eqv G + Inv D, Σ = C4 (f) Eqv G + Inv D, Σ = C8 Figure 10. Randomly generated digits by the RA-GANs trained on Rot MNIST after 20K generator iterations with 1% (600) training data. Structure-preserving GANs (a) CNN G&D (b) Eqv G + CNN D, Σ = C4 (c) CNN G + Inv D, Σ = C4 (d) (I)Eqv G + Inv D, Σ = C4 (e) Eqv G + Inv D, Σ = C4 (f) Eqv G + Inv D, Σ = C8 Figure 11. Randomly generated digits by the DL 2 -GANs trained on Rot MNIST after 20K generator iterations with 0.33% (200) training data. Our model Eqv G + Inv D, Σ = 8 is the only one that can generate high-fidelity images in this setting. We note that the repetitively generated digits are inevitable in such a small data regime, as the models are forced to learn the empirical distribution of the limited training data (20 images per class). Structure-preserving GANs (a) CNN G&D (b) Eqv G + CNN D, Σ = C4 (c) CNN G + Inv D, Σ = C4 (d) (I)Eqv G + Inv D, Σ = C4 (e) Eqv G + Inv D, Σ = C4 (f) Eqv G + Inv D, Σ = C8 Figure 12. Randomly generated digits by the RA-GANs trained on Rot MNIST after 20K generator iterations with 0.33% (200) training data. Our model Eqv G + Inv D, Σ = 8 is the only one that can generate high-fidelity images in this setting. We note that the repetitively generated digits are inevitable in such a small data regime, as the models are forced to learn the empirical distribution of the limited training data (20 images per class). Table 3. The median of the FIDs (lower is better), calculated every 1,000 generator update for 20,000 iterations, averaged over three independent trials. The number of the training samples used for experiments varies from 0.33% (200) to 100% (60,000) of the entire training set. Loss Architecture 0.33% 1% 5% 10% 25% 50% 100% CNN G&D Eqv G + CNN D, Σ = C4 CNN G + Inv D, Σ = C4 (I)Eqv G + Inv D, Σ = C4 Eqv G + Inv D, Σ = C4 Eqv G + Inv D, Σ = C8 431 865 382 360 190 313 295 389 223 173 98 123 357 333 181 141 78 52 348 355 188 132 89 51 407 325 185 124 80 59 403 380 177 135 84 52 392 393 176 130 82 57 CNN G&D Eqv G + CNN D, Σ = C4 CNN G + Inv D, Σ = C4 (I)Eqv G + Inv D, Σ = C4 Eqv G + Inv D, Σ = C4 Eqv G + Inv D, Σ = C8 423 409 511 484 352 293 280 253 330 273 149 122 261 271 208 147 99 55 283 251 192 133 88 57 290 263 190 141 80 53 297 274 183 124 80 53 293 275 173 126 81 51 Structure-preserving GANs 5000 10000 15000 20000 25000 30000 35000 40000 Generator Iterations CNN G & D (I) Eqv G + Inv D Eqv G + Inv D CNN G & D, aug. (I) Eqv G + Inv D, aug. Eqv G + Inv D, aug. (a) ANHIR, RA-GAN 5000 10000 15000 20000 25000 30000 35000 40000 Generator Iterations CNN G & D (I) Eqv G + Inv D Eqv G + Inv D CNN G & D, aug. (I) Eqv G + Inv D, aug. Eqv G + Inv D, aug. (b) ANHIR, DL 2 -GAN 5000 10000 15000 20000 25000 30000 35000 40000 Generator Iterations CNN G & D (I) Eqv G + Inv D Eqv G + Inv D CNN G & D, aug. (I) Eqv G + Inv D, aug. Eqv G + Inv D, aug. (c) LYSTO, RA-GAN 5000 10000 15000 20000 25000 30000 35000 40000 Generator Iterations CNN G & D (I) Eqv G + Inv D Eqv G + Inv D CNN G & D, aug. (I) Eqv G + Inv D, aug. Eqv G + Inv D, aug. (d) LYSTO, DL 2 -GAN Figure 13. The curves of the Fr echet Inception Scores (FID), calculated after every 2,000 generator updates up to 40,000 iterations, averaged over three random trials on the medical data sets, ANHIR (top row) and LYSTO (bottom row). The symbol aug. in the legend denotes the presence of data augmentation during GAN training. Structure-preserving GANs Figure 14. Real and GAN generated ANHIR images dyed with different stains. Left panel: real images. Middle and right panels: randomly selected DL 2 -GANs generated samples after 40,000 generator iterations. Middle panel: CNN G&D. Right panel: Eqv G + Inv D. Structure-preserving GANs Figure 15. Real and GAN generated LYSTO images of breast, colon, and prostate cancer. Left panel: real images. Middle and right panels: randomly selected DL 2 -GANs generated samples after 40,000 generator iterations. Middle panel: CNN G&D. Right panel: Eqv G + Inv D. Structure-preserving GANs Table 4. The (min, median) of the FIDs over the course of training, averaged over three independent trials on the medical images, where the plus sign + after the data set, e.g., ANHIR+, denotes the presence of data augmentation during training. Loss Architecture ANHIR ANHIR+ RA CNN G&D (I)Eqv G + Inv D Eqv G + Inv D (186, 523) (100, 142) (78, 125) (184, 503) (88, 140) (84, 118) DL 2 CNN G&D (I)Eqv G + Inv D Eqv G + Inv D (313, 485) (120, 176) (97, 157) (347, 539) (119, 177) (90, 128) Loss Architecture LYSTO LYSTO+ RA CNN G&D (I)Eqv G + Inv D Eqv G + Inv D (281, 340) (218, 272) (175, 238) (250, 312) (212, 271) (181, 227) DL 2 CNN G&D (I)Eqv G + Inv D Eqv G + Inv D (289, 410) (253, 343) (205, 259) (265, 376) (244, 329) (192, 259) G. Implementation Details G.1. Common experimental setup All models are trained using the Adam optimizer (Kingma & Ba, 2014) with β1 = 0.0 and β2 = 0.9 (Zhang et al., 2019). Discriminators are updated twice after each generator update. An exponential moving average across iterations of the generator weights with α = 0.9999 is used when sampling images (Brock et al., 2018). G.2. Rot MNIST For RA-GAN, the training is stabilized by regularizing the discriminator γ Γ with a zero-centered gradient panelty (GP) on the real distribution Q in the following form 2 Ex Q γ(x) 2 2. (72) We set the GP weight λ1 = 0.1 according to (Dey et al., 2021). For the DL α-GAN, we use the one-sided GP as a soft constraint on the Lipschitz constant R2 = λ2Ex ρg max{0, γ(x) 2 1}, (73) where ρg TX +(1 T)Y (with X Pg, Y Q, and T Unif([0, 1]) all being independent.) The one-sided GP weight is set to λ2 = 10 according to (Birrell et al., 2022). Unequal learning rates were set to ηG = 0.0001 and ηD = 0.0004 respectively. The neural architectures for the generators and discriminators are displayed in Table 5 and Table 6. G.3. ANHIR and LYSTO Similar to Rot MNIST, the GP weights are set to λ1 = 0.1 for the RA-GAN in (72) and λ2 = 10 for the DL α-GAN in (73), and we consider only the case α = 2. The learning rates were set to ηG = 0.0001 and ηD = 0.0004 respectively. Res Nets instead of CNNs are used as baseline generators and discriminators, and the detailed architectural designs are specified in Table 7 and Table 8. G.4. Architectures Structure-preserving GANs Table 5. Generator architectures used in the Rot MNIST experiment. Conv SN and C4-Conv SN stand for spectrally-normalized 2D convolution and its C4-equivariant counterpart. The incomplete attempt at building equivariant generators ((I)Eqv G) does not have the C4-symmetrization layer. The C8-equivariant generator (Eqv G, Σ = C8) is built by replacing 3 3 C4-Conv SN with 5 5 C8-Conv SN while adjusting the number of filters to maintain a similar number of trainable parameters. CNN Generator (CNN G) Sample noise z R64 N(0, I) Embed label class y into ˆy R64 Concatenate z and ˆy into h R128 Project and reshape h to 7 7 128 3 3 Conv SN, 128 512 Re LU; Up 2 3 3 Conv SN, 512 256 CCBN; Re LU; Up 2 3 3 Conv SN, 256 128 CCBN; Re LU 3 3 Conv SN, 128 1 C4-Equivariant Generator (Eqv G, Σ = C4) Sample noise z R64 N(0, I) Embed label class y into ˆy R64 Concatenate z and ˆy into h R128 Project and reshape h to 7 7 128 C4-symmetrization of h 3 3 C4-Conv SN, 128 256 Re LU; Up 2 3 3 C4-Conv SN, 256 128 CCBN; Re LU; Up 2 3 3 C4-Conv SN, 128 64 CCBN; Re LU 3 3 C4-Conv SN, 64 1 C4-Max Pool Table 6. Discriminator architectures used in the Rot MNIST experiment. The C8-invariant discriminator (Inv D, Σ = C8) is built by replacing 3 3 C4-Conv SN with 5 5 C8-Conv SN while adjusting the number of filters to maintain a similar number of trainable parameters. CNN Discriminator (CNN D) Input image x R28 28 1 3 3 Conv SN, 1 128 Leaky Re LU; Avg. Pool 3 3 Conv SN, 128 256 Leaky Re LU; Avg. Pool 3 3 Conv SN, 256 512 Leaky Re LU; Avg. Pool Global Avg. Pool into f Embed label class y into ˆy Project (ˆy , f) into a scalar C4-Invariant Discriminator (Inv D, Σ = C4) Input image x R28 28 1 3 3 C4-Conv SN, 1 64 Leaky Re LU; Avg. Pool 3 3 C4-Conv SN, 64 128 Leaky Re LU; Avg. Pool 3 3 C4-Conv SN, 128 256 Leaky Re LU; Avg. Pool C4-Max Pool Global Avg. Pool into f Embed label class y into ˆy Project (ˆy , f) into a scalar Structure-preserving GANs Table 7. Generator architectures used in the ANHIR and LYSTO experiments. The generator residual block (Res Block G) is a cascade of [CCBN, Re LU, Up 2 , 3 3 Conv SN, CCBN, Re LU, 3 3 Conv SN] with a short connection consisting of [Up 2 , 1 1 Conv SN]. The equivariant residual block (D4-Res Block G) is built by replacing each component with its equivariant counterpart. The incomplete attempt at building equivariant generators ((I)Eqv G) does not have the D4-symmetrization layer. CNN Generator (CNN G) Sample noise z R128 N(0, I) Embed label class y into ˆy R128 Concatenate z and ˆy into h R256 Project and reshape h to 4 4 128 Res Block G, 128 256 Res Block G, 256 128 Res Block G, 128 64 Res Block G, 64 32 Res Block G, 32 16 3 3 Conv SN, 16 3 Equivariant Generator (Eqv G) Sample noise z R128 N(0, I) Embed label class y into ˆy R128 Concatenate z and ˆy into h R256 Project and reshape h to 4 4 128 D4-symmetrization of h D4-Res Block G, 128 90 D4-Res Block G, 90 45 D4-Res Block G, 45 22 D4-Res Block G, 22 11 D4-Res Block G, 11 5 D4-BN; Re LU 3 3 D4-Conv SN, 5 3 D4-Max Pool Table 8. Discriminator architectures used in the ANHIR and LYSTO experiments. The discriminator residual block (Res Block D) is a cascade of [Re LU, 3 3 Conv SN, Re LU, 3 3 Conv SN, Max Pool] with a short connection consisting of [1 1 Conv SN, Max Pool]. The equivariant residual block (D4-Res Block D) is built by replacing each component with its equivariant counterpart. CNN Discriminator (CNN D) Input image x R64 64 3 Res Block D, 3 16 Res Block D, 16 32 Res Block D, 32 64 Res Block D, 64 128 Res Block D, 128 256 Global Avg. Pool into f Embed label class y into ˆy Project (ˆy , f) into a scalar Invariant Discriminator (Inv D) Input image x R64 64 3 D4-Res Block D, 3 5 D4-Res Block D, 5 11 D4-Res Block D, 11 22 D4-Res Block D, 22 45 D4-Res Block D, 45 90 D4-Max Pool Global Avg. Pool into f Embed label class y into ˆy Project (ˆy , f) into a scalar