# structurepreserving_gans__5aab36ac.pdf

Structure-preserving GANs

Jeremiah Birrell 1 Markos A. Katsoulakis 1 Luc Rey-Bellet 1 Wei Zhu 1

Generative adversarial networks (GANs), a class of distribution-learning methods based on a twoplayer game between a generator and a discriminator, can generally be formulated as a minmax problem based on the variational representation of a divergence between the unknown and the generated distributions. We introduce structurepreserving GANs as a data-efﬁcient framework for learning distributions with additional structure such as group symmetry, by developing new variational representations for divergences. Our theory shows that we can reduce the discriminator space to its projection on the invariant discriminator space, using the conditional expectation with respect to the σ-algebra associated to the underlying structure. In addition, we prove that the discriminator space reduction must be accompanied by a careful design of structured generators, as ﬂawed designs may easily lead to a catastrophic mode collapse of the learned distribution. We contextualize our framework by building symmetry-preserving GANs for distributions with intrinsic group symmetry, and demonstrate that both players, namely the equivariant generator and invariant discriminator, play important but distinct roles in the learning process. Empirical experiments and ablation studies across a broad range of data sets, including real-world medical imaging, validate our theory, and show our proposed methods achieve signiﬁcantly improved sample ﬁdelity and diversity almost an order of magnitude measured in Fr echet Inception Distance especially in the small data regime.

1Department of Mathematics and Statistics, University of Massachusetts Amherst, Amherst, MA 01003, USA. Correspondence to: Jeremiah Birrell <birrell@math.umass.edu>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

Figure 1. Real and GAN generated ANHIR images dyed with the H&E stain [cf. Section 5.5]. Left panel: real images. Right panels: randomly selected DL 2 -GAN generated samples after 40,000 generator iterations. Top right panel: CNN G&D, i.e., the baseline model. Bottom right panel: Eqv G + Inv D, i.e., our proposed framework contextualized in learning group-invariant distributions. More images are available in Appendix F.

Figure 2. Randomly generated digits 2, 3 and 8 by GANs trained on the rotated MNIST images using 1% (600) training samples. (a): the baseline CNN model. (b): our proposed framework for learning group-invariant distributions.

1. Introduction

Since their introduction by Goodfellow et al. (2014), generative adversarial networks (GANs) have become a burgeoning domain in distribution learning with a diverse range of innovative applications (Karras et al., 2019; Zhu et al., 2019; Mustafa et al., 2019; Yi et al., 2019). Mathematically, the minmax game between a generator and a discriminator in GAN can typically be formulated as minimizing a divergence or other notions of distance with a variational representation between the unknown and the generated distributions. Such formulations, however, do not make prior structural assumptions on the probability measures, making them sub-optimal in sample efﬁciency when learning distributions with intrinsic structures, such as the (rotation) group symmetry for medical images without preferred orientation; see Figure 1.

We introduce, in this work, the structure-preserving GANs, a data-efﬁcient framework for learning probability measures

Structure-preserving GANs

with embedded structures, by developing new variational representations for divergences between structured distributions. We demonstrate that efﬁcient adversarial learning can be achieved by reducing the discriminator space to its projection onto its invariant subspace, using the conditional expectation with respect to the σ-algebra associated to the underlying structure; such practice, which is rigorously justiﬁed by our theory and generally applicable to a broad range of variational divergences, acts effectively as an unbiased regularization to prevent discriminator overﬁtting, a common challenge for GAN optimization in the limited data regime (Zhao et al., 2020). Furthermore, our theory suggests that the discriminator space reduction must be accompanied by correctly building generators sharing the same probabilistic structure, as the lack of which may easily lead to mode collapse in the trained model, i.e., the generated distribution samples only a subset of the support of the data source [cf. Figure 4a (2nd row)].

As an example, we contextualize our framework by building symmetry-preserving GANs for learning distributions with group symmetry. Unlike prior empirical work, our choice of equivariant generators and invariant discriminators is theoretically founded, and we show (theoretically and empirically) how ﬂawed design of equivariant generators results easily in the aforementioned mode collapse [cf. Figure 4a (4th row)]. Experiments and ablation studies over synthetic and real-world data sets validate our theory, disentangle the contribution of the structural priors on generators and discriminators, and demonstrate the signiﬁcant outperformance of our framework in terms of both sample quality and diversity in some cases almost by an order of magnitude measured in Fr echet Inception Distance; see Figure 1 and 2 for a visual illustration.

2. Related Work

Neural generation of group-invariant distributions has mainly been proposed in a ﬂow-based framework (K ohler et al., 2019; 2020; Rezende et al., 2019; Liu et al., 2019; Biloˇs & G unnemann, 2021; Boyda et al., 2021; Garcia Satorras et al., 2021). Such models typically use an equivariant normalizing-ﬂow to push-forward a group-invariant prior distribution to a complex invariant target. In the context of GANs, Dey et al. (2021) intuitively replace the 2D convolutions with group convolutions (Cohen & Welling, 2016a) to build group-equivariant GANs; however, their empirical study has not been justiﬁed by theory, and their incomplete design of the equivariant generator may easily lead to a mode collapse of the learned model; see the discussion of Theorem 4.6. The existence of symmetry can often be deduced from prior or domain knowledge of the distribution, e.g., the rotation symmetry for medical images without preferred orientation. Symmetry detection from data has

also been studied in recent works such as (Dehmamy et al., 2021). When extended from group symmetry to probability structures induced from other operators, our work is also related to GAN-assisted coarse-graining (CG) for molecular dynamics (Durumeric & Voth, 2019) and cosmology (Mustafa et al., 2019; Feder et al., 2020); see the end of Section 4.1 for a detailed discussion.

3. Background and Motivation

3.1. Generative adversarial networks

Generative adversarial networks are a class of methods in learning a probability distribution via a zero-sum game between a generator and a discriminator (Goodfellow et al., 2014; Arjovsky et al., 2017; Nowozin et al., 2016; Gulrajani et al., 2017). Speciﬁcally, let (X, M) be a measurable space, and P(X) be the set of probability measures on X; given a target distribution Q P(X), the original GAN proposed by Goodfellow et al. (2014) learns Q by solving

inf g G D(Q Pg) = inf g G sup γ Γ H[γ; Q, Pg], (1)

where H[γ; Q, Pg] = EQ[log γ] + EPg[log(1 γ)]. The map g : Z X in Eq. (1) is called a generator, which maps a random vector z Z to a generated sample g(z) X, pushing forward the noise distribution P P(Z) (typically a Gaussian) to a probability measure Pg P(X), i.e., Pg := g P := P g 1; the test function γ : X R is called a discriminator, which aims to differentiate the source distribution Q and the generated probability measure Pg by maximizing H[γ; Q, Pg]. The spaces G and Γ, respectively, of generators and discriminators are both parametrized by neural networks (NNs), and the solution of model (1) is the best generator g G that is able to fool all discriminators γ Γ by achieving the smallest D(Q Pg), which measures the dissimilarity between Q and Pg.

3.2. Variational representations for divergences

Mathematically, most GANs can be formulated as minimizing the distance between the probability measures Q and Pg according to some divergence or probability metric with a variational representation supγ Γ H(γ; Q, Pg) as in (1). We hereby recast these formulations in a uniﬁed but ﬂexible mathematical framework that will prove essential in Section 4.1. Let M(X) be the space of measurable functions on X and Mb(X) be the subspace of bounded measurable functions. Given an objective functional H : M(X)n P(X) P(X) [ , ] and a test function space Γ M(X)n, n Z+, we deﬁne

DΓ H(Q P) = sup γ Γ H(γ; Q, P) . (2)

DΓ H is called a divergence if DΓ H 0 and DΓ H(Q P) = 0 if and only if Q = P, hence providing a notion of distance

Structure-preserving GANs

between probability measures. Variational representations of the form (2) have been widely used, including in GANs (Goodfellow et al., 2014; Nowozin et al., 2016; Arjovsky et al., 2017), divergence estimation (Nguyen et al., 2007; 2010; Ruderman et al., 2012; Birrell et al., 2021), determining independence through mutual information estimation (Belghazi et al., 2018), uncertainty quantiﬁcation of stochastic processes (Chowdhary & Dupuis, 2013; Dupuis et al., 2016), bounding risk in probably approximately correct (PAC) learning (Mc Allester, 1999; Shawe-Taylor & Williamson, 1997; Catoni et al., 2008), parameter estimation (Broniatowski & Keziou, 2009), statistical mechanics and interacting particles (Kipnis & Landim, 1999), and large deviations (Dupuis & Ellis, 2011). It is known that formula (2) includes, through suitable choices of functional H(γ; Q, P) and function space Γ, many divergences and probability metrics. Below we list several classes of examples.

(a) f-divergences. Let f : [0, ) R be convex and lower semi-continuous (LSC), with f(1) = 0 and f strictly convex at x = 1. The f-divergence between Q and P is

Df(Q P) = sup γ Mb(X) {EQ[γ] EP [f (γ)]}, (3)

where f denotes the Legendre transform of f. Some notable examples of the f-divergences include the Kullback Leibler (KL) divergence and the family of α-divergences, which are constructed, respectively, from

f KL = x log x, fα(x) = xα 1 α(α 1), α > 0, α = 1. (4)

The ﬂexibility of f allows one to tailor the divergence to the data source, e.g., for heavy tailed data. However, the formula (3) becomes Df(Q P) = when Q is not absolutely continuous with respect to P, limiting its efﬁcacy in comparing distributions with low-dimensional support.

(b) Γ-Integral Probability Metrics (IPMs). Given Γ Mb(X), the Γ-IPM between Q and P is deﬁned as

W Γ(Q, P) = sup γ Γ {EQ[γ] EP [γ]}. (5)

Apart from the Wasserstein metric when Γ = Lip1(X) (the space of 1-Lipschitz functions), examples of IPMs also include the total variation metric, the Dudley metric, and maximum mean discrepancy (MMD) (M uller, 1997; Sriperumbudur et al., 2012). With suitable choices of Γ, IPMs are able to meaningfully compare not-absolutely continuous distributions, but they could potentially fail at comparing distributions with heavy tails (Birrell et al., 2022).

(c) (f, Γ)-divergences. This class of divergences was introduced by Birrell et al. (2022) and they subsume both f-divergences and Γ-IPMs. Given a function f satisfying

the same condition as in the deﬁnition of the f-divergence and Γ Mb(X), the (f, Γ)-divergence is deﬁned as

DΓ f (Q P) = sup γ Γ

EQ[γ] ΛP f [γ] , (6)

where ΛP f [γ] = infν R {ν + EP [f (γ ν)]}. One can verify that (6) includes as a special case the f-divergence (3) when Γ = Mb(X), and it is demonstrated in (Birrell et al., 2022) that under suitable assumptions on Γ we have

0 DΓ f (Q P) min{Df(Q P), W Γ(Q, P)} , (7)

making DΓ f suitable to compare not-absolutely continuous distributions with heavy tails. An example of the (f, Γ)- divergence is the Lipschitz α-divergence,

DL α(Q P) = sup γ Lip L b (X) {EQ[γ] ΛP fα[γ]}, (8)

where f = fα as in Eq. (4), and Γ = Lip L b (X) is the space of bounded L-Lipschitz functions.

(d) Sinkhorn divergences. The Wasserstein metric associated with a cost function c : X2 R+ has the variational representation W Γ c (Q, P) = supγ=(γ1,γ2) Γ{EP [γ1] + EQ[γ2]}, where Γ = {(γ1, γ2) C(X)2 : γ1(x)+γ2(y) c(x, y)}, and C(X) is the space of continuous functions on X. The Sinkhorn divergence is given by

SDΓ c,ϵ(Q, P) = W Γ c,ϵ(Q, P) W Γ c,ϵ(Q, Q) + W Γ c,ϵ(P, P) 2 ,

where W Γ c,ϵ(Q, P) is the entropic regularization of the Wasserstein metrics [cf. Eq. (33)].

We refer to Appendix A for a detailed discussion of the variational divergences introduced above. In all the aforementioned examples, the choice of the discriminator space, Γ, is a deﬁning characteristic of the divergence. We will explain, in Section 4.1, a general framework, i.e., the structurepreserving GANs, for incorporating added structural knowledge of the probability distributions or data sets into the choice of Γ, leading to enhanced performance and data efﬁciency in adversarial learning of structured distributions.

3.3. Group invariance and equivariance

We ﬁrst introduce the structure-preserving GAN framework in the context of learning distributions with group symmetry. We emphasize that the focus of this work is not to discuss the group-invariance properties of probability measures (which can be found in, e.g., (Schindler, 2003)), but to understand how to incorporate such structural information into the generator/discriminator of GANs such that invariant probability distributions can be learned more efﬁciently. However,

Structure-preserving GANs

we ﬁrst require the following background and notations. Groups and group actions. A group is a set Σ equipped with a binary operator, the group product, satisfying the axioms of associativity, identity, and invertibility. Given a group Σ and a set X, a map T : Σ X X is called a group action if, for all σ Σ, Tσ := T(σ, ) : X X is an automorphism on X, and Tσ1 Tσ2 = Tσ1 σ2, σ1, σ2 Σ. In this paper, we will consider mainly the 2D rotation group SO(2) = {Rθ R2 2 : θ R} and roto-reﬂection group O(2) = {Rm,θ R2 2 : m Z, θ R}, where Rθ is the 2D rotation matrix of angle θ, and Rm,θ has a further reﬂection if m 1 (mod 2). The natural actions of SO(2) and O(2) on R2 are matrix multiplications, which can be lifted to actions on the space of (k-channel) planar signals L2(R2, Rk), e.g., RGB images. More speciﬁcally, when Σ is SO(2) or O(2) let Tσf(x) := f(σ 1x), σ Σ, f L2(R2, Rk). We will also consider the ﬁnite subgroups Cn, Dn, respectively, of SO(2) and O(2), with the rotation angles θ restricted to integer multiples of 2π/n.

Group equivariance and invariance. Let T Z and T X, respectively, be Σ-actions on the spaces Z and X. A map g : Z X is called Σ-equivariant if T X σ g = g T Z σ , σ Σ. A map γ : X Y is called Σ-invariant if γ T X σ = γ, σ Σ. Invariance is thus a special case of equivariance after equipping Y with the action T Y σ y y, σ Σ. In the context of NNs, achieving equivariance/invariance via group-equivariant CNNs (G-CNNs) has been well-studied, and we refer the reader to (Cohen et al., 2019; Weiler & Cesa, 2019) for a complete theory of G-CNNs.

Let G be a collection of measurable maps g : Z X. We denote its subset of Σ-equivariant maps as Geqv Σ := {g G : T X σ g = g T Z σ , σ Σ}. Similarly, let Γ be a set of measurable functions γ : X Y ; its subset, Γinv Σ , of Σ-invariant functions is deﬁned as

Γinv Σ := {γ Γ : γ T X σ = γ, σ Σ} . (10)

The function space Γ is called closed under Σ if

γ T X σ Γ, σ Σ, γ Γ . (11)

Finally, a probability measure P P(X) is called Σinvariant if P = P (T X σ ) 1 for all σ Σ. For instance, the distribution of medical images without orientation preference should be SO(2)-invariant; see Figure 1. The set of all Σ-invariant distributions on X is denoted as

PΣ(X) := {P P(X) : P is Σ-invariant}. (12)

3.4. Deﬁnition of Haar measure on Σ and the symmetrization operators SΣ and SΣ

We will make frequent use of the symmetrization operators, on both functions and probability distributions, that are induced by a group action on X. These are constructed

using the unique Haar probability measure, µΣ, of a compact Hausdorff topological group Σ (see, e.g., Chapter 11 in Folland (2013)). Intuitively the Haar measure is the uniform probability measure on Σ. Mathematically, this is expressed via the invariance of Haar measure under group multiplication, µΣ(σ E) = µΣ(E σ) = µΣ(E) for all σ Σ and all Borel sets E Σ. This is a generalization of the invariance of Lebesgue measure under translations and rotations. The Haar measure can be used to deﬁne symmetrization operators on both functions and probability measures as follows (going forward, we assume the group action is measurable).

Symmetrization of functions: SΣ : Mb(X) Mb(X),

SΣ[γ](x) := Z

Σ γ(Tσ (x))µΣ(dσ ) = EµΣ[γ Tσ (x)] .

Symmetrization of probability measures (dual operator): SΣ : P(X) P(X), deﬁned for γ Mb(X) by

ESΣ[P ]γ := Z

X SΣ[γ](x)d P(x) = EP SΣ[γ] . (14)

Remark 3.1. Sampling from SΣ[P]: If xi, i = 1, ..., N are samples from P, and σj, j = 1, ..., M are samples from the Haar probability measure µΣ (all independent) then Tσj(xi) are samples from SΣ[P]. If P is Σ-invariant then the use of Tσj(xi) can be viewed as a form of data augmentation.

The following lemma provides several key properties of the symmetrization operators. Proofs and further details can be found in Appendix B, Lemma B.1.

Lemma 3.2. (a) The symmetrization operator SΣ : Mb(X) Mb(X) is a projection onto the subspace of Σ-invariant bounded measurable functions, Minv b,Σ [cf. Eq. (10)].

(b) The symmetrization operator SΣ : P(X) P(X) is a projection onto the subset of Σ-invariant probability measures, PΣ(X) [cf. Eq. (12)].

(c) SΣ is the conditional expectation with respect to the σ-algebra MΣ of Σ-invariant sets, MΣ := {Measurable sets B M : Tσ(B) = B, σ Σ}, i.e., SΣ[γ] = EP [γ|MΣ] for all γ Mb(X), P PΣ(X).

Lemma 3.2 implies that since SΣ, SΣ are projections onto Minv b,Σ, PΣ(X) respectively, i.e. SΣ SΣ = SΣ, SΣ SΣ = SΣ, they are necessarily structure-preserving, namely here symmetry-preserving. We discuss a general concept of structure-preserving operators at the end of Section 4.1.

We present in this section our theory for structure-preserving GANs. The results are ﬁrst stated for the special case of

Structure-preserving GANs

learning group-invariant distributions. We then extend the theory to a general class of structure-preserving operators.

4.1. Invariant discriminator theorem

We demonstrate under assumptions outlined below and for broad classes of divergences and probability metrics that for Σ-invariant probability measures P, Q we can restrict the test function space Γ (discriminator space in GANs) in (2) to the subset of Σ-invariant functions, Γinv Σ [cf. Eq. (10)], without changing the divergence/probability metric, i.e.,

DΓ H(Q P) = DΓinv Σ H (Q P) for all Q, P PΣ . (15)

The space Γinv Σ is a much smaller and more efﬁcient discriminator space to optimize over in the proposed GANs. We rigorously formulate our results in the following theorem, which ﬁrst considers the (f, Γ) divergence (6), the Γ-IPM (5), and the Sinkhorn divergence (9). The proof is found in Appendix B.

Theorem 4.1. If SΣ[Γ] Γ and the probability measures P, Q are Σ-invariant then

DΓ(Q P) = DΓinv Σ (Q P) , (16)

where DΓ is an (f, Γ)-divergence or a Γ-IPM. Eq. (16) also holds for Sinkhorn divergences if the cost is Σ-invariant (i.e., c(Tσ(x), Tσ(y)) = c(x, y) for all σ Σ, x, y X).

Remark 4.2. Eq. (16) can be generalized to a wider range of objective functionals satisfying appropriate convexity, continuity, and invariance conditions; see Theorem B.10.

For the Σ-invariant (f, Γ)-divergences, we also obtain a reﬁned version of (7), given by the following inﬁmal convolution formula (for appropriate Γ and f):

DΓinv Σ f (Q P) = inf η PΣ(X){Df(η P) + W Γinv Σ (Q, η)} (17)

for all Q, P PΣ(X). See Appendix D for details on (17) and other results generalizing those in (Birrell et al., 2022).

Theorem 4.1 suggests that the discriminator space reduction effectively acts as an unbiased regularization to prevent discriminator overﬁtting, a common challenge for GAN optimization in the small data regime. Using invariant discriminators can thus improve the data-efﬁciency of the model; this will be empirically veriﬁed in Tables 1-3.

Examples satisfying the key condition SΣ[Γ] Γ of Theorem 4.1

1. First we consider the standard f-divergence (3) between two Σ-invariant probability measures P and Q. The identity SΣ[Mb(X)] = Minv b,Σ(X) from Lemma 3.2 implies that the functions space can

be restricted to the Σ-invariant bounded functions Minv b,Σ(X), giving rise to an (f, Γ)-divergence (6) with

Γ = Minv b,Σ(X), i.e., Df(Q P) = D Minv b,Σ(X) f (Q P).

2. If the group Σ is ﬁnite and the function space Γ Mb(X) is convex and closed under Σ in the sense of (11), then SΣ[Γ] Γ , as readily follows from the deﬁnition (13). Our implemented examples in Section 5 fall under this category.

3. The space of 1-Lipschitz functions on a metric space (X, d), assuming the action is 1-Lipschitz, i.e., d(Tσ(x), Tσ(y)) d(x, y) for all σ Σ, x, y X.

4. The unit ball in an appropriate RKHS; see Lemma C.1.

5. More generally, if Γ is convex and closed in the weak topology on Γ induced by integration against ﬁnite signed measures; see Lemma C.3 for a proof.

Extension to other structure-preserving operators Let Kx(dx ) be a probability kernel from X to X and deﬁne SK : Mb(X) 7 Mb(X) by SK[f](x) := R f(x )Kx(dx ). K also deﬁnes a dual map SK : P(X) P(X), SK[P] := R Kx( )P(dx). Let PK(X) be the set of K-invariant probability measures, i.e., PK(X) = {P P(X) : SK[P] = P}. In this setting we have the following generalization of Theorem 4.1.

Theorem 4.3. If SK[Γ] Γ and Q, P PK(X) then

DΓ(Q P) = DSK[Γ](Q P) , (18)

where DΓ is an (f, Γ)-divergence or a Γ-IPM. It also holds when DΓ is a Sinkhorn divergence if SK[c( , y)] = c( , y) and SK[c(x, )] = c(x, ) for all x, y X.

In addition, if SK is a projection (i.e., SK SK = SK) then SK[Γ] = Γinv K where Γinv K := {γ Γ : SK[γ] = γ}.

Remark 4.4. Conditional expectations, SK[f] := EP [f|A], are a special case of Theorem 4.3 with kernel being a regular conditional probability, K = P( |A). Here Γinv K is the set of A-measurable functions in Γ, which can be significantly smaller than Γ. The case where A = σ(ξ) for some random variable ξ has particular importance in coarse graining of molecular dynamics (Noid, 2013; Pak & Voth, 2018), see Appendix E. The result for Σ-invariant measures, Theorem 4.1, is also special case of Theorem 4.3, where the kernel is Kx = µΣ R 1 x , Rx(σ) := Tσ(x). Alternatively, Lemma 3.2 (c) shows SΣ can be written as a conditional expectation. Remark 4.5. Theorem 4.3 is an instance of the data processing inequality; see Theorem 2.21 in (Birrell et al., 2022).

4.2. Equivariant generator theorem

Theorem 4.1 provides the theoretical justiﬁcation for reducing the discriminator space Γ to its Σ-invariant subset Γinv Σ

Structure-preserving GANs

Figure 3. The Σ-symmetrization layer (enclosed in the red rectangle), which is missing in (Dey et al., 2021), ensures generator equivariance, which is critical in preventing GAN mode collapse [cf. Remark 4.11].

when the source Q and the generated measure Pg are both Σ-invariant. Our next theorem, however, shows that such practice could easily lead to mode collapse if one of the two distributions is not Σ-invariant, see Figure 4a; the proof is deferred to Appendix B.

Theorem 4.6. Let SΣ[Γ] Γ and P, Q P(X), i.e., not necessarily Σ-invariant. We have

DΓinv Σ (Q P) = DΓ(SΣ[Q] SΣ[P]) , (19)

where DΓ is an (f, Γ)-divergence or a Γ-IPM.

Remark 4.7. The analogous result for the Sinkhorn divergences also holds if the cost is separately Σ-invariant in each variable, i.e., c(Tσ(x), y) = c(x, y) and c(x, Tσ(y)) = c(x, y) for all σ Σ, x, y X. However, this is a strong assumption that is not satisﬁed by most commonly used cost functions and actions.

Theorem 4.6 has the following implications: If one uses a Σ-invariant GAN (i.e., invariant discriminators and equivariant generators) to learn a non-invariant data source Q then one will in fact learn the symmetrized version SΣ[Q]. On the other hand, if the data source Q is Σ-invariant (i.e., SΣ[Q] = Q, cf. Lemma 3.2) but the GAN generated distribution Pg is not then discriminators from Γinv Σ alone can not differentiate Q and Pg, i.e., DΓinv Σ (Q Pg) = 0, as long as Q = SΣ[Pg]. This suggests that Pg can easily suffer from mode collapse , as it only needs to equal Q after Σ-symmetrization; we refer readers to Figure 4a (2nd and 4th rows) for a visual illustration, where a unimodal Pg can be erroneously selected as the best ﬁtting model, even though its Σ-symmetrization SΣ[Pg] should be the correct one.

To prevent this from happening, one needs to ensure the generator produces a Σ-invariant distribution Pg; this is guaranteed by the following Theorem.

Theorem 4.8. If PZ P(Z) is Σ-invariant and g : Z X is Σ-equivariant then the push-forward measure Pg := PZ g 1 is Σ-invariant, i.e., Pg PΣ(X).

See Appendix B for a proof. We note that equivariant ﬂowbased methods have also been proposed based on a similar strategy to Theorem 4.8. We refer readers to Section 2 for a discussion of related works.

Remark 4.9. Suppose g = γ2 γ1 is a composition of two maps, γ1 : Z W and γ2 : W X. Even if γ1 is not Σ-equivariant (in fact, Z does not even need to be equipped with a Σ-action T Z σ ), as long as Pγ1 P(W) is Σ-invariant and γ2 is Σ-equivariant, the push-forward measure Pg P(X) is still Σ-invariant.

To construct the Σ-invariant noise source required in Theorem 4.8 (or Remark 4.9) one can begin with an arbitrary noise source and use a Σ-symmetrization layer, as described by the following theorem.

Theorem 4.10. Let W µΣ and N be a Z-valued random variable (i.e., an arbitrary noise source). If N and W are independent then the distribution of T Z(W, N) is Σinvariant.

Remark 4.11. Dey et al. (2021) also proposed to use GCNNs to generate images with C4/D4-invariant distributions. However, the ﬁrst step in their model, i.e., the Project & Reshape step [cf. Figure 3], uses a fully-connected layer which destroys the group symmetry in the noise source, leading to non-invariant ﬁnal distribution Pg even if the subsequent layers are all Σ-equivariant. This easily leads to mode collapse [cf. Theorem 4.6], which we will empirically demonstrate in Section 5; see, e.g., Figure 4a (4th row). An easy remedy for this is to add a Σ-symmetrization layer: let w be the output of Project & Reshape ; the Σ-symmetrization layer draws a random σ µΣ and transforms w into T W σ (w), producing a Σ-invariant distribution on the layer output (see Theorem 4.10). The ﬁnal distribution Pg is thus Σ-invariant if subsequent layers are all Σ-equivariant by Remark 4.9. See Figure 3 for a visual illustration.

5. Experiments

We present experiments on both synthetic and real-world data sets with embedded group symmetry to empirically verify our theory for structure-preserving GANs in Section 4.

5.1. Algorithmic Feasibility

Theorems 4.1 and 4.8 imply that one can build invariant GANs by using Σinvariant discriminators, Σ-equivariant generators, and a Σ-invariant noise source. Equivariant networks for arbitrary group symmetry (and gauge invariance) have been studied in recent works such as (Cohen & Welling, 2016b). Invariant noise sources can be constructed as shown in Theorem 4.10. We note that the symmetrization operators SΣ, SΣ are only used in the proofs of theoretical properties of the proposed GANs and are not needed in practical implementations. The necessary invariance/equivariance is built into the discriminator/generator via the structure of the layers; see Appendix G.4.

Structure-preserving GANs

5.2. Data sets and common experimental setups

Toy example. Following (Birrell et al., 2022), this synthetic data source is a mixture of four 2D t-distributions with 0.5 degrees of freedom, embedded in a plane in R12. The four centers of the t-distributions are located (in the supporting plane) at coordinates ( 10, 10), exhibiting C4-symmetry [cf. Figure 4a].

Rot MNIST is built by randomly rotating the original 10-class 28 28 MNIST digits (Le Cun et al., 1998), resulting in an SO(2)-invariant distribution. We use different portions of the 60,000 training images for experiments in Section 5.4.

ANHIR consists of pathology slides stained with 5 distinct dyes for the study of cellular compositions (Borovec et al., 2020). Following (Dey et al., 2021), we extract from the original images 28,407 foreground patches of size 64 64. The staining dye is used as the class label for conditioned image synthesis. As the images have no preferred orientation/reﬂection, the distribution is O(2)-invariant.

LYSTO contains 20,000 patches extracted from whole-slide images of breast, colon and prostate cancer stained with immunohistochemical markers (Ciompi et al., 2019). The images are classiﬁed into 3 categories based on the organ source, and we downsize the images to 64 64. Similar to ANHIR, this data set is also O(2)-invariant.

Common experimental setups. To verify our theory in Section 4, and to quantify and disentangle the contributions of the structure-preserving discriminator (D) and generator (G) (Theorem 4.1 and Theorem 4.6), we replace the baseline G and/or D by their group-equivariant/invariant counterparts, Eqv G and Inv D, while adjusting the number of ﬁlters according to the group size to ensure a similar number of trainable parameters. We also consider the incomplete attempt by Dey et al. (2021) at building equivariant generators ((I)Eqv G), wherein the ﬁrst fully-connected layer destroys the symmetry in the noise source, resulting in nonequivariant G even if subsequent layers are all equivariant [cf. Remark 4.11]. We use the Fr echet Inception Distance (FID) (Heusel et al., 2017) to evaluate the quality and diversity of the GAN generated samples after embedding them in the feature space of a pre-trained Inception-v3 network (Szegedy et al., 2016). Due to the simplicity of Rot MNIST, we replace the inception-featurization by the encoding feature space of an autoencoder trained on the rotated digits. We note that, compared to classiﬁers, autoencoders are guaranteed to produce different features for rotated versions of the same digit; they are thus more suitable to measure sample diversity in rotation.

(a) 2D projection of the generated samples.

(b) DL 2 -GANs.

Figure 4. This ﬁgure illustrates how our method can simultaneously handle heavy tails and low-dimensional support. Panel (a): 2D projection of the DL 2 -GAN generated samples onto the support plane of the source Q [cf. Section 5.3]. Each column shows the result after a given number of training epochs. The rows correspond to different settings for the generators (G) and discriminators (D); in particular, the 2nd and 4th rows use invariant D accompanied by, respectively, a baseline G and an incorrectly constructed equivariant G, leading to mode collapse [cf. Theorem 4.6]. The blue ovals mark the 25% and 50% probability regions of the data source Q, while the heat-map shows the generator samples. Panel (b) and (c): Generator distribution, projected onto components orthogonal to the support plane of Q. Values concentrated around zero indicate convergence to the sub-manifold. Models are trained on 200 training points.

5.3. Toy Example

We test the performance of different GANs (and their equivariant versions) based on 3 types of divergences, namely the Wasserstein-GAN (WGAN) based on the Γ-IPM Eq. (2), the Dfα-GAN based on the classiﬁcal f-divergence Eq. (3) and (4), and the DL α-GAN based on the (f, Γ)-divergence Eq. (8), in learning the C4-invariant mixture Q. We use fullyconnected networks with 3 hidden layers for the baseline G and D (Vanilla G&D). The generator pushes forward a 10D Gaussian noise source, which is itself C4-invariant after prescribing a proper group action, e.g., π/2-rotations in the ﬁrst two dimensions. Equivariant G (Eqv G) and invariant D (Inv D) are built by replacing fully-connected

Structure-preserving GANs

layers with C4-convolutional layers based on Theorem 4.8 due to the C4-invariance of the noise source. We also mimic the incomplete attempt by Dey et al. (2021) in building equivariant generators ((I)Eqv G) by leaving the ﬁrst fully-connected layer unchanged and replacing only the subsequent layers by C4-convolutions.

Figure 4a displays the 2D projection of the generated samples learned by the DL α=2-GAN (and its equivariant versions) on 200 training samples. It is clear that the baseline model without structural prior (Vanilla G&D) has difﬁculty in learning Q in such small data regime. Using an Inv D alone without an Eqv G (Vanilla G + Inv D) or with an incorrectly imposed Eqv G ((I)Eqv G + Inv D) leads easily to mode collapse , validating Theorem 4.6. On the other hand, DL α-GAN with an Eqv G (even without an Inv D) is able to learn all 4 modes of Q. We omit the results of (equivariant) Dfα-GANs and WGANs from Figure 4a, as both fail to learn the data source Q; this is unsurprising due to the lack of absolute continuity between Q and Pg (the former is supported on a plane, while the latter is the entire 12D space) and the fact that Q is heavytailed (as the mean does not exist.) This demonstrates the importance of our framework s broad applicability to a variety of variational divergences, as an improper choice of the divergence even with structural prior can fail to learn the source distribution.

Figure 4 (b) and (c) show the generated distribution projected onto components orthogonal to the support plane of Q. Values concentrated around zero indicate successful learning of the low-dimensional source distribution, i.e., generating high-ﬁdelity samples. Figure 4b indicates that an Inv D in the DL α-GAN helps produce a distribution with sharper support, whereas Eqv G alone without Inv D tends to generate relatively low-quality samples away from the supporting plane. In contrast, Figure 4c indicates that WGAN (even with symmetry prior) fails to learn the support plane due to Q being heavy-tailed. Results with different numbers of training samples and α s are shown in Appendix F, and the conclusions are similar.

5.4. Rot MNIST

We adopt a similar setup to Dey et al. (2021). Specifically, in the baseline G, a fully-connected layer ﬁrst projects and reshapes the concatenated Gaussian noise and class embedding into a 2D feature map (see Figure 3); spectrally-normalized convolutions (Miyato et al., 2018), interspersed with pointwise-nonlinearities, class-conditional batch-normalizations, and upsamplings, are subsequently used to increase the spatial dimension. We note again that replacing 2D convolutions with Cn-convolutions does not simply lead to Eqv G, as the distribution after the project and reshape layer is no longer Cn-invariant. This can be

Figure 5. Randomly generated digits 2, 3 and 8 by the RA-GANs trained on Rot MNIST after 20K generator iterations and using 1% (600) training data. (a): CNN G&D. (b): (I)Eqv G + Inv D, Σ = C4. (c) & (d): Eqv G + Inv D, i.e., our models with correctly constructed equivariant generators. (c): Σ = C4. (d): Σ = C8. More images are available in Appendix F.

Table 1. The median of the FIDs (lower is better), calculated every 1,000 generator update for 20,000 iterations, averaged over three independent trials. The number of the training samples used for experiments varies from 1% (600) to 10% (6,000) of the entire training set. See Appendix F for further results.

Architecture 1% 5% 10%

CNN G&D Eqv G + CNN D, Σ = C4 CNN G + Inv D, Σ = C4 (I)Eqv G+Inv D, Σ = C4 Eqv G + Inv D, Σ = C4 Eqv G + Inv D, Σ = C8

295 389 223 173 98 123

357 333 181 141 78 52

348 355 188 132 89 51

CNN G&D Eqv G + CNN D, Σ = C4 CNN G + Inv D, Σ = C4 (I)Eqv G+Inv D, Σ = C4 Eqv G + Inv D, Σ = C4 Eqv G + Inv D, Σ = C8

280 253 330 273 149 122

261 271 208 147 99 55

283 251 192 133 88 57

ﬁxed by adding a Cn-symmetrization layer after the ﬁrst linear embedding; see Remark 4.11. We consider GANs with the relative average loss (RA-GANs) (Jolicoeur-Martineau, 2019) in addition to the DL α-GANs for this experiment. All conﬁgurations are trained with a batch size of 64 for 20,000 generator iterations. Implementation details are available in Appendix G.

Table 1 shows the median of the FIDs, calculated every 1,000 generator update, averaged over three independent trials. It is clear that our proposed models (Eqv G + Inv D) consistently achieve signiﬁcantly improved results compared to the baseline CNN G&D and the prior approach ((I)Eqv G + Inv D); the out-performance is even more pronounced when increasing the group size from Σ = C4 to C8. We note that, similar to Rot MNIST, one can also use a custom autoencoder featurization for FID evaluation, and the superiority of our model (Eqv G + Inv D) is even

Structure-preserving GANs

Table 2. The (min, median) of the FIDs over the course of training, averaged over three independent trials on the medical images, where the plus sign + after the data set, e.g., ANHIR+, denotes the presence of data augmentation during training.

Loss Architecture ANHIR ANHIR+

DL 2 CNN G&D (I)Eqv G + Inv D Eqv G + Inv D

(313, 485) (120, 176) (97, 157)

(347, 539) (119, 177) (90, 128)

Loss Architecture LYSTO LYSTO+

DL 2 CNN G&D (I)Eqv G + Inv D Eqv G + Inv D

(289, 410) (253, 343) (205, 259)

(265, 376) (244, 329) (192, 259)

more prominent under such metric: for instance, on ANHIR, the median FIDs calculated through autoencoder featurization of the three comparing models are, respectively, 1221 (CNN G&D), 936 (((I)Eqv G + Inv D)), and 329 (Eqv G + Inv D). See Figure 5 also for randomly generated samples by RA-GANs trained with 1% training data. More results are available in Appendix F.

5.5. ANHIR and LYSTO

Compared to Rot MNIST, Res Net and its D4-equivariant counterpart are used instead of CNNs for G and D. All models are trained for 40,000 generator iterations with a batch size of 32. Implementation details are available in Appendix G.

Table 2 displays the minimum and median of the FIDs, calculated every 2,000 generator update, averaged over three independent trials. The plus sign + after the data set, e.g., ANHIR+, denotes the presence of data augmentation (random 90 rotations and reﬂection) during training. It is clear that augmentation usually (but not always) has a positive effect on the results evaluated by the FID; however, our proposed model even without data augmentation still consistently and signiﬁcantly outperforms the baseline model (CNN G&D) and the prior approach ((I)Eqv G + Inv D) (Dey et al., 2021) with augmentation. Figure 1 presents a random collection of real and generated ANHIR images, visually verifying the improved sample ﬁdelity of our model over the baseline. More results are available in Appendix F.

5.6. Discussion of empirical ﬁndings

Consistently across all experiments, our proposed structurepreserving GAN outperforms prior approaches in generating high-ﬁdelity and diverse samples by a signiﬁcant margin, in some cases almost an order of magnitude measured in FID. The results also show that, compared to data-augmentation

(a common strategy for learning from limited data), building theoretically-guided structural probabilistic priors directly into the two GAN players achieves substantially improved performance and data efﬁciency in adversarial learning.

Acknowledgements

The research of J.B., M.K. and L.R.-B. was partially supported by the Air Force Ofﬁce of Scientiﬁc Research (AFOSR) under the grant FA9550-21-1-0354. The research of M. K. and L.R.-B. was partially supported by the National Science Foundation (NSF) under the grants DMS-2008970 and TRIPODS CISE-1934846. The research of W.Z. was partially supported by NSF under DMS-2052525 and DMS2140982. We thank Neel Dey for sharing the pre-processed ANHIR data set. This work was performed in part using high performance computing equipment obtained under a grant from the Collaborative R&D Fund managed by the Massachusetts Technology Collaborative.

Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein generative adversarial networks. In International conference on machine learning, pp. 214 223. PMLR, 2017.

Belghazi, M. I., Baratin, A., Rajeshwar, S., Ozair, S., Bengio, Y., Courville, A., and Hjelm, D. Mutual information neural estimation. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 531 540, Stockholmsm assan, Stockholm Sweden, 10 15 Jul 2018. PMLR. URL http://proceedings.mlr. press/v80/belghazi18a.html.

Biloˇs, M. and G unnemann, S. Scalable normalizing ﬂows for permutation invariant densities. In International Conference on Machine Learning, pp. 957 967. PMLR, 2021.

Birrell, J., Dupuis, P., Katsoulakis, M. A., Rey-Bellet, L., and Wang, J. Variational representations and neural network estimation of R enyi divergences. SIAM Journal on Mathematics of Data Science, 3(4):1093 1116, 2021. doi: 10.1137/20M1368926. URL https: //doi.org/10.1137/20M1368926.

Birrell, J., Dupuis, P., Katsoulakis, M. A., Pantazis, Y., and Rey-Bellet, L. (f, Γ)-Divergences: Interpolating between f-Divergences and Integral Probability Metrics. Journal of Machine Learning Research, (to appear), 2022. URL https://arxiv.org/abs/2011.05953.

Borovec, J., Kybic, J., Arganda-Carreras, I., Sorokin, D. V., Bueno, G., Khvostikov, A. V., Bakas, S., Eric, I., Chang,

Structure-preserving GANs

C., Heldmann, S., et al. Anhir: automatic non-rigid histological image registration challenge. IEEE transactions on medical imaging, 39(10):3042 3052, 2020.

Bot, R., Grad, S., and Wanka, G. Duality in Vector Optimization. Vector Optimization. Springer Berlin Heidelberg, 2009. ISBN 9783642028861.

Boyda, D., Kanwar, G., Racani ere, S., Rezende, D. J., Albergo, M. S., Cranmer, K., Hackett, D. C., and Shanahan, P. E. Sampling using su (n) gauge equivariant ﬂows. Physical Review D, 103(7):074504, 2021.

Brock, A., Donahue, J., and Simonyan, K. Large scale GAN training for high ﬁdelity natural image synthesis. ar Xiv preprint ar Xiv:1809.11096, 2018.

Broniatowski, M. and Keziou, A. Parametric estimation and tests through divergences and the duality technique. Journal of Multivariate Analysis, 100(1):16 36, 2009. ISSN 0047-259X. doi: https://doi.org/10.1016/j.jmva.2008.03. 011. URL http://www.sciencedirect.com/ science/article/pii/S0047259X08001036.

Catoni, O., Euclid, P., Library, C. U., and Press, D. U. PAC-Bayesian Supervised Classiﬁcation: The Thermodynamics of Statistical Learning. Lecture notesmonograph series. Cornell University Library, 2008. URL https://books.google.gr/books?id= -Etrn QAACAAJ.

Chowdhary, K. and Dupuis, P. Distinguishing and integrating aleatoric and epistemic variation in uncertainty quantiﬁcation. ESAIM: Mathematical Modelling and Numerical Analysis, 47(3):635 662, 2013. doi: 10.1051/ m2an/2012038.

Ciompi, F., Jiao, Y., and van der Laak, J. Lymphocyte assessment hackathon (LYSTO), October 2019. URL https: //doi.org/10.5281/zenodo.3513571.

Cohen, T. and Welling, M. Group equivariant convolutional networks. In International conference on machine learning, pp. 2990 2999. PMLR, 2016a.

Cohen, T. and Welling, M. Group equivariant convolutional networks. In Balcan, M. F. and Weinberger, K. Q. (eds.), Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pp. 2990 2999, New York, New York, USA, 20 22 Jun 2016b. PMLR. URL https://proceedings.mlr.press/v48/ cohenc16.html.

Cohen, T. S., Geiger, M., and Weiler, M. A general theory of equivariant CNNs on homogeneous spaces. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alch e-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural

Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings. neurips.cc/paper/2019/file/ b9cfe8b6042cf759dc4c0cccb27a6737-Paper. pdf.

Cohn, D. Measure Theory. Birkh auser Boston, 2013. ISBN 9781489903990. URL https://books.google. com/books?id=rg Xy Bw AAQBAJ.

Dehmamy, N., Walters, R., Liu, Y., Wang, D., and Yu, R. Automatic symmetry discovery with lie algebra convolutional network. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 2503 2515. Curran Associates, Inc., 2021. URL https://proceedings. neurips.cc/paper/2021/file/ 148148d62be67e0916a833931bd32b26-Paper. pdf.

Dey, N., Chen, A., and Ghafurian, S. Group equivariant generative adversarial networks. In International Conference on Learning Representations, 2021. URL https: //openreview.net/forum?id=rg FNu JHHXv.

Dupuis, P. and Ellis, R. S. A weak convergence approach to the theory of large deviations, volume 902. John Wiley & Sons, 2011.

Dupuis, P., Katsoulakis, M. A., Pantazis, Y., and Plechac, P. Path-space information bounds for uncertainty quantiﬁcation and sensitivity analysis of stochastic dynamics. SIAM/ASA Journal on Uncertainty Quantiﬁcation, 4(1): 80 111, 2016. doi: 10.1137/15M1025645.

Durumeric, A. E. and Voth, G. A. Adversarial-residualcoarse-graining: Applying machine learning theory to systematic molecular coarse-graining. The Journal of chemical physics, 151(12):124110, 2019.

Feder, R. M., Berger, P., and Stein, G. Nonlinear 3d cosmic web simulation with heavy-tailed generative adversarial networks. Physical Review D, 102(10):103504, 2020.

Folland, G. Real Analysis: Modern Techniques and Their Applications. Pure and Applied Mathematics: A Wiley Series of Texts, Monographs and Tracts. Wiley, 2013. ISBN 9781118626399. URL https://books. google.com/books?id=w I4f Aw AAQBAJ.

Garcia Satorras, V., Hoogeboom, E., Fuchs, F., Posner, I., and Welling, M. E (n) equivariant normalizing ﬂows. Advances in Neural Information Processing Systems, 34, 2021.

Structure-preserving GANs

Genevay, A., Cuturi, M., Peyr e, G., and Bach, F. Stochastic optimization for large-scale optimal transport. In Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https://proceedings. neurips.cc/paper/2016/file/ 2a27b8144ac02f67687f76782a3b5d8f-Paper. pdf.

Glaser, P., Arbel, M., and Gretton, A. KALE ﬂow: A relaxed kl gradient ﬂow for probabilities with disjoint support. ar Xiv e-prints, art. ar Xiv:2106.08929, June 2021.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.

Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. C. Improved training of Wasserstein GANs. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings. neurips.cc/paper/2017/file/ 892c3b1c6dccd52936e27cbd0ff683d6-Paper. pdf.

Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.

Jolicoeur-Martineau, A. The relativistic discriminator: a key element missing from standard GAN. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum? id=S1er Ho R5t7.

Karras, T., Laine, S., and Aila, T. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4401 4410, 2019.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Kipnis, C. and Landim, C. Scaling Limits of Interacting Particle Systems. Springer-Verlag, 1999.

K ohler, J., Klein, L., and No e, F. Equivariant ﬂows: sampling conﬁgurations for multi-body systems with symmetric energies. ar Xiv preprint ar Xiv:1910.00753, 2019.

K ohler, J., Klein, L., and No e, F. Equivariant ﬂows: exact likelihood generative learning for symmetric densities. In International Conference on Machine Learning, pp. 5361 5370. PMLR, 2020.

Kullback, S. and Leibler, R. A. On information and sufﬁciency. The annals of mathematical statistics, 22(1): 79 86, 1951.

Le Cun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

Li, W., Burkhart, C., Poli nska, P., Harmandaris, V., and Doxastakis, M. Backmapping coarse-grained macromolecules: An efﬁcient and versatile machine learning approach. The Journal of Chemical Physics, 153(4):041101, 2020.

Liu, J., Kumar, A., Ba, J., Kiros, J., and Swersky, K. Graph normalizing ﬂows. ar Xiv preprint ar Xiv:1905.13177, 2019.

Mc Allester, D. A. Pac-bayesian model averaging. In Proceedings of the Twelfth Annual Conference on Computational Learning Theory, COLT 99, pp. 164 170, New York, NY, USA, 1999. Association for Computing Machinery. ISBN 1581131674. doi: 10. 1145/307400.307435. URL https://doi.org/10. 1145/307400.307435.

Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum? id=B1QRgzi T-.

Mustafa, M., Bard, D., Bhimji, W., Luki c, Z., Al-Rfou, R., and Kratochvil, J. M. Cosmo GAN: creating high-ﬁdelity weak lensing convergence maps using Generative Adversarial Networks. Computational Astrophysics and Cosmology, 6(1):1, December 2019. ISSN 2197-7909. doi: 10.1186/s40668-019-0029-9. URL https://comp-astrophys-cosmol. springeropen.com/articles/10.1186/ s40668-019-0029-9.

M uller, A. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29 (2):429 443, 1997. doi: 10.2307/1428011.

Nguyen, X., Wainwright, M. J., and Jordan, M. I. Nonparametric estimation of the likelihood ratio and divergence functionals. In 2007 IEEE International Symposium on Information Theory, pp. 2016 2020, 2007.

Structure-preserving GANs

Nguyen, X., Wainwright, M. J., and Jordan, M. I. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11):5847 5861, 2010.

Noid, W. G. Perspective: Coarse-grained models for biomolecular systems. The Journal of Chemical Physics, 139(9):090901, 2013. doi: 10.1063/1.4818908. URL

https://doi.org/10.1063/1.4818908.

Nowozin, S., Cseke, B., and Tomioka, R. f-GAN: Training generative neural samplers using variational divergence minimization. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 271 279, 2016.

Pak, A. J. and Voth, G. A. Advances in coarse-grained modeling of macromolecular complexes. Current Opinion in Structural Biology, 52:119 126, 2018. ISSN 0959440X. doi: https://doi.org/10.1016/j.sbi.2018.11.005. URL https://www.sciencedirect.com/ science/article/pii/S0959440X18300939. Cryo electron microscopy: the impact of the cryo-EM revolution in biology Biophysical and computational methods - Part A.

Rezende, D. J., Racani ere, S., Higgins, I., and Toth, P. Equivariant hamiltonian ﬂows. ar Xiv preprint ar Xiv:1909.13739, 2019.

Ruderman, A., Reid, M. D., Garc ıa-Garc ıa, D., and Petterson, J. Tighter variational representations of f-divergences via restriction to probability measures. In Proceedings of the 29th International Coference on International Conference on Machine Learning, ICML 12, pp. 1155 1162, Madison, WI, USA, 2012. Omnipress. ISBN 9781450312851.

Rudin, W. Functional Analysis. International series in pure and applied mathematics. Mc Graw-Hill, 2006. ISBN 9780070619883.

Schindler, W. Measures with Symmetry Properties. Lecture Notes in Mathematics. Springer Berlin Heidelberg, 2003. ISBN 9783540362104. URL https://books. google.com/books?id=xyt8Cw AAQBAJ.

Shawe-Taylor, J. and Williamson, R. C. A PAC analysis of a Bayesian estimator. In Proceedings of the Tenth Annual Conference on Computational Learning Theory, COLT 97, pp. 2 9, New York, NY, USA, 1997. Association for Computing Machinery. ISBN 0897918916. doi: 10. 1145/267460.267466. URL https://doi.org/10. 1145/267460.267466.

Sriperumbudur, B. K., Fukumizu, K., and Lanckriet, G. R. Universality, characteristic kernels and RKHS embedding

of measures. Journal of Machine Learning Research, 12(70):2389 2410, 2011. URL http://jmlr.org/ papers/v12/sriperumbudur11a.html.

Sriperumbudur, B. K., Fukumizu, K., Gretton, A., Sch olkopf, B., and Lanckriet, G. R. G. On the empirical estimation of integral probability metrics. Electronic Journal of Statistics, 6(none):1550 1599, 2012. doi: 10.1214/12-EJS722. URL https://doi.org/10. 1214/12-EJS722.

Steinwart, I. and Christmann, A. Support Vector Machines. Information Science and Statistics. Springer New York, 2008. ISBN 9780387772424. URL https://books. google.com/books?id=HUnqnrp Yt4IC.

Stieffenhofer, M., Bereau, T., and Wand, M. Adversarial reverse mapping of condensed-phase molecular structures: Chemical transferability. APL Materials, 9(3): 031107, 2021. doi: 10.1063/5.0039102. URL https: //doi.org/10.1063/5.0039102.

Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818 2826, 2016.

Weiler, M. and Cesa, G. General E(2)-equivariant steerable CNNs. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alch e-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings. neurips.cc/paper/2019/file/ 45d6637b718d0f24a237069fe41b0db4-Paper. pdf.

Yi, X., Walia, E., and Babyn, P. Generative adversarial network in medical imaging: A review. Medical image analysis, 58:101552, 2019.

Zhang, H., Goodfellow, I., Metaxas, D., and Odena, A. Self-attention generative adversarial networks. In International conference on machine learning, pp. 7354 7363. PMLR, 2019.

Zhao, S., Liu, Z., Lin, J., Zhu, J.-Y., and Han, S. Differentiable augmentation for data-efﬁcient gan training. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 7559 7570. Curran Associates, Inc., 2020. URL https://proceedings. neurips.cc/paper/2020/file/ 55479c55ebd1efd3ff125f1337100388-Paper. pdf.

Structure-preserving GANs

Zhu, M., Pan, P., Chen, W., and Yang, Y. Dm-gan: Dynamic memory generative adversarial networks for textto-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.

Structure-preserving GANs

A. More details on variational representations of divergences and probability metrics

We provide, in this appendix, more details on variational representations of the divergences and probability metrics discussed in Section 3.2. Recall the notation introduced in the main paper: let (X, M) be a measurable space, M(X) be the space of measurable functions on X, and Mb(X) be the subspace of bounded measurable functions. We denote P(X) as the set of probability measures on X. Given an objective functional H : Mn(X) P(X) P(X) [ , ] and a test function space Γ M(X)n, n Z+, we deﬁne

DΓ H(Q P) = sup γ Γ H(γ; Q, P) . (20)

DΓ H is called a divergence if DΓ H 0 and DΓ H(Q P) = 0 if and only if Q = P, hence providing a notion of distance between probability measures. DΓ H is further called a probability metric if it satisﬁes the triangle inequality (i.e., DΓ H(Q P) DΓ H(Q ν) + DΓ H(ν P) for all Q, P, ν P(X)) and is symmetric (i.e., DΓ H(Q P) = DΓ H(P Q) for all P, Q P(X)). It is well known that formula (20) includes, through suitable choices of objective functional H(γ; Q, P) and function space Γ, many divergences and probability metrics. Below we further elaborate on the examples discussed in Section 3.2.

(a) f-divergences. Let f : [0, ) R be convex and lower semi-continuous (LSC), with f(1) = 0 and f strictly convex at x = 1. The f-divergence between Q and P can be deﬁned based on two equivalent variational representations (Birrell et al., 2022), namely

Df(Q P) = sup γ Mb(X) {EQ[γ] EP [f (γ)]} (21)

= sup γ Mb(X) {EQ[γ] ΛP f [γ]} , (22)

where f in the ﬁrst representation (21) denotes the Legendre transform (LT) of f,

f (y) = sup x R {yx f(x)}, y R, (23)

and ΛP f [γ] in the second representation (22) is deﬁned as

ΛP f [γ] := inf ν R{ν + EP [f (γ ν)]} , γ Mb(Ω) . (24)

The two variational representations Eq. (21) and Eq. (22) share the same Γ = Mb(X), and their equivalence is due to Mb(Ω) being closed under the shift map γ 7 γ ν for ν R. Examples of the f-divergences include the Kullback-Leibler (KL) divergence (Kullback & Leibler, 1951), the total variation distance, the χ2-divergence, the Hellinger distance, the Jensen-Shannon divergence, and the family of α-divergences (Nowozin et al., 2016). For instance, the KL-divergence is constructed from

f KL = x log x, x 0. (25)

A key element in the second variational representation for Df [Eq. (22)] is the functional ΛP f [γ], which is a generalization of the cumulant generating function from the KL-divergence case to the f-divergence case. Indeed, for the KL-divergence where f(x) = f KL(x) = x log x, it is straightforward to show that ΛP f becomes the standard cumulant generating function, ΛP f KL[γ] = log EP [eγ], and Eq. (22) becomes the Donsker-Varadhan variational formula; see Appendix C.2 in (Dupuis & Ellis, 2011). The ﬂexibility of f allows one to tailor the divergence to the data source, e.g., for heavy tailed data. Moreover, the strict concavity of f in γ can result in improved statistical learning, estimation, and convergence performance. However, the variational representations (21) and (22) both result in Df(Q P) = if Q is not absolutely continuous with respect to P, limiting their efﬁcacy in comparing distributions with low-dimensional support.

(b) Γ-Integral Probability Metrics (IPMs). Given Γ Mb(X), the Γ-IPM between Q and P is deﬁned as

W Γ(Q, P) = sup γ Γ {EQ[γ] EP [γ]}. (26)

Structure-preserving GANs

We refer to (M uller, 1997; Sriperumbudur et al., 2012) for a complete theory and conditions on Γ ensuring that W Γ(Q, P) is a metric. Apart from the Wasserstein metric when Γ = Lip1(X) is the space of 1-Lipschitz functions, examples of IPMs also include: the total variation metric, where Γ is the unit ball in Mb(X); the Dudley metric, where Γ is the unit ball in the space of bounded and Lipschitz continuous functions; and maximum mean discrepancy (MMD), where Γ is the unit ball in an RKHS (M uller, 1997; Sriperumbudur et al., 2012). With suitable choices of Γ, IPMs are able to meaningfully compare not-absolutely continuous distributions, but they could potentially fail at comparing distributions with heavy tails (Birrell et al., 2022).

(c) (f, Γ)-divergences. This class of divergences were introduced in (Birrell et al., 2022) and they subsume both fdivergences and Γ-IPMs. Given a function f satisfying the same condition as in the deﬁnition of the f-divergence and Γ Mb(X), the (f, Γ)-divergence is deﬁned as

DΓ f (Q P) = sup γ Γ

EQ[γ] ΛP f [γ] , (27)

where ΛP f [γ] is again given by Eq. (24), implying that Eq. (6) includes as a special case the f-divergence (3) when Γ = Mb(X) and the Γ Mb(X) implies

DΓ f (Q P) Df(Q P) (28)

for any Γ Mb(X). It is demonstrated in (Birrell et al., 2022) that one also has

DΓ f (Q P) W Γ(Q, P) . (29)

Some notable examples of such Γ s can be found in (Birrell et al., 2022), for instance the 1-Lipschitz functions Lip1(X), the RKHS unit ball, Re LU neural networks, Re LU neural networks with spectral normalizations, etc. The property (29) readily implies that (f, Γ) divergences can be deﬁned for non-absolutely continuous probability distributions. If X is further assumed to be a complete separable metric space then, under stronger assumptions on f and Γ, one has the following Inﬁmal Convolution Formula:

DΓ f (Q P) = inf η P(X)

Df(η P) + W Γ(Q, η) , (30)

which implies, in particular, 0 DΓ f (Q P) min{Df(Q P), W Γ(Q, P)}, i.e., Eq. (28) and Eq. (29).

(d) Sinkhorn divergences. The Wasserstein (or earth-mover ) metric associated with a cost function c : X X R+

has the variational representation

W Γ c (Q, P) = inf π Co(Q,P ) Eπ[c(x, y)] = sup γ=(γ1,γ2) Γ {EP [γ1] + EQ[γ2]} , (31)

where Co(Q, P) is the set of all couplings of P and Q and Γ = {γ = (γ1, γ2) C(X) C(X) : γ1(x) + γ2(y) c(x, y) , x, y X}, with C(X) being the space of continuous functions on X (Cb(X) will denote the subspace of bounded continuous functions). The Sinkhorn divergence is given by

SDΓ c,ϵ(Q, P) = W Γ c,ϵ(Q, P) 1

2W Γ c,ϵ(Q, Q) 1

2W Γ c,ϵ(P, P), (32)

with W Γ c,ϵ(Q, P) being the entropic regularization of the Wasserstein metrics (Genevay et al., 2016),

W Γ c,ϵ(Q, P) = inf π Co(Q,P ) {Eπ[c(x, y)] + ϵR(π P Q)} (33)

= sup γ=(γ1,γ2) Γ

EP [γ1] + EQ[γ2] ϵEP Q

exp γ1 γ2 c

where now Γ = Cb(X) Cb(X) and γ1 γ2(x, y) := γ1(x) + γ2(y).

Structure-preserving GANs

In this appendix we provide proofs of results that were stated in the main text. First we prove the properties of the symmetrization operators from Lemma 3.2.

Lemma B.1. (a) The symmetrization operator SΣ : Mb(X) Mb(X) is a projection operator onto the subspace of Σ-invariant bounded measurable functions

Minv b,Σ(X) := {γ Mb(X) : γ Tσ = γ for all σ Σ} , (35)

in the sense that

1. SΣ[Mb(X)] = Minv b,Σ(X),

2. SΣ SΣ = SΣ.

SΣ[γ Tσ] = SΣ[γ] (36)

for all γ Mb(X), σ Σ.

(b) The symmetrization operator SΣ : P(X) P(X) is a projection operator onto the subset of Σ-invariant probability measures PΣ(X) := {P P(X) : P T 1 σ = P for all σ Σ} , (37)

in the sense that

1. SΣ[P(X)] = PΣ(X),

2. SΣ SΣ = SΣ.

(c) SΣ is the conditional expectation operator with respect to the σ-algebra of Σ-invariant sets. More speciﬁcally, for all γ Mb(X), P PΣ(X) we have SΣ[γ] = EP [γ|MΣ] . (38)

where MΣ is the σ-algebra of Σ-invariant sets,

MΣ := {Measurable sets B X : Tσ(B) = B for all σ Σ} . (39)

Proof. We will need the following invariance property of integrals with respect to Haar measure, which can be proven using the invariance of Haar measure under left and right group multiplication: Z

Σ h(σ σ )dµΣ(σ ) = Z

Σ h(σ σ)dµΣ(σ ) = Z

Σ h(σ )dµΣ(σ ) . (40)

(a) If γ Mb(X) then γ = SΣ[γ] Minv b,Σ(X) by applying (40) with h(σ) := γ Tσ(x), x X. Indeed we have

γ Tσ(x) = Z γ(Tσ (Tσ(x)))dµΣ(σ ) = Z h(σ σ)µΣ(dσ ) = Z h(σ )µΣ(dσ ) = γ (x) .

Furthermore any γ Minv b,Σ(X) belongs to the range of SΣ since γ Tσ = γ for all σ Σ implies that γ = SΣ[γ]. This also shows that SΣ SΣ = SΣ. Finally, for γ Mb(X), σ Σ, x X we can compute

SΣ[γ Tσ](x) = Z γ(Tσ σ (x))µΣ(dσ ) = Z γ(T σ(x))µΣ(dσ ) = SΣ[γ](x) ,

where we again used the invariance property of integrals with respect to Haar measure (40).

Structure-preserving GANs

(b) For P P(X), γ Mb(X), and σ Σ we can use (36) to compute Z γd SΣ[P] T 1 σ = Z γ Tσd SΣ[P] = Z SΣ[γ Tσ]d P = Z SΣ[γ]d P = Z γd SΣ[P] .

This holds for all γ Mb(X), hence SΣ[P] T 1 σ = SΣ[P] for all σ Σ. Therefore SΣ[P] PΣ(X). Conversely, if P PΣ(X) then EP [γ Tσ] = EP [γ] for all σ Σ and γ Mb(X) and thus, by Fubini s theorem, EP [SΣ[γ]] = EP [γ]. Hence SΣ[P] = P and so P SΣ[P]. This completes the proof that SΣ[P(X)] = PΣ(X). Combining these calculations it is also clear that SΣ SΣ = SΣ.

(c) Let γ Mb(X) and P PΣ(X). From part (a) we know that SΣ[γ] Minv b,Σ(X) and from this it is straightforward to show that SΣ[γ] is MΣ-measurable. Now ﬁx A MΣ and note that 1A Tσ = 1A for all σ Σ (where 1A denotes the indicator function for A). Using this fact together with SΣ[P] = P (see part (b)) we can compute Z SΣ[γ]1Ad P = Z Z γ Tσ 1AµΣ(dσ )d P = Z Z (γ1A) Tσ µΣ(dσ )d P = Z SΣ[γ1A]d P = Z γ1Ad SΣ[P]

= Z γ1Ad P .

This proves SΣ[γ] = EP [γ|MΣ] by the deﬁnition of conditional expectation.

Now we prove Theorem 4.1.

Theorem B.2. If SΣ[Γ] Γ and the probability measures P, Q are Σ-invariant then

DΓ(Q P) = DΓinv Σ (Q P) , (41)

where DΓ is an (f, Γ)-divergence or a Γ-IPM. Eq. (41) also holds for Sinkhorn divergences if the cost is Σ-invariant (i.e., c(Tσ(x), Tσ(y)) = c(x, y) for all σ Σ, x, y X).

Remark B.3. Note that the classical Sinkhorn divergence is obtained when Γ = Cb(X) Cb(X) but the proof of this theorem applies to any Γ Mb(X)2 with SΣ[Γ] Γ.

Proof. We ﬁrst prove the Theorem for (f, Γ)-divergences. Start by using Jensen s inequality and the convexity of the Legendre transform f to obtain

f (SΣ[γ](x) ν) = f Z γ(Tσ(x)) ν µΣ(dσ)

Z f (γ(Tσ(x)) ν)µΣ(dσ) = SΣ[f (γ(x) ν)]

for all γ Mb(X). Therefore

DSΣ[Γ] f (Q P)

= sup γ Γ,ν R {EQ[SΣ[γ]] ν EP [f (SΣ[γ] ν)]}

sup γ Γ,ν R {EQ[SΣ[γ] ν] EP [SΣ[f (γ ν)]]}

= sup γ Γ,ν R {EQ[γ] ν EP [f (γ ν)]} = DΓ f (Q P) ,

where in the next to last equality we use Lemma 3.2(c) together with the assumptions P, Q PΣ(X) to conclude EP [SΣ[f (γ ν)]] = EP [f (γ ν)] and EQ[SΣ[γ]] = EQ[γ]. Hence we obtain DΓ f (Q P) DSΣ[Γ] f (Q P). Furthermore,

since SΣ[Γ] Γ, we have from (6) that DSΣ[Γ] f (Q P) = DΓ f (Q P). We conclude by showing that SΣ[Γ] Γ implies SΣ[Γ] = Γinv Σ . First, if γ Γinv Σ , then SΣ[γ] = γ, therefore Γinv Σ SΣ[Γ]. Conversely, since Γ Mb(X), the functions in SΣ[Γ] are Σ-invariant (see Lemma 3.2). We assumed SΣ[Γ] Γ, hence SΣ[Γ] Γinv Σ .

Structure-preserving GANs

The proof for Γ-IPMs is similar, but does not require Jensen s inequality due to the linearity of the objective functional in γ. Hence the hypothesis SΣ[Γ] Γ is not necessary to obtain W Γ(Q, P) = W SΣ[Γ](Q, P).

Finally, we prove the result for Sinkhorn divergences. Equation (32) implies that it sufﬁces to show W Γ c,ϵ(Q, P) =

W Γinv Σ c,ϵ (Q, P): By the same reasoning used for (f, Γ)-divergences, our assumptions imply Γinv Σ = SΣ[Γ] and therefore

W Γinv Σ c,ϵ (Q, P)

=W SΣ[Γ] c,ϵ (Q, P) = sup (γ1,γ2) Γ

EP [SΣ[γ1]] + EQ[SΣ[γ2]] ϵEP Q

exp SΣ[γ1] SΣ[γ2] c

= sup (γ1,γ2) Γ

ESΣ[P ][γ1] + ESΣ[Q][γ2] ϵEP Q

exp R γ1(Tσ(x)) + γ2(Tσ(y)) c(x, y)µΣ(dσ)

Using Jensen s inequality followed by Fubini s theorem on the third term we obtain

W Γinv Σ c,ϵ (Q, P)

sup (γ1,γ2) Γ

ESΣ[P ][γ1] + ESΣ[Q][γ2] ϵ Z EP Q

exp γ1(Tσ(x)) + γ2(Tσ(y)) c(x, y)

µΣ(dσ) + ϵ .

Finally, the Σ-invariance of Q, P, and c imply SΣ[P] = P, SΣ[Q] = Q, and

exp γ1(Tσ(x)) + γ2(Tσ(y)) c(x, y)

exp γ1(Tσ(x)) + γ2(Tσ(y)) c(Tσ(x), Tσ(y))

= Z Z Z exp γ1(x) + γ2(y) c(x, y)

Q T 1 σ (dx)P T 1 σ (dy)µΣ(dσ)

= Z Z exp γ1(x) + γ2(y) c(x, y)

Q(dx)P(dy) .

W Γinv Σ c,ϵ (Q, P) sup (γ1,γ2) Γ

EP [γ1] + EQ[γ2] ϵEP Q

exp γ1 γ2 c

+ ϵ = W Γ c,ϵ(Q, P) .

The reverse inequality follows from Γinv Σ Γ and so the proof is complete.

Next we prove Theorem 4.3, a generalization of Theorem 4.1.

Theorem B.4. Let Kx(dx ) be a probability kernel from X to X and deﬁne SK : Mb(X) 7 Mb(X) by SK[f](x) = R f(x )Kx(dx ). K also deﬁnes a dual map SK : P(X) P(X), SK[P] := R Kx( )P(dx). Let PK(X) be the set of K-invariant probability measures, i.e., PK(X) = {P P(X) : SK[P] = P}.

If Γ Mb(X) such that SK[Γ] Γ and Q, P PK(X) then

DΓ(Q P) = DSK[Γ](Q P) , (42)

where DΓ is an (f, Γ)-divergence or a Γ-IPM. It also holds for the Sinkhorn divergence if SK[c( , y)] = c( , y) and SK[c(x, )] = c(x, ) for all x, y X.

In addition, if SK is a projection (i.e., SK SK = SK) then SK[Γ] = Γinv K where where Γinv K := {γ Γ : SK[γ] = γ}.

Proof. We prove (42) for (f, Γ)-divergences. The proof for Γ-IPMs and Sinkhorn divergences are similar. We note that for Γ-IPMs, (42) does not require the assumption SK[Γ] Γ.

Structure-preserving GANs

Fix Q, P PK(X) and use Jensen s inequality along with the K-invariance of Q and P to compute

DSK[Γ] f (Q P) = sup γ Γ,ν R {EQ[SK[γ] ν] EP [f (SK[γ] ν)]}

= sup γ Γ,ν R {EQ[SK[γ ν]] EP [f ( Z (γ(x ) ν)Kx(dx ))]}

sup γ Γ,ν R {EQ[SK[γ ν]] EP [ Z f (γ(x ) ν)Kx(dx ))]}

= sup γ Γ,ν R {ESK[Q][γ ν] ESK[P ][f (γ ν)]}

= sup γ Γ,ν R {EQ[γ ν] EP [f (γ ν)]} = DΓ f (Q P) .

Therefore DSK[Γ] f (Q P) DΓ f (Q P). Note that this computation is the same as the proof of the data processing inequality for (f, Γ)-divergences; see Theorem 2.21 in (Birrell et al., 2022). The assumption SK[Γ] Γ implies the reverse inequality, hence we conclude DSK[Γ] f (Q P) = DΓ f (Q P).

Now suppose SK SK = SK. If γ = SK[γ ] SK[Γ] then SK[γ] = SK[SK[γ ]] = SK[γ ] = γ. This, together with the assumption that SK[Γ] Γ implies γ Γinv K . Conversely, if γ Γinv K then γ = SK[γ] SK[Γ] by the deﬁnition of Γinv K . This completes the proof.

We now prove Theorem 4.6, which explains the potential mode collapse in GANs when restricting the test function space from Γ to Γinv Σ if at least one of the distributions Q and P is not Σ-invariant.

Theorem B.5. Suppose SΣ[Γ] Γ and P, Q P(X) (i.e., not necessarily Σ-invariant). Then

DΓinv Σ f (Q P) = DΓ f (SΣ[Q] SΣ[P]) , (43)

W Γinv Σ (Q, P) = W Γ(SΣ[Q], SΣ[P]) . (44)

Remark B.6. The analogous result for the Sinkhorn divergences also holds if the cost is separately Σ-invariant in each variable, i.e., c(Tσ(x), y) = c(x, y) and c(x, Tσ(y)) = c(x, y) for all σ Σ, x, y X. Though this is not satisﬁed by most commonly used cost functions and actions one can always enforce it by replacing the cost function c with the symmetrized cost

cΣ(x, y) := Z Z c(Tσ(x), Tσ (y))µΣ(dσ)µΣ(dσ ) . (45)

Proof. We prove only the validity of (43); the proof of (44) is similar.

DΓ f (SΣ[Q] SΣ[P]) = DΓinv Σ f (SΣ[Q] SΣ[P])

= sup γ Γinv Σ ,ν R

ESΣ[Q][γ ν] ESΣ[P ][f (γ ν)]

= sup γ Γinv Σ ,ν R {EQ[γ ν] EP [f (γ ν)]}

= DΓinv Σ f (Q P) ,

where the ﬁrst equality is due to Theorem 4.1, and the third equality holds as γ ν and f (γ ν) are both Σ-invariant when γ Γinv Σ .

Next we prove Theorem 4.8, which explains how to ensure the generator produces a Σ-invariant distribution Pg

Theorem B.7. If PZ P(Z) is Σ-invariant and g : Z X is Σ-equivariant then the push-forward measure Pg := PZ g 1

is Σ-invariant, i.e., Pg PΣ(X).

Structure-preserving GANs

Proof. The proof is based on the equivalence of the following commutative diagrams:

More speciﬁcally,

Pg (T X σ ) 1 = PZ g 1 (T X σ ) 1 = PZ (T X σ g) 1

=PZ (g T Z σ ) 1 = PZ (T Z σ ) 1 g 1 = PZ g 1

where the third and ﬁfth equalities are due to the equivariance and invariance, respectively, of g and PZ.

Next we prove Theorem 4.10, which provides a method for constructing Σ-invariant noise sources.

Theorem B.8. Let W µΣ and N be a Z-valued random variable (i.e., an arbitrary noise source). If W and N are independent then the distribution of T Z(W, N) is Σ-invariant.

Proof. Let PZ denote the distribution of N. Independence of W and N implies (W, N) µΣ PZ. Therefore T Z(W, N) (µΣ PZ) (T Z) 1 := P Σ Z . We need to show that P Σ Z is Σ-invariant: For σ Σ we can compute

P Σ Z (T Z σ ) 1 =(µΣ PZ) (T Z) 1 (T Z σ ) 1 (47)

=(µΣ PZ) (T Z σ T Z) 1

=(µΣ PZ) (T Z (T Σ σ id)) 1

=(µΣ PZ) (T Σ σ id) 1 (T Z) 1 ,

where T Σ is the left-multiplication action of Σ on itself. Invariance of µΣ implies

(µΣ PZ) (T Σ σ id) 1 =(µΣ (T Σ σ ) 1) PZ = µΣ PZ . (48)

P Σ Z T 1 σ = (µΣ PZ) (T Z) 1 = P Σ Z . (49)

This proves P Σ Z is Σ-invariant as claimed.

Next we show how the proof of Theorem 4.1 can be generalized to a wider variety of objective functionals. This result will utilize a certain topology on the space of bounded measurable functions which we describe in the following deﬁnition.

Deﬁnition B.9. Let V be a subspace of Mb(X)n, n Z+, and M(X) be the set of ﬁnite signed measures on X. For ν M(X)n we deﬁne τν : V R by τν(γ) := Pn i=1 R γidνi and we let T = {τν : ν M(X)n}. T is a separating vector space of linear functionals on V and we equip V with the weak topology from T (i.e., the weakest topology on V for which every τ T is continuous). This makes V a locally convex topological vector space with dual space V = T ; see Theorem 3.10 in (Rudin, 2006). In the following we will abbreviate this by saying that V has the M(X)-topology.

Theorem B.10. Let V be a subspace of Mb(X)n, n Z+, that is closed under Σ in the sense of (11) and satisﬁes SΣ[V ] V . Given an objective functional H : V P(X) P(X) [ , ) and a test function space Γ V we deﬁne

DΓ H(Q P) := sup γ Γ H(γ; Q, P) . (50)

If H( ; Q, P) is concave and upper semi-continuous (USC) in the M(X)-topology on V (see Deﬁnition B.9) and

H(γ Tσ; Q, P) = H(γ; Q T 1 σ , P T 1 σ ) (51)

Structure-preserving GANs

for all σ Σ, γ V , and Q, P P(X) then for all Σ-invariant Q, P we have

DΓ H(Q P) DSΣ[Γ] H (Q P) . (52)

If, in addition, SΣ[Γ] Γ then SΣ[Γ] = Γinv Σ and

DΓ H(Q P) = DΓinv Σ H (Q P) . (53)

Remark B.11. See Appendix C for conditions implying SΣ[Γ] Γ.

Proof. Fix γ Γ and Σ-invariant Q, P. Deﬁne G := H( ; Q, P) and note that G : V ( , ] is LSC and convex. Convex conjugate duality (see the Fenchel-Moreau Theorem, e.g., Theorem 2.3.6 in Bot et al. (2009)) and Fubini s theorem then imply

G(SΣ[γ]) = sup ν M(X)n{τν(SΣ[γ]) G (τν)}

= sup ν M(X)n{ X

Z SΣ[γi]dνi G (τν)}

= sup ν M(X)n{ Z X

Z γi Tσdνi G (τν)µΣ(dσ)}

= sup ν M(X)n{ Z τν(γ Tσ) G (τν)µΣ(dσ)}

Z G(γ Tσ)µΣ(dσ) .

We can use our assumptions to compute

G(γ Tσ) = H(γ Tσ; Q, P)

= H(γ; Q T 1 σ , P T 1 σ )

= H(γ; Q, P)

and hence we obtain

H(SΣ[γ]; Q, P) H(γ; Q, P) .

Taking the supremum over γ Γ gives (52). If SΣ[Γ] Γ then we clearly have the bound DSΣ[Γ] H DΓ H and hence DSΣ[Γ] H = DΓ H. The equality SΣ[Γ] = Γinv Σ was shown in the proof of Theorem 4.1 and so we are done.

Theorem B.10 applies to many classes of divergences, some of which have not been discussed in the main text. For example:

1. Integral probability metrics and MMD (5); see (M uller, 1997; Sriperumbudur et al., 2012).

2. (f, Γ) divergences (6); concavity and USC of the objective functional follows Proposition B.8 in (Birrell et al., 2022).

3. Sinkhorn divergences (9); concavity and USC of the objective functional follows Lemma B.7 in (Birrell et al., 2022).

4. R enyi divergence for α (0, 1); see Theorem 3.1 in (Birrell et al., 2021).

5. The Kullback-Leibler Approximate Lower bound Estimator (KALE); see Deﬁnition 1 in (Glaser et al., 2021).

Structure-preserving GANs

C. Conditions Ensuring SΣ[Γ] Γ

In this appendix we provide conditions under which the test function space Γ is closed under symmetrization, that being a key assumption in our main results in Section 4. First we show that SΣ[Γ] Γ when Γ is the unit ball in an appropriate RKHS.

Lemma C.1. Let V Mb(X) be a separable RKHS with reproducing-kernel k : X X R. Let Γ = {γ V : γ V 1} be the unit ball in V . Suppose we have a measurable group action T : Σ X X and k is Σ-invariant under this action (i.e., k(Tσ(x), Tσ(y)) = k(x, y) for all σ Σ, x, y X). Then SΣ[Γ] Γ.

Remark C.2. The proof will use many standard properties of a RKHS. In particular, recall that the assumption X Mb(X) implies k is bounded and jointly measurable. See Chapter 4 in (Steinwart & Christmann, 2008) for this and further background. See (Sriperumbudur et al., 2011) and references therein for more discussion of characteristic kernels as well as the related topic of universal kernels.

Proof. The Σ-invariance of k implies

k(Tσ(x), y) = k(Tσ(x), Tσ(Tσ 1(y))) = k(x, Tσ 1(y)) (54)

k( , Tσ(x)), k( , Tσ(y)) V = k(Tσ(x), Tσ(y)) = k(x, y) = k( , x), k( , y) V (55)

for all σ Σ and x, y X. Next we will show that the map Uσ : γ 7 γ Tσ is an isometry on V for all σ Σ, γ V : It is clearly a linear map. To show its range is contained in V , ﬁrst recall that the span of {k( , x)}x X is dense in V . Therefore, given γ V there is a sequence γn γ having the form

i=1 an,ik( , xn,i)

for some an,i R, xn,i X. Equation (54) implies

i=1 an,ik(Tσ( ), xn,i) =

i=1 an,ik( , Tσ 1(xn,i)) .

Combining Eq. (56) with Eq. (55) we can conclude that γn Tσ V = γn V and γn Tσ γm Tσ V = γn γm V . γn converges in V , hence is Cauchy, therefore γn Tσ is Cauchy as well. We have assumed V is complete, therefore γn Tσ γ for some γ V . V is a RKHS, hence the evaluation maps are continuous and we ﬁnd γ(x) = limn γn(Tσ(x)) = γ(Tσ(x)) for all x. Therefore γ Tσ = γ V and

γ Tσ V = lim n γn Tσ V = lim n γn V = γ V .

This proves Uσ is an isometry on V .

Now ﬁx γ Γ. We will show that the map σ Uσ[γ] is Bochner integrable (see, e.g., Appendix E in Cohn (2013)): It clearly has has separable range since V was assumed to be separable. By the same reasoning as above, given γ V we have a sequence γn γ where

i=1 an,ik( , xn,i) .

γ, Uσ[γ] V = lim n

i=1 an,i k( , xn,i), Uσ[γ] V = lim n

i=1 an,i, Uσ[γ](xn,i)

i=1 an,i, γ(Tσ(xn,i)) ,

Structure-preserving GANs

which is now clearly measurable in σ due to the measurability of the action. Therefore σ 7 Uσ[γ] is strongly measurable. Uσ[γ] V = γ V 1, therefore the Bochner integral R Uσ[γ]µΣ(dσ) exists in V and satisﬁes

Z Uσ[γ]µΣ(dσ) V Z Uσ[γ] V µΣ(dσ) 1 .

This proves R Uσ[γ]µΣ(dσ) Γ. Finally, V is a RKHS and so the evaluation maps are in V . Therefore evaluation commutes with the Bochner integral and we ﬁnd

( Z Uσ[γ]µΣ(dσ))(x) = Z Uσ[γ](x)µΣ(dσ) = Z γ(Tσ(x))µΣ(dσ) = SΣ[γ](x) .

Hence we can conclude SΣ[γ] Γ for all γ Γ as claimed.

The next result provides a general framework for proving SΣ[Γ] Γ.

Lemma C.3. Let V Mb(X)n, n Z+, be a subspace equipped with the M(X)-topology (see Deﬁnition B.9) and Γ V . If Γ is convex and closed, the group action T : Σ X X is measurable, SΣ[V ] V , and Γ is closed under Σ (i.e., γ Tσ Γ for all γ Γ, σ Σ) then SΣ[Γ] Γ.

Proof. Suppose we have γ Γ with SΣ[γ] Γ. As noted in Deﬁnition B.9, V is a locally convex topological vector space with V = {τν : ν M(X)n}, τν(γ) := Pn i=1 R γidνi. The separating hyperplane theorem (see Theorem 3.4(b) in Rudin (2006)) applied to A = {SΣ[γ]} and B = Γ therefore implies the existence of ν M(X)n such that

τν( γ) > τν(SΣ[γ]) (56)

for all γ Γ. We have assumed Γ is closed under Σ and so we can let γ = γ Tσ to get

Z SΣ[γi]dνi > 0 (57)

for all σ Σ. Integrating with respect to µΣ(dσ) and using Fubini s theorem to change the order of integration we obtain a contradiction. Therefore SΣ[γ] Γ as claimed.

We end this section with several examples of function spaces, V , that are useful in conjunction with Lemma C.3:

1. V = Mb(X)n, n Z+, in which case SΣ[V ] V follows from measurability of the action.

2. X is a metric space, the action T : Σ X X is continuous, and V = Cb(X)n, n Z+. In this case, SΣ[V ] V follows from the dominated convergence theorem.

3. X is a metric space, the action T : Σ X X is continuous, Tσ is 1-Lipschitz for all σ Σ, and V = Lip1 b(X)n, n Z+. In this case, SΣ[V ] V follows from the following calculation:

|SΣ[γ](x) SΣ[γ](y)| Z |γ(Tσ(x)) γ(Tσ(y))|µΣ(dσ) Z d(Tσ(x), Tσ(y))µΣ(dσ)

Z d(x, y)µΣ(dσ) = d(x, y)

for all γ Lip1 b(X).

D. Additional Properties of Σ-Invariant (f, Γ)-Divergences

In this appendix we derive further properties of (f, Γ)-divergences between Σ-invariant distributions. Here we will assume that X is a complete separable metric space (with metric d). Our analysis will require the following notion of a determining set of functions.

Structure-preserving GANs

Deﬁnition D.1. Given Q P(X), a subset Ψ Mb(X) will be called Q-determining if for all Q, P Q, EQ[ψ] = EP [ψ] for all ψ Ψ implies Q = P.

We will also need f and Γ to satisfy one of the following admissibility criteria, as introduced in (Birrell et al., 2022).

Deﬁnition D.2. For a, b with a < 1 < b we deﬁne F1(a, b) to be the set of convex functions f : (a, b) R with f(1) = 0. For f F1(a, b), if b is ﬁnite we extend the deﬁnition of f by f(b) := limx b f(x). Similarly, if a is ﬁnite we deﬁne f(a) := limx a f(x) (convexity implies these limits exist in ( , ]). Finally, extend f to x [a, b] by f(x) = . The resulting function f : R ( , ] is convex and LSC.

We will call f F1(a, b) admissible if {f < } = R and limy f (y) < (note that this limit always exists by convexity). If f is also strictly convex at 1 then we will call f strictly admissible. We will call Γ Cb(X) admissible if 0 Γ, Γ is convex, and Γ is closed in the M(X)-topology on Cb(X) (see Deﬁnition B.9). Γ will be called strictly admissible if it also satisﬁes the following property: There exists a P(X)-determining set Ψ Cb(X) such that for all ψ Ψ there exists c R, ϵ > 0 such that c ϵψ Γ. Finally, an admissible Γ Cinv b,Σ(X) (the set of Σ-invariant bounded continuous functions) will be called Σ strictly admissible if there exists a PΣ(X)-determining set Ψ Cb(X) such that for all ψ Ψ there exists c R, ϵ > 0 such that c ϵψ Γ.

One way to construct a Σ-strictly admissible set is to start with an appropriate strictly admissible set and then restrict to the subset of Σ-invariant functions; see Appendix D.1 for a proof.

Lemma D.3. Let Γ Cb(X).

1. If Γ is admissible then Γinv Σ is admissible.

2. If Γ is strictly admissible and SΣ[Γ] Γ then Γinv Σ is Σ-strictly admissible.

Below are several useful examples of strictly admissible Γ that satisfy SΣ[Γ] Γ.

1. Γ := Cb(X), if the action is continuous in x, i.e., if Tσ : X X is continuous for all σ Σ.

2. Γ := {g Cb(X) : |g| C} for any C > 0 and assuming the action is continuous in x,

3. Γ := Lip L b (X) for any L > 0 and assuming the action is 1-Lipschitz, i.e., d(Tσ(x), Tσ(y)) d(x, y) for all σ Σ, x, y X.

4. Γ := {g Lip L b (X) : |g| C} for any C, L > 0 and assuming the action is 1-Lipschitz.

5. The unit ball in an appropriate RKHS V , Γ := {g V : g V 1}, assuming the kernel is Σ-invariant; see Lemma

D.6 for details.

The following result extends the inﬁmal convolution formula and divergence properties from (Birrell et al., 2022) to the case where the models and test-function space are Σ-invariant.

Theorem D.4. Suppose f and Γ are admissible and Γ Cinv b,Σ(X). For Q, P PΣ(X) we have the following properties:

1. Inﬁmal Convolution Formula on PΣ(X):

DΓ f (Q P) = inf η PΣ(X){Df(η P) + W Γ(Q, η)} . (58)

2. Existence of an Optimizer: If DΓ f (Q P) < then there exists η PΣ(X) such that

DΓ f (Q P) = Df(η P) + W Γ(Q, η ) . (59)

If f is strictly convex then there is a unique such η .

3. PΣ(X)-Divergence Property for W Γ: W Γ(Q, P) 0 and W Γ(Q, P) = 0 if Q = P. If Γ is Σ-strictly admissible then W Γ(Q, P) = 0 implies Q = P.

Structure-preserving GANs

4. PΣ(X)-Divergence Property for DΓ f : DΓ f (Q P) 0 and DΓ f (Q P) = 0 if Q = P. If f is strictly admissible and Γ is Σ-strictly admissible then DΓ f (Q P) = 0 implies Q = P.

Proof. 1. Part 1 of Theorem 2.15 from (Birrell et al., 2022) implies an inﬁmal convolution formula on P(X), hence

DΓ f (Q P) = inf η P(X){Df(η P) + W Γ(Q, η)} inf η PΣ(X){Df(η P) + W Γ(Q, η)} . (60)

To prove the reverse inequality, we use the bound Df DSΣ[Mb(X)] f , the equality SΣ[Γ] = Γ, and then Theorem B.5 to compute

DΓ f (Q P) inf η P(X){DSΣ[Mb(X)] f (η P) + W SΣ[Γ](Q, η)} (61)

= inf η P(X){Df(SΣ[η] P) + W Γ(Q, SΣ[η])}

= inf η PΣ(X){Df(η P) + W Γ(Q, η)} .

This proves the inﬁmal convolution formula on PΣ(X).

2. Now suppose DΓ f (Q P) < . Part 2 of Theorem 2.15 from (Birrell et al., 2022) implies there exists η P(X) such that

DΓ f (Q P) = Df(η P) + W Γ(Q, η ) . (62)

We need to show that η can be taken to be Σ-invariant. To do this, ﬁrst use the inﬁmal convolution formula to bound

DΓ f (Q P) Df(SΣ[η ] P) + W Γ(Q, SΣ[η ]) . (63)

The Σ-invariance of Q and P together with Theorem B.5 imply

W Γ(Q, SΣ[η ]) = W Γ(Q, η ) . (64)

Df(SΣ[η ] P) = D Minv b,Σ(X) f (η P) Df(η P) . (65)

DΓ f (Q P) Df(SΣ[η ] P) + W Γ(Q, SΣ[η ]) Df(η P) + W Γ(Q, η ) = DΓ f (Q P) . (66)

DΓ f (Q P) = Df(SΣ[η ] P) + W Γ(Q, SΣ[η ]) (67)

with SΣ[η ] PΣ(X) as claimed.

If f is strictly convex then uniqueness is a corollary of the corresponding uniqueness result from Part 2 of Theorem 2.15 in (Birrell et al., 2022).

3. Admissibility of Γ implies 0 Γ, hence W Γ(Q P) EQ[0] EP [0] = 0. If Q = P then the deﬁnition clearly implies W Γ(Q, P) = 0. If Γ is Σ-strictly admissible and W Γ(Q, P) = 0 then 0 EQ[g] EP [g] for all g Γ. Letting g = c ϵψ as in the deﬁnition of Σ-strict admissiblity we see that 0 (EQ[ψ] EP [ψ]). Hence EQ[ψ] = EP [ψ] for all ψ Ψ. Ψ is a PΣ(X)-determining set and Q, P PΣ(X), hence we can conclude that Q = P.

4. We know that Df 0 and W Γ 0, therefore the inﬁmal convolution formula implies DΓ f 0. If Q = P we can bound

0 DΓ f (Q P) Df(Q P) = 0 , (68)

Structure-preserving GANs

hence DΓ f (Q P) = 0. Finally, suppose f is strictly admissible, Γ is Σ-strictly admissible, and DΓ f (Q P) = 0. Then Part 2 of this theorem implies

0 = DΓ f (Q P) = Df(η P) + W Γ(Q, η ) (69)

for some η PΣ(X). Both terms are non-negative, hence

Df(η P) = W Γ(Q, η ) = 0 . (70)

The PΣ(X)-divergence property for W Γ then implies Q = η . f being strictly admissible implies that Df has the divergence property, hence η = P. Therefore Q = P as claimed.

D.1. Admissibility Lemmas

In this appendix we prove several lemmas regarding admissible test function spaces. First we prove the admissibility properties of Γinv Σ from Lemma D.3.

Lemma D.5. Let Γ Cb(X).

1. If Γ is admissible then Γinv Σ is admissible.

2. If Γ is strictly admissible and SΣ[Γ] Γ then Γinv Σ is Σ-strictly admissible.

Proof. 1. The zero function is Σ-invariant, hence is in Γinv Σ . If γ1, γ2 Γinv Σ and t [0, 1] then convexity of Γ implies tγ1 + (1 t)γ2 Γ. We have (tγ1 + (1 t)γ2) Tσ = tγ1 Tσ + (1 t)γ2 Tσ = tγ1 + (1 t)γ2, hence we conclude that Γinv Σ is convex. Finally, we can write

Γinv Σ =Γ \

σ Σ,x X {γ Cb(X) : γ(Tσ(x)) = γ(x)}

σ Σ,x X {γ Cb(X) : τδTσ(x)[γ] = τδx[γ]} .

We have assumed Γ is admissible, hence it is closed. The maps τν, ν M(X) are continuous on Cb(X), hence the sets {γ Cb(X) : τδTσ(x)[γ] = τδx[γ]} are also closed. Therefore Γinv Σ is closed. This proves Γinv Σ is admissible.

2. Now suppose Γ is strictly admissible and SΣ[Γ] Γ. In particular, Γ is admissible and so Part 1 implies Γinv Σ is admissible. Let Ψ be as in the deﬁnition of strict admissibility. For every ψ Ψ there exists c R, ϵ > 0 such that c ϵψ Γ. Hence c ϵSΣ[ψ] = SΣ[c ϵψ] SΣ[Γ] = Γinv Σ (see the proof of Theorem 4.1) and SΣ[Ψ] Cb(X). Finally, suppose Q, P PΣ(X) such that EQ[SΣ[ψ]] = EP [SΣ[ψ]] for all ψ Ψ. Part (b) of Lemma 3.2 then implies EQ[ψ] = EP [ψ] for all ψ Ψ. Ψ is P(X)-determining, hence Q = P. Therefore SΣ[Ψ] is a PΣ(X)-determining set and we conclude that Γinv Σ is Σ-strictly admissible.

Next we provide assumptions under which the unit ball in a RKHS is closed under SΣ and is (strictly) admissible.

Lemma D.6. Let V Cb(X) be a separable RKHS with reproducing-kernel k : X X R. Let Γ = {γ V : γ V 1} be the unit ball in V . Then:

1. Γ is admissible.

2. If the kernel is characteristic (i.e., the map P P(X) 7 R k( , x)P(dx) V is one-to-one) then Γ is strictly admissible.

3. If k is Σ-invariant the SΣ[Γ] Γ.

Proof. 1. Admissibility was shown in Lemma C.9 in (Birrell et al., 2022).

Structure-preserving GANs

2. Now suppose the kernel is characteristic. Let P, Q P(X) with R γd P = R γd Q for all γ Γ (and hence for all γ V ). Therefore

0 = Z γd Q Z γd P = γ, Z k( , x)Q(dx) Z k( , x)P(dx) V (71)

for all γ V . Therefore R k( , x)Q(dx) = R k( , x)P(dx). We have assumed the kernel is characteristic, hence we conclude that Q = P. This proves Γ is P(X)-determining. We also have Γ Γ, hence Γ is strictly admissible.

3. This was shown in Lemma C.1 above.

E. Coarse-graining and structure-preserving operators

We show in this section how to apply our structure preserving formalism, Theorem 4.3 in particular, in the context of coarse-graining. We refer to the reviews (Noid, 2013; Pak & Voth, 2018) for fundamental concepts in the coarse-graining of molecular systems. Mathematically, a coarse-graining of the state space X is given by a measurable (non-invertible) map

where y = ξ(x) are thought of as the coarse variables and Y as a space of signiﬁcantly less complexity than X. If A = σ(ξ) is the σ-algebra generated by the coarse-graining map ξ then a function is measurable with respect to A if it is constant on every level set ξ 1(y).

To complete the description of the coarse-graining one selects a kernel Ky(dx), which in the coarse-graining literature is called the back-mapping. The kernel Ky(dx) describes the conditional distribution of the fully resolved state x ξ 1(y), conditioned on the coarse-grained state y = ξ(x), namely Ky(dx) = P(dx|y); in particular Ky(dx) is supported on the set ξ 1(y). The kernel induces naturally a projection SK : Mb(X) Mb(X) given by

SK[f](x) = Z

ξ 1(y) f(x )Ky(dx ) for any x ξ 1(y)

and, by construction, SK[f](x) is A-measurable. If a measure is SK-invariant, i.e., SK[P] = P, then it is uniquely determined by its value on A, in other words it is completely speciﬁed by a probability measure Q P(Y ) on the coarse variable y = ξ(x). We refer to such a Q as a coarse-grained probability measure. Once a coarse-grained measure is constructed on Y , see (Noid, 2013; Pak & Voth, 2018) for a rich array of such methods, it can be then reconstructed as a measure on X by the kernel Ky(dx) as P(dx) = Ky(dx)Q(dy). For example, if we take X and Y to be discrete sets we can chose the trivial (uniform) reconstruction kernel with density ky(x) = δx(ξ 1(y)) 1 |ξ 1(y)| and any coarse-grained measure with density q(y) on the coarse variables y is reconstructed on X as a probability density on X:

p(x) = δx(ξ 1(y)) 1 |ξ 1(y)|q(y) , where y = ξ(x) , x X .

Finally, we note that back-mappings Ky(dx) = P(dx|y) in coarse-graining being probabilities conditioned on the coarse variables can be constructed, to great accuracy, as generative models using conditional GANs, see (Li et al., 2020; Stieffenhofer et al., 2021).

Structure-preserving GANs

F. Additional Experiments

(a) Models trained with 50 training samples.

(b) Models trained with 5000 training samples.

Figure 6. 2D projection of the DL 2 -GAN generated samples onto the support plane of the source distribution Q [cf. Section 5.3]. Each column shows the result after a given number of training epochs. The rows correspond to different settings for the generators and discriminators. The solid and dashed blue ovals mark the 25% and 50% probability regions, respectively, of the data source Q, while the heat-map shows the generator samples. Panel (a): models are trained with 50 training samples. Panel (b): models are trained with 5000 training samples.

Structure-preserving GANs

(a) DL α-GANs, α = 5.

(b) DL α-GANs, α = 10.

Figure 7. 2D projection of the DL α-GAN generated samples onto the support plane of the source distribution Q [cf. Section 5.3]. Each column shows the result after a given number of training epochs. The rows correspond to different settings for the generators and discriminators. The solid and dashed blue ovals mark the 25% and 50% probability regions, respectively, of the data source Q, while the heat-map shows the generator samples. Models are trained on 200 training points. Panel (a): α = 5. Panel (b): α = 10.

Structure-preserving GANs

Figure 8. 2D projection of the DL 2 -GAN generated samples (3000 for each setting) onto the support plane of the source distribution Q [cf. Section 5.3]. Each GAN is trained for 10000 epochs. The rows correspond to the number of training points N = 50, 200, or 5000. The columns correspond to different settings for the generators and discriminators. The solid and dashed blue ovals mark the 25% and 50% probability regions, respectively, of the data source Q. Compared to Figure 6, heat maps are suppressed in this ﬁgure for easier examination of the sample quality.

(a) CNN G&D

(b) Eqv G + CNN D, Σ = C4

(c) CNN G + Inv D, Σ = C4

(d) (I)Eqv G + Inv D, Σ = C4

(e) Eqv G + Inv D, Σ = C4

(f) Eqv G + Inv D, Σ = C8

Figure 9. Randomly generated digits by the DL 2 -GANs trained on Rot MNIST after 20K generator iterations with 1% (600) training data.

Structure-preserving GANs

(a) CNN G&D

(b) Eqv G + CNN D, Σ = C4

(c) CNN G + Inv D, Σ = C4

(d) (I)Eqv G + Inv D, Σ = C4

(e) Eqv G + Inv D, Σ = C4

(f) Eqv G + Inv D, Σ = C8

Figure 10. Randomly generated digits by the RA-GANs trained on Rot MNIST after 20K generator iterations with 1% (600) training data.

Structure-preserving GANs

(a) CNN G&D

(b) Eqv G + CNN D, Σ = C4

(c) CNN G + Inv D, Σ = C4

(d) (I)Eqv G + Inv D, Σ = C4

(e) Eqv G + Inv D, Σ = C4

(f) Eqv G + Inv D, Σ = C8

Figure 11. Randomly generated digits by the DL 2 -GANs trained on Rot MNIST after 20K generator iterations with 0.33% (200) training data. Our model Eqv G + Inv D, Σ = 8 is the only one that can generate high-ﬁdelity images in this setting. We note that the repetitively generated digits are inevitable in such a small data regime, as the models are forced to learn the empirical distribution of the limited training data (20 images per class).

Structure-preserving GANs

(a) CNN G&D

(b) Eqv G + CNN D, Σ = C4

(c) CNN G + Inv D, Σ = C4

(d) (I)Eqv G + Inv D, Σ = C4

(e) Eqv G + Inv D, Σ = C4

(f) Eqv G + Inv D, Σ = C8

Figure 12. Randomly generated digits by the RA-GANs trained on Rot MNIST after 20K generator iterations with 0.33% (200) training data. Our model Eqv G + Inv D, Σ = 8 is the only one that can generate high-ﬁdelity images in this setting. We note that the repetitively generated digits are inevitable in such a small data regime, as the models are forced to learn the empirical distribution of the limited training data (20 images per class).

Table 3. The median of the FIDs (lower is better), calculated every 1,000 generator update for 20,000 iterations, averaged over three independent trials. The number of the training samples used for experiments varies from 0.33% (200) to 100% (60,000) of the entire training set.

Loss Architecture 0.33% 1% 5% 10% 25% 50% 100%

CNN G&D Eqv G + CNN D, Σ = C4 CNN G + Inv D, Σ = C4 (I)Eqv G + Inv D, Σ = C4 Eqv G + Inv D, Σ = C4 Eqv G + Inv D, Σ = C8

431 865 382 360 190 313

295 389 223 173 98 123

357 333 181 141 78 52

348 355 188 132 89 51

407 325 185 124 80 59

403 380 177 135 84 52

392 393 176 130 82 57

CNN G&D Eqv G + CNN D, Σ = C4 CNN G + Inv D, Σ = C4 (I)Eqv G + Inv D, Σ = C4 Eqv G + Inv D, Σ = C4 Eqv G + Inv D, Σ = C8

423 409 511 484 352 293

280 253 330 273 149 122

261 271 208 147 99 55

283 251 192 133 88 57

290 263 190 141 80 53

297 274 183 124 80 53

293 275 173 126 81 51

Structure-preserving GANs

5000 10000 15000 20000 25000 30000 35000 40000 Generator Iterations

CNN G & D (I) Eqv G + Inv D Eqv G + Inv D CNN G & D, aug. (I) Eqv G + Inv D, aug. Eqv G + Inv D, aug.

(a) ANHIR, RA-GAN

5000 10000 15000 20000 25000 30000 35000 40000 Generator Iterations

CNN G & D (I) Eqv G + Inv D Eqv G + Inv D CNN G & D, aug. (I) Eqv G + Inv D, aug. Eqv G + Inv D, aug.

(b) ANHIR, DL 2 -GAN

5000 10000 15000 20000 25000 30000 35000 40000 Generator Iterations

CNN G & D (I) Eqv G + Inv D Eqv G + Inv D CNN G & D, aug. (I) Eqv G + Inv D, aug. Eqv G + Inv D, aug.

(c) LYSTO, RA-GAN

5000 10000 15000 20000 25000 30000 35000 40000 Generator Iterations

CNN G & D (I) Eqv G + Inv D Eqv G + Inv D CNN G & D, aug. (I) Eqv G + Inv D, aug. Eqv G + Inv D, aug.

(d) LYSTO, DL 2 -GAN

Figure 13. The curves of the Fr echet Inception Scores (FID), calculated after every 2,000 generator updates up to 40,000 iterations, averaged over three random trials on the medical data sets, ANHIR (top row) and LYSTO (bottom row). The symbol aug. in the legend denotes the presence of data augmentation during GAN training.

Structure-preserving GANs

Figure 14. Real and GAN generated ANHIR images dyed with different stains. Left panel: real images. Middle and right panels: randomly selected DL 2 -GANs generated samples after 40,000 generator iterations. Middle panel: CNN G&D. Right panel: Eqv G + Inv D.

Structure-preserving GANs

Figure 15. Real and GAN generated LYSTO images of breast, colon, and prostate cancer. Left panel: real images. Middle and right panels: randomly selected DL 2 -GANs generated samples after 40,000 generator iterations. Middle panel: CNN G&D. Right panel: Eqv G + Inv D.

Structure-preserving GANs

Table 4. The (min, median) of the FIDs over the course of training, averaged over three independent trials on the medical images, where the plus sign + after the data set, e.g., ANHIR+, denotes the presence of data augmentation during training.

Loss Architecture ANHIR ANHIR+

RA CNN G&D (I)Eqv G + Inv D Eqv G + Inv D

(186, 523) (100, 142) (78, 125)

(184, 503) (88, 140) (84, 118)

DL 2 CNN G&D (I)Eqv G + Inv D Eqv G + Inv D

(313, 485) (120, 176) (97, 157)

(347, 539) (119, 177) (90, 128)

Loss Architecture LYSTO LYSTO+

RA CNN G&D (I)Eqv G + Inv D Eqv G + Inv D

(281, 340) (218, 272) (175, 238)

(250, 312) (212, 271) (181, 227)

DL 2 CNN G&D (I)Eqv G + Inv D Eqv G + Inv D

(289, 410) (253, 343) (205, 259)

(265, 376) (244, 329) (192, 259)

G. Implementation Details

G.1. Common experimental setup

All models are trained using the Adam optimizer (Kingma & Ba, 2014) with β1 = 0.0 and β2 = 0.9 (Zhang et al., 2019). Discriminators are updated twice after each generator update. An exponential moving average across iterations of the generator weights with α = 0.9999 is used when sampling images (Brock et al., 2018).

G.2. Rot MNIST

For RA-GAN, the training is stabilized by regularizing the discriminator γ Γ with a zero-centered gradient panelty (GP) on the real distribution Q in the following form

2 Ex Q γ(x) 2 2. (72)

We set the GP weight λ1 = 0.1 according to (Dey et al., 2021). For the DL α-GAN, we use the one-sided GP as a soft constraint on the Lipschitz constant

R2 = λ2Ex ρg max{0, γ(x) 2 1}, (73)

where ρg TX +(1 T)Y (with X Pg, Y Q, and T Unif([0, 1]) all being independent.) The one-sided GP weight is set to λ2 = 10 according to (Birrell et al., 2022). Unequal learning rates were set to ηG = 0.0001 and ηD = 0.0004 respectively. The neural architectures for the generators and discriminators are displayed in Table 5 and Table 6.

G.3. ANHIR and LYSTO

Similar to Rot MNIST, the GP weights are set to λ1 = 0.1 for the RA-GAN in (72) and λ2 = 10 for the DL α-GAN in (73), and we consider only the case α = 2. The learning rates were set to ηG = 0.0001 and ηD = 0.0004 respectively. Res Nets instead of CNNs are used as baseline generators and discriminators, and the detailed architectural designs are speciﬁed in Table 7 and Table 8.

G.4. Architectures

Structure-preserving GANs

Table 5. Generator architectures used in the Rot MNIST experiment. Conv SN and C4-Conv SN stand for spectrally-normalized 2D convolution and its C4-equivariant counterpart. The incomplete attempt at building equivariant generators ((I)Eqv G) does not have the C4-symmetrization layer. The C8-equivariant generator (Eqv G, Σ = C8) is built by replacing 3 3 C4-Conv SN with 5 5 C8-Conv SN while adjusting the number of ﬁlters to maintain a similar number of trainable parameters.

CNN Generator (CNN G)

Sample noise z R64 N(0, I) Embed label class y into ˆy R64

Concatenate z and ˆy into h R128

Project and reshape h to 7 7 128

3 3 Conv SN, 128 512

Re LU; Up 2

3 3 Conv SN, 512 256

CCBN; Re LU; Up 2

3 3 Conv SN, 256 128

CCBN; Re LU

3 3 Conv SN, 128 1

C4-Equivariant Generator (Eqv G, Σ = C4)

Sample noise z R64 N(0, I) Embed label class y into ˆy R64

Concatenate z and ˆy into h R128

Project and reshape h to 7 7 128

C4-symmetrization of h

3 3 C4-Conv SN, 128 256

Re LU; Up 2

3 3 C4-Conv SN, 256 128

CCBN; Re LU; Up 2

3 3 C4-Conv SN, 128 64

CCBN; Re LU

3 3 C4-Conv SN, 64 1

C4-Max Pool

Table 6. Discriminator architectures used in the Rot MNIST experiment. The C8-invariant discriminator (Inv D, Σ = C8) is built by replacing 3 3 C4-Conv SN with 5 5 C8-Conv SN while adjusting the number of ﬁlters to maintain a similar number of trainable parameters.

CNN Discriminator (CNN D)

Input image x R28 28 1

3 3 Conv SN, 1 128

Leaky Re LU; Avg. Pool

3 3 Conv SN, 128 256

Leaky Re LU; Avg. Pool

3 3 Conv SN, 256 512

Leaky Re LU; Avg. Pool

Global Avg. Pool into f

Embed label class y into ˆy

Project (ˆy , f) into a scalar

C4-Invariant Discriminator (Inv D, Σ = C4)

Input image x R28 28 1

3 3 C4-Conv SN, 1 64

Leaky Re LU; Avg. Pool

3 3 C4-Conv SN, 64 128

Leaky Re LU; Avg. Pool

3 3 C4-Conv SN, 128 256

Leaky Re LU; Avg. Pool

C4-Max Pool

Global Avg. Pool into f

Embed label class y into ˆy

Project (ˆy , f) into a scalar

Structure-preserving GANs

Table 7. Generator architectures used in the ANHIR and LYSTO experiments. The generator residual block (Res Block G) is a cascade of [CCBN, Re LU, Up 2 , 3 3 Conv SN, CCBN, Re LU, 3 3 Conv SN] with a short connection consisting of [Up 2 , 1 1 Conv SN]. The equivariant residual block (D4-Res Block G) is built by replacing each component with its equivariant counterpart. The incomplete attempt at building equivariant generators ((I)Eqv G) does not have the D4-symmetrization layer.

CNN Generator (CNN G)

Sample noise z R128 N(0, I) Embed label class y into ˆy R128

Concatenate z and ˆy into h R256

Project and reshape h to 4 4 128

Res Block G, 128 256

Res Block G, 256 128

Res Block G, 128 64

Res Block G, 64 32

Res Block G, 32 16

3 3 Conv SN, 16 3

Equivariant Generator (Eqv G)

Sample noise z R128 N(0, I) Embed label class y into ˆy R128

Concatenate z and ˆy into h R256

Project and reshape h to 4 4 128

D4-symmetrization of h

D4-Res Block G, 128 90

D4-Res Block G, 90 45

D4-Res Block G, 45 22

D4-Res Block G, 22 11

D4-Res Block G, 11 5

D4-BN; Re LU

3 3 D4-Conv SN, 5 3

D4-Max Pool

Table 8. Discriminator architectures used in the ANHIR and LYSTO experiments. The discriminator residual block (Res Block D) is a cascade of [Re LU, 3 3 Conv SN, Re LU, 3 3 Conv SN, Max Pool] with a short connection consisting of [1 1 Conv SN, Max Pool]. The equivariant residual block (D4-Res Block D) is built by replacing each component with its equivariant counterpart.

CNN Discriminator (CNN D)

Input image x R64 64 3

Res Block D, 3 16

Res Block D, 16 32

Res Block D, 32 64

Res Block D, 64 128

Res Block D, 128 256

Global Avg. Pool into f

Embed label class y into ˆy

Project (ˆy , f) into a scalar

Invariant Discriminator (Inv D)

Input image x R64 64 3

D4-Res Block D, 3 5

D4-Res Block D, 5 11

D4-Res Block D, 11 22

D4-Res Block D, 22 45

D4-Res Block D, 45 90

D4-Max Pool

Global Avg. Pool into f

Embed label class y into ˆy

Project (ˆy , f) into a scalar