# functional_space_analysis_of_local_gan_convergence__19838ade.pdf Functional Space Analysis of Local GAN Convergence Valentin Khrulkov 1 Artem Babenko 1 2 Ivan Oseledets 3 Recent work demonstrated the benefits of studying continuous-time dynamics governing the GAN training. However, this dynamics is analyzed in the model parameter space, which results in finite-dimensional dynamical systems. We propose a novel perspective where we study the local dynamics of adversarial training in the general functional space and show how it can be represented as a system of partial differential equations. Thus, the convergence properties can be inferred from the eigenvalues of the resulting differential operator. We show that these eigenvalues can be efficiently estimated from the target dataset before training. Our perspective reveals several insights on the practical tricks commonly used to stabilize GANs, such as gradient penalty, data augmentation, and advanced integration schemes. As an immediate practical benefit, we demonstrate how one can a priori select an optimal data augmentation strategy for a particular generation task. 1. Introduction Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) allow for efficient learning of complicated probability distributions from samples. However, training such models is notoriously complicated: the dynamics can exhibit oscillatory behavior, and the convergence can be very slow. To alleviate this, a number of practical tricks, including the usage of data augmentation (Karras et al., 2020a; Zhao et al., 2020; Zhang et al., 2020) and optimization methods inspired by numerical integration schemes for ODEs (Qin et al., 2020) have been proposed. Nevertheless, the theory explaining the success of these methods is only starting to catch up with practice. The standard way of training a GAN is to use simultane- 1Yandex, Russia 2National Research University Higher School of Economics, Moscow, Russia 3Skolkovo Institute of Science and Technology, Moscow, Russia. Correspondence to: Valentin Khrulkov . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). ous gradient descent. In Nagarajan & Kolter (2017) it was shown that under mild assumptions, this method is locally convergent. The authors Mescheder et al. (2018) demonstrated that when these assumptions are not satisfied, the usage of regularization methods such as the gradient penalty (Roth et al., 2017) or consensus optimization (Mescheder et al., 2017) is required to achieve convergence. In Balduzzi et al. (2018); Nie & Patel (2020); Liang & Stokes (2019) the local convergence is studied based on the eigenvalues of the Jacobian of the dynamics near the equilibrium. Such analysis involves the parameterization of the generator and the discriminator by neural networks and usually does not take into account the properties of the target distribution. Our key idea is that this dynamics can be efficiently analyzed in the functional space. This is achieved by constructing a local quadratic approximation of the GAN training objective and writing down the dynamics as a system of partial differential equations (PDEs). The differential operator underlying this system has a remarkably simple form, and its spectrum can be analyzed in terms of the fundamental properties of the target distribution. Furthermore, we show that the convergence of the dynamics is determined by the Poincar e constant of this distribution. Intuitively, this constant describes the connectivity properties of a distribution; for instance, it is smaller for multimodal datasets with disconnected modes, where GAN convergence is known to be slower. This connection is practically important since the Poincar e constant of a dataset can be easily estimated a priori, and the larger it is, the better is the expected convergence of GAN. Thus, we can analyze common techniques that alter the train set in terms of their effect on the Poincar e constant. For instance, one can choose a proper augmentation strategy that increases the Poincar e constant the most. The main contributions of our paper are: We develop a linearized GAN training model in the functional space in the form of a PDE. We derive explicit formulas connecting the eigenfunc- tions and eigenvalues of the resulting PDE operator with the target distribution properties. This connection provides a theoretical justification for common GAN stabilization techniques. We describe an efficient, practical recipe for choos- ing optimal parameters for gradient penalty and data Functional Space Analysis of Local GAN Convergence augmentations for a particular dataset. 2. Second-order approximation of the GAN We assume that data samples {Xi}N i=1 2 Rd are produced by a target probability measure µ. The generator function G(z) is a deterministic mapping from a latent space Rl to Rd; z is sampled from a known probability measure µz. The discriminator D is a real-valued function on Rd which goal is to distinguish between real and fake samples. The GAN objective can be written as a min-max problem D f(D, G), (1) f(D, G) = Ex µ φ1(D(x)) + Ez µz φ2(D(G(z))), (2) and φ1, φ2 are some scalar convex twice differentiable functions. For example, for the LSGAN (Mao et al., 2017) 2(x 1)2, φ2(x) = 1 We analyze the behaviour of a GAN in a neighborhood of the Nash equilibrium corresponding to an optimal discriminator function D and an optimal generator function G . We make standard assumptions (Nagarajan & Kolter, 2017; Mescheder et al., 2018; Nie & Patel, 2020) that the optimal discriminator satisfies D (x) = 0 8x and the measure produced by G is equal to µ. Similarly, we also assume that φ00 2(0) 0 and φ0 2(0) 6= 0. In this setting we can recover many common GAN variations including the vanilla GAN (Goodfellow et al., 2014), WGAN (Arjovsky et al., 2017), LSGAN (Mao et al., 2017). The standard way of solving the min-max problem is to represent D and G by some parametric models and solve it in the parameter space. The convergence of the resulting dynamics can be studied by linearizing it around the equilibrium point (Nagarajan & Kolter, 2017; Mescheder et al., 2018; Nie & Patel, 2020). This is equivalent to the construction of the quadratic approximation of f(D, G) in the parameter space. In this paper, we do the quadratic approximation first, obtaining a new saddle point problem in a functional space that approximates the local dynamics of standard GAN training methods. This representation has a remarkably simple form, and its properties can be studied in detail. Technical assumptions. We assume that µ is a measure with positive density > 0 on Rd, i.e., dµ = (x)dx. In practice, this can be (formally) achieved by smoothing a discrete data distribution with Gaussian noise. As the functional space, we consider L2(µ) = L2(µ, Rd) equipped with a natural weighted inner product: hu1, u2iµ = u1(x)u2(x)dµ. We will also often utilize differential operators, which should be understood in the distributional sense. In these cases, the functions under study are assumed to be the elements of the corresponding Sobolev space. Specifically, we will utilize the weighted Sobolev spaces H1(µ) and H2(µ). These are spaces of generalized functions such that their distributional first order and second order derivatives respectively lie in L2(µ). We start by finding the second-order approximation to the GAN objective near Nash equilibrium in the functional space. This process is analogous to the study of nonlinear (finite-dimensional) ODEs near an equilibrium point via linearization. We assume that G is invertible, i.e., for any x µ there exists a unique z µz such that G (z) = x. Let (δD, δG) be the functional variations of the optimal discriminator and generator respectively. Informally, this means that δD and δG are sufficiently small perturbations of the equilibrium. Let = 1 2(0)) , β = φ0 2(0) and f0 = f(D , G ). Theorem 1. Let D = D + δD = δD, G = G + δG. Let us denote u(x) = δD(x), v(x) = (δG)(G 1 (x)). We assume u 2 H2(µ) and v 2 H1(µ). Then, (2) can be approximated as f(D, G) = f0 + g(u, v) + R3(u, v), g(u, v) = hu, uiµ + βhrxu, viµ, R3(u, v) = O kuk H2(µ) + kvk H1(µ) Proof. See Appendix A for the proof. From Theorem 1 it follows that the local quadratic approximation to (2) is provided by g(u, v). 3. Continuous gradient descent-ascent in the functional space Now we need to solve the min-max problem of the form v g(u, v), g(u, v) = hu, uiµ + βhrxu, viµ. (3) We will refer to (3) as linearized GAN problem (l GAN). Since it is a local quadratic approximation of the original min-max problem, local behavior of GAN training methods can be understood on this task. The linearized GAN problem Functional Space Analysis of Local GAN Convergence depends solely on the measure µ, and we will see what are the particular fundamental properties of this measure that determine the convergence of the gradient-based method for solving (3). The continuous descent-ascent flow in the functional space can be written as: ut = rug(u, v), vt = rvg(u, v), (4) where rug and rvg are Fr echet derivatives of the functional g( , ) with respect to u and v. It is important to define what is the meaning of (4), i.e., it should be understood in the weak sense. Select a test function ˆu and take the scalar product of the first equation with it. We get hut, ˆuiµ = hrug(u, v), ˆuiµ, 8ˆu 2 H2(µ) (5) Similar equation holds for v. The evaluation of the righthand side is done by integration by parts, which leads to the system of the form hut, ˆuiµ = hu, ˆuiµ βhrxˆu, viµ, 8ˆu 2 H2(µ). hvt, ˆviµ = βhrxu, ˆviµ, 8ˆv 2 H1(µ). By making use of the density , equations (6) can be rewritten in the strong form as ut = u + βrx ( v), vt = βrxu. (7) We are interested in the asymptotic behavior of u and v when t goes to infinity. It is completely described by the eigenvalues of the operator on the right-hand side of (6). We will say that (uλ, vλ) is an eigenfunction with an eigenvalue λ if it satisfies the following system of equations. λhuλ, ˆuiµ = huλ, ˆuiµ βhrxˆu, vλiµ, 8ˆu 2 H2(µ). λhvλ, ˆviµ = βhrxuλ, ˆviµ, 8ˆv 2 H1(µ). 4. Eigenfunctions and eigenvalues of the linearized operator If (uλ, vλ) form an eigenfunction, then the solution of the time-dependent problem (6) with the initial conditions (u(0) = uλ, v(0) = vλ) can be written as u(t) = uλ exp(λt), v(t) = vλ exp(λt). (9) We will see that {(uλ, vλ)}λ form a basis in our functional space and thus an arbitrary solution can be written as a series of terms (9). Thus, the real parts of the spectrum of the operator in (6) should be less than 0 in order for the system to have asymptotic stability. We will demonstrate that our system does not have eigenvalues with a positive real part but has a naturally interpretable kernel. 4.1. The kernel. We find that the kernel has the following form. Corollary 1. Let (u0, v0) be an eigenfunction with λ = 0. Then, u0 = C, hrxˆu, v0iµ = 0, 8ˆu 2 H2(µ), (10) or in the strong form: u0 = C, rx ( v0) = 0. (11) Here C is a constant such that C = 0. I.e., for 6= 0 we get C = 0, and C 2 R otherwise. Proof. See Appendix A for the proof. This kernel has a straightforward interpretation. If we add a function v0 that satisfies (10) to the G , the mapped measure will be the same up to second order terms. Indeed, consider (10) in the strong form. We obtain that rx ( v0) = 0. Recall that the function x + v0(x) = x + δG(G 1 (x)) maps a sample x to a sample from the synthetic density produced by G + δG, i.e. v0(x) can be considered as a velocity of each sample. If samples evolve with velocity v0(x), the differential equation for the density takes the form t = rx ( v0) = 0, i.e., the density is invariant under such transformation. We will refer to the condition (10) as the divergence-free condition on v0. 4.2. Non-zero spectrum. 4.2.1. WEIGHTED LAPLACE OPERATOR We now address non-zero eigenvalues of our operator. It will be convenient to utilize the weighted Laplace operator µ which is defined (in the weak form) as follows. h µw, ˆwiµ := hrxw, rx ˆwiµ, 8 ˆw 2 H2(µ). (12) In the strong form µ takes the following form: rx ( rxw), (13) which results in the standard Laplacian in the case of the standard Lebesgue measure on Rd. This operator commonly appears in the study of the diffusion processes (Coifman & Lafon, 2006) and weighted heat equations (Grigoryan, 2009). Spectrum of the weighted Laplacian. In what follows we will use eigenvalues and eigenfunctions of µ. Firstly, this a self-adjoint non-positive definite operator and under mild assumptions on µ (Cianchi et al., 2011; Grigoryan, Functional Space Analysis of Local GAN Convergence 2009), e.g., when decreases sufficiently fast, it has a discrete spectrum. Let us study it in more detail. Consider the eigenvalue problem for µ written in the weak form. hrx ˆw, rxw iµ = h ˆw, w iµ 8 ˆw 2 H2(µ). (14) We note that w0 1 is the eigenfunction with = 0; thus, due to self-adjointness of µ for every other eigenfunction w we have h1, w iµ = 0, i.e., it has zero mean with respect to µ. The Poincar e constant of µ. Let us consider the smallest non-zero eigenvalue min of µ. We obtain that the following inequality holds. hrxw, rxwiµ minhw, wiµ, h1, wiµ = 0, 8w 2 H2(µ), (15) and the exact minimizer of this inequality is achieved by w min (Grigoryan, 2009). The value of min (sometimes its inverse) is called the Poincar e constant of the measure µ and often appears in Sobolev-type inequalities (Adams & Fournier, 2003). The exact values of min for a given measure are almost never known analytically. To make use of our results in practical settings, we propose a simple neural network based approach for estimation of min. We discuss it in Section 6. What is it exactly? We can provide an intuitive meaning of min by drawing an analogy with graph Laplacians. In this case, the second smallest eigenvalue of the graph Laplacian, called the Fiedler value (Fiedler, 1973), reflects how well connected the overall graph is. Thus, the Poincar e constant is in a way a continuous analog of this constant, reflecting the connectivity of a measure. This is the property that we would expect to impact the GAN convergence, as based on rich empirical evidence, the datasets that are more disconnected (such as Image Net) are very challenging to model. We provide experimental evidence in support of this intuitive explanation in Section 7. 4.2.2. SPECTRUM OF LGAN We are now ready to fully describe the spectrum of our l GAN model. Let { i}1 i=1 be the non-zero spectrum of µ and {w i}1 i=1 be the set of the corresponding eigenfunctions. Recall from Theorem 1 that the constants and β are defined purely in terms of functions φ1 and φ2. We now state our main result. For simplicity we assume that all the eigenvalues are of multiplicity one. Theorem 2. The non-zero spectrum of (8) is described as follows. The eigenvalues are given by {λ i=1 where λ i are roots of the quadratic equation: λ2 + λ + β2 i = 0. (16) The corresponding eigenfunctions are written in terms eigenfunctions of µ as follows. i ) = (w i, β rxw i). (17) Proof. See Appendix A for the proof. Theorem 3. Let (uλ, vλ) be the eigenfunction with eigenvalue λ (see (8)) and λ 6= 0. Then uλ is an eigenfunction of the weighted Laplace operator µ, defined as the solution of the eigenvalue problem hrxˆu, rxuiµ = hu, ˆuiµ, 8ˆu 2 H2(µ), (18) λrxuλ. The eigenvalue λ satisfies the quadratic equation λ2 + λ + β2 = 0, (19) Proof. See Appendix A for the proof. With the explicit formulas for eigenvalues and eigenvectors at hand, we are now ready to analyze the convergence of the problem (6). 5. Convergence The following theorem expresses the solution in terms of the eigenfunctions and also provides the convergence estimates. Theorem 4. Let u0 2 H2(µ), v0 2 H1(µ) and R u0dµ = c0. Then, these functions can be written as v0 = ev0 + rx V0, V0 = and ev0 is divergence-free, i.e. hrxˆu, ev0iµ = 0. The coefficients c+ k can be obtained as the solution of the linear systems: h V0, w kiµ With this expansion, the solution to (6) is u(t) = c0e t + v(t) = ev0 + rx V (t), Functional Space Analysis of Local GAN Convergence For > 0 the norms of u(t) and V (t) can be estimated as ku(t)kµ 2ku0kµe t, k V (t)kµ Ck V0kµe t, < 0 is the maximal real part of the eigenvalues. Proof. See Appendix A for the proof. Discussion. Theorem 4 provides the exact representation of the solution in terms of the eigenfunctions of µ. The eigenvalues are given as the solution of the quadratic equation (19), thus the spectral properties of µ completely determine the dynamics of the convergence. Specifically, we observe that the speed of convergence is determined by the value of min, i.e., the lowest non-zero mode of the weighted Laplacian. There are two distinct cases. If the first eigenvalue of µ satisfies 2 4β2 min 0, (i.e., min is large enough), then all the eigenvalues λ i are complex, and their real part is equal to 2 . Ideally, we would have 2 4β2 min = 0, which would provide us with the optimal convergence speed and no oscillations for the highest mode. Consider now the case of a small min 2 4β2 . In this case, will be close to 0, which would result in a slow convergence rate. These observations can be used to explain the success of various practical GAN training methods, as we show in Section 7. In the case is = 0, which holds, for instance, for WGAN, we obtain the well-known purely oscillatory behavior (Nagarajan & Kolter, 2017). We also note that for > 0, the average value of the discriminator exponentially decays to 0. This resembles the convergence plots of state-of-the-art GANs, see, e.g., Karras et al. (2020a, Figure 6b). In Appendix B we discuss a concrete application of Theorem 4 to the case of µ N(0, 1). The spectrum of µ can be found explicitly and is given by Z 0. What about neural networks? The discussion so far focuses on the functional spaces. In reality, these functions are approximated by neural networks, and the dynamics is written in the parameters D and G of these networks. Local convergence analysis of such dynamics is possible in these parameters (Mescheder et al., 2018; Nie & Patel, 2020); however, such analysis involves eigenvalues of the Jacobian of the loss. It is not easy to connect these properties to the fundamental properties of the measure. The functional space analysis shows this connection. The approximation by neural networks (or by any other parametric representation) can be thought of as a spatial discretization of PDEs. If the number of parameters increases, the eigenvalues of the discretized problem should approximate the eigenvalues of the infinite-dimensional problem. It is worth noting that different discretizations (for example, different neural network architectures) may lead to different properties. One can compare such discretizations by looking at the quality of approximation of the eigenvalues of the weighted Laplace operator (see Section 6 for the algorithmic details). Also, such techniques used in practice as Jacobian regularization (Karras et al., 2020b) can be considered as methods for choosing a better-conditioned discretization. Finally, for future research, we would like to mention that the l GAN problem is similar to the saddle point problems (Benzi et al., 2005) appearing in mathematical modeling of fluid problems. For example, for robust discretization, one has to choose discretization spaces of u ( pressure ), and v ( velocity ) such that they satisfy the famous LBB (Ladyzhenskaya-Babuˇska-Brezzi) condition (Boffiet al., 2013; Ladyzhenskaya, 1969) in order to get good convergence properties. This task requires systematic study, and we leave it for future research. 6. Effects of common practices on the asymptotic convergence Regularization. Several studies have shown that if we penalize the norm of the gradient of the GAN objective with respect to the discriminator, it improves the convergence (Gulrajani et al., 2017; Mescheder et al., 2018). In our case, it results in an additional term γ 2 Eµkr Df(D, G)k2. After linearization we obtain a regularized loss ˆg(u, v). u ˆg(u, v), ˆg(u, v) = g(u, v) + γ 2 Eµ krxuk2. New eigenvalue problem has the form λhuλ, ˆuiµ = huλ, ˆuiµ βhrxˆu, vλiµ γhrxu, rxˆui, λhvλ, ˆviµ = βhrxuλ, ˆviµ, 8ˆv 2 H1(µ), ˆu 2 H2(µ). The kernel stays the same, and moreover, the uλ component is also the same, which can be seen again by using ˆv = rxˆu in the second equation. The only difference is the connection between eigenvalues of the weighted Laplace operator and eigenvalues of the linearized problem, which changes as follows: λ2 + ( + γ i)λ + β2 i = 0, i = ( + γ i) ( + γ i)2 4β2 i Optimal parameters. One of the most interesting conclusions of our analysis is that the convergence is determined by the connection between , β, γ and . If we fix φ1, φ2 in advance, we can not control and β and convergence for a particular dataset will be determined by min. I.e., for one dataset the loss may work well, but for another, it may fail. Functional Space Analysis of Local GAN Convergence An alternative approach is to optimize over the hyperparameters of the loss and regularizer taking the Poincar e constant into account. Expression (22) gives a direct way to do that. The eigenvalue with the maximal real part corresponds to = min and is given by the formula λmax = ( + γ min) + ( + γ min)2 4β2 min The maximum is obtained if the discriminant is equal to 0, which gives + γ min = 2|β| Another desirable property is the absence of oscillations. This is only possible if the quadratic function under the square root is non-negative. Since it is equal to zero at = min, it is necessary and sufficient for it to have nonnegative derivative: 2γ( + γ min) 4β2 ! γ |β| p min Simple analysis gives the following conditions for the parameters , β, γ such that we have optimal local convergence rate and all the eigenvalues of the linearized operator are real: + γ min = 2|β| γ 2|β| p min , 0 |β| p min Discrete approximation of the time dynamics. The dynamics (6) can be written in the operator form as wt = Lw, w(0) = w0, w = The usage of gradient descent optimization for this problem is equivalent to the forward Euler scheme: wk+1 = wk + Lwk, (27) and the eigenvalues of the discretized operator are equal to i = 1 + λi, (28) where λ are given by Theorem 3. Since for all i such that 2 4β2 i < 0 the eigenvalue λi is complex, the modulus of the corresponding λe Since µ is an unbounded operator under mild assumptions on µ there exists an eigenvalue i of it that makes |λe i| > 1, i.e. the Euler scheme is absolutely unstable. On the discrete level, when the functions are approximated by a neural network, we might be working in a subspace that does not contain eigenvalues that are too large, and the method might actually converge; but this is very problem specific. As has been noted by Qin et al. (2020) the usage of more advanced time integration schemes leads to competitive GAN training results even without such functional constraints as spectral normalization (Miyato et al., 2018). If the parameters are selected in the range (25), then there are no complex eigenvalues, and we can make Euler scheme convergent. This requires regularization. Another important research direction is the development of more suitable time discretization schemes. The system (6) belongs to the class of hyperbolic problems, and special time discretization schemes have to be used. One class of such methods is total variation diminishing (TVD) schemes (Gottlieb & Shu, 1998). Note, that one of the methods considered in Qin et al. (2020) is the Heun method (S uli & Mayers, 2003) which is a second-order TVD Runge-Kutta scheme, which confirms its empirical efficiency. Thus, our analysis provides the theoretical foundation for the results of Qin et al. (2020). Data augmentation. One of the most successful practical tricks significantly improving GAN convergence have been the usage of data augmentation (Karras et al., 2020a; Zhang et al., 2020; Zhao et al., 2020). In our framework, the usage of data augmentations is reflected in the shift of min. Based on the intuition discussed earlier, higher values of min correspond to more connected distributions and allow for faster convergence. This is exactly what is intuitively achieved by data augmentation: we fill the holes in our dataset with new samples and make it more connected. We describe our experiments confirming this idea in Section 7. Variational estimation of min. For practical analysis of GAN convergence, we need to estimate the value of the Poincar e constant min for a given dataset. This can be performed in the standard supervised learning manner. Recall from the definition that min is the minimizer of the following optimization problem (commonly called the Rayleigh quotient). min hrxw, rxwiµ s.t. w 2 H2(µ), h1, wiµ = 0. Let us rewrite this optimization problem in a more convenient form. Note that the constraint h1, wiµ = 0 simply means that the Ex µ w = 0. Since subtraction of a constant does not affect the gradient, we can rewrite (30) as min Ex µ krxw(x)k2 Varx µw(x) (31) s.t w 2 H2(µ), Functional Space Analysis of Local GAN Convergence where Varx µ(w) denotes the variance of w with respect to µ. In practice we can parameterize w with a neural network, and perform optimization of (31) in a stochastic manner by replacing expectations over µ with their empirical counterparts (over a batch of inputs). Due to the scale invariance of the Rayleigh quotient, we employ spectral normalization (Miyato et al., 2018) as an additional regularization; this also falls in line with commonly used discriminator architectures. To summarize, we utilize a neural network w(x; ) and for a dataset X we perform stochastic optimization of the loss given by (31). I.e, the loss on a single batch Xb = {Xi}N i=1 is given by i=1 krxw(Xi; )k2 i=1 f(Xi)2 ( 1 i=1 w(Xi))2 , (32) and, correspondingly, min min L( ). Minimization of this loss function can be performed with standard optimizers such as SGD or Adam (Kingma & Ba, 2015). Note that the value of min obtained via a neural network approximation as an upper bound on the actual value. In principle, we might obtain arbitrarily small estimations while increasing the network s expressive power. In Section 7 we show that the obtained min estimate in fact converges to a certain (non-zero) value when we increase the number of parameters. 7. Experiments Experimental setup. Our experiments are organized as follows. We start by numerically investigating the value of min and the impact of formulas from Section 6 on the convergence on synthetic datasets. We then study the more practical CIFAR-10 (Krizhevsky et al., 2009) dataset. Firstly, we show the correlation between the min obtained for various augmented versions of the dataset and FID values obtained for the corresponding GAN. We then show that a similar correlation holds when we perform instance selection, a recently proposed technique shown to improve GAN convergence (De Vries et al., 2020). In Appendix C we provide additional visualizations. For the synthetic datasets, our experiments were performed in JAX (Bradbury et al., 2018) using the ODE-GAN code available at Git Hub1. For CIFAR-10 experiments we utilized Py Torch (Paszke et al., 2019). Importantly, we work with standard GAN models and validate theoretical findings predicted by l GAN. Gaussian Mixture. In order to study the effect of min and the choice of and γ on the training procedure, we set up a simple one-dimensional test: a mixture of two normal distributions N(0, 1) and N(M, 1), where M is the separation parameter. Intuitively, the larger the M, the smaller 1https://github.com/deepmind/ deepmind-research/tree/master/ode_gan is min, since it reduces connectivity of our measure. We also verify it numerically, as shown in Figure 1 (top). In this case, we utilize a simple two-layer MLP with spectral normalization on top of linear layers. We observe that indeed the value of min decays rapidly with the increase of the separation parameter M. To experiment with GAN convergence in this toy setting, we setup two MLP models for D and G, and train the LSGANlike model with 2 x2 + βx, φ2(x) = βx, and gradient penalty with factor γ. We train the model using the ODE-GAN approach with the Heun method. We experiment with two moderately separated mixtures, namely with M = 4 and M = 4. Respective min values obtained by the numerical simulation described above are min = 0.25 and min = 0.124. We consider five options for the values of ( , β, γ). We start with the baseline WGAN corresponding to (0, 1, 0). We consider two options for the value of γ: the optimal one γ = 2|β|/p min predicted by the theory, and a sub-optimal γ = |β|/p min. With the latter value we obtain Im λmax |β|p min. We also consider two LSGAN variants. First one is the default version with the parameters (0.25, 0.5, 0). For the second version we fix β = 1 and choose the optimal and γ according to (25). We measure performance of a GAN model via the Earth Mover s Distance (EMD) between the Gaussians fitted on real and synthetic data. This approach resembles the commonly used Frech et Inception Distance (FID) metric for GAN evaluation. Results are visualized on Figure 1. We observe that the resulting convergence plots match our theoretical predictions. In particular, we see that when the parameters are chosen in the optimal way, methods converge more rapidly and stably. On the other hand, when γ is not tuned or is sub-optimal, the convergence is more oscillatory, and GANs struggle to converge. Note that even for the optimal γ for methods with > 0, we still observe some oscillations. This may be a result of a noise induced by neural function approximation or stochasticity of training. Toy data augmentation. In this set of experiments, we verify if the effects of data augmentations on GAN convergence can be predicted by evaluating min for the correspondingly augmented dataset. We perform it is follows: consider 2N( 3, 1) + 1 2N(3, 1) (33) and its corresponding augmented version: N( 3, 1) + p (34) By increasing the value of p, we improve the connectivity of the measure and correspondingly increase the value min of Functional Space Analysis of Local GAN Convergence Figure 1. Numerical experiments with GAN convergence on a mix- ture of two univariate Gaussians. (Top) Estimated value of min vs the mixture separating parameter M. (Bottom) Convergence plots of GANs with different loss functions and regularization strengths. Methods with parameters selected optimally according to theory present better convergence. µp. In Figure 2 we provide visualizations and estimated values of min for p 2 {0, 0.2, 0.4}. For these values, we train a standard LSGAN with the Heun method, and visualize the convergence plots (specifically, we plot k D(t)k2 µp). We observe that increasing min in these cases indeed corresponds to better convergence. Figure 2. A toy data augmentation example: increasing connectiv- ity corresponds to a better convergence. Data augmentation of CIFAR-10. We now analyze a practical case of the CIFAR-10 dataset. We consider a number of augmentations commonly utilized in data augmentation pipelines for GANs (see Figure 3). In particular, these augmentations include both spatial (translations, zooming) and visual augmentations (color adjustment). Folklore knowledge is that the first type is generally helpful, while the second is not. In our framework, the positive effect of augmentations can be understood as improving the connectivity of a dataset, which can be quantitatively measured by evaluating min of the augmented version of a dataset. Figure 3. Image augmentations commonly used in GAN training. Increased value of min both intuitively and theoretically (as supported by Theorem 3) corresponds to better convergence and better FID scores. Our experiments are based on a thorough analysis of the impact of augmentations on the quality of a GAN trained on CIFAR-10 provided in Zhao et al. (2020). Specifically, the authors select a single augmentation from a predefined set and train several types of GANs by augmenting real and fake images. The strength of the augmentation is controlled by a single parameter λ, and the authors provide the obtained FID for a number of its possible values (e.g., we may vary how strongly we zoom an image); we refer the reader to Zhao et al. (2020) for specific details on each augmentation. We selected a large portion of augmentations studied in this paper and replicated its data augmentation setup. For each augmentation and each strength value, we minimize the Rayleigh quotient with a neural network mimicking the SNDCGAN discriminator from Zhao et al. (2020). We train it on the train part of the dataset by applying the respective augmentation to each image with probability one. Note that the actual augmentation strength is randomly sampled from the range [0, λ], so we cover the entire distribution relatively well. For training, we use the batch size of 256 and Adam optimizer with a learning rate 10 4 (results are not sensitive to these parameters). We measure the actual value of min by sweeping across the augmented test set 5 times to better cover the distribution and aggregate the results across all 50000 (augmented) samples. For the reference values, we choose the FID scores obtained by the SNDCGAN model with Balanced Consistency Regularization (b CR) compiled from Zhao et al. (2020, Figure 3). This model achieves better generation quality so our theory is more reliable in this case. Our results are provided at Figure 4. For convenience, we normalize the obtained values by dividing it by min of the non-augmented dataset (approximately equal to 0.0042 by our estimation). We observe that FID scores in many cases indeed follow the behavior of min. For instance, for zoomin and cutout, the estimated min values predict that there is an intermediate optimal augmentation strength, which is matched by the practice. For translation, we observe that stronger augmentations worsen the connectivity of the dataset, which results in worse FID scores. We can also observe a significant drop in min for color-based augmentations, confirming their practical inefficiency. We note, Functional Space Analysis of Local GAN Convergence in some cases (e.g., zoomout), we do not observe a direct correspondence. This may be a result of auxiliary effects not covered by our theory or of a discrepancy between ours and the authors implementation. Figure 4. (Top) The values of min for augmented versions of CIFAR-10 for different values of the augmentation strength. We normalize them dividing by min 0.0042 of the non-augmented version of the dataset. (Bottom) The FID values of a SNDCGAN (BCR) trained on the corresponding dataset. The values are taken from Zhao et al. (2020). For min a higher value is better in terms of connectivity of dataset and theoretical convergence. For FID smaller values are better. In most cases we observe a negative rank correlation between min and FID. Instance selection. Another possible way to manipulate the data distribution is to remove non-typical samples, which can confuse the generator (De Vries et al., 2020). This can be performed by fitting a Gaussian model on feature vectors of a dataset produced by some pretrained model (in our experiments, we used Inception v3 (Szegedy et al., 2016)) and keeping only samples with high likelihood under this model. This procedure results in better FID scores of trained GANs, as shown in De Vries et al. (2020). In our framework, this can be understood as another way to improve the dataset s connectivity since by removing outliers, we reduce the number of gaps in the data. Experiments in the aforementioned paper were conducted on the resized Image Net dataset; however, we hypothesize that this phenomenon may also be understood on the conceptually similar CIFAR-10 dataset. We perform an experiment confirming this effect quantitatively. We follow the steps described above, and for each value of 2 {0.1, 0.2, 0.3, 0.4, 0.5} evaluate min of the dataset obtained by removing samples in the bottom -quantile of the likelihood. Our results are provided in Table 1. We observe that for higher values of the truncation parameter, we indeed obtain better dataset connectivity, confirming practical findings of De Vries et al. (2020). Table 1. Normalized differences in min for CIFAR-10 with in- stance selection performed according to the procedure outlined in De Vries et al. (2020) with respect to the truncation (lower) quantile . min is computed as ( min( )/ min(0) 1) 102 for easier comparison. We observe that stronger truncation increases the value of min, confirming better empirical GAN convergence. 0 0.1 0.2 0.3 0.4 0.5 min " 0 0.95 0.04 1.24 2.90 2.74 Limitations of variational estimation. We now analyze whether the obtained approximations of min are reliable in the sense that they will not get arbitrarily small when the networks of higher capacity are used. To verify the stability of the obtained min estimate, we perform a sweep across the width (i.e., the number of filters in convolutions of the Res Net-18 model) of the network. We evaluate 19 values of width starting with 32 with the increment 8. The obtained min values with respect to the number of parameters are visualized in Figure 5. We observe that the obtained value indeed converges when we increase the depth; based on the plot of the relative error we may note that the convergence is roughly linear in terms of parameter count (R2 = 0.87). Figure 5. (Left) Convergence of the variational min estimate with respect to the number of parameters in the network. Parameters are given in multiples of 107. (Right) Relative error of the obtained estimated with respect to the value obtained for width=176. 8. Conclusion We presented a novel framework for a theoretical understanding of GAN training by analyzing the local convergence in the functional space. Namely, we represent the GAN dynamics as a system of partial differential equations and analyze the spectrum of the corresponding differential operator, which determines the dynamics convergence. As the main result, we show how the spectrum depends on the properties of the target distribution, in particular, on its Poincar e constant. Our perspective provides a new understanding of established GAN tricks, such as gradient penalty or dataset augmentation. For practitioners, our paper develops an efficient method that allows to choose optimal augmentations for a particular dataset. Functional Space Analysis of Local GAN Convergence Adams, R. A. and Fournier, J. J. Sobolev spaces. Elsevier, Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein gen- erative adversarial networks. In International conference on machine learning, pp. 214 223. PMLR, 2017. Balduzzi, D., Racaniere, S., Martens, J., Foerster, J., Tuyls, K., and Graepel, T. The mechanics of n-player differentiable games. In International Conference on Machine Learning, pp. 354 363. PMLR, 2018. Benzi, M., Golub, G. H., Liesen, J., et al. Numerical solution of saddle point problems. Acta numerica, 14(1):1 137, 2005. Boffi, D., Brezzi, F., Fortin, M., et al. Mixed finite element methods and applications, volume 44. Springer, 2013. Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary, C., Maclaurin, D., Necula, G., Paszke, A., Vander Plas, J., Wanderman-Milne, S., and Zhang, Q. JAX: composable transformations of Python+Num Py programs, 2018. URL http://github.com/google/jax. Cianchi, A., Maz ya, V., et al. On the discreteness of the spectrum of the Laplacian on noncompact Riemannian manifolds. Journal of differential geometry, 87(3):469 492, 2011. Coifman, R. R. and Lafon, S. Diffusion maps. Applied and computational harmonic analysis, 21(1):5 30, 2006. De Vries, T., Drozdzal, M., and Taylor, G. W. Instance Selection for GANs. Advances in Neural Information Processing Systems, 2020. Dieudonn e, J. Foundations of modern analysis. Read Books Fiedler, M. Algebraic connectivity of graphs. Czechoslovak mathematical journal, 23(2):298 305, 1973. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative Adversarial Nets. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems, volume 27, pp. 2672 2680. Curran Associates, Inc., 2014. Gottlieb, S. and Shu, C.-W. Total variation diminishing Runge-Kutta schemes. Mathematics of computation, 67 (221):73 85, 1998. Griffiths, D. J. Introduction to electrodynamics, 2005. Grigoryan, A. Heat kernel and analysis on manifolds, vol- ume 47. American Mathematical Soc., 2009. Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. C. Improved Training of Wasserstein GANs. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30, pp. 5767 5777. Curran Associates, Inc., 2017. Karras, T., Aittala, M., Hellsten, J., Laine, S., Lehtinen, J., and Aila, T. Training generative adversarial networks with limited data. Advances in Neural Information Processing Systems, 2020a. Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., and Aila, T. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8110 8119, 2020b. Kingma, D. P. and Ba, J. Adam: A Method for Stochastic Optimization. In ICLR (Poster), 2015. URL http: //arxiv.org/abs/1412.6980. Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. 2009. Ladyzhenskaya, O. A. The mathematical theory of viscous incompressible flow, volume 2. Gordon and Breach New York, 1969. Liang, T. and Stokes, J. Interaction matters: A note on non-asymptotic local convergence of generative adversarial networks. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 907 915. PMLR, Mao, X., Li, Q., Xie, H., Lau, R. Y., Wang, Z., and Paul Smolley, S. Least squares generative adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2794 2802, 2017. Mescheder, L., Nowozin, S., and Geiger, A. The Numerics of GANs. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30, pp. 1825 1835. Curran Associates, Inc., 2017. Mescheder, L., Geiger, A., and Nowozin, S. Which Train- ing Methods for GANs do actually Converge? In International Conference on Machine learning (ICML), pp. 3481 3490. PMLR, 2018. Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. Spec- tral Normalization for Generative Adversarial Networks. In International Conference on Learning Representations, Functional Space Analysis of Local GAN Convergence 2018. URL https://openreview.net/forum? id=B1QRgzi T-. Nagarajan, V. and Kolter, J. Z. Gradient descent GAN op- timization is locally stable. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 5591 5600, 2017. Nie, W. and Patel, A. B. Towards a better understanding and regularization of GAN training dynamics. In Uncertainty in Artificial Intelligence, pp. 281 291. PMLR, 2020. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., De Vito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. Pytorch: An Imperative Style, High-Performance Deep Learning Library. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alch e-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32, pp. 8024 8035. Curran Associates, Inc., 2019. Qin, C., Wu, Y., Springenberg, J. T., Brock, A., Donahue, J., Lillicrap, T. P., and Kohli, P. Training Generative Adversarial Networks by Solving Ordinary Differential Equations. Advances in Neural Information Processing Systems, 2020. Roth, K., Lucchi, A., Nowozin, S., and Hofmann, T. Sta- bilizing Training of Generative Adversarial Networks through Regularization. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30, pp. 2018 2028. Curran Associates, Inc., 2017. S uli, E. and Mayers, D. F. An introduction to numerical analysis. Cambridge university press, 2003. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818 2826, 2016. Zhang, H., Zhang, Z., Odena, A., and Lee, H. Consistency Regularization for Generative Adversarial Networks. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum? id=S1lx Kl SKPH. Zhao, Z., Zhang, Z., Chen, T., Singh, S., and Zhang, H. Image augmentations for GAN training. ar Xiv preprint ar Xiv:2006.02595, 2020.