# a_solvable_highdimensional_model_of_gan__71f6775a.pdf

A Solvable High-Dimensional Model of GAN

Chuang Wang1,2 wangchuang@ia.ac.cn Hong Hu2 honghu@g.harvard.edu Yue M. Lu2 yuelu@seas.harvard.edu

1. National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Science, 95 Zhong Guan Cun Dong Lu, Beijing 100190, China 2. John A. Paulson School of Engineering and Applied Sciences, Harvard University 33 Oxford Street, Cambridge, MA 02138, USA

We present a theoretical analysis of the training process for a single-layer GAN fed by high-dimensional input data. The training dynamics of the proposed model at both microscopic and macroscopic scales can be exactly analyzed in the highdimensional limit. In particular, we prove that the macroscopic quantities measuring the quality of the training process converge to a deterministic process characterized by an ordinary differential equation (ODE), whereas the microscopic states containing all the detailed weights remain stochastic, whose dynamics can be described by a stochastic differential equation (SDE). This analysis provides a new perspective different from recent analyses in the limit of small learning rate, where the microscopic state is always considered deterministic, and the contribution of noise is ignored. From our analysis, we show that the level of the background noise is essential to the convergence of the training process: setting the noise level too strong leads to failure of feature recovery, whereas setting the noise too weak causes oscillation. Although this work focuses on a simple copy model of GAN, we believe the analysis methods and insights developed here would prove useful in the theoretical understanding of other variants of GANs with more advanced training algorithms.

1 Introduction

A generative adversarial network (GAN) [1] seeks to learn a high-dimensional probability distribution from samples. While there have been numerous advances on the application front [2 6], considerably less is known about the underlying theory and conditions that can explain or guarantee the successful trainings of GANs.

Recently, it has been a very active area of research to study either the equilibrium properties [7 9] or the training dynamics [10, 11]. Speciﬁcally, there is a line of works studying the dynamics of the gradient-based training algorithms e.g., [11 16]. The basic idea is the following. The evolution of the learnable parameters in the training dynamics can be considered as a discrete-time process. With a proper time scaling, this discrete-time process converges to a deterministic continuous-time process as the learning rates tend to 0, which is characterized by an ordinary differential equation (ODE). By studying local stability of the ODE s ﬁxed points, [12] shows that oscillation in the training algorithm is due to the eigenvalues of the Jacobian of the gradient vector ﬁeld with zero real part and large imaginary part. Due to this fact, various stabilization approaches are proposed, for example adding additional regularizers [13, 14], and using two timescale [15] training. Very recently, [16] argues that those stabilization techniques may encourage the algorithms to converge non-Nash stationary points. All above works consider a small-learning-rates limit, where the limiting process

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

is always deterministic. The stochasticity and the effect of the noise is essentially ignored, which may not reﬂect practical situations. Thus, a new analysis paradigm to study the dynamics with the consideration of the intrinsic stochasticity is needed.

In this paper, we present a high-dimensional and exactly solvable model of GAN. Its dynamics can be precisely characterized at both macroscopic and microscopic scales, where the former is deterministic and the latter remains stochastic. Interestingly, our theoretical analysis shows that injecting additional noise can stabilize the training. Speciﬁcally, our main technical contributions are twofold:

We present an asymptotically exact analysis of the training process of the proposed GAN model. Our analysis is carried out on both the macroscopic and the microscopic levels. The macroscopic state measures the overall performance of the training process, whereas the microscopic state contains all the detailed weights information. In the high-dimensional limit (n ), we show that the former converges to a deterministic process governed by an ordinary differential equation (ODE), whereas the latter stays stochastic described by a stochastic differential equation (SDE). We show that depending on the choice of the learning rates and the strength of noise, the training process can reach either a successful, a failed, an oscillating, or a mode-collapsing phase. By studying the stabilities of the ﬁxed points of the limiting ODEs, we precisely characterize when each phase takes place. The analysis reveals a condition on the learning rates and the noise strength for successful training. We show that the level of the background noise is essential to the convergence of the training process: setting the noise level too strong (small signal-to-noise ratio) leads to failure of feature recovery, whereas setting the noise too weak (large signal-to-noise ratio) causes oscillation.

Our work builds upon a general analysis framework [17] for studying the scaling limits of highdimensional exchangeable stochastic processes with applications to nonlinear regression problems. Similar techniques have also been used in the literature to study Monte Carlo methods [18], online perceptron learning [19, 20], online sparse PCA [21], subspace estimation [22], online ICA [23] and more recently, the supervised learning of two-layer neural networks [24], but to our best knowledge, this technique has not yet been used in analyzing GANs.

The rest of the paper is organized as follows. We present the proposed GAN model and the associated training algorithm in Section 2. Our main results are presented in Section 3, where we show that the macroscopic and microscopic dynamics of the training process converge to their respective limiting processes that are characterized by an ODE and SDE, respectively. In Section 4, we analyze the stationary solutions of the limiting ODEs and precisely characterizes the long-term behaviors of the training process. We conclude in Section 5.

2 Formulations

In this section, we introduce the proposed GAN model and specify the associated training algorithm.

Model for the real data. In order to establish the theoretical analysis, we ﬁrst impose a model for the probability distribution from which we draw our real data samples. We assume that the real data yk Rn, k = 0, 1, . . . are drawn according to the following generative model:

yk = G(ck,ak; U, ηT) def = Uck + ηTak, (1)

where U Rn d is a deterministic unknown feature matrix with d features; ck Rd is a random vector drawn from an unknown distribution Pc; ak is an n-dimensional random vector acting as the background noise; and ηT is a parameter to control the strength of noise. Without loss of generality 1, we assume U U = Id, where Id is the d d identity matrix.

This generative model, referred to as the spiked covariance model [25] in the literature, is commonly used in the theoretical study of principal component analysis (PCA). We note that this model is not a trivial task for PCA even when d = 1 if the variance of the noise ak is a non-zero constant. As

1If U is not orthogonal, we can rewrite Uc in (1) as (UR)(R 1c), where R is a matrix that orthogonalizes and normalizes the columns of U. We can then study an equivalent system where the new feature vector is R 1c.

proved in [25], the best estimator can not perfectly recover the signal U given an O(n) number of samples yk. Thus, it is of sufﬁcient interest to investigate whether a GAN can retrieve informative results for the principal components in the same scaling limit.

The GAN model The GAN we are going to analyze is deﬁned as follows. We assume that the generator G has the same linear structure as the real data model (1) given above:

eyk = G(eck, eak; V , ηG) (2)

but the parameters are different. Here, eyk denotes a fake sample produced by the generator; eak is an n-dimensional random noise vector; the random variable eck is drawn from a ﬁxed distribution Pec; ηG is the noise strength; and the matrix V Rn d represents the parameters of the generator. (In an ideal case in which the generator learns the underlying true probability distribution perfectly, we have V = U.) Throughout the paper, we follow the notational convention that all the symbols that are decorated with a tilde (e.g., eyk, eck, eak) denote quantities associated with the generator.

We deﬁne the discriminator D of our GAN model as

D(y; w) def = b D(y w).

Here, y is an input vector, which can be either the real data yk from (1) or the fake one eyk from (2); b D : R 7 R can be any function; and the vector w Rn represents the parameters associated with the discriminator. Later, we will show that the generator can learn multiple features even though the discriminator only has one feature vector w. Discriminators with multiple features can also be analyzed in a similar way, but in this paper we consider the single-feature discriminator for simplicity.

The training algorithm. The proposed GAN model has two set of parameters V and w to be learned from the data. The training process is formulated as the following Min Max problem

min V max w Ey P(y;U)Eey e P(ey,V ) L(y, ey; w), (3)

where the two probability distributions P(y; U) and e P(ey; V ) represent the distributions of the real data y and the fake data ey as speciﬁed by (1) and (2) respectively, and

L(y, ey; w) def =F( b D(y w)) e F( b D(ey w)) λ

2 H(w w) + λ

2 tr H(V V ) (4)

with F( ) and e F( ) being two functions that quantify the performance of the discriminator and λ > 0 being a constant. The function H( ) acts as a regularization term introduced to control the magnitude of the parameters w and V . It can be an arbitrary real-valued function, which is applied element-wisely if the input is a matrix.

We consider a standard training algorithm that uses the vanilla stochastic gradient descent/ascent (SGDA) to seek a solution of (3). To simplify the theoretical analysis, we consider an online (i.e., streaming) setting where each data sample yk is used only once. At step k, the model parameters wk and V k are updated using a new real sample yk and two fake samples ey2k and ey2k+1, according to

wk+1 = wk + τ

n wk L(yk, ey2k; wk)

V k+1 = V k eτ

n V k L yk, G(ec2k+1, ea2k+1; V k; ηG); wk , (5)

where ec2k+1, ea2k+1 are random variables that generates the fake sample ey2k+1 according to (2). The two parameters τ and eτ in the above expressions control the learning rates of the discriminator and the generator, respectively. In (5), we only consider a single-step update for wk. This is a special case of Algorithm 1 in [1] with the batch-size m set to 1. We note that the analysis presented in this paper can be naturally extended to the mini-batch case where m is a ﬁnite number.

Example 1. We deﬁne F( b D(x)) = e F( b D(x)) = x2/2, and the regularizer function H(A) = log cosh(A I), where I is the identity matrix with the same dimension of A, and the function log cosh( ) transforms the input matrix element-wisely. We use this speciﬁc regularizer to control the magnitude of the model parameters V and w. In practice, any convex function with its minimum reached at zero would be ﬁne. Our choice log cosh(A I) here is is just a convenient special case since its derivative H (x) = tanh(x) is smooth and bounded. Furthermore, we set the regularization parameter λ , the original problem (3) becomes a constrained Min Max problem

min diag(V V )=Id max w =1 Ey PEey e P h (y w)2 (ey w)2i ,

in which the diagonal operation diag(A) returns a matrix where the diagonal entries are the same as A and the off-diagonal entries are all zero. The condition diag(V V ) = Id ensures that each column vector of V is normalized.

3 Dynamics of the GAN

Deﬁnition 1. Let Xk def = [U, V k, wk] Rn (2d+1). We call Xk the microscopic state of the training process at iteration step k.

The microscopic state Xk contains all the information about the training process. In fact, the sequence {Xk}k=0,1,2,... forms a Markov chain on Rn (2d+1). This can be easily veriﬁed from the update rule of Xk as deﬁned in (5), in which the real data yk and fake data eyk are drawn according to (1) and (2) respectively. The Markov chain is driven by the initial state X0 and the sequence of random variables {(ck, ak,ec2k, ea2k,ec2k+1, ea2k+1)}k=0,1,2,....

Deﬁnition 2. Let P k def = U V k, qk def = U wk, rk def = V k wk, Sk def = V k V k, and zk def = w k wk. We call the tuple {P k, qk, rk, Sk, zk} the macroscopic state of the Markov chain Xk at step k.

Those macroscopic quantities measure the cosine similarities among the feature vectors of the true model U, the generator V k and the discriminator wk. For example, the cosine of the angle between the ith true feature (i.e., the ith column of U) and the jth feature estimated in the generator (i.e., the jth column of V k) is [P k]i,j/ p

[Sk]j,j, where [P k]i,j is the inner product between the two feature vectors and p

[Sk]j,j is the norm of the jth column of V k. (The columns of U are unit vectors and need not be normalized here.) For simplicity, we introduce a compact notation for the macroscopic state:

M k def = X k Xk =

I P k qk P k Sk rk q k r k zk

In what follows, we investigate the dynamics of the training algorithm (5) at both the macroscopic and the microscopic levels. At the macroscopic level, by examining the cosines of the angles, we study how closely the model parameters V k, wk associated with the generator and discriminator can align with the ground truth feature vectors, i.e., the columns of U. At the microscopic level, we study how the elements in the matrix V k and the vector wk evolve as a stochastic process. As our analysis will reveal, the mechanisms behind the two levels are different: the macroscopic dynamics is asymptotically deterministic whereas the microscopic dynamics stays stochastic even as n .

3.1 Macroscopic dynamics

We ﬁrst study the asymptotic dynamics of the macroscopic state M k. Our theoretical analysis is carried out under the following assumptions.

(A.1) The sequences of ck Pc and eck Pec for k = 0, 1, . . . are i.i.d. random variables with bounded moments of all orders, and {ck} is independent of {eck}. (A.2) The sequences {ak} and {eak} for k = 0, 1, . . . are both independent Gaussian vectors with zero mean and the covariance matrix In. Moreover, {ak}, {eak} are independent of {ck} and {eck}. (A.3) The ﬁrst-order derivative of H( ) and the derivatives up to fourth order of the functions F( b D( )) and e F( b D( )) exist and they are also uniformly bounded. (A.4) Let [U, V 0, w0] be the initial microscopic state. For i = 1, 2, . . . , n, we have E [Pd ℓ=1([U]4 i,ℓ+ [V 0]4 i,ℓ+ [w0]4 i ]) C/n2, where C is a constant not depending on n. (A.5) The initial macroscopic state M 0 satisﬁes E M 0 M 0 C/ n, where M 0 is a deterministic matrix and C is a constant not depending on n.

We provide a few remarks on the above assumptions. In Assumption (A.1), Pc and Pec can be different. For example, c is Gaussian, and ec is uniform on [ 1, 1]d. The assumption (A.2) can

be relaxed to non-Gaussian cases as long as all moments of ak and eak are bounded, but we use Gaussian assumption here to simplify the proof. The assumption (A.4) requires that the elements in the parameter matrix of real data U and initial microscopic state X0 are O(1/ n) numbers. Intuitively, this assumption ensures that U and X0 are generic matrices with O(1) Frobenius norms (i.e., not the matrices that most elements are zeros and only few elements are large numbers). The assumption (A.5) ensures that the initial macroscopic states converges to a deterministic value as the system size n goes to inﬁnity. The following theorem proves that if the initial state is convergent, then the whole training process converges to a deterministic process as n , which is characterized by an ODE. Theorem 1. Fix T > 0. It holds under Assumptions (A.1) (A.5) that

max 0 k n T E M k M k

n C(T ) n , (7)

where C(T) is a constant that depends on T but not on n, and M(t) =

I P t qt P t St rt q t r t zt

R(2d+1) (2d+1) is a deterministic function. Moreover, M(t) is the unique solution of the following ODE: d dt P t = eτ qteg t + P t Lt

d dtqt = τ gt P tegt + qtht

d dtrt = τ P T t gt Stegt + rtht + eτ ztegt + Ltrt

d dt St = eτ rteg t + egtr t + St Lt + Lt St

d dtzt = 2τ(q t gt r t egt + ztht) + τ 2bt

with the initial condition M(0) = M 0, where

gt = cf(c qt + e ztηT)

c,e, egt = ec ef(ec rt + e ztηG)

ec,e, Lt = λdiag(H (St))

ht = f (c qt + e ztηT)

c,e ef (ec rt + e ztηG)

ec,e λH (zt),

bt = ηT f 2(c qt + e ztηT)

c,e + ηG ef 2(ec rt + e ztηG)

The two functions f, ef stand for f(x) = d dx F( b D(x)) and ef(x) = d dx e F( b D(x)), and f , ef and H

are derivatives of f, ef and H respectively. The two constants ηT and ηG are the strength of the noise in the true data model and the generator, respectively. The brackets c,e and ec,e denote the averages over the random variables c Pc, ec Pec, and e N(0, 1), where Pc and Pec are the distributions involved in deﬁning the generative model (1) and the generator (2).

This theorem implies that for each k = tn for some t [0, T], the macroscopic state M k converges to a deterministic number M(t), and the convergence rate is O(1/ n). The limiting ODE (8) for the macroscopic states involves O(d2) variables, where d is the number of internal features often assumed to be a ﬁnite number that is much less than n. This ODE is essentially different from the ODE derived in the small-learning-rate limit [11 16], in which the number of variables is O(n).

The complete proof can be found in the Supplementary Materials. We brieﬂy sketch the proof here. First, we note that M k is a discrete-time stochastic process driven by the Markov chain Xk. Then, we apply the martingale decomposition for M k and get

M k+1 M k = 1

nφ(M k) + (M k+1 Ek M k+1) + [Ek M k+1 M k 1

where the matrix-valued function φ(M) represents the functions on the right hand sides of the ODE (8), and Ek denotes the conditional expectation given the state of the Markov chain Xk. Finally, we show the martingale Pk k =0(M k +1 E k M k ) and the higher-order term Ek M k+1 M k 1 nφ(M k) have no contribution when n goes to inﬁnity.

Due to the limitation of our current proof, the constant C(T) in (7) grows exponentially as T increases. This is not a problem for any ﬁnite T, but may cause some problem to study the long time behavior when T . However, if we impose a sufﬁcient large regularizer parameter λ to limit the norms of the microscopic weights V k and wk, then the macroscopic state M k is bounded

0 50 100 150 200 0

0 100 200 300 400 0

0 50 100 150 200 0

Figure 1: Macroscopic dynamics of the GAN with d = 2 features: [P k]i,j is the cosine of the angle between i th column vector of the real feature matrix U k and j th column vector of the generator s weight matrix V k. Similarly, [qk]i is the cosine of angle between i th column vector of U k and the discriminator s weight vector wk. Colored dots are results from experiments, and the curves tracing these dots are our theoretical prediction by the ODE (8). From the left to right, the variance of background noise is ηT = ηG = 2, 1, 4 respectively, and other parameters are the same. The left ﬁgure is an example of successful training, where two features (red and blue dots) are retrieved by the generator. The center ﬁgure shows an oscillating training. It happens when noise are weak. The right ﬁgures shows a mode collapsing state, in which only the ﬁrst feature are estimated by the generator.

as [M k]2 i,j [M k]i,i[M k]j,j. In our experiments, λ > 1 is sufﬁcient. In this case, the constant C(T) is bounded not depending on T. In Example 1, when λ , [M k]i,i = 1, and therefore [M k]2 i,j 1 and C(T) (2d + 1)2, where the number of features d is considered a constant not growing with n. This justiﬁes the ﬁxed points analysis of the ODE as discussed in Section 4, which reﬂects the long-time training behavior. A better proof strategy to get rid of this dependence of T is also possible, e.g., [26].

Numerical veriﬁcation. We verify the theoretical prediction given by the ODE (8) via numerical simulations under the settings stated in Example 1. The results are shown in Figure 1. The number of features is d = 2, and ck and eck are both Gaussian with zero mean and covariance diag([5, 3]). The dimension is n = 5, 000, and the learning rates of the generator and discriminator are eτ = 0.04 and τ = 0.2 respectively. After testing different noise strength ηT = ηG = 2, 1, 4, we have observed at least three nontrivial dynamical patterns: success, oscillating or mode collapsing. In all these experiments, our theoretical predictions match the actual trajectories of the macroscopic states pretty well.

Let us take a closer look at the successful case as shown in the left ﬁgure in Figure 1. The dynamics can be split into 4 stages. At the ﬁrst stage, the discriminator learns the ﬁrst feature of the true model. At this state, [qt]1 quickly increases. At the second stage, the generator starts to learn the ﬁrst feature and the discriminator is deceived. At this stage, [P t]2 1,1 increases and [qt]2 1 decreases. Once the discriminator completely forgets the ﬁrst feature as [qt]1 0, the third state begins. The discriminator starts to learn the second feature as [qt]2 2 increases. Then, at the last stage, the generator learns the second feature and the discriminator is fooled again. In this region, [P t]2 2,2 increases and [qt]2 2 decreases down to 0. Eventually, the generators learns both features and the discriminator is completely fooled. It ends up at a stationary state that qt = 0 and P t is nearly an identity matrix. Interestingly, this experiment shows that the generator learn features sequentially given a single-feature discriminator. This may be a reason why in practice, the discriminator s structure can be much simpler than the generator s.

3.2 Microscopic dynamics

In this section, we study how the elements in Xk = [U, V k, wk] evolve during the training process. Instead of studying the trajectory of Xk, we study the evolution of the empirical measure of the microscopic states, which is deﬁned as

µk(bu,bv, bw) def = 1

n Pn i=1δ bu , bv , bw n [U]i,:, [V k]i,:, [w]i

where δ( ) is a Dirac measure on R2d+1 and [U]i,:, [V k]i,: are ith row of U and V k respectively. The scaling factor n in the Dirac measures is introduced because [U]i,ℓ, [V k]i,ℓand [wk,]i are O(1/ n) quantities.

We next embed the discrete-time measure-valued stochastic process µk into a continuous-time process by deﬁning µ(n) t def = µk(bu, bv, bw) with k = nt .Following the general technical approach presented in [17], we can show that under the same assumptions as Theorem 1, given T > 0, the sequence of measure-valued process {{µ(n) t }t [0,T ]}n converges weakly to a deterministic process {µt}t [0,T ]. In addition, µt is the measure of the solution to the stochastic differential equation

dbvt = eτ bwtegt + Ltbvt dt

d bwt = τ bu t gt + bv t egt + bwtht dt + τ p

where (bu0, bv0, bw0) µ0; Bt is the standard Brownian motion. The functions gt, egt, Lt, ht and bt are deﬁned in (9), in which the macroscopic quantities P t, St, qt, zt, rt are computed as follows

P t = µt, bubv , St = µt, bvbv , qt = µt, bu bw , zt = µt, bw2 , rt = µt, bv bw , (11)

where µt, denotes the expectation with respect to the measure µt.

The SDE (10) shows the intuitive meaning of the functions deﬁned in (9): gt, egt, Lt, ht are drift coefﬁcients of the SDE and bt is the diffusion coefﬁcient of the SDE. We also note that if one follows the analysis in the small-learning-rate limit [11 16], one will get an ODE for the microscopic states. Compared to our SDE formula, the diffusion term τ btd Bt is missing in those works, and therefore the effect of the noise can not be analyzed.

Moreover, the deterministic measure µt is unique solution of the following PDE (given in its weak form): for any bounded smooth test function ϕ(bu, bv, bw),

d dt µt, ϕ(bu, bv, bw) =

eτ µt, bweg t + bv Lt bvϕ + τ µt, bu gt bv egt + ht bw

b w ϕ + τ 2

2 bt µt, 2 b w2 ϕ (12)

where qt, rt, St, and zt are deﬁned in (11), and the functions gt, egt, bt, ht and Lt are deﬁned in (9). We refer readers to [17] for a general framework for rigorously establishing the above scaling limit.

The connection between the microscopic and macroscopic dynamics can also be derived from the weak formulation of the PDE. Let ϕ being each element of bubv , bu bw, bv bw, bvbv , bw2, and substituting those ϕ into the PDE (12), we can derive the ODE (8). In the setting of this paper, the macroscopic dynamics enjoys a closed ODE: We can predict the macroscopic states without solving the PDE nor SDE at microscopic scale. However, in a more general setting, e.g. when we add a regularizer other than the L2 type, the ODE itself may not be closed. In that case, one has to solve the PDE directly.

Numerical veriﬁcation. We verify the predictions given by the PDE (12) by setting d = 1 using a special choice of the (n 1)-dimensional target feature matrix U whose elements are all 1/ n with n = 10, 000. We also set the initial condition µ0(bv, bw|bu = 1) to be a Gaussian distribution. (When d = 1, the macroscopic quantities Pt, qt, rt, St reduce to scalars, so we remove their boldface here.) In this case, the PDE (12) admits a particularly simple analytical solution: at any time t, the solution µt(bv, bw|bu = 1) is a Gaussian distribution whose mean and covariance matrix are given by

Eµt(bv, b w|bu=1)

, Eµt(bv, b w|bu=1)

bv bw = St rt rt zt

. Figure 2 overlays the contours

of the probability distribution µt(bv, bw|bu = 1) at different times t over the point clouds of the actual experiment data ( n[wk]i, n[V k]i,1). We can see that the theoretical prediction given by (12) has excellent agreement with simulation results.

4 Local Stability Analysis of the ODE for the Macroscopic States

In this section, we study how the parameters, such as the learning rates τ and eτ, noise strength ηG and ηT affect the training algorithm. We will focus on the concrete model as described in Example 1 so that we can have analytical solutions.

In order to further reduce the degrees of freedom of the ODE (8), we let the regularization parameter λ . In this case, the vector wk and all columns vectors of V k are always normalized. Thus

Figure 2: The evolution of the microscopic states at t = 0, 10, 100, and 150. For each ﬁxed t, the red points in the corresponding ﬁgure represent the values of (bv, bw) = ( n[V k]i,1, n[wk]i) for i = 1, 2, . . . , n, where k = nt . The blue ellipses illustrate the contours corresponding to one, two, and three standard deviations of the 2-D Gaussian distribution predicted by the PDE (12).

zk = 1 and [S]i,i = 1. The macroscopic state is then described by P k, qk, rk and off-diagonal terms of Sk. Correspondingly, the ODE in Theorem 1 reduces to

d dt P t = eτ qtr t eΛ + P t Lt

d dtqt = τ Λqt P t eΛrt + htqt

d dtrt = τ P T t Λqt St eΛrt + htrt + eτ eΛ + Lt rt d dt St = eτ rtr t eΛ + eΛrtr t + St Lt + Lt St

where Λ and eΛ are the covariance matrices of the distributions Pc and Pec, respectively; and

ht = (1 τηG

2 )r t eΛrt (1 + τηT

2 )q t Λqt τ η2 G+η2 T 2 , Lt = diag(rtr t eΛ), (14)

in which ηT and ηG are the variance of noise in the true data model and generator, respectively. The derivation from the ODE (8) to (13) is presented in the Supplementary Materials.

Next, we discuss under what conditions, the GAN can reach a desirable training state by studying local stability of a particular type of ﬁxed points of the ODE (13). The perfect estimation of the generator corresponds to P t being an identity matrix (up to a permutation of rows and columns). A complete fail state relates to P = 0. Furthermore, It is easy to verify that if qt = rt = 0, the ODE (13) will be stable for any P t = P . Claim 1. The macroscopic states P t, q = r = 0 for all valid P t are always the ﬁxed points of the ODE (13). Furthermore, a sufﬁcient condition that the perfect estimation state P t = I, q = r = 0 is locally stable and the failed state P t = 0, q = r = 0 is unstable if

1 2 max ℓ{Λℓ eΛℓ+ αeΛℓ} τη2 < min ℓ Λℓ, (15)

where α = eτ

2(η2 T + η2 G), and Λℓ= [Λ]ℓ,ℓ, eΛℓ= [eΛ]ℓ,ℓ.

The proof can be found in the Supplementary Materials. If the right inequality in (15) is violated, any feature ℓwith the signal-to-noise ratio [Λ]ℓ,ℓ< τη2 is not learned by the generator resulting mode collapsing. The right ﬁgure in Figure 1 demonstrates this situations, where only one of the two features is recovered. If the left inequality in (15) is violated, the training processes can be trapped in an oscillation phase. This phenomenon is shown in the middle ﬁgure in Figure 1. This result indicates that proper background noise can help to avoid oscillation and stabilize the training process. In fact, the trick of injecting additional noise has been used in practice to train multi-layer GANs [27]. To our best knowledge, our paper is the ﬁrst theoretical study on why noise can have such a positive effect via a dynamic perspective.

In experiments, the training is not ended at the perfect recovery point due to the presence of the noise but converges at another ﬁxed point nearby. This is because the perfect state is marginally stable, as the Jacobian matrix always has zero eigenvalues. It indicates that there are other locally stable ﬁxed points near P = I. In fact, all points in the hyper-rectangle region satisfying q = r = 0 and p ℓ [P ]ℓ,ℓ 1, ℓ= 1, 2, . . . , d are locally stable for some critical p ℓ. In the matched

case when Λℓ= eΛℓ, we have p ℓ= (Λℓ τη2)(eΛℓ+ τη2 αeΛℓ)/(ΛℓeΛℓ) 1/2, α = eτ

1 2(η2 T + η2 G). Starting from a point near the origin, numerical solution of the ODE shows the training processes are ended up at the corner of this hyper-rectangle, i.e., P = diag({p ℓ, ℓ= 1, 2, . . . , d}). In the small-learning rate limit τ 0 and the learning rate ratio α 0, we get the perfect recovery P = I. The limit τ 0, α 0 was studied in the small-learning-rate analysis with the two-time scaling [15], and the result is consistent, but our analysis includes the situations with ﬁnite τ and α.

In addition, we provide a phase diagram analysis in a single-feature case d = 1 in the Supplementary Materials. All possible ﬁxed points in this case are enumerated and their local stability is analyzed. This helps us understand the successful recovery condition (15), which is the intersection of the informative phases that each feature can be recovered individually.

5 Conclusion

We present a simple high-dimensional model for GAN with an exactly analyzable training process. Using the tool of scaling limits of stochastic processes, we show that the macroscopic state associated with the training process converges to a deterministic process characterized as the unique solution of an ODE, whereas the microscopic state remains stochastic described by an SDE, whose time-varying probability measure is described by a limiting PDE.

Indeed, it is a common picture in statistical physics that the macroscopic states of large systems tend to converge to deterministic values due to self-averaging. These notions, especially the mean-ﬁeld dynamics, have been applied to analyzing neural networks both in shallow [19, 20] and deep models [28]. However, this mean-ﬁeld regime was not considered in previous analyses of GAN. For example, a series of recent works e.g., [11 16] considers a different scaling regime where the learning rate goes to zero but the system dimension n stays ﬁxed. In that regime, the microscopic dynamics are deterministic even with the presence of the microscopic noise. In contrast, we study the regime where the learning rate is ﬁxed but the dimension n . This setting allows us to quantify the effect of training noise in the learning dynamics.

In this paper, we only consider a linear generator with a latent variable ec drawn from a ﬁxed distribution Pec, but our analysis can be extended to a more complex non-linear model with a learnable latent-variable distribution. Speciﬁcally, in order to compute derivatives w.r.t. Pec, the latent variable ec Pec should be reparameterized by a deterministic function ec = f(z; θ), where θ is a learnable parameter and z is a random variable drawn from a simple and ﬁxed distribution. For example, a Gaussian mixture with L equal-probability modes can be parameterized by ec = PL ℓ=1(µℓ+ Σℓϵl)βl, where µℓand Σℓare two learnable parameters representing the mean and covariance of the ℓth mode respectively, and ϵ N(0, I); βℓis a random indicator variable where only one βℓfor ℓ= 1, 2, . . . , L is 1 and the others are 0. In practice, f(z; θ) is implemented by a multilayer neural network. Our analysis can be naturally extended to analyzing this model as long as the dimensions of ec and θ keep ﬁnite when the data dimension n goes to inﬁnity. More challenging situations, where the dimension of θ is proportional to n, will be explored in future works.

Although our analysis is carried out in the asymptotic setting, numerical experiments show that our theoretical predictions can accurately capture the actual performance of the training algorithm at moderate dimensions. Our analysis also reveals several different phases of the training process that highly depend on the choice of the learning rates and noise strength. The analysis reveals a condition on the learning rates and the strength of noise to have successful training. Violating this condition results either oscillation or mode collapsing. Despite its simplicity, the proposed model of GAN provides a new perspective and some insights for the study of more realistic models and more involved training algorithms.

Acknowledgments This work was supported by the US Army Research Ofﬁce under contract W911NF-16-1-0265 and by the US National Science Foundation under grants CCF-1319140, CCF1718698, and CCF-1910410.

[1] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, Generative Adversarial Nets, in Advances in Neural Information Processing System, 2014, pp. 2672 2680. [2] M. Arjovsky, S. Chintala, and L. Bottou, Wasserstein Generative Adversarial Networks, Proceedings of The 34th International Conference on Machine Learning, pp. 1 32, 2017. [3] M. Lucic, K. Kurach, M. Michalski, S. Gelly, and O. Bousquet, Are gans created equal? a large-scale study, in Advances in neural information processing systems, 2018, pp. 698 707. [4] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi, Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4681 4690. [5] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, Image-to-Image Translation with Conditional Adversarial Networks, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1125 1134. [6] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, Generative Adversarial Text to Image Synthesis, 33rd International Conference on Machine Learning, pp. 1060 1069, 2016. [7] S. Arora, R. Ge, Y. Liang, and Y. Zhang, Generalization and Equilibrium in Generative Adversarial Nets, in International Conference on Machine Learning, 2017, pp. 224 232. [8] M. Arjovsky and L. Bottou, Towards Principled Methods for Training Generative Adversarial Networks, ar Xiv preprint ar Xiv:1701.04862, 2017. [9] S. Feizi, C. Suh, F. Xia, and D. Tse, Understanding GANs: the LQG Setting, ar Xiv preprint ar Xiv:1710.10793, 2017. [10] J. Li, A. Madry, J. Peebles, and L. Schmidt, Towards Understanding the Dynamics of Generative Adversarial Networks, ar Xiv preprint ar Xiv:1706.09884, 2017. [11] L. Mescheder, A. Geiger, and S. Nowozin, Which training methods for gans do actually converge? in International Conference on Machine Learning, 2018, pp. 3478 3487. [12] L. Mescheder, S. Nowozin, and A. Geiger, The numerics of GANs, in Advances in Neural Information Processing Systems, 2017, pp. 1823 1833. [13] V. Nagarajan and J. Z. Kolter, Gradient descent GAN optimization is locally stable, in Advances in Neural Information and Processing Systems, 2017, pp. 5591 5600. [14] K. Roth, A. Lucchi, S. Nowozin, and T. Hofmann, Stabilizing Training of Generative Adversarial Networks through Regularization, in Advances in Neural Information Processing Systems, 2017, pp. 2015 2025. [15] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, GANs Trained by a Two Time Scale Update Rule Converge to a Local Nash Equilibrium, in Advances in Neural Information Processing Systems, 2017, pp. 6629 6640. [16] E. V. Mazumdar, M. I. Jordan, and S. S. Sastry, On ﬁnding local nash equilibria (and only local nash equilibria) in zero-sum games, ar Xiv preprint ar Xiv:1901.00838, 2019. [17] C. Wang, J. Mattingly, and Y. M. Lu, Scaling Limit: Exact and Tractable Analysis of Online Learning Algorithms with Applications to Regularized Regression and PCA, ar Xiv preprint ar Xiv:1712.04332, 2017. [18] G. O. Roberts, A. Gelman, and W. R. Gilks, Weak convergence and optimal scaling of random walk Metropolis algorithms, Annals of Applied Probability, vol. 7, no. 1, pp. 110 120, 1997. [19] D. Saad and S. A. Solla, Exact Solution for On-Line Learning in Multilayer Neural Networks, Phys. Rev. Lett., vol. 74, no. 21, pp. 4337 4340, 1995. [20] M. Biehl and H. Schwarze, Learning by on-line gradient descent, Journal of Physics A, vol. 28, no. 3, pp. 643 656, 1995. [21] C. Wang and Y. M. Lu, Online Learning for Sparse PCA in High Dimensions: Exact Dynamics and Phase Transitions, in Information Theory Workshop (ITW), 2016 IEEE, 2016, pp. 186 190. [22] C. Wang, Y. C. Eldar, and Y. M. Lu, Subspace estimation from incomplete observations: A highdimensional analysis, IEEE Journal of Selected Topics in Signal Processing, vol. 12, no. 6, pp. 1240 1252, Dec 2018. [23] C. Wang and Y. M. Lu, The Scaling Limit of High-Dimensional Online Independent Component Analysis, in Advances in Neural Information Processing Systems, 2017, pp. 6641 6650. [24] S. Mei, A. Montanari, and P.-M. Nguyen, A Mean Field View of the Landscape of Two-Layers Neural Networks, ar Xiv preprint, p. ar Xiv:1804.06561, 2018. [25] I. Johnstone and A. Lu, On consistency and sparsity for principal components analysis in high dimensions, Journal of the American Statistical Association, vol. 104, no. 486, pp. 682 693, 2009. [26] B. Jourdain, T. Lelièvre, and B. Miasojedow, Optimal scaling for the transient phase of Metropolis Hastings algorithms: The longtime behavior, Bernoulli, vol. 20, no. 4, pp. 1930 1978, 2014. [27] C. K. Sønderby, J. Caballero, L. Theis, W. Shi, and F. Huszár, Amortised map inference for image super-resolution, International Conference on Learning Representations, 2017. [28] P.-M. Nguyen, Mean ﬁeld limit of the learning dynamics of multilayer neural networks, ar Xiv preprint ar Xiv:1902.02880, 2019.