# sgd_learns_onelayer_networks_in_wgans__b026ae75.pdf SGD Learns One-Layer Networks in WGANs Qi Lei 1 Jason D. Lee 2 Alexandros G. Dimakis 1 Constantinos Daskalakis 3 Generative adversarial networks (GANs) are a widely used framework for learning generative models. Wasserstein GANs (WGANs), one of the most successful variants of GANs, require solving a minmax optimization problem to global optimality, but are in practice successfully trained using stochastic gradient descent-ascent. In this paper, we show that, when the generator is a onelayer network, stochastic gradient descent-ascent converges to a global solution with polynomial time and sample complexity. 1. Introduction Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) are a prominent framework for learning generative models of complex, real-world distributions given samples from these distributions. GANs and their variants have been successfully applied to numerous datasets and tasks, including image-to-image translation (Isola et al., 2017), image super-resolution (Ledig et al., 2017), domain adaptation (Tzeng et al., 2017), probabilistic inference (Dumoulin et al., 2016), compressed sensing (Bora et al., 2017) and many more. These advances owe in part to the success of Wasserstein GANs (WGANs) (Arjovsky et al., 2017; Gulrajani et al., 2017), leveraging the neural net induced integral probability metric to better measure the difference between a target and a generated distribution. Along with the aforementioned empirical successes, there have been theoretical studies of the statistical properties of GANs see e.g. (Zhang et al., 2018; Arora et al., 2017; 2018; Bai et al., 2018; Dumoulin et al., 2016) and their references. These works have shown that, with an appropriate design of the generator and discriminator, the global optimum of the WGAN objective identifies the target distri- 1University of Texas at Austin, TX 2Princeton University, NJ 3Massachusetts Institute of Technology, MA. Correspondence to: Qi Lei , Jason D. Lee . Proceedings of the 37 th International Conference on Machine Learning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 by the author(s). bution with low sample complexity. However, these results cannot be algorithmically attained via practical GAN training algorithms. On the algorithmic front, prior work has focused on the stability and convergence properties of gradient descentascent (GDA) and its variants in GAN training and more general min-max optimization problems; see e.g. (Nagarajan & Kolter, 2017; Heusel et al., 2017; Mescheder et al., 2017; 2018; Daskalakis et al., 2017; Daskalakis & Panageas, 2018a;b; Gidel et al., 2019; Liang & Stokes, 2019; Mokhtari et al., 2019; Jin et al., 2019; Lin et al., 2019; Lei et al., 2020; 2019; 2017) and their references. These works have studied conditions under which GDA converges to a globally optimal solution in the convex-concave objective, or local stability in the non-convex non-concave setting. These results do not ensure convergence to a globally optimal generator, or in fact even convergence to a locally optimal generator. Thus a natural question is whether: Are GANs able to learn high-dimensional distributions in polynomial time and polynomial/parametric sample complexity, and thus bypass the curse of dimensionality? The aforementioned prior works stop short of this goal due to a) the intractability of min-max optimization in the nonconvex setting, and b) the curse of dimensionality in learning with Wasserstein distance in high dimensions (Bai et al., 2018). A notable exception is Feizi et al. (2017) which shows that for WGANs with a linear generator and quadratic discriminator GDA succeeds in learning a Gaussian using polynomially many samples in the dimension. In the same vein, we are the first to our knowledge to study the global convergence properties of stochastic GDA in the GAN setting, and establishing such guarantees for nonlinear generators. In particular, we study the WGAN formulation for learning a single-layer generative model with some reasonable choices of activations including tanh, sigmoid and leaky Re LU. Our contributions. For WGAN with a one-layer generator network using an activation from a large family of functions and a quadratic discriminator, we show that stochastic gradient descent-ascent learns a target distribution using SGD Learns One-Layer Networks in WGANs polynomial time and samples, under the assumption that the target distribution is realizable in the architecture of the generator. This is achieved by simultaneously satisfying the following two criterion: 1. Proving that stochastic gradient-descent attains a globally optimal generator in the metric induced by the discriminator, 2. Proving that appropriate design of the discriminator ensures a parametric O( 1 n) statistical rate (Zhang et al., 2018; Bai et al., 2018) that matches the lower bound for learning one-layer generators as shown in (Wu et al., 2019). 2. Related Work We briefly review relevant results in GAN training and learning generative models: 2.1. Optimization viewpoint For standard GANs and WGANs with appropriate regularization, (Nagarajan & Kolter, 2017), (Mescheder et al., 2017) and (Heusel et al., 2017) establish sufficient conditions to achieve local convergence and stability properties for GAN training. At the equilibrium point, if the Jacobian of the associated gradient vector field has only eigenvalues with negative real-part, GAN training is verified to converge locally for small enough learning rates. A follow-up paper by (Mescheder et al., 2018) shows the necessity of these conditions by identifying a counterexample that fails to converge locally for gradient descent based GAN optimization. The lack of global convergence prevents the analysis from yielding any guarantees for learning the real distribution. The work of (Feizi et al., 2017) described above has similar goals as our paper, namely understanding the convergence properties of basic dynamics in simple WGAN formulations. However, they only consider linear generators, which restrict the WGAN model to learning a Gaussian. Our work goes a step further, considering WGANs whose generators are one-layer neural networks with a broad selection of activations. We show that with a proper gradient-based algorithm, we can still recover the ground truth parameters of the underlying distribution. More broadly, WGANs typically result in nonconvexnonconcave min-max optimization problems. In these problems, a global min-max solution may not exist, and there are various notions of local min-max solutions, namely local min-local max solutions (Daskalakis & Panageas, 2018b), and local min solutions of the max objective (Jin et al., 2019), the latter being guaranteed to exist under mild conditions. In fact, (Lin et al., 2019) show that GDA is able to find stationary points of the max objective in nonconvex- concave objectives. Given that GDA may not even converge for convex-concave objectives, another line of work has studied variants of GDA that exhibit global convergence to the min-max solution (Daskalakis et al., 2017; Daskalakis & Panageas, 2018a; Gidel et al., 2019; Liang & Stokes, 2019; Mokhtari et al., 2019), which is established for GDA variants that add negative momentum to the dynamics. While the convergence of GDA with negative momentum is shown in convex-concave settings, there is experimental evidence supporting that it improves GAN training (Daskalakis et al., 2017; Gidel et al., 2019). 2.2. Statistical viewpoint Several works have studied the issue of mode collapse. One might doubt the ability of GANs to actually learn the distribution vs just memorize the training data (Arora et al., 2017; 2018; Dumoulin et al., 2016). Some corresponding cures have been proposed. For instance, (Zhang et al., 2018; Bai et al., 2018) show for specific generators combined with appropriate parametric discriminator design, WGANs can attain parametric statistical rates, avoiding the exponential in dimension sample complexity (Liang, 2018; Bai et al., 2018; Feizi et al., 2017). Recent work of (Wu et al., 2019) provides an algorithm to learn the distribution of a single-layer Re LU generator network. While our conclusion appears similar, our focus is very different. Our paper targets understanding when a WGAN formulation trained with stochastic GDA can learn in polynomial time and sample complexity. Their work instead relies on a specifically tailored algorithm for learning truncated normal distributions (Daskalakis et al., 2018). 3. Preliminaries Notation. We consider GAN formulations for learning a generator GA : Rk Rd of the form z 7 x = φ(Az), where A is a d k parameter matrix and φ some activation function. We consider discriminators Dv : Rd R or DV : Rd R respectively when the discriminator functions are parametrized by either vectors or matrices. We assume latent variables z are sampled from the normal N(0, Ik k), where Ik k denotes the identity matrix of size k. The real/target distribution outputs samples x D = GA (N(0, Ik0 k0)), for some ground truth parameters A , where A is d k0, and we take k k0 for enough expressivity, taking k = d when k0 is unknown. The Wasserstain GAN under our choice of generator and discriminator is naturally formulated as: min A Rd k max v Rd f(A, v), 1 1We will replace v by matrix parameters V Rd d when SGD Learns One-Layer Networks in WGANs for f(A, v) Ex DDv(x) Ez N(0,Ik k)Dv(GA(z)). We use ai to denote the i-th row vector of A. We sometimes omit the 2 subscript, using x to denote the 2-norm of vector x, and X to denote the spectral norm of matrix X when there is no ambiguity. Sn Rn n represents all the symmetric matrices of dimension n n. We use Df(X0)[B] to denote the directional derivative of function f at point X0 with direction B: Df(X0)[B] = limt 0 f(X0+t B) f(X0) 3.1. Motivation and Discussion To provably learn one-layer generators with nonlinear activations, the design of the discriminator must strike a delicate balance: 1. (Approximation.) The discriminator should be large enough to be able to distinguish the true distribution from incorrect generated ones. To be more specific, the max function g(A) = maxx f(A, V ) captures some distance from our learned generator to the target generators. This distance should only have global minima that correspond to the ground truth distribution. 2. (Generalizability.) The discriminator should be small enough so that it can be learned with few samples. In fact, our method guarantees an O(1/ n) parametric rate that matches the lower bound established in (Wu et al., 2019). 3. (Stability.) The discriminator should be carefully designed so that simple local algorithms such as gradient descent ascent can find the global optimal point. Further, min-max optimization with non-convexity in either side is intractable. In fact, gradient descent ascent does not even yield last iterate convergence for bilinear forms, and it requires more carefully designed algorithms like Optimistic Gradient Descent Ascent (Daskalakis & Panageas, 2018b) and Extra-gradient methods (Korpelevich, 1976). In this paper we show a stronger hardness result. We show that for simple bilinear forms with Re LU activations, it is NP-hard to even find a stationary point. Theorem 1. Consider the min-max optimization on the following Re LU-bilinear form: min x max y i=1 φ(Aix + bi) y where x Rd, Ai RO(d) d and φ is Re LU activation. As long as n 4, the problem of checking whether f has any stationary point is NP-hard in d. We defer the proof to the Appendix where we show 3SAT is reducible to the above problem. This theorem shows that in general, adding non-linearity (non-convexity) in min-max forms makes the problem intractable. However, we are able to show gradient descent ascent finds global minima for training one-layer generators with non-linearity. This will rely on carefully designed discriminators, regularization and specific structure that we considered. Finally we note that understanding the process of learning one-layer generative model is important in practice as well. For instance, Progressive GAN (Karras et al., 2017) proposes the methodology to learn one-layer at a time, and grow both the generator and discriminator progressively during the learning process. Our analysis implies further theoretical support for this kind of progressive learning procedure. 4. Warm-up: Learning the Marginal Distributions As a warm-up, we ask whether a simple linear discriminator is sufficient for the purposes of learning the marginal distributions of all coordinates of D. Notice that in our setting, the i-th output of the generator is φ(x) where x N(0, ai 2), and is thus solely determined by ai 2. With a linear discriminator Dv(x) = v x, our minimax game becomes: min A Rd k max v Rd f1(A, v), (1) for f1(A, v) Ex D v x Ez N(0,Ik k) v φ(Az) . Notice that when the activation φ is an odd function, such as the tanh activation, the symmetric property of the Gaussian distribution ensures that Ex D[v x] = 0, hence the linear discriminator in f1 reveals no information about A . Therefore specifically for odd activations (or odd plus a constant activations), we instead use an adjusted rectified linear discriminator Dv(x) v R(x C) to enforce some bias, where C = 1 2(φ(x) + φ( x)) for all x, and R denotes the Re LU activation. Formally, we slightly modify our loss function as: f1(A, v) Ex D v R(x C) Ez N(0,Ik k) v R(φ(Az) C) . (2) We will show that we can learn each marginal of D if the activation function φ satisfies the following. Assumption 1. The activation function φ satisfies either one of the following: 1. φ is an odd function plus constant, and φ is monotone increasing; 2. The even component of φ, i.e. 1 2(φ(x) + φ( x)), is positive and monotone increasing on x [0, ). Remark 1. All common activation functions like (Leaky) Re LU, tanh or sigmoid function satisfy Assumption 1. SGD Learns One-Layer Networks in WGANs Lemma 1. Suppose A = 0. Consider f1 with activation that satisfies Assumption 1.2 and f1 with activation that satisfies Assumption 1.1. The stationary points of such f1 and f1 yield parameters A satisfying ai = a i , i [d]. To bound the capacity of the discriminator, WGAN adds an Lipschitz constraint: Dv 1, or simply v 2 1. To make the training process easier, we instead regularize the discriminator. For the regularized formulation we have: Theorem 2. In the same setting as Lemma 1, alternating gradient descent-ascent with proper learning rates on min A max v {f1(A, v) v 2/2}, or respectively min A max v { f1(A, v) v 2/2}, recovers A such that ai = a i , i [d]. All the proofs of the paper can be found in the appendix. We show that all local min-max points in the sense of (Jin et al., 2019) of the original problem are global min-max points and recover the correct norm of a i , i. Notice for the source data distribution x = (x1, x2, xd) D with activation φ, the marginal distribution of each xi follows φ(N(0, a i 2)) and is determined by a i . Therefore we have learned the marginal distribution for each entry i. It remains to learn the joint distribution. 5. Learning the Joint Distribution In the previous section, we utilize a (rectified) linear discriminator, such that each coordinate vi interacts with the i-th random variable. With the (rectified) linear discriminator, WGAN learns the correct ai , for all i. However, since there s no interaction between different coordinates of the random vector, we do not expect to learn the joint distribution with a linear discriminator. To proceed, a natural idea is to use a quadratic discriminator DV (x) := x V x = xx , V to enforce component interactions. Similar to the previous section, we study the regularized version: min A Rd k max V Rd d{f2(A, V ) 1 2 V 2 F }, (3) =Ex DDV (x) Ez N(0,Ik k)DV (φ(Az)) = Ex D xx Ez N(0,Ik k) φ(Az)φ(Az) , V . By adding a regularizer on V and explicitly maximizing f2(A, V ) 1 Ez N(0,Ik k) φ(Az)φ(Az) 2 In the next subsection, we first focus on analyzing the second-order stationary points of g, then we establish that gradient descent ascent converges to second-order stationary points of g . 5.1. Global Convergence for Optimizing the Generating Parameters We first assume that both A and A have unit row vectors, and then extend to general case since we already know how to learn the row norms from Section 4. To explicitly compute g(A), we rely on the property of Hermite polynomials. Since normalized Hermite polynomials {hi} i=0 forms an orthonomal basis in the functional space, we rewrite the activation function as φ(x) = P i=0 σihi, where σi is the i-th Hermite coefficient. We use the following claim: Claim 1 ((Ge et al., 2017) Claim 4.2). Let φ be a function from R to R such that φ L2(R, e x2/2), and let its Hermite expansion be φ = P i=1 σihi. Then, for any unit vectors u, v Rd, we have that Ex N(0,Id d) φ(u x)φ(v x) = i=0 σ2 i (u v)i. Therefore we could compute the value of f2 explicitly using the Hermite polynomial expansion: f2(A, V ) = i=0 σ2 i (A (A ) ) i (AA ) i , V Here X i is the Hadamard power operation where (X i)jk = (Xjk)i. Therefore we have: i=0 σ2 i (A (A ) ) i (AA ) i We reparametrize with Z = AA and define g(Z) = g(A) with individual component functions gjk(z) 1 2(P i=0 σ2 i ((z jk)i zi))2. Accordingly z jk = a j, a k is the (j, k)-th component of the ground truth covariance matrix A (A ) . Assumption 2. The activation function φ is an odd function plus constant. In other words, its Hermite expansion φ = P i=0 σihi satisfies σi = 0 for even i 2. Additionally we assume σ1 = 0. SGD Learns One-Layer Networks in WGANs Remark 2. Common activations like tanh and sigmoid satisfy Assumption 2. Lemma 2. For activations including leaky Re LU and functions satisfying Assumption 2, g(Z) has a unique stationary point where Z = A (A ) . Notice g(Z) = P jk gjk(zjk) is separable across zjk, where each gjk is a polynomial scalar function. Lemma 2 comes from the fact that the only zero point for g jk is zjk = z jk, for odd activation φ and leaky Re LU. Then we migrate this good property to the original problem we want to solve: Problem 1. We optimize over function g when a i = 1, i: i=0 σ2 i (A (A ) ) i (AA ) i s.t. a i ai = 1, i. Existing work (Journée et al., 2008) connects g(Z) to the optimization over factorized version for g(A) (g(A) g(AA )). Specifically, when k = d, all second-order stationary points for g(A) are first-order stationary points for g(Z). Though g is not convex, we are able to show that its first-order stationary points are global optima when the generator is sufficiently expressive, i.e., k k0. In reality we won t know the latent dimension k0, therefore we just choose k = d for simplicity. We get the following conclusion: Theorem 3. For activations including leaky Re LU and functions satisfying Assumption 2, when k = d, all second-order KKT points for problem 1 are global minima. Therefore alternating projected gradient descent-ascent on Eqn. (3) converges to A such that AA = A (A ) . The extension for non-unit vectors is straightforward, and we defer the analysis to the Appendix. This main theorem demonstrates the success of gradient descent ascent on learning the ground truth generator. This result is achieved by analyzing two factors. One is the geometric property of our loss function, i.e., all second-order KKT points are global minima. Second, all global minima satisfy AA = A (A ) , and for the problem we considered, i.e., one-layer generators, retrieving parameter AA is sufficient in learning the whole generating distribution. 6. Finite Sample Analysis In the previous section, we demonstrate the success of using gradient descent ascent on the population risk. This leaves us the question on how many samples do we need to achieve small error. In this section, we analyze Algorithm 1, i.e., gradient descent ascent on the following empirical loss: f (t) m,n(A, V ) i=1 φ(Az(t) i )φ(Az(t) i ) 1 i=1 xix i , V Notice in each iteration, gradient ascent with step-size 1 finds the optimal solution for V . By Danskin s theorem (Danskin, 2012), our min-max optimization is essentially gradient descent over g(t) m,n(A) max V f (t) m,n(A, V ) = 1 2 1 m Pm i=1 φ(Az(t) i )φ(Az(t) i ) 1 n Pn i=1 xix i 2 F with a batch of samples {z(t) i }, i.e., stochastic gradient descent for fn(A) Ezi N(0,Ik k), i [m][ gm,n(A)]. Therefore to bound the difference between fn(A) and the population risk g(A), we analyze the sample complexity required on the observation side (xi D, i [n]) and the mini-batch size required on the learning part (φ(Azj), zj N(0, Ik k), j [m]). We will show that with large enough n, m, the algorithm specified in Algorithm 1 that optimizes over the empirical risk will yield the ground truth covariance matrix with high probability. Our proof sketch is roughly as follows: 1. With high probability, projected stochastic gradient descent finds a second order stationary point ˆA of fn( ) as shown in (Ge et al., 2015; Lee et al., 2019; 2016). 2. For sufficiently large m, our empirical objective, though a biased estimator of the population risk g( ), achieves good ϵ-approximation to the population risk on both the gradient and Hessian (Lemmas 4&5). Therefore ˆA is also an O(ϵ)- approximate second order stationary point (SOSP) for the population risk g(A). 3. We show that any ϵ-SOSP ˆA for g(A) yields an O(ϵ)- first order stationary point (FOSP) ˆZ ˆA ˆA for the semidefinite programming on g(Z) (Lemma 6). 4. We show that any O(ϵ)-FOSP of function g(Z) induces at most O(ϵ) absolute error compared to the ground truth covariance matrix Z = A (A ) (Lemma 7). 6.1. Observation Sample Complexity For simplicity, we assume the activation and its gradient satisfy Lipschitz continuous, and let the Lipschitz constants be 1 w.l.o.g.: Assumption 3. Assume the activation is 1-Lipschitz and 1-smooth. To estimate observation sample complexity, we will bound the gradient and Hessian for the population risk and empiri- SGD Learns One-Layer Networks in WGANs Algorithm 1 Online stochastic gradient descent ascent on WGAN 1: Input: n training samples: x1, x2, xn, where each xi φ(A z), z N(0, Ik k), learning rate for generating parameters η, number of iterations T. 2: Random initialize generating matrix A(0). 3: for t = 1, 2, , T do 4: Generate m latent variables z(t) 1 , z(t) 2 , , z(t) m N(0, Ik k) for the generator. The empirical function becomes f (t) m,n(A, V ) = i=1 φ(Az(t) i )φ(Az(t) i ) 1 i=1 xix i , V 5: Gradient ascent on V with optimal step-size ηV = 1: V (t) V (t) ηV V f (t) m,n(A(t 1), V (t 1)). 6: Sample noise e uniformly from unit sphere 7: Projected Gradient Descent on A, with constraints C = {A|(AA )ii = (A A )ii} : A(t) Proj C(A(t 1) η( A f (t) m,n(A(t 1), V (t)) + e)). 8: end for 9: Output: A(T )(A(T )) cal risk on the observation samples: Ex D xx Ez N(0,Ik k) φ(Az)φ(Az) 2 i=1 xix i Ez N(0,Ik k) φ(Az)φ(Az) We calculate the gradient estimation error due to finite samples. = 2Ez diag(φ (Az))(X Xn)φ(Az)z , where X = Ex D[xx ], and Xn = 1 n Pn i=1 xix i . The directional derivative with arbitrary direction B is: D g(A)[B] D gn(A)[B] =2Ez diag(φ (Az))(Xn X)φ (Az) (Bz)z + 2Ez diag(φ (Az) (Bz))(Xn X)φ(Az)z Lemma 3. Suppose the activation satisfies Assumption 3. We get Pr[ X Xn ϵ X ] 1 δ, for n Θ(d/ϵ2 log2(1/δ))2. 2We will use Θ throughout the paper to hide log factors of d for simplicity. Bounding the relative difference between sample and population covariance matrices is essential for us to bound the estimation error in both gradient and its directional derivative. We can show the following relative error: Lemma 4. Suppose the activation satisfies Assumption 2&3. With samples n Θ(d/ϵ2 log2(1/δ)), we get: g(A) gn(A) 2 O(ϵd A 2), with probability 1 δ. Meanwhile, D g(A)[B] D gn(A)[B] 2 O(ϵd3/2 A 2 B 2), with probability 1 δ. 6.2. Bounding Mini-batch Size Normally for empirical risk for supervised learning, the mini-batch size can be arbitrarily small since the estimator of the gradient is unbiased. However in the WGAN setting, notice for each iteration, we randomly sample a batch of random variables {zi}i [m], and obtain a gradient of i=1 xix i 1 j=1 φ(Azj)φ(Azj) in Algorithm 1. However, the finite sum is inside the Frobenius norm and the gradient on each mini-batch may no longer be an unbiased estimator for our target i=1 xix i Ez φ(Az)φ(Az) SGD Learns One-Layer Networks in WGANs In other words, we conduct stochastic gradient descent over the function f(A) Ez gm,n(A). Therefore we just need to analyze the gradient error between this f(A) and gn(A) (i.e. gm,n is almost an unbiased estimator of gn). Finally with the concentration bound derived in last section, we get the error bound between f(A) and g(A). Lemma 5. The empirical risk gm,n is almost an unbiased estimator of gn. Specifically, the expected function f(A) = Ezi N(0,Ik k),i [m][ gm,n] satisfies: f(A) gn(A) O( 1 For arbitrary direction matrix B, D f(A)[B] D gn(A)[B] O( 1 m B A 3d5/2). In summary, we conduct concentration bound over the observation samples and mini-batch sizes, and show the gradient of f(A) that Algorithm 1 is optimizing over has close gradient and Hessian with the population risk g(A). Therefore a second-order stationary point (SOSP) for f(A) (that our algorithm is guaranteed to achieve) is also an ϵ approximated SOSP for g(A). Next we show such a point also yield an ϵ approximated first-order stationary point of the reparametrized function g(Z) g(A), Z = AA . 6.3. Relation on Approximate Optimality In this section, we establish the relationship between g and g. We present the general form of our target Problem 1: min A Rd k g(A) g(AA ) (4) s.t. Tr(A Xi A) = yi, Xi S, yi R, i = 1, , n. Similar to the previous section, the stationary property might not be obvious on the original problem. Instead, we could look at the re-parametrized version as: min Z S g(Z) (5) s.t. Tr(Xi Z) = yi, Xi S, yi R, i = 1, , n, Definition 1. A matrix A Rd k is called an ϵapproximate second-order stationary point (ϵ-SOSP) of Eqn. (4) if there exists a vector λ such that: Tr(A Xi A) = yi, i [n] ( Z g(AA ) Pn i=1 λi Xi) aj ϵ aj , ({ aj}j span the column space of A) Tr(B D AL(A, λ)[B]) ϵ B 2, B s.t. Tr(B Xi A) = 0 Here L(A, λ) is the Lagrangian form g(AA ) Pn i=1 λi(Tr(A Xi A) yi). Specifically, when ϵ = 0 the above definition is exactly the second-order KKT condition for optimizing (4). Next we present the approximate first-order KKT condition for (5): Definition 2. A symmetric matrix Z Sn is an ϵapproximate first order stationary point of function (5) (ϵFOSP) if and only if there exist a vector σ Rm and a symmetric matrix S S such that the following holds: Tr(Xi Z) = yi, i [n] Z 0, S ϵI, S aj ϵ aj , ({ aj}j span the column space of Z) S = Z g(Z) Pn i=1 σi Xi. Lemma 6. Let latent dimension k = d. For an ϵ-SOSP of function (4) with A and λ, it infers an ϵ-FOSP of function (5) with Z, σ and S that satisfies: Z = AA , σ = λ and S = Z g(AA ) P Now it remains to show an ϵ-FOSP of g(Z) indeed yields a good approximation for the ground truth parameter matrix. Lemma 7. If Z is an ϵ-FOSP of function (5), then Z Z F O(ϵ). Here Z = A (A ) is the optimal solution for function (5). Together with the previous arguments, we finally achieve our main theorem on connecting the recovery guarantees with the sample complexity and batch size3: Theorem 4. For arbitrary δ < 1, ϵ, given small enough learning rate η < 1/poly(d, 1/ϵ, log(1/δ)), let sample size n Θ(d5/ϵ2 log2(1/δ)), batch size m O(d5/ϵ), for large enough T=poly(1/η, 1/ϵ, d, log(1/δ)), the output of Algorithm 1 satisfies A(T )(A(T )) Z F O(ϵ), with probability 1 δ, under Assumptions 2 & 3 and k = d. Therefore we have shown that with finite samples of poly(d, 1/ϵ), we are able to learn the generating distribution with error measured in the parameter space, using stochastic gradient descent ascent. This echos the empirical success of training WGAN. Meanwhile, notice our error bound matches the lower bound on dependence of 1/ϵ, as suggested in (Wu et al., 2019). 7. Experiments In this section, we provide simple experimental results to validate the performance of stochastic gradient descent ascent and provide experimental support for our theory. 3The exact error bound comes from the fact that when diagonal terms of AA are fixed, A 2 = O( SGD Learns One-Layer Networks in WGANs Figure 1: Recovery error ( AA Z F ) with different observed sample sizes n and output dimension d. (a) leaky Re LU activation (α = 0.2) (b) tanh activation Figure 2: Comparisons of different performance with leaky Re LU and tanh activations. Same color starts from the same starting point. For both cases, parameters always converge to true covariance matrix. Each arrow indicates the progress of 500 iteration steps. We focus on Algorithm 1 that targets to recover the parameter matrix. We conduct a thorough empirical studies on three joint factors that might affect the performance: the number of observed samples m (we set n = m as in general GAN training algorithms), the different choices of activation function φ, and the output dimension d. In Figure 1 we plot the relative error for parameter estimation decrease over the increasing sample complexity. We fix the hidden dimension k = 2, and vary the output dimension over {3, 5, 7} and sample complexity over {500, 1000, 2000, 5000, 10000}. Reported values are averaged from 20 runs and we show the standard deviation with the corresponding colored shadow. Clearly the recovery error decreases with higher sample complexity and smaller output dimension. From the experimental results, we can see that our algorithm always achieves global convergence to the ground truth generators from any random initialization point. To visually demonstrate the learning process, we also in- clude a simple comparison for different φ: i.e. leaky Re LU and tanh activations, when k = 1 and d = 2. We set the ground truth covariance matrix to be [1, 1; 1, 1], and therefore a valid result should be [1, 1] or [ 1, 1]. From Figure 2 we could see that for both leaky Re LU and tanh, the stochastic gradient descent ascent performs similarly with exact recovery of the ground truth parameters. 8. Conclusion We analyze the convergence of stochastic gradient descent ascent for Wasserstein GAN on learning a single layer generator network. We show that stochastic gradient descent ascent algorithm attains the global min-max point, and provably recovers the parameters of the network with ϵ absolute error measured in Frobenius norm, from Θ(d5/ϵ2) i.i.d samples. SGD Learns One-Layer Networks in WGANs Acknowledgements The authors thank the Simons Institute Summer 2019 program on the Foundations of Deep Learning for hosting the authors. JDL acknowledges support of the ARO under MURI Award W911NF-11-1-0303, the Sloan Research Fellowship, and NSF CCF 2002272. A.D. acknowledges the support of NSF Grants 1618689, DMS 1723052, CCF 1763702, AF 1901292 and research gifts by Google, Western Digital and the Fluor Centennial Teaching Fellowship. C.D. acknowledges support of NSF Awards IIS1741137, CCF-1617730 and CCF-1901292, a Simons Investigator Award, the DOE Ph ILMs project (No. DE-AC0576RL01830), the DARPA award HR00111990021, a Google Faculty award, and the MIT Frank Quick Faculty Research and Innovation Fellowship. Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein generative adversarial networks. In International conference on machine learning, pp. 214 223, 2017. Arora, S., Ge, R., Liang, Y., Ma, T., and Zhang, Y. Generalization and equilibrium in generative adversarial nets (GANs). In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 224 232. JMLR. org, 2017. Arora, S., Risteski, A., and Zhang, Y. Do GANs learn the distribution? some theory and empirics. 2018. Bai, Y., Ma, T., and Risteski, A. Approximability of discriminators implies diversity in GANs. ar Xiv preprint ar Xiv:1806.10586, 2018. Bora, A., Jalal, A., Price, E., and Dimakis, A. G. Compressed sensing using generative models. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 537 546. JMLR. org, 2017. Daniely, A., Frostig, R., and Singer, Y. Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. In Advances In Neural Information Processing Systems, pp. 2253 2261, 2016. Danskin, J. M. The theory of max-min and its application to weapons allocation problems, volume 5. Springer Science & Business Media, 2012. Daskalakis, C. and Panageas, I. Last-iterate convergence: Zero-sum games and constrained min-max optimization. ar Xiv preprint ar Xiv:1807.04252, 2018a. Daskalakis, C. and Panageas, I. The limit points of (optimistic) gradient descent in min-max optimization. In Advances in Neural Information Processing Systems, pp. 9236 9246, 2018b. Daskalakis, C., Ilyas, A., Syrgkanis, V., and Zeng, H. Training gans with optimism. ar Xiv preprint ar Xiv:1711.00141, 2017. Daskalakis, C., Gouleakis, T., Tzamos, C., and Zampetakis, M. Efficient statistics, in high dimensions, from truncated samples. In the 59th IEEE Annual Symposium on Foundations of Computer Science (FOCS), 2018. Dumoulin, V., Belghazi, I., Poole, B., Mastropietro, O., Lamb, A., Arjovsky, M., and Courville, A. Adversarially learned inference. ar Xiv preprint ar Xiv:1606.00704, 2016. Feizi, S., Farnia, F., Ginart, T., and Tse, D. Understanding GANs: the LQG setting. ar Xiv preprint ar Xiv:1710.10793, 2017. Ge, R., Huang, F., Jin, C., and Yuan, Y. Escaping from saddle points online stochastic gradient for tensor decomposition. In Conference on Learning Theory, pp. 797 842, 2015. Ge, R., Lee, J. D., and Ma, T. Learning one-hidden-layer neural networks with landscape design. ar Xiv preprint ar Xiv:1711.00501, 2017. Gidel, G., Hemmat, R. A., Pezeshki, M., Priol, R. L., Huang, G., Lacoste-Julien, S., and Mitliagkas, I. Negative momentum for improved game dynamics. In the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS), 2019. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672 2680, 2014. Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. C. Improved training of wasserstein gans. In Advances in neural information processing systems, pp. 5767 5777, 2017. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pp. 6626 6637, 2017. Isola, P., Zhu, J.-Y., Zhou, T., and Efros, A. A. Image-toimage translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125 1134, 2017. Jin, C., Netrapalli, P., and Jordan, M. I. Minmax optimization: Stable limit points of gradient descent ascent are locally optimal. ar Xiv preprint ar Xiv:1902.00618, 2019. SGD Learns One-Layer Networks in WGANs Journée, M., Bach, F., Absil, P.-A., and Sepulchre, R. Low-rank optimization for semidefinite convex problems. ar Xiv preprint ar Xiv:0807.4423, 2008. Karras, T., Aila, T., Laine, S., and Lehtinen, J. Progressive growing of gans for improved quality, stability, and variation. ar Xiv preprint ar Xiv:1710.10196, 2017. Korpelevich, G. The extragradient method for finding saddle points and other problems. Matecon, 12:747 756, 1976. Ledig, C., Theis, L., Huszár, F., Caballero, J., Cunningham, A., Acosta, A., Aitken, A., Tejani, A., Totz, J., Wang, Z., et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4681 4690, 2017. Lee, J. D., Simchowitz, M., Jordan, M. I., and Recht, B. Gradient descent only converges to minimizers. In Conference on learning theory, pp. 1246 1257, 2016. Lee, J. D., Panageas, I., Piliouras, G., Simchowitz, M., Jordan, M. I., and Recht, B. First-order methods almost always avoid strict saddle points. Mathematical programming, 176(1-2):311 337, 2019. Lei, Q., Yen, I. E.-H., Wu, C.-y., Dhillon, I. S., and Ravikumar, P. Doubly greedy primal-dual coordinate descent for sparse empirical risk minimization. In International Conference on Machine Learning, pp. 2034 2042, 2017. Lei, Q., Zhuo, J., Caramanis, C., Dhillon, I. S., and Dimakis, A. G. Primal-dual block generalized frank-wolfe. In Advances in Neural Information Processing Systems, pp. 13866 13875, 2019. Lei, Q., Nagarajan, S. G., Panageas, I., and Wang, X. Last iterate convergence in no-regret learning: constrained min-max optimization for convex-concave landscapes. ar Xiv preprint ar Xiv:2002.06768, 2020. Liang, T. On how well generative adversarial networks learn densities: Nonparametric and parametric results. ar Xiv preprint ar Xiv:1811.03179, 2018. Liang, T. and Stokes, J. Interaction matters: A note on non-asymptotic local convergence of generative adversarial networks. In the 22nd International Conference on Artificial Intelligence and Statistics ( AISTATS), 2019. Lin, T., Jin, C., and Jordan, M. I. On gradient descent ascent for nonconvex-concave minimax problems. ar Xiv preprint ar Xiv:1906.00331, 2019. Mescheder, L., Nowozin, S., and Geiger, A. The numerics of GANs. In Advances in Neural Information Processing Systems, pp. 1825 1835, 2017. Mescheder, L., Geiger, A., and Nowozin, S. Which training methods for GANs do actually converge? ar Xiv preprint ar Xiv:1801.04406, 2018. Mokhtari, A., Ozdaglar, A., and Pattathil, S. A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach. ar Xiv preprint ar Xiv:1901.08511, 2019. Nagarajan, V. and Kolter, J. Z. Gradient descent GAN optimization is locally stable. In Advances in Neural Information Processing Systems, pp. 5585 5595, 2017. Tzeng, E., Hoffman, J., Saenko, K., and Darrell, T. Adversarial discriminative domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7167 7176, 2017. Vershynin, R. Introduction to the non-asymptotic analysis of random matrices. ar Xiv preprint ar Xiv:1011.3027, 2010. Wu, S., Dimakis, A. G., and Sanghavi, S. Learning distributions generated by one-layer Re LU networks. ar Xiv preprint ar Xiv:1909.01812, 2019. Zhang, P., Liu, Q., Zhou, D., Xu, T., and He, X. On the discrimination-generalization tradeoff in GANs. 2018.