# generative_principal_component_analysis__8782eba4.pdf Published as a conference paper at ICLR 2022 GENERATIVE PRINCIPAL COMPONENT ANALYSIS Zhaoqiang Liu National University of Singapore dcslizha@nus.edu.sg Jiulong Liu Chinese Academy of Sciences jiulongliu@lsec.cc.ac.cn Subhroshekhar Ghosh National University of Singapore subhrowork@gmail.com Jun Han PCG, Tencent junhanjh@tencent.com Jonathan Scarlett National University of Singapore scarlett@comp.nus.edu.sg In this paper, we study the problem of principal component analysis with generative modeling assumptions, adopting a general model for the observed matrix that encompasses notable special cases, including spiked matrix recovery and phase retrieval. The key assumption is that the first principal eigenvector lies near the range of an L-Lipschitz continuous generative model with bounded k-dimensional inputs. We propose a quadratic estimator, and show that it enjoys a statistical rate m , where m is the number of samples. Moreover, we provide a variant of the classic power method, which projects the calculated data onto the range of the generative model during each iteration. We show that under suitable conditions, this method converges exponentially fast to a point achieving the above-mentioned statistical rate. This rate is conjectured in (Aubin et al., 2019; Cocola et al., 2020) to be the best possible even when we only restrict to the special case of spiked matrix models. We perform experiments on various image datasets for spiked matrix and phase retrieval models, and illustrate performance gains of our method to the classic power method and the truncated power method devised for sparse principal component analysis. 1 INTRODUCTION Principal component analysis (PCA) is one of the most popular techniques for data processing and dimensionality reduction (Jolliffe, 1986), with an abundance of applications such as image recognition (Hancock et al., 1996), gene expression data analysis (Alter et al., 2000), and clustering (Ding & He, 2004; Liu & Tan, 2019). PCA seeks to find the directions that capture maximal variances in vector-valued data. In more detail, letting x1, x2, . . . , xm be m realizations of a random vector x Rn with a population covariance matrix Σ Rn n, PCA aims to reconstruct the top principal eigenvectors of Σ. The first principal eigenvector can be computed as follows: u1 = arg max w Rn w T Σw s.t. w 2 = 1, (1) where the empirical covariance matrix is defined as Σ := 1 m Pm i=1(xi c)(xi c)T , with c := 1 m Pm i=1 xi. In addition, subsequent principal eigenvectors can be estimated by similar optimization problems subject to being orthogonal to the previous vectors. PCA is consistent in the conventional setting where the dimension of the data n is relatively small compared to the sample size m (Anderson, 1962), but leads to rather poor estimates in the highdimensional setting where m n. In particular, it has been shown in various papers that the empirical principal eigenvectors are no longer consistent estimates of their population counterparts (Nadler, Corresponding authors. Published as a conference paper at ICLR 2022 2008; Johnstone & Lu, 2009; Jung & Marron, 2009; Birnbaum et al., 2013). In order to tackle the curse of dimensionality, a natural approach is to impose certain structural constraints on the principal eigenvectors. A common assumption is that the principal eigenvectors are sparse, and this gives rise to the problem of sparse principal component analysis (SPCA) (Zou et al., 2006). In particular, for recovering the top principal eigenvector, the optimization problem of SPCA is given by u1 = arg max w Rn w T Σw s.t. w 2 = 1, w 0 s, (2) where w 0 = |{i : wi = 0}| denotes the number of non-zero entries of w, and s N represents the sparsity level. In addition to reducing the effective number of parameters, the sparsity assumption also enhances the interpretability (Zou et al., 2006). Departing momentarily from the PCA problem, recent years have seen tremendous advances in deep generative models in a wide variety of real-world applications (Foster, 2019). This has motivated a new perspective of the related problem of compressed sensing (CS), in which the standard sparsity assumption is replaced by a generative modeling assumption. That is, the underlying signal is assumed to lie near the range of a (deep) generative model (Bora et al., 2017). The authors of (Bora et al., 2017) characterized the number of samples required to attain an accurate reconstruction, and also presented numerical results on image datasets showing that compared to sparsity-based methods, generative priors can lead to large reductions (e.g., a factor of 5 to 10) in the number of measurements needed to recover the signal up to a given accuracy. Additional numerical and theoretical results concerning inverse problems using generative models have been provided in (Van Veen et al., 2018; Dhar et al., 2018; Heckel & Hand, 2019; Jalal et al., 2020; Liu & Scarlett, 2020a; Ongie et al., 2020; Whang et al., 2020; Jalal et al., 2021; Nguyen et al., 2021), among others. In this paper, following the developments in both PCA/SPCA and inverse problems with generative priors, we study the use of generative priors in principal component analysis (GPCA), which gives a generative counterpart of SPCA in (2), formulated as follows: u1 = arg max w Rn w T Σw s.t. w Range(G), (3) where G is a (pre-trained) generative model, which we assume has a range contained in the unit sphere of Rn.1 Similarly to SPCA, the motivation for this problem is to incorporate prior knowledge on the vector being recovered (or alternatively, a prior preference), and to permit meaningful recovery and theoretical bounds even in the high-dimensional regime m n. 1.1 RELATED WORK In this subsection, we summarize some relevant works, which can roughly be divided into (i) the SPCA problem, and (ii) signal recovery with generative models. SPCA: It has been proved that the solution of the SPCA problem in (2) attains the optimal statistical rate p s log n/m (Vu & Lei, 2012), where m is the number of samples, n is the ambient dimension, and s is the sparsity level of the first principal eigenvector. However, due to the combinatorial constraint, the computation of (2) is intractable. To address this computational issue, in recent years, an extensive body of practical approaches for estimating sparse principal eigenvectors have been proposed in the literature, including (d Aspremont et al., 2007; Vu et al., 2013; Chang et al., 2016; Moghaddam et al., 2006; d Aspremont et al., 2008; Jolliffe et al., 2003; Zou et al., 2006; Shen & Huang, 2008; Journ ee et al., 2010; Hein & B uhler, 2010; Kuleshov, 2013; Yuan & Zhang, 2013; Asteris et al., 2011; Papailiopoulos et al., 2013), just to name a few. Notably, statistical guarantees for several approaches have been provided. The authors of (Yuan & Zhang, 2013) propose the truncated power method (TPower), which adds a truncation operation to the power method to ensure the desired level of sparsity. It is shown that this approach attains the optimal statistical rate under appropriate initialization. Most approaches for SPCA only focus on estimating the first principal eigenvector, with a certain deflation method (Mackey, 2008) being leveraged to reconstruct the rest. However, there are some exceptions; for instance, an iterative thresholding approach is proposed in (Ma, 2013), and is shown to attain a near-optimal statistical rate 1Similarly to (Liu et al., 2020; 2021a), we assume that the range of G is contained in the unit sphere for convenience. Our results readily transfer to general (unnormalized) generative models by considering its normalized version. See Remark 1 for a detailed discussion. Published as a conference paper at ICLR 2022 when estimating multiple individual principal eigenvectors. In addition, the authors of (Cai et al., 2013) propose a regression-type method that gives an optimal principal subspace estimator. Both works (Ma, 2013; Cai et al., 2013) rely on the assumption of a spiked covariance model to ensure a good initial vector. To avoid the spiked covariance model assumption, the work (Wang et al., 2014) proposes a two-stage procedure that attains the optimal subspace estimator in polynomial time. Signal recovery with generative models: Since the seminal work (Bora et al., 2017), there has been a substantial volume of papers studying various inverse problems with generative priors. One of the problems more closely related to PCA is spectral initialization in phase retrieval, which amounts to solving an eigenvalue problem. Phase retrieval with generative priors has been studied in (Hand et al., 2018; Hyder et al., 2019; Jagatap & Hegde, 2019; Wei et al., 2019; Shamshad & Ahmed, 2020; Aubin et al., 2020; Liu et al., 2021b). In particular, the work (Hand et al., 2018) models the underlying signal as being in the range of a fully-connected Re LU neural network with no offsets, and all the weight matrices of the Re LU neural network are assumed to have i.i.d. zero-mean Gaussian entries. In addition, the neural network needs to be sufficiently expansive in the sense that ni Ω(ni 1 log ni 1), where ni is the width of the i-th layer. Under these assumptions, the authors establish favorable global optimization landscapes for the corresponding objective, and derive a near-optimal sample complexity upper bound. They minimize the objective function directly over the latent space in Rk using gradient descent, which may suffer from local minima in general optimization landscapes (Hyder et al., 2019; Shah & Hegde, 2018). In (Aubin et al., 2020), the assumptions on the neural network are similar to those in (Hand et al., 2018), relaxing to general activation functions (beyond Re LU) and ni Ω(ni 1). The authors focus on the high dimensional regime where n, m, k with the ratio m/n being fixed, and assume that the input vector in Rk is drawn from a separable distribution. They derive sharp asymptotics for the information-theoretically optimal performance and for the associated approximate message passing (AMP) algorithm. Both works (Hand et al., 2018; Aubin et al., 2020) focus on noiseless phase retrieval. When only making the much milder assumption that the generative model is Lipschitz continuous, with no assumption on expansiveness, Gaussianity, and offsets, a spectral initialization step (similar to that of sparse phase retrieval) is typically required in order to accurately reconstruct the signal (Netrapalli et al., 2015; Cand es et al., 2015). The authors of (Liu et al., 2021b) propose an optimization problem similar to (3) for the spectral initialization for phase retrieval with generative models. It was left open in (Liu et al., 2021b) how to solve (or approximate sufficiently accurately) the optimization problem in practice. Understanding the eigenvalues of spiked random matrix models has been a central problem of random matrix theory, and spiked matrices have been widely used in the statistical analysis of SPCA. Recently, theoretical guarantees concerning spiked matrix models with generative priors have been provided in (Aubin et al., 2019; Cocola et al., 2020). In particular, in (Aubin et al., 2019), the assumptions are similar to those in (Aubin et al., 2020), except that the neural network is assumed to have exactly one hidden layer. The Bayes-optimal performance is analyzed, and it is shown that the AMP algorithm can attain this optimal performance. In addition, the authors of (Aubin et al., 2019) propose the linearized approximate message passing (LAMP) algorithm, which is a spectral algorithm specifically designed for single-layer feedforward neural networks with no bias terms. The authors show its superiority to classical PCA via numerical results on the Fashion-MNIST dataset. In (Cocola et al., 2020), the same assumptions are made as those in (Hand et al., 2018) on the neural network, and the authors demonstrate the benign global geometry for a nonlinear least squares objective. Similarly to (Hand et al., 2018), the objective is minimized over Rk using a gradient descent algorithm, which can get stuck in local minima for general global geometries. 1.2 CONTRIBUTIONS The main contributions of this paper are as follows: We study eigenvalue problems with generative priors (including GPCA), and characterize the statistical rate of a quadratic estimator similar to (3) under suitable assumptions. We propose a variant of the classic power method, which uses an additional projection operation to ensure that the output of each iteration lies in the range of a generative model. We refer to our method as projected power method (PPower). We further show that under appropriate conditions (most notably, assuming exact projections are possible), PPower Published as a conference paper at ICLR 2022 obtains a solution achieving a statistical rate that is conjectured to be optimal in (Aubin et al., 2019; Cocola et al., 2020) for spiked matrix models. For the spiked matrix and phase retrieval models, we perform numerical experiments on image datasets, and demonstrate that when the number of samples is relatively small compared to the ambient dimension, PPower leads to significantly better performance compared to the classic power method and TPower. Compared to the above-mentioned works that use generative models, we make no assumption on expansiveness, Gaussianity, and offsets for the generative model, and we consider a data model that simultaneously encompasses both spiked matrix and phase retrieval models, among others. 1.3 NOTATION We use upper and lower case boldface letters to denote matrices and vectors respectively. We write [N] = {1, 2, , N} for a positive integer N, and we use IN to denote the identity matrix in RN N. A generative model is a function G : D Rn, with latent dimension k, ambient dimension n, and input domain D Rk. We focus on the setting where k n. For a set S Rk and a generative model G : Rk Rn, we write G(S) = {G(z) : z S}. We use X 2 2 to denote the spectral norm of a matrix X. We define the ℓq-ball Bk q (r) := {z Rk : z q r} for q [0, + ]. Sn 1 := {x Rn : x 2 = 1} represents the unit sphere in Rn. The symbols C, C , C are absolute constants whose values may differ from line to line. 2 PROBLEM SETUP In this section, we formally introduce the problem, and overview some important assumptions that we adopt. Except where stated otherwise, we will focus on the following setting: We have a matrix V Rn n satisfying V = V + E, (4) where E is a perturbation matrix, and V is assumed to be positive semidefinite (PSD). For PCA and its constrained variants, V and V can be thought of as the empirical and population covariance matrices, respectively. We have an L-Lipschitz continuous generative model G : Bk 2(r) Rn. For convenience, similarly to that in (Liu et al., 2020), we assume that Range(G) Sn 1. Remark 1. For a general (unnormalized) L-Lipschitz continuous generative model G, we can instead consider a corresponding normalized generative model G : D Sn 1 as in (Liu et al., 2021b), where D := {z Bk 2(r) : G(z) 2 > Rmin} for some Rmin > 0, and G(z) = G(z) G(z) 2 . Then, the Lipschitz constant of G becomes L/Rmin. For a d-layer neural network, we typically have L = nΘ(d) (Bora et al., 2017). Thus, we can set Rmin to be as small as 1/nΘ(d) without changing the scaling laws, which makes the dependence on Rmin very mild. We aim to solve the following eigenvalue problem with a generative prior:2 ˆv := max w Rn w T Vw s.t. w Range(G). (5) Note that since Range(G) Sn 1, we do not need to impose the constraint w 2 = 1. Since V is not restricted to being an empirical covariance matrix, (5) is more general than GPCA in (3). However, we slightly abuse terminology and also refer to (5) as GPCA. To approximately solve (5), we use a projected power method (PPower), which is described by the following iterative procedure:3 w(t+1) = PG Vw(t) , (6) 2To find the top r rather than top one principal eigenvectors that are in the range of a generative model, we may follow the common approach to use the iterative deflation method for PCA/SPCA: Subsequent principal eigenvectors are derived by recursively removing the contribution of the principal eigenvectors that are calculated already under the generative model constraint. See for example (Mackey, 2008). 3In similar iterative procedures, some works have proposed to replace V by V + ρIn for some ρ R to improve convergence, e.g., see Deshpande et al. (2014). Published as a conference paper at ICLR 2022 Algorithm 1 A projected power method for GPCA (PPower) Input: V, number of iterations T, pre-trained generative model G, initial vector w(0) Procedure: Iterate w(t+1) = PG Vw(t) for t {0, 1, . . . , T 1}, and return w(T ) where PG( ) is the projection function onto G(Bk 2(r)),4 and the initialization vector w(0) may be chosen either manually or randomly, e.g., uniform over Sn 1. Often the initialization vector w(0) plays an important role and we may need a careful design for it. For example, for phase retrieval with generative models, as mentioned in (Liu et al., 2021b, Section V), we may choose the column corresponding to the largest diagonal entry of V as the starting point. See also (Yuan & Zhang, 2013, Section 3) for a discussion on the initialization strategy for TPower devised for SPCA. We present the algorithm corresponding to (6) in Algorithm 1. Remark 2. To tackle generalized eigenvalue problems encountered in some specific applications, there are variants of the projected power method, which combine a certain power iteration with additional operations to ensure sparsity or enforce other constraints. These applications include but not limited to sparse PCA (Journ ee et al., 2010; Yuan & Zhang, 2013), phase synchronization (Boumal, 2016; Liu et al., 2017), the hidden clique problem (Deshpande & Montanari, 2015), the joint alignment problem (Chen & Cand es, 2018), and coneconstrained PCA (Deshpande et al., 2014; Yi & Neykov, 2020). For example, under the simple spiked Wigner model (Perry et al., 2018) for the observed data matrix V with the underlying signal being assumed to lie in a convex cone, the authors of (Deshpande et al., 2014) show that cone-constrained PCA can be computed efficiently via a generalized projected power method. In general, the range of a Lipschitz-continuous generative model is non-convex and not a cone. In addition, we consider a matrix model that is more general than the spiked Wigner model. Although it is not needed for our main results, we first state a lemma (proved in Appendix A) that establishes a monotonicity property with minimal assumptions, only requiring that V is PSD; see also Proposition 3 of Yuan & Zhang (2013) for an analog in sparse PCA. By comparison, our main results in Section 4 will make more assumptions, but will also provide stronger guarantees. Note that the PSD assumption holds, for example, when E = 0, or when V is a sample covariance matrix. Lemma 1. For any x Rn, let Q(x) = x T Vx. Then, if V is PSD, the sequence {Q(w(t))}t>0 for w(t) in (6) is monotonically non-decreasing. 3 SPECIALIZED DATA MODELS AND EXAMPLES In this section, we make more specific assumptions on V = V + E, starting with the following. Assumption 1 (Assumption on V). Assume that V is PSD with eigenvalues λ1 > λ2 . . . λn 0. We use x (a unit vector) to represent the eigenvector of V that corresponds to λ1. In the following, it is useful to think of x is being close to the range of the generative model G. In the special case of (3), letting m be the number of samples, it is natural to derive that the upper bound of E 2 2 grows linearly in (n/m)b for some positive constant b such as 1 2 or 1 (with high probability; see, e.g., (Vershynin, 2010, Corollary 5.35)). In the following, we consider general scenarios with V depending on m samples (see below for specific examples). Similarly to (Yuan & Zhang, 2013), we may consider a restricted version of E 2 2, leading to the following. Assumption 2 (Assumption on E). Let S1, S2 be two (arbitrary) finite sets in Rn satisfying m = Ω(log(|S1| |S2|)). Then, we have for all s1 S1 and s2 S2 that s T 1 Es2 C log(|S1| |S2|) m s1 2 s2 2, (7) where C is an absolute constant. In addition, we have E 2 2 = O(n/m).5 4That is, for any x Rn, PG(x) := arg minw Range(G) w x 2. We will implicitly assume that the projection step can be performed accurately, e.g., (Deshpande et al., 2014; Shah & Hegde, 2018; Peng et al., 2020), though in practice approximate methods might be needed, e.g., via gradient descent (Shah & Hegde, 2018) or GAN-based projection methods (Raj et al., 2019). 5For the spectral norm of E, one often expects an even tighter bound O( p n/m), but we use O(n/m) to simplify the analysis of our examples. Moreover, at least under the typical scaling where L is polynomial Published as a conference paper at ICLR 2022 The following examples show that when the number of measurements is sufficiently large, the data matrices corresponding to certain spiked matrix and phase retrieval models satisfy the above assumptions with high probability. Short proofs are given in Appendix B for completeness. Example 1 (Spiked covariance model). In the spiked covariance model (Johnstone & Lu, 2009; Deshpande & Montanari, 2016), the observed vectors x1, x2, . . . , xm Rn are of the form βquq,isq + zi, (8) where s1, . . . , sr Rn are orthonormal vectors that we want to estimate, while zi N(0, In) and uq,i N(0, 1) are independent and identically distributed. In addition, β1, . . . , βr are positive constants that dictate the signal-to-noise ratio (SNR). To simplify the exposition, we focus on the rank-one case and drop the subscript q [r]. Let i=1 (xix T i In), (9) and V = E[V] = βss T .6 Then, V satisfies Assumption 1 with λ1 = β > 0, λ2 = . . . = λn = 0, and x = s. In addition, letting E = V V, the Bernstein-type inequality (Vershynin, 2010, Proposition 5.10) for the sum of sub-exponential random variables yields that for any finite sets S1, S2 Rn, when m = Ω log(|S1| |S2|)), with probability 1 e Ω(log(|S1| |S2|)), E satisfies (7) in Assumption 2. Moreover, standard concentration arguments give E 2 2 = O(n/m) with probability 1 e Ω(n). Remark 3. We can also consider the simpler spiked Wigner model (Perry et al., 2018; Chung & Lee, 2019) where V = βss T + 1 n H, with the signal s being a unit vector, β > 0 being an SNR parameter, and H Rn n being a symmetric matrix with entries drawn i.i.d. (up to symmetry) from N(0, 1). In this case, when m = n is sufficiently large, with high probability, V := E[V] = βss T and E := V V similarly satisfy Assumptions 1 and 2 respectively. Example 2 (Phase retrieval). Let A Rm n be a matrix having i.i.d. N(0, 1) entries, and let a T i be the i-th row of A. For some unit vector s, suppose that the observed vector is y = |As|, where the absolute value is applied element-wise.7 We construct the weighted empirical covariance matrix as follows (Zhang et al., 2017; Liu et al., 2021b): yiaia T i 1{l l > 1 are positive constants, and for g N(0, 1), γ := E |g|1{l<|g| 0, λ2 = . . . = λn = 0, and x = s. In addition, letting E = V V, we have similarly to Example 1 that E satisfies Assumption 2 with high probability. 4 MAIN RESULTS The following theorem concerns globally optimal solutions of (5). The proof is given in Appendix D. Theorem 1. Let V = V+E with Assumptions 1 and 2 being satisfied by V and E respectively, and let x G := PG( x) = arg minw Range(G) w x 2. Suppose that ˆv is a globally optimal solution to (5). Then, for any δ (0, 1), we have ˆvˆv T x x T F = O q ( λ1 + ϵn) x x G 2 where ϵn = O q in n (Bora et al., 2017), the upper bound for E 2 2 can be easily relaxed to O (n/m)b for any positive constant b, without affecting the scaling of our derived statistical rate. 6To avoid non-essential complications, β is typically assumed to be known (Johnstone & Lu, 2009). 7Without loss of generality, we assume that s is a unit vector. For a general signal s, we may instead focus on estimating s = s/ s 2, and simply use 1 m Pm i=1 yi to approximate s 2. Published as a conference paper at ICLR 2022 We have stated this result as an upper bound on ˆvˆv T x x T F, which intuitively measures the distance between the 1D subspaces spanned by ˆv and x. Note, however, that for any two unit vectors w1, w2 with w T 1 w2 0, the distances w1 w2 2 and w1w T 1 w2w T 2 F are equivalent up to constant factors, whereas if w T 1 w2 0, then a similar statement holds for w1 + w2 2. See Appendix C for the precise statements. In Theorem 1, the final term quantifies the effect of representation error. When there is no such error (i.e., x Range(G)), under the scaling λ1 λ2 = Θ(1), L = nΩ(1), and δ = O(1/n), Theorem 1 simplifies to ˆvˆv T x x T F = O( q m ). This provides a natural counterpart to the statistical rate of order q m for SPCA mentioned in Section 1.1. Before providing our main theorem for PPower described in (6), we present the following important lemma, whose proof is presented in Appendix E. To simplify the statement of the lemma, we fix δ to be O(1/n), though a more general form analogous to Theorem 1 is also possible. Lemma 2. Let V = V + E with Assumptions 1 and 2 being satisfied by V and E respectively, and further assume that x Range(G). Let γ = λ2/ λ1 with λ1 = Θ(1). Then, for all s Range(G) satisfying s T x > 0, we have PG(Vs) x 2 2 γ s x 2 k log(n Lr) where C is an absolute constant. Remark 4. The assumption s T x > 0 will be particularly satisfied when the range of G only contains nonnegative vectors. As mentioned in various works studying nonnegative SPCA (Zass & Shashua, 2007; Sigg & Buhmann, 2008; Asteris et al., 2014), for several practical fields such as economics, bioinformatics, and computer vision, it is natural to assume that the underlying signal has no negative entries. More generally, the assumption s T x > 0 can be removed if we additionally have that x is also contained in the range of G. For this case, when s T x < 0, we can instead derive an upper bound for ˆs + x 2. Based on Lemma 2, we have the following theorem, whose proof is given in Appendix F. Theorem 2. Suppose that the assumptions on the data model V = V + E are the same as those in Lemma 2, and assume that there exists t0 N such that x T w(t0) = 2 γ+ν with 2 γ+ν 1 τ, where γ = λ2/ λ1 [0, 1), and ν, τ are both positive and scale as Θ(1). Let µ0 = 2 γ x T w(t0) = 2 γ 2 γ+ν < 1, and in addition, suppose that m Cν,τ k log(n Lr) with Cν,τ > 0 being large enough. Then, we have after 0 = O log m k log(n Lr) iterations of PPower (beyond t0) that w(t) x 2 C (1 µ0)ν k log(n Lr) i.e., this equation holds for all t T0 := t0 + 0. Moreover, if γ = 0 then 0 1, whereas if γ = Θ(1), we have exponentially fast convergence via the following contraction property: There exists a constant ξ (0, 1) such that for t [t0, T0), it holds that w(t+1) x 2 (1 ξ) w(t) x 2. (14) Regarding the assumption x T w(t0) 2 γ + ν, we note that when t0 = 0, this condition can be viewed as having a good initialization. For both Examples 1 and 2, we have γ = 0. Thus, for the spiked covariance and phase retrieval models corresponding to these examples, the assumption x T w(t0) 2 γ +ν reduces to x T w(t0) ν for a sufficiently small positive constant ν, which results in a mild assumption. Such an assumption is also required for the projected power method devised for cone-constrained PCA under the simple spiked Wigner model, with the underlying signal being assumed to lie in a convex cone; see (Deshpande et al., 2014, Theorem 3). Despite using a similar assumption on the initialization, our proof techniques are significantly different from Deshpande et al. (2014); see Appendix G for discussion. When L is polynomial in n, Theorem 2 reveals that we have established conditions under which PPower in (6) converges exponentially fast to a point achieving the statistical rate of order q Published as a conference paper at ICLR 2022 Based on the minimax rates for SPCA (Vu & Lei, 2012; Birnbaum et al., 2013) and the informationtheoretic lower bounds for CS with generative models (Liu & Scarlett, 2020b; Kamath et al., 2020), the optimal rate for GPCA is naturally conjectured to be of the same order q m . We highlight that Theorem 2 partially addresses the computational-to-statistical gap (e.g., see (Wang et al., 2016; Hand et al., 2018; Aubin et al., 2019; Cocola et al., 2020)) for spiked matrix recovery and phase retrieval under a generative prior, though closing it completely would require efficiently finding a good initialization and addressing the assumption of exact projections. Perhaps the main caveat to Theorem 2 is that it assumes the projection step can be performed exactly. However, this is a standard assumption in analyses of projected gradient methods, e.g., see (Shah & Hegde, 2018), and both gradient-based projection and GAN-based projection have been shown to be highly effective in practice (Shah & Hegde, 2018; Raj et al., 2019). 5 EXPERIMENTS In this section, we experimentally study the performance of Algorithm 1 (PPower). We note that these experiments are intended as a simple proof of concept rather than seeking to be comprehensive, as our contributions are primarily theoretical. We compare with the truncated power method (TPower) devised for SPCA proposed in (Yuan & Zhang, 2013, Algorithm 1) and the vanilla power method (Power) that performs the iterative procedure w(t+1) = (Vw(t))/ Vw(t) 2. For a fair comparison, for PPower, TPower, and Power, we use the same initial vector. Specifically, as mentioned in (Liu et al., 2021b, Section V), we choose the initialization vector w(0) as the column of V that corresponds to its largest diagonal entry. For all three algorithms, the total number of iterations T is set to be 30. To compare the performance across algorithms, we use the scale-invariant Cosine Similarity metric defined as Cossim x, w(T ) := x,w(T ) x 2 w(T ) 2 , where x is the ground-truth signal to estimate, and w(T ) denotes the output vector of the algorithm. The experiments are performed on the MNIST (Le Cun et al., 1998), Fashion-MNIST (Xiao et al., 2017) and Celeb A (Liu et al., 2015) datasets, with the numerical results for the Fashion-MNIST and Celeb A datasets being presented in Appendix H and I. The MNIST dataset consists of 60, 000 images of handwritten digits. The size of each image is 28 28, and thus n = 784. To reduce the impact of local minima, we perform 10 random restarts, and choose the best among these. The cosine similarity is averaged over the test images, and also over these 10 random restarts. The generative model G is set to be a pre-trained variational autoencoder (VAE) model with latent dimension k = 20. We use the VAE model trained by the authors of (Bora et al., 2017) directly, for which the encoder and decoder are both fully connected neural networks with two hidden layers, with the architecture being 20 500 500 784. The VAE is trained by the Adam optimizer with a minibatch size of 100 and a learning rate of 0.001. The projection step PG( ) is solved by the Adam optimizer with a learning rate of 0.03 and 200 steps. In each iteration of TPower, the calculated entries are truncated to zero except for the largest q entries, where q N is a tuning parameter. Since for TPower, q is usually selected as an integer larger than the true sparsity level, and since it is unlikely that the image of the MNIST dataset can be well approximated by a k-sparse vector with k = 20, we choose a relatively large q, namely q = 150. Similarly to (Bora et al., 2017) and other related works, we only report the results on a test set that is unseen by the pre-trained VAE model, i.e., the training of G and the PPower computations do not use common data.8 1. Spiked covariance model (Example 1): The numerical results are shown in Figures 1 and 2. We observe from Figure 1 that Power and TPower attain poor reconstructions, and the generative prior based method PPower attains significantly better reconstructions. To illustrate the effect of the sample size m, we fix the SNR parameter β = 1 and vary m in {100, 200, 300, 400, 500}. In addition, to illustrate the effect of the SNR parameter β, we fix m = 300, and vary β in {0.6, 0.7, 0.8, 0.9, 1, 2, 3, 4}. From Figure 2, we observe that for these settings of m and β, PPower always leads to a much higher cosine similarity compared to Power and TPower, which is natural given the more precise modeling assumptions used. 8All experiments are run using Python 3.6 and Tensorflow 1.5.0, with a NVIDIA Ge Force GTX 1080 Ti 11GB GPU. The corresponding code is available at https://github.com/liuzq09/ Generative PCA. Published as a conference paper at ICLR 2022 Original Power TPower PPower Original Power TPower PPower (a) β = 1 and m = 200 (b) β = 2 and m = 100 Figure 1: Examples of reconstructed images of the MNIST dataset for the spiked covariance model. 100 150 200 250 300 350 400 450 500 Cosine Similarity Power TPower PPower 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Cosine Similarity Power TPower PPower 0 200 400 600 800 1000 1200 1400 1600 Cosine Similarity Power TPower PPower (a) Fixing β = 1 and varying m (b) Fixing m = 300 and varying β (c) Analog of (a) for phase ret. Figure 2: Quantitative comparisons of the performance of Power, TPower and PPower according to the Cosine Similarity for the MNIST dataset, under both the spiked covariance model (Left/Middle) and phase retrieval model (Right). 2. Phase retrieval (Example 2): The results are shown in Figure 2 (Right) and Figure 3. Again, we can observe that PPower significantly outperforms Power and TPower. In particular, for sparse phase retrieval, when performing experiments on image datasets, even for the noiseless setting, solving an eigenvalue problem similar to (5) can typically only serve as a spectral initialization step, with a subsequent iterative algorithm being required to refine the initial guess. In view of this, it is notable that for phase retrieval with generative priors, PPower can return meaningful reconstructed images for m = 200, which is small compared to n = 784. 6 CONCLUSION We have proposed a quadratic estimator for eigenvalue problems with generative models, and we showed that this estimator attains a statistical rate of order q m . We provided a projected power method to efficiently solve (modulo the complexity of the projection step) the corresponding optimization problem, and showed that our method converges exponentially fast to a point achieving a statistical rate of order q m under suitable conditions. Acknowledgment. J.S. was supported by the Singapore National Research Foundation (NRF) under grant R-252-000-A74-281, and S.G. was supported in part by the MOE grants R-146-000-250-133, R-146-000-312-114, and MOE-T2EP20121-0013. Original Power TPower PPower Original Power TPower PPower (a) m = 200 (b) m = 400 Figure 3: Examples of reconstructed images of the MNIST dataset for phase retrieval. Published as a conference paper at ICLR 2022 Orly Alter, Patrick O Brown, and David Botstein. Singular value decomposition for genome-wide expression data processing and modeling. PNAS, 97(18):10101 10106, 2000. Theodore Wilbur Anderson. An introduction to multivariate statistical analysis. Wiley New York, 1962. Megasthenis Asteris, Dimitris S Papailiopoulos, and George N Karystinos. Sparse principal component of a rank-deficient matrix. In Int. Symp. Inf. Theory (ISIT), pp. 673 677. IEEE, 2011. Megasthenis Asteris, Dimitris Papailiopoulos, and Alexandros Dimakis. Nonnegative sparse PCA with provable guarantees. In Int. Conf. Mach. Learn. (ICML), pp. 1728 1736. PMLR, 2014. Benjamin Aubin, Bruno Loureiro, Antoine Maillard, et al. The spiked matrix model with generative priors. Conf. Neur. Inf. Proc. Sys. (Neur IPS), 32:8366 8377, 2019. Benjamin Aubin, Bruno Loureiro, Antoine Baker, et al. Exact asymptotics for phase retrieval and compressed sensing with random generative priors. In Math. Sci. Mach. Learn. (MSML), pp. 55 73. PMLR, 2020. Aharon Birnbaum, Iain M Johnstone, Boaz Nadler, and Debashis Paul. Minimax bounds for sparse PCA with noisy high-dimensional data. Ann. Stat., 41(3):1055, 2013. Ashish Bora, Ajil Jalal, Eric Price, and Alexandros G Dimakis. Compressed sensing using generative models. In Int. Conf. Mach. Learn. (ICML), pp. 537 546, 2017. Nicolas Boumal. Nonconvex phase synchronization. SIAM J. Optim., 26(4):2355 2377, 2016. T Tony Cai, Zongming Ma, and Yihong Wu. Sparse PCA: Optimal rates and adaptive estimation. Ann. Stat., 41(6):3074 3110, 2013. Emmanuel J Cand es, Xiaodong Li, and Mahdi Soltanolkotabi. Phase retrieval via Wirtinger flow: Theory and algorithms. IEEE Trans. Inf. Theory, 61(4):1985 2007, 2015. Xiaojun Chang, Feiping Nie, Yi Yang, et al. Convex sparse PCA for unsupervised feature learning. ACM T. Knowl. Discov. D., 11(1):1 16, 2016. Yuxin Chen and Emmanuel J Cand es. The projected power method: An efficient algorithm for joint alignment from pairwise differences. Comm. Pure Appl. Math., 71(8):1648 1714, 2018. Hye Won Chung and Ji Oon Lee. Weak detection of signal in the spiked Wigner model. In Int. Conf. Mach. Learn. (ICML), pp. 1233 1241. PMLR, 2019. Jorio Cocola, Paul Hand, and Vlad Voroninski. Nonasymptotic guarantees for spiked matrix recovery with generative priors. In Conf. Neur. Inf. Proc. Sys. (Neur IPS), volume 33, 2020. Alexandre d Aspremont, Laurent El Ghaoui, Michael I Jordan, and Gert RG Lanckriet. A direct formulation for sparse PCA using semidefinite programming. SIAM Rev., 49(3):434 448, 2007. Alexandre d Aspremont, Francis Bach, and Laurent El Ghaoui. Optimal solutions for sparse principal component analysis. J. Mach. Learn. Res., 9(7), 2008. Yash Deshpande and Andrea Montanari. Finding hidden cliques of size p N/e in nearly linear time. Found. Comput. Math., 15(4):1069 1128, 2015. Yash Deshpande and Andrea Montanari. Sparse PCA via covariance thresholding. J. Mach. Learn. Res., 17(1):4913 4953, 2016. Yash Deshpande, Andrea Montanari, and Emile Richard. Cone-constrained principal component analysis. In Conf. Neur. Inf. Proc. Sys. (Neur IPS), pp. 2717 2725, 2014. Manik Dhar, Aditya Grover, and Stefano Ermon. Modeling sparse deviations for compressed sensing using generative models. In Int. Conf. Mach. Learn. (ICML), pp. 1214 1223. PMLR, 2018. Published as a conference paper at ICLR 2022 Chris Ding and Xiaofeng He. K-means clustering via principal component analysis. In Int. Conf. Mach. Learn. (ICML), pp. 29, 2004. David Foster. Generative Deep Learning : Teaching Machines to Paint, Write, Compose and Play. O Reilly Media, Inc, USA, 2019. Peter JB Hancock, A Mike Burton, and Vicki Bruce. Face processing: Human perception and principal components analysis. Mem. Cogn., 24(1):26 40, 1996. Paul Hand, Oscar Leong, and Vladislav Voroninski. Phase retrieval under a generative prior. In Conf. Neur. Inf. Proc. Sys. (Neur IPS), pp. 9154 9164, 2018. Reinhard Heckel and Paul Hand. Deep decoder: Concise image representations from untrained non-convolutional networks. In Int. Conf. Learn. Repr. (ICLR), 2019. Matthias Hein and Thomas B uhler. An inverse power method for nonlinear eigenproblems with applications in 1-spectral clustering and sparse PCA. In Conf. Neur. Inf. Proc. Sys. (Neur IPS), pp. 847 855, 2010. Rakib Hyder, Viraj Shah, Chinmay Hegde, and M Salman Asif. Alternating phase projected gradient descent with generative priors for solving compressive phase retrieval. In IEEE Int. Conf. Acoust. Sp. Sig. Proc. (ICASSP), pp. 7705 7709, 2019. Gauri Jagatap and Chinmay Hegde. Algorithmic guarantees for inverse imaging with untrained network priors. In Conf. Neur. Inf. Proc. Sys. (Neur IPS), volume 32, 2019. A Jalal, S Karmalkar, A Dimakis, and E Price. Instance-optimal compressed sensing via posterior sampling. In Int. Conf. Mach. Learn. (ICML), 2021. Ajil Jalal, Liu Liu, Alexandros G Dimakis, and Constantine Caramanis. Robust compressed sensing using generative models. Conf. Neur. Inf. Proc. Sys. (Neur IPS), 2020. Iain M Johnstone and Arthur Yu Lu. On consistency and sparsity for principal components analysis in high dimensions. J. Am. Stat. Assoc., 104(486):682 693, 2009. Ian T. Jolliffe. Principal component analysis. Springer-Verlag, 1986. Ian T Jolliffe, Nickolay T Trendafilov, and Mudassir Uddin. A modified principal component technique based on the Lasso. J. Comput. Graph. Stat., 12(3):531 547, 2003. Michel Journ ee, Yurii Nesterov, Peter Richt arik, and Rodolphe Sepulchre. Generalized power method for sparse principal component analysis. J. Mach. Learn. Res., 11(2), 2010. Sungkyu Jung and J Stephen Marron. PCA consistency in high dimension, low sample size context. Ann. Stat., 37(6B):4104 4130, 2009. Akshay Kamath, Sushrut Karmalkar, and Eric Price. On the power of compressed sensing with generative models. In Int. Conf. Mach. Learn. (ICML), pp. 5101 5109, 2020. Volodymyr Kuleshov. Fast algorithms for sparse principal component analysis based on Rayleigh quotient iteration. In Int. Conf. Mach. Learn. (ICML), pp. 1418 1425. PMLR, 2013. Yann Le Cun, L eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proc. IEEE, 86(11):2278 2324, 1998. Huikang Liu, Man-Chung Yue, and Anthony Man-Cho So. On the estimation performance and convergence rate of the generalized power method for phase synchronization. SIAM J. Optim., 27 (4):2426 2446, 2017. Zhaoqiang Liu and Jonathan Scarlett. The generalized Lasso with nonlinear observations and generative priors. In Conf. Neur. Inf. Proc. Sys. (Neur IPS), volume 33, 2020a. Zhaoqiang Liu and Jonathan Scarlett. Information-theoretic lower bounds for compressive sensing with generative models. IEEE J. Sel. Areas Inf. Theory, 1(1):292 303, 2020b. Published as a conference paper at ICLR 2022 Zhaoqiang Liu and Vincent YF Tan. The informativeness of k-means for learning mixture models. IEEE Trans. Inf. Theory, 65(11):7460 7479, 2019. Zhaoqiang Liu, Selwyn Gomes, Avtansh Tiwari, and Jonathan Scarlett. Sample complexity bounds for 1-bit compressive sensing and binary stable embeddings with generative priors. In Int. Conf. Mach. Learn. (ICML), 2020. Zhaoqiang Liu, Subhroshekhar Ghosh, Jun Han, and Jonathan Scarlett. Robust 1bit compressive sensing with partial Gaussian circulant matrices and generative priors. https://arxiv.org/2108.03570, 2021a. Zhaoqiang Liu, Subhroshekhar Ghosh, and Jonathan Scarlett. Towards sample-optimal compressive phase retrieval with sparse and generative priors. In Conf. Neur. Inf. Proc. Sys. (Neur IPS), 2021b. Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Int. Conf. Comput. Vis. (ICCV), pp. 3730 3738, 2015. Zongming Ma. Sparse principal component analysis and iterative thresholding. Ann. Stat., 41(2): 772 801, 2013. Lester W Mackey. Deflation methods for sparse PCA. In Conf. Neur. Inf. Proc. Sys. (Neur IPS), volume 21, pp. 1017 1024, 2008. Baback Moghaddam, Yair Weiss, and Shai Avidan. Spectral bounds for sparse PCA: Exact and greedy algorithms. Conf. Neur. Inf. Proc. Sys. (Neur IPS), 18:915, 2006. Boaz Nadler. Finite sample approximation results for principal component analysis: A matrix perturbation approach. Ann. Stat., 36(6):2791 2817, 2008. Praneeth Netrapalli, Prateek Jain, and Sujay Sanghavi. Phase retrieval using alternating minimization. IEEE Trans. Sig. Proc., 63(18):4814 4826, 2015. Thanh V Nguyen, Gauri Jagatap, and Chinmay Hegde. Provable compressed sensing with generative priors via Langevin dynamics. https://arxiv.org/2102.12643, 2021. Gregory Ongie, Ajil Jalal, Christopher A Metzler, et al. Deep learning techniques for inverse problems in imaging. IEEE J. Sel. Areas Inf. Theory, 1(1):39 56, 2020. Dimitris Papailiopoulos, Alexandros Dimakis, and Stavros Korokythakis. Sparse PCA through lowrank approximations. In Int. Conf. Mach. Learn. (ICML), pp. 747 755. PMLR, 2013. Pei Peng, Shirin Jalali, and Xin Yuan. Solving inverse problems via auto-encoders. IEEE J. Sel. Areas Inf. Theory, 1(1):312 323, 2020. Amelia Perry, Alexander S Wein, Afonso S Bandeira, and Ankur Moitra. Optimality and suboptimality of PCA I: Spiked random matrix models. Ann. Stat., 46(5):2416 2451, 2018. Ankit Raj, Yuqi Li, and Yoram Bresler. GAN-based projector for faster recovery with convergence guarantees in linear inverse problems. In IEEE/CVF Int. Conf. Comp. Vision (ICCV), 2019. Viraj Shah and Chinmay Hegde. Solving linear inverse problems using GAN priors: An algorithm with provable guarantees. In Int. Conf. Acoust. Sp. Sig. Proc. (ICASSP), pp. 4609 4613. IEEE, 2018. Fahad Shamshad and Ali Ahmed. Compressed sensing-based robust phase retrieval via deep generative priors. IEEE Sens. J., 21(2):2286 2298, 2020. Haipeng Shen and Jianhua Z Huang. Sparse principal component analysis via regularized low rank matrix approximation. J. Multivar. Anal., 99(6):1015 1034, 2008. Christian D Sigg and Joachim M Buhmann. Expectation-maximization for sparse and non-negative PCA. In Int. Conf. Mach. Learn. (ICML), pp. 960 967, 2008. Dave Van Veen, Ajil Jalal, Mahdi Soltanolkotabi, et al. Compressed sensing with deep image prior and learned regularization. https://arxiv.org/1806.06438, 2018. Published as a conference paper at ICLR 2022 Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. https://arxiv.org/abs/1011.3027, 2010. Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018. Vincent Vu and Jing Lei. Minimax rates of estimation for sparse PCA in high dimensions. In Int. Conf. Artif. Intell. Stat. (AISTATS), pp. 1278 1286. PMLR, 2012. Vincent Q Vu, Juhee Cho, Jing Lei, and Karl Rohe. Fantope projection and selection: A near-optimal convex relaxation of sparse PCA. In Conf. Neur. Inf. Proc. Sys. (Neur IPS), volume 26, 2013. Tengyao Wang, Quentin Berthet, and Richard J Samworth. Statistical and computational trade-offs in estimation of sparse principal components. Ann. Stat., 44(5):1896 1930, 2016. Zhaoran Wang, Huanran Lu, and Han Liu. Tighten after relax: Minimax-optimal sparse PCA in polynomial time. In Conf. Neur. Inf. Proc. Sys. (Neur IPS), 2014. Xiaohan Wei, Zhuoran Yang, and Zhaoran Wang. On the statistical rate of nonlinear recovery in generative models with heavy-tailed data. In Int. Conf. Mach. Learn. (ICML), pp. 6697 6706, 2019. Jay Whang, Qi Lei, and Alexandros G Dimakis. Compressed sensing with invertible generative models and dependent noise. https://arxiv.org/2003.08089, 2020. Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms. https://arxiv.org/1708.07747, 2017. Yufei Yi and Matey Neykov. Non-sparse PCA in high dimensions via cone projected power iteration. https://arxiv.org/2005.07587, 2020. Xiao-Tong Yuan and Tong Zhang. Truncated power method for sparse eigenvalue problems. J. Mach. Learn. Res., 14(4), 2013. Ron Zass and Amnon Shashua. Nonnegative sparse PCA. In Conf. Neur. Inf. Proc. Sys. (Neur IPS), pp. 1561 1568, 2007. Huishuai Zhang, Yi Zhou, Yingbin Liang, and Yuejie Chi. A nonconvex approach for phase retrieval: Reshaped Wirtinger flow and incremental algorithms. J. Mach. Learn. Res., 18, 2017. Hui Zou, Trevor Hastie, and Robert Tibshirani. Sparse principal component analysis. J. Comput. Graph. Stat., 15(2):265 286, 2006. Published as a conference paper at ICLR 2022 APPENDIX (GENERATIVE PRINCIPAL COMPONENT ANALYSIS, LIU/LIU/GHOSH/HAN/SCARLETT, ICLR 2022) A PROOF OF LEMMA 1 (NON-DECREASING PROPERTY OF Q) Since w(t+1) = PG Vw(t) and w(t) Range(G), we have Vw(t) w(t+1) 2 Vw(t) w(t) 2, (15) and since w(t+1) 2 = w(t) 2 = 1, expanding the square gives D Vw(t), w(t+1)E Q(w(t)). (16) Then, we obtain Q(w(t+1)) = D Vw(t+1), w(t+1)E (17) = Q(w(t+1) w(t)) + 2 D V(w(t+1) w(t)), w(t)E + Q(w(t)) (18) 2 D V(w(t+1) w(t)), w(t)E + Q(w(t)) (19) Q(w(t)), (20) where (18) follows by writing w(t+1) = w(t) + (w(t+1) w(t)) and expanding, (19) follows from the assumption that V is PSD, and (20) follows from (16). B PROOFS FOR SPIKED MATRIX AND PHASE RETRIEVAL EXAMPLES Before proceeding, we present the following standard definitions. Definition 1. A random variable X is said to be sub-Gaussian if there exists a positive constant C such that (E [|X|p])1/p C p for all p 1. The sub-Gaussian norm of a sub-Gaussian random variable X is defined as X ψ2 := supp 1 p 1/2 (E [|X|p])1/p. Definition 2. A random variable X is said to be sub-exponential if there exists a positive constant C such that (E [|X|p]) 1 p Cp for all p 1. The sub-exponential norm of X is defined as X ψ1 := supp 1 p 1 (E [|X|p]) The following lemma states that the product of two sub-Gaussian random variables is subexponential, regardless of the dependence between them. Lemma 3. (Vershynin, 2018, Lemma 2.7.7) Let X and Y be sub-Gaussian random variables (not necessarily independent). Then XY is sub-exponential, and satisfies XY ψ1 X ψ2 Y ψ2. (21) The following lemma provides a useful concentration inequality for the sum of independent subexponential random variables. Lemma 4. (Vershynin, 2010, Proposition 5.16) Let X1, . . . , XN be independent zero-mean subexponential random variables, and K = maxi Xi ψ1. Then for every α = [α1, . . . , αN]T RN and ϵ 0, it holds that i=1 αi Xi ϵ 2 exp c min ϵ2 K2 α 2 2 , ϵ K α where c > 0 is an absolute constant. In particular, with α = 1 N , . . . , 1 N T , we have i=1 Xi ϵ 2 exp c min Nϵ2 Published as a conference paper at ICLR 2022 The widely-used notion of an ϵ-net is introduced as follows. Definition 3. Let (X, d) be a metric space, and fix ϵ > 0. A subset S X is said be an ϵ-net of X if, for all x X, there exists some s S such that d(s, x) ϵ. The minimal cardinality of an ϵ-net of X, if finite, is denoted C(X, ϵ) and is called the covering number of X (at scale ϵ). The following lemma provides a useful upper bound for the covering number of the unit sphere. Lemma 5. (Vershynin, 2010, Lemma 5.2) The unit Euclidean sphere SN 1 equipped with the Euclidean metric satisfies for every ϵ > 0 that C(SN 1, ϵ) 1 + 2 The following lemma provides an upper bound for the spectral norm of a symmetric matrix. Lemma 6. (Vershynin, 2010, Lemma 5.4) Let X be a symmetric N N matrix, and let Cϵ be an ϵ-net of SN 1 for some ϵ [0, 1/2). Then, X 2 2 = sup r SN 1 | Xr, r | (1 2ϵ) 1 sup r Cϵ | Xr, r |. (25) With the above auxiliary results in place, we provide the proofs of Assumption 2 holding for the two examples described in Section 3. B.1 SPIKED COVARIANCE MODEL (EXAMPLE 1) As per Assumption 2, fix two finite signal sets S1 and S2. For r = 1, we have xi = βuis + zi and a direct calculation gives E[V] = V = βss T . Recall also that s 2 = 1, ui N(0, 1), and zi N(0, In). It follows that for any s1 S1, we have that s T 1 xi = βuis T s1 + z T i s1 is sub-Gaussian, with the sub-Gaussian norm being upper bounded by C( β + 1) s1 2. Similarly, we have for any s2 S2 that s T 2 xi ψ2 C( β + 1) s2 2. Applying Lemma 3, we deduce that (s T 1 xi)(s T 2 xi) is sub-exponential, with the sub-exponential norm being upper bounded by C2( β + 1)2 s1 2 s2 2. In addition, from (9) and V = βss T , we have s T 1 Es2 = s T 1 (V V)s2 (26) (x T i s1)(x T i s2) (s T 1 s2) + β(s T s1)(s T s2) , (27) and we observe that E[(x T i s1)(x T i s2)] = (s T 1 s2)+β(s T s1)(s T s2). Then, from Lemma 4, we obtain that for any t > 0 satisfying m = Ω(t), the following holds with probability 1 e Ω(t) (recall that C may vary from line to line): 1 m (x T i s1)(x T i s2) (s T 1 s2) + β(s T s1)(s T s2) C( p β + 1)2 s1 2 s2 2 where we note that the assumption m = Ω(t) ensures that the first term is dominant in the minimum in (23). Taking a union bound over all s1 S1 and s2 S2, and setting t = log(|S1| |S2|), we obtain with probability 1 e Ω(log(|S1| |S2|)) that (7) holds (with β being a fixed positive constant). Next, we bound |r T Er| for fixed r Sn 1, but this time consider t > 0 (different from the above t) satisfying t = Ω(m). In this case, we can follow the above analysis (with s1 and s2 both replaced by r), but the assumption t = Ω(m) means that when applying Lemma 4, the second term in the minimum in (23) is now the dominant one. As a result, for any t > 0 satisfying t = Ω(m), and any r Sn 1, we have with probability 1 e Ω(t) that (x T i r)2 1 + β(s T r)2 (29) Published as a conference paper at ICLR 2022 From Lemma 5, there exists an (1/4)-net C 1 4 of Sn 1 satisfying log C 1 4 n log 9. Taking a union bound over all r C 1 4 , and setting t = Cn, we obtain with probability 1 e Ω(n) that r T Er = O n Then, from Lemma 6, we have E 2 2 2 sup r C 1 r T Er = O n B.2 PHASE RETRIEVAL (EXAMPLE 2) Let W = 1 m Pm i=1 yiaia T i 1{l 0 satisfying m = Ω(t), with probability 1 e Ω(t), 1 m yi(a T i s1)(a T i s2)1{l 1, let the i-th column of U be ui. Then, we have ˆv T Vˆv = λ1 x T ˆv 2 + X i>1 λi u T i ˆv 2 (48) λ1 x T ˆv 2 + λ2 X u T i ˆv 2 (49) = λ1 x T ˆv 2 + λ2 1 x T ˆv 2 , (50) where we use x T ˆv 2 + P i>1 u T i ˆv 2 = 1 in (50). In addition, for any A Rn n and any s1, s2 Rn, we have s T 1 As1 s T 2 As2 = s1 + s2 T A s1 + s2 T A s1 + s2 = 2 s1 + s2 T A s1 + s2 In particular, when A is symmetric, we obtain s T 1 As1 s T 2 As2 = (s1 + s2)T A(s1 s2). (53) Let M be a (δ/L)-net of Bk 2(r); from (Vershynin, 2010, Lemma 5.2), we know that there exists such a net with log |M| k log 4Lr Since G is L-Lipschitz continuous, we have that G(M) is a δ-net of Range(G) = G(Bk 2(r)). We write ˆv = (ˆv x) + x, (55) where x G(M) satisfies ˆv x 2 δ. Suppose that x T ˆv 0; if this is not the case, we can use analogous steps to obtain an upper bound for x + ˆv 2 instead of x ˆv 2. We have 2 x ˆv 2 2 (56) = ( λ1 λ2) 1 x T ˆv (57) ( λ1 λ2) 1 x T ˆv 2 (58) = λ1 λ1 x T ˆv 2 + λ2 1 x T ˆv 2 (59) Published as a conference paper at ICLR 2022 = x T V x λ1 x T ˆv 2 + λ2 1 x T ˆv 2 (60) x T V x ˆv T Vˆv (61) = x T G Vx G + ( x + x G)T V( x x G) ˆv T Vˆv (62) x T G Vx G + 2 λ1 x x G 2 ˆv T Vˆv (63) = x T G(V E)x G + 2 λ1 x x G 2 ˆv T (V E)ˆv (64) ˆv T Eˆv x T GEx G + 2 λ1 x x G 2 (65) = x T E x + 2 ˆv x x T GEx G + 2 λ1 x x G 2 x T E x + 2δ E 2 2 x T GEx G + 2 λ1 x x G 2 (67) = 2 x + x G T E x + x G + 2δ E 2 2 + 2 λ1 x x G 2 δ m x x G 2 + 2δ E 2 2 + 2 λ1 x x G 2 (69) δ m ( x ˆv 2 + ˆv x 2 + x x G 2) + 2δ E 2 2 + 2 λ1 x x G 2 (70) δ m ˆv x 2 + O δn + O ( λ1 + ϵn) x x G 2 , (71) (57) (58) follow from x 2 = ˆv 2 = 1 and hence | x T ˆv| 1; (60) follows since ( λ1, x) are an eigenvalue-eigenvector pair for V with x 2 = 1; (61) follows from (50); (62) follows from (53) with V being symmetric and setting s1 = x, s2 = x G; (63) follows from x + x G 2 2 and V 2 2 = λ1; (64) follows since V = V E; (65) follows since ˆv is a globally optimal solution to (5) and x G Range(G); (66) follows from (52) with s1 = ˆv and s2 = x; (67) follows from (55) along with ˆv x 2 δ and ˆv + x 2 2; (68) follows from (52); (69) follows from Assumption 2 (with S1 = S2 being G(M) shifted by x G) and (54); (70) follows from the triangle inequality; (71) follows by substituting ˆv x 2 δ, along with the assumptions E 2 2 = O(n/m), m = Ω k log Lr δ , and ϵn = O q From (71), we have the following when ˆv T x 0: ˆv x 2 = O q ( λ1 + ϵn) x x G 2 Published as a conference paper at ICLR 2022 As mentioned earlier, if ˆv T x < 0, we have the same upper bound as in (72) for ˆv+ x 2. Therefore, we obtain ˆvˆv T x x T F = 2 1 ( x T ˆv)2 (73) 2 (1 x T ˆv) (1 + x T ˆv) (74) 2 min{ ˆv x 2, ˆv + x 2} (75) ( λ1 + ϵn) x x G 2 where (73) follows from (40), (75) follows from ˆv x 2 2 = 2(1 x T ˆv), and (76) follows from (72). E PROOF OF LEMMA 2 (AUXILIARY RESULT FOR PPOWER ANALYSIS) By the assumption Range(G) Sn 1, for any x Rn and a > 0, we have PG(x) = PG(ax), (77) which is seen by noting that when comparing x a 2 with x b 2 (in accordance with projection mapping to the closest point), as long as a 2 = b 2, the comparison reduces to comparing x, a with x, b , so is invariant to positive scaling of x. Let η = 1/ λ1 > 0 and ˆs = PG(Vs). Then, we have ˆs = PG(Vs) = PG( ηVs). Since x Range(G), we have ηVs ˆs 2 ηVs x 2. (78) This is equivalent to ( ηVs x) + ( x ˆs) 2 2 ηVs x 2 2, (79) and expanding the square gives x ˆs 2 2 2 ηVs x,ˆs x . (80) Note also that from V x = λ1 x, we obtain x = η V x, which we will use throughout the proof. For δ > 0, let M be a (δ/L)-net of Bk 2(r); from Lemma 5, there exists such a net with log |M| k log 4Lr By the L-Lipschitz continuity of G, we have that G(M) is a δ-net of Range(G) = G(Bk 2(r)). We write s = (s s0) + s0, ˆs = (ˆs s) + s, (82) where s G(M) satisfies ˆs s 2 δ, and s0 G(M) satisfies s s0 2 δ. Then, we have ηVs x,ˆs x = η V(s x),ˆs x + ηEs,ˆs x , (83) which follows from V = V + E and x = η V x. In the following, we control the two terms in (83) separately. 1. The term η V(s x),ˆs x : We decompose s = α x+βt and ˆs = ˆα x+ ˆβˆt, where t 2 = ˆt 2 = 1 and t T x = ˆt T x = 0. Since s 2 = ˆs 2 = 1, we have α2 + β2 = ˆα2 + ˆβ2 = 1. In addition, we have α = s T x and ˆα = ˆs T x. Recall that in (47), we write the SVD of V as V = U D UT . Since t T x = 0, we can write t as t = P i>1 hi ui. In addition, since Published as a conference paper at ICLR 2022 t 2 = 1, we have P i>1 h2 i = 1. Hence, by the Cauchy-Schwarz inequality, we have | Vt,ˆt | Vt 2 (84) i>1 λihi ui i>1 λ2 i h2 i (86) i>1 h2 i (87) Therefore, we obtain | η V(s x),ˆs x | = | (α 1) x + ηβ Vt, (ˆα 1) x + ˆβˆt | (89) = |(α 1)(ˆα 1) + ηβ ˆβ Vt,ˆt | (90) (1 α)(1 ˆα) + η|β ˆβ| λ2 (91) = (1 α)(1 ˆα) + γ p 1 ˆα2, (92) where (89) uses η V x = x, and (90) uses x 2 = 1 and x, t = 0. 2. The term ηEs,ˆs x : We have | ηEs,ˆs x | = ηE((s s0) + s0),ˆs x (93) = ηE(s s0),ˆs x + ηEs0, (ˆs s) + ( s x) (94) η E 2 2δ ˆs x 2 + η E 2 2δ + O O (δ E 2 2) + O ˆs x 2, (96) where (95) follows from Assumption 2 and (81), and (96) follows from η = 1/ λ1, along with the fact that we assumed λ1 = Θ(1). Note that x ˆs 2 2 = 2(1 ˆs T x) = 2(1 ˆα). Hence, and using (80), (83), (92), and (96), we obtain 2(1 ˆα) 2 (1 α)(1 ˆα) + γ p + O (δ E 2 2) + O 2(1 ˆα). (97) Using 2(1 ˆα) 2(1 α)(1 ˆα) = 2α(1 ˆα), 1 α2 = 1 α 1 + α p 2(1 α), and similarly 2(1 ˆα), we obtain from (97) that 2α(1 ˆα) 2 γ p 2(1 ˆα) + O 2(1 ˆα) + O (δ E 2 2) . (98) Since ˆs x 2 2 = 2(1 ˆα) and s x 2 2 = 2(1 α), this is equivalent to 2 γ s x 2 + O ˆs x 2 + O (δ E 2 2) . (99) Published as a conference paper at ICLR 2022 This equation is of the form az2 bz + c (where z = ˆs x 2 and a = α == s T x > 0), and using a simple application of the quadratic formula,9 we obtain ˆs x 2 2 γ s x 2 s T x δ E 2 2 k log(n Lr) where we use the assumption E 2 2 = O(n/m) and set δ = 1/n in (101). F PROOF OF THEOREM 2 (MAIN THEOREM FOR PPOWER) Suppose for the time being that (13) holds for at least one index t t0 (we will later verify that this is the case), and let T0 t0 be the smallest such index. Thus, we have w(T0) x 2 C (1 µ0)ν k log(n Lr) Note that according to the theorem statement, 1 µ0 is bounded away from zero. Using w(T0) 2 = x 2 = 1 and the assumption that m Cν,τ k log(n Lr) with Cν,τ > 0 being large enough, we deduce from (102) that w(T0) x 2 is sufficiently small such that x T w(T0) 1 τ. (103) Next, using the assumption 2 γ + ν 1 τ, we write 2 γ (1 µ0)ν(1 τ) + 1 1 τ = 2 γ + (1 µ0)ν (1 µ0)ν(1 τ) 2 γ + ν (1 µ0)ν(1 τ) 1 (1 µ0)ν . (104) Then, from Lemma 2, we obtain w(T0+1) x 2 2 γ x T w(T0) w(T0) x 2 + C x T w(T0) k log(n Lr) 2 γ 1 τ C (1 µ0)ν k log(n Lr) k log(n Lr) k log(n Lr) where (106) follows from (102) (103), and (107) follows from (104). Thus, we have transferred (102) from T0 to T0 + 1, and proceeding by induction, we obtain w(t) x 2 C (1 µ0)ν k log(n Lr) for al t T0. Next, we consider t [t0, T0). Again using Lemma 2 (with ˆs = w(t0+1) = PG(Vw(t0))), we have w(t0+1) x 2 µ0 w(t0) x 2 + C 2 γ + ν k log(n Lr) where we recall that µ0 = 2 γ x T w(t0) = 2 γ 2 γ+ν < 1, and note that the denominator in the second term of (109) follows since x T w(t0) = 2 γ + ν. Supposing that t0 < T0 (otherwise, the above analysis for t T0 alone is sufficient), we have that (13) is reversed at t = t0: w(t0) x 2 > C (1 µ0)ν k log(n Lr) 9Since the leading coefficient a = α of the quadratic is positive, z must lie in between the two associated roots. This yields z b+ b2+4ac 2a , from which the inequality b gives (100). Published as a conference paper at ICLR 2022 This means that we can upper bound the second term in (109) by C k log(n Lr) m < (1 µ0) w(t0) x 2, which gives w(t0+1) x 2 < w(t0) x 2. (111) Squaring both sides, expanding, and canceling the norms (which all equal one), we obtain x T w(t0+1) > x T w(t0), (112) and by induction, we obtain that { x T w(t)}t [t0,T0) is monotonically increasing. Recall that we assume that λ1 = Θ(1), and that T0 = t0 + 0 is the smallest integer such that (13) holds. To verify that T0 is finite and upper bound 0, we consider the following three cases: µ0 = 0 (or equivalently, γ = λ2 = 0): In this case, (109) gives T0 = t0 +1 (or T0 = t0, which we already addressed above). Thus, we have 0 1, as stated in the theorem. µ0 = o(1) (or equivalently, γ = o(1) and λ2 = o(1)): Since { x T w(t)}t [t0,T0) is monotonically increasing, for any positive integer with t0 + T0, by applying Lemma 2 (or (109)) multiple times, we obtain10 w(t0+ ) x 2 µ 0 w(t0) x 2 + 1 µ 0 1 µ0 C 2 γ + ν k log(n Lr) µ 0 w(t0) x 2 + 1 1 µ0 C 2 γ + ν k log(n Lr) Then, if we choose 0 N such that k log(n Lr) we obtain from (114) that w(t0+ 0) x 2 µ 0 0 w(t0) x 2 + 1 1 µ0 C 2 γ + ν k log(n Lr) 2µ 0 0 + 1 1 µ0 C 2 γ + ν k log(n Lr) < Cµ0 ν(1 µ0) k log(n Lr) m + 1 1 µ0 C 2 γ + ν k log(n Lr) = C (1 µ0)ν k log(n Lr) where (117) follows from w(t0) x 2 2, (118) follows from (115) and 1 < 1 1 µ0 , and (119) follows from µ0 = 2 γ 2 γ+ν , which implies µ0 ν + 1 2 γ+ν = µ0(2 γ+ν)+ν ν(2 γ+ν) = 2 γ+ν ν(2 γ+ν) = 1 ν . Observe that (119) coincides with (13), and since µ0 = o(1), we obtain from (115) that 0 = O log m k log(n Lr) as desired. µ0 = Θ(1) (or equivalently, γ = Θ(1) and λ2 = Θ(1)): Recall that we only need to focus on the case T0 > t0. This means that (110) holds, implying that we can upper bound the second term in (109) by (1 µ0)ν 2 γ+ν w(t0) x 2, yielding w(t0+1) x 2 < µ0 w(t0) x 2 + (1 µ0)ν 2 γ + ν w(t0) x 2 (120) = 2 γ + (1 µ0)ν 2 γ + ν w(t0) x 2 (121) = (1 ξ) w(t0) x 2, (122) 10In simpler notation, if zt+1 azt+b, then we get zt+2 a2zt+(1+a)b, then zt+3 a3zt+(1+a+a2)b, and so on, and then we can apply 1 + a + . . . + ai 1 = 1 ai 1 a for a = 1. Published as a conference paper at ICLR 2022 where ξ = µ0ν 2 γ+ν = µ0(1 µ0) = Θ(1). With the distance to x shrinking by a constant factor in each iteration according to (122), and the initial distance w(t0) x 2 being at most 2 due to the vectors having unit norm, we deduce that 0 = O log m k log(n Lr) iterations suffice to ensure that (13) holds for t = t0 + 0. G COMPARISON OF ANALYSIS TO DESHPANDE ET AL. (2014) As mentioned in Section 4, our analysis is significantly different from that of Deshpande et al. (2014) despite using a similar assumption on the initialization. We highlight the differences as follows: 1. Perhaps the most significant difference is that the proof of (Deshpande et al., 2014, Theorem 3) is highly dependent on the Moreau decomposition, which is only valid for a closed convex cone (see (Deshpande et al., 2014, Definition 1.2)). In particular, the Moreau decomposition needs to be used at the beginning of the proof of (Deshpande et al., 2014, Theorem 3), such as Eqs. (18) and (19) in the supplementary material therein. We do not see a way for the proof to proceed without the Moreau decomposition, and our Range(G) may be very different from a convex cone. 2. We highlight that one key observation in our proof of Lemma 2 (and thus Theorem 2) is that for a generative model G with Range(G) Sn 1, and any x Rn and a > 0, we have PG(ax) = PG(x) (Eq. (77)). This enables us to derive the important equation ˆs = PG(Vs) = PG( ηVs). We are not aware of a similar idea being used in the proof of (Deshpande et al., 2014, Theorem 3). 3. In the PPower method in (Deshpande et al., 2014), the authors need to add ρIn with ρ > 0 to the observed data matrix V to improve the convergence. In particular, they mention in the paragraph before the statement of Theorem 3 that the memory term ρvt is necessary for our proof technique to go through . In contrast, our proof of Theorem 2 does not require adding such terms, even when our data model is restricted to the spiked Wigner model considered in (Deshpande et al., 2014). 4. We consider a matrix model that is significantly more general than the spiked Wigner model studied in (Deshpande et al., 2014). H NUMERICAL RESULTS FOR THE FASHION-MNIST DATASET The Fashion-MNIST dataset consists of Zalando s article images with a training set of 60, 000 examples and a test set of 10, 000 examples. The size of each image in the Fashion-MNIST dataset is also 28 28, and thus n = 784. The generative model G is set to be a boundary-seeking generative adversarial network (BEGAN). The BEGAN architecture is summarized as follows.11 The generator has latent dimension k = 62 and four layers. The first two are fully connected layers with the architecture 62 1024 6272, and with Re LU activation functions. The output of the second layer, reshaped to 128 7 7, is forwarded to a deconvolution layer with kernel size 4 and stride 2. The third layer uses Re LU activations and has output size 64 14 14, where 64 is the number of channels. The fourth layer is a deconvolution layer with kernel size 4 and strides 2, and it uses Re LU activations and has output size 1 28 28, where the number of channels is 1. The BEGAN is trained with a mini-batch size of 256, a learning rate of 0.0002, and 100 epochs. The other parameters are the same as those for the MNIST dataset. We perform two sets of experiments, considering the spiked covariance and phase retrieval models separately. The corresponding results are reported in Figures 4, 5, 6, and 7. From these figures, again, we can observe clear superiority of PPower to Power and TPower. We note that for the Fashion-MNIST dataset, some of the images are not sparse in the natural basis, but we observe from Figures 4 and 6 that even for the sparsest images (sandals), PPower also significantly outperforms TPower. 11Further details of the architecture can be found at https://github.com/hwalsuklee/ tensorflow-generative-model-collections. Published as a conference paper at ICLR 2022 (a) β = 1 and m = 200 (b) β = 2 and m = 100 Figure 4: Examples of reconstructed Fashion-MNIST images for the spiked covariance model. 100 200 300 400 500 m Cosine Similarity Power TPower PPower 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Cosine Similarity Power TPower PPower (a) Fixing β = 1 and varying m (b) Fixing m = 300 and varying β Figure 5: Quantitative comparisons of the performance of Power, TPower and PPower according to the Cosine Similarity for the Fashion-MNIST dataset and the spiked covariance model. (a) m = 200 (b) m = 400 Figure 6: Examples of reconstructed images of the Fashion-MNIST dataset for phase retrieval. 0 200 400 600 800 1000 1200 1400 1600 m Cosine Similarity Power TPower PPower Figure 7: Quantitative comparisons of the performance of Power, TPower and PPower according to the Cosine Similarity for the Fashion-MNIST dataset and the phase retrieval model. Published as a conference paper at ICLR 2022 Original Power TPower TPower W PPower Original Power TPower TPower W PPower (a) β = 1 and m = 6000 (b) β = 4 and m = 3000 Figure 8: Examples of reconstructed Celeb A images for the spiked covariance model. 2000 4000 6000 8000 10000 m Cosine Similarity Power TPower TPower W PPower 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Cosine Similarity Power TPower TPower W PPower (a) Fixing β = 1 and varying m (b) Fixing m = 3000 and varying β Figure 9: Quantitative comparisons of the performance of Power, TPower, TPower W and PPower according to the Cosine Similarity for the Celeb A dataset and spiked covariance model. I NUMERICAL RESULTS FOR THE CELEBA DATASET The Celeb A dataset consists of more than 200, 000 face images of celebrities, where each input image is cropped to a 64 64 RGB image with n = 64 64 3 = 12288. The generative model G is set to be a pre-trained Deep Convolutional Generative Adversarial Networks (DCGAN) model with latent dimension k = 100. We use the DCGAN model trained by the authors of (Bora et al., 2017) directly. We select the best estimate among 2 random restarts. The Adam optimizer with 100 steps and a learning rate of 0.1 is used for the projection operator. For the images of the Celeb A dataset, the corresponding vectors are clearly not sparse in the natural basis. To make a fairer comparison to the sparsity-based method TPower, we convert the original images to the wavelet basis, and perform TPower on these converted images. The obtained results of TPower are then converted back to the vectors in the natural basis. The corresponding method is denoted by TPower W, with W referring to the conversion to images in the wavelet basis. In each iteration of TPower and TPower W, the calculated entries are truncated to zero except for the largest q entries. For Celeb A, q is set to be 2000. Other parameters are the same as those for the MNIST dataset. We also perform two sets of experiments, considering the spiked covariance and phase retrieval models separately. The corresponding results are reported in Figures 8 and 9. From these figures, we can observe the superiority of PPower to Power, TPower and TPower W, whereas TPower W only marginally improves over TPower.