# misspecified_phase_retrieval_with_generative_priors__e9d68614.pdf Misspecified Phase Retrieval with Generative Priors Zhaoqiang Liu National University of Singapore dcslizha@nus.edu.sg Xinshao Wang University of Oxford xinshao.wang@eng.ox.ac.uk Jiulong Liu Chinese Academy of Sciences jiulongliu@lsec.cc.ac.cn In this paper, we study phase retrieval under model misspecification and generative priors. In particular, we aim to estimate an n-dimensional signal x from m i.i.d. realizations of the single index model y = f(a T x), where f is an unknown and possibly random nonlinear link function and a Rn is a standard Gaussian vector. We make the assumption Cov[y, (a T x)2] = 0, which corresponds to the misspecified phase retrieval problem. In addition, the underlying signal x is assumed to lie in the range of an L-Lipschitz continuous generative model with bounded kdimensional inputs. We propose a two-step approach, for which the first step plays the role of spectral initialization and the second step refines the estimated vector produced by the first step iteratively. We show that both steps enjoy a statistical rate of order p (k log L) (log m)/m under suitable conditions. Experiments on image datasets are performed to demonstrate that our approach performs on par with or even significantly outperforms several competing methods. 1 Introduction Compressed sensing (CS) is perhaps the most popular instance of high-dimensional inverse problems, for which one has the linear measurement model y = a T x + η, (1) where a Rn is the sensing vector, x Rn is the sparse signal to estimate, and η represents additive noise. It has been well-known for CS that roughly m = O(s log(n/s)) i.i.d. Gaussian measurements are sufficient to ensure the accurate recovery of a signal with s non-zero entries [88, 1, 23, 11, 77]. Phase retrieval (PR) arises in numerous scientific areas including X-ray crystallography, acoustics, astronomy, microscopy, optics, wireless communications, and quantum information [13], where one cannot measure a T x directly, and can only record its magnitude. For example, the following noisy magnitude-only measurement model has been adopted in various prior works on sparse (real-valued) PR [92, 39, 25, 36, 9, 7]: y = |a T x| + η, (2) where the signal x is assumed to be sparse, and the sensing vector a is assumed to be a standard Gaussian vector. However, both the linear and magnitude-only measure models in (1) and (2) are idealized views of the data generating process. To make the setup more general, one can utilize the following semi-parametric single index model (SIM) for general nonlinear models: y = f(a T x), (3) 36th Conference on Neural Information Processing Systems (Neur IPS 2022). where f : R R is an unknown (possibly random) nonlinear link function, and a is typically assumed to be Gaussian. In addition, since the norm of x Rn is absorbed into the SIM, for brevity, it is common to assume that x is a unit vector, i.e., x 2 = 1. SIMs have long been studied in the conventional setting where the number of measurements m > n [30, 81, 45]. In recent years, they have also been analyzed in the high-dimensional setting where m n, mainly under the assumption that the underlying signal is sparse. Relevant works include but are not limited to [72, 63, 26, 73, 68]. For all these works, it is crucial to impose the following assumption on the SIM: Cov[y, a T x] = 0. (4) The pivotal condition in (4) is fairly generic and encompasses notable special examples such as noisy 1-bit measurements and general (non-binary) quantization schemes. However, it is not satisfied by PR models including (2) and the related models y = |a T x + η|, y = (a T x)2 + η, (5) where η refers to zero-mean random noise that is independent of a. In order to formalize a class of SIMs that encompass the above-mentioned PR models (and more general models, see the discussion in Section 2.2) as special cases, misspecified phase retrieval (MPR) has been studied in [64, 98], with the condition (4) being replaced by the assumption Cov[y, (a T x)2] = 0. (6) It is worth mentioning that another motivation behind studying MPR is that the theoretical analysis for PR typically relies on the correct model specification that the data points are indeed generated by the correct model, and the MPR model enables theoretical analysis under statistical model misspecification. In both works [64, 98], the signal x is assumed to be sparse. Recently, motivated by tremendous successful applications of deep generative models and following the seminal work [6] on generative model based linear CS, it has been popular to study high-dimensional inverse problems under generative priors [76, 31, 32, 34, 96, 41, 2, 67, 95, 60, 40, 65]. In particular, instead of being sparse, the underlying signal is assumed to be contained in (or lie near) the range of a generative model. It has been empirically demonstrated in [6] and its various follow-up works (e.g., see [19, 85, 48] and a literature review in [78]) that compared to sparsity based methods, corresponding generative model based algorithms require significantly fewer samples to recover the signal up to a given accuracy. In this paper, we study the MPR problem under generative modeling assumptions. 1.1 Related Work In this subsection, we summarize some relevant works on PR and SIM, for both cases with or without generative priors. PR and SIM without generative priors: There is a large amount of literature providing practical algorithms for PR, including convex methods [66, 14, 12, 90, 4, 29] and empirically more competitive non-convex methods [27, 21, 62, 70]. In particular, in the seminal work [62] and a variety of its follow-up works [13, 10, 15, 92, 91, 99, 39, 82, 8], whether it is for general PR (with no priors on the signal) or sparse PR, two-step approaches have been proposed with provable guarantees. The first step consists of a spectral initialization method, and the second step is typically an iterative (such as alternating minimization and gradient descent) algorithm that further refines the initial guess of the first step. For general PR, the spectral initialization step turns out to be unnecessary, and optimal sample complexity guarantees can be established even when using random initialization [83, 16]. However, to the best of our knowledge, it is not the case for sparse PR. More specifically, all theoretically-guaranteed non-convex algorithms for sparse PR require spectral initialization, and this typically results in a sub-optimal sample complexity of O s2 log n (instead of O(s log n)), where s refers to the number of non-zero entries of the signal to estimate. MPR is also closely related to SIMs, which have been extensively studied in the conventional setting where m > n, see, e.g., [30, 35, 59]. High-dimensional SIMs have received a lot of attention in recent years, with theoretical guarantees for signal estimation and support recovery [22, 24, 74, 84, 55, 72, 73, 17, 97, 28, 93, 69, 20]. In particular, motivated by the idea that under the assumption (4), a SIM can be converted into a scaled linear measurement model with unconventional noise, the authors of [72] consider minimizing the linear least-squares objective function over a convex set. They show that a reliable estimation of the signal can be obtained by such a simple method despite the unknown nonlinear link function. As mentioned earlier, MPR with sparse priors has been studied in [64, 98]. The work [64] implements a two-step procedure based on convex optimization, with the first step being a spectral initialization method. More specifically, a semidefinite program is solved to produce an initial vector, and then an ℓ1 regularized least square is solved to obtain a refined estimator. Such a procedure suffers from high computational costs. A more efficient two-step approach, which is a simple variant of the thresholded Wirtinger flow method [10], is proposed in [98]. In the first step, identical to that in [10], the initial vector is calculated by a thresholded spectral method that first estimates the support of the sparse signal by thresholding and then performs the classic power method over the submatrix corresponding to the estimated support. In the second step, a thresholded gradient descent algorithm is employed. Both approaches in [64, 98] can attain the optimal statistical rate of order p (s log n)/m, provided that the sensing vector is standard Gaussian and the number of samples m = Ω s2 log n . In addition, the second step of the approach proposed in [98] is shown to achieve a linear convergence rate. PR and SIM with generative priors: PR with generative priors has been studied in [31, 37, 38, 80, 3, 47]. More specifically, an approximate message passing algorithm is proposed in [3]. The authors of [31, 80] minimize the objective over the latent space in Rk using gradient descent, where k is the latent dimension of the generative prior. Corresponding algorithms may easily get stuck in local minima since the explorable solution space is limited. Recovery guarantees for projected gradient descent algorithms over the ambient space in Rn for noiseless PR with pre-trained or untrained neural network priors have been proposed in [37, 38]. No initialization methods have been proposed in these works, making the assumption on the initial vector therein a bit stringent. On the other hand, the authors of [47] propose a spectral initialization method for PR with generative priors and provide recovery guarantees with respect to globally optimal solutions of a corresponding optimization problem. The optimization problem is non-convex, and a projected power method is proposed in [50] to approximately find an optimal solution. Generative model based SIMs have been studied in [51, 49, 46, 94]. The authors of [51, 49, 46] provide optimal sample complexity upper bounds under the assumption of Gaussian sensing vectors. But their results rely on the assumption (4), which is not satisfied by widely adopted PR models. The SIM studied in [94] encompasses certain PR models as special cases, and the sensing vector can be non-Gaussian. However, the nonlinear link function f is assumed to be differentiable, making it not directly applicable to the PR model in (2). Moreover, it is worth mentioning that in both works [94, 51], the recovery guarantees are with respect to globally optimal solutions of typically highly non-convex optimization problems. Attaining these optimal solutions is practically difficult. 1.2 Contributions Throughout this paper, we assume that the signal x lies in the range of an L-Lipschitz continuous generative model with bounded k-dimensional inputs. Our main contributions are as follows: We propose a two-step approach for MPR with generative priors. In particular, in the first step, we make use of the projected power method proposed in [50] to obtain a good initial vector for the iterative algorithm used in the second step. We show that under appropriate initialization, both steps attain a statistical rate of order p (k log L) (log m)/m, and the second step achieves a linear convergence rate. The initialization condition for the first step is mild in the sense that it allows the inner product between the starting point and the signal x to be a sufficiently small positive constant. In contrary, the initialization condition for the second step is more restrictive, making the first step necessary. Notably, unlike for the works on MPR with sparse priors [64, 98] that require the sub-optimal O s2 log n sample complexity, the sample complexity requirement for our recovery guarantees is O((k log L) (log m)), which is naturally conjectured to be near-optimal [52, 42]. We perform numerical experiments on image datasets to corroborate our theoretical results. In particular, for the noisy magnitude-only measurement model (2), we observe that our approach gives reconstructions that are competitive with those of the alternating phase projected gradient descent (APPGD) algorithm proposed in [37], which is the corresponding state-of-the-art method, though we do not utilize the knowledge of the link function and allow for model misspecification. Moreover, for several closely related measurement models that satisfy the condition (6), our approach leads to superior reconstruction performance compared to all other competing methods, including APPGD. 2 Problem Formulation In this section, we overview some important assumptions that we adopt. Before proceeding, we present the notation used in this paper. 2.1 Notation We use upper and lower case boldface letters to denote matrices and vectors respectively. For any positive integer N, we use the shorthand notation [N] = {1, 2, , N}, and IN represents the identity matrix in RN N. The support (set) of a vector is the index set of its non-zero entries. We use X 2 2 to represent the spectral norm of X. We use Bk(r) to denote the radius-r in Rk, i.e., Bk(r) := {z Rk : z 2 r}, and Sn 1 represents the unit sphere in Rn, i.e., Sn 1 := {s Rn : s 2 = 1}. We use G to denote a pre-trained L-Lipschitz continuous generative model from Bk(r) to Rn. We focus on the setting where k n. For any set S Bk(r), we write G(S) = {G(z) : z S}. Our goal is to estimate the signal x Range(G) = G(Bk(r)) from realizations of the sensing vector a and the observation y (generated according to the SIM in (3)). For any two sequences of real values {aj} and {bj}, we write aj = O(bj) if there exist an absolute constant C1 and a positive integer j1 such that for any j > j1, |aj| C1bj, aj = Ω(bj) if there exist an absolute constant C2 and a positive integer j2 such that for any j > j2, |aj| C2bj. We use the generic notations C and C to denote large positive constants, and we use c to denote a small positive constant; their values may differ from line to line. First, the following are the standard definitions of a sub-exponential random variable and the associated sub-exponential norm. Definition 1. A random variable X is said to be sub-exponential if there exists a positive constant C such that (E [|X|p])1/p Cp for all p 1, or equivalently, if there exists a positive constant C such that P(|X| > u) exp(1 u/C ) for all u 0. The sub-exponential norm of X is defined as X ψ1 := supp 1 p 1 (E [|X|p])1/p. We will focus on the following settings except where stated otherwise: The observations are independently generated according to the SIM (3), where f is the link function that is unknown and possibly random. We have an L-Lipschitz continuous generative model G : Bk(r) Rn. For convenience, similarly to [48, 50], we assume that the generative model is normalized such that Range(G) Sn 1. For a general (unnormalized) generative model, we may essentially consider its normalized version. See, e.g., [50, Remark 1]. The signal x is contained in the range of G, i.e., x Range(G) Sn 1. The sensing vector a Rn is standard Gaussian, i.e., a N(0, In). The random variable y = f(a T x) is sub-exponential with the sub-exponential norm being denoted by Ky, i.e., Ky := y ψ1. In addition, we use My to denote the expectation of y, i.e., My := E[y]. Remark 1. y will be sub-exponential when f(x) comprises of xc plus lower order terms with c 2 (since the product of two sub-Gaussian random variables is sub-exponential), and therefore we will see that the y corresponding to all the measurement models presented in our paper is sub-exponential. We remark that the assumption of sub-exponential y is not essential and it can be easily relaxed. For example, when y = xc with c being an even integer that is larger than 2, there will be only a minor change in the order of the log m term in the sample complexity and statistical rate. However, for brevity, we follow [64, 98] and make the assumption of sub-exponential y to avoid non-essential complications. We consider MPR and assume that1 ν := Cov y, (a T x)2 > 0, (7) or equivalently, ν := Cov f(g), g2 > 0, (8) where g N(0, 1) is a standard normal random variable. Note that the assumption (8) is only with respect to the nonlinear link function f. The condition in (7) is satisfied by PR models described in (2) and (5). It is also satisfied by relevant models such as [98] y = |a T x| + 2 tanh(|a T x|) + η, y = 2(a T x)2 + 3 sin(|a T x|) + η. (9) See [64, Proposition 3 and Remark 4] for more general examples. 3 Algorithm In this section, we describe our two-step algorithm devised for MPR with generative priors. Suppose that we have m i.i.d. realizations of a and y, namely a1, . . . , am and y1, . . . , ym. To estimate the signal x, we consider the following two-step approach: 1. We perform T1 iterations in the first step. In particular, let i=1 yi aia T i In . (10) We perform the projected power method proposed in [50]: For t = 0, 1, . . . , T1 1, let w(t+1) = PG Vw(t) , (11) where PG( ) denotes the projection function onto Range(G),2 and we obtain x(0) := w(T1). Similarly to [43, 47], we set the starting point w(0) as the column of 1 m Pm i=1 yiaia T i (i.e., a shifted version of V) that corresponds to the largest diagonal entry. Note that it is easy to calculate that E[V] = νxx T (see, e.g., [47, Lemma 8]), for which each column is a scalar product of x. This motivates the use of a shifted version of V to get the initialization vector. 2. We perform T2 iterations in the second step. In particular, let i=1 yi (12) be the empirical mean of the observations. We perform the following iterative procedure: For t = 0, 1, 2, . . . , T2 1, let i=1 (yi y) a T i x(t) 2 (13) y(t) i = (yi y) a T i x(t) , i = 1, 2, . . . , m. (14) x(t+1) = x(t) ζ ˆν(t) a T i x(t) y(t) i ai, (15) x(t+1) = PG x(t+1) , (16) where ζ > 0 is a tuning parameter, and ˆν(t) can be thought of as an approximation of ν defined in (7). The idea behind calculating y(t) i is that by comparing (4) and (6), we observe that to transform the MPR model into a conventional SIM, we may use (y E[y])(a T x) to 1The case that ν < 0 can be similarly handled by considering y. 2That is, for any s Rn, PG(s) := G arg minz Bk(r) G(z) s 2 . Similarly to [79, 37, 38, 71, 50], we implicitly assume the exact projection in our analysis. In practice, approximate methods such as gradientand GAN-based projections [79, 75] have been shown to work well. Algorithm 1 A two-step approach for misspecified phase retrieval with generative priors Input: {(ai, yi)}m i=1, step size ζ > 0, number of iterations T1 for the first step, number of iterations T2 for the second step, pre-trained generative model G, initial vector w(0) First step: 1: for t = 0, 1, . . . , T1 1 do 2: w(t+1) = PG Vw(t) 3: end for Second step: Let x(0) := w(T1) 1: for t = 0, 1, . . . , T2 1 do 2: Calculate ˆν(t), y(t) i , x(t+1) and x(t+1) according to (13), (14), (15), and (16), respectively 3: end for Output: ˆx := x(T2) replace y, see also [64]. Moreover, (15) can be considered as a gradient descent step with respect to a linear measurement model with the scale factor ˆν(t) and observations y(t) i . Then, in (16), we project the calculated vector x(t+1) onto Range(G). Finally, we use ˆx := x(T2) to represent the estimated vector obtained after T2 iterations. For convenience, we summarize the details in Algorithm 1. 4 Theoretical Results The following theorem establishes recovery guarantees for the first step of Algorithm 1. The proof is given in the supplementary material. Note that Ky := y ψ1 (cf. Section 2.2) is considered a fixed constant and is omitted in the O( ) notation. Theorem 1. Assume that there exists a positive integer t0 such that x T w(t0) c0, where c0 is a sufficiently small positive constant. Suppose that m = Ω((k log(n Lr)) (log m)) with a large enough implied constant. Then, we have that with probability 1 O(1/m), it holds for all t > t0 that w(t) x 2 CKy (k log(n Lr)) (log m) (k log(n Lr)) (log m) Since a d-layer feedforward neural network generative model from Bk(r) to Rn is typically LLipschitz continuous with L = nΘ(d) [6] and we may set r = nΘ(d), the upper bound in (17) is of order p (k log L) (log m)/m. Such a statistical rate is naturally conjectured to be near-optimal according to information-theoretic lower bounds established for MPR with sparse priors [64] and generative model based principal component analysis [50]. Therefore, Theorem 1 reveals that the first step of Algorithm 1 attains the near-optimal statistical rate under appropriate initialization and exact projections. The accurate projection assumption is perhaps the major caveat to Theorem 1. However, it is a standard assumption in relevant works including [79, 37, 38, 71, 50]. In practice, both gradientand GAN-based projection methods [79, 75] have been shown to be highly effective in approximating the projection step. Remark 2. Spectral initialization steps in relevant works on sparsity based PR [92, 39, 25, 36, 9, 7] or MPR [64, 98] require the sub-optimal sample complexity O(s2 log n), where s refers to the number of non-zero entries. In contrary, according to Theorem 1, our spectral initialization step only requires the near-optimal O((k log L) (log m)) sample complexity (with a linear rather than a quadratic dependence on k). However, we note that such an advantage of our spectral initialization step comes at a price. In particular, we require the initialization condition x T w(t0) c0, which is not required by spectral initialization steps in the above-mentioned works on sparse PR/MPR. Remark 3. For some applications, we may assume that the dataset contains only vectors whose elements are all non-negative. For example, this is a natural assumption for image datasets. During pre-training, we can easily set the activation function of the last layer of the neural network generative model to be a certain non-negative function such as Re LU or sigmoid, and the range of such a generative model is contained in the non-negative orthant. Therefore, the assumption that x T w(t0) c0 for a sufficiently small positive constant c0 is also mild. Similar assumptions have been made in relevant works including [18, 50] where it is not appropriate to assume that x is also contained in the structured set (such as a closed convex cone or the range of a deep generative model). As a result, we provide an upper bound on w(t) x 2, instead of min{ w(t) x 2, w(t) + x 2}, which is a commonly adopted distance measure in relevant literature on real-valued PR. Moreover, although the projected power iterations in the first step of Algorithm 1 can attain the near-optimal statistical rate under appropriate conditions, it is evident in a large body of literature on PR (see, e.g., [62, 13, 10, 99, 98, 64]) that such a spectral method better serves as the initialization step of a subsequent iterative approach. This motivates us to propose the second step of Algorithm 1, and in our numerical experiments, we clearly observe the benefit of the second step. More specifically, compared to simply performing the projected power method, performing both steps of Algorithm 1 leads to significantly better reconstructed images when the total number of iterations is fixed to be the same, namely T1 + T2. Next, we present the following theorem, which is proved in the supplementary material. This theorem shows that under appropriate initialization and the assumption of exact projections, the iterative algorithm in the second step of Algorithm 1 converges linearly to a point achieving the near-optimal statistical rate of order p (k log L) (log m)/m. Theorem 2. Assume that the step size ζ and x(0), which is the initial vector for the second step of Algorithm 1, satisfy 2 |1 ζν| + 5ζν x(0) x 2 + β1 = 1 β2, (18) where both β1 and β2 are positive constants.3 Suppose that m = Ω((k log(n Lr)) (log m)) with a large enough implied constant. Then, we have with probability 1 O(1/m), the following event occurs: There exists a positive integer T0 = O log m (k log(n Lr)) (log m) , such that the sequence x(t) x 2 t [0,T0] is monotonically decreasing, with the following inequality holds for all t T0: x(t) x 2 < (1 β2)t x(0) x 2 + CKy (k log(n Lr)) (log m) In addition, we have for all t T0 that x(t) x 2 CKy (k log(n Lr)) (log m) (k log(n Lr)) (log m) Remark 4. In our analysis, we need to impose the assumption (18) on the step size ζ and initial vector x(0). This makes the first step of Algorithm 1 necessary since when x(0) x 2 is not small, say x(0) x 2 = 1, the condition (18) cannot be satisfied. In comparison, to attain the near-optimal statistical rate O( p (k log L) (log m)/m), the initialization condition of the first step of Algorithm 1 is milder, and x T w(t0) (see Theorem 1) only needs to be lower bounded by a sufficiently small positive constant (thus w(t0) x 2 can be close to 2). However, although the second step of Algorithm 1 requires a more restrictive initialization condition, we observe from our experimental results that it clearly refines the estimate of the first step. Such a phenomenon is also observed in various works related to PR, including [62, 13, 10, 99, 98, 64]. Remark 5. In Remark 4, we have briefly discussed the comparison of the initialization condition x T w(t0) c0 in Theorem 1 and the typical initialization condition x w(t0) 2 < δ x 2. In the following, we provide a more detailed discussion: When both x and w(t0) are unit vectors (this is the setting of our Theorem 1), the typical initialization requirement x w(t0) 2 < δ x 2 can be reduced to 2 1 x T w(t0) < δ2, or equivalently, x T w(t0)1 δ2 2 . Note that δ is typically a small positive constant (e.g., δ = 1 6 in [10] and δ = 1 8 in [13]), and thus the typical initialization condition requires x T w(t0) to be larger than some positive constant that is close to 1. This is stronger than the assumption x T w(t0) c0,4 where c0 is a sufficiently small positive constant. 3β1 is sufficiently small, and it is used to absorb a certain o(1) term. 4It basically assumes the weak recovery of the signal, see, e.g., [61]. Remark 6. The condition in (18) requires |1 ζν| < 1 2. This reveals that we should choose ζ such that ζ 1 2ν , and a good choice of ζ is ζ = 1 ν (for this case, the condition (18) reduces to x(0) x 2 < 1 5). Since the knowledge of the link function f (and thus the knowledge of ν, which is dependent on f; See (8)) cannot be assumed, in our experiments, we use ˆν(t) to approximate ν. That is, ζ is set to be 1 ˆν(t) in the t-th iteration of the second step of Algorithm 1 (though it is slightly varying instead of being fixed). 5 Numerical Results In this section, we demonstrate the empirical performance of our Algorithm 1 (denoted by MPRG). We remark that these numerical experiments are proof-of-concept rather than seeking to be comprehensive since our contributions are primarily theoretical. We present some numerical results for the MNIST [44] dataset in the main document. Additional results for the MNIST dataset and some experimental results for the Celeb A [53] dataset are presented in the supplementary material. The MNIST dataset contains 60, 000 images of handwritten digits. The size of each image is 28 28, and thus the dimension of the image vector is n = 784. For the MNIST dataset, the generative model G is set to be (the normalized version of) a pre-trained variational autoencoder (VAE) model with the latent dimension being k = 20. We make use of the VAE model pre-trained by the authors of [6] directly. For this VAE model, both the encoder and decoder are set to be fully connected neural networks with two hidden layers, and the architecture is 20 500 500 784. The VAE model is trained by the Adam optimizer with a mini-batch size 100 and a learning rate of 0.001, and is trained from the images in the training set. The projection step PG( ) (cf. (11)) is approximated by the Adam optimizer with a learning rate of 0.1 and 120 steps. We report the results on 10 testing images that are selected from the test set. Note that these images are unseen by the pre-trained generative model. We perform 10 random restarts. For reconstructed images, we choose the best among these 10 random restarts to reduce the impact of local minima. We also provide quantitative comparisons with respect to the reconstruction error ˆx x 2, where x is the underlying signal and ˆx refers to the estimated (normalized) vector produced by each of the methods described below. The reconstruction error is averaged over both the 10 testing images and 10 random restarts. All experiments are run using Python 3.6 and Tensor Flow 1.5.0, with a NVIDIA Ge Force GTX 1080 Ti 11GB GPU. For Algorithm 1, we set T1 = 20 and T2 = 30. As mentioned in Section 3, the starting point w(0) is set to be the column of 1 m Pm i=1 yiaia T i (i.e., a shifted version of V defined in (10)) that corresponds to the largest diagonal entry. In addition, as mentioned in Remark 6, we set the step size ζ as ζ = 1 ˆν(t) (cf. (13)) in the t-th iteration of the second step of Algorithm 1. We compare our Algorithm 1 (denoted by MPRG) with the following methods: 1) The method proposed in [98], which is for misspecified phase retrieval with sparse priors and is denoted by MPRS. All the involved parameters are set to be the same as those in [98]. 2) Simply performing the first step of Algorithm 1 for T1 + T2 = 50 iterations. The corresponding method is denoted by PPower. The purpose of comparing to PPower is to verify the benefit of the second step of Algorithm 1. 3) Simply performing the second step of Algorithm 1 for T1 + T2 = 50 iterations. The corresponding method is denoted by Step2. The purpose of comparing to Step2 is to check whether the first step of Algorithm 1 is practically useful. 4) The Alternating Phase Projected Gradient Descent (denoted by APPGD) algorithm proposed in [37]. This algorithm is specifically designed for phase retrieval with magnitude-only measurements (cf. (2)) and generative priors, and the corresponding iterative procedure is x(t+1) = PG a T i x(t) yi sign a T i x(t) ai where τ > 0 is the step size. We follow [37] to set τ = 0.9. For a fair comparison, we use the vector produced after T1 = 20 iterations of the first step of Algorithm 1 as the initialization vector of APPGD, and then we run APPGD for T2 = 30 iterations. We first consider the noisy magnitude-only measurement models for i [m], yi = |a T i x| + ηi, (22) yi = |a T i x + ηi|, (23) where ηi are i.i.d. realizations of an N(0, σ2) random variable. For such a measurement model, the corresponding random nonlinear link function f is f(x) = |x| + η or f(x) = |x + η|, where η N(0, σ2). The numerical results for (22) and (23) are demonstrated in Figures 1 and 2. From these figures, we observe that the sparsity based method MPRS attains poor reconstructions, and the results of PPower are not desirable. The three methods Step2, APPGD, MPRG lead to similar reconstruction error, but the reconstructed images of APPGD and MPRG are better than those of Step2. In particular, our method MPRG leads to mostly accurate reconstructions that are competitive compared to those of APPGD, even if we do not make use of the knowledge of the link function f and MPRG is not specifically designed for the magnitude-only measurement models. Original MPRS PPower Step2 APPGD MPRG Original MPRS PPower Step2 APPGD MPRG (a) (22) with m = 200 and σ = 0 (b) (23) with m = 400 and σ = 0.01 Figure 1: Examples of reconstructed MNIST images for the measurement models (22) and (23). 0.0 0.2 0.4 0.6 0.8 1.0 Reconstruction Error MPRS PPower Step2 APPGD MPRG 200 300 400 500 600 700 800 m Reconstruction Error MPRS PPower Step2 APPGD MPRG (a) (22) with fixed m = 400 (b) (23) with fixed σ = 0.1 Figure 2: Quantitative comparisons of the performance for the measurement model (22) and (23). Next, we consider the following two measurement models: yi = |a T i x| + 2 tanh(|a T i x|) + ηi, (24) yi = 2(a T i x)2 + 3 sin(|a T i x|) + ηi, (25) where again ηi are i.i.d. realizations of an N(0, σ2) random variable. For both models in (24) and (25), the corresponding link functions satisfy the condition (8) for MPR [98]. The numerical results are presented in Figures 3 and 4. We observe from these figures that for the measurement models (24) and (25), our method MPRG achieves the best reconstructions. In particular, it outperforms APPGD in terms of recovery quality and/or reconstruction error. 6 Conclusion and Future Work We have proposed a two-step approach for phase retrieval under model misspecification and generative priors. We show that under suitable conditions, both steps of our approach obtain estimated vectors that achieve the near-optimal statistical rate of order p (k log L) (log m)/m, where k is the latent dimension and L is the Lipschitz constant of the generative model respectively, and m refers to the number of samples. Original MPRS PPower Step2 APPGD MPRG Original MPRS PPower Step2 APPGD MPRG (a) (24) with m = 200 and σ = 0.01 (b) (25) with m = 300 and σ = 0.5 Figure 3: Examples of reconstructed MNIST images for the models in (24) and (25). 0.0 0.2 0.4 0.6 0.8 1.0 Reconstruction Error MPRS PPower Step2 APPGD MPRG 200 300 400 500 600 700 800 m Reconstruction Error MPRS PPower Step2 APPGD MPRG (a) (24) with fixed m = 200 (b) (25) with fixed σ = 0.5 Figure 4: Quantitative comparisons of the performance for the measurement models in (24) and (25). We assume accurate projections in our analysis and use a gradient-based method to approximate the projection step PG( ) in our experiments. Although the exact projection assumption is commonly made in relevant works [79, 37, 38, 71, 50], it is a very interesting future research direction to design provably-guaranteed efficient methods for the projection step. In addition, we focus on real Gaussian measurements. While we believe that based on the technical results in [62, 13] (which study complex Gaussian measurements), it is straightforward to extend our work to the complex case, the extension to more practical non-Gaussian measurement models such as sub-sampled Fourier measurements is a very interesting future direction. Another direction is to use different preprocessing functions to enhance the performance of our spectral initialization method [56, 54, 57]. Moreover, if one has the access to the nonlinear link function f, the Bayesoptimal performances can characterized using message-passing algorithms [5, 58, 3]. It would be interesting to connect or compare our results with the corresponding Bayes-optimal rate. Acknowledgment. J. Liu was partially supported by NSFC Grant #12288201 and the Fund of the Youth Innovation Promotion Association, CAS (2022002). We sincerely thank the five anonymous reviewers and area chair for their careful reading and helpful comments. We are extremely grateful to Dr. Jonathan Scarlett for proofreading the manuscript and giving valuable suggestions. [1] E. Arias-Castro, E. J. Candes, and M. A. Davenport, On the fundamental limits of adaptive sensing, IEEE Trans. Inf. Theory, vol. 59, no. 1, pp. 472 481, 2012. [2] M. Asim, M. Daniels, O. Leong, A. Ahmed, and P. Hand, Invertible generative models for inverse problems: Mitigating representation error and dataset bias, in Int. Conf. Mach. Learn. (ICML). PMLR, 2020, pp. 399 409. [3] B. Aubin, B. Loureiro, A. Baker, F. Krzakala, and L. Zdeborová, Exact asymptotics for phase retrieval and compressed sensing with random generative priors, in Math. Sci. Mach. Learn. (MSML). PMLR, 2020, pp. 55 73. [4] S. Bahmani and J. Romberg, Phase retrieval meets statistical learning theory: A flexible convex relaxation, in Int. Conf. Artif. Intell. Stat. (AISTATS). PMLR, 2017, pp. 252 260. [5] J. Barbier, F. Krzakala, N. Macris, L. Miolane, and L. Zdeborová, Optimal errors and phase transitions in high-dimensional generalized linear models, PNAS, vol. 116, no. 12, pp. 5451 5460, 2019. [6] A. Bora, A. Jalal, E. Price, and A. G. Dimakis, Compressed sensing using generative models, in Int. Conf. Mach. Learn. (ICML), 2017, pp. 537 546. [7] J.-F. Cai, Y. Jiao, X. Lu, and J. You, Sample-efficient sparse phase retrieval via stochastic alternating minimization, https://arxiv.org/2112.07919, 2021. [8] J.-F. Cai, J. Li, X. Lu, and J. You, Sparse signal recovery from phaseless measurements via hard thresholding pursuit, https://arxiv.org/abs/2005.08777, 2020. [9] J. Cai, M. Huang, D. Li, and Y. Wang, Solving phase retrieval with random initial guess is nearly as good as by spectral initialization, https://arxiv.org/abs/2101.03540, 2021. [10] T. T. Cai, X. Li, and Z. Ma, Optimal rates of convergence for noisy sparse phase retrieval via thresholded Wirtinger flow, Ann. Stat., vol. 44, no. 5, pp. 2221 2251, 2016. [11] E. J. Candes and M. A. Davenport, How well can we estimate a sparse vector? Appl. Comp. Harm. Analysis, vol. 34, no. 2, pp. 317 323, 2013. [12] E. J. Candes, Y. C. Eldar, T. Strohmer, and V. Voroninski, Phase retrieval via matrix completion, SIAM Rev., vol. 57, no. 2, pp. 225 251, 2015. [13] E. J. Candès, X. Li, and M. Soltanolkotabi, Phase retrieval via Wirtinger flow: Theory and algorithms, IEEE Trans. Inf. Theory, vol. 61, no. 4, pp. 1985 2007, 2015. [14] E. J. Candès, T. Strohmer, and V. Voroninski, Phase Lift: Exact and stable signal recovery from magnitude measurements via convex programming, Comm. Pure Appl. Math., vol. 66, no. 8, pp. 1241 1274, 2013. [15] Y. Chen and E. J. Candès, Solving random quadratic systems of equations is nearly as easy as solving linear systems, Comm. Pure Appl. Math., vol. 70, no. 5, pp. 822 883, 2017. [16] Y. Chen, Y. Chi, J. Fan, and C. Ma, Gradient descent with random initialization: Fast global convergence for nonconvex phase retrieval, Math. Program., vol. 176, no. 1, pp. 5 37, 2019. [17] L. Cheng, P. Zeng, and Y. Zhu, BS-SIM: An effective variable selection method for highdimensional single index model, Electron. J. Stat., vol. 11, no. 2, pp. 3522 3548, 2017. [18] Y. Deshpande, A. Montanari, and E. Richard, Cone-constrained principal component analysis, in Conf. Neur. Inf. Proc. Sys. (Neur IPS), 2014, pp. 2717 2725. [19] M. Dhar, A. Grover, and S. Ermon, Modeling sparse deviations for compressed sensing using generative models, in Int. Conf. Mach. Learn. (ICML). PMLR, 2018, pp. 1214 1223. [20] H. Eftekhari, M. Banerjee, and Y. Ritov, Inference in high-dimensional single-index models under symmetric designs, J. Mach. Learn. Res., vol. 22, pp. 27 1, 2021. [21] J. R. Fienup, Phase retrieval algorithms: A comparison, Appl. Opt., vol. 21, no. 15, pp. 2758 2769, 1982. [22] J. C. Foster, J. M. Taylor, and B. Nan, Variable selection in monotone single-index models via the adaptive Lasso, Stat. Med., vol. 32, no. 22, pp. 3944 3954, 2013. [23] S. Foucart and H. Rauhut, A Mathematical Introduction to Compressive Sensing. Springer New York, 2013. [24] R. Ganti, N. Rao, R. M. Willett, and R. Nowak, Learning single index models in high dimensions, https://arxiv.org/1506.08910, 2015. [25] B. Gao, X. Sun, Y. Wang, and Z. Xu, Perturbed amplitude flow for phase retrieval, IEEE Trans. Sig. Proc., vol. 68, pp. 5427 5440, 2020. [26] M. Genzel, High-dimensional estimation of structured signals from non-linear observations with general convex loss functions, IEEE Trans. Inf. Theory, vol. 63, no. 3, pp. 1601 1619, 2016. [27] R. W. Gerchberg, A practical algorithm for the determination of phase from image and diffraction plane pictures, Optik, vol. 35, pp. 237 246, 1972. [28] L. Goldstein, S. Minsker, and X. Wei, Structured signal recovery from non-linear and heavytailed measurements, IEEE Trans. Inf. Theory, vol. 64, no. 8, pp. 5513 5530, 2018. [29] T. Goldstein and C. Studer, Phasemax: Convex phase retrieval via basis pursuit, IEEE Trans. Inf. Theory, vol. 64, no. 4, pp. 2675 2689, 2018. [30] A. K. Han, Non-parametric analysis of a generalized regression model: the maximum rank correlation estimator, J. Econom., vol. 35, no. 2-3, pp. 303 316, 1987. [31] P. Hand, O. Leong, and V. Voroninski, Phase retrieval under a generative prior, in Conf. Neur. Inf. Proc. Sys. (Neur IPS), 2018, pp. 9154 9164. [32] P. Hand and V. Voroninski, Global guarantees for enforcing deep generative priors by empirical risk, in Conf. Learn. Theory (COLT). PMLR, 2018, pp. 970 978. [33] B. Hao, Y. Yadkori, Z. Wen, and G. Cheng, Bootstrapping upper confidence bound, Conf. Neur. Inf. Proc. Sys. (Neur IPS), 2019. [34] R. Heckel and P. Hand, Deep decoder: Concise image representations from untrained nonconvolutional networks, in Int. Conf. Learn. Repr. (ICLR), 2019. [35] J. L. Horowitz, Semiparametric and nonparametric methods in econometrics. Springer, 2009, vol. 12. [36] M. Huang and Z. Xu, The estimation performance of nonlinear least squares for phase retrieval, IEEE Trans. Inf. Theory, vol. 66, no. 12, pp. 7967 7977, 2020. [37] R. Hyder, V. Shah, C. Hegde, and M. S. Asif, Alternating phase projected gradient descent with generative priors for solving compressive phase retrieval, in Int. Conf. Acoust. Sp. Sig. Proc. (ICASSP). IEEE, 2019, pp. 7705 7709. [38] G. Jagatap and C. Hegde, Algorithmic guarantees for inverse imaging with untrained network priors, in Conf. Neur. Inf. Proc. Sys. (Neur IPS), vol. 32, 2019. [39] G. Jagatap and C. Hegde, Sample-efficient algorithms for recovering structured signals from magnitude-only measurements, IEEE Trans. Inf. Theory, vol. 65, no. 7, pp. 4434 4456, 2019. [40] A. Jalal, S. Karmalkar, A. Dimakis, and E. Price, Instance-optimal compressed sensing via posterior sampling, in Int. Conf. Mach. Learn. (ICML), 2021. [41] A. Jalal, L. Liu, A. G. Dimakis, and C. Caramanis, Robust compressed sensing using generative models, Conf. Neur. Inf. Proc. Sys. (Neur IPS), 2020. [42] A. Kamath, S. Karmalkar, and E. Price, On the power of compressed sensing with generative models, in Int. Conf. Mach. Learn. (ICML), 2020, pp. 5101 5109. [43] V. Kuleshov, Fast algorithms for sparse principal component analysis based on Rayleigh quotient iteration, in Int. Conf. Mach. Learn. (ICML). PMLR, 2013, pp. 1418 1425. [44] Y. Le Cun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to document recognition, Proc. IEEE, vol. 86, no. 11, pp. 2278 2324, 1998. [45] K.-C. Li and N. Duan, Regression analysis under link violation, Ann. Stat., pp. 1009 1052, 1989. [46] J. Liu and Z. Liu, Non-iterative recovery from nonlinear observations using generative models, in IEEE Comput. Soc. Conf. Comput. Vis. Pattern. Recognit. (CVPR), 2022, pp. 233 243. [47] Z. Liu, S. Ghosh, and J. Scarlett, Towards sample-optimal compressive phase retrieval with sparse and generative priors, in Conf. Neur. Inf. Proc. Sys. (Neur IPS), 2021. [48] Z. Liu, S. Gomes, A. Tiwari, and J. Scarlett, Sample complexity bounds for 1-bit compressive sensing and binary stable embeddings with generative priors, in Int. Conf. Mach. Learn. (ICML), 2020. [49] Z. Liu and J. Han, Projected gradient descent algorithms for solving nonlinear inverse problems with generative priors, in Int. Jt. Conf. Artif. Intell. (IJCAI), 2022. [50] Z. Liu, J. Liu, S. Ghosh, J. Han, and J. Scarlett, Generative principal component analysis, in Int. Conf. Learn. Repr. (ICLR), 2022. [51] Z. Liu and J. Scarlett, The generalized Lasso with nonlinear observations and generative priors, in Conf. Neur. Inf. Proc. Sys. (Neur IPS), vol. 33, 2020. [52] Z. Liu and J. Scarlett, Information-theoretic lower bounds for compressive sensing with generative models, IEEE J. Sel. Areas Inf. Theory, vol. 1, no. 1, pp. 292 303, 2020. [53] Z. Liu, P. Luo, X. Wang, and X. Tang, Deep learning face attributes in the wild, in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), 2015, pp. 3730 3738. [54] Y. M. Lu and G. Li, Phase transitions of spectral initialization for high-dimensional non-convex estimation, Inf. Inference, vol. 9, no. 3, pp. 507 541, 2020. [55] S. Luo and S. Ghosal, Forward selection and estimation in high dimensional single index models, Stat. Methodol., vol. 33, pp. 172 179, 2016. [56] W. Luo, W. Alghamdi, and Y. M. Lu, Optimal spectral initialization for signal recovery with applications to phase retrieval, IEEE Trans. Sig. Proc., vol. 67, no. 9, pp. 2347 2356, 2019. [57] A. Maillard, F. Krzakala, Y. M. Lu, and L. Zdeborová, Construction of optimal spectral methods in phase retrieval, in Math. Sci. Mach. Learn. (MSML). PMLR, 2022, pp. 693 720. [58] A. Maillard, B. Loureiro, F. Krzakala, and L. Zdeborová, Phase retrieval in high dimensions: Statistical and computational phase transitions, Conf. Neur. Inf. Proc. Sys. (Neur IPS), vol. 33, pp. 11 071 11 082, 2020. [59] P. Mc Cullagh and J. A. Nelder, Generalized linear models. Routledge, 2019. [60] S. Menon, A. Damian, S. Hu, N. Ravi, and C. Rudin, Pulse: Self-supervised photo upsampling via latent space exploration of generative models, in IEEE Comput. Soc. Conf. Comput. Vis. Pattern. Recognit. (CVPR), 2020, pp. 2437 2445. [61] M. Mondelli and A. Montanari, Fundamental limits of weak recovery with applications to phase retrieval, in Conf. Learn. Theory (COLT). PMLR, 2018, pp. 1445 1450. [62] P. Netrapalli, P. Jain, and S. Sanghavi, Phase retrieval using alternating minimization, IEEE Trans. Sig. Proc., vol. 63, no. 18, pp. 4814 4826, 2015. [63] M. Neykov, J. S. Liu, and T. Cai, L1-regularized least squares for support recovery of high dimensional single index models with Gaussian designs, J. Mach. Learn. Res., vol. 17, no. 1, pp. 2976 3012, 2016. [64] M. Neykov, Z. Wang, and H. Liu, Agnostic estimation for misspecified phase retrieval models, J. Mach. Learn. Res., vol. 21, pp. 1 39, 2020. [65] T. V. Nguyen, G. Jagatap, and C. Hegde, Provable compressed sensing with generative priors via Langevin dynamics, https://arxiv.org/2102.12643, 2021. [66] H. Ohlsson, A. Yang, R. Dong, and S. Sastry, CPRL An extension of compressive sensing to the phase retrieval problem, Conf. Neur. Inf. Proc. Sys. (Neur IPS), vol. 25, pp. 1367 1375, 2012. [67] G. Ongie, A. Jalal, C. A. Metzler, R. G. Baraniuk, A. G. Dimakis, and R. Willett, Deep learning techniques for inverse problems in imaging, IEEE J. Sel. Areas Inf. Theory, vol. 1, no. 1, pp. 39 56, 2020. [68] S. Oymak and M. Soltanolkotabi, Fast and reliable parameter estimation from nonlinear observations, SIAM J. Optim., vol. 27, no. 4, pp. 2276 2300, 2017. [69] A. Pananjady and D. P. Foster, Single-index models in the high signal regime, IEEE Trans. Inf. Theory, vol. 67, no. 6, pp. 4092 4124, 2021. [70] E. J. R. Pauwels, A. Beck, Y. C. Eldar, and S. Sabach, On fienup methods for sparse phase retrieval, IEEE Trans. Sig. Proc., vol. 66, no. 4, pp. 982 991, 2017. [71] P. Peng, S. Jalali, and X. Yuan, Solving inverse problems via auto-encoders, IEEE J. Sel. Areas Inf. Theory, vol. 1, no. 1, pp. 312 323, 2020. [72] Y. Plan and R. Vershynin, The generalized Lasso with non-linear observations, IEEE Trans. Inf. Theory, vol. 62, no. 3, pp. 1528 1537, 2016. [73] Y. Plan, R. Vershynin, and E. Yudovina, High-dimensional estimation with geometric constraints, Inf. Inference, vol. 6, no. 1, pp. 1 40, 2017. [74] P. Radchenko, High dimensional single index models, J. Multivar. Anal., vol. 139, pp. 266 282, 2015. [75] A. Raj, Y. Li, and Y. Bresler, GAN-based projector for faster recovery with convergence guarantees in linear inverse problems, in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), 2019. [76] J. Rick Chang, C.-L. Li, B. Poczos, B. Vijaya Kumar, and A. C. Sankaranarayanan, One network to solve them all solving linear inverse problems using deep projection models, in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), 2017, pp. 5888 5897. [77] J. Scarlett and V. Cevher, Limits on support recovery with probabilistic models: An informationtheoretic framework, IEEE Trans. Inf. Theory, vol. 63, no. 1, pp. 593 620, 2016. [78] J. Scarlett, R. Heckel, M. R. Rodrigues, P. Hand, and Y. C. Eldar, Theoretical perspectives on deep learning methods in inverse problems, https://arxiv.org/2206.14373, 2022. [79] V. Shah and C. Hegde, Solving linear inverse problems using GAN priors: An algorithm with provable guarantees, in Int. Conf. Acoust. Sp. Sig. Proc. (ICASSP). IEEE, 2018, pp. 4609 4613. [80] F. Shamshad and A. Ahmed, Compressed sensing-based robust phase retrieval via deep generative priors, IEEE Sens. J., vol. 21, no. 2, pp. 2286 2298, 2020. [81] R. P. Sherman, The limiting distribution of the maximum rank correlation estimator, Econometrica, pp. 123 137, 1993. [82] M. Soltanolkotabi, Structured signal recovery from quadratic measurements: Breaking sample complexity barriers via nonconvex optimization, IEEE Trans. Inf. Theory, vol. 65, no. 4, pp. 2374 2400, 2019. [83] J. Sun, Q. Qu, and J. Wright, A geometric analysis of phase retrieval, Found. Comput. Math., vol. 18, no. 5, pp. 1131 1198, 2018. [84] C. Thrampoulidis, E. Abbasi, and B. Hassibi, Lasso with non-linear measurements is equivalent to one with linear measurements, in Conf. Neur. Inf. Proc. Sys. (Neur IPS), 2015, pp. 3420 3428. [85] D. Van Veen, A. Jalal, M. Soltanolkotabi, E. Price, S. Vishwanath, and A. G. Dimakis, Compressed sensing with deep image prior and learned regularization, https://arxiv.org/1806.06438, 2018. [86] R. Vershynin, Introduction to the non-asymptotic analysis of random matrices, https://arxiv.org/abs/1011.3027, 2010. [87] R. Vershynin, High-dimensional probability: An introduction with applications in data science. Cambridge university press, 2018, vol. 47. [88] M. J. Wainwright, Information-theoretic limits on sparsity recovery in the high-dimensional and noisy setting, IEEE Trans. Inf. Theory, vol. 55, no. 12, pp. 5728 5741, 2009. [89] M. J. Wainwright, High-dimensional statistics: A non-asymptotic viewpoint. Cambridge University Press, 2019, vol. 48. [90] I. Waldspurger, A. d Aspremont, and S. Mallat, Phase recovery, maxcut and complex semidefinite programming, Math. Program., vol. 149, no. 1, pp. 47 81, 2015. [91] G. Wang, G. B. Giannakis, and Y. C. Eldar, Solving systems of random quadratic equations via truncated amplitude flow, IEEE Trans. Inf. Theory, vol. 64, no. 2, pp. 773 794, 2017. [92] G. Wang, L. Zhang, G. B. Giannakis, M. Akçakaya, and J. Chen, Sparse phase retrieval via truncated amplitude flow, IEEE Trans. Sig. Proc., vol. 66, no. 2, pp. 479 491, 2017. [93] X. Wei, Structured recovery with heavy-tailed measurements: A thresholding procedure and optimal rates, https://arxiv.org/abs/1804.05959, 2018. [94] X. Wei, Z. Yang, and Z. Wang, On the statistical rate of nonlinear recovery in generative models with heavy-tailed data, in Int. Conf. Mach. Learn. (ICML), 2019, pp. 6697 6706. [95] J. Whang, Q. Lei, and A. G. Dimakis, Compressed sensing with invertible generative models and dependent noise, https://arxiv.org/2003.08089, 2020. [96] Y. Wu, M. Rosca, and T. Lillicrap, Deep compressed sensing, in Int. Conf. Mach. Learn. (ICML). PMLR, 2019, pp. 6850 6860. [97] Z. Yang, K. Balasubramanian, and H. Liu, High-dimensional non-Gaussian single index models via thresholded score function estimation, in Int. Conf. Mach. Learn. (ICML). PMLR, 2017, pp. 3851 3860. [98] Z. Yang, L. F. Yang, E. X. Fang, T. Zhao, Z. Wang, and M. Neykov, Misspecified nonconvex statistical optimization for sparse phase retrieval, Math. Program., vol. 176, no. 1, pp. 545 571, 2019. [99] H. Zhang, Y. Zhou, Y. Liang, and Y. Chi, A nonconvex approach for phase retrieval: Reshaped Wirtinger flow and incremental algorithms, J. Mach. Learn. Res., vol. 18, 2017. 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] The limitations are relevant to our future work, described in Section 6. (c) Did you discuss any potential negative societal impacts of your work? [N/A] (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] The code is included in the supplementary material. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [N/A] (b) Did you mention the license of the assets? [N/A] (c) Did you include any new assets either in the supplemental material or as a URL? [N/A] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]