# phase_retrieval_under_a_generative_prior__9d4a7955.pdf Phase Retrieval Under a Generative Prior Paul Hand Northeastern University p.hand@northeastern.edu Oscar Leong Rice University oscar.f.leong@rice.edu Vladislav Voroninski Helm.ai vlad@helm.ai We introduce a novel deep learning inspired formulation of the phase retrieval problem, which asks to recover a signal y0 2 Rn from m quadratic observations, under structural assumptions on the underlying signal. As is common in many imaging problems, previous methodologies have considered natural signals as being sparse with respect to a known basis, resulting in the decision to enforce a generic sparsity prior. However, these methods for phase retrieval have encountered possibly fundamental limitations, as no computationally efficient algorithm for sparse phase retrieval has been proven to succeed with fewer than O(k2 log n) generic measurements, which is larger than the theoretical optimum of O(k log n). In this paper, we propose a new framework for phase retrieval by modeling natural signals as being in the range of a deep generative neural network G : Rk ! Rn. We introduce an empirical risk formulation that has favorable global geometry for gradient methods, as soon as m = O(kd2 log n), under the model of a d-layer fully-connected neural network with random weights. Specifically, when suitable deterministic conditions on the generator and measurement matrix are met, we construct a descent direction for any point outside of a small neighborhood around the true k-dimensional latent code and a negative multiple thereof. This formulation for structured phase retrieval thus benefits from two effects: generative priors can more tightly represent natural signals than sparsity priors, and this empirical risk formulation can exploit those generative priors at an information theoretically optimal sample complexity, unlike for a sparsity prior. We corroborate these results with experiments showing that exploiting generative models in phase retrieval tasks outperforms both sparse and general phase retrieval methods. 1 Introduction We study the problem of recovering a signal y0 2 Rn given m n phaseless observations of the form b = |Ay0| where the measurement matrix A 2 Rm n is known and | | is understood to act entrywise. This is known as the phase retrieval problem. In this work, we assume, as a prior, that the signal y0 is in the range of a generative model G : Rk ! Rn so that y0 = G(x0) for some x0 2 Rk. To recover y0, we first recover the original latent code x0 corresponding to it, from which y0 is obtained by applying G. Hence we study the phase retrieval problem under a generative prior which asks: find x 2 Rk such that b = |AG(x)|. We will refer to this formulation as Deep Phase Retrieval (DPR). The phase retrieval problem has applications in X-ray crystallography [21, 29], optics [34], astronomical imaging [14], diffraction Authors are listed in alphabetical order. 32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada. imaging [5], and microscopy [28]. In these problems, the phase information of an object is lost due to physical limitations of scientific instruments. In crystallography, the linear measurements in practice are typically Fourier modes because they are the far field limit of a diffraction pattern created by emitting a quasi-monochromatic wave on the object of interest. In many applications, the signals to be recovered are compressible or sparse with respect to a certain basis (e.g. wavelets). Many researchers have attempted to leverage sparsity priors in phase retrieval to yield more efficient recovery algorithms. However, these methods have been met with potentially severe fundamental limitations. In the Gaussian measurement regime where A has i.i.d. Gaussian entries, one would hope that recovery of a k-sparse n-dimensional signal is possible with O(k log n) measurements. However, there is no known method to succeed with fewer than O(k2 log n) measurements. Moreover, [26] proved that the semidefinite program Phase Lift cannot outperform this suboptimal sample complexity by direct 1 penalization. This is in stark contrast to the success of leveraging sparsity in linear compressed sensing to yield optimal sample complexity. Hence enforcing sparsity as a generic prior in phase retrieval may be fundamentally limiting sample complexity. Our contribution. We show information theoretically optimal sample complexity2 for structured phase retrieval under generic measurements and a novel nonlinear formulation based on empirical risk and a generative prior. In this work, we suppose that the signal of interest is the output of a generative model. In particular, the generative model is a d-layer, fully-connected, feed forward neural network with Rectifying Linear Unit (Re LU) activation functions and no bias terms. Let Wi 2 Rni ni 1 denote the weights in the i-th layer of our network for i = 1, . . . , d where k = n0 < n1 < < nd. Given an input x 2 Rk, the output of the the generative model G : Rk ! Rnd can be expressed as G(x) := relu (Wd . . . relu(W2(relu(W1x))) . . . ) where relu(x) = max(x, 0) acts entrywise. We further assume that the measurement matrix A and each weight matrix Wi have i.i.d. Gaussian entries. The Gaussian assumption of the weight matrices is supported by empirical evidence showing neural networks, learned from data, that have weights that obey statistics similar to Gaussians [1]. Furthermore, there has also been work done in establishing a relationship between deep networks and Gaussian processes [25]. Nevertheless, we will introduce deterministic conditions on the weights for which our results hold, allowing the use of other distributions. To recover x0, we study the following 2 empirical risk minimization problem: min x2Rk f(x) := 1 !!!|AG(x)| |AG(x0)| Due to the non-convexity of the objective function, there is no a priori guarantee that gradient descent schemes can solve (1) as many local minima may exist. In spite of this, our main result illustrates that the objective function exhibits favorable geometry for gradient methods. Moreover, our result holds with information theoretically optimal sample complexity: Theorem 1 (Informal). If we have a sufficient number of measurements m = (kd log(n1 . . . nd)) and our network is sufficiently expansive at each layer ni = (ni 1 log ni 1), then there exists a descent direction vx,x0 2 Rk for any non-zero x 2 Rk outside of two small neighborhoods centered at the true solution x0 and a negative multiple dx0 with high probability. In addition, the origin is a local maximum of f. Here d > 0 depends on the number of layers d and d ! 1 as d ! 1. Our main result asserts that the objective function does not have any spurious local minima away from neighborhoods of the true solution and a negative multiple of it. Hence if one were to solve (1) via gradient descent and the algorithm converged, the final iterate would be close to the true solution or a negative multiple thereof. The proof of this result is a concentration argument. We first prove the sufficiency of two deterministic conditions on the weights Wi and measurement matrix A. We then show that Gaussian Wi and A satisfy these conditions with high probability. Finally, using these two conditions, we argue that the specified descent direction vx,x0 concentrates around a vector hx,x0 that is continuous for non-zero x 2 Rk and vanishes only when x x0 or x dx0. Rather than working against potentially fundamental limitations of polynomial time algorithms, we examine more sophisticated priors using generative models. Our results illustrate that these priors are, 2with respect to the dimensionality of the latent code given to the generative network in reality, less limiting in terms of sample complexity, both by providing more compressibility and by being able to be more tightly enforced. Prior methodologies for general phase retrieval. In the Gaussian measurement regime, most of the techniques to solve phase retrieval problems can be classified as convex or non-convex methods. In terms of convex techniques, lifting-based methods transform the signal recovery problem into a rank-one matrix recovery problem by lifting the signal into the space of positive semidefinite matrices. These semidefinite programming (SDP) approaches, such as Phase Lift [9], can provably recover any n-dimensional signal with O(n log n) measurements. A refinement on this analysis by [7] for Phase Lift showed that recovery is in fact possible with O(n) measurements. Other convex methods include Phase Cut [33], an SDP approach, and linear programming algorithms such as Phase Max, which has been shown to achieve O(n) sample complexity [17]. Non-convex methods encompass alternating minimization approaches such as the original Gerchberg Saxton [16] and Fienup [15] algorithms and direct optimization algorithms such as Wirtinger Flow [8]. These latter methods directly tackle the least squares objective function !!!|Ay|2 |Ay0|2!!! In the seminal work, [8] show that through an initialization via the spectral method, a gradient descent scheme can solve (2) where the gradient is understood in the sense of Wirtinger calculus with O(n log n) measurements. Expanding on this, a later study on the minimization of (2) in [31] showed that with O(n log3 n) measurements, the energy landscape of the objective function exhibited global benign geometry which would allow it to be solved efficiently by gradient descent schemes without special initialization. There also exist amplitude flow methods that solve the following non-smooth variation of (2): !!!|Ay| |Ay0| These methods have found success with O(n) measurements [13] and have been shown to empirically perform better than intensity-based methods using the squared formulation in (2) [37]. Sparse phase retrieval. Many of the successful methodologies for general phase retrieval have been adapted to try to solve sparse phase retrieval problems. In terms of non-convex optimization, Wirtinger Flow type methods such as Thresholded Wirtinger Flow [6] create a sparse initializer via the spectral method and perform thresholded gradient descent updates to generate sparse iterates to solve (2). Another non-convex method, SPARTA [35], estimates the support of the signal for its initialization and performs hard thresholded gradient updates to the amplitude-based objective function (3). Both of these methods require O(k2 log n) measurements for a generic k-sparse n-dimensional signal to succeed, which is more than the theoretical optimum O(k log n). While lifting-based methods such as Phase Lift have been proven unable to beat the suboptimal sample complexity O(k2 log n), there has been some progress towards breaking this barrier. In [19], the authors show that with an initializer that sufficiently correlates with the true solution, a linear program can recover the sparse signal from O(k log n k ) measurements. However, the best known initialization methods require at least O(k2 log n) measurements [6]. Outside of the Gaussian measurement regime, there have been other results showing that if one were able to design their own measurement matrices, then the optimal sample complexity could be reached [22]. For example, [2] showed that assuming the measurement vectors were chosen from an incoherent subspace, then recovery is possible with O(k log n k ) measurements. However, these results would be difficult to generalize to the experimental setting as their design architectures are often unrealistic. Moreover, the Gaussian measurement regime more closely models the experimental Fourier diffraction measurements observed in, for example, X-ray crystallography. As Fourier models are the ultimate goal, results towards lowering this sample complexity in the Gaussian measurement regime must be made or new modes of regularization must be explored in order for phase retrieval to advance. Related work. There has been recent empirical evidence supporting applying a deep learning based approach to holographic imaging, a phase retrieval problem. The authors in [18] show that a neural network with Re LU activation functions can learn to perform holographic image reconstruction. In particular, they show that compared to current approaches, this neural network based method requires less measurements to succeed and is computationally more efficient, needing only one hologram to reconstruct the necessary images. Furthermore, there have been a number of recent advancements in leveraging generative priors over sparsity priors in compressed sensing. In [4], the authors considered the least squares objective !!!AG(x) AG(x0) They provided empirical evidence showing that 5-10X fewer measurements were needed to succeed in recovery compared to standard sparsity-based approaches such as Lasso. In terms of theory, they showed that if A satisfied a restricted eigenvalue condition and if one were able to solve (4), then the solution would be close to optimal. The authors in [20] analyze the same optimization problem in [4] but exhibit global guarantees regarding the non-convex objective function. Under particular conditions about the expansivity of each neural network layer and randomness assumptions on their weights, they show that the energy landscape of the objective function does not have any spurious local minima. Furthermore, there is always a descent direction outside of two small neighborhoods of the global minimum and a negative scalar multiple of it. The success of leveraging generative priors in compressed sensing along with the sample complexity bottlenecks in sparse phase retrieval have influenced this work to consider enforcing a generative prior in phase retrieval to surpass sparse phase retrieval s current theoretical and practical limitations. Notation. Let ( )> denote the real transpose. Let [n] = {1, . . . , n}. Let B(x, r) denote the Euclidean ball centered at x with radius r. Let k k denote the 2 norm for vectors and spectral norm for matrices. For any non-zero x 2 Rn, let ˆx = x/kxk. Let 1 i=d Wi = Wd Wd 1 . . . W1. Let In be the n n identity matrix. Let Sk 1 denote the unit sphere in Rk. We write c = (δ) when c > Cδ for some positive constant C. Similarly, we write c = O(δ) when c 6 Cδ for some positive constant C. When we say that a constant depends polynomially on 1, this means that it is at least C k for some positive C and positive integer k. For notational convenience, we write a = b + O1( ) if ka bk 6 where k k denotes | | for scalars, 2 norm for vectors, and spectral norm for matrices. Define sgn : R ! R to be sgn(x) = x/|x| for non-zero x 2 R and sgn(x) = 0 otherwise. For a vector v 2 Rn, diag(sgn(v)) is sgn(vi) in the i-th diagonal entry and diag(v > 0) is 1 in the i-th diagonal entry if vi > 0 and 0 otherwise. 2 Algorithm While our main result illustrates that the objective function exhibits favorable geometry for optimization, it does not guarantee recovery of the signal as gradient descent algorithms could, in principle, converge to the negative multiple of our true solution. Hence we propose a gradient descent scheme to recover the desired solution by escaping this region. First, consider Figure 1 which illustrates the behavior of our objective function in expectation, i.e. when the number of measurements m ! 1. We observe two important attributes of the objective function s landscape: (1) there exist two minima, the true solution x0 and a negative multiple βx0 for some β > 0 and (2) if z x0 while w βx0, we have that f(z) < f(w), i.e. the objective function value is lower near the true solution than near its negative multiple. This is due to the fact that the true solution is in fact the global optimum. Based on these attributes, we will introduce a gradient descent scheme to converge to the global minimum. First, we define some useful quantities. For any x 2 Rk and matrix W 2 Rn k, define W+,x := diag(Wx > 0)W. That is, W+,x keeps the rows of W that have a positive dot product with x and zeroes out the rows that do not. We will extend the definition of W+,x to each layer of weights Wi in our neural network. For W1 2 Rn1 k and x 2 Rk, define W1,+,x := diag(W1x > 0)W1. For each layer i 2 [d], define Wi,+,x := diag(Wi Wi 1,+,x . . . W2,+,x W1,+,xx > 0)Wi. Wi,+,x keeps the rows of Wi that are active when the input to the generative model is x. Then, for any x 2 Rk, the output of our generative model can be written as G(x) = ( 1 i=d Wi,+,x)x. For any z 2 Rn, define Az := diag(sgn(Az))A. Note that |AG(x)| = AG(x)G(x) for any x 2 Rk. Since a gradient descent scheme could in principle be attracted to the negative multiple, we exploit the geometry of the objective function s landscape to escape this region. First, choose a random initial Figure 1: Surface (left) and contour plot (right) of objective function with m ! 1 and true solution x0 = [1, 0]> 2 R2. iterate for gradient descent x1 6= 0. At each iteration i = 1, 2, . . . , compute the descent direction vxi,x0 := ( 1 i=d Wi,+,xi)>A> G(xi) (|AG(xi)| |AG(x0)|) . This is the gradient of our objective function f where f is differentiable. Once computed, we then take a step in the direction of vxi,x0. However, prior to taking this step, we compare the objective function value for xi and its negation xi. If f( xi) < f(xi), then we set xi to its negation, compute the descent direction and update the iterate. The intuition for this algorithm relies on the landscape illustrated in Figure 1: since the true solution x0 is the global minimum, the objective function value near x0 is smaller than near dx0. Hence if we begin to converge towards dx0, this algorithm will escape this region by choosing a point with lower objective function value, which will be in a neighborhood of x0. Algorithm 1 formally outlines this process. Algorithm 1 Deep Phase Retrieval (DPR) Gradient method Require: Weights Wi, measurement matrix A, observations |AG(x0)|, and step size > 0 1: Choose an arbitrary initial point x1 2 Rk \ {0} 2: for i = 1, 2, . . . do 3: if f( xi) < f(xi) then 4: xi xi; 5: end if 6: Compute vxi,x0 = ( 1 i=d Wi,+,xi)>A> G(xi) (|AG(xi)| |AG(x0)|); 7: xi+1 = xi vxi,x0; 8: end for Remark. We note that while the function is not differentiable, the descent direction is welldefined for all x 2 Rk due to the definitions of Wi,+,x and AG(x). When the objective function is differentiable, vx,x0 agrees with the true gradient. Otherwise, the descent direction only takes components of the formula for which the inputs to each Re LU are nonnegative. 3 Main Theoretical Analysis We now formally present our main result. While the objective function is not smooth, its onesided directional derivatives exist everywhere due to the continuity and piecewise linearity of G. Let Dvf(x) denote the unnormalized one-sided directional derivative of f at x in the direction v: Dvf(x) = limt!0+ f(x+tv) f(x) Theorem 2. Fix > 0 such that K1d8 1/4 6 1 and let d > 2. Suppose G is such that Wi 2 Rni ni 1 has i.i.d. N(0, 1/ni) entries for i = 1, . . . , d. Suppose that A 2 Rm nd has i.i.d. N(0, 1/m) entries independent from {Wi}. Then if m > C dk log(n1n2 . . . nd) and ni > C ni 1 log ni 1 for i = 1, . . . , d, then with probability at least 1 Pd i=1 γnie c ni 1 γm4k+1e c m, the following holds: for all non-zero x, x0 2 Rk, there exists vx,x0 2 Rk such that the one-sided directional derivatives of f satisfy D vx,x0 f(x) < 0, 8x /2 B(x0, K2d3 1/4kx0k) [ B( dx0, K2d14 1/4kx0k) [ {0}, Dxf(0) < 0, 8x 6= 0, where d > 0 converges to 1 as d ! 1 and K1 and K2 are universal constants. Here C depends polynomially on 1, c depends on , and γ is a universal constant. See Section 3.1 for the definition of the descent direction vx,x0. We note that while we assume the weights to have i.i.d. Gaussian entries, we make no assumption about the independence between layers. The result will be shown by proving the sufficiency of two deterministic conditions on the weights Wi of our generative network and the measurement matrix A. Weight Distribution Condition. The first condition quantifies the Gaussianity and spatial arrangement of the neurons in each layer. We say that W 2 Rn k satisfies the Weight Distribution Condition (WDC) with constant > 0 if for any non-zero x, y 2 Rk: +,x W+,y Qx,y !! 6 where Qx,y := x,y 2 Ik + sin x,y Here x,y = \(x, y) and Mˆx$ˆy3 is the matrix that sends ˆx 7! ˆy, ˆy 7! ˆx, and z 7! 0 for any z 2 span({x, y})?. If Wij N(0, 1/n), then an elementary calculation gives E = Qx,y. [20] proved that Gaussian W satisfies the WDC with high probability (Lemma 1 in the Appendix). Range Restricted Concentration Property. The second condition is similar in the sense that it quantifies whether the measurement matrix behaves like a Gaussian when acting on the difference of pairs of vectors given by the output of the generative model. We say that A 2 Rm n satisfies the Range Restricted Concentration Property (RRCP) with constant > 0 if for all non-zero x, y 2 Rk, the matrices AG(x) and AG(y) satisfy the following for all x1, x2, x3, x4 2 Rk: G(x)AG(y) ΦG(x),G(y))(G(x1) G(x2)),G(x3) G(x4)i| 6 31 k G(x1) G(x2)kk G(x3) G(x4)k Φz,w := 2 z,w In + 2 sin z,w If Aij N(0, 1/m), then for any z, w 2 Rn, a similar calculation for Gaussian W gives E = Φz,w. In our work, we establish that Gaussian A satisfies the RRCP with high probability. Please see Section 6 in the Appendix for a complete proof. We emphasize that these two conditions are deterministic, meaning that other distributions could be considered. We now state our main deterministic result. Theorem 3. Fix > 0 such that K1d8 1/4 6 1 and let d > 2. Suppose that G is such that Wi 2 Rni ni 1 satisfies the WDC with constant for all i = 1, . . . , d. Suppose A 2 Rm nd satisfies the RRCP with constant . Then the same conclusion as Theorem 2 holds. 3.1 Proof sketch for Theorem 2 Before we outline the proof of Theorem 2, we specify the descent direction vx,x0. For any x 2 Rk where f is differentiable, we have that rf(x) = ( 1 i=d Wi,+,x)>A> G(x)AG(x)( 1 i=d Wi,+,x)x ( 1 i=d Wi,+,x)>A> G(x)AG(x0)( 1 i=d Wi,+,x0)x0. 3A formula for this matrix is as follows: consider a rotation matrix R that sends ˆx 7! e1 and ˆy 7! cos 0e1 + sin 0e2 where 0 = \(x, y). Then Mˆx$ˆy = R> cos 0 sin 0 0 sin 0 cos 0 0 0 0 0k 2 5 R where 0k 2 is the k 2 k 2 matrix of zeros. Note that if 0 = 0 or , Mˆx$ˆy = ˆxˆx> or ˆxˆx>, respectively. This is precisely the descent direction specified in Algorithm 1, expanded with our notation. When f is not differentiable at x, choose a direction w such that f is differentiable at x + δw for sufficiently small δ > 0. Such a direction w exists by the piecewise linearity of the generative model G. In fact, not only is the function piecewise linear, each of the pieces is the intersection of a finite number of half spaces. Thus, with probability 1 any randomly chosen direction w moves strictly into one piece, allowing for differentiability at x + δw for sufficiently small δ. We note that any such w can be chosen arbitrarily. Hence we define our descent direction vx,x0 as rf(x) f differentiable at x 2 Rk limδ!0+ rf(x + δw) otherwise. The following is a sketch of the proof of Theorem 2: By the WDC and RRCP, we have that the descent direction vx,x0 concentrates uniformly for all non-zero x, x0 2 Rk around a particular vector vx,x0 defined by equation (5) in the Appendix. The WDC establishes that vx,x0 concentrates uniformly for all non-zero x, x0 2 Rk around a continuous vector hx,x0 defined by equation (7) in the Appendix. A direct analysis shows that hx,x0 is only small in norm for x x0 and x dx0. See Section 5.3 for a complete proof. Since vx,x0 vx,x0 hx,x0, vx,x0 is also only small in norm in neighborhoods around x0 and dx0, establishing Theorem 3. Gaussian Wi and A satisfy the WDC and RRCP with high probability (Lemma 1 and Proposition 2 in the Appendix). Theorem 2 is a combination of Lemma 1, Proposition 2, and Theorem 3. The full proofs of these results can be found in the Appendix. Remark. In comparison to the results in [20], considerable technical advances were needed in our case, including establishing concentration of AG(x) over the range of G. The quantity AG(x) acts like a spatially dependent sensing matrix, requiring a condition similar to the Restricted Isometry Property that must hold simultaneously over a finite number of subspaces given by the range(G). 4 Experiments In this section, we investigate the use of enforcing generative priors in phase retrieval tasks. We compared our results with the sparse truncated amplitude flow algorithm (SPARTA) [35] and three popular general phase retrieval methods: Fienup [15], Gerchberg Saxton [16], and Wirtinger Flow [8]. A MATLAB implementation of the SPARTA algorithm was made publicly available by the authors at https://gangwg.github.io/SPARTA/. We implemented the last three algorithms using the MATLAB phase retrieval library Phase Pack [10]. While these methods are not intended for sparse recovery, we include them to serve as baselines. 4.1 Experiments for Gaussian signals We first consider synthetic experiments using Gaussian measurements on Gaussian signals. In particular, we considered a two layer network given by G(x) = relu(W2relu(W1x)) where each Wi has i.i.d. N(0, 1) entries for i = 1, 2. We set k = 10, n1 = 500, and n2 = 1000. We let the entries of A 2 Rm n2 and x0 2 Rk be i.i.d. N(0, 1). We ran Algorithm 1 for 25 random instances of (A, W1, W2, x0). A reconstruction x? is considered successful if the relative error k G(x?) G(x0)k/k G(x0)k 6 10 3. We also compared our results with SPARTA. In this setting, we chose a k = 10-sparse y0 2 Rn2, where the nonzero coefficients are i.i.d. N(0, 1). As before, we ran SPARTA with 25 random instances of (A, y0) and considered a reconstruction y? successful if ky? y0k/ky0k 6 10 3. We also experimented with sparsity levels k = 3, 5. Figure 2 displays the percentage of successful trials for different ratios m/n where n = n2 = 1000 and m is the number of measurements. 4.2 Experiments for MNIST and Celeb A We next consider image recovery tasks, where we use two different generative models for the MNIST and Celeb A datasets. In each task, the goal is to recover an image y0 2 Rn given |Ay0| where Figure 2: Empirical success rate with ratios m/n where DPR s latent code dimension is k = 10, SPARTA s sparsity level ranges from k = 3, 5, and 10, and n = 1000. DPR achieves nearly the same empirical success rate of recovering a 10-dimensional latent code as SPARTA in recovering a 3-sparse 1000-dimensional signal. A 2 Rm n has i.i.d. N(0, 1/m) entries. We found an estimate image G(x?) in the range of our generator via gradient descent, using the Adam optimizer [23]. Empirically, we noticed that Algorithm 1 would typically only negate the latent code (Lines 3 4) at the initial iterate, if necessary. Hence we use a modified version of Algorithm 1 in these image experiments: we ran two sessions of gradient descent for a random initial iterate x1 and its negation x1 and chose the most successful reconstruction. In the first image experiment, we used a pretrained Variational Autoencoder (VAE) from [4] that was trained on the MNIST dataset [24]. This dataset consists of 60, 000 images of handwritten digits. Each image is of size 28 28, resulting in vectorized images of size 784. As described in [4], the recognition network is of size 784 500 500 20 while the generator network is of size 20 500 500 784. The latent code space dimension is k = 20. Figure 3: Top left: Example reconstructions with 200 measurements. Top right: Example reconstructions with 500 measurements. Bottom: A comparison of DPR s reconstruction error versus each algorithm for different numbers of measurements. For SPARTA, we performed sparse recovery by transforming the images using the 2-D Discrete Cosine Transform (DCT). We allowed 10 random restarts for each algorithm, including the sparse and general phase retrieval methods. The results in Figure 3 demonstrate the success of our algorithm with very few measurements. For 200 measurements, we can achieve reasonable recovery. SPARTA can achieve good recovery with 500 measurements while the other algorithms cannot. In addition, our algorithm exhibits recovery with 500 measurements compared to the alternatives requiring 1000 and 1500 measurements, which is where they begin to succeed. The performance for the general phase retrieval methods is to be expected as they are known to succeed only when m = (n) where n = 784. We note that while our algorithm succeeds with fewer measurements than the other methods, our performance, as measured by per-pixel reconstruction error, saturates as the number of measurements increases since our reconstruction accuracy is ultimately bounded by the generative model s representational error. As generative models improve, their representational errors will decrease. Nonetheless, as can be seen in the reconstructed digits, the recoveries are semantically correct (the correct digit is legibly recovered) even though the reconstruction error does not decay to zero. In applications, such as MRI and molecular structure estimation via X-ray crystallography, semantic error measures would be more informative estimates of recovery performance than per-pixel error measures. In the second experiment, we used a pretrained Deep Convolutional Generative Adversarial Network (DCGAN) from [4] that was trained on the Celeb A dataset [27]. This dataset consists of 200, 000 facial images of celebrities. The RGB images were cropped to be of size 64 64, resulting in vectorized images of dimension 64 64 3 = 12288. The latent code space dimension is k = 100. We allowed 2 random restarts. We ran numerical experiments with the other methods and they did not succeed at measurement levels below 5000. The general phase retrieval methods began reconstructing the images when m = (n) where n = 12288. The following figure showcases our results on reconstructing 10 images from the DCGAN s test set with 500 measurements. DPR with DCGAN Figure 4: 10 reconstructed images from celeb A s test set using DPR with 500 measurements. Acknowledgments OL acknowledges support by the NSF Graduate Research Fellowship under Grant No. DGE-1450681. PH acknowledges funding by the grant NSF DMS-1464525. [1] Sanjeev Arora, Yingyu Liang, and Tengyu Ma. Why are deep nets reversible: A simple theory with implications for training. Co RR, abs/1511.05653, 2015. [2] Sohail Bahmani and Justin Romberg. Efficient compressive phase retrieval with constrained sensing vectors. Advances in Neural Information Processing Systems (NIPS 2015), pages 523 531, 2015. [3] Richard Baraniuk, Mark Davenport, Ronald De Vore, and Michael Wakin. A simple proof of the restricted isometry property for random matrices. Constructive Approximation, 28(3):253 263, 2008. [4] Ashish Bora, Alexandros G. Dimakis, Ajil Jalal, and Eric Price. Compressed sensing using generative models. ar Xiv preprint ar Xiv:1703.03208, 2017. [5] Oliver Bunk, Ana Diaz, Franz Pfeiffer, Christian David, Bernd Schmitt, Dillip K. Satapathy, and J. Friso van der Veen. Diffractive imaging for periodic samples: Retrieving one-dimensional concentration profiles across microfluidic channels. Acta Crystallographica Section A: Foundations of Crystallography, 63(4):306 314, 2007. [6] Tony Cai, Xiaodong Li, and Zongming Ma. Optimal rates of convergence for noisy sparse phase retrieval via thresholded wirtinger flow. The Annals of Statistics, 44(5):2221 2251, 2016. [7] Emmanuel J. Candès and Xiaodong Li. Solving quadratic equations via phaselift when there are about as many equations as unknowns. Foundations of Computational Mathematics, 14.5:1017 1026, 2014. [8] Emmanuel J. Candès, Xiaodong Li, and Mahdi Soltanolkotabi. Phase retrieval via wirtinger flow: Theory and applications. IEEE Transactions on Information Theory, 61(4):195 2007, 2017. [9] Emmanuel J. Candès, Thomas Strohmer, and Vladislav Voroninski. Phaselift: Exact and stable signal recovery from magnitude measurements via convex programming. Comm. Pure Applied Math, 66(8):1241 1274, 2013. [10] Rohan Chandra, Ziyuan Zhong, Justin Hontz, Val Mc Culloch, Christoph Studer, and Tom Goldstein. Phasepack: A phase retrieval library. Asilomar Conference on Signals, Systems, and Computers, 2017. [11] Mark A. Davenport. Proof of the rip for sub-gaussian matrices. Open Stax CNX, http://cnx.org/contents/f37687c1-d62b-4ede-8064-794a7e7da7da@5, 2013. [12] S. W. Drury. Honours analysis lecture notes, mcgill university. http://www.math.mcgill. ca/drury/notes354.pdf, 2001. [13] Yonina C. Eldar, Georgios B. Giannakis, and Gang Wang. Solving systems of random quadratic equations via truncated amplitude flow. IEEE Transactions on Information Theory, 23(26):773 794, 2017. [14] C. Fienup and J. Dainty. Phase retrieval and image reconstruction for astronomy. Image Recovery: Theory and Application, pages 231 275, 1987. [15] J.R. Fienup. Phase retrieval algorithms: A comparison. Applied Optics, 21:2758 2768, 1982. [16] R.W. Gerchberg and W.O. Saxton. A practical algorithm for the determination of phase from image and diffraction plane pictures. Optik, 35:237 246, 1972. [17] Tom Goldstein and Christoph Struder. Phasemax: Convex phase retrieval via basis pursuit. ar Xiv preprint ar Xiv: 1610.07531, 2016. [18] Harun Günaydin, Da Tend, Yair Rivenson, Aydogan Ozcan, and Yibo Zhang. Phase recovery and holographic image reconstruction using deep learning in neural networks. ar Xiv preprint ar Xiv:1705.04286, 2017. [19] Paul Hand and Vladislav Voroninski. Compressed sensing from phaseless gaussian measure- ments via linear programming in the natural parameter space. ar Xiv preprint ar Xiv:1611.05985, 2016. [20] Paul Hand and Vladislav Voroninski. Global guarantees for enforcing generative priors by empirical risk. ar Xiv preprint ar Xiv:1705.07576, 2017. [21] Robert W. Harrison. Phase problem in crystallography. J. Opt. Soc. Am. A, 10(5):1046 1055, [22] Kishore Jaganathan, Samet Oymak, and Babak Hassibi. Sparse phase retrieval: Convex algorithms and limitations. Information Theory Proceedings (ISIT), 2013 IEEE International Symposium on:1022 1026, 2013. [23] Diederik Kingma and Jimmy Ba. Adam. Adam: A method for stochastic optimization. ar Xiv preprint, ar Xiv:1412.6980, 2014. [24] Yann Le Cun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. [25] Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S. Schoenholz, Jeffrey Pennington, and Jascha Sohl-Dickstein. Deep neural networks as gaussian processes. International Conference on Learning Representations (ICLR 2018), 2018. [26] Xiaodong Li and Vladislav Voroninski. Sparse signal recovery from quadratic measurements via convex programming. SIAM Journal on Mathematical Analysis, 45(5):3019 3033, 2013. [27] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. Proceedings of the IEEE International Conference on Computer Vision, pages 3730 3738, 2015. [28] Jianwei Miao, Tetsuya Ishikawa, Qun Shen, and Thomas Earnest. Extending x-ray crystallogra- phy to allow the imaging of noncrystalline materials, cells, and single protein complexes. Annu. Rev. Phys. Chem., 59:387 410, 2008. [29] RP Millane. Phase retrieval in crystallography and optics. J. Opt. Soc. Am. A, 7(3):394 411, [30] Yaniv Plan and Roman Vershynin. One-bit compressed sensing by linear programming. Com- munications on Pure and Applied Mathematics, 66(8):1275 1297, 2013. [31] Ju Sun, Qing Qu, and John Wright. A geometric analysis of phase retrieval. Information Theory (ISIT), 2016 IEEE International Symposium, pages 2379 2383, 2016. [32] Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. Compressed Sensing: Theory and Applications, Cambridge University Press, 2012. [33] Irène Waldspurger, Alexandre d Aspremont, and Stéphane Mallat. Phase recovery, maxcut and complex semidefinite programming. Mathematical Programming, 149(1-2):47 81, 2015. [34] Adriaan Walther. The question of phase retrieval in optics. Journal of Modern Optics, 10(1):41 [35] Gang Wang, Liang Zhang, Georgios B. Giannakis, Mehmet Akçakaya, and Jie Chen. Sparse phase retrieval via truncated amplitude flow. Signal Processing IEEE Transactions on, 66:479 491, 2018. [36] James G. Wendel. A problem in geometric probability. Math. Scand., 11:109 111, 1962. [37] Li-Hao Yeh, Jonathan Dong, Jingshan Zhong, Lei Tian, Michael Chen, Gongguo Tang, Mahdi Soltanolkotabi, and Laura Waller. Experimental robustness of fourier ptychography phase retrieval algorithms. Optics Express, PP(99):33214 33240, 2015.