# convolutional_phase_retrieval__fb0d3a71.pdf

Convolutional Phase Retrieval

Qing Qu Columbia University qq2105@columbia.edu

Yuqian Zhang Columbia University yz2409@columbia.edu

Yonina C. Eldar Technion yonina@ee.technion.ac.il

John Wright Columbia University jw2966@columbia.edu

We study the convolutional phase retrieval problem, which considers recovery of an unknown signal x Cn from m measurements consisting of the magnitude of its cyclic convolution with a known kernel a of length m. This model is motivated by applications to channel estimation, optics, and underwater acoustic communication, where the signal of interest is acted on by a given channel/ﬁlter, and phase information is diﬃcult or impossible to acquire. We show that when a is random and m is suﬃciently large, x can be eﬃciently recovered up to a global phase using a combination of spectral initialization and generalized gradient descent. The main challenge is coping with dependencies in the measurement operator; we overcome this challenge by using ideas from decoupling theory, suprema of chaos processes and the restricted isometry property of random circulant matrices, and recent analysis for alternating minimizing methods.

1 Introduction

We study the problem of recovering a unknown signal x Cn from measurements y = |a x|, which consist of the magnitude of the convolution of x and a given ﬁlter a Cm,

ﬁnd z, s.t. y = |a z| , (1)

where denotes cyclic convolution. Let Ca Cm m be a circulant matrix generated by a, and let A Cm n be a matrix formed by the ﬁrst n columns of Ca. Then the convolutional phase retrieval problem can be rewritten in the common matrix-vector form

ﬁnd z, s.t. y = |Az| . (2)

This problem is motivated by applications like channel estimation [37, 1], (non)coherent optical communication [14, 24], and underwater acoustic communication [31]. For example, in millimeter-wave (mm-wave) wireless communications for 5G networks [27], one important problem is to reconstruct signal angle of arrival (Ao A) from measurements, which are taken by the convolution of signal Ao A and the antenna pattern.

Because of technical diﬃculties that the phase measurements are either very noisy and unreliable, or expensive to acquire, it is preferred to only take measurements of signal magnitude and the phase information is lost.

Most known results on the exact solution of phase retrieval problems [8, 29, 10, 38, 36, 35] pertain to generic random matrices, where the entries of A are independent subgaussian random variables. However, in practice it is almost impossible to design purely random measurement matrices: in many cases as we mentioned above, the measurement is much more

31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

structured generated by passing a signal through a manually designed ﬁlter. Moreover, the structured measurements often admit more eﬃcient numerical methods: by using the fast Fourier transform for matrix-vector products, the benign structure of the convolutional model (1) allows us to design methods with O(m) memory and O(m log m) computation cost per iteration. While for generic measurements, the cost is around O(mn).

In this work, we study the convolutional phase retrieval problem (1) under the assumption that the kernel a = [a1, , am] is random, with each entry i.i.d. complex Gaussian,

a = u + iv, u, v iid N ( 0, 1

Compared to the generic random measurement, as we can see, the random convolution model we study here is far more structured: it is parameterized by only O(m) independent complex normal random variables, whereas the generic model involves O(mn) ones. Since the rows and columns of A are probabilistically dependent, standard techniques (based on concentration of functions of independent random vectors) do not apply.

We propose and analyze a local1 gradient descent type method, minimizing a weighted, nonconvex and nonsmooth objective

min z Cn f(z) = 1 2m

b1/2 (y |Az|) 2 , (4)

where denotes the Hadamard product and b Rm ++ is a weighting vector. Our result can be informally summarized as follows.

Theorem 1.1 (Informal) When m Ω(n poly log n), with high probability, spectral initialization [25, 5] produces an initialization z(0) that is O(1/ poly log n) close to the optimum. Moreover,

when m Ω ( Cx 2

x 2 n poly log n ) , with high probability, a certain gradient descent method based

on (4) converges linearly from this initialization to the set X = { xeiϕ | ϕ [0, 2π) } of points that diﬀer from the true signal x only by a global phase.

Here, Cx Cm m denotes the circulant matrix corresponding to cyclic convolution with a length m zero padding of x, and poly log n denotes a polynomial in log n. A dependence of the sample complexity m on Cx seems inevitable2 and is corroborated by experiments.

Our proof is based on ideas from decoupling theory [11], the suprema of chaos processes and restricted isometry property of random circulant matrices [26, 20], and a new iterative analysis of alternating minimizing methods [35]. Our analysis draws connections between the convergence properties of gradient descent and the classical alternating direction method. This allows us to avoid the need to argue that high-degree polynomials in the structured random matrix A concentrate uniformly, as would be required by a straightforward translation of existing analysis to this new setting. Instead, we control the bulk eﬀect of phase errors uniformly in a neighborhood around the ground truth. This requires us to develop new decoupling and concentration tools for controlling nonlinear phase functions of circulant random matrices, which could be potentially useful for analyzing other random circulant convolution problems such as blind deconvolution [40], and convolutional dictionary learning [18].

Prior art for phase retrieval. The challenge of developing eﬃcient, guaranteed methods for phase retrieval has attracted substantial interest over the past decade [28, 19]. For the generalized phase retrieval problem in which the sensing matrix A is i.i.d. random, the ﬁrst result on global recovery is based on semideﬁnite programming (SDP) [8, 3, 36]. However, the computational cost of SDP limits its practicality. Nonconvex methods can be more efﬁcient. [25] showed that the alternating minimization method provably converges to the

1It would be nicer to characterize the global geometry of the problem as in [15, 33, 34, 32]. However, the nonhomogeneity of Cx over the space causes tremendous diﬃculties for concentration with m Ω(n poly log n) samples. 2The operator norm of Cx is nonhomogeneous over x CSn 1, ranging from constant to O( n). For instance, Cx = 1 when x is a standard basis vector; and Cx = n when x = 1 n1.

truth, when initialized using a spectral method and provided with fresh samples at each iteration. Candes et al. [5] showed with the same initalization, gradient descent for the nonconvex least squares objective,

min z Cn f1(z) = 1 2m

y2 |Az|2 2 , (5)

provably recovers the ground truth, with near-optimal sample complexity m Ω(n log n). The work [10, 39, 38] further reduce the sample complexity to m Ω(n) by using diﬀerent nonconvex objectives and truncation techniques. Moreover, [34] reveals that the nonconvex objective (5) has a benign global geometry: with high probability, it has no bad critical points with m Ω(n log3 n) samples3.

Structured random measurements. The study of structured random measurement in signal processing [21] includes the study of random Fourier measurements [7, 9, 12] and partial random convolutions [26, 20] in compressed sensing [6]. However, the study of structured random measurement for phase retrieval is still quite limited. In particular, [17] and [4] studied the performance of SDP methods with t-designs and random masked Fourier transform measurements. The authors in [5, 2] show that the phase retrieval problem with random coded diﬀraction and STFT measurements can be solved by minimizing nonconvex objectives, while [5] requires resampling for the initialization, and in [2] the contraction radius is not large enough for initialization. In addition, the motivation of these measurement schemes are quite diﬀerent from ours. For more detailed review of this subject, we refer the readers to Section 4 of [21].

Notations. We use ( ) and ( ) to denote the real and Hermitian transpose, respectively. We use CSn 1 to denote a n dimensional complex sphere. Let ℜ( ) and ℑ( ) denote the real and imaginary parts of a complex variable, respectively. Throughout the paper, we assume the optimal solution is x Cn. Because the solution is only optimal to a global phase shift, we deﬁne the optimal solution set as X = { xeiθ | θ [0, 2π) } , and deﬁne the distance from a point z Cn to the set X as

dist(z, X) .= inf θ [0,2π)

For any z C with |z| = 0, we use ϕ(z) to denote the phase of z, that is, eiϕ(z) = z/ |z|.

2 Algorithm

We develop an approach to convolutional phase retrieval based on local nonconvex optimization. Our proposed algorithm has two components: (1) a careful initialization using the spectral method; (2) local reﬁnement by (generalized) gradient descent. We introduce the two steps in reverse order.

2.1 Minimization of a nonconvex and nonsmooth objective

We consider minimizing a weighted nonconvex and nonsmooth objective

f(z) = 1 2m

b1/2 (y |Az|) 2 . (6)

The introduction of the positive weights b facilitates our analysis, by enabling us to compare certain functions of the dependent random matrix A to functions involving more independent random variables. We will substantiate this claim in the next section.

Although the function (4) is not complex-diﬀerentiable, if one identiﬁes Cn with R2n and treats f(z) as a function in the real domain, f is still diﬀerentiable in the real sense. Thus, we adopt the Wirtinger calculus [22], which can be thought of as a clean way of organizing the real partial derivatives [29, 34].

3[30] tightened the sample complexity to m Ω(n log n) by using advanced probability tools.

On the other hand, it should also be noted that the absolute value | | is nonsmooth at 0 and hence f(z) is not diﬀerentiable everywhere even in the real sense. Similar to [38], for any complex number u C, if we uniquely deﬁne its phase ϕ(u) at 0 by

exp (iϕ(u)) .= {u/ |u| if |u| = 0, 1 otherwise,

then the Wirtinger gradient of (4) can be uniquely determined as

z f(z) = 1 m A diag (b) [Az y exp (iϕ(Az))] . (7)

Starting from some initialization z(0), we minimize the objective (6) by gradient descent

z(r+1) = z(r) τ

z f(z(r)), (8)

where τ > 0 is the stepsize. Indeed, zf(z) can be interpreted as the gradient of f(z) as in the real case; this method is also referred to as amplitude ﬂow [38].

2.2 Initialization via spectral method

Similar to [25, 29], we compute the initialization z(0) via a spectral method, detailed in [29, Algorithm 1]. More speciﬁcally, z(0) is a scaled leading eigenvector of

k=1 y2 kaka k = 1 m A diag ( y2) A, (9)

which is constructed from the knowledge of the sensing vectors and observations. The leading eigenvector of Y can be eﬃciently computed via the power method. Note that E [Y ] = x 2 I + xx , so the leading eigenvector of E [Y ] is proportional to the optimal solution x. Under the random convolutional model of A, by using probability tools from [21], we show that v Y v concentrates to its expectation v E [Y ] v for all v CSn 1 whenever m Ω(n poly log n), ensuring the initialization z(0) close to the optimal set X.

3 Main Result and Analysis

In this section, we describe our main theoretical result, which shows that with high probability, the algorithm described in the previous section succeeds.

Theorem 3.1 (Main Result) Whenever m C0n log31 n, the spectral method [29, Algorithm 1] produces an initialization z(0) that satisﬁes

dist ( z(0), X ) c0 log 6 n x

with probability at least 1 c1m c2. Suppose b = ζσ2(y), where

ζσ2(t) = 1 2πσ2ξσ2(t), ξσ2(t) = 1 2πσ2 exp ( t2

) , t > 0, (10)

with σ2 > 1/2. Starting from z(0), with σ2 = 0.51 and stepsize τ = 2.02, whenever m

x 2 max { log17 n, n log4 n } , with probability at least 1 c3m c4 for all iterate z(r)(r 1) deﬁned in (8), we have

dist ( z(r), X ) (1 ϱ)r dist ( z(0), X ) , (11)

holds for a small scalar ϱ (0, 1). Here, c0, c1, c2, c3, c4, C0, C1 > 0 are numerical constants.

Remark: Our result shows that by initializing the problem O(1/polylog(n))-close to the optimum via spectral method, the gradient descent (8) converges linearly to the optimal solution. As we can see, the sample complexity here also depends on Cx , which is quite different from the i.i.d. case. For a typical x CSn 1 (e.g., x is drawn uniformly random from CSn 1), Cx remains as O(log n), the sample complexity m Ω(n poly log n) matches the i.i.d. case up to log factors. However, Cx is nonhomogeneous over x CSn 1: if x is sparse in the Fourier domain (e.g., x = 1 n1), the sample complexity can be as large as m Ω ( n2 poly log n ) . Such a behavior is also demonstrated in the experiments of Section 4. We believe the (very large!) number of logarithms in our result is an artifact of our analysis, rather than a limitation of the method. We expect to reduce the sample complexity

to m Ω ( Cx 2

x 2 n log6 n ) by a tighter analysis, which is left for future work. The choices

of the weighting b Rm in (10), σ2 = 0.51, and the stepsize τ = 2.02 are purely for the purpose of analysis. In practice, the algorithm converges with b = 1 and a choice of small stepsize τ, or by using backtracking linesearch for the stepsize τ.

In the following, we brieﬂy highlight some major challenges and novel prooﬁng ideas behind the analysis. The details can be found in our full paper.

3.1 Proof sketch of iterative contraction

Our analysis is largely inspired by the recent analysis of alternating direction method (ADM) [35]. In this following, we draw connections between the gradient descent method (8) and ADM, and sketch basic ideas of convergence analysis.

ADM iteration. ADM is a classical method for solving phase retrieval problems [16, 25, 35], which can be considered as a heuristic method that solves the problem

min z Cn,|u|=1

1 2 Az y u 2 .

At every iterate bz(r), ADM proceeds in two steps:

c(r+1) = y exp ( Abz(r)) , bz(r+1) = arg min z 1 2

Az c(r+1) 2 ,

which leads to the following update

bz(r+1) = A ( y exp ( Abz(r))) ,

where A = (A A) 1 A is the pseudo-inverse of A. Let bθr = arg minθ bz(r) xeiθ . The

distance between bz(r+1) and X is bounded by

dist ( bz(r+1), X ) = bz(r+1) xeibθr+1 A Axeibθr ( y exp ( Abz(r))) . (12)

Gradient descent with b = 1. For simplicity, let us consider the gradient descent update

(8) with b = 1. Let θr = arg minθ z(r) xeiθ , with stepsize τ = 1. The distance between

the iterate z(r) and the optimal set X is bounded by

dist ( z(r+1), X ) = z(r+1) xeiθr+1 I 1

m A Axeiθr y exp ( iϕ(Az(r)) ) . (13)

Towards iterative contraction. By measure concentration, it can be shown that I 1

m A A = o(1), A m, A 1/ m, (14)

holds with high probability whenever m Ω(n poly log n). Therefore, to show iterative contraction of both methods, based on (12) and (13), it is suﬃcient to show that Axeiθ y exp (iϕ(Az)) (1 η) m z xeiθ , (15)

for some constant η (0, 1), where θ = arg minθ [0,2π) z xeiθ such that eiθ = x z/ |x z|.

By similar ideas of controlling (15) for the ADM method [35], this observation provides a new way of analyzing the gradient descent method. As an attempt to show (15) for the random circulant matrix A, we invoke the following lemma, which controls the error in a ﬁrst order approximation to exp(iϕ( )).

Lemma 3.2 (Lemma 3.2, [35]) For any ρ > 0, and for any z, z C, we have

|exp (iϕ(z + z)) exp (iϕ(z ))| 21|z| ρ|z | + (1 ρ) 1 |ℑ(z/z )| .

Let us decompose z = αx + βw, where w CSn 1 with w x, and α, β C. Note that ϕ(α) = θ. Then by Lemma 3.2, for any ρ (0, 1), we have

Axeiθ y exp (iϕ(Az)) = |Ax| [ exp (iϕ (Ax)) exp ( iϕ ( Ax + β

α||Aw| ρ|Ax| | {z } T1

ℑ((Aw) exp ( iϕ(Ax))) | {z } T2

The ﬁrst term T1 can be bounded using the restricted isometry property of random circulant matrices [20], together with some auxiliary analysis.

The second term T2 involves a nonlinear function exp ( iϕ(Ax)) of the random circulant matrix A. Controlling this nonlinear, highly dependent random process T2(w) for all w is a nontrivial task. Next, we explain why controlling T2 is technically challenging, and sketch the key ideas about how to control a smoothed variant of T2, by using the weighting b = ζσ2(y) introduced in (10). We also provide intuition for why the weighting b is helpful.

3.2 Controlling the phase term T2

As elaborated above, the major challenge of showing iterative contraction is bounding the suprema of the nonlinear, dependent random process T2(w) over the set S = { w CSn 1 | w x } . By using the fact that ℑ(u) = 1

2i (u u) for any u C, we have

sup w S T 2 2 1 2 A 2 + 1

w A diag (ψ(Ax)) Aw | {z } L(a,w)

where ψ(t) .= exp ( 2iϕ(t)). As from (14), A m, the major task left is to show that

sup w S |L(a, w)| < (1 η )m (16)

for some constant η (0, 1).

Why decoupling? Let A =

L(a, w) = w A diag (ψ(Ax)) Aw =

k=1 ψ(a kx)w aka k w | {z } dependence across k

is a summation of dependent random variables, for which our probability tools are very limited. To overcome this problem, we deploy ideas from decoupling [11]. Informally, decoupling allows us to compare moments of the original random function to functions of more independent random variables, which are usually easier to analyze. The book [11] provides

a beautiful introduction to this area. In our problem, notice that the random vector a occurs twice in the deﬁnition of L(a, w) one in the phase term ψ(Ax) = exp( 2iϕ(Ax)), and another in the quadratic term. The general spirit of decoupling is to seek to replace one a with an independent copy a of the same random vector, yielding a random process with fewer dependencies. Here, we seek to replace L(a, w) with

QL dec(a, a , w) = w A diag (ψ(A x)) Aw. (17)

The usefulness of this new, decoupled form QL dec(a, a , w), is that it introduces extra randomness QL dec(a, a , w) is now a chaos process of a conditioned on a . This makes analyzing supw S QL dec(a, a , w) amenable to existing analysis of suprema of chaos processes for random circulant matrices [21]. However, achieving the decoupling requires additional work; the most general existing results on decoupling pertain to tetrahedral polynomials, which are polynomials with no monomials involving any power larger than one of any random variable. By appropriately tracking cross terms, these results can also be applied to more general (non-tetrahedral) polynomials in Gaussian random variables [23]. However, our random process L(a, w) involves a nonlinear phase term ψ(Aw) which is not a polynomial, and hence is not amenable to a direct appeal to existing results.

Decoupling is recoupling . Existing results [23] for decoupling polynomials of Gaussian random variables are derived from two simple facts: (i) orthogonal projections of Gaussian variables are independent, and (ii) Jensen s inequality. Indeed, for a CN(0, I), let us introduce an independent vector δ CN(0, I). Write

g1 = a + δ, g2 = a δ.

Because of Fact (i), these are independent CN(0, 2I) vectors. By conditional expectation,

Eδ [ QL dec(g1, g2, w) ] = Eδ [ QL dec(a + δ, a δ, w) ] .= b L(a, w). (18)

Thus, we can see that the key idea of decoupling L(a, w) into QL dec(a, a , w), is essentially recoupling QL dec(g1, g2, w) via conditional expectation the recoupled term b L can be viewed as an approximation of L(a, w). Notice that by Fact (ii), for any convex function φ,

[ sup w S φ ( b L(a, w) )] = Ea

[ sup w S φ ( Eδ [ QL dec(a + δ, a δ, w) ])]

[ sup w S φ ( QL dec(a + δ, a δ, w) )] = Eg1,g2 [ sup w S φ ( QL dec(g1, g2, w) )] .

Thus, by choosing φ(t) = |t|p, we can control moments of supw S b L(a, w) via sup w S

QL dec(g1, g2, w) Lp . (19)

For tetrahedral polynomials, b L = L, so the approximation is exact. As the tail bound of

supw S b L(a, w) can be controlled via its moments bounds [13, Chapter 7.2], this allows

us to directly control the object L(a, w) of interest. The reason that this control obtains is because the conditional expectation operator Eδ [ | a] recouples QL dec(a, a , w) back to the target L(a, w). In slogan form, (Gaussian) decoupling is recoupling.

Recoupling is Gaussian smoothing. A distinctive feature in convolutional phase retrieval is that L is not a polynomial. Hence, it may be challenging to posit a QL dec which recouples back to L. In other words, in the existing form, we need to tolerate an approximation error as b L = L. By the triangle inequality,

sup w S |L(a, w)| sup w S

b L(a, w) + sup w S

b L(a, w) L(a, w) . (20)

As discussed above, the supw S b L(a, w) can be sharply controlled via its moments bound

in (19). Now the bound (20) is useful to derive tight control for L(a, w), if L(a, w) is very

close to b L(a, w) uniformly. The question is: for what L is it possible to ﬁnd a well-behaved QL dec for which the approximation error is small? To understand this question, recall that the mechanism that links Qdec back to b L is the conditional expectation operator Eδ [ | a]. For our case, from (18) orthogonality leads to

b L(a, w) = w A diag (h(Ax)) Aw, h(t) .= Es CN (0, x 2) [ψ(t + s)] . (21)

Thus, by combining the results in (20) and (21), we have

sup w S |L(a, w)| sup w S

b L(a, w) + h ψ L | {z } approximation error

Note that the function h is not exactly ψ, but generated by convolving ψ with a multivariate Gaussian pdf : indeed, recoupling is Gaussian smoothing. The Fourier transform of a multivariate Gaussian is again a Gaussian; it decays quickly with frequency. So, in order to admit a small approximation error, the target L must be smooth. However, in our case, the function ψ(t) = exp( 2iϕ(t)) is discontinuous at t = 0; it changes extremely rapidly in the vicinity of t = 0, and hence its Fourier transform (appropriately deﬁned) does not decay quickly at all. Therefore, L(a, w) is a poor target for approximation with a smooth function b L = Eδ[QL dec]. From Fig. 1, the diﬀerence between h and ψ increases as |t| 0. The poor approximation error h ψ L = 1 results in a trivial bound for supw S |L(a, w)| instead of (16).

Decoupling and convolutional phase retrieval. The key idea to reduce the approximation error ψ h L = 1 is to smooth ψ. More speciﬁcally, we introduce a new objective (6) with Gaussian weighting b = ζσ2(y) in (10), replacing the analyzing target T2 with

b T2 = diag ( b1/2) ℑ((Aw) exp ( iϕ(Ax))) .

Consequently, we obtain a smoothed variant Ls(a, w) of L(a, w),

Ls(a, w) = w A diag (ζσ2(y) ψ(Ax)) Aw.

Now the approximation error h ψ L in (22) is replaced by h(t) ζσ2(t)ψ(t) L . As observed from Fig. 1, the function ζσ2(t) smoothes ψ(t) especially near the vicinity of t = 0, such that the new approximation error f(t) ζσ2(t)ψ(t) L is signiﬁcantly reduced. Thus, by using similar ideas as above, we can prove a desired bound supw S |Ls(a, w)| < (1 ηs)m. Finally, because the new weighting b = ζσ2(y), the overall analysis needs to be slightly modiﬁed correspondingly. We refer the readers to our full paper for more details.

Figure 1: Plots of functions ζσ2(t), f(t) and ψ(t) for t R+.

Figure 2: Phase transition for recovering the signal x CSn 1 with different Cx .

4 Experiments

Dependence of sample complexity on Cx . First, we investigate the dependence of the sample complexity m on Cx . We assume the ground truth x CSn 1, and consider three

cases: (1) x = e1 with Cx = 1, where e1 the standard basis vector; (2) x is uniformly random generated from CSn 1; (3) x = 1 n1, with Cx = n. For each case, we ﬁx the signal length n = 1000 and vary the ratio m/n. For each ratio m/n, we randomly generate the kernel a CN(0, I) and repeat the experiment for 100 times. We initialize the algorithm by the spectral method [29, Algorithm 1] and run the gradient descent (8). Given the algorithm output bx, we judge the success of recovery by infϕ [0,2π) bx xeiϕ ϵ, where ϵ = 10 5. From Fig. 2, we can see that the larger the Cx , the more samples are needed for exact recovery.

Figure 3: Experiment on real images.

Experiments on real image. Finally, we run the experiment on some real dataset to demonstrate the eﬀectiveness and the eﬃciency of the proposed method. We choose an image of size 200 300 as in Fig. 4, we use m = 5n log n samples for reconstruction. The kernel a Cm is randomly generated as complex Gaussian CN(0, I). We run power method for 100 iterations for initialization, and stop the algorithm once the error is smaller than 1 10 4. It takes 197.08s to reconstruct all the RGB channels. Experiment using general Gaussian measurements A Cm n could easily run out of memory on a personal computer for problems of this size.

Figure 4: Experiment with real antenna pattern.

Experiments on signal Ao A phase recovery for 5G communications. Finally, we demonstrate the eﬀectiveness of the proposed method on a problem arising in 5G communication, as we mentioned in the introduction. Fig. 4 (left) shows an antenna pattern a C361 obtained from Bell labs. We observe the modulus of the convolution of this pattern with the signal of interest. For three diﬀerent types of signals with length n = 20, (1) x = e1 , (2) x is uniformly random generated from CSn 1, (3) x = 1 n1, our result in Fig. 4 shows that we can achieve almost perfect recovery.

5 Acknowledgement

This work was partially supported by the grants NSF CCF 1527809 and NSF IIS 1546411, the grants from the European Unions Horizon 2020 research and innovation program under grant agreement No. 646804-ERCCOGBNYQ, and the grant from the Israel Science Foundation under grant no. 335/14. QQ thanks the generous support of the Microsoft graduate research fellowship. We would like to thank Shan Zhong for the helpful discussion for real applications and providing the antenna data for experiments, and we thank Ju Sun and Han-wen Kuo for helpful discussion and input regarding the analysis of this work.

[1] Sercan Ö. Arik and Joseph M. Kahn. Direct-detection mode-division multiplexing in modal basis using phase retrieval. Opt. Lett., 41(18):4265 4268, Sep 2016.

[2] T. Bendory, Y. C. Eldar, and N. Boumal. Non-convex phase retrieval from stft measurements. IEEE Transactions on Information Theory, PP(99):1 1, 2017.

[3] Emmanuel J. Candès, Yonina C. Eldar, Thomas Strohmer, and Vladislav Voroninski. Phase retrieval via matrix completion. SIAM Journal on Imaging Sciences, 6(1), 2013.

[4] Emmanuel J. Candès, Xiaodong Li, and Mahdi Soltanolkotabi. Phase retrieval from coded diﬀraction patterns. Applied and Computational Harmonic Analysis, 39(2):277 299, 2015.

[5] Emmanuel J. Candès, Xiaodong Li, and Mahdi Soltanolkotabi. Phase retrieval via wirtinger ﬂow: Theory and algorithms. Information Theory, IEEE Transactions on, 61(4):1985 2007, April 2015.

[6] Emmanuel J Candès, Justin Romberg, and Terence Tao. Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on information theory, 52(2):489 509, 2006.

[7] Emmanuel J Candes, Justin K Romberg, and Terence Tao. Stable signal recovery from incomplete and inaccurate measurements. Communications on pure and applied mathematics, 59(8):1207 1223, 2006.

[8] Emmanuel J. Candès, Thomas Strohmer, and Vladislav Voroninski. Phaselift: Exact and stable signal recovery from magnitude measurements via convex programming. Communications on Pure and Applied Mathematics, 66(8):1241 1274, 2013.

[9] Emmanuel J Candes and Terence Tao. Near-optimal signal recovery from random projections: Universal encoding strategies? IEEE transactions on information theory, 52(12):5406 5425, 2006.

[10] Yuxin Chen and Emmanuel J. Candès. Solving random quadratic systems of equations is nearly as easy as solving linear systems. ar Xiv preprint ar Xiv:1505.05114, 2015.

[11] Victor De la Pena and Evarist Giné. Decoupling: from dependence to independence. Springer, 1999.

[12] Yonina C Eldar and Gitta Kutyniok. Compressed sensing: theory and applications. Cambridge University Press, 2012.

[13] Simon Foucart and Holger Rauhut. A mathematical introduction to compressive sensing. Springer, 2013.

[14] Robert M Gagliardi and Sherman Karp. Optical communications. New York, Wiley-Interscience, 1976. 445 p., 1, 1976.

[15] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points online stochastic gradient for tensor decomposition. In Proceedings of The 28th Conference on Learning Theory, pages 797 842, 2015.

[16] R. W. Gerchberg and W. Owen Saxton. A practical algorithm for the determination of the phase from image and diﬀraction plane pictures. Optik, 35:237 246, 1972.

[17] David Gross, Felix Krahmer, and Richard Kueng. A partial derandomization of phaselift using spherical designs. ar Xiv preprint ar Xiv:1310.2267, 2013.

[18] Felix Heide, Wolfgang Heidrich, and Gordon Wetzstein. Fast and ﬂexible convolutional sparse coding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5135 5143, 2015.

[19] Kishore Jaganathan, Yonina C. Eldar, and Babak Hassibi. Phase retrieval: An overview of recent developments. Chapter V, Optical Compressive Imaging, 2016.

[20] Felix Krahmer, Shahar Mendelson, and Holger Rauhut. Suprema of chaos processes and the restricted isometry property. Communications on Pure and Applied Mathematics, 67(11):1877 1904, 2014.

[21] Felix Krahmer and Holger Rauhut. Structured random measurements in signal processing. GAMM-Mitteilungen, 37(2):217 238, 2014.

[22] Ken Kreutz-Delgado. The complex gradient operator and the CR-calculus. ar Xiv preprint ar Xiv:0906.4835, 2009.

[23] Stanislaw Kwapien. Decoupling inequalities for polynomial chaos. The Annals of Probability, pages 1062 1071, 1987.

[24] Antonio Mecozzi, Cristian Antonelli, and Mark Shtaif. Kramers kronig coherent receiver. Optica, 3(11):1220 1227, Nov 2016.

[25] Praneeth Netrapalli, Prateek Jain, and Sujay Sanghavi. Phase retrieval using alternating minimization. In Advances in Neural Information Processing Systems, pages 2796 2804, 2013.

[26] Holger Rauhut. Compressive sensing and structured random matrices. Theoretical foundations and numerical methods for sparse recovery, 9:1 92, 2010.

[27] Arash Shahmansoori, Gabriel E Garcia, Giuseppe Destino, Gonzalo Seco-Granados, and Henk Wymeersch. 5g position and orientation estimation through millimeter wave mimo. In Globecom Workshops (GC Wkshps), 2015 IEEE, pages 1 6. IEEE, 2015.

[28] Yoav Shechtman, Yonina C. Eldar, Oren Cohen, Henry N. Chapman, Jianwei Miao, and Mordechai Segev. Phase retrieval with application to optical imaging: A contemporary overview. Signal Processing Magazine, IEEE, 32(3):87 109, May 2015.

[29] Mahdi Soltanolkotabi. Algorithms and theory for clustering and nonconvex quadratic programming. Ph D thesis, Stanford University, 2014.

[30] Mahdi Soltanolkotabi. Structured signal recovery from quadratic measurements: Breaking sample complexity barriers via nonconvex optimization. Co RR, abs/1702.06175, 2017.

[31] Milica Stojanovic, Josko A Catipovic, and John G Proakis. Phase-coherent digital communications for underwater acoustic channels. IEEE Journal of Oceanic Engineering, 19(1):100 111, 1994.

[32] Ju Sun, Qing Qu, and John Wright. Complete dictionary recovery over the sphere. ar Xiv preprint ar Xiv:1504.06785, 2015.

[33] Ju Sun, Qing Qu, and John Wright. When are nonconvex problems not scary? ar Xiv preprint ar Xiv:1510.06096, 2015.

[34] Ju Sun, Qing Qu, and John Wright. A geometric analysis of phase retreival. ar Xiv preprint ar Xiv:1602.06664, 2016.

[35] Irène Waldspurger. Phase retrieval with random gaussian sensing vectors by alternating projections. ar Xiv preprint ar Xiv:1609.03088, 2016.

[36] Irène Waldspurger, Alexandre d Aspremont, and Stéphane Mallat. Phase recovery, maxcut and complex semideﬁnite programming. Mathematical Programming, 149(1-2):47 81, 2015.

[37] P. Walk, H. Becker, and P. Jung. OFDM channel estimation via phase retrieval. In Asilomar 2015, 2015.

[38] G. Wang, G. B. Giannakis, and Y. C. Eldar. Solving systems of random quadratic equations via truncated amplitude ﬂow. IEEE Transactions on Information Theory, PP(99):1 1, 2017.

[39] Huishuai Zhang and Yingbin Liang. Reshaped wirtinger ﬂow for solving quadratic systems of equations. ar Xiv preprint ar Xiv:1605.07719, 2016.

[40] Yuqian Zhang, Yenson Lau, Han-wen Kuo, Sky Cheung, Abhay Pasupathy, and John Wright. On the global geometry of sphere-constrained sparse blind deconvolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.