# solving_most_systems_of_random_quadratic_equations__fd23d4a3.pdf

Solving Most Systems of Random Quadratic Equations

Gang Wang , Georgios B. Giannakis Yousef Saad Jie Chen

Key Lab of Intell. Contr. and Decision of Complex Syst., Beijing Inst. of Technology Digital Tech. Center & Dept. of Electrical and Computer Eng., Univ. of Minnesota Department of Computer Science and Engineering, Univ. of Minnesota {gangwang, georgios, saad}@umn.edu; chenjie@bit.edu.cn.

This paper deals with ﬁnding an n-dimensional solution x to a system of quadratic equations yi = | ai, x |2, 1 i m, which in general is known to be NP-hard. We put forth a novel procedure, that starts with a weighted maximal correlation initialization obtainable with a few power iterations, followed by successive reﬁnements based on iteratively reweighted gradient-type iterations. The novel techniques distinguish themselves from prior works by the inclusion of a fresh (re)weighting regularization. For certain random measurement models, the proposed procedure returns the true solution x with high probability in time proportional to reading the data {(ai; yi)}1 i m, provided that the number m of equations is some constant c > 0 times the number n of unknowns, that is, m cn. Empirically, the upshots of this contribution are: i) perfect signal recovery in the high-dimensional regime given only an information-theoretic limit number of equations; and, ii) (near-)optimal statistical accuracy in the presence of additive noise. Extensive numerical tests using both synthetic data and real images corroborate its improved signal recovery performance and computational efﬁciency relative to state-of-the-art approaches.

1 Introduction

One is often faced with solving quadratic equations of the form yi = | ai, x |2, or equivalently,

ψi = | ai, x |, 1 i m (1)

where x Rn/Cn (hereafter, symbol A/B denotes either A or B) is the wanted unknown n 1 vector, while given observations ψi and feature vectors ai Rn/Cn that are collectively stacked in the data vector ψ := [ψi]1 i m and the m n sensing matrix A := [ai]1 i m, respectively. Put differently, given information about the (squared) modulus of the inner products of the signal vector x and several known design vectors ai, can one reconstruct exactly (up to a global phase factor) x, or alternatively, the missing phase of ai, x ? In fact, much effort has been devoted to determining the number of such equations necessary and/or sufﬁcient for the uniqueness of the solution x; see e.g., [1, 8]. It has been proved that m 2n 1 (m 4n 4) generic 1 (which includes the case of random vectors) real (complex) vectors ai are sufﬁcient for uniquely determining an n-dimensional real (complex) vector x [1, Theorem 2.8], [8], while in the real case m = 2n 1 is shown also necessary [1]. In this sense, the number m = 2n 1 of equations as in (1) can be regarded as the information-theoretic limit for such a quadratic system to be uniquely solvable.

1It is out of the scope of the present paper to explain the meaning of generic vectors, whereas interested readers are referred to [1].

31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

In diverse physical sciences and engineering ﬁelds, it is impossible or very difﬁcult to record phase measurements. The problem of recovering the signal or phase from magnitude measurements only, also commonly known as phase retrieval, emerges naturally [10, 11]. Relevant application domains include e.g., X-ray crystallography, astronomy, microscopy, ptychography, and coherent diffraction imaging [21]. In such setups, optical measurement and detection systems record solely the photon ﬂux, which is proportional to the (squared) magnitude of the ﬁeld, but not the phase. Problem (1) in its squared form, on the other hand, can be readily recast as an instance of nonconvex quadratically constrained quadratic programming, that subsumes as special cases several well-known combinatorial optimization problems involving Boolean variables, e.g., the NP-complete stone problem [2, Sec. 3.4.1]. A related task of this kind is that of estimating the mixture of linear regressions, where the latent membership indicators can be converted into the missing phases [29]. Although of simple form and practical relevance across different ﬁelds, solving systems of nonlinear equations is arguably the most difﬁcult problem in all of the numerical computations [19, Page 355].

Notation: Lower- (upper-) case boldface letters denote vectors (matrices), e.g., a Rn (A Rm n). Calligraphic letters are reserved for sets. The ﬂoor operation c gives the largest integer no greater than the given real quantity c > 0, the cardinality |S| counts the number of elements in set S, and x denotes the Euclidean norm of x. Since for any phase φ R, vectors x Cn and ejφx are indistinguishable given {ψi} in (1), let dist(z, x) := minφ [0,2π) z xejφ be the Euclidean distance of any estimate z Cn to the solution set {ejφx}0 φ<2π of (1); in particular, φ = 0/π in the real case.

1.1 Prior contributions

Following the least-squares (LS) criterion (which coincides with the maximum likelihood (ML) one assuming additive white Gaussian noise), the problem of solving quadratic equations can be naturally recast as an empirical loss minimization

minimize z Rn/Cn L(z) := 1

i=1 ℓ(z; ψi/yi) (2)

where one can choose to work with the amplitude-based loss ℓ(z; ψi) := (ψi | ai,z |)2/2 [28, 30], or the intensity-based one ℓ(z; yi) := (yi | ai,z |2)2/2 [3], and its related Poisson likelihood ℓ(z; yi) := yi log(| ai, z |2) | ai, z |2 [7]. Either way, the objective functional L(z) is nonconvex; hence, it is generally NP-hard and computationally intractable to compute the ML or LS estimate.

Minimizing the squared modulus-based LS loss in (2), several numerical polynomial-time algorithms have been devised via convex programming for certain choices of design vectors ai [4, 25]. Such convex paradigms ﬁrst rely on the matrix-lifting technique to express all squared modulus terms into linear ones in a new rank-1 matrix variable, followed by solving a convex semi-deﬁnite program (SDP) after dropping the rank constraint. It has been established that perfect recovery and (near-)optimal statistical accuracy are achieved in noiseless and noisy settings respectively with an optimal-order number of measurements [4]. In terms of computational efﬁciency however, such lifting-based convex approaches entail storing and solving for an n n semi-deﬁnite matrix from m general SDP constraints, whose computational complexity in the worst case scales as n4.5 log 1/ϵ for m n [25], which is not scalable. Another recent line of convex relaxation [12], [13] reformulated the problem of phase retrieval as that of sparse signal recovery, and solved a linear program in the natural parameter vector domain. Although exact signal recovery can be established assuming an accurate enough anchor vector, its empirical performance is in general not competitive with state-of-the-art phase retrieval approaches.

Recent proposals advocate suitably initialized iterative procedures for coping with certain nonconvex formulations directly; see e.g., algorithms abbreviated as Alt Min Phase, (R/P)WF, (M)TWF, (S)TAF [16, 3, 7, 26, 28, 27, 30, 22, 6, 24], as well as a prox-linear algorithm [9]. These nonconvex approaches operate directly upon vector optimization variables, thus leading to signiﬁcant computational advantages over their convex counterparts. With random features, they can be interpreted as performing stochastic optimization over acquired examples {(ai; ψi/yi)}1 i m to approximately minimize the population risk functional L(z) := E(ai,ψi/yi)[ℓ(z; ψi/yi)]. It is well documented that minimizing nonconvex functionals is generally intractable due to existence of multiple critical points [17]. Assuming Gaussian sensing vectors however, such nonconvex paradigms can provably locate the global optimum, several of which also achieve optimal (statistical) guarantees. Speciﬁcally,

starting with a judiciously designed initial guess, successive improvement is effected by means of a sequence of (truncated) (generalized) gradient-type iterations given by

zt+1 := zt µt

i T t+1 ℓi(zt; ψi/yi), t = 0, 1, . . . (3)

where zt denotes the estimate returned by the algorithm at the t-th iteration, µt > 0 is learning rate that can be pre-selected or found via e.g., the backtracking line search strategy, and ℓ(zt, ψi/yi) represents the (generalized) gradient of the modulusor squared modulus-based LS loss evaluated at zt. Here, T t+1 denotes some time-varying index set signifying the per-iteration gradient truncation.

Although they achieve optimal statistical guarantees in both noiseless and noisy settings, state-of-theart (convex and nonconvex) approaches studied under Gaussian designs, empirically require stable recovery of a number of equations (several) times larger than the information-theoretic limit [7, 3, 30]. As a matter of fact, when there are numerously enough measurements (on the order of n up to some polylog factors), the squared modulus-based LS functional admits benign geometric structure in the sense that [23]: i) all local minimizers are also global; and, ii) there always exists a negative directional curvature at every saddle point. In a nutshell, the grand challenge of tackling systems of random quadratic equations remains to develop algorithms capable of achieving perfect recovery and statistical accuracy when the number of measurements approaches the information limit.

1.2 This work

Building upon but going beyond the scope of the aforementioned nonconvex paradigms, the present paper puts forward a novel iterative linear-time scheme, namely, time proportional to that required by the processor to scan all the data {(ai; ψi)}1 i m, that we term reweighted amplitude ﬂow, and henceforth, abbreviate as RAF. Our methodology is capable of solving noiseless random quadratic equations exactly, yielding an estimate of (near)-optimal statistical accuracy from noisy modulus observations. Exactness and accuracy hold with high probability and without extra assumption on the unknown signal vector x, provided that the ratio m/n of the number of equations to that of the unknowns is larger than a certain constant. Empirically, our approach is shown able to ensure exact recovery of high-dimensional unstructured signals given a minimal number of equations, where m/n in the real case can be as small as 2. The new twist here is to leverage judiciously designed yet conceptually simple (re)weighting regularization techniques to enhance existing initializations and also gradient reﬁnements. An informal depiction of our RAF methodology is given in two stages as follows, with rigorous details deferred to Section 3:

S1) Weighted maximal correlation initialization: Obtain an initializer z0 maximally correlated with a carefully selected subset S M := {1, 2, . . . , m} of feature vectors ai, whose contributions toward constructing z0 are judiciously weighted by suitable parameters {w0 i > 0}i S. S2) Iteratively reweighted gradient-like iterations: Loop over 0 t T:

zt+1 = zt µt

i=1 wt i ℓ(zt; ψi) (4)

for some time-varying weighting parameters {wt i 0}, each possibly relying on the current iterate zt and the datum (ai; ψi).

Two attributes of the novel approach are worth highlighting next. First, albeit being a variant of the spectral initialization devised in [28], the initialization here [cf. S1)] is distinct in the sense that different importance is attached to each selected datum (ai; ψi). Likewise, the gradient ﬂow [cf. S2)] weighs judiciously the search direction suggested by each datum (ai; ψi). In this manner, more robust initializations and more stable overall search directions can be constructed even based solely on a rather limited number of data samples. Moreover, with particular choices of the weights wt i s (e.g., taking 0/1 values), the developed methodology subsumes as special cases the recently proposed algorithms RWF [30] and TAF [28].

2 Algorithm: Reweighted Amplitude Flow

This section explains the intuition and basic principles behind each stage of the advocated RAF algorithm in detail. For analytical concreteness, we focus on the real Gaussian model with x Rn,

and independent sensing vectors ai Rn N(0, I) for all 1 i m. Nonetheless, the presented approach can be directly applied when the complex Gaussian and the coded diffraction pattern (CDP) models are considered.

2.1 Weighted maximal correlation initialization

A key enabler of general nonconvex iterative heuristics success in ﬁnding the global optimum is to seed them with an excellent starting point [14]. Indeed, several smart initialization strategies have been advocated for iterative phase retrieval algorithms; see e.g., the spectral initialization [16], [3] as well as its truncated variants [7], [28], [9], [30], [15]. One promising approach is the one pursued in [28], which is also shown robust to outliers in [9]. To hopefully approach the informationtheoretic limit however, its performance may need further enhancement. Intuitively, it is increasingly challenging to improve the initialization (over state-of-the-art) as the number of acquired data samples approaches the information-theoretic limit.

In this context, we develop a more ﬂexible initialization scheme based on the correlation property (as opposed to the orthogonality in [28]), in which the added beneﬁt is the inclusion of a ﬂexible weighting regularization technique to better balance the useful information exploited in the selected data. Similar to related approaches of the same kind, our strategy entails estimating both the norm x and the direction x/ x of x. Leveraging the strong law of large numbers and the rotational invariance of Gaussian ai vectors (the latter sufﬁces to assume x = x e1, with e1 being the ﬁrst canonical vector in Rn), it is clear that

i=1 ψ2 i = 1

ai, x e1 2 = 1

i=1 a2 i,1 x 2 x 2 (5)

whereby x can be estimated to be Pm i=1ψ2 i/m. This estimate proves very accurate even with a limited number of data samples because Pm i=1a2 i,1/m is unbiased and tightly concentrated.

The challenge thus lies in accurately estimating the direction of x, or seeking a unit vector maximally aligned with x. Toward this end, let us ﬁrst present a variant of the initialization in [28]. Note that the larger the modulus ψi of the inner-product between ai and x is, the known design vector ai is deemed more correlated to the unknown solution x, hence bearing useful directional information of x. Inspired by this fact and having available data {(ai; ψi)}1 i m, one can sort all (absolute) correlation coefﬁcients {ψi}1 i m in an ascending order, yielding ordered coefﬁcients 0 < ψ[m] ψ[2] ψ[1]. Sorting m records takes time proportional to O(m log m).2 Let S M denote the set of selected feature vectors ai to be used for computing the initialization, which is to be designed next. Fix a priori the cardinality |S| to some integer on the order of m, say, |S| := 3m/13 . It is then natural to deﬁne S to collect the ai vectors that correspond to one of the largest |S| correlation coefﬁcients {ψ[i]}1 i |S|, each of which can be thought of as pointing to (roughly) the direction of x. Approximating the direction of x therefore boils down to ﬁnding a vector to maximize its correlation with the subset S of selected directional vectors ai. Succinctly, the wanted approximation vector can be efﬁciently found as the solution of

maximize z =1 1 |S|

ai, z 2 = z 1

i S aia i z (6)

where the superscript represents the transpose or the conjugate transpose that will be clear from the context. Upon scaling the unity-norm solution of (6) by the norm estimate obtained Pm i=1ψ2 i/m in (5), to match the magnitude of x, we will develop what we will henceforth refer to as maximal correlation initialization.

As long as |S| is chosen on the order of m, the maximal correlation method outperforms the spectral ones in [3, 16, 7], and has comparable performance to the orthogonality-promoting method [28]. Its performance around the information-limit however, is still not the best that we can hope for. Recall from (6) that all selected directional vectors {ai}i S are treated the same in terms of their contributions to constructing the initialization. Nevertheless, according to our starting principle, this ordering information carried by the selected ai vectors is not exploited by the initialization scheme in (6) and [28]. In other words, if for i, j S, the correlation coefﬁcient of ψi with ai is larger

2f(m) = O(g(m)) means that there exists a constant C > 0 such that |f(m)| C|g(m)|.

than that of ψj with aj, then ai is deemed more correlated (with x) than aj is, hence bearing more useful information about the direction of x. It is thus prudent to weigh more the selected ai vectors associated with larger ψi values. Given the ordering information ψ[|S|] ψ[2] ψ[1] available from the sorting procedure, a natural way to achieve this goal is weighting each ai vector with simple monotonically increasing functions of ψi, say e.g., taking the weights w0 i := ψγ i , i S with the exponent parameter γ 0 chosen to maintain the wanted ordering w0 |S| w0 [2] w0 [1]. In a nutshell, a more ﬂexible initialization strategy, that we refer to as weighted maximal correlation, can be summarized as follows

z0 := arg max z =1 z 1

i S ψγ i aia i z. (7)

For any given ϵ > 0, the power method or the Lanczos algorithm can be called for to ﬁnd an ϵ-accurate solution to (7) in time proportional to O(n|S|) [20], assuming a positive eigengap between the largest and the second largest eigenvalues of the matrix (1/|S|) P i S ψγ i aia i , which is often true when {ai} are sampled from continuous distribution. The proposed initialization can be obtained upon scaling z0 from (7) by the norm estimate in (5), to yield z0 := (Pm i=1 ψ2 i/m) z0. By default, we take γ := 1/2 in all reported numerical implementations, yielding w0 i := p

| ai, x | for all i S.

Regarding the initialization procedure in (7), we next highlight two features, whereas technical details and theoretical performance guarantees are provided in Section 3:

F1) The weights {w0 i } in the maximal correlation scheme enable leveraging useful information that each feature vector ai may bear regarding the direction of x. F2) Taking w0 i := ψγ i for all i S and 0 otherwise, problem (7) can be equivalently rewritten as

z0 := arg max z =1 z 1

i=1 w0 i aia i z (8)

which subsumes previous initialization schemes with particular selections of weights {w0 i }. For instance, the spectral initialization in [16, 3] is recovered by choosing S := M, and w0 i := ψ2 i for all 1 i m.

1,000 2,000 3,000 4,000 5,000 n: signal dimension (m=2n-1)

Relative error

Reweight. max. correlation Spectral initialization Trunc. spectral in TWF Orthogonality promoting Trunc. spectral in RWF

Figure 1: Relative initialization error for i.i.d. ai N(0, I1,000), 1 i 1, 999.

For comparison, deﬁne

Relative error := dist(z, x)

Throughout the paper, all simulated results were averaged over 100 Monte Carlo (MC) realizations, and each simulated scheme was implemented with their pertinent default parameters. Figure 1 evaluates the performance of the developed initialization relative to several state-of-the-art strategies, and also with the information limit number of data benchmarking the minimal number of samples required. It is clear that our initialization is: i) consistently better than the state-of-the-art; and, ii) stable as n grows, which is in contrast to the instability encountered by the spectral ones [16, 3, 7, 30]. It is worth stressing that the more than 5% empirical advantage (relative to the best) at the challenging information-theoretic benchmark is nontrivial, and is one of the main RAF upshots. This advantage becomes increasingly pronounced as the ratio m/n grows.

2.2 Iteratively reweighted gradient ﬂow

For independent data obeying the real Gaussian model, the direction that TAF moves along in stage S2) presented earlier is given by the following (generalized) gradient [28]: 1 m

i T ℓ(z; ψi) = 1

a i z ψi a i z |a i z|

where the dependence on the iterate count t is neglected for notational brevity, and the convention a i z/|a i z| := 0 is adopted when a i z = 0.

Unfortunately, the (negative) gradient of the average in (9) generally may not point towards the true solution x unless the current iterate z is already very close to x. Therefore, moving along such a descent direction may not drag z closer to x. To see this, consider an initial guess z0 that has already been in a basin of attraction (i.e., a region within which there is only a unique stationary point) of x. Certainly, there are summands (a i z ψi a i z |a i z|)ai in (9), that could give rise to bad/misleading

gradient directions due to the erroneously estimated signs a i z |a i z| = a i x |a i x| [28], or (a i z)(a i x) < 0 [30]. Those gradients as a whole may drag z away from x, and hence out of the basin of attraction. Such an effect becomes increasingly severe as m approaches the information-theoretic limit of 2n 1, thus rendering past approaches less effective in this case. Although this issue is somewhat remedied by TAF with a truncation procedure, its efﬁcacy is limited due to misses of bad gradients and mis-rejections of meaningful ones around the information limit.

To address this challenge, reweighted amplitude ﬂow effecting suitable gradient directions from all data samples {(ai; ψi)}1 i m will be adopted in a (timely) adaptive fashion, namely introducing appropriate weights for all gradients to yield the update zt+1 = zt µt ℓrw(zt; ψi), t = 0, 1, . . . (10) The reweighted gradient ℓrw(zt) evaluated at the current point zt is given as

i=1 wi ℓ(z; ψi) (11)

for suitable weights {wi}1 i m to be designed next.

To that end, we observe that the truncation criterion [28]

T := 1 i m : |a i z| |a i x| α (12)

with some given parameter α > 0 suggests to include only gradient components associated with |a i z| of relatively large sizes. This is because gradients of sizable |a i z|/|a i x| offer reliable and meaningful directions pointing to the truth x with large probability [28]. As such, the ratio |a i z|/|a i x| can be somewhat viewed as a conﬁdence score about the reliability or meaningfulness of the corresponding gradient ℓ(z; ψi). Recognizing that conﬁdence can vary, it is natural to distinguish the contributions that different gradients make to the overall search direction. An easy way is to attach large weights to the reliable gradients, and small weights to the spurious ones. Assume without loss of generality that 0 wi 1 for all 1 i m; otherwise, lump the normalization factor achieving this into the learning rate µt. Building upon this observation and leveraging the gradient reliability conﬁdence score |a i z|/|a i x|, the weight per gradient ℓ(z; ψi) in RAF is designed to be

wi := 1 1 + βi/(|a i z|/|a i x|), 1 i m (13)

in which {βi > 0}1 i m are some pre-selected parameters.

Regarding the proposed weighting criterion in (13), three remarks are in order, followed by the RAF algorithm summarized in Algorithm 1.

R1) The weights {wt i}1 i m are time adapted to zt. One can also interpret the reweighted gradient ﬂow zt+1 in (10) as performing a single gradient step to minimize the smooth reweighted loss 1 m Pm i=1 wt iℓ(z; ψi) with starting point zt; see also [4] for related ideas successfully exploited in the iteratively reweighted least-squares approach to compressive sampling. R2) Note that the larger |a i z|/|a i x| is, the larger wi will be. More importance will be attached to reliable gradients than to spurious ones. Gradients from almost all data points are are judiciously accounted for, which is in sharp contrast to [28], where withdrawn gradients do not contribute the information they carry. R3) At the points {z} where a i z = 0 for certain i M, the corresponding weight will be wi = 0. That is, the losses ℓ(z; ψi) in (2) that are nonsmooth at points z will be eliminated, to prevent their contribution to the reweighted gradient update in (10). Hence, the convergence analysis of RAF can be considerably simpliﬁed because it does not have to cope with the nonsmoothness of the objective function in (2).

2.3 Algorithmic parameters

To optimize the empirical performance and facilitate numerical implementations, choice of pertinent algorithmic parameters of RAF is independently discussed here. It is obvious that the RAF algorithm entails four parameters. Our theory and all experiments are based on: i) |S|/m 0.25; ii) 0 βi 10 for all 1 i m; and, iii) 0 γ 1. For convenience, a constant step size µt µ > 0 is suggested, but other step size rules such as backtracking line search with the reweighted objective work as well. As will be formalized in Section 3, RAF converges if the constant µ is not too large, with the upper bound depending in part on the selection of {βi}1 i m.

In the numerical tests presented in Sections 2 and 4, we take

|S| := 3m/13 , βi β := 10, γ := 0.5, and µ := 2 (14)

while larger step sizes µ > 0 can be afforded for larger m/n values.

Algorithm 1 Reweighted Amplitude Flow

1: Input: Data {(ai; ψi}1 i m; maximum number of iterations T; step size µt = 2/6 and weighting parameter βi = 10/5 for real/complex Gaussian model; |S| = 3m/13 , and γ = 0.5. 2: Construct S to include indices associated with the |S| largest entries among {ψi}1 i m.

3: Initialize z0 := p Pm i=1 ψ2 i/m z0 with z0 being the unit principal eigenvector of

i=1 w0 i aia i (15)

where w0 i := ψγ i , i S M 0, otherwise for all 1 i m.

4: Loop: for t = 0 to T 1

zt+1 = zt µt

i=1 wt i a i zt ψi a i zt

where wt i := |a i zt|/ψi |a i zt|/ψi+βi for all 1 i m.

5: Output: z T .

3 Main results

Our main results summarized in Theorem 1 next establish exact recovery under the real Gaussian model, whose proof is provided in the supplementary material. Our RAF approach however can be generalized readily to the complex Gaussian and CDP models.

Theorem 1 (Exact recovery) Consider m noiseless measurements ψ = |Ax| for an arbitrary x Rn. If the data size m c0|S| c1n and the step size µ µ0, then with probability at least 1 c3e c2m, the reweighted amplitude ﬂow s estimates zt in Algorithm 1 obey

dist(zt, x) 1

10(1 ν)t x , t = 0, 1, . . . (17)

where c0, c1, c2, c3 > 0, 0 < ν < 1, and µ0 > 0 are certain numerical constants depending on the choice of algorithmic parameters |S|, β, γ, and µ.

According to Theorem 1, a few interesting properties of our RAF algorithm are worth highlighting. To start, RAF recovers the true solution exactly with high probability whenever the ratio m/n of the number of equations to the unknowns exceeds some numerical constant. Expressed differently, RAF achieves the information-theoretic optimal order of sample complexity, which is consistent with the state-of-the-art including TWF [7], TAF [28], and RWF [30]. Notice that (17) also holds at t = 0, namely, dist(z0, x) x /10, therefore providing performance guarantees for the proposed initialization scheme (cf. Step 3 in Algorithm 1). Moreover, starting from this initial estimate, RAF converges linearly to the true solution x. That is, to reach any ϵ-relative solution accuracy (i.e., dist(z T , x) ϵ x ), it sufﬁces to run at most T = O(log 1/ϵ) RAF iterations (cf. Step 4). This in

conjunction with the per-iteration complexity O(mn) conﬁrms that RAF solves exactly a quadratic system in time O(mn log 1/ϵ), which is linear in O(mn), the time required to read the entire data {(ai; ψi)}1 i m. Given the fact that the initialization stage can be performed in time O(n|S|) and |S| < m, the overall linear-time complexity of RAF is order-optimal.

Proof of Theorem 1 is provided in the supplementary material.

4 Simulated tests

0 10 20 30 40 50 60 70 80 90 100

Realization number

! log10 L(z T)

Figure 2: Function value L(z T ) by RAF for 100 MC realizations when m = 2n 1.

Our theoretical ﬁndings about RAF have been corroborated with comprehensive numerical tests, a sample of which are discussed next. Performance of RAF is evaluated relative to the state-of-the-art (T)WF, RWF, and TAF in terms of the empirical success rate among 100 MC trials, where a success will be declared for a trial if the returned estimate incurs error ψ |Az T |

where the modulus operator | | is understood element-wise. The real Gaussian model and the physically realizable CDPs were simulated in this section. For fairness, all schemes were implemented with their suggested parameter values. The true signal vector x was randomly generated using x N(0, I), and the i.i.d. sensing vectors ai ai N(0, I). Each scheme obtained the initial guess based on 200 power iterations, followed by a series of T = 2, 000 (truncated/reweighted) gradient iterations. All experiments were performed using MATLAB on an Intel CPU @ 3.4 GHz (32 GB RAM) computer. For reproducibility, the Matlab code of the RAF algorithm is publicly available at https://gangwg.github.io/RAF/.

To demonstrate the power of RAF in the high-dimensional regime, the function value L(z) in (2) evaluated at the returned estimate z T for 100 independent trials is plotted (in negative logarithmic scale) in Fig. 2, where m = 2n 1 = 9, 999. It is self-evident that RAF succeeded in all trials even at this challenging information limit. To the best of our knowledge, RAF is the ﬁrst algorithm that empirically recovers any solution exactly from a minimal number of random quadratic equations. Left panel in Fig. 3 further compares the empirical success rate of ﬁve schemes under the real Gaussian model with n = 1, 000 and m/n varying by 0.1 from 1 to 5. Evidently, the developed RAF achieves perfect recovery as soon as m is about 2n, where its competing alternatives do not work well. To demonstrate the stability and robustness of RAF in the presence of additive noise, the right panel in Fig. 3 depicts the normalized mean-square error

NMSE := dist2(z T , x)

as a function of the signal-to-noise ratio (SNR) for m/n taking values {3, 4, 5}. The noise model

ψi = | ai, x | + ηi, 1 i m

with η := [ηi]1 i m N(0, σ2Im) was employed, where σ2 was set such that certain SNR := 10 log10( Ax 2/mσ2) values on the x-axis were achieved.

To examine the efﬁcacy and scalability of RAF in real-world conditions, the last experiment entails the Galaxy image 3 depicted by a three-way array X R1,080 1,920 3, whose ﬁrst two coordinates encode the pixel locations, and the third the RGB color bands. Consider the physically realizable CDP model with random masks [3]. Letting x Rn (n 2 106) be a vectorization of a certain band of X, the CDP model with K masks is

ψ(k) = |F D(k)x|, 1 k K,

3Downloaded from http://pics-about-space.com/milky-way-galaxy.

1 2 3 4 5 m/n for x2 R1,000

Empirical success rate

RAF TAF TWF RWF WF

5 10 15 20 25 30 35 40 45

SNR (d B) for x2 R1,000

m=3n m=4n m=5n

Figure 3: Real Gaussian model: Empirical success rate (Left); and, Relative MSE vs. SNR (Right).

where F Cn n is a DFT matrix, and diagonal matrices D(k) have their diagonal entries sampled uniformly at random from {1, 1, j, j} with j := 1. Each D(k) represents a random mask placed after the object to modulate the illumination patterns [5]. Implementing K = 4 masks, each algorithm performs independently over each band 100 power iterations for an initial guess, which was reﬁned by 100 gradient iterations. Recovered images of TAF (left) and RAF (right) are displayed in Fig. 4, whose relative errors were 1.0347 and 1.0715 10 3, respectively. WF and TWF returned images of corresponding relative error 1.6870 and 1.4211, which are far away from the ground truth.

Figure 4: Recovered Galaxy images after 100 gradient iterations of TAF (Left); and of RAF (Right).

5 Conclusion

This paper developed a linear-time algorithm called RAF for solving systems of random quadratic equations. Our procedure consists of two stages: a weighted maximal correlation initializer attainable with a few power or Lanczos iterations, and a sequence of scalable reweighted gradient reﬁnements for a nonconvex nonsmooth LS loss function. It was demonstrated that RAF achieves the optimal sample and computational complexity. Judicious numerical tests showcase its superior performance over state-of-the-art alternatives. Empirically, RAF solves a set of random quadratic equations with high probability so long as a unique solution exists. Promising extensions include studying robust and/or sparse phase retrieval and matrix recovery via (stochastic) reweighted amplitude ﬂow counterparts, and in particular exploiting the power of (re)weighting regularization techniques to enable more general nonconvex optimization such as training deep neural networks [18].

Acknowledgments

G. Wang and G. B. Giannakis were partially supported by NSF grants 1500713 and 1514056. Y. Saad was partially supported by NSF grant 1505970. J. Chen was partially supported by the National Natural Science Foundation of China grants U1509215, 61621063, and the Program for Changjiang Scholars and Innovative Research Team in University (IRT1208).

[1] R. Balan, P. Casazza, and D. Edidin, On signal reconstruction without phase, Appl. Comput. Harmon. Anal., vol. 20, no. 3, pp. 345 356, May 2006.

[2] A. Ben-Tal and A. Nemirovski, Lectures on Modern Convex Optimization: Analysis, Algorithms, and Engineering Applications. SIAM, 2001, vol. 2.

[3] E. J. Candès, X. Li, and M. Soltanolkotabi, Phase retrieval via Wirtinger ﬂow: Theory and algorithms, IEEE Trans. Inf. Theory, vol. 61, no. 4, pp. 1985 2007, Apr. 2015.

[4] E. J. Candès, T. Strohmer, and V. Voroninski, Phase Lift: Exact and stable signal recovery from magnitude measurements via convex programming, Appl. Comput. Harmon. Anal., vol. 66, no. 8, pp. 1241 1274, Nov. 2013.

[5] E. J. Candès, X. Li, and M. Soltanolkotabi, Phase retrieval from coded diffraction patterns, Appl. Comput. Harmon. Anal., vol. 39, no. 2, pp. 277 299, Sept. 2015.

[6] J. Chen, L. Wang, X. Zhang, and Q. Gu, Robust Wirtinger ﬂow for phase retrieval with arbitrary corruption, ar Xiv:1704.06256, 2017.

[7] Y. Chen and E. J. Candès, Solving random quadratic systems of equations is nearly as easy as solving linear systems, in Adv. on Neural Inf. Process. Syst., Montréal, Canada, 2015, pp. 739 747.

[8] A. Conca, D. Edidin, M. Hering, and C. Vinzant, An algebraic characterization of injectivity in phase retrieval, Appl. Comput. Harmon. Anal., vol. 38, no. 2, pp. 346 356, Mar. 2015.

[9] J. C. Duchi and F. Ruan, Solving (most) of a set of quadratic equalities: Composite optimization for robust phase retrieval, ar Xiv:1705.02356, 2017.

[10] J. R. Fienup, Phase retrieval algorithms: A comparison, Appl. Opt., vol. 21, no. 15, pp. 2758 2769, Aug. 1982.

[11] R. W. Gerchberg and W. O. Saxton, A practical algorithm for the determination of phase from image and diffraction, Optik, vol. 35, pp. 237 246, Nov. 1972.

[12] T. Goldstein and S. Studer, Phase Max: Convex phase retrieval via basis pursuit, ar Xiv:1610.07531v1, 2016.

[13] P. Hand and V. Voroninski, An elementary proof of convex phase retrieval in the natural parameter space via the linear program phasemax, ar Xiv:1611.03935, 2016.

[14] R. H. Keshavan, A. Montanari, and S. Oh, Matrix completion from a few entries, IEEE Trans. Inf. Theory, vol. 56, no. 6, pp. 2980 2998, Jun. 2010.

[15] Y. M. Lu and G. Li, Phase transitions of spectral initialization for high-dimensional nonconvex estimation, ar Xiv:1702.06435, 2017.

[16] P. Netrapalli, P. Jain, and S. Sanghavi, Phase retrieval using alternating minimization, in Adv. on Neural Inf. Process. Syst., Stateline, NV, 2013, pp. 2796 2804.

[17] P. M. Pardalos and S. A. Vavasis, Quadratic programming with one negative eigenvalue is NP-hard, J. Global Optim., vol. 1, no. 1, pp. 15 22, 1991.

[18] G. Pereyra, G. Tucker, J. Chorowski, Ł. Kaiser, and G. Hinton, Regularizing neural networks by penalizing conﬁdent output distributions, ar Xiv:1701.06548, 2017.

[19] J. R. Rice, Numerical Methods in Software and Analysis. Academic Press, 1992.

[20] Y. Saad, Numerical Methods for Large Eigenvalue Problems: Revised Edition. SIAM, 2011.

[21] Y. Shechtman, Y. C. Eldar, O. Cohen, H. N. Chapman, J. Miao, and M. Segev, Phase retrieval with application to optical imaging: A contemporary overview, IEEE Signal Process. Mag., vol. 32, no. 3, pp. 87 109, May 2015.

[22] M. Soltanolkotabi, Structured signal recovery from quadratic measurements: Breaking sample complexity barriers via nonconvex optimization, ar Xiv:1702.06175, 2017.

[23] J. Sun, Q. Qu, and J. Wright, A geometric analysis of phase retrieval, Found. Comput. Math., 2017 (to appear); see also ar Xiv:1602.06664, 2016.

[24] I. Waldspurger, Phase retrieval with random Gaussian sensing vectors by alternating projections, a Xiv:1609.03088, 2016.

[25] I. Waldspurger, A. d Aspremont, and S. Mallat, Phase recovery, maxcut and complex semideﬁnite programming, Math. Program., vol. 149, no. 1, pp. 47 81, 2015.

[26] G. Wang and G. B. Giannakis, Solving random systems of quadratic equations via truncated generalized gradient ﬂow, in Adv. on Neural Inf. Process. Syst., Barcelona, Spain, 2016, pp. 568 576.

[27] G. Wang, G. B. Giannakis, and J. Chen, Scalable solvers of random quadratic equations via stochastic truncated amplitude ﬂow, IEEE Trans. Signal Process., vol. 65, no. 8, pp. 1961 1974, Apr. 2017.

[28] G. Wang, G. B. Giannakis, and Y. C. Eldar, Solving systems of random quadratic equations via truncated amplitude ﬂow, IEEE Trans. Inf. Theory, 2017 (to appear); see also ar Xiv:1605.08285, 2016.

[29] X. Yi, C. Caramanis, and S. Sanghavi, Alternating minimization for mixed linear regression, in Proc. Intl. Conf. on Mach. Learn., Beijing, China, 2014, pp. 613 621.

[30] H. Zhang, Y. Zhou, Y. Liang, and Y. Chi, Reshaped Wirtinger ﬂow and incremental algorithm for solving quadratic system of equations, J. Mach. Learn. Res., 2017 (to appear); see also ar Xiv:1605.07719, 2016.