# agnostic_estimation_for_phase_retrieval__7e597c93.pdf

Journal of Machine Learning Research 21 (2020) 1-39 Submitted 5/18; Revised 4/20; Published 6/20

Agnostic Estimation for Misspeciﬁed Phase Retrieval Models

Matey Neykov mneykov@stat.cmu.edu Carnegie Mellon University Department of Statistics & Data Science

Zhaoran Wang zhaoranwang@gmail.com Northwestern University Department of Industrial Engineering and Management Sciences

Han Liu hanliu@northwestern.edu Northwestern University

Department of Electrical Engineering and Computer Science

Department of Statisitcs

Editor: Qiang Liu

The goal of noisy high-dimensional phase retrieval is to estimate an s-sparse parameter β Rd

from n realizations of the model Y = (X β )2 + ε. Based on this model, we propose a signiﬁcant semi-parametric generalization called misspeciﬁed phase retrieval (MPR), in which Y = f(X β , ε) with unknown f and Cov(Y, (X β )2) > 0. For example, MPR encompasses Y = h(|X β |) + ε with increasing h as a special case. Despite the generality of the MPR model, it eludes the reach of most existing semi-parametric estimators. In this paper, we propose an estimation procedure, which consists of solving a cascade of two convex programs and provably recovers the direction of β . Furthermore, we prove that our procedure is minimax optimal over the class of MPR models. Interestingly, our minimax analysis characterizes the statistical price of misspecifying the link function in phase retrieval models. Our theory is backed up by thorough numerical results.

1. Introduction

In scientiﬁc and engineering ﬁelds researchers often times face the problem of quantifying the relationship between a given outcome Y and corresponding predictor vector X, based on a sample {(Yi, X i ) }n i=1 of n observations. In such situations it is common to postulate a linear working model, and search for a d-dimensional signal vector β satisfying the following familiar relationship:

Y = X β + ε. (1)

When the predictor X is high-dimensional in the sense that d n, it is commonly assumed that the underlying signal β is s-sparse. In a certain line of applications, such as X-ray crystallography, microscopy, diﬀraction and array imaging, one can only measure the magnitude of X β but not its phase (i.e., sign in the real domain). In this case, assuming model (1) may not be appropriate. To cope with such applications in the high-dimensional setting, Cai et al. (2015) proposed the thresholded Wirtinger ﬂow (TWF), a procedure which consistently estimates the signal β in the following real

c 2020 Matey Neykov, Zhaoran Wang, and Han Liu.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v21/18-259.html.

Neykov, Wang, and Liu

sparse noisy phase retrieval model:

Y = (X β )2 + ε, (2)

where one additionally knows that the predictors have a Gaussian random design X N(0, Id). In the present paper, taking an agnostic point of view, we recognize that both models (1) and (2) represent an idealized view of the data generating mechanism. In reality, the nature of the data could be better reﬂected through the more ﬂexible viewpoint of a single index model (SIM):

Y = f(X β , ε), (3)

where f is an unknown link function, and it is assumed that β 2 = 1 for identiﬁability. A recent line of work on high-dimensional SIMs (Plan and Vershynin, 2015; Neykov et al., 2016), showed that under Gaussian designs, one can apply ℓ1 regularized least squares to successfully estimate the direction of β and its support. The crucial condition allowing for the above somewhat surprising application turns out to be:

Cov(Y, X β ) = 0. (4)

While condition (4) is fairly generic, encompassing cases with a binary outcome, such as logistic regression and one-bit compressive sensing (Boufounos and Baraniuk, 2008), it fails to capture the phase retrieval model (2). More generally, it is easy to see that when the link function f is even in its ﬁrst coordinate, condition (4) fails to hold. The goal of the present manuscript is to formalize a class of SIMs, which includes the noisy phase retrieval model as a special case in addition to various other additive and non-additive models with even link functions, and develop a procedure that can successfully estimate the direction of β up to a global sign. Formally, we consider models (3) with Gaussian design that satisfy the following moment assumption:

Cov(Y, (X β )2) > 0. (5)

Unlike (4), one can immediately check that condition (5) is satisﬁed by model (2). In 2 we give multiple examples, both abstract and concrete, of SIMs obeying this constraint. Our second moment constraint (5) can be interpreted as a semi-parametric robust version of phase-retrieval. Hence, we will refer to the class of models satisfying condition (5) as misspeciﬁed phase retrieval (MPR) models. In this point of view it is worth noting that condition (4) relates to linear regression in a way similar to how condition (5) relates to the phase retrieval model. Our motivation for studying SIMs under such a constraint can ultimately be traced to the vast suﬃcient dimension reduction (SDR) literature. In particular, we would like to point out Li (1992) as a source of inspiration.

Contributions. Our ﬁrst contribution is to formulate a novel and easily implementable two-step procedure, which consistently estimates the direction of β in an MPR model. In the ﬁrst step we solve a semideﬁnite program producing a unit vector bv, such that |bv β | is suﬃciently large. Once such a pilot estimate is available, we consider solving an ℓ1 regularized least squares on the augmented outcome e Yi = (Yi Y )X i bv, where Y is the average of Yi s, to produce a second estimate bb, which is then normalized to obtain the ﬁnal reﬁned estimator bβ = bb/ bb 2. In addition to being universally applicable to MPR models, our procedure has an algorithmic advantage in that it relies solely on convex optimization, and as a consequence we can obtain the corresponding global minima of the two convex programs in polynomial time.

Agnostic Estimation for Phase Retrieval

Our second contribution is to rigorously demonstrate that the above procedure consistently estimates the direction of β . We prove that for a given MPR model, with high probability, one has:

minη { 1,1} bβ ηβ 2 p

provided that the sample size n satisﬁes n s2 log d. While the same rates (with diﬀerent constants) hold for the TWF algorithm of Cai et al. (2015) in the special case of noisy phase retrieval model, our procedure provably achieves these rates over the broader class of MPR models. Our third contribution is to formulate a sub-class of additive MPR models Y = h(X β ) + ε for h H, for which the above estimation rate turns out to be tight (in the cases with s = O(d1 δ) for some δ > 0). In fact we show that even when the data is generated by the most favorable model out of the sub-class H, we have

infh H inf b β sup β 0 s, β 2=1 Eh,β minη { 1,1} bβ ηβ 2 p

s log(d/s)/n.

The model class H encompasses the noisy phase retrieval model (2) as a special case. The bound above implies the standard minimax lower bound over the entire class of MPR models. Note that our bound is strictly stronger than standard minimax lower bounds, since it allows taking the easiest link function out of the class in addition to the worst case parameter β . In contrast, in standard minimax lower bounds, we take the hardest link function together with the worst case parameter. On the technical side, the proof of the lower bound uses a more transparent technique in comparison with existing proofs of related lower bounds, which are speciﬁc to the standard noisy phase retrieval model. Interestingly, our minimax analysis characterizes the statistical price of misspecifying the link function in phase retrieval models, i.e., the price of agnosticity, which will be discussed in 4.

Related Work. The phase retrieval model has received considerable attention in the recent years by statistics, applied mathematics as well as signal processing communities. For the non-sparse version of (2), eﬃcient algorithms have been suggested based on both semideﬁnite programs (Cand es et al., 2013; Cand es et al., 2015a) and non-convex optimization methods that extend gradient descent (Cand es et al., 2015b). Additionally, a non-traditional instance of phase retrieval model (which also happens to be a special case of the MPR model) was considered by Chen et al. (2013), where the authors suggested an estimation procedure originally proposed for the problem of mixed regression. For the noisy sparse version of model (2), near optimal solutions were achieved with a computationally infeasible program by Lecu e and Mendelson (2013b). Subsequently, a tractable gradient descent approach achieving minimax optimal rates was developed by Cai et al. (2015). Abstracting away from the phase retrieval or linear model settings, we note that inference for SIMs in the case when d is small or ﬁxed, has been studied extensively in the literature (e.g., Xia and Li, 1999; Horowitz, 2009; Peng and Huang, 2011; Mc Cullagh and Nelder, 1989, among many others). In another line of research on SDR, seminal insights shedding light on condition (4) can be found in, e.g., Li and Duan (1989); Li (1991); Cook and Ni (2005). The modiﬁed condition (5) traces roots to Li (1992), where the authors designed a procedure to handle precisely situations where (4) fails to hold. More recently, there have been active developments for high-dimensional SIMs. Plan and Vershynin (2015) and later Thrampoulidis et al. (2015) demonstrated that under condition (4), running the least squares with ℓ1 regularization can obtain a consistent estimate of the direction of β , while Neykov et al. (2016) showed that this procedure also recovers the signed support of the direction. Excess risk bounds were derived in Ganti et al. (2015). Very recently, Genzel (2016) extended this observation to other convex loss functions under a condition corresponding to (4) depending implicitly on the loss function of interest. Radchenko (2015) proposed a non-parametric least squares with an equality ℓ1

Neykov, Wang, and Liu

constraint to handle simultaneous estimation of β as well as f. Han and Wang (2015) considered a smoothed-out U-process type of loss function with ℓ1 regularization, and proved their approach works for a sub-class of functions satisfying condition (4). None of the aforementioned works on SIMs can be directly applied to tackle the MPR class (5). A generic procedure for estimating sparse principal eigenvectors was developed in Yuan and Zhang (2013). While in principle this procedure can be applied to estimate the direction in MPR models, it requires proper initialization, and in addition, it requires knowledge of the sparsity of the vector β . We discuss this approach in more detail in 5. Regularized procedures have also been proposed for speciﬁc choices of f and Y . For example, Yi et al. (2015) studied consistent estimation under the model P(Y = 1|X) = (h(X β ) + 1)/2 with binary Y , where h : R 7 [ 1, 1] is possibly unknown. Their procedure is based on taking pairs of diﬀerences in the outcome, and therefore replaces condition (4) with a diﬀerent type of moment conditon. Yang et al. (2015) considered the model Y = h(X β ) + ε with a known continuously diﬀerentiable and monotonic h, and developed estimation and inferential procedures based on the ℓ1 regularized quadratic loss, in a similar spirit to the TWF algorithm suggested by Cai et al. (2015). In conclusion, although there exists much prior related work, to the best of our knowledge, none of the available literature discusses the MPR models in the generality we attempt in the present manuscript.

Notation. In this section we brieﬂy outline some commonly used notations. Other notations will be deﬁned as needed throughout the paper. For a (sparse) vector v = (v1, . . . , vp) , we let Sv := supp(v) = {j : vj = 0} denote its support, v p denote the ℓp norm (with the usual extension when p = ) and v 2 = vv is a shorthand for the outer product. With a standard abuse of notation we will denote by v 0 = |supp(v)| the cardinality of the support of v. For a matrix A we denote the max and ℓ2 norms with A max = maxi,j |Aij| and A 2 = sup v 2=1 Av 2 respectively. If A is symmetric we denote its spectrum ordered in decreasing manner by λj(A). We often use Id to denote a d d identity matrix. We will also use to denote the Hadamard (or element-wise) product, and dot product will sometimes be denoted with angle notation , . For a real random variable X, deﬁne

X ψ2 = sup p 1 p 1/2(E|X|p)1/p, X ψ1 = sup p 1 p 1(E|X|p)1/p.

Recall that a random variable is called sub-Gaussian if X ψ2 < and sub-exponential if X ψ1 < (e.g., Vershynin, 2010). For any integer k N we use the shorthand notation [k] = {1, . . . , k}. We also use standard asymptotic notations. Given two sequences {an}, {bn} we write an = O(bn) if there exists a constant C < such that an Cbn, an = Ω(bn) if there exists a positive constant c > 0 such that an cbn, and an bn if there exist positive constants c and C such that c < an/bn < C.

Organization. In 2 and 3 we introduce the MPR model class and our estimation procedure, and 4 is dedicated to our main results. A brief discussion is provided in 6. We defer the technical proofs to the appendices due to space limitations.

2. MPR Models

In this section we formally introduce MPR models. In detail, we argue that the class of such models is suﬃciently rich, including numerous models of interest. Motivated by the setup in the sparse noisy phase retrieval model (2), we assume throughout the remainder of the paper that X N(0, Id). Let ψ1 be the sub-exponential norm of a random variable, which is also known as the Orlicz ψ1 norm. We begin our discussion with a formal deﬁnition.

Agnostic Estimation for Phase Retrieval

Deﬁnition 1 (MPR Models). Assume that we are given model (3), where X N(0, Id), ε X and β Rd is an s-sparse unit vector, i.e., β 2 = 1. We call such a model misspeciﬁed phase retrieval (MPR) model, if the link function f and noise ε further satisfy, for Z N(0, 1) and K > 0,

c0 := Cov(f(Z, ε), Z2) > 0, (6)

Y ψ1 K. (7)

Both assumptions (6) and (7) impose moment restrictions on the random variable Y . Assumption (6) states that Y is positively correlated with the random variable (X β )2, while assumption (7) requires Y to have somewhat light-tails. Also, as mentioned in the introduction, the unit norm constraint on the vector β is required for the identiﬁability of model (3). We remark that the class of MPR models is convex in the sense that if we have two MPR models f1(X β , ε) and f2(X β , ε), all models generated by their convex combinations λf1(X β , ε) + (1 λ)f2(X β , ε) (λ [0, 1]) are also MPR models. It is worth noting the > direction in (6) is assumed without loss of generality. If Cov(Y, (X β )2) < 0 one can apply the same algorithm to Y . However, the knowledge of the direction of the inequality is important. In the following, we restate condition (6) in a more convenient way, enabling us to easily calculate the explicit value of the constant c0 in several examples.

Proposition 2. Assume that there exists a version of the map ϕ(z) : z 7 E[f(Z, ε)|Z = z] such that ED2ϕ(Z) > 0, where D2 is the second distributional derivative of ϕ and Z N(0, 1). Then the SIM (3) satisﬁes assumption (6) with c0 = ED2ϕ(Z).

We now provide three concrete MPR models as warm up examples for the more general examples discussed in Proposition 3 and Remark 4. Consider the models:

Y = (X β )2 + ε, (8) Y = |X β | + ε, (9) Y = |X β + ε|, (10)

where ε X is sub-exponential noise, i.e., ε ψ1 Kε for some Kε > 0. Model (8) is the noisy phase retrieval model considered by Cai et al. (2015), while models (9) and (10) were both discussed in Chen et al. (2013), where the authors proposed a method to solve model (10) in the low-dimensional regime. Below we brieﬂy explain why these models satisfy conditions (6) and (7). First, observe that in all three models we have a sum of two sub-exponential random variables, and hence by the triangle inequality it follows that the random variable Y is also sub-exponential, which implies (7). To understand why (6) holds, by applying Proposition 2 we have c0 = E2 = 2 > 0 for model (8), c0 = E2δ0(Z) = 2/

2π > 0 for model (9), and c0 = E2δ0(Z + ε) = 2Eφ(ε) > 0 for model (10), where δ0( ) is the Dirac delta function centered at zero, and φ is the density of the standard normal distribution. Admittedly, calculating the second distributional derivative could be a laborious task in general. In the remainder of this section we set out to ﬁnd a simple to check generic suﬃcient condition on the link function f and error term ε, under which both (6) and (7) hold. Before giving our result note that the condition ED2ϕ(Z) > 0 is implied whenever ϕ is strictly convex and twice diﬀerentiable. However, strictly convex functions ϕ may violate assumption (7) as they can inﬂate the tails of Y arbitrarily (consider, e.g., f(x, ε) = x4 + ε). Moreover, the functions in example (9) and (10) fail to be twice diﬀerentiable. In the following result we handle those two problems, and in addition we provide a more generic condition than convexity, which suﬃces to ensure the validity of (6).

Proposition 3. The following statements hold.

Neykov, Wang, and Liu

(i) Let the function ϕ deﬁned in Proposition 2 be such that the map z 7 ϕ(z) + ϕ( z) is nondecreasing on R+ 0 and and there exist z1 > z2 > 0 such that ϕ(z1) + ϕ( z1) > ϕ(z2) + ϕ( z2). Then (6) holds.

(ii) A suﬃcient condition for (i) to hold, is that z 7 ϕ(z) is convex and sub-diﬀerentiable at every point z R, and there exists a point z0 R+ 0 satisfying ϕ(z0) + ϕ( z0) > 2ϕ(0).

(iii) Assume that there exist functions g1, g2 such that f(Z, ε) g1(Z) + g2(ε), and g1 is essentially quadratic in the sense that there exists a closed interval I = [a, b] with 0 I, such that for all z satisfying g1(z) Ic we have |g1(z)| Cz2 for a suﬃciently large constant C > 0, and let g2(ε) be sub-exponential. Then (7) holds.

Remark 4. Proposition 3 shows that the class of MPR models is suﬃciently broad. By (i) and (ii) it immediately follows that the additive models

Y = h(X β ) + ε, (11)

where the link function h is even and increasing on R+ 0 or convex, satisfy the covariance condition (6) by (i) and (ii) of Proposition 3 respectively. If h is also essentially quadratic and ε is sub-exponentially distributed, using (iii) we can deduce that Y in (11) is a sub-exponential random variable, and hence under these restrictions model (11) is an MPR model. Both examples (8) and (9) take this form. Additionally, Proposition 3 implies that the model

Y = h(X β + ε) (12)

satisﬁes (6), whenever the link h is a convex sub-diﬀerentiable function, such that h(z0) + h( z0) > 2h(0) for some z0 > 0, E|h(z + ε)| < for all z R and E|h(Z + ε)| < . This conclusion follows because under the latter conditions the function ϕ(z) = Eh(z + ε) satisﬁes (ii), which is proved in Appendix B under Lemma 13. Moreover, if it turns out that h is essentially quadratic and h(2ε) is sub-exponential, then by Jensen s inequality we have 2h(Z +ε) h(2Z)+h(2ε) and hence (iii) implies that (7) is also satisﬁed. Model (10) is of the type (12). Unlike the additive noise models (11), models (12) allow noise corruption even within the argument of the link function. On the negative side, it should be apparent that (6) fails to hold in cases where ϕ is an odd function, i.e., ϕ(z) = ϕ( z). In many such cases (e.g. when ϕ is monotone or non-constant and non-positive/non-negative on R+), one would have Cov(Y, X β ) = E[ϕ(Z)Z] = 0, and hence direct application of the ℓ1 regularized least squares algorithm is possible as we discussed in the introduction.

3. Agnostic Estimation for MPR

In this section we describe and motivate our two-step procedure, which consists of a convex relaxation and an ℓ1 regularized least squares program, for performing estimation in the MPR class of models described by Deﬁnition 1. We begin our motivation by noting that any MPR model satisﬁes the following inequality

Cov((Y µ)X β , X β ) = E{(Y µ)(X β )2} = Cov(f(Z, ε), Z2) = c0 > 0, (13)

where we have denoted µ := EY . This simple observation plays a major role in the motivation of our procedure. Notice that in view of condition (4), inequality (13) implies that if instead of observing Y we had observed Y = g(X β , ε) = (Y µ)X β we would be in a position to apply LASSO on

Agnostic Estimation for Phase Retrieval

Figure 1: The geometry of decomposition (14), in a case when bv β > 0.

(a) Initialization

(b) Second Step

Figure 2: An illustration of the estimates bv and bβ produced by the ﬁrst and second steps of Algorithm 1. After the ﬁrst step we can guarantee that the vector β belongs to one of two spherical caps which contain all vectors w such that |bv w| κ for some constant κ > 0, provided that the sample size n s2 log d is suﬃciently large. After the second step we can guarantee that the vector β belongs to one of two spherical caps in (b), which are shrinking with (n, s, d) at an optimal rate.

Y . However, there is no direct way of generating the random variable Y , as doing so would require the knowledge of β and the mean µ. Here, we propose to roughly estimate β by a vector bv ﬁrst, use an empirical estimate Y of µ, and then obtain the ℓ1 regularized least squares estimate on the augmented variable e Y = (Y Y )X bv to sharpen the convergence rate. At ﬁrst glance it might appear counter-intuitive that introducing a noisy estimate of β can lead to consistent estimates, as the so-deﬁned e Y variable depends on the projection of X on span{β , bv}. Decompose

bv = (bv β )β + bβ , (14)

where bβ β . A visualization of decomposition (14) is provided on Figure 1 for convenience. To better motivate this proposal, in the following we analyze the population least squares ﬁt, based on the augmented variable ˇY = (Y µ)X bv for some ﬁxed unit vector bv with decomposition (14). Writing out the population solution for least squares yields:

[EX 2] 1E[X ˇY ] = E[X(Y µ)X (bv β )β ] | {z } I1

+ E[X(Y µ)X bβ ] | {z } I2

We will now argue that left hand side of (15) is proportional to β . First, we observe that I1 = c0(bv β )β , since multiplying by any vector b β yields b I1 = 0 by independence. Second, and perhaps more importantly, we have that I2 = 0. To see this, we ﬁrst take a vector b span{β , bβ } . Since the three variables b X, Y µ and bβ X are independent, we have b I2 = 0. Multiplying by β we have β I2 = 0 since β X(Y µ) is independent of X bβ . Finally, multiplying by bβ yields I 2 bβ = 0, since (X bβ )2 is independent of Y µ. It is noteworthy to mention that the above derivation crucially relies on the fact that the Y variable was centered, and the vector bv was ﬁxed. In what follows we formulate a pilot procedure which produces an estimate bv such that |bv β | κ > 0. A proper initialization algorithm can be achieved by using a spectral method, such as the Principal Hessian Directions (PHD) proposed by Li (1992). Cast into the framework of SIM, the PHD framework implies the following simple observation:

Lemma 5. If we have an MPR model, then argmax v 2=1 v E[Y (X 2 I)]v = β .

Neykov, Wang, and Liu

Note that similar observations have been made in the non-sparse well-speciﬁed case by other authors in the phase retrieval literature (Cand es et al., 2015b; Lu and Li, 2017; Mondelli and Montanari, 2017). A proof of Lemma 5 can be found in Appendix B. Lemma 5 encourages us to look into the following sample version maximization problem

argmax v 2=1, v 0=sn 1v Pn i=1[Yi(X 2 i I)]v, (16)

which targets a restricted (s-sparse) principal eigenvector. Unfortunately, solving such a problem is a computationally intensive task, and requires knowledge of s. Here we take a standard route of relaxing the above problem to a convex program, and solving it eﬃciently via semideﬁnite programming (SDP). A similar in spirit SDP relaxation for solving sparse PCA problems, was originally proposed by d Aspremont et al. (2007). Instead of solving (16), deﬁne bΣ = n 1 Pn i=1 Yi(X 2 i I), and solve the following convex program:

b A = argmaxtr(A)=1,A Sd + tr(bΣA) λn Pd i,j=1|Aij|, (17)

where Sd + is the convex cone of non-negative semideﬁnite matrices, and λn is a regularization parameter encouraging element-wise sparsity in the matrix A. The hopes of introducing the optimization program above are that b A will be a good ﬁrst estimate of β 2. In practice it could turn out that the matrix b A is not rank one, hence we suggest taking bv as the principal eigenvector of b A. In theory we show that with high probability the matrix b A will indeed be of rank one. Observation (15), Lemma 5 and the SDP formulation motivate the agnostic two-step estimation procedure for misspeciﬁed phase retrieval in Algorithm 1.

Algorithm 1 input : (Yi, Xi)n i=1: data, λn, νn: tuning parameters

1. Split the sample into two approximately equal sets S1, S2, with |S1| = n/2 , |S2| = n/2 .

2. Let bΣ := |S1| 1 P

i S1 Yi(X 2 i Id). Solve (17). Let bv be the ﬁrst eigenvector of b A.

3. Let Y = |S2| 1 P

i S2 Yi. Solve the following program:

bb = argminb(2|S2|) 1P

i S2((Yi Y )X i bv X i b)2 + νn b 1. (18)

4. Return bβ := bb/ bb 2.

The sample split is required to ensure that after decomposition (14), the vector bβ and the value bv β are independent of the remaining sample. In 4 we demonstrate that Algorithm 1 succeeds with optimal (in the noisy regime) ℓ2 rate p

s log d/n, provided that s2 log d n. The latter requirement on the sample size suﬃces to guarantee that the solution b A of optimization program (17) is rank one. Figure 2 illustrates the two steps of Algorithm 1. In addition to our main procedure, we propose an optional reﬁnement step (Algorithm 2) in which one applies steps 3. and 4. of Algorithm 1 on the full dataset using the output vector bβ of Algorithm 1 without hurting the rate. Doing so can potentially result in additional stability and further reﬁnements of the rate constant.

Agnostic Estimation for Phase Retrieval

Algorithm 2 Optional Reﬁnement

input : (Yi, Xi)n i=1: data, ν n: tuning parameter, output bβ from the Algorithm 1

5. Let Y = n 1 P

i [n] Yi. Solve the following program:

bb = argminb(2n) 1Pn i=1((Yi Y )X i bβ X i b)2 + ν n b 1. (19)

6. Return bβ := bb/ bb 2.

4. Main Results

In this section we present our main results, which consist of theoretical justiﬁcation of our procedures, as well as lower bounds for certain types of SIM (3). To simplify the presentation for this section, we slightly change the notation and assume that the sample size is 2n and S1 = [n] and S2 = {n + 1, . . . , 2n}. Of course this abuse of notation does not restrict our analysis to only even sample size cases. In Proposition 15, we show that the ﬁrst step of Algorithm 1 narrows down the search for the direction of β to a union of two spherical caps (i.e., the estimate bv satisﬁes |bv β | κ for some constant κ > 0 see also Figure 2a). Our main result demonstrates that in combination with program (18) this suﬃces to recover the direction of β at an optimal rate with high probability.

Theorem 6. There exist values of λn, νn p

log d/n and a constant R > 0 depending on f and ε, such that if s p

log d/n < R and log(d) log2(n)/n = o(1), the output of Algorithm 1 satisﬁes:

sup β 2=1, β 0 s Pβ min η {1, 1} bβ ηβ 2 > L

O(d 1 n 1), (20)

where L is a constant depending solely on f and ε.

We remark that although the estimation rate is of the order p

s log d/n, our procedure still requires that s p

log d/n is suﬃciently small. This phenomenon is similar to what has been observed by Cai et al. (2015), and it is our belief that this requirement cannot be relaxed for computationally feasible algorithms. We would further like to mention that while in bound (20) we control the worst case probability of failure, it is less clear whether the estimate bβ is universally consistent (i.e., whether the sup can be moved inside the probability in (20)). In the following theorem we show that the additional reﬁnement in Algorithm 2 does not hurt the rate obtained in Theorem 6. The practical beneﬁts of such an additional improvement will become evident in the numerical studies section.

Theorem 7. Assume that the data satisﬁes the assumptions in Theorem 6 with suﬃciently small R and tuning parameter νn. Additionally, let s log2(n)/n = O(1) and n > R s2(log d log n) for some suﬃciently large R > 0. Then there exists a value ν n for which the output of Algorithm 2 satisﬁes:

sup β 2=1, β 0 s

Pβ min η {1, 1} bβ ηβ 2 > L r

O(d 1 (2n) 1 exp( Ω(s))),

where L is a constant depending solely on f and ε.

Neykov, Wang, and Liu

We would like to point out that the proof of this result is a non-trivial consequence of our main result Theorem 6. Notice that Theorem 7 does not guarantee that the reﬁnement will result in a strictly better estimate, although intuitively one expects to improve the precision and stabilize the estimator by using the full dataset. The proof can further be seen to imply that one can repeat the optional reﬁnement several (ﬁxed and ﬁnite) number of times without hurting the rate. We discourage iterating the reﬁnement step more than a couple of times since, ﬁrst this might hurt the probability bound in Theorem 7, and second it could potentially increase the rate constant. In practice we observe that running Algorithm 2 once typically suﬃces to obtain satisfactory results.

Lower Bound. In this section we investigate minimax lower bounds in the noisy setting. Comments on the performance of algorithms that solve SIM (3) under the constraints (6) and (7) in the noiseless setting can be found towards the end of this section. We begin with deﬁning the following class of functions:

C(K, M, L) :={h : h is K times cont. diﬀ., k < K : EZ N(0,1)(h(k)(Z))2 M, h(K) Lip L},

where g Lip := supx =y|g(x) g(y)|/|x y| for g : R 7 R. (21)

Notice that in deﬁnition (21) we allow for K = 0, in which case we simply require the function h to be L-Lipschitz. Next, deﬁne the classes C(L) := C(0, M, L) and H(K, M, L) := K k=0C(k, M, L). Fix K, M, L R+. We will use H(K, M, L) := conv{H(K, M, L)} to denote the convex hull of the functional class H(K, M, L). We remind the reader that taking the convex hull amounts in the following collection of functions

{f : N N, {gi}N i=1 H(K, M, L), {wi}N i=1 R+ 0 , PN i=1wi = 1 s.t.f = PN i=1wigi}.

For completeness we note that taking the convex hull of the class C(L) is tautological as C(L) = conv{C(L)}.

Theorem 8. Let us observe n data points from the additive SIM:

Y = h(X β ) + ε. (22)

Assuming that s log(d/(4s)) 12 log 2, the following information theoretic lower bounds hold:

(i) Assume pε(x) exp( P(x2)), where P is a known, non-constant, ﬁnite degree polynomial with non-negative coeﬃcients such that P(0) = 0. Then

inf h C(L) inf b β sup β 2=1, β 0 s

Eh,β min η { 1,1} bβ ηβ 2 CP,L

s log(d/(4s))

where CP,L is a constant depending on P and L.

(ii) What is more, in the special case ε N(0, σ2), we have:

inf h H(K,M,L) inf b β sup β 2=1, β 0 s

Eh,β min η { 1,1} bβ ηβ 2 σ

s log(d/(4s)) 12n(KM + L2) 1. (24)

A few remarks regarding known lower bounds of this nature are in order. General lower bounds for models with additive Gaussian noise and sub-Gaussian designs have been established by Lecu e and

Agnostic Estimation for Phase Retrieval

Mendelson (2013a). These bounds diﬀer from the bounds (23) and (24), in two major aspects ﬁrst, cast in our setting, the bounds of Lecu e and Mendelson (2013a) lower bound the prediction accuracy, rather than estimation error, and second our lower bound takes an inﬁmum over the functional classes C(L) or H(K, M, L) in the spirit of the lower bound produced by Yi et al. (2015). Such bounds have the compelling interpretation that even with the easiest function living in C(L) or H(K, M, L), one cannot hope to do better than (23) or (24) in the corresponding noise variants (i) or (ii). Note that the bound above is much stronger than standard minimax lower bounds, which take the hardest link function together with the worst case parameter. Consequently, the bound above implies the standard minimax lower bound within the class of MPR models. Additionally, in contrast to the bound proved in Lecu e and Mendelson (2013a), our lower bound is established via a classical Fano inequality type of argument and hence the proof is more transparent in nature. For any ﬁxed k N, it can be shown (see Lemma 14) that monomials of the type h(x) = xk belong to the class C(k 1, (k!)2, k!) H(k 1, (k!)2, k!). Also, clearly we have |x| C(1), and hence {|x|, x2} H(1, 4, 2). The latter shows that that models (8) and (9) belong to H(1, 4, 2). In this sense, in the case where β 2 = 1, Theorem 3.2 in Cai et al. (2015), a version of which is proved in Lecu e and Mendelson (2013b), is implied by Theorem 8 (ii). On an important note, H(1, 4, 2) also contains models that are not captured by our framework, such as the linear model Y = X β + ε. In conclusion, as we illustrated, the classes H(K, M, L) have intersection with class of MPR models of the type (11). For all models lying in this intersection, as soon as σ σ0 > 0 in case (ii) Algorithms 1 and 2 are minimax optimal up to a constant multiplier.

Statistical Price of Agnosticity. The lower bounds in Theorem 8 hold only in cases where noise is present in the model. If one is speciﬁcally interested in solving the noiseless model (8) for example, the TWF algorithm can recover the vector with essentially no error as demonstrated by Cai et al. (2015). Our procedure described in Algorithm 1 however does not come with such a guarantee for the noiseless model (8), although (20) continues to hold. In this section we will argue that having an algorithm with no error universally over the entire MPR class is generally impossible, even in the noiseless case in the worst case sense. To see why, consider the model:

Y = sign(X β + c), (25)

for a ﬁxed c < 0. It is easy to check that the map z 7 sign(z + c) + sign( z + c) is non-decreasing and non-constant on R+. By Proposition 3 we conclude that model (25) satisﬁes (6) and (7). Model (25) is very similar to the one-bit compressive sensing introduced by Boufounos and Baraniuk (2008). The only diﬀerence is that in the original formulation the constant c = 0. Theorem 1 of Jacques et al. (2013) establishes a lower bound for the one-bit compressive sensing model in the noiseless case. They lower bound the worst case error by at least s/n whenever n s3/2. By tracing the proof of Theorem 1 of Jacques et al. (2013) we realize that the sole obstruction to whether their statement holds in the case (25) is whether the number of quantization points, i.e., points {Yi}n i=1 that can be generated by model (25) for any β and X, is larger than the upper bound shown in Lemma 1 of Jacques et al. (2013). In Lemma 36 of Appendix F we argue that the number of such points for any ﬁxed c < 0, can be roughly bounded from above by d s (ne)s+1

(s+1)s+1 . Repeating the volume argument of the proof of Theorem 1 of Jacques et al. (2013), we conclude that solving model

(25) one cannot hope to achieve a worst case error rate better than (s+1)1+1/s

(en)1+1/s+2s3/2+1/s (s+1)1+1/s

(en)1+1/s ,

where the last holds when n s3/2. Our algorithm would appear to perform sub-optimally in such a situation by roughly a factor of p

s/n in the statistical error. However, we would like to point out that it is possible the above lower bound is not tight, and that there might exist other pathological

Neykov, Wang, and Liu

examples of link functions f producing tighter lower bounds. In addition, we have not used the fact that the algorithm does not utilize a priori knowledge on the link function f. We leave the important question for investigating the tightness of the lower bound in the noiseless case for future work.

5. Numerical Experiments

In this section we provide numerical experiments based on the three models (8), (9) and (10) where the random variable ε N(0, 1). All models are compared with the Truncated Power Method (TPM), proposed in Yuan and Zhang (2013). For model (8) we also compare the results of our approach to the ones given by the TWF algorithm of Cai et al. (2015). Our setup is as follows. In all scenarios the vector β was held ﬁxed at β = ( s 1/2, s 1/2, . . . , s 1/2 | {z } s , 0, . . . 0 | {z } d s ). Since our theory requires

that n s2 log d, we have setup four diﬀerent sample sizes n θs2 log d, where θ {4, 8, 12, 16}. We let s depend on the dimension d and we take s log d. In addition to the suggested approach in Algorithm 1, we also provide results using the reﬁnement procedure (see Algorithm 19). We also provide the values of two warm starts of our algorithm, produced by solving program (17) with half and full data correspondingly. It is evident that for all scenarios the second step of Algorithms 1 and 2 outperform the warm start from SDP, except in Figure 3 (b), (c), (e), when the sample size is simply too small for the warm start on half of the data to be accurate. All values we report are based on an average over 100 simulations. The SDP parameter was kept at a constant value (0.015) throughout all simulations, and we observed that varying this parameter had little inﬂuence on the ﬁnal SDP solution. To select the νn parameter for (18) a pre-speciﬁed grid of parameters {ν1, . . . , νl} was selected, and the following heuristic procedure based on K-fold cross-validation was used. We divide S2 into K = 5 approximately equally sized non-intersecting sets S2 = j [K] e Sj 2. For each j [K] and k [l] we run (18) on the set r [K],r =j e Sr 2 with a tuning parameter νn = νk to obtain an estimate bβk, e Sj 2. Lemma 5 then justiﬁes

the following criteria to select the optimal index for selecting bνn = νbl where

bl = argmax k [l]

Yi(X i bβk, e Sj 2)2.

Our experience suggests that this approach works well in practice provided that the values {ν1, . . . , νl} are selected within appropriate range and are of the magnitude p

log d/n. Since the TPM algorithm requires an estimate of the sparsity s, we tuned it as suggested in Section 4.1.2 of Yuan and Zhang (2013). In particular, for each scenario we considered the set of possible sparsities K = {s, 2s, 4s, 8s}. For each k K the algorithm is ran on the ﬁrst part of the data S1, to obtain an estimate bβk, and the ﬁnal estimate is taken to be bβbk where bk is given by

bk = argmax k K bβ k |S2| 1 X

i S2 Yi(X 2 i Id) bβk.

The TPM is ran for 2000 iterations. In the case of phase retrieval, the TWF algorithm was also ran at a total number of 2000 iterations, using the tuning parameters originally suggested in Cai et al. (2015). As expected the TWF algorithm which targets the sparse phase retrieval model in particular outperforms our approach in the case when the sample size n is small, however our approach performs very comparatively to the TWF, and in fact even slightly better once we increase the sample size. It is possible that the TWF algorithm can perform better if it is ran for a longer than 2000 iterations,

Agnostic Estimation for Phase Retrieval

0.0 0.5 1.0 1.5 2.0

G G G G G G G

Init Second Step Init full data Refined TPM TWF full data

θ = 4 θ = 8 θ = 12 θ = 16

(a) Model (8), d = 200

0.0 0.5 1.0 1.5 2.0

Init Second Step Init full data Refined TPM

θ = 4 θ = 8 θ = 12 θ = 16

(b) Model (9), d = 200

0.0 0.5 1.0 1.5 2.0

Init Second Step Init full data Refined TPM

θ = 4 θ = 8 θ = 12 θ = 16

(c) Model (10), d = 200

0.0 0.5 1.0 1.5 2.0

G G G G G G G G G

Init Second Step Init full data Refined TPM TWF full data

θ = 4 θ = 8 θ = 12 θ = 16

(d) Model (8), d = 300

0.0 0.5 1.0 1.5 2.0

Init Second Step Init full data Refined TPM

θ = 4 θ = 8 θ = 12 θ = 16

(e) Model (9), d = 300

0.0 0.5 1.0 1.5 2.0

Init Second Step Init full data Refined TPM

θ = 4 θ = 8 θ = 12 θ = 16

(f) Model (10), d = 300

0.0 0.5 1.0 1.5 2.0

G G G G G G G G G

Init Second Step Init full data Refined TPM TWF full data

θ = 4 θ = 8 θ = 12 θ = 16

(g) Model (8), d = 400

0.0 0.5 1.0 1.5 2.0

Init Second Step Init full data Refined TPM

θ = 4 θ = 8 θ = 12 θ = 16

(h) Model (9), d = 400

0.0 0.5 1.0 1.5 2.0

Init Second Step Init full data Refined TPM

θ = 4 θ = 8 θ = 12 θ = 16

(i) Model (10), d = 400

Figure 3: Simulation results for the three examples considered in 2, in two diﬀerent settings for the dimension d {200, 300, 400}. Here the parameter θ n s2 log d describes the relationship between sample size, dimension and sparsity of the signal. Algorithm 2 dominates in most settings, with exceptions when θ is too small, in which case none of the approaches provides meaningful results.

Neykov, Wang, and Liu

although in most cases it appeared to have converged to its ﬁnal value. The results are visualized on Figure 3 above. The TPM algorithm, has performance comparable to that of Algorithm 1, is always worse than the estimate produced by Algorithm 2, and it needs an initialization (for the ﬁrst step of Algorithm 1 is used) and further requires a rough knowledge of the sparsity s, whereas both Algorithms 1 and 2 do not require an estimate of s.

6. Discussion

In this paper we proposed the two-step procedure for estimation of MPR models with standard Gaussian designs. We argued that the MPR models form a rich class including numerous additive SIMs (i.e., Y = h(X β ) + ε) with an even and increasing on R+ link function h. Our algorithm is based solely on convex optimization, and achieves minimax optimal rates of estimation over broad classes of additive SIMs with normal noise as we illustrated by Theorem 24. Our procedure does require that the sample size n s2 log d to ensure successful initialization. The same condition has been exhibited previously, e.g., in Cai et al. (2015) for the phase retrieval model, and in works on sparse principal components analysis (see, e.g., Berthet and Rigollet, 2013; Gao et al., 2014; Wang et al., 2015). We anticipate that for a certain subclass of MPR models, the sample size requirement n s2 log d is necessary for computationally eﬃcient algorithms to exist. We conjecture models (8)-(10) to be such models. It is however certainly not true that this sample size requirement holds for all models from the MPR class. In (25) for example, the model can be solved eﬃciently by applying the Lasso algorithm, without requiring the sample size scaling n s2 log d. This discussion leads to the important question under what conditions of the (known) link and error distribution (f, ε) one can eﬃciently solve the SIM Y = f(X β , ε) with an optimal sample complexity. We would like to investigate this issue further in our future work.

Agnostic Estimation for Phase Retrieval

Rados law Adamczak and Pawe l Wolﬀ. Concentration inequalities for non-Lipschitz functions with bounded derivatives of higher order. Probability Theory and Related Fields, 162(3-4):531 586, 2015.

Arash A Amini and Martin J Wainwright. High-dimensional analysis of semideﬁnite relaxations for sparse principal components. In IEEE International Symposium on Information Theory, pages 2454 2458, 2008.

Quentin Berthet and Philippe Rigollet. Complexity theoretic lower bounds for sparse principal component detection. In Conference on Learning Theory, pages 1046 1066, 2013.

Peter J Bickel, Ya acov Ritov, and Alexandre B Tsybakov. Simultaneous analysis of Lasso and dantzig selector. The Annals of Statistics, pages 1705 1732, 2009.

St ephane Boucheron, G abor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory of independence. Oxford University Press, 2013.

Petros T Boufounos and Richard G Baraniuk. 1-bit compressive sensing. In Annual Conference on Information Sciences and Systems, pages 16 21, 2008.

Peter B uhlmann and Sara van de Geer. Statistics for high-dimensional data: Methods, theory and applications. Springer, 2011.

T Tony Cai, Xiaodong Li, and Zongming Ma. Optimal rates of convergence for noisy sparse phase retrieval via thresholded Wirtinger ﬂow. ar Xiv:1506.03382, 2015.

Emmanuel J Cand es, Thomas Strohmer, and Vladislav Voroninski. Phaselift: Exact and stable signal recovery from magnitude measurements via convex programming. Communications on Pure and Applied Mathematics, 66(8):1241 1274, 2013.

Emmanuel J Cand es, Xiaodong Li, and Mahdi Soltanolkotabi. Phase retrieval from coded diﬀraction patterns. Applied and Computational Harmonic Analysis, 39(2):277 299, 2015a.

Emmanuel J Cand es, Xiaodong Li, and Mahdi Soltanolkotabi. Phase retrieval via Wirtinger ﬂow: Theory and algorithms. IEEE Transactions on Information Theory, 61(4):1985 2007, 2015b.

Yudong Chen, Xinyang Yi, and Constantine Caramanis. A convex formulation for mixed regression with two components: Minimax optimal rates. ar Xiv:1312.7006, 2013.

R Dennis Cook and Liqiang Ni. Suﬃcient dimension reduction via inverse regression. Journal of the American Statistical Association, 100(470), 2005.

Alexandre d Aspremont, Laurent El Ghaoui, Michael I Jordan, and Gert RG Lanckriet. A direct formulation for sparse PCA using semideﬁnite programming. SIAM review, 49(3):434 448, 2007.

Simon Foucart and Holger Rauhut. A mathematical introduction to compressive sensing. Number 1. Springer, 2013.

Ravi Ganti, Nikhil Rao, Rebecca M Willett, and Robert Nowak. Learning single index models in high dimensions. ar Xiv preprint ar Xiv:1506.08910, 2015.

Neykov, Wang, and Liu

Chao Gao, Zongming Ma, and Harrison H Zhou. Sparse CCA: Adaptive estimation and computational barriers. ar Xiv:1409.8565, 2014.

Martin Genzel. High-dimensional estimation of structured signals from non-linear observations with general convex loss functions. ar Xiv:1602.03436, 2016.

Fang Han and Honglang Wang. Provable smoothing approach in high dimensional generalized regression model. ar Xiv:1509.07158, 2015.

Joel L Horowitz. Semiparametric and nonparametric methods in econometrics. Springer, 2009.

Laurent Jacques, Jason N Laska, Petros T Boufounos, and Richard G Baraniuk. Robust 1-bit compressive sensing via binary stable embeddings of sparse vectors. IEEE Transactions on Information Theory, 59(4):2082 2102, 2013.

B eatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, pages 1302 1338, 2000.

Guillaume Lecu e and Shahar Mendelson. Learning sub-Gaussian classes: Upper and minimax bounds. ar Xiv:1305.4825, 2013a.

Guillaume Lecu e and Shahar Mendelson. Minimax rate of convergence and the performance of erm in phase recovery. ar Xiv:1311.5024, 2013b.

Ker-Chau Li. Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86(414):316 327, 1991.

Ker-Chau Li. On principal Hessian directions for data visualization and dimension reduction: Another application of Stein s lemma. Journal of the American Statistical Association, 87(420):1025 1039, 1992.

Ker-Chau Li and Naihua Duan. Regression analysis under link violation. The Annals of Statistics, pages 1009 1052, 1989.

Yue M Lu and Gen Li. Phase transitions of spectral initialization for high-dimensional nonconvex estimation. ar Xiv preprint ar Xiv:1702.06435, 2017.

P. Mc Cullagh and J.A. Nelder. Generalized linear models. Chapman & Hall/CRC, 1989.

Marco Mondelli and Andrea Montanari. Fundamental limits of weak recovery with applications to phase retrieval. ar Xiv preprint ar Xiv:1708.05932, 2017.

Matey Neykov, Qian Lin, and Jun S Liu. Signed support recovery for single index models in high-dimensions. ar Xiv:1511.02270, 2015.

Matey Neykov, Jun S Liu, and Tianxi Cai. L1-regularized least squares for support recovery of high dimensional single index models with Gaussian designs. Journal of Machine Learning Research, 17 (87):1 37, 2016.

Heng Peng and Tao Huang. Penalized least squares for single index models. Journal of Statistical Planning and Inference, 141(4):1362 1379, 2011.

Agnostic Estimation for Phase Retrieval

Yaniv Plan and Roman Vershynin. The generalized Lasso with non-linear observations. IEEE Transactions on information theory, 2015.

Peter Radchenko. High dimensional single index models. Journal of Multivariate Analysis, 139: 266 282, 2015.

Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Restricted eigenvalue properties for correlated Gaussian designs. Journal of Machine Learning Research, 11:2241 2259, 2010.

Charles M Stein. Estimation of the mean of a multivariate normal distribution. The Annals of Statistics, pages 1135 1151, 1981.

Christos Thrampoulidis, Ehsan Abbasi, and Babak Hassibi. Lasso with non-linear measurements is equivalent to one with linear measurements. ar Xiv preprint, ar Xiv:1506.02181, 2015.

Alexandre B. Tsybakov. Introduction To Nonparametric Estimation. Springer, New York, 2009. ISBN 978-0-387-79051-0. doi: 10.1007/b13794.

AW Vaart and Jon A Wellner. Weak convergence and empirical processes, 1996.

Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. ar Xiv:1011.3027, 2010.

Zhaoran Wang, Quanquan Gu, and Han Liu. Sharp computational-statistical phase transitions via oracle computational model. ar Xiv:1512.08861, 2015.

Yingcun Xia and WK Li. On single-index coeﬃcient regression models. Journal of the American Statistical Association, 94(448):1275 1285, 1999.

Zhuoran Yang, Zhaoran Wang, Han Liu, Yonina C. Eldar, and Tong Zhang. Sparse nonlinear regression: Parameter estimation and asymptotic inference. ar Xiv; 1511:04514, 2015.

Xinyang Yi, Zhaoran Wang, Constantine Caramanis, and Han Liu. Optimal linear estimation under unknown nonlinear transform. In Advances in Neural Information Processing Systems, pages 1549 1557, 2015.

Xiao-Tong Yuan and Tong Zhang. Truncated power method for sparse eigenvalue problems. Journal of Machine Learning Research, 14(Apr):899 925, 2013.

Neykov, Wang, and Liu

Appendix A. Auxiliary Results

Here we collect several results which we use in the later development.

Lemma 9 (Lemma 5 Amini and Wainwright (2008)). Consider the following optimization program:

b Z = argmax tr(Z)=1,Z Sd + tr(AZ) λn

i,j=1 |Zij|, (26)

where Sd + is the set of all the d d positive semi-deﬁnite matrices. Suppose there exists a matrix U (independent of bz) satisfying:

( sign(bzi) sign(bzj), if bzibzj = 0;

[ 1, 1], otherwise. (27)

Then if bz is the principal eigenvector of the matrix A λn U, bzbz is the optimal solution to problem (26).

Lemma 10 (Lemma 10.12 Foucart and Rauhut (2013)). Given integers 1 < s < d, there exist:

(s 1)/2 , (28)

subsets {Si : Si [d 1]}i [N] such that:

i [N] : |Si| = s 1 and i = j : |Si Sj| < (s 1)/2. (29)

Remark 11. Notice that when s > 1 Lemma 10 implies:

For convenience of the reader we brieﬂy recall the notation and result on Gaussian concentration of non-Lipschitz functions used by Adamczak and Wolﬀ(2015), which we apply in Lemma 30 below. For the set [ℓ], we denote with Pℓthe set of its partitions into non-empty and non-intersecting disjoint sets. For a partition J = {J1, . . . , Jk}, and an ℓ-indexed matrix A = (ai)i [n]ℓ, deﬁne the norm:

A J = sup n X

l=1 x(l) i Jl : x(l) i Jl 2 1, 1 l k o ,

where the indexing should be understood as i I := (ik)k I. Given the convention that #J = |J | is the cardinality of the set J we restate (a shortened) version of Theorem 1.4 of Adamczak and Wolﬀ (2015).

Theorem 12 (Theorem 1.4 Adamczak and Wolﬀ(2015)). Let X = (X1, . . . , Xn) be a random vector with independent components, such that for all i n, Xi ψ2 Υ. Then for every polynomial f : Rn 7 R of degree L and every p 2 we have:

f(X) Ef(X) p KL

J Pℓ p#J /2 EDℓf(X) J .

Here Dℓis the ℓth derivative of f.

Agnostic Estimation for Phase Retrieval

Appendix B. Preliminary Proofs

Proof [Proof of Proposition 2] We have the following equality:

c0 = Cov(f(Z, ε), Z2) = E[ϕ(Z)Z2] Eϕ(Z) = ED2ϕ(Z) > 0,

where the last (and key) equation follows by Stein s Lemma (see, e.g., Lemma 4 of Stein, 1981).

Proof [Proof of Lemma 5] First of all we notice that:

E[Y (X 2 I)] = E[(Y µ)(X 2 I)] = E(Y µ)X 2.

Hence proving Lemma 5 is equivalent to showing:

β = argmax v 2=1 E[(Y µ)(v X)2].

Next, decompose:

v X = (v β )β X + (v (v β )β )X = (v β )β X + β X,

where we used the shorthand notation β := v (v β )β , for the vector β which is orthogonal to β. In terms of this notation, we have the following identity:

E[(Y µ)(v X)2] = (v β )2E[(Y µ)(β X)2]

+ 2(v β )E[(Y µ)(β X)(β X)] + E[(Y µ)(β X)2].

We next deal with the last two terms of the above decomposition. Since β X β X we have that the second term:

E[(Y µ)(β X)(β X)] = E[(Y µ)(β X)]E[β X] = 0.

For the third term due to the same independence (β X β X) we have:

E[(Y µ)(β X)2] = E[(Y µ)]E[(β X)2] = 0.

Hence: E[(Y µ)(v X)2] = (v β )2E[(Y µ)(β X)2] = (v β )2c0.

Since (6) implies that c0 > 0, by Cauchy-Schwartz the maximizer of the above expression is v = β .

Proof [Proof of Proposition 3] We prove the three statements in turn.

(i) Let Z be an independent copy of Z. We have the following chain of equalities:

c0 = E(ϕ(Z) Eϕ(Z))Z2 = E(ϕ(Z) Eϕ(Z ))Z2 = E(ϕ(Z) ϕ(Z ))Z2.

By symmetry one also has: c0 = E(ϕ(Z ) ϕ(Z))(Z )2. Adding the last two equations yields:

2c0 = E[(ϕ(Z) ϕ(Z ))(Z2 (Z )2)]

= EX,X |N (0,1)|[(ϕ(X) + ϕ( X) (ϕ(X ) + ϕ( X )))(X2 (X )2)]/2

where we used the fact that sign(ϕ(X) + ϕ( X) (ϕ(X ) + ϕ( X ))) = sign(X2 (X )2). The last inequality is strict since by our condition the integrand is strictly positive on the set [0, z2] [z1, ) R2 which is a set of positive Lebesgue measure.

Neykov, Wang, and Liu

(ii) To see (ii) for any two points x < y, take vx ϕ(x) and vy ϕ(y) to be arbitrary points in the corresponding sub-diﬀerentials. Adding the following two inequalities:

ϕ(x) ϕ(y) vy(x y), ϕ(y) ϕ(x) vx(y x),

we conclude that (vx vy)(x y) 0. Notice that since ϕ is convex, by Jensen s inequality, ϕ(z) + ϕ( z) 2ϕ(0) for any z 0. Next take z > z > 0, and consider the diﬀerence:

ϕ(z) + ϕ( z) ϕ(z ) ϕ( z ) (vz v z )(z z ) 0,

where vz ϕ(z ) and v z ϕ( z ) are arbitrary sub-gradients, and the last inequality follows by the fact that z > 0 and hence vz v z as we veriﬁed before. The above inequality becomes strict whenever z z0 since vz v z is non-decreasing and by assumption:

z0vz0 ϕ(z0) ϕ(0) > ϕ(0) ϕ( z0) v z0z0,

and hence vz0 v z0 > 0. Hence we may take z1 = 2z0 and z2 = z0 in (i) to complete the proof.

(iii) Statement (iii) is an implication of the fact that we can control the tail bound of g1(Z). Notice that when 0 < t max{|a|, |b|} we trivially have P(|g1(Z)| t) 1. When t > max{|a|, |b|} using our assumption, by a standard normal tail bound we have:

P(|g1(Z)| t) P(|Z| p

t/C) 2 exp( t/(2C)) exp(1 t/(2C)).

Hence setting K = max{|a|, |b|, 2C} shows that in any case P(|g1(Z)| t) exp(1 t/K), which shows that g1(Z) ψ1 c K < for some absolute constant c. Finally by the triangle inequality we conclude: g1(Z) ψ1 + g2(ε) ψ1 < ,

which completes the proof.

Lemma 13. If h is a convex function such that h(z0)+h( z0) > 2h(0) for some z0 > 0, E|h(z+ε)| < for every z R, and E|h(Z + ε)| < we have ϕ(z) = Eh(z + ε) is convex, sub-diﬀerentiable and there exists a z 0 > 0 such that ϕ(z 0) + ϕ( z 0) > 2ϕ(0).

Proof [Proof of Lemma 13] Since the function h is convex and the expectation is a linear operator it follows that ϕ(z) is indeed convex. The linearity of the expectation operator, coupled with the fact that the function |h(z + ε)| is integrable for all z, additionally implies that ϕ(z) is sub-diﬀerentiable with E εh(z + ε) ϕ(z)1, where εh(z + ε) h(z + ε) is chosen so that ε 7 εh(z + ε) is a function2. Next, notice that for any ﬁxed ε, we have:

EZ[h(Z + ε) + h( Z + ε)] > 2h(ε).

The last inequality is strict, since by Jensen s inequality EZ[h(Z + ε) + h( Z + ε)] h(EZ + ε) + h( EZ + ε) = 2h(ε), and equality can be achieved only when h is linear, which is not the case since

1. The fact that E εh(z + ε) exists is implied by E|h(z + ε)| < . 2. Note also that ε 7 εh(z + ε) is monotone and hence measurable.

Agnostic Estimation for Phase Retrieval

h(z0) + h( z0) > 2h(0) by assumption. Take an expectation with respect to ε and exchange the expectations (by Fubini s theorem, recall that E|h(Z + ε)| < ) to obtain:

EZEε[h(Z + ε) + h( Z + ε)] > 2Eεh(ε).

Naturally, the above implies the existence of z 0 such that:

Eε[h(z 0 + ε) + h( z 0 + ε)] > 2Eεh(ε),

and completes the proof.

Lemma 14. Monomials of the type h(x) = xk belong to the class C(k 1, (k!)2, k!) H(k 1, (k!)2, k!).

Proof [Proof of Lemma 14] This statement holds due to following explicit moment calculation:

max i [k 1] EZ N(0,1)(h(k i)(Z))2 = max i [k 1] 2i k!

2 Γ((1 + 2i)/2)

Γ(1/2) = (k!)2,

where the last equality follows by noticing that the sequence is decreasing in i.

Appendix C. Proofs for Initialization Step

Our ﬁrst result shows that the optimization program (17) succeeds in producing a vector bv which is close to the vector β .

Proposition 15. Assume that n is large enough so that s q

4 ) c0 (C1+C2) for some small

but ﬁxed κ > 0 and constants C1, C2 (depending on f and ε). Then there exists a value of λn q

n such that the principal eigenvector bv of b A, the solution of (17), satisﬁes

with probability at least 1 4d 1 O(n 1).

Proof [Proof of Proposition 15] The proof follows by an application of Lemma 9 and Lemma 17.

Lemma 16. Let A = avv bww be a symmetric rank two matrix, with a > b 0 and v 2 = w 2 = 1, and let N be a symmetric noise matrix. Then, assuming that N 2 a b

2 the principal eigenvector bv of A + N satisﬁes:

|bv v| ha b 2 N 2

Proof [Proof of Lemma 16] First oﬀ, an elementary calculation shows that the non-zero spectrum of A is:

{λ1(A), λd(A)} = na b p

(a b)2 + 4ab(1 (v w)2)

Neykov, Wang, and Liu

Next we have: a(v bv)2 + N 2 bv (A + N)bv λ1(A) N 2,

(v bv)2 a b + p

(a b)2 + 4ab(1 (v w)2)

where the last inequality follows by Cauchy-Schwartz.

Lemma 17. Assume that n is large enough so that s q

4 ) c0 (C1+C2) for some small but

ﬁxed κ > 0 and constants C1, C2 as deﬁned in Lemmas 21 and 22. Put λn = (C1 + C2) q

exists a sign matrix b U with b USβ Sβ = sign(β Sβ ) sign(β Sβ ) such that the principal eigenvector of bΣ λ b U, bv satisﬁes: |bv β | κ,

with probability at least 1 4d 1 O(n 1).

Remark 18. The proof of Lemma 17 also shows that with high probability the vector bv can be identiﬁed with a vector ev (the principal eigenvector of bΣSβ ,Sβ λn b USβ ,Sβ see below) which is independent of the data XSc β such that bv ev with high probability. This becomes evident upon realizing that the matrix NSβ Sβ depends solely on XSβ . In addition it is evident that the support supp(bv) Sβ and supp(ev) Sβ .

Proof [Proof of Lemma 17] Setting λn = (C1 + C2) q

n and using Lemma 19 gives us that

bΣ λn b U =

" ηβ Sβ β Sβ λn sign(β Sβ ) sign(β Sβ ) + NSβ Sβ NSβ Sc β λn b USβ Sc β NSc β Sβ λn b USc β Sβ NSc β Sc β λn b USc β Sc β

We can select the sign matrix b U such that all three terms which do not correspond to the Sβ Sβ corner of the above visualization are 0, since by Lemma 19 we have that N max

(C1 + C2) q

n λn with high probability. Recall that by our assumption on the sample size we

6s and hence λn b USc β Sβ c0

sign(β Sβ ) s

sign(β Sβ ) s . Using Lemmas 16 and 19 on the event

NSβ Sβ 2 (C1 + C2)s q

|bv Sβ β Sβ | η c0

6 2 NSβ Sβ 2

3 4 NSβ Sβ 2/c0 2

3 4(C1 + C2)

for values of n large enough so that the above expression is positive, which concludes the proof.

Agnostic Estimation for Phase Retrieval

Lemma 19. We have that:

bΣ = ηβ β + N, (32)

where η > c0/2 and NSβ Sβ 2 (C1 + C2)s q

n and N max (C1 + C2) q

n with probability at least 1 4d 1 O(n 1), where C1 and C2 are constants depending on f, ε.

Proof [Proof of Lemma 19] First we observe that decomposition (32) holds with:

i=1 Yi(β Xi)2β β + 1

i=1 Yi(Pβ Id),

i=1 Yi(β Xi)(β X i Pβ + Pβ Xiβ ) + 1

i=1 Yi[Pβ (X 2 i Id)Pβ ],

where Pβ = (Id β β ). Lemma 20 shows that η c0/2 with probability at least 1 O(n 1). Next, Lemma 21 and Lemma 22 show that:

N max (C1 + C2)

with probability at least 1 4d 1 O(n 1), where the constants C1 and C2 depend on f, ε. Using the fact that NSβ Sβ 2 NSβ Sβ 1 s N max completes the proof.

Lemma 20. We have that η deﬁned in (32) satisﬁes

with probability at least 1 4 Var[f(Z,ε)(Z2 1)]

Proof [Proof of Lemma 20]Grouping the ﬁrst two terms by Chebyshev s inequality we have that:

i=1 Yi((β Xi)2 1) c0 t Var[f(Z, ε)(Z2 1)]

Notice that in the last inequality we have Var[(f(Z, ε)(Z2 1)] < since we are assuming that f(Z, ε) is sub-exponential. Setting t = c0/2 brings the above probability bound to zero at a rate n 1.

Lemma 21. We have that: 1

i=1 Yi(β Xi)[β X i Pβ + Pβ Xiβ ] max C1

where C1 is a constant depending on f, ε, with probability at least 1 2d 1 Var[f 2(Z,ε)Z2] (E[f 2(Z,ε)Z2)])2 n 1.

Proof [Proof of Lemma 21] We will only deal with the ﬁrst term of the sum, as the second term follows by the same argument after transposition. First notice that Yi(β Xi) X i Pβ . Analyzing the ﬁrst part of this term row-wise, for j Sβ (β j = 0) we have:

i=1 Yiβ j (β Xi)X i Pβ N 0, β 2 j 1 n2

i=1 Y 2 i (β Xi)2Pβ .

Neykov, Wang, and Liu

By Chebyshev s inequality we have:

i=1 Y 2 i (β Xi)2 E[f 2(Z, ε)Z2] t Var[f 2(Z, ε)Z2]

assuming Var[f 2(Z, ε)Z2] < . Putting t = E[f 2(Z, ε)Z2] brings the above probability converge to zero at a rate n 1. Hence, by a conditioning argument, a standard normal tail bound coupled with the facts that Pβ 2 1, |β j | 1 and a union bound, we obtain:

P max j [d]

i=1 Yi(β Xi)β j X i Pβ > t 2d2 exp nt2

4E[f 2(Z, ε)Z2]

Plugging in

3Ef 2(Z, ε)Z2 r

brings the probability to 2d 1. This completes the proof with C1 = 4 p

3Ef 2(Z, ε)Z2.

Lemma 22. Let Yi = f(X i β , ε), where Xi N(0, I). Assume that f and ε are such that f(Z, ε) ψ1 K for Z N(0, 1) and Z ε, and in addition let log d = o(n/ log2 n). Then:

i=1 Yi Pβ (X 2 i Id)Pβ max

with probability at least 1 2d 1 (211K4/(Ef 2(Z, ε))2 +e)n 1, for some absolute value C2 depending on K, and large values of n.

Proof [Proof of Lemma 22] Notice that by the properties of the multivariate normal distribution one has that Yi Pβ (X 2 i Id)Pβ . Next we have that Zi := Pβ Xi N(0, Pβ ), and thus, since Pβ 2 1, we have that each individual entry of Zi is a normally distributed random variable with variance at most one. Hence we have that for any j, k [d]: Zij Zik ψ1 2 Zij ψ2 Zik ψ2 2, and hence conditionally on Yi one has

Zij Zik EZij Zik ψ1 = Zij Zik Pβ ,jk ψ1 4,

for all j, k [d]. Next conditionally on the Yi values and a Bernstein type of inequality (see, e.g., Proposition 5.16 of Vershynin (2010)) we obtain:

i=1 Yi Pβ (X 2 i Id)Pβ max t

2d2 exp h c min nt2

16n 1 Pn i=1 Y 2 i , nt 4 maxi [n] |Yi|

for an absolute constant c > 0. Using the union bound and the fact that Yi are sub-exponential we obtain:

P(max |Yi| t) n exp( t/(c K)), (34)

Agnostic Estimation for Phase Retrieval

for some absolute constant. Setting t = 2c K log(n) brings the above probability converging to zero at a rate n 1. Furthermore by Chebyshev s inequality we obtain:

i=1 Y 2 i EY 2 t Var(Y 2 i )n 1t 2 29K4n 1t 2, (35)

and thus we can set t = EY 2/2 to bring the above probability to zero at a rate n 1. In addition we have EY 2/2 n 1 Pn i=1 Y 2 i 2EY 2 with probability at least 211K4n 1/(EY 2)2. Selecting

96EY 2 log d

cn in (33) gives us that:

c K log n 16n 1 Pn i=1 Y 2 i 4 maxi [n] |Yi| ,

where the ﬁrst inequality in the preceding display holds for large enough values of n so long as log d = o(n/ log2 n). Hence we conclude:

i=1 Yi Pβ (X 2 i Id)Pβ max

96EY 2 log d

with probability at least 1 2d 1 (211K4/(EY 2)2 + e)n 1. Taking into account that EY 2 4K2

we obtain that C2 = 384K2/c.

Appendix D. Proofs for Second Step

Remark 23. For simplicity of presentation we will subtract n from the indexes of the set S2 in the proofs, i.e., instead of having observations indexed in the range S2 = {n + 1, . . . , 2n} we will pretend that our observations are in the range {1, . . . , n}.

Proof [Proof of Theorem 6]Take the ﬁxed estimate bv from the ﬁrst step (recall that bv 2 = 1), and decompose it to: bv = (bv β )β + bβ .

By the Pythagorean theorem we have 1 = bv 2 2 = (bv β )2 β 2 2 + bβ 2 2, which implies that

1 (bv β )2 1. (36)

Put α := c0bv β so by Lemma 17 we have |α| > κc0 with high probability. By formulation (18) we have:

1 2n X(bb αβ ) 2 2 + νn bb 1 1

D (Y Y ) Xbv αXβ , X(bb αβ ) E + νn αβ 1.

We handle the empirical process term in Lemma 25, which also presents the main diﬃculty in the analysis of the ℓ1 regularized least squares procedure. Using this result we conclude that:

1 2n X(bb αβ ) 2 2 + νn bb 1 e C

h bb αβ 1 + 1 n X(bb αβ ) 2 i + νn αβ 1, (37)

Neykov, Wang, and Liu

with probability at least 1 O(n 1 + d 1). We now distinguish two cases. First assume that X(bb αβ ) 2 > 4 e C log d. Then (37) implies that:

1 4n X(bb αβ ) 2 2 + νn bb 1 e C

n bb αβ 1 + νn αβ 1, (38)

Next using a standard trick (see, e.g., Bickel et al., 2009; B uhlmann and van de Geer, 2011) we have:

bb 1 = bb Sβ 1 + bb Sc β 1 αβ Sβ 1 bb Sβ αβ Sβ 1 + bb Sc β 1,

bb αβ 1 = bb Sβ αβ Sβ 1 + bb Sc β 1.

Selecting νn 2 e C q

n , the above equalities in combination with (38) guarantee that:

1 2n X(bb αβ ) 2 2 + νn bb Sc β αβ Sc β 1 3νn bb Sβ αβ Sβ 1. (39)

Using Corollary 1 from Raskutti et al. (2010), since clearly Id satisﬁes the RE condition of order 2s with constants (3, 1) (i.e., S [d] 2s θ { θSc 1 3 θS 1} we have θ 2 Iθ 2) we can further bound:

1 2n X(bb αβ ) 2 2 1 2 82 bb αβ 2 2, (40)

with probability at least 1 c exp( c n) if n > c 42s log d where c , c , c > 0 are absolute constants. On the above event, (39) implies:

1 2 82 bb αβ 2 2 3νn bb Sβ αβ Sβ 1 3νn

2s bb αβ 2,

where we used Cauchy-Schwartz and the fact that the vector bb Sβ αβ Sβ is at most 2s sparse. The above inequality gives us that:

bb αβ 2 12 82

In the second case when X(bb αβ ) 2 4 e C log d, on the event (40) we have:

bb αβ 2 32 e C

and we see that in either case bound (41) holds. Before we complete the proof we need the following straightforward result:

Lemma 24. Assume that n is large enough so that 12 82

2sνn κc0/2. Then with probability at least 1 O(n 1 + d 1) we have:

min η {1, 1}

bb 2 ηβ 2 48 82

Finally notice that s q

n < R implies that 12 82

2sνn κc0/2 when R is small enough.

Agnostic Estimation for Phase Retrieval

Proof [Proof of Lemma 24] Put r = 12 82

2sνn for brevity. By (41) and the triangle inequality we can conclude that: |α| r bb 2 r + |α|.

Additionally:

bb 2 sign(α)β 2 bb αβ 2 + | bb 2 |α||

bb 2 2r |α| r 4r

with the last two inequalities holding with high probability when r < κc0/2 ( |α|/2 with high probability by Lemma 17). This completes the proof.

Lemma 25. There exists a constant e C depending on f, ε such that:

D (Y Y ) Xbv αXβ , X(bb αβ ) E e C

h 1 n X(bb αβ ) 2 + bb αβ 1 i ,

with probability at least 1 O(n 1 + d 1).

Proof [Proof of Lemma 25] Using H older s inequality we obtain:

D (Y Y ) Xbv αXβ , X(bb αβ ) E 1

n X [(Y µ) Xbv αXβ ] bb αβ 1 (42)

n (Y µ) Xbv 2 X(bb αβ ) 2

where we have set µ := EY for brevity. We ﬁrst handle the second term. We have 1 n (Y µ) Xbv 2 =

|Y µ| 1 n Xbv 2. Since Yi is assumed to be sub-exponential by a Bernstein type of inequality we have: P(|Y µ| t) 2 exp( c min(nt2/4K2, nt/2K))

where c is an absolute constant. Thus we conclude that |Y µ| 2K c

n with probability at least

1 2d 1, for values of n such that q

n < c. Also since we have Xbv N(0, In) we obtain that Xbv 2 2 χ2 n. Hence by Chebyshev s inequality we obtain:

P(| Xbv 2 2/n 1| t) 2/(nt2),

and thus by plugging in t = 1, we conclude that 1 n Xbv 2

2 with probability at least 1 2n 1.

Hence 1 n (Y µ) Xbv 2 e C1 q

n with probability at least 1 O(n 1) 2d 1. Next we analyze the sup norm term appearing in inequality (42). The ﬁrst fact we observe is that by construction this term is unbiased since:

E[(Y µ)X 2β (bv β ) αX 2β ] + E[(Y µ)X 2 bβ ]

= β E[(Y µ)(X β )2(bv β ) αβ X 2β ] | {z } 0

+ E[(Y µ)Pβ X 2β (bv β )] | {z } 0 E[αPβ X 2β ] | {z } 0

+ E[(Y µ)Pβ X 2 bβ ] | {z } 0

+β E[(Y µ)β X 2 bβ ] | {z } 0

Neykov, Wang, and Liu

Now according to the decomposition in the preceding display, we break down the sup norm term in (42) into mean zero terms using the triangle inequality:

n 1 X [(Y µ) Xbv αXβ ] n 1 P{β , b β } X [(Y µ) X bβ ] (43)

+ n 1 β β X [(Y µ) X bβ ]

bβ 2 2 bβ ( bβ ) X [(Y µ) X bβ ]

+ n 1 β β X [(Y µ) Xβ c0Xβ ]

+ n 1 Pβ X [(Y µ) Xβ c0Xβ ] ,

where in the last two terms we used the fact that |bv β | 1 and P{β , b β } is the projection on the

space span{β , bβ } . We use Lemma 26 to control the ﬁrst term of the decomposition. Lemma 27 handles the second term, Lemma 28 handles the third term, and Lemmas 29 and 31 show concentration for the remaining terms. We conclude that there exists a constant e C2 such that:

n 1 X [(Y µ) Xbv αXβ ] e C2

with probability at least 1 O(n 1 +d 1), which is what we aimed to show with e C = max( e C1, e C2).

Lemma 26. We have that: 1

i=1 (Yi µ)X i bβ X i P{β , b β } bβ 2C3

for an absolute constant C3 depending on f and ε with probability at least 1 2d 1 Var(f(Z,ε) Ef(Z,ε))2

[E(f(Z,ε) Ef(Z,ε))2]2 n 1.

Proof [Proof of Lemma 26]Notice that X i P{β , b β } is independent of (Yi µ)X i bβ . Hence

conditionally on Y and X bβ we have

i=1 (Yi µ)X i bβ X i P{β , b β } N 0, 1

i=1 (Yi µ)2(X i bβ )2P{β , b β } .

Next using Chebyshev s inequality we can control the probability of spread about the mean:

bβ 2 2 (Yi µ)2 E(f(Z, ε) Ef(Z, ε))2 t Var[(f(Z, ε) Ef(Z, ε))2]

by setting t = E(f(Z, ε) Ef(Z, ε))2. Using the fact that P{β , b β } 2 1, by a standard normal

tail bound and union bound on the event 1

n Pn i=1(X i bβ )2(Yi µ)2 2 bβ 2 2E(f(Z, ε) Ef(Z, ε))2

i=1 (Yi µ)X i bβ X i P{β , b β } t 2d exp( nt2/[4 bβ 2 2E(f(Z, ε) Ef(Z, ε))2]).

Select t = 2 p

2E(f(Z, ε) Ef(Z, ε))2 bβ 2 q

n yields the desired bound with

2E(f(Z, ε) Ef(Z, ε))2 bβ 2.

Agnostic Estimation for Phase Retrieval

Lemma 27. We have that:

n 1 β β X [(Y µ) X bβ ] C4

for an absolute constant C4 depending on f and ε with probability at least 1 2d 1 Var Z2(f(Z,ε) Ef(Z,ε))2

[EZ2(f(Z,ε) Ef(Z,ε))2]2 n 1.

Proof [Proof of Lemma 27] Notice that β β 2 = 1 and thus:

n 1 β β X [(Y µ) X bβ ] n 1|β X [(Y µ) X bβ ]|.

Next since (( bβ ) Xi) (β Xi)(Yi µ), conditioning on {(β Xi)(Yi µ)}i [n] we obtain:

i=1 (β Xi)(Yi µ)(( bβ ) Xi) N 0, bβ 2 2 n2

i=1 (β Xi)2(Yi µ)2 ,

i=1 (β Xi)2(Yi µ)2 EZ2(f(Z, ε) Ef(Z, ε))2 t Var[Z2(f(Z, ε) Ef(Z, ε))2]

by setting t = EZ2(f(Z, ε) Ef(Z, ε))2 we can control the variance term above. The ﬁnal bound follows after an application of a standard Gaussian tail bound, where C4 turns out to be C4 = bβ 22 p

EZ2(f(Z, ε) Ef(Z, ε))2.

Lemma 28. For large enough values of n we have:

bβ 2 2 bβ ( bβ ) X [(Y µ) X bβ ] C5

with prob at least 1 2d 1 O(n 1).

Proof [Proof of Lemma 28] We have that bβ bβ 2, and hence:

bβ 2 2 bβ ( bβ ) X [(Y µ) X bβ ] n 1

bβ 2 |( bβ ) X [(Y µ) X bβ ]|.

Observe that X i bβ is independent from Yi µ, and in addition X i bβ N(0, bβ 2 2). Hence (X i bβ )2/ β 2 2 χ2 1. Next we make usage of the decomposition:

i (Yi µ)(X i bβ )2 = bβ 2 2 n

i (Yi µ)((X i bβ )2/ bβ 2 2 1) + bβ 2 2 n

Since Yi is assumed to be sub-exponential, the second concentrates about zero by Proposition 5.16 in Vershynin (2010):

i=1 Yi µ t 2 exp( c min(nt2/K2, nt/K)).

Neykov, Wang, and Liu

Selecting t = K c

n gives a bound on the probability equal to 2d 1, for values of n large enough

n c. For the remaining term, conditionally on {Yi}i [n], by Lemma 1 of Laurent and Massart (2000) we obtain:

i (Yi µ)((X i bβ )2/ bβ 2 2 1) 2

v u u tn 1 n X

i=1 (Yi µ)2

t + 2 max i [n] |Yi µ|t

2 exp( nt). (46)

Next, by the triangle inequality: v u u tn 1 n X

i=1 (Yi µ)2

v u u tn 1 n X

i=1 Y 2 i + |µ|, max i [n] |Yi µ| max i [n] |Yi| + |µ|.

The inequalities in the preceding display allow us to reuse the results (34) and (35) of Lemma 22. Thus conditioning on these events (46) implies:

i (Yi µ)((X i bβ )2/ bβ 2 2 1) (2

t + (4c K log n)t 2 exp( nt),

on an event failing with probability at most Var Y 2 [EY 2]2 + e n 1. Selecting t = log d

n implies that with

probability at least 1 2d 1 Var Y 2 [EY 2]2 + e n 1 we have:

i (Yi µ)((X i bβ )2/ bβ 2 2 1) [(2

2EY 2 + µ) + 4c K]

with the probability of failing being at most exp( nt) max(2d 1, O(n 1)). We remind the reader

that we are assuming log(d) = o(n/ log2(n)). This completes the proof with C5 = (2

2EY 2 + µ) +

4c K + K c bβ 2.

Lemma 29. We have that:

n 1 β β X [(Y µ) Xβ c0Xβ ] C6

with probability at least 1 O(n 1) 3d 1.

Proof [Proof of Lemma 29] As in Lemma 27 we have:

n 1 β β X [(Y µ) Xβ c0Xβ ] n 1|β X [(Y µ) Xβ c0Xβ ]|,

since β 1. We decompose the right hand side of the preceding display to:

i=1 [(β Xi)2Yi (c0 + µ)] + (c0 + µ)

i=1 [1 (β Xi)2].

To handle the ﬁrst term one can easily use Chebyshev s inequality to obtain convergence with probability at least (log d) 1. However, to sharpen this rate, we work around the classic Chebyshev s inequality, by making usage of recent concentration results on polynomials of sub-Gaussian random variables proved in Adamczak and Wolﬀ(2015). We have the following:

Agnostic Estimation for Phase Retrieval

Lemma 30. We have that:

i=1 [(β Xi)2Yi (c0 + µ)] e C6

with probability at least 1 max(O(n 1), d 1).

Usual concentration bounds on the χ2 distribution can be used to control the second term. Using Lemma 1 of Laurent and Massart (2000) we obtain:

i=1 (β Xi)2 1 2

t + 2t 2 exp( nt).

Select t = q

n to complete the proof assuming that t < 1 and setting C6 = 4(c0 + µ) + e C6.

Proof [Proof of Lemma 30] First we construct the random variable Zi = ηi|β Xi|1/2|Yi|1/4, where ηi is a Rademacher random variable. Notice that Z4 i = (β Xi)2Yi, and hence EZ4 i = (c0 + µ). We now argue that Z is a sub-Gaussian random variable. By H older s inequality, and the deﬁnition of ψ2 norm we have:

E|β X|p E|Y |p/2 (p β X ψ2( Y ψ1/2)1/2)p/2 (p( Y ψ1/2)1/2)p/2,

where we used that since β X N(0, 1) we have β X ψ2 1. Hence Z ψ2 (K/2)1/4, and thus Z is sub-Gaussian as claimed. For the remaining part recall the notation preceding Theorem 12. For f(x) = x4 and F(x) = Pn i=1 f(xi) we have DℓF(x) = diagd(f (ℓ)(x1), . . . , f (ℓ)(xn)) for ℓ [4]. Using the deﬁnition of ψ2 norm we can easily estimate E[|Z|ℓ] (

ℓ)ℓ Z ℓ ψ2. To this end we observe the following:

diagℓ{x1, . . . , xn} J = 1(#J = 1) x 2 + 1(#J 2) x max.

EDℓF(Z) J [1(#J = 1) n + 1(#J 2)]4!/(4 ℓ)!(

4 ℓ)4 ℓ Z 4 ℓ ψ2 ,

for ℓ [4], where with a slight abuse of notation we understand (

4 ℓ)(4 ℓ) = 1 when ℓ= 4. Using the moment estimate of Theorem 12 we obtain:

F(Z) EF(Z) k K4 X

ℓ [4] Z ℓ ψ2 X

J Pℓ k#J /2[1(#J = 1) n + 1(#J 2)] 4!(

4 ℓ)4 ℓ Z 4 ℓ ψ2 (4 ℓ)!

where Pℓis the set of partitions of [ℓ], the absolute constant K4 depends solely on the dimension four, and e K4 on the Z ψ2 norm and K4. Next by Chebyshev s inequality:

P(n 1|F(Z) EF(Z)| t) e Kk 4 [ p

k/n + k2/n]k

Applying this inequality with k = min( log d , (n log d)1/4 ), and t = 2e e K4 q

n gives us that:

i=1 [(β Xi)2Yi (c0 + µ)] e C5

Neykov, Wang, and Liu

with probability at least 1 exp( min( log d , (n log d)1/4 )) 1 max(O(n 1), d 1) where e C5 = 2e e K4. This is what we wanted to show.

Lemma 31. We have:

n 1 Pβ X [(Y µ) Xβ c0Xβ ] C6

with probability at least 1 O(n 1) 2d 1.

Proof [Proof of Lemma 31] We have that Pβ X is independent of (Y µ) Xβ c0Xβ and thus:

1 n Pβ X [(Y µ) Xβ c0Xβ ] N 0, 1

i=1 (Yi µ c0)2(β Xi)2Pβ .

By Chebyshev s inequality we obtain that 1

n Pn i=1(Yi µ c0)2(β Xi)2 2E((f(Z, ε) µ

c0)2Z2) with probability at least Var((f(Z,ε) µ c0)2Z2) [E((f(Z,ε) µ c0)2Z2)]2n. Since Pβ 2 1, by a standard Gaussian tail bound we obtain:

P Pβ X [(Y µ) Xβ c0Xβ ] t 2d exp nt2

4E((f(Z, ε) µ c0)2Z2)

Setting t = 2 q

2E((f(Z, ε) µ c0)2Z2) log d

n , completes the proof.

Finally we prove a Lemma showing that the support of the second step of Algorithm 1 stays conﬁned within the true support Sβ provided that νn is suﬃciently large. We will use this lemma in the proof of Theorem 7. In the following, for a matrix M Rd1 d2, sets S1, S2 [d], we let M,S2 = [Mij]j S2 i [d1] and MS1,S2 = [Mij]j S2 i S1 .

Lemma 32. There exists a constant ρ > 0 depending solely on f, ε such that if we set νn = C q

n for some suﬃciently large constant C > ρ if

n s log d > 16 1 ρ

then the vector bβ produced by the second step of Algorithm 1 satisﬁes supp( bβ) Sβ , with probability at least 1 O(n 1 + d 1 + exp( Ω(s))). Moreover bβ can be identiﬁed with a vector eβ (i.e., bβ eβ with probability at least 1 O(n 1 + d 1 + exp( Ω(s)))), where eβ is independent of X,Sc β and

supp( eβ) Sβ .

Proof [Proof of Lemma 32] The proof of this Lemma follows the proof of Theorem 2.3.4 (i) of Neykov et al. (2016), and here we will provide only a proof of a modiﬁcation required in the original argument. It turns out that the to show the ﬁrst part of the statement we need to show that there exists a constant ρ > 0: |S2| 1 X

i S2 ((Yi Y )X i bv c0(bv β )X i β )2 ρ,

Agnostic Estimation for Phase Retrieval

with high probability. Using the elementary inequality (a + b + c)2 3(a2 + b2 + c2) together with the fact that |bv β | 1 we obtain:

i S2 ((Yi Y )X i bv c0(bv β )X i β )2/3 |S2| 1 X

i S2 (( Y µ)X i bv)2

i S2 ((Yi µ)X i bβ )2

i S2 ((Yi µ)X i β c0X i β )2.

Recall that by Remark 18 we can treat bv as independent of X,Sc β and supp(bv) Sβ . Hence the ﬁrst two terms have been can be handled as in Lemmas 33 and 34, whilst the last term can be handled by Chebyshev s inequality as in the original proof of Theorem 2.3.4. (i). Using the statement of Theorem

2.3.4 (i), we conclude that when νn = C q

n for a suﬃciently large constant C such that C2 > ρ if n satisﬁes: n s log d > 16 1 ρ

we have supp( bβ) Sβ . Additionally, this also implies that in fact is the solution of:

bβSβ = argmin b Rs 1 2|S2|

i S2 ((Yi Y )X i,Sβ bv Sβ X i,Sβ b)2 + νn b 1,

with high probability. Hence just as in Remark 18 we can identify bβ with a vector bβ eβ with high probability, where eβ is independent of X,Sc β and supp( eβ) Sβ .

Appendix E. Proofs for Algorithm 2

In this section for simplicity we will substitute the sample size of the full data from 2n to n. We will refer to the sample size of the second half of the data of Algorithm 1 as |S2| := n/2 whenever it can cause confusion. Unlike in the previous sections here we deﬁne:

bβ := bβ ( bβ β )β .

Proof [Proof of Theorem 7] The proof bares similarity proof to the proof of Theorem 6, the diﬀerence being that Lemma 33 is used in place of Lemma 25. We omit the details.

Lemma 33. There exists a constant e C depending on f, ε such that:

D (Y Y ) X bβ αXβ , X(bb αβ ) E e C

h 1 n X(bb αβ ) 2 + bb αβ 1 i ,

with probability at least 1 O(n 1 + d 1 + exp( cs)) for some absolute constant c > 0.

Proof [Proof of Lemma 33] The proof of this Lemma, follows a similar route as in the proof of Lemma 25, modulo several non-trivial modiﬁcations. The ﬁrst modiﬁcation required is when handling the term 1 n X bβ 2, as we now need to take into account that the vector bβ is not independent of the matrix X.

Neykov, Wang, and Liu

To handle the term X bβ 2 recall that from Lemma 32 we have supp( bβ) Sβ . Using the deﬁnition of matrix norm we obtain:

X bβ 2 X,Sβ 2 2 n + s,

with probability at least 1 2 exp( n/2) by Theorem 5.35 (Vershynin, 2010). Hence as in Lemma 25

we obtain 1 n (Y µ) X bβ 2 e C1 q

n with probability at least 1 O(n 1) 2d 1. Next our strategy is to use the following decomposition (contrast to decomposition in (43)):

n 1D X [(Y µ) X bβ αXβ ], bb αβ E

n 1 X ,Sc β [(Y µ) X bβ ] bb αβ 1

+ n 1 X ,Sβ [(Y µ) X bβ ] bb αβ 1

+ n 1 β β X [(Y µ) Xβ c0Xβ ] bb αβ 1

+ n 1 Pβ X [(Y µ) Xβ c0Xβ ] bb αβ 1.

Since the last two terms do not involve bβ or bβ they are already handled by Lemma 29 and Lemma 31. We handle ﬁrst two terms in Lemma 34 and 35.

Lemma 34. Assuming that s < Un log2(n), we have:

n 1 X ,Sc β [(Y µ) X bβ ] C7

for a constant C7 depending on f and ε with probability at least 1 O(d 1) O(n 1) 2 exp( cs) for an absolute constant c.

Proof [Proof of Lemma 34] We start by noticing that conditioning on X,Sβ determines both Y µ and X,Sβ while X,Sc β are unaﬀected. Hence conditionally we have:

i=1 (Yi µ)X i bβ Xi,Sc β N 0, bβ 2 2 n2

i=1 (Yi µ)2(X i bβ / bβ 2)2Id s . (47)

We will now delicately handle the variance term. Deﬁne

B := {β : β β , supp(β ) Sβ , β 2 = 1}.

We have argued that b β

b β 2 B with high probability. The following inequality holds

i=1 X 2 i (Yi µ)2β 1

i=1 P{β } X 2 i P{β } (Yi µ)2 2.

Now we realize that the terms P{β } Xi are independent of the terms Yi µ. Hence conditioning on Y µ we are in position to use Remark 5.40 of Vershynin (2010) to claim that

i=1 P{β } X 2 i P{β } (Yi µ)2 2 1

i=1 (Yi µ)2 +n 1 e C v u u ts

i=1 (Yi µ)4 +s max i (Yi µ)2 .

Agnostic Estimation for Phase Retrieval

with probability at least 1 2 exp( cs) for some absolute constants c and e C. To handle the the second term we simply resort to Chebyshev s inequality:

i=1 (Yi µ)2 E(f(Z, ε) Ef(Z, ε))2 t Var[(f(Z, ε) Ef(Z, ε))2]

by setting t = E(f(Z, ε) Ef(Z, ε))2 we bring the above probability to zero at a rate n 1. Similarly one can handle the term 1

n Pn i=1(Yi µ)4. The term maxi [n](Yi µ)2 was handled in (34) where

we showed that there exists an absolute constant such that maxi [n](Yi µ)2 ee C log2 n with high probability. Putting everything together we conclude:

i=1 X 2 i (Yi µ)2β

2E(f(Z, ε) Ef(Z, ε))2 + r s

n2E(f(Z, ε) Ef(Z, ε))4 + ee Cs log2 n

n | {z } e V

with probability at least 1 2 exp( cs) O(n 1). Under the assumption s < Un log2(n) for some constant

U > 0 we have e V < . This implies on a set with overwhelming probability the variance of (47) is bounded by the constant e V . Then by a standard normal tail bound in conjunction with a union bound we obtain:

i=1 (Yi µ)X i bβ X i,Sc β

t 2(d s) exp( nt2/(8 bβ 2 2 e V )).

Selecting t = bβ 2 q

e V log d/n drives the probability to zero at a rate O(d 1).

Lemma 35. Suppose that n R s2(log d log n) for a suﬃciently large constant R . We have:

n 1 X ,Sβ [(Y µ) X bβ ] C8

with probability at least 1 O(n 1 + d 1 + exp( cs)).

Proof [Proof of Lemma 35] Recall that by Theorem 6 bβ satisﬁes minη { 1,1} bβ ηβ 2 L q

The last bound implies 1 L2

n | bβ β |. Combining this with bound (36), easy algebra yields

n . Next, by the triangle inequality we decompose:

X ,Sβ [(Y µ) X bβ ]

β Sβ β Sβ X ,Sβ [(Y µ) X bβ ] + P{β Sβ } X ,Sβ [(Y µ) X bβ ] .

Below we handle the two terms separately in turn starting with the ﬁrst one. It is evident that:

n 1 β Sβ β Sβ X ,Sβ [(Y µ) X bβ ] sup β B

i=1 (Yi µ)β Sβ Xi,Sβ β Xi

Neykov, Wang, and Liu

where B = {β : β β , supp(β ) Sβ , β 2 = 1}. Notice that if we select two vectors w , v B, then the vector w v

w v 2 B. Hence by the triangle inequality we have:

i=1 (Yi µ)β Sβ Xi,Sβ w Xi n 1 n X

i=1 (Yi µ)β Sβ Xi,Sβ v Xi + w v 2I.

The bound in the above display implies that if Nϵ is an ϵ-net on B, then:

I (1 ϵ) 1 sup v Nϵ

i=1 (Yi µ)β Sβ Xi,Sβ v Xi .

Let N1/2 be an 1/2-net of B. By Lemma 5.2 of Vershynin (2010) we have that |N1/2| 5s. Using these facts coupled with the conditional argument in the proof of Lemma 27, we conclude that there exists a constant e C such that I e Cp s

n with probability at least 1 O(n 1 + d 1) 2 exp( cs) for some absolute constant c > 0. Next we turn to bounding the term:

n 1P{β Sβ } X ,Sβ [(Y µ) X bβ ] bβ 2 sup β B

i=1 (β Xi)2(Yi µ)

bβ 2 ee C r s

n + s log n

with probability at least 1 O(n 1) 2 exp( cs), where the ﬁnal bound follows much like in Lemma

34 so we omit the details. Combining the ﬁnal two bounds and the facts that bβ 2 C q

n s2(log d log n) we conclude that X ,Sβ [(Y µ) X bβ ] C8 q

n for some suﬃciently large constant C8.

Appendix F. Lower Bound Proofs

Lemma 36. Let us have L sets {Sj}L j=1 such that Sj [d] and |Sj| = s. Construct the sets Bj = {β : β 2 = 1, supp(β) = Sj}. Then for any points Xi Rd, i [n] and any ﬁxed constant c < 0, the set j [L] β Bj (sign(X i β + c))i [n] has cardinality at most:

| j [L] β Bj(sign(X i β + c))i [n]| L ne

Proof [Proof of Lemma 36] It suﬃces to show the result for L 1 with a ﬁxed support set S such that |S| = s. Start by constructing the function space FS := {f(X) = X β : supp(β) S, β Rd}. It is clear that: {f(X) = X β : β Rd, β 2 = 1, supp(β) S} FS.

Denote the subgraph class of sets associated with FS by S(FS) := f FS{(t, X) : t < f(X)}. Using of Lemma 2.6.15 Vaart and Wellner (1996), we conclude that the VC-index of the class S(FS) is at most s + 2. Next, applying Sauer-Shelah s lemma (see, e.g., Corollary 2.6.3 of Vaart and Wellner, 1996) we conclude that:

max X1,...,Xn n(S(FS), ( c, X1), . . . , ( c, Xn)) ne

Agnostic Estimation for Phase Retrieval

where n is the number of subsets of {( c, Xi)}n i=1 which can be picked out by the collection of sets S(FS). Recall that a subset of A {( c, Xi)}n i=1 is picked out by S(FS) iﬀthere exists a function f FS such that f(Xi) > c for all ( c, Xi) A, and f(Xi) c for all ( c, Xi) Ac. Equivalently this means that there exists a vector supp(β) S, β Rd such that sign(X i β + c) > 0 if ( c, Xi) A and sign(X i β + c) < 0 if ( c, Xi) Ac. In other words, bound (49) is an upper bound on the possible number of quantization points which can be produced by the model (25) on the set {β : β 2 = 1, supp(β) S} for an arbitrary matrix X.

Proof [Proof of Theorem 8] We assume throughout the proof, without loss of generality, that that the inﬁmum in (24) is attained at the function h H(K, M, L) (h C(L) in case (i)). If that were not the case, the same arguments will go through for a sequence of functions hn converging to the inf. Our ﬁrst step is to construct a suﬃciently large set B of s-sparse vectors on the unit sphere Sp 1, relying on a standard combinatorial construction which which is a sparse variation of the GV-bound and is stated under Lemma 10 in Appendix A for convenience of the reader. Apply Lemma 10 to construct the collection of sets {Si}i [N]. Next deﬁne the set B in the following manner:

B := n β : k [N] s.t. βj = ρ1(j Sk) s 1 for j [d 1], and βd = p

1 ρ2 o , (50)

for some to be speciﬁed constant ρ < 1. It is simple to see that indeed B is indeed a subset of the unit sphere in Rd, i.e., B Sd 1, and furthermore β 0 = s for all β B, and by the constructions of sets {Sk}k [N] (see (29)), we have for all k = j:

Next we resort to a general scheme of reducing estimation to testing, which is wonderfully described in Chapter 2 of Tsybakov (2009). Let for brevity M( bβ, β) = minη { 1,1} bβ ηβ 2. We outline the reduction steps below. We have:

inf b β sup β: β 0 s, β 2=1 Eh ,βM( bβ, β) inf b β sup β: β 0 s, β 2=1 ρ/2Ph ,β(M( bβ, β) ρ/2)

inf b β sup β B ρ/2Ph ,β(M( bβ, β) ρ/2), (52)

where in the ﬁrst inequality used Markov s inequality, and in the second one the fact that B Sp 1 {β : β 0 = s}. Deﬁne the test:

ψ = argmin i [N] M( bβ, βi).

We now show that j = ψ we have M( bβ, βj) ρ/2. The latter holds trivially if M( bβ, βψ ) ρ/2, hence we assume M( bβ, βψ ) < ρ/2. Let further (without loss of generality) M( bβ, βψ ) = bβ βψ 2. By the triangle inequality and (51) this immediately implies that bβ βj 2 ρ/2 for all j = ψ . Next we show that bβ + βj 2 ρ/2 for all j = ψ . By the triangle inequality:

bβ + βj 2 βψ + βj 2 bβ βψ 2 p

4(1 ρ2) + 2ρ2 ρ/2 > ρ/2,

where to see the last inequality recall that ρ < 1. We have shown M( bβ, βj) ρ/2 for all j = ψ , which by (52) implies:

inf b β sup β: β 0 s, β 2=1 Eh ,βM( bβ, β) ρ/2 max j [N] Ph ,βj(ψ = j) ρ/2 inf ψ max j [N] Ph ,βj(ψ = j). (53)

Neykov, Wang, and Liu

Hence it suﬃces to control the quantity in the right hand side of (53). To this end we use Fano s inequality which is a standard tool for producing minimax lower bounds. We have:

inf ψ max j [N] Ph ,βj(ψ = j) 1 I(J, {(Yi, Xi)}i [n]) + log 2

log N , (54)

where J is uniformly distributed on [N] random variable, and I denotes the mutual information. By a standard argument, the mutual information is bounded by the KL diameter of the set B, i.e.:

I(J, {(Yi, Xi)}i [n]) max k,j DKL(Pβk({(Yi, Xi)}i [n])||Pβj({(Yi, Xi)}i [n])). (55)

Deﬁne U := β k X, V := β j X, and W := P{βk,βj} X, and notice that W (U, V ). At this point the analysis of (i) and (ii) branches. We ﬁrst show (i), and return to (ii) later.

(i) Just as in Example 3 of Neykov et al. (2015), a simple calculation yields:

DKL(Pβk({(Yi, Xi)}i [n])||Pβj({(Yi, Xi)}i [n])) = n E[P((ξ + h (U) h (V ))2) P(ξ2)]

= n e P((h (U) h (V ))2),

where ξ is a random variable with density pξ(x) exp( P(x2)) (i.e., ξ d= ε), and e P is another nonzero polynomial of the same degree as P, determined by the even moments of ξ. This holds since all odd moments of ξ are zero and all even moments of ξ exist. Using the fact that h is L-Lipschitz we can further bound the above by E e P(L2(U V )2) E exp( e CP,L(U V )2) 1, for a suﬃciently large constant CP,L depending on P and L. Finally notice that U V N(0, ν2), where ν2 2ρ2 by (51). Direct calculation yields that:

E exp( e CP,L(U V )2) 1 =

1 2ν2 e CP,L 1

1 4ρ2 e CP,L 1 4ρ2 e CP,L,

where we are assuming that 4ρ2 e CP,L 1

2. Using Fano s bound we get:

inf b β sup β: β 0 s, β 2=1 Eh ,βM( bβ, β) ρ

1 16nρ2 e CP,L + 4 log 2

s log(d/(4s))

Hence (again in the case ρ 1 (8 e CP,L)1/2 1) we obtain, the bound setting

ρ = e C 1/2 P,L r

s log(d/(4s))

where CP,L = e C 1/2 P,L /6.

(ii) In the case when ε N(0, σ2), using the formula for KL-divergence for a normal distribution, and the law of iterated expectation conditionally on U, V we have:

DKL(Pβk({(Yi, Xi)}i [n])||Pβj({(Yi, Xi)}i [n])) = n 2σ2 EU,V (h (U) h (V ))2 (56)

We now remark that it suﬃces to consider the case h C(K, M, L) for some K N. To see this suppose that h = PN

i=1 wih i for some h i C(i, M, L) and positive weights wi 0 such that Pn i=0 wi = 1. By Jensen s inequality:

EU,V (h (U) h (V ))2

i=1 wi EU,V (h i (U) h i (V ))2 max i [N ] EU,V (h i (U) h i (V ))2.

Agnostic Estimation for Phase Retrieval

Next observe that:

EU,V (h (U) h (V ))2 = Var(h (U) h (V )), (57)

where the last equality follows by the fact that h (U) d= h (V ) which in turn implies E(h (U) h (V )) = 0. Furthermore notice that by (51) we have Cov(U, V ) = β k βj 1 ρ2. Set 1 ν2 := Cov(U, V ) for some 0 ν ρ. This fact coupled with the normality of U and V implies that we can equivalently represent U and V as:

1 ν2Z1 + νZ2, V = p

1 ν2Z1 + νZ3,

where Z1, Z2, Z3 are i.i.d. N(0, 1) random variables. To this end we recall the Gaussian Poincar e inequality (see, e.g., Theorem 3.20 of Boucheron et al., 2013), which states that if F : Rk 7 R is a continuously diﬀerentiable map, and Z N(0, Ik), then:

Var(F(Z)) E F(Z) 2 2.

Consider the map F : R3 7 R, deﬁned by F(Z1, Z2, Z3) := h (

1 ν2Z1+νZ2) h (

1 ν2Z1+ νZ3). Applying the Gaussian Poincar e inequality yields:

Var(h (U) h (V )) (1 ν2)E(h (U) h (V ))2 + 2ν2EZ N(0,1)(h (Z))2

(1 ν2) Var(h (U) h (V )) + 2ν2M,

where we used the fact that h (U) d= h (V ) and the fact that h C(K, M, L). Now since clearly h C(K 1, M, L) we can inductively apply this argument to obtain the bound:

Var(h (U) h (V )) 2ν2M

k=0 (1 ν2)k + (1 ν2)K Var(h (K)(U) h (K)(V ))

2M(1 (1 ν2)K) + L2E(U V )2

2ρ2 MK + L2 .

Combining this observation with (55), (56), and Fano s bound (54) we obtain:

inf b β sup β: β 0 s, β 2=1 Eh ,βM( bβ, β) ρ

1 4nρ2(MK + L2)/σ2 + 4 log 2

s log(d/(4s))

where we used (28). Selecting

s log(d/(4s)) 12n(MK + L2) 1,

makes the right hand side ρ/6, completing the proof.

Remark 37. In the special case of phase retrieval model equation (57) has the following explicit form:

EU,V (U V )2(U + V )2 = EU,V (U V )2EU,V (U + V )2 = βk βj 2 2 βk + βj 2 2.

This observation is suﬃcient for recovering the result of Theorem 3.2 of Cai et al. (2015) even in the general case when one assumes β 2 = R instead of β 2 = 1.