# agnostic_estimation_for_phase_retrieval__7e597c93.pdf Journal of Machine Learning Research 21 (2020) 1-39 Submitted 5/18; Revised 4/20; Published 6/20 Agnostic Estimation for Misspecified Phase Retrieval Models Matey Neykov mneykov@stat.cmu.edu Carnegie Mellon University Department of Statistics & Data Science Zhaoran Wang zhaoranwang@gmail.com Northwestern University Department of Industrial Engineering and Management Sciences Han Liu hanliu@northwestern.edu Northwestern University Department of Electrical Engineering and Computer Science Department of Statisitcs Editor: Qiang Liu The goal of noisy high-dimensional phase retrieval is to estimate an s-sparse parameter β Rd from n realizations of the model Y = (X β )2 + ε. Based on this model, we propose a significant semi-parametric generalization called misspecified phase retrieval (MPR), in which Y = f(X β , ε) with unknown f and Cov(Y, (X β )2) > 0. For example, MPR encompasses Y = h(|X β |) + ε with increasing h as a special case. Despite the generality of the MPR model, it eludes the reach of most existing semi-parametric estimators. In this paper, we propose an estimation procedure, which consists of solving a cascade of two convex programs and provably recovers the direction of β . Furthermore, we prove that our procedure is minimax optimal over the class of MPR models. Interestingly, our minimax analysis characterizes the statistical price of misspecifying the link function in phase retrieval models. Our theory is backed up by thorough numerical results. 1. Introduction In scientific and engineering fields researchers often times face the problem of quantifying the relationship between a given outcome Y and corresponding predictor vector X, based on a sample {(Yi, X i ) }n i=1 of n observations. In such situations it is common to postulate a linear working model, and search for a d-dimensional signal vector β satisfying the following familiar relationship: Y = X β + ε. (1) When the predictor X is high-dimensional in the sense that d n, it is commonly assumed that the underlying signal β is s-sparse. In a certain line of applications, such as X-ray crystallography, microscopy, diffraction and array imaging, one can only measure the magnitude of X β but not its phase (i.e., sign in the real domain). In this case, assuming model (1) may not be appropriate. To cope with such applications in the high-dimensional setting, Cai et al. (2015) proposed the thresholded Wirtinger flow (TWF), a procedure which consistently estimates the signal β in the following real c 2020 Matey Neykov, Zhaoran Wang, and Han Liu. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v21/18-259.html. Neykov, Wang, and Liu sparse noisy phase retrieval model: Y = (X β )2 + ε, (2) where one additionally knows that the predictors have a Gaussian random design X N(0, Id). In the present paper, taking an agnostic point of view, we recognize that both models (1) and (2) represent an idealized view of the data generating mechanism. In reality, the nature of the data could be better reflected through the more flexible viewpoint of a single index model (SIM): Y = f(X β , ε), (3) where f is an unknown link function, and it is assumed that β 2 = 1 for identifiability. A recent line of work on high-dimensional SIMs (Plan and Vershynin, 2015; Neykov et al., 2016), showed that under Gaussian designs, one can apply ℓ1 regularized least squares to successfully estimate the direction of β and its support. The crucial condition allowing for the above somewhat surprising application turns out to be: Cov(Y, X β ) = 0. (4) While condition (4) is fairly generic, encompassing cases with a binary outcome, such as logistic regression and one-bit compressive sensing (Boufounos and Baraniuk, 2008), it fails to capture the phase retrieval model (2). More generally, it is easy to see that when the link function f is even in its first coordinate, condition (4) fails to hold. The goal of the present manuscript is to formalize a class of SIMs, which includes the noisy phase retrieval model as a special case in addition to various other additive and non-additive models with even link functions, and develop a procedure that can successfully estimate the direction of β up to a global sign. Formally, we consider models (3) with Gaussian design that satisfy the following moment assumption: Cov(Y, (X β )2) > 0. (5) Unlike (4), one can immediately check that condition (5) is satisfied by model (2). In 2 we give multiple examples, both abstract and concrete, of SIMs obeying this constraint. Our second moment constraint (5) can be interpreted as a semi-parametric robust version of phase-retrieval. Hence, we will refer to the class of models satisfying condition (5) as misspecified phase retrieval (MPR) models. In this point of view it is worth noting that condition (4) relates to linear regression in a way similar to how condition (5) relates to the phase retrieval model. Our motivation for studying SIMs under such a constraint can ultimately be traced to the vast sufficient dimension reduction (SDR) literature. In particular, we would like to point out Li (1992) as a source of inspiration. Contributions. Our first contribution is to formulate a novel and easily implementable two-step procedure, which consistently estimates the direction of β in an MPR model. In the first step we solve a semidefinite program producing a unit vector bv, such that |bv β | is sufficiently large. Once such a pilot estimate is available, we consider solving an ℓ1 regularized least squares on the augmented outcome e Yi = (Yi Y )X i bv, where Y is the average of Yi s, to produce a second estimate bb, which is then normalized to obtain the final refined estimator bβ = bb/ bb 2. In addition to being universally applicable to MPR models, our procedure has an algorithmic advantage in that it relies solely on convex optimization, and as a consequence we can obtain the corresponding global minima of the two convex programs in polynomial time. Agnostic Estimation for Phase Retrieval Our second contribution is to rigorously demonstrate that the above procedure consistently estimates the direction of β . We prove that for a given MPR model, with high probability, one has: minη { 1,1} bβ ηβ 2 p provided that the sample size n satisfies n s2 log d. While the same rates (with different constants) hold for the TWF algorithm of Cai et al. (2015) in the special case of noisy phase retrieval model, our procedure provably achieves these rates over the broader class of MPR models. Our third contribution is to formulate a sub-class of additive MPR models Y = h(X β ) + ε for h H, for which the above estimation rate turns out to be tight (in the cases with s = O(d1 δ) for some δ > 0). In fact we show that even when the data is generated by the most favorable model out of the sub-class H, we have infh H inf b β sup β 0 s, β 2=1 Eh,β minη { 1,1} bβ ηβ 2 p s log(d/s)/n. The model class H encompasses the noisy phase retrieval model (2) as a special case. The bound above implies the standard minimax lower bound over the entire class of MPR models. Note that our bound is strictly stronger than standard minimax lower bounds, since it allows taking the easiest link function out of the class in addition to the worst case parameter β . In contrast, in standard minimax lower bounds, we take the hardest link function together with the worst case parameter. On the technical side, the proof of the lower bound uses a more transparent technique in comparison with existing proofs of related lower bounds, which are specific to the standard noisy phase retrieval model. Interestingly, our minimax analysis characterizes the statistical price of misspecifying the link function in phase retrieval models, i.e., the price of agnosticity, which will be discussed in 4. Related Work. The phase retrieval model has received considerable attention in the recent years by statistics, applied mathematics as well as signal processing communities. For the non-sparse version of (2), efficient algorithms have been suggested based on both semidefinite programs (Cand es et al., 2013; Cand es et al., 2015a) and non-convex optimization methods that extend gradient descent (Cand es et al., 2015b). Additionally, a non-traditional instance of phase retrieval model (which also happens to be a special case of the MPR model) was considered by Chen et al. (2013), where the authors suggested an estimation procedure originally proposed for the problem of mixed regression. For the noisy sparse version of model (2), near optimal solutions were achieved with a computationally infeasible program by Lecu e and Mendelson (2013b). Subsequently, a tractable gradient descent approach achieving minimax optimal rates was developed by Cai et al. (2015). Abstracting away from the phase retrieval or linear model settings, we note that inference for SIMs in the case when d is small or fixed, has been studied extensively in the literature (e.g., Xia and Li, 1999; Horowitz, 2009; Peng and Huang, 2011; Mc Cullagh and Nelder, 1989, among many others). In another line of research on SDR, seminal insights shedding light on condition (4) can be found in, e.g., Li and Duan (1989); Li (1991); Cook and Ni (2005). The modified condition (5) traces roots to Li (1992), where the authors designed a procedure to handle precisely situations where (4) fails to hold. More recently, there have been active developments for high-dimensional SIMs. Plan and Vershynin (2015) and later Thrampoulidis et al. (2015) demonstrated that under condition (4), running the least squares with ℓ1 regularization can obtain a consistent estimate of the direction of β , while Neykov et al. (2016) showed that this procedure also recovers the signed support of the direction. Excess risk bounds were derived in Ganti et al. (2015). Very recently, Genzel (2016) extended this observation to other convex loss functions under a condition corresponding to (4) depending implicitly on the loss function of interest. Radchenko (2015) proposed a non-parametric least squares with an equality ℓ1 Neykov, Wang, and Liu constraint to handle simultaneous estimation of β as well as f. Han and Wang (2015) considered a smoothed-out U-process type of loss function with ℓ1 regularization, and proved their approach works for a sub-class of functions satisfying condition (4). None of the aforementioned works on SIMs can be directly applied to tackle the MPR class (5). A generic procedure for estimating sparse principal eigenvectors was developed in Yuan and Zhang (2013). While in principle this procedure can be applied to estimate the direction in MPR models, it requires proper initialization, and in addition, it requires knowledge of the sparsity of the vector β . We discuss this approach in more detail in 5. Regularized procedures have also been proposed for specific choices of f and Y . For example, Yi et al. (2015) studied consistent estimation under the model P(Y = 1|X) = (h(X β ) + 1)/2 with binary Y , where h : R 7 [ 1, 1] is possibly unknown. Their procedure is based on taking pairs of differences in the outcome, and therefore replaces condition (4) with a different type of moment conditon. Yang et al. (2015) considered the model Y = h(X β ) + ε with a known continuously differentiable and monotonic h, and developed estimation and inferential procedures based on the ℓ1 regularized quadratic loss, in a similar spirit to the TWF algorithm suggested by Cai et al. (2015). In conclusion, although there exists much prior related work, to the best of our knowledge, none of the available literature discusses the MPR models in the generality we attempt in the present manuscript. Notation. In this section we briefly outline some commonly used notations. Other notations will be defined as needed throughout the paper. For a (sparse) vector v = (v1, . . . , vp) , we let Sv := supp(v) = {j : vj = 0} denote its support, v p denote the ℓp norm (with the usual extension when p = ) and v 2 = vv is a shorthand for the outer product. With a standard abuse of notation we will denote by v 0 = |supp(v)| the cardinality of the support of v. For a matrix A we denote the max and ℓ2 norms with A max = maxi,j |Aij| and A 2 = sup v 2=1 Av 2 respectively. If A is symmetric we denote its spectrum ordered in decreasing manner by λj(A). We often use Id to denote a d d identity matrix. We will also use to denote the Hadamard (or element-wise) product, and dot product will sometimes be denoted with angle notation , . For a real random variable X, define X ψ2 = sup p 1 p 1/2(E|X|p)1/p, X ψ1 = sup p 1 p 1(E|X|p)1/p. Recall that a random variable is called sub-Gaussian if X ψ2 < and sub-exponential if X ψ1 < (e.g., Vershynin, 2010). For any integer k N we use the shorthand notation [k] = {1, . . . , k}. We also use standard asymptotic notations. Given two sequences {an}, {bn} we write an = O(bn) if there exists a constant C < such that an Cbn, an = Ω(bn) if there exists a positive constant c > 0 such that an cbn, and an bn if there exist positive constants c and C such that c < an/bn < C. Organization. In 2 and 3 we introduce the MPR model class and our estimation procedure, and 4 is dedicated to our main results. A brief discussion is provided in 6. We defer the technical proofs to the appendices due to space limitations. 2. MPR Models In this section we formally introduce MPR models. In detail, we argue that the class of such models is sufficiently rich, including numerous models of interest. Motivated by the setup in the sparse noisy phase retrieval model (2), we assume throughout the remainder of the paper that X N(0, Id). Let ψ1 be the sub-exponential norm of a random variable, which is also known as the Orlicz ψ1 norm. We begin our discussion with a formal definition. Agnostic Estimation for Phase Retrieval Definition 1 (MPR Models). Assume that we are given model (3), where X N(0, Id), ε X and β Rd is an s-sparse unit vector, i.e., β 2 = 1. We call such a model misspecified phase retrieval (MPR) model, if the link function f and noise ε further satisfy, for Z N(0, 1) and K > 0, c0 := Cov(f(Z, ε), Z2) > 0, (6) Y ψ1 K. (7) Both assumptions (6) and (7) impose moment restrictions on the random variable Y . Assumption (6) states that Y is positively correlated with the random variable (X β )2, while assumption (7) requires Y to have somewhat light-tails. Also, as mentioned in the introduction, the unit norm constraint on the vector β is required for the identifiability of model (3). We remark that the class of MPR models is convex in the sense that if we have two MPR models f1(X β , ε) and f2(X β , ε), all models generated by their convex combinations λf1(X β , ε) + (1 λ)f2(X β , ε) (λ [0, 1]) are also MPR models. It is worth noting the > direction in (6) is assumed without loss of generality. If Cov(Y, (X β )2) < 0 one can apply the same algorithm to Y . However, the knowledge of the direction of the inequality is important. In the following, we restate condition (6) in a more convenient way, enabling us to easily calculate the explicit value of the constant c0 in several examples. Proposition 2. Assume that there exists a version of the map ϕ(z) : z 7 E[f(Z, ε)|Z = z] such that ED2ϕ(Z) > 0, where D2 is the second distributional derivative of ϕ and Z N(0, 1). Then the SIM (3) satisfies assumption (6) with c0 = ED2ϕ(Z). We now provide three concrete MPR models as warm up examples for the more general examples discussed in Proposition 3 and Remark 4. Consider the models: Y = (X β )2 + ε, (8) Y = |X β | + ε, (9) Y = |X β + ε|, (10) where ε X is sub-exponential noise, i.e., ε ψ1 Kε for some Kε > 0. Model (8) is the noisy phase retrieval model considered by Cai et al. (2015), while models (9) and (10) were both discussed in Chen et al. (2013), where the authors proposed a method to solve model (10) in the low-dimensional regime. Below we briefly explain why these models satisfy conditions (6) and (7). First, observe that in all three models we have a sum of two sub-exponential random variables, and hence by the triangle inequality it follows that the random variable Y is also sub-exponential, which implies (7). To understand why (6) holds, by applying Proposition 2 we have c0 = E2 = 2 > 0 for model (8), c0 = E2δ0(Z) = 2/ 2π > 0 for model (9), and c0 = E2δ0(Z + ε) = 2Eφ(ε) > 0 for model (10), where δ0( ) is the Dirac delta function centered at zero, and φ is the density of the standard normal distribution. Admittedly, calculating the second distributional derivative could be a laborious task in general. In the remainder of this section we set out to find a simple to check generic sufficient condition on the link function f and error term ε, under which both (6) and (7) hold. Before giving our result note that the condition ED2ϕ(Z) > 0 is implied whenever ϕ is strictly convex and twice differentiable. However, strictly convex functions ϕ may violate assumption (7) as they can inflate the tails of Y arbitrarily (consider, e.g., f(x, ε) = x4 + ε). Moreover, the functions in example (9) and (10) fail to be twice differentiable. In the following result we handle those two problems, and in addition we provide a more generic condition than convexity, which suffices to ensure the validity of (6). Proposition 3. The following statements hold. Neykov, Wang, and Liu (i) Let the function ϕ defined in Proposition 2 be such that the map z 7 ϕ(z) + ϕ( z) is nondecreasing on R+ 0 and and there exist z1 > z2 > 0 such that ϕ(z1) + ϕ( z1) > ϕ(z2) + ϕ( z2). Then (6) holds. (ii) A sufficient condition for (i) to hold, is that z 7 ϕ(z) is convex and sub-differentiable at every point z R, and there exists a point z0 R+ 0 satisfying ϕ(z0) + ϕ( z0) > 2ϕ(0). (iii) Assume that there exist functions g1, g2 such that f(Z, ε) g1(Z) + g2(ε), and g1 is essentially quadratic in the sense that there exists a closed interval I = [a, b] with 0 I, such that for all z satisfying g1(z) Ic we have |g1(z)| Cz2 for a sufficiently large constant C > 0, and let g2(ε) be sub-exponential. Then (7) holds. Remark 4. Proposition 3 shows that the class of MPR models is sufficiently broad. By (i) and (ii) it immediately follows that the additive models Y = h(X β ) + ε, (11) where the link function h is even and increasing on R+ 0 or convex, satisfy the covariance condition (6) by (i) and (ii) of Proposition 3 respectively. If h is also essentially quadratic and ε is sub-exponentially distributed, using (iii) we can deduce that Y in (11) is a sub-exponential random variable, and hence under these restrictions model (11) is an MPR model. Both examples (8) and (9) take this form. Additionally, Proposition 3 implies that the model Y = h(X β + ε) (12) satisfies (6), whenever the link h is a convex sub-differentiable function, such that h(z0) + h( z0) > 2h(0) for some z0 > 0, E|h(z + ε)| < for all z R and E|h(Z + ε)| < . This conclusion follows because under the latter conditions the function ϕ(z) = Eh(z + ε) satisfies (ii), which is proved in Appendix B under Lemma 13. Moreover, if it turns out that h is essentially quadratic and h(2ε) is sub-exponential, then by Jensen s inequality we have 2h(Z +ε) h(2Z)+h(2ε) and hence (iii) implies that (7) is also satisfied. Model (10) is of the type (12). Unlike the additive noise models (11), models (12) allow noise corruption even within the argument of the link function. On the negative side, it should be apparent that (6) fails to hold in cases where ϕ is an odd function, i.e., ϕ(z) = ϕ( z). In many such cases (e.g. when ϕ is monotone or non-constant and non-positive/non-negative on R+), one would have Cov(Y, X β ) = E[ϕ(Z)Z] = 0, and hence direct application of the ℓ1 regularized least squares algorithm is possible as we discussed in the introduction. 3. Agnostic Estimation for MPR In this section we describe and motivate our two-step procedure, which consists of a convex relaxation and an ℓ1 regularized least squares program, for performing estimation in the MPR class of models described by Definition 1. We begin our motivation by noting that any MPR model satisfies the following inequality Cov((Y µ)X β , X β ) = E{(Y µ)(X β )2} = Cov(f(Z, ε), Z2) = c0 > 0, (13) where we have denoted µ := EY . This simple observation plays a major role in the motivation of our procedure. Notice that in view of condition (4), inequality (13) implies that if instead of observing Y we had observed Y = g(X β , ε) = (Y µ)X β we would be in a position to apply LASSO on Agnostic Estimation for Phase Retrieval Figure 1: The geometry of decomposition (14), in a case when bv β > 0. (a) Initialization (b) Second Step Figure 2: An illustration of the estimates bv and bβ produced by the first and second steps of Algorithm 1. After the first step we can guarantee that the vector β belongs to one of two spherical caps which contain all vectors w such that |bv w| κ for some constant κ > 0, provided that the sample size n s2 log d is sufficiently large. After the second step we can guarantee that the vector β belongs to one of two spherical caps in (b), which are shrinking with (n, s, d) at an optimal rate. Y . However, there is no direct way of generating the random variable Y , as doing so would require the knowledge of β and the mean µ. Here, we propose to roughly estimate β by a vector bv first, use an empirical estimate Y of µ, and then obtain the ℓ1 regularized least squares estimate on the augmented variable e Y = (Y Y )X bv to sharpen the convergence rate. At first glance it might appear counter-intuitive that introducing a noisy estimate of β can lead to consistent estimates, as the so-defined e Y variable depends on the projection of X on span{β , bv}. Decompose bv = (bv β )β + bβ , (14) where bβ β . A visualization of decomposition (14) is provided on Figure 1 for convenience. To better motivate this proposal, in the following we analyze the population least squares fit, based on the augmented variable ˇY = (Y µ)X bv for some fixed unit vector bv with decomposition (14). Writing out the population solution for least squares yields: [EX 2] 1E[X ˇY ] = E[X(Y µ)X (bv β )β ] | {z } I1 + E[X(Y µ)X bβ ] | {z } I2 We will now argue that left hand side of (15) is proportional to β . First, we observe that I1 = c0(bv β )β , since multiplying by any vector b β yields b I1 = 0 by independence. Second, and perhaps more importantly, we have that I2 = 0. To see this, we first take a vector b span{β , bβ } . Since the three variables b X, Y µ and bβ X are independent, we have b I2 = 0. Multiplying by β we have β I2 = 0 since β X(Y µ) is independent of X bβ . Finally, multiplying by bβ yields I 2 bβ = 0, since (X bβ )2 is independent of Y µ. It is noteworthy to mention that the above derivation crucially relies on the fact that the Y variable was centered, and the vector bv was fixed. In what follows we formulate a pilot procedure which produces an estimate bv such that |bv β | κ > 0. A proper initialization algorithm can be achieved by using a spectral method, such as the Principal Hessian Directions (PHD) proposed by Li (1992). Cast into the framework of SIM, the PHD framework implies the following simple observation: Lemma 5. If we have an MPR model, then argmax v 2=1 v E[Y (X 2 I)]v = β . Neykov, Wang, and Liu Note that similar observations have been made in the non-sparse well-specified case by other authors in the phase retrieval literature (Cand es et al., 2015b; Lu and Li, 2017; Mondelli and Montanari, 2017). A proof of Lemma 5 can be found in Appendix B. Lemma 5 encourages us to look into the following sample version maximization problem argmax v 2=1, v 0=sn 1v Pn i=1[Yi(X 2 i I)]v, (16) which targets a restricted (s-sparse) principal eigenvector. Unfortunately, solving such a problem is a computationally intensive task, and requires knowledge of s. Here we take a standard route of relaxing the above problem to a convex program, and solving it efficiently via semidefinite programming (SDP). A similar in spirit SDP relaxation for solving sparse PCA problems, was originally proposed by d Aspremont et al. (2007). Instead of solving (16), define bΣ = n 1 Pn i=1 Yi(X 2 i I), and solve the following convex program: b A = argmaxtr(A)=1,A Sd + tr(bΣA) λn Pd i,j=1|Aij|, (17) where Sd + is the convex cone of non-negative semidefinite matrices, and λn is a regularization parameter encouraging element-wise sparsity in the matrix A. The hopes of introducing the optimization program above are that b A will be a good first estimate of β 2. In practice it could turn out that the matrix b A is not rank one, hence we suggest taking bv as the principal eigenvector of b A. In theory we show that with high probability the matrix b A will indeed be of rank one. Observation (15), Lemma 5 and the SDP formulation motivate the agnostic two-step estimation procedure for misspecified phase retrieval in Algorithm 1. Algorithm 1 input : (Yi, Xi)n i=1: data, λn, νn: tuning parameters 1. Split the sample into two approximately equal sets S1, S2, with |S1| = n/2 , |S2| = n/2 . 2. Let bΣ := |S1| 1 P i S1 Yi(X 2 i Id). Solve (17). Let bv be the first eigenvector of b A. 3. Let Y = |S2| 1 P i S2 Yi. Solve the following program: bb = argminb(2|S2|) 1P i S2((Yi Y )X i bv X i b)2 + νn b 1. (18) 4. Return bβ := bb/ bb 2. The sample split is required to ensure that after decomposition (14), the vector bβ and the value bv β are independent of the remaining sample. In 4 we demonstrate that Algorithm 1 succeeds with optimal (in the noisy regime) ℓ2 rate p s log d/n, provided that s2 log d n. The latter requirement on the sample size suffices to guarantee that the solution b A of optimization program (17) is rank one. Figure 2 illustrates the two steps of Algorithm 1. In addition to our main procedure, we propose an optional refinement step (Algorithm 2) in which one applies steps 3. and 4. of Algorithm 1 on the full dataset using the output vector bβ of Algorithm 1 without hurting the rate. Doing so can potentially result in additional stability and further refinements of the rate constant. Agnostic Estimation for Phase Retrieval Algorithm 2 Optional Refinement input : (Yi, Xi)n i=1: data, ν n: tuning parameter, output bβ from the Algorithm 1 5. Let Y = n 1 P i [n] Yi. Solve the following program: bb = argminb(2n) 1Pn i=1((Yi Y )X i bβ X i b)2 + ν n b 1. (19) 6. Return bβ := bb/ bb 2. 4. Main Results In this section we present our main results, which consist of theoretical justification of our procedures, as well as lower bounds for certain types of SIM (3). To simplify the presentation for this section, we slightly change the notation and assume that the sample size is 2n and S1 = [n] and S2 = {n + 1, . . . , 2n}. Of course this abuse of notation does not restrict our analysis to only even sample size cases. In Proposition 15, we show that the first step of Algorithm 1 narrows down the search for the direction of β to a union of two spherical caps (i.e., the estimate bv satisfies |bv β | κ for some constant κ > 0 see also Figure 2a). Our main result demonstrates that in combination with program (18) this suffices to recover the direction of β at an optimal rate with high probability. Theorem 6. There exist values of λn, νn p log d/n and a constant R > 0 depending on f and ε, such that if s p log d/n < R and log(d) log2(n)/n = o(1), the output of Algorithm 1 satisfies: sup β 2=1, β 0 s Pβ min η {1, 1} bβ ηβ 2 > L O(d 1 n 1), (20) where L is a constant depending solely on f and ε. We remark that although the estimation rate is of the order p s log d/n, our procedure still requires that s p log d/n is sufficiently small. This phenomenon is similar to what has been observed by Cai et al. (2015), and it is our belief that this requirement cannot be relaxed for computationally feasible algorithms. We would further like to mention that while in bound (20) we control the worst case probability of failure, it is less clear whether the estimate bβ is universally consistent (i.e., whether the sup can be moved inside the probability in (20)). In the following theorem we show that the additional refinement in Algorithm 2 does not hurt the rate obtained in Theorem 6. The practical benefits of such an additional improvement will become evident in the numerical studies section. Theorem 7. Assume that the data satisfies the assumptions in Theorem 6 with sufficiently small R and tuning parameter νn. Additionally, let s log2(n)/n = O(1) and n > R s2(log d log n) for some sufficiently large R > 0. Then there exists a value ν n for which the output of Algorithm 2 satisfies: sup β 2=1, β 0 s Pβ min η {1, 1} bβ ηβ 2 > L r O(d 1 (2n) 1 exp( Ω(s))), where L is a constant depending solely on f and ε. Neykov, Wang, and Liu We would like to point out that the proof of this result is a non-trivial consequence of our main result Theorem 6. Notice that Theorem 7 does not guarantee that the refinement will result in a strictly better estimate, although intuitively one expects to improve the precision and stabilize the estimator by using the full dataset. The proof can further be seen to imply that one can repeat the optional refinement several (fixed and finite) number of times without hurting the rate. We discourage iterating the refinement step more than a couple of times since, first this might hurt the probability bound in Theorem 7, and second it could potentially increase the rate constant. In practice we observe that running Algorithm 2 once typically suffices to obtain satisfactory results. Lower Bound. In this section we investigate minimax lower bounds in the noisy setting. Comments on the performance of algorithms that solve SIM (3) under the constraints (6) and (7) in the noiseless setting can be found towards the end of this section. We begin with defining the following class of functions: C(K, M, L) :={h : h is K times cont. diff., k < K : EZ N(0,1)(h(k)(Z))2 M, h(K) Lip L}, where g Lip := supx =y|g(x) g(y)|/|x y| for g : R 7 R. (21) Notice that in definition (21) we allow for K = 0, in which case we simply require the function h to be L-Lipschitz. Next, define the classes C(L) := C(0, M, L) and H(K, M, L) := K k=0C(k, M, L). Fix K, M, L R+. We will use H(K, M, L) := conv{H(K, M, L)} to denote the convex hull of the functional class H(K, M, L). We remind the reader that taking the convex hull amounts in the following collection of functions {f : N N, {gi}N i=1 H(K, M, L), {wi}N i=1 R+ 0 , PN i=1wi = 1 s.t.f = PN i=1wigi}. For completeness we note that taking the convex hull of the class C(L) is tautological as C(L) = conv{C(L)}. Theorem 8. Let us observe n data points from the additive SIM: Y = h(X β ) + ε. (22) Assuming that s log(d/(4s)) 12 log 2, the following information theoretic lower bounds hold: (i) Assume pε(x) exp( P(x2)), where P is a known, non-constant, finite degree polynomial with non-negative coefficients such that P(0) = 0. Then inf h C(L) inf b β sup β 2=1, β 0 s Eh,β min η { 1,1} bβ ηβ 2 CP,L s log(d/(4s)) where CP,L is a constant depending on P and L. (ii) What is more, in the special case ε N(0, σ2), we have: inf h H(K,M,L) inf b β sup β 2=1, β 0 s Eh,β min η { 1,1} bβ ηβ 2 σ s log(d/(4s)) 12n(KM + L2) 1. (24) A few remarks regarding known lower bounds of this nature are in order. General lower bounds for models with additive Gaussian noise and sub-Gaussian designs have been established by Lecu e and Agnostic Estimation for Phase Retrieval Mendelson (2013a). These bounds differ from the bounds (23) and (24), in two major aspects first, cast in our setting, the bounds of Lecu e and Mendelson (2013a) lower bound the prediction accuracy, rather than estimation error, and second our lower bound takes an infimum over the functional classes C(L) or H(K, M, L) in the spirit of the lower bound produced by Yi et al. (2015). Such bounds have the compelling interpretation that even with the easiest function living in C(L) or H(K, M, L), one cannot hope to do better than (23) or (24) in the corresponding noise variants (i) or (ii). Note that the bound above is much stronger than standard minimax lower bounds, which take the hardest link function together with the worst case parameter. Consequently, the bound above implies the standard minimax lower bound within the class of MPR models. Additionally, in contrast to the bound proved in Lecu e and Mendelson (2013a), our lower bound is established via a classical Fano inequality type of argument and hence the proof is more transparent in nature. For any fixed k N, it can be shown (see Lemma 14) that monomials of the type h(x) = xk belong to the class C(k 1, (k!)2, k!) H(k 1, (k!)2, k!). Also, clearly we have |x| C(1), and hence {|x|, x2} H(1, 4, 2). The latter shows that that models (8) and (9) belong to H(1, 4, 2). In this sense, in the case where β 2 = 1, Theorem 3.2 in Cai et al. (2015), a version of which is proved in Lecu e and Mendelson (2013b), is implied by Theorem 8 (ii). On an important note, H(1, 4, 2) also contains models that are not captured by our framework, such as the linear model Y = X β + ε. In conclusion, as we illustrated, the classes H(K, M, L) have intersection with class of MPR models of the type (11). For all models lying in this intersection, as soon as σ σ0 > 0 in case (ii) Algorithms 1 and 2 are minimax optimal up to a constant multiplier. Statistical Price of Agnosticity. The lower bounds in Theorem 8 hold only in cases where noise is present in the model. If one is specifically interested in solving the noiseless model (8) for example, the TWF algorithm can recover the vector with essentially no error as demonstrated by Cai et al. (2015). Our procedure described in Algorithm 1 however does not come with such a guarantee for the noiseless model (8), although (20) continues to hold. In this section we will argue that having an algorithm with no error universally over the entire MPR class is generally impossible, even in the noiseless case in the worst case sense. To see why, consider the model: Y = sign(X β + c), (25) for a fixed c < 0. It is easy to check that the map z 7 sign(z + c) + sign( z + c) is non-decreasing and non-constant on R+. By Proposition 3 we conclude that model (25) satisfies (6) and (7). Model (25) is very similar to the one-bit compressive sensing introduced by Boufounos and Baraniuk (2008). The only difference is that in the original formulation the constant c = 0. Theorem 1 of Jacques et al. (2013) establishes a lower bound for the one-bit compressive sensing model in the noiseless case. They lower bound the worst case error by at least s/n whenever n s3/2. By tracing the proof of Theorem 1 of Jacques et al. (2013) we realize that the sole obstruction to whether their statement holds in the case (25) is whether the number of quantization points, i.e., points {Yi}n i=1 that can be generated by model (25) for any β and X, is larger than the upper bound shown in Lemma 1 of Jacques et al. (2013). In Lemma 36 of Appendix F we argue that the number of such points for any fixed c < 0, can be roughly bounded from above by d s (ne)s+1 (s+1)s+1 . Repeating the volume argument of the proof of Theorem 1 of Jacques et al. (2013), we conclude that solving model (25) one cannot hope to achieve a worst case error rate better than (s+1)1+1/s (en)1+1/s+2s3/2+1/s (s+1)1+1/s (en)1+1/s , where the last holds when n s3/2. Our algorithm would appear to perform sub-optimally in such a situation by roughly a factor of p s/n in the statistical error. However, we would like to point out that it is possible the above lower bound is not tight, and that there might exist other pathological Neykov, Wang, and Liu examples of link functions f producing tighter lower bounds. In addition, we have not used the fact that the algorithm does not utilize a priori knowledge on the link function f. We leave the important question for investigating the tightness of the lower bound in the noiseless case for future work. 5. Numerical Experiments In this section we provide numerical experiments based on the three models (8), (9) and (10) where the random variable ε N(0, 1). All models are compared with the Truncated Power Method (TPM), proposed in Yuan and Zhang (2013). For model (8) we also compare the results of our approach to the ones given by the TWF algorithm of Cai et al. (2015). Our setup is as follows. In all scenarios the vector β was held fixed at β = ( s 1/2, s 1/2, . . . , s 1/2 | {z } s , 0, . . . 0 | {z } d s ). Since our theory requires that n s2 log d, we have setup four different sample sizes n θs2 log d, where θ {4, 8, 12, 16}. We let s depend on the dimension d and we take s log d. In addition to the suggested approach in Algorithm 1, we also provide results using the refinement procedure (see Algorithm 19). We also provide the values of two warm starts of our algorithm, produced by solving program (17) with half and full data correspondingly. It is evident that for all scenarios the second step of Algorithms 1 and 2 outperform the warm start from SDP, except in Figure 3 (b), (c), (e), when the sample size is simply too small for the warm start on half of the data to be accurate. All values we report are based on an average over 100 simulations. The SDP parameter was kept at a constant value (0.015) throughout all simulations, and we observed that varying this parameter had little influence on the final SDP solution. To select the νn parameter for (18) a pre-specified grid of parameters {ν1, . . . , νl} was selected, and the following heuristic procedure based on K-fold cross-validation was used. We divide S2 into K = 5 approximately equally sized non-intersecting sets S2 = j [K] e Sj 2. For each j [K] and k [l] we run (18) on the set r [K],r =j e Sr 2 with a tuning parameter νn = νk to obtain an estimate bβk, e Sj 2. Lemma 5 then justifies the following criteria to select the optimal index for selecting bνn = νbl where bl = argmax k [l] Yi(X i bβk, e Sj 2)2. Our experience suggests that this approach works well in practice provided that the values {ν1, . . . , νl} are selected within appropriate range and are of the magnitude p log d/n. Since the TPM algorithm requires an estimate of the sparsity s, we tuned it as suggested in Section 4.1.2 of Yuan and Zhang (2013). In particular, for each scenario we considered the set of possible sparsities K = {s, 2s, 4s, 8s}. For each k K the algorithm is ran on the first part of the data S1, to obtain an estimate bβk, and the final estimate is taken to be bβbk where bk is given by bk = argmax k K bβ k |S2| 1 X i S2 Yi(X 2 i Id) bβk. The TPM is ran for 2000 iterations. In the case of phase retrieval, the TWF algorithm was also ran at a total number of 2000 iterations, using the tuning parameters originally suggested in Cai et al. (2015). As expected the TWF algorithm which targets the sparse phase retrieval model in particular outperforms our approach in the case when the sample size n is small, however our approach performs very comparatively to the TWF, and in fact even slightly better once we increase the sample size. It is possible that the TWF algorithm can perform better if it is ran for a longer than 2000 iterations, Agnostic Estimation for Phase Retrieval 0.0 0.5 1.0 1.5 2.0 G G G G G G G Init Second Step Init full data Refined TPM TWF full data θ = 4 θ = 8 θ = 12 θ = 16 (a) Model (8), d = 200 0.0 0.5 1.0 1.5 2.0 Init Second Step Init full data Refined TPM θ = 4 θ = 8 θ = 12 θ = 16 (b) Model (9), d = 200 0.0 0.5 1.0 1.5 2.0 Init Second Step Init full data Refined TPM θ = 4 θ = 8 θ = 12 θ = 16 (c) Model (10), d = 200 0.0 0.5 1.0 1.5 2.0 G G G G G G G G G Init Second Step Init full data Refined TPM TWF full data θ = 4 θ = 8 θ = 12 θ = 16 (d) Model (8), d = 300 0.0 0.5 1.0 1.5 2.0 Init Second Step Init full data Refined TPM θ = 4 θ = 8 θ = 12 θ = 16 (e) Model (9), d = 300 0.0 0.5 1.0 1.5 2.0 Init Second Step Init full data Refined TPM θ = 4 θ = 8 θ = 12 θ = 16 (f) Model (10), d = 300 0.0 0.5 1.0 1.5 2.0 G G G G G G G G G Init Second Step Init full data Refined TPM TWF full data θ = 4 θ = 8 θ = 12 θ = 16 (g) Model (8), d = 400 0.0 0.5 1.0 1.5 2.0 Init Second Step Init full data Refined TPM θ = 4 θ = 8 θ = 12 θ = 16 (h) Model (9), d = 400 0.0 0.5 1.0 1.5 2.0 Init Second Step Init full data Refined TPM θ = 4 θ = 8 θ = 12 θ = 16 (i) Model (10), d = 400 Figure 3: Simulation results for the three examples considered in 2, in two different settings for the dimension d {200, 300, 400}. Here the parameter θ n s2 log d describes the relationship between sample size, dimension and sparsity of the signal. Algorithm 2 dominates in most settings, with exceptions when θ is too small, in which case none of the approaches provides meaningful results. Neykov, Wang, and Liu although in most cases it appeared to have converged to its final value. The results are visualized on Figure 3 above. The TPM algorithm, has performance comparable to that of Algorithm 1, is always worse than the estimate produced by Algorithm 2, and it needs an initialization (for the first step of Algorithm 1 is used) and further requires a rough knowledge of the sparsity s, whereas both Algorithms 1 and 2 do not require an estimate of s. 6. Discussion In this paper we proposed the two-step procedure for estimation of MPR models with standard Gaussian designs. We argued that the MPR models form a rich class including numerous additive SIMs (i.e., Y = h(X β ) + ε) with an even and increasing on R+ link function h. Our algorithm is based solely on convex optimization, and achieves minimax optimal rates of estimation over broad classes of additive SIMs with normal noise as we illustrated by Theorem 24. Our procedure does require that the sample size n s2 log d to ensure successful initialization. The same condition has been exhibited previously, e.g., in Cai et al. (2015) for the phase retrieval model, and in works on sparse principal components analysis (see, e.g., Berthet and Rigollet, 2013; Gao et al., 2014; Wang et al., 2015). We anticipate that for a certain subclass of MPR models, the sample size requirement n s2 log d is necessary for computationally efficient algorithms to exist. We conjecture models (8)-(10) to be such models. It is however certainly not true that this sample size requirement holds for all models from the MPR class. In (25) for example, the model can be solved efficiently by applying the Lasso algorithm, without requiring the sample size scaling n s2 log d. This discussion leads to the important question under what conditions of the (known) link and error distribution (f, ε) one can efficiently solve the SIM Y = f(X β , ε) with an optimal sample complexity. We would like to investigate this issue further in our future work. Agnostic Estimation for Phase Retrieval Rados law Adamczak and Pawe l Wolff. Concentration inequalities for non-Lipschitz functions with bounded derivatives of higher order. Probability Theory and Related Fields, 162(3-4):531 586, 2015. Arash A Amini and Martin J Wainwright. High-dimensional analysis of semidefinite relaxations for sparse principal components. In IEEE International Symposium on Information Theory, pages 2454 2458, 2008. Quentin Berthet and Philippe Rigollet. Complexity theoretic lower bounds for sparse principal component detection. In Conference on Learning Theory, pages 1046 1066, 2013. Peter J Bickel, Ya acov Ritov, and Alexandre B Tsybakov. Simultaneous analysis of Lasso and dantzig selector. The Annals of Statistics, pages 1705 1732, 2009. St ephane Boucheron, G abor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory of independence. Oxford University Press, 2013. Petros T Boufounos and Richard G Baraniuk. 1-bit compressive sensing. In Annual Conference on Information Sciences and Systems, pages 16 21, 2008. Peter B uhlmann and Sara van de Geer. Statistics for high-dimensional data: Methods, theory and applications. Springer, 2011. T Tony Cai, Xiaodong Li, and Zongming Ma. Optimal rates of convergence for noisy sparse phase retrieval via thresholded Wirtinger flow. ar Xiv:1506.03382, 2015. Emmanuel J Cand es, Thomas Strohmer, and Vladislav Voroninski. Phaselift: Exact and stable signal recovery from magnitude measurements via convex programming. Communications on Pure and Applied Mathematics, 66(8):1241 1274, 2013. Emmanuel J Cand es, Xiaodong Li, and Mahdi Soltanolkotabi. Phase retrieval from coded diffraction patterns. Applied and Computational Harmonic Analysis, 39(2):277 299, 2015a. Emmanuel J Cand es, Xiaodong Li, and Mahdi Soltanolkotabi. Phase retrieval via Wirtinger flow: Theory and algorithms. IEEE Transactions on Information Theory, 61(4):1985 2007, 2015b. Yudong Chen, Xinyang Yi, and Constantine Caramanis. A convex formulation for mixed regression with two components: Minimax optimal rates. ar Xiv:1312.7006, 2013. R Dennis Cook and Liqiang Ni. Sufficient dimension reduction via inverse regression. Journal of the American Statistical Association, 100(470), 2005. Alexandre d Aspremont, Laurent El Ghaoui, Michael I Jordan, and Gert RG Lanckriet. A direct formulation for sparse PCA using semidefinite programming. SIAM review, 49(3):434 448, 2007. Simon Foucart and Holger Rauhut. A mathematical introduction to compressive sensing. Number 1. Springer, 2013. Ravi Ganti, Nikhil Rao, Rebecca M Willett, and Robert Nowak. Learning single index models in high dimensions. ar Xiv preprint ar Xiv:1506.08910, 2015. Neykov, Wang, and Liu Chao Gao, Zongming Ma, and Harrison H Zhou. Sparse CCA: Adaptive estimation and computational barriers. ar Xiv:1409.8565, 2014. Martin Genzel. High-dimensional estimation of structured signals from non-linear observations with general convex loss functions. ar Xiv:1602.03436, 2016. Fang Han and Honglang Wang. Provable smoothing approach in high dimensional generalized regression model. ar Xiv:1509.07158, 2015. Joel L Horowitz. Semiparametric and nonparametric methods in econometrics. Springer, 2009. Laurent Jacques, Jason N Laska, Petros T Boufounos, and Richard G Baraniuk. Robust 1-bit compressive sensing via binary stable embeddings of sparse vectors. IEEE Transactions on Information Theory, 59(4):2082 2102, 2013. B eatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, pages 1302 1338, 2000. Guillaume Lecu e and Shahar Mendelson. Learning sub-Gaussian classes: Upper and minimax bounds. ar Xiv:1305.4825, 2013a. Guillaume Lecu e and Shahar Mendelson. Minimax rate of convergence and the performance of erm in phase recovery. ar Xiv:1311.5024, 2013b. Ker-Chau Li. Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86(414):316 327, 1991. Ker-Chau Li. On principal Hessian directions for data visualization and dimension reduction: Another application of Stein s lemma. Journal of the American Statistical Association, 87(420):1025 1039, 1992. Ker-Chau Li and Naihua Duan. Regression analysis under link violation. The Annals of Statistics, pages 1009 1052, 1989. Yue M Lu and Gen Li. Phase transitions of spectral initialization for high-dimensional nonconvex estimation. ar Xiv preprint ar Xiv:1702.06435, 2017. P. Mc Cullagh and J.A. Nelder. Generalized linear models. Chapman & Hall/CRC, 1989. Marco Mondelli and Andrea Montanari. Fundamental limits of weak recovery with applications to phase retrieval. ar Xiv preprint ar Xiv:1708.05932, 2017. Matey Neykov, Qian Lin, and Jun S Liu. Signed support recovery for single index models in high-dimensions. ar Xiv:1511.02270, 2015. Matey Neykov, Jun S Liu, and Tianxi Cai. L1-regularized least squares for support recovery of high dimensional single index models with Gaussian designs. Journal of Machine Learning Research, 17 (87):1 37, 2016. Heng Peng and Tao Huang. Penalized least squares for single index models. Journal of Statistical Planning and Inference, 141(4):1362 1379, 2011. Agnostic Estimation for Phase Retrieval Yaniv Plan and Roman Vershynin. The generalized Lasso with non-linear observations. IEEE Transactions on information theory, 2015. Peter Radchenko. High dimensional single index models. Journal of Multivariate Analysis, 139: 266 282, 2015. Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Restricted eigenvalue properties for correlated Gaussian designs. Journal of Machine Learning Research, 11:2241 2259, 2010. Charles M Stein. Estimation of the mean of a multivariate normal distribution. The Annals of Statistics, pages 1135 1151, 1981. Christos Thrampoulidis, Ehsan Abbasi, and Babak Hassibi. Lasso with non-linear measurements is equivalent to one with linear measurements. ar Xiv preprint, ar Xiv:1506.02181, 2015. Alexandre B. Tsybakov. Introduction To Nonparametric Estimation. Springer, New York, 2009. ISBN 978-0-387-79051-0. doi: 10.1007/b13794. AW Vaart and Jon A Wellner. Weak convergence and empirical processes, 1996. Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. ar Xiv:1011.3027, 2010. Zhaoran Wang, Quanquan Gu, and Han Liu. Sharp computational-statistical phase transitions via oracle computational model. ar Xiv:1512.08861, 2015. Yingcun Xia and WK Li. On single-index coefficient regression models. Journal of the American Statistical Association, 94(448):1275 1285, 1999. Zhuoran Yang, Zhaoran Wang, Han Liu, Yonina C. Eldar, and Tong Zhang. Sparse nonlinear regression: Parameter estimation and asymptotic inference. ar Xiv; 1511:04514, 2015. Xinyang Yi, Zhaoran Wang, Constantine Caramanis, and Han Liu. Optimal linear estimation under unknown nonlinear transform. In Advances in Neural Information Processing Systems, pages 1549 1557, 2015. Xiao-Tong Yuan and Tong Zhang. Truncated power method for sparse eigenvalue problems. Journal of Machine Learning Research, 14(Apr):899 925, 2013. Neykov, Wang, and Liu Appendix A. Auxiliary Results Here we collect several results which we use in the later development. Lemma 9 (Lemma 5 Amini and Wainwright (2008)). Consider the following optimization program: b Z = argmax tr(Z)=1,Z Sd + tr(AZ) λn i,j=1 |Zij|, (26) where Sd + is the set of all the d d positive semi-definite matrices. Suppose there exists a matrix U (independent of bz) satisfying: ( sign(bzi) sign(bzj), if bzibzj = 0; [ 1, 1], otherwise. (27) Then if bz is the principal eigenvector of the matrix A λn U, bzbz is the optimal solution to problem (26). Lemma 10 (Lemma 10.12 Foucart and Rauhut (2013)). Given integers 1 < s < d, there exist: (s 1)/2 , (28) subsets {Si : Si [d 1]}i [N] such that: i [N] : |Si| = s 1 and i = j : |Si Sj| < (s 1)/2. (29) Remark 11. Notice that when s > 1 Lemma 10 implies: For convenience of the reader we briefly recall the notation and result on Gaussian concentration of non-Lipschitz functions used by Adamczak and Wolff(2015), which we apply in Lemma 30 below. For the set [ℓ], we denote with Pℓthe set of its partitions into non-empty and non-intersecting disjoint sets. For a partition J = {J1, . . . , Jk}, and an ℓ-indexed matrix A = (ai)i [n]ℓ, define the norm: A J = sup n X l=1 x(l) i Jl : x(l) i Jl 2 1, 1 l k o , where the indexing should be understood as i I := (ik)k I. Given the convention that #J = |J | is the cardinality of the set J we restate (a shortened) version of Theorem 1.4 of Adamczak and Wolff (2015). Theorem 12 (Theorem 1.4 Adamczak and Wolff(2015)). Let X = (X1, . . . , Xn) be a random vector with independent components, such that for all i n, Xi ψ2 Υ. Then for every polynomial f : Rn 7 R of degree L and every p 2 we have: f(X) Ef(X) p KL J Pℓ p#J /2 EDℓf(X) J . Here Dℓis the ℓth derivative of f. Agnostic Estimation for Phase Retrieval Appendix B. Preliminary Proofs Proof [Proof of Proposition 2] We have the following equality: c0 = Cov(f(Z, ε), Z2) = E[ϕ(Z)Z2] Eϕ(Z) = ED2ϕ(Z) > 0, where the last (and key) equation follows by Stein s Lemma (see, e.g., Lemma 4 of Stein, 1981). Proof [Proof of Lemma 5] First of all we notice that: E[Y (X 2 I)] = E[(Y µ)(X 2 I)] = E(Y µ)X 2. Hence proving Lemma 5 is equivalent to showing: β = argmax v 2=1 E[(Y µ)(v X)2]. Next, decompose: v X = (v β )β X + (v (v β )β )X = (v β )β X + β X, where we used the shorthand notation β := v (v β )β , for the vector β which is orthogonal to β. In terms of this notation, we have the following identity: E[(Y µ)(v X)2] = (v β )2E[(Y µ)(β X)2] + 2(v β )E[(Y µ)(β X)(β X)] + E[(Y µ)(β X)2]. We next deal with the last two terms of the above decomposition. Since β X β X we have that the second term: E[(Y µ)(β X)(β X)] = E[(Y µ)(β X)]E[β X] = 0. For the third term due to the same independence (β X β X) we have: E[(Y µ)(β X)2] = E[(Y µ)]E[(β X)2] = 0. Hence: E[(Y µ)(v X)2] = (v β )2E[(Y µ)(β X)2] = (v β )2c0. Since (6) implies that c0 > 0, by Cauchy-Schwartz the maximizer of the above expression is v = β . Proof [Proof of Proposition 3] We prove the three statements in turn. (i) Let Z be an independent copy of Z. We have the following chain of equalities: c0 = E(ϕ(Z) Eϕ(Z))Z2 = E(ϕ(Z) Eϕ(Z ))Z2 = E(ϕ(Z) ϕ(Z ))Z2. By symmetry one also has: c0 = E(ϕ(Z ) ϕ(Z))(Z )2. Adding the last two equations yields: 2c0 = E[(ϕ(Z) ϕ(Z ))(Z2 (Z )2)] = EX,X |N (0,1)|[(ϕ(X) + ϕ( X) (ϕ(X ) + ϕ( X )))(X2 (X )2)]/2 where we used the fact that sign(ϕ(X) + ϕ( X) (ϕ(X ) + ϕ( X ))) = sign(X2 (X )2). The last inequality is strict since by our condition the integrand is strictly positive on the set [0, z2] [z1, ) R2 which is a set of positive Lebesgue measure. Neykov, Wang, and Liu (ii) To see (ii) for any two points x < y, take vx ϕ(x) and vy ϕ(y) to be arbitrary points in the corresponding sub-differentials. Adding the following two inequalities: ϕ(x) ϕ(y) vy(x y), ϕ(y) ϕ(x) vx(y x), we conclude that (vx vy)(x y) 0. Notice that since ϕ is convex, by Jensen s inequality, ϕ(z) + ϕ( z) 2ϕ(0) for any z 0. Next take z > z > 0, and consider the difference: ϕ(z) + ϕ( z) ϕ(z ) ϕ( z ) (vz v z )(z z ) 0, where vz ϕ(z ) and v z ϕ( z ) are arbitrary sub-gradients, and the last inequality follows by the fact that z > 0 and hence vz v z as we verified before. The above inequality becomes strict whenever z z0 since vz v z is non-decreasing and by assumption: z0vz0 ϕ(z0) ϕ(0) > ϕ(0) ϕ( z0) v z0z0, and hence vz0 v z0 > 0. Hence we may take z1 = 2z0 and z2 = z0 in (i) to complete the proof. (iii) Statement (iii) is an implication of the fact that we can control the tail bound of g1(Z). Notice that when 0 < t max{|a|, |b|} we trivially have P(|g1(Z)| t) 1. When t > max{|a|, |b|} using our assumption, by a standard normal tail bound we have: P(|g1(Z)| t) P(|Z| p t/C) 2 exp( t/(2C)) exp(1 t/(2C)). Hence setting K = max{|a|, |b|, 2C} shows that in any case P(|g1(Z)| t) exp(1 t/K), which shows that g1(Z) ψ1 c K < for some absolute constant c. Finally by the triangle inequality we conclude: g1(Z) ψ1 + g2(ε) ψ1 < , which completes the proof. Lemma 13. If h is a convex function such that h(z0)+h( z0) > 2h(0) for some z0 > 0, E|h(z+ε)| < for every z R, and E|h(Z + ε)| < we have ϕ(z) = Eh(z + ε) is convex, sub-differentiable and there exists a z 0 > 0 such that ϕ(z 0) + ϕ( z 0) > 2ϕ(0). Proof [Proof of Lemma 13] Since the function h is convex and the expectation is a linear operator it follows that ϕ(z) is indeed convex. The linearity of the expectation operator, coupled with the fact that the function |h(z + ε)| is integrable for all z, additionally implies that ϕ(z) is sub-differentiable with E εh(z + ε) ϕ(z)1, where εh(z + ε) h(z + ε) is chosen so that ε 7 εh(z + ε) is a function2. Next, notice that for any fixed ε, we have: EZ[h(Z + ε) + h( Z + ε)] > 2h(ε). The last inequality is strict, since by Jensen s inequality EZ[h(Z + ε) + h( Z + ε)] h(EZ + ε) + h( EZ + ε) = 2h(ε), and equality can be achieved only when h is linear, which is not the case since 1. The fact that E εh(z + ε) exists is implied by E|h(z + ε)| < . 2. Note also that ε 7 εh(z + ε) is monotone and hence measurable. Agnostic Estimation for Phase Retrieval h(z0) + h( z0) > 2h(0) by assumption. Take an expectation with respect to ε and exchange the expectations (by Fubini s theorem, recall that E|h(Z + ε)| < ) to obtain: EZEε[h(Z + ε) + h( Z + ε)] > 2Eεh(ε). Naturally, the above implies the existence of z 0 such that: Eε[h(z 0 + ε) + h( z 0 + ε)] > 2Eεh(ε), and completes the proof. Lemma 14. Monomials of the type h(x) = xk belong to the class C(k 1, (k!)2, k!) H(k 1, (k!)2, k!). Proof [Proof of Lemma 14] This statement holds due to following explicit moment calculation: max i [k 1] EZ N(0,1)(h(k i)(Z))2 = max i [k 1] 2i k! 2 Γ((1 + 2i)/2) Γ(1/2) = (k!)2, where the last equality follows by noticing that the sequence is decreasing in i. Appendix C. Proofs for Initialization Step Our first result shows that the optimization program (17) succeeds in producing a vector bv which is close to the vector β . Proposition 15. Assume that n is large enough so that s q 4 ) c0 (C1+C2) for some small but fixed κ > 0 and constants C1, C2 (depending on f and ε). Then there exists a value of λn q n such that the principal eigenvector bv of b A, the solution of (17), satisfies with probability at least 1 4d 1 O(n 1). Proof [Proof of Proposition 15] The proof follows by an application of Lemma 9 and Lemma 17. Lemma 16. Let A = avv bww be a symmetric rank two matrix, with a > b 0 and v 2 = w 2 = 1, and let N be a symmetric noise matrix. Then, assuming that N 2 a b 2 the principal eigenvector bv of A + N satisfies: |bv v| ha b 2 N 2 Proof [Proof of Lemma 16] First off, an elementary calculation shows that the non-zero spectrum of A is: {λ1(A), λd(A)} = na b p (a b)2 + 4ab(1 (v w)2) Neykov, Wang, and Liu Next we have: a(v bv)2 + N 2 bv (A + N)bv λ1(A) N 2, (v bv)2 a b + p (a b)2 + 4ab(1 (v w)2) where the last inequality follows by Cauchy-Schwartz. Lemma 17. Assume that n is large enough so that s q 4 ) c0 (C1+C2) for some small but fixed κ > 0 and constants C1, C2 as defined in Lemmas 21 and 22. Put λn = (C1 + C2) q exists a sign matrix b U with b USβ Sβ = sign(β Sβ ) sign(β Sβ ) such that the principal eigenvector of bΣ λ b U, bv satisfies: |bv β | κ, with probability at least 1 4d 1 O(n 1). Remark 18. The proof of Lemma 17 also shows that with high probability the vector bv can be identified with a vector ev (the principal eigenvector of bΣSβ ,Sβ λn b USβ ,Sβ see below) which is independent of the data XSc β such that bv ev with high probability. This becomes evident upon realizing that the matrix NSβ Sβ depends solely on XSβ . In addition it is evident that the support supp(bv) Sβ and supp(ev) Sβ . Proof [Proof of Lemma 17] Setting λn = (C1 + C2) q n and using Lemma 19 gives us that bΣ λn b U = " ηβ Sβ β Sβ λn sign(β Sβ ) sign(β Sβ ) + NSβ Sβ NSβ Sc β λn b USβ Sc β NSc β Sβ λn b USc β Sβ NSc β Sc β λn b USc β Sc β We can select the sign matrix b U such that all three terms which do not correspond to the Sβ Sβ corner of the above visualization are 0, since by Lemma 19 we have that N max (C1 + C2) q n λn with high probability. Recall that by our assumption on the sample size we 6s and hence λn b USc β Sβ c0 sign(β Sβ ) s sign(β Sβ ) s . Using Lemmas 16 and 19 on the event NSβ Sβ 2 (C1 + C2)s q |bv Sβ β Sβ | η c0 6 2 NSβ Sβ 2 3 4 NSβ Sβ 2/c0 2 3 4(C1 + C2) for values of n large enough so that the above expression is positive, which concludes the proof. Agnostic Estimation for Phase Retrieval Lemma 19. We have that: bΣ = ηβ β + N, (32) where η > c0/2 and NSβ Sβ 2 (C1 + C2)s q n and N max (C1 + C2) q n with probability at least 1 4d 1 O(n 1), where C1 and C2 are constants depending on f, ε. Proof [Proof of Lemma 19] First we observe that decomposition (32) holds with: i=1 Yi(β Xi)2β β + 1 i=1 Yi(Pβ Id), i=1 Yi(β Xi)(β X i Pβ + Pβ Xiβ ) + 1 i=1 Yi[Pβ (X 2 i Id)Pβ ], where Pβ = (Id β β ). Lemma 20 shows that η c0/2 with probability at least 1 O(n 1). Next, Lemma 21 and Lemma 22 show that: N max (C1 + C2) with probability at least 1 4d 1 O(n 1), where the constants C1 and C2 depend on f, ε. Using the fact that NSβ Sβ 2 NSβ Sβ 1 s N max completes the proof. Lemma 20. We have that η defined in (32) satisfies with probability at least 1 4 Var[f(Z,ε)(Z2 1)] Proof [Proof of Lemma 20]Grouping the first two terms by Chebyshev s inequality we have that: i=1 Yi((β Xi)2 1) c0 t Var[f(Z, ε)(Z2 1)] Notice that in the last inequality we have Var[(f(Z, ε)(Z2 1)] < since we are assuming that f(Z, ε) is sub-exponential. Setting t = c0/2 brings the above probability bound to zero at a rate n 1. Lemma 21. We have that: 1 i=1 Yi(β Xi)[β X i Pβ + Pβ Xiβ ] max C1 where C1 is a constant depending on f, ε, with probability at least 1 2d 1 Var[f 2(Z,ε)Z2] (E[f 2(Z,ε)Z2)])2 n 1. Proof [Proof of Lemma 21] We will only deal with the first term of the sum, as the second term follows by the same argument after transposition. First notice that Yi(β Xi) X i Pβ . Analyzing the first part of this term row-wise, for j Sβ (β j = 0) we have: i=1 Yiβ j (β Xi)X i Pβ N 0, β 2 j 1 n2 i=1 Y 2 i (β Xi)2Pβ . Neykov, Wang, and Liu By Chebyshev s inequality we have: i=1 Y 2 i (β Xi)2 E[f 2(Z, ε)Z2] t Var[f 2(Z, ε)Z2] assuming Var[f 2(Z, ε)Z2] < . Putting t = E[f 2(Z, ε)Z2] brings the above probability converge to zero at a rate n 1. Hence, by a conditioning argument, a standard normal tail bound coupled with the facts that Pβ 2 1, |β j | 1 and a union bound, we obtain: P max j [d] i=1 Yi(β Xi)β j X i Pβ > t 2d2 exp nt2 4E[f 2(Z, ε)Z2] Plugging in 3Ef 2(Z, ε)Z2 r brings the probability to 2d 1. This completes the proof with C1 = 4 p 3Ef 2(Z, ε)Z2. Lemma 22. Let Yi = f(X i β , ε), where Xi N(0, I). Assume that f and ε are such that f(Z, ε) ψ1 K for Z N(0, 1) and Z ε, and in addition let log d = o(n/ log2 n). Then: i=1 Yi Pβ (X 2 i Id)Pβ max with probability at least 1 2d 1 (211K4/(Ef 2(Z, ε))2 +e)n 1, for some absolute value C2 depending on K, and large values of n. Proof [Proof of Lemma 22] Notice that by the properties of the multivariate normal distribution one has that Yi Pβ (X 2 i Id)Pβ . Next we have that Zi := Pβ Xi N(0, Pβ ), and thus, since Pβ 2 1, we have that each individual entry of Zi is a normally distributed random variable with variance at most one. Hence we have that for any j, k [d]: Zij Zik ψ1 2 Zij ψ2 Zik ψ2 2, and hence conditionally on Yi one has Zij Zik EZij Zik ψ1 = Zij Zik Pβ ,jk ψ1 4, for all j, k [d]. Next conditionally on the Yi values and a Bernstein type of inequality (see, e.g., Proposition 5.16 of Vershynin (2010)) we obtain: i=1 Yi Pβ (X 2 i Id)Pβ max t 2d2 exp h c min nt2 16n 1 Pn i=1 Y 2 i , nt 4 maxi [n] |Yi| for an absolute constant c > 0. Using the union bound and the fact that Yi are sub-exponential we obtain: P(max |Yi| t) n exp( t/(c K)), (34) Agnostic Estimation for Phase Retrieval for some absolute constant. Setting t = 2c K log(n) brings the above probability converging to zero at a rate n 1. Furthermore by Chebyshev s inequality we obtain: i=1 Y 2 i EY 2 t Var(Y 2 i )n 1t 2 29K4n 1t 2, (35) and thus we can set t = EY 2/2 to bring the above probability to zero at a rate n 1. In addition we have EY 2/2 n 1 Pn i=1 Y 2 i 2EY 2 with probability at least 211K4n 1/(EY 2)2. Selecting 96EY 2 log d cn in (33) gives us that: c K log n 16n 1 Pn i=1 Y 2 i 4 maxi [n] |Yi| , where the first inequality in the preceding display holds for large enough values of n so long as log d = o(n/ log2 n). Hence we conclude: i=1 Yi Pβ (X 2 i Id)Pβ max 96EY 2 log d with probability at least 1 2d 1 (211K4/(EY 2)2 + e)n 1. Taking into account that EY 2 4K2 we obtain that C2 = 384K2/c. Appendix D. Proofs for Second Step Remark 23. For simplicity of presentation we will subtract n from the indexes of the set S2 in the proofs, i.e., instead of having observations indexed in the range S2 = {n + 1, . . . , 2n} we will pretend that our observations are in the range {1, . . . , n}. Proof [Proof of Theorem 6]Take the fixed estimate bv from the first step (recall that bv 2 = 1), and decompose it to: bv = (bv β )β + bβ . By the Pythagorean theorem we have 1 = bv 2 2 = (bv β )2 β 2 2 + bβ 2 2, which implies that 1 (bv β )2 1. (36) Put α := c0bv β so by Lemma 17 we have |α| > κc0 with high probability. By formulation (18) we have: 1 2n X(bb αβ ) 2 2 + νn bb 1 1 D (Y Y ) Xbv αXβ , X(bb αβ ) E + νn αβ 1. We handle the empirical process term in Lemma 25, which also presents the main difficulty in the analysis of the ℓ1 regularized least squares procedure. Using this result we conclude that: 1 2n X(bb αβ ) 2 2 + νn bb 1 e C h bb αβ 1 + 1 n X(bb αβ ) 2 i + νn αβ 1, (37) Neykov, Wang, and Liu with probability at least 1 O(n 1 + d 1). We now distinguish two cases. First assume that X(bb αβ ) 2 > 4 e C log d. Then (37) implies that: 1 4n X(bb αβ ) 2 2 + νn bb 1 e C n bb αβ 1 + νn αβ 1, (38) Next using a standard trick (see, e.g., Bickel et al., 2009; B uhlmann and van de Geer, 2011) we have: bb 1 = bb Sβ 1 + bb Sc β 1 αβ Sβ 1 bb Sβ αβ Sβ 1 + bb Sc β 1, bb αβ 1 = bb Sβ αβ Sβ 1 + bb Sc β 1. Selecting νn 2 e C q n , the above equalities in combination with (38) guarantee that: 1 2n X(bb αβ ) 2 2 + νn bb Sc β αβ Sc β 1 3νn bb Sβ αβ Sβ 1. (39) Using Corollary 1 from Raskutti et al. (2010), since clearly Id satisfies the RE condition of order 2s with constants (3, 1) (i.e., S [d] 2s θ { θSc 1 3 θS 1} we have θ 2 Iθ 2) we can further bound: 1 2n X(bb αβ ) 2 2 1 2 82 bb αβ 2 2, (40) with probability at least 1 c exp( c n) if n > c 42s log d where c , c , c > 0 are absolute constants. On the above event, (39) implies: 1 2 82 bb αβ 2 2 3νn bb Sβ αβ Sβ 1 3νn 2s bb αβ 2, where we used Cauchy-Schwartz and the fact that the vector bb Sβ αβ Sβ is at most 2s sparse. The above inequality gives us that: bb αβ 2 12 82 In the second case when X(bb αβ ) 2 4 e C log d, on the event (40) we have: bb αβ 2 32 e C and we see that in either case bound (41) holds. Before we complete the proof we need the following straightforward result: Lemma 24. Assume that n is large enough so that 12 82 2sνn κc0/2. Then with probability at least 1 O(n 1 + d 1) we have: min η {1, 1} bb 2 ηβ 2 48 82 Finally notice that s q n < R implies that 12 82 2sνn κc0/2 when R is small enough. Agnostic Estimation for Phase Retrieval Proof [Proof of Lemma 24] Put r = 12 82 2sνn for brevity. By (41) and the triangle inequality we can conclude that: |α| r bb 2 r + |α|. Additionally: bb 2 sign(α)β 2 bb αβ 2 + | bb 2 |α|| bb 2 2r |α| r 4r with the last two inequalities holding with high probability when r < κc0/2 ( |α|/2 with high probability by Lemma 17). This completes the proof. Lemma 25. There exists a constant e C depending on f, ε such that: D (Y Y ) Xbv αXβ , X(bb αβ ) E e C h 1 n X(bb αβ ) 2 + bb αβ 1 i , with probability at least 1 O(n 1 + d 1). Proof [Proof of Lemma 25] Using H older s inequality we obtain: D (Y Y ) Xbv αXβ , X(bb αβ ) E 1 n X [(Y µ) Xbv αXβ ] bb αβ 1 (42) n (Y µ) Xbv 2 X(bb αβ ) 2 where we have set µ := EY for brevity. We first handle the second term. We have 1 n (Y µ) Xbv 2 = |Y µ| 1 n Xbv 2. Since Yi is assumed to be sub-exponential by a Bernstein type of inequality we have: P(|Y µ| t) 2 exp( c min(nt2/4K2, nt/2K)) where c is an absolute constant. Thus we conclude that |Y µ| 2K c n with probability at least 1 2d 1, for values of n such that q n < c. Also since we have Xbv N(0, In) we obtain that Xbv 2 2 χ2 n. Hence by Chebyshev s inequality we obtain: P(| Xbv 2 2/n 1| t) 2/(nt2), and thus by plugging in t = 1, we conclude that 1 n Xbv 2 2 with probability at least 1 2n 1. Hence 1 n (Y µ) Xbv 2 e C1 q n with probability at least 1 O(n 1) 2d 1. Next we analyze the sup norm term appearing in inequality (42). The first fact we observe is that by construction this term is unbiased since: E[(Y µ)X 2β (bv β ) αX 2β ] + E[(Y µ)X 2 bβ ] = β E[(Y µ)(X β )2(bv β ) αβ X 2β ] | {z } 0 + E[(Y µ)Pβ X 2β (bv β )] | {z } 0 E[αPβ X 2β ] | {z } 0 + E[(Y µ)Pβ X 2 bβ ] | {z } 0 +β E[(Y µ)β X 2 bβ ] | {z } 0 Neykov, Wang, and Liu Now according to the decomposition in the preceding display, we break down the sup norm term in (42) into mean zero terms using the triangle inequality: n 1 X [(Y µ) Xbv αXβ ] n 1 P{β , b β } X [(Y µ) X bβ ] (43) + n 1 β β X [(Y µ) X bβ ] bβ 2 2 bβ ( bβ ) X [(Y µ) X bβ ] + n 1 β β X [(Y µ) Xβ c0Xβ ] + n 1 Pβ X [(Y µ) Xβ c0Xβ ] , where in the last two terms we used the fact that |bv β | 1 and P{β , b β } is the projection on the space span{β , bβ } . We use Lemma 26 to control the first term of the decomposition. Lemma 27 handles the second term, Lemma 28 handles the third term, and Lemmas 29 and 31 show concentration for the remaining terms. We conclude that there exists a constant e C2 such that: n 1 X [(Y µ) Xbv αXβ ] e C2 with probability at least 1 O(n 1 +d 1), which is what we aimed to show with e C = max( e C1, e C2). Lemma 26. We have that: 1 i=1 (Yi µ)X i bβ X i P{β , b β } bβ 2C3 for an absolute constant C3 depending on f and ε with probability at least 1 2d 1 Var(f(Z,ε) Ef(Z,ε))2 [E(f(Z,ε) Ef(Z,ε))2]2 n 1. Proof [Proof of Lemma 26]Notice that X i P{β , b β } is independent of (Yi µ)X i bβ . Hence conditionally on Y and X bβ we have i=1 (Yi µ)X i bβ X i P{β , b β } N 0, 1 i=1 (Yi µ)2(X i bβ )2P{β , b β } . Next using Chebyshev s inequality we can control the probability of spread about the mean: bβ 2 2 (Yi µ)2 E(f(Z, ε) Ef(Z, ε))2 t Var[(f(Z, ε) Ef(Z, ε))2] by setting t = E(f(Z, ε) Ef(Z, ε))2. Using the fact that P{β , b β } 2 1, by a standard normal tail bound and union bound on the event 1 n Pn i=1(X i bβ )2(Yi µ)2 2 bβ 2 2E(f(Z, ε) Ef(Z, ε))2 i=1 (Yi µ)X i bβ X i P{β , b β } t 2d exp( nt2/[4 bβ 2 2E(f(Z, ε) Ef(Z, ε))2]). Select t = 2 p 2E(f(Z, ε) Ef(Z, ε))2 bβ 2 q n yields the desired bound with 2E(f(Z, ε) Ef(Z, ε))2 bβ 2. Agnostic Estimation for Phase Retrieval Lemma 27. We have that: n 1 β β X [(Y µ) X bβ ] C4 for an absolute constant C4 depending on f and ε with probability at least 1 2d 1 Var Z2(f(Z,ε) Ef(Z,ε))2 [EZ2(f(Z,ε) Ef(Z,ε))2]2 n 1. Proof [Proof of Lemma 27] Notice that β β 2 = 1 and thus: n 1 β β X [(Y µ) X bβ ] n 1|β X [(Y µ) X bβ ]|. Next since (( bβ ) Xi) (β Xi)(Yi µ), conditioning on {(β Xi)(Yi µ)}i [n] we obtain: i=1 (β Xi)(Yi µ)(( bβ ) Xi) N 0, bβ 2 2 n2 i=1 (β Xi)2(Yi µ)2 , i=1 (β Xi)2(Yi µ)2 EZ2(f(Z, ε) Ef(Z, ε))2 t Var[Z2(f(Z, ε) Ef(Z, ε))2] by setting t = EZ2(f(Z, ε) Ef(Z, ε))2 we can control the variance term above. The final bound follows after an application of a standard Gaussian tail bound, where C4 turns out to be C4 = bβ 22 p EZ2(f(Z, ε) Ef(Z, ε))2. Lemma 28. For large enough values of n we have: bβ 2 2 bβ ( bβ ) X [(Y µ) X bβ ] C5 with prob at least 1 2d 1 O(n 1). Proof [Proof of Lemma 28] We have that bβ bβ 2, and hence: bβ 2 2 bβ ( bβ ) X [(Y µ) X bβ ] n 1 bβ 2 |( bβ ) X [(Y µ) X bβ ]|. Observe that X i bβ is independent from Yi µ, and in addition X i bβ N(0, bβ 2 2). Hence (X i bβ )2/ β 2 2 χ2 1. Next we make usage of the decomposition: i (Yi µ)(X i bβ )2 = bβ 2 2 n i (Yi µ)((X i bβ )2/ bβ 2 2 1) + bβ 2 2 n Since Yi is assumed to be sub-exponential, the second concentrates about zero by Proposition 5.16 in Vershynin (2010): i=1 Yi µ t 2 exp( c min(nt2/K2, nt/K)). Neykov, Wang, and Liu Selecting t = K c n gives a bound on the probability equal to 2d 1, for values of n large enough n c. For the remaining term, conditionally on {Yi}i [n], by Lemma 1 of Laurent and Massart (2000) we obtain: i (Yi µ)((X i bβ )2/ bβ 2 2 1) 2 v u u tn 1 n X i=1 (Yi µ)2 t + 2 max i [n] |Yi µ|t 2 exp( nt). (46) Next, by the triangle inequality: v u u tn 1 n X i=1 (Yi µ)2 v u u tn 1 n X i=1 Y 2 i + |µ|, max i [n] |Yi µ| max i [n] |Yi| + |µ|. The inequalities in the preceding display allow us to reuse the results (34) and (35) of Lemma 22. Thus conditioning on these events (46) implies: i (Yi µ)((X i bβ )2/ bβ 2 2 1) (2 t + (4c K log n)t 2 exp( nt), on an event failing with probability at most Var Y 2 [EY 2]2 + e n 1. Selecting t = log d n implies that with probability at least 1 2d 1 Var Y 2 [EY 2]2 + e n 1 we have: i (Yi µ)((X i bβ )2/ bβ 2 2 1) [(2 2EY 2 + µ) + 4c K] with the probability of failing being at most exp( nt) max(2d 1, O(n 1)). We remind the reader that we are assuming log(d) = o(n/ log2(n)). This completes the proof with C5 = (2 2EY 2 + µ) + 4c K + K c bβ 2. Lemma 29. We have that: n 1 β β X [(Y µ) Xβ c0Xβ ] C6 with probability at least 1 O(n 1) 3d 1. Proof [Proof of Lemma 29] As in Lemma 27 we have: n 1 β β X [(Y µ) Xβ c0Xβ ] n 1|β X [(Y µ) Xβ c0Xβ ]|, since β 1. We decompose the right hand side of the preceding display to: i=1 [(β Xi)2Yi (c0 + µ)] + (c0 + µ) i=1 [1 (β Xi)2]. To handle the first term one can easily use Chebyshev s inequality to obtain convergence with probability at least (log d) 1. However, to sharpen this rate, we work around the classic Chebyshev s inequality, by making usage of recent concentration results on polynomials of sub-Gaussian random variables proved in Adamczak and Wolff(2015). We have the following: Agnostic Estimation for Phase Retrieval Lemma 30. We have that: i=1 [(β Xi)2Yi (c0 + µ)] e C6 with probability at least 1 max(O(n 1), d 1). Usual concentration bounds on the χ2 distribution can be used to control the second term. Using Lemma 1 of Laurent and Massart (2000) we obtain: i=1 (β Xi)2 1 2 t + 2t 2 exp( nt). Select t = q n to complete the proof assuming that t < 1 and setting C6 = 4(c0 + µ) + e C6. Proof [Proof of Lemma 30] First we construct the random variable Zi = ηi|β Xi|1/2|Yi|1/4, where ηi is a Rademacher random variable. Notice that Z4 i = (β Xi)2Yi, and hence EZ4 i = (c0 + µ). We now argue that Z is a sub-Gaussian random variable. By H older s inequality, and the definition of ψ2 norm we have: E|β X|p E|Y |p/2 (p β X ψ2( Y ψ1/2)1/2)p/2 (p( Y ψ1/2)1/2)p/2, where we used that since β X N(0, 1) we have β X ψ2 1. Hence Z ψ2 (K/2)1/4, and thus Z is sub-Gaussian as claimed. For the remaining part recall the notation preceding Theorem 12. For f(x) = x4 and F(x) = Pn i=1 f(xi) we have DℓF(x) = diagd(f (ℓ)(x1), . . . , f (ℓ)(xn)) for ℓ [4]. Using the definition of ψ2 norm we can easily estimate E[|Z|ℓ] ( ℓ)ℓ Z ℓ ψ2. To this end we observe the following: diagℓ{x1, . . . , xn} J = 1(#J = 1) x 2 + 1(#J 2) x max. EDℓF(Z) J [1(#J = 1) n + 1(#J 2)]4!/(4 ℓ)!( 4 ℓ)4 ℓ Z 4 ℓ ψ2 , for ℓ [4], where with a slight abuse of notation we understand ( 4 ℓ)(4 ℓ) = 1 when ℓ= 4. Using the moment estimate of Theorem 12 we obtain: F(Z) EF(Z) k K4 X ℓ [4] Z ℓ ψ2 X J Pℓ k#J /2[1(#J = 1) n + 1(#J 2)] 4!( 4 ℓ)4 ℓ Z 4 ℓ ψ2 (4 ℓ)! where Pℓis the set of partitions of [ℓ], the absolute constant K4 depends solely on the dimension four, and e K4 on the Z ψ2 norm and K4. Next by Chebyshev s inequality: P(n 1|F(Z) EF(Z)| t) e Kk 4 [ p k/n + k2/n]k Applying this inequality with k = min( log d , (n log d)1/4 ), and t = 2e e K4 q n gives us that: i=1 [(β Xi)2Yi (c0 + µ)] e C5 Neykov, Wang, and Liu with probability at least 1 exp( min( log d , (n log d)1/4 )) 1 max(O(n 1), d 1) where e C5 = 2e e K4. This is what we wanted to show. Lemma 31. We have: n 1 Pβ X [(Y µ) Xβ c0Xβ ] C6 with probability at least 1 O(n 1) 2d 1. Proof [Proof of Lemma 31] We have that Pβ X is independent of (Y µ) Xβ c0Xβ and thus: 1 n Pβ X [(Y µ) Xβ c0Xβ ] N 0, 1 i=1 (Yi µ c0)2(β Xi)2Pβ . By Chebyshev s inequality we obtain that 1 n Pn i=1(Yi µ c0)2(β Xi)2 2E((f(Z, ε) µ c0)2Z2) with probability at least Var((f(Z,ε) µ c0)2Z2) [E((f(Z,ε) µ c0)2Z2)]2n. Since Pβ 2 1, by a standard Gaussian tail bound we obtain: P Pβ X [(Y µ) Xβ c0Xβ ] t 2d exp nt2 4E((f(Z, ε) µ c0)2Z2) Setting t = 2 q 2E((f(Z, ε) µ c0)2Z2) log d n , completes the proof. Finally we prove a Lemma showing that the support of the second step of Algorithm 1 stays confined within the true support Sβ provided that νn is sufficiently large. We will use this lemma in the proof of Theorem 7. In the following, for a matrix M Rd1 d2, sets S1, S2 [d], we let M,S2 = [Mij]j S2 i [d1] and MS1,S2 = [Mij]j S2 i S1 . Lemma 32. There exists a constant ρ > 0 depending solely on f, ε such that if we set νn = C q n for some sufficiently large constant C > ρ if n s log d > 16 1 ρ then the vector bβ produced by the second step of Algorithm 1 satisfies supp( bβ) Sβ , with probability at least 1 O(n 1 + d 1 + exp( Ω(s))). Moreover bβ can be identified with a vector eβ (i.e., bβ eβ with probability at least 1 O(n 1 + d 1 + exp( Ω(s)))), where eβ is independent of X,Sc β and supp( eβ) Sβ . Proof [Proof of Lemma 32] The proof of this Lemma follows the proof of Theorem 2.3.4 (i) of Neykov et al. (2016), and here we will provide only a proof of a modification required in the original argument. It turns out that the to show the first part of the statement we need to show that there exists a constant ρ > 0: |S2| 1 X i S2 ((Yi Y )X i bv c0(bv β )X i β )2 ρ, Agnostic Estimation for Phase Retrieval with high probability. Using the elementary inequality (a + b + c)2 3(a2 + b2 + c2) together with the fact that |bv β | 1 we obtain: i S2 ((Yi Y )X i bv c0(bv β )X i β )2/3 |S2| 1 X i S2 (( Y µ)X i bv)2 i S2 ((Yi µ)X i bβ )2 i S2 ((Yi µ)X i β c0X i β )2. Recall that by Remark 18 we can treat bv as independent of X,Sc β and supp(bv) Sβ . Hence the first two terms have been can be handled as in Lemmas 33 and 34, whilst the last term can be handled by Chebyshev s inequality as in the original proof of Theorem 2.3.4. (i). Using the statement of Theorem 2.3.4 (i), we conclude that when νn = C q n for a sufficiently large constant C such that C2 > ρ if n satisfies: n s log d > 16 1 ρ we have supp( bβ) Sβ . Additionally, this also implies that in fact is the solution of: bβSβ = argmin b Rs 1 2|S2| i S2 ((Yi Y )X i,Sβ bv Sβ X i,Sβ b)2 + νn b 1, with high probability. Hence just as in Remark 18 we can identify bβ with a vector bβ eβ with high probability, where eβ is independent of X,Sc β and supp( eβ) Sβ . Appendix E. Proofs for Algorithm 2 In this section for simplicity we will substitute the sample size of the full data from 2n to n. We will refer to the sample size of the second half of the data of Algorithm 1 as |S2| := n/2 whenever it can cause confusion. Unlike in the previous sections here we define: bβ := bβ ( bβ β )β . Proof [Proof of Theorem 7] The proof bares similarity proof to the proof of Theorem 6, the difference being that Lemma 33 is used in place of Lemma 25. We omit the details. Lemma 33. There exists a constant e C depending on f, ε such that: D (Y Y ) X bβ αXβ , X(bb αβ ) E e C h 1 n X(bb αβ ) 2 + bb αβ 1 i , with probability at least 1 O(n 1 + d 1 + exp( cs)) for some absolute constant c > 0. Proof [Proof of Lemma 33] The proof of this Lemma, follows a similar route as in the proof of Lemma 25, modulo several non-trivial modifications. The first modification required is when handling the term 1 n X bβ 2, as we now need to take into account that the vector bβ is not independent of the matrix X. Neykov, Wang, and Liu To handle the term X bβ 2 recall that from Lemma 32 we have supp( bβ) Sβ . Using the definition of matrix norm we obtain: X bβ 2 X,Sβ 2 2 n + s, with probability at least 1 2 exp( n/2) by Theorem 5.35 (Vershynin, 2010). Hence as in Lemma 25 we obtain 1 n (Y µ) X bβ 2 e C1 q n with probability at least 1 O(n 1) 2d 1. Next our strategy is to use the following decomposition (contrast to decomposition in (43)): n 1D X [(Y µ) X bβ αXβ ], bb αβ E n 1 X ,Sc β [(Y µ) X bβ ] bb αβ 1 + n 1 X ,Sβ [(Y µ) X bβ ] bb αβ 1 + n 1 β β X [(Y µ) Xβ c0Xβ ] bb αβ 1 + n 1 Pβ X [(Y µ) Xβ c0Xβ ] bb αβ 1. Since the last two terms do not involve bβ or bβ they are already handled by Lemma 29 and Lemma 31. We handle first two terms in Lemma 34 and 35. Lemma 34. Assuming that s < Un log2(n), we have: n 1 X ,Sc β [(Y µ) X bβ ] C7 for a constant C7 depending on f and ε with probability at least 1 O(d 1) O(n 1) 2 exp( cs) for an absolute constant c. Proof [Proof of Lemma 34] We start by noticing that conditioning on X,Sβ determines both Y µ and X,Sβ while X,Sc β are unaffected. Hence conditionally we have: i=1 (Yi µ)X i bβ Xi,Sc β N 0, bβ 2 2 n2 i=1 (Yi µ)2(X i bβ / bβ 2)2Id s . (47) We will now delicately handle the variance term. Define B := {β : β β , supp(β ) Sβ , β 2 = 1}. We have argued that b β b β 2 B with high probability. The following inequality holds i=1 X 2 i (Yi µ)2β 1 i=1 P{β } X 2 i P{β } (Yi µ)2 2. Now we realize that the terms P{β } Xi are independent of the terms Yi µ. Hence conditioning on Y µ we are in position to use Remark 5.40 of Vershynin (2010) to claim that i=1 P{β } X 2 i P{β } (Yi µ)2 2 1 i=1 (Yi µ)2 +n 1 e C v u u ts i=1 (Yi µ)4 +s max i (Yi µ)2 . Agnostic Estimation for Phase Retrieval with probability at least 1 2 exp( cs) for some absolute constants c and e C. To handle the the second term we simply resort to Chebyshev s inequality: i=1 (Yi µ)2 E(f(Z, ε) Ef(Z, ε))2 t Var[(f(Z, ε) Ef(Z, ε))2] by setting t = E(f(Z, ε) Ef(Z, ε))2 we bring the above probability to zero at a rate n 1. Similarly one can handle the term 1 n Pn i=1(Yi µ)4. The term maxi [n](Yi µ)2 was handled in (34) where we showed that there exists an absolute constant such that maxi [n](Yi µ)2 ee C log2 n with high probability. Putting everything together we conclude: i=1 X 2 i (Yi µ)2β 2E(f(Z, ε) Ef(Z, ε))2 + r s n2E(f(Z, ε) Ef(Z, ε))4 + ee Cs log2 n n | {z } e V with probability at least 1 2 exp( cs) O(n 1). Under the assumption s < Un log2(n) for some constant U > 0 we have e V < . This implies on a set with overwhelming probability the variance of (47) is bounded by the constant e V . Then by a standard normal tail bound in conjunction with a union bound we obtain: i=1 (Yi µ)X i bβ X i,Sc β t 2(d s) exp( nt2/(8 bβ 2 2 e V )). Selecting t = bβ 2 q e V log d/n drives the probability to zero at a rate O(d 1). Lemma 35. Suppose that n R s2(log d log n) for a sufficiently large constant R . We have: n 1 X ,Sβ [(Y µ) X bβ ] C8 with probability at least 1 O(n 1 + d 1 + exp( cs)). Proof [Proof of Lemma 35] Recall that by Theorem 6 bβ satisfies minη { 1,1} bβ ηβ 2 L q The last bound implies 1 L2 n | bβ β |. Combining this with bound (36), easy algebra yields n . Next, by the triangle inequality we decompose: X ,Sβ [(Y µ) X bβ ] β Sβ β Sβ X ,Sβ [(Y µ) X bβ ] + P{β Sβ } X ,Sβ [(Y µ) X bβ ] . Below we handle the two terms separately in turn starting with the first one. It is evident that: n 1 β Sβ β Sβ X ,Sβ [(Y µ) X bβ ] sup β B i=1 (Yi µ)β Sβ Xi,Sβ β Xi Neykov, Wang, and Liu where B = {β : β β , supp(β ) Sβ , β 2 = 1}. Notice that if we select two vectors w , v B, then the vector w v w v 2 B. Hence by the triangle inequality we have: i=1 (Yi µ)β Sβ Xi,Sβ w Xi n 1 n X i=1 (Yi µ)β Sβ Xi,Sβ v Xi + w v 2I. The bound in the above display implies that if Nϵ is an ϵ-net on B, then: I (1 ϵ) 1 sup v Nϵ i=1 (Yi µ)β Sβ Xi,Sβ v Xi . Let N1/2 be an 1/2-net of B. By Lemma 5.2 of Vershynin (2010) we have that |N1/2| 5s. Using these facts coupled with the conditional argument in the proof of Lemma 27, we conclude that there exists a constant e C such that I e Cp s n with probability at least 1 O(n 1 + d 1) 2 exp( cs) for some absolute constant c > 0. Next we turn to bounding the term: n 1P{β Sβ } X ,Sβ [(Y µ) X bβ ] bβ 2 sup β B i=1 (β Xi)2(Yi µ) bβ 2 ee C r s n + s log n with probability at least 1 O(n 1) 2 exp( cs), where the final bound follows much like in Lemma 34 so we omit the details. Combining the final two bounds and the facts that bβ 2 C q n s2(log d log n) we conclude that X ,Sβ [(Y µ) X bβ ] C8 q n for some sufficiently large constant C8. Appendix F. Lower Bound Proofs Lemma 36. Let us have L sets {Sj}L j=1 such that Sj [d] and |Sj| = s. Construct the sets Bj = {β : β 2 = 1, supp(β) = Sj}. Then for any points Xi Rd, i [n] and any fixed constant c < 0, the set j [L] β Bj (sign(X i β + c))i [n] has cardinality at most: | j [L] β Bj(sign(X i β + c))i [n]| L ne Proof [Proof of Lemma 36] It suffices to show the result for L 1 with a fixed support set S such that |S| = s. Start by constructing the function space FS := {f(X) = X β : supp(β) S, β Rd}. It is clear that: {f(X) = X β : β Rd, β 2 = 1, supp(β) S} FS. Denote the subgraph class of sets associated with FS by S(FS) := f FS{(t, X) : t < f(X)}. Using of Lemma 2.6.15 Vaart and Wellner (1996), we conclude that the VC-index of the class S(FS) is at most s + 2. Next, applying Sauer-Shelah s lemma (see, e.g., Corollary 2.6.3 of Vaart and Wellner, 1996) we conclude that: max X1,...,Xn n(S(FS), ( c, X1), . . . , ( c, Xn)) ne Agnostic Estimation for Phase Retrieval where n is the number of subsets of {( c, Xi)}n i=1 which can be picked out by the collection of sets S(FS). Recall that a subset of A {( c, Xi)}n i=1 is picked out by S(FS) iffthere exists a function f FS such that f(Xi) > c for all ( c, Xi) A, and f(Xi) c for all ( c, Xi) Ac. Equivalently this means that there exists a vector supp(β) S, β Rd such that sign(X i β + c) > 0 if ( c, Xi) A and sign(X i β + c) < 0 if ( c, Xi) Ac. In other words, bound (49) is an upper bound on the possible number of quantization points which can be produced by the model (25) on the set {β : β 2 = 1, supp(β) S} for an arbitrary matrix X. Proof [Proof of Theorem 8] We assume throughout the proof, without loss of generality, that that the infimum in (24) is attained at the function h H(K, M, L) (h C(L) in case (i)). If that were not the case, the same arguments will go through for a sequence of functions hn converging to the inf. Our first step is to construct a sufficiently large set B of s-sparse vectors on the unit sphere Sp 1, relying on a standard combinatorial construction which which is a sparse variation of the GV-bound and is stated under Lemma 10 in Appendix A for convenience of the reader. Apply Lemma 10 to construct the collection of sets {Si}i [N]. Next define the set B in the following manner: B := n β : k [N] s.t. βj = ρ1(j Sk) s 1 for j [d 1], and βd = p 1 ρ2 o , (50) for some to be specified constant ρ < 1. It is simple to see that indeed B is indeed a subset of the unit sphere in Rd, i.e., B Sd 1, and furthermore β 0 = s for all β B, and by the constructions of sets {Sk}k [N] (see (29)), we have for all k = j: Next we resort to a general scheme of reducing estimation to testing, which is wonderfully described in Chapter 2 of Tsybakov (2009). Let for brevity M( bβ, β) = minη { 1,1} bβ ηβ 2. We outline the reduction steps below. We have: inf b β sup β: β 0 s, β 2=1 Eh ,βM( bβ, β) inf b β sup β: β 0 s, β 2=1 ρ/2Ph ,β(M( bβ, β) ρ/2) inf b β sup β B ρ/2Ph ,β(M( bβ, β) ρ/2), (52) where in the first inequality used Markov s inequality, and in the second one the fact that B Sp 1 {β : β 0 = s}. Define the test: ψ = argmin i [N] M( bβ, βi). We now show that j = ψ we have M( bβ, βj) ρ/2. The latter holds trivially if M( bβ, βψ ) ρ/2, hence we assume M( bβ, βψ ) < ρ/2. Let further (without loss of generality) M( bβ, βψ ) = bβ βψ 2. By the triangle inequality and (51) this immediately implies that bβ βj 2 ρ/2 for all j = ψ . Next we show that bβ + βj 2 ρ/2 for all j = ψ . By the triangle inequality: bβ + βj 2 βψ + βj 2 bβ βψ 2 p 4(1 ρ2) + 2ρ2 ρ/2 > ρ/2, where to see the last inequality recall that ρ < 1. We have shown M( bβ, βj) ρ/2 for all j = ψ , which by (52) implies: inf b β sup β: β 0 s, β 2=1 Eh ,βM( bβ, β) ρ/2 max j [N] Ph ,βj(ψ = j) ρ/2 inf ψ max j [N] Ph ,βj(ψ = j). (53) Neykov, Wang, and Liu Hence it suffices to control the quantity in the right hand side of (53). To this end we use Fano s inequality which is a standard tool for producing minimax lower bounds. We have: inf ψ max j [N] Ph ,βj(ψ = j) 1 I(J, {(Yi, Xi)}i [n]) + log 2 log N , (54) where J is uniformly distributed on [N] random variable, and I denotes the mutual information. By a standard argument, the mutual information is bounded by the KL diameter of the set B, i.e.: I(J, {(Yi, Xi)}i [n]) max k,j DKL(Pβk({(Yi, Xi)}i [n])||Pβj({(Yi, Xi)}i [n])). (55) Define U := β k X, V := β j X, and W := P{βk,βj} X, and notice that W (U, V ). At this point the analysis of (i) and (ii) branches. We first show (i), and return to (ii) later. (i) Just as in Example 3 of Neykov et al. (2015), a simple calculation yields: DKL(Pβk({(Yi, Xi)}i [n])||Pβj({(Yi, Xi)}i [n])) = n E[P((ξ + h (U) h (V ))2) P(ξ2)] = n e P((h (U) h (V ))2), where ξ is a random variable with density pξ(x) exp( P(x2)) (i.e., ξ d= ε), and e P is another nonzero polynomial of the same degree as P, determined by the even moments of ξ. This holds since all odd moments of ξ are zero and all even moments of ξ exist. Using the fact that h is L-Lipschitz we can further bound the above by E e P(L2(U V )2) E exp( e CP,L(U V )2) 1, for a sufficiently large constant CP,L depending on P and L. Finally notice that U V N(0, ν2), where ν2 2ρ2 by (51). Direct calculation yields that: E exp( e CP,L(U V )2) 1 = 1 2ν2 e CP,L 1 1 4ρ2 e CP,L 1 4ρ2 e CP,L, where we are assuming that 4ρ2 e CP,L 1 2. Using Fano s bound we get: inf b β sup β: β 0 s, β 2=1 Eh ,βM( bβ, β) ρ 1 16nρ2 e CP,L + 4 log 2 s log(d/(4s)) Hence (again in the case ρ 1 (8 e CP,L)1/2 1) we obtain, the bound setting ρ = e C 1/2 P,L r s log(d/(4s)) where CP,L = e C 1/2 P,L /6. (ii) In the case when ε N(0, σ2), using the formula for KL-divergence for a normal distribution, and the law of iterated expectation conditionally on U, V we have: DKL(Pβk({(Yi, Xi)}i [n])||Pβj({(Yi, Xi)}i [n])) = n 2σ2 EU,V (h (U) h (V ))2 (56) We now remark that it suffices to consider the case h C(K, M, L) for some K N. To see this suppose that h = PN i=1 wih i for some h i C(i, M, L) and positive weights wi 0 such that Pn i=0 wi = 1. By Jensen s inequality: EU,V (h (U) h (V ))2 i=1 wi EU,V (h i (U) h i (V ))2 max i [N ] EU,V (h i (U) h i (V ))2. Agnostic Estimation for Phase Retrieval Next observe that: EU,V (h (U) h (V ))2 = Var(h (U) h (V )), (57) where the last equality follows by the fact that h (U) d= h (V ) which in turn implies E(h (U) h (V )) = 0. Furthermore notice that by (51) we have Cov(U, V ) = β k βj 1 ρ2. Set 1 ν2 := Cov(U, V ) for some 0 ν ρ. This fact coupled with the normality of U and V implies that we can equivalently represent U and V as: 1 ν2Z1 + νZ2, V = p 1 ν2Z1 + νZ3, where Z1, Z2, Z3 are i.i.d. N(0, 1) random variables. To this end we recall the Gaussian Poincar e inequality (see, e.g., Theorem 3.20 of Boucheron et al., 2013), which states that if F : Rk 7 R is a continuously differentiable map, and Z N(0, Ik), then: Var(F(Z)) E F(Z) 2 2. Consider the map F : R3 7 R, defined by F(Z1, Z2, Z3) := h ( 1 ν2Z1+νZ2) h ( 1 ν2Z1+ νZ3). Applying the Gaussian Poincar e inequality yields: Var(h (U) h (V )) (1 ν2)E(h (U) h (V ))2 + 2ν2EZ N(0,1)(h (Z))2 (1 ν2) Var(h (U) h (V )) + 2ν2M, where we used the fact that h (U) d= h (V ) and the fact that h C(K, M, L). Now since clearly h C(K 1, M, L) we can inductively apply this argument to obtain the bound: Var(h (U) h (V )) 2ν2M k=0 (1 ν2)k + (1 ν2)K Var(h (K)(U) h (K)(V )) 2M(1 (1 ν2)K) + L2E(U V )2 2ρ2 MK + L2 . Combining this observation with (55), (56), and Fano s bound (54) we obtain: inf b β sup β: β 0 s, β 2=1 Eh ,βM( bβ, β) ρ 1 4nρ2(MK + L2)/σ2 + 4 log 2 s log(d/(4s)) where we used (28). Selecting s log(d/(4s)) 12n(MK + L2) 1, makes the right hand side ρ/6, completing the proof. Remark 37. In the special case of phase retrieval model equation (57) has the following explicit form: EU,V (U V )2(U + V )2 = EU,V (U V )2EU,V (U + V )2 = βk βj 2 2 βk + βj 2 2. This observation is sufficient for recovering the result of Theorem 3.2 of Cai et al. (2015) even in the general case when one assumes β 2 = R instead of β 2 = 1.