# statisticalcomputational_tradeoff_in_single_index_models__9d835634.pdf Statistical-Computational Tradeoffs in High-Dimensional Single Index Models Lingxiao Wang Zhuoran Yang Zhaoran Wang We study the statistical-computational tradeoffs in a high dimensional single index model Y = f(X β ) + ϵ, where f is unknown, X is a Gaussian vector and β is s-sparse with unit norm. When Cov(Y, X β ) = 0, [43] shows that the direction and support of β can be recovered using a generalized version of Lasso. In this paper, we investigate the case when this critical assumption fails to hold, where the problem becomes considerably harder. Using the statistical query model to characterize the computational cost of an algorithm, we show that when Cov(Y, X β ) = 0 and Cov(Y, (X β )2) > 0, no computationally tractable algorithms can achieve the information-theoretic limit of the minimax risk. This implies that one must pay an extra computational cost for the nonlinearity involved in the model. 1 Introduction A single index model (SIM) specifies that the response Y and the covariate X satisfy Y = f(X β )+ ϵ, where β Rd is an unknown parameter, f : R R is an unknown link function, and ϵ R is a random noise. This model extends linear regression by incorporating the unknown link function, offers additional modeling flexibility and robustness to model misspecification. SIMs are extensively studied in the literature, with wide applications such as time-series [17], survival analysis [35], and quantile regression [56]. Given n i.i.d. observations of this model, the primary focus is to estimate the parametric component β without knowing the exact form of f. When β is estimated accurately, f can be fitted via univariate nonparametric regression. Recently, there is growing research interest in recovering β in the high-dimensional setting where the dimensionality d is much larger than the sample size n and β is sparse. When Y and X β have nonzero correlation, [43, 44] propose to estimate β by fitting an ℓ1-regularized linear model, i.e., Lasso [50], directly using Y and X. More interestingly, they also establish similar theoretical guarantees as those for the linear model. Specifically, they show that the Lasso estimator is consistent as long as the sample size is of the order s log d, where s is the number of nonzero entries in β . Moreover, this sample complexity result is known to be optimal in the sense that it attains the information-theoretical lower bound [46, 53], and the proposed estimator can be obtained efficiently using convex optimization. However, the Lasso approach fails when Y and X β are uncorrelated, which is the case when the link function is symmetric. A prominent example is phase retrieval [10, 11], where f is known to be either the absolute value or quadratic function. For sparse phase retrieval, s log d sample complexity is only attained by the empirical risk minimizer Northwestern University; lingxiaowang2022@u.northwestern.edu Princeton University; zy6@princeton.edu Northwestern University; zhaoranwang@gmail.com 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada. [33], which searches over all d s possible support sets of β, and is thus computationally intractable. In addition, various efficient estimators are proposed based on convex relaxation or projected gradient descent [8, 13], whose consistency is only shown when the sample size is of the order s2 log d. Thus, there seems an interesting tradeoff between the statistical optimality and computational efficiency, i.e., there is a gap between the optimal statistical performance achieved by the family of computationally efficient estimators and that attained by all possible estimators. In sparse phase retrieval, such a gap is conjectured to be fundamental [8] and is also observed in SIMs where f is symmetric [42, 48, 62]. This intriguing phenomenon motivates the following two questions: (i) How does the unknown link function affect the statistical and computational aspects of learning SIMs in high dimensions? (ii) Are the gap observed in symmetric links intrinsic and cannot be eliminated by more sophisticated algorithm design and analysis? For the first question, we introduce the notions of firstand second-order Stein s associations which characterize the dependence between Y and X β of two different orders. We differentiate two types of link functions: (i) f with nonzero Stein s associations and (ii) f with zero first-order and nonzero second-order Stein s association. These two classes capture the functions considered in [43, 44] and [42, 48, 62] respectively. More importantly, we establish the statistical-computational barrier under an oracle computational model [16, 18, 19, 54], which is an abstraction of computations made by algorithms that interact with data. Specifically, we study the signal detection problem where the link function is defined as a continuous interpolation of two link functions of different types. We establish information-theoretical and computational lower bounds for the minimum signal strength required for successful detection and also propose algorithms that yield matching upper bounds. Moreover, we characterize the gap between signal strengths for learning SIMs under limited and unlimited computational budgets and display the evolution of this gap as the link function transits from one type to the other. Main Contribution. Our contribution is three-fold. First, we introduce the firstand second-order Stein s associations, which bring a general characterization of the link functions considered in the literature. Second, for the detection problem, we establish nearly tight information-theoretical and computational lower bounds under the framework of oracle model, which exhibit the statistical price paid for achieving computational efficiency in learning SIMs. Third, we also construct algorithms which yield matching upper bounds. Our results also imply a similar computational barrier for parameter estimation, thus providing a positive answer to the open problem raised in [8]. Related Work. There is a huge body of literature on single-index models in the low-dimensional setting. See, for example, [25, 27, 29, 39] and the references therein. For high-dimensional SIMs, when Y and X β have a nonzero correlation, [22, 23, 26, 40, 41, 43, 44, 58] study the statistical rates of Lasso-type estimators, which are shown to achieve both statistical accuracy and computational efficiency. In contrast, [42, 49, 61, 62] study SIMs which are generalizations of sparse phase retrieval [8]. In addition, the statistical query model is proposed by [30] and further extended by [15, 18 20] for studying the computational complexity of planted clique, random satisfiability problems, stochastic convex optimization, and Gaussian mixture model. In addition, based on a slightly modified version, [16, 34, 54, 63] establish the statistical-computational tradeoffs in statistical problems including sparse PCA, high-dimensional mixture models, weakly supervised learning, and graph structure inference. Among them, our work is mostly related to [16], which validates the computational barrier in phase retrieval with absolute value link function by drawing the connection to mixture of regression models. In comparison, we tackle SIMs directly, which takes phase retrieval as a particular case. More importantly, by interpolating the two sub-classes of SIMs, we obtain the full spectrum of phase transitions, which shed new light on the open problem raised in [8]. Furthermore, there is a massive body of literature on understanding the computational barriers of statistical models. Besides our oracle model approach, there are two other popular means of attacking such problems. The first one is based on polynomial-time reductions from the conjectured computationally challenging problems to statistical problems of interest. See, e.g., [3 7, 9, 12, 21, 24, 37, 57] and the references therein. Second method constructs a sequence of sum-of-squares convex relaxations that are increasingly tighter based on semidefinite programming [1, 2, 14, 28, 31, 36, 38, 45, 55]. Although this approach is free of hardness conjectures, their computational barriers only hold for the restricted family of convex relaxation algorithms. 2 Background In this section, we first introduce the single index model and the associated signal detection problem. We then introduce the statistical query model, which quantifies the computational cost of an algorithm that interacts with data and is later used to establish the main results. 2.1 Statistical Model We consider the single index model Y = f(X β ) + ϵ, (2.1) where X N(0, Id) is the covariate, Y is the response, β Rd is the unknown parameter of interest, ϵ N(0, σ2) is the noise, and f : R R is the unknown link function. Given n independent realizations {zi = (yi, xi)}i [n] of this model, our goal is to estimate β under the assumption that β is s-sparse, s n, and d n. [43] estimate β by exploiting the covariance structure Cov(Y, X β ). When such a structure is unavailable, that is, Cov(Y, X β ) = 0, [42, 62] estimate β by exploiting Cov[Y, (X β )2]. However, the resulting estimators require a higher sample complexity than the estimators that are based on Cov(Y, X β ). To understand such a gap in sample complexity, we consider more general settings under a unified framework. The key of this framework is the following Stein s identities [47]. Let X N(0, Id) be the standard Gaussian distribution and Y = h(X). If the expectation E[ h(X)] exists, the first-order Stein s identity takes the form E h(X) = E[Y X]. (2.2) Let Y = h(X), where h is twice differentiable. If the expectation E[ 2h(X)] exists, the second-order Stein s identity takes the form E 2h(X) = E Y (XX Id) . (2.3) The above identities show that the covariance structures Cov(Y, X β ) and Cov[Y, (X β )2] are pivotal in the estimation of the model defined in (2.1) [59, 60]. Specifically, following from (2.2) with h(X) = f(X β ) + ϵ, it holds that E[Y X] = E[f (X β , ϵ)] β , where we denote by f the derivative of f with respect to the first coordinate. In other words, E[Y X] recovers β up to a scaling under the assumption that Cov(Y, X β ) = 0. Meanwhile, following from (2.3) with h(X) = f(X β ) + ϵ, it holds that E[Y XX ] = E f (X β , ϵ) β β + E[Y ] Id. In other words, β is the leading eigenvector of E[Y XX ] under the assumption that Cov[Y, (X β )2] > 0. We define the following covariance structures, which play important roles in the estimation of β in the model in (2.1) with unknown link function f. Definition 2.1 (First-order and second-order Stein s associations). Let ψ be a twice differentiable transformation from R to R and Y be the response of X under the model in (2.1). We define the firstand second-order Stein s association between Y and X β as S1(Y ) = Cov(Y, X β ), S2(Y, ψ) = Cov ψ(Y ), (X β )2 , respectively, where ψ is called the marginal transformation. In the following, we introduce classes of link functions of interest. We consider the following two classes of link functions, C1 = f : Cov f(X β ), X β β 2 2 = 1 , C2 = f : Cov f(X β ), X β = 0 . (2.4) The function class C1 is a class of normalized link functions. Following from the first-order Stein s identity in (2.2), it holds that Cov f(X β ), X β = E f (X β ) β 2 2. In other words, the definition of C1 in (2.4) equivalently requires the link function f C1 to satisfy E[f (X β )] = 1. For any twice differentiable marginal transformation ψ, we define C(ψ) as the class of link functions f such that C(ψ) = f : Cov ψ(Y ), (X β )2 β 4 2 1 for Y = f(X β ) + ϵ . (2.5) The definition of C(ψ) is a generalization of the misspecified phase retrieval model studied by [42, 62] with additive noise. By allowing marginal transformations of Y , such a class also covers the linear regression model as a special case. Note that in (2.5), we require the covariance structure Cov[ψ(Y ), (X β )2] to have a magnitude comparable to β 4 2. Without any loss of generality, such a requirement specifies the scaling of the marginal transformation ψ and the corresponding link function f C(ψ). To see this, note that it holds from the second-order Stein s identity in (2.3) that Cov ψ(Y ), (X β )2 = E D2ψ f(X β ) + ϵ β 4 2, where D is the differentiation operator with respect to X β . In other words, (2.5) equivalently requires the link function f C(ψ) to satisfy E[D2ψ(f(X β ) + ϵ)] 1. For ψ(y) = y, the function class C(ψ) defined in (2.5) reduces to the misspecified phase retrieval models considered by [42, 62] with additive noise. For ψ(y) = y2, C(ψ) characterizes the linear regression model, the mixed regression model, and various phase retrieval models, including Y = (X β )2 + ϵ and Y = |X β | + ϵ, up to normalizations. In particular, C(ψ) also characterizes a class of one-hidden-layer neural networks with Rectified Linear Units (Re LU) activation function. For a neural network with two neurons in the hidden layer, where the parameters in the first layer are β and β , and the parameter in the second layer is (1, 1) R2, we have Y = max{X β , 0} + max{ X β , 0} + ϵ = |X β | + ϵ, which is captured by C(ψ) with ψ(y) = y or ψ(y) = y2 up to normalizations. Throughout this paper, we focus on the marginal transformations ψ such that C(ψ) C1 = and C(ψ) C2 = , where the function classes C1, C2, and C(ψ) are defined in (2.4) and (2.5). Such a class of marginal transformations ψ enables us to study the phase transition between f1 C(ψ) C1 and f2 C(ψ) C2. As an example, we consider ψ(y) = y. It holds that f1 C(ψ) C1 for f1(X β ) = X β + (X β )2, and f2 C(ψ) C2 for f2(X β ) = (X β )2. In other words, it holds that C(ψ) C1 = and C(ψ) C2 = for ψ(y) = y. With link functions f1 C(ψ) C1 and f2 C(ψ) C2, we introduce the following statistical model of interest, Y = f1(X β ) + ϵ, with probability α, f2(X β ) + ϵ, with probability 1 α, (2.6) where ϵ N(0, σ2), X N(0, Id), and β is s-sparse. We assume that f1 and f2 are unknown, and ψ is known a priori. In (2.6), the mixture probability α controls the magnitude of the first-order Stein s association S1(Y ) defined in Definition 2.1, which characterizes a notion of linearity between the response Y and the index X β . Let zi = (yi, xi) be n independent observations of (2.6) with n d, we aim at detecting the existence of a nonzero parameter β , that is, testing the following hypotheses, H0 : β = 0 versus H1 : β = 0. (2.7) In what follows, we assume that s is a known integer and σ2 is an unknown constant. Meanwhile, to address the identifiability issue, we assume that β 2 is fixed. The difficulty of the testing problem in (2.7) is characterized by the signal-to-noise ratio (SNR), which is defined as κ(β , σ) = β 2 2/σ2. Moreover, to characterize the minimum required SNR, we consider the following parameter spaces corresponding to the null and alternative hypotheses, G0 = (β , σ) Rd+1 : β = 0 , G1(s, γn) = (β , σ) Rd+1 : β 0 = s, κ(β , σ) γn , (2.8) where {γn} n=1 is a nonnegative sequence. For notational simplicity, we denote by θ = (β , σ) and Pn θ the joint distribution of {zi}n i=1, which are generated by the model in (2.6) with the parameter of interest θ and nuisance parameters f1, f2, and ψ. For any function φ that maps z = (z1, . . . , zn) R(d+1) n to {0, 1}, the worst-case risk for testing H0 : θ G0 versus H1 : θ G1(s, γn) is defined as the sum of the maximum type-I and type-II errors, Rn(φ; G0, G1) = sup θ G0 Pθ (φ = 1) + sup θ G1 Pθ (φ = 0). (2.9) Correspondingly, the minimax risk is defined as R n(G0, G1) = inf φ sup f1,f2,ψ Rn(φ; G0, G1), (2.10) where we take the supreme over the nuisance parameters f1, f2, and ψ of models in (2.6), and the infimum over the function φ. We further define the minimax separation rate in the following. Definition 2.2 (Minimax separation rate [32, 51]). A sequence {γ n} n=1 is called the minimax separation rate if (i) given any sequence {γn} n=1 with γn = o(γ n), it holds that lim infn R n(G0, G1(s, γn)) = 1, (ii) given any sequence {γn} n=1 with γn = Ω(γ n), it holds that limn R n(G0, G1(s, γn)) = 0. The minimax separation rate characterizes the minimum SNR that guarantees the existence of an asymptotically powerful test. Therefore, it captures the difficulty of the hypothesis testing problem in (2.7). 2.2 Oracle Computational Model In what follows, we introduce an oracle computational model that quantifies the computational cost of an algorithm. Our model follows from the one considered in [16, 54], which slightly extends the statistical query model originally proposed in [18 20, 30]. Definition 2.3 (Statistical query model). A statistical oracle r responds to a given query function q with Zq, which is a random variable in R. We define Q {q : Rd+1 [ M, M]} as the space consisting of all the query functions. We define an algorithm A as the iterative process that queries a given statistical oracle with query functions in QA Q but does not access the data directly. We denote by A(T) the set of algorithms that query the statistical oracle T rounds, where T is called the oracle complexity. We denote by R[ξ, n, T, η(QA )] the set of statistical oracles r such that n Zq E[q(Z)] τq o 1 2ξ, (2.11) where Zq is the response of the statistical oracle r, Z = (Y, X) is the random variable following the underlying statistical model, ξ [0, 1) is the tail probability, and τq is the tolerance parameter given by η(QA ) + log(1/ξ) M n 2 η(QA ) + log(1/ξ) M 2 {E[q(Y, X)]}2 Here the parameter η(QA ) is the logarithmic measure of the capacity of QA . For a countable QA , we have η(QA ) = log(|QA |). For an uncountable QA , the magnitude η(QA ) can be the Vapnik-Chervonenkis dimension or the metric entropy. The intuition behind Definition 2.3 is to separate the algorithm from the dataset. Under this definition, the algorithms we consider are blackbox systems that access the necessary information from a statistical oracle. The definition of the statistical oracle r R[ξ, n, T, η(QA )] is a generalization of the sample average. Note that it holds that M 2 {E[q(Y, X)]}2 Var q(Y, X) . (2.13) If the response zq of the statistical oracle is the sample mean of n independent realizations of q(Z), then (2.11) follows from Bernstein s inequality coupled with a uniform concentration argument over QA , where the variance term is replaced by its upper bound in (2.13) [16]. To capture the computational difficulty of the hypothesis testing problem in (2.7), we introduce the following definition of computational minimax separation risk, which is an analog of the minimax separation risk defined in (2.10) with an additional constraint on the oracle complexity. We consider the algorithms A A(T) associated with the statistical oracle r R[ξ, n, T, η(QA )], and denote by H(A , r) the set of all the test functions based on A A(T), which queries r R[ξ, n, T, η(QA )] T rounds. We define the risk for test function φ H(A , r) as Rn(φ; G0, G1) = sup θ G0 Pθ (φ = 1) + sup θ G1 Pθ (φ = 0). (2.14) Correspondingly, we define the computational minimax risk as R n(G0, G1; A , r) = inf φ H(A ,r) sup f1,f2,ψ Rn(φ; G0, G1) (2.15) The probability Pθ in the above formulation is taken over the distribution of responses from the statistical oracle r under the model in (2.6) with the parameter of interest θ and nuisance parameter f1, f2, and ψ. We introduce the following definition of computational minimax separation rate [18, 19, 54]. Definition 2.4 (Computational minimax separation rate). A sequence { γ n} n=1 is called the computational minimax separation rate if (i) given any sequence {γn} n=1 with γn = o( γ n), for any η and any A A(dη), there exists a statistical oracle r R[ξ, n, dµ, η(QA )] such that lim inf n R n(G0, G1(s, γn); A , r) = 1, (ii) given any sequence {γn} n=1 with γn = Ω( γ n), there exists an algorithm A A(dη) with some absolute constant η such that it holds for any statistical oracle r R[ξ, n, dµ, η(QA )] that lim n R n(G0, G1(s, γn); A , r) = 0. In the following section, we give the explicit forms of γ n and γ n. In particular, when the link function f deviates from class C1(ψ), a gap between γ n and γ n arises, which characterizes the computational cost to pay for the lack of first-order Stein s association defined in Definition 2.1. 3 Main Results In this section, we lay out the theoretical results. For the hypothesis testing problem in (2.7), we establish the information-theoretic and computational lower bounds by constructing a worst-case hypothesis testing problem. We further establish upper bounds that attain these lower bounds up to logarithmic factors, which is deferred to A. These lower and upper bounds together characterize the statistical-computational tradeoff. Finally, we show that such a tradeoff in hypothesis testing implies similar computational barriers in parameter estimation. 3.1 Lower Bounds In what follows, we present lower bounds of the minimax and computational minimax separation rates defined in Definitions 2.2 and 2.4, respectively. For the hypothesis testing problem in (2.7) with parameter spaces defined in (2.8), we have the following proposition that characterizes its information-theoretic difficulty. Proposition 3.1. We assume that β in (2.6) is sparse such that s = o(d1/2 δ) for some positive absolute constant δ. For it holds that lim infn R n G0, G1(s, γn)] 1. In other words, any test for the hypothesis testing problem in (2.7) and (2.8) is asymptotically powerless. Proof. See B.1 for a detailed proof. It follows from Proposition 3.1 that any sequence satisfying (ii) of Definition 2.2 is asymptotically lower bounded by any sequence that satisfies (3.1). As a result, it holds that where γ n is the minimax separation rate defined in Definition 2.2. Based on (3.2) and the upper bound in Theorem A.2, which is deferred to A, up to logarithmic factors, the minimax separation rate defined in Definition 2.2 takes the form The following theorem establishes a lower bound of the computational minimax separation rate defined in Definition 2.4. Theorem 3.2. We assume that β in (2.6) is sparse such that s = o(d1/2 δ) for some positive absolute constant δ. For any positive absolute constant µ and A A(dµ) with there exists a statistical oracle r R[ξ, n, dµ, η(Q)] such that lim infn R n(G0, G1; A , r) 1. In other words, any computational tractable test for the hypothesis testing problem in (2.7) and (2.8) is asymptotically powerless. Proof. See B.2 for a detailed proof. It follows from Theorem 3.2 that any sequence satisfying (ii) of Definition 2.4 is asymptotically lower bounded by any sequence that satisfies (3.4). As a result, it holds that where γ n and γ n are the minimax and computational minimax separation rates defined in Definitions 2.2 and 2.4, respectively. Based on (3.5) and the upper bound in Theorem A.3, which is deferred to A, up to logarithmic factors, the computational minimax separation rate defined in Definition 2.4 takes the form 3.2 Phase Transition In what follows, we characterize the phase transition in the minimax and computational minimax separation rates when the mixture probability α transits from zero to one. We categorize the phase transition into the following regimes in terms of α. 1. For 0 < α ((log d)2/n)1/4, our results show that γ n = p s log d/n and γ n = p s2/n. For γn = o( p s log d/n) , any test for the hypothesis testing problem in (2.7) is asymptotically powerless. For γn = Ω( p s log d/n) and γn = o( p s2/n), any asymptotically powerful test for (2.7) is computationally intractable with superpolynomial oracle complexity defined in Definition 2.3. For γn = Ω( p s2/n), there exists an asymptotically powerful test that is computationally tractable with polynomial oracle complexity. In this regime, the gap between the computational minimax separation rate γ n and the minimax separation rate γ n is invariant to α. 2. For (log2 d/n)1/4 α (s log d/n)1/4, our results show that γ n = p s log d/n and γ n = 1/α2 s log d/n. For γn = o( p s log d/n), any test is asymptotically powerless. For γn = Ω( p s log d/n) and γn = o(1/α2 s log d/n), any asymptotically powerful test for (2.7) is computationally intractable. For γn = Ω(1/α2 s log d/n), there exists an asymptotically powerful test that is computationally tractable. In this regime, a larger α implies a smaller gap between γ n and γ n. 3. For (s log d/n)1/4 < α 1, our results show that γ n = γ n = 1/α2 s log d/n. For γn = o(1/α2 s log d/n), any test for the hypothesis testing problem in (2.7) is asymptotically powerless, whereas for γn = Ω(1/α2 s log d/n), there exists an asymptotically powerful test that is computationally tractable. In this regime, the gap between γ n and γ n vanishes. By the normalization specified following (2.7), the mixture probability α characterizes the first-order Stein s association of the model under the alternative hypothesis. Therefore, the phase transition implies that when the first-order Stein s association attains its maximum, which corresponds to α = 1, the gap between the computational minimax separation rate γ n and the minimax separation rate γ n vanishes, whereas when the first-order Stein s association vanishes, which corresponds to α = 0, the gap between the computational minimax separation rate γ n and the minimax separation rate γ n attains its maximum. In other words, the lack of the first-order Stein s association leads to an extra price of computational cost. 3.3 Implication for Parameter Estimation For the model in (2.6), our result on the computational minimax separation rate in A implies computational barriers in the estimation of β , which is established in the following theorem. Theorem 3.3. For the estimation of β in (2.6) with where γn = β 2/σ2, it holds that, for any positive absolute constant µ and algorithm A A(T) that gives bβ within oracle complexity T = O(dµ), there exists a statistical oracle r R[ξ, n, T, η(Q)] such that P bβ β 2 σ β 1 2 γn/4 C, (3.8) where C is a positive absolute constant. Proof. See B.5 for a detailed proof. Figure 1: Phase transition in the gap between minimax separation rate and computational minimax seperation rate: (i) for 0 < α ((log d)2/n)1/4, the gap is invariant to α. (ii) for (log2 d/n)1/4 α (s log d/n)1/4, a larger α implies a smaller gap. (iii) for (s log d/n)1/4 < α 1, the gap vanishes. For α = 0, the estimation of β in (2.6) reduces to the sparse phase retrieval problem. For simplicity of discussion, let γn = β 2 2/σ2 be a constant in the following discussions. Theorem 3.3 implies that for n = o(s2), any computationally tractable estimator is statistically inconsistent in the sense that bβ β 2 C holds with at least constant probability. [8] construct a computational tractable estimator for sparse phase retrieval with the quadratic link function Y = |X β |2 + ϵ. The estimator by [8] is statistically consistent under the assumption that n C(1 + σ2/ β 4 2) s2 log d. Similar phenomenon arises in misspecified sparse phase retrieval studied by [42], although their work is slightly more general, in the sense that they consider f(X β , ϵ) as the link function. The estimator by [42] requires n Cs2 log d to be statistically consistent. Both [8] and [42] conjecture that their requirements on the sample size cannot be relaxed for computationally tractable estimators. Theorem 3.3 confirms this conjecture for the sparse phase retrieval problem under the statistical query model defined in Definition 2.3. For α = 1, the requirement for a computationally tractable estimator to be statistically consistent becomes n Cs log d. Such a sample size requirement agrees with the information-theoretic lower bound. [43] construct a computationally tractable estimator of β , which requires the sample size n Cs log(d/s) to be statistically consistent. It follows from Theorem 3.3 that such a requirement is necessary. For 0 < α < 1, we observe a phase transition in the required sample size in terms of α, which is similar to the phase transition of the computational minimax separation rates. For 0 < α p γn log d/s, the requirement becomes n Cs2. For p γn log d/s α 1, the requirement becomes n Cs log d/α2. In this regime, a larger α implies a smaller sample size required for a computationally tractable estimator to be statistically consistent.