# robust_testing_in_highdimensional_sparse_models__d5f22dda.pdf

Robust Testing in High-Dimensional Sparse Models

Anand Jerry George School of Computer and Communication Sciences École Polytechnique Fédérale de Lausanne (EPFL) anand.george@epfl.ch

Clément L. Canonne School of Computer Science The University of Sydney clement.canonne@sydney.edu.au

We consider the problem of robustly testing the norm of a high-dimensional sparse signal vector under two different observation models. In the first model, we are given n i.i.d. samples from the distribution N (θ, Id) (with unknown θ), of which a small fraction has been arbitrarily corrupted. Under the promise that θ 0 s, we want to correctly distinguish whether θ 2 = 0 or θ 2 > γ, for some input parameter γ > 0. We show that any algorithm for this task requires n = Ω s log ed

s samples, which is tight up to logarithmic factors. We also extend our results to other common notions of sparsity, namely, θ q s for any 0 < q < 2. In the second observation model that we consider, the data is generated according to a sparse linear regression model, where the covariates are i.i.d. Gaussian and the regression coefficient (signal) is known to be s-sparse. Here too we assume that an ε-fraction of the data is arbitrarily corrupted. We show that any algorithm that reliably tests the norm of the regression coefficient requires at least n = Ω min(s log d, 1/γ4) samples. Our results show that the complexity of testing in these two settings significantly increases under robustness constraints. This is in line with the recent observations made in robust mean testing and robust covariance testing.

1 Introduction

Hypothesis testing is a fundamental task in statistics and a staple of the scientific method, in which we seek to test the validity of a pre-specified hypothesis based on empirical observations. In this work, we are concerned with the problem of testing whether a given high-dimensional sparse signal vector is zero under two common and well-studied observation models: 1) the Gaussian location model and 2) the Gaussian linear regression model. Specifically, in the Gaussian location model, we observe i.i.d. samples from a d-dimensional spherical Gaussian distribution with an unknown sparse mean vector, and seek to detect whether its ℓ2 norm is large (equivalently, we observe a set of measurements subject to white noise, and seek to determine whether there exists an underlying (sparse) signal). Similarly, in the Gaussian linear regression model we seek to detect whether the ℓ2 norm of the sparse regression coefficient is large. We further assume that our samples are imperfect or even corrupted, allowing an adversary to arbitrarily tamper with up to an ε-fraction of the observations. Our objective is to characterize the minimum number of samples required to perform these testing tasks, and, crucially, to understand the effect that requiring robustness to this adversarial corruption has on the complexity of the problems.

It is known [DKS17, DK21] that, for a variety of high-dimensional tasks, robust testing becomes as costly (in terms of sample complexity) as the corresponding estimation task. This is in contrast to the non-robust version, where testing is typically much more efficient often by a quadratic factor in the dimension. However, it is unclear how sparsity enters the picture, and for instance if robustness only starts becoming costly when the signal vector is sufficiently dense i.e., whether the problem

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

exhibits a phase transition. This is particularly relevant, as the non-robust versions of the problems we consider are known to present such a phase transition at sparsity s

How does robustness affect the sample and computational complexities of testing norm of the signal vector in high-dimensional sparse models? Does testing remain easier than learning?

This type of question, framed in a minimax setting, sits at the intersection of theoretical computer science (where it is captured under the framework of distribution testing) and robust statistics. Specifically, for Θ Rd, let {pθ}θ Θ be a family of distributions and p0 be a reference distribution (simple null hypothesis in our case the standard Gaussian). Then, we say that an algorithm T reliably tests the ℓ2-norm of θ if it satisfies the condition

max n Pr X pn 0 [ T(X) = reject ], sup θ Θ θ 2 γ

Pr X pn θ [ T(X) = accept ] o δ (1)

where δ (0, 1] is the failure probability, which following the literature we will hereafter set to 1/3.1 The quantity of interest here is the sample complexity, that is the minimum number of samples n required by any algorithm to solve the problem.

The above task, however, assumes access to perfect samples from the unknown distribution pθ. This is often an unrealistic assumption, as a fraction of the n samples could be imperfect or corrupted. This motivates the setting of robust testing. The problem is then similar to the formulation in (1), with a crucial difference: the algorithm T does not have access to the i.i.d. samples X = (X1, . . . , Xn) Rd n, but instead to a contaminated version X = ( X1, . . . , Xn) Rd n obtained by arbitrarily modifying up to εn of the Xi s (i.e., an ε-fraction). We will refer to this as the ε-corruption model.

In this work we consider two instances of the general testing task (1) in the robust testing setting. First, let us introduce some notation common to the problems. For q (0, 2) let Bs,q := {θ Rd : θ q s} be the ℓq ball of radius s in Rd. For q = 0, we get the usual notion of sparsity, and will simply write Bs for 0 s d. Let N (θ, Id) denote the d-dimensional Gaussian with mean θ Rd and identity covariance.

Sparse Gaussian mean testing. The first problem that we consider is the sparse Gaussian mean testing, in which, given an ε-corrupted dataset of n samples from N (θ, Id), where θ Bs,q is unknown, our goal is to robustly distinguish between (1) θ 2 = 0 , and (2) θ 2 γ (equivalently, the total variation distance between N (θ, Id) and the standard Gaussian N (0, Id) is Ω(γ)), for some input parameter γ (0, 1].

Testing in sparse linear regression model. This is the second problem that we consider. In the sparse linear regression model the data is generated according to the following process: Let X1, X2, , Xn be i.i.d. samples from N (0, Id). Let θ Bs be unknown and let ξi s be i.i.d. samples from N (0, 1) (and independent from the Xi s), for 1 i n. Then, the yi s are generated as follows: yi = Xi, θ + ξi for all 1 i n. (2)

Note that for a given θ, the joint distribution of (Xi, yi) is N (0, Σθ), where Σθ = Id θ θT 1 + θ 2

Our aim is to robustly distinguish between (1) θ 2 = 0 , and (2) θ 2 γ, given an ε-corrupted version of the observations (X1, y1), (X2, y2), , (Xn, yn).

Note that for both the problems, one can restrict themselves to the case γ ε, as otherwise the problem becomes trivially information-theoretically impossible.

1.1 Our contributions

Our main contributions are the characterization of the sample complexity of robust sparse Gaussian mean testing for a range of notions of sparsity, and a lower bound on the sample complexity of robust

1The choice of the value 1/3 here is arbitrary, and any fixed value greater than 1/2 would suffice, as one can amplify the success probability to 1 δ, for any δ > 0, using a standard majority vote.

testing in the sparse linear regression model. Together, these results fully answer the above question, and provide more evidence to the belief that robustness requirements make the testing tasks as hard as the corresponding estimation tasks. To establish our lower bounds, we draw upon and combine a variety of methods from the literature, in order to upper bound the χ2-divergence between a point and a mixture distribution before concluding by Le Cam s two-point method. We elaborate further on those aspects below.

Sparse Gaussian mean testing. It is known [CCT17] that, in the non-robust setting described in (1), the sample complexity of sparse Gaussian mean testing is

n0(s, d, γ) =

Θ s γ2 log 1 + d

for q = 0, and, for q (0, 2),

nq(s, d, γ) =

Θ m γ2 log 1 + d m2 if m <

where m := max{u [d] : γ2u 2 q 1 s2} is the effective sparsity. In particular, both sample complexities present a phase transition at

d, after which the sparsity no longer helps decreasing the sample complexity of the problem, which defaults to the folklore non-sparse bound of Θ(

Our main result in this setting is a lower bound on robust sparse mean testing, which shows a significantly different landscape: Theorem 1 (Informal; see Theorems 4 and 5). For every constant ε, γ, the sample complexity of robust sparse Gaussian mean testing in the ε-corruption model is

for q = 0, and Ω m log ed

m for q (0, 2), where m = max{u [d] : γ2u 2 q 1 s2}.

Moreover, our bound for standard s-sparsity is tight, in view of the known O s γ2 log ed

s sample complexity bound for robust sparse mean estimation [Li17, DK19] (which implies the same bound for robust testing). This not only shows that the robust testing problem is much harder than its non-robust counterpart especially in the dense regime (and actually as hard as the robust estimation problem), but also that the robust setting no longer presents any threshold phenomenon. If we set s = d, we further recover the result in [DKS17] that robustness requirement increases the sample complexity of Gaussian mean testing. Figure 1 illustrates the sample complexity of robust and non-robust sparse Gaussian mean testing as a function of the sparsity.

The tightness of our bound, however, only follows from previous work in the case q = 0 (standard sparsity). We provide a (near) matching upper bound for the case q (0, 2), which essentially resolves the question: showing that the aforementioned hardness and disappearance of a phase transition apply to all types of sparsity. Theorem 2 (Informal; see Theorem 6). For every ε, γ, the sample complexity of robust sparse Gaussian mean testing in the ε-corruption model is

for q (0, 2), where m = max{u [d] : γ2u 2 q 1 s2}.

Note that, for constant γ, our upper bound only differs from the lower bound of Theorem 1 by a logarithmic dependence on the effective sparsity m. Finally, we note that while the upper bounds from [Li17] and Theorem 2 are achieved by computationally inefficient algorithms (time complexity exponential in s), this is actually inherent; indeed, [BB20] recently proved that any computationally efficient algorithm for robust sparse mean estimation must have much higher sample complexity, namely Ω(s2). A simple inspection of their proof shows that this result extends to the robust testing problem.

Figure 1: Sample complexity as a function of sparsity for non-robust and robust sparse Gaussian mean testing (the behavior applies to testing in the sparse regression model as well).

Testing in sparse linear regression model. From the results in [CCC+19] we can deduce that the sample complexity of non-robust testing in the sparse linear regression model is given by

n0(s, d, γ) =

Θ min s γ2 log 1 + d

This expression looks very similar to the sample complexity of sparse Gaussian mean testing, except for an additional 1/γ4 term. This term is essentially due to the fact that we can ignore the observations Xi s and estimate γ just by using yi s, since their variance is 1 + γ2. This would require O 1/γ4

samples. Nevertheless, the testing in sparse linear regression still exhibits a phase transition at s

d as was observed in the sparse Gaussian mean testing. It is thus natural to wonder whether the parallels between these two problems extend to the robust setting as well. We show that this is indeed the case: the sample complexity of testing in sparse linear regression significantly increases when introducing the robustness condition. Theorem 3 (Informal; see Theorem 7). For any sufficiently small γ > 0, ε = γ

C for a sufficiently large C, and s = d1 δ for any δ (0, 1), the sample complexity of testing in sparse linear regression under ε-corruption model is

Ω min s log d, 1

The tightness of this bound follows from the results in [LSLC20, Theorem 2.1] which states that any algorithm for robust sparse mean estimation can be used for robust sparse linear regression with a polylog(1/γ) increase in the sample complexity. Hence the agnostic hypothesis selection via tournaments algorithm in [Li17], which is a statistically optimal algorithm for robust sparse mean estimation works in this case as well.

1.2 Our techniques

To establish the lower bounds in Theorem 1, we combine a range of tools from information theory and the literature on robust estimation. First, we argue that it is enough to consider the standard sparsity case (q = 0), as this will imply the analogous result for q (0, 2). Indeed, the definition of effective sparsity will enable us to deduce that any x Rd with x 0 = m and x 2 = γ belong to the set Bs,q, letting us establish the lower bound given in Theorem 1 for general q from the q = 0 case. In this discussion, we therefore focus on standard sparsity, and assume that our observations are from a distribution for which the unknown parameter θ is in the set Bs.

To simplify further, we note that in order to establish the desired lower bound it suffices to consider the weaker ε-Huber contamination model instead of the (more general) ε-corruption model [DK19]. In

the the ε-Huber contamination model, n i.i.d. samples from a distribution p are, after contamination, modeled as a set of n i.i.d. samples from a distribution (1 ε)p + εN, where N is an arbitrary and unknown probability distribution (that is, the adversary is oblivious: limited to choosing, ahead of time, a bad mixture component N to fool the algorithm). We thus can restrict ourselves, for our lower bound, to the setting where the n i.i.d. samples are from some distribution in {(1 ε)N (θ, Id) + εNε,θ : θ 0 s}, where we get to design the distributions Nε,θ. Let Cs := {(1 ε)N (θ, Id)+εNε,θ : θ 0 s, θ 2 γ} denote the set of ε-Huber contaminated versions of Gaussian distributions whose mean has ℓ2 norm greater than γ. Our goal can be rephrased as choosing as suitable set of parameters Θ and a corresponding ensemble of distributions {qθ}θ Θ Cs, and argue that no algorithm can distinguish n samples drawn from a (randomly chosen) element qθ from n samples drawn from N (0, Id).

To do so, define the mixture q := 1 |Θ| P

θ qn θ . (equivalently q = Eθ uΘ[qn θ ]). Le Cam s two-point method allows us to reduce the above indistinguishability problem to showing that d TV(q, N (0, Id)n) = o(1), for which, in view of the standard inequality, d TV(q, p)2 1 4χ2(q || p), it suffices to show that χ2(q || N (0, Id)n) = o(1). By the Ingster Suslina method, this χ2-divergence between a point and a mixture distribution can then be simplified as

1 + χ2(q || N (0, Id)n) = 1 + χN(0,Id)n(Eθ[qn θ ], Eθ [qn θ ]) = Eθ,θ 1 + χN(0,Id)(qθ, qθ ) n , (3) where θ, θ uΘ, and χp(q, q ) := R dqdq

dp 1 is the χ2-correlation between q and q with respect to p. Thus, the challenge is to design an ensemble {qθ} that yields a good upper bound for the r.h.s. of (3), while still being simple enough to analyze.

Building upon previous works [DKS17, BB20], we define our ensemble {qθ} as follows: let

θ 1 s, 0, 1 s

d : θ 0 = s

and, for θ Θ, qθ := (1 ε)N (γθ, Id) + εN (µθ, Id), where µθ is chosen such that (1 ε)γθ + εµθ = 0. Note that indeed, for all θ Θ, qθ Cs and θ 2 = 1. This choice of qθ combined with the above outline will enable us to prove Theorem 1, by carefully upper bounding Eθ,θ 1 + χN(0,Id)(qθ, qθ ) n as a function of n, d, and s.

One can attempt to prove Theorem 3 (the lower bound for testing in the sparse linear regression model) in a similar vein. By restricting ourselves to the ε-Huber contamination model (which only strengthens the resulting lower bound by constraining the adversary), we get the following set of distributions in the sparse linear regression problem: {(1 ε)N (0, Σθ) + εNε,θ : θ 0 s},

where Σθ := Id θ θT 1 + θ 2

. Further defining Ds := {(1 ε)N (0, Σθ) + εNε,θ : θ 0

s, θ 2 γ}, as earlier our problem boils down to finding an ensemble {qθ} Ds such that χ2(q || N (0, Id+1)n) = o(1) for q := 1 |Θ| P

Notice that in the sparse linear regression model, yi has marginal distribution N(0, 1 + θ 2) and conditioned on yi the distribution of Xi is given by N yiθ 1+ θ 2 , I θθT

1+ θ 2 . This observation, along with the indistinguishability result established while proving Theorem 1 encourage us to choose Θ as in (4) and qθ in the following manner:

qθ(yi) = N 0, 1 + γ2

qθ(Xi | yi) = (1 ε)N yiγθ

1 + γ2 , I θθT

1 + γ2 , I θθT

where µθ is chosen such that (1 ε)γθ + εµθ = 0. Again, note that qθ D. Although this setup looks promising in giving us the right lower bound, it turns out that certain tail events can cause χ2(q || N (0, Id+1)n) to blow up to infinity. To circumvent this, one of the remedies available in the literature is to use the conditional second moment method [RXZ19, WX18]. In this method, one carefully conditions out the rare events that preclude us from evaluating the χ2-divergence. Specifically, we define an event E generated by the random variables (y1, y2, , yn) such that

q(EC) = o(1). We then define q E as the distribution q conditioned on the good event E. That is,

q E(X, Y ) = q(X, Y | E) = Eθ[qθ(X, Y )1E{Y }]

Note that we have the relation q = q(E)q E + q(EC)q EC. The convexity of total variation distance consequently gives

d TV(q, N (0, Id+1)n) q(E) d TV(q E, N (0, Id+1)n) + q(EC) d TV(q EC, N (0, Id+1)n)

d TV(q E, N (0, Id+1)n) + q(EC)

Thus, in view of the fact that q(EC) = o(1), deriving a lower bound reduces to showing that d TV(q E, N (0, Id+1)n) is small, for which we saw earlier that it was sufficient (and more convenient) to show that χ2 q E || N (0, Id+1)n = o(1). This is the roadmap we will follow first, defining a suitable event E, before showing that χ2 q E || N (0, Id+1)n = o(1).

The above outlines our approach to proving Theorems 1 and 3. To establish the upper bound in Theorem 2, we show that the Agnostic hypothesis selection via Tournaments algorithm [DKK+16] achieves the stated sample complexity. The argument, in turn, is very similar to the proof of the upper bound for robust sparse mean estimation given in [Li17].

1.3 Related work

The study of high-dimensional signals with a sparse underlying structure has enjoyed a significant amount of attention in statistics and signal processing for the past few decades. This line of research has led to the discovery of surprising phenomena such as phase transitions and computational hardness in problems involving sparse signals. The main motivation to study sparse signals is that the sample complexity of statistical tasks involving sparsity is typically significantly smaller than that of dense signals, leading to much more data-efficient algorithms. This yields significant savings whenever the problem is expected to exhibit such a sparse structure, e.g., for physical or biological reasons, or due to a specific design choice when engineering a system. Yet, practical scenarios seldom involve noise-free or perfect signals, which effectively destroys the sparsity of the signal one would have capitalized on. This led to the study of these questions under various noise models, in order to understand if one could still see the same type of sample size savings in these settings.

There is a large body of work on the estimation and detection of signals with a sparse structure under noise (e.g., [DJ04, Bar02, Ver12]). Most recently, [CCT17] gave the tight characterization of minimax rates of estimating linear and quadratic functionals of sparse signals under Gaussian noise. The authors also derived the minimum detection level required for reliably testing the ℓ2 norm of sparse signals under Gaussian noise. Another prominent sparse signal model studied in the literature is that of sparse linear regression [ITV10, Ver12, RXZ19]. In the non-asymptotic setting, [CCC+19] established the tight sample complexity of testing in the sparse linear regression model. However, these line of works focused on random noise, and not the more challenging types of noise allowing for adversarial corruptions what is commonly known as seeking robust algorithms.

The systematic study of the robustness of statistical procedures was initiated in the foundational works of Huber [Hub64] and Tukey [Tuk60]. Several statistically optimal procedures were found to break down even under slight model misspecification or sample contamination. Although there was substantial progress in the field of robust statistics, surprisingly, computationally efficient procedures remained elusive until recently, even for simple tasks such as high-dimensional mean estimation.

In this regard, the past few years witnessed an incredible progress in algorithmic robust statistics. A line of work initiated by [DKK+16] and [LRV16] provided computationally efficient optimal robust estimators for various estimation tasks in high dimension [DKK+19]. The surprising upshot from these papers is that even with robustness requirements, the sample complexity of many highdimensional estimation tasks remains essentially the same, at no extra computational cost: i.e., that robustness and computational efficiency are not at odds for those estimation tasks. And still, a subsequent result of [DKS17] shows that, surprisingly, imposing robustness constraints does significantly increase the sample complexity of the Gaussian mean testing problem, and makes the sample complexity as large as that of Gaussian mean estimation: that is, for testing, robustness comes at a very high cost, and negates the usual savings that testing allows over estimation. Extending their

results on mean testing, [DK21] shows the analogue in the case of Gaussian covariance testing under the Frobenius norm. Yet, those striking results focus on testing dense parameters; our work seeks to combine the two lines of work inference under sparsity guarantees, and robustness to understand if an analogous jump in sample complexity occurs for testing in high-dimensional sparse models.

We further note that several works have shown evidence for the existence of statistical-computation gaps in estimation problems with sparse signal structure [BR13, CW20]. [DKS17] gave the first evidence for the presence of an s to s2 statistical-computation gap in robust sparse mean estimation, in the form of a Statistical Query (SQ) lower bound (i.e., for a restricted type of algorithms). This was complemented by [Li17] and [BDLS17], which gave computationally efficient algorithms for robust sparse estimation achieving O s2 sample complexity. Finally, [BB20] recently used average-case reductions to prove the algorithmic hardness of robust sparse mean estimation and robust sparse linear regression.

1.4 Preliminaries and notation

Given a probability distribution p and integer n 1, we denote by pn the n-fold product distribution with marginals p, and given a set of distributions D write Dn = { pn : p D }. For 0 < q 2 and r > 0, we let Bq,r = {θ Rd : θ q r} the ℓq ball of radius r. We say a vector θ Rd is s-sparse if θ 0 s; for q (0, 2], we will accordingly say that θ is s-sparse in ℓq norm whenever θ q s (note that s need not be an integer). We denote the uniform distribution over a set S by u S. The delta measure (point mass) at an element x is denoted by δx. The total variation distance between two distributions ν and µ is defined as: d TV(ν, µ) = 1

2 R |dν dµ| , which is to be interpreted as

dλ dλ, where λ is any distribution dominating both ν and µ; equivalently, d TV(ν, µ) =

sup S(ν(S) µ(S)) where the supremum is over all measurable sets S. The χ2-divergence between ν and µ is given by χ2(ν || µ) = R dνdν

dµ 1 , and satisfies d TV(ν, µ) 1

χ2(ν || µ). Finally, given a reference probability measure µ, the χ2-correlation between ν and λ with respect to µ is defined as χµ(ν, λ) = R dνdλ

dµ 1 ; note that χ2(ν || µ) = χµ(ν, ν).

We will also rely in several occasions on the following technical lemma due to Cai, Ma, and Wu, which helps us bound the moment generating function of the square of a random sum of Rademacher random variables, which will arise when we consider sparse priors in our lower bounds. Lemma 1 (Lemma 1 in [CMW15]). Fix d N and s [d]. Let H Hypergeometric(d, s, s), and let ξ1, ξ2, , ξs be i.i.d. Rademacher. Define Y := PH i=1 ξi. Then there exists a function τ : (0, 1

36) (1, ) with τ(0+)=1, such that for any 0 < b < 1 36, and λ := b

E exp λY 2 τ(b).

2 Main results

In this section, we formally state and give proofs for the theorems outlined informally in Section 1.1. Recall that a lower bound or the hardness of hypothesis testing between two distributions p and q can be characterized by the total variation distance between them. Indeed, by the Pearson Neyman lemma, if there exists test which successfully distinguishes between two distributions pn and qn (which in our case will correspond to distributions over n tuples of i.i.d. samples) with probability at least 2/3, then one must have d TV(pn, qn) 1/3. Hence, to prove indistinguishability for a given n, it suffices to show d TV(pn, qn) = o(1); since d TV(pn, qn)2 1

4χ2(qn || pn), one can then focus on showing χ2(qn || pn) = o(1).

Sparse Gaussian mean testing. We now state and prove our results related to sparse Gaussian mean testing. First, we derive the lower bounds in Theorem 1 using the techniques outlined in Section 1.2 and then show a matching upper bound (up to logarithmic factors) for q > 0 case. Theorem 4. Let ε, γ > 0 be fixed. Let X1, X2, , Xn be i.i.d. samples from an unknown distribution p. Moreover, suppose an ε-fraction of these n samples are arbitrarily corrupted. Then, if there exists an algorithm that distinguishes between the cases p = N (0, Id) and p {N (θ, Id) : θ 2 γ, θ 0 s} with probability greater than 2/3, we must have n = Ω s log ed

To prove this theorem, we will require the following lemma due to Diakonikolas, Kane, and Stewart [DKS17]. Lemma 2 ([DKS17, Lemma 6.9]). Fix θ, θ Rd, ε (0, 1/3], and let qθ be defined as

qθ := (1 ε)N(θ, Id) + εN (1 ε)

1 + |χN(0,Id)(qθ, qθ )| exp θ, θ 2

With this in hand, we are now able to establish Theorem 4.

Proof of Theorem 4. For the purpose of deriving a lower bound, it is enough to consider the weaker ε-Huber model for the corruption of samples, as then a lower bound on the sample complexity of the hypothesis testing problem in this setting will also be a lower bound for robust sparse Gaussian mean testing in the adversarial one.

H0 : Xi N(0, Id) H1 : Xi (1 ε)N(θ, Id) + εNε,θ s. t. θ 2 γ and θ 0 s,

for 1 i n. Here the distributions Nε,θ need to be chosen appropriately.

Let B = {β { 1, 0, 1}d : β 0 = s} and β u B. We can think of β as being generated according to the following process: First pick an element uniformly from the set {b {0, 1}d : b 0 = s} and then set the non-zero elements to be i.i.d. Rademacher random variables.

Define Θ := 1 s B and θ := 1 sβ, so that θ 2 = 1. Define qθ as

qθ = (1 ε)N(γθ, Id) + εN (1 ε)

It is immediate to see that qθ H1 for all θ Θ. Let q := Eθ uΘ[qn θ ]. Then by (3),

1 + χ2(q || N (0, Id)n) = Eθ,θ 1 + χN(0,Id)(qθ, qθ ) n .

By Lemma 2, we get

1 + χN(0,Id)(qθ, qθ ) exp γ4 θ, θ 2

= exp γ4 β, β 2

1 + χ2(q || N (0, Id)n) Eβ,β exp nγ4 β, β 2

Now by symmetry of the problem, it suffices to evaluate the expectation for any fixed β , say to β0 := (1, 1, , 1 | {z } s , 0, 0, , 0). We can characterize β, β0 as follows: let H be a random

variable with distribution Hypergeometric(d, s, s) and let ξ1, ξ2, , ξs be i.i.d. Rademacher random variables independent of H. Then, we have Y := β, β0 = PH i=1 ξi, where Y can be thought of as a symmetric random walk with Hypergeometric stopping time. We can then rewrite (6) as

1 + χ2(q || N (0, Id)n) E exp nγ4Y 2

By Lemma 1, if n c ε4

γ4 s log ed

s for a sufficiently small c > 0 (e.g., c < 1/36 suffices),

χ2(q || N (0, Id)n) E exp nγ4Y 2

Since τ is continuous at 0+, we can choose c > 0 to make τ(c) 1 arbitrarily small. This implies that it is impossible to distinguish between H0 and H1 with high probability if n = o s log ed

s , establishing the theorem.

Next, we extend this lower bound to other sparsity notions, namely, with respect to the ℓq-norms. Theorem 5. Let ε, γ > 0 be fixed, and q (0, 2). Let X1, X2, , Xn be i.i.d. samples from an unknown distribution p. Moreover, suppose an ε-fraction of these n samples are arbitrarily corrupted. Then, if there exists an algorithm that distinguishes between the cases p = N (0, Id) and p {N (θ, Id) : θ 2 γ, θ q s} with probability greater than 2/3, we must have

n = Ω m log ed

m , where m is the effective sparsity defined as: m := max{u [d] : γ2u 2 q 1 s2}.

Proof. The parameter set of the alternative hypothesis in this problem is Θ = {θ Rd : θ q s and θ 2 γ}. To conclude, it suffices to show that every m-sparse (in ℓ0 sense) vector with ℓ2 norm equal to γ belongs to Θ, and then appeal to the proof of Theorem 4 (whose hard instances had mean with magnitude exactly γ). To do so, let x Rd be m-sparse such that x 2 = γ. Then,

i=1 |xi|q ! 2

q (convexity) m 2 q 1 x 2 2 = m 2 q 1γ2 (a) s2,

where (a) is due to the definition of m, and the convexity step follows from q

2 1. Hence x Θ. Thus the lower bound in Theorem 4 holds here with m replacing s.

We will now proceed to show that the lower bounds in Theorems 4 and 5 are tight up to a logarithmic factor. First, we note that the lower bound given in Theorem 4 is optimal due to the result that the sample complexity of robust sparse Gaussian mean estimation is upper bounded by the same value [DK19, Li17], and the folklore fact that testing is no harder than estimation. So we need to prove the upper bound only for the cases where q > 0. We will show that the sample complexity lower bound given in Theorem 5 is tight by proving that the algorithm for robust sparse Gaussian mean estimation given in [Li17] works in the ℓq-norm constrained case as well (and so, again, the upper bound for testing will follow from the upper bound for estimation), in the regime γ = Θ(ε). Intuitively, this is because all but m coordinates of a vector θ would be small if θ q s and θ 2 = γ. Our proof is similar to the proof of [Li17, Fact A.1] with a minor modification. We defer its proof to the supplemental. Theorem 6. Let ε, δ > 0 and, q (0, 2) be fixed. Let X1, X2, , Xn be i.i.d. samples from an unknown distribution N (µ, Id), where µ q s. Moreover, suppose an ε-fraction of these n samples are arbitrarily corrupted. Then, there exists an algorithm that, for n = O 1

δ , upon being given the X1, . . . , Xn outputs a µ such that µ µ 2 O(ε) with probability 1 δ. Here m is the effective sparsity defined as: m = max{u [d] : ε2u 2 q 1 s2}.

Testing in sparse linear regression model. Lastly, we state the formal version of Theorem 3, and provide an outline of its proof (see supplementary for a complete proof).

Theorem 7. Let γ > 0 be sufficiently small, ε = γ

C for a sufficiently large C, and s = d1 δ for some δ (0, 1). Let (X1, y1), (X2, y2), , (Xn, yn) be i.i.d. samples obtained from the sparse linear regression model described in (2). Moreover, suppose an ε-fraction of these n samples are arbitrarily corrupted. Then, if there exists an algorithm that distinguishes between the cases θ 2 = 0 and θ 2 γ with probability greater than 2/3, we must have n = Ω min s log d, 1

Proof outline. Let X denote (X1, X2, , Xn) and Y denote (y1, y2, , yn). Let Θ = {θ { 1 s, 0, 1 s}d : θ 0 = s} and p = 1 ε

ε . We define hypotheses H0 and H1 as follows:

H0 : (Xi, yi) i.i.d. p = N (0, Id+1)

H1 : θ q(θ) = uΘ; yi i.i.d. q(yi) = N 0, 1 + γ2

q(Xi|yi, θ) = qθ(Xi|yi) = Eb

1 + γ2 , I γ2θθT

, where b (1 ε)δ1 + εδ p.

Here, we denote the probability distribution under the hypotheses H0 by p and H1 by q. Notice

that qθ is an ε-Huber contaminated version of N (0, Σγθ), where Σγθ = Id γθ γθT 1 + γ2

lower bound on the sample complexity of this hypothesis testing task would be a lower bound for the robust testing in sparse linear regression model. A natural approach to prove the impossibility of detection between H0 and H1 would be to try and show that χ2(q || p) = o(1). However, as previously discussed, this method does not yield a useful lower bound for this problem, due to low-probability events which cause χ2(q || p) to blow up. Instead, as alluded to in Section 1.2, we evaluate χ2 q E || p , where we define an event E as

n 2 + log log d

and the probability distribution q E is given by (5). Evaluating χ2 q E || p gives

1 + χ2 q E || p = 1 q(E)2 Eθ,θ

p(Y )2 EX p

qθ(X|Y )qθ (X|Y )

We compute Eθ,θ [Z] by splitting it into two cases depending on whether | θ, θ | τ or | θ, θ | > τ, where τ = ε2 4eγ2 log(d). This leads to the result that χ2 q E || p = o(1) if

n = o min s log d, 1

γ4 . We then show that E is asymptotically a high-probability event, which

allows us to argue that d TV(p, q) = o(1), thereby proving Theorem 7.

3 Discussion and Future Work

In this section, we discuss some of the limitations of our results, which we believe are ground for possible future work. Firstly, we suspect that our lower bounds are not tight with respect to the parameters γ and ε. The dependence on γ and ε in Theorems 4 and 5, is Ω ε4/γ4 . While the exact dependence on these parameters is still an open problem even for robust Gaussian mean testing (non-sparse), we conjecture that the actual dependence on these parameters should scale as ε2/γ4, and hence believe that our dependence on these parameters is suboptimal. In the current proof, the bottleneck appears while obtaining an upper bound for the chi-squared correlation between two Huber-contaminated distributions, and we suspect that more advanced techniques might be required to get the right dependence on γ and ε.

Furthermore, we restricted ourselves to the case when the covariance of the Gaussian distributions under consideration are identity matrices ( spherical Gaussians ); of course, handling the case of arbitrary covariances is a natural and important question. We believe that deriving the sample complexity in the case of non-identity (and unknown) covariance matrices would require significant additional effort, as well as new techniques and ideas. It is worth pointing out that we are not aware of any work addressing the sample complexity results for non-identity covariances even in the non-robust (but sparse) setting.

[Bar02] Yannick Baraud. Non-asymptotic minimax rates of testing in signal detection. Bernoulli, 8(5):577 606, 2002.

[BB20] Matthew Brennan and Guy Bresler. Reducibility and statistical-computational gaps from secret leakage. In Jacob Abernethy and Shivani Agarwal, editors, Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pages 648 847. PMLR, 09 12 Jul 2020.

[BDLS17] Sivaraman Balakrishnan, Simon S. Du, Jerry Li, and Aarti Singh. Computationally efficient robust sparse estimation in high dimensions. In Satyen Kale and Ohad Shamir, editors, Proceedings of the 2017 Conference on Learning Theory, volume 65 of Proceedings of Machine Learning Research, pages 169 212. PMLR, 07 10 Jul 2017.

[BR13] Quentin Berthet and Philippe Rigollet. Optimal detection of sparse principal components in high dimension. Ann. Statist., 41(4):1780 1815, 2013.

[CCC+19] A. Carpentier, O. Collier, L. Comminges, A. B. Tsybakov, and Yu. Wang. Minimax rate of testing in sparse linear regression. Autom. Remote Control, 80(10):1817 1834, oct 2019. [CCT17] Olivier Collier, Laëtitia Comminges, and Alexandre B. Tsybakov. Minimax estimation of linear and quadratic functionals on sparsity classes. Ann. Statist., 45(3):923 958, 2017. [CMW15] Tony Cai, Zongming Ma, and Yihong Wu. Optimal estimation and rank detection for sparse spiked covariance matrices. Probab. Theory Related Fields, 161(3-4):781 815, 2015. [CW20] T. Tony Cai and Yihong Wu. Statistical and computational limits for sparse matrix detection. Ann. Statist., 48(3):1593 1614, 2020. [DJ04] David Donoho and Jiashun Jin. Higher criticism for detecting sparse heterogeneous mixtures. Ann. Statist., 32(3):962 994, 2004. [DK19] Ilias Diakonikolas and Daniel M. Kane. Recent advances in algorithmic high-dimensional robust statistics, 2019, ar Xiv: 1911.05911. [DK21] Ilias Diakonikolas and Daniel M. Kane. The sample complexity of robust covariance testing. In Mikhail Belkin and Samory Kpotufe, editors, Proceedings of Thirty Fourth Conference on Learning Theory, volume 134 of Proceedings of Machine Learning Research, pages 1511 1521. PMLR, 15 19 Aug 2021. [DKK+16] Ilias Diakonikolas, Gautam Kamath, Daniel M. Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Robust estimators in high dimensions without the computational intractability. In 57th Annual IEEE Symposium on Foundations of Computer Science FOCS 2016, pages 655 664. IEEE Computer Soc., Los Alamitos, CA, 2016. [DKK+19] Ilias Diakonikolas, Daniel Kane, Sushrut Karmalkar, Eric Price, and Alistair Stewart. Outlier-robust high-dimensional sparse estimation via iterative filtering. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. [DKS17] Ilias Diakonikolas, Daniel M. Kane, and Alistair Stewart. Statistical query lower bounds for robust estimation of high-dimensional Gaussians and Gaussian mixtures (extended abstract). In 58th Annual IEEE Symposium on Foundations of Computer Science FOCS 2017, pages 73 84. IEEE Computer Soc., Los Alamitos, CA, 2017. [Hub64] Peter J. Huber. Robust Estimation of a Location Parameter. The Annals of Mathematical Statistics, 35(1):73 101, 1964. [ITV10] Yuri I. Ingster, Alexandre B. Tsybakov, and Nicolas Verzelen. Detection boundary in sparse regression. Electron. J. Stat., 4:1476 1526, 2010. [Li17] Jerry Li. Robust sparse estimation tasks in high dimensions, 2017, ar Xiv: 1702.05860. [LRV16] Kevin A. Lai, Anup B. Rao, and Santosh Vempala. Agnostic estimation of mean and covariance. In 57th Annual IEEE Symposium on Foundations of Computer Science FOCS 2016, pages 665 674. IEEE Computer Soc., Los Alamitos, CA, 2016. [LSLC20] Liu Liu, Yanyao Shen, Tianyang Li, and Constantine Caramanis. High dimensional robust sparse regression. In Silvia Chiappa and Roberto Calandra, editors, Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 411 421. PMLR, 26 28 Aug 2020. [RXZ19] Galen Reeves, Jiaming Xu, and Ilias Zadik. The all-or-nothing phenomenon in sparse linear regression. In Alina Beygelzimer and Daniel Hsu, editors, Proceedings of the Thirty-Second Conference on Learning Theory, volume 99 of Proceedings of Machine Learning Research, pages 2652 2663. PMLR, 25 28 Jun 2019. [Tuk60] John W. Tukey. A survey of sampling from contaminated distributions. In Contributions to probability and statistics, pages 448 485. Stanford Univ. Press, Stanford, Calif., 1960. [Ver12] Nicolas Verzelen. Minimax risks for sparse regressions: ultra-high dimensional phenomenons. Electron. J. Stat., 6:38 90, 2012. [WX18] Yihong Wu and Jiaming Xu. Statistical problems with planted structures: Informationtheoretical and computational limits, 2018, ar Xiv: 1806.00118.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes]

(c) Did you discuss any potential negative societal impacts of your work? [N/A] Our work focuses on theoretical aspects of statistical estimation and high-dimensional testing, and as such is remote enough from direct applications that its societal impacts are impossible to assess. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [N/A] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [N/A] (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [N/A] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [N/A] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [N/A] (b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A]

(d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]