# efficient_listdecodable_regression_using_batches__8f4a8a8b.pdf Efficient List-Decodable Regression using Batches Abhimanyu Das 1 Ayush Jain 2 Weihao Kong 1 Rajat Sen 1 We demonstrate the use of batches in studying list-decodable linear regression, in which only α (0, 1] fraction of batches contain genuine samples from a common distribution and the rest can contain arbitrary or even adversarial samples. When genuine batches have Ω(1/α) samples each, our algorithm can efficiently find a small list of potential regression parameters, with a high probability that one of them is close to the true parameter. This is the first polynomial time algorithm for list-decodable linear regression, and its sample complexity scales nearly linearly with the dimension of the covariates. The polynomial time algorithm is made possible by the batch structure and may not be feasible without it, as suggested by a recent Statistical Query lower bound (Diakonikolas et al., 2021b). 1. Introduction Linear regression is one of the most fundamental tasks in supervised learning with applications in various sciences and industries (Mc Donald, 2009; Dielman, 2001). In the standard linear regression setup, one is given m samples (xi, yi) such that yi = w , xi + ni where ni is the observation noise with bounded variance and the covariates xi Rd are drawn i.i.d from some fixed distribution. For this setup, the commonly used least-squares estimator that minimizes the square loss P i(yi w, xi )2, provides a good estimate of the unknown regression vector w . In many applications, some samples are inadvertently or maliciously corrupted, for example, due to mislabeling or measurement errors, or data poisoning attacks. For instance, such corruptions are commonplace in biology (Rosenberg et al., 2002; Paschou et al., 2010) and machine learning security (Barreno et al., 2010; Biggio et al., 2012). Even 1Google Research 2UC San Diego. The majority of this work was completed while Ayush Jain was an intern at Google Research. Correspondence to: Ayush Jain . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). a small number of corrupt samples in the data can cause the least-squares estimator to fail catastrophically. Classical robust estimators have been proposed in (Huber, 2011; Rousseeuw, 1991) but they suffer from exponential runtime. Recent works (Lai et al., 2016; Diakonikolas et al., 2019a; 2017) have derived efficient algorithms for robust mean estimation with provable guarantees even when a small fraction of the data can be corrupt or adversarial. These works have inspired the efficient algorithms for robust regression (Prasad et al., 2018; Diakonikolas et al., 2019b;c; Pensia et al., 2020) under the same corruption model. (Cherapanamjeri et al., 2020a; Jambulapati et al., 2021) have obtained robust regression algorithms with near-optimal run time and sample complexity. In this paper, we are interested in the setting where a small fraction α, potentially even less than half, of the data is considered inlier, and the majority of the data may be influenced by factors such as adversarial manipulation, corruption, bias, or being drawn from a diverse distribution. This setting also encompasses the problem of learning a mixture of regressions (Jordan & Jacobs, 1994; Zhong et al., 2016; Kong et al., 2020b; Pal et al., 2022) because any solution of the former immediately yields a solution to the latter by setting α to be the proportion of the data from the smallest mixture component. However, it is information-theoretically impossible to output a single accurate estimate of regression parameter when α < 1/2. Instead, it may be possible to generate a short list of estimates such that at least one of them is accurate. This relaxed notion of learning is known as list-decodable learning and is useful since a learner can identify a single accurate estimate from the list given a small number of reliable samples. For high dimensional mean estimation, Charikar et al. (2017) derived the first polynomial time algorithm for list decodable setting. List-decodable linear regression has been studied in (Karmalkar et al., 2019; Raghavendra & Yau, 2020) yielding algorithms with runtime and sample complexity of O(dpoly(1/α)). In contrast to list-decodable mean estimation, recent work (Diakonikolas et al., 2021b) has shown that a sub-exponential runtime and sample complexity might be impossible for linear regression. These prior results may lead to a pessimistic conclusion for Efficient List-Decodable Regression using Batches obtaining practical algorithms for the fundamental learning paradigm of linear regression when less than half of the data may be inlier or genuine. However, our work demonstrates that it can be overcome in various real-world applications such as federated learning (Wang et al., 2021), learning from multiple sensors (Wax & Ziskind, 1989), and crowd-sourcing (Steinhardt et al., 2016). In these and many other applications individual data sources often provide multiple samples. We refer to a collection of samples from a single source as a batch. If a fraction α of the sources follow the underlying distribution we aim to learn, then α fraction of the batches will contain independent samples from that distribution, while the remaining batches may contain arbitrary samples. When each batch contains Ω(d) samples then one can get the estimate of the regression vectors for each batch. However, typically in modern applications the dimension of the data is high and only a moderate number of samples are available per batch (Grottke et al., 2015; Park & Tuzhilin, 2008; Kong et al., 2020b). As we show in this paper, for any α (0, 1], as long as the number of samples provided by each genuine source is more than a small threshold of Ω(1/α), we can use the grouping of samples in batches to develop a polynomialtime algorithm. The batch setting has a natural advantage in the context of list-decodable learning. When there are multiple possible inlier distributions for the data sources, the list will include regression vectors for all distributions that underlie more than α fraction of sources. To determine the best-fitting solution for a specific source from the short list generated by the list-decodable algorithm, a small hold-out portion of the batch provided by that source can be used. This post hoc identification of the best weight for a source/batch is naturally not feasible in the single sample setting. This motivates the problem of list-decodable linear regression using batches. Formally, there are m batches. Each batch has a collection of n regression samples which can either all come from a global regression model with true weight w (good batch) and noise variance σ2 or are arbitrarily corrupted (adversarial batch). The task is to output a small list of regression vectors at least one of which is approximately correct given that only α fraction of the batches are good. It is important to highlight that in this scenario, any algorithm aiming to provide reasonable estimation guarantees must return a list of estimates. This is because the formulation allows for data to stem from Θ(1/α) different distributions, each of which generates at least α fraction of the batches. The regression parameters for each of these distributions can vary arbitrarily. Without any method to identify the genuine distribution among these Θ(1/α) possibilities, any algorithm providing a single estimate of the regression parameter would fail to offer a meaningful estimation guarantee. Our main result is the following theorem: Theorem 1.1 (Informal). For any α (0, 1], there exists a polynomial time algorithm for list-decodable regression, that uses m = On,α(d) batches each of size n = Ω(1/α), and outputs O(1/α2) weights such that with high probability at least one of them, w, satisfies w w 2 = O(σ/ nα). We formally state the problem in Section 2, introduce necessary notation in Section 3, and present our main result in Section 4. In Section 5, we describe the main ideas behind our algorithm and provide a comprehensive overview of our technical contributions. We present our algorithm and prove its performance guarantee in Section 6. We provide a detailed discussion of related work in Appendix A. 2. Problem formulation We have m sources. Of these m sources at least α-fraction of the sources are genuine and provide n i.i.d. samples from a common distribution. The remaining sources may provide arbitrary data. Since, we can use only the first n samples from each source and ignore the rest, hence, w.l.o.g. we assume that each source provides exactly n samples. We will refer to the collection of all samples from a single source as a batch. To formalize the setting, let B be a collection of m batches. Each batch b B in this collection, has n samples {(xb i, yb i )}n i=1, where xb i Rd and yb i R. Among these batches B, there is a sub-collection G of good batches such that for each b G and i [n] samples (xb i, yb i ) are generated independently from a common distribution D and the size of this sub-collection is |G| α|B|. The remaining batches B \ G are adversarial batches and have arbitrary samples that may be selected by an adversary depending on good batches. Next, we describe the assumption of distribution D. We require the same set of general assumptions on the distribution, as in the recent work (Cherapanamjeri et al., 2020a), which focuses on the case when n = 1 and 1 α is small, that is when all but a small fraction of data is genuine. Distribution Assumptions. For an unknown ddimensional vector w , the sample noises nb i, the covariates xb i and the outputs yb i are random variables that are related as yb i = xb i w + nb i. Let Σ = ED[xb i(xb i) ]. For scaling purposes, we assume Σ = 1. We have the following general assumptions. 1. xb i is L4-L2 hypercontractive, that is for some C 1 and all vectors u, ED[(xb i u)4] C ED[(xb i u)2]2. Efficient List-Decodable Regression using Batches 2. For some constant C1 > 0, xb i C1 3. The condition number of Σ is at most C3, that is for each unit vector u, we have u Σu Σ C3 = 1 C3 . 4. Sample noisenb i is independent of xb i, has zero mean E D[nb i]=0, and bounded covariance E D[(nb i)2] σ2. 5. The distribution of noise nb i is symmetric around 0. We note that the assumptions 1,3, and 4 are standard in heavy-tailed linear regression (Cherapanamjeri et al., 2020b; Lecu e & Mendelson, 2016). Assumptions 2 and 5, on the other hand, are introduced solely for the ease of presentation and we discuss in Appendix G that these two assumptions can be eliminated without any impact on our results. 3. Notation We use hb to denote a function over batches. For a function hb, we use ED[hb] and Cov D(hb) to denote the expected value and covariance of hb for a random batch b of n independent samples from D.1 Next, we define the expectation and covariance w.r.t. the collection of batches B. When batches are chosen uniformly from a sub-collection B B, the expected value and co-variance of a function hb are denoted as EB [hb] = P b B 1 |B |hb and Cov B (hb) = P b B 1 |B |(hb EB [hb])(hb EB [hb]) , respectively. To allow for more general samplings, the definition is extended to use a weight vector. A weight vector, denoted by β, is a collection of weights, βb, for each batch, b B, such that βb is between 0 and 1. The total weight of the vector is represented by βB = P b B βb. It can be helpful to think of β as a soft cluster of batches, with its components denoting the membership weight of batches in the cluster. When defining expectation or covariance of a function w.r.t. a weight vector β, the probability of sampling a batch, b, is βb βB . The expectation of a function, hb, over batches, when using a weight vector β, is represented by Eβ[hb] := P b B βb βB hb, and the covariance is represented by Covβ(hb) := P b B βb βB (hb Eβ[hb])(hb Eβ[hb]) . For weight vector β, the weight of all batches of a subset B is denoted as βB := P We use f(x) = O(g(x)) as a shorthand for f(x) = O(g(x) logk x), where k is some integer, and f(x) = Oy(g(x)) implies that if y is bounded then f(x) = O(g(x)). 1With slight abuse of notation, instead of h(b), we use hb to denote function over batches. Note that hb may be a function of some or all the samples in the batch b. Throughout the paper, we use the notation ci, with i 1, to represent universal constants. 4. Main Results Recently there has been a significant interest in the problem of list decodable linear regression. The prior works considered only the non-batch setting. The sample and time complexity of algorithm in (Karmalkar et al., 2019; Raghavendra & Yau, 2020) are d O(1/α4) and d O(1/α8), respectively. (Raghavendra & Yau, 2020) achieves an error O(σ/α3/2) with a list of size (1/α)O(log(1/α)), and (Karmalkar et al., 2019) obtains an error guarantee O(σ/α) with a list of size O(1/α). (Diakonikolas et al., 2021b) improved the sample complexity. For Gaussian noise and covariates distributed according to standard Gaussian, they gave an information-theoretic algorithm that uses O(d/α3) samples and estimates w to an accuracy O(σ p log(1/α)/α) using a list of size O(1/α). They also showed that no algorithm, even with infinite samples, can achieve an error σ/α p log(1/α) with a Poly(1/α) size list. As these works considered the non-batch setting, they do not obtain a polynomial time algorithm for this problem, which may in fact be impossible (Diakonikolas et al., 2021b). Our main result shows that using batches one can achieve a polynomial time algorithm for this setting, moreover, the algorithm requires only On,α(d) genuine samples. Theorem 4.1. For any 0 < α < 1, n Θ( C3 2C2 log2(2/α) α ) and |G| = ΩC dn2 log(d) , Algorithm 1 runs in time poly(|G|, α, d, n) and returns a list M of size at most 4/α2 such that with probability 1 4/d2, minw L w w O C3C log(2/α) nα σ . Interestingly, for n = Ω(1/α), the estimation error of our polynomial algorithm has a better dependence on α than the best possible σ/α p log(1/α) (Diakonikolas et al., 2021b) by any algorithm (even with infinite resources) in the nonbatch setting (i.e. n = 1). We restate the above result as the following corollary, which for a given ϵ, d and α characterizes the number of good batches |G| and n required by Algorithm 1 to achieve an estimation error O(ϵσ). Corollary 4.2. For any 0 < α < 1, 0 ϵ 1, nmin = ΘC3,C( log2(2/α) αϵ2 ), n nmin, and |G| = ΩC dn2 min log(d) , Algorithm 1 runs in time poly(α, d, ϵ) and returns a list M of size at most 4/α2 such that with probability 1 4/d2, minw L w w O(ϵσ). Efficient List-Decodable Regression using Batches For ϵ = Θ(1) in the above corollary, we get n = Ω( 1 α) and |G| = ΩC d log(d)/α2 . Remark 4.1. As discussed earlier, for the case where a majority of data is genuine, i.e. α > 1/2, polynomial time algorithms have been developed in prior works (Prasad et al., 2018; Diakonikolas et al., 2019b; Cherapanamjeri et al., 2020b) to estimate the regression parameter even in a nonbatch setting. Since the majority of data is genuine, these algorithms can return a single estimate of the regression parameter instead of a list. In particular, the algorithm in (Cherapanamjeri et al., 2020b) requires O(d/(1 α)2) genuine samples, and estimates the regression parameter w to an ℓ2 distance of O(C3 p (1 α)σ) for any 1 α = O( 1 C32 ), where C3 is the condition number of the covariance matrix Σ of the covariates. A lower bound of Ω( p (1 α)σ) is also known for the non-batch setting. We note that the algorithm in (Cherapanamjeri et al., 2020b) for the case α > 1/2, can easily be extended to the batch setting, where by using batch gradients instead of sample gradients in their algorithm, the regression parameter w can be estimated to a much smaller ℓ2 distance of O(C3 p (1 α)σ/ n). 5. Technical Overview This section presents the main ideas behind our algorithm. For a given batch b from B, the square loss of its ith sample at point w in the parameter space is represented by f b i (w) := (w xb i yb i )2/2. If all batches in B had samples generated from D then the minimizer of the average loss across all batches, represented by EB[f b i (w)], would converge to the optimal solution w . However, the presence of even a single outlier sample can cause this method to fail. In our setting, a majority of batches may contain potentially outlier samples. The gradient of the loss function f b i (w) is f b i (w) = (w xb i yb i ) xb i. For good batches, which has i.i.d. samples from distribution D, the expected value of this gradient is ED[ f b i (w)] = Σ(w w ). When |G| is sufficiently large, then E G[ f b i (w)] = 1 |G|n P b G P i [n] f b i (w) E D[ f b i (w)] = Σ (w w ) . (1) Suppose w is a stationary point of all samples, i.e. E B[ f b i ( w)] = 0. If w is far from w , then the above equation implies that the mean of gradients good samples will be large. Then norm of the co-variance of the sample gradients at w will be at least Cov B[ f b i ( w)] |G| |B| E G[ f b i ( w)] E B[ f b i ( w)] 2 = α E G[ f b i ( w)] 2 (a) α Σ ( w w ) 2. (2) When the co-variance of good sample points is much smaller than the overall co-variance of all samples it is possible to iteratively divide or filter samples in two (possibly overlapping) clusters such that one of the clusters is cleaner than the original (Steinhardt et al., 2016; Diakonikolas et al., 2020b). Hence, if we had Cov B[ f b i ( w)] Cov G[ f b i ( w)] then we could have obtained a cleaner version of B, that had a higher fraction of good batches. For batch b G the norm of co-variance of gradients (of a single sample) is Cov D[ f b i ( w)] = Θ(σ2 + Σ( w w ) 2) (using L4-L2 hypercontractivity). Even if we had Cov G[ f b i ( w)] Cov D[ f b i ( w)] , it does not guarantee |Cov B[ f b i ( w)]| |Cov G[ f b i ( w)]|, as α Σ ( w w ) 2 σ2 + Σ( w w ) 2 , regardless of how large the difference between the stationary point w for the distribution D and the stationary point w for all samples is. We will now see that focusing on batch gradients rather than single sample gradients can alleviate this problem. 5.1. How Batches Help In the preceding approach, we didn t leverage the batch structure. In fact, the SQ lower bound in (Diakonikolas et al., 2021b) suggests that it may be impossible to achieve a polynomial-time algorithm for the non-batch setting. To take the advantage of the batch structure instead of considering the loss function and its gradient for each sample individually, we consider the loss of a batch and the gradient of the batch loss. The loss function of a batch b is f b(w) = 1 n Pn i=1 f b i (w) i.e the average of the loss function in its samples. From the linearity of differentiation, the gradient of the batch loss function is f b(w) = 1 n Pn i=1 f b i (w). Then from the linearity of expectation, E G[ f b i (w)] = E G[ f b(w)] for any w. However, averaging over n samples reduces the co-variance by a factor n, therefore, Cov D[ f b(w)] = Cov D[ f b i (w)]/n. For |G| large enough, we will have population covariance Cov G[ f b( w)] Cov D[ f b( w)] . Further, as Cov D[ f b i ( w)] = O σ2 + Σ( w w ) 2 , it follows Cov G[ f b( w)] = O σ2+ Σ( w w ) 2 If the batch size n = Ω(log2(1/α)/α) and w is stationary point of average loss of all samples in B, then for a large value of w w = Ω(σ log(1/α)/ nα), it can be shown that α Σ ( w w ) 2 log2( 1 α)O( σ2+ Σ( w w ) 2) n ). Since the expectation of batch and sample gradients are the same over any batch sub-collection, using a similar argument as for Equation (2) one can show that Cov B[ f b( w)] α Σ ( w w ) 2, Combining this bound with the above bound gives Cov B[ f b( w)] log2(1/α) Cov G[ f b( w)] . Efficient List-Decodable Regression using Batches Therefore, either the distance between the stationary point of this cluster and w is O( σ log(1/α) nα ) or the covariance of gradients for the set of all batches is much larger than that for good batches. If it is the former, then we have a good approximation of w and if it is the latter, we can divide B into two (possibly overlapping) clusters, where at least one of the new clusters contains a majority of good batches and has a higher proportion of good batches than the initial cluster. The same argument can be extended from B to any sub-collection of B that retains a major portion of good batches G. To divide the clusters, we use the MULTIFILTER routine from (Diakonikolas et al., 2020b). Instead of hard clustering, this routine does soft clustering. The soft clustering produces a membership or weight vector β of length |B| with each entry between [0, 1] that denotes the membership weight of the corresponding batch in the cluster. The above discussion leads to the following algorithm. We begin with the initial cluster of all batches B. We keep applying MULTIFILTER routine iteratively on the clusters (or weight vectors) until, for all the clusters, the covariance of gradients at the stationary points of the respective clusters becomes small. MULTIFILTER routine ensures that at least one of the clusters retains a major portion of good batches and it doesn t have more than O(1/α2) clusters at any stage. As discussed, for a cluster that retains a major portion of good batches G, the covariance of batch gradients is small only if stationary point w of that cluster approximates w with an accuracy of w w = O( σ log(1/α) nα ). Since the final set of O(1/α2) clusters includes at least one such cluster, the stationary point of at least one of the clusters should approximate w to the desired accuracy. However, applying the MULTIFILTER routine for this purpose presents additional challenges, which we address through various technical contributions in the next section. 5.2. Clipping to Improve Sample Complexity We would like to obtain a high probability concentration bound of Cov G[ f b(w)] O( w w 2+σ2 n ) on the empirical covariance of the batch gradients in the good batches. No such bounds for general n are known in previous literature. And even for n = 1, using known concentration bounds would require a large number of good batches or samples. For example, (Diakonikolas et al., 2019b) needed d5 samples in total, and in fact, a minimum requirement of d2 samples can be shown for such a bound to hold. (Cherapanamjeri et al., 2020a) required O(d) samples (for n fixed to 1) for a related bound, but for each point w they need to ignore certain samples from the calculation of empirical covariance. These samples can be different depending on w. While such guarantees sufficed for their application where a majority of data was genuine, it is unclear if it can be extended to the list-decodable setting. To address these challenges, we use clipped loss instead. For clipping parameter κ > 0 and any batch b B, the clipped loss of its ith sample at point w is given by f b i (w, κ) := ( (w xb i yb i )2 2 if |w xb i yb i | κ κ|w xb i yb i | κ2/2 otherwise. We specify the choice of clipping parameter κ later. The clipped loss defined above is known as Huber s loss in literature. The gradient of this clipped loss is f b i (w, κ) := (xb i w yb i ) |xb i w yb i | κκxb i. (3) We refer to the gradient of the clipped loss above as the clipped gradient. For a batch b, its clipped loss is simply the average of clipped loss over all its samples. Clipped loss for a batch b at point w is f b(w, κ) := 1 i [n] f b i (w, κ). By the linearity of gradients, the gradient of the clipped loss, or clipped gradient, f b(w, κ) is the average of clipped loss over all its samples, i.e. f b(w, κ) := P i [n] 1 n f b i (w, κ). Ideal choice of Clipping parameter. When κ , the clipped loss is the same as the squared loss, hence clipping will have no effect in reducing the number of samples required. On the other hand, if κ 0, the loss function is overly clipped, which can lead to the expected norm of the clipped gradient being much smaller than that of the unclipped gradients in Equation (1). Theorem C.2 shows that as long as the clipping parameter is set to Ω( w w ) + Ωnα(σ), the expected norm of the clipped gradient will be Ω( w w ) O(σ/ nα). This means that for any point w whose distance from w is greater than Ω(σ/ nα), the expected norm of the clipped gradient at w is ED[ f b i (w, κ)] = Ω( w w ), which is of the same order as that of unclipped gradients in Equation (1). Furthermore, taking advantage of clipping, in Theorem B.1 we show that for all points w, and for any clipping parameter κ = O( w w ) + Onα(σ) the covariance of the clipped gradients satisfies Cov G[ f b(w, κ)] O( w w 2+σ2 n ) with only On,α(d) samples. As discussed previously, the same bound on the covariance of the un-clipped gradients would instead require Ω(d2) samples. From the preceding discussion, in order for both the requirements of ED[ f b i (w, κ)] = Ω( w w ) and Cov G[ f b(w, κ)] O( w w 2+σ2 n ) to be met using only On,α(d) samples, the clipping parameter must be set to κ = Θ( w w ) + Θnα(σ). This requires a constant factor approximation of w w to be obtained. Efficient List-Decodable Regression using Batches Additionally, when using the MULTIFILTER on a cluster, a tight approximation of w w is necessary to obtain a tight upper bound on Cov G[ f b(w, κ)] , which is required by MULTIFILTER as an input parameter. Furthermore, recall that when applying the MULTIFILTER on any cluster, we set w to a stationary point of the clipped loss for that cluster. This stationary point w will depend on the clipping parameter κ, and the appropriate range for κ depends on w, creating a cyclic dependence that we must also overcome when estimating w w . 5.3. Estimating Parameters for Multifilter Recall that our goal is to return a small set of (soft) clusters, such that at least one of them retains a major portion of good batches, and its stationary point closely approximates w . When the MULTIFILTER routine is applied to a cluster, it generates sub-clusters. Hence, the sub-clusters that originate from a cluster that has already lost a majority of good batches are not relevant for us. Therefore, we will need accurate parameter estimation only for clusters that have retained a substantial weight of good batches, and will only consider such clusters in the remaining section. Let vb(w) := 1 i [n] |w xb i yb i | denote the mean absolute loss of a batch at point w. We developed a subroutine called FINDCLIPPINGPPARAMETER (Algorithm 2) that overcomes the cyclic dependence to find appropriate stationary point w and clipping parameter κ for a given soft cluster β. These values ensure that w is a stationary point for clipped gradients f b(w, κ) for batches in the cluster and κ falls in a range determined by the expected absolute loss of batches in cluster β at the stationary point w, specifically, κ = Θ(Eβ vb(w) ) + Θnα(σ). For |G| = Ω(d), we prove that w.h.p. Var G vb(w) E G (vb(w) E D vb(w) )2 = O σ2+E D[vb(w)]2 From the above bound, it follows that for most of the good batches, vb(w) is very close to E D[vb(w)]. Further, it can be shown that E D[vb(w)] = Θ( w w ) O(σ), where E D[vb(w)] is expectation of vb(w) for a batch sampled from D. We derive a novel way that given a weight vector β estimates upper bound θ1 on variance Var G vb(w) . Further, the upper bound is tight enough to ensure that if Varβ(vb(w)) = O(θ1) then for most batches b in β, vb(w) will be close to its expectation E β[vb(w)] over β. As the soft cluster contains a significant proportion of good batches, and since vb(w) for most of these good batches is close to E D[vb(w)], then E β[vb(w)] would also be close to E D[vb(w)]. Furthermore, since E D[vb(w)] = Θ( w w ) O(σ), it follows that E β[vb(w)] = Θ( w w ) O(σ). Therefore, if for a certain weight vector β the variance Varβ(vb(w)), is close to our estimated variance of good batches, θ1, then we can ensure that κ = Θ( w w ) + Θnα(σ) and use E β[vb(w)] as an estimate for w w . However, if the variance Varβ(vb(w)) for a β, is significantly greater than our estimated variance of good batches, θ1, then we will not use the MULTIFILTER routine for gradients on that cluster. Instead, we will apply MULTIFILTER routine on this cluster w.r.t. average absolute loss vb(w). As a result, the estimation of w w and ensuring that κ is within the correct range, which is necessary for using the MULTIFILTER routine for gradients, is no longer relevant. Hence, in the estimation part, we either apply MULTIFILTER routine on the cluster for average absolute loss to obtain new clusters with one of them being cleaner, or else our estimate of the parameters is in the desired range to apply MULTIFILTER routine w.r.t. the gradients. 6. Algorithm and Proof of Theorem 4.1 In subsection 6.2, a triplet (β, κ, w) is defined as nice if it meets certain criteria: β retains a substantial amount of weight among good batches, κ falls within a specific range, w is a stationary point of the clipped loss, and the covariance of gradients of the clipped loss for the cluster β is bounded at w. It is noted that any such triplet s point w is a good approximation of w . In Section 5.1, we provided intuition for the same without the clipping. To identify a cluster with bounded covariance of clipped gradients, we require that the covariance of clipped gradients for G is bounded. And to estimate the correct range of κ and an upper bound on the covariance of clipped gradients for G, as described in Section 5.3, we require that the variance of mean absolute loss is bounded for the set of good batches. We formalize these requirements in the next subsection in form of two regularity conditions. To specify the range in which the clipping parameter κ should be set, we define κmax and κmin in Definition B.1 in the appendix, which are functions of w, and other distribution parameters. Finally, in the last two subsections, we describe the algorithm and show that it finds a nice triplet. 6.1. Regularity Conditions. The first condition is that for all unit vectors u, all vectors w and for all κ κmax, E G h f b(w, κ) u E D[ f b(w, κ) u] 2i U1, Efficient List-Decodable Regression using Batches where U1 := c4 σ2+C E D[|(w w ) xb i|2] n . The second regularity condition is that for all vectors w, E G h vb(w) E D[|w xb i yb i |] 2i U2, where U2 := c2 σ2+C E D[|w xb i yb i |]2 n . We will repeatedly refer to the upper bounds U1 and U2 in the regularity conditions throughout this section. In Section B, we show that even with a minimal number of good batches, |G| = Ωn,α(d), the two regularity conditions hold w.h.p. As a simple consequence, the first regularity condition implies that Cov G( f b(w, κ)) U1, (5) and similarly, the second regularity condition implies that Var G vb(w) U2. (6) We note that the expressions for U1 and U2 simplify to Θ(σ2 + w w 2/n) and the expressions for κmax and κmin simplify to Θ( w w + σ) if one is not concerned with the dependence on distribution parameters C, C3, Cp. 6.2. Nice Triplet First, we introduce the notion of nice weight vector. A weight vector β is considered nice if the total weight assigned to all good batches by it is at least βG 3|G|/4. We term a combination of a weight vector β, a clipping parameter κ, and an estimate w as a triplet. Next, we introduce the concept of a nice triplet. Condition 1. A triplet (β, κ, w) is considered nice if (a) β is a nice weight vector, i.e. βG 3|G|/4. (b) Clipping parameter is in the range, κmin κ κmax. (c) w is an approximate stationary point, namely mean clipped loss for weight vector β at w is at most Eβ[ f b(w, κ)] log(2/α)σ/8 nα. (d) Covariance of the clipped gradients over β at stationary point w is at most Covβ( f b(w, κ)) c5C2 log2( 2 α) (σ2+E D[|(w w ) xb i|]2) n , where c5 is a positive universal constant. According to these conditions, a triplet (β, κ, w) is nice if weight vector β is considered nice, clipping parameter κ is within the appropriate range, w is an approximate stationary point for clipped loss for this weight vector and covariance of clipped gradient over weight vector β at this point w is small. As discussed briefly at the beginning of this section, for a triplet satisfying these conditions w is a good approximation of w . Theorem C.1 formally shows that for any nice triplet (β, w, κ), we have w w O( C3Cσ log(2/α) nα ). Then to prove Theorem 4.1, it is sufficient to show that the algorithm returns a small list of triplets such that at least one of them is nice. In the next two subsections, we will describe the algorithm and demonstrate that it returns a small list of triplets, at least one of which is nice. Algorithm 1 MAINALGORITHM 1: Input: Data {{(xb i, yb i )}i [n]}b B, α, C, σ. 2: For each b B, βb init 1 and βinit {βb init}b B. 3: List L {βinit} and M . 4: while L = do 5: Pick any element β in L and remove it from L. 6: a1 = 256C 2 3 and a2 = a1 4 + 64. 7: κ, w FINDCLIPPINGPPARAMETER(B, β, a1, a2 {{(xb i, yb i )}i [n]}b B) 8: Find top approximate unit eigenvector u of Covβ( f b(w, κ)). 9: For each batch b B, let vb = 1 i [n] |w xb i yb i | and vb = f b(w, κ) u. 10: θ0 inf{v : β({b : vb v}) α|B|/4 and n σ2 + 16C2 E β[vb] + σ 2 . (8) 11: if Var B,β(vb) > c3 log2(2/α)θ1 then 12: NEWWEIGHTS MULTIFILTER(B, α, β, {vb}, θ1). {Type-1 use} 13: Append each weight vector eβ NEWWEIGHTS that has total weight eβB α|B|/2 to list L. 14: else if Var B,β( vb) > c3 log2(2/α)θ2 then 15: NEWWEIGHTS MULTIFILTER(B, α, β, { vb}, θ2). {Type-2 use} 16: Append each weight vector eβ NEWWEIGHTS that has total weight eβB α|B|/2 to list L. 17: else 18: Append (β, κ, w) to M. 19: end if 20: end while 21: Return M 6.3. Description of the Algorithm MAINALGORITHM starts with L = βinit, where the initial weight vector βinit assigns an equal weight of 1 to each batch in B. This initial weight vector is nice since βG init = |G|. In each iteration of the while loop, the algorithm selects one of the weight vectors β from the list L, until the list L is empty. Then, it uses the subroutine FINDCLIPPINGPPARAMETER on this weight vector β, which Efficient List-Decodable Regression using Batches returns the values of clipping parameter κ and approximate stationary point of clipped loss as w. Next, the algorithm uses the MULTIFILTER subroutine on β. Given a weight vector and a function over batches, as well as an estimate of the variance of the function for good batches, this subroutine divides the cluster to produce new clusters, such that each of them is shorter than the original. To apply this subroutine, the algorithm first calculates parameters θ1 and θ2, which are estimates of the upper bounds U2 and U1 in the two regularity conditions for the point w. If the variance of the mean absolute loss at w for batches in this weight vector β is much larger than the estimate θ1, namely Varβ vb c3 log2(2/α)θ1, the algorithm applies the MULTIFILTER subroutine for the function vb(w). This is referred to as a Type-1 use of this subroutine. If instead, the variance of vb in the weight vector is small, the algorithm defines a new function on batches, vb := f b(w, κ) u, where u is a top approximate unit eigenvector of Covβ( f b(w, κ)) such that u Covβ( f b(w, κ))u 0.5 Covβ( f b(w, κ)) . This function vb is a projection of clipped batch gradients along the direction in which covariance is nearly the highest. From (5), it follows that variance of this new function vb in good batch collection G will be bounded by U1. If the variance of vb over the weight vector β is much larger than estimate θ2 of U1, namely Varβ vb c3 log2(2/α)θ2, then the algorithm applies MULTIFILTER subroutine for function vb(w). This is referred to as a Type-2 use of this subroutine. When MULTIFILTER is applied to a weight vector, it returns a list NEWWEIGHTS of weight vectors as a result. The MAINALGORITHM appends weight vectors in NEWWEIGHTS that have total weights more than α|B|/2 to list L and the iteration terminates. The weight vectors that have total weights less than α|B|/2 are ignored as they can t be nice weight vectors and can not result in any nice weight vector in future iterations. If the variances of both vb and vb are small, then the iteration ends by appending (β, κ, w) to M. Next, we argue that M ends up with at least one nice triplet. 6.4. Finding Nice Triplet We first show that Type-1 application of MULTIFILTER on a nice weight β only occurs when, Varβ vb c3 log2(2/α)Var G vb . (9) Recall that Type-1 application of MULTIFILTER on β takes place when Varβ vb c3 log2(2/α)θ1. From Equation (6), we have Var G vb U2 and Theorem E.1 shows that for a nice weight vector β the parameter θ1 upper bounds U2. Thus, Type-1 use of MULTIFILTER on a nice weight β only takes place when Equation (9) holds. The subroutine FINDCLIPPINGPPARAMETER returns κ and w for a given weight vector β. Theorem D.1 in the Appendix D shows that these parameters w and κ satisfy: 1. w is an approximate stationary point for {f b( , κ)} w.r.t. weight vector β. 2 Eβ[vb(w)] a2σ κ 4a2 1 Eβ[vb(w)] a2σ , where a1 and a2 are input parameters of FINDCLIPPINGPPARAMETER. The first guarantee implies that if a triplet (β, κ, w) ends in set M, then it must satisfy condition (c) for a nice triplet. Theorem E.4 shows that if Type-1 filtering did not occur for a nice weight vector, then for this weight vector the range of κ specified in the second guarantee of subroutine FINDCLIPPINGPPARAMETER is a subset of the desired range (κmin, κmax). Specifically, if for a nice weight vector β, Varβ vb c3 log2(2/α)θ1, then κ (κmin, κmax) and U1 θ2 c5 2c3 C2(σ2+E D[|(w w ) xb i|]2) n . (10) Recall that a triplet (β, κ, w) ends up in M only when Varβ vb c3 log2(2/α)θ1 and Varβ vb c3 log2(2/α)θ2 are both satisfied. From the above discussion, it follows that if a triplet (β, κ, w) is in M such that β is nice then κ (κmin, κmax) and it satisfies, 2 log2(2/α) C2(σ2+E D[|(w w ) xb i|]2) n . From the definition of vb, it follows that Covβ( f b(w, κ)) 2Varβ vb . Therefore, for any triplet (β, κ, w) in M such that β is a nice weight vector, conditions (b) and (d) are also satisfied. This means that any such triplet is a nice triplet. Finally, it remains to be shown that M contains at least one triplet with a nice weight vector, which we do next. Recall that Type-2 application of MULTIFILTER on a weight β only takes place when, Varβ vb c3 log2(2/α)θ2. Since for a nice β, from Equation (10), U1 θ2, from Equation (5), Cov G( f b(w, κ)) U1, and from the definition of vb, Var G vb Cov G( f b(w, κ)) . Therefore, θ2 Var G vb , and hence Type-2 application on a nice weight β only takes place when, Varβ vb c3 log2(2/α)Var G vb . (11) Theorem F.2 in Appendix F states that if Equation (9) holds for all Type-1 uses and Equation (11) holds for all Type-2 uses when using subroutine MULTIFILTER on nice weight vectors, then at least one of the triplets in the final list M will include a nice weight vector. Since we have already Efficient List-Decodable Regression using Batches shown that these two equations hold, it follows that M will contain a nice triplet. The theorem also shows that the size of M is at most 4/α2 and the total number of iterations of the while loop is at most O(|B|/α2), implying a small list size and a polynomial runtime for the algorithm 7. Conclusion In summary, this paper addresses the problem of linear regression in the setting when data is presented in batches and only a small fraction of the batches contain genuine data. The paper presents a polynomial time algorithm to identify a small list containing a good approximation of the true regression parameter when genuine batches have at least Ω(1/α) samples each. By utilizing the batch structure, the paper introduces the first polynomial-time algorithm for list decodable linear regression. Additionally, the algorithm requires a number of genuine samples that increase nearly linearly with the dimension of the covariates. SQ lower bounds in (Diakonikolas et al., 2021b) for the non-batch setting suggests that a polynomial time algorithm is impossible with batch size 1, and the paper demonstrates that a batch size of Ω(1/α) is sufficient to obtain a polynomial time algorithm. This poses the question of what the smallest batch size required is to obtain a polynomial time algorithm, which is a promising direction for future work. Acharya, J., Jain, A., Kamath, G., Suresh, A. T., and Zhang, H. Robust estimation for random graphs. In Conference on Learning Theory, pp. 130 166. PMLR, 2022. Anscombe, F. J. Rejection of outliers. Technometrics, 2(2): 123 146, 1960. Balakrishnan, S., Du, S. S., Li, J., and Singh, A. Computationally efficient robust sparse estimation in high dimensions. In Proceedings of the 30th Conference on Learning Theory, COLT 2017, pp. 169 212, 2017. Barreno, M., Nelson, B., Joseph, A. D., and Tygar, J. D. The security of machine learning. Machine Learning, 81 (2):121 148, 2010. Bhatia, K., Jain, P., and Kar, P. Robust regression via hard thresholding. Advances in neural information processing systems, 28, 2015. Bhatia, K., Jain, P., Kamalaruban, P., and Kar, P. Consistent robust regression. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, pp. 2107 2116, 2017a. Bhatia, K., Jain, P., Kamalaruban, P., and Kar, P. Consistent robust regression. Advances in Neural Information Processing Systems, 30, 2017b. Biggio, B., Nelson, B., and Laskov, P. Poisoning attacks against support vector machines. ar Xiv preprint ar Xiv:1206.6389, 2012. Chaganty, A. T. and Liang, P. Spectral experts for estimating mixtures of linear regressions. In International Conference on Machine Learning (ICML), pp. 1040 1048, 2013. Charikar, M., Steinhardt, J., and Valiant, G. Learning from untrusted data. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pp. 47 60, 2017. Chen, S., Li, J., and Moitra, A. Learning structured distributions from untrusted batches: Faster and simpler. Advances in Neural Information Processing Systems, 33: 4512 4523, 2020a. Chen, S., Li, J., and Song, Z. Learning mixtures of linear regressions in subexponential time via Fourier moments. In STOC. https://arxiv.org/pdf/1912. 07629.pdf, 2020b. Chen, Y. and Poor, H. V. Learning mixtures of linear dynamical systems. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato, S. (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 3507 3557. PMLR, 17 23 Jul 2022. Cheng, Y., Diakonikolas, I., Ge, R., and Woodruff, D. P. Faster algorithms for high-dimensional robust covariance estimation. In Conference on Learning Theory, pp. 727 757. PMLR, 2019. Cherapanamjeri, Y., Aras, E., Tripuraneni, N., Jordan, M. I., Flammarion, N., and Bartlett, P. L. Optimal robust linear regression in nearly linear time. ar Xiv preprint ar Xiv:2007.08137, 2020a. Cherapanamjeri, Y., Aras, E., Tripuraneni, N., Jordan, M. I., Flammarion, N., and Bartlett, P. L. Optimal robust linear regression in nearly linear time. ar Xiv preprint ar Xiv:2007.08137, 2020b. Cherapanamjeri, Y., Mohanty, S., and Yau, M. List decodable mean estimation in nearly linear time. ar Xiv preprint ar Xiv:2005.09796, 2020c. Dalalyan, A. and Thompson, P. Outlier-robust estimation of a sparse linear model using ℓ1 -penalized huber s mestimator. Advances in neural information processing systems, 32, 2019. Efficient List-Decodable Regression using Batches Diakonikolas, I. and Kane, D. M. Small covers for near-zero sets of polynomials and learning latent variable models. In 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), pp. 184 195. IEEE, 2020. Diakonikolas, I., Kamath, G., Kane, D. M., Li, J., Moitra, A., and Stewart, A. Being robust (in high dimensions) can be practical. In International Conference on Machine Learning, pp. 999 1008. PMLR, 2017. Diakonikolas, I., Kamath, G., Kane, D. M., Li, J., Moitra, A., and Stewart, A. Being Robust (in High Dimensions) Can Be Practical. ar Xiv e-prints, art. ar Xiv:1703.00893, March 2017. Diakonikolas, I., Kamath, G., Kane, D. M., Li, J., Moitra, A., and Stewart, A. Robustly learning a gaussian: Getting optimal error, efficiently. In Proceedings of the Twenty Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 2683 2702. SIAM, 2018a. Diakonikolas, I., Kane, D. M., and Stewart, A. Listdecodable robust mean estimation and learning mixtures of spherical gaussians. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pp. 1047 1060, 2018b. Diakonikolas, I., Kamath, G., Kane, D., Li, J., Moitra, A., and Stewart, A. Robust estimators in high-dimensions without the computational intractability. SIAM Journal on Computing, 48(2):742 864, 2019a. Diakonikolas, I., Kamath, G., Kane, D. M., Li, J., Steinhardt, J., and Stewart, A. Sever: A robust meta-algorithm for stochastic optimization. In Proceedings of the 36th International Conference on Machine Learning, ICML 19, pp. 1596 1606. JMLR, Inc., 2019b. Diakonikolas, I., Kong, W., and Stewart, A. Efficient algorithms and lower bounds for robust linear regression. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 2745 2754. SIAM, 2019c. Diakonikolas, I., Hopkins, S. B., Kane, D., and Karmalkar, S. Robustly learning any clusterable mixture of gaussians. ar Xiv preprint ar Xiv:2005.06417, 2020a. Diakonikolas, I., Kane, D., and Kongsgaard, D. Listdecodable mean estimation via iterative multi-filtering. Advances in Neural Information Processing Systems, 33: 9312 9323, 2020b. Diakonikolas, I., Kane, D., Kongsgaard, D., Li, J., and Tian, K. List-decodable mean estimation in nearly-pca time. Advances in Neural Information Processing Systems, 34: 10195 10208, 2021a. Diakonikolas, I., Kane, D., Pensia, A., Pittas, T., and Stewart, A. Statistical query lower bounds for list-decodable linear regression. Advances in Neural Information Processing Systems, 34:3191 3204, 2021b. Dielman, T. E. Applied regression analysis for business and economics. Duxbury/Thomson Learning Pacific Grove, CA, 2001. Dong, Y., Hopkins, S., and Li, J. Quantum entropy scoring for fast robust mean estimation and improved outlier detection. Advances in Neural Information Processing Systems, 32, 2019. Gao, C. Robust regression via mutivariate regression depth. Bernoulli, 26(2):1139 1170, 2020. Grottke, M., Knoll, J., and Groß, R. How the distribution of the number of items rated per user influences the quality of recommendations. In 2015 15th International Conference on Innovations for Community Services (I4CS), pp. 1 8. IEEE, 2015. Hopkins, S., Li, J., and Zhang, F. Robust and heavy-tailed mean estimation made simple, via regret minimization. Advances in Neural Information Processing Systems, 33, 2020a. Hopkins, S. B. and Li, J. Mixture models, robustness, and sum of squares proofs. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pp. 1021 1034, 2018. Hopkins, S. B. et al. Mean estimation with sub-gaussian rates in polynomial time. Annals of Statistics, 48(2): 1193 1213, 2020b. Huber, P. J. Robust estimation of a location parameter. Annals Mathematics Statistics, 35, 1964. Huber, P. J. Robust statistics. In International encyclopedia of statistical science, pp. 1248 1251. Springer, 2011. Jain, A. and Orlitsky, A. Optimal robust learning of discrete distributions from batches. In Proceedings of the 37th International Conference on Machine Learning, ICML 20, pp. 4651 4660. JMLR, Inc., 2020a. Jain, A. and Orlitsky, A. A general method for robust learning from batches. ar Xiv preprint ar Xiv:2002.11099, 2020b. Jain, A. and Orlitsky, A. Robust density estimation from batches: The best things in life are (nearly) free. In International Conference on Machine Learning, pp. 4698 4708. PMLR, 2021. Efficient List-Decodable Regression using Batches Jambulapati, A., Li, J., and Tian, K. Robust sub-gaussian principal component analysis and width-independent schatten packing. Advances in Neural Information Processing Systems, 33, 2020. Jambulapati, A., Li, J., Schramm, T., and Tian, K. Robust regression revisited: Acceleration and improved estimation rates. Advances in Neural Information Processing Systems, 34:4475 4488, 2021. Jia, H. and Vempala, S. Robustly clustering a mixture of gaussians. ar Xiv preprint ar Xiv:1911.11838, 2019. Jordan, M. I. and Jacobs, R. A. Hierarchical mixtures of experts and the em algorithm. Neural computation, 6(2): 181 214, 1994. Karmalkar, S. and Price, E. Compressed sensing with adversarial sparse noise via l1 regression. In 2nd Symposium on Simplicity in Algorithms, 2019. Karmalkar, S., Klivans, A., and Kothari, P. List-decodable linear regression. Advances in neural information processing systems, 32, 2019. Klivans, A., Kothari, P. K., and Meka, R. Efficient algorithms for outlier-robust regression. In Conference On Learning Theory, pp. 1420 1430. PMLR, 2018. Kong, W., Somani, R., Kakade, S., and Oh, S. Robust metalearning for mixed linear regression with small batches. Advances in Neural Information Processing Systems, 33, 2020a. Kong, W., Somani, R., Song, Z., Kakade, S., and Oh, S. Meta-learning for mixed linear regression. In International Conference on Machine Learning, pp. 5394 5404. PMLR, 2020b. Kong, W., Sen, R., Awasthi, P., and Das, A. Trimmed maximum likelihood estimation for robust learning in generalized linear models. ar Xiv preprint ar Xiv:2206.04777, 2022. Konstantinov, N., Frantar, E., Alistarh, D., and Lampert, C. On the sample complexity of adversarial multi-source pac learning. In International Conference on Machine Learning, pp. 5416 5425. PMLR, 2020. Kothari, P. K., Steinhardt, J., and Steurer, D. Robust moment estimation and improved clustering via sum of squares. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pp. 1035 1046, 2018. Lai, K. A., Rao, A. B., and Vempala, S. Agnostic estimation of mean and covariance. In Proceedings of the 57th Annual IEEE Symposium on Foundations of Computer Science, FOCS 16, pp. 665 674, Washington, DC, USA, 2016. IEEE Computer Society. Lecu e, G. and Mendelson, S. Performance of empirical risk minimization in linear aggregation. 2016. Li, J. and Ye, G. Robust gaussian covariance estimation in nearly-matrix multiplication time. Advances in Neural Information Processing Systems, 33, 2020. Li, Y. and Liang, Y. Learning mixtures of linear regressions with nearly optimal complexity. In COLT. ar Xiv preprint ar Xiv:1802.07895, 2018. Liu, L., Shen, Y., Li, T., and Caramanis, C. High dimensional robust sparse regression. ar Xiv preprint ar Xiv:1805.11643, 2018. Mc Donald, J. H. Handbook of biological statistics, volume 2. sparky house publishing Baltimore, MD, 2009. Mukhoty, B., Gopakumar, G., Jain, P., and Kar, P. Globallyconvergent iteratively reweighted least squares for robust regression problems. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 313 322, 2019. Pal, S., Mazumdar, A., Sen, R., and Ghosh, A. On learning mixture of linear regressions in the non-realizable setting. In International Conference on Machine Learning, pp. 17202 17220. PMLR, 2022. Park, Y.-J. and Tuzhilin, A. The long tail of recommender systems and how to leverage it. In Proceedings of the 2008 ACM conference on Recommender systems, pp. 11 18, 2008. Paschou, P., Lewis, J., Javed, A., and Drineas, P. Ancestry informative markers for fine-scale individual assignment to worldwide populations. Journal of Medical Genetics, 47(12):835 847, 2010. Pensia, A., Jog, V., and Loh, P.-L. Robust regression with covariate filtering: Heavy tails and adversarial contamination. ar Xiv preprint ar Xiv:2009.12976, 2020. Philippe, R. 18.s997 high-dimensional statistics. Massachusetts Institute of Technology: MIT Open Course Ware, https://ocw.mit.edu. License: Creative Commons BY-NC-SA, 2015. Prasad, A., Suggala, A. S., Balakrishnan, S., and Ravikumar, P. Robust estimation via robust gradient estimation. ar Xiv preprint ar Xiv:1802.06485, 2018. Qiao, M. and Valiant, G. Learning discrete distributions from untrusted batches. In Proceedings of the 9th Conference on Innovations in Theoretical Computer Science, ITCS 18, pp. 47:1 47:20, Dagstuhl, Germany, 2018. Schloss Dagstuhl Leibniz-Zentrum fuer Informatik. Efficient List-Decodable Regression using Batches Raghavendra, P. and Yau, M. List decodable learning via sum of squares. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 161 180. SIAM, 2020. Rosenberg, N. A., Pritchard, J. K., Weber, J. L., Cann, H. M., Kidd, K. K., Zhivotovsky, L. A., and Feldman, M. W. Genetic structure of human populations. science, 298 (5602):2381 2385, 2002. Rousseeuw, P. J. Tutorial to robust statistics. Journal of chemometrics, 5(1):1 20, 1991. Sedghi, H., Janzamin, M., and Anandkumar, A. Provable tensor methods for learning mixtures of generalized linear models. In Artificial Intelligence and Statistics (AISTATS), pp. 1223 1231, 2016. Steinhardt, J., Valiant, G., and Charikar, M. Avoiding imposters and delinquents: Adversarial crowdsourcing and peer prediction. Advances in Neural Information Processing Systems, 29, 2016. Suggala, A. S., Bhatia, K., Ravikumar, P., and Jain, P. Adaptive hard thresholding for near-optimal consistent robust regression. In Conference on Learning Theory, pp. 2892 2897. PMLR, 2019. Tukey, J. W. A survey of sampling from contaminated distributions. Contributions to probability and statistics, pp. 448 485, 1960. Vershynin, R. Introduction to the non-asymptotic analysis of random matrices. In Eldar, Y. C. and Kutyniok, G. (eds.), Compressed Sensing, pp. 210 268. Cambridge University Press, 2012. Wang, J., Charles, Z., Xu, Z., Joshi, G., Mc Mahan, H. B., Al-Shedivat, M., Andrew, G., Avestimehr, S., Daly, K., Data, D., et al. A field guide to federated optimization. ar Xiv preprint ar Xiv:2107.06917, 2021. Wax, M. and Ziskind, I. On unique localization of multiple sources by passive sensor arrays. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37(7):996 1000, 1989. Yi, X., Caramanis, C., and Sanghavi, S. Solving a mixture of many random linear equations by tensor decomposition and alternating minimization. ar Xiv preprint ar Xiv:1608.05749, 2016. Zhong, K., Jain, P., and Dhillon, I. S. Mixed linear regression with multiple components. In Advances in neural information processing systems (NIPS), pp. 2190 2198, 2016. Efficient List-Decodable Regression using Batches A. Related Work Robust Estimation and Regression. Designing estimators which are robust under the presence of outliers has been broadly studied since 1960s (Tukey, 1960; Anscombe, 1960; Huber, 1964). However, most prior works either requires exponential time or have a dimension dependency on the error rate, even for basic problems such as mean estimation. Recently, (Diakonikolas et al., 2019a) proposed a filter-based algorithm for mean estimation which achieves polynomial time and has no dependency on the dimensionality in the estimation error. There has been a flurry of research on robust estimation problems, including mean estimation (Lai et al., 2016; Diakonikolas et al., 2017; Dong et al., 2019; Hopkins et al., 2020a;b; Diakonikolas et al., 2018a), covariance estimation (Cheng et al., 2019; Li & Ye, 2020), linear regression and sparse regression (Bhatia et al., 2015; 2017a; Balakrishnan et al., 2017; Gao, 2020; Prasad et al., 2018; Klivans et al., 2018; Diakonikolas et al., 2019b; Liu et al., 2018; Karmalkar & Price, 2019; Dalalyan & Thompson, 2019; Mukhoty et al., 2019; Diakonikolas et al., 2019c; Karmalkar et al., 2019; Pensia et al., 2020; Cherapanamjeri et al., 2020b), principal component analysis (Kong et al., 2020a; Jambulapati et al., 2020), mixture models (Diakonikolas et al., 2020a; Jia & Vempala, 2019; Kothari et al., 2018; Hopkins & Li, 2018). The results on robust linear regression are particularly related to the setting of this work, though those papers considered non-batch settings and the fraction of good examples α > 1/2. (Prasad et al., 2018; Diakonikolas et al., 2019b;c; Pensia et al., 2020; Cherapanamjeri et al., 2020b; Jambulapati et al., 2021) considered the setting when both both covariate xi and label yi are corrupted. When there are only label corruptions, (Bhatia et al., 2015; Dalalyan & Thompson, 2019; Kong et al., 2022) achieve nearly optimal rates with O(d) samples. Under the oblivious label corruption model, i.e., the adversary only corrupts a fraction of labels in complete ignorance of the data, (Bhatia et al., 2017b; Suggala et al., 2019) provide a consistent estimator whose approximate error goes to zero as the sample size goes to infinity. Robust Learning from Batches. (Qiao & Valiant, 2018) introduced the problem of learning discrete distribution from untrusted batches and derived an exponential time algorithm. Subsequent works (Chen et al., 2020b) improved the run-time to quasi-polynomial and (Jain & Orlitsky, 2020a) obtained polynomial time with an optimal sample complexity. (Jain & Orlitsky, 2021; Chen et al., 2020a) extended these results to one-dimensional structured distributions. (Jain & Orlitsky, 2020b; Konstantinov et al., 2020) studied the problem of classification from untrusted batches. (Acharya et al., 2022) studies a closely related problem of learning parameter of Erd os-R enyi random graph when a fraction of nodes are corrupt. All these works focus on different problems than ours and only consider the case when a majority of the data is genuine. List Decodable Mean Estimation and Regression. List decodable framework was first introduced in (Charikar et al., 2017) to obtain learning guarantees when a majority of data is corrupt. They derived the first polynomial algorithm for list decodable mean estimation under co-variance bound. Subsequent works (Diakonikolas et al., 2020b; Cherapanamjeri et al., 2020c; Diakonikolas et al., 2021a) obtained a better run time. (Diakonikolas et al., 2018b; Kothari et al., 2018) improved the error guarantees, however, under stronger distributional assumptions and has higher sample and time complexities. (Karmalkar et al., 2019) studies the problem of list-decodable linear regression with batch-size n = 1 and derive an algorithm with sample complexity (d/α)O(1/α4) and runtime (d/α)O(1/α8). (Raghavendra & Yau, 2020) show a sample complexity of (d/α)O(1/α4) with runtime (d/α)O(1/α8)(1/α)log(1/α). Polynomial time might indeed be impossible for the single sample setting owing to the statistical query lower bounds in (Diakonikolas et al., 2021b). Mixed Linear Regression. When each batch has only one sample, (i.e. n = 1) and contains samples of one of the k regression components the problem becomes the classical mixed linear regression which has been widely studied (Diakonikolas & Kane, 2020; Chen et al., 2020b; Li & Liang, 2018; Sedghi et al., 2016; Zhong et al., 2016; Yi et al., 2016; Chaganty & Liang, 2013). It is worth noting that no algorithm is known to achieve polynomial sample complexity in this setting. The problem is only studied very recently in the batched setting with n > 1 by (Kong et al., 2020b;a), where all the samples in the batch are from the same component. (Kong et al., 2020b) proposed a polynomial time algorithm which requires O(d) batches each with size O( k). (Kong et al., 2020a) leveraged sum-of-squares hierarchy to introduce a class of algorithms which is able to trade off the batch size n and the sample complexity. Both of these works assume that the distributions of covariates for all components is identical and Gaussian. Since the above problem is a special case of the list-decodable linear regression, our algorithm is able to recover the k regression components with batch size n = O(k) and O(d) number of batches. Our algorithms allow more general distributions for the covariates than allowed by the Gaussian assumption in the previous works. Further, our algorithms allow the distributions of covariates for the different components to differ. It is worth noting that list-decodable linear regression is a strictly harder problem than mixed linear regression as shown in (Diakonikolas et al., 2021b) and thus our result is incomparable to the ones in the mixed linear regression setting. Leaning mixture of linear dynamical systems has been studied in (Chen & Poor, 2022). Efficient List-Decodable Regression using Batches B. Regularity conditions In this section, we state regularity conditions for genuine data used in proving the guarantees of our algorithm. Before we proceed we will define upper and lower bounds on the clipping parameter κ that are functions of w and other distribution parameters, Definition B.1. We define the following upper and lower bounds on the clipping parameter κ as a function of w and other distribution parameters: κmax =c7C2 p E D[|xb i (w w )|2] + σ , κmin = max n 8 p C E D[|xb i (w w )|2], 8σ o . κmax will be used in this section to define our first regularity condition, while κmin will be used in Section C for defining a nice triplet. Regularity Conditions. 1. For all κ κmax, all unit vectors u and all vectors w E G h f b(w, κ) u E D[ f b(w, κ) u] 2i c4 σ2 + C E D[((w w ) xb i)2] n , 2. For all vectors w, i [n] |w xb i yb i | E D[|w xb i yb i |] σ2 + C E D[|w xb i yb i |]2 The first regularity condition on the set of good batches G, bounds the mean squared deviation of projections of clipped batch gradients from its true population mean. The regularity condition requires clipping parameter κ to be upper bounded, with the upper bound depending on w w and σ. As discussed in Section 5, when κ , the clipping has no effect, and establishing such regularity condition for unclipped gradients would require Ω(d2) samples. By using clipping, and ensuring that clipping parameter κ is in the desired range we are able to achieve On,α(d) sample complexity. Theorem B.1 characterizes the number of good batches required for regularity condition 1 as a function of the upper bound on κ. Theorem B.1. There exist a universal constant c4 such that for µmax [1, d4n2 C ] and |G| = Ω(µ4 maxn2d log(d)), with probability 1 4 d2 , for all unit vectors u, all vectors w and for all κ2 µmax(σ2 + C E D[((w w ) xb i)2]), E G h f b(w, κ) u E D[ f b(w, κ) u] 2i c4 σ2 + C E D[((w w ) xb i)2] n . (12) We prove the above theorem in Section H. The second regularity condition on the set of good batches G, bounds the mean squared deviation of average absolute error for a batch from its true population mean. Theorem B.2 characterizes the number of good batches required for regularity condition 2. Theorem B.2. For |G| = Ω(n2d log(d)) and universal constant c2 > 0, with probability 1 4 d2 , for all vectors w, i [n] |w xb i yb i | E D[|w xb i yb i |] σ2 + C E D[|w xb i yb i |2] n Efficient List-Decodable Regression using Batches Proof. Proof of the above theorem is similar to the proof of Theorem B.1, and for brevity, we skip it. Combining the two theorems shows that the two regularity conditions hold with high probability with On,α(d) batches. Corollary B.3. For |G| ΩC dn2 log(d) , both regularity conditions hold with probability 1 8 We conclude the sections with the following Lemma which lists some simple consequences of regularity conditions, that we use in later sections. Lemma B.4. If regularity conditions hold then 1. For all vectors w and for all κ κmax, Cov G( f b(w, κ)) c4 σ2 + C E D[((w w ) xb i)2] n , 2. For all vectors w i [n] |w xb i yb i | σ2 + C E D[|w xb i yb i |]2 3. For all G G of size |G|/2, E G [ f b(w, κ)] E D[ f b(w, κ)] C E D[((w w ) xb i)2] n . Proof. The first item in the lemma follows as Cov G( f b(w, κ)) = max u: u 1 E G h f b(w, κ) u E G[ f b(w, κ) u] 2i max u: u 1 E G h f b(w, κ) u E D[ f b(w, κ) u] 2i c4 σ2 + C E D[((w w ) xb i)2] n , where the first inequality follows as the expected squared deviation along the mean is the smallest and the second inequality follows from the first regularity condition. Similarly, the second item follows from the second regularity condition. Finally, we prove the last item using the first regularity condition. Let u be any unit vector and Zb(u) := f b(w, κ) u E D[ f b(w, κ) u] 2. Then E G[Zb](u) = |G| E G [Zb(u)] 1 2 E G [Zb(u)] , where the first inequality used the fact that Zb(u) is a positive and the second inequality used |G | |G|/2. Then using the bound on E G[Zb(u)] in in the first regularity condition, we get E G [Zb(u)] 2c4 σ2 + C E D[((w w ) xb i)2] n . Using the Cauchy Schwarz inequality and the above bound, E G [ f b(w, κ) u E D[ f b(w, κ) u] ] = E G [ q E G [Zb(u)] 2c4 σ2 + C E D[((w w ) xb i)2] n . Since the above bound holds for each unit vector u, we have E G [ f b(w, κ) E D[ f b(w, κ)] ] 2c4 σ2 + C E D[((w w ) xb i)2] n C E D[((w w ) xb i)2] n . Efficient List-Decodable Regression using Batches C. Guarantees for nice triplet For completeness, we first restate the conditions a nice triplet (β, κ, w) satisfy. A triplet (β, κ, w) is nice if (a) β is a nice weight vector, i.e. βG 3|G|/4. (b) κmin κ κmax. (c) w is any approximate stationary point w.r.t. β for clipped loss with clipping parameter κ, namely Eβ[ f b(w, κ)] log(2/α)σ (d) Covβ( f b(w, κ)) c5C2 log2(2/α)(σ2+E D[|(w w ) xb i|]2) n . In this section, we establish the following guarantees for any nice triplets. In doing so we assume regularity conditions hold for G. Theorem C.1. Suppose (β, κ, w) is a nice triplet, n max{32c4CC3, 256 α c5C2C3 2 log2(2/α)} and regularity conditions holds, then w w O( C3Cσ log(2/α) nα ). In the remainder of this section, we prove the theorem. First, we provide an overview of the proof and state some auxiliary lemma that we use to prove the theorem. In this section, we show that for any nice triplet (β, κ, w) if w w = Ω(σ/ nα) then the following lower bound on clipped gradient co-variance, Covβ( f b(w, κ)) Ω(α w w 2) holds. For n = Ω( 1 α) and w w = Ω(σ/ nα) this lower bound contradicts the upper bound in condition (d). Hence, the theorem concludes that w w = O(σ/ nα). To show the lower bound Covβ( f b(w, κ)) Ω(α w w 2), we first show E D[ f b i (w, κ)] = Ω( w w ) O(σ/ nα) in Theorem C.2. Since E D[ f b i (w, κ)] = E D[ f b(w, κ)] , the same bound will hold for the norm of expectation of clipped batch gradients. When clipping parameter κ then f b i (w, κ) = f b i (w) and for unclipped gradients, a straightforward calculation shows the desired lower bound E D[ f b i (w, κ)] = Ω( w w ). However, if κ is too small then clipping may introduce a large bias in the gradients and such a lower bound may no longer hold. Yet, the lower bound on κ in condition (b) ensures that κ is much larger than the typical error which is of the order w w + σ. And when clipping parameter κ is much larger than the typical error, it can be shown that with high probability clipped and unclipped gradients for a random sample from D would be the same. The next theorem uses this observation and for the case when κ satisfies the lower bound in condition (b) it shows the desired lower bound on the norm of expectation of clipped gradient. Theorem C.2. If κ max{8 p C E D[|xb i (w w )|2], 8σ}, then E D[ f b i (w, κ)] 3 4C3 w w . We prove the above theorem in subsection C.1 Since E D[ f b i (w, κ)] = E D[ f b(w, κ)], the same bound holds for the clipped batch gradients. Next, in Lemma C.3 we show that for any sufficiently large collection G G of the good batches E G [ f b(w, κ)] E D[ f b(w, κ)] . Lemma C.3. Suppose κ and w are part of a nice triplet, n 32c4CC3 and regularity conditions holds, then for all G G of size |G|/2, E G [ f b(w, κ)] 1 2C3 w w 2c4σ n . Efficient List-Decodable Regression using Batches Proof. From item 3 in Lemma B.4, E G [ f b(w, κ)] E D[ f b(w, κ)] C E D[((w w ) xb i)2] n 2c4C w w 2 Σ n 2c4σ n + w w 2c4C n . Using n 32c4CC3 2, E G [ f b(w, κ)] E D[ f b(w, κ)] 1 4C3 w w + 2c4σ n . From Theorem C.2, and the observation E D[ f b i (w, κ)] = E D[ f b(w, κ)], we get E D[ f b(w, κ)] 3 4C3 w w . The lemma follows by combining the above equation using triangle inequality. Next, the general bound on the co-variance will be useful in proving Theorem C.1. Lemma C.4. For any weight vector β, any set of vectors zb associated with batches, and any sub-collection of vectors B {b B : βb 1/2}, Covβ(zb) |B | 2|B| E β[zb] E B [zb] 2. The proof of the lemma appears in Section C.2. In Theorem C.1 we show that since βG 3/4|G|, we can find a sub-collection G of size |G|/2 such that for each b G , its weight βb 1/2. The we use the previous results for B = G and z = f b(w, κ) to get, Covβ( f b(w, κ)) |G | 4|B| Eβ[ f b(w, κ)] EG [ f b(w, κ)] |G| 8|B| Eβ[ f b(w, κ)] EG [ f b(w, κ)] 8 Eβ[ f b(w, κ)] EG [ f b(w, κ)] 2. From condition (c) of nice triplets we have Eβ[ f b(w, κ)] 0 and from Lemma C.3 we have EG [ f b(w, κ)] w w . Then from Lemma C.4, we get an upper bound Covβ( f b(w, κ)) α w w 2. As discussed before, combining this lower bound with the upper bound in condition (d), the theorem concludes w w = O(σ/ nα). Next, we formally prove Theorem C.1 using the above auxiliary lemmas and theorems. Proof of Theorem C.1. Let G := {b G : βb 1/2}. Next, we show that |G | |G|/2. To prove it by contradiction assume the contrary that |G | < |G|/2. Then b G\G βb + X b G βb (a) X b G 1 |G \ G | 2 + |G | = |G| |G | 2 + |G | < 3|G|/4, here (a) follows as the definition of G implies that for any b / G , βb < 1/2 and for all batches βb 1. Above is a contradiction, as we assumed in the Theorem that βG 3|G|/4. Applying Lemma C.4 for B = G and zb = f b(w, κ) we have Covβ( f b(w, κ)) G 2|B| E G [ f b(w, κ)] E β[ f b(w, κ)] 2 4|B| E G [ f b(w, κ)] E β[ f b(w, κ)] 2 4 E G [ f b(w, κ)] E β[ f b(w, κ)] 2. (13) Efficient List-Decodable Regression using Batches In the above equation, using the bound in Lemma C.3 and bound on E β[ f b(w, κ)] in condition (c) for nice triplet we get, Covβ( f b(w, κ)) α max 0, 1 2C3 w w 2c4σ n log(2/α)σ We show that when w w O( C3Cσ log(2/α) nα ), the above upper bound contradicts the following lower bound in condition (d), Covβ( f b(w, κ)) c5C2 log2(2/α)(σ2 + E D[|(w w ) xb i|]2) n c5C2 log2(2/α)(σ2 + w w 2) To prove the contradiction assume 8C3 > max log(2/α)σ 8 nα , 2c4σ n , 2 c5Cσ log(2/α) nα Using this lower bound on w w , we lower bound the co-variance. Combining the above lower bound on w w and equation (13), we get, Covβ( f b(w, κ)) α 2 c5Cσ log(2/α) nα + 1 8C3 w w 2 2 c5Cσ log(2/α) nα c5C2 log2(2/α)σ2 c5C2 log2(2/α)σ2 n + c5C2 log2(2/α) w w 2 here the last step used n 256 α c5C2C3 2 log2(2/α). This completes the proof of the contradiction. Hence, 8C3 max log(2/α)σ 8 nα , 2c4σ n , 2 c5Cσ log(2/α) nα The above equation implies w w O( C3Cσ log(2/α) nα ). C.1. Proof of Theorem C.2 The following auxiliary lemma will be useful in the proof of the theorem. Lemma C.5. For any z1 R, z2 > 0 and a symmetric random variable Z, E (z1 + Z) (z1 + Z)z2 max(|z1 + Z|, z2) 2|z1| Pr(Z > z2 |z1|) Proof. We consider z1 0 and prove the lemma for this case. The proof for z1 < 0 case then follows from the symmetry of the distribution of Z around 0. The term inside the expectation can be expressed in terms of indicator random variables as follows: (z1 + Z) (z1 + Z)z2 max(|z1 + Z|, z2) = (z1 + Z z2) 1(Z > z2 z1) + (z1 + Z + z2) 1(Z < z2 z1) = (z1 + Z z2) 1(z2 z1 < Z z2 + z1) + (z1 + Z z2) 1(Z > z2 + z1) + (z1 + Z + z2) 1(Z < z2 z1). Efficient List-Decodable Regression using Batches Next, taking the expectation on both sides in the above equation, E (z1 + Z) (z1 + Z)z2 max(|z1 + Z|, z2) = E[(z1 + Z z2) 1(z2 z1 < Z z2 + z1)] + E[(z1 + Z z2) 1(Z > z2 + z1)] + E[(z1 + Z + z2) 1(Z < z2 z1)] = E[(z1 + Z z2) 1(z2 z1 < Z z1 + z2)] + 2|z1| Pr(Z > z2 + z1), where the last step follows because Z is symmetric and z1 = |z1| since we assumed z1 0. E (z1 + Z) (z1 + Z)z2 max(|z1 + Z|, z2) = E[|z1 + Z z2| 1(z2 z1 < Z z2 + z1)] + 2|z1| Pr(Z > z2 + z1) 2|z1| Pr(z2 z1 < Z z2 + z1) + 2|z1| Pr(Z > z2 + z1) = 2|z1| Pr(Z > z2 z1). Next, using the above lemma we prove Theorem C.2. Proof of Theorem C.2. Consider a random sample (xb i, yb i ) from distribution D. Recall that nb i = yb i w nb i denote the random noise and is independent of xb i. Consider (xb i w yb i )xb i f b i (w, κ). the difference between the unclipped and the clipped gradient for the sample: (xb i w yb i )xb i f b i (w, κ) = (xb i w yb i )xb i (xb i w yb i ) |xb i w yb i | κκxb i = (xb i (w w ) nb i)xb i (xb i (w w ) nb i) |xb i (w w ) nb i| κκ xb i, (14) where in the last equality we used the relation between xb i, yb i and nb i. Next, by applying Lemma C.5, we get: (xb i (w w ) nb i)xb i (xb i (w w ) nb i) |xb i (w w ) nb i| κκ 2|xb i (w w )| Pr nb i > κ |xb i (w w )| , note that in the above expectation xb i is fixed and expectation is taken over nb i. Let Z := 1 |xb i (w w )| κ/2 . Observe that Pr(nb i > κ |(w w ) xb i|) Z + Pr(nb i > κ/2). Combining this observation with the above equation, we have: ((w w ) xb i nb i) ((w w ) xb i nb i) |(w w ) xb i nb i| κκ 2|(w w ) xb i| Pr(nb i > κ/2) + Z . (15) When w = w the bound holds trivially. Hence, in the remainder of the proof, we assume w = w . Let v := w w w w and Efficient List-Decodable Regression using Batches Zb i := 1 (|xb i (w w )| κ/2) (|nb i| κ/2) . Then, for unit vector v Rd, we have | E D[((w xb i yb i )xb i f b i (w, κ)) v]| = | E D[E D[((w xb i yb i )xb i f b i (w, κ)) v|xb i]]| E D[| E D[((w xb i yb i )xb i f b i (w, κ)) v xb i]]|] (a) E D 2|(w w ) xb i| |xb i v| Z + Pr(nb i > κ/2) 2|(w w ) xb i|2 w w Z + Pr(nb i > κ/2) 2 w w E D[Z |(w w ) xb i|2] + Pr(nb i > κ/2) E D[|(w w ) xb i|2] , (16) here (a) follows from Equation (14) and Equation (15), and (b) follows from the definition of vector v. Next, we bound the two terms on the right one by one. We start with the first term: E D[Z |(w w ) xb i|2] (a) E[(Z)2] E D[(xb i (w w ))4] 1/2 (b) E[Z] C E D[(xb i (w w ))2]2 1/2 (c) C Pr[|xb i (w w )| κ/2] 1/2 E D[(xb i (w w ))2], (17) where (a) used the Cauchy-Schwarz inequality, (b) used the fact that Z is an indicator random variable, hence, Z2 = Z and L4 L2 hypercontractivity, and (c) follows from the definition of Z. Applying the Markov inequality to (nb i)2 we get: Pr[|nb i| κ/2] E D[(nb i)2] (κ/2)2 σ2 (κ/2)2 . (18) Similarly, applying the Markov inequality to |xb i (w w )|4 yields: Pr[|xb i (w w )| κ/2] E D[|xb i (w w )|4] (κ/2)4 C E D[|xb i (w w )|2]2 (κ/2)4 , (19) where the last inequality uses L4 L2 hypercontractivity. Combining Equations (16), (17), (18) and (19), we have | E D[((w xb i yb i )xb i f b i (w, κ)) v]| 8 E D[(xb i (w w ))2] κ2 w w C E D[(xb i (w w ))2] + σ2 . E D[(w xb i yb i )xb i v] (a)= E D[((w w ) xb i nb i)xb i v] (b)= E D[((w w ) xb i)xb i v] E D[nb i] E D[xb i v] (c)= E D[((w w ) xb i)xb i v] (d)= E D ((w w ) xb i)2 here (a) follows from the relationship between xb i, yb i and nb i, (b) follows from as xb i and nb i are independent, (c) uses E D[nb i] = 0 and (d) follows from the definition of v. Efficient List-Decodable Regression using Batches Combining the previous two equations using the triangle inequality: | E D[ f b i (w, κ) v]| | E D[(w xb i yb i )xb i v]| | E D[((w xb i yb i )xb i f b i (w, κ)) v]| ((w w ) xb i)2 κ2 C E D[(xb i (w w ))2] + σ2 ((w w ) xb i)2 C3 = 3 4C3 w w , here the second last inequality follows from lower bound on κ. The theorem then follows by observing, E D[ f b i (w, κ)] max u =1 | E D[ f b i (w, κ) u]| | E D[ f b i (w, κ) v]| 3 4C3 w w . C.2. Proof of Lemma C.4 Proof. Note that βB (zb E β[zb])(zb E β[zb]) βB (zb E β[zb])(zb E β[zb]) 1 2|B|(zb E β[zb])(zb E β[zb]) |B |(E B [zb] E β[zb])(E B []zb] E β[zb]) 2|B| E β[zb] E B [zb] 2, where (a) used βb 1/2 for b B and the trivial bound βB |B| and (b) follows from the fact that any Z, b B (zb Z)(zb Z) |B | E B [zb] Z E B [zb] Z . We complete the proof of the lemma by proving the above fact. b B (zb Z)(zb Z) = b B (zb E B [zb] + E B [zb] Z)(zb E B [zb] + E B [zb] Z) (zb E B [zb])(zb E B [zb]) + (E B [zb] Z)(E B [zb] Z) (b) |B | (E B [zb] Z)(E B [zb] Z) , here (a) follows as P b B zb = |B | E B [zb] and hence, P b B (zb E B [zb])(E B [zb] Z) = P b B (E B [zb] Z)(zb E B [zb]) = 0, and (b) follows as (zb E B [zb])(zb E B [zb]) are positive semi-definite matrices. Efficient List-Decodable Regression using Batches Algorithm 2 FINDCLIPPINGPPARAMETER 1: Input: Set B, β, σ, a1 1, a2 data {{(xb i, yb i }i [n]}b B. 2: κ 3: while True do 4: wκ any approximate stationary point of clipped losses {f b( , κ)} w.r.t. weight vector β such that Eβ[f b(wκ, κ)] log(2/α)σ 5: κnew max n a1 p E β[f b(wκ, κ)], a2σ o . 6: if κnew κ/2 then 7: Break 8: end if 9: κ κnew 10: end while 11: Return(κ, wκ) D. Subroutine FINDCLIPPINGPARAMETER and its analysis Theorem D.1. For any weight vector β, a1 1, and a2 > 0, Algorithm FINDCLIPPINGPARAMETER runs at most log O maxi,b |yb i | σ iterations of the while loop and returns κ and wκ such that 1. wκ is a (approximate) stationary point for {f b( , κ)} w.r.t. weight vector β such that E β[f b(wκ, κ)] log(2/α)σ 2. max n a1 p Eβ[f b(wκ, κ)], a2σ o κ 2 max n a1 p Eβ[f b(wκ, κ)], a2σ o . 3. max n a1 2 Eβ h 1 n P i [n] |wκ xb i yb i | i , a2σ o κ max n 4a2 1 Eβ h 1 n P i [n] |wκ xb i yb i | i , a2σ o . Proof. First, we bound the number of iterations of the while loop. Since wκ is a stationary point for f b(., κ), hence its will achieve a smaller loss than w = 0, hence Eβ[f b(wκ, κ)] Eβ[f b(0, κ)]. And, since the clipped loss is smaller than unclipped loss, Eβ[f b(0, κ)] Eβ[f b(0)] = Eβ[ 1 i [n](yb i )2] maxi,b(yb i )2. Therefore after the first iteration κ max a1 maxi,b |yb i |, a2σ . Also in each iteration apart from the last one κ decreases by a factor 2 and κ can t be smaller than a2σ. Hence, the number of iterations between the first one and the last one are at most log( a1 maxi,b |yb i | a2σ )). Therefore the total number of iterations are at most log( a1 maxi,b |yb i | a2σ )) + 2. The first item follows from the definition of wκ in the subroutine FINDCLIPPINGPPARAMETER. Next to prove the lower bound in item 2 we prove the claim that if in an iteration κ max n a1 p E β[f b(wκ, κ)], a2σ o then the same condition will hold in the next iteration. The condition κ max n a1 p E β[f b(wκ, κ)], a2σ o in the claim implies that κ κnew. Then from the defini- tion of clipped loss, for each w and each b we have f b(w, κ) f b(w, κnew). It follows that Eβ[f b(wκ, κ)] Eβ[f b(wκ, κnew)]. And further wκnew is stationary point for f b(., κnew), hence it will achieve a smaller loss, Eβ[f b(wκnew, κnew)] Eβ[f b(wκ, κnew)]. Therefore, Eβ[f b(wκnew, κnew)] Eβ[f b(wκ, κ)]. Hence, κnew = E β[f b(wκ, κ)], a2σ o max n a1 p Eβ[f b(wκnew, κnew)], a2σ o . This completes the proof of the claim. Since the initial value of κ is infinite the claim must hold in the first iteration, and therefore in each iteration thereafter. Therefore it must hold in the iteration when the algorithm terminates. This completes the proof of the lower bound in item 2. The upper bound in the second item follows by observing that when the algorithm ends κ 2κnew and κnew = a1 p Eβ[f b(wκ, κ)] + a2σ. Efficient List-Decodable Regression using Batches Finally, we prove item 3 using item 2. We start by proving the lower bound in item 3. From the lower bound in item 2, we have, κ a2σ. Then to complete the proof of the lower bound in item 3, it suffices to prove κ > a1 2 Eβ h 1 n P i [n] |wκ xb i yb i | i . To prove this by contradiction suppose κ < a1 2 Eβ h 1 n P i [n] |wκ xb i yb i | i . Then E β[f b(wκ, κ)] i [n] f b i (wκ, κ) 1(|wκ xb i yb i | κ) (wκ xb i yb i )2 2 + 1(|wκ xb i yb i | > κ) κ|wκ xb i yb i | κ2 1(|wκ xb i yb i | κ) (wκ xb i yb i )2 1(|wκ xb i yb i | > κ) κ|wκ xb i yb i | 2 i [n] E β 1(|wκ xb i yb i | κ) |wκ xb i yb i | 2 + κ i [n] E β 1(|wκ xb i yb i | > κ) |wκ xb i yb i | i [n] E β 1(|wκ xb i yb i | κ) |wκ xb i yb i | i [n] E β 1(|wκ xb i yb i | > κ) |wκ xb i yb i | i [n] E β 1(|wκ xb i yb i | κ) |wκ xb i yb i | i [n] E β 1(|wκ xb i yb i | > κ) |wκ xb i yb i | E β 1(|wκ xb i yb i | κ) |wκ xb i yb i | + E β 1(|wκ xb i yb i | > κ) |wκ xb i yb i | = κ 2a1 E β i [n] |wκ xb i yb i | a2 1 , (20) here (a) and (b) follows the Cauchy-Schwarz inequality, (c) and (e) follows from our assumption κ < 2 Eβ h 1 n P i [n] |wκ xb i yb i | i and (d) follows since a1 1. This contradicts the lower bound κ a1 p E β[f b(wκ, κ)] in item 2. Hence we conclude, κ 2 Eβ h 1 n P i [n] |wκ xb i yb i | i . This completes the proof of the lower bound in item 3. Next, we prove the upper bound in item 3. We consider two cases. For the case when a1 p E β[f b(wκ, κ)] a2σ then upper bound in item 3 follows from the upper bound in item 2. Next we prove for the other case, when a1 p E β[f b(wκ, κ)] > a2σ. For this case item 2 implies E β[f b(wκ, κ)] κ2 Next, from the definition of f b(w, κ), E β[f b(wκ, κ)] = E β i [n] f b i (wκ, κ) i [n] κ|wκ xb i yb i | i [n] |wκ xb i yb i | Combining the above equation and E β[f b(wκ, κ)] κ2 4a2 1 , we get, 4a2 1 κ E β i [n] |wκ xb i yb i | Efficient List-Decodable Regression using Batches The upper bound in item 3 then follows from the above equation. E. Correctness of estimated parameters for nice weight vectors For batch b B, let vb(w) := 1 i [n] |w xb i yb i |. Since w will be fixed in the proofs, we will often denote vb(w) as vb. In this section, we state and prove Theorems E.1, E.2 and E.4. For any triplet with a nice weight vector, Theorem E.1 ensures the correctness of parameters calculated for Type-1 use of MULTIFILTER. For any triplet with a nice weight vector, Theorem E.4 ensures the correctness of parameters calculated for the case when it gets added to M or goes through Type-2 use of MULTIFILTER. Theorem E.2 serves as an intermediate step in proving Theorem E.4. Theorem E.1. In Algorithm 1 if the weight vector β is such that βG 3|G|/4, n (16)2c2C, and Theorem B.2 s conclusion holds, then for any w, the parameter θ1 computed in the subroutine satisfies σ2 + C E D[|w xb i yb i |]2 where c2 is the same universal positive constant as item 2 in Lemma B.4. Proof. To prove the theorem we first show that θ0 calculated in the algorithm is 7 E D[|w xb i yb i |] 8 σ 8 Let MED denote median of the set {vb : b G}. From Theorem B.2 and Markov s inequality, it follows that MED E D[|w xb i yb i |] 2 σ2 + C E D[|w xb i yb i |]2 E D[|w xb i yb i |] 8 + σ 8 where the last inequality uses n (16)2c2C. It follows that MED 7 E D[|w xb i yb i |] 8 σ 8 Then to complete the proof we show that MED θ0. Note that b G:vb βG |G| And since from the definition of θ0, we have P b:vb>θ0 βb α|B|/4 |G| 4 , it follows that MED θ0. Therefore, θ0 7 E D[|w xb i yb i |] 8 σ 8 C . The lower bound in the theorem on θ1 then follows from the relation between θ0 and θ1. Theorem E.2. Suppose regularity conditions holds, and β, w and n satisfy n max{ (32)2c3c2C log2(2/α) α , (16)2c2C}, βG 3|G|/4, and Varβ vb(w) c3 log2(2/α)θ1, 3 E D[|(w w ) xb i|] 4 σ E β vb(w) 4 E D[|(w w ) xb i|] 3 + 2σ. Efficient List-Decodable Regression using Batches In proving Theorem E.2 the following auxiliary lemma will be useful. We prove this lemma in Subsection E.1. Lemma E.3. Let Z be any random variable over the reals. For any z R, such that Pr[Z > z] 1/2, we have Var(Z) Pr[Z z] E[Z] z + p and for all z Z, Var(Z) min{Pr[Z z], Pr[Z z], 0.5}. Now we prove Theorem E.2 using the above Lemma. Proof of Theorem E.2. Let MED denote median of the set {vb : b G}. In Equation (23) we showed, b B:vb MED βb |G| b B:vb MED βb Similarly, by symmetry, one can show P b B:vb MED βb Then from the second bound in Lemma E.3, | E β[vb] MED| From Equation (22), the above equation, and the triangle inequality, | E β[vb] E D[|w xb i yb i |]| α + E D[|w xb i yb i |] 8 + σ 8 Next, from the definition of θ0, we have P b:vb θ0 βb α|B|/4 and P b:vb>θ0 βb < α|B|/4. Then P 4βG α|B| 4(3|G|/4) 1 Then from the first bound in Lemma E.3, α E β[vb]. (26) In this lemma, we had assumed the following bound on the variance of vb, Varβ[vb] c3 log2(2/α)θ1. Efficient List-Decodable Regression using Batches α c3 log2(2/α)θ1 α = c3 log2(2/α)c2(σ2 + (2 Cθ0 + σ)2) nα (σ2 + 4Cθ2 0 + 2σ2) 322C σ2 256C + θ2 0 256, here the equality follows from the relation between θ0 and θ1 and the first inequality follows as n (32)2Cc3c2 log2(2/α) 256C + θ2 0 256 σ 16 E β[vb] + 2 here the second inequality used a2 + b2 |a| + |b| and the last inequality used (26). From the above equation, it follows that r 14 E β[vb]. Combining the above bound and Equation (25) | E β[vb] E D[|w xb i yb i |]| σ 7 7 E β[vb] + E D[|w xb i yb i |] 8 + σ 8 From the above equation it follows that 49 E D[|w xb i yb i |] 64 15σ 64 C E β[vb] 21 E D[|w xb i yb i |] 16 + 5σ 16 Finally, we upper bound and lower bound E D[|w xb i yb i |] to complete the proof. To prove the upper bound, note that, E D[|w xb i yb i |] = E D[|(w w ) xb i nb i|] E D[|(w w ) xb i|] + E D[|nb i|] E D[|(w w ) xb i|] + σ, here the last inequality used E D[|nb i|] p E D[|nb i|2]. Combining the above upper bound with the upper bound in (27) and using C 1 proves the upper bound in the lemma. Similarly, we can show E D[|w xb i yb i |] E D[|(w w ) xb i|] σ, Combining the above lower bound with the lower bounds in (27) and using C 1 proves the lower bound in the lemma. Theorem E.4. Suppose regularity conditions holds, and β, w and n satisfy n max{ (32)2c3c2C log2(2/α) α , (16)2c2C}, βG 3|G|/4, and i [n] |w xb i yb i | c3 log2(2/α)θ1, then for κ, w returned by subroutine FINDCLIPPINGPARAMETER and θ2 calculated by MAINALGORITHM, we have 1. c4 σ2+C E D[((w w ) xb i)2] n θ2 c6C2(σ2+E D[|(w w ) xb i|]2) n , where c4 is the same positive constant as in item 1 of Lemma B.4 and c6 is some other positive universal constant. C E D[|xb i (w w )|2], 8σ} κ and κ c7C2 p E D[|xb i (w w )|2] + σ , where c7 is some other positive universal constant. Note that the range of κ in item 2 of the above Theorem is the same as that in (b). In proving the theorem the following lemma will be useful. Efficient List-Decodable Regression using Batches Lemma E.5. For any vectors u, we have r E D[|u xb i|2] 8C E D[|u xb i|] q E D[|u xb i|2] We prove the above auxiliary lemma in Section E.2 using the Cauchy-Schwarz inequality for the upper bound and L4 L2 hypercontractivity for the lower bound. Next, we prove Theorem E.4 using the above lemma and Theorem E.2. Proof of Theorem E.4. We start by proving the first item. For convenience, we recall the definition of θ2 in (8), n σ2 + 16C2 E β[vb] + σ 2 . The upper bound in the item follows from this definition of θ2 and the upper bound on E β[vb] in Lemma E.2. Using the lower bound bound on E β[vb] in Lemma E.2 and definition of θ2, n σ2 + 9C2 E D[|(w w ) xb i|]2 c4 8C2 E D[|(w w ) xb i|2] , where the last step used Lemma E.5. This completes the proof of lower bound item 1. Next, we prove item 2. From Theorem D.1, 2 Eβ h 1 n P i [n] |w xb i yb i | i , a2σ o κ max n 4a2 1 Eβ h 1 n P i [n] |w xb i yb i | i , a2σ o . Since for any a, b > 0, (a + b)/2 max(a, b) a + b. Then from the above bound, 4 Eβ h 1 n P i [n] |wκ xb i yb i | i + a2σ 2 κ 4a2 1 Eβ h 1 n P i [n] |wκ xb i yb i | i + a2σ. Using the bound on E β[vb] in Lemma E.2 in the above equation 4 3 E D[|(w w ) xb i|] 4 σ 2 κ 4a2 1 4 E D[|(w w ) xb i|] 3 + 2σ + a2σ. Using Lemma E.5, and the above equation, E D[|(w w ) xb i|2] + (4a2 a1)σ 8 κ 16a2 1 3 p E D[|(w w ) xb i|2] + (8a2 1 + a2)σ. The upper bound and lower bound in item 2 then follow by using the values a1 = 256C 2 3 and a2 = a1 E.1. Proof of Lemma E.3 Proof of Lemma E.3. We only prove the first statement as the second statement and then follow from the symmetry. We start by proving the upper bound in the first statement. We consider two cases, E[Z] z and E[Z] > z. For the first case, the upper bound automatically follows. Next, we prove the second case. In this case, Var(Z) = E[(Z E[Z])2] E[1(Z z)(Z E[Z])2] E[1(Z z)(z E[Z])2] = Pr[Z z](z E[Z])2. Then using Pr[Z z] = 1 Pr[Z > z] 1/2, we get Var(Z) (z E[Z])2 The upper bound from the above equation. Next, we prove the lower bound. Again, we consider two cases, E[Z] z and E[Z] < z. For the first case, the lower bound automatically follows. Next, we prove the second case. In this case, Var(Z) = E[(Z E[Z])2] E[1(Z z)(Z E[Z])2] E[1(Z z)(z E[Z])2] = Pr[Z z](z E[Z])2, Efficient List-Decodable Regression using Batches from which the lower bound follows. By symmetry, for any z R, such that Pr[Z < z] 1/2, one can show that 2 Var(Z) E[Z] z + Var(Z) Pr[Z z]. Since for any z, either Pr[Z > z] 1/2 or Pr[Z < z] 1/2, Hence, either the first bound in the Lemma or the above bound holds for each z, therefore for any z R, Var(Z) Pr[Z z], p E[Z] z + max Var(Z) Pr[Z z], p The second bound in the lemma is implied by the above bound. E.2. Proof of Lemma E.5 Proof of Lemma E.5. The upper bound on E D[|u xb i|] follows from the Cauchy-Schwarz inequality, E D[|u xb i|] q E D[|u xb i|2] Σ 1. Next, we prove the lower bound. From Markov s inequality Pr D[ xb i u 2 2C E D[ xb i u 2]] = Pr D[ xb i u 4 4C2(E D[ xb i u 2])2] = E D[ xb i u 4] 4C2(E D[ xb i u 2])2 1 where the last step uses L4 L2 hypercontractivity. Then, from the Cauchy-Schwarz inequality, E D[ xb i u 2 1( xb i u 2 2C E D[ xb i u 2])] q E D 1 xb i u 2 2C E D[ xb i u 2] E D[ xb i u 4] Pr D xb i u 2 2C E D[ xb i u 2] C E D[ xb i u 2]2 2 E D[ xb i u 2]. E D[ xb i u 2 1( xb i u 2 < 2C E D[ xb i u 2])] = E D[ xb i u 2] E D[ xb i u 2 1( xb i u 2 2C E D[ xb i u 2])] 2 E D[ xb i u 2] E D xb i u 2 1 xb i u 2 < 2C E D[ xb i u 2] 2C E D[ xb i u 2] 1 xb i u 2 < 2C E D[ xb i u 2] 2C E D[ xb i u 2] E D[ xb i u ]. Combining the above two equations we get E D[ xb i u ] E D[ xb i u 2] 2C E D[ xb i u 2] = E D[ xb i u 2] 2 Efficient List-Decodable Regression using Batches F. Multi-filtering In this section, we state the subroutine MULTIFILTER, a simple modification of BASICMULTIFILTER algorithm in (Diakonikolas et al., 2020b). The subroutine takes a weight vector β, a real function zb on batches, and a parameter θ as input and produces new weight vectors. This subroutine is used only when: Var B,β(zb) > c3 log2(2/α)θ, (28) where c3 is an universal constant (Same as 2 C, where C is the constant in BASICMULTIFILTER algorithm in (Diakonikolas et al., 2020b)). When the variance of zb for good batches is smaller than θ and the weight vector β is nice that is βG 3/4|G|, then at least one of the new weight vectors produced by this subroutine has a higher fraction of weights in good vector than the original weight vector β. Algorithm 3 MULTIFILTER Input: Set B, α, β, {zb}b B, θ. {Input must satisfy Condition (28)} Let a = inf{z : P b:zbz βb αβB/8} Let B = {b B : zb [a, b]} if Var B ,β(zb) c3 log2(2/α)θ 2 then Let f b = minz [a,b] |zb z|2, and the new weight of each batch b B be βb new = 1 f b maxb B:βb>0 f b βb (29) NEWWEIGHTS {βnew} else Find z R and R > 0 such that sets B = {b B : zb z R} and B = {b B : zb < z + R} satisfy (βB )2 + (βB )2 (βB)2, (30) βB 48 log( 2 α ) R2 . (31) {Existence of such z and R is guaranteed as shown in Lemma 3.6 of (Diakonikolas et al., 2020b).} For each b B, let βb 1 = βb 1(b B ) and βb 2 = βb 1(b B ). Let β1 = {βb 1}b B and β2 = {βb 2}b B. NEWWEIGHTS {β1, β2} end if Return(NEWWEIGHTS) In BASICMULTIFILTER subroutine of (Diakonikolas et al., 2020b) input is not restricted by the condition in Equation (28). However, when input meets this condition BASICMULTIFILTER and its modification MULTIFILTER behaves the same. Therefore, the guarantees for weight vectors returned by MULTIFILTER follows from the guarantees of BASICMULTIFILTER in (Diakonikolas et al., 2020b). We characterize these guarantees in Theorem F.1. Theorem F.1. Let {zb}b B be collection of real numbers associated with batches, β be a weight vector, and threshold θ > 0 be such that condition in (28) holds. Then MULTIFILTER(B, β, {zb}b B, θ1) returns a list NEWWEIGHTS containing either one or two new weight vectors such that, 1. Sum of square of the total weight of new weight vectors is bounded by the square of the total weight of β, namely X eβ NEWWEIGHTS (eβB)2 (βB)2. (32) Efficient List-Decodable Regression using Batches 2. In the new weight vectors returned the weight of at least one of the weight vectors has been set to zero, that is for each weight vector eβ NEWWEIGHTS, {b : βb > 0} {b : βb > 0}, (33) 3. If weight vector β is such that βG 3|G|/4 and for good batches the variance Var G(zb) θ is bounded, then for at least one of the weight vector eβ NEWWEIGHTS, βB 1 24 log(2/α). (34) Proof. When the list NEWWEIGHTS contains one weight vector it is generated using Equation (29), and when the list NEWWEIGHTS contains one weight vector it is generated using Equations (30) and (31). In both cases, item 1 and item 2 of the Theorem follow immediately from these equations. The last item follows from Corollary 3.8 in (Diakonikolas et al., 2020b). F.1. Guarantees for the use of MULTIFILTER in Algorithm 1 The following Theorem characterizes the use MULTIFILTER by our algorithm. The proof of the theorem is similar to the proofs for the main algorithm in (Diakonikolas et al., 2020b). Theorem F.2. At the end of Algorithm 1 the size of M is at most 4/α2 and the algorithm makes at most O(|B|/α2) calls to MULTIFILTER. And, if for every use of subroutine MULTIFILTER by the algorithm we have Var G(zb) θ then there is at least one triplet (β, w, κ) in M such that βG 3|G|/4. Proof. First note that the if blocks in Algorithm 1 ensures that for every use of subroutine MULTIFILTER Equation (28) is satisfied, therefore we can use the guarantees in Theorem F.1. First we upper bound the size of M. The progress of Algorithm 1 may be described using a tree. The internal nodes of this tree are the weight vectors that have gone through subroutine MULTIFILTER at some point of the algorithm, and children of these internal nodes are new weight vectors returned by MULTIFILTER. Observe that any weight vector β encountered in Algorithm 1 is ignored iff βB < α|B|/2. If it is not ignored then either it is added to M (in form of a triplet), or else it goes through subroutine MULTIFILTER. It follows that, if a node β is an internal node or a leaf in M then βB α|B|/2. (35) From Equation (32), it follows that the total weight squared for each node is greater than equal to that of its children. It follows that the total weight squared of the root, βinit is greater than equal to the sum of the square of weights of all the leaves. And since all weight vectors in M are among the leaves of the tree, and have total weight at least α|B|/2, (βB init)2 X β M (βB)2 X here the last step follows from Equation (35). Using βB init = |B|, in the above equation we get |M| 4/α2. Similarly, it can be shown that the number of branches in the tree is at most O(1/α2). Item 2 in Theorem F.1 implies that each iteration of MULTIFILTER zeroes out the weight of one of the batches. Hence for any weight β at depth d, we have βB |B| d. Therefore, the depth of the tree can t be more than |B|. Hence, the number of nodes in the tree is upper bounded by O(|B|/α2). And since each call to MULTIFILTER corresponds to a non-leaf node in the tree, the total calls to MULTIFILTER by Algorithm 1 are upper bounded by O(|B|/α2). Next, we show that if for each use of MULTIFILTER we have Var G(zb) θ then one of the weight vector β M must satisfy βG 3|G|/4. Efficient List-Decodable Regression using Batches Let β0 = βinit and suppose for each i, weight vectors βi and βi+1 are related as follows: βG i βG i+1 βG i βB i βB i+1 βB i 1 24 log(2/α). (36) Then Lemma 3.12 in (Diakonikolas et al., 2020b) showed that under the above relation, for each i, we have βG i 3|G|/4. We show that there is a branch of the tree such that βi and βi+1 are related using the above equation, where for each i, βi denote the weight vector corresponding to the node at ith level in this branch. From the preceding discussion, this would imply that for each i, βG i 3|G|/4. We prove it by induction. For i = 0, we select βi = βinit. Note that βG init = |G|, hence βG i 3|G|/4. If βi is a leaf then the branch is complete. Else, since βG i 3|G|/4, item 3 in Theorem F.1 implies that we can select one of the child of βi as βi+1 so that (36) holds. Then from the preceding discussion, we have βG i+1 3|G|/4. By repeating this argument, we keep finding the next node in the branch, until we reach the leaf. Next, we argue that the leaf at the end of this branch must be in M. Let β denote the weight vector for the leaf. From the above discussion, it follows that βG 3|G|/4. Hence, βB βG 3|G|/4 3α|B|/4 > α|B|/2. As discussed earlier any leaf β is not part of M iff βB α|B|/2. Hence, the leaf at the end of the above branch must be in M. This concludes the proof of the Theorem. G. Eliminating Additional Distributional Assumptions In this section, we discuss how we can remove assumptions 2 and 5 regarding the distribution of data in Section 2 of the main paper. We demonstrate that our results can still be achieved without these assumptions. Assumption 2 states that there exists a constant C1 > 0 such that for random samples (xb i, yb i ) D, we have xb i C1 d almost surely. In the non-batch setting, Cherapanamjeri et al. (2020) (Cherapanamjeri et al., 2020b) have shown that this assumption is not limiting. They have proven that if other assumptions are met, then there exists a constant C1 such that the probability of xb i C1 d exceeds 0.99. Thus, discarding the samples where xb i > C1 d does not significantly reduce the dataset s size. Additionally, it has minimal impact on the covariance matrix and hypercontractivity constants of the distribution This reasoning can be easily extended to the batch setting. In the batch setting, we first exclude samples from batches where xb i > C1 d. We then remove batches that have been reduced by more than 10% of their original size. Since, on average, this operation would remove 1% of samples from genuine batches, a simple argument using the Markov inequality shows that the probability of removing a genuine batch is at most 10%. It can be demonstrated that with high probability, the fraction of genuine batches that are removed for any component is 10%. Therefore, assumption 2 regarding data distribution is not required, and this simple procedure can be used to enforce assumption 2, resulting in a decrease in batch size and α by at most 10%. Consequently, the guarantees in our theorem are altered by only a small factor. Assumption 5 states that the noise distribution is symmetric. We can address this by employing a simple technique. Let s consider two independent samples (xb i, yb i ) and (xb i+1, yb i+1), where yb j = w xb j + nb j for j {i, i + 1}. We define xb i = (xb i xb i+1)/ 2, yb i = (yb i yb i+1)/ 2, and nb i = (nb i nb i+1)/ 2. Since nb i and nb i+1 are i.i.d., the distribution of nb i is symmetric around 0 and the variance of nb i matches that of nb i. Moreover, the covariance of xb i is the same as that of xb i, and we have yb i = w xb i + nb i. Therefore, the new sample ( xb i, yb i ) obtained by combining two i.i.d. samples (xb i, yb i ) and (xb i+1, yb i+1) in a batch satisfies the same distributional assumptions as before, and in addition, ensures a symmetric noise distribution. We can apply this approach to combine every two samples in a batch, which only reduces the batch size by a constant factor of 1/2. Thus, the assumption of symmetric noise can be eliminated by increasing the required batch sizes in our theorems by a factor of 2. H. Proof of Theorem B.1 In Section H.1, we state and prove two auxiliary lemmas that will be used in proving Theorem B.1, and in Section H.2, we prove Theorem B.1. We will use the following notation in describing the auxiliary lemmas and in the proofs. Efficient List-Decodable Regression using Batches Let S := {(xb i, yb i ) : b G, i [n]} denote the collection of all good samples. Note that |S| = |G|n. For any function h over (x, y), we denote the expectation of h w.r.t. uniform distribution on subset S S by ES [h(xb i, yb i )] := P (xb i,yb i ) S h(xb i,yb i ) |S | . H.1. Auxiliary lemmas In this subsection, we state and prove Lemmas H.1 and H.2. We will use these lemmas in proof of Theorem B.1 in the following subsection. In the next lemma, for any unit vectors u, we bound the expected second moment of the tails of |xb i u|, for covariate xb i of a random sample from the distribution D. Lemma H.1. For all θ > 1, and all unit vectors u Rd, Pr D[ xb i u 2 θ2 and E D[1( xb i u 2 Cθ) xb i u 2] Proof. The first part of the lemma follows from Markov s inequality, Pr D[ xb i u 2 Cθ] = Pr D[ xb i u 4 Cθ2] = E D[ xb i u 4] Cθ2 1 where the last step uses L4 L2 hypercontractivity. This proves the first bound in the lemma. For the second bound, note that E D[1( xb i u 2 Cθ) xb i u 2] (a) q E D[1( xb i u 2 Cθ)] E D[ xb i u 4] Pr D[ xb i u 2 Cθ] C E D[ xb i u 2]2 C θ E D[ xb i u 2], here (s) follows from the Cauchy-Schwarz inequality, (b) uses L4 L2 hypercontractivity, and (c) follows from the first bound in the lemma. In the next lemma, for any unit vectors u, we provide a high probability bound on the expected second moment of the tails of |xb i u|, wheres xb i are covariates of samples in good batches G. Lemma H.2. For any given θ > 1, and |G|n = Ω(dθ2 log( C1dθ C )), with probability at least 1 2/d2, for all unit vectors u, E S h 1 xb i u 2 3 Cθ xb i u 2i O The following lemma restates Lemma 5.1 of (Cherapanamjeri et al., 2020a). The lemma shows that for any large subset of S, the covariance of covariates xb i in S is close to the true covariance for distribution D of samples. We will use this result in proving Lemma H.2. Lemma H.3. For any fix θ > 1, and |G|n = Ω(dθ2 log(dθ)), with probability at least 1 1/d2 for all subsets of S S of size (1 1 θ2 )|S|, we have I E S [xb i(xb i) ] Σ + O Remark H.1. Lemma 5.1 of (Cherapanamjeri et al., 2020a) assumes that hypercontractive parameter C is a constant and its dependence doesn t appear in their lemma but is implicit in their proof. hides/ignores its dependence. The following corollary is a simple consequence. We will use this corollary in proving Lemma H.2. Efficient List-Decodable Regression using Batches Corollary H.4. For any fix θ > 1, and |G|n = Ω(dθ2 log(dθ)), with probability at least 1 1/d2 for all subsets S S of size |S| θ2 and all unit vectors u, we have |S| E S [(xb i u)2] O Proof. Consider any set S of size |S| θ2 . Since |S \ S | (1 1 θ2 )|S|, applying Lemma H.3 for S \ S and S, I E S\S [xb i(xb i) ], E S[xb i(xb i) ] Σ + O E S[xb i(xb i) ] = |S | |S| E S [xb i(xb i) ] + |S \ S | |S| E S\S [xb i(xb i) ] = |S | E S [xb i(xb i) ] = |S| E S[xb i(xb i) ] |S \ S | E S\S [xb i(xb i) ]). Combining the previous three equations, |S | E S [xb i(xb i) ] |S| |S |Σ + (|S| + |S \ S |)O θ2 |S|Σ + 2|S|O where the last line used Σ I, |S | |S|/θ2, C 1, and 1/θ2 1/θ for θ 1. Finally, observing that for any unit vector u E S [xb i(xb i) ]u = E S [(xb i u)2] completes the proof. Now we complete the proof of the Lemma H.2 with help of the above corollary. Proof of Lemma H.2. From Lemma H.1 we have ED[1( xb i u 2 Cθ)] = Pr[ xb i u 2 Cθ] 1 θ2 Applying Chernoff bound for random variable 1( xb i u 2 Pr E S[1( xb i u 2 (i,b) S 1( xb i u 2 Hence, for a fix unit vector u, with probability 1 exp |S| E S[1( xb i u 2 Next, we show that this bound holds uniformly over all unit vectors u. Consider an q Cθ 2C1d net of unit sphere {u Rd : u 1} such that for any vector u in this ball there exist a u in the net such that u u q Cθ 2C1d. The standard covering argument (Vershynin, 2012) shows the existence of such a net of size e O(d log( C1d Cθ )). Then from the union bound, for all vectors u in this net with probability at least 1 e O(d log( C1d E S[1( xb i u 2 Efficient List-Decodable Regression using Batches 3θ2 d log( C1dθ C ) d log( C1d Cθ )), therefore, e O(d log( C1d Now consider any vector u in unit ball and u in the net such that u u q Cθ 2C1d. Then (xb i u)2 = (xb i (u + (u u )))2 = 2(xb i u )2 + (xb i (u u ))2 2(xb i u )2 + 2 u u 2 xb i 2 2(xb i u )2 + 2 Cθ 2C1d C1d 2(xb i u )2 + where in the last line we used the assumption that xb i C1 d. When (xb i u )2 Cθ, then above sum is bounded by 2 Cθ. It follows that with probability 1 1/d2, for all unit vectors u, E S[1( xb i u 2 3 Applying Corollary H.4 for S = { xb i u 2 3 Cθ}, proves the lemma E S[1( xb i u 2 3 Cθ) xb i u 2] = |S | |S| E S [ xb i u 2] O H.2. Proof of Theorem B.1 Proof of Theorem B.1. Note that E G h f b(w, κ) u E D[ f b(w, κ) u] 2i = 1 |G| f b(w, κ) u E D[ f b(w, κ) u] 2 i n f b i (w, κ) u E D[ f b(w, κ) u] f b i (w, κ) u E D[ f b i (w, κ) u] !2 where in the last step we used the expectation of batch and sample gradients are the same, namely E D[ f b i (w, κ) u] = E D[ f b(w, κ) u]. For any positive ρ > 0 and unit vector u, define gb i (w, κ, u, ρ) := f b i (w, κ) u xb i u ρ ρ. Recall that for a good batch b G, yb i = w xb i + nb i. Using this in equation (3), for any good batch b G, we have f b i (w, κ) = (xb i (w w ) nb i) |xb i (w w ) nb i| κκxb i. (37) Combining the above two equations, gb i (w, κ, u, ρ) = κρ (xb i (w w ) nb i) xb i (w w ) nb i κ xb i u xb i u ρ From the above expression it follows that |gb i (w, κ, u, ρ)| κρ a.s. Efficient List-Decodable Regression using Batches We will choose ρ later in the proof. Let Zb i (w, κ, u, ρ) := gb i (w, κ, u, ρ) E D gb i (w, κ, u, ρ) . Zb i (w, κ, u, ρ) := f b i (w, κ) u E D f b i (w, κ) u Zb i (w, κ, u, ρ) = f b i (w, κ) u gb i (w, κ, u, ρ) E D f b i (w, κ) u gb i (w, κ, u, ρ) . When w, u, κ, and ρ are fixed or clear from the context, we will omit them from the notation of Zb i and Zb i . Then, E G h f b(w, κ) u E D[ f b(w, κ) u] 2i = 1 |G| i n (Zb i + Zb i ) i n ( Zb i )2, (39) here in the last step we used Jensen s inequality (E[Z])2 E[Z2]. We bound the two summations separately. To bound the first summation we first show that Zb i are bounded, and then use Bernstein s inequality. We bound the second term using Lemma H.2 and Lemma H.1. From (38), it follows that |gb i (w, κ, u, ρ)| κρ a.s., and therefore, |Zb i | 2κρ. Since |Zb i | is bounded by 2κρ, it is a (2κρ)2 sub-gaussian random variable. Using the fact that the sum of sub-gaussian random variables is sub-gaussian, the sum Pn i=1 Zb i is n(2κρ)2 sub-gaussian random variable. Since square of a subgaussian is sub-exponential (Philippe, 2015) (Lemma 1.12), hence (Pn i=1 Zb i )2 ED(Pn i=1 Zb i )2 is sub-exponential with parameter 16n(2κρ)2. Bernstein s inequality (Philippe, 2015) (Theorem 1.12) for sub-Gaussian random variables implies that with probability 1 δ, 16n(2κρ)2 max ( 2 ln(1/δ) Since Zb i are zero mean independent random variables, i=1 (Zb i ) = n E D (Zb i )2 . Efficient List-Decodable Regression using Batches We bound the expectation on the right, E D (Zb i )2 = E D h gb i (w, κ, u, ρ) E D gb i (w, κ, u, ρ) 2i (a) E D h gb i (w, κ, u, ρ) 2i (b) E D[(nb i + (w w ) xb i)2(xb i u)2] (c)= E D[(nb i)2(xb i u)2] + E D[((w w ) xb i)2(xb i u)2] (d) E D[(nb i)2] E D[(xb i u)2] + q E D[((w w ) xb i)4] E D[(u xb i)4] (e) σ2 E D[(xb i u)2] + q C2 E D[((w w ) xb i)2]2 E D[(u xb i)2]2 (f) σ2 + C E D[((w w ) xb i)2], here inequality (a) uses that squared deviation is smaller than mean squared deviation, inequality (b) follows from the definition of gb i in (38), inequality (c) follows from the independence of nb i and xb i, inequality (d) follows the Cauchy Schwarz inequality, (e) uses the L-4 to L-2 hypercontractivity assumption E D[(u (xb i))4] C, and (f) follows as for any unit vector E D[(xb i u)2] Σ 1. Combining the last three equations, we get that with probability 1 δ, n(σ2 + C E D[((w w ) xb i)2]) + 64n(κρ)2 max ( 2 ln(1/δ) The above bound holds for given fixed values of parameters κ, w, and u. To extend the bound for all values of these parameters (for appropriate ranges of interest), we will use the covering argument. With the help of the covering argument, we show that with probability 1 δe O(d log(C1dn) 1 d2 , for all unit vectors u, all vectors w and κ (σ + w w )d2n, i=1 Zb i (w, κ, u, ρ) 2σ2n + 13Cn E D[((w w ) xb i)2] + 384n(κρ)2 max ( 2 ln(1/δ) We delegate the proof of Equation (41) using Equation (40) and the covering argument to the very end. The use of covering argument is rather standard. The main subtlety is that the above bound holds for all vectors w. The cover size of all d dimensional vectors is infinite. To overcome this difficulty we first take union bound for vectors for all w such that w w R for an appropriate choice of R. To extend it to any w for which w w > R is large we show that the behavior of the above quantity on the left for such a w can be approximated by its behavior for w = w +(w w ) R w w . Note that dividing Equation (41) by n2 bounds the first term in Equation (39). Next, we bound the second term in Equation(39). Note that i n ( Zb i )2 1 n|G| f b i (w, κ) u gb i (w, κ, u, ρ) E D f b i (w, κ) u gb i (w, κ, u, ρ) 2 f b i (w, κ) u gb i (w, κ, u, ρ) 2 + E D f b i (w, κ) u gb i (w, κ, u, ρ) 2 f b i (w, κ) u gb i (w, κ, u, ρ) 2 + E D h f b i (w, κ) u gb i (w, κ, u, ρ) 2i . Efficient List-Decodable Regression using Batches From the definitions of gb i (w, κ, u, ρ) and f b i (w, κ), | f b i (w, κ) u gb i (w, κ, u, ρ)| = 1( xb i u ρ) f b i (w, κ) u ρ xb i u f b i (w, κ) u 1( xb i u ρ) f b i (w, κ) u κ xb i u 1( xb i u ρ). From the above equation, it follows that E D h f b i (w, κ) u gb i (w, κ, u, ρ) 2i κ2 E D h 1( xb i u ρ) xb i u 2i . Combining the above three bounds, i n ( Zb i )2 2κ2 1( xb i u ρ) xb i u 2 + E D h 1( xb i u ρ) xb i u 2i = 2κ2 E S h 1( xb i u ρ) xb i u 2i + E D h 1( xb i u ρ) xb i u 2i , here the last line uses the fact that S is the collection of all good samples. C, and |G|n = Ω(dρ4 log( C1dρ C )), Lemma H.2 implies that with probability at least 1 2/d2, for all unit vectors u, we have E S h 1( xb i u ρ) xb i u 2i = E S h 1( xb i u 2 ρ2) xb i u 2i O( And from Lemma H.1, for ρ2 C and any unit vectors u, E D h 1( xb i u ρ) xb i u 2i O( By combining the above three bounds it follows that, if ρ2 C, and |G|n = Ω(dρ4 log( C1dρ C )), with probability at least 1 2/d2, for all unit vectors u, i n ( Zb i (w, κ, u, β))2 O Combining the above bound, Equation (41) and (39) we get that if ρ2 = Ω( C), and |G| = Ω( dρ4 n log( C1dρ C )) then with probability 1 δe O(d log(C1dn) 3 d2 , for all unit vectors u, all vectors w and κ (σ + w w )d2n, E G h f b(w, κ) u E D[ f b(w, κ) u] 2i 5 2σ2n + 13Cn E D[((w w ) xb i)2] + 384n(κρ)2 max ( 2 ln(1/δ) Recall that 1 µmax d4n2 C . Choose ρ2 = µmax Cn. Note that p µmax(σ2 + C E D[((w w ) xb i)2]) (σ + w w )d2n. Then from the above equation choosing ρ2 = µmax Cn, for all µmax(σ2 + C E D[((w w ) xb i)2]), with probability 1 δe O(d log(C1dn) 3 d2 , for all unit vectors u, all vectors w , E G h f b(w, κ) u E D[ f b(w, κ) u] 2i O σ2 + C E D[((w w ) xb i)2] n 1 + nµ2 max ( 2 ln(1/δ) Efficient List-Decodable Regression using Batches Choose δ = e Θ(d log(C1dn), and |G| = Ω( dρ4 n log( C1dρ C ) + Cµ4 maxdn2 log(C1dn)) = Ω(ρ4 maxn2d log(d)). Then with probability 1 4 d2 , for all unit vectors u, all vectors w and for all κ2 µmax(σ2 + C E D[((w w ) xb i)2]), E G h f b(w, κ) u E D[ f b(w, κ) u] 2i O σ2 + C E D[((w w ) xb i)2] n which is the desired bound. We complete the proof by proving Equation (41). Proof of Equation (41) To complete the proof of the theorem next we prove Equation (41) with the help of Equation (40) and covering argument. To use the covering argument, we first show that gb i (w, κ, u, ρ) do not change by much by slight deviation of these parameters. From the definition of Zb i (w, κ, u, ρ), the same conclusion would then hold for it. By the triangle inequality, |gb i (w, κ, u, ρ) gb i (w , κ , u , ρ)| |gb i (w , κ , u, ρ) gb i (w , κ , u , ρ)| + |gb i (w, κ , u, ρ) gb i (w , κ , u, ρ)| + |gb i (w, κ, u, ρ) gb i (w, κ , u, ρ)|. We bound each term on the right one by one. To bound these terms we use Equation (38), the assumption that xb i C1 d and the definition of the function g(). For the first term, |gb i (w , κ , u, ρ) gb i (w , κ , u , ρ)| (u u )xb i κ C1 u u for the second term, |gb i (w, κ , u, ρ) gb i (w , κ , u, ρ)| |u xb i| |(w w ) xb i| xb i 2 w w C2 1d w w , and for the last term |gb i (w, κ, u, ρ) gb i (w, κ , u, ρ)| |κ κ | |u xb i| C1 Combining the three bounds, |gb i (w, κ, u, ρ) gb i (w , κ , u , ρ)| C1 u u dκ + C2 1d w w + C1 For u u 1/(24C1d5n3), κ 2d4σn2, w w σ/(12d C2 1n) and |κ κ | σ/(12C1dn), |gb i (w, κ, u, ρ) gb i (w , κ , u , ρ)| σ/4n a.s. This would imply, |Zb i (w, κ, u, ρ) Zb i (w , κ , u , ρ)| σ/2n a.s. Using this bound, i=1 Zb i (w, κ, u, ρ) Zb i (w , κ , u , ρ ) + σ i=1 Zb i (w , κ , u , ρ ) i=1 Zb i (w , κ , u , ρ ) Let U := {u Rd : u = 1}, W := {w Rd : w w d2σn)}, and K := 0, 2d4σn2 . Efficient List-Decodable Regression using Batches Standard covering argument shows that there exist covers such that U U : u U, min u U u u 1 (24C1d5n3), (43) W W : w W, min w W w w σ 12C2 1dn, (44) K K : κ K, min κ K ,κ κ |κ κ | σ 12C1dn, (45) and the size of each is |U |, |W |, |K | e O(d log(C1dn). In equation (40), taking the union bound over all elements in U , W and K , it follows that with probability 1 δe O(d log(C1dn), for all u U , w W and κ K i=1 Zb i (w , κ , u , ρ) n(σ2 + C E D[((w w ) xb i)2]) + 64n(κ ρ)2 max ( 2 ln(1/δ) Combining the above bound with Equation (42), it follows that with probability 1 δe O(d log(C1dn), for all u U, w W and κ K and elements u , w and κ in the respective nets satisfying equations (43),(44), and (45), i=1 Zb i (w, κ, u, ρ) 2n(σ2 + C E D[((w w ) xb i)2]) + 128n(κ ρ)2 max ( 2 ln(1/δ) 2n(σ2(1 + 1 4n) + C E D[((w w ) xb i)2]) + 128n(κρ)2 max ( 2 ln(1/δ) 4σ2 + 2C E D[((w w ) xb i)2]) + 128n(κρ)2 max ( 2 ln(1/δ) here (a) follows from the bound κ κ in Equation (45), and (b) follows by first writing w w = (w w ) + (w w) and then using the bound w w σ 12C2 1dn in Equation (44). Next, we further remove the restriction w W and extend the above bound to all vectors w. Consider a w / W and κ 0, (σ + w w )d2n . From the definition of W, we have w w > d2σn. Let w = w + w w w w d2σn and κ = d2σn w w κ. Observe that w w = d2σn and κ (σ + w w )d2n d2σn w w d2σn d2σn w w + d4σn2 d2σn + d4σn2 2d4σn2, Efficient List-Decodable Regression using Batches hence, w W and κ K. From Equation (38), d2σn gb i (w , κ , u, ρ) gb i (w, κ, u, ρ) (a)= ρ xb i u xb i u ρ w w d2σn κ (xb i (w w ) nb i) xb i (w w ) nb i κ κ (xb i (w w ) nb i) xb i (w w ) nb i κ (xb i w w w w d2σn nb i) xb i w w w w d2σn nb i d2σn w w κ κ (xb i (w w ) nb i) xb i (w w ) nb i κ (xb i (w w ) d2σn w w nb i) xb i (w w ) d2σn w w nb i κ κ (xb i (w w ) nb i) xb i (w w ) nb i κ (c) xb i u d2σn w w nb i nb i d|nb i| d2σn w w here (a) follows from the definition of gb i , inequality (b) follows as ρ xb i u ρ, inequality (c) uses the fact that for any a, and b 0, we have |b a+ (a+ ) b b a a b| | |, inequality (c) uses xb i C1 d and the last inequality (e) uses w w > d2σn. d2σn Zb i (w , κ , u, ρ) Zb i (w, κ, u, ρ) C1 d |nb i| + E[|nb i|] C1 d |nb i| + σ . From the above equation, i=1 Zb i (w, κ, u, ρ) d2σn Zb i (w , κ , u, ρ) + C1 d|nb i| + C1 d2σn Zb i (w , κ , u, ρ) (b) 3 w w 2 d4σ2n2 1 |G| i=1 Zb i (w , κ , u, ρ) + 3d C2 1 |G| i=1 |nb i|2 ! + 3d C2 1n2σ2 (c) 3 w w 2 d4σ2n2 1 |G| i=1 Zb i (w , κ , u, ρ) + 3d C2 1n2σ2(d2 + 1), here (a) and (b) uses (Pt i=1 zi) t Pt i=1 z2 i and inequality (c) holds with with probability 1 1 d2 by Markov inequality, as Pr[ 1 n|G| P b G Pn i=1 |nb i|2 > d2 ED[(nb i)2)] 1 d2 . Recall that w w d2σn and κ 2d4σn2, therefore in the above equation, we can bound the first term on the right by Efficient List-Decodable Regression using Batches using high probability bound in Equation (46). Then, with probability 1 δe O(d log(C1dn) 1 i=1 Zb i (w, κ, u, ρ) 4σ2 + 2C E D[((w w ) xb i)2]) + 128n(κ ρ)2 max ( 2 ln(1/δ) + 3d C2 1n2σ2(d2 + 1) (a)= 12Cn E D[((w w ) xb i)2] + 15 w w 2 2d4σ2n2 nσ2 + 384n(κρ)2 max ( 2 ln(1/δ) + 3d C2 1n2σ2(d2 + 1) (b) 13Cn E D[((w w ) xb i)2] + 384n(κρ)2 max ( 2 ln(1/δ) here equality (a) uses the relation w w = w w w w d2σn and κ = d2σn w w κ, and (b) follows as C E D[((w w ) xb i)2] C w w 2 Σ C3 = C w w 2 C3 Cd4σ2n2/C3, where C3 is condition number of Σ, hence C E D[((w w ) xb i)2] 15 w w 2 2d4σ2n2 nσ2 and C E D[((w w ) xb i)2] C w w 2 C3 15 w w 2 2d4σ2n2 nσ2 and C E D[((w w ) xb i)2] 3d C2 1n2σ2(d2 + 1). The above bound holds for all unit vectors u, w / W and κ (σ + w w )d2n, i=1 Zb i (w, κ, u, ρ) 13Cn E D[((w w ) xb i)2] + 450n(κρ)2 max ( 2 ln(1/δ) Recall that bound in Equation (46) holds for all unit vectors u, w W and κ K with probability 1 δe O(d log(C1dn). Note that for w W, (σ + w w )d2n 2d4σn2, hence [0, (σ + w w )d2n] K . Hence the above bound holds for all unit vectors u, w W and κ (σ + w w )d2n. Combining the two bounds, with probability 1 δe O(d log(C1dn) 1 d2 , for all unit vectors u, all vectors w and κ (σ + w w )d2n, i=1 Zb i (w, κ, u, ρ) 2σ2n + 13Cn E D[((w w ) xb i)2] + 384n(κρ)2 max ( 2 ln(1/δ) This completes the proof of Equation (41).