# linear_regression_using_heterogeneous_data_batches__66d4d82d.pdf Linear Regression using Heterogeneous Data Batches Ayush Jain Granica Computing Inc. ayush.jain@granica.ai Rajat Sen Google Research senrajat@google.com Weihao Kong Google Research weihaokong@google.com Abhimanyu Das Google Research abhidas@google.com Alon Orlitsky UC San Diego alon@ucsd.edu In many learning applications, data are collected from multiple sources, each providing a batch of samples that by itself is insufficient to learn its input-output relationship. A common approach assumes that the sources fall in one of several unknown subgroups, each with an unknown input distribution and input-output relationship. We consider one of this setup s most fundamental and important manifestations where the output is a noisy linear combination of the inputs, and there are k subgroups, each with its own regression vector. Prior work [KSS+20] showed that with abundant small-batches, the regression vectors can be learned with only few, Ω(k3/2), batches of medium-size with Ω( k) samples each. However, the paper requires that the input distribution for all k subgroups be isotropic Gaussian, and states that removing this assumption is an interesting and challenging problem . We propose a novel gradient-based algorithm that improves on the existing results in several ways. It extends the applicability of the algorithm by: (1) allowing the subgroups underlying input distributions to be different, unknown, and heavy-tailed; (2) recovering all subgroups followed by a significant proportion of batches even for infinite k; (3) removing the separation requirement between the regression vectors; (4) reducing the number of batches and allowing smaller batch sizes. 1 Introduction In numerous applications, including federated learning [WCX+21], sensor networks [WZ89], crowd-sourcing [SVC16] and recommendation systems [WDVR06], data are collected from multiple sources, each providing a batch of samples. For instance, in movie recommendation systems, users typically rate multiple films. Since all samples in a batch are generated by the same source, they are often assumed to share the same underlying distribution. However, the batches are frequently very small, e.g., many users provide only few ratings. Hence, it may be impossible to learn a different model for each batch. A common approach has therefore assumed [TLW99] that all batches share the same underlying distribution and learn this common model by pooling together the data from all batches. While this may work well for some applications, in others it may fail, or lack personalization. For instance, in recommendation systems, it may not capture the characteristics of individual users. A promising alternative that allows for personalization even with many small batches, assumes that batches can be categorized into k sub-populations with similar underlying distributions. Hence in This work was done while the author was a student at UCSD and interning part-time at Google Research. 38th Conference on Neural Information Processing Systems (Neur IPS 2024). each sub-population, all underlying distributions are close to and can be represented by a single distribution. Even when k is large, our work allows the recovery of models for sub-populations with a significant fraction of batches. For example, in the recommendation setting, most users can be classified into a few sub-populations such that the distribution of users in the sub-population is close, for instance, those preferring certain genres. In this paper, we focus on the canonical model of linear regression in supervised learning. A distribution D of samples (x, y) follows a linear regression model if, for some regression vector w Rd, the output is y = w x + η where the input x is a random d dimensional vector and η is a zero-mean noise. While we allow the presence of sub-populations that do not follow the linear regression model or have very few batches, our goal is to recover the regression vectors for all large sub-populations that follow this linear regression model. 1.1 Our Results This setting was first considered in [KSS+20] for meta-learning applications, where they view and term batches as tasks. [KSS+20] argue that in meta-learning applications task or batch lengths follow a long tail distribution and in the majority of the batches only a few labeled examples are available. Only a few batches have medium-size labeled samples available, and almost all of them have length d. Note that similar observations have been made in the recommender system literature where the distribution of a number of ratings per user follows a long-tailed distribution with an overwhelming number of users rating only a few items while rare tail users rating hundreds of items [GKG15]. The same has been observed for the distribution of the number of ratings per item [PT08]. Therefore, it is reasonable to assume that in these applications of interest, a handful of medium-size batches along with a large number of batches of constant size are available. Under this setting our main results allow recovery of all sub-populations that has a significant fraction of batches and follow a linear regression model. Building upon the preceding discussion, we have two sets of batches. Each batch comprises i.i.d. samples drawn from one of the k sub-populations. The distribution of samples for these subpopulations is unknown, as is the identity of the sub-population from which each batch originates. Batches in set Bs (denoted as small) consist of at least two samples each, while those in set Bm (denoted as medium) are of larger size, containing nm samples. Let s consider a sub-population indexed by i, which follows a linear regression model with parameter vector wi and output-noise variance σ2. The next theorem presents the guarantees provided by our algorithm for estimating wi. Theorem 1.1 (Informal). Let α > 0 such that both sets Bs and Bm comprises at least α fraction of batches from a sub-population i. Given |Bs| Ω(d/α2), |Bm| Ω(min( k, 1/ α)/α), and nm Ω(min( k, 1/ α)), our algorithm runs in time poly(d, 1/α, k) and outputs a list L of size O(1/α) such that w.h.p., at least one estimate in L is within a distance of o(σ) from wi and has an expected prediction error σ2(1 + o(1)) for sub-population i. Moreover, given Ω(log L) samples from sub-population i, we can identify such an estimate from L. In the above theorem, the batches within set Bs must contain a minimum of 2 samples, while those within set Bm must be of size nm. Additionally, both sets must include α fraction of batches originating from the population we want to recover. It follows that our algorithm requires only α(2|Bs|+nm|Bs|) = Ω(d/α) samples from that specific sub-population. It s worth noting that any algorithm, even when dealing with a single sub-population (k = 1), requires a minimum of Ω(d) samples from said sub-population. To the best of our knowledge, ours is the best sample complexity for recovering the linear regression models in the presence of multiple sub-populations using batch sizes smaller than d. Remark 1.1. When multiple sub-populations meet the criteria outlined for sub-population i, the list L returned by our algorithm includes estimates of the regression vector for each of these subpopulations. Furthermore, since there can be up to min{k, 1/α} distinct sub-populations that satisfy these criteria, any algorithm that aims to provide meaningful guarantees must produce a list of size min{k, 1/α}, unless the algorithm has access to a sample set known to originate from the specific sub-population i it aims to recover. A similar situation arises in the list-decodable setup [CSV17], where also the algorithm returns a list of estimates. 1.2 Comparison to Prior Work The only work that provides a polynomial-time algorithm in dimension d, in the same generality as ours is [DJKS22]. They even allow the presence of adversarial batches. However, they require Ω(d/α2) batches from the sub-population, each of them of size Ω(1/α), and therefore, Ω(d/α3) samples in total, which exceeds our sample complexity by a factor of 1/α2. Furthermore, their algorithm requires a minimum number of samples per batch, which is quadratically larger than ours. All other works place strong assumptions on the distributions of the sub-population and still require a number of samples much larger than ours, which we discuss next. Most of the previous works [CL13, SJA16, ZJD16, YCS16, LL18, CLS20, DK20, PMSG22] have addressed the widely studied mixed linear regression (MLR) model where all batches are of size 1, and adhere to the following three assumptions: 1. All k sub-populations have α fraction of data. This assumption implies k 1/α. 2. All k distributions follow a linear regression model. 3. All k regression coefficients are well separated, namely wi wj , i = j . Even for k = 2, solving MLR, in general, is NP-hard [YCS14]. Hence all these works on mixed linear regression, except [LL18], also made the following assumption: 4. All input distributions (i.e., the distribution over x) are the same for every sub-population, in fact, the same isotropic Gaussian distribution. With this additional isotropic Gaussian assumption, they provided algorithms that have runtime and sample complexity polynomial in the dimension. However, even with these four strong assumptions, their sample complexity is super-polynomial overall. In particular, the sample complexity in [ZJD16, CLS20, DK20] is quasi-polynomial in k and [CL13, SJA16, DK20] require at least a quadratic scaling in d. In [CL13, SJA16, YCS16] the sample complexity scales as a large negative power of the minimum singular value of certain moment matrix of regression vectors that can be zero even when the gap between the regression vectors is large. In addition, [ZJD16, YCS16, CLS20] required zero-noise i.e η = 0. The only work we are aware of that can avoid Assumption 4 and handle different input distributions for different sub-populations under MLR is [LL18]. However, they still require all distributions to be Gaussian and η = 0, and their sample size, and hence run-time is exponential, Ω(exp(k2)) in k. Recently, [TV23] proposed and analyzed an approximate messagepassing algorithm for mixed linear regression. However, their analysis is asymptotic, providing convergence guarantees as the number of samples and dimensions tends to infinity, while assuming a finite number of components. The results also rely on Assumptions 2 and 4. The work that most closely relates to ours is [KSS+20], which considers batch sizes > 1. While it achieves the same dependence as us on d, k, and 1/α, on the length and number of medium and small batches, the sample complexity of the algorithms and the minimum length required for medium-size batches had an additional multiplicative dependence on the inverse separation parameter 1/ . It also required Assumption 4 mentioned in the section. The follow-up work [KSKO20] which still assumes all four assumptions can handle the presence of a small fraction α2/k2 of adversarial batches, but requires Ω(dk2/α2 + k5/α4) samples. It also suffers from strong assumptions similar to those in earlier work, and the sum-of-squares approach makes it impractical. The sum of the square approach, and stronger isotropic Gaussian assumption, allow it to achieve a better dependence on 1/α on medium-size batch lengths, however, causing a significant increase in the number of medium-size batches required. Our improvement over prior work. In contrast, our work avoids all four assumptions, and can recover any sufficiently large sub-populations that follow a linear regression model. In particular: (1) Even when a large number of different sub-populations are present, (e.g., k 1/α), we can still recover the regression coefficient of a sub-population with sufficient fraction of batches. (2) The k distributions do not even need to follow a linear regression model. In particular, our algorithm is robust to the presence of other sub-populations for which the conditional distribution of output given input is arbitrary. (3) Our work requires no assumption on the separation of regression coefficient , and our guarantees as well have no dependence on the separation. (4) We allow different input distributions for different sub-populations, which can be unknown and heavy tailed, (5) In addition to removing the four assumptions, the algorithm doesn t require all batches in a sub-population to have identical distributions, it only requires them to be close (see Remark 2.1). 1.3 Techniques and Organization To recover the regression vector for a sub-population i that follow the linear regression model and has more than α fraction of batches, the algorithm begins by selecting a medium size batch from this sub-population. To obtain such a batch we randomly choose Θ(1/α) batches from the set of medium-size batches Bm and repeat our algorithm to obtain one estimate for every batch chosen. With high probability, at least one of the randomly selected medium-sized batches belongs to subpopulation i, and when the algorithm is run for such a batch, w.h.p. it recovers the estimate of wi. Therefore, at least one of the Θ(1/α) estimates accurately recover wi. The regression vector wi minimizes the expected squared loss for the sub-population i. Therefore, we use a gradient-descent-based approach to estimate wi. We start with an initial estimate (all zero) and improve this estimate by performing multiple rounds of gradient descent steps. First, using a large number of smaller batches we estimate a smaller subspace of Rd that preserves the norm of the gradient for sub-population i. Next, using the medium-size batch selected at the beginning, we test which of the other medium-size batches have a projection of gradient close to the chosen batch, and use them to estimate the gradient in the smaller subspace. The advantage of subspace reduction is that testing and estimation of the gradient in the smaller subspace is easier, consequently reducing the minimum length of medium-size batches required for testing and the number of medium-size batches required for estimation. Our gradient estimation approach draws inspiration from [KSS+20]. However, while they employed it to directly estimate regression vectors of all subpopulations simultaneously , we adapt it to sequentially estimate gradients for one sub-population at a time. A crucial new ingredient of our algorithm is the clipping of gradients, which limits the effect of other components and allows the algorithm to work even for heavy-tailed distributions. We describe the algorithm in detail in Section 3 after having presented our main theorems in Section 2. Then in Section 4 we compare our algorithm with the one in [KSS+20] on simulated datasets, to show that our algorithm performs better in the setting of the latter paper as well as generalizes to settings that are outside the assumptions of [KSS+20]. 2 Problem Formulation and Main Results Consider distributions D0, . . . , Dk 1 over input-output pairs (x, y) Rd R each corresponding to one of the k sub-populations. A batch b consists of i.i.d. samples from one of the distributions. Samples in different batches are independent. There are two sets of batches. The batches in Bs are small and contain at least two samples each, while batches in Bm are of medium size and contain at least nm samples. For any batch the identity of the distribution it is sampled from is unknown. Next, we describe the distributions. To aid it and the remaining paper we first introduce some notation. The L2 norm of a vector u is denoted by u . The norm, or spectral norm, of a matrix M is denoted by M and is defined as max u =1 Mu . If M is a symmetric matrix, the norm simplifies to M = max u =1 |u Mu|. The trace of a symmetric matrix M is Tr(M) := P i Mii, the sum of the elements on the main diagonal of M. We will use the symbol S to denote an arbitrary collection of samples. For a batch denoted by b, we will use Sb to represent the set of all samples in the batch. 2.1 Data Distributions Let Σi := EDi[xx ] denote the second-moment matrix of input for distribution Di. To recover the regression vector for a distribution Di, we assume that it satisfies the following assumptions2: 1. (Input distribution) There are constants C and C1 such that for all i I, 2We note that these assumptions are standard in heavy-tailed linear regression where some condition on the anti-concentration of E[xx ] is necessary to achieve finite sample complexities [LM16, Oli16, CAT+20], even in the specific case of a single sub-population. (a) L4-L2 hypercontractivity: For all u Rd, EDi[(x u)4] C(EDi[(x u)2])2. (b) Bounded condition number: For normalization purpose, we assume min u =1 u Σiu 1 and to bound the condition number we assume that Σi C1. 2. (Input-output relation) There is a σ > 0 s.t. for all i I, y = wi x + η, where wi Rd is an unknown regression vector, and η is a noise independent of x, with zero mean E Di[η]=0, and E Di[η2] σ2. Note that by definition, the distribution of η may differ for each i. For a given αs > 0 and αm > 0, our goal is to estimate the regression vector wi for all subpopulations that have at least αs and αm fractions of the batches within Bs and Bm, respectively, and follow the assumptions described above. Let I {0, 1, .., k 1} denote the collection of indices corresponding to all such sub-populations. (Note that in the introduction, we simplified this more general scenario by assuming equal values for αs and αm, both denoted by α.) For other sub-populations i / I, we only need that the input distribution satisfies Σi C1, same as the second half of assumption 1(b). The input-output relation for samples generated by Di for i / I can be arbitrary, and in particular, does not even need to follow a linear regression model. Moreover, the fraction of batches from these other sub-populations can also be arbitrary. To simplify the presentation, we make two additional assumptions. First, there is a constant C2 > 0 such that for all components i {0, 1, .., k 1}, and random sample (x, y) Di, x C2 d, a.s. Second, for all i I and a random sample (x, y) Di, the noise distribution η = y wi x is symmetric around 0. As discussed in Appendix J, these assumptions can be easily removed. Remark 2.1. For simplicity, we assumed that each batch exactly follows one of the k distributions. However, our methods can be extended to more general scenarios. Let Db denote the underlying distribution of batch b. Instead of requiring Db = Di for some i {0, 1, .., k 1}, our approach can be extended to cases where the expected gradients for Db and Di are close. Additionally, if i I, the regression vector wi must achieve a mean squared error (MSE) of at most σ2 for Db. These conditions are satisfied if the following hold: (1) EDb[xx ] Σi is small, (2) for all x Rd, | EDb[y|x] EDi[y|x]| is small, and (3) if i I then for all x Rd, EDb[(y wi x)2|x] σ2. Thus, the strict identity requirement Db = Di can therefore be replaced by these three approximation conditions in our analysis. This extension is feasible as our algorithm and its analysis are based on gradient descent, and can handle stochastic noise. Since the analysis tolerates statistical noise, it can also accommodate small deviations in the gradient distributions across batches in the same subpopulation. 2.2 Main Results We begin by presenting our result for estimating the regression vector of a component Di, for any i I. This result assumes that in addition to the batch collections Bs and Bm, we have an extra medium-sized batch denoted as b which contains samples from Di and w.l.o.g, we assume i = 0. Theorem 2.1. Suppose index 0 is in set I and let b be a batch of nm i.i.d. samples from D0. Given δ, ϵ (0, 1], |Bs| = Ω( d αs2ϵ4 ), nm = Ω(min{ k, 1 ϵ αs } 1 ϵ2 ), and |Bm| = Ω( 1 k, 1 ϵ αs }), Algorithm 1 runs in polynomial time and returns estimate ˆw, such that with probability 1 δ, ˆw w0 ϵσ. We provide a proof sketch of Theorem 2.1 and the description of Algortihm 1 in Section 3, and a formal proof in Appendix H. Algorithm 1 can be used to estimate wi for all i I, and the requirement of a separate batch b is not crucial. As mentioned in the introduction, it can be obtained by repeatedly sampling a batch from Bm and running the algorithm for these sampled b . Since all the components in I have αm fraction of batches in Bm, then randomly sampling b from Bm, Θ(1/αm) times would ensure that, with high probability, we have b corresponding to each component. We can then return a list of size Θ(1/αm) containing estimates corresponding to each sampled b . Then, with high probability, the list will have an estimate of the regression vectors for all components. Note that in this case, returning a list is unavoidable as there is no way to assign an appropriate index to the regression vector estimates. The following corollary follows from the above discussion and Theorem 2.1. Corollary 2.2. For given δ, ϵ (0, 1], if |Bs|, nm, and |Bm|, meet the requirements outlined in Theorem 2.1, then the above modification of Algorithm 1 runs in polynomial-time and outputs a list L of size O(1/αm) such that with probability 1 δ, the list has an accurate estimate for regression vectors wi for each i I, namely maxi I min ˆ w L ˆw wi ϵσ. In particular, this corollary implies that for any i I, the algorithm requires only Ω(d/αs) batches of size two and Ω(min{ k, 1 αs }) medium-size batches of size Ω(min{ k, 1 αs }) from distribution Di to estimate wi within an accuracy o(σ). Furthermore, it is easy to show that any o(σ) accurate estimate of regression parameter wi achieves an expected prediction error of σ2(1+o(1)) for output y given input x generated from this Di. Note that results work even for infinite k and without any separation assumptions on regression vectors. The min( k, 1/ αs) dependence is the best of both words. This dependence is reasonable for recovering components with a significant presence or if the number is few. The total number of samples required by the algorithm from Di in small size batches Bs and medium size batches Bm are only O(d/αs) and O(min{k, 1/αs)). It is well known that any estimator would require Ω(d) samples for such estimation guarantees even in the much simpler setting with just i.i.d. data. Consequently, the total number of samples required from Di by the algorithm is within O(1/αs) factor from that required in a much simpler single component setting. The next theorem shows that for any i I, given a list L containing estimate of wi and Ω(log(1/αs)) samples from Di, we can identify an estimate of regression vector achieving a small prediction error for Di. The proof of the theorem and the algorithm is in Appendix B. Theorem 2.3. Let index i be in set I and let L be a list of d dimensional vectors. Suppose for a given β > 0 at least one of the vectors in L is within distance β from the regression vector of Di, namely minw L w wi β. Given O(max{1, σ2 δ ) samples from Di Algorithm 3 identifies an estimate w from the list L, s.t. with probability 1 δ, w wi = O(β) and it achieves an expected estimation error EDi[( ˆw x y)2] σ2 + O(β2). Remark 2.2. As a special case the above theorem implies that that for any β Ω(σ), Algorithm 3 can identify the near best estimate of wi from list L using only O(log L δ ) samples from Di. Combining the above theorem and Corollary 2.2, we get Theorem 2.4. For given δ, ϵ (0, 1], if |Bs|, nm, and |Bm|, meet the requirements outlined in Theorem 2.1, then, there exists a polynomial-time algorithm that, with probability 1 δ, outputs a list L of size O(1/αm) containing estimates of wi s for i I. Furthermore, given |S| Ω( 1 ϵ2 log 1 δαm ) samples from Di, for any i I, Algorithm 3 returns ˆw L that with probability 1 δ satisfies wi ˆw O(ϵσ) and achieves an expected estimation error EDi[( ˆw x y)2] σ2 + O(ϵ2σ2) 3 Algorithm for recovering regression vectors This section provides an overview and pseudo-code of Algorithm 1, along with an outline of the proof that achieves the guarantee stated in Theorem 2.1. As per the theorem, we assume that index 0 belongs to I, and we have a batch b containing nm samples from the distribution D0. Note that D0 satisfies the conditions mentioned in Section 2.1 and that Bs and Bm each have |Bs|αs and |Bm|αm batches with i.i.d. samples from D0. However, the identity of these batches is unknown. Gradient Descent. Note that w0 minimizes the expected square loss for distribution D0. Our algorithm aims to estimate w0 by taking a gradient descent approach. It performs a total of R gradient descent steps. Let ˆw(r) denote the algorithm s estimate of w0 at the beginning of step r. Without loss of generality, we assume that the algorithm starts with an initial estimate of ˆw(1) = 0. In step r, the algorithm produces an estimate (r) of the gradient of the expected square loss for distribution D0 at its current estimate ˆw(r). For convenience, we refer to this expected gradient for D0 at ˆw(r) simply as expected gradient. The algorithm then updates its current estimate for the next round as ˆw(r+1) = ˆw(r) (r)/C1. The main challenge the algorithm faces is the accurate estimation of the expected gradients in each step. Accurately estimating the expected gradients at each step would require Ω(d/ϵ2) i.i.d. samples from D0. However, our algorithm only has access to a medium-size batch b that is guaranteed to have samples from D0 and this batch contains far fewer samples. And for batches in Bs and Bm, Algorithm 1 MAINALGORITHM 1: Input: Collections of batches Bs and Bm, αs, k, a medium size batch b of i.i.d. samples from D0, distribution parameters (upper bounds) σ, C, C1, an upper bound M on w0 , ϵ and δ, 2: Output: Estimate of w0 3: ϵ1 Θ(1), ϵ2 Θ 1 C1 C+1(ϵ1 + 1 C1 ) , ℓ min{k, 1 2αsϵ2 2 }, δ δ 5R and R Θ(C1 log M 4: Partition the collection of batches Bs into R disjoint same size random parts {B(r) s }r [R]. 5: Similarly partition Bm into R disjoint same size random parts {B(r) m }r [R]. 6: Divide samples Sb into 2R disjoint same size random parts {Sb ,(r) 1 }r [R] and {Sb ,(r) 2 }r [R]. 7: Initialize ˆw(1) 0 8: for r from 1 to R do 9: κ(r) CLIPEST(Sb ,(r) 1 , ˆw(r), ϵ1, δ , σ, C, C1) 10: P (r) GRADSUBEST(B(r) s , κ(r), ˆw(r), ℓ) 11: (r) GRADEST(B(r) m , Sb ,(r) 2 , κ(r), ˆw(r), P (r), ϵ2, δ ) 12: ˆw(r+1) ˆw(r) 1 C1 (r) 13: end for 14: ˆw ˆw(R+1) and Return ˆw the algorithm doesn t know which of the batches has samples from D0. Despite these challenges, we demonstrate an efficient method to estimate the expected gradients accurately. The algorithm randomly divides sets Bs and Bm into R disjoint equal subsets, denoted as {B(r) s }R r=1 and {B(r) m }R r=1, respectively. The samples in batch b are divided into two collections of equal disjoint subsets, denoted as {Sb ,(r) 1 }R r=1 and {Sb ,(r) 2 }R r=1. At each iteration r, the algorithm uses the collections of medium and small batches B(r) s and B(r) m , respectively, along with the two collections of i.i.d. samples Sb ,(r) 1 and Sb ,(r) 2 from D0 to estimate the gradient at point w(r). While this division may not be necessary for practical implementation, as shown by our simulation results later, this ensures independence between the stationary point ˆw(r) and the gradient estimate, which facilitates our theoretical analysis and only incurs a logarithmic factor in sample complexity. Next, we describe how the algorithm estimates the gradient and the guarantees of this estimation. Due to space limitations, we provide a brief summary here, and a more detailed description, along with formal proofs, can be found in the appendix. We start by introducing a clipping operation on the gradients, which plays a crucial role in the estimation process. Clipping. Recall that the squared loss of samples (x, y) on point w is (x w y)2/2 and its gradient is (x w y)x. Instead of directly working with the gradient of the squared loss, we work with its clipped version. Given a clipping parameter κ > 0, the clipped gradient for a sample (x, y) evaluated at point w is defined as f(x, y, w, κ) := (x w y) |x w y| κκx. For a collection of samples S, the clipped gradient f(S, w, κ) is the average of the clipped gradients of all samples in S, i.e., f(S, w, κ) := 1 |S| P (x,y) S f(x, y, w, κ). The clipping parameter κ controls the level of clipping and for κ = , the clipped and unclipped gradients are the same. The clipping step is necessary to make our gradient estimate more robust, by limiting the influence of the components other than D0 (in lemma C.1), and as a bonus, we also obtain better tail bounds for the clipped gradients. Theorem C.2 shows that for κ Ω( p E D0[(y x w)2]), the difference between the expected clipped gradient and the expected gradient E D0[( f(x, y, w, κ)] E D0[(x w y)x] is small. Therefore, the ideal value of κ at point w is Θ( p E D0[(y x w)2]). For the estimate ˆw(r) at step r, the choice of the clipping parameter is represented by κ(r). To estimate a value for κ(r) that is close to its ideal value, the algorithm employs the subroutine CLIPEST (presented as Algorithm 4 in the appendix). The subroutine estimates the expected value of (y x ˆw(r))2 by using i.i.d. samples Sb ,(r) 1 from the distribution D0. According to Theorem D.1 in Appendix D, the subroutine w.h.p. obtains κ(r) which is close to the ideal value. This ensures that the difference between the expectation of clipped and unclipped gradients is small, and thus, estimating the expectation of clipped gradients can replace estimating the actual gradients. Subspace Estimation. The algorithm proceeds by using subroutine GRADSUBEST (presented as Algorithm 5 in Appendix E) with b B = B(r) s , w = ˆw(r), and κ = κ(r) to estimate a smaller subspace P (r) within Rd. The objective of this subroutine is to ensure that, with high probability, the expected clipped gradient for distribution D0 at point ˆw(r) remains largely unchanged when projected onto the identified subspace. Thus, estimating the expected clipped gradient reduces to estimating its projection onto P (r), which requires fewer samples due to the lower dimensionality of the subspace. To achieve this, the subroutine constructs a matrix A defined as E[A] = P i pi EDi[ f(x, y, w, κ)] EDi[ f(x, y, w, κ)] , where pi denotes the fraction of batches in b B that have samples from Di. Since B(r) s are obtained by randomly partitioning Bs, w.h.p. p0 αs. It is crucial for the success of the subroutine that the expected contribution of every batch in the above expression is a PSD matrix. The clipping helps in bounding the contribution of other components and statistical noise. The subroutine returns the projection matrix P (r) for the subspace spanned by the top ℓsingular vectors of A, where we choose ℓ= min{k, Θ(1/αs)}. It is worth noting that when 1/αs is much smaller than k (thinking of the extreme case k = ), our algorithm still only requires estimating the top ℓ= 1/αs dimensional subspace, since those infinitely many components can create at most (1/αs 1) directions with weight greater than αs. This along with clipping ensures that the direction of D0 must appear in the top Θ(1/αs) subspace (see Lemma E.3). Theorem E.1 in Appendix E characterizes the guarantees for this subroutine. Informally, if b B Ω(d/α2), then w.h.p., the expected value of the projection of the clipped gradient on this subspace is nearly the same as the expected value of the clipped gradient, namely ED0[P (r) f(x, y, w, κ)] E D0[ f(x, y, w, κ)] is small. We note that our construction of matrix A for the subroutine is inspired by a similar construction Algorithm 2 GRADEST 1: Input: A collection of medium batches b B, a collection of samples S from D0, κ, w, projection matrix P for subspace of Rd, parameter ϵ, δ 2: Output: An estimate of clipped gradient at point w. 3: T1 Θ(log | b B| δ ) and T2 Θ(log 1 δ ) 4: For each b divide Sb into two equal random parts Sb 1 and Sb 2 5: For each b further divide Sb 1 into 2T1 equal random parts, and denote them as {Sb 1,j}j [2T1] 6: Divide S into 2T1 equal random parts, and denote them as {S j }j [2T1] 7: ζb j := f(Sb 1,j, w, κ) f(S j , w, κ) P P f(Sb 1,T1+j, w, κ) f(S T1+j, w, κ) 8: Let e B n b b B : median{ζb j : j [T1]} ϵ2κ2C1 o 9: For each b divide Sb 2 into T2 equal parts randomly, and denote them as {Sb 2,j}j [T2] 10: For i [T2], let i 1 | e B| P b e B P f(Sb 2,i, w, κ). 11: Let ξi median{j [T2] : i j } 12: Let i arg min{i [T2] : ξi} and i 13: Return in [KSS+20], where they used it for directly estimating regression vectors. Our results generalize the applicability of the procedure to provide meaningful guarantees even when the number of components k = . Additionally, Lemma E.3 improves matrix perturbation bounds in Lemma 5.1 of [KSS+20], which is crucial for applying this procedure for heavy-tailed distributions and reducing the number of required batches. Estimating the expectation of clipped gradient projection. The final subroutine, GRADEST, estimates the expected projection of the clipped gradient. It does this by utilizing two key sets: B(r) m , one of the R partitions of medium-sized batches, and Sb ,(r) 2 , one of the 2R partitions of the batch b , which contains i.i.d. samples from the reference distribution D0. GRADEST begins by splitting each batch in B(r) m into two equal parts. For each batch b B(r) m , the first part of the split is used along with samples from D0 in Sb ,(r) 2 to test whether the expected projection of the clipped gradient for the distribution that generated batch b is close to that of the reference distribution D0. With high probability, the algorithm retains batches sampled from D0 and rejects those from other distributions where the difference in expected projections is significant. This test requires Ω( samples in each batch, where ℓis the dimension of the projected clipped gradient, which is same as the dimension of the estimated subspace in the subspace estimation space. After identifying the relevant batches, GRADEST estimates the projection of the clipped gradients using the second half of the samples in these batches. Since the projections of the clipped gradients lie in an ℓdimensional subspace, Ω(ℓ) samples suffice for the estimation. To obtain high-probability guarantees, the procedure uses the median of means approach for both testing and estimation. The guarantees of the subroutine are described in Theorem F.1, which implies that the estimate (r) of the gradient satisfies (r) ED0[P (r) f(x, y, w, κ)] is small. Estimation guarantees for expected gradient. Using the triangle inequality, we have: (r) E D0[(x w y)x] E D0[ f(x, y, w, κ)] E D0[(x w y)x] + E D0[ f(x, y, w, κ)] E D0[P (r) f(x, y, w, κ)] + (r) E D0[P (r) f(x, y, w, κ)] . The first term on the right represents the difference between the expectations of the gradient and the clipped gradient. The second term captures the difference between the expectation of the clipped gradient and its projection onto the estimated subspace. Finally, the last term reflects the difference between the expectation of the projection of the clipped gradient and its estimate. All expectations are taken with respect to the distribution D0, and both gradients and clipped gradients are evaluated at the point w. Recall that at the beginning of each gradient descent step, w is updated to the previous round s estimate of w0. As previously argued, all three terms on the right side of the inequality are small, which implies that (r) provides an accurate estimate of the expected value of the gradient for distribution D0. Moreover, Lemma G.1 shows that with an accurate estimation of expected gradients, gradient descent reaches an accurate estimation of w0 after O(log w0 σ ) steps. Therefore, setting R = Ω(log w0 σ ) suffices. This completes the description of the algorithm and the proof sketch of Theorem 2.1, with a more formal proof available in Appendix H. 4 Empirical Results Setup. We have sets Bs and Bm of small and medium size batches and k distributions Di for i {0, 1, . . . , k 1}. For a subset of indices I {0, 1, . . . , k 1}, both Bs and Bm have a fraction of α batches that contain i.i.d. samples from Di for each i I. And for each i {0, 1, . . . , k 1}\I in the remaining set of indices, Bs and Bm have (1 |I|α)/(k |I|) fraction of batches, that have i.i.d samples from Di. In all figures, the output noise is distributed as N(0, 1). All small batches have 2 samples each, while medium-size batches have nm samples each, which we vary from 4 to 32, as shown in the plots. We fix data dimension d = 100, α = 1/16, the number of small batches to |Bs| = min{8dk2, 8d/α2} and the number of medium batches to |Bm| = 256. In all the plots, we averaged over 10 runs and report the standard error. 5 10 15 20 25 30 0 KSSKO(4 samples) KSSKO(8 samples) Ours(4 samples) Ours(8 samples) medium batch size Figure 1: Same input dist., k = 16, large min. distance between regression vectors. 5 10 15 0 25 30 800 KSSKO(4 samples) KSSKO(8 samples) Ours(4 samples) Ours(8 samples) medium batch size Figure 2: Different input dist., k = 16, large min. distance between regression vectors. Evaluation. Our objective is to recover a small list containing good estimates for the regression vectors of Di for each i I. We compare our proposed algorithm s performance with that of the algorithm in [KSS+20]. Given a new batch, we can choose the weight vector from the returned list, L that achieves the best error3. Then the MSE of the chosen weight is reported on another new batch drawn from the same distribution. The size of the new batch can be either 4 or 8 as marked in the plot. More details about our setup can be found in Appendix L. Setting in [KSS+20]. We first compare our algorithm with the one in [KSS+20] in the same setting as the latter paper i.e. with more restrictive assumptions. The results are displayed in Figure 1, where I = {0, 1, . . . , 15} and all 16 distributions have been used to generate 1/16 fraction of the batches. All the Di s are equal to N(0, I), and the minimum distance between the regression vectors is comparable to their norm. It can be seen that even in the original setting of [KSS+20] our algorithm significantly outperforms the other at all the different medium batch sizes plotted on the x-axis. Input distributions. Our algorithm can handle different input distributions for different subgroups. We test this in our next experiment presented in Figure 2. Specifically, for each i, we randomly generate a covariance matrix Σi such that its eigenvalues are uniformly distributed in [1, C1], and the input distribution for Di is chosen as N(0, Σi). We set C1 = 4. It can be seen that [KSS+20] completely fails in this case, while our algorithm retains its good performance. In the interest of space, we provide additional results in Appendix L which include even more general settings: (i) when the minimum distance between regression vectors can be much smaller than their norm (ii) when the number of subgroups k can be very large but the task is to recover the regression weights for the subgroups that appear in a sufficient fraction of the batches. In both these cases, our algorithm performs much better than the baseline. 5 Conclusion We study the problem of learning linear regression from batched data in the presence of subpopulations. We removed several restrictive assumptions from prior work and provide better guarantees in terms of overall sample complexity. Moreover, we require relatively fewer medium batches that need to contain less number of samples compared to prior work. Finally, in our empirical results, we show that our algorithm is both practical and more performant compared to a prior baseline. One limitation of the algorithm is that it requires knowledge of parameters C, C1, M and σ. However, note that the algorithm works even if their assumed values exceed the actual values. The change in the recovery guarantees is proportional to the disparity between the estimates and the actual parameters. The necessity of knowing reasonable upper bounds on these parameters, even for simpler problems, has been highlighted by previous work, e.g. [CAT+20]. [AJK+22] Jayadev Acharya, Ayush Jain, Gautam Kamath, Ananda Theertha Suresh, and Huanyu Zhang, Robust estimation for random graphs, Conference on Learning Theory, PMLR, 2022, pp. 130 166. [BDLS17] S. Balakrishnan, S. S. Du, J. Li, and A. Singh, Computationally efficient robust sparse estimation in high dimensions, Proceedings of the 30th Conference on Learning Theory, COLT 2017, 2017, pp. 169 212. [BJK15] Kush Bhatia, Prateek Jain, and Purushottam Kar, Robust regression via hard thresholding, Advances in neural information processing systems 28 (2015). [BJKK17] K. Bhatia, P. Jain, P. Kamalaruban, and P. Kar, Consistent robust regression, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 2017, pp. 2107 2116. [CAT+20] Yeshwanth Cherapanamjeri, Efe Aras, Nilesh Tripuraneni, Michael I Jordan, Nicolas Flammarion, and Peter L Bartlett, Optimal robust linear regression in nearly linear time, ar Xiv preprint ar Xiv:2007.08137 (2020). 3This simple approach showed better empirical performance than Algorithm 3, whose theoretical guarantees we described in Theorem 2.3 [CL13] Arun Tejasvi Chaganty and Percy Liang, Spectral experts for estimating mixtures of linear regressions, International Conference on Machine Learning (ICML), 2013, pp. 1040 1048. [CLM20] Sitan Chen, Jerry Li, and Ankur Moitra, Learning structured distributions from untrusted batches: Faster and simpler, Advances in Neural Information Processing Systems 33 (2020), 4512 4523. [CLS20] Sitan Chen, Jerry Li, and Zhao Song, Learning mixtures of linear regressions in subexponential time via Fourier moments, STOC, https://arxiv.org/pdf/1912. 07629.pdf, 2020. [CP22] Yanxi Chen and H. Vincent Poor, Learning mixtures of linear dynamical systems, Proceedings of the 39th International Conference on Machine Learning (Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, eds.), Proceedings of Machine Learning Research, vol. 162, PMLR, 17 23 Jul 2022, pp. 3507 3557. [CSV17] Moses Charikar, Jacob Steinhardt, and Gregory Valiant, Learning from untrusted data, Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, 2017, pp. 47 60. [DJKS22] Abhimanyu Das, Ayush Jain, Weihao Kong, and Rajat Sen, Efficient list-decodable regression using batches, ar Xiv preprint ar Xiv:2211.12743 (2022). [DK20] Ilias Diakonikolas and Daniel M Kane, Small covers for near-zero sets of polynomials and learning latent variable models, 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), IEEE, 2020, pp. 184 195. [DKK+19] Ilias Diakonikolas, Gautam Kamath, Daniel M. Kane, Jerry Li, Jacob Steinhardt, and Alistair Stewart, Sever: A robust meta-algorithm for stochastic optimization, Proceedings of the 36th International Conference on Machine Learning, ICML 19, JMLR, Inc., 2019, pp. 1596 1606. [DKP+21] Ilias Diakonikolas, Daniel Kane, Ankit Pensia, Thanasis Pittas, and Alistair Stewart, Statistical query lower bounds for list-decodable linear regression, Advances in Neural Information Processing Systems 34 (2021), 3191 3204. [DKS19] Ilias Diakonikolas, Weihao Kong, and Alistair Stewart, Efficient algorithms and lower bounds for robust linear regression, Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, SIAM, 2019, pp. 2745 2754. [DT19] Arnak Dalalyan and Philip Thompson, Outlier-robust estimation of a sparse linear model using ℓ1 -penalized huber s m-estimator, Advances in neural information processing systems 32 (2019). [FAL17] Chelsea Finn, Pieter Abbeel, and Sergey Levine, Model-agnostic meta-learning for fast adaptation of deep networks, Proceedings of the 34th International Conference on Machine Learning (ICML), 2017, pp. 1126 1135. [Gao20] Chao Gao, Robust regression via mutivariate regression depth, Bernoulli 26 (2020), no. 2, 1139 1170. [GKG15] Michael Grottke, Julian Knoll, and Rainer Groß, How the distribution of the number of items rated per user influences the quality of recommendations, 2015 15th International Conference on Innovations for Community Services (I4CS), IEEE, 2015, pp. 1 8. [JLST21] Arun Jambulapati, Jerry Li, Tselil Schramm, and Kevin Tian, Robust regression revisited: Acceleration and improved estimation rates, Advances in Neural Information Processing Systems 34 (2021), 4475 4488. [JO20a] Ayush Jain and Alon Orlitsky, A general method for robust learning from batches, ar Xiv preprint ar Xiv:2002.11099 (2020). [JO20b] , Optimal robust learning of discrete distributions from batches, Proceedings of the 37th International Conference on Machine Learning, ICML 20, JMLR, Inc., 2020, pp. 4651 4660. [JO21] , Robust density estimation from batches: The best things in life are (nearly) free, International Conference on Machine Learning, PMLR, 2021, pp. 4698 4708. [KFAL20] Nikola Konstantinov, Elias Frantar, Dan Alistarh, and Christoph Lampert, On the sample complexity of adversarial multi-source pac learning, International Conference on Machine Learning, PMLR, 2020, pp. 5416 5425. [KKK19] Sushrut Karmalkar, Adam Klivans, and Pravesh Kothari, List-decodable linear regression, Advances in neural information processing systems 32 (2019). [KKM18] Adam Klivans, Pravesh K Kothari, and Raghu Meka, Efficient algorithms for outlierrobust regression, Conference On Learning Theory, PMLR, 2018, pp. 1420 1430. [KP19] Sushrut Karmalkar and Eric Price, Compressed sensing with adversarial sparse noise via l1 regression, 2nd Symposium on Simplicity in Algorithms, 2019. [KSAD22] Weihao Kong, Rajat Sen, Pranjal Awasthi, and Abhimanyu Das, Trimmed maximum likelihood estimation for robust learning in generalized linear models, ar Xiv preprint ar Xiv:2206.04777 (2022). [KSKO20] Weihao Kong, Raghav Somani, Sham Kakade, and Sewoong Oh, Robust meta-learning for mixed linear regression with small batches, Advances in Neural Information Processing Systems 33 (2020). [KSS+20] Weihao Kong, Raghav Somani, Zhao Song, Sham Kakade, and Sewoong Oh, Metalearning for mixed linear regression, International Conference on Machine Learning, PMLR, 2020, pp. 5394 5404. [KZS15] Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov, Siamese neural networks for one-shot image recognition, ICML deep learning workshop, vol. 2, 2015. [LL18] Yuanzhi Li and Yingyu Liang, Learning mixtures of linear regressions with nearly optimal complexity, COLT, ar Xiv preprint ar Xiv:1802.07895, 2018. [LM16] Guillaume Lecu e and Shahar Mendelson, Performance of empirical risk minimization in linear aggregation. [LSLC18] Liu Liu, Yanyao Shen, Tianyang Li, and Constantine Caramanis, High dimensional robust sparse regression, ar Xiv preprint ar Xiv:1805.11643 (2018). [MGJK19] Bhaskar Mukhoty, Govind Gopakumar, Prateek Jain, and Purushottam Kar, Globallyconvergent iteratively reweighted least squares for robust regression problems, The 22nd International Conference on Artificial Intelligence and Statistics, 2019, pp. 313 322. [Oli16] Roberto Imbuzeiro Oliveira, The lower tail of random quadratic forms with applications to ordinary least squares, Probability Theory and Related Fields 166 (2016), 1175 1194. [OLL18] Boris Oreshkin, Pau Rodr ıguez L opez, and Alexandre Lacoste, Tadam: Task dependent adaptive metric for improved few-shot learning, Advances in Neural Information Processing Systems, 2018, pp. 721 731. [PJL20] Ankit Pensia, Varun Jog, and Po-Ling Loh, Robust regression with covariate filtering: Heavy tails and adversarial contamination, ar Xiv preprint ar Xiv:2009.12976 (2020). [PMSG22] Soumyabrata Pal, Arya Mazumdar, Rajat Sen, and Avishek Ghosh, On learning mixture of linear regressions in the non-realizable setting, International Conference on Machine Learning, PMLR, 2022, pp. 17202 17220. [PSBR18] Adarsh Prasad, Arun Sai Suggala, Sivaraman Balakrishnan, and Pradeep Ravikumar, Robust estimation via robust gradient estimation, ar Xiv preprint ar Xiv:1802.06485 (2018). [PT08] Yoon-Joo Park and Alexander Tuzhilin, The long tail of recommender systems and how to leverage it, Proceedings of the 2008 ACM conference on Recommender systems, 2008, pp. 11 18. [QV18] Mingda Qiao and Gregory Valiant, Learning discrete distributions from untrusted batches, Proceedings of the 9th Conference on Innovations in Theoretical Computer Science (Dagstuhl, Germany), ITCS 18, Schloss Dagstuhl Leibniz-Zentrum fuer Informatik, 2018, pp. 47:1 47:20. [RL17] Sachin Ravi and Hugo Larochelle, Optimization as a model for few-shot learning, International Conference on Representation Learning, 2017. [RRS+18] Andrei A Rusu, Dushyant Rao, Jakub Sygnowski, Oriol Vinyals, Razvan Pascanu, Simon Osindero, and Raia Hadsell, Meta-learning with latent embedding optimization, ar Xiv preprint ar Xiv:1807.05960 (2018). [RY20] Prasad Raghavendra and Morris Yau, List decodable learning via sum of squares, Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SIAM, 2020, pp. 161 180. [Sch87] J urgen Schmidhuber, Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-... hook, Ph.D. thesis, Technische Universit at M unchen, 1987. [SJA16] Hanie Sedghi, Majid Janzamin, and Anima Anandkumar, Provable tensor methods for learning mixtures of generalized linear models, Artificial Intelligence and Statistics (AISTATS), 2016, pp. 1223 1231. [SVC16] Jacob Steinhardt, Gregory Valiant, and Moses Charikar, Avoiding imposters and delinquents: Adversarial crowdsourcing and peer prediction, Advances in Neural Information Processing Systems 29 (2016). [TLW99] Kai Ming Ting, Boon Toh Low, and Ian H Witten, Learning from batched data: Model combination versus data combination, Knowledge and Information Systems 1 (1999), 83 106. [TP12] Sebastian Thrun and Lorien Pratt, Learning to learn, Springer Science & Business Media, 2012. [TV23] Nelvin Tan and Ramji Venkataramanan, Mixed regression via approximate message passing, Journal of Machine Learning Research 24 (2023), no. 317, 1 44. [TZD+19] Eleni Triantafillou, Tyler Zhu, Vincent Dumoulin, Pascal Lamblin, Kelvin Xu, Ross Goroshin, Carles Gelada, Kevin Swersky, Pierre-Antoine Manzagol, and Hugo Larochelle, Meta-dataset: A dataset of datasets for learning to learn from few examples, ar Xiv preprint ar Xiv:1903.03096 (2019). [WCX+21] Jianyu Wang, Zachary Charles, Zheng Xu, Gauri Joshi, H Brendan Mc Mahan, Maruan Al-Shedivat, Galen Andrew, Salman Avestimehr, Katharine Daly, Deepesh Data, et al., A field guide to federated optimization, ar Xiv preprint ar Xiv:2107.06917 (2021). [WDVR06] Jun Wang, Arjen P De Vries, and Marcel JT Reinders, Unifying user-based and itembased collaborative filtering approaches by similarity fusion, Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, 2006, pp. 501 508. [WZ89] Mati Wax and Ilan Ziskind, On unique localization of multiple sources by passive sensor arrays, IEEE Transactions on Acoustics, Speech, and Signal Processing 37 (1989), no. 7, 996 1000. [YCS14] Xinyang Yi, Constantine Caramanis, and Sujay Sanghavi, Alternating minimization for mixed linear regression, International Conference on Machine Learning, PMLR, 2014, pp. 613 621. [YCS16] , Solving a mixture of many random linear equations by tensor decomposition and alternating minimization, ar Xiv preprint ar Xiv:1608.05749 (2016). [ZJD16] Kai Zhong, Prateek Jain, and Inderjit S Dhillon, Mixed linear regression with multiple components, Advances in neural information processing systems (NIPS), 2016, pp. 2190 2198. A Other related work Meta Learning. The setting we considered in this paper is closely related to meta learning if we treat each batch as a task. Meta-learning approaches aim to jointly learn from past experience to quickly adapt to new tasks with little available data [Sch87, TP12]. This is particularly significant in our setting when each task is associated with only a few training examples. By leveraging structural similarities among those tasks (e.g. sub-population structure), meta-learning algorithms achieve far better accuracy than what can be achieved for each task in isolation [FAL17, RL17, KZS15, OLL18, TZD+19, RRS+18]. Learning mixture of linear dynamical systems has been studied in [CP22]. Robust and List decodable Linear Regression. Several recent works have focused on obtaining efficient algorithms for robust linear regression and sparse linear regression when a small fraction of data may be adversarial [BJK15, BJKK17, BDLS17, Gao20, PSBR18, KKM18, DKK+19, LSLC18, KP19, DT19, MGJK19, DKS19, KKK19, PJL20, CAT+20, JLST21, KSAD22]. In scenarios where over half of the data may be arbitrary or adversarial, it becomes impossible to return a single estimate for the underlying model. Consequently, the requirement is relaxed to return a small list of estimates such that at least one of them is a good estimate for the underlying model. This relaxed framework, called List decodable learning, was first introduced in [CSV17]. Listdecodable linear regression has been studied by [KKK19, RY20, DKP+21], who have provided exponential runtime algorithms. Additionally, [DKP+21] has established statistical query lower bounds, indicating that polynomial-time algorithms may be impossible for this setting. However, as mentioned earlier, the problem can be solved in polynomial time in the batch setting as long as the batch size is greater than the inverse of the fraction of genuine data, as demonstrated in [DJKS22]. It s worth noting that an algorithm for list-decodable linear regression can be used to obtain a list of regression vector estimates for mixed linear regression. Robust Learning from Batches. [QV18] presented the problem of robust learning of discrete distributions from untrustworthy batches, where a majority of batches share the same distribution and a small fraction are adversarial. They developed an exponential time algorithm for the problem. Subsequent works [CLS20] improved the run-time to quasi-polynomial, while and [JO20b] derived a polynomial time algorithm with an optimal sample complexity. The results were extended to learning one-dimensional structured distributions in [JO21, CLM20], and classification in [JO20a, KFAL20]. [AJK+22] examined a closely related problem of learning the parameters of an Erd os R enyi random graph when a portion of nodes may be corrupted and their edges are maybe be chosen by an adversary. B Selecting a regression vector from a given list In this section, we introduce Algorithm 3 and prove that it achieves the guarantees presented in Theorem 2.3. Algorithm 3 SELECTING THE REGRESSION VECTOR 1: Input: Samples S from Di for some i I, C1, a list L of possible estimates of wi, and β 0 s.t. β minw L w wi . 2: Output: An estimate of wi from list L 3: Divide S into T3 = Θ(log(|L|/δ)) equal parts {Sj}T3 j=1 4: while max{ w w : w, w L} 12C1β do 5: pick any w, w L s.t. w w 12C1β. 6: For j [T3], let aj P (x,y) Sj 1 |Sj|(x w y)x (w w ) 7: a Median{aj : j [T3]} 8: If a > w w 2/4 remove w, else remove w from L 9: end while 10: Return any of the remaining w L Without loss of generality, assume i = 0, and let w = arg minw L w w0 . From the condition in the theorem, we know that w w0 β. The algorithm is given access to |S| = Ω(max 1, σ2 δ ) samples. The algorithm chooses any two vectors w, w L that are more than 12C1β distance apart and tests which of them is more likely to be within β distance from w0 using samples in S. The algorithm performs T3 = Θ(log |L| δ ) such tests and takes the majority vote. It retains the vector that is more likely to be closer to w and discards the other from L. The algorithm terminates when all the vectors in L are within a distance of 12C1β from each other, by choosing a vector from those remaining in L and returning it as an estimate of w0. If w is retained in L until the end, using the simple triangle inequality for all w that remain in L at the end, we have w w0 w w + w w0 12C1β + β 13C1β = O(β). Therefore, the estimate returned by the algorithm achieves the desired accuracy in estimating w0. Hence, it suffices to show that w is retained at the end with high probability. Suppose w is not in the final list L. Then it must have been discarded by the test in favor of w L such that w w 12C1β. The following theorem shows that for any w such that w w 12C1β, the probability of the testing procedure rejecting w in favor of w is at most δ/|L|. Theorem B.1. Given β > 0, list L, and samples S from D0, if minw L w w0 β and |S| = max{1, σ2 δ , then for the parameter a computed in the while loop of Algorithm 3, with probability 1 δ/|L|, we have a w w 2/4 if w = w0 and a > w w 2/4 if w = w0. The testing procedure utilized in the algorithm is based on gradients. Specifically, it calculates the average of the gradient computed on samples at point w projected onto the vector (w w ). The expected value of the gradient at w, and its projection onto (w w ), are (w w0) Σ0 and (w w0) Σ0(w w ), respectively. If w w0, then the expected projection will be small. On the other hand. if w w0 and w, then expected value of projection is (w w ) Σ0(w w ) w w 2. Using these observations, we prove Theorem B.1 in the next subsection. Finally, since the maximum number of comparisons made by the algorithm is |L| 1, a union bound ensures that w will be retained until the end with probability greater than 1 δ, completing the proof of Theorem 2.3. B.1 Proof of Theorem B.1 Proof. Note that a is the median of the set {aj : j [T3]}, where each aj is computed using different sets of i.i.d. samples. Consequently, {aj}j [T3] are also i.i.d random variables. We begin by calculating the expected value of aj. Using the linearity of expectations, we have: E[aj] (a) = E h P (x,y) Sj 1 |Sj|(x w y)x (w w ) i (b) = E D0[(x w y)x (w w )] = E D0[(x (w w0) (y x w0))x (w w )] = E D0[(x (w w0)x (w w )] E D0[(y x w0)x (w w )] = E D0[x (w w0)x (w w )] (1) (c) = E D0[(x (w w ))2] + E D0[x (w w0)x (w w )], (2) here, (a) follows from the definition of aj, (b) follows from the linearity of expectation, and since Sj contains i.i.d. samples from D0, (c) follows as the noise y x w0 has a zero mean and is independent of x. Next, we compute the variance of aj. Since aj represents the average of (x w y)x (w w ) over |Sj| i.i.d. samples, we have Var(aj) = 1 |Sj| Var D0((x w y)x (w w )) 1 |Sj| E D0 h ((x w y)x (w w ))2i . (3) By applying Chebyshev s inequality, with a probability 3/4, the following holds for each aj: E[aj] 2 Var(aj) aj E[aj] + 2 Var(aj). (4) First, we consider the case when w = w . In this case, we have w0 w β. Using Equation (3), we can express the variance of aj as follows: Var(aj) 1 |Sj| E D0 h ((x w y)x (w w ))2i = 1 |Sj| E D0 h (x (w w0)x (w w ) + (w0 x y)x (w w ))2i E D0 h (x (w w0)x (w w ))2i + E D0 h ((w0 x y)x (w w ))2i , where the last step uses the fact that for any u, v R, (u + v)2 2u2 + 2v2. Next, we bound the two terms on the right. For the first term, we have E D0 h (x (w w0)x (w w ))2i p E D0[(x (w w0))4] E D0[(x (w w ))4] CE D0[(x (w w0))2] E D0[(x (w w ))2]. (5) For the second term, we have: E D0 h ((w0 x y)x (w w ))2i = E D0 (w0 x y)2 E D0 h (x (w w ))2i = σ2 E D0 h (x (w w ))2i , (6) where the first inequality follows from assumption 1a and the second inequality follows from assumption 1b. Combining the above three equations, we obtain: Var(aj) 2 |Sj| E D0 h (x (w w ))2i C E D0[(x (w w0))2] + σ2 2 |Sj|C1 w w 2 CC1 w w0 2 + σ2 , where the last inequality uses assumption 1b. Using Equation (1), the Cauchy-Schwarz inequality, and assumption 1b, we have: E[aj] C1 w w0 w w . (7) Combining the two equations above, we obtain: E[aj] + 2 q Var(aj) w w C1 w w0 + 2 2C1 p C1β + 8C1 p (b) 3C1 w w β here, in (a), we use w = w , which implies w w0 β, in (b), we utilize |Sj| 48C and |Sj| 12σ2 C1β2 , in (c), we use the fact that for any w and w in the while loop of the algorithm, w w 12C1β. Consequently, it follows from Equation (4) that each aj w w 2 probability 3/4. Hence, with probability 1 δ the median of aj is w w 2 Next, we consider the case when w = w . Firstly, we bound the variance using Equation (3): Var(aj) 1 |Sj| E D0 h ((x w y)x (w w ))2i = 1 |Sj| E D0 h (x (w w ))2 + x (w w0)x (w w ) + (w0 x y)x (w w ) 2i E D0 h ((w w ) x)4i + E D0 h (x (w w0)x (w w ))2i (8) + E D0 h ((w0 x y)x (w w ))2i C E D0 h (x (w w ))2i2 + C E D0[(x (w w0))2] + σ2 E D0[(x (w w ))2] C E D0 h (x (w w ))2i + p C E D0[(x (w w0))2] + σ p E D0[(x (w w ))2] 2 . In (a), we use the fact that for any t, u, v R, (t + u + v)2 3t2 + 3u2 + 3v2. In (b) the first term is bounded using assumption 1a, the bound on the second term can be obtained similarly to Equation (5), and the bound on the last term is from Equation (6). Finally, in (c) we use the fact that for any t, u, v 0, (t + u + v)2 t2 + u2 + v2. Using Equation (2) and the equation above, we get Var(aj) E D0[(x (w w ))2] + E D0[x (w w0)x (w w )] C E D0 h (x (w w ))2i + p C E D0[(x (w w0))2] + σ p E D0[(x (w w ))2] E D0[(x (w w ))2] p E D0[(x (w w0))2] p E D0[(x (w w ))2] C E D0[(x (w w0))2] + σ p E D0[(x (w w ))2] E D0[(x (w w ))2] 3 E D0[(x (w w0))2] E D0[(x (w w ))2] E D0[(x (w w ))2] 3 E D0[(x (w w0))2] p E D0[(x (w w ))2], here, in (a) we use the Cauchy-Schwarz inequality, (b) follows from |Sj| 48C, and (c) utilizes |Sj| 12σ2 C1β2 . Next, we have: E D0[(x (w w ))2] 3 E D0[(x (w w0))2] p 4 w w , (10) here in (a), we use assumption 1b, (b) relies on w = w , which implies w w0 β, and (c) uses the fact that for any w and w in the while loop of the algorithm, w w 12C1β and C1 1. Combining the above two equations, we obtain Var(aj) > 1 Then from Equation (4) it follows that each aj > w w 2 4 with probability 3/4. Hence, with probability 1 δ/|L| the median of aj is > w w 2 C Properties of Clipped Gradients The norm of the expected value and covariance of unclipped gradients for components other than D0 can be significantly larger than D0, acting as noise in the recovery process of D0. When using unclipped gradients, the algorithm s batch size and the number of batches must increase to limit the effect of these noisy components. And while the norm of the expected value and covariance of the unclipped gradient for D0 follows desired bounds, the maximum value of the unclipped gradient is unbounded, posing difficulties in applying concentration bounds. The following lemma shows that the clipping operation described in the main paper is able to address these challenges. Lemma C.1. Let S be a collection of random samples drawn from distribution Di for some i {0, 1, ..., k 1}. For any κ 0 and w Rd, the clipped gradient satisfies the following properties: 1. E[ f(S, w, κ)] κ C1, 2. Cov( f(S, w, κ)) 1 |S|κ2C1, 3. f(S, w, κ) κC2 d almost surely, 4. E f(S, w, κ) 2 C1κ2d, 5. for all unit vectors u, E ( f(S, w, κ) u)2 κ2C1. This lemma implies that for smaller values of κ, the norm of the expectations and covariance of clipped gradients is bounded by a smaller upper limit. The proof of the lemma is presented in Subsection C.1. The following theorem demonstrates that by appropriately choosing a sufficiently large value of κ, the norm of the expected difference between the clipped and unclipped gradients for distribution D0 can be small: Theorem C.2. For any ϵ > 0, κ2 8CC1 E D0[(y x w)2]/ϵ the norm of difference between expected clipped gradient E D0[( f(x, y, w, κ)] and expected unclipped gradient E D0[(w x y)x] is at most, E D0[( f(x, y, w, κ)] E D0[(w x y)x] ϵ w w0 , where E D0[(w x y)x] = Σ0(w w0). The theorem shows in order to estimate the expectation of gradients at point w for distribution D0, it is sufficient to estimate the expectation of clipped gradients at point w, as long as the clipping parameter κ is chosen to be at least Ω q E D0[(y x w)2] Intuitively, when κ is much larger than ED0[|y x w|], with high probability the clipped and unclipped gradients at point w for a random sample from D0 will be identical. The proof of the theorem is a bit more nuanced and involves leveraging the symmetry of noise distribution and L4 L2 hypercontractivity of distribution of x. The proof appears in Subsection C.2. In the algorithm, we set κ to approximately Θ( p (E D0[(y x w)2] + σ2)/ϵ). This choice ensures that κ is close to the minimum value recommended by Theorem C.2 for preserving the gradient expectation of D0. By selecting a small κ, we ensure a tighter upper bound on the expectation and covariance of the clipped gradient for other components, as described in Lemma C.1. The use of the clipping operation also assists in obtaining improved bounds on the tails of the gradient by limiting the maximum possible norm of the gradients after clipping, as stated in item 3 of the lemma. C.1 Proof of Lemma C.1 Proof. Since f(S, w, κ) is average of clipped gradients of |S| independent samples from Di, it follows that a) E[ f(S, w, κ)] = EDi[ f(x, y, w, κ)], b) Cov( f(S, w, κ)) = 1 |S|Cov Di( f(x, y, w, κ)), c) f(S, w, κ) ess sup(x,y) Di f(x, y, w, κ) a.s., d) E f(S, w, κ) 2 E Di h f(x, y, w, κ) 2i , and e) for all vectors u, E ( f(S, w, κ) u)2 E Di h ( f(x, y, w, κ) u)2i . We will now proceed to prove the five claims in the lemma by using these properties. Firstly, we can analyze the expected norm of EDi[ f(x, y, w, κ)] as follows: E Di[ f(x, y, w, κ)] = max u E Di[( f(x, y, w, κ) u)] max u E Di[( f(x, y, w, κ) u)2]1/2 max u E Di[(κx u)2]1/2 κ p here the first inequality follows from the Cauchy Schwarz inequality and the last inequality follows from assumptions on distributions Di. Combining the above inequality with item a) above proves the first claim in the lemma. Next, to prove the second claim in the lemma, we first establish bounds for the norm of the covariance of the clipped gradient of a random sample: Cov Di( f(x, y, w, κ)) = max u Var Di( f(x, y, w, κ) u) max u E Di[( f(x, y, w, κ) u)2] max u E Di[(κx u)2] κ2C1. By using the above bound and combining it with item b), we establish the second claim in the lemma. To prove the third item in the lemma, we first bound the norm of the clipped gradient: ess sup (x,y) Di f(x, y, w, κ) κ ess sup (x,y) Di x κC2 We then combine this bound with item c) to prove the third claim in the lemma. Next, we bound the expected value of the square of the norm of the clipped gradient of a random sample, E Di h f(x, y, w, κ) 2i E Di h κ2 x 2i = κ2Tr(Σi) κ2d Σi C1κ2d. This bound, combined with item d), proves the fourth claim in the lemma. Finally, for any unit vector u, we bound E Di h ( f(x, y, w, κ) u)2i κ2 E Di h (x u)2i κ2 Σi κ2C1. This bound, combined with item e), shows the fifth claim in the lemma. C.2 Proof of Theorem C.2 We will utilize the following auxiliary lemma in the proof of the theorem. This lemma applies to general random variables. Lemma C.3. For any a R, b > 0 and a symmetric random variable z, E (a + z) (a + z)b max(|a + z|, b) 2|a| Pr(z > b |a|) Proof. We assume a 0 and prove the lemma for this case. The statement for a < 0 case then follows from symmetry. We rewrite the term inside the expectation in terms of indicator random variables: (a + z) (a + z)b max(|a + z|, b) = (a + z b) 1(z > b a) + (a + z + b) 1(z < b a) = (a + z b) 1(b a < z b + a) + (a + z b) 1(z > b + a) + (a + z + b) 1(z < b a). Taking the expectation on both sides, E (a + z) (a + z)b max(|a + z|, b) = E[(a + z b) 1(b a < z b + a)] + E[(a + z b) 1(z > b + a)] + E[(a + z + b) 1(z < b a)] = E[(a + z b) 1(b a < z b + a)] + 2a Pr(z > b + a), where the last step follows because z is symmetric. Next, E (a + z) (a + z)b max(|a + z|, b) = E[|a + z b| 1(b a < z b + a)] + 2|a| Pr(z > b + a) E[|2a| 1(b a < z b + a)] + 2|a| Pr(z > b + a) 2|a| Pr(b a < z b + a) + 2|a| Pr(z > b + a) = 2|a| Pr(z > b a). Next, we proceed with the proof of Theorem C.2 using the aforementioned lemma. Proof of Theorem C.2. Let (x, y) be a random sample from distribution D0, and let η = y w0 x denote the noise. Recall that η is independent of x. (w x y)x = ((w w0) x η)x. We will now evaluate the expected value of the unclipped gradient. E D0[(w x y)x] = E D0[((w w0) x η)x] = E D0[((w w0) x)x] = Σ0(w w0). (11) Next, we will bound the norm of the expected value of (w x y)x f(x, y, w, κ), which represents the difference between the clipped gradient and the true gradient. We first expand this expression: (w x y)x f(x, y, w, κ) = (w x y) (w x y) |w x y| κκ x = ((w w0) x η) ((w w0) x η) |(w w0) x η| κκ x. (12) Next, by applying Lemma C.3, we have ((w w0) x η) ((w w0) x η) |(w w0) x η| κκ 2|(w w0) x| Pr(η > κ |(w w0) x|). Note that in the above expectation, we fixed x and took the expectation over noise η. Let Z := 1(|x (w w0)| κ/2). Observe that Pr(η > κ |(w w0) x|) Z + Pr(η > κ/2). Combining this observation with the above equation, we have: ((w w0) x η) ((w w0) x η) |(w w0) x η| κκ 2|(w w0) x| (Pr(η > κ/2) + Z). Then, for any unit vector v Rd, we have | E D0[((w x y)x f(x, y, w, κ)) v]| = | E x D0[E η D0[((w x y)x f(x, y, w, κ)) v]]| E x D0[| E η D0[((w x y)x f(x, y, w, κ)) v]|] E x D0[2|(w w0) x| |x v|(Z + Pr(η > κ/2))] 2 E D0[Z |(w w0) x| |x v|] + 2 Pr(η > κ/2) E D0[|(w w0) x| |x v|], (14) here the second last inequality follows from Equation (12) and Equation (13). Next, we bound the two terms on the right one by one. We start with the first term: E D0[Z |(x (w w0))(x v)|] (a) E[(Z)2] E D0[(x (w w0))2(x v)2] 1/2 (b) E[Z] E D0[(x (w w0))4]1/2 E D0[(x v)4]1/2 1/2 (c) E[Z] C E D0[(x (w w0))2] E D0[(x v)2] 1/2 (d) CC1 Pr[|x (w w0)| κ/2] E D0[(x (w w0))2] 1/2, (15) where (a) used the Cauchy-Schwarz inequality, (b) used the fact that Z is an indicator random variable, hence, Z2 = Z, and the Cauchy-Schwarz inequality, (c) uses L4 L2 hypercontractivity, and (d) follows from the definition of Z and the assumption that Σ0 C1. Similarly, we can show that E D0[|(w w0) x| |x v|] C1 E D0[(x (w w0))2] 1/2. (16) Applying the Markov inequality to η2 we get: Pr[|η| κ/2] E D0[η2] (κ/2)2 . (17) Similarly, applying the Markov inequality to |x (w w0)|4 yields: Pr[|x (w w0)| κ/2] E D0[|x (w w0)|4] (κ/2)4 C E D0[|x (w w0)|2]2 (κ/2)4 , (18) where the last inequality uses L4 L2 hypercontractivity. Combining Equations (14), (15), (16), (17) and (18), we have | E D0[((w x y)x f(x, y, w, κ)) v]| 8C C1 E D0[(x (w w0))2] 3/2 κ2 + 8 C1 E D0[η2] E D0[(x (w w0))2] 1/2 8C C1 E D0[(x (w w0))2] 1/2( E D0[(x (w w0))2] + E D0[η2]/C) κ2 (a) ϵ E D0[(x (w w0))2] 1/2 C1 ( E D0[(x (w w0))2] + E D0[η2]/C) E D0[(y x w)2] (b) ϵ w w0 ( E D0[(x (w w0))2] + E D0[η2]) E D0[(y x w)2] (c) = ϵ w w0 , here inequality (a) follows from the lower bound on κ2 in theorem, inequality (b) follows from Assumption 1b and C 1, and the last equality follows since y x w = x(w w0) + η, and x and η are independent. Note that the above bound holds for all unit vectors v, therefore, E D0[((w x y)x f(x, y, w, κ))] max v E D0[((w x y)x f(x, y, w, κ))] v ϵ w w0 . D Estimation of clipping parameter Algorithm 4 CLIPEST 1: Input: A collection of samples S from D0, w, ϵ, δ , σ, C, and C1. 2: Output: clipping parameter κ 3: T Θ(log 1/δ ) 4: Divide S into T equal parts randomly, and denote them as {S j }j [T ] 5: θ Median n 1 |S j | P (x,y) S j (x w y)2 : j [T] o 32(C+1)C1(θ+17σ2) ϵ 7: Return κ In round r, to set κ p E D0[(y x w)2]/ϵ at point w = ˆw(r), the main algorithm 1 runs subroutine CLIPEST 4 for S = Sb ,(r) 1 and w = ˆw(r). Recall that Sb ,(r) 1 is collection of i.i.d. samples from D0. Using these samples this subroutine estimates E D0[(y x w)2] at w = ˆw(r) using the median of means and then use it to obtain κ in the desired range. The following theorem provides the guarantees on the estimation of κ by this subroutine. Theorem D.1. For ϵ > 0, T Ω(log 1/δ ) and |S | 64C2T and w Rd. With probability 1 δ , the clipping parameter κ returned by subroutine CLIPEST satisfy, q 8(C+1)C1 E D0[(y x w)2] 2(C+1)C1(E D0[((w w0) x)2]+σ2) To prove the theorem, we will make use of the following lemma: Lemma D.2. Let S be a collection of m 64C2 i.i.d. samples from D0 and w Rd, then with probability at least 7/8, the following holds: 1 4 E D0[(w x y)2] 17σ2 1 (x,y) S (y w x)2 3 E D0[((w w0) x)2] + 32σ2. Proof. We start by expanding the expression: 1 m (x,y) S (w x y)2 = 1 (x,y) S ((w w0) x + (w0 x y))2 ((w w0) x)2 + 2((w w0) x)(w0 x y) + (w0 x y)2 2((w w0) x)2 (w0 x y)2 , (19) where the last inequality follows since for any a, b, we have a2 + 2ab + b2 a2/2 b2. Similarly, we can show: 1 m (x,y) S (w x y)2 1 2((w w0) x)2 + 2(w0 x y)2 . (20) Since S contains independent samples from D0, we have: (x,y) S (((w w0) x)2 = E D0[((w w0) x)2], (x,y) S ((w w0) x)2 = Var D0(((w w0) x)2) E D0[((w w0) x)4] m C E D0[((w w0) x)2]2 where the last inequality follows from L4-L2 hypercontractivity. For any a > 0, using Chebyshev s inequality, (x,y) S ((w w0) x)2 E D0[((w w0) x)2] a C E D0[((w w0) x)2] m Using the Markov inequality, for any a > 0, we have: (x,y) S (w0 x y)2 > a2σ2 E D0[(w0 x y)2] By combining the equations above, we can derive the following inequality: With probability 1 2 a2 , the following holds: (x,y) S (w x y)2 1 2 E D0[((w w0) x)2](1 a C m) a2σ2, (x,y) S (w x y)2 2 E D0[((w w0) x)2](1 + a C m) + 2a2σ2. By choosing a = 4 and using m 64C2 in the above equation, we can conclude that with probability 1 2 a2 , the following holds: 1 4 E D0[((w w0) x)2] 16σ2 1 (x,y) S (w x y)2 3 E D0[((w w0) x)2] + 32σ2. Next, note that E D0[(w x y)2] = E D0[((w w0) x (y w0 x))2] (a) = E D0[((w w0) x)2] + E D0[(y w0 x)2] (b) E D0[((w w0) x)2] + σ2, here (a) follows since x is independent of the output noise y w0 x, and (b) follows since the output noise y w0 x is zero mean and has a variance at most σ2. Combining the above two equations completes the proof. Now we prove Theorem D.1 using the above lemma: Proof of Theorem D.1. From the previous lemma and Chernoff bound it follows that with probability 1 δ , 1 4 E D0[(w x y)2] 17σ2 θ 3 E D0[((w w0) x)2] + 32σ2. Then bound on κ follows from the relation κ = q 32(C+1)C1(θ+17σ2) E Subspace Estimation As a part of gradient estimation in step r, the main algorithm 1 uses subroutine GRADSUBEST for b B = B(r) s and w = ˆw(r). Recall that B(r) s is a random subset of the collection of small batches Bs. The purpose of this subroutine is to estimate a smaller subspace of Rd such that for distribution D0, the expectation of the projection of the clipped gradient onto this subspace closely approximates the Algorithm 5 GRADSUBEST 1: Input: A collection of medium batches b B, κ, w, ℓ 2: Output: A rank ℓprojection matrix. 3: For each b b B divide its samples Sb into two equal random parts Sb 1 and Sb 2 4: A 1 2| b B| P b b B f(Sb 1, w, κ) f(Sb 2, w, κ) + f(Sb 2, w, κ) f(Sb 1, w, κ) 5: U [u1, u2, ..., uℓ], where {ui} s are top ℓsingular vectors of A 6: Return UU true expectation of the clipped gradient, for distribution D0. This reduction to a smaller subspace helps reduce the number of medium-sized batches and their required length in the subsequent part of the algorithm. The following theorem characterizes the final guarantee for subroutine GRADSUBEST. Theorem E.1. Let p0 denote the fraction of batches in b B that are sampled from D0. For any ϵ, δ > 0, and b B = Ω d αsϵ2 1 αsϵ2 + C2 2 C1 δ , p0 αs/2 and ℓ min{k, 1 2αsϵ2 }, with probability 1 δ , the projection matrix UU returned by subroutine GRADSUBEST satisfy (I UU ) E D0[ f(x, y, w, κ)] 4ϵκ p The above theorem implies that the difference between the expectation of the clipped gradient and the expectation of projection of the clipped gradient for distribution D0 is small. Next, we present the description of the subroutine GRADSUBEST and provide a brief outline of the proof for the theorem before formally proving it in the subsequent subsection. The subroutine divides samples in each batch b b B into two parts, namely Sb 1 and Sb 2. Then it computes the clipped gradients ub := f(Sb 1, w, κ) and vb := f(Sb 2, w, κ). From linearity of expectation, for any i and batch b that contain i.i.d. samples from Di, E[ub] = E[vb] = EDi[ f(x, y, w, κ)]. The subroutine defines A = P b b B 1 2| b B|ub(vb) + vb(ub) . Let pi denote the fraction of batches in b B that have samples from Di. Then using the linearity of expectation, we have: E[A] = Pk 1 i=0 pi E Di[ f(x, y, w, κ)] E Di[ f(x, y, w, κ)] . It is evident that if the matrix U is formed by selecting the top k singular vectors of E[A], then the projection of E D0[ f(x, y, w, κ)] onto UU corresponds to itself, and the guarantee stated in the theorem holds. However, we do not have access to E[A], and furthermore, when the number of components k is large, it may be desirable to obtain a subspace of smaller size than k. To address the first challenge, Theorem E.2 in the next subsection shows that A E[A] is small. This theorem permits the usage of A as a substitute for E[A]. The clipping operation, introduced in the previous subsection, plays a crucial role in the proof of Theorem E.2 by controlling the norm of the expectation and the covariance of the clipped gradient for other components, and the maximum length of clipped gradients across all components. This is crucial for obtaining a good bound on the number of small-size batches required. Additionally, the clipping operation ensures that the subroutine remains robust to arbitrary input-output relationships for other components. Furthermore, the clipping operation assists in addressing the second challenge by ensuring a uniform upper bound on the norm of the expectation of all components, i.e., E Di[ f(x, y, w, κ)] O(κ). Leveraging this property, Lemma E.3 demonstrates that it suffices to estimate the top 1/p0dimensional subspace. Intuitively, this is because the infinitely many components can create at most approximately 1/p0 directions with weights greater than p0, indicating that the direction of D0 must be present in the top Θ(1/p0) subspace. Since b B = B(r) s is obtained by randomly partitioning Bs into R subsets, and Bs contains a fraction of at least αs batches with samples from D0, it holds with high probability that p0 αs. Consequently, when ℓ min{k, Ω( 1 αs )}, the subspace corresponding to the top ℓsingular vectors of A satisfies the desired property in the Theorem E.1. We note that the construction of matrix A in subroutine GRADSUBEST is inspired by previous work [KSS+20]. However, while they employed it to approximate the k-dimensional subspace of the true regression vectors for all components, we focus exclusively on one distribution D0 at a time and recover a subspace such that, for distribution D0, the expectation of the projection of the clipped gradient on this subspace closely matches the true expectation of the clipped gradient. It is worth noting that, in addition to repurposing the subroutine from [KSS+20], we achieve four significant improvements: 1) A more meticulous statistical analysis and the use of clipping enable our algorithm to handle heavy-tailed distributions for both noise and input distributions. 2) Clipping also facilitates the inclusion of arbitrary input-output relationships for other components. The next two improvements are attributed to an improved linear algebraic analysis. Specifically, our Lemma E.3 enhances the matrix perturbation bounds found in [LL18] and [KSS+20]. These enhancements enable us to: 3) Provide meaningful guarantees even when the number of components k is very large, 4) reduce the number of batches required when the distance between the regression vectors is small. E.1 Proof of Theorem E.1 To prove Theorem E.1, in the following theorem, we will first demonstrate that the term A E[A] is small when given enough batches. Theorem E.2. For 0 i k 1, let zi = E Di[ f(x, y, w, κ)], and pi denote the fraction of batches in b B that have samples from Di. For any ϵ, δ > 0, and b B = Ω d αsϵ2 1 αsϵ2 + C2 2 C1 with probability at least 1 δ , A i=0 piziz i where A is the matrix defined in subroutine GRADSUBEST. Proof. Let Zb := f(Sb 1, w, κ) f(Sb 2, w, κ) . b b B (Zb + (Zb) ). Then, from the triangle inequality, we have: A i=0 piziz i i=0 piziz i i=0 piziz i i=0 piziz i For a batch b sampled from distribution Di, we have: E[Zb] = E[ f(Sb 1, w, κ) f(Sb 2, w, κ) ] = E[ f(Sb 1, w, κ)] E[ f(Sb 2, w, κ) ] = E Di[ f(x, y, w, κ)] E Di[ f(x, y, w, κ) ] = ziz i , where the second inequality follows since samples in Sb 1 and Sb 2 are independent, and the third equality follows from the linearity of expectation. It follows that b b B E[Zb] = i=0 piziz i , i=0 piziz i b b B E[Zb] To complete the proof, we will prove a high probability bound on the term on the right by applying the Matrix Bernstein inequality. To apply this inequality, we first upper bound |Zb| as follows: Zb = f(Sb 1, w, κ) f(Sb 2, w, κ) f(Sb 1, w, κ) f(Sb 2, w, κ) . From item 3 in Lemma C.1, we have f(Sb 1, w, κ) κC2 d almost surely, and f(Sb 2, w, κ) κC2 d almost surely. It follows that Zb κ2C2 2d. Therefore, Zb E[Zb] 2κ2C2 2d. Next, we will provide an upper bound for E P b b B(Zb E[Zb]) P b b B(Zb E[Zb]) : E P b b B(Zb E[Zb]) P b b B(Zb E[Zb]) b b B(Zb E[Zb])(Zb E[Zb]) | b B| max b b B E (Zb E[Zb])(Zb E[Zb]) | b B| max b b B | b B| max b b B E[ f(Sb 2, w, κ) 2] E f(Sb 1, w, κ) f(Sb 1, w, κ) | b B| max b b B,u: u =1 E[ f(Sb 2, w, κ) 2] E ( f(Sb 1, w, κ) u)2 . From item 4 and item 5 in lemma C.1, we have: E[ f(Sb 2, w, κ) 2] C1κ2d, and E ( f(Sb 1, w, κ) u)2 κ2C1. Combining these two bounds, wee obtain: E b b B (Zb E[Zb]) b b B (Zb E[Zb]) | b B|dκ4C1 2. Due to symmetry, the same bound holds for E P b b B(Zb E[Zb]) P b b B(Zb E[Zb]) . Finally, by applying the Matrix Bernstein inequality, we have: b b B (Zb E[Zb]) | b B|dκ4C1 2 + | b B|θ(2C2 2κ2d) For b B = Ω d αsϵ2 1 αsϵ2 + C2 2 C1 δ , the quantity on the right-hand side is bounded by δ . Therefore, with probability at least 1 δ , we have: b b B (Zb E[Zb]) Combining the above equation with Equation (23) completes the proof of the Theorem. In the proof of Theorem E.1, we will utilize the following general linear algebraic result: Lemma E.3. For z0, z1, ..., zk 1 Rd and a probability distribution (p0, p1, ..., pk 1) over k elements, let Z = Pk 1 i=0 piziz i . For a symmetric matrix M and ℓ> 0, let u1, u2, .., uℓbe top ℓ singular vectors of M and let U = [u1, u2, ..., uℓ] Rd ℓ, then we have: (I UU )z0 2 ( 2(ℓ+1) M Z +maxj zj 2 (ℓ+1)p0 ℓ< k 2 M Z Lemma E.3 provides a bound on the preservation of the component z0 by the subspace spanned by the top-ℓsingular vectors of a symmetric matrix M. This bound is expressed in terms of the spectral distance between matrices Z and M, the maximum norm of any zi, and the weight of the component corresponding to z0 in Z. The proof of Lemma E.3 can be found in Section I. Utilizing Lemma E.3 in conjunction with Theorem E.2, we proceed to prove Theorem E.1. Proof of Theorem E.1. From Lemma E.3, we have the following inequality: (I UU ) E D0[ f(x, y, w, κ)] 2 2(ℓ+1) A Pk 1 i=0 piziz i +maxj E Dj [ f(x,y,w,κ)] 2 (ℓ+1)p0 ℓ< k 2 A Pk 1 i=0 piziz i p0 if ℓ k. By applying Theorem E.2 and utilizing item 1 of Lemma C.1, it follows that with a probability of at least 1 δ , we have: (I UU ) E D0[ f(x, y, w, κ)] 2 ( 2αsϵ2κ2C1 p0 + κ2C1 (ℓ+1)p0 ℓ< k The theorem then follows by using p0 αs/2 and ℓ min{k, 1 2αsϵ2 }. F Grad Estimation Recall that in gradient estimation for step r, Algorithm 1 utilizes the subroutine GRADSUBEST to find a projection matrix P (r) for an ℓ-dimensional subspace. In the previous section, we showed that the difference between the expectation of the clipped gradient and the expectation of projection of the clipped gradient on the subspace for distribution D0 is small. Therefore, it suffices to estimate the expectation of projection of the clipped gradient on the subspace. The main algorithm 1 passes the medium-sized batches b B = B(r) m , the ℓ-dimensional projection matrix P = P (r), and a collection of i.i.d. samples S = Sb ,(r) 2 from D0 to the subroutine GRADEST. Here, B(r) m is a random subset of the collection of medium-sized batches Bm. The purpose of the GRADEST subroutine is to estimate the expected value of the projection of the clipped gradient onto the ℓ-dimensional subspace defined by the projection matrix P. Since the subroutine operates on a smaller ℓ-dimensional subspace, the minimum batch size required for the batches in Bm and the number of batches required depend on ℓrather than d. The following theorem characterizes the final guarantee for the GRADEST subroutine: Theorem F.1. For subroutine GRADEST, let nm denote the length of the smallest batch in b B, N denote the number of batches in b B that has samples from D0 and P be a projection matrix for some ℓdimensional subspace of Rd. If T1 Ω(log | b B| δ ), T2 Ω(log 1 δ ), nm 4T1Ω( ℓ ϵ2 ) samples, and N nm T2Ω( ℓ ϵ2 ), then with probability 1 2δ the estimate returned by subroutine GRADEST satisfy E D0[P f(x, y, w, κ)] 9ϵκ p The above theorem implies that when the length of medium-sized batches is Ω( ℓ) and the number of batches in b B containing samples from D0 is Ω(ℓ), the GRADEST subroutine provides a reliable estimate of the projection of the clipped gradient onto the ℓ-dimensional subspace defined by the projection matrix P. Next, we provide a description of the GRADEST subroutine and present a brief outline of the proof for the theorem before formally proving it in the subsequent subsection. In the GRADEST subroutine, the first step is to divide the samples in each batch of b B into two equal parts. By utilizing the first half of the samples in a batch b along with the samples S , it estimates whether the expected values of the projection of the clipped gradient for D0 and the distribution used for the samples in b are close or not. With high probability, the algorithm retains all the batches from D0 while rejecting batches from distributions where the difference between the two expectations is large. To achieve this with an ℓ-dimensional subspace, we require Ω( ℓ) samples in each batch (see Lemma F.6). Following the rejection process, the GRADEST subroutine proceeds to estimate the projection of the clipped gradients within this ℓ-dimensional subspace using the second half of the samples from the retained batches. To estimate the gradient accurately in the ℓ-dimensional subspace, Ω(ℓ) samples are sufficient (see Lemma F.7). To obtain guarantees with high probability, the procedure employs the median of means approach, both for determining which batches to keep and for estimation using the retained batches. We prove the theorem formally in the next subsection. F.1 Proof of Theorem F.1 The following lemma provides an upper bound on the covariance of the projection of the clipped gradients. Lemma F.2. Consider a collection S of m i.i.d. samples from distribution Di. For κ > 0, w Rd and a projection matrix P for an ℓdimensional subspace of Rd, we have E[P f(S, w, κ)] = E Di[P f(x, y, w, κ)], and Cov(P f(S, w, κ)) 2κ2 m C1 and Tr (Cov(P f(S, w, κ))) ℓ Cov(P f(S, w, κ)) . Proof. Note that, E P f(S, w, κ) = P E[ f(S, w, κ)]] = P E Di[ f(x, y, w, κ)] = E Di[P f(x, y, w, κ)], where the second-to-last equality follows from Lemma C.1. This proves the first part of the lemma. To prove the second part, we bound the norm of the covariance matrix: Cov(P f(S, w, κ)) = max u 1 Var(u P f(S, w, κ)) = max v 1 Var(v f(S, w, κ)) Cov( f(S, w, κ)) where the last inequality follows from Lemma C.1. Similarly, Cov(P f(S , w, κ)) κ2 Finally, since random vector P f(S , w, κ) lies in ℓdimensional subspace of Rd, corresponding to projection matrix P, hence its covariance matrix has rank ℓ. Hence, the relation Tr (Cov(P f(S, w, κ))) ℓ Cov(P f(S, w, κ)) follows immediately. The following corollary is a simple consequence of the previous lemma: Corollary F.3. Consider two collections S and S each consisting of m i.i.d. samples from distributions Di and D0, respectively. For κ > 0, w Rd and a projection matrix P for an ℓdimensional subspace of Rd, let z = P f(S, w, κ) f(S , w, κ) , we have: E[z] = E Di[P f(x, y, w, κ)] E D0[P f(x, y, w, κ)], and Cov(z) 4κ2 m C1 and Tr (Cov(z)) ℓ Cov(z) . Proof. The expression for E[z] can be obtained from the previous lemma and the linearity of expectation. To prove the second part, we bound the norm of the covariance matrix of z. Cov(z) = Cov(P f(S, w, κ) f(S , w, κ) ) 2(Cov(P f(S, w, κ)) + Cov(P f(S , w, κ))). Using the bounds from the previous lemma, we can conclude that Cov(z) 4κ2 Finally, since the random vector z lies in the ℓ-dimensional subspace of Rd defined by the projection matrix P, its covariance matrix has rank ℓ. Therefore, we have Tr, (Cov(z)) ℓ Cov(z) . The following theorem bounds the variance of the dot product of two independent random vectors. It will be helpful in upper bounding the variance of ζb j (defined in subroutine GRADEST). Theorem F.4. For any two independent random vectors z1 and z2, we have: Var(z1 z2) 3Tr(Cov(z1)) Cov(z2) + 3 E[z1] 2 Cov(z2) + 3 E[z2] 2 Cov(z1) . Proof. We start by expanding the variance expression: Var(z1 z2) = Var(z1 z2 E[z1 z2]) = Var(z1 z2 E[z1] E[z2]) = Var((z1 E[z1]) (z2 E[z2]) + E[z1] z2 + E[z2] z1) 3 Var((z1 E[z1]) (z2 E[z2])) + 3 Var(E[z1] z2) + 3 Var(E[z2] z1) = 3 Var((z1 E[z1]) (z2 E[z2])) + 3 E[z1] Cov(z2) E[z1] + E[z2] Cov(z1) E[z2] 3 Var((z1 E[z1]) (z2 E[z2])) + 3 E[z1] 2 Cov(z2) + 3 E[z2] 2 Cov(z1) . To complete the proof, we bound the first term in the last expression: Var((z1 E[z1]) (z2 E[z2])) = E ((z1 E[z1]) (z2 E[z2]))2 = E[(z1 E[z1]) Cov(z2)(z1 E[z1])] E z1 E[z1] 2 Cov(z2) = Tr(Cov(z1)) Cov(z2) . Using the two previous results, we can establish a bound on the expectation and variance of ζb j. Lemma F.5. In subroutine GRADEST, let P be a projection matrix of an ℓdimensional subspace. Suppose S j has m i.i.d. samples from D0 and, Sb 1,j and Sb 1,j+T1 have m i.i.d. samples from Di for some i {0, 1, ..., k 1}. Than we have: E[ζb j] = E Di[P f(x, y, w, κ)] E D0[P f(x, y, w, κ)] 2 Var(ζb j) 48 m2 κ4ℓC1 2 + 24 m κ2 E[ζb j]C1. Proof. Let z1 = P f(Sb 1,j, w, κ) f(S j , w, κ) and z2 = P f(Sb 1,T1+j, w, κ) f(S T1+j, w, κ) . From Corollary F.3, we know that E[z1] = E[z2] = E Di[P f(x, y, w, κ)] E D0[P f(x, y, w, κ)], Cov(z1) = Cov(z2) = 4κ2 Tr(Cov(z1)) = Tr(Cov(z2)) = 4 Note that ζb j = z1 z2. Then bound on the variance of ζb j follows by combining the above bounds with Theorem F.4. Finally, the expected value of ζb j is: E[ζb j] = E[z1] E[z2] = E[z1] 2. The following lemma provides a characterization of the minimum batch length in b B and the size of the collection S required for successful testing in subroutine GRADEST. Lemma F.6. In subroutine GRADEST, let P be a projection matrix of an ℓdimensional subspace, T1 Ω(log | b B| δ ), and each batch b b B has at least |Sb| 4T1Ω( ℓ ϵ2 ) samples, and |S | = ℓ ϵ2 ). Then with probability 1 δ , the subset B in subroutine GRADEST satisfy the following: 1. | B| retains all the batches in b B that had samples from D0. 2. B does not contain any batch that had samples from Di if i is such that E Di[P f(x, y, w, κ)] E D0[P f(x, y, w, κ)] > 2ϵκ p Proof. The lower bound on |Sb| in the lemma ensures that for each batch b and all j [2T1], we have Sb 1,j = Ω( ℓ ϵ2 ), and the lower bound on |S| ensures that for all j [2T1], Sj = Ω( First, consider the batches that have samples from the distribution D0. For any such batch b and j [T1], from Lemma F.5, we have E[ζb j] = 0 and Var(ζb j) = O(ϵ4κ2C1 2). Therefore, for T1 Ω(log | b B| δ ), it follows that with probability 1 δ /2 for every batch b b B that has samples from D0 the median of {ζb j}j [T1] will be less than ϵ2κ2C1, and it will be retained in B. This completes the proof of the first part. Next, consider the batches that have samples from any distribution Di for which E Di[P f(x, y, w, κ)] E D0[P f(x, y, w, κ)] > 2ϵκ p For any such batch b and j [T1], according to Lemma F.5, we have E[ζb j] 4ϵ2κ2C1 and Var(ζb j) = O(E[ζb j]2). Hence, for T1 Ω(log | b B| δ ), it follows that with probability at least 1 δ /2, the median of {ζb j}j [T1] for every batch will be greater than ϵ2κ2C1, and those batches will not be included in B. This completes the proof of the second part. The following theorem characterizes the number of samples required in B for an accurate estimation of . Lemma F.7. Suppose the conclusions in Lemma F.6 hold for B defined in subroutine GRADEST, T2 Ω(log 1 δ ), each batch b B has size nm, and | B| nm 2T2Ω( ℓ ϵ2 ), then with probability 1 δ the estimate returned by subroutine GRADEST satisfy E D0[P f(x, y, w, κ)] 9ϵκ p Proof. Recall that in subroutine GRADEST, we defined b B P f(Sb 2,i, w, κ). Let zb i = P f(Sb 2,i, w, κ). From Lemma F.6, for all b B, we have E[zb i ] E D0[P f(x, y, w, κ)] 2ϵκ p E[ i] E D0[P f(x, y, w, κ)] = b B E[zb i ] E D0[P f(x, y, w, κ)] E[zb i ] E D0[P f(x, y, w, κ)] 2ϵκ p Next, from Lemma F.2, Cov(zb i ) κ2 |Sb 2,i|C1 = T2κ2 |Sb 2| C1 = 2T2κ2 |Sb| C1 2T2κ2C1 minb B |Sb|, (25) where the two equalities follow because for all batches b B, |Sb 2,i| = |Sb 2|/T2 and |Sb 2| = |Sb|/2. Cov( i) = 1 | B| max b B Cov(zb i ) 2T2κ2C1 | B| minb B |Sb| Since i lies in an ℓdimensional subspace of Rd, it follows that Tr(Cov( i)) ℓ Cov( i) 2ℓT2κ2C1 | B| minb B |Sb| Note that Var( i E[ i] ) = Tr(Cov( i)). Then, from Chebyshev s bound: Pr[ i E[ i] ϵκ p C1] Var( i E[ i] ) ϵ2κ2C1 2ℓT2 ϵ2| B| minb B |Sb| 1/8. Combining above with Equation (24), Pr[ i E D0[P f(x, y, w, κ)] 3ϵκ p Let D := {i [T2] : i E D0[P f(x, y, w, κ)] 3ϵκ C1 }. Then, for T2 = Ω(log 1 δ ), with probability 1 δ , we have Recall that in the subroutine, we defined ξi = median{j [T2] : i j } and i = arg min{i [T2] : ξi}. From the definition of D, and triangle inequality, for all i, j D, we have i j 6ϵκ C1. Therefore, if |D| 1 2T2, then for any i D, ξi 6ϵκ C1. This would imply ξi 6ϵκ C1. Furthermore, since |D| 1 2T2, there exist at least one i D such that i i 6ϵκ C1. Using the definition of D, and the triangle inequality, we can conclude that i E D0[P f(x, y, w, κ)] i E D0[P f(x, y, w, κ)] + i i 9ϵκ p Theorem F.1 then follows by combining lemmas F.6 and F.7. G Number of steps required The following lemma shows that with a sufficiently accurate estimation of the expectation of gradients, a logarithmic number of gradient descent steps are sufficient in the main algorithm 1. Lemma G.1. For ϵ > 0, suppose (r) Σ0( ˆw(r) w0) 1 2 ˆw(r) w0 + ϵσ 4 , and R = Ω(C1 log w0 σ ), then ˆw(r) w0 ϵσ. Proof. Recall that ˆw(r+1) = ˆw(r) 1 C1 (r). Then we have: ˆw(r+1) w0 = ˆw(r) w0 1 C1 (r) = ( ˆw(r) w0) I 1 C1 (Σ0( ˆw(R+1) w0) (r)). Using triangle inequality, we obtain: ˆw(r+1) w0 ˆw(r) w0 I 1 C1 Σ0( ˆw(r) w0) (r) ˆw(r) w0 1 1 ˆw(r) w0 1 1 2C1 Using recursion, we have: ˆw(R+1) w0 ˆw(1) w0 1 1 2C1 ˆw(1) w0 exp R + 2C1 ϵσ 4C1 ϵσ, where the second inequality follows from the upper bound on the sum of infinite geometric series and the last inequality follows from the bound on R and ˆw(1) = 0. H Final Estimation Guarantees Proof of Theorem 2.1. We show that with probability 1 δ, for each r [R], the gradient computed by the algorithm satisfies (r) Σ0( ˆw(r) w0) 1 2 ˆw(r) w0 + ϵσ 4 . Lemma G.1 then implies that for R = Ω(C1 log M σ ), the output returned by the algorithm ˆw = ˆw(R+1) satisfy ˆw w0 ϵσ. To show this, we fix r, and for this value of r, we show that with probability 1 δ/R, (r) Σ0( ˆw(r) w0) 1 2 ˆw(r) w0 + ϵσ 4 . Since each round uses an independent set of samples, the theorem then follows by applying the union bound. First, we determine the bound on the clipping parameter. From Theorem D.1, for |Sb |/R = Ω(C2 log 1/δ ), with probability 1 δ , we have s 8(C + 1)C1 E D0[(y x w(r))2] 2(C + 1)C1 E D0[(x (w(r) w0))2] + σ2 Next, employing Theorem C.2 and utilizing the lower bound on the clipping parameter in the above equation, we obtain the following bound on the norm of the expected difference between clipped and unclipped gradients: E D0[( f(x, y, w(r), κ(r)) Σ0(w(r) w0) ϵ1 w(r) w0 . (27) Recall that in Bs, at least αs fraction of the batches contain samples from D0. When Bs is divided into R equal random parts, w.h.p. each part B(r) s will have at least αs fraction of the batches containing samples from D0. From Theorem E.1, if |B(r) s | = |Bs| R = Ω d αsϵ2 2 1 αsϵ2 2 + C2 2 C1 δ , then with probability 1 δ , the projection matrix P (r) satisfies E D0[ f(x, y, w(r), κ(r))] P (r) E D0[ f(x, y, w(r), κ(r))] 4ϵ2κ(r)p The above equation shows subroutine GRADSUBEST finds projection matrix P (r) such that the expected value of clipped gradients projection is roughly the same as the expected value of the clipped gradient. Next, we show that subroutine GRADEST provides a good estimate of the expected value of clipped gradients projection. Let N denote the number of batches in Bm that have samples from D0. If N Ω(R + log 1/δ ) then with probability 1 δ , B(r) m has Θ(N/R) batches sampled from D0. If each batch in Bm and batch b has more than nm samples, nm ℓ ϵ2 2 log( |Bm| ϵ2 2 log 1/δ ), then from Theorem F.1, with probability 1 δ (r) E D0[P (r) f(x, y, w(r), κ(r))] 9ϵ2κ(r)p Combining the above three equations using triangle inequality, (r) Σ0( ˆw(r) w0) 13ϵ2κ(r)p C1 + ϵ1 w(r) w0 , (30) with probability 1 5δ . In equation (26) using the upper bound, E D0[(x (w(r) w0))2] w(r) w0 2 Σ0 C1 w(r) w0 2 we get 2(C + 1)C1 2 w(r) w0 2 + (C + 1)C1σ2 2(C + 1) ϵ1 (C1 w(r) w0 + p Combining the two equations, (r) Σ0( ˆw(r) w0) 364ϵ2 p 2(C + 1)C1 ϵ1 (C1 w(r) w0 + p C1σ) + ϵ1 w(r) w0 , with probability 1 5δ There exist universal constants c1, c2 > 0 such that for ϵ1 = c1 and ϵ2 = c2 C1 C+1 , the quantity on the right is bounded by w(r) w0 /2 + ϵσ/4. We choose these values for ϵ1 and ϵ2 and δ = δ 5R. From the above discussion, it follows that if |Bs| = Ω d αs2ϵ4 , nm Ω( ℓ ϵ2 ), and Bm has ϵ2 batches sampled from D0, then with probability 1 δ/R, (r) Σ0( ˆw(r) w0) 1 2 ˆw(r) w0 + ϵσ Using ℓ= min{k, 1 ϵ2αs }, we get the bounds on the number of samples and batches required by the algorithm. I Proof of Lemma E.3 To establish the lemma, we first introduce and prove two auxiliary lemmas. Lemma I.1. For k > 0, and a probability distribution (p0, p1, ..., pk 1) over k elements, let Z = Pk 1 i=0 p0ziz i , where zi are d-dimensional vectors. Then for all ℓ 0, ℓth largest singular value of Z is bounded by maxi zi 2/ℓ. Proof. Note that Z is a symmetric matrix, so its left and right singular values are the same. Let v1, v2, ... be the singular vectors in the SVD decomposition of Z, and let a1 a2 a3 ... be the corresponding singular values. Using the properties of SVD, we have: i v i Zvi = X j=0 pjzjz j j=0 pj zj 2 max j zj 2. Next, we have: X i ℓ aℓ= ℓ aℓ. Combining the last two equations yields the desired result. Lemma I.2. Let u1, u2, .., uℓ Rd be ℓmutually orthogonal unit vectors, and let U = [u1, u2, ..., uℓ] Rd ℓ. For any set of k vectors z0, z1, ..., zk 1 Rd, non-negative reals p0, p1, ..., pk 1, and reals a1, a2, ..., aℓ, we have: (I UU )z0 2 Pk 1 i=1 piziz i P j [ℓ] ajuju j Proof. Let v = (I UU )z0. First we show that for all j [ℓ], the vectors v and uj are orthogonal, u j (I UU )z0 = (u j z0) (u j z0) = 0. Then, v Pk 1 i=0 piziz i P j [ℓ] ajuju j v = v Pk 1 i=0 piziz i v = Pk 1 i=0 pi(z i v)2 p0(z 0v)2 Next, we have: z 0v = z 0(I UU )v + z0UU v = z 0(I UU )v = v v = v 2 Combining the last two equations, we obtain: v 2 Pk 1 i=0 piziz i P j [ℓ] ajuju j v Pk 1 i=0 ziz i P j [ℓ] ajuju j v p0 v 4. Dividing both sides by v 2 completes the proof. Next, combining the above two auxiliary lemmas we prove Lemma E.3. Proof of Lemma E.3. Let Λi( ) denote the ith largest singular value of a matrix. Let ˆ M be rank ℓ truncated-SVD of M, then it follows that, M ˆ M = Λℓ+1(M). First, we consider the case ℓ< k. By applying Weyl s inequality for singular values, we have Λℓ+1(M) Λℓ+1(Z) + Λ1(M Z) maxj zj 2 ℓ+ 1 + M Z , where the last equation follows from Lemma I.1. First applying the triangle inequality, and then using the above two equations, we have ˆ M Z M ˆ M + M Z maxj zj 2 ℓ+ 1 + 2 M Z . Combining the above equation with Lemma I.2, we have: (I UU )z0 2 2(ℓ+ 1) M Z + maxj zj 2 This completes the proof for ℓ< k. To prove for the case ℓ> k, we use Λℓ+1(Z) = 0 in place of the bound Λℓ+1(Z) maxj zj 2 ℓ+1 in the above proof for the case ℓ< k. J Removing the Additional Assumptions To simplify our analysis, we made two assumptions about the data distributions. We now argue that these assumptions are not limiting. The first additional assumption was that there exists a constant C2 > 0 such that for all components i {0, 1, . . . , k 1} and random samples (x, y) Di, we have x C2 d almost surely. In the non-batch setting, Cherapanamjeri et al. (2020) [CAT+20] have shown that this assumption is not limiting. They showed that if other assumptions are satisfied, then there exists a constant C2 such that with probability 0.99, we have x C2 d. Therefore, disregarding the samples for which |x| > C2 d does not significantly reduce the data size. Moreover, it has minimal impact on the covariance matrix and hypercontractivity constants of the distributions. This argument easily extends to the batch setting. In the batch setting, we first exclude samples from batches where x > C2 d. Then we remove small-sized batches with fewer than or equal to 2 samples and medium-sized batches that have been reduced by more than 10% of their original size. It is easy to show that w.h.p. the fraction of medium and small size batches that gets removed for any component is at most 10%. THence, this assumption can be removed with a small increase in the batch size and the number of required samples in our main results. Next, we address the assumption that the noise distribution is symmetric. We can handle this by employing a simple trick. Consider two independent and identically distributed (i.i.d.) samples (x1, y1) and (x2, y2), where yi = w xi + ηi. We define x = (x1 x2)/ 2, y = (y1 y2)/ 2, and η = (η1 η2)/ 2. It is important to note that the distribution of η is symmetric around 0, and the covariance of x is the same as that of xi, while the variance of η is the same as that of ηi. Furthermore, we have y = w x + η. Therefore, the new sample (x, y) obtained by combining two i.i.d. samples satisfies the same distributional assumptions as before, and in addition, the noise distribution is symmetric. We can combine every two samples in a batch using this approach, which only reduces the batch size of each batch by a constant factor of 1/2. Thus, the assumption of symmetric noise can be eliminated by increasing the required batch sizes in our theorems by a factor of 2. K Algorithm s Parameter Requirements The algorithm requires two distribution parameters: the L4-L2 hypercontractivity parameter C and the condition number of input covariance matrix C1. These parameters significantly generalize the input distributions typically considered in mixed linear regression, which have primarily focused on sub-Gaussian distributions with an identity covariance matrix. Specifically, the L4-L2 hypercontractivity parameter generalizes the sub-gaussian distribution assumption and allows for heavier-tailed distributions. The condition-number C1 of the input covariance matrices generalizes the identity covariance matrix assumption for input distribution, corresponding to C1 = 1. Additionally, the algorithm requires two population parameters: the number of sub-populations k and the fraction of batches αs present from the relevant sub-population. Note that the algorithm combines these parameters to define a single parameter ℓ. Finally, the algorithm requires two regression parameters: the length M of the regression vector, and the additive noise variance σ. In total the algorithm relies on six parameters, two of which can be combined into one. Although this may seem large, it requires only an upper bound on their values. If the upper bounds are tight, they will result in a good estimate. If for any parameter we do not have a tight upper bound, we can run the algorithm for different values of the parameter, dividing its possible range logarithmically. This approach increases the size of the resulting list at most logarithmically, ensuring that at least one estimate in the list will be tight. L More Simulation Details Setup. We have sets Bs and Bm of small and medium size batches and k distributions Di for i {0, 1, . . . , k 1}. For a subset of indices I {0, 1, . . . , k 1}, both Bs and Bm have a fraction 0 5 10 15 20 25 30 35 0 KSSKO(4 samples) KSSKO(8 samples) Ours(4 samples) Ours(8 samples) medium batch size Figure 3: Same input dist. (standard normal), k = 16, small minimum distance between regression vectors, recovering all 5 10 15 20 25 30 80 Ours(4 samples) Ours(8 samples) medium batch size Figure 4: Different input dist, k = 100, large minimum distance between regression vectors, recovering 4 components that have 1/16 fraction of batches each of α batches that contain i.i.d. samples from Di for each i I. And for each i {0, 1, . . . , k 1}\I in the remaining set of indices, Bs and Bm have (1 |I|/16)/(k |I|) fraction of batches, that have i.i.d samples from Di. In all figures the output noise is distributed as N(0, 1). All small batches have 2 samples each, while medium-size batches have nm samples each, which we vary from 4 to 32, as shown in the plots. We fix data dimension d = 100, α = 1/16, number of small batches to |Bs| = min{8dk2, 8d/α2} and the number of medium batches to |Bm| = 256. In all the plots, we average our 10 runs. Evaluation. Our objective is to recover a small list containing good estimates for the regression vectors of Di for each i I. We compare our proposed algorithm s performance with that of the algorithm in [KSS+20]. We generate lists of regression vector estimates LOurs and LKSSKO using our algorithm and [KSS+20], respectively. Then, we create 1600 new batches, each containing nnew i.i.d samples randomly drawn from the distribution Di, where for each batch, index i is chosen randomly from I. Each list enables the clustering of the new sample batches. To cluster a batch using a list, we assign it to the regression vector in the list that achieves the lowest mean square error (MSE) for its samples. To evaluate the average MSE for each algorithm, for each clustered batch, we generate additional samples from the distribution that the batch was generated from and calculate the error achieved by the regression vector in the list that the batch was assigned to. We then take the average of this error over all sets. We evaluate both algorithms performance for new batch sizes nnew = 4 and nnew = 8, as shown in the plots. Minimum distance between regression vectors. Our theoretical analysis suggests that our algorithm is robust to the case when the minimum distance between the regression vectors are much smaller than their norms. In order to test this, in Figure 3, we generate half of the regression vectors with elements independently and randomly distributed in U[9, 11], and the other half with elements independently and randomly distributed in U[ 11, 9]. Notably, the minimum gap between the vectors, in this case, is much smaller than their norm. It can be seen that the performance gap between our algorithm and the one in [KSS+20] increases significantly as we deviate from the assumptions required for the latter algorithm to work. Number of different distributions. Our algorithm can notably handle very large k (even infinite) while still being able to recover regression vectors for the subgroups that represent sufficient fraction of the data. In the last plot, we set k = 100 and I = {0, 1, 2, 3} to highlight this ability. In this case, the first four distributions each generate a 1/16 fraction of batches, and the remaining 96 distributions each generate a 1/128 fraction of batches. We provide the algorithm with one additional medium-size batch from Di for each i I for identification of a list of size I. The results are plotted in Figure 4, where we can see that the performance gets better with medium batch size as expected. Note that the algorithm in [KSS+20] cannot be applied to this scenario. Neur IPS Paper Checklist Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: These claims are justified by our theorems and the experimental results. Guidelines: The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: We present such a discussion in the last section of the main paper. Guidelines: The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate Limitations section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: All assumptions and proofs are clearly mentioned in the paper. Guidelines: The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: We include the details to reproduce our experiments in the paper. We also include the scripts to reproduce all our results in supplementary material. Guidelines: The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: Our experiments are on synthetic data and we include the scripts to reproduce all our results in supplementary material. Guidelines: The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: We provide these details in the experiment sections in appendix and the main paper. We also provide the code used to generate the figures in the paper. Guidelines: The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: We report standard error over 10 runs in all our plots. Guidelines: The answer NA means that the paper does not include experiments. The authors should answer Yes if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: Our experiments are not compute intensive and were run on Macbook pro 16 inch laptop with 16 GB ram and intel processor (2020 model) and each of them took less than 6 hours to finish. Guidelines: The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: We have read the code of ethics and our submission abide by that. Guidelines: The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] Justification: This paper presents work whose goal is to advance the field of Machine Learning There could be many potential societal consequences of our work, none which we feel must be specifically highlighted here. Guidelines: The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: We do not release any new dataset or model. Guidelines: The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justification: We used an algorithm in a prior work [KSS+20] and we have cited the original paper. Guidelines: The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: The paper does not release new assets. Guidelines: The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research with human subjects. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research with human subjects. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.