# scalingup_robust_gradient_descent_techniques__25af89e5.pdf Scaling-Up Robust Gradient Descent Techniques Matthew J. Holland Osaka University matthew-h@ar.sanken.osaka-u.ac.jp We study a scalable alternative to robust gradient descent (RGD) techniques that can be used when losses and/or gradients can be heavy-tailed, though this will be unknown to the learner. The core technique is simple: instead of trying to robustly aggregate gradients at each step, which is costly and leads to sub-optimal dimension dependence in risk bounds, we choose a candidate which does not diverge too far from the majority of cheap stochastic sub-processes run over partitioned data. This lets us retain the formal strength of RGD methods at a fraction of the cost. Introduction Obtaining strong contracts for the performance of machine learning algorithms is difficult.1 Classical tasks in computer science, such as sorting integers or simple matrix operations, come with lucid worst-case guarantees. With enough resources, the job can be done correctly and completely. In machine learning, since we only have access to highly impoverished information regarding the phenomena or goal of interest, inevitably the learning task is uncertain, and any meaningful performance guarantee can only be stated with some degree of confidence, typically over the random draw of the data used for training. This uncertainty is reflected in the standard formulation of machine learning tasks as risk minimization problems (Vapnik 1982; Haussler 1992). Here we consider risk minimization over some set of candidates W Rd, where the risk of w is defined as the expected loss to be incurred by w, namely RP(w) ..= EP L(w; Z) = Z L(w; z) P(dz), w W. Here we have a loss function L : W Z R+, and random data Z P takes values in a set Z. At most, any learning algorithm will have access to n data points sampled from P, denoted Z1, . . . , Zn. Write (Z1, . . . , Zn) wn to denote the output of an arbitrary learning algorithm. The usual starting point for analyzing algorithm performance is the estimation error RP( wn) R P, where R P ..= inf{RP(w) : w Copyright 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. 1This notion was described lucidly in a keynote lecture by L. Bottou (Bottou 2015). W}, or more precisely, the distribution of this error. Since we never know much about the underlying data-generating process, typically all we can assume is that P belongs to some class P of probability measures on Z, and typical guarantees are given in the form of P {RP( wn) R P > ε (n, δ, P, W)} δ, P P. Flipping the inequalities around, this says that the algorithm generating wn enjoys ε-good performance with (1 δ)-high confidence over the draw of the sample, where the error level depends on the sample size n, the desired confidence level δ, the underlying data distribution P, and any constraints encoded in W, not to mention the nature of loss L. Ideally, we would like formal guarantees to align as closely as possible with performance observed in the real world by machine learning practitioners. With this in mind, the following properties are important to consider. 1. Transparency: can we actually compute the output wn that we study in theory? 2. Strength: what form do bounds on ε(n, δ, P, W) take? How rich is the class P? 3. Scalability: how do computational costs scale with the above-mentioned factors? Balancing these three points is critical to developing guarantees for algorithms that will actually be used in practice. If strong assumptions are made on the data distribution (i.e., P is a small class), then most of the data any practitioner runs into will fall out of scope. If the error bound grows too quickly with 1/δ or shrinks too slowly with n, then either the guarantees are vacuous, or the procedure is truly sub-optimal. If the procedure outputting wn cannot be implemented, then we run into a gap between what we code, and what we study formally. Our problem setting In this work, we consider the setup of potentially heavy-tailed data, specialized to the case of strongly convex loss functions; in related work (Holland 2021), we consider a completely different approach when we do not have strong convexity. More concretely, all the learner can know is that for some m < , P P : sup w W EP |L(w; Z)|m < , (1) The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21) where typically m = 2. Thus, it is unknown whether the losses (or partial derivatives, etc.) are congenial in a sub Gaussian sense (where (1) holds for all m), or heavy-tailed in the sense that all higher-order moments could be infinite or undefined. We next review the related technical literature, and give an overview of our contributions. Context and Contributions With the three properties of transparency, strength, and scalability highlighted in the previous section in mind, for the next few paragraphs we look at the characteristics of several important families of learning algorithms. ERM: can scale well, but lacks robustness Classical learning theory is primarily centered around empirical risk minimization (ERM) (Vapnik 1998; Anthony and Bartlett 1999), and studies the statistical properties that hold for any minimizer of the empirical risk, namely wn arg min w W i=1 L(w; Zi). (2) Clearly, this leaves all algorithmic aspects of the problem totally abstract, and opens up the possibility for substantial gaps between the performance of good and bad ERM solutions, as studied by Feldman (2017). Furthermore, the empirical mean is sensitive to outliers, and formally speaking is sub-optimal in the sense that it cannot achieve sub-Gaussian error bounds under potentially heavy tails, while other practical procedures can; see Catoni (2012) and Devroye et al. (2016) for comprehensive studies. Roughly speaking, the empirical mean cannot guarantee better error bounds than those which scale as Ω(1/ δn). In the context of machine learning, these statistical limitations provide an important implication about the feedback available to any learner which tries to directly minimize the empirical risk, effectively lower-bounding the statistical error (in contrast to the optimization error) incurred by any such procedure. Robust risk minimizers: strong in theory, but lacks transparency To deal with the statistical weaknesses of ERM, it is natural to consider algorithms based on more robust feedback, i.e., minimizers of estimators of the risk which provide stronger guarantees than the empirical mean under potentially heavy tails. A seminal example of this is the work of Brownlees, Joly, and Lugosi (2015), who consider learning algorithms of the form wn arg min w W R(w), s.t. R(w) L(w; Zi) That is, they consider minimizers of an M-estimator of the risk, using influence function ψ of the type studied by Catoni (2012). Under weak moment bounds like (1), their minimizers enjoy O(1/ n) rates with O(log(δ 1)) dependence on the confidence. This provides a significant improvement in terms of the strength of guarantees compared with ERM, but unfortunately the issue of transparency remains. Like ERM, the algorithmic side of the problem is left abstract here, and in general may even be a much more difficult computational task. Observe that the new objective R( ) cannot be written in closed form, and even if L( ; Z) is convex, this R( ) need not preserve such convexity. Direct optimization is hard, but verifying improvement in the function value is easy, and some researchers have utilized a guess-and-check strategy to make the approach viable in practice (Holland and Ikeda 2017). However, these methods are inexact, and due to optimization error, strictly speaking the algorithm being run does not enjoy the full guarantees given by Brownlees, Joly, and Lugosi (2015) for the ideal case. Robust gradient descent: transparent, but scales poorly To try and address the issue of transparency without sacrificing the strength of formal guarantees, several new families of algorithms have been designed in the past few years to tackle the potentially heavy-tailed setting using a tractable procedure. Such algorithms may naturally be called robust gradient descent (RGD), the naming being appropriate since their core updates all take the form wt+1 = wt αt Gn( wt), (4) and they are robust in the sense that the estimate Gn(w) RP(w) has deviations with near-optimal confidence intervals under potentially heavy-tailed data (i.e., both the loss and partial gradients are potentially heavy-tailed). The most common strategy replaces the empirical risk gradient with a high-dimensional median of means estimate (RGD-by Mo M) (Chen, Su, and Xu 2017b,a; Prasad et al. 2018; El Mhamdi et al. 2019; Rajput et al. 2020). Other strategies involve dimension-wise truncation (Holland 2019) and Mestimation of the risk partial derivatives (RGD-M) (Holland and Ikeda 2019). Note that both approaches enjoy error bounds with optimal dependence on n and 1/δ under potentially heavy-tailed data, with the significant merit that the computational procedures are transparent and easy to implement as-is. Unfortunately, instead of a simple onedimensional robust mean estimate as in (3), all RGD methods rely on sub-routines that work in d-dimensions. This makes the procedures much more expensive computationally for big learning tasks, and leads to an undesirable dependence on the ambient dimension d in the statistical guarantees as well, hampering their overall scalability. Summarizing the above points, first of all, ERM and robust risk minimizers leave the potential for a severe gap between what is guaranteed on paper and what is done in practice. On the other hand, both formal guarantees and computational requirements for RGD methods do not scale well to high-dimensional learning tasks. For comparison, we show concrete error bounds for RGD procedures under strongly convex losses in Table 1. The key issues are clear: even when working with the Euclidean geometry, a quick glance at the proofs in the cited works on RGD shows that direct dependence on d in the error bounds is unavoidable. Furthermore, the extra computational overhead, scaling at least linearly in Method Error Cost DC-SGD (Algorithm 1) O log(δ 1) O dn log(δ 1) + cost (Merge) RGD-by-Mo M (Chen et al. 2017a) O (1 c)2T + O k(d + log(δ 1)) + T cost(Geo Med) RGD-M (Holland and Ikeda 2019) O (1 c)2T + O d(log(dδ 1) + log(n)) Table 1: Here we compare performance guarantees for different learning algorithms. Error refers to 1 δ confidence intervals for RP( ) R P, evaluated at the output of each algorithm after T iterations (with DC-SGD using T = n by definition). All error bounds stated are under the assumptions of Theorem 1. Here Geo Med refers to the geometric median sub-routine (Minsker 2015; Vardi and Zhang 2000). Cost estimates assume the availability of k cores for parallel computations. See appendix for additional details. d, must be incurred at every step in the iterative procedure, which severely hurts scalability. Our contributions Considering the issues highlighted above, in particular with the scalability of modern RGD techniques in terms of loose guarantees and computational cost, here we investigate a different algorithmic approach of equal generality, with the goal of achieving as-good or better dependence on n, d, and 1/δ, under the same assumptions, and in provably less time for larger problems. The core technique uses distance-based rules to select among independent weak candidates, which are implemented using inexpensive stochastic gradient-based updates. Our main contributions: We analyze a flexible and robust archetype for learning under heavy tails (Algorithm 1), and obtain sharp highprobability error bounds (Theorem 1) that improve on the poor dimension dependence of existing RGD routines under strongly convex risks when both the losses and gradients can be heavy-tailed (see Table 1). The procedure outlined in Algorithm 1 is simple to implement and amenable to distributed computation, providing superior computational scalability over existing serial RGD procedures, without sacrificing the strength or transparency of theoretical guarantees. Empirically, we study the efficiency and robustness of the proposed algorithm and its key competitors in a tightly controlled simulated setting (section ), verifying a substantial improvement in the cost-performance tradeoff, robustness to heavy-tailed data, and performance that scales well to higher dimensions. Taken together, our results suggest a promising class of algorithms for convex risk minimization, which achieve an appealing balance between transparency, strength and scalability. Theoretical Analysis Preliminaries First we establish some basic notation, and organize technical details in one place for ease of reference. Notation For any positive integer k, write [k] ..= {1, . . . , k}. For any index I [n], write ZI ..= (Zi)i I, defined analogously for independent copy Z I. To keep the notation simple, in the special case of I = [n], we write Zn ..= Z[n] = (Z1, . . . , Zn). We shall use P as a generic symbol to denote computing probability; in most cases this will be the product measure induced by the sample Zn or Z n. For any function f : Rd R, denote by f(u) the subdifferential of f evaluated at u. Variance of the loss is denoted by σ2 P(w) ..= var P L(w; Z) = EP(L(w; Z) RP(w))2 for each w W. When we write I{event}, this refers to the indicator function which returns 1 when event is true, and 0 otherwise. Technical conditions The two key running assumptions that we make are related to independence and convexity. First, we assume that all the observed data are independent, i.e., the random variables Zi and Z i taken over all i [n] are independent copies of Z P. Second, for each z Z, we assume the map w L(w; z) is a real-valued convex function over Rd, and that the parameter set W Rd is non-empty, convex, and compact. All results derived in the next sub-section will be for an arbitrary choice of P P, where P satisfies (1) with m = 2. We say a function f is λ1-smooth if its gradient is λ1-Lipschitz continuous, and μstrongly convex if u v, f(u) f(v) μ u v 2 for all u, v (details given in appendix). Finally, to make formal statements technically simpler, we assume that RP( ) achieves its minimum on the interior of W. Overview of algorithm We study a general-purpose learning algorithm using a divide-and-conquer strategy, summarized in Algorithm 1. Following a partition of the data into k disjoint subsets, a sub-routine SGD is used to generate k candidates, and from these candidates, a final output is determined by Merge, a generic sub-routine whose desirable properties will be discussed shortly (see (7)). As for SGD, for concreteness we are considering traditional (projected) stochastic gradient descent. The core update of arbitrary point w given data Z P is given by SGD [w; Z, α, W] ..= ΠW (w α G(w; Z)) . (5) Algorithm 1 Robust divide and conquer archetype; DC-SGD [Zn, w0; k]. inputs: sample Zn, initial value w0 W, parameter 1 k n. k j=1 Ij = [n], with |Ij| n/k , and Ij Il = when j = l. Disjoint partition. w(j) = SGD w0; ZIj, W , for each j [k]. Obtain inexpensive candidates. return: w DC = Merge { w(1), . . . , w(k)}; 2 . Follow the majority. Here α 0 denotes a step-size parameter, ΠW denotes projection to W with respect to the ℓ2 norm, and the standard assumption is that the random vector G(w; Z) satisfies EP G(w; Z) RP(w), for each w W. That is, we assume access to an unbiased estimate of some sub-gradient of the true risk. For an arbitrary sequence (Z1, Z2, . . . , Zt) of length t 1, let SGD[ w0; (Z1, . . . , Zt), W] ..= SGD[ wt 1; Zt, αt 1, W]. Note that using (5), the right-hand side is defined recursively, and bottoms out at t = 0, using pre-fixed initial value w0. Note that we suppress the step sizes (α0, . . . , αt 1) from this notation for readability. For any arbitrary sub-index I [n], sequence SGD[ w0; ZI, W] is defined analogously; since the Zi are iid, the sequence order does not matter. Error Bounds Under Heavy-Tailed Losses and Gradients We start by stating the main performance guarantee for Algorithm 1, which holds for potentially heavy-tailed losses/gradients, and then sketch out the core principles behind the proof. Theorem 1. Let the risk be μ-strongly convex, and the loss be λ1-smooth. Run Algorithm 1 with a sample size at least n max{k, M }, where k = 8 log(δ 1) , and max λ1μ w0 w 2 EP G(w ; Z) 2 , 1 1 . For the initial update set α0 = 1/(2λ1), and subsequent step sizes αt = a/(μn + b) for t > 0, with b = 2aλ1, and a > 0 set such that αt α0 for all t. Then, with probability no less than 1 δ, we have RP( w DC) R P EP G(w ; Z) 2 aλ1 2 2c log(δ 1) where Mδ 16 log(δ 1)(M b), and c depends only on the choice of Merge. Proving such a theorem is straightforward using the quadratic growth property of strongly convex functions in conjunction with λ1-smoothness. When the risk is μstrongly convex, we have the critical property that points which are ε-far away from the minimum w must be (ε2μ/2)-bad in terms of excess risk. As such, simple distance-based robust aggregation metrics can be used to efficiently boost the confidence. To start, we need a few basic facts which will be used to characterize a valid Merge op- eration.2 Given k points u1, . . . , uk Rd, the basic requirement here is that we want the output of Merge to be close to the majority of these points, in the appropriate norm. To make this concrete, define Δ(u; γ, {u1, . . . , uk}, ) ..= inf r 0 : |{j : uj u r}| > k 1 2 + γ . (6) When the other parameters are obvious from the context, we shall write simply Δ(u; γ) = Δ(u; γ, {u1, . . . , uk}, ). In words, Δ(u; γ) is the radius of the smallest ball centered at u which contains a γ-majority of the points u1, . . . , uk. Using this quantity, our requirement on Merge is that for any 0 γ < 1/2 and u Rd, we have u u cγΔ(u; γ, {u1, . . . , uk}), where u = Merge [{u1, . . . , uk}; ] . (7) Here cγ is a factor that is independent of the choice of u or the points u1, . . . , uk given, which depends only on the choice of γ. Next, considering the partitioning scheme of Algorithm 1, the ideal case is of course where, given some desired performance level RP( w(j)) R P ε, the SGD sub-routine returns an ε-good candidate for all j [k] subsets. In practice, we will not always be so lucky, but the following lemma shows that with enough candidates, most of them will be ε-good with high confidence. Lemma 2. Let (S, ) be any normed linear space. Let X1, . . . , Xk be iid random entities taking values in S, and fix x S. For ε > 0, write ai(ε) ..= I{ Xi x ε}, δε ..= 1 E a(ε). For any 0 γ < (1/2 δε), it follows that i=1 ai(ε) > k 1 1 e 2k(γ+δε 1 Applying Lemma 2 using the event aj(ε) = I{RP( w(j)) R P ε}, we see that when k scales with log(δ 1), we can guarantee that there is a 1 δ probability good event in which at least a γ-majority of the candidates are ε-good. On this good event, via strong convexity it follows that a γmajority of the candidates are 2ε/μ-close to w , which means Δ(w ; γ, { w(1), . . . , w(k)}) 2ε/μ. Leveraging the requirement (7) on Merge, one obtains the following general-purpose boosting procedure. 2Procedures with this property are called robust distance approximation by Hsu and Sabato (2016). Lemma 3 (Boosting the confidence, under strong convexity). Assume the risk is μ-strongly convex and λ1-smooth, and that we have a learning algorithm w OLD which for n 1 and δ0 (0, 1) achieves P RP( w OLD) R P > εP(n) For desired confidence level δ, split Zn into k = 8 log(δ 1)/(1 γ)2 disjoint subsets, let w(1) OLD, . . . , w(k) OLD denote the outputs of w OLD run on these subsets, and construct w NEW as w NEW = Merge { w(1) OLD, . . . , w(k) OLD}; . As long as Merge satisfies (7), then for any 0 γ < 1/4 and n k, we have that RP( w NEW) R P 4c2 γλ1 μ εP with probability no less than 1 δ. With these basic results in place, we can readily prove Theorem 1. Proof of Theorem 1. Using the assumptions provided in the hypothesis, we can obviously leverage Lemma 3. The correspondence between this lemma and Algorithm 1 is w(j) OLD w(j) and w NEW w DC. The remaining task is to fill in εP( ) for the final iterate of standard SGD, using the prescribed step sizes. It is well-known that for averaged SGD, one does not need to require that the losses be Lipschitz. On the other hand, for last-iterate SGD, it was only quite recently that Nguyen et al. (2018), in a nice argument building upon Bottou, Curtis, and Nocedal (2016), showed that the Lipschitz condition is not required if we have λ1-smooth losses. For our purposes, this implies that εP(m) EP G(w ; Z) 2 for any choice of m M . A detailed statement of the more general property used is given in Theorem 7 in the appendix. Applying this to Lemma 3 for Algorithm 1, we shall have m = (1 γ)2n/8 log(δ 1). Multiplying these factors out, we end up with n 8 log(δ 1)(M b)/(1 γ)2 in the denominator, for arbitrary choice of 0 γ < 1/4. To cover all choices of Merge and thus γ, we simply use the rough upper bound 8 log(δ 1)(M b)/(1 γ)2 16 log(δ 1)(M b) in the stated result. Concrete implementations of Merge There are many natural choices for implementing Merge. For example, the geometric median (minimizing the sum of absolute deviations) can be easily implemented (Vardi and Zhang 2000; Cohen et al. 2016), and enjoys a factor of cγ (1+1/(2γ)) (Minsker 2015). A simple smallest-ball procedure, which takes the point that contains a γ-majority in the smallestradius ball just requires doing pairwise distance calculations, and satisfies cγ 3 (Hsu and Sabato 2016). Using a coordinate-wise median approach is the easiest to code, but introduces dimension dependence, as cγ d(1+1/(2γ)). Plugging these upper bounds into the proof of Theorem 1, by basic arithmetic, the Merge-dependent constant c can be bounded as c 1536, c 288, and c 1536d respectively. Additional related literature The excess risk bounds given by Theorem 1 give us an example of the guarantees that are possible under potentially heavy-tailed data, for arguably the simplest divide-and-conquer strategy one could conceive of. Here we remark that the core idea of using robust aggregation methods to boost the confidence of independent candidates under potentially heavy-tailed data can be seen in various special cases throughout the literature. For example, influential work from Minsker (2015, Sec. 4) applies the geometric median to robustify both PCA and highdimensional linear regression procedures, under potentially heavy-tailed observations. Hsu and Sabato (2016, Sec. 4.2) look at merging ERM solutions when the empirical risk is strongly convex, using a smallest-ball strategy. In contrast, we do not require the losses to be strongly convex, and our computational procedure is explicit, yielding bounds which incorporate error of both a statistical and computational nature, unlike ERM-type guarantees. Empirical Analysis In this section, we use controlled simulations to investigate how the differences in formal performance guarantees discussed in the previous section work out in practice. Experimental setup We essentially follow the noisy convex minimization tests used in the literature to test the robustness of RGD methods (Holland and Ikeda 2019). Complete details of the experimental setup are provided in the supplementary materials.3 Put simply, we provide the learner with random losses of the form L(w; Z) = ( w w , X +E)2/2, where w Rd is a pre-defined vector unknown to the learner, X is a d-dimensional random vector, E is zero-mean random noise, and X and E are independent of each other. This approach is advantageous in that we can compute the resulting risk RP(w) = EP L(w; Z) exactly, and by modifying the distribution P, we can control the μ-strong convexity of the risk and ensure the gradients L(w; Z) = ( w w, X + E)X are Lipschitz continuous, satisfying the two key conditions of Theorem 1. Regarding the methods being compared, as classical baselines, empirical risk minimization using batch GD (denoted ERM-GD) and stochastic GD (denoted SGD) are used. We also implement standard RGD methods as more modern benchmarks: RGD-by-Mo M (Chen, Su, and Xu 2017a; Prasad et al. 2018) (denoted here as RGD-Mo M), RGD-M (Holland and Ikeda 2019) (RGD-M), and median-of-means minimization by gradient descent of Lecu e, Lerasle, and Mathieu (2018) (RGD-Lec). Against these methods, we compare Algorithm 1 (DC-SGD), with Merge computed us- 3Repository: https://github.com/feedbackward/sgd-roboost Figure 1: Excess risk statistics as a function of cost in gradient computations (log scale, base 10). The two right-most plots zoom in on the region between the dashed lines in the two left-most plots. Figure 2: Analogous results to Figure 1, for the case of log-Normal noise. Note for the zoomed-in plots on the right, we have removed the volatile SGD trajectory for visibility. ing the geometric median (Vardi and Zhang 2000). All detailed settings are in the supplementary materials. The basic idea of these tests is to calibrate and fix the methods to the case of nice data characterized by additive Gaussian noise, and then to see how the performance of each method changes as different experimental parameters are modified. The key performance metric that we look at in the figures to follow is excess risk, computed as RP( w) RP(w ), where w is the output of any learning algorithm being studied, and w is the pre-fixed minimum described in the previous paragraph. Each experimental setting is characterized by the triplet (P, n, d), which we modify in many different ways to investigate different phenomena. For each setting, we run multiple independent trials, and compute performance statistics based on these trials. For example, when we give the average (denoted ave) and standard deviation (denoted sd) of excess risk, these statistics are computed over all trials. All box-plots are also computed based on multiple independent trials. For convenience, here we list the key contents of our empirical analysis (extra figures in supplementary materials): 1. Error trajectories in low dimensions (fixed n and d, many iterations). 2. Statistical error in high dimensions (d grows, n fixed). 3. Actual computation times (d grows, n fixed/grows). 4. Impact of initialization on error trajectories ( w0 w grows). 5. Impact of noise level on error trajectories (signal/noise ratio gets worse). Discussion of results Plots of representative empirical test results are in Figures 1 4. We start with low-dimensional tests to examine the nature of the error trajectory of Algorithm 1 (DC-SGD), when run for multiple passes over the data. With no fine-tuning of algorithm parameters, by simply doing a proper aggregation of noisy SGD sub-processes, it is clearly possible to achieve performance comparable to well-tuned RGD methods, and do so using far less computational resources, both on average and in terms of betweentrial variance (Figures 1 2). Clearly, even when the underlying sub-processes used by Algorithm 1 are very noisy, only a few passes over the data are necessary to match the bestperforming RGD methods, both on average and in terms Figure 3: Left: box-plot of final excess risk for batch methods (many-pass) versus DC-SGD (two-pass), with d = 1024 under log-Normal noise. Right: median computation times as d increases. Figure 4: Excess risk trajectories (averaged over trials) with different initialization error ranges. Here sup refers to the error range, where w0,j = w j + Uniform[ sup, +sup] for each j [d]. of between-trial variance. Furthermore, without any algorithm adjustments, this robustness holds over changes to the signal/noise ratio and initialization error (Figure 4). While sample-splitting can cause our procedure to take a small hit in statistical error for very large n, it makes up for this in scalability as d grows. Even with just two passes over the data, at the order of thousands of parameters, the proposed procedure is able to achieve comparable performance under well-behaved gradients, and superior performance under heavy-tailed gradients, without prior knowledge or retuning, and at a fraction of the cost (Figure 3). Future Directions This paper presents evidence, both formal and empirical, that a general-purpose learning algorithm following the archetype drawn out in Algorithm 1 should be able to improve significantly on the scalability of modern robust gradient descent methods under potentially heavy-tailed losses and gradients, without sacrificing formal guarantees. Extending the theory to other algorithms besides vanilla SGD is a straightforward exercise; less straightforward is when we start considering a stage-wise strategy, when partition size k can change from stage to stage. Extending results to al- low for multiple passes is also of natural interest; the work of Lin and Rosasco (2017) does this for the squared error, but without heavy tails. We only covered the ℓ2 norm case here, but extensions to cover other geometries (via stochastic mirror descent for example) are also of interest. The RGD methods cited in this work are chiefly designed for convex optimization problems; they sacrifice exploration in favor of exploitation, and it will be interesting to investigate how approaches like Algorithm 1 fare in non-convex settings due to their higher tendency to explore. Acknowledgements This work was partially supported by the JSPS KAKENHI Grant Number 19K20342, and the Kayamori Foundation of Information Science Advancement Grant Number K30XXIII-518. References Anthony, M.; and Bartlett, P. L. 1999. Neural Network Learning: Theoretical Foundations. Cambridge University Press. Bottou, L. 2015. Two Big Challenges in Machine Learning. URL https://leon.bottou.org/talks/2challenges. Keynote at ICML 2015. Bottou, L.; Curtis, F. E.; and Nocedal, J. 2016. Optimization methods for large-scale machine learning. ar Xiv preprint ar Xiv:1606.04838 . Brownlees, C.; Joly, E.; and Lugosi, G. 2015. Empirical risk minimization for heavy-tailed losses. Annals of Statistics 43(6): 2507 2536. Catoni, O. 2012. Challenging the empirical mean and empirical variance: a deviation study. Annales de l Institut Henri Poincar e, Probabilit es et Statistiques 48(4): 1148 1185. Chen, Y.; Su, L.; and Xu, J. 2017a. Distributed statistical machine learning in adversarial settings: Byzantine gradient descent. In Proceedings of the ACM on Measurement and Analysis of Computing Systems. ACM. Chen, Y.; Su, L.; and Xu, J. 2017b. Distributed Statistical Machine Learning in Adversarial Settings: Byzantine Gradient Descent. ar Xiv preprint ar Xiv:1705.05491v2 . Cohen, M. B.; Lee, Y. T.; Miller, G.; Pachocki, J.; and Sidford, A. 2016. Geometric median in nearly linear time. In Proceedings of the Forty-Eighth Annual ACM Symposium on Theory of Computing, 9 21. Devroye, L.; Lerasle, M.; Lugosi, G.; and Oliveira, R. I. 2016. Sub-Gaussian mean estimators. Annals of Statistics 44(6): 2695 2725. El-Mhamdi, E.-M.; Guerraoui, R.; Guirguis, A.; and Rouault, S. 2019. SGD: Decentralized Byzantine Resilience. ar Xiv preprint ar Xiv:1905.03853 . Feldman, V. 2017. Generalization of ERM in Stochastic Convex Optimization: The Dimension Strikes Back. In Advances in Neural Information Processing Systems 29 (NIPS 2016), 3576 3584. Haussler, D. 1992. Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation 100(1): 78 150. Holland, M. J. 2019. Robust descent using smoothed multiplicative noise. In 22nd International Conference on Artificial Intelligence and Statistics (AISTATS), volume 89 of Proceedings of Machine Learning Research, 703 711. Holland, M. J. 2021. Robustness and scalability under heavy tails, without strong convexity. In 24th International Conference on Artificial Intelligence and Statistics (AISTATS 2021), volume 130 of Proceedings of Machine Learning Research. Holland, M. J.; and Ikeda, K. 2017. Robust regression using biased objectives. Machine Learning 106(9): 1643 1679. doi:10.1007/s10994-017-5653-5. Holland, M. J.; and Ikeda, K. 2019. Better generalization with less data using robust gradient descent. In 36th International Conference on Machine Learning (ICML), volume 97 of Proceedings of Machine Learning Research. Hsu, D.; and Sabato, S. 2016. Loss Minimization and Parameter Estimation with Heavy Tails. Journal of Machine Learning Research 17(18): 1 40. Lecu e, G.; Lerasle, M.; and Mathieu, T. 2018. Robust classification via MOM minimization. ar Xiv preprint ar Xiv:1808.03106v1 . Lin, J.; and Rosasco, L. 2017. Optimal Learning for Multipass Stochastic Gradient Methods. In Advances in Neural Information Processing Systems 29 (NIPS 2016), 4556 4564. Minsker, S. 2015. Geometric median and robust estimation in Banach spaces. Bernoulli 21(4): 2308 2335. Nguyen, L. M.; Nguyen, P. H.; van Dijk, M.; Richt arik, P.; Scheinberg, K.; and Tak aˇc, M. 2018. SGD and Hogwild! convergence without the bounded gradients assumption. ar Xiv preprint ar Xiv:1802.03801v2 . Prasad, A.; Suggala, A. S.; Balakrishnan, S.; and Ravikumar, P. 2018. Robust estimation via robust gradient estimation. ar Xiv preprint ar Xiv:1802.06485 . Rajput, S.; Wang, H.; Charles, Z.; and Papailiopoulos, D. 2020. DETOX: A Redundancy-based Framework for Faster and More Robust Gradient Aggregation. In Advances in Neural Information Processing Systems 32 (Neur IPS 2019), 10320 10330. Vapnik, V. 1982. Estimation of Dependences Based on Empirical Data. Springer Series in Statistics. Springer-Verlag. Vapnik, V. N. 1998. Statistical Learning Theory. Wiley. Vardi, Y.; and Zhang, C.-H. 2000. The multivariate L1median and associated data depth. Proceedings of the National Academy of Sciences 97(4): 1423 1426.