# ell_1regression_with_heavytailed_distributions__64694bfa.pdf ℓ1-regression with Heavy-tailed Distributions Lijun Zhang, Zhi-Hua Zhou National Key Laboratory for Novel Software Technology Nanjing University, Nanjing 210023, China {zhanglj, zhouzh}@lamda.nju.edu.cn In this paper, we consider the problem of linear regression with heavy-tailed distributions. Different from previous studies that use the squared loss to measure the performance, we choose the absolute loss, which is capable of estimating the conditional median. To address the challenge that both the input and output could be heavy-tailed, we propose a truncated minimization problem, and demonstrate that it enjoys an e O( p d/n) excess risk, where d is the dimensionality and n is the number of samples. Compared with traditional work on ℓ1-regression, the main advantage of our result is that we achieve a high-probability risk bound without exponential moment conditions on the input and output. Furthermore, if the input is bounded, we show that the classical empirical risk minimization is competent for ℓ1-regression even when the output is heavy-tailed. 1 Introduction Linear regression used to be a mainstay of statistics, and remains one of our most important tools for data analysis [Hastie et al., 2009]. Let T = {(x1, y1), . . . , (xn, yn)} Rd R be a set of input-output pairs that are independently drawn from an unknown distribution P. In linear regression, we assume that the relationship between the input and output can be well modeled by a linear function, and aim to discover it from the training set T . For a linear function f(x) = x w, its quality is measured by the expected prediction error on a random pair (x, y) sampled from P, i.e., the risk: Rℓ(w) = E(x,y) P ℓ(x w, y) where ℓ( , ) is a loss that quantifies the prediction error. The most popular losses include the squared loss ℓ2(u, v) = (u v)2 and the absolute loss ℓ1(u, v) = |u v|. Let W Rd be a domain of linear coefficients. The standard approach for linear regression is the empirical risk minimization (ERM) min w W b Rℓ(w) = 1 i=1 ℓ(x i w, yi) which selects the linear function that minimizes the empirical risk on the training set. In the literature, there are plenty of theoretical guarantees on linear regression by ERM, either targeting linear regression directly [Birgé and Massart, 1998, Györfiet al., 2002], or being derived from the general theories of statistical learning [Vapnik, 2000, Koltchinskii, 2011, Zhang et al., 2017]. In order to establish high-probability risk bounds, most of the previous analyses rely on the assumption that the input and output are bounded or sub-Gaussian. However, if the input and output are heavy-tailed, which is commonly encountered in many disciplines such as finance and environment [Finkenstädt and Rootzén, 2003], existing high-probability bounds of ERM become invalid. In fact, for heavy-tailed distributions, it has been explicitly proved that the 32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada. empirical risk is no longer a good approximation of the risk [Catoni, 2012], which inspires recent studies on learning with heavy-tailed losses [Audibert and Catoni, 2011, Hsu and Sabato, 2014, Brownlees et al., 2015]. In particular, for linear regression with squared loss, i.e., ℓ2-regression, Audibert and Catoni [2011] develop a truncated min-max estimator, and establish an O(d/n) excess risk that holds with high probability even when the input and output are heavy-tailed. This result is a great breakthrough as it extends the scope of linear regression, but is limited to the squared loss. The motivation of squared loss is to estimate the conditional mean of y given x. Besides the squared loss, there also exist other losses for regression. For example, if we are interested in the conditional median, then absolute loss would be a natural choice [Hastie et al., 2009]. Furthermore, absolute loss is more robust in that it is resistant to outliers in the data [Peter J. Rousseeuw, 1987]. In this paper, we target linear regression with absolute loss, namely ℓ1-regression. Inspired by the truncation function of Audibert and Catoni [2011], we propose a truncated minimization problem to support heavy-tailed distributions. Theoretical analysis shows that our method achieves an e O( p d/n) excess risk, which holds with high probability. Our theoretical guarantee requires very mild assumptions, and is derived from standard techniques the covering number and concentration inequalities. Furthermore, we show that if the input is bounded, the classical ERM is sufficient even when the output is heavy-tailed. We highlight the contributions of this paper as follows. We propose a truncated minimization problem for ℓ1-regression, which is more simple than the truncated min-max formulation of Audibert and Catoni [2011] for ℓ2-regression. This is the first time an e O( p d/n) excess risk is established for ℓ1-regression under the condition that both the input and output could be heavy-tailed. Although Brownlees et al. [2015] develop a general theorem for heavy-tailed losses, when applied to ℓ1-regression, it requires the input to be bounded. When the input is bounded, we prove that the classical ERM achieves an O(D/ n) excess risk for ℓ1-regression, where D is the maximum norm of the input, and does not require any assumption on the output. Finally, we note that although this paper focuses on ℓ1-regression, the idea of truncated minimization and our analysis can be directly extended to Lipschitz losses [Alquier et al., 2017] that satisfy |ℓ(x w, y) ℓ(x w , y)| |x w x w |, w, w W. Many popular losses for classification, including hinge loss and logistic loss, are Lipschitz losses. We will provide detailed investigations in an extended paper. 2 Related Work In this section, we review the recent work on learning with heavy-tailed losses. A distribution F is heavy-tailed if and only if [Foss et al., 2013] Z eλxd F(x) = , λ > 0. There exist studies that try to equip ERM with theoretical guarantees under heavy-tailed distributions. However, these theories need additional assumptions, such as the small-ball assumption [Mendelson, 2014, 2015] and the Bernstein s condition [Dinh et al., 2016, Alquier et al., 2017], and do not always hold with high probability. In the following, we only discuss alternatives of ERM. 2.1 Truncation based Approaches Our truncated method follows the line of research stemmed from Catoni [2012]. In his seminal work, Catoni [2012] considers the estimation of the mean and variance of a real random variable from independent and identically distributed (i.i.d.) samples. Let x1, . . . , xn R be i.i.d. samples drawn from some unknown probability distribution P. Catoni [2012] builds the estimator bθ of the mean as the solution of the equation n X i=1 ψ h α(xi bθ) i = 0 (1) where α > 0 is a positive parameter, and ψ( ) : R 7 R is a truncation function that is non-decreasing and satisfies log 1 x + x2 ψ(x) log 1 + x + x2 By choosing α = p 2/nν, where ν is the variance of the random variable, Catoni [2012] demonstrates that with high probability, the deviation of bθ from the mean is upper bounded by O( p ν/n). Thus, (1) can be applied to heavy-tailed distributions, as long as the variance is finite. Audibert and Catoni [2011] extend the robust estimator of Catoni [2012] to linear regression with squared loss. Specifically, they consider the ℓ2-norm regularized ℓ2-regression (i.e., ridge regression), and propose the following min-max estimator min w W max u W λ w 2 u 2 + 1 i=1 ψ α(yi x i w)2 α(yi x i u)2 (3) where denotes the ℓ2-norm of vectors and ψ( ) is the truncation function defined as log 1 x + x2 log(2), x 1; ψ( x), x 0. Let (bw, bu) be the optimal solution to (3). With a suitable choice of α, Audibert and Catoni [2011] have proved that the following risk bound Rℓ2(bw) + λ bw 2 min w W Rℓ2(w) + λ w 2 = O d holds with high probability even when neither the input nor the output has exponential moments. In a subsequent work, Brownlees et al. [2015] apply the robust estimator of Catoni [2012] to the general problem of learning with heavy-tailed losses. Let x be a random variable taking values in some measurable space X and F be a set of nonnegative functions defined on X. Given n independent random variables x1, . . . , xn, all distributed as x, Brownlees et al. [2015] propose to use the robust estimator in (1) to estimate the mean of each function f F, and choose the one that minimizes the estimator. Formally, the optimization problem is given by min f F bµf i=1 ψ α(f(xi) bµf) = 0 where ψ( ) is defined as log 1 + x + x2 log 1 x + x2 Based on the generic chaining method [Talagrand, 2005], Brownlees et al. [2015] develop performance bounds of the above problem. However, the theoretical guarantees rely on the condition that the function space F is bounded in terms of certain distance. When applying their results to ℓ1-regression, the linear function needs to be bounded, which means the input vector must be bounded [Brownlees et al., 2015, Section 4.1.1]. When applying to ℓ2-regression, the input also needs to be bounded and the theoretical guarantee is no longer a high-probability bound [Brownlees et al., 2015, Section 4.1.2]. 2.2 Median-of-means Approaches Another way to deal with heavy-tailed distributions is the median-of-means estimator [Nemirovski and Yudin, 1983, Alon et al., 1999, Bubeck et al., 2013, Minsker, 2015]. The basic idea is to divide the data into several groups, calculate the sample mean within each group, and take the median of these means. Recently, Hsu and Sabato [2014, 2016] generalize the median-of-means estimator to arbitrary metric spaces, and apply it to the minimization of smooth and strongly convex losses. Specifically, for ℓ2-regression with heavy-tailed distributions, a high-probability e O(d/n) excess risk is established, under slightly stronger assumptions than those of Audibert and Catoni [2011]. For regression problem, Lugosi and Mendelson [2016] introduce a new procedure, the so-called median-of-means tournament, which achieves the optimal tradeoff between accuracy and confidence under minimal assumptions. The setting of Lugosi and Mendelson [2016] is general in the sense that the function space could be any convex class of functions, not necessary linear, but the performance is only measured by the squared loss. Compared with truncation based approaches, the advantage of median-of-means approaches is that they do not require prior knowledge of distributional properties. However, the current theoretical results of median-of-means are restricted to the squared loss or strongly convex losses, and thus cannot be applied to ℓ1-regression considered in this paper. 3 Our Results In this section, we first present our truncated minimization problem, then discuss its theoretical guarantee, and finally study the special setting of bounded inputs. 3.1 Our Formulation Inspired by the truncated min-max estimator of Audibert and Catoni [2011], we propose the following truncated minimization problem for ℓ1-regression with heavy-tailed distributions: min w W 1 nα i=1 ψ α|yi x i w| (6) where the truncation function ψ( ) is non-decreasing and satisfies (2), and α > 0 is a parameter whose value will be determined later. Note that we can choose (4) or (5) as the truncation function. Compared with the min-max problem in (3), our minimization problem in (6) is more simple. Although (6) is still a non-convex problem, it has a special structure that can be exploited. Because ψ( ) is non-decreasing, it is easy to verify that each individual function ψ(α|yi x i w|) is quasiconvex. Thus, our problem is to minimize the sum of quasiconvex functions. From the recent developments of quasiconvex optimization [Hazan et al., 2015], we may apply (stochastic) normalized gradient descent (NGD) to solve (6). This paper focuses on the statistical property of (6), and we leave the design of efficient optimization procedures as a future work. 3.2 Theoretical Guarantees Let W be a subset of a Hilbert space H, and be the norm associated with the inner product of H. We introduce assumptions that used in our analysis. Assumption 1 The domain W is totally bounded such that for any ε > 0, there exists a finite ε-net of W.1 Assumption 2 The expectation of the squared norm of x is bounded, that is, E(x,y) P x 2 < . Assumption 3 The ℓ2-risk of all w W is bounded, that is, sup w W Rℓ2(w) = sup w W E(x,y) P (y x w)2 < . 1A subset N K is called an ε-net of K if for every w K one can find a ew N so that w ew ε. Remark 1 We have the following comments regarding our assumptions. Although Assumption 1 requires the domain W is bounded, the input and output could be unbounded, which allows us to model heavy-tailed distributions. Because our goal is to bound the ℓ1-risk, which is the first-order moment, it is natural to require higher-order moment conditions. Thus, in Assumptions 2 and 3, we introduce second-order moment conditions on inputs and outputs. Our assumptions support heavytailed distributions in the sense that commonly used heavy-tailed distributions, such as the Pareto distribution (with parameter α > 2) and the log-normal distribution, have finite second-order moment. By Jensen s inequality, we have (E[ x ])2 E[ x 2]. Thus, Assumption 2 implies E[ x ] is bounded. Given Assumptions 1 and 2, Assumption 3 can be relaxed as the ℓ2-risk of the optimal solution w is bounded. To see this, we have Rℓ2(w) 2Rℓ2(w ) + 2(w w ) E xx (w w ) 2Rℓ2(w ) + 2 w w 2 E xx 2 where 2 is the spectral norm of matrices. First, Assumption 1 implies w w is bounded. Second, Assumption 2 implies the spectral norm of E[xx ] is also bounded, that is, E xx 2 E xx 2 = E x 2 < . Thus, Rℓ2(w) is bounded as long as Rℓ2(w ) is bounded and Assumptions 1 and 2 hold. To present our theoretical guarantee, we introduce the notation of the covering number. The minimal cardinality of the ε-net of W is called the covering number and denoted by N(W, ε). Let bw be a solution to (6), and w argminw W Rℓ1(w) be an optimal solution that minimizes the ℓ1-risk. We have the following excess risk bound. Theorem 1 Let 0 < δ < 1/2. Under Assumptions 1, 2 and 3, with probability at least 1 2δ, we have Rℓ1(bw) Rℓ1(w ) 2εE[ x ] + αε2E x 2 + 3α 2 sup w W Rℓ2(w) + 1 nα log N(W, ε) for any ε > 0. Furthermore, by setting 1 n log N(W, ε) Rℓ1(bw) Rℓ1(w ) 2εE[ x ] + 1 n log N(W, ε) ε2E x 2 + 3 2 sup w W Rℓ2(w) + 1 . Remark 2 Note that Theorem 1 is very general in the sense that it can be applied to infinite dimensional Hilbert spaces, provided the domain W has a finite covering number [Cucker and Smale, 2002]. In contrast, the result of Audibert and Catoni [2011] is limited to finite spaces. The difference is caused by the different techniques used in the analysis: While Audibert and Catoni [2011] employ the PAC-Bayesian analysis, we make use of the covering number and standard concentrations. To reveal the order of the excess risk, we need to specify the value of the covering number. To this end, we consider the special case that W is a bounded subset of Euclidean space, and introduce the following condition. Assumption 4 The domain W is a subset of Rd and its radius is bounded by B, that is, w B, w W Rd. (7) Let Br Rd be a ball centered at origin with radius r, and N(Br, ε) be its ε-net with minimal cardinality, denoted by N(Br, ε). According to a standard volume comparison argument [Pisier, 1989], we have log N B1, ε d log 3 ε log N Br, ε d log 3r Since W BB, we have log N(W, ε) log N BB, ε ε where the first inequality is because the covering numbers are (almost) increasing by inclusion [Plan and Vershynin, 2013, (3.2)]. Then, we have the following corollary by setting ε = 1/n. Corollary 2 Let 0 < δ < 1/2, and set 1 n log N(W, 1/n) Under Assumptions 2, 3 and 4, with probability at least 1 2δ, we have Rℓ1(bw) Rℓ1(w ) d log(6n B) + log 1 n2 E x 2 + 3 2 sup w W Rℓ2(w) + 1 Remark 3 Ignoring the logarithmic factor, Corollary 2 shows an e O( p d/n) excess risk that holds with high probability. From the above discussions, we see that the square root dependence on d comes from the upper bound of the covering number. If the domain W has additional structures (e.g., sparse), the covering number may have a smaller dependence on d, and as a result, the dependence on d could be improved. 3.3 Bounded Inputs If we only allow the output to be heavy-tailed and the input is bounded, the problem becomes much easier. Although both our method and the algorithm of Brownlees et al. [2015] are applicable, at least in theory, there is no need to resort to sophisticated methods. In fact, a careful analysis shows that the classical ERM is sufficient in this case. We introduce the following assumption. Assumption 5 The norm of the random vector x Rd is upper bounded by a constant D, that is, x D, (x, y) P. (8) Then, we have the following risk bound for ERM. Theorem 3 Let w argmin w W i=1 |yi x i w| be a solution returned by ERM. Under Assumptions 4 and 5, with probability at least 1 δ, we have Rℓ1( w) Rℓ1(w ) 4BD n Remark 4 When inputs are upper bounded, we do not need any assumption about outputs. That is because the absolute loss is 1-Lipschitz continuous, and outputs will be canceled in the analysis. Theorem 3 implies ERM achieves an O(D/ n) excess risk which holds with high probability. Compared with the risk bound in Corollary 2, we observe that the new bound is independent from the dimensionality d, but it has a linear dependence on D, which is the upper bound of the norm of inputs. Due to the limitation of space, we only present the proof of Theorem 1. The proof of Theorem 3 can be found in the full paper [Zhang and Zhou, 2018]. 4.1 Proof of Theorem 1 To simplify notations, define b Rψ ℓ1(w) = 1 i=1 ψ α|yi x i w| . From the optimality of bw, we have i=1 ψ α|yi x i bw| | {z } b Rψ ℓ1( bw) i=1 ψ α|yi x i w | | {z } b Rψ ℓ1(w ) Next, we will discuss how to upper bound b Rψ ℓ1(w ) by Rℓ1(w ) and lower bound b Rψ ℓ1(bw) by Rℓ1(bw). Because w is independent from samples (x1, y1), . . . , (xn, yn), it is easy to relate b Rψ ℓ1(w ) with Rℓ1(w ) by the standard concentration techniques. To this end, we have the following lemma. Lemma 1 With probability at least 1 δ, we have b Rψ ℓ1(w ) Rℓ1(w ) + α 2 Rℓ2(w ) + 1 Lower bounding b Rψ ℓ1(bw) is more involved because bw depends on the sample. To this end, we combine the covering number and concentration inequalities to develop the following lemma. Lemma 2 With probability at least 1 δ, we have Rℓ1(bw) b Rψ ℓ1(bw) + 2εE[ x ] + α sup w W Rℓ2(w) + αε2E x 2 + 1 nα log N(W, ε) for any ε > 0. Then, Theorem 1 is a direct consequence of (9), Lemmas 1 and 2, and the union bound. 4.2 Proof of Lemma 1 First, note that our truncation function ψ satisfies ψ(x) log 1 + x + x2 , x R. (10) Then, we have E h exp nα b Rψ ℓ1(w ) i = E i=1 ψ α|yi x i w | !# 1 + α|yi x i w | + α2(yi x i w )2 = E 1 + α|y x w | + α2|y x w |2 = 1 + αRℓ1(w ) + α2 2 Rℓ2(w ) n 1+x ex exp n αRℓ1(w ) + α2 2 Rℓ2(w ) . By Chernoff s method [Lugosi, 2009], we have P nα b Rψ ℓ1(w ) n αRℓ1(w ) + α2 2 Rℓ2(w ) + log 1 = P exp nα b Rψ ℓ1(w ) exp n αRℓ1(w ) + α2 2 Rℓ2(w ) + log 1 E h exp nα b Rψ ℓ1(w ) i exp n αRℓ1(w ) + α2 2 Rℓ2(w ) + log 1 which completes the proof. 4.3 Proof of Lemma 2 Let N(W, ε) be an ε-net of W with minimal cardinality N(W, ε). From the definition of ε-net, there must exist a ew N(W, ε) such that bw ew ε. So, we have |yi x i bw| |yi x i ew| |x i (ew bw)| |yi x i ew| ε xi . (12) Since ψ( ) is non-decreasing, we have b Rψ ℓ1(bw) = 1 i=1 ψ α|yi x i bw| (12) 1 nα i=1 ψ α|yi x i ew| αε xi . (13) To proceed, we develop the following lemma to lower bound the last term in (13). Lemma 3 With probability at least 1 δ, for all ew N(W, ε), we have i=1 ψ α|yi x i ew| αε xi Rℓ1(ew) εE[ x ] + α sup w W Rℓ2(w) + αε2E x 2 + 1 nα log N(W, ε) Substituting (14) into (13), with probability at least 1 δ, we have b Rψ ℓ1(bw) Rℓ1(ew) εE[ x ] + α sup w W Rℓ2(w) + αε2E x 2 + 1 nα log N(W, ε) Rℓ1(bw) 2εE[ x ] + α sup w W Rℓ2(w) + αε2E x 2 + 1 nα log N(W, ε) where the last step is due to the following inequality Rℓ1(bw) = E |y x bw| E |y x ew| + |x (ew bw)| Rℓ1(ew) + εE[ x ]. 4.4 Proof of Lemma 3 We first consider a fixed ew N(W, ε) W. The proof is similar to that of Lemma 1. Recall that the truncation function ψ( ) satisfies ψ(x) log 1 x + x2 , x R. (15) Then, we have i=1 ψ α|yi x i ew| αε xi !# 1 α|yi x i ew| + αε xi + α2 |yi x i ew| ε xi 2 1 α|y x ew| + αε x + α2 |y x ew| ε x 2 = 1 αRℓ1(ew) + αεE[ x ] + α2 2 E h |y x ew| ε x 2i n exp h n αRℓ1(ew) + αεE[ x ] + α2Rℓ2(ew) + α2ε2E x 2 i . where the last step is due to the basic inequalities 1 + x ex and (a + b)2 2a2 + 2b2. By Chernoff s method [Lugosi, 2009], we have i=1 ψ α|yi x i ew| αε xi n αRℓ1(ew) + αεE[ x ] + α2Rℓ2(ew) + α2ε2E x 2 + log 1 i=1 ψ α|yi x i ew| αε xi ! exp n αRℓ1(ew) + αεE[ x ] + α2Rℓ2(ew) + α2ε2E x 2 + log 1 E exp Pn i=1 ψ α|yi x i ew| αε xi exp h n αRℓ1(ew) + αεE[ x ] + α2Rℓ2(ew) + α2ε2E [ x 2] + log 1 δ i (16) δ. Thus, with probability at least 1 δ, we have i=1 ψ α|yi x i ew| αε xi Rℓ1(ew) + εE[ x ] + αRℓ2(ew) + αε2E x 2 + 1 Rℓ1(ew) + εE[ x ] + α sup w W Rℓ2(w) + αε2E x 2 + 1 We complete the proof by taking the union bound over all ew N(W, ε). 5 Conclusion and Future Work In this paper, we consider ℓ1-regression with heavy-tailed distributions, and propose a truncated minimization problem. Under mild assumptions, we prove that our method enjoys an e O( p d/n) excess risk, which holds with high probability. Compared with traditional work on ℓ1-regression, the main advantage of our result is that we establish a high-probability bound without exponential moment conditions on the input and output. Furthermore, we demonstrate that when the input is bounded, the classical ERM is sufficient for ℓ1-regression. In the future, we will develop optimization algorithms and theories for the non-convex problem in (6). Another future work is to apply the idea of truncated minimization to other losses in machine learning, especially Lipschitz losses. Acknowledgments This work was partially supported by the NSFC (61751306), YESS (2017QNRC001), and the Collaborative Innovation Center of Novel Software Technology and Industrialization. We thank an anonymous reviewer of COLT 2018 for helping us simplify the proof of Theorem 1. N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. Journal of Computer and System Sciences, 58(1):137 147, 1999. P. Alquier, V. Cottet, and G. Lecué. Estimation bounds and sharp oracle inequalities of regularized procedures with lipschitz loss functions. Ar Xiv e-prints, ar Xiv:1702.01402, 2017. J.-Y. Audibert and O. Catoni. Robust linear least squares regression. The Annals of Statistics, 39(5): 2766 2794, 2011. L. Birgé and P. Massart. Minimum contrast estimators on sieves: exponential bounds and rates of convergence. Bernoulli, 4(3):329 375, 1998. C. Brownlees, E. Joly, and G. Lugosi. Empirical risk minimization for heavy-tailed losses. The Annals of Statistics, 43(6):2507 2536, 2015. S. Bubeck, N. Cesa-Bianchi, and G. Lugosi. Bandits with heavy tail. IEEE Transactions on Information Theory, 59(11):7711 7717, 2013. O. Catoni. Challenging the empirical mean and empirical variance: A deviation study. Annales de l Institut Henri Poincaré, Probabilités et Statistiques, 48(4):1148 1185, 2012. F. Cucker and S. Smale. On the mathematical foundations of learning. Bulletin of the American Mathematical Society, 39(1):1 49, 2002. V. C. Dinh, L. S. Ho, B. Nguyen, and D. Nguyen. Fast learning rates with heavy-tailed losses. In Advances in Neural Information Processing Systems 29, pages 505 513, 2016. B. Finkenstädt and H. Rootzén, editors. Extreme Values in Finance, Telecommunications, and the Environment. Chapman & Hall/CRC, 2003. S. Foss, D. Korshunov, and S. Zachary. An Introduction to Heavy-Tailed and Subexponential Distributions. Springer, 2013. L. Györfi, M. Kohler, A. Krzy zak, and H. Walk. A Distribution-Free Theory of Nonparametric Regression. Springer, 2002. T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer Series in Statistics. Springer New York, 2009. E. Hazan, K. Levy, and S. Shalev-Shwartz. Beyond convexity: Stochastic quasi-convex optimization. In Advances in Neural Information Processing Systems 28, pages 1594 1602, 2015. D. Hsu and S. Sabato. Heavy-tailed regression with a generalized median-of-means. In Proceedings of the 31st International Conference on Machine Learning, pages 37 45, 2014. D. Hsu and S. Sabato. Loss minimization and parameter estimation with heavy tails. Journal of Machine Learning Research, 17(18):1 40, 2016. V. Koltchinskii. Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems. Springer, 2011. G. Lugosi. Concentration-of-measure inequalities. Technical report, Department of Economics, Pompeu Fabra University, 2009. G. Lugosi and S. Mendelson. Risk minimization by median-of-means tournaments. Ar Xiv e-prints, ar Xiv:1608.00757, 2016. S. Mendelson. Learning without concentration. In Proceedings of the 27th Annual Conference on Learning Theory, pages 25 39, 2014. S. Mendelson. Learning without concentration. Journal of the ACM, 62(3):21:1 21:25, 2015. S. Minsker. Geometric median and robust estimation in Banach spaces. Bernoulli, 21(4):2308 2335, 2015. A. Nemirovski and D. B. Yudin. Problem Complexity and Method Efficiency in Optimization. John Wiley & Sons Ltd, 1983. A. M. L. Peter J. Rousseeuw. Robust Regression and Outlier Detection. John Wiley & Sons Inc, 1987. G. Pisier. The volume of convex bodies and Banach space geometry. Cambridge Tracts in Mathematics (No. 94). Cambridge University Press, 1989. Y. Plan and R. Vershynin. One-bit compressed sensing by linear programming. Communications on Pure and Applied Mathematics, 66(8):1275 1297, 2013. M. Talagrand. The Generic Chaining. Springer, 2005. V. Vapnik. The Nature of Statistical Learning Theory. Springer, second edition, 2000. L. Zhang and Z.-H. Zhou. ℓ1-regression with heavy-tailed distributions. Ar Xiv e-prints, ar Xiv:1805.00616, 2018. L. Zhang, T. Yang, and R. Jin. Empirical risk minimization for stochastic convex optimization: O(1/n)- and O(1/n2)-type of risk bounds. In Proceedings of the 30th Annual Conference on Learning Theory, pages 1954 1979, 2017.