# on_binary_classification_in_extreme_regions__b159416b.pdf On Binary Classification in Extreme Regions Hamid Jalalzai, Stephan Cl emenc on and Anne Sabourin LTCI Telecom Paris Tech, Universit e Paris-Saclay 75013, Paris, France first.last@telecom-paristech.fr In pattern recognition, a random label Y is to be predicted based upon observing a random vector X valued in Rd with d 1 by means of a classification rule with minimum probability of error. In a wide variety of applications, ranging from finance/insurance to environmental sciences through teletraffic data analysis for instance, extreme (i.e. very large) observations X are of crucial importance, while contributing in a negligible manner to the (empirical) error however, simply because of their rarity. As a consequence, empirical risk minimizers generally perform very poorly in extreme regions. It is the purpose of this paper to develop a general framework for classification in the extremes. Precisely, under non-parametric heavy-tail assumptions for the class distributions, we prove that a natural and asymptotic notion of risk, accounting for predictive performance in extreme regions of the input space, can be defined and show that minimizers of an empirical version of a non-asymptotic approximant of this dedicated risk, based on a fraction of the largest observations, lead to classification rules with good generalization capacity, by means of maximal deviation inequalities in low probability regions. Beyond theoretical results, numerical experiments are presented in order to illustrate the relevance of the approach developed. 1 Introduction Because it covers a wide range of practical applications and its probabilistic theory can be straightforwardly extended to some extent to various other prediction problems, binary classification can be considered as the flagship problem in statistical learning. In the standard setup, (X, Y ) is a random pair defined on a certain probability space with (unknown) joint probability distribution P, where the (output) r.v. Y is a binary label, taking its values in { 1, +1} say, and X models some information, valued in Rd and hopefully useful to predict Y . In this context, the goal pursued is generally to build, from a training sample Dn = {(X1, Y1), . . . , (Xn, Yn)} composed of n 1 i.i.d. realizations of (X, Y ), a classifier g : Rd { 1, +1} minimizing the probability of error LP (g) = P{Y = g(X)}. The Empirical Risk Minimization paradigm (ERM in abbreviated form, see e.g. [5]) suggests considering solutions gn of the minimization problem ming G b Ln(g), where b Ln(g) is a statistical estimate of the risk L(g). In general the empirical version b Ln(g) = (1/n) Pn i=1 1{Yi = g(Xi)} is considered, denoting by 1{E} the indicator function of any event E. This amounts to replacing P in LP with the empirical distribution of the (Xi, Yi) s. The class G of predictive rules is supposed to be rich enough to contain a reasonable approximant of the minimizer of LP , i.e. the Bayes classifier g (x) = 2 1{η(x) 1/2} 1, where η(X) = P{Y = 1 | X} denotes the posterior probability. Because extreme observations X, i.e. observations whose norm X exceeds some large threshold t > 0, are rare and thus underrepresented in the training dataset Dn classification errors in these 32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montr eal, Canada. regions of the input space may have a negligible impact on the global prediction error of bgn. Notice incidentally that the threshold t may depend on n, since large should be naturally understood as large w.r.t the vast majority of data previously observed. Using the total probability formula, one may indeed write LP (g) = P{ X > t}P{Y = g(X) | X > t} + P{ X t}P{Y = g(X) | X t}. (1) Hence, due to the extremely small order of magnitude of P{ X > t} and of its empirical counterpart, there is no guarantee that the standard ERM strategy produces an optimal classifier on the extreme region {x : x > t}. In other words the quantity P{Y = bgn(X) | X > t} may not be nearly optimal, whereas in certain practical applications (e.g. finance, insurance, environmental sciences, aeronautics safety), accurate prediction in extreme regions is crucial. The purpose of the subsequent analysis is to investigate the problem of building a classifier such that the first term of the decomposition (1) is asymptotically minimum as t + . We thus consider the conditional probability of error, which quantity is next referred to as the classification risk above level t, given by Lt(g) := LPt(g) = P{Y = g(X) | X > t}, (2) denoting by Pt the conditional distribution of (X, Y ) given X > t. In this paper, we address the issue of learning a classifier gn whose risk Lt(gn) is asymptotically minimum as t with high probability. In order to develop a framework showing that a variant of the ERM principle tailored to this statistical learning problem leads to predictive rules with good generalization capacities, (non-parametric) distributional assumptions related to the tail behavior of the class distributions F+ and F , the conditional distributions of the input r.v. X given Y = +/ 1, are required. Precisely, we assume that they are both multivariate regularly varying, which correspond to a large non-parametric class of (heavy-tailed) distributions, widely used in applications where the impact of extreme observations should be enhanced, or at least not neglected. Hence, under appropriate non-parametric assumptions on F+ and F , as well as on the tail behavior of η(x), we prove that ming Lt(g) converges to a quantity denoted by L and referred to as the asymptotic risk in the extremes, as t . It is also shown that this limit can be interpreted as the minimum classification error related to a (non observable) random pair (X , Y ), whose distribution P corresponds to the limit of the conditional distribution of (X, Y ) given X > t, for an appropriate normalization of X, as t . With respect to the goal set above we next investigate the performance of minimizer bgn,τ of an empirical version of the risk LPtτ , where tτ is the (1 τ) quantile of the r.v. X and τ 1. The computation of bgn,τ involves the nτ input observations with largest norm, and the minimization is performed over a collection of classifiers of finite VC dimension. Based on a variant of the VC inequality tailored to low probability regions, rate bounds for the deviation Lt(bgn,τ) L are established, of order OP(1/ nτ) namely. These theoretical results are also illustrated by preliminary experiments based on synthetic data. The rest of the paper is organized as follows. Multivariate extreme value theory (MEVT) notions involved in the framework we develop are described in section 2, together with the probabilistic setup we consider for classification in the extremes. A notion of risk tailored to this statistical learning task is also introduced therein. Section 3 investigates how to extend the ERM principle in this situation. In particular, probability bounds proving the generalization ability of minimizers of a non-asymptotic approximant of the risk previously introduced are established. Illustrative numerical results are displayed in section 4, while several concluding remarks are collected in section 5. Some technical details and proofs are deferred to the Supplementary Material. 2 Probabilistic Framework - Preliminary Results We start off with recalling concepts pertaining to MEVT and next develop a general framework in order to formulate the problem of binary classification in the extremes in a rigorous manner. For completeness, additional details about regular variation and vague convergence are given in the supplementary material (Appendix A). 2.1 Regularly Varying Random Vector By definition, heavy-tail phenomena are those which are ruled by very large values, occurring with a far from negligible probability and with significant impact on the system under study. When the phenomenon of interest is described by the distribution of a univariate random variable, the theory of regularly varying functions provides the appropriate mathematical framework for the study of heavytailed distributions. One may refer to [11] for an excellent account of the theory of regularly varying functions and its application to the study of heavy-tailed distributions. For examples of works where such assumptions are considered in the context of statistical learning, see e.g. [6, 3, 12, 10, 1] or [8]. Let α > 0, a random variable X is said to be regularly varying with tail index α if P {X > tx | X > t} t x α, x > 1. This is the case if and only if there exists a function b : R+ R + with b(t) such that for all x > 0, the quantity t P {X/b(t) > x} converges to some limit h(x) as t . Then b may be chosen as b(t) = t1/α and h(x) = cx α for some c > 0. Based on this characterization, the heavy-tail model can be extended to the multivariate setup. Consider a d-dimensional random vector X = (X(1), . . . , X(d)) taking its values in Rd +. Assume that all the X(j) are regularly varying with index α > 0. Then the random vector X is said to be regularly varying with tail index α if there exists a non null positive Radon measure µ on the punctured space E = [0, ]d\{0} and a function b(t) such that for all Borel set A E such that 0 / A and µ( A) = 0, t P {X/b(t) A} t µ(A). In such a case, the so-called exponent measure µ fulfills the homogeneity property µ(t C) = t αµ(C) for all t > 0 and any Borel set C E. This suggests a decomposition of µ into a radial component and an angular component Φ. For all x = (x1, . . . , xd) Rd +, set R(x), . . . , xd R(x) where S is the positive orthant of the unit sphere in Rd for the chosen norm . The choice of the norm is unimportant as all norms are equivalent in Rd. Define an angular measure Φ on S as Φ(B) = µ{rθ : θ B, r 1}, B S, measurable. The angular measure Φ is finite, and the conditional distribution of (R(X)/t, Θ(X)) given that R(X) > t converges as t to a limit which admits the following product decomposition: for r 1 and B S such that Φ( B) = 0, lim t P {R(X)/t > r, Θ(X) B | R(X) > t} = c µ{x : R(x) > r, Θ(x) B} = c r α Φ(B), where c = µ{x : R(x) > 1} 1 = Φ(S) 1 is a normalizing constant. Thus cΦ may be viewed as the limiting distribution of Θ(X) given that R(X) is large. Remark 1 It is assumed above that all marginal distributions are tail equivalent to the Pareto distribution with index α. In practice, the tails of the marginals may be different and it may be convenient to work with marginally standardized variables, that is, to separate the margins Fj(xj) = P{X(j) xj} from the dependence structure in the description of the joint distribution of X. Consider the standardized variables V (j) = 1/(1 Fj(X(j))) [1, ] and V = (V (1), . . . , V (d)). Replacing X by V permits to take α = 1 and b(t) = t. 2.2 Classification in the Extremes - Assumptions, Criterion and Optimal Elements We place ourselves in the binary classification framework recalled in the introduction. For simplicity, we suppose that X takes its values in the positive orthant Rd +. The general aim is to build from training data in the extreme region (i.e. data points (Xi, Yi) such that Xi > tn for a large threshold value tn > 0) a classifier gn(x) with risk Ltn(gn) defined in (2) being asymptotically minimum as tn . In this purpose, we introduce general assumptions guaranteeing that the minimum risk Lt(g ) above level t has a limit as t . Throughout the article, we assume that the class distributions F+ and F are heavy-tailed with same index α = 1. Assumption 1 For all σ { , +}, the conditional distribution of X given Y = σ1 is regularly varying with index 1 and angular measure Φσ(dθ) (respectively, exponent measure µσ(dx)): for A [0, ]d \ {0} a measurable set such that 0 / A and µ( A) = 0, t P t 1X A Y = σ 1 t µσ(A), σ { , +}, and for B S a measurable set, Φσ(B) = µσ{x Rd + : R(x) > 1, Θ(x) B}, σ { , +}. Under the hypothesis above, X s marginal distribution, given by F = p F+ + (1 p)F , where p = P {Y = +1} > 0, is heavy-tailed as well with index 1. Indeed, we have: t P t 1X A t µ(A) := pµ+(A) + (1 p)µ (A). And similarly Φ(B) := pΦ+(B) + (1 p)Φ (B). Observe also that the limiting class balance can be expressed using the latter asymptotic measures. Indeed, let Ω= {x Rd + : x 1} denote the positive orthant of the unit ball and let Ωc denote its complementary set in Rd +. We have: pt = P {Y = +1 | X > t} = t P { X > t | Y = 1} p t P { X > t} t pµ+ (Ωc) µ (Ωc) = pΦ+(S) Remark 2 (ON ASSUMPTION 1) We point out that only the situation where the supposedly heavytailed class distributions F+ and F have the same tail index is of interest. Suppose for instance that the tail index α+ of F+ is strictly larger than that of F , α , that is F has heavier tail than F+. In such a case F is still regularly varying with index min{α+, α } and pt 0. In this case, one may straightforwardly see that the classifier predicting always 1 on {x Rd + : x > t} is optimal as t increases to infinity. Remark 3 (ON ASSUMPTION 1 (BIS)) As noticed in Remark 1, assuming that α = 1 is not restrictive when the marginal distributions are known. In practice however, they must be estimated. Due to space limitations, in the present analysis, we shall neglect the estimation error arising from their estimation. Relaxing this assumption, as made in e.g. [7], will be the subject of future work. Asymptotic criterion for classification in the extremes. The goal pursued is to construct a classifier gn, based on the training examples Dn, minimizing the asymptotic risk in the extremes given by L (g) = lim sup t Lt(g). (4) We also set L = infg measurable L (g). It is immediate that any classifier which coincides with the Bayes classifier g on the region tΩc = {x Rd + : x > t} is optimal w.r.t. the distribution Pt. In particular g minimizes Lt and the associated risk is L t := Lt(g ) = E [min{η(X), 1 η(X)} | X > t] , t > 0. (5) Thus, for all classifier g, Lt(g) Lt(g ), and taking the limit superior shows that g minimizes L , that is L = L (g ). Optimality. The objective formulated above can be connected with a standard binary classification problem, related to a random pair (X , Y ) taking its values in the limit space Ωc { 1, +1}, see Theorem 1 below. Let P {Y = +1} = p as in (3) and define the distribution of X given that Y = σ1, σ { , +} as µσ(Ωc) 1µσ( ). Then for A Ωc, using (3), P {X A, Y = +1} = p µ+(A) µ+(Ωc) = pµ+(A) µ(Ωc) = p limt t P {X t A | Y = +1} limt t P {X tΩc} = lim t P {X t A, Y = +1 | X > t} . We denote by P the joint distribution of (X , Y ) thus defined. As shall be seen below, under appropriate and natural assumptions, classifiers with minimum asymptotic risk in the extremes are in 1-to-1 correspondence with solutions of the binary classification problem related to (X , Y ). Let ρ be a common dominating measure for Φ , Φ+ on S (ρ does not need to be the Lebesgue measure, take e.g. ρ = Φ+ + Φ ). Then denote by ϕ+, ϕ respectively the densities of Φ+, Φ w.r.t. ρ. By homogeneity of µ+, µ , the conditional distribution of Y given X = x is η (x) def = P {Y = 1 | X = x} = p ϕ+(Θ(x))/Φ+(S) p ϕ+(Θ(x))/Φ+(S) + (1 p )ϕ (Θ(x))/Φ (S) = pϕ+(Θ(x)) pϕ+(Θ(x)) + (1 p)ϕ (Θ(x)). Notice that η is independent of the chosen reference measure ρ and that η is constant along rays, that is η (tx) = η (x) for (t, x) such that min( tx , x ) 1. The optimal classifier for the random pair (X , Y ) with respect to the classical risk LP is clearly g (x) = 21{η (x) 1/2} 1. Again g is constant along rays on Ωc and is thus a function of Θ(x) only. We abusively denote η (x) = η (Θ(x)). The minimum classification error is L P = LP (g ) = E [min {η (Θ ), 1 η (Θ )}] , (6) where Θ = Θ(X ). More generally, observe that any class GS of classifiers g : θ S 7 g(θ) { 1, +1} defines a class of classifiers on Rd +, x Rd + 7 g(Θ(x)), that shall still be denoted by GS for simplicity. The next result claims that, under the regularity hypothesis stated below, the classifier g is optimal for the asymptotic risk in the extremes, that is L (g ) = infg L (g). We shall also prove that L (g ) = L P . Assumption 2 (UNIFORM CONVERGENCE ON THE SPHERE OF η(tx)) The limiting regression function η is continuous on S and sup θ S |η(Θ(tθ)) η (θ)| t 0 Remark 4 (ON ASSUMPTION 2) By invariance of η along rays, Assumption 2 is equivalent to sup {x Rd +: x t} |η(x) η (x)| t 0. Assumption 2 is satisfied whenever the probability densities f+, f of F+, F are continuous, regularly varying with limit functions q+, q , and when the convergence is uniform, that is if lim t sup x S |td+1fσ(tx) qσ(x)| = 0, σ {+, }. (7) In such a case q+, q are respectively the densities of µ+, µ with respect to the Lebesgue measure and are continuous, which implies the continuity of ϕ+, ϕ . The latter uniform convergence assumption is introduced in [4] and is used e.g. in [2] in the context of minimum level sets estimation. Theorem 1 (OPTIMAL CLASSIFIERS IN THE EXTREMES) Under Assumptions 1 and 2, L t t L P . (8) Hence, we have: L = L P . In addition, the classifier g minimizes the asymptotic risk in the extremes: inf g measurable L (g) = L (g ) = E {min(η (Θ ), 1 η (Θ ))} . Refer to the Supplementary Material for the technical proof. Theorem 1 gives us the form of the optimal classifier in the extremes g (x) = g (Θ(x)), which depends only on the angular component Θ(x), not the norm R(x). This naturally leads to applying the ERM principle to a collection of classifiers of the form g(x) = g(Θ(x)) on the domain {x Rd + : x > t} for t > 0 large enough. The next section provides statistical guarantees for this approach. 3 Empirical Risk Minimization in the Extremes Consider a class GS of classifiers g : θ S 7 g(θ) { 1, +1} on the sphere S. It also defines a collection of classifiers on Rd +, namely {g(Θ(x)) : g GS}, which we denote by GS for simplicity. Sorting the training observations by decreasing order of magnitude, we introduce the order statistics X(1) > . . . > X(n) and we denote by Y(i) the corresponding sorted labels. Fix a small fraction τ > 0 of extreme observations, and let tτ be the quantile at level (1 τ) of the r.v. X : P{ X > tτ} = τ. Set k = nτ and consider the empirical risk b Lk(g) = 1 i=1 1{Y(i) = g(Θ(X(i)))} = L b Pk(g), (9) where b Pk denotes the empirical distribution of the truncated training sample {(Xi, Yi) : Xi Xk , i {1, . . . , n}}, the statistical version of the conditional distribution Ptτ . We now investigate the performance in terms of asymptotic risk in the extremes L of the solutions of the minimization problem min g GS b Lk(g). (10) The practical issue of designing efficient algorithms for solving (10) is beyond the scope of this paper. Focus is here on the study of the learning principle that consists in assigning to any very large input value x the likeliest label based on the direction Θ(x) it defines only (the construction is summarized in Algorithm 1 below). The following result provides an upper bound for the excess of classification error in the domain tτΩc of solutions of (10). Its proof, which relies on a maximal deviation inequality tailored to low probability regions, is given in the Supplementary Material. Theorem 2 Suppose that the class GS is of finite VC dimension VGS < + . Let bgk be any solution of (10). Recall k = nτ . Then, for δ (0, 1), n 1, we have with probability larger than 1 δ: Ltτ (bgk) L tτ 1 2(1 τ) log(2/δ) + C p VGS log(1/δ) 5 + 2 log(1/δ) + p log(1/δ)(C p 2) + inf g GS Ltτ (g) L tτ where C is a constant independent from n, τ and δ. Remark 5 (ON MODEL SELECTION) Selecting an appropriate model class GS is a crucial issue in machine-learning. Following in the footsteps of structured risk minimization, one may use a VC bound for E[supg GS |b Lk(g) E[b Lk(g)]|] as a complexity regularization term to penalize in an additive fashion the empirical risk (9). Considering a collection of such models, oracle inequalities guaranteeing the quasi-optimality of the rule minimizing the penalized empirical risk can be then classically established by means of a slight modification of the argument of Theorem 2 s proof, see e.g. Chapter 18 in [5]. The upper bound stated above shows that the learning rate is of order OP(1/ k), where k is the actual size of the training data set used to perform approximate empirical risk minimization in the extremes. As revealed by the corollary below, this strategy permits to build a consistent sequence of classifiers for the L -risk, when the fraction τ = τn decays at an appropriate rate (provided that the model bias can be neglected of course). Corollary 1 Suppose that the assumptions of Theorems 1-2 are fulfilled. In addition, assume that the model bias asymptotically vanishes as τ 0, i.e. inf g GS Ltτ (g) L tτ 0 as τ 0. Then, as soon as k + as n , the sequence of classifiers (bgk) is consistent in the extremes, meaning that we have the convergence in probability: L (bgk) L as n . Algorithm 1 (ERM in the extremes) Input Training dataset Dn = {(X1, Y1), . . . , (Xn, Yn)}, collection GS of classifiers on the sphere, size k n of the training set composed of extreme observations 1 Standardization. Standardize the input vector by applying the rank-transformation: i {1, . . . , n}, ˆVi = ˆT(Xi), where ˆT(x) = 1/ 1 ˆFj(xj) j=1, ..., d , for all x = (x1, . . . , xd) Rd. 2 Truncation. Sort the training input observations by decreasing order of magnitude ˆV(1) . . . ˆV(n) , and consider the set of extreme training points n ( ˆV(1), Y(1)), . . . , ( ˆV(k), Y(k)) o . 3 Optimization. Compute a solution bgk(θ) of the minimization problem min g GS 1 k i=1 1 n Y(i) = g Θ( ˆV(i)) o Output The classifier bgk Θ ˆT(x) , applicable on the region {x : ˆT(x) > ˆV(k) }. Remark 6 (Choice of k) Determining the best value of k is a typical challenge of Extreme Value analysis. This is typically a bias/variance trade-off, too large values introduce a bias by taking into account observations which are not large enough, so that their distribution deviates significantly from the limit distribution of extremes. On the other hand, too small values obviously increase the variance of the classifier. See e.g.[6] or[7] and the reference therein for a discussion. In practice a possible default choice is k = n, otherwise cross-validation can be performed. 4 Illustrative Numerical Experiments The purpose of our experiments is to provide insights into the performance of the classifier bgk on extreme regions constructed via Algorithm 1.The training set is ordered as in Step 1 of Algorithm 1. For a chosen k, let t = ˆT(Xtrain (k) ) , the L1 norm is used throughout our experiments. The extreme test set T is the subset of test points such that ˆT(Xtest i ) > t. To approximate of the asymptotic risk in the extremes L (bgk) and illustrate the generalization ability of the proposed classifier in the extreme region, we consider decreasing subsets of T . Namely denoting ntest = |T |, we keep only the κntest largest instances of T in terms on ˆT(Xtest i ) , for decreasing values of κ (0, 1]. This experimental framework is summarized in Figure 1, where λt = ˆT(Xtest ( κntest )) t. We consider two different classification algorithms for Step 3 in Algorithm 1, namely random forest (RF) and k-nearest neighbors (k-NN), which correspond to two different classes GS of classifiers. For each class GS, the performance of bgk (which considers only the direction Θ( ˆT(x)) of both training and testing data, in other words classifies the projected datasets onto the unit sphere (see Figure 2 ) is compared with that of the classical version of the algorithm (RF or k-NN) taking as input the same training data but without the standardization and truncation steps neither the projection onto the unit sphere. Figures 4 and 5 summarize the results obtained using RF respectively with a multivariate simulated dataset and with a real world dataset. The simulated dataset is generated from a logistic distribution as described in [13]. The positive and negative instances are generated using Train region Train region Test region Figure 1: Train set (dotted area) and test set (colored area). Figure 2: Colored cones correspond to a given label from the classifier on the simplex. two different dependency parameters. An example of dataset thus obtained is displayed in Figure 3. We report the results obtained with 5 103 points for each label for the train set and 5 104 points for each label for the test set. k = 100 and κ [1, 0.3]. the number of trees for both random forests (in the regular setting and in the setting of Algorithm 1) is set to 200. The number of neighbors for both k-NN s is set to 5. 0 20 40 60 80 100 120 140 labeled -1 labeled +1 Figure 3: Toy dataset generated from a multivariate logistic distribution projected on R2. The real dataset known as Ecoli dataset, introduced in [9], deals with protein localization and contains 336 instances and 8 features. The Supplementary Material gathers additional details concerning the datasets and the tuning of RF and k-NN in our experiments, as well as additional results obtained with the above described datasets and with a simulated dataset from a different distribution. Values of the multiplicative factor Hamming Loss Regular RF RF on the Simplex Figure 4: Logistic data - test loss of RF on the simplex and regular RF depending on the multiplicative factor κ. Values of the multiplicative factor Hamming Loss Regular RF RF on the Simplex Figure 5: Real data - test loss of RF on the simplex and regular RF depending on the multiplicative factor κ. Figure 4 shows the evolutions of the Hamming losses with decreasing values of κ [0.3, 1]. The boxplots display the losses obtained with 10 independently simulated datasets. For the experiment on the Ecoli dataset (Figure 5), one third of the dataset is used as a test set and the rest corresponds to the train set. k = 100 and κ [0.3, 1] (considering smaller values of κ was prevented by data scarcity). The boxplots display the results for different (random) partitions of the data into a train and a test set. In both examples, the loss of the regular classifier is worse (and even increases) when κ decreases whereas the classifier resulting from the proposed approach is better and has a better extrapolation ability. 5 Conclusion In various applications (e.g. safety/security, finance, insurance, environmental sciences), it is of prime importance to predict the response Y of a system when it is impacted by shocks, corresponding to extremely large input values X. In this paper, we have developed a rigorous probabilistic framework for binary classification in extreme regions, relying on the (nonparametric) theory of regularly varying random vectors, and proved the accuracy of the ERM approach in this context, when the risk functional is computed from extreme observations only. The present contribution may open a new line of research, insofar as progress can be naturally expected in the design of algorithmic learning methods tailored to extreme points (or their projection onto the unit sphere) and statistical issues such as estimation of the minimum risk in the extremes, L , remain to be addressed. [1] C. Brownlees, E. Joly, and G. Lugosi. Empirical risk minimization for heavy-tailed losses. Ann. Statist., 43(6):2507 2536, 2015. [2] J.J. Cai, J.H.J. Einmahl, and L. De Haan. Estimation of extreme risk regions under multivariate regular variation. The Annals of Statistics, pages 1803 1826, 2011. [3] A. Carpentier and M. Valko. Extreme bandits. In Advances in Neural Information Processing Systems 27, pages 1089 1097. Curran Associates, Inc., 2014. [4] L. De Haan and S. Resnick. On regular variation of probability densities. Stochastic processes and their applications, 25:83 93, 1987. [5] L. Devroye, L. Gy orfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Applications of mathematics : stochastic modelling and applied probability. U.S. Government Printing Office, 1996. [6] N. Goix, A. Sabourin, and S. Cl emenc on. Sparse representation of multivariate extremes with applications to anomaly ranking. In Artificial Intelligence and Statistics, pages 75 83, 2016. [7] N. Goix, A. Sabourin, and S. Cl emenc on. Sparse representation of multivariate extremes with applications to anomaly detection. Journal of Multivariate Analysis, 161:12 31, 2017. [8] S. Mendelson. Learning without concentration for general loss functions. Probability Theory and Related Fields, 171(1):459 502, 2018. [9] K. Nakai and M. Kanehisa. A knowledge base for predicting protein localization sites in eukaryotic cells. Genomics, 14(4):897 911, 1992. [10] Mesrob I Ohannessian and Munther A Dahleh. Rare probability estimation under regularly varying heavy tails. In Conference on Learning Theory, pages 21 1, 2012. [11] S. Resnick. Extreme Values, Regular Variation, and Point Processes. Springer Series in Operations Research and Financial Engineering, 1987. [12] Teemu Roos, Peter Gr unwald, Petri Myllym aki, and Henry Tirri. Generalization to unseen cases. In Y. Weiss, B. Sch olkopf, and J. C. Platt, editors, Advances in Neural Information Processing Systems 18, pages 1129 1136. MIT Press, 2006. [13] A. Stephenson. Simulating multivariate extreme value distributions of logistic type. Extremes, 6(1):49 59, 2003.