# on_medians_of_randomized_pairwise_means__581f28fe.pdf On Medians of (Randomized) Pairwise Means Pierre Laforgue 1 Stephan Cl emenc on 1 Patrice Bertail 2 Tournament procedures, recently introduced in Lugosi & Mendelson (2016), offer an appealing alternative, from a theoretical perspective at least, to the principle of Empirical Risk Minimization in machine learning. Statistical learning by Medianof-Means (Mo M) basically consists in segmenting the training data into blocks of equal size and comparing the statistical performance of every pair of candidate decision rules on each data block: that with highest performance on the majority of the blocks is declared as the winner. In the context of nonparametric regression, functions having won all their duels have been shown to outperform empirical risk minimizers w.r.t. the mean squared error under minimal assumptions, while exhibiting robustness properties. It is the purpose of this paper to extend this approach, in order to address other learning problems in particular, for which the performance criterion takes the form of an expectation over pairs of observations rather than over one single observation, as may be the case in pairwise ranking, clustering or metric learning. Precisely, it is proved here that the bounds achieved by Mo M are essentially conserved when the blocks are built by means of independent sampling without replacement schemes instead of a simple segmentation. These results are next extended to situations where the risk is related to a pairwise loss function and its empirical counterpart is of the form of a U-statistic. Beyond theoretical results guaranteeing the performance of the learning/estimation methods proposed, some numerical experiments provide empirical evidence of their relevance in practice. 1LTCI, T el ecom Paris, Institut Polytechnique de Paris 2Modal X, UPL, Universit e Paris-Nanterre. Correspondence to: Pierre Laforgue . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). 1. Introduction In Lugosi & Mendelson (2016), the concept of tournament procedure for statistical learning has been introduced and analyzed in the context of nonparametric regression, one of the flagship problems of machine learning. The task is to predict a real valued random variable (r.v.) Y based on the observation of a random vector X with marginal distribution µ(dx), taking its values in Rd with d 1, say by means of a regression function f : Rd R with minimum expected quadratic risk R(f) = E[(Y f(X))2]. Statistical learning usually relies on a training dataset Sn = {(X1, Y1), . . . , (Xn, Yn)} formed of independent copies of the generic pair (X, Y ). Following the Empirical Risk Minimization (ERM) paradigm, one is encouraged to build predictive rules by minimizing an empirical version of the risk ˆRn(f) = (1/n) Pn i=1(yi f(xi))2 over a class F of regression function candidates of controlled complexity (e.g. of finite VC dimension), while being rich enough to contain a reasonable approximant of the optimal regression function f (x) = E[Y | X = x]: for any f F, the risk excess R(f) R(f ) is then equal to ||f f ||2 L2(µ) = E[(f(X) f (X))2]. A completely different learning strategy, recently proposed in Lugosi & Mendelson (2016), consists in implementing a tournament procedure based on the Median-of-Means (Mo M) method (see Nemirovsky & Yudin (1983)). Precisely, the full dataset is first divided into 3 subsamples of equal size. For every pair of candidate functions (f1, f2) F2, the first step consists in computing the Mo M estimator of the quantity f1 f2 L1(µ) := E[|f1(X) f2(X)|] based on the first subsample: the latter being segmented into K 1 subsets of equal size (approximately), f1 f2 L1(µ) is estimated by the median of the collection of estimators formed by its empirical versions computed from each of the K sub-datasets. When the Mo M estimate is large enough, the match between f1 and f2 is allowed. The rationale behind this approach is as follows: if one of the candidate, say f2, is equal to f , and the quantity f1 f L1(µ) (which is less than f1 f L2(µ) = p R(f1) R(f )) is large, so is its (robust) Mo M estimate (much less sensitive to atypical values than sampling averages) with high probability. Therefore, f is compared to distant candidates only, against which it should hopefully win its matches. The second step consists in computing the Mo M estimator of R(f1) R(f2) On Medians of Randomized (Pairwise) Means based on the second subsample for every distant enough candidates f1 and f2. If a candidate wins all its matches, it is kept for the third round. As said before, f should be part of this final pool, denoted by H. Finally, matches involving all pairs of candidates in H are computed, using a third Mo M estimate on the third part of the data. A champion winning again all its matches is either f or has a small enough excess risk anyway. It is the purpose of the present article to extend the Mo Mbased statistical learning methodology. Firstly, we investigate the impact of randomization in the Mo M technique: by randomization, it is meant that data subsets are built through sampling schemes, say simple random sampling without replacement (SRSWo R in abbreviated form) for simplicity, rather than partitioning. Though introducing more variability in the procedure, we provide theoretical and empirical evidence that attractive properties of the original Mo M method are essentially preserved by this more flexible variant (in particular, the number of blocks involved in this alternative procedure is arbitrary). Secondly, we consider the application of the tournament approach to other statistical learning problems, namely those involving pairwise loss functions, like popular formulations of ranking, clustering or metric-learning. In this setup, natural statistical versions of the risk of low variance take the form of U-statistics (of degree two), i.e. averages over all pairs of observations, see e.g. Cl emenc on et al. (2008). In this situation, we propose to estimate the risk by the median of U-statistics computed from blocks obtained through data partitioning or sampling. Results showing the accuracy of this strategy, referred to as Median of (Randomized) Pairwise Means here, are established and application of this estimation technique to pairwise learning is next investigated from a theoretical perspective and generalization bounds are obtained. The relevance of this approach is also supported by convincing illustrative numerical experiments. The rest of the paper is organized as follows. Section 2 briefly recalls the main ideas underlying the Mo M procedure, its applications to robust machine learning as well as basic concepts pertaining to the theory of Ustatistics/processes. In section 3, the variants of the Mo M approach we propose are described at length and theoretical results establishing their statistical performance are stated. Illustrative numerical experiments are displayed in section 4, while proofs are deferred to the Appendix section. Some technical details and additional experimental results are postponed to the Supplementary Material (SM). 2. Background - Preliminaries As a first go, we briefly describe the main ideas underlying the tournament procedure for robust machine learning, and next recall basic notions of the theory of U-statistics, as well as crucial results related to their efficient approximation. Here and throughout, the indicator function of any event E is denoted by I{E}, the variance of any square integrable r.v. Z by Var (Z), the cardinality of any finite set A by #A. If (a1, . . . , an) Rn, the median (sometimes abbreviated med) of a1, . . . , an is defined as aσ((n+1)/2) when n is odd and aσ(n/2) otherwise, σ denoting a permutation of {1, . . . , n} such that aσ(1) . . . aσ(n). The floor and ceiling functions are denoted by u 7 u and u 7 u . 2.1. Medians of Means based Statistical Learning First introduced independently by Nemirovsky & Yudin (1983), Jerrum et al. (1986), and Alon et al. (1999), the Median-of-Means (Mo M) is a mean estimator dedicated to real random variables. It is now receiving a great deal of attention in the statistical learning literature, following in the footsteps of the results established in Audibert & Catoni (2011), Catoni (2012), where mean estimators are studied through the angle of their deviation probabilities, rather than on their traditional mean square errors, for robustness purpose. Indeed, Devroye et al. (2016) showed that the Mo M provides an optimal δ-dependent subgaussian mean estimator, under the sole assumption that a second order moment exists. The Mo M estimator has later been extended to random vectors, through different generalizations of the median (Minsker et al., 2015; Hsu & Sabato, 2016; Lugosi & Mendelson, 2017). In Bubeck et al. (2013), it is used to design robust bandits strategies, while Lerasle & Oliveira (2011) and Brownlees et al. (2015) advocate minimizing a Mo M, respectively Catoni, estimate of the risk, rather than performing ERM, to tackle different learning tasks. More recently, Lugosi & Mendelson (2016) introduced a tournament strategy based on the Mo M approach. The Mo M estimator. Let Sn = {Z1, . . . , Zn} be a sample composed of n 1 independent realizations of a square integrable real valued r.v. Z, with expectation θ and finite variance Var (Z) = σ2. Dividing Sn into K disjoint blocks, each with same cardinality B = n/K , θk denotes the empirical mean based on the data lying in block k for k K. The Mo M estimator ˆθMo M of θ is then given by ˆθMo M = median( θ1, . . . , θK). It offers an appealing alternative to the sample mean ˆθn = (1/n) Pn i=1 Zi, much more robust, i.e. less sensitive to the presence of atypical values in the sample. Exponential concentration inequalities for the Mo M estimator can be established in heavy tail situations, under the sole assumption that the Zi s are square integrable. For any δ [e1 n/2, 1[, choosing K = log(1/δ) and B = n/K , we have (see e.g. Devroye et al. (2016), Lugosi & Mendelson (2016)): ( ˆθMo M θ > 2 1 + log(1/δ) On Medians of Randomized (Pairwise) Means The tournament procedure. Placing ourselves in the distribution-free regression framework, recalled in the Introduction section, it has been shown that, under appropriate complexity conditions and in a possibly non sub Gaussian setup, the tournament procedure outputs a candidate ˆf with optimal accuracy/confidence tradeoff, outperforming thus ERM in heavy-tail situations. Namely, there exist c, c0, and r > 0 such that, with probability at least 1 exp(c0n min{1, r2}), it holds both at the same time (see Theorem 2.10 in Lugosi & Mendelson (2016)): ˆf f L2 cr, and R( ˆf) R(f ) (cr)2, 2.2. Pairwise Means and U-Statistics Rather than the mean of an integrable r.v., suppose now that the quantity of interest is of the form θ(h) = E[h(X1, X2)], where X1 and X2 are i.i.d. random vectors, taking their values in some measurable space X with distribution F(dx) and h : X X R is a measurable mapping, square integrable w.r.t. F F. For simplicity, we assume that h(x1, x2) is symmetric (i.e. h(x1, x2) = h(x2, x1) for all (x1, x2) X 2). A natural estimator of the parameter θ(h) based on an i.i.d. sample Sn = {X1, . . . , Xn} drawn from F is the average over all pairs Un(h) = 2 n(n 1) 1 i Y and, in this case, one would ideally have r(X, X ) = +1, the rule r being supposed anti-symmetric (i.e. r(x, x ) = r(x , x) for all (x, x ) X 2). This can be formulated as the problem of minimizing the U-statistic known as the empirical ranking risk (see Cl emenc on et al. (2005)) for a given loss function ℓ: R R+: b Ln(r) = 2 n(n 1) 1 i ε | Sn}, seen as U-statistics of degree B. Refer to the Appendix for details. Proposition 1. Suppose that Z1, . . . , Zn are independent copies of a square integrable r.v. Z with mean θ and variance σ2. Then, for any τ ]0, 1/2[, for any δ [2e 8τ 2n/9, 1[, choosing K = log(2/δ)/(2(1/2 τ)2) and B = 8τ 2n/(9 log(2/δ)) , we have: ( θMo RM θ > 3 3 σ 2 τ 3/2 The bound stated above presents three main differences with (1). Recall first that the number K of randomized blocks is completely arbitrary in the Mo RM procedure and may even exceed n. Consequently, it is always possible to build log(2/δ)/(2 (1/2 τ)2) blocks, and there is no restriction on the range of admissible confidence levels δ due to K. Second, the size B of the blocks can also be chosen completely arbitrarily in {1, . . . , n}, and independently from K. Proposition 1 exhibits their respective dependence with respect to δ and n. Still, B needs to be greater than 1, which results in a restriction on the admissible δ s, such as specified. Observe finally that B never exceeds n. Indeed for all τ ]0, 1/2[, 8τ 2/(9 log(2/δ)) does not exceeds 1 as long as δ is lower than 2 exp( 2/9) 1.6, which is always true. Third, the proposed bound involves an additional parameter τ, that can be arbitrarily chosen in ]0, 1/2[. As may be revealed by examination of the proof, the choice of this extra parameter reflects a trade-off between the order of magnitude of K and that of B: the larger τ, the larger K, the larger the confidence range, the lower B and the lower the constant in (5) as well. Since one can pick K arbitrarily large, τ can be chosen as large as possible in ]0, 1/2[. This way, one asymptotically achieves a 3 6 constant factor, which is the same than that obtained in Hsu & Sabato (2016) for a comparable confidence range. However, the price of such an improvement is the construction of a higher number of blocks in practice (for a comparable number of blocks, the constant in (5) becomes 27 Remark 1. (ALTERNATIVE SAMPLING SCHEMES) We point out that other procedures than the SWo R scheme above (e.g. Poisson/Bernoulli/Monte-Carlo sampling) can be considered to build blocks and estimates of the parameter θ. However, as discussed in the SM, the theoretical analysis of such variants is much more challenging, due to possible replications of the same original observation in a block. Remark 2. (EXTENSION TO RANDOM VECTORS) Among approaches extending Mo Ms to random vectors, that of Minsker et al. (2015) could be readily adapted to Mo RM. Indeed, once Lemma 2.1 therein has been applied, the sum of indicators can be bounded exactly as in Proposition 1 s proof. Computationally, Mo RM only differs from Mo M in the sampling, adding no difficulty, while multivariate medians can be computed efficiently (Hopkins, 2018). Remark 3. (RANDOMIZATION MOTIVATION) Theoretically, randomization being a natural alternative to data segmentation, it appeared interesting to study its impact on Mo Ms. On the practical side, when performing a Mo M Gradient Descent (GD), it is often needed shuffling the blocks at each step (see e.g. Remark 5 in Lecu e et al. (2018)). While this shuffling may seem artificial and ad hoc in a Mo M GD, it is already included and controlled with Mo RM. Finally, extending Mo U to incomplete U-statistics like in subsection 3.4 first requires a Mo M randomization s study. On Medians of Randomized (Pairwise) Means 3.2. Medians of (Randomized) U-statistics We now consider the situation described in subsection 2.2, where the parameter of interest is θ(h) = E[h(X1, X2)] and investigate the performance of two possible approaches for extending the Mo M methodology. Medians of U-statistics. The most straightforward way of extending the Mo M approach is undoubtedly to form complete U-statistics based on K subsamples corresponding to sets of indexes I1, . . . , IK of size B = n/K built by segmenting the full sample, as originally proposed: for k {1, . . . , K}, ˆUk(h) = 2 B(B 1) (i,j) I2 k, i 2 and L > 1 such that Hf span(HF), Hf Lq L Hf L2. Let r (properly defined in the SM due to space limitation) that only depends on f , L, q, and the geometry of HF around Hf . Set r 2r . Then there exist c0, c > 0, and a procedure based on X1, . . . , X2n, L, and r that selects a function ˆf F such that with probability at least 1 exp(c0n min{1, r2}), R( ˆf) R(f ) cr. Proof. The proof is analogous to that of Theorem 2.11 in Lugosi & Mendelson (2016), and sketched in the SM. Remark 6. In pairwise learning, one seeks to minimize ℓ(f, X, X ) = ( p ℓ(f, X, X ) 0)2 = (Hf(X, X ) 0)2. We almost recover the setting of Lugosi & Mendelson (2016): quadratic loss, with Y = 0, for the decision function Hf. This is why any loss function ℓcan be considered, once technicalities induced by U-statistics are tackled. The control obtained on Hf H f L2 then translates into a control on the excess risk of f (see SM for further details). Remark 7. As discussed at length in Lugosi & Mendelson (2016), computing the tournament winner is a nontrivial problem. However, one could alternatively consider performing a tournament on an ϵ-coverage of F, while controlling the approximation error of this coverage. 3.4. Discussion - Further Extensions The computation of the U-statistic (2) is expensive in the sense that it involves the summation of O(n2) terms. The concept of incomplete U-statistic, see Blom (1976), precisely permits to address this computational issue and achieve a trade-off between scalability and variance reduction. In one of its simplest forms, it consists in selecting a subsample of size M 1 by sampling with replacement in the set of all pairs of observations that can be formed from the original sample. Setting Λ = {(i, j) : 1 i < j n}, and denoting by {(i1, j1), . . . , (i M, j M)} Λ the subsample drawn by Monte-Carlo, the incomplete version of the U-statistic (2) is: e UM(h) = (1/M) P m M h(Xim, Xjm). e UM(h) is directly an unbiased estimator of θ with variance Var e UM(h) = 1 1 Var(Un(h)) + σ2(h) The difference between its variance and that of (2) vanishes as M increases. In contrast, when M #Λ = n(n 1)/2, the variance of a complete U-statistic based on a subsample of size M , and thus on O(M) pairs just like e UM(h), is of order O(1/ M). Minimization of incomplete Ustatistics has been investigated in Cl emenc on et al. (2016) from the perspective of scalable statistical learning. Hence, rather than sampling first observations and forming next pairs from data blocks in order to compute a collection of complete U-statistics, which the median is subsequently taken of, one could sample directly pairs of observations, compute alternatively estimates of drastically reduced variance and output a Median of Incomplete U-statistics. However, one faces significant difficulties when trying to analyze theoretically such a variant, as explained in the SM. 4. Numerical Experiments Here we display numerical results supporting the relevance of the Mo M variants analyzed in the paper. Additional experiments are presented in the SM for completeness. Mo RM experiments. Considering inference of the expectation of four specified distributions (Gaussian, Student, Lognormal and Pareto), based on a sample of size n = 1000, seven estimators are compared below: standard Mo M, and On Medians of Randomized (Pairwise) Means Table 1. Quadratic Risks for the Mean Estimation, δ = 0.001 NORMAL (0, 1) STUDENT (3) LOG-NORMAL (0, 1) PARETO (3) MOM 0.00149 0.00218 0.00410 0.00584 0.00697 0.00948 1.02036 0.06115 MORM1/6, SWOR 0.01366 0.01888 0.02947 0.04452 0.06210 0.07876 1.12256 0.14970 MORM3/10, SWOR 0.00255 0.00361 0.00602 0.00868 0.01241 0.01610 1.05458 0.07041 MORM9/20, SWOR 0.00105 0.00148 0.00264 0.00372 0.00497 0.00668 1.02802 0.04903 Table 2. Quadratic Risks for the Variance Estimation, δ = 0.001 NORMAL (0, 1) STUDENT (3) LOG-NORMAL (0, 1) PARETO (3) MOU1/2; 1/2 0.00409 0.00579 1.72618 28.3563 2.61283 23.5001 1.35748 36.7998 MOUPARTITION 0.00324 0.00448 0.38242 0.31934 1.62258 1.41839 0.09300 0.05650 MORUSWOR 0.00504 0.00705 0.51202 3.88291 2.01399 4.85311 0.09703 0.07116 six Mo RM estimators, related to different sampling schemes (SRSWo R, Monte-Carlo) or different values of the tuning parameter τ. Results are obtained through 5000 replications of the estimation procedures. Beyond the quadratic risk, accuracy of the estimators are assessed by means of deviation probabilities (see SM), i.e. empirical quantiles for a geometrical grid of confidence levels δ. As highlighted above, τ = 1/6 leads to (approximately) the same number of blocks as in the Mo M procedure. However, Mo RM usually select blocks of cardinality lower than n/K, so that the Mo RM estimator with τ = 1/6 uses less examples than Mo M. Proposition 1 exhibits a higher constant for Mo RM in that case, and it is confirmed empirically here. The choice τ = 3/10 guarantees that the number of Mo RM blocks multiplied by their cardinality is equal to n. This way, Mo RM uses as much samples as Mo M. Nevertheless, the increased variability leads to a slightly lower performance in this case. Finally, τ = 9/20 is chosen to be closer to 1/2, as suggested by (5). In this setting, the two constant factors are (almost) equal, and Mo RM even empirically shows a systematic improvement compared to Mo M. Note that the quantile curves should be decreasing. However, the estimators being δdependent, different experiments are run for each value of δ, and the rare little increases are due to this random effect. Mo(R)U experiments. In these experiments assessing empirically the performance of the Mo(R)U methods, the parameter of interest is the variance (i.e. h(x, y) = (x y)2/2) of the four laws used above. Again, estimators are assessed through their quadratic risk and empirical quantiles. A metric learning application is also proposed in the SM. 5. Conclusion In this paper, various extensions of the Medians-of-Means methodology, which tournament-based statistical learning techniques recently proposed in the literature to handle heavy-tailed situations rely on, have been investigated at length. First, confidence bounds showing that accuracy can be fully preserved when data blocks are built through SRSWo R schemes rather than simple segmentation, giving more flexibility to the approach regarding the number of blocks and their size, are established. Second, its application to estimation of pairwise expectations (i.e. Medians-of-Ustatistics) is studied in a valid theoretical framework, paving the way for the design of robust pairwise statistical learning techniques in clustering, ranking or similarity-learning tasks. Technical Proofs Proof of Proposition 1 Let ε > 0. Just like in the classic argument used to prove (1), observe that θMo RM θ > ε k=1 Iε(Bϵk) K/2 where Iε(Bϵk) = I{ θk θ > ε} for k = 1, . . . , K. In order to benefit from the conditional independence of the blocks given the original sample Sn, we first condition upon Sn and consider the variability induced by the ϵk s only: P θMo RM θ > ε | Sn P Now, the average (1/K) PK k=1 Iε(Bϵk) can be viewed as an approximation of the U-statistic of degree B (refer to Lee (1990)), its conditional expectation given Sn being U ε n = 1 n B X ϵ Λ(n,B) Iε(Bϵ). On Medians of Randomized (Pairwise) Means Denoting by pε = E[U ε n] = P{| θ1 θ| > ε} the expectation of the Iε(Bϵk) s, we have τ ]0, 1/2[: P θMo RM θ >ε PSn {U ε n pε τ pε} (7) k=1 Iε(Bϵk) U ε n 1 By virtue of Hoeffding inequality for i.i.d. averages (see Hoeffding (1963)) conditioned upon Sn, we have: t > 0, k=1 Iε(Bϵk) U ε n t Sn (8) In addition, the version of Hoeffding inequality for Ustatistics (cf Hoeffding (1963), see also Theorem A in Chapter 5 of Serfling (1980)) yields: t > 0, PSn {U ε n pε t} exp 2nt2/B . (9) One may also show (see SM A.2.1) that pε σ2 Bϵ2 . Combining this remark with equations (7), (8) and (9), the deviation probability of θMo RM can be bounded by Choosing K = log(2/δ)/(2(1/2 τ)2 and B = 8τ 2n/(9 log(2/δ)) leads to the desired result. Proof of Proposition 2 The data blocks are built here by partitioning the original dataset into K n subsamples of size B = n/K . Set Iε,k = I{| ˆUk(h) θ(h)| > ε} for k {1, . . . , K}. Again, observe that P ˆθMo U(h)(h) θ(h) > ε is lower than k=1 Iε,k qε 1 where qε = E[Iε,1] = P{| ˆU1(h) θ(h)| > ε}. By virtue of Chebyshev s inequality and equation (3): qε Var ˆU1(h) 4σ2 1(h) B + 2σ2 2(h) B(B 1) Using Hoeffding inequality, the deviation probability can thus be bounded by 2 4σ2 1(h) Bε2 + 2σ2 2(h) B(B 1)ε2 Choosing K = log(1/δ)/(2(1/2 λ)2), λ ]0, 1/2[ gives: δ n + C2(λ) log2( 1 2 λ)2n log 1 with C1(λ) = 2σ2 1(h)/(λ( 1 2 λ)2) and C2(λ) = σ2 2(h)/(λ(1/2 λ)2). The optimal constant for the first and leading term is attained for λ = 1/6, which corresponds to K = (9/2) log(1/δ) and gives: δ n + C2 log2( 1 δ ) n 2n 9 log 1 with C1 = 108σ2 1(h) and C2 = 486σ2 2(h). Finally, taking K instead of K does not change the result. Proof of Proposition 3 Here we consider the situation where the estimator is the median of K randomized U-statistics, computed from data blocks built by means of independent SRSWo R schemes. We set Iε(ϵk) = I | Uk(h) θ(h)| > ε . For all τ ]0, 1/2[, we have: P n θMo RU(h) θ(h) > ε o P k=1 Iε(ϵk) 1 k=1 Iε(ϵk) W ε n 1 + P {W ε n qε τ qε} , (10) where we set W ε n = E [Iε(ϵ1) | Sn] = 1 n B X ϵ Λn,B Iε(ϵ), qε = E[Iε(ϵ1)] = P{| U1(h) θ(h)| > ε}. The conditional expectation W ε n, with mean qε, is a Ustatistic of degree B, so that Theorem A in Chapter 5 of Serfling (1980) yields: P {W ε n qε τ qε} exp One may also show (see SM A.2.2) that 4σ2 1(h) B + 2σ2 2(h) B(B 1) Combining (10), standard Hoeffding s inequality conditioned on the data, as well as (11) together with (12) gives that the deviation of θMo RU(h) is upper bounded by 4σ2 1(h) B + 2σ2 2(h) B(B 1) Choosing K = log(2/δ)/(2(1/2 τ)2) and B = 8τ 2n/(9 log(2/δ)) leads to the desired bound. On Medians of Randomized (Pairwise) Means Alon, N., Matias, Y., and Szegedy, M. The space complexity of approximating the frequency moments. Journal of Computer and system sciences, 58(1):137 147, 1999. Audibert, J.-Y. and Catoni, O. Robust linear least squares regression. The Annals of Statistics, 39(5):2766 2794, 2011. Bellet, A., Habrard, A., and Sebban, M. A Survey on Metric Learning for Feature Vectors and Structured Data. Ar Xiv e-prints, June 2013. Blom, G. Some properties of incomplete U-statistics. Biometrika, 63(3):573 580, 1976. Brownlees, C., Joly, E., Lugosi, G., et al. Empirical risk minimization for heavy-tailed losses. The Annals of Statistics, 43(6):2507 2536, 2015. Bubeck, S., Cesa-Bianchi, N., and Lugosi, G. Bandits with heavy tail. IEEE Transactions on Information Theory, 59 (11):7711 7717, 2013. Callaert, H. and Janssen, P. The Berry-Esseen theorem for U-statistics. The Annals of Statistics, 6(2):417 421, 1978. Catoni, O. Challenging the empirical mean and empirical variance: a deviation study. In Annales de l Institut Henri Poincar e, Probabilit es et Statistiques, volume 48, pp. 1148 1185. Institut Henri Poincar e, 2012. Cl emenc on, S. A statistical view of clustering performance through the theory of U-processes. Journal of Multivariate Analysis, 124:42 56, 2014. Cl emenc on, S., Lugosi, G., and Vayatis, N. Ranking and scoring using empirical risk minimization. In Proceedings of COLT, 2005. Cl emenc on, S., Lugosi, G., and Vayatis, N. Ranking and empirical risk minimization of U-statistics. The Annals of Statistics, 36(2):844 874, 2008. Cl emenc on, S., Colin, I., and Bellet, A. Scaling-up Empirical Risk Minimization: Optimization of Incomplete U-statistics. Journal of Machine Learning Research, 17: 1 36, 2016. Devroye, L., Lerasle, M., Lugosi, G., Oliveira, R. I., et al. Sub-gaussian mean estimators. The Annals of Statistics, 44(6):2695 2725, 2016. Hoeffding, W. A class of statistics with asymptotically normal distribution. Ann. Math. Stat., 19:293 325, 1948. Hoeffding, W. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13 30, 1963. Hopkins, S. B. Sub-gaussian mean estimation in polynomial time. ar Xiv preprint ar Xiv:1809.07425, 2018. Hsu, D. and Sabato, S. Loss minimization and parameter estimation with heavy tails. The Journal of Machine Learning Research, 17(1):543 582, 2016. Jerrum, M., Valiant, L., and Vazirani, V. Random generation of combinatorial structures from a uniform distribution. Theoretical Computer Science, 43:169 188, 1986. Joly, E. and Lugosi, G. Robust estimation of u-statistics. Stochastic Processes and their Applications, 126(12): 3760 3773, 2016. Lecu e, G. and Lerasle, M. Robust machine learning by median-of-means: theory and practice. ar Xiv preprint ar Xiv:1711.10306, 2017. Lecu e, G., Lerasle, M., and Mathieu, T. Robust classification via mom minimization. ar Xiv preprint ar Xiv:1808.03106, 2018. Lee, A. J. U-statistics: Theory and practice. Marcel Dekker, Inc., New York, 1990. Lerasle, M. and Oliveira, R. I. Robust empirical mean estimators. ar Xiv preprint ar Xiv:1112.3914, 2011. Lugosi, G. and Mendelson, S. Risk minimization by medianof-means tournaments. ar Xiv preprint ar Xiv:1608.00757, 2016. Lugosi, G. and Mendelson, S. Sub-gaussian estimators of the mean of a random vector. ar Xiv preprint ar Xiv:1702.00482, 2017. Mc Diarmid, C. On the method of bounded differences, pp. 148 188. London Mathematical Society Lecture Note Series. Cambridge University Press, 1989. doi: 10.1017/CBO9781107359949.008. Mendelson, S. On aggregation for heavy-tailed classes. Probability Theory and Related Fields, 168(3-4):641 674, 2017. Minsker, S. and Wei, X. Robust modifications of u-statistics and applications to covariance estimation problems. ar Xiv preprint ar Xiv:1801.05565, 2018. Minsker, S. et al. Geometric Median and Robust Estimation in Banach Spaces. Bernoulli, 21(4):2308 2335, 2015. Nemirovsky, A. S. and Yudin, D. B. Problem Complexity and Method Efficiency in Optimization. Wiley Interscience, New-York, 1983. Pe na, V. H. and Gin e, E. Decoupling: from dependence to independence. 1999. On Medians of Randomized (Pairwise) Means Serfling, R. Approximation Theorems of Mathematical Statistics. Wiley Series in Probability and Statistics. John Wiley & Sons, 1980. Van der Vaart, A. Asymptotic Statistics. Cambridge university press, 2000. Vogel, R., Cl emenc on, S., and Bellet, A. A Probabilistic Theory of Supervised Similarity Learning: Pairwise Bipartite Ranking and Pointwise ROC Curve Optimization. In International Conference in Machine Learning, 2018.