# aggregated_holdout__192802a0.pdf

Journal of Machine Learning Research 22 (2021) 1-55 Submitted 7/19; Revised 11/20; Published 1/21

Aggregated Hold-Out

Guillaume Maillard guillaume.maillard@uni.lu Université Paris-Saclay, CNRS, Inria, Laboratoire de mathématiques d Orsay, 91405, Orsay, France

Sylvain Arlot sylvain.arlot@universite-paris-saclay.fr Université Paris-Saclay, CNRS, Inria, Laboratoire de mathématiques d Orsay, 91405, Orsay, France Institut Universitaire de France (IUF)

Matthieu Lerasle matthieu.lerasle@universite-paris-saclay.fr Université Paris-Saclay, CNRS, Inria, Laboratoire de mathématiques d Orsay, 91405, Orsay, France

Editor: Shie Mannor

Aggregated hold-out (agghoo) is a method which averages learning rules selected by holdout (that is, cross-validation with a single split). We provide the ﬁrst theoretical guarantees on agghoo, ensuring that it can be used safely: Agghoo performs at worst like the holdout when the risk is convex. The same holds true in classiﬁcation with the 0 1 risk, with an additional constant factor. For the hold-out, oracle inequalities are known for bounded losses, as in binary classiﬁcation. We show that similar results can be proved, under appropriate assumptions, for other risk-minimization problems. In particular, we obtain an oracle inequality for regularized kernel regression with a Lipschitz loss, without requiring that the Y variable or the regressors be bounded. Numerical experiments show that aggregation brings a signiﬁcant improvement over the hold-out and that agghoo is competitive with cross-validation. Keywords: cross-validation, aggregation, bagging, hyperparameter selection, regularized kernel regression

1. Introduction

The problem of choosing from data among a family of learning rules is central to machine learning. There is typically a variety of rules which can be applied to a given problem for instance, support vector machines, neural networks or random forests. Moreover, most machine learning rules depend on hyperparameters which have a strong impact on the ﬁnal performance of the algorithm. For instance, k-nearest-neighbors rules (Biau and Devroye, 2015) depend on the number k of neighbors. A second example, among many others, is given by regularized empirical risk minimization rules, such as support vector machines (Steinwart and Christmann, 2008) or the lasso (Tibshirani, 1996; Bühlmann and van de Geer, 2011), which all depend on some regularization parameter. A related problem is model selection (Burnham and Anderson, 2002; Massart, 2007), where one has to choose among a family of candidate models. In supervised learning, cross-validation (CV) is a general, eﬃcient and classical answer to the problem of selecting a learning rule (Arlot and Celisse, 2010). It relies on the idea

c 2021 Guillaume Maillard, Sylvain Arlot and Matthieu Lerasle.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v22/19-624.html.

Maillard, Arlot and Lerasle

of splitting data into a training sample used for training a predictor with each rule in competition and a validation sample used for assessing the performance of each predictor. This leads to an estimator of the risk the hold-out estimator when data are split once, the CV estimator when an average is taken over several data splits , which can be minimized for selecting among a family of competing rules. A completely diﬀerent strategy, called aggregation, is to combine the predictors obtained with all candidates (Nemirovski, 2000; Yang, 2001; Tsybakov, 2004). Aggregation is the key step of ensemble methods (Dietterich, 2000), among which we can mention bagging (Breiman, 1996), adaboost (Freund and Schapire, 1997) and random forests (Breiman, 2001; Biau and Scornet, 2016). A major interest of aggregation is that it builds a learning rule that may not belong to the family of rules in competition. Therefore, it sometimes has a smaller risk than the best of all rules (Salmon and Dalalyan, 2011, Table 1). In contrast, cross-validation, which selects only one candidate, cannot outperform the best rule in the family.

1.1 Aggregated Hold-Out (Agghoo)

This paper studies a procedure mixing cross-validation and aggregation ideas, that we call aggregated hold-out (agghoo). Data are split several times; for each split, the hold-out selects one predictor; then, the predictors obtained with the diﬀerent splits are aggregated. A formal deﬁnition is provided in Section 3. This procedure is as general as cross-validation and it has roughly the same computational cost (see Section 3.4). Agghoo is already popular among practicioners, and has appeared in the neuro-imaging literature (Hoyos-Idrobo et al., 2015; Varoquaux et al., 2017) under the name CV + averaging . Yet, to the best of our knowledge, existing experimental studies do not give any indication on how to choose agghoo s parameters. No general mathematical deﬁnition has been provided, so it is unclear how to generalize agghoo beyond a given article s setting. Theoretical guarantees on agghoo have not been established yet, to the best of our knowledge. The closest results we found study other procedures, called ACV (Jung and Hu, 2015), EKCV (Jung, 2016), or Hall and Robinson (2009) s bagged cross-validation (shortened into Hall s BCV, which should not be confused with other procedures combining bagging and cross-validation, see Section 3.3). These authors do not prove oracle inequalities. We explain in Section 3.3 why agghoo should be preferred to these procedures in the general prediction setting. Because of the aggregation step, agghoo is an ensemble method, and like bagging, it combines resampling with aggregation. However, unlike agghoo, bagging applies to single estimators, and does not adress the problem of estimator selection. Hence, if there is a free hyperparameter, bagging must be combined with some estimator selection method, such as cross-validation. The application of bagging to the hold-out was ﬁrst suggested by Breiman (1996) as a way to combine pruning and bagging of CART trees. We discuss in detail in Section 3.3 how agghoo relates to bagging and subagging combined with the hold-out. In particular, we explain why previous results on bagging or subagging do not apply to agghoo; new developments are required.

Aggregated Hold-Out

1.2 Contributions

In this article, agghoo s performance is studied both theoretically and experimentally. We consider agghoo from a prediction point of view. Performance is measured by a risk functional. On the theoretical side, the aim is to show that the risk of agghoo s ﬁnal predictor is as low as the risk of the optimal rule among the given collection. This is known as an oracle inequality. By a convexity argument, agghoo always improves on the hold-out, provided that the risk is convex. Hence, agghoo can safely replace the hold-out in any application where this hypothesis holds true. Another consequence is that oracle inequalities for agghoo can be deduced from oracle inequalities for the hold-out. This kind of result on the hold-out has already appeared in the literature: for example, Massart (2007, Corollary 8.8) proves a general theorem under an abstract noise assumption; more explicit results have been obtained in speciﬁc settings such as least-squares regression (Györﬁet al., 2002, Theorem 7.1) or maximum-likelihood density estimation (Massart, 2007, Theorem 8.9). A review on cross-validation which includes the hold-out is done by Arlot and Celisse (2010). Most existing theoretical guarantees on the hold-out have a limitation: they assume that the loss function is uniformly bounded. In regression, the variable Y and the regressors are also usually assumed to be bounded, which excludes some standard least-squares estimators. Even when the boundedness assumption holds true, constants arising from general bounds may be of the wrong order of magnitude, leading to vacuous results. By replacing uniform supremum bounds by local ones, we are able to relax these hypotheses in a general setting (Theorem 17). This enables us to prove an oracle inequality for the hold-out and agghoo in regularized kernel regression with a general Lipschitz loss (Theorem 11). This oracle inequality allows for instance to recover state-of-the-art convergence rates in median regression without knowing the regularity of the regression function (adaptivity), both in the general case and, for small enough regularity, also in the speciﬁc setting of Eberts and Steinwart (2013). To illustrate the implications of Theorem 11, we also apply it to ε-regression (Corollary 12). To the best of our knowledge, all these oracle inequalities are new, even for the hold-out. In addition to the RKHS setting studied here, Theorem 17 of this article has also been applied to sparse linear regression (Maillard, 2020a) and to least-squares density estimation (Maillard, 2020b). A limitation of agghoo is that it does not cover settings where averaging does not make sense, such as classiﬁcation. In classiﬁcation with the 0 1 loss, the natural way to aggregate classiﬁers is to take a majority vote among them. This yields a procedure which we call majhoo. Using existing theory for the hold-out in classiﬁcation, we prove that majhoo satisﬁes a general margin-adaptive oracle inequality (Theorem 13) under Tsybakov s margin assumption (Mammen and Tsybakov, 1999). All our oracle inequalities are valid for any size of the aggregation ensemble. Qualitatively, since bagging and subagging are well-known for their stabilizing eﬀects (Breiman, 1996; Bühlmann and Yu, 2002), we can expect agghoo to behave similarly. In particular, large ensembles should improve much the prediction performance of CV when the hold-out selected predictor is unstable. For further insights into agghoo and majhoo, we conduct in Section 5 a numerical study on simulated data sets. Its results conﬁrm our intuition: in all settings considered, agghoo

Maillard, Arlot and Lerasle

and majhoo actually perform much better than the hold-out, and sometimes better than CV, provided their parameters are well-chosen. When choosing the number of neighbors for k-nearest neighbors, the prediction performance of majhoo is much better than the one of CV, which illustrates the strong interest of using agghoo/majhoo when learning rules are unstable . In support vector regression, agghoo can even perform as well as the oracle, an achievement that is not matched by CV, ACV, EKCV or bagging applied to K-fold cross-validation. Based upon our experiments, we also give in Section 5 some guidelines for choosing agghoo s parameters: the training set size and the number of data splits.

The remaining of the article is structured as follows. In Section 2, we introduce the general statistical setting. In Section 3, we give a formal deﬁnition of agghoo. In Section 4, we state the main theoretical results. In Section 5, we present our numerical experiments and discuss the results. Finally, in Section 6, we draw some qualitative conclusions about agghoo. The proofs are postponed to the Appendix.

2. Setting and Deﬁnitions

We consider a general statistical learning setting, following the book by Massart (2007).

2.1 Risk Minimization

The goal is to minimize over a set S a risk functional L : S R {+ }. The set S may be inﬁnite dimensional for non-parametric problems. Assume that L attains its minimum over S at a point s, called a Bayes element. Then the excess risk of any f S is the nonnegative quantity ℓ(s, f) = L(f) L(s) .

Suppose that the risk can be written as an expectation over an unknown probability distribution, L(f) = E γ(f, ξ) ,

for a contrast function γ : S Ξ R and a random variable ξ with values in some set Ξ and unknown distribution P, such that

f S, eξ Ξ 7 γ(f, eξ) is P-measurable .

The statistical learning problem is to use data Dn = (ξ1, ..., ξn), where ξ1, ..., ξn are independent and identically distributed with common distribution P, to ﬁnd an approximate minimizer for L. The quality of this approximation is measured by the excess risk.

2.2 Examples

Supervised learning aims at predicting a quantity of interest Y Y using explanatory variables X X. The statistician observes pairs (X1, Y1), . . . (Xn, Yn), so that Ξ = X Y, and seeks a predictor in S = {f : X Y : f measurable}. The contrast function is deﬁned by γ(f, (x, y)) = g(f(x), y) for some loss function g : Y Y R. Here, g(y , y) measures the loss incurred by predicting y instead of the observed value y. Two classical supervised learning problems are classiﬁcation and regression, which we detail below.

Aggregated Hold-Out

Example 1 (Classiﬁcation) In classiﬁcation, Y belongs to a ﬁnite set of labels, that is, Y = {0, . . . , M 1} for some M 2. We wish to correctly label any new data point X, and the risk is the probability of error:

f S, L(f) = P f(X) = Y ,

which corresponds to the loss function g(y , y) = I{y = y}. Classiﬁcation with convex losses (such as the hinge loss or logistic loss) can also be described using the formalism of Section 2.1.

Example 2 (Regression) In regression, we wish to predict a continuous variable Y that belongs to Y = Rd. The error made by predicting y instead of y is measured by the loss function deﬁned by g(y , y) = φ( y y ) where φ : R+ R+ is nondecreasing and convex. Some typical choices are φ(x) = x2 (least squares), φ(x) = |x| (median regression) or φ(x) = (|x| ε)+ (Vapnik s ε-insensitive loss, leading to ε-regression). The risk is given by

L(f) = E h φ Y f(X) i .

If φ is strictly convex, the minimizer of L over S is a unique function, up to modiﬁcation on a set of probability 0 under the distribution of X.

In some applications, such as robust regression, it is of interest to deﬁne s and ℓ(s, f) even when φ( Y ) / L1. This is possible for Lipschitz contrasts, by the following remark.

Remark 1 When φ is convex and increasing (as in Example 2), assuming also that φ is Lipschitz-continuous, it is always possible to deﬁne

s : x 7 argmin u R E φ( Y u ) φ( Y ) X = x .

When s L1(X), it is a Bayes element for the loss function g(y , y) = φ( y y ) φ( y ). Whenever φ( Y ) L1, this loss yields the same Bayes element and excess risk as in Example 2.2.

This adjustment to the general deﬁnition allows to consider Example 2 when φ( Y s(X) ) is not integrable, for example when Y = s(X)+η, where η is independent from X and follows a multivariate Cauchy distribution with location parameter 0.

Some density estimation problems, such as maximum-likelihood or least-squares density estimation, also ﬁt the formalism of Section 2.1, see the book by Massart (2007).

2.3 Learning Rules and Estimator Ensembles

Statistical procedures use data to compute an element of S which approximately minimizes L. Since agghoo uses subsampling, we require learning rules to accept as input data sets of any size. Therefore, we deﬁne a learning rule to be a function which maps any data set to an element of S.

Maillard, Arlot and Lerasle

Deﬁnition 2 A data set Dn of length n is a ﬁnite sequence (ξi)1 i n of independent and identically distributed Ξ-valued random variables with common distribution P. A learning rule A is a measurable function1

In the risk minimization setting, A should be chosen so as to minimize L(A(Dn)).

A generic situation is when a family (Am)m M of learning rules is given, so that we have to select one of them (estimator selection), or to combine their outputs (estimator aggregation). For instance, when X is a metric space, we can consider the family (ANN k )k 1 of nearest-neighbors classiﬁers where k is the number of neighbors , or, for a given kernel on X, the family (ASVM λ )λ [0,+ ) of support vector machine classiﬁers where λ is the regularization parameter. Not all rules in such families perform well on a given data set. Bad rules should be avoided when selecting the hyperparameter, or be given small weights if the outputs are combined in a weighted average. This requires a data-adaptive procedure, as the right choice of rule in general depends on the unknown distribution P.

Aggregation and parameter selection methods aim to resolve this problem, as described in the next section.

3. Cross-Validation and Aggregated Hold-Out (Agghoo)

This section recalls the deﬁnition of cross-validation for estimator selection, and introduces a new procedure called aggregated hold-out (agghoo). For more details and references on cross-validation, we refer the reader to the survey by Arlot and Celisse (2010).

3.1 Background: Cross-Validation

Cross-validation uses subsampling and the empirical risk. We ﬁrst introduce some notation.

Deﬁnition 3 (Empirical risk) For any data set Dn = (ξi)1 i n and any f S, the empirical risk of f over Dn is deﬁned by

Pnγ(f, ) = 1

i=1 γ(f, ξi) .

For any nonempty subset T {1, . . . , n}, let also

DT n = (ξi)i T

be the subsample of Dn indexed by T, and deﬁne the associated empirical risk by

f S, P T n γ(f, ) = 1

i T γ(f, ξi) .

1. For any n, ( Ξn Ξ R (ξ1:n, ξ) 7 γ(A(ξ1:n), ξ)

is assumed to be measurable with respect to the product σ-algebra on Ξn+1.

Aggregated Hold-Out

The most classical estimator selection procedure is to hold out some data to calculate the empirical risk of each estimator, and then to select the estimator with the lowest empirical risk. This ensures that the data used to evaluate the risk are independent from the training data used to compute the learning rules.

Deﬁnition 4 (Hold-out) For any data set Dn and any subset T {1, . . . , n}, the associated hold-out risk estimator of a learning rule A is deﬁned by

HOT (A, Dn) = P T c n γ A(DT n ), .

Given a collection of learning rules (Am)m M, the hold-out procedure selects

bmho T (Dn) argmin m M HOT (Am, Dn) ,

measurably with respect to Dn. The overall learning rule is then given by

bf ho T (Am)m M, Dn = A bmho T (Dn)(DT n ) .

Hold-out depends on the arbitrary choice of a training set T, and is known to be quite unstable, despite its good theoretical properties (Massart, 2007, Section 8.5.1). Therefore, practicioners often prefer to use cross-validation instead, which considers several training sets.

Deﬁnition 5 (Cross-validation) Let Dn denote a data set. Let T denote a collection of nonempty subsets of {1, . . . , n}. The associated cross-validation risk estimator of a learning rule A is deﬁned by

CVT (A, Dn) = 1 |T |

T T HOT (A, Dn) .

The cross-validation procedure then selects

bmcv T (Dn) argmin m M CVT (Am, Dn) .

The ﬁnal predictor obtained through this procedure is

bf cv T (Am)m M, Dn = A bmcv T (Dn)(Dn) .

Depending on how T is chosen, this can lead to leave-one-out, leave-p-out, V -fold crossvalidation or Monte-Carlo cross-validation, among others (Arlot and Celisse, 2010). In the following, we omit some of the arguments A, Dn which appear in Deﬁnitions 4 and 5, when they are clear from context. For example, we often write HOT (A) , bmho T , bf ho T instead of HOT (A, Dn) , bmho T (Dn), bf ho T (Am)m M, Dn , respectively.

3.2 Aggregated Hold-Out (Agghoo) Estimators

In this paper, we study another way to improve on the stability of hold-out selection, by aggregating the predictors bf ho T obtained by the hold-out procedure applied repeatedly with diﬀerent training sets T T . When S is convex (for example, regression), aggregated holdout (agghoo) consists in averaging them.

Maillard, Arlot and Lerasle

Deﬁnition 6 (Agghoo) Assume that S is a convex set. Let (Am)m M denote a collection of learning rules, Dn a data set, and T a collection of subsets of {1, . . . , n}. Using the notation of Deﬁnition 4, the associated agghoo estimator is deﬁned by

bf ag T (Am)m M, Dn = 1 |T |

T T bf ho T (Am)m M, Dn .

In the classiﬁcation framework, as seen in Example 1, S = {f : X {0, . . . , M 1}} which is not convex. However, there is still a natural way to aggregate several classiﬁers, by taking a majority vote.

Deﬁnition 7 (Majhoo) Let Y = {0, . . . , M 1} be the set of labels. Given a collection of learning rules (Am)m M, a data set Dn and a collection T of subsets of {1, . . . , n}, the majority hold-out (majhoo) classiﬁer is any measurable bf mv T (Am)m M, Dn : X Y such that, using the notation bf ho T introduced in Deﬁnition 4, for all x X,

bf mv T (Am)m M, Dn (x) argmax j Y

n T T : bf ho T (Am)m M, Dn (x) = j o .

In most situations, it is clear how hold-out rules should be aggregated and there is no ambiguity in discussing hold-out aggregation. However, there is an important exception where both agghoo and majhoo can be used.

Remark 8 (Two options for binary classiﬁcation) In binary classiﬁcation (Example 1 with M = 2), it is classical to consider classiﬁers of the form x 7 If(x) 0 where the function f Sconv = {f : X R} aims at minimizing a surrogate convex risk associated with the loss gconv : (y , y) 7 φ[(2y 1)(2y 1)] with φ : R R convex (Boucheron et al., 2005). Then, given a family of Sconv-valued learning rules Am

m M, one can either apply agghoo to the surrogate problem and get

I bf ag T ((Am)m M,Dn) 0 ,

or apply majhoo to the binary classiﬁcation problem and get

bf mv T IAm( ) 0

In the rest of Section 3, we focus on agghoo, though much of the following discussion applies also to majhoo.

Compared to cross-validation rules (Deﬁnition 5), agghoo reverses the order between aggregation (majority vote or averaging) and minimization of the risk estimator: instead of averaging hold-out risk estimators before selecting the hyperparameter, the selection step is made ﬁrst to produce hold-out predictors bf ho T

T T (given by Deﬁnition 4) and then an average is taken.

Aggregated Hold-Out

3.3 Related Procedures

To the best of our knowledge, agghoo has not been studied theoretically before, though it is used in applications (Hoyos-Idrobo et al., 2015; Varoquaux et al., 2017), under the name CV + averaging in Varoquaux et al. (2017). According to Varoquaux et al. (2017), agghoo is commonly used by the machine learning community thanks to the scikit-learn library (Pedregosa et al., 2011). A related procedure is K-fold averaging cross-validation (ACV), proposed by Jung and Hu (2015). In linear regression, ACV corresponds to averaging the A bmho T (Dn), which are retrained on the whole data set, while agghoo averages the A bmho T (DT n ). An advantage of averaging the rules A bmho T (DT n ) is that they have been selected for their good performance on the validation set T c, unlike the A bmho T (Dn) whose performance has not been assessed on independent data. Furthermore, similarly to bagging, using several distinct training sets may result in improvements for unstable methods through a reduction in variance. Note ﬁnally that the theoretical results of Jung and Hu (2015) on ACV are limited to a speciﬁc setting, and much weaker than an oracle inequality.

A second family of related procedures is averaging the chosen parameters bmho T

T T , contrary to agghoo which averages the chosen prediction rules. This leads to diﬀerent procedures for learning rules that are not linear functions of their parameters. This is the approach taken by Jung and Hu (2015) for selecting a regularization parameter, still under the name of ACV. The idea has also been put forward under the name bagged cross-validation (that we call in this article Hall s BCV, for avoiding confusion with other ways of combining bagging and cross-validation) by Hall and Robinson (2009) with numerical and theoretical results in the case of bandwidth choice in kernel density estimation , and under the name eﬃcient K-fold cross-validation (EKCV; Jung, 2016) for the choice of a regularization parameter in high-dimensional regression with numerical results only. Unlike agghoo, which only depends on the set {Am : m M} of learning rules, ACV, EKCV and Hall s BCV depend on the parametrization m 7 Am . Sometimes, the most natural parametrization does not allow the use of such procedures: for example, model dimensions are integers, and averaging them does not make sense. In contrast, in regression, it is always possible to average the real-valued functions Am(Dnt) S. Even when all procedures are applicable, averaging rules is generally safer than averaging hyperparameters. Often in regression, the risk L is known to be convex over S, so given f1, . . . , f V S,

i=1 L(fi) .

Hence, averaging regressors (agghoo) always improves performance compared to selecting a single fi at random (hold-out). On the other hand, if (fθ)θ Θ is a family of elements of S parametrized by a convex set Θ, there is no guarantee in general that the function θ 7 L(fθ) is convex over Θ. So, for some θ1, . . . , θV Θ, it may happen that

V PV i=1 θi

i=1 L(fθi) .

Maillard, Arlot and Lerasle

In such a case, it is better to choose one parameter at random (hold-out) than to average them (ACV, EKCV or Hall s BCV).

A third family of related procedures is bagging or subagging applied to some CV predictor Dn 7 bf cv T ((Am)m M, Dn) given by Deﬁnition 5. The bagging case (that we call bagged CV ) has been studied numerically by Petersen et al. (2007), but clearly diﬀers from agghoo since it relies on bootstrap resamples, in which the original data can appear several times. As a consequence, some data can be shared between training and test samples, which breaks the independence between them. This is a major issue for theory, and it can degrade the performance as shown by our numerical experiments (Tables 1 2 in Section 5.1). The problem disappears if sampling with replacement (bagging) is replaced with sampling without replacement (subagging). The resulting procedure, that can be called subagged CV in general, and subagged hold-out when the kind of CV considered is the hold-out, is not explicitly studied in the literature, to the best of our knowledge. Subagged hold-out is closer to agghoo but there is still a slight diﬀerence. In subagged hold-out, the sample is divided into three parts: the training part of the bagging subsample, the validation part of the bagging subsample, and the data not in the bagging subsample. Thus, part of the data is discarded in each iteration of subagging. With agghoo, the sample is only divided into two parts: training set and validation set. Thus, all the data is used, either to train the learning rules or to estimate their risk. As a result, agghoo is potentially more eﬃcient in its use of the data. Another consequence is that theoretical results on bagging or subagging cannot be used directly for studying agghoo. Note also that Petersen et al. (2007) recommend a diﬀerent approach, where rather than bagging the whole CV procedure, bagging is instead applied within each CV fold, which leads to the (random) selection of a single (bagged) estimator. In contrast, agghoo selects diﬀerent estimators for each fold, potentially reducing the variance as usual with ensemble methods (Catoni, 2001; Lecué, 2007; Genuer, 2012).

3.4 Computational Complexity

In general, for a given value of V = |T |, both agghoo ( bf ag T ) and CV ( bf cv T ) must compute V hold-out risk estimators over all values of m M. Assume for simplicity that all training data sets DT n , T T , have the same size |T| = nt, and denote by nv = n nt the size of the validation data set. Let Cho(M, nt, nv) be the average computational complexity of the hold-out. Then the overall complexity of risk estimation is of order V Cho(M, nt, nv) for both agghoo and CV. Next, CV must average V risk vectors of length |M| and ﬁnd a single minimum, while agghoo computes V minima over m M; these operations have similar complexity, of order V |M|. Thus, computing the ensemble aggregated by agghoo takes about as much time as selecting a learning rule using cross-validation. A potential diﬀerence occurs when evaluating agghoo and CV on new data. If there is no fast way to perform aggregation at training time, it is always possible to evaluate each predictor in the ensemble on the new data, and to average the results; then, agghoo is slower than CV by a factor of order V at test time.

Aggregated Hold-Out

4. Theoretical Results

The purpose of agghoo is to construct an estimator whose risk is as small as possible, compared to the (unknown) best rule in the class (Am)m M. This is guaranteed theoretically by proving oracle inequalities of the form

E ℓ(s, bf ag T ) CE h inf m M ℓ s, Am(Dn) i + εn , (1)

with εn negligible compared to the oracle excess risk E[infm M ℓ(s, Am(Dn))] and C close to 1. Equation (1) then implies that agghoo performs as well as the best choice of m M, up to the constant C. In the following, we actually prove slightly weaker inequalities that are more natural in our setting. By deﬁnition, agghoo is an average of predictors chosen by hold-out over the collection (Am)m M . Therefore, when the risk is convex, an oracle inequality (1) can be deduced from an oracle inequality for the hold-out, provided that there exists an integer nt {1, . . . , n 1} such that T is independent from Dn and T T , |T| = nt . (2)

We make this assumption in the rest of the article, and then deﬁne nv = n nt the size of the validation data set. Most cross-validation methods satisfy hypothesis (2), including leave-p-out, V -fold crossvalidation (with nt = n(V 1)/V ) and Monte-Carlo cross-validation (Arlot and Celisse, 2010). In the remainder of this section, we introduce the RKHS setting of interest, and prove an oracle inequality for agghoo without changing the standard estimators or requiring Y to be bounded.

4.1 Agghoo in Regularized Kernel Regression

Kernel methods such as support vector machines (SVM), kernel least squares or ε-regression use a kernel function to map the data Xi into an inﬁnite-dimensional function space, more speciﬁcally a reproducing kernel Hilbert space (RKHS) (Scholkopf and Smola, 2001; Steinwart and Christmann, 2008). We consider in this section regularized empirical risk minimization using a training loss function c, with a penalty proportional to the square norm of the RKHS, to solve the supervised learning problem (deﬁned in Section 2.2) with loss function g for deﬁning the risk. Hence, the contrast γ can be written γ(f, (x, y)) = g(f(x), y) := (g f)(x, y). We assume that g and c are convex in their ﬁrst argument.

Deﬁnition 9 (Regularized kernel estimator) Let c : R R R be convex in its ﬁrst argument, and let K : X X R be a positive-deﬁnite kernel function. Given λ > 0 and training data (Xi, Yi)1 i nt, deﬁne the regularized kernel estimator as

Aλ(Dnt) = argmin f H

n Pnt(c f) + λ f 2 H o ,

Maillard, Arlot and Lerasle

where H is the reproducing kernel Hilbert space induced by K. By the representer theorem, Aλ can be computed explicitly:

Aλ(Dnt)(x) =

j=1 bθλ,j K(Xj, x) where bθλ,j is the j-th component of the vector

bθλ = argmin θ Rnt

j=1 θj K(Xj, Xi), Yi

j=1 θiθj K(Xi, Xj)

The loss function c is used to measure the accuracy of the ﬁt on the training data: for example, taking c : (u, y) 7 (1 uy)+ (the hinge loss) in Deﬁnition 9 corresponds to SVM. The loss function g used for risk evaluation may or may not be equal to c. For example, in classiﬁcation, the 0 1 loss often cannot be used for training for computational reasons, hence a surrogate convex loss, such as the hinge loss, is used instead (see Remark 8), but there is no reason to use the hinge loss for risk estimation and hyperparameter selection.

In Deﬁnition 9, the hyperparameter of interest is λ (we assume that K is ﬁxed). We show below some guarantees on agghoo s performance when it is applied to a ﬁnite subfamily (Aλ)λ Λ of the one deﬁned by Deﬁnition 9. We ﬁrst state some useful assumptions.

Hypothesis Comp C(g, c): The functions Lc : f 7 P(c f) and Lg have a common minimum s argminf S Lc(f) argminf S Lg(f) and for any f S, Lc(f) Lc(s) C [Lg(f) Lg(s)].

Note that Comp1(g, c) is always satisﬁed when g = c. When g = c, some hypothesis relating c and g is necessary anyway for Deﬁnition 9 to be of interest, if only to ensure consistency (asymptotic minimization of the risk) for some sequence of hyperparameters (λn)n N.

In addition, some information about the evaluation loss g helps to obtain an oracle inequality (1) with a smaller remainder term εn.

Hypothesis SCρ,ν: Let ℓX(u) = E[g(u, Y )|X] infv R E[g(v, Y )|X]. The triple (g, X, Y ) satisﬁes SCρ,ν if and only if, for any u, v R,

E h g(u, Y ) g(v, Y ) 2 X i h ρ ν|u v| i ℓX(u) + ℓX(v) . (4)

For example, in the case of median regression, that is, g(u, y) = |u y|, hypothesis SCρ,ν holds whenever there is a uniform lower bound on the concentration of Y around s(X), as shown by the following proposition.

Proposition 10 Let g(u, y) = |u y| for all u, y R. For any x X, let Fx be the conditional cumulative distribution function of Y knowing X = x. Assume that, for any x X, Fx is continuous with a unique median s(x) and that there exists a(x) > 0, b(x) > 0 such that u R, Fx(u) Fx s(x) a(x) h u s(x) b(x) i . (5)

Aggregated Hold-Out

For instance, this holds true if d Fx

du a(x)I|u s(x)| b(x) for every x X. Let

am = inf x X a(x) and µm = inf x X a(x)b(x) .

If am > 0 and µm > 0, then (g, X, Y ) satisﬁes SC 2 am , 2 µm .

Proposition 10 is proved in Appendix C.1. We can now state our ﬁrst main result.

Theorem 11 Let Λ R + be a ﬁnite grid. Using the notation of Deﬁnition 6 and assuming that (2) holds true, let bf ag T be the output of agghoo, applied to the collection (Aλ)λ Λ given by Deﬁnition 9. Assume that λm = min Λ > 0 and κ = supx X K(x, x) < + . Assume that Comp C(g, c) holds for a constant C > 0 and that (g, X, Y ) satisﬁes SCρ,ν with constants ρ 0, ν 0. Assume that c and g are convex and Lipschitz in their ﬁrst argument, with Lipschitz constant less than L. Assume also that nv 100 and 3 |Λ| e nv. Then, for any θ (0, 1],

(1 θ)E h ℓ s, bf ag T i (1 + θ)E h min λ Λ ℓ s, Aλ(Dnt) i

18ρlog nv|Λ|

θnv , b1 log2 nv|Λ|

θ3λmn2v , b2 log 3 2 nv|Λ|

where b1, b2 do not depend on nv, nt, λm, ρ or θ but only on κ, L, ν and C.

Theorem 11 is proved in Appendix B as a consequence of a result valid in the general framework of Section 2.1 (Theorem 17). It shows that bf ag T satisﬁes an oracle inequality of the form (1), with Aλ(Dnt) instead of Aλ(Dn) on the right-hand side of the inequality. The fact that Dnt appears in the bound instead of Dn is a limitation of our result, but it is natural since predictors aggregated by agghoo are only trained on part of the data. In most cases, it can be expected that ℓ(s, Aλ(Dnt)) is close to ℓ(s, Aλ(Dn)) whenever nt

n is close to 1. The assumption that K is bounded is mild. For instance, popular kernels such as Gaussian kernels, (x, x ) 7 exp[ x x 2 /(2h2)] for some h > 0, or Laplace kernels, (x, x ) 7 exp( x x /h) for some h > 0, are bounded by κ = 1. Taking |T | = 1 in Theorem 11 yields a new oracle inequality for the hold-out. Oracle inequalities for the hold-out have already been proved in a variety of settings (see Arlot and Celisse, 2010, for a review), and used to obtain adaptive rates in regularized kernel regression (Steinwart and Christmann, 2008). However, this work has mostly been accomplished under the assumption that the contrast γ (Aλ(Dn), (X, Y )) is bounded uniformly (in n, Dn and λ Λ) by a constant. If this constant increases with n, bounds obtained in this manner may worsen considerably. As many natural regression procedures including regularized kernel regression (Deﬁnition 9) fail to satisfy such bounds, some theoreticians introduce truncated versions of standard procedures (Steinwart and Christmann, 2008), but truncation has no basis in practice. Theorem 11 avoids these complications.

In order to be satisfactory, Theorem 11 should prove that agghoo performs asymptotically as well as the best choice of λ Λ, at least for reasonable choices of Λ. This is the case

Maillard, Arlot and Lerasle

whenever the maximum in Equation (6) is negligible with respect to the oracle excess risk E[minλ Λ ℓ(s, Aλ(Dnt))] as n + . This depends on the range [λm, + ) in which the hold-out is allowed to search for the optimal λ. On the one hand, it is desirable that this interval be wide enough to contain the true optimal value. On the other hand, if λm = 0, then inequality (6) becomes vacuous. We now provide precise examples where Theorem 11 applies with a remainder term in Equation (6) that is negligible relative to the oracle excess risk. Take the example of median regression, in which c(u, y) = g(u, y) = |u y|. Then Comp1(g, c) holds trivially. Make also the same assumptions as in Proposition 10, which ensures that SCρ,ν holds for some ﬁnite values of ρ and ν. Theorem 11 therefore applies as long as the kernel K is bounded and λm > 0. Choose nv = nt = n

2 and Λ of cardinality at most polynomial in n (which is suﬃcient in theory and in practice). Then Steinwart and Christmann (2008, Theorem 9.6) prove the consistency of Aλn(Dn) as n + , provided that λ2 nn + . This suggests choosing λm = 1/ nt, in which case the remainder term of Equation (6) is of order (log n)3/2/n, which is negligible relative to nonparametric convergence rates in median regression. In order to have a more precise idea of the order of magnitude of the oracle excess risk, let us consider median regression with a Gaussian kernel. Under some assumptions, one of which coincides with Proposition 10, Eberts and Steinwart (2013, Corollary 4.12) show that taking λn = c1

n leads to rates of order n 2α 2α+d , where d N is the dimension of X and α > 0 is the smoothness of s. Therefore, taking λm = 1/nt in Theorem 11, the remainder term of Equation (6) is at most of order (log n)3/2/ n, hence negligible relative to the above risk rates as soon as 2α < d.

Theorem 11 can handle situations where g is diﬀerent from the training loss c, provided that Comp C(g, c) holds true. Such situations arise for instance in the case of support vector regression (Scholkopf and Smola, 2001, Chapter 9), which uses for training Vapnik s ε-insensitive loss ceps ε (u, y) = (|u y| ε)+. This loss depends on a parameter ε, the choice of which is usually motivated by a tradeoﬀbetween sparsity and prediction accuracy (Scholkopf and Smola, 2001). Therefore, some other loss is typically used to measure predictive performance, independently of ε. We state one possible application of Theorem 11 to this case, as a corollary.

Corollary 12 (ε-regression) Let c = ceps ε : (u, y) 7 (|y u| ε)+ be Vapnik s ε-insensitive loss and assume that the evaluation loss is g = ceps 0 : (u, y) 7 |u y|. Assume that for every x the conditional distribution of Y given X = x has a unimodal density with respect to the Lebesgue measure, symmetric around its mode. Introduce the robust noise parameter

σ = sup x X

inf y R : P(Y y | X = x) 3

sup y R : P(Y y | X = x) 1

Then, applying agghoo to a ﬁnite subfamily (Aλ)λ Λ of the rules given by Deﬁnition 9 with c = ceps ε and a kernel K such that K 1 yields the following oracle inequality. Assuming

Aggregated Hold-Out

nv 100, 3 |Λ| e nv, and that (2) holds true, we have that for any θ (0, 1],

(1 θ)E h ℓ s, bf ag T i (1 + θ)E h min λ Λ ℓ s, Aλ(Dnt) i

72σlog nv|Λ|

θnv , b1 log2 nv|Λ|

θ3λmn2v , b2 log 3 2 nv|Λ|

where b1 and b2 represent numerical constants.

Corollary 12 is proved in Appendix C.2. When ε = 0, ε-regression becomes median regression, which is discussed above. The oracle inequality of Corollary 12 is then the same as that given by Theorem 11 and Proposition 10. Assumptions of unimodality and symmetry allow to give more explicit values of am and µm in terms of σ. When ε > 0, the unimodality and symmetry assumptions are used to prove hypothesis Comp C(g, c).

4.2 Classiﬁcation

Loss functions are not all convex. When convexity fails, the aggregation procedure should be revised. In classiﬁcation, majhoo is a possible solution (see Deﬁnition 7). By Proposition 35 in Appendix D, majority voting satisﬁes a kind of convexity inequality with respect to the 0 1 loss; as a result, oracle inequalities for the hold-out imply oracle inequalities for majhoo. Hold-out for binary classiﬁcation with 0 1 loss has been studied by Massart (2007). In that work, Massart makes an assumption which is closely related to margin hypotheses, such as Tsybakov s noise condition (Mammen and Tsybakov, 1999) which we consider here. This approach allows to derive the following theorem.

Theorem 13 Consider the classiﬁcation setting described in Example 1 with M = 2 classes (binary classiﬁcation). Let (Am)m M be a collection of learning rules and T a collection of training sets satisfying assumption (2). Assume that there exists β 0 and r 1 such that for ξ = (X, Y ) with distribution P,

h > 0, P 2η(X) 1 h rhβ (MA)

where η(X) := P(Y = 1 | X). Then, we have

E h ℓ s, bf mv T i 3E inf m M ℓ s, Am(Dnt) + 29r 1 β+2 log e|M|

β+1 β+2 v .

Theorem 13 is proved in Appendix D. It shows that bf mv T , like bf ag T , satisﬁes an oracle inequality of the form (1) with Am(Dnt) instead of Am(Dn). Tsybakov s margin assumption (MA) only depends on the distribution of (X, Y ) and not on the collection of learning rules. It is a standard hypothesis in classiﬁcation, under which fast learning rates faster than n 1/2 are attainable (Tsybakov, 2004). In contrast with the results of Section 4.1, that are

Maillard, Arlot and Lerasle

valid for various losses but only for a speciﬁc type of learning rule, Theorem 13 holds true for any family of classiﬁcation rules. The constant 3 in front of the oracle excess risk can be replaced by any constant larger than 2, at the price of increasing the constant in the remainder term, as can be seen from the proof (in Appendix D). However, our approach cannot yield a constant lower than 2, because we use Proposition 35 instead of a convexity argument, since the 0 1 loss is not convex.

5. Numerical Experiments

This section investigates how agghoo and majhoo s performance vary with their parameters V = |T | = and τ = nt

n , and how it compares to the performance of CV and related methods at a similar computational cost that is, for the same values of V and τ. Two settings are considered, corresponding to Corollary 12 (ε-regression) and Theorem 13 (classiﬁcation).

5.1 ε-Regression

Consider the collection (Aλ)λ Λ of regularized kernel estimators (see Deﬁnition 9) with loss function ceps ε (u, y) = (|u y| ε)+ and Gaussian kernel K(x, x ) = exp[ (x x )2/(2h2)] over X = R.

5.1.1 Experimental Procedure

Agghoo and CV training sets T T are chosen independently and uniformly among the subsets of {1, . . . , n} with cardinality τn , for diﬀerent values of τ and V = |T |; hence, CV corresponds to what is usually called Monte-Carlo CV (Arlot and Celisse, 2010). Each algorithm is run on 1000 independent samples of size n = 500, and independent test samples of size 1000 are used for estimating the excess risks ℓ(s, bf ag T ), ℓ(s, bfcv T ) and the oracle excess risk infλ Λ ℓ(s, Aλ(Dn)). The risks (and excess risks) are evaluated using the L1 loss g(u, y) = |u y|. Expectations of these quantities are estimated by taking an average over the 1000 samples; we also compute standard deviations for these estimates, which are not displayed, since they are suﬃciently small to ensure that visible gaps on the graph are statistically signiﬁcant. Agghoo and CV are applied to (Aλ)λ Λ over the grid Λ = { 2j 1

500nt : 0 j 17}, corresponding to the grid {500

2j : 0 j 17} over the cost parameter C = 1 2λnt of the R implementation svm from package e1071. In a second step, V -fold agghoo and Monte-Carlo agghoo are compared with other averaging cross-validation procedures:

ACV (Jung and Hu, 2015),

EKCV (Jung, 2016),

bagged K-FCV, that is, bagging applied to the K-fold CV predictor bf cv T given by Deﬁnition 5, with T = {{1, . . . , n}\Ji : 1 i K} for some partition (Ji)1 i K of {1, . . . , n} into K blocks of equal size |Ji| = nv = n/K. Following the terminology detailed in Section 3.3, bagged K-FCV is a speciﬁc instance of bagged CV. Note that bagged K-FCV is not Hall s BCV.

Aggregated Hold-Out

0.0 0.2 0.4 0.6 0.8 1.0

0.00 0.01 0.02 0.03 0.04

Performance in epsilon regression 1

mean excess risk

Agghoo CV selection V = 1 (ho) V = 2 V = 5 V = 10 oracle

Figure 1: Performance of agghoo and CV for ε-regression in setup 1

These methods can be seen as having the same two hyperparameters: the fraction τ of data assigned to the training set in each fold of CV, and the number V of estimators or parameters being aggregated. Using the notation of Jung and Hu (2015) for ACV, we have τ = (K 1)/K and V = K for some integer K 1. Using the notation of Jung (2016) for EKCV, we have τ = K 1

K and V = M. For bagged K-FCV, V is the number of bagging resamples considered, and τ = (K 1)/K (or equivalently, K = 1/(1 τ)). For all methods, we consider the choices τ {0.8, 0.9} and V {5, 10}. For ACV and EKCV, these are the values recommended by Jung and Hu (2015) and Jung (2016), respectively.

5.1.2 Setup 1

Data are generated as follows: (X1, Y1), . . . , (Xn, Yn) are independent, with Xi N(0, π2), Yi = s(Xi) + Zi, with Zi N(0, 1/4) independent from Xi. The regression function is s : x 7 exp[cos(x)], the kernel parameter is h = 1

2 and the threshold for the ε-insensitive loss is ε = 1

Results for agghoo and CV in setup 1 are shown on Figure 1. The performance of agghoo strongly depends on both τ and V . For a ﬁxed τ, increasing V signiﬁcantly decreases the risk of the resulting estimator. This is not surprising and conﬁrms that considering several data splits is always useful. Most of the improvement occurs between V = 1 and V = 5, and taking V much larger seems useless at least for τ 0.5 , a behavior previously observed for CV (Arlot and

Maillard, Arlot and Lerasle

τ V MC agghoo V -fold agghoo ACV EKCV Bagged K-FCV 0.8 5 2.48 0.2 1.95 0.2 1.66 0.2 1.51 0.2 5.46 0.4 0.8 10 1.93 0.2 1.47 0.2 3.54 0.3 0.9 5 3.46 0.3 1.52 0.2 6.03 0.4 0.9 10 2.82 0.2 2.35 0.2 2.07 0.2 1.38 0.2 3.99 0.3

Table 1: Setup 1, diﬀerence between the average risk of each method and the average risk of the oracle, multiplied by 103. The uncertainty is obtained using the formula 2 bσ ns , where bσ2 is the empirical variance and ns is the number of simulations. Missing values occur when the given combination of τ and V is not allowed by the method; this is the case with ACV and V -fold agghoo, for which τ = V 1

Lerasle, 2016). For a ﬁxed V , the risk strongly decreases when τ increases from 0.1 to 0.5, decreases slowly over the interval [0.5, 0.8] and seems to rise for τ > 0.8. It seems that τ [0.6, 0.9] yields the best performance, while taking τ close to 0 should clearly be avoided (at least for V 10). Taking V large enough, say V = 10, makes the choice of τ less crucial: a large region of values of τ yields (almost) optimal performance. We do not know whether taking V larger can make the performance of agghoo with τ 0.4 close to the optimum. As a function of τ, the risk of CV behaves quite diﬀerently from agghoo s. The performance does not degrade signiﬁcantly when τ is small. The optimum is located around τ = 0.1, but the risk curve is so ﬂat that there is no perceptible diﬀerence between the values of τ [0.1, 0.4]. In any case, the optimum is much smaller than for agghoo. A possible explanation is that the regressors produced by cross-validation are all trained on the whole sample, so that τ only impacts risk estimation. Furthermore, additional simulations show, as expected, that higher values of τ (τ = 0.8 or τ = 0.9) improve risk estimation while degrading the hyperparameter selection performance. Compared to agghoo, CV s performance depends much less on V : only V = 2 appears to be signiﬁcantly worse than V 5. Let us now compare agghoo and CV. For small values of τ (that is, τ 0.5), agghoo generally performs much worse than CV for all values of V . In the case of the hold-out, this is unsurprising as the hold-out estimator is then trained on a much smaller sample than the CV estimator. Clearly, aggregation does not suﬃciently compensate for this, at least for V 10. On the other hand, for τ [0.6, 0.9], agghoo with V = 10 approximately matches CV s performance. The risks of the two methods are indistinguishable for V = 10, τ = 0.8.

Comparison of agghoo with ACV, EKCV and bagged K-FCV in setup 1. According to the results summarized by Table 1, the best performing method in this experiment is EKCV, followed by ACV. The performance of EKCV does not vary very much over the tested values of V, τ, whereas other methods show stronger variation. Among the two agghoo methods, V -fold appears to perform better than Monte-Carlo for a given value of τ and equal or smaller value of V . Bagged K-FCV performs the worst out of all the methods, for all values of τ and V . Overall, in this simulation, the methods which select a single regressor from the collection (Aλ(Dn))λ R+ (CV, ACV and EKCV) generally perform better than the methods which aggregate them (agghoo and bagged K-FCV). This could be due to the fact that the regression function exp[cos(x)] of setup 1 is very smooth (analytic) and bounded. Combined

Aggregated Hold-Out

0.0 0.2 0.4 0.6 0.8 1.0

0.04 0.05 0.06 0.07 0.08

Performance in epsilon regression 2

mean excess risk

Agghoo CV selection V = 1 (ho) V = 2 V = 5 V = 10 oracle

Figure 2: Performance of agghoo and CV for ε-regression in setup 2

with a one-dimensional variable X and Gaussian noise, this yields an easy non-parametric regression problem. As a result, the collection (Aλ(Dn))λ R+ may already have very good approximation properties and improving on it with aggregation may be diﬃcult. Estimator aggregation could prove more useful in harder problems where s is less smooth and the dimension is higher. In order to assess this hypothesis, we carry out a second experiment.

5.1.3 Setup 2

Data are generated as follows: (X1, Y1), . . . , (Xn, Yn) are independent, with Xi R2 distributed as Cauchy(0, 1) 2, Yi = s(Xi) + Zi, with Zi N(0, 1/4) independent from Xi. The regression function is deﬁned almost everywhere by

s(x1, x2) = 2 sin(x1x2)

x2 1 + x2 2 ,

the kernel parameter is h = 1

2 and the threshold for the ε-insensitive loss is ε = 1

4. This regression function is less regular than in the previous setup, since it has a discontinuity at (0, 0) R2.

Results for agghoo and CV in setup 2 are shown on Figure 2. The qualitative conclusions about the behaviour of agghoo and CV, taken separately, are mostly the same as in setup 1, with the exception that CV now shows the expected increase in risk for the smallest values of τ.

Maillard, Arlot and Lerasle

τ V MC agghoo V -fold agghoo ACV EKCV Bagged K-FCV 0.8 5 0.94 0.3 0.32 0.2 2.1 0.2 1.97 0.2 6.97 0.6 0.8 10 0.08 0.2 3.66 0.4 3.64 0.5 0.9 5 2.39 0.4 1.96 0.2 8.81 0.6 0.9 10 1.07 0.3 0.72 0.3 3.18 0.4 3.65 0.4 4.65 0.5

Table 2: Setup 2, diﬀerence between the average risk of each method and the average risk of the oracle, multiplied by 103. The uncertainty is obtained using the formula 2 bσ ns , where bσ2 is the empirical variance and ns is the number of simulations.

The main diﬀerence with setup 1 is that agghoo performs much better relative to CV and the oracle. For V = 10 and τ [0.4, 0.9], agghoo outperforms CV by a signiﬁcant margin; for V = 10 and τ [0.6, 0.8], agghoo even matches the oracle s performance, up to statistical uncertainty. Part of the explanation is that, on a given data set, agghoo can perform better than the oracle using aggregation whereas CV, as a parameter selection method, naturally cannot. Indeed, for a randomly drawn data set in setup 2, this situation can be observed to occur quite regularly. Overall, if the computational cost of V = 10 data splits is not prohibitive, agghoo with optimized parameters (V = 10, τ [0.6, 0.8]) clearly improves over CV with optimized parameters (V = 10, τ [0.5, 0.7]). The same holds with V = 5.

Comparison of agghoo with ACV, EKCV and bagged K-FCV in setup 2. According to the results summarized by Table 2, aggregated hold-out is clearly the best performing method in setup 2, by a large margin. This shows the potential advantage of aggregating estimators (agghoo) rather than parameters (ACV, EKCV) when s is non-smooth. Overall, bagged K-FCV performs the worst, except for (τ, V ) = (0.8, 10) where it is tied with EKCV. Its poor performance relative to agghoo can be explained by several factors: ﬁrst, K-fold CV is more stable than the hold-out, which leads to a less diverse ensemble for aggregation. Secondly, bagging (sampling with replacement) breaks the independence between training and test samples, potentially leading to overﬁtting. Among the two agghoo methods, V - fold seems to outperform Monte-Carlo whenever both are deﬁned, though the diﬀerence is not signiﬁcant. However, the overall best performance is attained at (τ, V ) = (0.8, 10), a combination which is not achievable using a V -fold subsampling scheme.

5.1.4 Computational Complexity

By Equation (3), regularized kernel regressors can be represented linearly by vectors of length nt, therefore the aggregation step can be performed at training time by averaging these vectors. The complexity of this aggregation is at most O(V nt). In general, this is negligible relative to the cost of computing the hold-out, as simply computing the kernel matrix requires nt(nt + 1)/2 kernel evaluations. Therefore, the aggregation step does not aﬀect much the computational complexity of agghoo, so the conclusion of Section 3.4 that agghoo and CV have similar complexity applies in the present setting. The same holds for ACV and EKCV which rely on V -fold type splits. In contrast, bagged K-FCV has a higher

Aggregated Hold-Out

complexity, as one must carry out 1/(1 τ)-fold CV within each bagging sample. As a result, the hold-out must be computed V/(1 τ) times. Evaluating agghoo, CV, ACV, EKCV and bagged K-FCV on new data x X also takes the same time in general, as all can be computed by evaluating Pnt j=1 θj K(Xj, x) with a

pre-computed value of θ. A potential diﬀerence occurs when the bθλ given by Deﬁnition 9, Equation (3) are sparse: aggregation increases the number of non-zero coeﬃcients, so evaluating bf ag T on new data can be slower than evaluating bf cv T (or ACV and EKCV) if the implementation is designed to take advantage of sparsity.

5.2 k-Nearest-Neighbors Classiﬁcation

We consider the collection (ANN k )k 1, k odd of nearest-neighbors classiﬁers assuming k is odd to avoid ties on the following binary classiﬁcation problem.

5.2.1 Experimental Setup

Data (X1, Y1), . . . , (Xn, Yn) are independent, with Xi uniformly distributed over X = [0, 1]2

P(Yi = 1 | Xi) = σ h(Xi) b

where u, v R, σ(u) = 1 1 + e u and h (u, v) = exp (u2 + v)3 + u2 + v2 ,

b = 1.18 and λ = 0.05. The Bayes classiﬁer is s : x 7 Ih(x) b and the Bayes risk, computed numerically using the scipy.integrate python library, is approximately equal to 0.242. Majhoo (the classiﬁcation version of agghoo, see Deﬁnition 7) and CV are used with the collection (ANN k )k 1, k odd and Monte-Carlo training sets as in Section 5.1. An experimental procedure similar to the one of Section 5.1 is used to evaluate the performance of agghoo and to compare it with Monte-Carlo cross-validation. Standard deviations of the excess risk were computed; they are smaller than 3.6% of the estimated value.

5.2.2 Results

As shown by Figure 3, the results are similar to the regression case (see Section 5.1), with a few diﬀerences. First, agghoo does not perform better than the oracle. In fact, all methods considered here remain far from the oracle, which has an excess risk around 0.0034 0.0004; both agghoo and CV have excess risks at least 4 times larger. Second, risk curves as a function of τ for agghoo are almost U-shaped, with a signiﬁcant rise of the risk for τ > 0.6. Therefore, less data is needed for training, compared to Section 5.1. The optimal value of τ here is 0.6, at least for some values of V , up to statistical error. Third, the performance of CV as a function of τ has a similar U-shape, which makes the comparison between agghoo and CV easier. For a given τ, agghoo performs signiﬁcantly better if V 10, while CV performs signiﬁcantly better if V = 2; the diﬀerence is mild for V = 5.

5.2.3 Alternative Methods

Neither ACV nor EKCV can be applied to k-nearest neighbors, as there are no models and the parameter k of k-NN cannot be averaged, as it is an integer. Bagged K-FCV can be

Maillard, Arlot and Lerasle

0.0 0.2 0.4 0.6 0.8 1.0

0.010 0.015 0.020 0.025 0.030 0.035 0.040

excess risk

Agghoo CV selection V = 1 (ho) V = 2 V = 5 V = 10 V = 20

Figure 3: Classiﬁcation performance of majhoo and CV for the k-NN family

used here, however it performs very poorly: using V bagging samples and K-fold CV with (V, K) {5, 10}2 yields excess risks close to 0.072, again with negligible error (estimated standard deviation 8 10 4), hence two to ﬁve times worse than agghoo. Further investigations reveal that within a bagging sample, CV often chooses k = 1, so that bagged K-FCV practically reduces to bagged 1-NN. A plausible explanation for this is that sampling with replacement leads to the same data point appearing in both training and test sets; 1-NN perfectly classiﬁes these repeated samples, which artiﬁcially improves its CV score.

5.2.4 Computational Complexity

As explained in Section 3.4, the complexity of computing the optimal parameters for CV (bkcv T ) is the same as for computing (bk ho T )T T for majhoo. Here, there is no simple way to represent the aggregated estimator, so aggregation may have to be performed at test time. In that case, the complexity of evaluating majhoo on new data is roughly V times greater than for CV, as explained in Section 3.4 for agghoo.

6. Discussion

Theoretical and numerical results of the paper show that agghoo can be used safely in RKHS regression, at least when its parameters are properly chosen; V 10 and τ = 0.8 seem to be safe choices. A variant, majhoo, can be used in supervised classiﬁcation with the 0 1 loss, with a general guarantee on its performance (Theorem 13). Experiments show that

Aggregated Hold-Out

agghoo/majhoo actually performs much better than what the upper bounds of Section 4 suggest. In one simulation setup, it roughly matches CV s performance for well chosen V, τ. In two others setups, it outperforms CV by a signiﬁcant margin, as long as V 5 splits are used. Proving theoretically that agghoo can improve over CV is an open problem that deserves future works, solved in a speciﬁc setting during the revision of this article (Maillard, 2020b, Chapters 5 6). Since agghoo and CV have the same training computational cost for any ﬁxed (V, τ), agghoo with properly chosen parameters V, τ is competitive with CV in practice, unless aggregation is undesirable for some other reason, such as interpretability of the predictors, or computational complexity at test time.

Our results can be extended in several ways. First, our theoretical bounds directly apply to subagging hold-out, which also averages several hold-out selected estimators. As explained in Section 3.3, the diﬀerence with agghoo is that, in subagging, the training set size is n p q and the validation set size is q, for some q {1, . . . , n p 1}, leading to slightly worse bounds than those we obtained for agghoo (at least if E [ℓ(s, Am(Dn))] decreases with n). The diﬀerence should not be large in practice, if q is well chosen. Oracle inequalities can also be obtained for agghoo in other settings, as a consequence of our general Theorems 16 and 17 in Appendix A. Since the ﬁrst version of this paper appeared as a preprint, such results have been obtained in two settings. Maillard (2020a) applies Theorem 17 to sparse linear regression with the Huber loss function. Maillard (2020b, Chapter 6) applies Theorem 17 to the collection of empirical Fourier projections in leastsquares density estimation, as a preliminary step to a deeper study of agghoo in that setting.

Acknowledgements

While ﬁnishing the writing of this article, the ﬁrst author (Guillaume Maillard) has received funding from the European Union s Horizon 2020 research and innovation programme under grant agreement No 811017.

Appendix A. General Theorems

We need the following hypothesis, deﬁned for two functions wi : R+ R+, i {1, 2} and a family (tm)m M SM.

Hypothesis H(w1, w2, (tm)m M): w1 and w2 are nondecreasing, and for any (m, m ) M2, some cm m R exists such that, for all k 2,

P γ(tm) γ(tm ) cm m k k! h w1 p

ℓ(s, tm) + w1( p

ℓ(s, tm )) i2

ℓ(s, tm) + w2 p

ℓ(s, tm ) ik 2 .

This hypothesis is similar to those used by Massart (2007) to study the hold-out and empirical risk minimizers. However, unlike Massart (2007), we intend to go beyond the setting of bounded risks. We also need the following deﬁnition.

Maillard, Arlot and Lerasle

Deﬁnition 14 Let w : R+ R+ and r R+. Let

δ(w, r) = inf δ 0 : x δ, w(x) rx2 ,

with the convention inf = + .

Remark 15 If r > 0 and x 7 w(x)

x is nonincreasing, then δ(w, r) is the unique solution to the equation w(x)

r 7 δ(w, r) is nonincreasing.

If w(x) = cxβ for c > 0 and β [0, 2), then δ(w, r) = c

A.1 Theorem Statements

We can now state two general theorems from which we deduce all the theoretical results of the paper. The ﬁrst theorem is a general oracle inequality for the hold-out.

Theorem 16 Let (tm)m M be a ﬁnite collection in S, and

bm argmin m M Pnvγ(tm, ) .

Assume that H(w1, w2, (tm)m M) holds true. Let x > 0. Then, with probability larger than 1 e x, for any θ (0, 1], we have

(1 θ) ℓ(s, t bm) (1 + θ) min m M ℓ(s, tm) +

r nv x + log|M|

2 δ2 w2, θ2

4 nv x + log|M|

If in addition, the two functions x 7 wj(x)

x , j = 1, 2, are nonincreasing, then for any x > 0, with probability larger than 1 e x, for all θ (0, 1], we have

(1 θ)ℓ(s, t bm) (1 + θ) min m M ℓ(s, tm) + δ2(w1, nv) θ + 2(x + log|M|)

+ δ2(w2, nv) θ + (x + log|M|)2

Using Theorem 16, we prove the following general oracle inequality for agghoo.

Theorem 17 Assume that the hyperparameter space S is convex and that the risk L is convex. Let (Am)m M be a ﬁnite collection of learning rules of size |M| 3. Let bf ag T be an agghoo estimator, according to Deﬁnition 6, with T satisfying assumption (2). Assume that bw1,1, bw1,2 are Dnt-measurable random functions such that almost surely, hypothesis H bw1,1, bw1,2, (Am(Dnt))m M holds true. Assume also that for i {1, 2}, x 7 bw1,i(x)

x is nonincreasing. Then for any θ (0, 1],

(1 θ)E h ℓ s, bf ag T i (1 + θ)E min m M ℓ s, Am(Dnt) + R1(θ) (9)

Aggregated Hold-Out

where R1(θ) = R1,1(θ) + R1,2(θ) with

θ + 2 1 + log|M|

E h δ2 bw1,1, nv i ,

θ + 2 1 + log|M| + log2|M| θ

E h δ2 bw1,2, nv i .

For any Dnt-measurable functions bw2,1 and bw2,2 such that H( bw2,1, bw2,2, (Am(Dnt))m M) holds true almost surely, and any x > 0, θ (0, 1], we have

(1 θ)E h ℓ s, bf ag T i (1 + θ)E min m M ℓ s, Am(Dnt) + R2(θ) (10)

where R2(θ) = R2,1(θ) + R2,2(θ) + R2,3(θ) + R2,4(θ) with

2θE δ2 bw2,1, θ

r nv x + log|M|

R2,2(θ) = θ2

2 E δ2 bw2,2, θ2

4 nv x + log|M|

R2,3(θ) = e x θ + 2(1 + x + log|M|)

E h δ2 bw1,1, nv i ,

and R2,4(θ) = e x θ + 2(1 + x + log|M|) + (x + log|M|)2

E h δ2 bw1,2, nv i .

A.2 Proof of Theorem 16

We start by proving three lemmas.

Lemma 18 Let w be a nondecreasing function on R+. Let r > 0. Then

u 0 , w(u) r u2 δ2(w, r) ,

where δ(w, r) is given by Deﬁnition 14.

Proof If u > δ(w, r), by Deﬁnition 14,

If u δ(w, r), since w is nondecreasing, for all v > δ(w, r),

w(u) w(v) rv2.

By taking the inﬁmum over v, we recover w(u) rδ(w, r)2.

Lemma 19 Let w be a nondecreasing function such that x 7 w(x)

x is nonincreasing over (0, + ). Let a R+ and b (0, + ). For any θ (0, 1] and u 0,

2 u + δ2(w, b) + a2δ2(w, b)

Maillard, Arlot and Lerasle

Proof Since w is nondecreasing,

u + δ2(w, b)

u + δ2(w, b)w p

u + δ2(w, b)

u + δ2(w, b) .

x is nonincreasing and δ(w, b) > 0,

u + δ2(w, b)w δ(w, b)

u + δ2(w, b) b δ(w, b) by Deﬁnition 14.

Therefore, using the inequality xy θ

2θ, valid for any x > 0, y > 0,

a2 u + δ(w, b)2 δ(w, b)2 θ

2 u + δ(w, b)2 + a2δ(w, b)2

Lemma 20 Let nv N . Let M be a ﬁnite set and let (tm)m M SM. Assume that there exists p [0, 1/|M|) and a function R : (0, 1] R+ such that for any m, m in M, with probability greater than 1 p,

θ (0, 1] , (Pnv P) γ(tm, ) γ(tm , ) θℓ(s, tm) + θℓ(s, tm ) + R(θ) .

Then, with probability greater than 1 |M|p, for any bm argminm M Pnvγ(tm, ),

θ (0, 1] , (1 θ)ℓ(s, t bm) (1 + θ) min m M ℓ(s, tm) + R(θ) .

Proof Let m argminm M Pγ(tm, ). Then for any m M, with probability greater than 1 p,

θ (0, 1] , (Pnv P) γ(tm , ) γ(tm, ) θℓ(s, tm ) + θℓ(s, tm) + R(θ) .

So by the union bound, with probability greater than 1 |M|p,

θ (0, 1] , m M , (Pnv P) γ(tm , ) γ(tm, ) θℓ(s, tm ) + θℓ(s, tm) + R(θ) .

On that event, for all θ (0, 1],

Pγ(t bm, ) = Pnvγ(t bm, ) + (P Pnv)γ(t bm, )

Pnvγ(tm , ) + (P Pnv)γ(t bm, )

= Pγ(tm , ) + (P Pnv) γ(t bm, ) γ(tm , )

Pγ(tm , ) + θℓ(s, tm ) + θℓ(s, t bm) + R(θ) .

Aggregated Hold-Out

Subtracting the Bayes risk Pγ(s, ) on both sides, we get with probability greater than 1 |M|p, for all θ (0, 1],

ℓ(s, t bm) ℓ(s, tm ) + θℓ(s, tm ) + θℓ(s, t bm) + R(θ) ,

that is, (1 θ)ℓ(s, t bm) (1 + θ) min m M ℓ(s, tm) + R(θ) .

We now prove Theorem 16. Let (m, m ) M2 be ﬁxed. Let

ℓ(s, tm) + w1 p

ℓ(s, tm ) ,

and c := w2 p

ℓ(s, tm) + w2 p

ℓ(s, tm ) .

By hypothesis H w1, w2, (tm)m M ,

cm m R such that k 2 , P γ(tm, ) γ(tm , ) cm m k k! σ2ck 2 .

For all y > 0, let Ωy(m, m ) be the event on which

(Pnv P) γ(tm, ) γ(tm , ) r

2y nv σ + cy

By Bernstein s inequality (Boucheron et al., 2013, Theorem 2.10),

P Ωy(m, m ) 1 e y .

nv x+log|M|. By Lemma 18 with r = q,

ℓ(s, tm) + w1 p

ℓ(s, tm ) q ℓ(s, tm) δ2(w1, q) + ℓ(s, tm ) δ2(w1, q) .

Set y = x + log|M| in Equation (11). Then,

2(x + log|M|)

2(x + log|M|)

r nv x + log|M| ℓ(s, tm) δ2(w1, q) + ℓ(s, tm ) δ2(w1, q)

ℓ(s, tm) + ℓ(s, tm ) + 2δ2 w1, θ

r nv x + log|M|

As for the second term of Equation (11), by Lemma 18 with r = q2, we have

ℓ(s, tm)) + w2( p

ℓ(s, tm )) q2 ℓ(s, tm) δ2(w2, q2) + ℓ(s, tm ) δ2(w2, q2) .

Maillard, Arlot and Lerasle

Recall that q is shorthand for θ

nv x+log|M|. Therefore,

nv x + log|M|

4 nv x + log|M| ℓ(s, tm) δ2(w2, q2) + ℓ(s, tm ) δ2(w2, q2)

4 ℓ(s, tm) δ2(w2, q2) + ℓ(s, tm ) δ2(w2, q2)

ℓ(s, tm) + ℓ(s, tm ) + 2δ2 w2, θ2

4 nv x + log|M|

4 1 and θ (0, 1], plugging Equations (12) and (13) in Equation (11) yields, on the event Ωx+log|M|(m, m ), for all θ (0, 1],

(Pnv P) γ(tm, ) γ(tm , ) θ ℓ(s, tm) + ℓ(s, tm ) +

r nv x + log|M|

2 δ2 w2, θ2

4 nv x + log|M|

Suppose now that x 7 wj(x)

x is nonincreasing for j {1, 2}. Let θ (0, 1]. Let y 0. By Lemma 19 with a = 2y and b = nv, r

2y nv σ = r

ℓ(s, tm) + w1 p

ℓ(s, tm ) i

2ℓ(s, tm) + θ

2ℓ(s, tm ) + δ2(w1, nv) θ + 2y

By Lemma 19 with a = y and b = nv,

ℓ(s, tm) + w2 p

ℓ(s, tm ) i

2ℓ(s, tm) + θ

2ℓ(s, tm ) + δ2(w2, nv) θ + y2

Plugging Equations (15) and (16) in Equation (11) yields, on the event Ωy(m, m ), for all θ (0, 1],

(Pnv P) γ(tm, ) γ(tm , )

θℓ(s, tm) + θℓ(s, tm ) + δ2(w1, nv) θ + 2y

+ δ2(w2, nv) θ + y2

By Equation (14), Lemma 20 applies with p = exp( x)/|M| and

r nv x + log|M|

2 δ2 w2, θ2

4 nv x + log|M|

This yields Equation (7). By Equation (17), Lemma 20 applies with p = e y and

R(θ) = δ2(w1, nv) θ + 2y

+ δ2(w2, nv) θ + y2

Setting y = log|M| + x yields Equation (8).

Aggregated Hold-Out

A.3 Proof of Theorem 17

We start by proving two lemmas.

Lemma 21 Let f L1(R+, e xdx) be a non-negative, nondecreasing function such that lim x + f(x) = + . Let X be a random variable such that

x R+ , P X > f(x) e x .

0 f(x)e xdx .

Proof Let g L1(R+, e xdx) be a nondecreasing, diﬀerentiable function such that g f. Then

0 P(X > t)dt

0 P(X > t)dt + Z +

0 P X > g(x) g (x)dx

0 e xg (x)dx since g f

= g(0) + e xg(x) 0 + Z +

0 e xg(x)dx

0 e xg(x)dx .

It remains to show that g can approximate f in L1(Ix 0e xdx). Let K be a nonnegative smooth function vanishing outside [ 1, 1], normalized such that R K(t)dt = 1. Let ε > 0. Deﬁne

Z f(t)K x + ε t

Z f(x + ε t)K t

By Equation (18), fε is smooth. By Equation (19), fε is nondecreasing, moreover

fε(x) f(x) = 1

Z f(x + ε t) f(x) K t

dt since Z K = 1

f(x + ε t) f(x) K t

dt since K(u) = 0 when |u| 1

0 since f is nondecreasing and K 0 .

Thus fε f. Finally, by Jensen s inequality and Fubini s theorem, Z fε(x) f(x) e xdx 1

Z f(x + ε t) f(x) e xdxdt

Z f(x + τ) f(x) e xdx ,

Maillard, Arlot and Lerasle

which converges to 0 when ε 0 since f L1(R+, e xdx).

We use the following additional notation:

Deﬁnition 22 Let g be the function deﬁned by

(θ, y, p, q) (0, 1] R3 + , g(θ, y, p, q) = θ[p + q] + 1

θ(2yp + y2q) .

This function satisﬁes the following properties.

Lemma 23 Let g be the function given in Deﬁnition 22. For any θ (0, 1] and any real numbers u > 0, p 0, q 0,

u g(θ, y, p, q)e ydy = θ + 2(1 + u)

p + θ + 2 + 2u + u2

Proof Using the formulae Z +

u e xdx = e u , Z +

u xe xdx = (1 + u)e u

u x2e xdx = (2 + 2u + u2)e u ,

u g(θ, y, p, q)e ydy = θ(p + q) + (1 + u)2p

θ + (2 + 2u + u2)q

= θ + 2(1 + u)

p + θ + 2 + 2u + u2

We can now proceed with the proof of Theorem 17. Let θ (0, 1] be ﬁxed. Let ( bf ho T )T T be the individual hold out estimators, so that bf ag T = 1 |T | P T T bf ho T . By convexity of the risk functional L, we have

L( bf ag T ) 1 |T |

T T L( bf ho T ) .

It follows by subtracting L(s) that

ℓ(s, bf ag T ) 1 |T |

T T ℓ(s, bf ho T ) .

Since the data are independent and identically distributed, by assumption (2), all bf ho T have the same distribution. Let T1 = {1, . . . , nt}, so that DT1 n = Dnt. Taking expectations yields

E ℓ(s, bf ag T ) E ℓ(s, bf ho T1 ) . (20)

Since H ( bw1,1, bw1,2, (Am(Dnt))m M) holds, we can apply Theorem 16 conditionally on Dnt, with tm = Am(Dnt).

Aggregated Hold-Out

A.3.1 Proof of Equation (9)

For i {1, 2}, let bδ1,i = δ( bw1,i, nv i). Let g be given by Deﬁnition 22. By Theorem 16, Equation (8), for any z 0, with probability greater than 1 e z,

(1 θ)ℓ(s, bf ho T1 ) (1 + θ) min m M ℓ(s, tm) + g θ, z + log|M|, bδ 2 1,1, bδ 2 1,2 . (21)

As g is nondecreasing in its second variable, Lemma 21 applied to the random variable (1 θ)ℓ(s, bf ho T1 ) yields

(1 θ)E h ℓ(s, bf ho T1 ) DT1 n i (1 + θ) min m M ℓ(s, tm) + Z +

log|M| g θ, y, bδ 2 1,1, bδ 2 1,2 e (y log|M|)dy .

Lemma 23 yields

(1 θ)E h ℓ(s, bf ho T1 ) DT1 n i (1 + θ) min m M ℓ(s, tm) + θ + 2 (1 + log|M|)

+ θ + 2 (1 + log|M|) + log2|M|

Taking expectations with respect to DT1 n = Dnt, we get

(1 θ)E h ℓ(s, bf ho T1 ) i (1 + θ)E min m M ℓ(s, Am(Dnt)) + θ + 2 (1 + log|M|)

E h bδ 2 1,1 i

+ θ + 2 (1 + log|M|) + log2|M|

E h bδ 2 1,2 i .

Equation (9) then follows from Equation (20).

A.3.2 Proof of Equation (10)

Fix x 0. For i {1, 2}, let bδ2,i = δ

bw2,i, θ 2 q

nv x+log|M|

By Theorem 16, Equation (7), with probability larger than 1 e x,

(1 θ)ℓ(s, bf ho T1 ) (1 + θ) min m M ℓ(s, tm) +

2θbδ 2 2,1 + θ2

2 bδ 2 2,2 . (22)

Combining Equations (21) and (22), for any z 0, with probability larger than 1 e z,

(1 θ)ℓ(s, bf ho T1 ) (1 + θ) min m M ℓ(s, tm)

2θbδ 2 2,1 + θ2

2 bδ 2 2,2 + Iz xg θ, z + log|M|, bδ 2 1,1, bδ 2 1,2 .

By Lemma 21,

(1 θ)E h ℓ(s, bf ho T1 ) DT1 n i (1 + θ) min m M ℓ(s, tm) +

2θbδ 2 2,1 + θ2

x+log|M| g θ, y, bδ 2 1,1, bδ 2 1,2 e (y log|M|)dy .

Maillard, Arlot and Lerasle

By Lemma 23, it follows that

(1 θ)E h ℓ(s, bf ho T1 ) DT1 n i (1 + θ) min m M ℓ(s, tm) +

2θbδ 2 2,1 + θ2

+ e x θ + 2(1 + x + log|M|)

+ e x θ + 2(1 + x + log|M|) + (x + log|M|)2

Taking expectations with respect to DT1 n and using inequality (20) yields Equation (10) of Theorem 17.

Appendix B. RKHS Regression: Proof of Theorem 11

In the following, for any g : R R R and t : X R, the function (x, y) 7 g(t(x), y) is denoted by g t.

B.1 Preliminary Results

Remark ﬁrst that the RKHS norm dominates the supremum norm.

Lemma 24 If κ = supx X K(x, x) < + then for any t H,

Proof By deﬁnition of an RKHS, t H, x X, t, K(x, ) H = t(x). It follows that, for any t H,

t 2 = sup x X t(x)2 = sup x X t, K(x, ) 2 H

t 2 H sup x X K(x, ), K(x, )

t 2 H sup x X K(x, x) .

Using standard arguments, the following deviation inequality can be derived.

Proposition 25 Let H denote a RKHS with bounded kernel K : X X R. Let κ = supx X K(x, x) and h : R2 R be Lipschitz in its ﬁrst argument with Lipschitz constant L. For any t H and r > 0, denote

BH(t, r) = t H : t t H r .

Let t0 H. Then for any probability measure P on X R and any y > 0,

sup (t1,t2) BH(t0,r)2(Pn P) h t1 h t2 2(2 + p

Aggregated Hold-Out

Proof Let Dn = (Xi, Yi)1 i n be a data set drawn from P. Let (σi)1 i n be independent Rademacher variables independent from Dn. Denote by

i=1 σif(Xi)

the Rademacher complexity of a class F of real valued functions. By Lemma 24, for any (t1, t2) BH(t0, r)2,

h t1 h t2 L t1 t2 L [ t1 t0 + t2 t0 ] 2L κr .

By symmetry under exchange of t1 and t2, notice that

Rn {h t1 h t2 : (t1, t2) BH(t0, r)2} = sup (t1,t2) BH(t0,r)2 1 n

i=1 σi(h t1 h t2)(Xi)

By the bounded diﬀerence inequality and Boucheron et al. (2005, Theorem 3.2), it follows that for any y > 0, with probability greater than 1 e y,

sup (t1,t2) BH(t0,r)2(Pn P)(h t1 h t2)

2Rn {h t1 h t2 : (t1, t2) BH(t0, r)2} + 2Lr

Rn {h t1 h t2 : (t1, t2) BH(t0, r)2}

Rn h t : t BH(t0, r) + Rn h t : t BH(t0, r)

2LRn BH(t0, r) by the contraction lemma

as formulated by Meir and Zhang (2003, Theorem 7),

= 2LRn BH(0, r) by translation invariance.

Finally, by a classical computation (see for example Boucheron et al., 2005, Section 4.1.2),

Rn h t1 h t2 : (t1, t2) BH(t0, r)2

i=1 K(Xi, Xi)

The proof of Theorem 11 also uses the following peeling lemma.

Maillard, Arlot and Lerasle

Lemma 26 Let (Zu)u T be a stochastic process and d : T R+ be a function. Let a 0 and b (0, 2] and assume that

sup u T:d(u) r Zu r1 + p

Then, for any θ (0, + ),

u T , Zu θd2(u) + 2 + b 1.1 + 2(a + y)

Proof Let x > 0. Let η (1, 2], jm N and y0 R be numerical constants that will be determined later. Then,

Zu d2(u) + x2 1 + p

b(a + y) x n

sup u T:d(u) x

Zu d2(u) + x2 1 + p

b(a + y) x n

sup u T:ηjx d(u) ηj+1x

Zu d2(u) + x2 1 + p

b(a + y) x n

sup u T:d(u) x

b(a + y) x n

sup u T:ηjx d(u) ηj+1x

Zu (1 + η2j)x2 1 + p

b(a + y) x n

sup u T:d(u) x Zu x 1 + p

sup u T:d(u) ηj+1x Zu (1 + η2j)x 1 + p

Notice that

(1 + η2j)x 1 + p

n = xηj+1 η2j + 1

= xηj+1 1 + p

b(a + zj) n ,

ηj+1 1 + η2j + 1

b(a + y) 2 a

ηj+1 1 2 + η2j + 1

2 y since a 0 and η2j + 1 ηj+1 .

Aggregated Hold-Out

Taking expectations in Equation (24) and using hypothesis (23), we obtain

Zu d2(u) + x2 1 + p

b(a + y) x n

So for any y y0 ,

Zu d2(u) + x2 1 + p

b(a + y) x n

e y + e y + X

ηj+1 1 2 (η2j + 1)2

(ηj+1)2 1 y

e y + e y + X

ηj+1 1 2 (η2j + 1)2

(ηj+1)2 1 y0

Now, we have

ηj+1 1 2 (η2j + 1)2

(ηj+1)2 1 y0

exp (η2j + 1)2

(ηj+1)2 1 y0

exp y0 η2(j 1)y0 . (26)

Let v denote the sequence vj = exp y0 η2(j 1)y0 . Then for j jm,

log vj+1 log vj = η2(j 1)y0 η2jy0

= y0(1 η2)η2(j 1)

y0(1 η2)η2(jm 1) since η > 1 .

Thus, j jm, vj+1 vj exp y0(η2 1)η2(jm 1) .

Therefore, we have

j 0, vj+jm vjm exp jy0(η2 1)η2(jm 1))

j=jm vj vjm h 1 exp y0(η2 1)η2(jm 1) i 1 .

It follows from Equations (25) and (26) that for any y y0, since b 2,

Zu d2(u) + x2 1 + p

b(a + y) x n

ηj+1 1 2 (η2j + 1)2

(ηj+1)2 1 y0

+ exp y0 η2(jm 1)y0

1 exp y0(η2 1)η2(jm 1) .

Maillard, Arlot and Lerasle

On the other hand, when y y0,

Zu d2(u) + x2 1 + p

b(a + y) x n

Taking η = 1.18, jm = 10, y0 = 0.52, the right-hand side of Equation (27) evaluates to 1.6765 < 1.7 whereas ey0 1.683 < 1.7. It follows that for all y > 0,

Zu d2(u) + x2 1 + p

b(a + y) x n

1.7e y . (28)

Now take x = 1+

b(a+y) θ n with θ > 0. We can rewrite

Zu d2(u) + x2 1 + p

b(a + y) x n

= P u T , Zu d2(u) + x2 θ

= P u T , Zu θd2(u) + 1

b(a + y) i2

P u T , Zu θd2(u) + 2 + 2b(a + y)

It follows from Equation (28), with y replaced by y + 0.55, that

u T, Zu θd2(u) + 2 + b 1.1 + 2(a + y)

1.7e 0.55e y e y .

We need two other technical lemmas in the proof of Theorem 11.

Lemma 27 For any nonnegative, continuous convex function h over a Hilbert space H, and any λ R+, the elements of the regularization path,

tλ = argmin t H

n h(t) + λ t 2 H o ,

satisfy, for any (λ, µ) R2 such that 0 < λ µ,

tλ tµ 2 H tλ 2 H tµ 2 H .

Proof By Barbu and Precupanu (2012, Theorem 2.11), tλ exists for any λ R+. Moreover, it is unique by strong convexity of 2 H. For a closed convex set C H, let ΠC denote the orthogonal projection onto C.

Aggregated Hold-Out

Let µ > 0. The set {t : h(t) h(tµ)} is closed by continuity of h and convex by convexity of h. Moreover, for any t H such that h(t) h(tµ),

µ tµ 2 H h(tµ) h(t) + µ tµ 2 H µ t 2 H by deﬁnition of tµ .

Therefore, tµ = Π{t:h(t) h(tµ)}(0). Let λ (0, µ). By deﬁnition of tλ, tµ,

µ + tµ 2 H h(tλ)

λ + tλ 2 H + 1

λ + tµ 2 H + 1

which implies (µ 1 λ 1)h(tµ) (µ 1 λ 1)h(tλ) and thus h(tλ) h(tµ) since λ < µ. For a projection ΠC, it is well known that

t H , t C , t ΠC(t), ΠC(t) t H 0 .

Choosing C = {t : h(t) h(tµ)}, t = tλ C, t = 0 yields tµ, tµ tλ H 0. Therefore

tλ 2 H = tµ + (tλ tµ) 2 H = tµ 2 H + tλ tµ 2 H + 2 tµ, tλ tµ H

tµ 2 H + tλ tµ 2 H .

Lemma 28 Let (b, c) R2 + and lb,c(x) = bx + c. Let δ be given by Deﬁnition 14. For any r R+,

δ2(lb,c, r) b2

For (a, b, c) R3 +, let ga,b,c(x) = ax (bx3 + cx2) 1 2 . For any r R+,

δ2 ga,b,c, r a2

Proof Since x 7 lb,c(x)

x is nonincreasing, we have by Remark 15

bδ(lb,c, r) + c = rδ2(lb,c, r) ,

that is, δ2(lb,c, r) bδ(lb,c, r)

Maillard, Arlot and Lerasle

hence δ(lb,c, r) = b 2r + 1

r . Thus, we have

δ2(lb,c, r) 2 b2

This proves Equation (29). For any x > 0, ga,b,c(x) rx2 is equivalent to

ax rx2 (31)

and bx3 + cx2 r2x4 . (32)

Equation (31) is equivalent to x a

r . On the other hand, for every

we have x > δ(lb,c, r2) by Equation (29), hence bx + c r2x2 by Deﬁnition 14, so that Equation (32) holds true. Therefore, whenever

it holds that ga,b,c(x) rx2. Equation (30) follows by Deﬁnition 14.

B.2 Uniform Control on the Empirical Process

From now on, until the end of the proof, the notation and hypotheses of Theorem 11 are used. Recall also the notation g t : (x, y) 7 g(t(x), y), for any g : R R R and t : X R. Fix a training set Dnt. Start with the following deﬁnition.

Deﬁnition 29 For t1, t2 H, let

d(t1, t2) = min λ Λ t1 sλ H + t1 t2 H , (33)

where sλ = argmint H n P(c t) + λ t 2 H o . Furthermore, let

32κL2 sup (t1,t2) H2

(Pnt P)(c t1 c t2) λm

2 d(t1, t2)2 ,

(t1, t2) H2 , (Pnt P)(c t1 c t2) λm

2 d(t1, t2)2 + 32κL2by

λmnt . (34)

We then have the following bounds on by.

Aggregated Hold-Out

Claim 30 For all x 0,

P by 2.6 + log|Λ| + x e x .

In particular, E[by] 4 + log|Λ|.

Proof Let t1, t2 H be such that d(t1, t2) r. Let λ Λ be such that

t1 sλ H + t1 t2 H r .

By the triangle inequality, t1, t2 B(sλ, r), hence

sup (t1,t2):d(t1,t2) r

(Pnt P)(c t1 c t2) max λ Λ sup (t1,t2) B(sλ,r)2(Pnt P)(c t1 c t2) . (35)

From Proposition 25 and the union bound, it follows that, for any x 0,

max λ Λ sup (t1,t2) B(sλ,r)2

(Pnt P)(c t1 c t2) 2 2 + p

2(x + log|Λ|) Lr κ nt

It follows by Equation (35) that, for all x 0,

sup (t1,t2):d(t1,t2) r

(Pnt P)(c t1 c t2)

By Lemma 26 with θ = λm 8L κ, a = log|Λ|, b = 1

2, with probability larger than 1 e x,

(t1, t2) H2 , (Pnt P)(c t1 c t2) λm

2 d(t1, t2)2 + 32L2 κ(2.6 + x + log|Λ|)

On the same event, by 2.6 + x + log|Λ| by Deﬁnition 29. Therefore, by Lemma 21, E[by] 3.6 + log|Λ|.

Deﬁnition 29 and Claim 30 together imply a uniform control on the empirical process thanks to the drift term λmd(t1, t2)2, whereas Proposition 25 only gives a bound on an RKHS ball of ﬁxed radius.

B.3 Verifying the Assumptions of Theorem 17

Theorem 11 is a consequence of Theorem 17. For all λ Λ, let btλ = Aλ(Dnt), where Aλ is given by Deﬁnition 9. To verify the assumptions of Theorem 17, adequate functions ( bwi,j)(i,j) {1,2}2 must be found such that for i {1, 2}, H bwi,1, bwi,2, (btλ)λ Λ holds almost surely. This is the purpose of the present subsection. The core of the proof of Theorem 11 lies in the following deterministic claim.

Claim 31 For all λ, µ Λ such that λ µ, we have

btλ btµ 2 κC

λm ℓ(s, btµ) + 96L2 κ2by

Maillard, Arlot and Lerasle

Proof Let (λ, µ) Λ2 with λ µ. Let sµ be as in Deﬁnition 29. By convexity of c, the function t 7 P(c t) + µ t 2 H is µ-strongly convex. Since sµ is its optimum, we get

t H , P(c t) + µ t 2 H P(c sµ) + µ sµ 2 H + µ t sµ 2 H .

Hence, taking t = btµ,

λm btµ sµ 2 H µ btµ sµ 2 H P(c btµ) + µ btµ 2 H P(c sµ) µ sµ 2 H

= Pnt(c btµ) + µ btµ 2 H Pnt(c sµ) µ sµ 2 H + (P Pnt)(c btµ c sµ) .

By Deﬁnition 9, Pnt(c btµ) + µ btµ 2 H Pnt(c sµ) + µ sµ 2 H .

Hence λm btµ sµ 2 H (P Pnt)(c btµ c sµ) = (Pnt P)(c sµ c btµ). Now take t1 = sµ and t2 = btµ in Equation (34) of Deﬁnition 29 to get

λm btµ sµ 2 H λm

2 d(sµ, btµ)2 + 32L2 κby

sµ btµ 2 H + 32L2 κby

Therefore, btµ sµ 2 H 64L2 byκ

λ2mnt . (36)

Now btλ btµ 2 H can be bounded as follows. Since t 7 Pnt(c t) + λ t 2 H is λ-strongly convex and btλ is its optimum,

λm btλ btµ 2 H λ btλ btµ 2 H Pnt(c btµ) Pnt(c btλ) + λ btµ 2 H λ btλ 2 H .

By Lemma 27 with h(t) = Pnt(c t), btλ btµ 2 H btλ 2 H btµ 2 H. Hence

(λm + λ) btλ btµ 2 H Pnt(c btµ) Pnt(c btλ)

= P(c btµ) P(c btλ) + (Pnt P) c btµ c btλ

P(c btµ) min t S P c t) + (Pnt P)(c btµ c btλ

Cℓ(s, btµ) + (Pnt P) c btµ c btλ

by hypothesis Comp C(g, c). By Equation (34) with t1 = btµ and t2 = btλ, we have

(λm + λ) btλ btµ 2 H Cℓ(s, btµ) + λm

2 btµ sµ H + btλ btµ H 2 + 32L2 κby

Cℓ(s, btµ) + λm

byκ λm nt + btλ btµ H

!2 + 32L2 κby

Aggregated Hold-Out

by Equation (36). For any a, b R, (a + b)2 2a2 + 2b2, hence

(λ + λm) btλ btµ 2 H Cℓ(s, btµ) + λm

λ2mnt + 2 btλ btµ 2 H

This yields

λ btλ btµ 2 H Cℓ(s, btµ) + 96L2 κby

and ﬁnally, since λ λm,

btλ btµ 2 H Cℓ(s, btµ)

λm + 96L2 κby

Now, by Lemma 24,

btλ btµ 2 κ btλ btµ 2 H

λm ℓ(s, btµ) + 96L2 κ2by

This proves Claim 31.

Using hypothesis SCρ,ν Equation (4) , a reﬁned bound can be obtained on

P h g btλ g btµ 2i .

Claim 32 For any (λ, µ) Λ2,

P h g btλ g btµ 2i bw B

ℓ(s, btλ) 2 + bw B

ℓ(s, btµ) 2

bw B(x)2 = max

κC λm x3 + 10νL κ p

by λm nt x2 )

Proof By hypothesis SCρ,ν Equation (4) with u = btλ(X) and v = btµ(X),

E h g btλ g btµ 2(X, Y ) X i ρ ν|btλ(X) btµ(X)| ℓX(btλ(X)) + ℓX(btµ(X))

ρ ν btλ btµ ℓX(btλ(X)) + ℓX(btµ(X)) ,

where ℓX(u) = E[g(u, Y )|X] minv R E[g(v, Y )|X]. Integrating this inequality with respect to X, it follows that

P h g btλ g btµ 2i h ρ ν btλ btµ ih ℓ(s, btλ) + ℓ(s, btµ) i .

Maillard, Arlot and Lerasle

Assume without loss of generality that λ µ. By Claim 31,

P h g btλ g btµ 2i

ℓ(s, btµ) + 10 Lκ p

#! h ℓ(s, btλ) + ℓ(s, btµ) i

ρ h ℓ(s, btλ) + ℓ(s, btµ) i , ν

ℓ(s, btµ)ℓ(s, btλ) + q

ℓ(s, btµ) 3

h ℓ(s, btλ) + ℓ(s, btµ) i#)

Using the inequality ab ap

q with Hölder conjugates p = 3, q = 3

ℓ(s, btµ) ℓ(s, btλ) + q

ℓ(s, btµ) 3 1

ℓ(s, btµ) 3 + 2

3ℓ(s, btλ) 3 2 + q

ℓ(s, btµ) 3

ℓ(s, btλ) 3 + q

ℓ(s, btµ) 3 . (38)

Claim 32 then follows from Equations (37) and (38), using the elementary inequality

a, b, c, d R , (a + b) (c + d) a c + b d .

As g is L-Lipschitz in its ﬁrst argument, it follows from Claim 31 that for all λ, µ Λ such that λ µ, g btλ g btµ L btλ btµ

ℓ(s, btµ) + 10L2 κ p

ℓ(s, btµ) + bw A

ℓ(s, btλ) , (39)

bw A(x) = L

κC λm x + 5L2 κ p

by λm nt . (40)

If follows that for all k 2,

P h g btλ g btµ ki g btλ g btµ k

ℓ(s, btµ) + bw A

ℓ(s, btλ) k .

This proves that hypothesis H bw A, bw A, (btλ)λ Λ , as deﬁned in Appendix A, holds true.

Aggregated Hold-Out

It follows from Claim 32 and Equation (39) that, for all k 2,

P |g btλ g btµ|k g btλ g btµ k 2 P h g(btλ(X), Y ) g(btµ(X), Y ) 2i

ℓ(s, btλ) + bw A q

ℓ(s, btµ) k 2

ℓ(s, btλ) + bw B q

ℓ(s, btµ) 2

which proves that H bw B, bw A, (btλ)λ Λ holds true.

B.4 Conclusion of the Proof

We have proved that H bw B, bw A, (btλ)λ Λ and H bw A, bw A, (btλ)λ Λ hold, where bw B is deﬁned in Claim 32 and bw A in Equation (40). Moreover, x 7 bw A(x)

x is nonincreasing. Therefore, Theorem 17 applies with bw1,1 = bw A, bw1,2 = bw A, bw2,1 = bw B, bw2,2 = bw A, x = log nv and it remains to bound the remainder terms (R2,i)1 i 4 of Equation (10). For each i, we bound R2,i(θ) by a numerical constant times max{T1(θ), T2(θ), T3(θ)}, where

100 log(nv|Λ|)

T2(θ) = (ν L)2 κC log2(nv|Λ|)

T3(θ) = L(ν L)κlog 3 2 (nv|Λ|) θλmnv nt .

Summing up these bounds yields Theorem 11.

B.4.1 Bound on R2,1(θ) =

2θE δ2 bw B, θ

nv log(nv|Λ|)

Recall that bw B(x)2 := max ρx2, ν 4

κC λm x3 + 10νL κ

by λm nt x2 .

By Equation (30) in Lemma 28 with a = ρ, b = ν 4

κC λm and c = 10νL κ

by λm nt , we have

r nv log(nv|Λ|)

4ρlog(nv|Λ|)

θ2nv + 29ν2κC

log(nv|Λ|) 2

θ4λmn2v +80νLκ

log(nv|Λ|) p

by θ2λmnv nt .

2ρlog(nv|Λ|)

log(nv|Λ|) 2

θ3λmn2v + 80

log(nv|Λ|) p

E[by] θλmnv nt .

Maillard, Arlot and Lerasle

By Claim 30, E[by] 4 + log|Λ|. Since nv 100 e4, E[by] log(nv|Λ|). As a result,

R2,1(θ) 6ρlog(nv|Λ|)

θnv + 42ν2κC

log(nv|Λ|) 2

θ3λmn2v + 114νLκ

log(nv|Λ|) 3

θλmnv nt 100T1(θ) + 42T2(θ) + 114T3(θ)

256 max T1(θ), T2(θ), T3(θ) .

B.4.2 Bound on R2,2(θ) = θ2

2 E h δ2 bw A, θ2

4 nv log(nv|Λ|) i

Recall that by Equation (40), bw A(x) = L q

κC λm x + 5L2 κ

By Equation (29) in Lemma 28 with b = L q

κC λm and c = 5L2 κ

by λm nt , we have

δ2 bw A, θ2

4 nv log(nv|Λ|)

16L2κC log2(nv|Λ|)

θ4λmn2v + 40L2κ

log(nv|Λ|) p

by θ2λmnv nt .

As E[by] log(nv|Λ|) by Claim 30, it follows that

R2,2(θ) 8L2κC log2(nv|Λ|)

θ2λmn2v + 20L2κlog 3 2 (nv|Λ|) λmnv nt 8θT2(θ) + 20θT3(θ)

28 max {T1(θ), T2(θ), T3(θ)} since θ (0, 1] .

B.4.3 Bound on R2,3(θ) = 1 nv

θ + 2 1+log(nv|Λ|)

E h bδ 2 bw A, nv i

By Equation (29) in Lemma 28 with b = L q

κC λm and c = 5L2 κ

by λm nt , we have

δ2( bw A, nv) L2 κC

λmnv + L2 10κ p

by λm nvnt .

As θ (0, 1] and nv 100 e 3 2 , we have θ + 2

θ 2 log(nv|Λ|)

θ + 2[1 + log(nv|Λ|)]

θ 4 log(nv|Λ|)

R2,3(θ) 4 log(nv|Λ|)

λmnv + L2 10κ p

E[by] λm nvnt

Aggregated Hold-Out

Since E[by] log(nv|Λ|) by Claim 30,

R2,3(θ) 4 log(nv|Λ|) L2κC

θλmn2v + 40L2κ log 3 2 (nv|Λ|) θλmnv nvnt

log(nv|Λ|)T2(θ) + 40 nv T3(θ)

0.8T2(θ) + 4T3(θ) since nv 100 and |Λ| 2

4.8 max{T1, T2, T3} .

B.4.4 Bound on R2,4(θ) = 1 nv

θ + 2 1+log(nv|Λ|) +log2(nv|Λ|) θ

E h bδ 2 bw A, nv i

By Equation (29) in Lemma 28 with b = L q

κC λm and c = 5L2 κ

by λm nt , we have

δ2( bw A, nv) L2 κC

λmn2v + L2 10κ p

by λmnv nt . (42)

Since θ [0, 1], nv 100 and |Λ| 2, we have log(nv|Λ|) log(200) 5 and

θ + 2 1 + log(nv|Λ|)

θ 4 log(nv|Λ|)

θ by Equation (41)

4 log2(nv|Λ|)

Hence, by Equation (42),

R2,4(θ) 1.8 log2(nv|Λ|)

λmn2v + L2 10κ p

E[by] λmnv nt

Since E[by] log(nv|Λ|),

R2,4(θ) 1.8 log2(nv|Λ|) L2κC

θλmn3v + 18L2κlog 5 2 (nv|Λ|) θλmn2v nt

nv T2(θ) + 18log(nv|Λ|)

Since nv 100 and |Λ| e nv, we have log(nv|Λ|)

nv + log(e nv ) nv log(100)

10 0.15 and so

R2,4(θ) 0.018 T2(θ) + 2.7 T3(θ)

2.8 max T1(θ) , T2(θ) , T3(θ) .

B.4.5 Conclusion

Summing up the above inequalities, we get that for every θ (0, 1],

R2(θ) = R2,1(θ) + R2,2(θ) + R2,3(θ) + R2,4(θ)

292 max T1(θ) , T2(θ) , T3(θ) .

Maillard, Arlot and Lerasle

Equation (10) in Theorem 17 thus yields

(1 θ)E ℓ(s, bf ag T ) (1 + θ)E h min λ Λ ℓ(s, Aλ(Dnt)) i + 292 max T1(θ) , T2(θ) , T3(θ)

which proves Theorem 11 with b1 = 292(ν L)2κC and b2 = 292 L(ν L)κ.

Appendix C. Proof of Proposition 10 and Corollary 12

Let us start by two useful lemmas.

Lemma 33 If ψ is a convex, Lipschitz-continuous, and even function, and if Y is a random variable with a non-atomic distribution, then the function

R : u 7 E ψ(u Y )

is convex and diﬀerentiable with derivative R (u) = E[ψ (u Y )]. Moreover, if Y is symmetric around q, that is, (q Y ) (Y q), then R reaches a minimum at q.

Proof First, remark that R is convex by convexity of ψ. Let u R. For h = 0, let k(h, Y ) = ψ(u+h Y ) ψ(u Y )

h . Let A be the set on which ψ is non-diﬀerentiable. Since ψ is convex, A is at most countable. By deﬁnition, k(h, Y ) h 0 ψ (u Y ) whenever u Y / A,

that is to say Y / u A. Since Y is non-atomic, P(Y / u A) = 1. Moreover, since ψ is Lipschitz, there exists a constant L such that h = 0, |k(h, Y )| L. Therefore, by the dominated convergence theorem,

R(u + h) R(u)

h = E[k(h, Y )] h 0 E[ψ (u Y )] .

Thus, R is diﬀerentiable and for all u R, R (u) = E[ψ (u Y )]. Moreover, we have

R (q) = E[ψ (q Y )]

= E[ψ (Y q)] since ψ ( x) = ψ (x) on R\A

= E[ψ (q Y )] since (Y q) (q Y ) ,

which implies that R (q) = 0. Hence, R reaches a minimum at q since R is convex.

Lemma 34 Let G : R R be a diﬀerentiable convex function that reaches a minimum at u R. If there exists ε, δ > 0 such that

u [u δ , u + δ] , |G (u)| ε|u u | , (43)

then for all (u, v) R2,

εδ|u v| G(u) + G(v) 2G(u ) .

Aggregated Hold-Out

Proof By integrating Equation (43),

u [u δ , u + δ] , G(u) G(u ) ε

2(u u )2 . (44)

Let h(u) = 1

δ G(u + δ) G(u ) (u u ) . (45)

By convexity of G, for any u u + δ, G(u) G(u ) h(u). Hence by Equation (44) with u = u + δ and Equation (45),

u u + δ , G(u) G(u ) 1

δ ε 2δ2(u u ) = εδ

The same argument applies to the convex function x 7 G( x) with minimum u , which yields that

u R such that |u u | δ , G(u) G(u ) εδ

2 |u u | . (46)

Let (u, v) R2. Assume without loss of generality that |u u | |v u |. If |u u | δ then by Equation (44),

(u v)2 2(u u )2 + 2(v u )2

ε G(u) + G(v) 2G(u ) .

Otherwise, by Equation (46),

(u v)2 |u v| |u u | + |v u |

2|u v| |u u |

εδ|u v| G(u) G(u )

εδ|u v| G(u) + G(v) 2G(u ) .

C.1 Proof of Proposition 10

Now, we can prove Proposition 10. Let Rx : u 7 R |u y|d Fx(y). By Lemma 33 with ψ = | |, for all v R,

R x(v) = Z ( Iv y 0 + Iv y 0) d Fx(y)

= Fx(v) 1 Fx(v)

= 2 h Fx(v) Fx s(x) i

Maillard, Arlot and Lerasle

since by deﬁnition, Fx(s(x)) = 1

2. Hence by hypothesis (5),

u s(x) b(x) , s(x) + b(x) , R x(u) 2a(x) u s(x) .

Therefore by Lemma 34 with G = Rx, δ = b(x) and ε = 2a(x), we obtain that for all x X and (u, v) R2,

(u v)2 4 2a(x) 4|u v|

h Rx(u) + Rx(v) 2Rx s(x) i

µm |u v| h Rx(u) + Rx(v) 2Rx s(x) i .

Since g is the function (u, y) 7 |u y|, it follows by taking x = X that

g(u, Y ) g(v, Y ) 2 (u v)2 2

µm |u v| ℓX(u) + ℓX(v) ,

which implies hypothesis SC 2 am , 2 µm .

C.2 Proof of Corollary 12

Corollary 12 is a consequence of Theorem 11. Let us check that its assumptions are satisﬁed.

C.2.1 Compatibility Hypothesis Comp1(ceps 0 , ceps ε )

Fix x X and let px, Fx be respectively the density and the cumulative distribution function of Y given X = x. By assumption, px is symmetric. Recall that the contrast function here is γ(t, (x, y)) = ceps 0 (t(x), y) = |t(x) y|, so any conditional median is a possible value for s(x), and we can take s(x) equal to the center of symmetry. Let

Rε,x : u 7 Z

y ceps ε (u, y)px(y)dy = Z ψε(u y)px(y)dy , (47)

where ψε(z) = (|z| ε)+ for any z R. Lemma 33 applies, since px is symmetric by assumption and ψε is even, convex and 1-Lipschitz. Hence for any ε 0, Rε,x has a minimum at s(x) and is diﬀerentiable, with

R ε,x(u) = Z ψ ε(u y)px(y)dy = Z [ Iu y ε + Iu y ε] px(y)dy

= Fx(u ε) 1 Fx(u + ε) .

Therefore, for any ε 0 and u R,

R ε,x(u) R 0,x(u) = Z ε

px(u t) + px(u + t) dt . (48)

Now, assume that u s(x). By symmetry of px around s(x), for all t 0,

px(u t) = px s(x) + u s(x) t

= px s(x) + |u s(x) t| . (49)

Aggregated Hold-Out

Since px is unimodal, its mode is s(x) and px is nonincreasing on [s(x), + ). It follows from Equation (49) that for all u s(x) and t 0,

px(u t) px s(x) + |u s(x)| + t

= px(u + t) . (50)

Therefore, by Equations (48) and (50), for all u s(x) and ε 0, R ε,x(u) R 0,x(u). By integration, this implies that for all u s(x),

Rε,x(u) Rε,x s(x) R0,x(u) R0,x s(x) . (51)

By Equation (47) and symmetry of px, Rε,x and R0,x are symmetric around s(x), hence inequality (51) is also valid when u s(x). Choosing x = X, u = t(X) and taking an expectation, we get Lceps ε (t) Lceps ε (s) Lceps 0 (t) Lceps 0 (s) which proves Comp1(ceps 0 , ceps ε ).

C.2.2 Hypothesis SC4σ,8

We ﬁrst compute a lower bound on R0,x. Let qx, 1

4 = sup{y : Fx(y) 1

4} and qx, 3

4 = inf{y : Fx(y) 3

4}. By continuity of Fx,

4 and Fx(qx, 3

4. Let σ(x) = qx, 3

4 , which is the smallest determination

of the interquartile range. By symmetry of px around s(x), 1

4 = s(x), therefore

4 = s(x) + σ(x)

2 and qx, 1

4 = s(x) σ(x)

For any u h s(x) σ(x)

2 , s(x) + σ(x)

2 i , by symmetry of px around s(x),

Fx(u) Fx s(x) = Z s(x)+|u s(x)|

s(x) px(v)dv

= |u s(x)| 1 |u s(x)|

Z s(x)+|u s(x)|

s(x) px(v)dv .

Since px is nonincreasing on [s(x), + ) and |u s(x)| σ(x)

Fx(u) Fx s(x) |u s(x)| 2 σ(x)

Z s(x)+ σ(x)

s(x) px(v)dv

= |u s(x)| 2 σ(x)

4 Fx s(x) i

Hence, by Proposition 10 with a(x) = 1 2σ(x) and b(x) = σ(x)

2 , (g, X, Y ) satisﬁes hypothesis SC4σ,8.

C.2.3 Conclusion

To conclude, we apply Theorem 11 with κ = 1, C = 1, L = 1 (since ceps 0 and ceps ε are 1-Lipschitz), ρ = 4σ and ν = 8. Since constants b1, b2 of Theorem 11 only depend on

Maillard, Arlot and Lerasle

κ, L, C, ν and all these parameters have now received explicit values, the constants b1, b2 are now absolute.

Appendix D. Classiﬁcation: Proof of Theorem 13

In the proof of Theorem 17, we use convexity of the risk to show that the risk of the average is less than the average of the risk. A property of this type also holds in the setting of classiﬁcation, with the average replaced by the majority vote.

Proposition 35 In the classiﬁcation framework see Example 1 , let ( bfi)1 i V denote a ﬁnite family of functions X Y and let bfmv be some majority vote rule, deﬁned by

x X , bfmv(x) argmax y Y |{i [V ] : bfi(x) = m}| .

Then, we have

ℓ(s, bfmv) M

i=1 ℓ(s, bfi) and L( bfmv) 2

i=1 L( bfi) .

Proof For any y Y, deﬁne ηy : x 7 P(Y = y|X = x). Then, for any f S, we have L(f) = E[1 ηf(X)(X)] hence s(X) argmaxy Y ηy(X) and

ℓ(s, f) = E h max y Y ηy(X) ηf(X)(X) i = E ηs(X)(X) ηf(X)(X) .

We now ﬁx some x X and deﬁne Cx(y) = {i [V ] : bfi(x) = y} and Cx = maxy Y |Cx(y)|. Since Cx M P y Y |Cx(y)| = V , we get Cx V/M. On the other hand, by deﬁnition

ηs(x)(x) η bfi(x)(x) | {z } 0

V ηs(x)(x) η bfmv(x)(x) 1

M ηs(x)(x) η bfmv(x)(x) .

Integrating over x (with respect to the distribution of X) yields the ﬁrst bound. For the second bound, ﬁx x X and deﬁne Cx(y) and Cx as above. Let y Y be such that bfmv(x) = y. Since y occurs less often than bfmv(x) among bf1(x), . . . , bf V (x), we have |Cx(y)| V/2. Therefore,

i=1 I{ bfi(x) =y} = V |Cx(y)|

bfmv(x) = y implies 1 V

i=1 I{ bfi(x) =y} 1

Aggregated Hold-Out

Hence, for any y Y,

I{ bfmv(x) =y} 2

i=1 I{ bfi(x) =y} .

Taking expectations with respect to (x, y) yields L( bfmv) 2V 1 PV i=1 L( bfi).

We can now proceed with the proof of Theorem 13. It relies on a result by Massart (2007, Equation 8.60, which is itself a consequence of Corollary 8.8), which holds true as soon as t S, Var I{t(X) =Y } I{s(X) =Y } h w p

ℓ(s, t) i2 (52)

for some nonnegative and nondecreasing continuous function w on R+, such that x 7 w(x)/x is nonincreasing on (0, + ) and w(1) 1. Let us ﬁrst prove that assumption (52) holds true. On the one hand, since Y = {0, 1}, for any t S,

Var I{t(X) =Y } I{s(X) =Y } E h I{t(X) =Y } I{s(X) =Y } 2i

= E I{t(X) =s(X)} = E |t(X) s(X)|] . (53)

On the other hand, since we consider binary classiﬁcation with the 0 1 loss, for any t S and h > 0,

ℓ(s, t) = E |2η(X) 1| |t(X) s(X)| by Devroye et al. (1996, Theorem 2.2)

h E |t(X) s(X)|I{|2η(X) 1| h}

h E |t(X) s(X)| I{|2η(X) 1|<h} since t s 1

h E |t(X) s(X)| rhβ+1 by (MA).

This lower bound is maximized by taking

E |t(X) s(X)|

which belongs to [0, 1] since r 1 and E |t(X) s(X)| 1. Thus, we obtain

ℓ(s, t) h β β + 1E |t(X) s(X)| = β (β + 1)(β+1)/βr1/β E |t(X) s(X)| (β+1)/β .

Therefore, Equation (53) leads to

Var I{t(X) =Y } I{s(X) =Y } E |t(X) s(X)| β + 1 ββ/(β+1) r 1 β+1 ℓ(s, t)

By Lemma 36 below, β+1 ββ/(β+1) 2; hence, deﬁning r1 = 2r 1 β+1 , Equation (52) holds true

with w(u) = r1u

β β+1 , which statisﬁes the required conditions. So, by Massart (2007, Equation 8.60), for any θ (0, 1),

E ℓ s, bf ho T | DT n 1 + θ

1 θ inf m M

n ℓ s, Am(DT n ) o + δ2 1 θ

2θ + log e|M| 1

Maillard, Arlot and Lerasle

where δ is the positive solution of the ﬁxed-point equation w(δ ) = nvδ2 , that is,

Taking expectations with respect to the training data DT n , we obtain

E ℓ(s, bf ho T ) 1 + θ

1 θE inf m M ℓ s, Am(DT n ) + 2r 1 β+2

1 θ 2θ + log e|M| 1

β+1 β+2 v .

Under assumption (2), E ℓ(s, bf ho T ) and E L( bf ho T ) do not depend on T T (they only depend on T through its cardinality nt). Now, by Proposition 35 applied to ( bf ho T )T T ,

E ℓ(s, bf mv T ) 2E ℓ(s, bf ho T1 )

1 θE inf m M ℓ s, Am(DT n ) + 4r 1 β+2

1 θ 2θ + log e|M| 1

β+1 β+2 v .

Taking θ = 1/5 leads to the result.

The proof of Theorem 13 makes use of the following lemma.

Lemma 36 For all β > 0, we have

β + 1 ββ/(β+1) 2 .

Proof We ﬁrst notice that

= log(β + 1) β β + 1 log β

= log(β + 1) β β + 1

h log(β + 1) + log β β+1 i

= log(β + 1)

β + 1 β β + 1 log β β+1 .

Deﬁning p = 1 β+1, this can be written

= p log p (1 p) log(1 p) ,

which is the entropy of a Bernoulli distribution. The result follows from the fact that this entropy attains its maximal value log 2 at p = 1

Aggregated Hold-Out

Sylvain Arlot and Alain Celisse. A survey of cross-validation procedures for model selection. Statistics Surveys, 4:40 79, 2010.

Sylvain Arlot and Matthieu Lerasle. Choice of V for V -fold cross-validation in least-squares density estimation. Journal of Machine Learning Research (JMLR), 17(208):1 50, 2016.

Viorel Barbu and Teodor Precupanu. Convexity and optimization in Banach spaces. Springer, 2012.

Gérard Biau and Luc Devroye. Lectures on the Nearest Neighbor Method. Springer Series in the Data Sciences. Springer, 2015.

Gérard Biau and Erwan Scornet. A random forest guided tour. TEST, 25(2):197 227, 2016.

Stéphane Boucheron, Olivier Bousquet, and Gábor Lugosi. Theory of classiﬁcation: a survey of some recent advances. ESAIM: PS, 9:323 375, 2005.

Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, Oxford, 2013.

Leo Breiman. Bagging predictors. Machine Learning, 24(2):123 140, 1996.

Leo Breiman. Random forests. Machine Learning, 45:5 32, 2001.

Peter Bühlmann and Sara van de Geer. Statistics for high-dimensional data. Springer Series in Statistics. Springer, Heidelberg, 2011. Methods, theory and applications.

Peter Bühlmann and Bin Yu. Analyzing bagging. The Annals of Statistics, 30(4):927 961, 2002.

Kenneth P. Burnham and David R. Anderson. Model Selection and Multimodel Inference. Springer-Verlag, New York, second edition, 2002. A practical information-theoretic approach.

Olivier Catoni. Statistical Learning Theory and Stochastic Optimization, volume 1851 of Lecture Notes in Mathematics. Springer, Berlin, 2001. Lectures from the 31st Summer School on Probability Theory held in Saint-Flour, July 8 25, 2001, with a foreword by Jean Picard.

Luc P. Devroye, László Györﬁ, and Gábor Lugosi. A Probabilistic Theory of Pattern Recognition, volume 31 of Applications of Mathematics (New York). Springer-Verlag, New York, 1996.

Thomas G. Dietterich. Ensemble methods in machine learning. In International workshop on multiple classiﬁer systems, pages 1 15. Springer, 2000.

Mona Eberts and Ingo Steinwart. Optimal regression rates for SVMs using Gaussian kernels. Electron. J. Statist., 7:1 42, 2013.

Maillard, Arlot and Lerasle

Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1, part 2): 119 139, 1997. Euro COLT 95.

Robin Genuer. Variance reduction in purely random forests. Journal of Nonparametric Statistics, 24(3):543 562, 2012.

László Györﬁ, Michael Kohler, Adam Krzyżak, and Harro Walk. A Distribution-Free Theory of Nonparametric Regression. Springer New York, 2002.

Peter Hall and Andrew P. Robinson. Reducing variability of crossvalidation for smoothingparameter choice. Biometrika, 96(1):175 186, January 2009.

Andres Hoyos-Idrobo, Yannick Schwartz, Gael Varoquaux, and Bertrand Thirion. Improving sparse recovery on structured images with bagged clustering. In 2015 International Workshop on Pattern Recognition in Neuro Imaging. IEEE, June 2015.

Yoonsuh Jung. Eﬃcient tuning parameter selection by cross-validated score in high dimensional models. International Journal of Mathematical, Computational, Physical, Electrical and Computer Engineering, 10(1):19 25, 2016.

Yoonsuh Jung and Jianhua Hu. A K-fold averaging cross-validation procedure. Journal of Nonparametric Statistics, 27(2):167 179, 2015.

Guillaume Lecué. Suboptimality of Penalized Empirical Risk Minimization in Classiﬁcation. In COLT 2007, volume 4539 of Lecture Notes in Artiﬁcial Intelligence. Springer, Berlin, 2007.

Guillaume Maillard. Aggregated hold out for sparse linear regression with a robust loss function, 2020a. ar Xiv:2002.11553.

Guillaume Maillard. Hold-out and Aggregated hold-out. Ph D thesis, Université Paris-Saclay, September 2020b.

Enno Mammen and Alexandre B. Tsybakov. Smooth discrimination analysis. The Annals of Statistics, 27(6):1808 1829, 1999.

Pascal Massart. Concentration Inequalities and Model Selection, volume 1896 of Lecture Notes in Mathematics. Springer, Berlin, 2007. Lectures from the 33rd Summer School on Probability Theory held in Saint-Flour, July 6 23, 2003, With a foreword by Jean Picard.

Ron Meir and Tong Zhang. Generalization error bounds for bayesian mixture algorithms. Journal of Machine Learning Research, 4:839 860, 2003.

Arkadi Nemirovski. Topics in Non-parametric Statistics, volume 1738 of Lecture Notes in Math. Springer, Berlin, 2000.

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, and et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res., 12:2825 2830, 2011.

Aggregated Hold-Out

Maya L. Petersen, Annette M. Molinaro, Sandra E. Sinisi, and Mark J. van der Laan. Crossvalidated bagged learning. Journal of Multivariate Analysis, 98(9):1693 1704, October 2007.

Joseph Salmon and Arnak S. Dalalyan. Optimal aggregation of aﬃne estimators. In COLT - 24th Conference on Learning Theory - 2011, Budapest, Hungary, Jul 2011.

Bernhard Scholkopf and Alexander J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA, USA, 2001.

Ingo Steinwart and Andreas Christmann. Support vector machines. Information Science and Statistics. Springer, New York, 2008.

Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B. Methodological, 58(1):267 288, 1996.

Alexandre B. Tsybakov. Optimal aggregation of classiﬁers in statistical learning. The Annals of Statistics, 32(1):135 166, 2004.

Gaël Varoquaux, Pradeep Reddy Raamana, Denis A. Engemann, Andrés Hoyos-Idrobo, Yannick Schwartz, and Bertrand Thirion. Assessing and tuning brain decoders: Crossvalidation, caveats, and guidelines. Neuro Image, 145:166 179, January 2017.

Yuhong Yang. Adaptive regression by mixing. Journal of the American Statistical Association, 96(454):574 588, 2001.