# empirical_risk_minimization_under_random_censorship__254f1a4f.pdf Journal of Machine Learning Research 23 (2022) 1-59 Submitted 6/19; Revised 10/21; Published 12/21 Empirical Risk Minimization under Random Censorship Guillaume Ausset guillaume.ausset@telecom-paris.fr LTCI, T el ecom Paris, Institut Polytechnique de Paris BNP Paribas Stephan Cl emen con stephan.clemencon@telecom-paris.fr LTCI, T el ecom Paris, Institut Polytechnique de Paris Fran cois Portier francois.portier@ensai.fr CREST UMR 9194, ENSAI, Univ Rennes Editor: Ingo Steinwart We consider the classic supervised learning problem where a continuous non-negative random label Y (e.g. a random duration) is to be predicted based upon observing a random vector X valued in Rd with d 1 by means of a regression rule with minimum least square error. In various applications, ranging from industrial quality control to public health through credit risk analysis for instance, training observations can be right censored, meaning that, rather than on independent copies of (X, Y ), statistical learning relies on a collection of n 1 independent realizations of the triplet (X, min{Y, C}, δ), where C is a nonnegative random variable with unknown distribution, modelling censoring and δ = I{Y C} indicates whether the duration is right censored or not. As ignoring censoring in the risk computation may clearly lead to a severe underestimation of the target duration and jeopardize prediction, we consider a plug-in estimate of the true risk based on a Kaplan Meier estimator of the conditional survival function of the censoring C given X, referred to as Beran risk, in order to perform empirical risk minimization. It is established, under mild conditions, that the learning rate of minimizers of this biased/weighted empirical risk functional is of order OP( p log(n)/n) when ignoring model bias issues inherent to plug-in estimation, as can be attained in absence of censoring. Beyond theoretical results, numerical experiments are presented in order to illustrate the relevance of the approach developed. Keywords: Censored data, empirical risk minimization, U-processes, statistical learning theory, survival data analysis. 1. Introduction Covering a wide variety of practical applications, distribution-free regression can be considered one of the flagship problems in statistical learning. In the most standard setup, (X, Y ) is a random pair defined on a certain probability space with (unknown) joint probability distribution P, where the output Y is a real-valued square integrable random variable (r.v.) and X models some input information, valued in Rd, supposedly useful to predict Y . In this context, one is interested in building a (measurable) function f : Rd R minimizing the 2022 G. Ausset, S. Cl emen con and F. Portier. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v23/19-450.html. Ausset, Cl emenc on and Portier (expected quadratic) risk RP (f) = E h (Y f(X))2i , (1) which is finite as soon as the r.v. f(X) is square integrable. Obviously, the minimizer of (1) is the regression function f (X) = E[Y | X]. As the distribution of (X, Y ) is unknown in practice, the Empirical Risk Minimization paradigm (ERM in abbreviated form, see e.g. Gy orfiet al. (2002)) suggests considering solutions bfn of the minimization problem, also referred to as least squares regression, minf F b Rn(f), where b Rn(f) is a statistical estimate of the risk RP (f) computed from a training sample Dn = {(X1, Y1), . . . , (Xn, Yn)} of independent copies of (X, Y ). In general the empirical version b Rn(f) = 1 i=1 (Yi f(Xi))2 , (2) is considered. This boils down to replacing P in the risk functional RP ( ) with the empirical distribution of the (Xi, Yi) s. The class F of predictive functions is supposed to be of controlled complexity (e.g. of finite VC dimension), while being rich enough to contain a reasonable approximant of the minimizer of RP , f (x). In a framework stipulating in addition that the random variables Y and f(X), f F, are sub-Gaussian, ERM is proved to yield rules with good generalization properties, see e.g. Gy orfiet al. (2002); Bartlett et al. (2005); Lecu e and Mendelson (2016) (notice, however, that, in heavy-tail situations, alternative strategies are preferred, refer to Lugosi and Mendelson (2016) for instance). In many applications such as industrial reliability, see Mann et al. (1974), or clinical trials, the r.v. Y to be predicted represents a duration, e.g. the lifespan of a manufactured component or the time to recovery of sick patients, and it is far from uncommon in survival analysis that the data at disposal to learn a predictive rule are not composed of independent realizations (X1, Y1), . . . , (Xn, Yn) of distribution P but of observations (X1, Y1, δ1), . . . , (Xn, Yn, δn), where the observed durations are of the form Yi = min{Ci, Yi} with i {1, . . . , n}, the random variables Ci modelling a possible right censoring, and the δi are binary variables indicating whether censoring has occurred for each duration. Of course, other types of censoring (e.g. left/interval/progressive censoring) can be encountered in practice and result in partially observed durations. Since the results established in this paper can be straightforwardly extended to a more general framework, focus is here on the right censoring case. Whereas the asymptotic theory of statistical estimation based on censored data is very well documented in the literature (see e.g. Fleming and Harrington (1991); Andersen et al. (1993) and the references therein), the issues raised by censoring in statistical learning has received much less attention and it is the major purpose of this article to investigate how ERM can be extended to this setup with sound generalization guarantees. As the empirical risk (2) cannot be computed directly from the data available, we rely on the inverse of the probability of censoring weighted (IPCW) approach, see Gerds et al. (2017) for a recent review. In that, we build first a plug-in (biased) estimator of the risk (1) by means of the Beran estimator of the conditional survival function of the censoring (Beran, 1981; Dabrowska, 1989; van Keilegom and Veraverbeke, 1996) and minimize next the Empirical Risk Minimization under Random Censorship resulting risk estimate, referred to as Beran risk and that can be interpreted as a weighted version of the empirical risk process based on the observations. The asymptotic behaviour of such weighted empirical risk has been first considered in the seminal contributions of Stute (1993, 1996) and refined recently in Lopez (2011); Lopez et al. (2013). In this paper, more in the spirit of the popular statistical learning theory of empirical risk minimization, nonasymptotic maximal deviation bounds for this risk functional, much more complex than a basic empirical process due to the strong dependency exhibited by the terms averaged to compute it, are established by means of linearization techniques combined with concentration results pertaining to the theory of U-processes. We prove that, under appropriate conditions, minimizers of the Beran risk proposed have good generalization properties, achieving learning rate bounds of order OP( p log(n)/n) when ignoring the model bias impact on the plug-in estimation step, the same as ERM in absence of any censoring. Beyond this theoretical analysis, illustrative numerical results are also displayed, providing strong empirical evidence of the relevance of the approach promoted. They reveal in particular that, even if the estimator of the conditional survival function plugged is only moderately accurate, Beran risk minimizers significantly outperform approaches ignoring censoring. Eventually, we point out that some of the results established in this paper have been preliminarily presented in an elementary form at the 2018 Neur IPS Machine Learning for Healthcare workshop (ML4HEALTH), see Ausset et al. (2018). The rest of the paper is organized as follows. The framework we consider for statistical learning based on censored training data is detailed in section 2, where notions pertaining to survival data analysis involved in the subsequent study are also briefly recalled and a nonasymptotic uniform bound for the Beran estimator of the conditional survival function of the censoring is also stated. In section 3, the statistical version of the expected quadratic risk we propose, based on the Beran estimator previously studied, is introduced and the performance of its minimizers is analysed. Illustrative numerical results are displayed in section 4, while several concluding remarks are collected in section 5. Technical proofs are postponed to the appendices. 2. Background - Preliminaries In this section, we first describe at length the probabilistic setup considered in this paper and recall basic concepts of censored data analysis, which the subsequent analysis relies on. Next, we establish a nonasymptotic bound for the deviation between the conditional survival function of the random censoring and its Beran estimator under adequate smoothness assumptions. Here and throughout, the indicator function of any event E is denoted by I{E}, the Dirac mass at any point x by δx. When well-defined, the convolution product between two real-valued Borelian functions on Rd g(x) and w(x) is denoted by (g w)(x) = R x Rd g(x x )w(x )dx . The left-limit at s > 0 of any c adla g function S on R+ is denoted by S(s ) = limt s S(t). 2.1 The Statistical Framework In this paper, we consider a pair (X, Y ) of random variables defined on the same probability space (Ω, A, P), with unknown joint distribution P and where Y , representing a duration, takes positive values only and X models some information valued in Rd, d 1, a priori Ausset, Cl emenc on and Portier useful to predict Y . We assume that X s marginal distribution has a density g(x) w.r.t. Lebesgue measure on Rd. We are concerned with building a prediction rule f : Rd R+ with minimum expected quadratic risk RP (f), see Eq. (1), based on a training dataset Dn = {(X1, Y1, δ1), . . . , (Xn, Yn, δn)} composed of n 1 independent realizations of the random triplet (X, Y , δ), where Y = min{Y, C}, C is a positive r.v. defined on (Ω, A, P) and δ = I{Y C} indicates whether the duration is (right) censored (δ = 0) or not (δ = 1). The following hypothesis is required in the present study. Assumption 1 (Conditional independence) The random variables Y and C are conditionally independent given the input X and we have Y = C with probability one. Naturally, many other types of censoring can be encountered in practice. However, since the goal of the present paper is to explain the main ideas to apply the ERM principle to censored data rather than dealing with the problem at the highest level of generality, we restrict our attention to the type of right random censoring introduced above. Though simple, it covers many situations. Addressing the problem in a more complex probabilistic framework, where Y and C are not conditionally independent given X anymore for instance, will be the subject of future research. The assumption stipulating that {Y = C} is a zero-probability event is quite general, insofar as it allows considering situations where Y and/or C are discrete variables. Under conditional independence, it is obviously satisfied when the r.v. Y is continuous. Easy to state but difficult to solve, the statistical learning problem we consider here is of considerable importance. In a wide variety of applications, the input information is of increasing granularity and described by a random vector of very large dimension d, while (censored) data are progressively becoming massively available. Machine-learning techniques are thus expected to complement traditional approaches, based on statistical modelling, in order to produce more flexible/accurate predictive models based on censored data. Incidentally, we point out that the problem under study can be viewed as a very specific type of transfer learning problem, see e.g. Pan and Yang (2010) insofar as, due to the censoring, the distribution of the training/source data is not that of the test/target data. However, the source domain coincides here with the target one and the predictive task (regression) remains the same. Weighted empirical risk. Discarding censored observations to evaluate the risk of a candidate function f(x) would lead to the quantity i=1 δi Yi f(Xi) 2 / i=1 δi, (3) with 0/0 = 0 by convention, which is clearly a biased estimate of RP (f) in general, since, by virtue of the strong law of large numbers, it converges to E[(Y f(X))2 | Y C] with probability one. One may easily check that the minimizer of this functional is given by f (X) = E[Y I{Y C} | X]/P{Y C | X}, Empirical Risk Minimization under Random Censorship which significantly differs from f (X) in general. Observing that, by means of a straightforward conditioning argument, one can write the risk as " δ( Y f(X))2 where SC(u | x) = P{C > u | X = x} denotes the conditional survival function of the random right censoring given X, we propose to estimate the risk (1) by computing first a nonparametric estimator ˆSC(u | x) of SC(u | x) and by plugging it next into (4), so as to obtain e Rn(f) = 1 δi( Yi f(Xi))2 ˆSC( Yi | Xi) , (5) which approximates the unknown quantity whose expectation is equal to (4) δi( Yi f(Xi))2 SC( Yi | Xi) , (6) the conditional survival function of C given X being itself unknown. Observe that the risk estimate (5) can be viewed as a weighted version of the sum of the observed squared errors ( Yi f(Xi))2, just like (3) except that the i-th weight is not δi/ P j n δj anymore but δi/ ˆSC( Yi | Xi). In the terminology of survival analysis, the weighted empirical risk (5) is usually referred to as an IPCW (Gerds et al., 2017) estimate and a natural strategy to learn a predictive function in the censored framework described above then consists in solving the minimization problem inf f F e Rn(f), (7) over an appropriate class F. In Rotnitzky and Robins (1992) or Section 3.3 in van der Laan and Robins (2003), a parametric estimator (Cox, 1972) of SC(Y | X) is used to infer the risk. However, such an approach is naturally limited by well-known misspecification issues, inherent in the choice of the parametric model. In van der Laan and Robins (2003); Rubin and van der Laan (2007), the misspecification problem is alleviated by the use of a doubly robust loss which allows for misspecification of one of the models (either SC or else ST ) and improves in addition the efficiency of the procedure. This has been further investigated in Molinaro et al. (2004); Steingrimsson et al. (2016, 2019) , where different methodologies are proposed to build classification trees. The use of the Kaplan-Meier estimator (Kaplan and Meier, 1958) for SC(u | x) has been considered in several papers (Stute, 1993, 1996; Bang and Tsiatis, 2002; Kohler et al., 2002). Even if the censoring model is free from any parametric modelling, the assumptions required to ensure consistency are quite strong as the distribution of C is supposed to be independent from X, see Stute (1996) for more details. In particular, the weights used are independent from X. To overcome the previous restrictions, the Beran estimator (Beran, 1981), which is a kernel smoothing version of the Kaplan-Meier estimate (see the next section), can be employed instead of the Cox or the Kaplan-Meier estimators. Such an approach is promoted and studied in Lopez (2011); Lopez et al. (2013). When using the Beran approach to estimate SC(u | x), as detailed in the next subsection, the risk functional (5) is referred to as the Beran risk throughout Ausset, Cl emenc on and Portier the article. Based on accuracy results for kernel-based Beran estimators of the conditional survival function SC( | x) such as those subsequently presented, the performance of solutions of (7) is investigated in the next section. We point out that, as highlighted in section 4, alternative inference strategies for conditional survival function estimation can be considered. For simplicity, here we restrict our attention to kernel-smoothing techniques, although the analysis carried out can be extended to other nonparametric methods (e.g. partition-based techniques, nearest neighbours). Our results are therefore comparable to those of Lopez (2011), as both studies are concerned with the Beran risk, relaxing in particular the restrictive assumption on the dependence between C and X used in Stute (1993, 1996). In Lopez (2011), an asymptotic representation of the estimation error is established when the input variable is univariate (d = 1). An extension with a single index model is considered in Lopez et al. (2013). The proof technique is based on the asymptotic equicontinuity of the empirical process and imposes strong conditions on the bandwidth choice, e.g. nh3 (see Theorem 3.3 in Lopez (2011) and Theorem 3.1 in Lopez et al. (2013)). The (nonasymptotic) analysis carried out in the present paper is quite different and based on two steps: 1) the risk estimate is linearized and 2) concentration results for generalized U-processes are used to describe its fluctuations uniformly over the class of predictive functions considered (see e.g. Cl emen con and Portier (2018)). Notice additionally that the approach we adopt to establish nonasymptotic rate bounds requires weaker conditions, i.e. namely nh2d/| log(hd)| in the d-dimensional case. Integration domain. As any (conditional) survival function, SC(y | x) vanishes as y tends to infinity. In order to avoid dealing with the asymptotic behaviour of the conditional survival function of the censoring and stipulating decay rate assumptions for its tail behaviour, in the analysis carried out in section 3 we restrict the study of the prediction problem to a (borelian) domain K R+ Rd such that SC(y | x) stays bounded away from 0 on it and consider the risk RP,K(f) = E SC( Y | X) I{( Y , X) K} as well as its empirical counterpart δi( Yi f(Xi))2 SC( Yi | Xi) I{( Yi, Xi) K}. (9) As discussed at length below, the present analysis distinguishes itself from previous works, relying on the IPCW approach as well, in several respects. First, the problem of regression (in presence of censoring) is tackled here from the angle of prediction, not as the problem of estimating the conditional expectation f with minimum L2(µX)-error, denoting X s marginal distribution by µX. Although an estimator of Beran s type of C s conditional survival function given X is involved in the empirical risk construction as explained above, the goal pursued here is to ensure that the predictor fn obtained by solving (7) has a small excess risk RP ( fn) RP (f ) with large probability. As will be discussed in detail in the next section, establishing such nonasymptotic guarantees for statistical learning in the censored Empirical Risk Minimization under Random Censorship context, in the form of generalization bounds, yields technical difficulties, which are far from straightforward to overcome, when avoiding the simplifying assumption, hardly met in practice, that the output variable Y is independent from the r.v. C modelling the censoring mechanism (see Assumption 1, offering a much more realistic framework for regression based on censored training data). In contrast, we derive in this article sound theoretical results, providing nonasymptotic guarantees for the risk minimizers by jointly estimating the intrinsic loss and the censoring mechanism. 2.2 Preliminary Results In this subsection, we briefly recall the Beran approach to estimate a (conditional) survival function by means of a kernel smoothing procedure and state a uniform bound for the deviations between the conditional survival function of C given X and its Beran estimator, involved in statistical learning framework developed in the next section for distributionfree censored regression. As shall be discussed below, this result refines those obtained in Dabrowska (1989) and Du and Akritas (2002), which are of similar nature, except that they are related to the estimation of the conditional survival function of the duration Y given X, denoted by SY (u | x) = P{Y > u | X = x}, rather than that of the conditional survival function of the censoring C given X. Define the conditional integrated hazard function of the right censoring C given X ΛC(u | x) = Z u SC(ds | x) SC(s | x), (10) and the conditional subsurvival functions H(u | x) = P{ Y > u | X = x} and H0(u | x) = P{ Y > u, δ = 0 | X = x} for u 0 and x Rd. As we have (under Assumption 1), H0(du | x) = SY (u | x)SC(du | x) and H(u | x) = SY (u | x)SC(u | x), we obtain ΛC(u | x) = Z u Here, we propose to build an estimate of ΛC(u | x) by plugging into formula (10) Nadaraya Watson type kernel estimates of the conditional subsurvival functions and derive from it an estimator of SC(u | x). Of course, alternative estimation techniques can be considered for this purpose. Throughout the paper, K : Rd R+ is a symmetric bounded kernel function, i.e. a bounded nonnegative Borelian function, integrable w.r.t. Lebesgue measure such that R K(x)dx = 1, K(x) = K( x) for all x Rd, see Wand and Jones (1994). We assume it lies in the linear span of functions w, whose subgraphs {(s, u) : w(s) u}, can be represented as a finite number of Boolean operations among sets of the form {(s, u) : p(s, u) ζ(u)}, where p is a polynomial on Rd R and ζ an arbitrary real-valued function. This assumption guarantees that the collection of functions {K((x )/h) : x Rd, h > 0}, is a bounded VC type class, see Gin e et al. (2004), a property that will be useful to establish our results. Although very technical at first glance, this hypothesis is very general and Ausset, Cl emenc on and Portier is satisfied by kernels of the form K(x) = ζ(p(x)), p being any polynomial and ζ any bounded real function of bounded variation (see Nolan and Pollard (1987)) or when the graph of K is a pyramid (truncated or not). For any bandwidth h > 0 and x Rd, we set Kh(x) = K(h 1x)/hd. Based on the kernel estimators given by ˆH0,n(u, x) = 1 i=1 I{ Yi > u, δi = 0}Kh(x Xi), (11) ˆHn(u, x) = 1 i=1 I{ Yi > u}Kh(x Xi), (12) i=1 Kh(x Xi), (13) define the conditional subsurvival function estimates ˆH0,n(u | x) = ˆH0,n(u, x) ˆgn(x) and ˆHn(u | x) = ˆHn(u, x) as well as the (biased) estimators of ΛC(u | x) and SC(u | x) ˆΛC,n(u | x) = Z u ˆH0,n(ds | x) ˆHn(s | x) , (14) ˆSC,n(u | x) = Y 1 ˆΛC,n(s | x) , (15) with Λ(t) = Λ(t) Λ(t ), which are classically referred to as the conditional Nelson-Aalen and Beran estimators (Dabrowska, 1989). Let b > 0 and define the set Γb = n (y, x) R+ Rd : SY (y|x) SC(y|x) g(x) > b o , which is supposed to be non-empty. On this set, one may guarantee that ˆH0,n(y, x) and ˆH0,n(y, x) are both away from 0 with high probability, which permits the study of the fluctuations of (15). In addition, the mild and standard smoothness assumption below is required in the analysis to control the estimation bias. Assumption 2 For all u R+, the functions x 7 H(u | x), x 7 H0(u | x) and x 7 g(x) are twice continuously differentiable on Rd with all partial derivatives bounded by L. The result stated below provides a uniform bound for the deviation between SC(u | x) and its estimator (15). Proposition 1 Suppose that Assumptions 1 and 2 are fulfilled. Then, there exist constants M1 > 0, M2 > 0 and h0 > 0 depending on b, L and K only such that, for all ϵ (0, 1), we have with probability greater than 1 ϵ: sup (t,x) Γb | ˆSC,n(t | x) SC(t | x)| M1 | log(hd/2ϵ)| as soon as h h0 and nhd M2| log(hd/2ϵ)|. Empirical Risk Minimization under Random Censorship The technical proof is given in the Appendix section (refer to the latter for a description of the constants M1, M2 and h0 involved in the result stated above). A similar result, for the conditional survival function of Y given X, is proved in Dabrowska (1989), see Theorem 2.1 therein. Observe also that choosing h = hn n 1/(d+4) yields a rate bound of order OP( p log(n)/n4/(d+4)). Prediction vs Estimation. Finally, we emphasize that conditional density/expectation estimation is not the goal we pursue here, the regression framework considered in the next section having to do with prediction, i.e. the construction of a predictive rule fn(X) from censored training data with good predictive capacity. The learning procedure we investigate in this article consists in minimizing a plug-in estimation of the risk (4) and consequently involves the nonparametric estimator (15), the accuracy of the prediction is measured by the excess risk, not by the estimation error E[( fn(X) f (X))2]. This contrasts with estimation techniques, which consist in forming directly an estimator of the conditional expectation f (X) under specific (smoothness) assumptions. In addition, we point out that alternative flexible (local averaging) methods could be naturally used to compute estimators of H0(u, x), H(u, x) and g(x) and consequently estimators of SC(u | x) and ΛC(u | x), including k-nearest neighbours, decision trees or random forest. However, whereas the accuracy of nonparametric estimators based on kernel smoothing under appropriate smoothness hypotheses can be rather easily studied, it is much less convenient to establish rates for estimators produced by tree-based techniques for example (one generally prefers to investigate estimators built by means of variants, involving random splitting for instance, quite different from the algorithms used in practice). For this reason, the predictive performance of extensions of the statistical learning approach under study, based on estimators of SC(t | x) built by means of tree-based techniques, are studied from an empirical angle only in this article, see section 4 for further details. 3. Generalization Bounds for Kaplan-Meier Risk Minimizers It is the purpose of this section to investigate the excess of risk (8) related to a domain K R+ Rd of minimizers efn(x) of the Kaplan-Meier risk (9) over a class F of predictive functions that is of controlled complexity (see the technical assumptions below), while being rich enough to yield a small bias R(f ) R( f ), denoting RP,K( ) by R( ) for simplicity throughout the present section. We consider here the situation where, for all i {1, . . . , n}, the estimate of the quantity SC( Yi | Xi) plugged into (6) is obtained by evaluating the kernel smoothing estimator of SC(y | x) investigated in subsection 2.2 and based on the subsample {(Xj, Yj, δj) : 1 j n, j = i} at (y, x) = ( Yi, Xi). The corresponding versions of the kernel estimators (11), (12), (13) and those of (14) and (15) are respectively denoted by ˆH(i) 0,n(y | x), ˆH(i) n (y | x), ˆg(i) n (x), ˆΛ(i) C,n(y | x) and ˆS(i) C,n(y | x). This yields the leave-one-out estimator of the risk of any candidate f e Rn(f) = 1 δi( Yi f(Xi))2 ˆS(i) C,n( Yi | Xi) I{( Yi, Xi) K}, Ausset, Cl emenc on and Portier that is well-defined on the event Tn i=1{ ˆS(i) C,n( Yi | Xi) > 0}. As we clearly have R( efn) inf f F R(f) 2 sup f F e Rn(f) R(f) , the key of the analysis is the control of the fluctuations of the process { e Rn(f) R(f) : f F}. Slightly more generally, we establish below a uniform deviation bound for processes of type δiϕ( Yi, Xi) ˆS(i) C,n( Yi | Xi) E [ϕ(Y, X)] , ϕ Φ, where the indexing class Φ fulfills the following property allowing us to control the fluctuations of the pseudo-variables ˆS(i) C,n( Yi | Xi), as in Proposition 1. Assumption 3 There exists a domain K Γb such that ϕ(y, x) = 0 as soon as (y, x) / K for all ϕ Φ. Equipped with these notations, observe that e Rn(f) R(f) = Zn(ϕ) when ϕ(Y, X) = (Y f(X))2I{( Y , X) K}. Linearization. Whereas in the standard regression framework or in classification ERM can be straightforwardly studied by means of maximal deviation inequalities for empirical processes, the form of the process {Zn(ϕ) : ϕ Φ} of interest is very complex since the terms averaged in (5) are obviously far from being independent due to the presence of the plugged leave-one-out estimators of the quantities SC( Yi | Xi). The subsequent analysis is all the more technically difficult that, in contrast to most works devoted to statistical censored data analysis (see subsection 2.1 for more details), the simplifying assumption, unrealistic in many situations in practice, that Y and C are independent is avoided here, cf Assumption 1. Our approach to the study of the fluctuations of the process Zn consists in linearizing the statistic Zn(ϕ), i.e. approximating Zn(ϕ) by a standard i.i.d. average in the L2-sense, as stated in the next proposition. In order to make this decomposition explicit, further notations are needed. We set, for all i {1, . . . , n}, ˆa(i) n (t | x) = Z t c(u | x) H(u, x) ˆH(i) 0,n(du, x) H0(du, x) c(u | x) H(u, x)2 ˆH(i) n (u, x) H(u, x) ˆH(i) 0,n(du, x), ˆb(i) n (t | x) = Z t H(u, x)2 ˆH(i) n (u, x) ˆH(i) n (u, x) H(u, x) 2 ˆH(i) 0,n(du, x) ˆS(i) C,n(u | x) SC(u | x) SC(u | x) ˆ (i) n (du | x), ˆ (i) n (du | x) = ˆΛ(i) C,n(du | x) ΛC(du | x), c(u | x) = SC(u | x) SC(u | x) . Empirical Risk Minimization under Random Censorship Equipped with these notations, we can now state the following result. Proposition 2 (KM risk decomposition) Suppose that Assumptions 1, 2 and 3 are fulfilled. There exist constants h0 > 0 and M1 > 0 that depends on b, L and K only such that (i) for any n 2 and ϵ (0, 1), provided that h h0 and nhd M1| log(hd/2ϵ)|, the event En def = \ n (t, x) K, ˆS(i) C,n(t, x) b/2 and ˆH(i) n (t, x) b3/2 o , occurs with probability greater than 1 ϵ; (ii) for all ϕ Φ and n 2, we have on the event En: Zn(ϕ) = Ln(ϕ) + Mn(ϕ) + Rn(ϕ), δi ϕ( Yi, Xi) SC( Yi | Xi) E δ ϕ( Y , X) i=1 δiϕ( Yi, Xi)ˆa(i) n ( Yi | Xi) SC( Yi | Xi) , δiϕ( Yi, Xi) SC( Yi | Xi) ˆb(i) n ( Yi | Xi) + SC( Yi | Xi) ˆS(i) C,n( Yi | Xi) 2 SC( Yi | Xi) ˆS(i) C,n( Yi | Xi) The proof is given in the Appendix section. Observe that the term Ln(ϕ) is a basic centred i.i.d. sample mean statistic and its uniform rate of convergence 1/ n can be recovered by applying maximal deviation bounds for empirical processes under classic complexity assumptions such as those stipulated below, whereas the term Mn(ϕ) is more complicated, since it involves multiple sums. It is dealt with by means of results pertaining to the theory of U-processes (de la Pe na and Gin e, 1999), by showing that it can be decomposed as Mn(ϕ) = L n(ϕ) + R n(ϕ), the sum of a linear term and a second-order term. The term Rn(ϕ) + R n(ϕ) is a remainder term (second order) and shall be proved to be negligible with respect to Ln(ϕ) + L n(ϕ). The theory of U-processes is used next to describe the uniform behaviour of Mn + Rn. Such concentration results are also used in Cl emen con et al. (2008) and Papa et al. (2016) in simpler situations, where the residuals take the form of a degenerate U-statistic. In our case, due to the presence of a leave-one-out estimate of the survival function, the U-processes that arise do not have all their diagonal terms (e.g., the sum indexes 1 i, j n are restrained to i = j). This is of particular interest because results dealing with U-processes are in most cases stated for such sums (see Lemma 7 and Corollary 8 in the Appendix section) and, more importantly, removing diagonal terms improves the estimation accuracy by reducing the bias (see also Delyon and Portier (2016), remark 4). To obtain uniform concentration inequalities over the function class Φ, it is standard (Nolan and Pollard, 1987; Gin e and Guillou, 2001) to assume the following type of control on the complexity of the class. Ausset, Cl emenc on and Portier Assumption 4 The set Φ of real-valued functions on R+ Rd is a bounded VC type of class with parameter (A, v) and constant envelope MΦ. The formal definition of VC classes is given in the Appendix section. By means of these assumptions, the following result, proved in the Appendix section, describes the order of magnitude of the fluctuations of the process Zn. Proposition 3 Suppose that Assumptions 1-4 are fulfilled. There exist constants h0, M1, M2 and M3 that depend on (A, v), MΦ, L, K and b only, such that, for all n 2 and ϵ (0, 1), the event sup ϕ Φ |Zn(ϕ)| M1 n + | log(ϵhd/2)| occurs with probability greater than 1 ϵ provided that h h0, nh2d M3| log(ϵhd)|. The risk excess probability bound stated in the following theorem shows that, remarkably, minimizers of the Kaplan-Meier risk attain the same learning rate as that achieved by classic empirical risk minimizers in absence of censoring, when ignoring the model bias effect induced by the plug-in estimation step (cf choice of the bandwidth h). Theorem 4 Suppose that Assumptions 1-4 are fulfilled. There exist constants h0, M1, M2 and M3 that depend on (A, v), MΦ, L, K and b only, such that, for all n 2 and ϵ (0, 1), the event |R( fn) R(f )| M1 n + | log(ϵhd/2)| occurs with probability greater than 1 ϵ provided that h h0, nh2d M3| log(ϵhd)|. The proof is a direct application of Proposition 3. A similar bound for the expectation of the risk excess of minimizers of the empirical Kaplan-Meier risk can be classically derived with quite similar arguments, details are left to the reader. We finally point out that, given that Proposition 3 holds true for a fairly general class of functions Φ, the guarantees provided by Theorem 4 can be naturally extended to more general risk measures than that defined by the quadratic loss. 4. Numerical Experiments Beyond the theoretical generalization guarantees established in the previous section, we now examine at length the performance of the predictive approach proposed in the context of regression based on censored data from an empirical perspective. We present various experiments using both synthetic and real data, and compare it to alternative methods documented in the survival analysis literature standing as natural competitors. As shall be seen below, the experimental results obtained provide strong empirical evidence of the relevance of the Kaplan-Meier empirical risk minimization approach described in section 2 and analysed theoretically in section 3. All the experiments and figures displayed in this article can be reproduced using the code available at https://github.com/aussetg/ipcw. Empirical Risk Minimization under Random Censorship 4.1 Experimental Setup Before presenting and discussing the numerical results obtained, we first describe the experimental schemes used here to investigate the predictive capacity of the learning procedure under random censoring previously studied. 4.1.1 Data Generative Models In all the synthetic experiments we have carried out, the generation of the data is based either on the proportional hazard model of Cox and Oakes (1984) or else on the accelerated time failure model of Buckley and James (1979); both commonly used for parametric modelling and statistical estimation of conditional survival functions in the censored setup. Samples of the triplet ( Y , δ, X) are obtained by specifying the marginal distribution of X, as well as the conditional distribution of (Y, C) given X. For simplicity, the input r.v. X is here uniformly distributed on the unit square [0, 1]d, for d {2, 4, 8}. Only the results for d = 4 are presented below, while those obtained for d {2, 8} are available through the link mentioned above. Cox Model. The first survival model we use to simulate synthetic data stipulates that SY (y | x) = exp exp βT x y and SC(y | x) = exp exp βT Cx y , (16) where β and βC are parameters in Rd. Given X, the conditional distribution of Y is thus exponential with parameter exp(βT X), while that of C is exponential with parameter exp(βT CX). Accelerated Failure Time Model (AFT). The second generative model we considered assumes that log(Y ) = βT X + ε0 and log(C) = βT CX + ε1, (17) where the r.v. ε0 (respectively ε1) is independent from X. Different accelerated failure time models can thus be generated, depending on the distributions D0 and D1 chosen for ε0 and ε1. Three distributions have been used: Normal (N) with mean and variance (3/2, 1), Laplace (L) with location and scale (1, 1) and Gamma (G) with shape and scale (0, 1). Denoting by AFT(D0, D1) the model such that (ε0, ε1) D0 D1, the variants AFT(N, N), AFT(N, L) and AFT(N, G) have been simulated. Since the results obtained for these AFT models are quite similar to those based on the Cox model, only the latter are presented below. We refer to the link aforementioned for a description of the results based on the data generated through the AFT models. Parameters β and βC. In the Cox and AFT models, the level of censoring can be tuned by carefully choosing the parameters β and βC. In order to guarantee that the censoring is informative, we use the following parametrization: βT = d/2 z }| { 1 1 0 0 , βT C = λ 1 0 1 0 1 , where the tuning parameter λ > 0 controls the level of censoring 1 p with p = P{Y C} and u R 7 u is the ceiling function. For a targeted censoring level p, the parameter λ can be empirically determined so that Pn i=1 δi np. Ausset, Cl emenc on and Portier 4.1.2 Plugged estimator of the conditional survival function SC( | x) The estimate of the risk (9) one seeks to minimize is partly determined by the choice of the estimator ˆSC( | x) of SC( | x) plugged into it. We consider for ˆSC the kernelized Kaplan-Meier estimator (15), in its standard version denoted by ˆSKern C ( | x) and in its leaveone-out version as well, denoted by ˆS(i) Kern C ( | x). We denote by ˆSKM C ( ) the Kaplan-Meier estimator of C s (unconditional) survival function, which can be seen as the limit of ˆSKern C when h and yields the Kaplan-Meier risk considered in Stute (1995). In addition, we used ˆS(i) KNN C ( | x), the estimator obtained by replacing the kernel smoothing involved in (9) by a nearest neighbour averaging, in a leave-one-out fashion. Finally, we also considered ˆSRF C , the survival random forest estimator proposed in Ishwaran et al. (2008). From each estimator of SC, one computes a plug-in estimation of the risk: i=1 δi ( Yi f(Xi))2 ˆSKern C ( Yi|Xi) IPCW Lo O i=1 δi ( Yi f(Xi))2 ˆS(i) Kern C ( Yi|Xi) IPCW Forest i=1 δi ( Yi f(Xi))2 ˆSRF C ( Yi|Xi) IPCW Stute i=1 δi ( Yi f(Xi))2 ˆSKM C ( Yi) i=1 δi ( Yi f(Xi))2 ˆS(i) KNN C ( Yi|Xi) IPCW Oracle i=1 δi ( Yi f(Xi))2 i=1 ( Yi f(Xi))2 Observed i=1 δi( Yi f(Xi))2. The naive and observed empirical risks introduced above correspond to strongly biased estimators of the true risk (1) of course. Note that the normalizing constant is ignored here, insofar as it has no impact on the empirical risk minimizer. However, when estimating the true risk itself, it is necessary to correctly normalize the previous quantities in order to obtain complete case estimators. A point of comparison is the oracle risk i=1 (Yi f(Xi))2, i.e. the empirical risk in absence of any censoring (i.e. when all the Yi s are observed). For each risk, a prediction rule f n is built by (approximately) minimizing it over a certain class F. The results are depicted in Fig.1 for various sizes of the (censored) training sample and different censoring levels, the prediction error being evaluated by means of a test (uncensored) sample of size 5000: learning a predictive function by minimizing an IPCW estimator (here IPCW Lo O) of the risk always outperforms naive alternatives, the gain in predictive performance naturally becoming more pronounced as the level of censoring 1 p increases. Unsurprisingly, when most of the points are observed (i.e. p 1), all methods reach roughly the same error, all the losses in (18) being equal for p = 1, as depicted by Fig.2. Notice also in Fig.2 that the IPCW estimator performs best compared to the naive methods for a moderate level of censoring. This can be explained by the fact that in absence of censoring the methods are equivalent and when censoring reaches a very high level, there is not enough data to estimate reliably the IPCW weights. Of course, the phenomenon is Empirical Risk Minimization under Random Censorship exaggerated in the present example, as the training set is of size 1000: with a censoring level of 90%, only 100 observations are then available for the conditional estimation of the weights. 200 500 1000 2000 200 500 1000 2000 200 500 1000 2000 p = 0.75 p = 0.5 p = 0.25 Number n of training samples. E[(Y f n(X))2] Naive Observed Figure 1: Prediction error E[(Y f n(X))2] for data generated by the Cox model (16), when minimization is performed over the class of affine predictive rules. 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 E[(Y f n(X))2] Naive Observed Figure 2: Prediction error E[(Y f n(X))2] for data generated by the Cox model (16) with n = 1000, when minimization is performed over the class of affine predictive rules. Truncation of the estimator ˆSC( | x). In the theoretical analysis carried out in the previous section, we placed ourselves on a restricted set Γb. However, in practice, we employ a truncation approach by simply removing the last jump of the estimated survival function. For instance, ˆSKern C (y|x) is taken as Y Yi y Yi f(Xi)} P Yj Yi δj , (20) which can be seen as an extension of the classical AUC metric for the standard classification problem to censored data, measuring how well ordered the predicted death times are. Note that, as a complete case statistic, the same methodology as presented in this paper could be applied to estimate the AUC type statistic P{f(X1) > f(X2) | Y1 > Y2}, using IPCW weights. However, because it has been used for evaluating alternative methods, we use the criterion (20) here in order to facilitate comparisons and avoid to discuss the impact of specific IPCW approaches to compute the weights. For all the results, we use the IPCW methodology presented earlier. The Cox proportional hazards model was, however, learned after variable selection via a Lasso regression so as to augment performance. We observe from the results in Table 2 that, as expected, the predictors built through IPCW risk minimization significantly outperform their competitors, including the standard Empirical Risk Minimization under Random Censorship Cox model, for the prediction task. The large improvement compared to the Cox approach may surprise at first but is not unexpected actually, insofar as the seemingly less sophisticated IPCW approach is specifically designed for the purpose of prediction. By directly minimizing an estimate of the loss of interest, it naturally achieves a lower test prediction error than that reached by the traditional two-stage approach in statistics, which consists in estimating first the distribution and deducing next an estimator of the minimizer of the loss of interest (here, such an approach would consist in building first an estimator of Y s conditional density given X based on the censored sample and taking next its expectation to form an estimator of the theoretical minimizer f ). The Cox estimator only controls the likelihood of the model without any concern for the predictive performance. In particular, extreme errors are not penalized in any way while those are hurtful to the overall L2 error. More interestingly, we see that, while the difference is not as pronounced, the IPCW predictors also outperform the Cox estimator with respect to the concordance index. Notice incidentally that the concordance index, as an extension of the Wilcoxon-Mann-Whitney or AUC statistic, is an empirical criterion that can hardly be optimized directly in practice (using gradient ascent techniques for instance) because of the nonsmooth character of the pairwise loss function it involves, see e.g. section 7 in Cl emen con et al. (2008), but it is often used for evaluating performance a posteriori by practitioners. Remarkably, as the IPCW risk minimization approach can be combined with highly sophisticated learners (such as random forests), without any modification or increase in complexity, it is possible to significantly increase its predictive capacity, while edging the standard survival techniques on auxiliary metrics as well. IPCW Naive Observed Method L2 Error (years) Concordance L2 Error C L2 Error C Cox 18.78 0.6095 SVR 2.768 0.563 2.796 0.575 2.795 0.543 Linear Regression 3.193 0.594 4.971 0.557 3.898 0.508 Ridge 3.193 0.594 4.962 0.5573 3.896 0.5077 Kernel Ridge 2.683 0.597 2.704 0.592 2.956 0.513 Random Forest 2.577 0.630 2.636 0.603 2.878 0.542 Table 2: Results of the IPCW approach on the TCGA Cancer data. 5. Conclusion In the present article, we have presented both theoretical and experimental work on statistical learning based on censored data. Precisely, we considered the problem of learning a predictive/regression function when the output variables related to the training observations are subject to random right censoring under mild assumptions. Following in the footsteps of the approach introduced in Stute (1995), we studied from a nonasymptotic perspective the performance of predictive functions built by minimizing a weighted version of the empirical Ausset, Cl emenc on and Portier (quadratic) risk, constructed by means of the Kaplan-Meier methodology. Learning rate bounds describing the generalization ability of such predictive rules have been proved, through the study of the fluctuations of the Kaplan-Meier risk functional, relying on linearization techniques combined with concentration results for U-processes. These theoretical results have also been confirmed by various numerical experiments, supporting the approach promoted. A difficult question, that will be the subject of further research, is the design of model selection methods (structural risk minimization) to pick automatically the optimal hyperparameters for the plugged estimator ˆSC,n. Indeed, this is far from straightforward, insofar as changing the hyperparameters or the model modifies the loss that is being optimized, which makes standard methods such as cross-validation unsuitable. Appendix A. Concentration inequalities for VC classes and permanence properties For completeness, concentration results as well as preservation properties of VC classes, extensively used in the subsequent proofs, are recalled. For the sake of generality, this section is independent from the rest of the paper. For a function f : S R, we define f = supx S |f(x)| and f A = supx A |f(x)|. Concentration inequalities over VC classes. The following concentration inequalities provide uniform bound on empirical sums over VC classes of functions. We start by recalling the definition of a VC class. Definition 5 Let (S, S) be a measurable space. A class F of real-valued functions defined on S is called VC of parameter (A, v) (0, + )2 and constant envelope UF > 0 if for any probability measure Q on (S, S) and any ϵ (0, 1): N(F, L2(Q), ϵUF) (A/ϵ)v, where N(F, L2(Q), ϵ) denotes the smallest number of L2(Q)-balls of radius less than ϵ required to cover class Φ (covering number), see e.g. Nolan and Pollard (1987) and Gin e and Guillou (2001). The following inequality for empirical processes over VC classes is stated in Einmahl and Mason (2000); Gin e and Guillou (2001) under various forms. The present version is taken from Gin e and Sang (2010). Lemma 6 Let ξ1, ξ2, . . . be i.i.d. r.v. s valued in a measurable space (S, S) and U be a class of functions on S, uniformly bounded and of VC-type with constant (A, v) and envelope U : S R. Set σ2(u) = var(u(ξ1)) for all u U. There exist constants C1 > 0, C2 1, C3 > 0 (depending on v and A) such that t > 0 satisfying i=1 {u(ξi) E[u(ξi)]} C2 exp C3 t2 Empirical Risk Minimization under Random Censorship The previous result is extended to the case of degenerated U-processes over VC classes (Major, 2006, Theorem 2). Lemma 7 Let ξ1, ξ2, . . . be an i.i.d. sequence of random variables taking their values in a measurable space (S, S) and distributed according to a probability measure P. Let H be a class of functions on Sk uniformly bounded such that H is of VC type with constants (A, v) and envelope G. For any H H, set σ2(H) = var(H(ξ1, . . . , ξk)) and assume that j {1, . . . , k}, E[H(ξ1, . . . , ξk) | ξ1, . . . , ξj 1, ξj+1, . . . , ξk] = 0 with probability one. Then, there exist constants C1 > 0, C2 1, C3 > 0 (depending on v and A) such that for all t > 0 satisfying C1σ n log 2 G k/2 t σ nσ G (i1,...,ik) H(ξi1, . . . , ξik) H > t C2 exp where G 2 σ2 Var(H) 2 H. The following result is directly derived from that stated above by specifying an appropriate value of t. Corollary 8 Let ξ1, ξ2, . . . be an i.i.d. sequence of random variables taking their values in a measurable space (S, S) and distributed according to a probability measure P. Let H be a class of functions on Sk uniformly bounded such that H is of VC type with constants (A, v) and envelope G. For any H H, set σ2(H) = var(H(ξ1, . . . , ξk)) and assume that j {1, . . . , k}, E[H(ξ1, . . . , ξk) | ξ1, . . . , ξj 1, ξj+1, . . . , ξk] = 0 with probability one. Then, there exist constants C1 > 0 , C2 1 , C3 > 0 (depending on v and A) such that (i1,...,ik) H(ξi1, . . . , ξik) t(n, σ, ϵ) = σnk/2 k/2 + log(C2/ϵ) provided that C2/k 1 log 2 G + log(C2/ϵ) sup H H σ2(H) σ2 G 2 . Ausset, Cl emenc on and Portier VC type classes of functions - Permanence properties. In the subsequent sections, several results are obtained by applying the concentration bounds recalled above to specific classes of functions built up from the elements of the class Φ and other functions such as Kh(x), SC(u | x) or g(x). To show that these specific classes are VC, we rely on the following lemmas which exhibits situations where the VC type property is preserved, while controlling the constants (A, v) involved. In what follows the kernel K is assumed to satisfy the hypotheses introduced in section 2.2. Lemma 9 (see Nolan and Pollard (1987), Lemma 22, Assertion (ii)) The class {z 7 K(h 1(x z)) : x Rd, h > 0} is a bounded VC class of functions. The following result is an extension of a result established in the proof of Proposition 8 in Portier and Segers (2018). Lemma 10 Let (V, W) be a pair of random variables taking their values in Rq and in Rd respectively, denote by f0(v | W) the density of the conditional distribution of the random variable V given W, supposed to be absolutely continuous w.r.t. Lebesgue measure on Rq. Let F be a bounded VC class of functions defined on Rq Rd with parameter (A, v) and constant envelop UF. The class G = {w Rd 7 E[f(V, W) | W = w] : f F} is a bounded VC class of functions with parameter (A, v) and constant envelop UF. Proof Let Q be a probability measure on Rd. Consider Q the probability measure defined through d Q(v) = Z f0(v|w)Q(dw)dv. Let ϵ (0, 1) and consider the centres f1, . . . , f N of an ϵUF-covering of the VC class F with respect to the metric L2( Q). Let g G, i.e., g : w Rd 7 E[f(V, W) | W = w] with f in F. Define gk = E[fk(V, W) | W = w], for k = 1, . . . , N. There exists k {1, . . . , N} such that Z (g(w) gk(w))2 Q(dw) Z E (f(V, W) fk(V, W))2 | W = w Q(dw) = ZZ (f(v, w) fk(v, w))2 f0(v|w) dv Q(dw) = Z (f(v, w) fk(v, w))2 Q(dv) ϵ2U2 F, using Jensen s inequality and Fubini s theorem. Consequently, we have: N G, L2(Q), ϵUF N F, L2( Q), ϵUF A Since the constant UF is an envelop for the class G, the result is established. Empirical Risk Minimization under Random Censorship Lemma 11 Let Ψ be a VC class of functions defined on Rq Rd with constant envelop U > 0 that satisfies the following Lipschitz property: for all ψ Ψ, z Rq, (x, y) Rd Rd, |ψ(z, x) ψ(z, y)| κ x y . with κ > 0. Let K : Rd R be a positive function such that R K(u)du = 1 and v K = R u 2K(u)du < . The class F = {(z, x) 7 R ψ(z, x hu)K(u)du : ψ Ψ, 0 < h h} is a bounded measurable VC class of functions with constant envelope (κ h v K + U). Proof Let 0 < ϵ 1 and hk = kϵ h, k = 1, . . . , 1/ϵ , an (ϵ h)-subdivision of the interval (0, h]. Let Q be a probability measure on Rq Rd. For each k, define µk as the probability measure of the random variable (Z, X hk U) when (Z, X, U) Q K. Let Ψk,j, j = 1, . . . , N be an ϵU-cover of the function class Ψ with respect to L2(µk). Let h (0, h] and ψ Ψ. For any measurable function f and any k, we have Z f(z, x hku)K(u)du L2(Q) f L2(µk). As a consequence, for each k there exists j such that Z (ψ(z, x hku) ψk,j(z, x hku))K(u)du L2(Q) ϵU. Besides, from the Lipschitz property, there exists k such that Z (ψ(z, x hu) ψ(z, x hku))K(u)du L2(Q) ϵκ h v K. The triangle inequality allows to claim that there exists j and k such that Z (ψ(z, x hu) ψk,j(z, x hku))K(u)du L2(Q) = ϵ(κ h v K + U). There are 1/ϵ Aϵ v such functions Ψk,j meaning that N F, L2(Q), ϵ(κ h v K + U) Aϵ (v+1), where (κ h v K + U) is indeed an envelop for the class F. We conclude the section by a preservation result for the product and the inverse. Lemma 12 Suppose that F and G are two VC classes defined on S with parameters (AF, v F) and (AG, v G) and constant envelops UF and UG, respectively. Then it holds: (i) The class FG = {fg : f F, g G} is VC with parameter (2(AF AG), v F + v G) and envelop UFUG. (ii) In addition, if for all f F and x S, f(x) b F, then the class F 1 = {1/f : f F, g G} is VC with parameter (AFUF/b F, v F) and envelop 1/b F. Ausset, Cl emenc on and Portier Proof Let 0 < ϵ 1 and fk, k = 1, . . . , NF the centers of an (ϵUF)-covering of F. Similarly, denote by gk, k = 1, . . . , NG the centers of an (ϵUG)-covering of G. By applying the operation (fk UF) UF, we can assume without loss of generality that the fk (resp. gk) are bounded by UF (resp. UG). Then for any f F and g G, there are k {1, . . . , NF} and j {1, . . . , NG} such that fg fkgj UG f fk + UF g gj 2ϵUGUF, which implies that N(FG, L2(Q), 2ϵUGUF) (AF/ϵ)v F(AG/ϵ)v G. Taking ϵ = 2ϵ gives the result. For the second point, taking fk b F, we have f 1 f 1 k 1 b2 F f fk UF and the result follows taking ϵ = (UF/b F)ϵ. Appendix B. Integration results In this section we establish useful bounds related to these quantities: kernel smoothers, integrals with respect to signed measures, survival functions and hazard functions namely. This corresponds to Lemmas 13, 14, 15 and 16, respectively. As the previous section, this section is independent from the rest of the paper. Lemma 13 Let Ωan open convex subset of Rd. Suppose that f is twice differentiable on Ω such that the greatest eigenvalue of the Hessian matrix is uniformly bounded by M > 0, then, if the kernel K is symmetric, i.e., K(u) = K( u), we have: for all h > 0, sup x Ω |(Kh f)(x) f(x)| M 2 h2 Z z 2K(z)dz. Proof The proof follows the same lines as the proof of Lemma 11 given in Delyon and Portier (2020). Lemma 14 Let θ (0, 1), h : R+ [1, [ be Borelian, increasing, with limit 1/θ at + and ν be any signed measure on R+. Then, we have: T > 0, t [0, T], θ sup s [0,T] Proof Recall first the identity 0 dν = sup f DE Z fdν , (21) Empirical Risk Minimization under Random Censorship where DE is the space of non-increasing functions valued in [0, 1] and vanishing at infinity (see e.g. Dudley (1992)). Since h is increasing from 1 to 1/θ, we have for any signed measure ν (whose restriction to [0, T] is denoted by ν[0,T]), 0 hdν = θ 1 0 dν + θ Z t h θ 1 dν 2θ 1 sup f DE Then applying (21) we obtain that θ sup s [0,T] Lemma 15 Let τ > 0. Let S(1) and S(2) be c ad-l ag non-increasing functions on R+ such that S(1)(0) = S(2)(0) = 1 and S(2)(τ) θ > 0. For k {1, 2}, Λ(k)(t) = R t 0 S(k)(du)/S(k)(u ) is the corresponding cumulative hazard function. We have: S(1) S(2) [0,τ] 2θ 1 Λ(1) Λ(2) [0,τ]. Proof Let t [0, τ]. As S(2)(t) > 0, the integration by part argument of Theorem 3.2.3 in Fleming and Harrington (1991) yields S(1)(t) S(2)(t) S(2)(t) = Z t S(2)(u) (Λ(1)(du) Λ(2)(du)). (22) Set 1(du) = (Λ(1)(du) Λ(2)(du))/S(2)(u) and apply the integration by parts formula (refer to page 305 in Shorack and Wellner (2009) for instance) to get S(1)(t) S(2)(t) S(2)(t) = Z t 0 S(1)(u ) 1(du) = S(1)(t) 1(t) + Z t 0 1(u)S(1)(du). Then, as S(2)(t) 1, we obtain that |S(1)(t) S(2)(t)| S(1)(t)| 1(t)| + (1 S(1)(t)) sup u [0,τ] | 1(u)| sup u [0,τ] | 1(u)|. We conclude by using Lemma 14 with dν = d(Λ(1) Λ(2)) and h = 1/S(2). Lemma 16 Let 0 < θ1, θ2 < 1 and τ > 0. For k {1, 2}, define Λ(k)(t) = R t 0 G(k)(du)/H(k)(u), where G(k) : [0, τ] [0, β] is c ad-l ag non-decreasing and H(k) : [0, τ] [θk, 1] is Borelian non-increasing. Then, we have: Λ(1) Λ(2) [0,τ] 2 θ1 G(1) G(2) [0,τ] + β θ1θ2 H(1) H(2) [0,τ]. Ausset, Cl emenc on and Portier Proof Let t [0, τ]. Observe that, by triangular inequality, Λ(1)(t) Λ(2)(t) = d(G(1) G(2)) (H(2) H(1)) H(1)H(2) d G(2) θ1 G(1) G(2) [0,τ] + β θ1θ2 H(2) H(1) [0,τ], where the bound for the second term on the right hand side is straightforward and that for the first term can be deduced from the application of Lemma 14 with the measure ν equal to A 7 R A d(G(1) G(2)) and the function h equal to 1/H(1). Lemma 17 Let τ > 0. Let S(1) and S(2) be c ad-l ag non-increasing functions on R+ such that S(1)(0) = S(2)(0) = 1 and S(2)(τ) θ > 0. For k {1, 2}, define Λ(k)(t) = R t 0 S(k)(u )S(k)(du) and suppose that Λ(k)(t) = R t 0 G(k)(du)/H(k)(u), where G(k) : [0, τ] [0, β] and H(k) : [0, τ] [θ, 1] are respectively non-decreasing and non-increasing borelian functions. Then, there exists a constant Cθ,β > 0, depending only on θ and β, such that sup t [0,τ] S(1)(u ) S(2)(u ) Λ(1)(du) Λ(2)(du) Cθ,β H(1) H(2) 2 [0,τ] + G(1) G(2) 2 [0,τ] + W [0,τ] , S(2)(s ) G(1)(ds) G(2)(ds) S(2)(s)H(2)(s) d G(1)(du) G(2)(du) S(2)(u)H(2)(u) . Proof The proof consists in showing first that there exist constants C1,θ,β and C2,θ,β such that sup t [0,τ] ( ˆS(1)(u ) S(2)(u )) S(2)(u) (Λ(1)(du) Λ(2)(du)) C1,θ,β( G(1) G(2) 2 [0,τ] + H(1) H(2) 2 [0,τ]) + Π [0,τ], (23) 0 2(u) 1(du), 2(t) = Z t 0 S(2)(u ) 1(du), 1(t) = Z t 0 S(2)(u) 1 (du), and = Λ(1) Λ(2), and next that Π W [0,τ] C2,θ,β H(1) H(2) 2 [0,τ] + G(1) G(2) 2 [0,τ] . (24) Empirical Risk Minimization under Random Censorship In order to establish (23), we successively apply (22), Fubini s theorem and the integration by part formula: u=0 (S(1)(u ) S(2)(u )) 1(du) v=0 S(1)(v ) 1(dv)S(2)(u ) 1(du) u=v S(2)(u ) 1(du) S(1)(v ) 1(dv) 0 S(1)(v ) 1(dv) + Z t 0 S(1)(v )Π(dv) = 2(t) S(1)(t) 1(t) Z t 0 1(u)S(1)(du) + S(1)(t)Π(t) Z t 0 Π(u)S(1)(du) 2 2 [0,τ] 1 [0,τ] + 2 Π [0,τ]. (25) From (21), we deduce that 2 [0,τ] 1 [0,τ] (because S(2)I[0,τ] belongs to the space DE) and, from Lemma 14, it follows that 1 [0,τ] 2θ 1 [0,τ]. Apply next Lemma 16 to obtain 2 [0,τ] 1 [0,τ] 8 θ2 G(1) G(2) 2 [0,τ] + β2 θ4 H(1) H(2) 2 [0,τ] Combined with (25), this proves (23). For (24), the application of the Taylor expansion a2 + (x a)2 d = d(G(1) G(2)) H(2) (H(1) H(2))d G(1) (H(2))2 + (H(1) H(2))2d G(1) (H(2))2H(1) . Set c(s) = S(2)(s )/S(2)(s). It follows that G(1)(ds) G(2)(ds) H(1)(s) H(2)(s) G(1)(ds) H(2)(s)2 H(1)(s) H(2)(s) 2 G(1)(ds) H(2)(s)2H(1)(s) Ausset, Cl emenc on and Portier Observe that Π(t) W(t) = G(1)(ds) G(2)(ds) H(1)(u) H(2)(u) G(1)(du) S(2)(u)H(1)(u)H(2)(u) H(1)(s) H(2)(s) G(1)(ds) H(2)(s)2 H(1)(u) H(2)(u) G(1)(du) S(2)(u)H(1)(u)H(2)(u) H(1)(s) H(2)(s) G(1)(ds) H(2)(s)2 G(1)(du) G(2)(du) S(2)(u)H(2)(u) H(1)(s) H(2)(s) 2 G(1)(ds) H(2)(s)2H(1)(s) 1(du) = A + B + C + D. We next bound each term on the right hand side of the equation above. Successively apply Lemma 14 and (21) to get G(1)(ds) G(2)(ds) 0 S(2)(s ) G(1)(ds) G(2)(ds) Z S(2)(s )I {s u} G(1)(ds) G(2)(ds) θ2 G(1) G(2) [0,τ]. Because, for any u [0, τ], 1/{S(2)(u)H(1)(u)H(2)(u)} 1/θ3, we can write G(1)(ds) G(2)(ds) H(1)(u) H(2)(u) G(1)(du) G(1)(ds) G(2)(ds) + H(1)(u) H(2)(u) 2 ) θ7 G(1) G(2) 2 [0,τ] + H(1) H(2) 2 [0,τ] In addition, because for any u [0, τ], c(u)/(H(2)(u))2 1/θ3 we have: t [0, τ], H(1)(s) H(2)(s) G(1)(ds) H(1)(u) H(2)(u) G(1)(du) H(1)(s) H(2)(s) G(1)(ds) 2 H(1) H(2) 2 Empirical Risk Minimization under Random Censorship Define Γ2(t) = R t 0(G(1)(du) G(2)(du))/(S(2)(u)H(2)(u)). Applying Fubini s theorem, we get H(1)(s) H(2)(s) G(1)(ds) H(2)(s)2 G(1)(du) G(2)(du) S(2)(u)H(2)(u) G(1)(du) G(2)(du) S(2)(u)H(2)(u) c(s) H(1)(s) H(2)(s) G(1)(ds) (H(2)(s))2 n |Γ2(t) Γ2(s)| H(1)(s) H(2)(s) o G(1)(ds) θ3 Γ2 [0,τ] H(1) H(2) [0,τ] . Then, using Lemma 14, it follows that |C| 2(1/θ3)β(2/θ3) G(1) G(2) [0,τ] H(1) H(2) [0,τ] (1/θ3)β(2/θ3)( G(1) G(2) 2 [0,τ] + H(1) H(2) 2 [0,τ]). The last term can be treated by means of Fubini s theorem. Indeed, because 1 [0,τ] 2(β/θ) and for any u [0, τ], 1/{H(2)(u)2H(1)(u)} 1/θ3, we have H(1)(s) H(2)(s) 2 G(1)(ds) (H(2)(s))2H(1)(s) 1(du) u=s 1(du) H(1)(s) H(2)(s) 2 G(1)(ds) H(2)(s)2H(1)(s) θ3 β 1 [0,τ] H(1) H(2) 2 [0,τ] θ4 H(1) H(2) 2 [0,τ]. Putting all this together, the triangular inequality leads to (24) . Appendix C. Proof of Proposition 1 We start by establishing 3 useful lemmas, namely Lemma 18, 19 and 20. Then the proof will follow easily. Define H0,h(y, x) = E h ˆH0,n(y, x) i , Hh(y, x) = E h ˆHn(y, x) i . H0(y, x) = H0(y | x)g(x), H(y, x) = H(y | x)g(x). Ausset, Cl emenc on and Portier Lemma 18 Under Assumption 2, there exists C0 > 0 depending only on K and L such that for all h > 0, sup (t,x) R+ Rd |H0,h(t, x) H0(t, x)| C0h2, sup (t,x) R+ Rd |Hh(t, x) H(t, x)| C0h2. Proof The proof results from the application of Lemma 13 combined with the smoothness assumptions stipulated. Lemma 19 Under Assumption 2, There exist constants M1 > 0 and h0 > 0 depending only on K and L such that: sup (t,x) R+ Rd | ˆH0,n(t, x) H0,h(t, x)| M1| log(ϵhd/2)| sup (t,x) R+ Rd | ˆHn(t, x) Hh(t, x)| M1| log(ϵhd/2)| provided that h h0 and M1| log(ϵhd/2)| nhd. Proof The exponential inequalities stated above directly result from the application of Corollary 8 to the uniformly bounded VC-type classes (see Lemma 9 and 12) {(y, x ) R+ Rd 7 I{y > u}K((x x )/h) : (x, u, h) Rd R+ R +} and {(y, δ, x ) R+ {0, 1} Rd 7 I{y > u, δ = 0}K((x x )/h) : (x, u, h) Rd R+ R +} whose VC constants are independent from h, with constant envelope ||K|| , with k = 1 and σ2 = c2 K,Lhd with L R K2(x)dx. This gives that sup (t,x) R+ Rd | ˆH0,n(t, x) H0,h(t, x)| t sup (t,x) R+ Rd | ˆHn(t, x) Hh(t, x)| t provided that hd/2c K,L ||K|| and + C2 1 log 2 K Since, for any positive numbers a, b, γ, it holds that aγ + bγ 2γ(a + b)γ, we find, taking h0 sufficiently small, that t2 M1| log(ϵhd/2)|/nhd for some constant M1 > 0. Finally, taking h0 sufficiently small ensures that log(C2)/C3 + C2 1 log(2 K /c K,L) C2 1 log(1/hd/2), Empirical Risk Minimization under Random Censorship for any h h0, which permits to ensure that the previous condition is satisfied whenever M2| log(ϵhd/2)| nhd, for some M2 > 0. Take M1 = M1 + M2 to obtain the desired result. Lemma 20 Suppose that Assumptions 1 and 2 are fulfilled. There exist constants M1 > 0 and h0 > 0 depending only on b, L and K such that: P inf (t,x) Γb ˆHn(t, x) 3b3 provided that h h0 and M1| log(ϵhd/2)| nhd. Proof Define sup (t,x) Γb |H(t, x) ˆHn(t, x)| b3 By virtue of Assumption 1, for any (t, x) Γb, we have: H(t|x) = SC(t|x)SY (t|x) b2. As a consequence of ˆHn(t, x) H(t, x) |H(t, x) ˆHn(t, x)|, An {inf(t,x) Γb ˆHn(t, x) 3b3/4}. Hence we only have to prove that event An occurs with probability 1 ϵ at least. By virtue of Lemma 18, as soon as h p 3b3/(8C0), we have sup (t,x) Γb |Hh(t, x) H(t, x)| 3b3 sup (t,x) Γb | ˆHn(t, x) Hh(t, x)| 3b3 Simply use Lemma 19 to ensure that the event in the left-hand side holds with probability 1 ϵ whenever M1| log(ϵhd/2)| nhd (where M1 now depends on b, L and K) and h h0. Now we conclude the proof. We start by using Lemma 20 to get that inf(t,x) Γb ˆHn(t, x) 3b3/4 happens with probability 1 ϵ/3. We suppose that this event is realized in the following. Let (t, x) Γb and define τx = inf{t 0 : min{SC(t|x), SY (t|x)} > b}. Observing that the choice of kernel K guarantees that ˆSC,n( |x) is a (random) survival function, we first apply Lemma 15 with S(1) = ˆSC,n( |x), S(2) = SC( |x) and θ = b to get: ˆSC,n( |x) SC( |x) [0,τx] (2/b) ˆΛC,n( |x) ΛC( |x) [0,τx]. (27) Applying Lemma 16 with Λ(1)(u) = ΛC(u | x) = R u 0 H0(ds, x)/H(s , x), Λ(2)(u) = ˆΛC,n(u | x) = R u 0 ˆH0,n(ds, x)/ ˆHn(s , x), β = 1, θ1 = b3 H(s, x), θ2 = 3b3/4 (because Ausset, Cl emenc on and Portier inf(t,x) Γb ˆHn(t, x) b3/4), next yields ˆΛC,n( |x) ΛC( |x) [0,τx] 2 b3 ˆH0,n( , x) H0( | x)g(x) [0,τx] (28) 3b6 ˆHn( , x) H( | x)g(x) [0,τx]. Combining (27) and (28), using Lemma 18 and taking the supremum over x such that g(x) > b, we obtain that, the following bound holds true: sup (t,x) Γb | ˆSC,n(t | x) SC(t | x)| b4 sup (t,x) Γb | ˆH0,n(t, x) H0(t, x)| + 8 3b7 sup (t,x) Γb | ˆHn(t, x) H(t, x)| b4 sup (t,x) Γb | ˆH0,n(t, x) H0,h(t, x)| + 4 b4 C0h2 + 8 3b7 sup (t,x) Γb | ˆHn(t, x) Hh(t, x)| + 8 Lemma 19 with the probability level ϵ/3 allows us to bound the 2 previous random terms. Combined with the union bound (with 3 events having probability smaller than ϵ/3), permits claiming that with probability greater than 1 ϵ: sup (t,x) Γb ˆSC,n(t | x) SC(t | x) 4 M1 log(ϵhd/2) provided that (to apply Lemma 19) h h0 and nhd M1| log(3ϵhd/2)|. Examining the different terms and taking h0 small enough lead to the stated result. Appendix D. Proof of Proposition 2 Proof of (i): Observe that: i {1, . . . , n}, sup (t,x) K | ˆH(i) 0,n(t, x) ˆH0,n(t, x)| 2||K|| /((n 1)hd), (30) sup (t,x) K | ˆH(i) n (t, x) ˆHn(t, x)| 2||K|| /((n 1)hd). (31) The result follows from the union bound and that each of these events B(1) n := \ n (t, x) K, ˆH(i) n (t, x) b3/2 o , B(2) n := \ n (t, x) K, ˆS(i) C,n(t, x) b/2 o , has probability 1 ϵ/2 under the mentioned condition on (n, h). Apply Lemma 20 to choose (n, h) such that with probability 1 ϵ/2, inf (t,x) K ˆHn(t, x) 3b3/4. Empirical Risk Minimization under Random Censorship Using (31) and the triangle inequality, we get that B(1) n has probability 1 ϵ/2 provided that 2||K|| /((n 1)hd) b3/4. Suppose that event B(1) n is realized. The same reasoning as that used in the proof of Proposition 1 (see (27),(28),(29)), with S(1)( ) = SC( |x), S(2)( ) = S(i) C,n( |x), β = 1, θ1 = b3 and θ2 = b3/2 (as B(1) n is realized), combined with the triangular inequality, yields: i {1, . . . , n}, sup (t,x) K | ˆS(i) C,n(t|x) SC(t|x)| sup (t,x) K | ˆH(i) 0,n(t, x) ˆH0,n(t, x)| + sup (t,x) K | ˆH0,n(t, x) H0(t, x)| sup (t,x) K | ˆH(i) n (t, x) ˆHn(t, x)| + sup (t,x) K | ˆHn(t, x) H(t, x)| We further assume that 4 2||K|| /((n 1)hd) b/4, which is realized whenever h0 is small enough and M1, appearing in the condition nhd M1| log(hd/2ϵ)|, is large enough. From K Γb and (30)-(31), it results that sup (t,x) K ˆH0,n(t, x) H0(t, x) b5 sup (t,x) K ˆHn(t, x) H(t, x) b8 is included in the set B(2) n . Following the treatment of (29), it is easy to see that the latter event occurs with probability 1 ϵ/2 whenever h h0 is small enough (for the bias) and nhd M1| log(hd/2ϵ)|. Proof of (ii). We suppose that the event En is realized. For all i {1, . . . , n}, recall that ˆΛ(i) C,n(du | x) = ˆH(i) 0,n(du, x) ˆH(i) n (u , x) , ˆ (i) n = (ˆΛ(i) C,n ΛC), and that c(s | x) = SC(s | x)/SC(s | x). It results from Theorem 3.2.3 in (Fleming and Harrington, 1991, page 97) that ˆS(i) C,n(t | x) SC(t | x) SC(t | x) = 0 c(u | x) ˆ (i) n (du | x) Z t ( ˆS(i) C,n(u | x) SC(u | x)) SC(u | x) ˆ (i) n (du | x). Ausset, Cl emenc on and Portier The Taylor expansion (26) gives that ˆ (i) n (du | x) = ( ˆH(i) 0,n(du, x) H0(du, x)) H(u, x) ( ˆH(i) n (u, x) H(u, x)) ˆH(i) 0,n(du, x) + ( ˆH(i) n (u, x) H(u, x))2 ˆH(i) 0,n(du, x) H(u, x)2 ˆH(i) n (u, x) , which implies that ˆS(i) C,n(t | x) SC(t | x) SC(t | x) = ˆa(i) n (t | x) + ˆb(i) n (t | x), (32) ˆa(i) n (t | x) = Z t c(u | x) H(u, x)( ˆH(i) 0,n(du, x) H0(du, x)) c(u | x) H(u, x)2 ( ˆH(i) n (u, x) H(u, x)) ˆH(i) 0,n(du, x), ˆb(i) n (t | x) = Z t H(u, x)2 ˆH(i) n (u, x) ( ˆH(i) n (u, x) H(u, x))2 ˆH(i) 0,n(du, x) ( ˆS(i) C,n(u | x) SC(u | x)) SC(u | x) ˆ (i) n (du | x). Now, using (26), we obtain that: ϕ Φ, δi ϕ( Yi, Xi) ˆS(i) C,n( Yi | Xi) E δ ϕ( Y , X) δi ϕ( Yi, Xi) SC( Yi | Xi) E δ ϕ( Yi, Xi) i=1 δiϕ( Yi, Xi) ˆS(i) C,n( Yi | Xi) SC( Yi | Xi) S2 C( Yi | Xi) i=1 δiϕ( Yi, Xi) (SC( Yi | Xi) ˆS(i) C,n( Yi | Xi))2 S2 C( Yi | Xi) ˆS(i) C,n( Yi | Xi) . Then, using (32), we retrieve the expected terms Zn(ϕ) = Ln(ϕ) + Mn(ϕ) + Rn(ϕ), which proves (ii). Empirical Risk Minimization under Random Censorship Appendix E. Proof of Proposition 3 The proof is based on the decomposition stated in Proposition 2, combined with the lemmas below that permit to control each term involved in it. Their proofs are given in the next section of the Appendix. The term Ln(ϕ) is a basic i.i.d. (centred) average. As shown in the lemma stated below, its uniform fluctuations can be controlled by standard results in empirical process theory. Lemma 21 Suppose that the hypotheses of Proposition 3 are fulfilled. Then, for any ϵ (0, 1), we have with probability at least 1 ϵ: sup ϕ Φ |Ln(ϕ)| M1 log(M2/ϵ) provided that n M1 log(M2/ϵ), where M1 > 0 and M2 > 1 are constants depending on (A, v), K, MΦ, and b only. We now turn to the term Mn(ϕ). Observe it can be decomposed as Mn(ϕ) = Vn,1(ϕ) + Bn,1(ϕ) + Vn,2(ϕ) + Bn,2(ϕ) Vn,1(ϕ) = 1 δiϕ( Yi, Xi) SC( Yi | Xi) c(u | Xi) H(u, Xi)2 ˆH(i) n (u, Xi) Hh(u, Xi) ˆH(i) 0,n(du, Xi), Bn,1(ϕ) = 1 δiϕ( Yi, Xi) SC( Yi | Xi) c(u | Xi) H(u, Xi)2 (Hh(u, Xi) H(u, Xi)) ˆH(i) 0,n(du, Xi), Vn,2(ϕ) = 1 δiϕ( Yi, Xi) SC( Yi | Xi) c(u | Xi) H(u, Xi) ˆH(i) 0,n(du, Xi) H0,h(du, Xi) , Bn,2(ϕ) = 1 δiϕ( Yi, Xi) SC( Yi | Xi) c(u | Xi) H(u, Xi) (H0,h(du, Xi) H0(du, Xi)) . Next we treat the bias terms Bn,1 and Bn,2. Lemma 22 Under the assumptions of Proposition 3, for any ϵ (0, 1), we have, with probability 1 ϵ: sup ϕ Φ |Bn,1(ϕ)| M1h2, sup ϕ Φ |Bn,2(ϕ)| M1h2, provided that n M2| log(hd/2ϵ)|, where M1 > 0, M2 > 0 depend only on MΦ, K, L and b. Ausset, Cl emenc on and Portier Now we consider Vn,1(ϕ). For simplicity, we set Kij = Kh(Xi Xj) for 1 i, j n. We have: Vn,1(ϕ) = 1 n(n 1) δiϕ( Yi, Xi) SC( Yi | Xi) Yj Yi (1 δj)Kijc( Yj | Xi) H( Yj, Xi)2 ˆH(i) n ( Yj, Xi) Hh( Yj, Xi) = 1 n(n 1)2 X (i,j,k) i =j,i =k = V n,1(ϕ) + V n,1(ϕ), where, for all 1 i, j, k n, vi,j,k(ϕ) = δiϕ( Yi, Xi) SC( Yi | Xi) Yj Yi (1 δj)Kijc( Yj | Xi) H( Yj, Xi)2 Yk> Yj Kik Hh( Yj, Xi) V n,1(ϕ) = 1 n(n 1)2 X (i,j,k) i =j,i =k,j =k V n,1(ϕ) = 1 n(n 1)2 X The lemma stated below provides a uniform bound for V n,1(ϕ). Lemma 23 Under the assumptions of Proposition 3, we have, with probability 1: V n,1(ϕ) M1 where M1 > 0 depends only on MΦ, K, L and b. We now consider V n,1(ϕ). Set Zk = (Xk, Yk, δk) for k {1, . . . , n}. It can be decomposed as follows: V n,1(ϕ) = n 2 n U(1) n,1(ϕ) + U(2) n,1(ϕ) + U(3) n,1(ϕ) + L n(ϕ) o , Empirical Risk Minimization under Random Censorship U(1) n,1(ϕ) = 1 n(n 1)(n 2) (i,j,k) i =j,i =k,j =k vi,j,k(ϕ) E[vi,j,k(ϕ)|Zj, Zk] E[vi,j,k(ϕ)|Zi, Zk] + E[vi,j,k(ϕ)|Zk] , U(2) n,1(ϕ) = 1 n(n 1) {E[vi,j,k(ϕ)|Zj, Zk] E[vi,j,k(ϕ)|Zk]} , U(3) n,1(ϕ) = 1 n(n 1) {E[vi,j,k(ϕ)|Zi, Zk] E[vi,j,k(ϕ)|Zk]} , k E[vi,j,k(ϕ)|Zk], where i, j and k always denote pairwise distinct indexes, the varying amounts of indexes in the summations being the results of the successive marginalizations necessary to obtain degenerate U-processes. Observe that, for all ϕ Φ and pairwise distinct indexes i, j and k in {1, . . . , n}, we have with probability one: E[vi,j,k(ϕ) | Zi, Zj] = E[vi,j,k(ϕ) | Zi] = E[vi,j,k(ϕ) | Zj] = 0. The quantities U(k) n,1(ϕ), k {1, 2, 3} are thus degenerate U-statistics of degree 3, 2 and 2 respectively, whereas L n(ϕ) is a basic (centred) i.i.d. average. The following result is essentially proved by applying Corollary 8, once the complexity assumptions related to the classes of kernels involved in the definition of these degenerate U-processes have been established. It shows that the terms U(k) n,1(ϕ) s are uniformly negligible. Lemma 24 Suppose that the hypotheses of Proposition 3 are fulfilled. There exist constants M1, M2 and h0 depending on (A, v), MΦ, L, K and b only, such that for any ϵ (0, 1), each of the following events holds true with probability at least 1 ϵ: sup ϕ Φ |U(1) n,1(ϕ)| M1| log(ϵhd/2)| sup ϕ Φ |U(2) n,1(ϕ)| M1| log(ϵhd/2)| sup ϕ Φ |U(3) n,1(ϕ)| M1| log(ϵhd/2)| as soon as h h0 and M2| log(ϵhd)| nh2d. Maximal deviation inequalities for the L n(ϕ) can be obtained by means of classical results in empirical process theory, like for Ln(ϕ). Ausset, Cl emenc on and Portier Lemma 25 Suppose that the hypotheses of Proposition 3 are fulfilled. Then, for any ϵ (0, 1), we have with probability at least 1 ϵ: sup ϕ Φ |L n(ϕ)| M1 log(M2/ϵ) as soon as M2| log(ϵhd)| nh2d and h h0 where h0, M1 > 0 and M2 > 1 are constants depending on (A, v), K, MΦ, L and b only. The two preceding lemmas combined with the union bound directly yield the following result. Corollary 26 Suppose that the hypotheses of Proposition 3 are fulfilled. There exist constants M1, M2, M3 and h0 depending on (A, v), MΦ, L, K and b only such that for any ϵ (0, 1), we have with probability greater than 1 ϵ: V n,1(ϕ) M1 n + | log(ϵhd/2)| | log(ϵhd/2)| as soon as h h0, M3| log(ϵhd)| nh2d. We next deal with the term Vn,2(ϕ). Lemma 27 Suppose that the hypotheses of Proposition 3 are fulfilled. There exist constants M1, M2, M3 and h0 depending on (A, v), MΦ, L, K and b only such that for any ϵ (0, 1), we have with probability greater than 1 ϵ: sup ϕ Φ |Vn,2(ϕ)| M1 n + | log(ϵhd/2)| as soon as h h0, M3| log(ϵhd/2)| nhd. Finally, we consider the residual Rn(ϕ). Recall first that, for all ϕ Φ, we have Rn(ϕ) = R n(ϕ) + R n(ϕ), where δiϕ( Yi, Xi) SC( Yi | Xi) ˆb(i) n ( Yi | Xi), i=1 δiϕ( Yi, Xi) SC( Yi | Xi) ˆS(i) C,n( Yi | Xi) 2 S2 C( Yi | Xi) ˆS(i) C,n( Yi | Xi) . Each of the quantities, R n(ϕ) and R n(ϕ), is treated separately. We start with R n(ϕ). Lemma 28 Suppose that the assumptions of Proposition 3 are satisfied. Then, for all ϵ (0, 1), we have with probability greater than 1 ϵ | log(ϵhd/2)| nhd + 1 (nhd)2 + h4 ! as soon as h h0 and M2| log(ϵhd/2)| nhd, where M1 and M2 are nonnegative constants depending on K, L, MΦ and b only. Empirical Risk Minimization under Random Censorship We now state a uniform bound for R n(ϕ). Lemma 29 Suppose that the assumptions of Proposition 3 are satisfied. Then, for all ϵ (0, 1), we have with probability greater than 1 ϵ |log(ϵhd/2)| |log(ϵhd/2)| (nhd)3/2 + 1 nhd + 1 (nhd)2 + h2 ! as soon as h h0 and M2| log(ϵhd/2)| nhd, where M1 and M2 are nonnegative constants depending on K, L, MΦ and b only. Now we can conclude the proof of Proposition 3 by gathering each of the previous results. First note that they all are valid under the condition that h h0 and M1| log(ϵhd/2)| nhd and n M2 log(M3/ϵ). By taking h0 small enough, the last requirement is no longer necessary. In addition, if nhd > 1 and | log(ϵhd/2)| > 1 we guarantee that | log(ϵhd/2)|/nhd 1/nhd 1/(nhd)2 and | log(ϵhd/2)|1/2 | log(ϵhd/2)|3/2, leading to | log(ϵhd/2)| | log(ϵhd/2)| !3/2 | log(ϵhd/2)| Using this manipulation, we obtain the stated result. Appendix F. Intermediary Results Here we prove lemmas involved in the argument of Proposition 3 s proof. Recall that, under the assumptions stipulated: (t, x) K, H(t, x) b3, SC(t | x) b, c(t | x) 1/b, Hh(t | x) L. (34) F.1 Proof of Lemma 21 The proof is a direct application of Corollary 8 to the i.i.d. sequence {(Xn, Yn, δn) : n 1} and the class of functions (x, u, δ) K {0, 1} 7 δϕ(u, x) indexed by (ϕ, h) Φ ]0, h0]. The previous class is of VC type in virtue of Lemma 12. We choose σ = G = 2MΦ/b, the bound obtained for Ln(ϕ) is simply (2MΦ/b)n 1/2 C2 1 log (2) 1/2 + log(C2/ϵ) (4MΦ/b)n 1/2 C2 1 log (2) + log(C2/ϵ) where the constants C1, C2, C3 are the ones of Corollary 8. Easy manipulations give the result. Ausset, Cl emenc on and Portier F.2 Proof of Lemma 22 Taking the supremum of each element in the sum we find that |Bn,1(ϕ)| MΦ b8 sup (u,x) K |Hh(u, x) H(u, x)| sup (u,x) K | ˆH(i) 0,n(u, x)|. An appeal to Lemma 18, Lemma 19 combined with (30) gives the first result. Concering Bn,2, we write |Bn,2(ϕ)| MΦ b sup (t,x) K c(u | x) H(u, x) (H0,h(du, x) H0(du, x)) . Because for any signed measure ν on R+ and any measurable function f with total variation at most 1 vanishing at infinity, we have (Dudley, 1992), | Z f(u) ν(du)| sup t R | Z t we conclude that |Bn,2(ϕ)| M sup (u,x) K |H0,h(u, x) H0(u, x)|. where M > 0 depends only on L, b and MΦ. Conclude by using the bound given in Lemma 18. F.3 Proof of Lemma 23 Observe that, for i = j, we have vi,j,j(ϕ) = δiϕ( Yi, Xi) SC( Yi | Xi) I{ Yj Yi}(1 δj)Kijc( Yj | Xi) H( Yj, Xi)2 Hh( Yj, Xi). It follows from (34) that |vi,j,j(ϕ)| MΦ b8 K h d L, and since V n,1(ϕ) is a sum over n(n 1) such terms divided by n(n 1)2 we get the stated bound. F.4 Proof of Lemma 24 We will use the expression vi,j,k(ϕ) = wi,j,k(ϕ)Kik Kij E[wi,j,k(ϕ)Kik Kij | Zi, Zj] wi,j,k(ϕ) = δiϕ( Yi, Xi) SC( Yi | Xi) I{ Yj Yi}(1 δj)c( Yj | Xi) H( Yj, Xi)2 I{ Yk > Yj}. Empirical Risk Minimization under Random Censorship Using (34), we have that |wi,j,k| MΦ Recall that E[vi,j,k(ϕ) | Zi, Zj] = E[vi,j,k(ϕ) | Zi] = E[vi,j,k(ϕ) | Zj] = 0. As a result, the quantities U(k) n,1(ϕ), k {1, 2, 3} are degenerate U-statistics of degree 3, 2 and 2 respectively. For this reason we can apply Corollary 8 to each of them as soon as their respective kernels are shown to form VC classes. The kernel of h2dn(n 1)2U(1) n,1 is h2d{vi,j,k(ϕ) E [vi,j,k(ϕ)|Zj, Zk] E [vi,j,k(ϕ) | Zi, Zk] + E [vi,j,k(ϕ) | Zk]}. Lemma 10 and Corollary 17 in Nolan and Pollard (1987) implies that it is of VC type with constant envelop 8MΦ/b8 K 2 as soon as {vi,j,k(ϕ)} is of VC type with envelop MΦ/b8 K 2 . The later is true in virtue of Lemma 9 and Lemma 12. The same arguments implies that the kernels of {h2dn(n 1)U(2) n,1(ϕ)} and {h2dn(n 1)U(3) n,1(ϕ)} are of VC type with the constant envelop 4MΦ/b8 K 2 . In what follows we specify, for each U(k) n,1(ϕ), the value of σ to use in the application of Corollary 8. The bound for U(1) n,1(ϕ). Observe that E h2dwi,j,k(ϕ)Kik Kij 2 MΦ 2 L2c4 Kh2d. where c2 K = R K2(x)dx. Since we have a sum of 8 terms in the U-statistics of interest, h2dn(n 1)2U(1) n,1(ϕ), each having an L2-norm smaller that E[h4dvi,j,k(ϕ)2] (by Jensen s inequality), we obtain a bound for the resulting variance (using Minkowski s inequality), in 82(MΦ/b8)2L2c4 Kh2d. We apply Corollary 8 with k = 3 and a value for σ larger than the previous bound. We take σ2 = 82(MΦ/b8)2L2c4 Khdhd 0 (note that h h0) and G = 8MΦ K 2 /b8. The conditions are K 4 L2c4 Khd 0 2 K 2 hd/2hd/2 0 Lc2 K + log(C2/ϵ) and L2c4 Khdhd 0 K 4 , where C1, C2 and C3 are the constants in Corollary 8. The latter conditions are indeed of the type h h0 and nhd M2| log(ϵhd/2)|. This gives sup ϕ Φ |h2dn(n 1)2U(1) n,1(ϕ)| M1hd/2n3/2 !!3/2 + log(C2/ϵ) where M1 and M2 are constants depending on MΦ, L, K, b, and h0. To recover the stated result, one just needs to multiply the previous bound by 1/(n(n 1)(n 2)h2d) and to use similar manipulations as the ones presented at the end of the proof of Lemma 19. Ausset, Cl emenc on and Portier The bound for U(2) n,1(ϕ). In what follows, we use the shortcut E[ |Zi, Zj] = E[ |i, j]. The kernel of h2dn(n 1)U(2) n,1(ϕ) is given by h2d{E[vi,j,k(ϕ)|j, k] E[vi,j,k(ϕ)|k]} = h2d{E[wi,j,k(ϕ)Kik Kij|j, k] E[wi,j,k(ϕ)Kik Kij|j] E[wi,j,k(ϕ)Kik Kij|k] + E[wi,j,k(ϕ)Kik Kij]}. By Jensen s inequality and Minkowski s inequality, the variance is then smaller than 42h4d E[E[wi,j,k(ϕ)Kik Kij|j, k]2] 42h4d(MΦ/b8)2E [Kij Kik | j, k]2 . But we have E [Kij Kik | j, k] = Z Kh(x Xj)Kh(x Xk)g(x)dx L Z K(u)Kh(Xj Xk + hu)du LK h(Xk Xj), where K = K K and K h(u) = K (u/h)/hd (note that R K (u) du = 1 and K K ). This implies that 42h4d E h E [wi,j,k(ϕ)Kik Kij|j, k]2i 42h4d MΦL where c2 K = R K2(x)dx. The bound (33) is thus obtained by applying Corollary 8 to h2dn(n 1)U(2) n,1(ϕ) with k = 2 and σ2 = 42h2dhd 0 The bound for U(3) n,1(ϕ). Based on conditioning arguments, we have E[vi,j,k(ϕ)|Zk] = Ak(ϕ) E[Ak(ϕ)], where Ak(ϕ) = E[wi,j,k(ϕ)Kik Kij|k]. We now show that the class of functions n Zk 7 hd Ak(ϕ) : ϕ Φ o is a VC class with constant envelop. Define β1(Zi, Zk) = Kik δiϕ( Yi, Xi) SC( Yi | Xi) β2(Zi, Zk) = Z M(Xi + hu, Zi, Zk)K(u)du M(Xj, Zi, Zk) = Z I{u Yi, u < Yk}c(u | Xi) H(u, Xi)2 H0(du | Xj), Empirical Risk Minimization under Random Censorship and observe that E[wi,j,k(ϕ)Kik Kij|i, k] = β1(Zi, Zk)E " I{ Yj Yi}(1 δj)c( Yj | Xi) H( Yj, Xi)2 I{ Yk > Yj}Kij | i, k = β1(Zi, Zk)E[M(Xj, Zi, Zk)Kij | i, k] = β1(Zi, Zk) Z M(Xi + hu, Zi, Zk)g(Xi + hu)K(u)du, where we have used Assumption 1 and the fact that H0(du|x) = SY (u |x)SC(du|x). Because for any f with total variation at most 1 vanishing at infinity, we have (Dudley, 1992), Z f(u) {H0(du | x) H0(du | x )} sup u R |H0(u | x) H0(u | x )| where the last inequality is a consequence of Assumption 2. The same holds true for g in virtue of Assumption 2. Hence, the map M is uniformly Lipschitz with respect to Xj. Appealing to Lemma 11, we obtain that the kernel hd E[wi,j,k(ϕ)Kik Kij|i, k] is VC with constant envelop K MΦL/b8. The same holds true for hd Ak(ϕ) by Lemma 10. Moreover, observe that for all (ϕ, h) Φ ]0, h0], using (35), we have almost surely, b8 E[Kij Kik|Zk]. E[Kij Kik|Zk] = ZZ Kh(x y)Kh(x Xk)g(x)g(y)dxdy = ZZ K(z)Kh(x Xk)g(x)g(x hz)dxdz L Z Kh(x Xk)g(x) Z K(z)dz it follows that E[(hd Ak(ϕ))2] h2d MΦL2 Applying Corollary 8 to the kernel hd{Ak(ϕ) E[Ak(ϕ)]} with k = 1 and G = 2 K MΦL/b8 and σ2 = h2d(MΦL2/b8)2 yields the bound sup ϕ Φ |nhd L n(ϕ)| MΦL2 nhd log(C2/ϵ)/C3 , Ausset, Cl emenc on and Portier with probability 1 ϵ, provided a condition of the type nh2d M2| log(hdϵ)| Straightforward calculations then give the desired result. F.5 Proof of Lemma 27 For all ϕ Φ, we first set wij(ϕ) = δiϕ( Yi, Xi)I{ Yj Yi}(1 δj)Kijc( Yj | Xi) SC( Yi | Xi)H( Yj | Xi) and observe next that Vn,2(ϕ) = 1 δiϕ( Yi, Xi) SC( Yi | Xi) c(u | Xi) H(u, Xi)d ˆH(i) 0,n(u, Xi) H0,h(u, Xi) δiϕ( Yi, Xi) SC( Yi | Xi) I{ Yj Yi}(1 δj)Kijc( Yj | Xi) H( Yj | Xi) Z Yi c(u | Xi) H(u, Xi)d H0,h(u, Xi) i =j {wij(ϕ) E[wij(ϕ) | Zj]} = U(1) n,2(ϕ) + U(2) n,2(ϕ), U(1) n,2(ϕ) = 1 n(n 1) wij(ϕ) E[wij(ϕ) | Zj] E[wij(ϕ) | Zi] + E[w12(ϕ)] , U(2) n,2(ϕ) = 1 i=1 {E[wij(ϕ) | Zi] E[w12(ϕ)]} . (37) Hence, Vn,2(ϕ) can be decomposed as the sum of a degenerate U-statistic (36) and an i.i.d. average (37). Note also that, by (34), we have |wij(ϕ)| MΦ The bound for U(1) n,2(ϕ). In virtue of Lemma 9 and Lemma 12, the collection of kernels of the degenerate U-statistics n hdn(n 1)U(1) n,2(ϕ) : (ϕ, h) Φ ]0, h0] o Empirical Risk Minimization under Random Censorship forms a class of VC type with constants depending only on (v, A), K and h0. In addition, these terms are all bounded by 4 MΦ K /b5 and we have: Var hdwij(ϕ) 4MΦ 2 hd Lc2 K. It thus results from the application of Corollary 8 with k = 2 and σ2 = 42 MΦ/b5 2 hd Lc2 K that, with probability greater than 1 ϵ hdn(n 1)U(2) n,1(ϕ) nhd/2 M1 + log (C2/ϵ) where M1 and M2 depends on MΦ, K, L, b, provided a condition of the type h h0 and nhd M2| log(ϵhd/2)|. The bound for U(2) n,2(ϕ). Following the proof of Lemma 25, the collection of kernels of U(2) n,2(ϕ) is of VC type with constant envelop. Besides, we have, with probability one: E[wij(ϕ) | Zi] MΦL and therefore Var (E[wij(ϕ) | Zi]) MΦL b5 2 . Applying thus Corollary 8 with k = 1, σ2 = 4 MΦL/b5 2 and G = 2 MΦL/b5, we obtain that, with probability 1 ϵ, n U(2) n,2(ϕ) C n where C depends on MΦ, K, L, b, provided a condition of the type n M3| log(M4/ϵ)| holds true but this is already implied by h h0 and nhd M2| log(ϵhd/2)| whenever h0 is small. The bound stated in the lemma results from rearranging the bounds (38) and (39). F.6 Proof of Lemma 28 Use the triangle inequality, (30) and (31) with Lemmas 19 and 18 to get that, with probability 1 ϵ, max i=1,...n sup (t,x) K | ˆH(i) 0,n(t, x) H0(t, x)| M1 | log(ϵhd/2)| max i=1,...n sup (t,x) K | ˆH(i) n (t, x) H(t, x)| M2 | log(ϵhd/2)| We suppose further that both previous inequalities are realized. Note that under the mentioned condition on (n, h), it holds that (t, x) K inf i=1,...,n ˆH(i) n (t, x) b3 Ausset, Cl emenc on and Portier In a similar fashion as in the proof of Proposition 1 (see (27),(28) and (29)), we apply Lemma 15 to get that sup (t,x) K | ˆS(i) C,n(t|x) SC(t|x)| (2/b) sup (t,x) K |ˆΛ(i) C,n(t|x) ΛC(t|x)|. Then, we apply Lemma 16, with θ1 = b3, θ2 = b3/2, β = 1, to finally obtain that: i {1, . . . , n}, sup (t,x) K | ˆS(i) C,n(t|x) SC(t|x)| 2 b3 sup (t,x) K | ˆH(i) 0,n(t, x) H0(t, x)| + 2 b6 sup (t,x) K | ˆH(i) n (t, x) H(t, x)| | log(ϵhd/2)| Hence provided that h h0 and nhd M2| log(ϵhd/2)|, we have R n(ϕ) 2MΦ b3 sup (t,x) K SC(t | x) ˆS(i) C,n(t | x) 2 . F.7 Proof of Lemma 29 Recall first that δiϕ( Yi, Xi) SC( Yi | Xi) ˆb(i) n ( Yi | Xi), ˆb(i) n (t | x) = Z t H(u, x)2 ˆH(i) n (u, x) ( ˆH(i) n (u, x) H(u, x))2 ˆH(i) 0,n(du, x) ( ˆS(i) C,n(u | x) SC(u | x)) SC(u | x) ˆ (i) n (du | x). and ˆ (i) n (du | x) = ˆΛ(i) C,n(du | x) ΛC(du | x). The following argument is based on Lemma 17, stated in section B. Note that, on the event En, we have: ˆb(i) n (t | x) 2 Z ˆH(i) n (u, x) Hh(u, x) 2 ˆH(i) n (du, x), ( ˆS(i) C,n(u | x) SC(u | x)) SC(u | x) ˆ (i) n (du | x) b10 sup (u,x) Γb ˆH(i) n (u, x) Hh(u, x) 2 ( ˆS(i) C,n(u | x) SC(u | x)) SC(u | x) ˆ (i) n (du | x) Empirical Risk Minimization under Random Censorship The application of the Lemma 17, with S(2)(u) = SC(u | x), S(1)(u) = ˆS(i) C,n(u | x), β = 1, θ = b and Λ(1)(u) = ˆΛ(i) C,n(u | x) = Z u ˆH(i) 0,n(ds, x) ˆH(i) n (s , x) , Λ(2)(u) = ΛC(u | x) = Z u ˆS(i) C,n(u | x) SC(u | x) SC(u | x) ˆ (i) n (du | x) sup (u,x) Γb ˆH(i) n (u, x) H(u, x) 2 + sup (u,x) Γb ˆH(i) 0,n(u, x) H0(u, x) 2 + sup (u,x) Γb ˆW (i) n (u, x) , where C > 0 depends on b and ˆW (i) n (t, x) is defined as ˆH(i) 0,n(ds, x) H0(ds, x) ˆH(i) 0,n(du, x) H0(du, x) SC(u | x)H(u, x) . Using (30) and (31) combined with Lemma 18 and Lemma 19, we obtain that with probability at least 1 ϵ: sup (u,x) Γb ˆH(i) n (u, x) H(u, x) 2 + sup (u,x) Γb ˆH(i) 0,n(u, x) H0(u, x) 2 1 (nhd)2 + | log(ϵhd/2)| as soon as h h0 and M2| log(ϵhd/2)| nhd. It remains to show that, with probability at least 1 ϵ: max i {1, ..., n} sup (u,x) Γb ˆW (i) n (u, x) M1 | log(hd/2ϵ)| as soon as h h0 and M2| log(ϵhd/2)| nhd. We first define ˆWn,1(t, x), for all (t, x) K, ˆH0,n(ds, x) H0(ds, x) d ˆH0,n(du, x) H0(du, x) SC(u | x)H(u, x) Ausset, Cl emenc on and Portier and notice that, since c(s | x)/(H(s, x)SC(u | x)H(u, x)) 1/b8 (using (34)), we have by virtue of (30) max i {1, ..., n} sup (t,x) K ˆWn,1(t, x) ˆW (i) n (t, x) C/(nhd), where C is a constant depending on b and K only. Let α1(u, x) = (SC(u | x)H(u, x)) 1 Z u 0 c(s | x)(H0,h(ds, x) H0(ds, x)) and note that sup (u,x) K |α1(u, x)| M1 sup (u,x) K |H0,h(u, x) H0(u, x)| M2h2. We define ˆWn,2(t, x) as ˆH0,n(ds, x) H0,h(ds, x) ˆH0,n(du, x) H0,h(du, x) SC(u | x)H(u, x) , we have ˆWn(t, x) = ˆWn,2(t, x) ˆH0,n(ds, x) H0,h(ds, x) H(s, x) (H0,h(du, x) H0(du, x)) SC(u | x)H(u, x) 0 α1(u, x) ˆH0,n(du, x) H0(du, x) 0 α1(u, x) (H0,h(du, x) H0(du, x)) . Applying Fubini s theorem in the second term, we see that the last three terms are similar. We give the details only for the second one. We have 0 α1(u, x)( ˆH0,n(du, x) H0(du, x)) 0 |α1(u, x)|( ˆH0,n(du, x) + H0(du, x)) sup (u,x) K |α1(u, x)| sup (u,x) K | ˆH0,n(u, x) + H0(u, x)| M2h2 sup (u,x) K | ˆH0,n(u, x) + H0(u, x)| In addition, observe that ˆWn,2(t, x) = n 2 n X j=1 vij(t, x) i =j vij(t, x) + n 2 n X i=1 vii(t, x) := Un(t, x) + Mn(t, x), Empirical Risk Minimization under Random Censorship where, for all 1 i, j n, we set vij(t, x) = uij(t, x) E[uij(t, x)|Zi] E[uij(t, x)|Zj] + E[u1,2(t, x)], uij(t, x) = ξi,j(x)I{ Yi t}Kh(Xi x)Kh(Xj x), ξi,j(x) = δiδjc( Yj | x) SC( Yi, x)H( Yi, x)H( Yj, x) I{ Yj Yi}. Because we have, for all (t, x) K, E[v12(t, x)|Z1] = E[v12(t, x)|Z2] = 0, the collection of random variables {n2h2d Un(t, x) : (t, x, h) K ]0, h0]} is a degenerate U-process of order 2. The related class of kernels is uniformly bounded by 4||K||2 /b8 and of VC type, by virtue of classic permanence properties recalled in Appendix A. Observe in addition that Var h2dv12(t, x) h4d42E[u2 12(t, x)] 2 E[K2 1x K2 2x] Applying Corollary 8 with k = 2 and σ2 = hdhd 0 4 b8 L2c4 K, G = 4 K 2 9b8 , we obtain that, with probability greater than 1 ϵ, sup (t,x) K |Un(t, x)| M1 | log(ϵhd/2)| as soon as M2| log(ϵhd/2)| nhd and h h0. Notice now that, for all (t, x) K, Mn(t, x) = Ln(t, x) + Rn(t, x), where Ln(t, x) = n 2 n X i=1 {vii(t, x) E[v11(t, x)]} , Rn(t, x) = n 1E[v11(t, x)]. Ausset, Cl emenc on and Portier Observing that, for all (t, x) K, |h2dv11(t, x)| 4||K||2 /b8 and Var h2dv11(t, x) h4d42E[u2 11(t, x)] 2 L Z K4(x)dx. Hence, the application of Corollary 8 with k = 1 to the empirical sums {n2h2d Ln(t, x) : (t, x, h) K ]0, h0]} permits to get that, with probability at least 1 ϵ, sup (t,x) K |Ln(t, x)| M1 | log( M2/hd/2)| + p log(C2/ϵ)/C3 (nhd)3/2 , where M1 and M2 are constants depending on K, b and L. The previous bound is valid whenever M2| log(ϵhd/2)| nhd and h h0. We also have n 1E[|v11(t, x)|] 4n 1E[|u11(t, x)|] (4/b8)Lc2 K/(nhd). This leads to the stated results. Empirical Risk Minimization under Random Censorship P. K. Andersen, Ø. Borgan, R. D. Gill, and N. Keiding. Statistical Models Based on Counting Processes. Springer Series in Statistics. Springer US, New York, NY, 1993. ISBN 978-0-387-94519-4. doi: 10.1007/978-1-4612-4348-9. G. Ausset, F. Portier, and S. Cl emen con. Machine Learning for Survival Analysis: Empirical Risk Minimization for Censored Distribution Free Regression with Applications. In Neur IPS ML4H Workshop, Montreal, Canada, 2018. H. Bang and A. A. Tsiatis. Median Regression with Censored Cost Data. Biometrics, 58(3): 643 649, 2002. ISSN 0006-341X. P. L. Bartlett, O. Bousquet, and S. Mendelson. Local Rademacher complexities. Annals of Statistics, 33(4):1497 1537, Aug. 2005. ISSN 0090-5364, 2168-8966. doi: 10.1214/009053 605000000282. R. Beran. Nonparametric regression with randomly censored survival data. Technical report, 1981. J. Buckley and I. James. Linear Regression with Censored Data. Biometrika, 66(3):429 436, 1979. ISSN 0006-3444. doi: 10.2307/2335161. S. Cl emen con and F. Portier. Beating Monte Carlo Integration: A Nonasymptotic Study of Kernel Smoothing Methods. In International Conference on Artificial Intelligence and Statistics, pages 548 556. PMLR, Mar. 2018. S. Cl emen con, G. Lugosi, and N. Vayatis. Ranking and Empirical Minimization of Ustatistics. Annals of Statistics, 36(2):844 874, Apr. 2008. ISSN 0090-5364, 2168-8966. doi: 10.1214/009052607000000910. D. R. Cox. Regression Models and Life-Tables. Journal of the Royal Statistical Society. Series B (Methodological), 34(2):187 220, 1972. ISSN 0035-9246. D. R. Cox and D. Oakes. Analysis of Survival Data. Chapman and Hall/CRC, Boca Raton, 1984. ISBN 978-1-315-13743-8. doi: 10.1201/9781315137438. D. M. Dabrowska. Uniform Consistency of the Kernel Conditional Kaplan-Meier Estimate. Annals of Statistics, 17(3):1157 1167, Sept. 1989. ISSN 0090-5364, 2168-8966. doi: 10.1214/aos/1176347261. V. de la Pe na and E. Gin e. Decoupling: From Dependence to Independence. Probability and Its Applications. Springer-Verlag, New York, 1999. ISBN 978-0-387-98616-6. doi: 10.1007/978-1-4612-0537-1. B. Delyon and F. Portier. Integral approximation by kernel smoothing. Bernoulli, 22(4): 2177 2208, Nov. 2016. ISSN 1350-7265. doi: 10.3150/15-BEJ725. B. Delyon and F. Portier. Safe and adaptive importance sampling: A mixture approach. Annals of Statistics, Mar. 2020. Ausset, Cl emenc on and Portier Y. Du and M. G. Akritas. Uniform strong representation of the conditional Kaplan-Meier process. Mathematical Methods of Statistics, 11(2):152 182, 2002. R. M. Dudley. Frechet Differentiability, p-Variation and Uniform Donsker Classes. Annals of Probability, 20(4):1968 1982, Oct. 1992. ISSN 0091-1798, 2168-894X. doi: 10.1214/aop/ 1176989537. U. Einmahl and D. M. Mason. An Empirical Process Approach to the Uniform Consistency of Kernel-Type Function Estimators. Journal of Theoretical Probability, 13(1):1 37, Jan. 2000. ISSN 1572-9230. doi: 10.1023/A:1007769924157. T. Fleming and D. Harrington. Counting Processes and Survival Analysis. 1991. doi: 10.2307/2290673. T. A. Gerds, J. Beyersmann, L. Starkopf, S. Frank, M. J. van der Laan, and M. Schumacher. The Kaplan-Meier Integral in the Presence of Covariates: A Review. From Statistics to Mathematical Finance, pages 25 42, 2017. doi: 10.1007/978-3-319-50986-0. E. Gin e and A. Guillou. On consistency of kernel density estimators for randomly censored data: Rates holding uniformly over adaptive intervals. Annales de l Institut Henri Poincare (B) Probability and Statistics, 37(4):503 522, July 2001. ISSN 0246-0203. doi: 10.1016/S0246-0203(01)01081-0. E. Gin e and H. Sang. Uniform asymptotics for kernel density estimators with variable bandwidths. Journal of Nonparametric Statistics, 22(6):773 795, Aug. 2010. ISSN 10485252. doi: 10.1080/10485250903483331. E. Gin e, V. Koltchinskii, and J. Zinn. Weighted uniform consistency of kernel density estimators. Annals of Probability, 32(3B):2570 2605, July 2004. ISSN 0091-1798, 2168894X. doi: 10.1214/009117904000000063. R. L. Grossman, A. P. Heath, V. Ferretti, H. E. Varmus, D. R. Lowy, W. A. Kibbe, and L. M. Staudt. Toward a Shared Vision for Cancer Genomic Data. New England Journal of Medicine, 375(12):1109 1112, Sept. 2016. ISSN 0028-4793. doi: 10.1056/NEJMp1607591. L. Gy orfi, M. Kohler, A. Krzyzak, and H. Walk. A Distribution-Free Theory of Nonparametric Regression. Springer Series in Statistics. Springer-Verlag, New York, 2002. ISBN 978-0387-95441-7. doi: 10.1007/b97848. T. Hothorn, P. B uhlmann, S. Dudoit, A. Molinaro, and M. J. van der Laan. Survival ensembles. Biostatistics, 7(3):355 373, 2006. ISSN 14654644. doi: 10.1093/biostatistics/ kxj011. H. Ishwaran and U. B. Kogalur. Random survival forests for R. R News, 7(2):25 31, 2007. H. Ishwaran, U. B. Kogalur, E. H. Blackstone, and M. S. Lauer. Random survival forests. The Annals of Applied Statistics, 2(3):841 860, 2008. ISSN 19326157. doi: 10.1214/08-A OAS169. Empirical Risk Minimization under Random Censorship E. L. Kaplan and P. Meier. Nonparametric Estimation from Incomplete Observations. Journal of the American Statistical Association, 53(282):457 481, 1958. ISSN 0162-1459. doi: 10.2307/2281868. M. Kohler, K. M ath e, and M. Pint er. Prediction from Randomly Right Censored Data. Journal of Multivariate Analysis, 80(1):73 100, Jan. 2002. ISSN 0047259X. doi: 10.1006/ jmva.2000.1973. G. Lecu e and S. Mendelson. Learning subgaussian classes : Upper and minimax bounds. ar Xiv:1305.4825 [math, stat], Sept. 2016. O. Lopez. Nonparametric Estimation of the Multivariate Distribution Function in a Censored Regression Model with Applications. Communications in Statistics - Theory and Methods, 40(15):2639 2660, Aug. 2011. ISSN 0361-0926, 1532-415X. doi: 10.1080/03610926.2010.48 9175. O. Lopez, V. Patilea, and I. van Keilegom. Single index regression models in the presence of censoring depending on the covariates. Bernoulli, 19(3):721 747, Aug. 2013. ISSN 1350-7265. doi: 10.3150/12-BEJ464. G. Lugosi and S. Mendelson. Risk minimization by median-of-means tournaments. Journal of the European Mathematical Society, 22(3), 2016. P. Major. An estimate on the supremum of a nice class of stochastic integrals and Ustatistics. Probability Theory and Related Fields, 134(3):489 537, Mar. 2006. ISSN 0178-8051, 1432-2064. doi: 10.1007/s00440-005-0440-9. N. R. Mann, R. E. Schafer, and N. D. Singpurwalla. Methods for Statistical Analysis of Reliability and Life Data. Wiley, 1974. ISBN 978-0-471-56737-0. A. M. Molinaro, S. Dudoit, and M. J. van der Laan. Tree-based multivariate regression and density estimation with right-censored data. Journal of Multivariate Analysis, 90(1 SPEC. ISS.):154 177, 2004. ISSN 10957243. doi: 10.1016/j.jmva.2004.02.003. D. Nolan and D. Pollard. U-Processes: Rates of Convergence. Annals of Statistics, 15(2): 780 799, June 1987. ISSN 0090-5364, 2168-8966. doi: 10.1214/aos/1176350374. S. J. Pan and Q. Yang. A Survey on Transfer Learning. IEEE Transactions on Knowledge and Data Engineering, 22(10):1345 1359, Oct. 2010. ISSN 1558-2191. doi: 10.1109/TK DE.2009.191. G. Papa, A. Bellet, and S. Cl emen con. On Graph Reconstruction via Empirical Risk Minimization: Fast Learning Rates and Scalability. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 694 702. Curran Associates, Inc., 2016. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, and V. Dubourg. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011. Ausset, Cl emenc on and Portier S. P olsterl. Scikit-survival: A Library for Time-to-Event Analysis Built on Top of scikit-learn. Journal of Machine Learning Research, 21(212):1 6, 2020. ISSN 1533-7928. S. P olsterl, N. Navab, and A. Katouzian. Fast Training of Support Vector Machines for Survival Analysis. In A. Appice, P. P. Rodrigues, V. Santos Costa, J. Gama, A. Jorge, and C. Soares, editors, Machine Learning and Knowledge Discovery in Databases, Lecture Notes in Computer Science, pages 243 259. Springer International Publishing, 2015. ISBN 978-3-319-23525-7. S. P olsterl, N. Navab, and A. Katouzian. An Efficient Training Algorithm for Kernel Survival Support Vector Machines. In CML PKDD MLLS 2016, 2016. F. Portier and J. Segers. On the weak convergence of the empirical conditional copula under a simplifying assumption. Journal of Multivariate Analysis, 166(C):160 181, 2018. A. Rotnitzky and J. M. Robins. Recovery of Information and Adjustment for Dependent Censoring Using Surrogate Markers. AIDS Epidemiology, 88(424):1473, 1992. ISSN 01621459. doi: 10.1007/978-1-4757-1229-2 14. P. Royston and M. K. B. Parmar. The use of restricted mean survival time to estimate the treatment effect in randomized clinical trials when the proportional hazards assumption is in doubt. Statistics in Medicine, 30(19):2409 2421, 2011. ISSN 1097-0258. doi: 10.1002/sim.4274. D. Rubin and M. J. van der Laan. A Doubly Robust Censoring Unbiased Transformation. The International Journal of Biostatistics, 3(1), Jan. 2007. ISSN 1557-4679. doi: 10.2202/ 1557-4679.1052. G. R. Shorack and J. A. Wellner. Empirical Processes with Applications to Statistics. Classics in Applied Mathematics. Society for Industrial and Applied Mathematics, Jan. 2009. ISBN 978-0-89871-684-9. doi: 10.1137/1.9780898719017. J. A. Steingrimsson and S. Morrison. Deep learning for survival outcomes. Statistics in Medicine, 39(17):2339 2349, 2020. ISSN 1097-0258. doi: 10.1002/sim.8542. J. A. Steingrimsson, L. Diao, A. M. Molinaro, and R. L. Strawderman. Doubly robust survival trees: Doubly Robust Survival Trees. Statistics in Medicine, 35(20):3595 3612, Sept. 2016. ISSN 02776715. doi: 10.1002/sim.6949. J. A. Steingrimsson, L. Diao, and R. L. Strawderman. Censoring Unbiased Regression Trees and Ensembles. Journal of the American Statistical Association, 114(525):370 383, Jan. 2019. ISSN 0162-1459, 1537-274X. doi: 10.1080/01621459.2017.1407775. W. Stute. Consistent estimation under random censorship when covariables are present. Journal of Multivariate Analysis, 45(1), 1993. doi: 10.1006/jmva.1993.1028. W. Stute. The Central Limit Theorem Under Random Censorship. Annals of Statistics, 23 (2):422 439, Apr. 1995. ISSN 0090-5364, 2168-8966. doi: 10.1214/aos/1176324528. Empirical Risk Minimization under Random Censorship W. Stute. Distributional Convergence under Random Censorship when Covariables are Present. Scandinavian Journal of Statistics, 23(4):461 471, 1996. V. Van Belle, K. Pelckmans, J. A. K. Suykens, and S. Van Huffel. Support Vector Machines for Survival Analysis. Proceedings of the Third International Conference on Computational Intelligence in Medicine and Healthcare, 2007. V. Van Belle, K. Pelckmans, J. A. K. Suykens, and S. V. Huffel. Learning Transformation Models for Ranking and Survival Analysis. Journal of Machine Learning Research, 12: 819 862, 2011. ISSN 15324435. M. J. van der Laan and J. M. Robins. Unified Methods for Censored Longitudinal Data and Causality. Springer Series in Statistics. Springer New York, New York, NY, first edition, 2003. ISBN 978-0-387-21700-0. I. van Keilegom and N. Veraverbeke. Uniform strong convergence results for the conditional kaplan-meier estimator and its quantiles. Communications in Statistics - Theory and Methods, 25(10):2251 2265, Jan. 1996. ISSN 0361-0926. doi: 10.1080/03610929608831836. M. P. Wand and M. C. Jones. Kernel Smoothing. Number 60. Chapman & Hall, Boca Raton, FL, U.S., Dec. 1994. ISBN 978-0-412-55270-0.