# empirical_risk_minimization_under_random_censorship__254f1a4f.pdf

Journal of Machine Learning Research 23 (2022) 1-59 Submitted 6/19; Revised 10/21; Published 12/21

Empirical Risk Minimization under Random Censorship

Guillaume Ausset guillaume.ausset@telecom-paris.fr LTCI, T el ecom Paris, Institut Polytechnique de Paris BNP Paribas

Stephan Cl emen con stephan.clemencon@telecom-paris.fr LTCI, T el ecom Paris, Institut Polytechnique de Paris

Fran cois Portier francois.portier@ensai.fr CREST UMR 9194, ENSAI, Univ Rennes

Editor: Ingo Steinwart

We consider the classic supervised learning problem where a continuous non-negative random label Y (e.g. a random duration) is to be predicted based upon observing a random vector X valued in Rd with d 1 by means of a regression rule with minimum least square error. In various applications, ranging from industrial quality control to public health through credit risk analysis for instance, training observations can be right censored, meaning that, rather than on independent copies of (X, Y ), statistical learning relies on a collection of n 1 independent realizations of the triplet (X, min{Y, C}, δ), where C is a nonnegative random variable with unknown distribution, modelling censoring and δ = I{Y C} indicates whether the duration is right censored or not. As ignoring censoring in the risk computation may clearly lead to a severe underestimation of the target duration and jeopardize prediction, we consider a plug-in estimate of the true risk based on a Kaplan Meier estimator of the conditional survival function of the censoring C given X, referred to as Beran risk, in order to perform empirical risk minimization. It is established, under mild conditions, that the learning rate of minimizers of this biased/weighted empirical risk functional is of order OP( p

log(n)/n) when ignoring model bias issues inherent to plug-in estimation, as can be attained in absence of censoring. Beyond theoretical results, numerical experiments are presented in order to illustrate the relevance of the approach developed.

Keywords: Censored data, empirical risk minimization, U-processes, statistical learning theory, survival data analysis.

1. Introduction

Covering a wide variety of practical applications, distribution-free regression can be considered one of the ﬂagship problems in statistical learning. In the most standard setup, (X, Y ) is a random pair deﬁned on a certain probability space with (unknown) joint probability distribution P, where the output Y is a real-valued square integrable random variable (r.v.) and X models some input information, valued in Rd, supposedly useful to predict Y . In this context, one is interested in building a (measurable) function f : Rd R minimizing the

2022 G. Ausset, S. Cl emen con and F. Portier.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v23/19-450.html.

Ausset, Cl emenc on and Portier

(expected quadratic) risk RP (f) = E h (Y f(X))2i , (1)

which is ﬁnite as soon as the r.v. f(X) is square integrable. Obviously, the minimizer of (1) is the regression function f (X) = E[Y | X]. As the distribution of (X, Y ) is unknown in practice, the Empirical Risk Minimization paradigm (ERM in abbreviated form, see e.g. Gy orﬁet al. (2002)) suggests considering solutions bfn of the minimization problem, also referred to as least squares regression, minf F b Rn(f), where b Rn(f) is a statistical estimate of the risk RP (f) computed from a training sample Dn = {(X1, Y1), . . . , (Xn, Yn)} of independent copies of (X, Y ). In general the empirical version

b Rn(f) = 1

i=1 (Yi f(Xi))2 , (2)

is considered. This boils down to replacing P in the risk functional RP ( ) with the empirical distribution of the (Xi, Yi) s. The class F of predictive functions is supposed to be of controlled complexity (e.g. of ﬁnite VC dimension), while being rich enough to contain a reasonable approximant of the minimizer of RP , f (x). In a framework stipulating in addition that the random variables Y and f(X), f F, are sub-Gaussian, ERM is proved to yield rules with good generalization properties, see e.g. Gy orﬁet al. (2002); Bartlett et al. (2005); Lecu e and Mendelson (2016) (notice, however, that, in heavy-tail situations, alternative strategies are preferred, refer to Lugosi and Mendelson (2016) for instance). In many applications such as industrial reliability, see Mann et al. (1974), or clinical trials, the r.v. Y to be predicted represents a duration, e.g. the lifespan of a manufactured component or the time to recovery of sick patients, and it is far from uncommon in survival analysis that the data at disposal to learn a predictive rule are not composed of independent realizations (X1, Y1), . . . , (Xn, Yn) of distribution P but of observations (X1, Y1, δ1), . . . , (Xn, Yn, δn), where the observed durations are of the form

Yi = min{Ci, Yi} with i {1, . . . , n},

the random variables Ci modelling a possible right censoring, and the δi are binary variables indicating whether censoring has occurred for each duration. Of course, other types of censoring (e.g. left/interval/progressive censoring) can be encountered in practice and result in partially observed durations. Since the results established in this paper can be straightforwardly extended to a more general framework, focus is here on the right censoring case. Whereas the asymptotic theory of statistical estimation based on censored data is very well documented in the literature (see e.g. Fleming and Harrington (1991); Andersen et al. (1993) and the references therein), the issues raised by censoring in statistical learning has received much less attention and it is the major purpose of this article to investigate how ERM can be extended to this setup with sound generalization guarantees. As the empirical risk (2) cannot be computed directly from the data available, we rely on the inverse of the probability of censoring weighted (IPCW) approach, see Gerds et al. (2017) for a recent review. In that, we build ﬁrst a plug-in (biased) estimator of the risk (1) by means of the Beran estimator of the conditional survival function of the censoring (Beran, 1981; Dabrowska, 1989; van Keilegom and Veraverbeke, 1996) and minimize next the

Empirical Risk Minimization under Random Censorship

resulting risk estimate, referred to as Beran risk and that can be interpreted as a weighted version of the empirical risk process based on the observations. The asymptotic behaviour of such weighted empirical risk has been ﬁrst considered in the seminal contributions of Stute (1993, 1996) and reﬁned recently in Lopez (2011); Lopez et al. (2013). In this paper, more in the spirit of the popular statistical learning theory of empirical risk minimization, nonasymptotic maximal deviation bounds for this risk functional, much more complex than a basic empirical process due to the strong dependency exhibited by the terms averaged to compute it, are established by means of linearization techniques combined with concentration results pertaining to the theory of U-processes. We prove that, under appropriate conditions, minimizers of the Beran risk proposed have good generalization properties, achieving learning rate bounds of order OP( p

log(n)/n) when ignoring the model bias impact on the plug-in estimation step, the same as ERM in absence of any censoring. Beyond this theoretical analysis, illustrative numerical results are also displayed, providing strong empirical evidence of the relevance of the approach promoted. They reveal in particular that, even if the estimator of the conditional survival function plugged is only moderately accurate, Beran risk minimizers signiﬁcantly outperform approaches ignoring censoring. Eventually, we point out that some of the results established in this paper have been preliminarily presented in an elementary form at the 2018 Neur IPS Machine Learning for Healthcare workshop (ML4HEALTH), see Ausset et al. (2018). The rest of the paper is organized as follows. The framework we consider for statistical learning based on censored training data is detailed in section 2, where notions pertaining to survival data analysis involved in the subsequent study are also brieﬂy recalled and a nonasymptotic uniform bound for the Beran estimator of the conditional survival function of the censoring is also stated. In section 3, the statistical version of the expected quadratic risk we propose, based on the Beran estimator previously studied, is introduced and the performance of its minimizers is analysed. Illustrative numerical results are displayed in section 4, while several concluding remarks are collected in section 5. Technical proofs are postponed to the appendices.

2. Background - Preliminaries

In this section, we ﬁrst describe at length the probabilistic setup considered in this paper and recall basic concepts of censored data analysis, which the subsequent analysis relies on. Next, we establish a nonasymptotic bound for the deviation between the conditional survival function of the random censoring and its Beran estimator under adequate smoothness assumptions. Here and throughout, the indicator function of any event E is denoted by I{E}, the Dirac mass at any point x by δx. When well-deﬁned, the convolution product between two real-valued Borelian functions on Rd g(x) and w(x) is denoted by (g w)(x) = R

x Rd g(x x )w(x )dx . The left-limit at s > 0 of any c adla g function S on R+ is denoted by S(s ) = limt s S(t).

2.1 The Statistical Framework

In this paper, we consider a pair (X, Y ) of random variables deﬁned on the same probability space (Ω, A, P), with unknown joint distribution P and where Y , representing a duration, takes positive values only and X models some information valued in Rd, d 1, a priori

Ausset, Cl emenc on and Portier

useful to predict Y . We assume that X s marginal distribution has a density g(x) w.r.t. Lebesgue measure on Rd. We are concerned with building a prediction rule f : Rd R+ with minimum expected quadratic risk RP (f), see Eq. (1), based on a training dataset Dn = {(X1, Y1, δ1), . . . , (Xn, Yn, δn)} composed of n 1 independent realizations of the random triplet (X, Y , δ), where Y = min{Y, C}, C is a positive r.v. deﬁned on (Ω, A, P) and δ = I{Y C} indicates whether the duration is (right) censored (δ = 0) or not (δ = 1). The following hypothesis is required in the present study.

Assumption 1 (Conditional independence) The random variables Y and C are conditionally independent given the input X and we have Y = C with probability one.

Naturally, many other types of censoring can be encountered in practice. However, since the goal of the present paper is to explain the main ideas to apply the ERM principle to censored data rather than dealing with the problem at the highest level of generality, we restrict our attention to the type of right random censoring introduced above. Though simple, it covers many situations. Addressing the problem in a more complex probabilistic framework, where Y and C are not conditionally independent given X anymore for instance, will be the subject of future research. The assumption stipulating that {Y = C} is a zero-probability event is quite general, insofar as it allows considering situations where Y and/or C are discrete variables. Under conditional independence, it is obviously satisﬁed when the r.v. Y is continuous. Easy to state but diﬃcult to solve, the statistical learning problem we consider here is of considerable importance. In a wide variety of applications, the input information is of increasing granularity and described by a random vector of very large dimension d, while (censored) data are progressively becoming massively available. Machine-learning techniques are thus expected to complement traditional approaches, based on statistical modelling, in order to produce more ﬂexible/accurate predictive models based on censored data. Incidentally, we point out that the problem under study can be viewed as a very speciﬁc type of transfer learning problem, see e.g. Pan and Yang (2010) insofar as, due to the censoring, the distribution of the training/source data is not that of the test/target data. However, the source domain coincides here with the target one and the predictive task (regression) remains the same.

Weighted empirical risk. Discarding censored observations to evaluate the risk of a candidate function f(x) would lead to the quantity

i=1 δi Yi f(Xi) 2 /

i=1 δi, (3)

with 0/0 = 0 by convention, which is clearly a biased estimate of RP (f) in general, since, by virtue of the strong law of large numbers, it converges to E[(Y f(X))2 | Y C] with probability one. One may easily check that the minimizer of this functional is given by

f (X) = E[Y I{Y C} | X]/P{Y C | X},

Empirical Risk Minimization under Random Censorship

which signiﬁcantly diﬀers from f (X) in general. Observing that, by means of a straightforward conditioning argument, one can write the risk as

" δ( Y f(X))2

where SC(u | x) = P{C > u | X = x} denotes the conditional survival function of the random right censoring given X, we propose to estimate the risk (1) by computing ﬁrst a nonparametric estimator ˆSC(u | x) of SC(u | x) and by plugging it next into (4), so as to obtain

e Rn(f) = 1

δi( Yi f(Xi))2

ˆSC( Yi | Xi) , (5)

which approximates the unknown quantity whose expectation is equal to (4)

δi( Yi f(Xi))2

SC( Yi | Xi) , (6)

the conditional survival function of C given X being itself unknown. Observe that the risk estimate (5) can be viewed as a weighted version of the sum of the observed squared errors ( Yi f(Xi))2, just like (3) except that the i-th weight is not δi/ P j n δj anymore but δi/ ˆSC( Yi | Xi). In the terminology of survival analysis, the weighted empirical risk (5) is usually referred to as an IPCW (Gerds et al., 2017) estimate and a natural strategy to learn a predictive function in the censored framework described above then consists in solving the minimization problem inf f F e Rn(f), (7)

over an appropriate class F. In Rotnitzky and Robins (1992) or Section 3.3 in van der Laan and Robins (2003), a parametric estimator (Cox, 1972) of SC(Y | X) is used to infer the risk. However, such an approach is naturally limited by well-known misspeciﬁcation issues, inherent in the choice of the parametric model. In van der Laan and Robins (2003); Rubin and van der Laan (2007), the misspeciﬁcation problem is alleviated by the use of a doubly robust loss which allows for misspeciﬁcation of one of the models (either SC or else ST ) and improves in addition the eﬃciency of the procedure. This has been further investigated in Molinaro et al. (2004); Steingrimsson et al. (2016, 2019) , where diﬀerent methodologies are proposed to build classiﬁcation trees. The use of the Kaplan-Meier estimator (Kaplan and Meier, 1958) for SC(u | x) has been considered in several papers (Stute, 1993, 1996; Bang and Tsiatis, 2002; Kohler et al., 2002). Even if the censoring model is free from any parametric modelling, the assumptions required to ensure consistency are quite strong as the distribution of C is supposed to be independent from X, see Stute (1996) for more details. In particular, the weights used are independent from X. To overcome the previous restrictions, the Beran estimator (Beran, 1981), which is a kernel smoothing version of the Kaplan-Meier estimate (see the next section), can be employed instead of the Cox or the Kaplan-Meier estimators. Such an approach is promoted and studied in Lopez (2011); Lopez et al. (2013). When using the Beran approach to estimate SC(u | x), as detailed in the next subsection, the risk functional (5) is referred to as the Beran risk throughout

Ausset, Cl emenc on and Portier

the article. Based on accuracy results for kernel-based Beran estimators of the conditional survival function SC( | x) such as those subsequently presented, the performance of solutions of (7) is investigated in the next section. We point out that, as highlighted in section 4, alternative inference strategies for conditional survival function estimation can be considered. For simplicity, here we restrict our attention to kernel-smoothing techniques, although the analysis carried out can be extended to other nonparametric methods (e.g. partition-based techniques, nearest neighbours). Our results are therefore comparable to those of Lopez (2011), as both studies are concerned with the Beran risk, relaxing in particular the restrictive assumption on the dependence between C and X used in Stute (1993, 1996). In Lopez (2011), an asymptotic representation of the estimation error is established when the input variable is univariate (d = 1). An extension with a single index model is considered in Lopez et al. (2013). The proof technique is based on the asymptotic equicontinuity of the empirical process and imposes strong conditions on the bandwidth choice, e.g. nh3 (see Theorem 3.3 in Lopez (2011) and Theorem 3.1 in Lopez et al. (2013)). The (nonasymptotic) analysis carried out in the present paper is quite diﬀerent and based on two steps: 1) the risk estimate is linearized and 2) concentration results for generalized U-processes are used to describe its ﬂuctuations uniformly over the class of predictive functions considered (see e.g. Cl emen con and Portier (2018)). Notice additionally that the approach we adopt to establish nonasymptotic rate bounds requires weaker conditions, i.e. namely nh2d/| log(hd)| in the d-dimensional case.

Integration domain. As any (conditional) survival function, SC(y | x) vanishes as y tends to inﬁnity. In order to avoid dealing with the asymptotic behaviour of the conditional survival function of the censoring and stipulating decay rate assumptions for its tail behaviour, in the analysis carried out in section 3 we restrict the study of the prediction problem to a (borelian) domain K R+ Rd such that SC(y | x) stays bounded away from 0 on it and consider the risk

RP,K(f) = E

SC( Y | X) I{( Y , X) K}

as well as its empirical counterpart

δi( Yi f(Xi))2

SC( Yi | Xi) I{( Yi, Xi) K}. (9)

As discussed at length below, the present analysis distinguishes itself from previous works, relying on the IPCW approach as well, in several respects. First, the problem of regression (in presence of censoring) is tackled here from the angle of prediction, not as the problem of estimating the conditional expectation f with minimum L2(µX)-error, denoting X s marginal distribution by µX. Although an estimator of Beran s type of C s conditional survival function given X is involved in the empirical risk construction as explained above, the goal pursued here is to ensure that the predictor fn obtained by solving (7) has a small excess risk RP ( fn) RP (f ) with large probability. As will be discussed in detail in the next section, establishing such nonasymptotic guarantees for statistical learning in the censored

Empirical Risk Minimization under Random Censorship

context, in the form of generalization bounds, yields technical diﬃculties, which are far from straightforward to overcome, when avoiding the simplifying assumption, hardly met in practice, that the output variable Y is independent from the r.v. C modelling the censoring mechanism (see Assumption 1, oﬀering a much more realistic framework for regression based on censored training data).

In contrast, we derive in this article sound theoretical results, providing nonasymptotic guarantees for the risk minimizers by jointly estimating the intrinsic loss and the censoring mechanism.

2.2 Preliminary Results

In this subsection, we brieﬂy recall the Beran approach to estimate a (conditional) survival function by means of a kernel smoothing procedure and state a uniform bound for the deviations between the conditional survival function of C given X and its Beran estimator, involved in statistical learning framework developed in the next section for distributionfree censored regression. As shall be discussed below, this result reﬁnes those obtained in Dabrowska (1989) and Du and Akritas (2002), which are of similar nature, except that they are related to the estimation of the conditional survival function of the duration Y given X, denoted by SY (u | x) = P{Y > u | X = x}, rather than that of the conditional survival function of the censoring C given X. Deﬁne the conditional integrated hazard function of the right censoring C given X

ΛC(u | x) = Z u

SC(ds | x) SC(s | x), (10)

and the conditional subsurvival functions H(u | x) = P{ Y > u | X = x} and H0(u | x) = P{ Y > u, δ = 0 | X = x} for u 0 and x Rd. As we have (under Assumption 1), H0(du | x) = SY (u | x)SC(du | x) and H(u | x) = SY (u | x)SC(u | x), we obtain

ΛC(u | x) = Z u

Here, we propose to build an estimate of ΛC(u | x) by plugging into formula (10) Nadaraya Watson type kernel estimates of the conditional subsurvival functions and derive from it an estimator of SC(u | x). Of course, alternative estimation techniques can be considered for this purpose. Throughout the paper, K : Rd R+ is a symmetric bounded kernel function, i.e. a bounded nonnegative Borelian function, integrable w.r.t. Lebesgue measure such that R K(x)dx = 1, K(x) = K( x) for all x Rd, see Wand and Jones (1994). We assume it lies in the linear span of functions w, whose subgraphs {(s, u) : w(s) u}, can be represented as a ﬁnite number of Boolean operations among sets of the form {(s, u) : p(s, u) ζ(u)}, where p is a polynomial on Rd R and ζ an arbitrary real-valued function. This assumption guarantees that the collection of functions

{K((x )/h) : x Rd, h > 0},

is a bounded VC type class, see Gin e et al. (2004), a property that will be useful to establish our results. Although very technical at ﬁrst glance, this hypothesis is very general and

Ausset, Cl emenc on and Portier

is satisﬁed by kernels of the form K(x) = ζ(p(x)), p being any polynomial and ζ any bounded real function of bounded variation (see Nolan and Pollard (1987)) or when the graph of K is a pyramid (truncated or not). For any bandwidth h > 0 and x Rd, we set Kh(x) = K(h 1x)/hd. Based on the kernel estimators given by

ˆH0,n(u, x) = 1

i=1 I{ Yi > u, δi = 0}Kh(x Xi), (11)

ˆHn(u, x) = 1

i=1 I{ Yi > u}Kh(x Xi), (12)

i=1 Kh(x Xi), (13)

deﬁne the conditional subsurvival function estimates

ˆH0,n(u | x) = ˆH0,n(u, x)

ˆgn(x) and ˆHn(u | x) = ˆHn(u, x)

as well as the (biased) estimators of ΛC(u | x) and SC(u | x)

ˆΛC,n(u | x) = Z u

ˆH0,n(ds | x)

ˆHn(s | x) , (14)

ˆSC,n(u | x) = Y

1 ˆΛC,n(s | x) , (15)

with Λ(t) = Λ(t) Λ(t ), which are classically referred to as the conditional Nelson-Aalen and Beran estimators (Dabrowska, 1989). Let b > 0 and deﬁne the set

Γb = n (y, x) R+ Rd : SY (y|x) SC(y|x) g(x) > b o ,

which is supposed to be non-empty. On this set, one may guarantee that ˆH0,n(y, x) and ˆH0,n(y, x) are both away from 0 with high probability, which permits the study of the ﬂuctuations of (15). In addition, the mild and standard smoothness assumption below is required in the analysis to control the estimation bias.

Assumption 2 For all u R+, the functions x 7 H(u | x), x 7 H0(u | x) and x 7 g(x) are twice continuously diﬀerentiable on Rd with all partial derivatives bounded by L.

The result stated below provides a uniform bound for the deviation between SC(u | x) and its estimator (15).

Proposition 1 Suppose that Assumptions 1 and 2 are fulﬁlled. Then, there exist constants M1 > 0, M2 > 0 and h0 > 0 depending on b, L and K only such that, for all ϵ (0, 1), we have with probability greater than 1 ϵ:

sup (t,x) Γb | ˆSC,n(t | x) SC(t | x)| M1

| log(hd/2ϵ)|

as soon as h h0 and nhd M2| log(hd/2ϵ)|.

Empirical Risk Minimization under Random Censorship

The technical proof is given in the Appendix section (refer to the latter for a description of the constants M1, M2 and h0 involved in the result stated above). A similar result, for the conditional survival function of Y given X, is proved in Dabrowska (1989), see Theorem 2.1 therein. Observe also that choosing h = hn n 1/(d+4) yields a rate bound of order OP( p

log(n)/n4/(d+4)).

Prediction vs Estimation. Finally, we emphasize that conditional density/expectation estimation is not the goal we pursue here, the regression framework considered in the next section having to do with prediction, i.e. the construction of a predictive rule fn(X) from censored training data with good predictive capacity. The learning procedure we investigate in this article consists in minimizing a plug-in estimation of the risk (4) and consequently involves the nonparametric estimator (15), the accuracy of the prediction is measured by the excess risk, not by the estimation error E[( fn(X) f (X))2]. This contrasts with estimation techniques, which consist in forming directly an estimator of the conditional expectation f (X) under speciﬁc (smoothness) assumptions. In addition, we point out that alternative ﬂexible (local averaging) methods could be naturally used to compute estimators of H0(u, x), H(u, x) and g(x) and consequently estimators of SC(u | x) and ΛC(u | x), including k-nearest neighbours, decision trees or random forest. However, whereas the accuracy of nonparametric estimators based on kernel smoothing under appropriate smoothness hypotheses can be rather easily studied, it is much less convenient to establish rates for estimators produced by tree-based techniques for example (one generally prefers to investigate estimators built by means of variants, involving random splitting for instance, quite diﬀerent from the algorithms used in practice). For this reason, the predictive performance of extensions of the statistical learning approach under study, based on estimators of SC(t | x) built by means of tree-based techniques, are studied from an empirical angle only in this article, see section 4 for further details.

3. Generalization Bounds for Kaplan-Meier Risk Minimizers

It is the purpose of this section to investigate the excess of risk (8) related to a domain K R+ Rd of minimizers efn(x) of the Kaplan-Meier risk (9) over a class F of predictive functions that is of controlled complexity (see the technical assumptions below), while being rich enough to yield a small bias R(f ) R( f ), denoting RP,K( ) by R( ) for simplicity throughout the present section. We consider here the situation where, for all i {1, . . . , n}, the estimate of the quantity SC( Yi | Xi) plugged into (6) is obtained by evaluating the kernel smoothing estimator of SC(y | x) investigated in subsection 2.2 and based on the subsample {(Xj, Yj, δj) : 1 j n, j = i} at (y, x) = ( Yi, Xi). The corresponding versions of the kernel estimators (11), (12), (13) and those of (14) and (15) are respectively denoted by ˆH(i) 0,n(y | x), ˆH(i) n (y | x), ˆg(i) n (x), ˆΛ(i) C,n(y | x) and ˆS(i) C,n(y | x). This yields the leave-one-out estimator of the risk of any candidate f

e Rn(f) = 1

δi( Yi f(Xi))2

ˆS(i) C,n( Yi | Xi) I{( Yi, Xi) K},

Ausset, Cl emenc on and Portier

that is well-deﬁned on the event Tn i=1{ ˆS(i) C,n( Yi | Xi) > 0}. As we clearly have

R( efn) inf f F R(f) 2 sup f F

e Rn(f) R(f) ,

the key of the analysis is the control of the ﬂuctuations of the process { e Rn(f) R(f) : f F}. Slightly more generally, we establish below a uniform deviation bound for processes of type

δiϕ( Yi, Xi) ˆS(i) C,n( Yi | Xi)

E [ϕ(Y, X)] , ϕ Φ,

where the indexing class Φ fulﬁlls the following property allowing us to control the ﬂuctuations of the pseudo-variables ˆS(i) C,n( Yi | Xi), as in Proposition 1.

Assumption 3 There exists a domain K Γb such that ϕ(y, x) = 0 as soon as (y, x) / K for all ϕ Φ.

Equipped with these notations, observe that e Rn(f) R(f) = Zn(ϕ) when ϕ(Y, X) = (Y f(X))2I{( Y , X) K}.

Linearization. Whereas in the standard regression framework or in classiﬁcation ERM can be straightforwardly studied by means of maximal deviation inequalities for empirical processes, the form of the process {Zn(ϕ) : ϕ Φ} of interest is very complex since the terms averaged in (5) are obviously far from being independent due to the presence of the plugged leave-one-out estimators of the quantities SC( Yi | Xi). The subsequent analysis is all the more technically diﬃcult that, in contrast to most works devoted to statistical censored data analysis (see subsection 2.1 for more details), the simplifying assumption, unrealistic in many situations in practice, that Y and C are independent is avoided here, cf Assumption 1. Our approach to the study of the ﬂuctuations of the process Zn consists in linearizing the statistic Zn(ϕ), i.e. approximating Zn(ϕ) by a standard i.i.d. average in the L2-sense, as stated in the next proposition. In order to make this decomposition explicit, further notations are needed. We set, for all i {1, . . . , n},

ˆa(i) n (t | x) = Z t

c(u | x) H(u, x)

ˆH(i) 0,n(du, x) H0(du, x)

c(u | x) H(u, x)2

ˆH(i) n (u, x) H(u, x) ˆH(i) 0,n(du, x),

ˆb(i) n (t | x) = Z t

H(u, x)2 ˆH(i) n (u, x)

ˆH(i) n (u, x) H(u, x) 2 ˆH(i) 0,n(du, x)

ˆS(i) C,n(u | x) SC(u | x)

SC(u | x) ˆ (i) n (du | x),

ˆ (i) n (du | x) = ˆΛ(i) C,n(du | x) ΛC(du | x),

c(u | x) = SC(u | x)

SC(u | x) .

Empirical Risk Minimization under Random Censorship

Equipped with these notations, we can now state the following result.

Proposition 2 (KM risk decomposition) Suppose that Assumptions 1, 2 and 3 are fulﬁlled. There exist constants h0 > 0 and M1 > 0 that depends on b, L and K only such that

(i) for any n 2 and ϵ (0, 1), provided that h h0 and nhd M1| log(hd/2ϵ)|, the event En def = \

n (t, x) K, ˆS(i) C,n(t, x) b/2 and ˆH(i) n (t, x) b3/2 o ,

occurs with probability greater than 1 ϵ;

(ii) for all ϕ Φ and n 2, we have on the event En:

Zn(ϕ) = Ln(ϕ) + Mn(ϕ) + Rn(ϕ),

δi ϕ( Yi, Xi) SC( Yi | Xi) E

δ ϕ( Y , X)

i=1 δiϕ( Yi, Xi)ˆa(i) n ( Yi | Xi) SC( Yi | Xi) ,

δiϕ( Yi, Xi) SC( Yi | Xi)

ˆb(i) n ( Yi | Xi) +

SC( Yi | Xi) ˆS(i) C,n( Yi | Xi) 2

SC( Yi | Xi) ˆS(i) C,n( Yi | Xi)

The proof is given in the Appendix section. Observe that the term Ln(ϕ) is a basic centred i.i.d. sample mean statistic and its uniform rate of convergence 1/ n can be recovered by applying maximal deviation bounds for empirical processes under classic complexity assumptions such as those stipulated below, whereas the term Mn(ϕ) is more complicated, since it involves multiple sums. It is dealt with by means of results pertaining to the theory of U-processes (de la Pe na and Gin e, 1999), by showing that it can be decomposed as Mn(ϕ) = L n(ϕ) + R n(ϕ), the sum of a linear term and a second-order term. The term Rn(ϕ) + R n(ϕ) is a remainder term (second order) and shall be proved to be negligible with respect to Ln(ϕ) + L n(ϕ). The theory of U-processes is used next to describe the uniform behaviour of Mn + Rn. Such concentration results are also used in Cl emen con et al. (2008) and Papa et al. (2016) in simpler situations, where the residuals take the form of a degenerate U-statistic. In our case, due to the presence of a leave-one-out estimate of the survival function, the U-processes that arise do not have all their diagonal terms (e.g., the sum indexes 1 i, j n are restrained to i = j). This is of particular interest because results dealing with U-processes are in most cases stated for such sums (see Lemma 7 and Corollary 8 in the Appendix section) and, more importantly, removing diagonal terms improves the estimation accuracy by reducing the bias (see also Delyon and Portier (2016), remark 4). To obtain uniform concentration inequalities over the function class Φ, it is standard (Nolan and Pollard, 1987; Gin e and Guillou, 2001) to assume the following type of control on the complexity of the class.

Ausset, Cl emenc on and Portier

Assumption 4 The set Φ of real-valued functions on R+ Rd is a bounded VC type of class with parameter (A, v) and constant envelope MΦ.

The formal deﬁnition of VC classes is given in the Appendix section. By means of these assumptions, the following result, proved in the Appendix section, describes the order of magnitude of the ﬂuctuations of the process Zn.

Proposition 3 Suppose that Assumptions 1-4 are fulﬁlled. There exist constants h0, M1, M2 and M3 that depend on (A, v), MΦ, L, K and b only, such that, for all n 2 and ϵ (0, 1), the event

sup ϕ Φ |Zn(ϕ)| M1

n + | log(ϵhd/2)|

occurs with probability greater than 1 ϵ provided that h h0, nh2d M3| log(ϵhd)|.

The risk excess probability bound stated in the following theorem shows that, remarkably, minimizers of the Kaplan-Meier risk attain the same learning rate as that achieved by classic empirical risk minimizers in absence of censoring, when ignoring the model bias eﬀect induced by the plug-in estimation step (cf choice of the bandwidth h).

Theorem 4 Suppose that Assumptions 1-4 are fulﬁlled. There exist constants h0, M1, M2 and M3 that depend on (A, v), MΦ, L, K and b only, such that, for all n 2 and ϵ (0, 1), the event

|R( fn) R(f )| M1

n + | log(ϵhd/2)|

occurs with probability greater than 1 ϵ provided that h h0, nh2d M3| log(ϵhd)|.

The proof is a direct application of Proposition 3. A similar bound for the expectation of the risk excess of minimizers of the empirical Kaplan-Meier risk can be classically derived with quite similar arguments, details are left to the reader. We ﬁnally point out that, given that Proposition 3 holds true for a fairly general class of functions Φ, the guarantees provided by Theorem 4 can be naturally extended to more general risk measures than that deﬁned by the quadratic loss.

4. Numerical Experiments

Beyond the theoretical generalization guarantees established in the previous section, we now examine at length the performance of the predictive approach proposed in the context of regression based on censored data from an empirical perspective. We present various experiments using both synthetic and real data, and compare it to alternative methods documented in the survival analysis literature standing as natural competitors. As shall be seen below, the experimental results obtained provide strong empirical evidence of the relevance of the Kaplan-Meier empirical risk minimization approach described in section 2 and analysed theoretically in section 3. All the experiments and ﬁgures displayed in this article can be reproduced using the code available at https://github.com/aussetg/ipcw.

Empirical Risk Minimization under Random Censorship

4.1 Experimental Setup

Before presenting and discussing the numerical results obtained, we ﬁrst describe the experimental schemes used here to investigate the predictive capacity of the learning procedure under random censoring previously studied.

4.1.1 Data Generative Models

In all the synthetic experiments we have carried out, the generation of the data is based either on the proportional hazard model of Cox and Oakes (1984) or else on the accelerated time failure model of Buckley and James (1979); both commonly used for parametric modelling and statistical estimation of conditional survival functions in the censored setup. Samples of the triplet ( Y , δ, X) are obtained by specifying the marginal distribution of X, as well as the conditional distribution of (Y, C) given X. For simplicity, the input r.v. X is here uniformly distributed on the unit square [0, 1]d, for d {2, 4, 8}. Only the results for d = 4 are presented below, while those obtained for d {2, 8} are available through the link mentioned above.

Cox Model. The ﬁrst survival model we use to simulate synthetic data stipulates that

SY (y | x) = exp exp βT x y and SC(y | x) = exp exp βT Cx y , (16)

where β and βC are parameters in Rd. Given X, the conditional distribution of Y is thus exponential with parameter exp(βT X), while that of C is exponential with parameter exp(βT CX).

Accelerated Failure Time Model (AFT). The second generative model we considered assumes that log(Y ) = βT X + ε0 and log(C) = βT CX + ε1, (17)

where the r.v. ε0 (respectively ε1) is independent from X. Diﬀerent accelerated failure time models can thus be generated, depending on the distributions D0 and D1 chosen for ε0 and ε1. Three distributions have been used: Normal (N) with mean and variance (3/2, 1), Laplace (L) with location and scale (1, 1) and Gamma (G) with shape and scale (0, 1). Denoting by AFT(D0, D1) the model such that (ε0, ε1) D0 D1, the variants AFT(N, N), AFT(N, L) and AFT(N, G) have been simulated. Since the results obtained for these AFT models are quite similar to those based on the Cox model, only the latter are presented below. We refer to the link aforementioned for a description of the results based on the data generated through the AFT models.

Parameters β and βC. In the Cox and AFT models, the level of censoring can be tuned by carefully choosing the parameters β and βC. In order to guarantee that the censoring is informative, we use the following parametrization:

βT = d/2 z }| { 1 1 0 0 ,

βT C = λ 1 0 1 0 1 ,

where the tuning parameter λ > 0 controls the level of censoring 1 p with p = P{Y C} and u R 7 u is the ceiling function. For a targeted censoring level p, the parameter λ can be empirically determined so that Pn i=1 δi np.

Ausset, Cl emenc on and Portier

4.1.2 Plugged estimator of the conditional survival function SC( | x)

The estimate of the risk (9) one seeks to minimize is partly determined by the choice of the estimator ˆSC( | x) of SC( | x) plugged into it. We consider for ˆSC the kernelized Kaplan-Meier estimator (15), in its standard version denoted by ˆSKern C ( | x) and in its leaveone-out version as well, denoted by ˆS(i) Kern C ( | x). We denote by ˆSKM C ( ) the Kaplan-Meier estimator of C s (unconditional) survival function, which can be seen as the limit of ˆSKern C when h and yields the Kaplan-Meier risk considered in Stute (1995). In addition, we used ˆS(i) KNN C ( | x), the estimator obtained by replacing the kernel smoothing involved in (9) by a nearest neighbour averaging, in a leave-one-out fashion. Finally, we also considered ˆSRF C , the survival random forest estimator proposed in Ishwaran et al. (2008). From each estimator of SC, one computes a plug-in estimation of the risk:

i=1 δi ( Yi f(Xi))2

ˆSKern C ( Yi|Xi) IPCW Lo O

i=1 δi ( Yi f(Xi))2

ˆS(i) Kern C ( Yi|Xi)

IPCW Forest

i=1 δi ( Yi f(Xi))2

ˆSRF C ( Yi|Xi) IPCW Stute

i=1 δi ( Yi f(Xi))2

ˆSKM C ( Yi)

i=1 δi ( Yi f(Xi))2

ˆS(i) KNN C ( Yi|Xi) IPCW Oracle

i=1 δi ( Yi f(Xi))2

i=1 ( Yi f(Xi))2 Observed

i=1 δi( Yi f(Xi))2.

The naive and observed empirical risks introduced above correspond to strongly biased estimators of the true risk (1) of course. Note that the normalizing constant is ignored here, insofar as it has no impact on the empirical risk minimizer. However, when estimating the true risk itself, it is necessary to correctly normalize the previous quantities in order to obtain complete case estimators. A point of comparison is the oracle risk

i=1 (Yi f(Xi))2,

i.e. the empirical risk in absence of any censoring (i.e. when all the Yi s are observed). For each risk, a prediction rule f n is built by (approximately) minimizing it over a certain class F. The results are depicted in Fig.1 for various sizes of the (censored) training sample and diﬀerent censoring levels, the prediction error being evaluated by means of a test (uncensored) sample of size 5000: learning a predictive function by minimizing an IPCW estimator (here IPCW Lo O) of the risk always outperforms naive alternatives, the gain in predictive performance naturally becoming more pronounced as the level of censoring 1 p increases. Unsurprisingly, when most of the points are observed (i.e. p 1), all methods reach roughly the same error, all the losses in (18) being equal for p = 1, as depicted by Fig.2. Notice also in Fig.2 that the IPCW estimator performs best compared to the naive methods for a moderate level of censoring. This can be explained by the fact that in absence of censoring the methods are equivalent and when censoring reaches a very high level, there is not enough data to estimate reliably the IPCW weights. Of course, the phenomenon is

Empirical Risk Minimization under Random Censorship

exaggerated in the present example, as the training set is of size 1000: with a censoring level of 90%, only 100 observations are then available for the conditional estimation of the weights.

200 500 1000 2000 200 500 1000 2000 200 500 1000 2000

p = 0.75 p = 0.5 p = 0.25

Number n of training samples.

E[(Y f n(X))2]

Naive Observed

Figure 1: Prediction error E[(Y f n(X))2] for data generated by the Cox model (16), when minimization is performed over the class of aﬃne predictive rules.

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

E[(Y f n(X))2]

Naive Observed

Figure 2: Prediction error E[(Y f n(X))2] for data generated by the Cox model (16) with n = 1000, when minimization is performed over the class of aﬃne predictive rules.

Truncation of the estimator ˆSC( | x). In the theoretical analysis carried out in the previous section, we placed ourselves on a restricted set Γb. However, in practice, we employ a truncation approach by simply removing the last jump of the estimated survival function. For instance, ˆSKern C (y|x) is taken as Y

Yi y Yi<maxj: δj=0 Yj

1 ˆΛC,n( Yi | x) . (19)

Observe that, although it is not a survival function anymore, it is still a relevant estimator. This alleviates possible diﬃculties caused by the frequent edge case where the last individual is observed (δ = 1), since, in the case where (15) is used, we have then δn/ ˆSC( Yn | Xn) = . Of course, it would have been possible to decide to force a restriction on Γb by considering

Ausset, Cl emenc on and Portier

the ﬂooring max(b, ˆSC,n(y | x)) rather than ˆSC,n(y | x). However, b then becomes an hyperparameter of the procedure that has to be tuned by the practitioner. In the survival analysis literature, it is common to consider the restricted mean survival time, the conditional distribution of the restricted variable min(Y, τ) being easier to learn than that of Y ; see e.g. Royston and Parmar (2011) or Steingrimsson et al. (2016, 2019); Steingrimsson and Morrison (2020). In this case the conditional survival function of the restricted variable is equal to the truncated conditional survival function of Y . Our approach therefore consists in using the restricted mean censoring time as the target for the weights in order to reduce the noise. We evaluated truncation of the survival function and ﬂooring by comparing the predictive performance attained by the rule learnt from data generated by means of the Cox model, when choosing successively ˆSKern C , ˆS(i) Kern C and ˆS(i) KNN C for ˆSC. As depicted by Fig. 3 for the speciﬁc case where ˆSC = ˆS(i) Kern C (and this remains true in the other cases), a wrong choice for b may have serious consequences, while the truncation approach of Eq. (19) consistently produces good results. Consequently, the truncated version is always used in the following experiments.

100 200 300 500 1000 100 200 300 500 1000 100 200 300 500 1000 10

p = 0.75 p = 0.5 p = 0.25

Number n of training samples.

E[(Y f n(X))2]

0 0.001 0.1 0.3 0.5 Truncation

Figure 3: Prediction error E[(Y f n(X))2] when F is the class of aﬃne functions, choosing the IPCW Lo O risk estimator and the Cox model (16) for generating the data. The curves correspond to diﬀerent ﬂoors b and to the truncation approach of (19).

Calibration of ˆSC( | x). In order to fully specify the estimator ˆSC( | x), it may be necessary to choose speciﬁc hyperparameters. Without censoring, and therefore without having to resort to the IPCW approach, one would select the various hyperparameters by way of a cross-validation; this approach is, however, impossible in our case as the loss itself is unknown and only estimated, worse any modiﬁcation of the parameters of ˆSC results in a modiﬁcation of the estimator of the loss we wish to minimize. One possible solution is to rely on a surrogate loss i.e. an auxiliary loss that we are able to compute exactly and on which a cross-validation is therefore possible. For ˆSKern C and ˆS(i) Kern C , we consider ˆmh(x) the nonparametric kernel regression of Y w.r.t. X, known as the Nadaraya-Watson estimator, and the surrogate loss E[| Y ˆmh(X)|2] which is then minimized using cross-validation with respect to h. In this way, a value for the bandwidth parameter h cv is obtained and might be used in ˆSKern C and ˆS(i) Kern C . This approach is also easily applied to set the number of

Empirical Risk Minimization under Random Censorship

neighbours involved in ˆS(i) KNN C . As a close relative of the task of estimating SC, the previous regression loss is a good candidate for the surrogate cross-validation. In the speciﬁc case of ˆS(i) Kern C and ˆS(i) KNN C , we experimentally studied the impact of the choice of h and k on the prediction performance E[(Y f n(X))2]. The results for the speciﬁc case of ˆS(i) Kern C are given in Fig. 4 and demonstrate the need to be over-conservative rather than under-conservative in the choice of these hyperparameters. In our experiments,

100 200 300 500 1000 100 200 300 500 1000 100 200 300 500 1000 10

p = 0.75 p = 0.5 p = 0.25

Number n of training samples.

E[(Y f n(X))2]

0.1 1.0 5.0 10.0 20.0 50.0

Figure 4: Prediction error E[(Y f n(X))2] for varying bandwidth h, with F a random forest model, IPCW Lo O estimator of SC, and data following the Cox model (16).

choosing h at least equal to the value h cv obtained by minimizing the surrogate, and up to 5 times this value, is a safe choice. Consequently, we use h = 5h cv in the following experiments. For ˆSRF C , given the large number of hyperparameters, the default parameters selected by the package s authors have been used.

4.2 Experimental Results based on Synthetic Data

We now present the results obtained from the data generated by means of the model previously described.

Risk estimation. While not the focus of the predictive approach studied in this paper, it is of interest to evaluate the quality of the estimation of E[ϕ(Y, X)], related to a certain function ϕ, attained by the IPCW method. In order to make computations easier, we choose to study functionals of the form ϕ(Y, X) = Y exp XT β , where Y follows the Cox model described in Equation (16). In this case E[ϕ(Y, X)] = 1. For a single random dataset Dn = {(Xi, Yi, δi) : i = 1, . . . , n} of size n, the estimation error is given by

(Xi,Yi,δi) Dn

δi ˆSC( Yi | Xi) Yi exp ( X i β)

Based on M = 100 simulated datasets, we study the distribution of errn for varying sizes n. We represent the median, 5% and 95% quantiles of errn in Fig. 5 for each survival estimator. As can be seen in Fig. 5, while the naive uncorrected method results in a poor approximation of the considered expectation (as expected, since it is strongly biased), the

Ausset, Cl emenc on and Portier

100 200 300 500 1000 2000 3000 4000 5000 0

Number n of training samples.

IPCW Lo O IPCW Stute IPCW KNN IPCW Forest

Naive IPCW Oracle

Figure 5: Estimation error for the IPCW Risks of (18) compared to the naive method, for p = 1/4 and data following the Cox model (16).

IPCW reweighting errors converge towards 0. One should pay speciﬁc attention to the particularly good performance of the leave-one-out version of the IPCW estimators. We also point out that low-bias estimators of SC, such as the Random Forest estimator, can underperform signiﬁcantly compared to their high bias counterparts such as the unconditional Stute estimator. This behaviour is consistent with the observations made in the previous discussion about calibration. It is illustrated by Fig. 4. We empirically observe that the IPCW estimator of the risk with oracle weights (i.e. computed from the true conditional survival function SC(. | x)) may be less accurate than plug-in versions (i.e. computed from an estimator of the conditional ˆSC(. | x)) and exhibits a much higher variance. Intuitively, this phenomenon can be explained by the fact that the empirical weights governed by the value 1/SC(Yi|Xi) can grow arbitrarily large for observations in the tail. This phenomenon is reduced for the estimated version Lo O (resp. KNN) because of the truncation (see the implementation details above) and the over-conservative choice of the bandwidth (resp. of the number of neighbours). A similar phenomenon occurs for estimated of importance sampling type, for which the weights appearing in the denominator need to be tuned ﬁnely, see Delyon and Portier (2020).

Predictive performance. In this paper, we are concerned with the predictive task, rather than risk estimation. Hence, we now focus on the problem of minimizing (1). We study the prediction error E[(Y f n(X))2] for the following types of predictive model F: Support vector Regression (SVR), Random Forests and Linear Regression. Although the

Empirical Risk Minimization under Random Censorship

choices we made are far from being exhaustive, they correspond to tools commonly used by practitioners.

Following the experimental scheme presented in subsection 4.1 we ﬁrst generate train sets of varying size n and test sets of ﬁxed size 5000 according to the data generative models described in 4.1.1. We then estimate on the train set the weights corresponding to each risk described in Eq. (18) with hyperparameters chosen using the procedure given in 4.1.2 before computing f n, the minimizer of the resulting empirical risk over the class F considered. We ﬁnally estimate the prediction error E[(Y f n(X))2] using the test dataset. Each experiment is entirely replicated (including the sampling of the train and test sets) 100 times in order to obtain reliable statistics for the distribution of the true risk of the learning procedure.

200 300 500 1000 2000 3000

Number n of training samples.

E[(Y f n(X))2]

IPCW Stute IPCW Forest

IPCW KNN IPCW Oracle

IPCW IPCW Lo O

Figure 6: Prediction error E[(Y f n(X))2] for diﬀerent estimators of SC using the linear regression model for data generated by the Cox model (16).

As already observed earlier in Fig. 5, Fig. 6 and Fig. 7 show that the IPCW Lo O predictor systematically outperforms the other predictors in our experiments, no matter the level of censoring p and across diﬀerent distributions as can be seen here for the speciﬁc case of the Cox model and the AFT(N, L). Consequently, any further mention of IPCW implicitly refers to the IPCW Lo O version from now on and all subsequent experiments involve the use of ˆS(i) Kern C . We also underline that these results hold true, no matter the predictive model F considered, as can be seen by examining Fig. 8 and no matter the underlying distribution, see Fig. 9. It is noteworthy that the methods are all the most diﬀerent as the number of observations in the training set increases. As the conditional estimators become more and

Ausset, Cl emenc on and Portier

200 300 500 1000 2000 3000

Number n of training samples.

E[(Y f n(X))2]

IPCW IPCW Lo O IPCW Stute IPCW Forest

IPCW KNN IPCW Oracle

Figure 7: Prediction error E[(Y f n(X))2] for diﬀerent estimators of SC using the linear regression model for data generated by the AFT(N, L) model (17).

more complex, more and more data are required to diﬀerentiate them from the unconditional versions.

200 500 1000 2000 200 500 1000 2000 200 500 1000 2000

p = 0.75 p = 0.5 p = 0.25

Number n of training samples.

E[(Y f n(X))2]

Random Forest

SVR Linear Regression

Figure 8: Prediction error E[(Y f n(X))2] for the three predictive models (SVR, random forest and linear regression) using the IPCW Lo O for data generated by the Cox model (16).

Finally, we compare variants of popular machine learning methods implementing the IPCW technique promoted in this paper to standard state-of-the-art procedures from the survival analysis literature that do not rely on (re-weighted) risk minimization. Such

Empirical Risk Minimization under Random Censorship

200 500 1000 2000 200 500 1000 2000 200 500 1000 2000

p = 0.75 p = 0.5 p = 0.25

Number n of training samples.

E[(Y f n(X))2]

Random Forest

SVR Linear Regression

Figure 9: Prediction error E[(Y f n(X))2] for the three predictive models (SVR, random forest and linear regression) using the IPCW Lo O for data generated by the AFT(N, L) model (17).

techniques include classic statistical methods based on the preliminary estimation of the survival function, as already mentioned in Section 1 (see e.g. van der Laan and Robins (2003)), the survival function estimator being next used to estimate the downstream quantity of interest in a plug-in fashion, provided that the latter can be expressed as an integral w.r.t. the survival function, just like the conditional mean. An alternative approach, in the spirit of machine learning methods, consists in designing losses tailored to the censored regression problem, either through transformation models in Van Belle et al. (2011), or else by adapting the SVM methodology, as done in e.g. Van Belle et al. (2007); P olsterl et al. (2015, 2016). We also include the method of Hothorn et al. (2006) which shares similarities with the approach investigated in this paper and uses a boosting technique to optimize a loss reweighted by unconditional Kaplan-Meier weights, as well as the technique proposed in Ishwaran et al. (2008) that builds a recursive splitting of the feature space X by maximizing a measure of inter-cluster dissimilarity of the survival functions, the resulting clusters being then used for downstream tasks such as classiﬁcation, regression, or quantile estimation. We compare the predictive performance of ten estimators of the regression function based on statistical models documented in the survival literature with that of ﬁve predictive functions learned using the IPCW risk minimization approach. The IPCW versions of the machine learning techniques for regression considered in these experiments, corresponding to the approach studied in the present article, have been implemented with Scikit-Learn (Pedregosa et al., 2011), combined with our own implementation of the Lo O IPCW predictor we propose. For the survival machine learning methods mentioned above, we use the reference implementations of the Scikit Survival package (P olsterl, 2020). The canonical implementation of Ishwaran and Kogalur (2007) is used for Random Survival Forest. The default values for the hyperparameters are used in every case. All experiments are based on n = 200 training observations. Results for all methods can be found in Table 1. While the undeniable superiority of IPCW methods compared to the standard survival techniques may appear surprising at ﬁrst glance. However, keeping in mind that the performance measure is here the prediction error E[(Y f n(X))2], it is expected that directly minimizing an estimator of the prediction error yields better results than

Ausset, Cl emenc on and Portier

a two-stage procedure, which consists in estimating ﬁrst the underlying distribution and forming next an estimator of its mean.

E[(Y f n(X))2]

Method p = 0.75 p = 0.5 p = 0.25

Scikit Survival

Survival Gradient Boosting 3.19 3.55 3.61 Component-wise Survival Gradient Boosting 3.19 3.87 4.23 Cox Proportional Hazards 7.86 7.61 7.03 Coxnet 7.62 7.39 6.85 Kernel Survival SVM 4.02 3.92 4.13 Survival SVM 4.04 4.09 3.94 Hinge Loss Survival SVM 8.10 8.28 8.09 Minlip Survival SVM 3.27 3.96 4.22 Random Survival Forest 2.01 2.94 2.78

Scikit Learn

Ridge + IPCW 1.75 1.49 1.24 Kernel Ridge + IPCW 2.07 1.60 1.35 Linear Regression + IPCW 1.81 1.49 1.24 Random Forest + IPCW 1.85 1.57 1.36 SVR + IPCW 1.87 1.66 1.42

Table 1: Performance on the Cox dataset

4.3 Experimental Results based on Real Data

The performance of the IPCW risk minimization approach is now investigated on the TCGA Cancer data (Grossman et al., 2016) using solely the RNA transcriptomes as informative variables. All models are trained on n = 8080 patients with a censoring rate of 18%, we measure on the remaining 1449 observed patients the prediction error, as well as the Harrel concordance index deﬁned by P Yj Yi δj I{f(Xj) > f(Xi)}

P Yj Yi δj , (20)

which can be seen as an extension of the classical AUC metric for the standard classiﬁcation problem to censored data, measuring how well ordered the predicted death times are. Note that, as a complete case statistic, the same methodology as presented in this paper could be applied to estimate the AUC type statistic P{f(X1) > f(X2) | Y1 > Y2}, using IPCW weights. However, because it has been used for evaluating alternative methods, we use the criterion (20) here in order to facilitate comparisons and avoid to discuss the impact of speciﬁc IPCW approaches to compute the weights. For all the results, we use the IPCW methodology presented earlier. The Cox proportional hazards model was, however, learned after variable selection via a Lasso regression so as to augment performance. We observe from the results in Table 2 that, as expected, the predictors built through IPCW risk minimization signiﬁcantly outperform their competitors, including the standard

Empirical Risk Minimization under Random Censorship

Cox model, for the prediction task. The large improvement compared to the Cox approach may surprise at ﬁrst but is not unexpected actually, insofar as the seemingly less sophisticated IPCW approach is speciﬁcally designed for the purpose of prediction. By directly minimizing an estimate of the loss of interest, it naturally achieves a lower test prediction error than that reached by the traditional two-stage approach in statistics, which consists in estimating ﬁrst the distribution and deducing next an estimator of the minimizer of the loss of interest (here, such an approach would consist in building ﬁrst an estimator of Y s conditional density given X based on the censored sample and taking next its expectation to form an estimator of the theoretical minimizer f ). The Cox estimator only controls the likelihood of the model without any concern for the predictive performance. In particular, extreme errors are not penalized in any way while those are hurtful to the overall L2 error. More interestingly, we see that, while the diﬀerence is not as pronounced, the IPCW predictors also outperform the Cox estimator with respect to the concordance index. Notice incidentally that the concordance index, as an extension of the Wilcoxon-Mann-Whitney or AUC statistic, is an empirical criterion that can hardly be optimized directly in practice (using gradient ascent techniques for instance) because of the nonsmooth character of the pairwise loss function it involves, see e.g. section 7 in Cl emen con et al. (2008), but it is often used for evaluating performance a posteriori by practitioners. Remarkably, as the IPCW risk minimization approach can be combined with highly sophisticated learners (such as random forests), without any modiﬁcation or increase in complexity, it is possible to signiﬁcantly increase its predictive capacity, while edging the standard survival techniques on auxiliary metrics as well.

IPCW Naive Observed

Method L2 Error (years) Concordance L2 Error C L2 Error C

Cox 18.78 0.6095 SVR 2.768 0.563 2.796 0.575 2.795 0.543 Linear Regression 3.193 0.594 4.971 0.557 3.898 0.508 Ridge 3.193 0.594 4.962 0.5573 3.896 0.5077 Kernel Ridge 2.683 0.597 2.704 0.592 2.956 0.513 Random Forest 2.577 0.630 2.636 0.603 2.878 0.542

Table 2: Results of the IPCW approach on the TCGA Cancer data.

5. Conclusion

In the present article, we have presented both theoretical and experimental work on statistical learning based on censored data. Precisely, we considered the problem of learning a predictive/regression function when the output variables related to the training observations are subject to random right censoring under mild assumptions. Following in the footsteps of the approach introduced in Stute (1995), we studied from a nonasymptotic perspective the performance of predictive functions built by minimizing a weighted version of the empirical

Ausset, Cl emenc on and Portier

(quadratic) risk, constructed by means of the Kaplan-Meier methodology. Learning rate bounds describing the generalization ability of such predictive rules have been proved, through the study of the ﬂuctuations of the Kaplan-Meier risk functional, relying on linearization techniques combined with concentration results for U-processes. These theoretical results have also been conﬁrmed by various numerical experiments, supporting the approach promoted. A diﬃcult question, that will be the subject of further research, is the design of model selection methods (structural risk minimization) to pick automatically the optimal hyperparameters for the plugged estimator ˆSC,n. Indeed, this is far from straightforward, insofar as changing the hyperparameters or the model modiﬁes the loss that is being optimized, which makes standard methods such as cross-validation unsuitable.

Appendix A. Concentration inequalities for VC classes and permanence properties

For completeness, concentration results as well as preservation properties of VC classes, extensively used in the subsequent proofs, are recalled. For the sake of generality, this section is independent from the rest of the paper. For a function f : S R, we deﬁne f = supx S |f(x)| and f A = supx A |f(x)|.

Concentration inequalities over VC classes. The following concentration inequalities provide uniform bound on empirical sums over VC classes of functions. We start by recalling the deﬁnition of a VC class.

Deﬁnition 5 Let (S, S) be a measurable space. A class F of real-valued functions deﬁned on S is called VC of parameter (A, v) (0, + )2 and constant envelope UF > 0 if for any probability measure Q on (S, S) and any ϵ (0, 1):

N(F, L2(Q), ϵUF) (A/ϵ)v,

where N(F, L2(Q), ϵ) denotes the smallest number of L2(Q)-balls of radius less than ϵ required to cover class Φ (covering number), see e.g. Nolan and Pollard (1987) and Gin e and Guillou (2001).

The following inequality for empirical processes over VC classes is stated in Einmahl and Mason (2000); Gin e and Guillou (2001) under various forms. The present version is taken from Gin e and Sang (2010).

Lemma 6 Let ξ1, ξ2, . . . be i.i.d. r.v. s valued in a measurable space (S, S) and U be a class of functions on S, uniformly bounded and of VC-type with constant (A, v) and envelope U : S R. Set σ2(u) = var(u(ξ1)) for all u U. There exist constants C1 > 0, C2 1, C3 > 0 (depending on v and A) such that t > 0 satisfying

i=1 {u(ξi) E[u(ξi)]}

C2 exp C3 t2

Empirical Risk Minimization under Random Censorship

The previous result is extended to the case of degenerated U-processes over VC classes (Major, 2006, Theorem 2).

Lemma 7 Let ξ1, ξ2, . . . be an i.i.d. sequence of random variables taking their values in a measurable space (S, S) and distributed according to a probability measure P. Let H be a class of functions on Sk uniformly bounded such that H is of VC type with constants (A, v) and envelope G. For any H H, set σ2(H) = var(H(ξ1, . . . , ξk)) and assume that

j {1, . . . , k}, E[H(ξ1, . . . , ξk) | ξ1, . . . , ξj 1, ξj+1, . . . , ξk] = 0 with probability one.

Then, there exist constants C1 > 0, C2 1, C3 > 0 (depending on v and A) such that for all t > 0 satisfying

C1σ n log 2 G

k/2 t σ nσ G

(i1,...,ik) H(ξi1, . . . , ξik) H > t C2 exp

where G 2 σ2 Var(H) 2 H.

The following result is directly derived from that stated above by specifying an appropriate value of t.

Corollary 8 Let ξ1, ξ2, . . . be an i.i.d. sequence of random variables taking their values in a measurable space (S, S) and distributed according to a probability measure P. Let H be a class of functions on Sk uniformly bounded such that H is of VC type with constants (A, v) and envelope G. For any H H, set σ2(H) = var(H(ξ1, . . . , ξk)) and assume that

j {1, . . . , k}, E[H(ξ1, . . . , ξk) | ξ1, . . . , ξj 1, ξj+1, . . . , ξk] = 0 with probability one.

Then, there exist constants C1 > 0 , C2 1 , C3 > 0 (depending on v and A) such that

(i1,...,ik) H(ξi1, . . . , ξik)

t(n, σ, ϵ) = σnk/2

k/2 + log(C2/ϵ)

provided that

C2/k 1 log 2 G

+ log(C2/ϵ)

sup H H σ2(H) σ2 G 2 .

Ausset, Cl emenc on and Portier

VC type classes of functions - Permanence properties. In the subsequent sections, several results are obtained by applying the concentration bounds recalled above to speciﬁc classes of functions built up from the elements of the class Φ and other functions such as Kh(x), SC(u | x) or g(x). To show that these speciﬁc classes are VC, we rely on the following lemmas which exhibits situations where the VC type property is preserved, while controlling the constants (A, v) involved. In what follows the kernel K is assumed to satisfy the hypotheses introduced in section 2.2.

Lemma 9 (see Nolan and Pollard (1987), Lemma 22, Assertion (ii)) The class {z 7 K(h 1(x z)) : x Rd, h > 0} is a bounded VC class of functions.

The following result is an extension of a result established in the proof of Proposition 8 in Portier and Segers (2018).

Lemma 10 Let (V, W) be a pair of random variables taking their values in Rq and in Rd

respectively, denote by f0(v | W) the density of the conditional distribution of the random variable V given W, supposed to be absolutely continuous w.r.t. Lebesgue measure on Rq. Let F be a bounded VC class of functions deﬁned on Rq Rd with parameter (A, v) and constant envelop UF. The class G = {w Rd 7 E[f(V, W) | W = w] : f F} is a bounded VC class of functions with parameter (A, v) and constant envelop UF.

Proof Let Q be a probability measure on Rd. Consider Q the probability measure deﬁned through

d Q(v) = Z f0(v|w)Q(dw)dv.

Let ϵ (0, 1) and consider the centres f1, . . . , f N of an ϵUF-covering of the VC class F with respect to the metric L2( Q). Let g G, i.e., g : w Rd 7 E[f(V, W) | W = w] with f in F. Deﬁne gk = E[fk(V, W) | W = w], for k = 1, . . . , N. There exists k {1, . . . , N} such that Z (g(w) gk(w))2 Q(dw) Z E (f(V, W) fk(V, W))2 | W = w Q(dw)

= ZZ (f(v, w) fk(v, w))2 f0(v|w) dv Q(dw)

= Z (f(v, w) fk(v, w))2 Q(dv) ϵ2U2 F,

using Jensen s inequality and Fubini s theorem. Consequently, we have:

N G, L2(Q), ϵUF N F, L2( Q), ϵUF A

Since the constant UF is an envelop for the class G, the result is established.

Empirical Risk Minimization under Random Censorship

Lemma 11 Let Ψ be a VC class of functions deﬁned on Rq Rd with constant envelop U > 0 that satisﬁes the following Lipschitz property: for all ψ Ψ, z Rq, (x, y) Rd Rd,

|ψ(z, x) ψ(z, y)| κ x y .

with κ > 0. Let K : Rd R be a positive function such that R K(u)du = 1 and v K = R u 2K(u)du < . The class F = {(z, x) 7 R ψ(z, x hu)K(u)du : ψ Ψ, 0 < h h} is a bounded measurable VC class of functions with constant envelope (κ h v K + U).

Proof Let 0 < ϵ 1 and hk = kϵ h, k = 1, . . . , 1/ϵ , an (ϵ h)-subdivision of the interval (0, h]. Let Q be a probability measure on Rq Rd. For each k, deﬁne µk as the probability measure of the random variable (Z, X hk U) when (Z, X, U) Q K. Let Ψk,j, j = 1, . . . , N be an ϵU-cover of the function class Ψ with respect to L2(µk). Let h (0, h] and ψ Ψ. For any measurable function f and any k, we have

Z f(z, x hku)K(u)du L2(Q) f L2(µk).

As a consequence, for each k there exists j such that

Z (ψ(z, x hku) ψk,j(z, x hku))K(u)du L2(Q) ϵU.

Besides, from the Lipschitz property, there exists k such that

Z (ψ(z, x hu) ψ(z, x hku))K(u)du L2(Q) ϵκ h v K.

The triangle inequality allows to claim that there exists j and k such that

Z (ψ(z, x hu) ψk,j(z, x hku))K(u)du L2(Q) = ϵ(κ h v K + U).

There are 1/ϵ Aϵ v such functions Ψk,j meaning that

N F, L2(Q), ϵ(κ h v K + U) Aϵ (v+1),

where (κ h v K + U) is indeed an envelop for the class F.

We conclude the section by a preservation result for the product and the inverse.

Lemma 12 Suppose that F and G are two VC classes deﬁned on S with parameters (AF, v F) and (AG, v G) and constant envelops UF and UG, respectively. Then it holds:

(i) The class FG = {fg : f F, g G} is VC with parameter (2(AF AG), v F + v G) and envelop UFUG.

(ii) In addition, if for all f F and x S, f(x) b F, then the class F 1 = {1/f : f F, g G} is VC with parameter (AFUF/b F, v F) and envelop 1/b F.

Ausset, Cl emenc on and Portier

Proof Let 0 < ϵ 1 and fk, k = 1, . . . , NF the centers of an (ϵUF)-covering of F. Similarly, denote by gk, k = 1, . . . , NG the centers of an (ϵUG)-covering of G. By applying the operation (fk UF) UF, we can assume without loss of generality that the fk (resp. gk) are bounded by UF (resp. UG). Then for any f F and g G, there are k {1, . . . , NF} and j {1, . . . , NG} such that

fg fkgj UG f fk + UF g gj 2ϵUGUF,

which implies that N(FG, L2(Q), 2ϵUGUF) (AF/ϵ)v F(AG/ϵ)v G. Taking ϵ = 2ϵ gives the result. For the second point, taking fk b F, we have

f 1 f 1 k 1

b2 F f fk UF

and the result follows taking ϵ = (UF/b F)ϵ.

Appendix B. Integration results

In this section we establish useful bounds related to these quantities: kernel smoothers, integrals with respect to signed measures, survival functions and hazard functions namely. This corresponds to Lemmas 13, 14, 15 and 16, respectively. As the previous section, this section is independent from the rest of the paper.

Lemma 13 Let Ωan open convex subset of Rd. Suppose that f is twice diﬀerentiable on Ω such that the greatest eigenvalue of the Hessian matrix is uniformly bounded by M > 0, then, if the kernel K is symmetric, i.e., K(u) = K( u), we have: for all h > 0,

sup x Ω |(Kh f)(x) f(x)| M

2 h2 Z z 2K(z)dz.

Proof The proof follows the same lines as the proof of Lemma 11 given in Delyon and Portier (2020).

Lemma 14 Let θ (0, 1), h : R+ [1, [ be Borelian, increasing, with limit 1/θ at + and ν be any signed measure on R+. Then, we have: T > 0, t [0, T],

θ sup s [0,T]

Proof Recall ﬁrst the identity

0 dν = sup f DE

Z fdν , (21)

Empirical Risk Minimization under Random Censorship

where DE is the space of non-increasing functions valued in [0, 1] and vanishing at inﬁnity (see e.g. Dudley (1992)). Since h is increasing from 1 to 1/θ, we have for any signed measure ν (whose restriction to [0, T] is denoted by ν[0,T]),

0 hdν = θ 1

0 dν + θ Z t

h θ 1 dν 2θ 1 sup f DE

Then applying (21) we obtain that

θ sup s [0,T]

Lemma 15 Let τ > 0. Let S(1) and S(2) be c ad-l ag non-increasing functions on R+ such that S(1)(0) = S(2)(0) = 1 and S(2)(τ) θ > 0. For k {1, 2}, Λ(k)(t) = R t 0 S(k)(du)/S(k)(u ) is the corresponding cumulative hazard function. We have:

S(1) S(2) [0,τ] 2θ 1 Λ(1) Λ(2) [0,τ].

Proof Let t [0, τ]. As S(2)(t) > 0, the integration by part argument of Theorem 3.2.3 in Fleming and Harrington (1991) yields

S(1)(t) S(2)(t)

S(2)(t) = Z t

S(2)(u) (Λ(1)(du) Λ(2)(du)). (22)

Set 1(du) = (Λ(1)(du) Λ(2)(du))/S(2)(u) and apply the integration by parts formula (refer to page 305 in Shorack and Wellner (2009) for instance) to get

S(1)(t) S(2)(t)

S(2)(t) = Z t

0 S(1)(u ) 1(du) = S(1)(t) 1(t) + Z t

0 1(u)S(1)(du).

Then, as S(2)(t) 1, we obtain that

|S(1)(t) S(2)(t)|

S(1)(t)| 1(t)| + (1 S(1)(t)) sup u [0,τ] | 1(u)|

sup u [0,τ] | 1(u)|.

We conclude by using Lemma 14 with dν = d(Λ(1) Λ(2)) and h = 1/S(2).

Lemma 16 Let 0 < θ1, θ2 < 1 and τ > 0. For k {1, 2}, deﬁne Λ(k)(t) = R t 0 G(k)(du)/H(k)(u), where G(k) : [0, τ] [0, β] is c ad-l ag non-decreasing and H(k) : [0, τ] [θk, 1] is Borelian non-increasing. Then, we have:

Λ(1) Λ(2) [0,τ] 2

θ1 G(1) G(2) [0,τ] + β θ1θ2 H(1) H(2) [0,τ].

Ausset, Cl emenc on and Portier

Proof Let t [0, τ]. Observe that, by triangular inequality,

Λ(1)(t) Λ(2)(t) =

d(G(1) G(2))

(H(2) H(1))

H(1)H(2) d G(2)

θ1 G(1) G(2) [0,τ] + β θ1θ2 H(2) H(1) [0,τ],

where the bound for the second term on the right hand side is straightforward and that for the ﬁrst term can be deduced from the application of Lemma 14 with the measure ν equal to A 7 R

A d(G(1) G(2)) and the function h equal to 1/H(1).

Lemma 17 Let τ > 0. Let S(1) and S(2) be c ad-l ag non-increasing functions on R+ such that S(1)(0) = S(2)(0) = 1 and S(2)(τ) θ > 0. For k {1, 2}, deﬁne Λ(k)(t) = R t 0 S(k)(u )S(k)(du) and suppose that Λ(k)(t) = R t 0 G(k)(du)/H(k)(u), where G(k) : [0, τ] [0, β] and H(k) : [0, τ] [θ, 1] are respectively non-decreasing and non-increasing borelian functions. Then, there exists a constant Cθ,β > 0, depending only on θ and β, such that

sup t [0,τ]

S(1)(u ) S(2)(u )

Λ(1)(du) Λ(2)(du)

Cθ,β H(1) H(2) 2 [0,τ] + G(1) G(2) 2 [0,τ] + W [0,τ] ,

S(2)(s ) G(1)(ds) G(2)(ds)

S(2)(s)H(2)(s) d G(1)(du) G(2)(du)

S(2)(u)H(2)(u) .

Proof The proof consists in showing ﬁrst that there exist constants C1,θ,β and C2,θ,β such that

sup t [0,τ]

( ˆS(1)(u ) S(2)(u ))

S(2)(u) (Λ(1)(du) Λ(2)(du))

C1,θ,β( G(1) G(2) 2 [0,τ] + H(1) H(2) 2 [0,τ]) + Π [0,τ], (23)

0 2(u) 1(du), 2(t) = Z t

0 S(2)(u ) 1(du), 1(t) = Z t

0 S(2)(u) 1 (du),

and = Λ(1) Λ(2), and next that

Π W [0,τ] C2,θ,β H(1) H(2) 2 [0,τ] + G(1) G(2) 2 [0,τ] . (24)

Empirical Risk Minimization under Random Censorship

In order to establish (23), we successively apply (22), Fubini s theorem and the integration by part formula:

u=0 (S(1)(u ) S(2)(u )) 1(du)

v=0 S(1)(v ) 1(dv)S(2)(u ) 1(du)

u=v S(2)(u ) 1(du) S(1)(v ) 1(dv)

0 S(1)(v ) 1(dv) + Z t

0 S(1)(v )Π(dv)

= 2(t) S(1)(t) 1(t) Z t

0 1(u)S(1)(du) + S(1)(t)Π(t) Z t

0 Π(u)S(1)(du)

2 2 [0,τ] 1 [0,τ] + 2 Π [0,τ]. (25)

From (21), we deduce that 2 [0,τ] 1 [0,τ] (because S(2)I[0,τ] belongs to the space DE) and, from Lemma 14, it follows that 1 [0,τ] 2θ 1 [0,τ]. Apply next Lemma 16 to obtain

2 [0,τ] 1 [0,τ] 8

θ2 G(1) G(2) 2 [0,τ] + β2

θ4 H(1) H(2) 2 [0,τ]

Combined with (25), this proves (23). For (24), the application of the Taylor expansion

a2 + (x a)2

d = d(G(1) G(2))

H(2) (H(1) H(2))d G(1)

(H(2))2 + (H(1) H(2))2d G(1)

(H(2))2H(1) .

Set c(s) = S(2)(s )/S(2)(s). It follows that

G(1)(ds) G(2)(ds)

H(1)(s) H(2)(s) G(1)(ds) H(2)(s)2

H(1)(s) H(2)(s) 2 G(1)(ds) H(2)(s)2H(1)(s)

Ausset, Cl emenc on and Portier

Observe that

Π(t) W(t) =

G(1)(ds) G(2)(ds)

H(1)(u) H(2)(u) G(1)(du) S(2)(u)H(1)(u)H(2)(u)

H(1)(s) H(2)(s) G(1)(ds) H(2)(s)2

H(1)(u) H(2)(u) G(1)(du) S(2)(u)H(1)(u)H(2)(u)

H(1)(s) H(2)(s) G(1)(ds) H(2)(s)2

G(1)(du) G(2)(du)

S(2)(u)H(2)(u)

H(1)(s) H(2)(s) 2 G(1)(ds) H(2)(s)2H(1)(s) 1(du) = A + B + C + D.

We next bound each term on the right hand side of the equation above. Successively apply Lemma 14 and (21) to get

G(1)(ds) G(2)(ds)

0 S(2)(s ) G(1)(ds) G(2)(ds)

Z S(2)(s )I {s u} G(1)(ds) G(2)(ds)

θ2 G(1) G(2) [0,τ].

Because, for any u [0, τ], 1/{S(2)(u)H(1)(u)H(2)(u)} 1/θ3, we can write

G(1)(ds) G(2)(ds)

H(1)(u) H(2)(u) G(1)(du)

G(1)(ds) G(2)(ds)

+ H(1)(u) H(2)(u) 2 )

θ7 G(1) G(2) 2 [0,τ] + H(1) H(2) 2 [0,τ]

In addition, because for any u [0, τ], c(u)/(H(2)(u))2 1/θ3 we have: t [0, τ],

H(1)(s) H(2)(s) G(1)(ds) H(1)(u) H(2)(u) G(1)(du)

H(1)(s) H(2)(s) G(1)(ds) 2

H(1) H(2) 2

Empirical Risk Minimization under Random Censorship

Deﬁne Γ2(t) = R t 0(G(1)(du) G(2)(du))/(S(2)(u)H(2)(u)). Applying Fubini s theorem, we get

H(1)(s) H(2)(s) G(1)(ds) H(2)(s)2

G(1)(du) G(2)(du)

S(2)(u)H(2)(u)

G(1)(du) G(2)(du)

S(2)(u)H(2)(u) c(s)

H(1)(s) H(2)(s) G(1)(ds) (H(2)(s))2

n |Γ2(t) Γ2(s)| H(1)(s) H(2)(s) o G(1)(ds)

θ3 Γ2 [0,τ] H(1) H(2) [0,τ] .

Then, using Lemma 14, it follows that

|C| 2(1/θ3)β(2/θ3) G(1) G(2) [0,τ] H(1) H(2) [0,τ]

(1/θ3)β(2/θ3)( G(1) G(2) 2 [0,τ] + H(1) H(2) 2 [0,τ]).

The last term can be treated by means of Fubini s theorem. Indeed, because 1 [0,τ] 2(β/θ) and for any u [0, τ], 1/{H(2)(u)2H(1)(u)} 1/θ3, we have

H(1)(s) H(2)(s) 2 G(1)(ds) (H(2)(s))2H(1)(s) 1(du)

u=s 1(du) H(1)(s) H(2)(s) 2 G(1)(ds) H(2)(s)2H(1)(s)

θ3 β 1 [0,τ] H(1) H(2) 2 [0,τ]

θ4 H(1) H(2) 2 [0,τ].

Putting all this together, the triangular inequality leads to (24) .

Appendix C. Proof of Proposition 1

We start by establishing 3 useful lemmas, namely Lemma 18, 19 and 20. Then the proof will follow easily. Deﬁne

H0,h(y, x) = E h ˆH0,n(y, x) i ,

Hh(y, x) = E h ˆHn(y, x) i .

H0(y, x) = H0(y | x)g(x),

H(y, x) = H(y | x)g(x).

Ausset, Cl emenc on and Portier

Lemma 18 Under Assumption 2, there exists C0 > 0 depending only on K and L such that for all h > 0,

sup (t,x) R+ Rd |H0,h(t, x) H0(t, x)| C0h2,

sup (t,x) R+ Rd |Hh(t, x) H(t, x)| C0h2.

Proof The proof results from the application of Lemma 13 combined with the smoothness assumptions stipulated.

Lemma 19 Under Assumption 2, There exist constants M1 > 0 and h0 > 0 depending only on K and L such that:

sup (t,x) R+ Rd | ˆH0,n(t, x) H0,h(t, x)|

M1| log(ϵhd/2)|

sup (t,x) R+ Rd | ˆHn(t, x) Hh(t, x)|

M1| log(ϵhd/2)|

provided that h h0 and M1| log(ϵhd/2)| nhd.

Proof The exponential inequalities stated above directly result from the application of Corollary 8 to the uniformly bounded VC-type classes (see Lemma 9 and 12) {(y, x ) R+ Rd 7 I{y > u}K((x x )/h) : (x, u, h) Rd R+ R +} and {(y, δ, x ) R+ {0, 1} Rd 7 I{y > u, δ = 0}K((x x )/h) : (x, u, h) Rd R+ R +} whose VC constants are independent from h, with constant envelope ||K|| , with k = 1 and σ2 = c2 K,Lhd with

L R K2(x)dx. This gives that

sup (t,x) R+ Rd | ˆH0,n(t, x) H0,h(t, x)| t

sup (t,x) R+ Rd | ˆHn(t, x) Hh(t, x)| t

provided that hd/2c K,L ||K|| and

+ C2 1 log 2 K

Since, for any positive numbers a, b, γ, it holds that aγ + bγ 2γ(a + b)γ, we ﬁnd, taking h0 suﬃciently small, that t2 M1| log(ϵhd/2)|/nhd for some constant M1 > 0. Finally, taking h0 suﬃciently small ensures that log(C2)/C3 + C2 1 log(2 K /c K,L) C2 1 log(1/hd/2),

Empirical Risk Minimization under Random Censorship

for any h h0, which permits to ensure that the previous condition is satisﬁed whenever M2| log(ϵhd/2)| nhd, for some M2 > 0. Take M1 = M1 + M2 to obtain the desired result.

Lemma 20 Suppose that Assumptions 1 and 2 are fulﬁlled. There exist constants M1 > 0 and h0 > 0 depending only on b, L and K such that:

P inf (t,x) Γb ˆHn(t, x) 3b3

provided that h h0 and M1| log(ϵhd/2)| nhd.

Proof Deﬁne

sup (t,x) Γb |H(t, x) ˆHn(t, x)| b3

By virtue of Assumption 1, for any (t, x) Γb, we have: H(t|x) = SC(t|x)SY (t|x) b2. As a consequence of ˆHn(t, x) H(t, x) |H(t, x) ˆHn(t, x)|, An {inf(t,x) Γb ˆHn(t, x) 3b3/4}. Hence we only have to prove that event An occurs with probability 1 ϵ at least. By virtue of Lemma 18, as soon as h p

3b3/(8C0), we have

sup (t,x) Γb |Hh(t, x) H(t, x)| 3b3

sup (t,x) Γb | ˆHn(t, x) Hh(t, x)| 3b3

Simply use Lemma 19 to ensure that the event in the left-hand side holds with probability 1 ϵ whenever M1| log(ϵhd/2)| nhd (where M1 now depends on b, L and K) and h h0.

Now we conclude the proof. We start by using Lemma 20 to get that inf(t,x) Γb ˆHn(t, x) 3b3/4 happens with probability 1 ϵ/3. We suppose that this event is realized in the following. Let (t, x) Γb and deﬁne

τx = inf{t 0 : min{SC(t|x), SY (t|x)} > b}.

Observing that the choice of kernel K guarantees that ˆSC,n( |x) is a (random) survival function, we ﬁrst apply Lemma 15 with S(1) = ˆSC,n( |x), S(2) = SC( |x) and θ = b to get:

ˆSC,n( |x) SC( |x) [0,τx] (2/b) ˆΛC,n( |x) ΛC( |x) [0,τx]. (27)

Applying Lemma 16 with Λ(1)(u) = ΛC(u | x) = R u 0 H0(ds, x)/H(s , x), Λ(2)(u) = ˆΛC,n(u | x) = R u 0 ˆH0,n(ds, x)/ ˆHn(s , x), β = 1, θ1 = b3 H(s, x), θ2 = 3b3/4 (because

Ausset, Cl emenc on and Portier

inf(t,x) Γb ˆHn(t, x) b3/4), next yields ˆΛC,n( |x) ΛC( |x) [0,τx] 2

b3 ˆH0,n( , x) H0( | x)g(x) [0,τx] (28)

3b6 ˆHn( , x) H( | x)g(x) [0,τx].

Combining (27) and (28), using Lemma 18 and taking the supremum over x such that g(x) > b, we obtain that, the following bound holds true:

sup (t,x) Γb | ˆSC,n(t | x) SC(t | x)|

b4 sup (t,x) Γb | ˆH0,n(t, x) H0(t, x)| + 8

3b7 sup (t,x) Γb | ˆHn(t, x) H(t, x)|

b4 sup (t,x) Γb | ˆH0,n(t, x) H0,h(t, x)| + 4

b4 C0h2 + 8

3b7 sup (t,x) Γb | ˆHn(t, x) Hh(t, x)| + 8

Lemma 19 with the probability level ϵ/3 allows us to bound the 2 previous random terms. Combined with the union bound (with 3 events having probability smaller than ϵ/3), permits claiming that with probability greater than 1 ϵ:

sup (t,x) Γb

ˆSC,n(t | x) SC(t | x) 4

M1 log(ϵhd/2)

provided that (to apply Lemma 19) h h0 and nhd M1| log(3ϵhd/2)|. Examining the diﬀerent terms and taking h0 small enough lead to the stated result.

Appendix D. Proof of Proposition 2

Proof of (i): Observe that: i {1, . . . , n},

sup (t,x) K | ˆH(i) 0,n(t, x) ˆH0,n(t, x)| 2||K|| /((n 1)hd), (30)

sup (t,x) K | ˆH(i) n (t, x) ˆHn(t, x)| 2||K|| /((n 1)hd). (31)

The result follows from the union bound and that each of these events

B(1) n := \

n (t, x) K, ˆH(i) n (t, x) b3/2 o ,

B(2) n := \

n (t, x) K, ˆS(i) C,n(t, x) b/2 o ,

has probability 1 ϵ/2 under the mentioned condition on (n, h). Apply Lemma 20 to choose (n, h) such that with probability 1 ϵ/2,

inf (t,x) K ˆHn(t, x) 3b3/4.

Empirical Risk Minimization under Random Censorship

Using (31) and the triangle inequality, we get that B(1) n has probability 1 ϵ/2 provided that 2||K|| /((n 1)hd) b3/4. Suppose that event B(1) n is realized. The same reasoning as that used in the proof of Proposition 1 (see (27),(28),(29)), with S(1)( ) = SC( |x), S(2)( ) = S(i) C,n( |x), β = 1,

θ1 = b3 and θ2 = b3/2 (as B(1) n is realized), combined with the triangular inequality, yields: i {1, . . . , n},

sup (t,x) K | ˆS(i) C,n(t|x) SC(t|x)|

sup (t,x) K | ˆH(i) 0,n(t, x) ˆH0,n(t, x)| + sup (t,x) K | ˆH0,n(t, x) H0(t, x)|

sup (t,x) K | ˆH(i) n (t, x) ˆHn(t, x)| + sup (t,x) K | ˆHn(t, x) H(t, x)|

We further assume that 4

2||K|| /((n 1)hd) b/4,

which is realized whenever h0 is small enough and M1, appearing in the condition nhd M1| log(hd/2ϵ)|, is large enough. From K Γb and (30)-(31), it results that

sup (t,x) K

ˆH0,n(t, x) H0(t, x) b5

sup (t,x) K

ˆHn(t, x) H(t, x) b8

is included in the set B(2) n . Following the treatment of (29), it is easy to see that the latter event occurs with probability 1 ϵ/2 whenever h h0 is small enough (for the bias) and nhd M1| log(hd/2ϵ)|.

Proof of (ii). We suppose that the event En is realized. For all i {1, . . . , n}, recall that

ˆΛ(i) C,n(du | x) = ˆH(i) 0,n(du, x)

ˆH(i) n (u , x) , ˆ (i) n = (ˆΛ(i) C,n ΛC),

and that c(s | x) = SC(s | x)/SC(s | x). It results from Theorem 3.2.3 in (Fleming and Harrington, 1991, page 97) that

ˆS(i) C,n(t | x) SC(t | x)

SC(t | x) =

0 c(u | x) ˆ (i) n (du | x) Z t

( ˆS(i) C,n(u | x) SC(u | x))

SC(u | x) ˆ (i) n (du | x).

Ausset, Cl emenc on and Portier

The Taylor expansion (26) gives that

ˆ (i) n (du | x) = ( ˆH(i) 0,n(du, x) H0(du, x))

H(u, x) ( ˆH(i) n (u, x) H(u, x)) ˆH(i) 0,n(du, x)

+ ( ˆH(i) n (u, x) H(u, x))2 ˆH(i) 0,n(du, x)

H(u, x)2 ˆH(i) n (u, x) ,

which implies that

ˆS(i) C,n(t | x) SC(t | x)

SC(t | x) = ˆa(i) n (t | x) + ˆb(i) n (t | x), (32)

ˆa(i) n (t | x) = Z t

c(u | x) H(u, x)( ˆH(i) 0,n(du, x) H0(du, x))

c(u | x) H(u, x)2 ( ˆH(i) n (u, x) H(u, x)) ˆH(i) 0,n(du, x),

ˆb(i) n (t | x) = Z t

H(u, x)2 ˆH(i) n (u, x) ( ˆH(i) n (u, x) H(u, x))2 ˆH(i) 0,n(du, x)

( ˆS(i) C,n(u | x) SC(u | x))

SC(u | x) ˆ (i) n (du | x).

Now, using (26), we obtain that: ϕ Φ,

δi ϕ( Yi, Xi) ˆS(i) C,n( Yi | Xi) E

δ ϕ( Y , X)

δi ϕ( Yi, Xi) SC( Yi | Xi) E

δ ϕ( Yi, Xi)

i=1 δiϕ( Yi, Xi)

ˆS(i) C,n( Yi | Xi) SC( Yi | Xi)

S2 C( Yi | Xi)

i=1 δiϕ( Yi, Xi) (SC( Yi | Xi) ˆS(i) C,n( Yi | Xi))2

S2 C( Yi | Xi) ˆS(i) C,n( Yi | Xi) .

Then, using (32), we retrieve the expected terms

Zn(ϕ) = Ln(ϕ) + Mn(ϕ) + Rn(ϕ),

which proves (ii).

Empirical Risk Minimization under Random Censorship

Appendix E. Proof of Proposition 3

The proof is based on the decomposition stated in Proposition 2, combined with the lemmas below that permit to control each term involved in it. Their proofs are given in the next section of the Appendix. The term Ln(ϕ) is a basic i.i.d. (centred) average. As shown in the lemma stated below, its uniform ﬂuctuations can be controlled by standard results in empirical process theory.

Lemma 21 Suppose that the hypotheses of Proposition 3 are fulﬁlled. Then, for any ϵ (0, 1), we have with probability at least 1 ϵ:

sup ϕ Φ |Ln(ϕ)|

M1 log(M2/ϵ)

provided that n M1 log(M2/ϵ), where M1 > 0 and M2 > 1 are constants depending on (A, v), K, MΦ, and b only.

We now turn to the term Mn(ϕ). Observe it can be decomposed as

Mn(ϕ) = Vn,1(ϕ) + Bn,1(ϕ) + Vn,2(ϕ) + Bn,2(ϕ)

Vn,1(ϕ) = 1

δiϕ( Yi, Xi) SC( Yi | Xi)

c(u | Xi) H(u, Xi)2

ˆH(i) n (u, Xi) Hh(u, Xi) ˆH(i) 0,n(du, Xi),

Bn,1(ϕ) = 1

δiϕ( Yi, Xi) SC( Yi | Xi)

c(u | Xi) H(u, Xi)2 (Hh(u, Xi) H(u, Xi)) ˆH(i) 0,n(du, Xi),

Vn,2(ϕ) = 1

δiϕ( Yi, Xi) SC( Yi | Xi)

c(u | Xi) H(u, Xi)

ˆH(i) 0,n(du, Xi) H0,h(du, Xi) ,

Bn,2(ϕ) = 1

δiϕ( Yi, Xi) SC( Yi | Xi)

c(u | Xi) H(u, Xi) (H0,h(du, Xi) H0(du, Xi)) .

Next we treat the bias terms Bn,1 and Bn,2.

Lemma 22 Under the assumptions of Proposition 3, for any ϵ (0, 1), we have, with probability 1 ϵ:

sup ϕ Φ |Bn,1(ϕ)| M1h2,

sup ϕ Φ |Bn,2(ϕ)| M1h2,

provided that n M2| log(hd/2ϵ)|, where M1 > 0, M2 > 0 depend only on MΦ, K, L and b.

Ausset, Cl emenc on and Portier

Now we consider Vn,1(ϕ). For simplicity, we set Kij = Kh(Xi Xj) for 1 i, j n. We have:

Vn,1(ϕ) = 1 n(n 1)

δiϕ( Yi, Xi) SC( Yi | Xi)

Yj Yi (1 δj)Kijc( Yj | Xi)

H( Yj, Xi)2

ˆH(i) n ( Yj, Xi) Hh( Yj, Xi)

= 1 n(n 1)2 X

(i,j,k) i =j,i =k

= V n,1(ϕ) + V n,1(ϕ),

where, for all 1 i, j, k n,

vi,j,k(ϕ) = δiϕ( Yi, Xi) SC( Yi | Xi) Yj Yi (1 δj)Kijc( Yj | Xi)

H( Yj, Xi)2

Yk> Yj Kik Hh( Yj, Xi)

V n,1(ϕ) = 1 n(n 1)2 X

(i,j,k) i =j,i =k,j =k

V n,1(ϕ) = 1 n(n 1)2 X

The lemma stated below provides a uniform bound for V n,1(ϕ).

Lemma 23 Under the assumptions of Proposition 3, we have, with probability 1:

V n,1(ϕ) M1

where M1 > 0 depends only on MΦ, K, L and b.

We now consider V n,1(ϕ). Set Zk = (Xk, Yk, δk) for k {1, . . . , n}. It can be decomposed as follows:

V n,1(ϕ) = n 2

n U(1) n,1(ϕ) + U(2) n,1(ϕ) + U(3) n,1(ϕ) + L n(ϕ) o ,

Empirical Risk Minimization under Random Censorship

U(1) n,1(ϕ) = 1 n(n 1)(n 2)

(i,j,k) i =j,i =k,j =k

vi,j,k(ϕ) E[vi,j,k(ϕ)|Zj, Zk]

E[vi,j,k(ϕ)|Zi, Zk] + E[vi,j,k(ϕ)|Zk] ,

U(2) n,1(ϕ) = 1 n(n 1)

{E[vi,j,k(ϕ)|Zj, Zk] E[vi,j,k(ϕ)|Zk]} ,

U(3) n,1(ϕ) = 1 n(n 1)

{E[vi,j,k(ϕ)|Zi, Zk] E[vi,j,k(ϕ)|Zk]} ,

k E[vi,j,k(ϕ)|Zk],

where i, j and k always denote pairwise distinct indexes, the varying amounts of indexes in the summations being the results of the successive marginalizations necessary to obtain degenerate U-processes. Observe that, for all ϕ Φ and pairwise distinct indexes i, j and k in {1, . . . , n}, we have with probability one:

E[vi,j,k(ϕ) | Zi, Zj] = E[vi,j,k(ϕ) | Zi] = E[vi,j,k(ϕ) | Zj] = 0.

The quantities U(k) n,1(ϕ), k {1, 2, 3} are thus degenerate U-statistics of degree 3, 2 and 2 respectively, whereas L n(ϕ) is a basic (centred) i.i.d. average. The following result is essentially proved by applying Corollary 8, once the complexity assumptions related to the classes of kernels involved in the deﬁnition of these degenerate U-processes have been established. It shows that the terms U(k) n,1(ϕ) s are uniformly negligible.

Lemma 24 Suppose that the hypotheses of Proposition 3 are fulﬁlled. There exist constants M1, M2 and h0 depending on (A, v), MΦ, L, K and b only, such that for any ϵ (0, 1), each of the following events holds true with probability at least 1 ϵ:

sup ϕ Φ |U(1) n,1(ϕ)|

M1| log(ϵhd/2)|

sup ϕ Φ |U(2) n,1(ϕ)| M1| log(ϵhd/2)|

sup ϕ Φ |U(3) n,1(ϕ)| M1| log(ϵhd/2)|

as soon as h h0 and M2| log(ϵhd)| nh2d.

Maximal deviation inequalities for the L n(ϕ) can be obtained by means of classical results in empirical process theory, like for Ln(ϕ).

Ausset, Cl emenc on and Portier

Lemma 25 Suppose that the hypotheses of Proposition 3 are fulﬁlled. Then, for any ϵ (0, 1), we have with probability at least 1 ϵ:

sup ϕ Φ |L n(ϕ)|

M1 log(M2/ϵ)

as soon as M2| log(ϵhd)| nh2d and h h0 where h0, M1 > 0 and M2 > 1 are constants depending on (A, v), K, MΦ, L and b only.

The two preceding lemmas combined with the union bound directly yield the following result.

Corollary 26 Suppose that the hypotheses of Proposition 3 are fulﬁlled. There exist constants M1, M2, M3 and h0 depending on (A, v), MΦ, L, K and b only such that for any ϵ (0, 1), we have with probability greater than 1 ϵ:

V n,1(ϕ) M1

n + | log(ϵhd/2)|

| log(ϵhd/2)|

as soon as h h0, M3| log(ϵhd)| nh2d.

We next deal with the term Vn,2(ϕ).

Lemma 27 Suppose that the hypotheses of Proposition 3 are fulﬁlled. There exist constants M1, M2, M3 and h0 depending on (A, v), MΦ, L, K and b only such that for any ϵ (0, 1), we have with probability greater than 1 ϵ:

sup ϕ Φ |Vn,2(ϕ)| M1

n + | log(ϵhd/2)|

as soon as h h0, M3| log(ϵhd/2)| nhd.

Finally, we consider the residual Rn(ϕ). Recall ﬁrst that, for all ϕ Φ, we have Rn(ϕ) = R n(ϕ) + R n(ϕ), where

δiϕ( Yi, Xi) SC( Yi | Xi) ˆb(i) n ( Yi | Xi),

i=1 δiϕ( Yi, Xi)

SC( Yi | Xi) ˆS(i) C,n( Yi | Xi) 2

S2 C( Yi | Xi) ˆS(i) C,n( Yi | Xi) .

Each of the quantities, R n(ϕ) and R n(ϕ), is treated separately. We start with R n(ϕ).

Lemma 28 Suppose that the assumptions of Proposition 3 are satisﬁed. Then, for all ϵ (0, 1), we have with probability greater than 1 ϵ

| log(ϵhd/2)|

nhd + 1 (nhd)2 + h4 !

as soon as h h0 and M2| log(ϵhd/2)| nhd, where M1 and M2 are nonnegative constants depending on K, L, MΦ and b only.

Empirical Risk Minimization under Random Censorship

We now state a uniform bound for R n(ϕ).

Lemma 29 Suppose that the assumptions of Proposition 3 are satisﬁed. Then, for all ϵ (0, 1), we have with probability greater than 1 ϵ

|log(ϵhd/2)|

|log(ϵhd/2)|

(nhd)3/2 + 1 nhd + 1 (nhd)2 + h2 !

as soon as h h0 and M2| log(ϵhd/2)| nhd, where M1 and M2 are nonnegative constants depending on K, L, MΦ and b only.

Now we can conclude the proof of Proposition 3 by gathering each of the previous results. First note that they all are valid under the condition that h h0 and M1| log(ϵhd/2)| nhd

and n M2 log(M3/ϵ). By taking h0 small enough, the last requirement is no longer necessary. In addition, if nhd > 1 and | log(ϵhd/2)| > 1 we guarantee that | log(ϵhd/2)|/nhd 1/nhd 1/(nhd)2 and | log(ϵhd/2)|1/2 | log(ϵhd/2)|3/2, leading to

| log(ϵhd/2)|

| log(ϵhd/2)|

!3/2 | log(ϵhd/2)|

Using this manipulation, we obtain the stated result.

Appendix F. Intermediary Results

Here we prove lemmas involved in the argument of Proposition 3 s proof. Recall that, under the assumptions stipulated: (t, x) K,

H(t, x) b3, SC(t | x) b, c(t | x) 1/b, Hh(t | x) L. (34)

F.1 Proof of Lemma 21

The proof is a direct application of Corollary 8 to the i.i.d. sequence {(Xn, Yn, δn) : n 1} and the class of functions

(x, u, δ) K {0, 1} 7 δϕ(u, x)

indexed by (ϕ, h) Φ ]0, h0]. The previous class is of VC type in virtue of Lemma 12. We choose σ = G = 2MΦ/b, the bound obtained for Ln(ϕ) is simply

(2MΦ/b)n 1/2 C2 1 log (2) 1/2 + log(C2/ϵ)

(4MΦ/b)n 1/2 C2 1 log (2) + log(C2/ϵ)

where the constants C1, C2, C3 are the ones of Corollary 8. Easy manipulations give the result.

Ausset, Cl emenc on and Portier

F.2 Proof of Lemma 22

Taking the supremum of each element in the sum we ﬁnd that

|Bn,1(ϕ)| MΦ

b8 sup (u,x) K |Hh(u, x) H(u, x)| sup (u,x) K | ˆH(i) 0,n(u, x)|.

An appeal to Lemma 18, Lemma 19 combined with (30) gives the ﬁrst result. Concering Bn,2, we write

|Bn,2(ϕ)| MΦ

b sup (t,x) K

c(u | x) H(u, x) (H0,h(du, x) H0(du, x)) .

Because for any signed measure ν on R+ and any measurable function f with total variation at most 1 vanishing at inﬁnity, we have (Dudley, 1992),

| Z f(u) ν(du)| sup t R | Z t

we conclude that

|Bn,2(ϕ)| M sup (u,x) K |H0,h(u, x) H0(u, x)|.

where M > 0 depends only on L, b and MΦ. Conclude by using the bound given in Lemma 18.

F.3 Proof of Lemma 23

Observe that, for i = j, we have

vi,j,j(ϕ) = δiϕ( Yi, Xi)

SC( Yi | Xi) I{ Yj Yi}(1 δj)Kijc( Yj | Xi)

H( Yj, Xi)2 Hh( Yj, Xi).

It follows from (34) that

|vi,j,j(ϕ)| MΦ

b8 K h d L,

and since V n,1(ϕ) is a sum over n(n 1) such terms divided by n(n 1)2 we get the stated bound.

F.4 Proof of Lemma 24

We will use the expression

vi,j,k(ϕ) = wi,j,k(ϕ)Kik Kij E[wi,j,k(ϕ)Kik Kij | Zi, Zj]

wi,j,k(ϕ) = δiϕ( Yi, Xi)

SC( Yi | Xi) I{ Yj Yi}(1 δj)c( Yj | Xi)

H( Yj, Xi)2 I{ Yk > Yj}.

Empirical Risk Minimization under Random Censorship

Using (34), we have that

|wi,j,k| MΦ

Recall that E[vi,j,k(ϕ) | Zi, Zj] = E[vi,j,k(ϕ) | Zi] = E[vi,j,k(ϕ) | Zj] = 0. As a result, the quantities U(k) n,1(ϕ), k {1, 2, 3} are degenerate U-statistics of degree 3, 2 and 2 respectively. For this reason we can apply Corollary 8 to each of them as soon as their respective kernels are shown to form VC classes. The kernel of h2dn(n 1)2U(1) n,1 is

h2d{vi,j,k(ϕ) E [vi,j,k(ϕ)|Zj, Zk] E [vi,j,k(ϕ) | Zi, Zk] + E [vi,j,k(ϕ) | Zk]}.

Lemma 10 and Corollary 17 in Nolan and Pollard (1987) implies that it is of VC type with constant envelop 8MΦ/b8 K 2 as soon as {vi,j,k(ϕ)} is of VC type with envelop MΦ/b8 K 2 . The later is true in virtue of Lemma 9 and Lemma 12. The same arguments implies that the kernels of {h2dn(n 1)U(2) n,1(ϕ)} and {h2dn(n 1)U(3) n,1(ϕ)} are of VC type

with the constant envelop 4MΦ/b8 K 2 . In what follows we specify, for each U(k) n,1(ϕ), the value of σ to use in the application of Corollary 8.

The bound for U(1) n,1(ϕ). Observe that

E h2dwi,j,k(ϕ)Kik Kij 2 MΦ

2 L2c4 Kh2d.

where c2 K = R K2(x)dx. Since we have a sum of 8 terms in the U-statistics of interest, h2dn(n 1)2U(1) n,1(ϕ), each having an L2-norm smaller that E[h4dvi,j,k(ϕ)2] (by Jensen s inequality), we obtain a bound for the resulting variance (using Minkowski s inequality), in 82(MΦ/b8)2L2c4 Kh2d. We apply Corollary 8 with k = 3 and a value for σ larger than the previous bound. We take σ2 = 82(MΦ/b8)2L2c4 Khdhd 0 (note that h h0) and G = 8MΦ K 2 /b8. The conditions are

K 4 L2c4 Khd 0

2 K 2 hd/2hd/2 0 Lc2 K

+ log(C2/ϵ)

and L2c4 Khdhd 0 K 4 ,

where C1, C2 and C3 are the constants in Corollary 8. The latter conditions are indeed of the type h h0 and nhd M2| log(ϵhd/2)|. This gives

sup ϕ Φ |h2dn(n 1)2U(1) n,1(ϕ)| M1hd/2n3/2

!!3/2 + log(C2/ϵ)

where M1 and M2 are constants depending on MΦ, L, K, b, and h0. To recover the stated result, one just needs to multiply the previous bound by 1/(n(n 1)(n 2)h2d) and to use similar manipulations as the ones presented at the end of the proof of Lemma 19.

Ausset, Cl emenc on and Portier

The bound for U(2) n,1(ϕ). In what follows, we use the shortcut E[ |Zi, Zj] = E[ |i, j]. The

kernel of h2dn(n 1)U(2) n,1(ϕ) is given by

h2d{E[vi,j,k(ϕ)|j, k] E[vi,j,k(ϕ)|k]}

= h2d{E[wi,j,k(ϕ)Kik Kij|j, k] E[wi,j,k(ϕ)Kik Kij|j]

E[wi,j,k(ϕ)Kik Kij|k] + E[wi,j,k(ϕ)Kik Kij]}.

By Jensen s inequality and Minkowski s inequality, the variance is then smaller than

42h4d E[E[wi,j,k(ϕ)Kik Kij|j, k]2] 42h4d(MΦ/b8)2E [Kij Kik | j, k]2 .

But we have

E [Kij Kik | j, k] = Z Kh(x Xj)Kh(x Xk)g(x)dx

L Z K(u)Kh(Xj Xk + hu)du

LK h(Xk Xj),

where K = K K and K h(u) = K (u/h)/hd (note that R K (u) du = 1 and K K ). This implies that

42h4d E h E [wi,j,k(ϕ)Kik Kij|j, k]2i 42h4d MΦL

where c2 K = R K2(x)dx. The bound (33) is thus obtained by applying Corollary 8 to h2dn(n 1)U(2) n,1(ϕ) with k = 2 and

σ2 = 42h2dhd 0

The bound for U(3) n,1(ϕ). Based on conditioning arguments, we have

E[vi,j,k(ϕ)|Zk] = Ak(ϕ) E[Ak(ϕ)],

where Ak(ϕ) = E[wi,j,k(ϕ)Kik Kij|k]. We now show that the class of functions n Zk 7 hd Ak(ϕ) : ϕ Φ o

is a VC class with constant envelop. Deﬁne

β1(Zi, Zk) = Kik δiϕ( Yi, Xi) SC( Yi | Xi)

β2(Zi, Zk) = Z M(Xi + hu, Zi, Zk)K(u)du

M(Xj, Zi, Zk) = Z I{u Yi, u < Yk}c(u | Xi)

H(u, Xi)2 H0(du | Xj),

Empirical Risk Minimization under Random Censorship

and observe that

E[wi,j,k(ϕ)Kik Kij|i, k]

= β1(Zi, Zk)E

" I{ Yj Yi}(1 δj)c( Yj | Xi)

H( Yj, Xi)2 I{ Yk > Yj}Kij | i, k

= β1(Zi, Zk)E[M(Xj, Zi, Zk)Kij | i, k]

= β1(Zi, Zk) Z M(Xi + hu, Zi, Zk)g(Xi + hu)K(u)du,

where we have used Assumption 1 and the fact that H0(du|x) = SY (u |x)SC(du|x). Because for any f with total variation at most 1 vanishing at inﬁnity, we have (Dudley, 1992),

Z f(u) {H0(du | x) H0(du | x )} sup u R |H0(u | x) H0(u | x )|

where the last inequality is a consequence of Assumption 2. The same holds true for g in virtue of Assumption 2. Hence, the map M is uniformly Lipschitz with respect to Xj. Appealing to Lemma 11, we obtain that the kernel hd E[wi,j,k(ϕ)Kik Kij|i, k] is VC with constant envelop K MΦL/b8. The same holds true for hd Ak(ϕ) by Lemma 10. Moreover, observe that for all (ϕ, h) Φ ]0, h0], using (35), we have almost surely,

b8 E[Kij Kik|Zk].

E[Kij Kik|Zk] = ZZ Kh(x y)Kh(x Xk)g(x)g(y)dxdy

= ZZ K(z)Kh(x Xk)g(x)g(x hz)dxdz

L Z Kh(x Xk)g(x) Z K(z)dz

it follows that

E[(hd Ak(ϕ))2] h2d MΦL2

Applying Corollary 8 to the kernel hd{Ak(ϕ) E[Ak(ϕ)]} with k = 1 and G = 2 K MΦL/b8

and σ2 = h2d(MΦL2/b8)2 yields the bound

sup ϕ Φ |nhd L n(ϕ)| MΦL2 nhd

log(C2/ϵ)/C3 ,

Ausset, Cl emenc on and Portier

with probability 1 ϵ, provided a condition of the type

nh2d M2| log(hdϵ)|

Straightforward calculations then give the desired result.

F.5 Proof of Lemma 27

For all ϕ Φ, we ﬁrst set

wij(ϕ) = δiϕ( Yi, Xi)I{ Yj Yi}(1 δj)Kijc( Yj | Xi)

SC( Yi | Xi)H( Yj | Xi)

and observe next that

Vn,2(ϕ) = 1

δiϕ( Yi, Xi) SC( Yi | Xi)

c(u | Xi) H(u, Xi)d ˆH(i) 0,n(u, Xi) H0,h(u, Xi)

δiϕ( Yi, Xi) SC( Yi | Xi)

I{ Yj Yi}(1 δj)Kijc( Yj | Xi)

H( Yj | Xi) Z Yi

c(u | Xi) H(u, Xi)d H0,h(u, Xi)

i =j {wij(ϕ) E[wij(ϕ) | Zj]}

= U(1) n,2(ϕ) + U(2) n,2(ϕ),

U(1) n,2(ϕ) = 1 n(n 1)

wij(ϕ) E[wij(ϕ) | Zj]

E[wij(ϕ) | Zi] + E[w12(ϕ)] ,

U(2) n,2(ϕ) = 1

i=1 {E[wij(ϕ) | Zi] E[w12(ϕ)]} . (37)

Hence, Vn,2(ϕ) can be decomposed as the sum of a degenerate U-statistic (36) and an i.i.d. average (37). Note also that, by (34), we have

|wij(ϕ)| MΦ

The bound for U(1) n,2(ϕ). In virtue of Lemma 9 and Lemma 12, the collection of kernels of the degenerate U-statistics n hdn(n 1)U(1) n,2(ϕ) : (ϕ, h) Φ ]0, h0] o

Empirical Risk Minimization under Random Censorship

forms a class of VC type with constants depending only on (v, A), K and h0. In addition, these terms are all bounded by 4 MΦ K /b5 and we have:

Var hdwij(ϕ) 4MΦ

2 hd Lc2 K.

It thus results from the application of Corollary 8 with k = 2 and σ2 = 42 MΦ/b5 2 hd Lc2 K that, with probability greater than 1 ϵ

hdn(n 1)U(2) n,1(ϕ) nhd/2 M1

+ log (C2/ϵ)

where M1 and M2 depends on MΦ, K, L, b, provided a condition of the type h h0 and nhd M2| log(ϵhd/2)|.

The bound for U(2) n,2(ϕ). Following the proof of Lemma 25, the collection of kernels of

U(2) n,2(ϕ) is of VC type with constant envelop. Besides, we have, with probability one:

E[wij(ϕ) | Zi] MΦL

and therefore Var (E[wij(ϕ) | Zi]) MΦL

b5 2 . Applying thus Corollary 8 with k = 1,

σ2 = 4 MΦL/b5 2 and G = 2 MΦL/b5, we obtain that, with probability 1 ϵ,

n U(2) n,2(ϕ) C n

where C depends on MΦ, K, L, b, provided a condition of the type n M3| log(M4/ϵ)| holds true but this is already implied by h h0 and nhd M2| log(ϵhd/2)| whenever h0 is small. The bound stated in the lemma results from rearranging the bounds (38) and (39).

F.6 Proof of Lemma 28

Use the triangle inequality, (30) and (31) with Lemmas 19 and 18 to get that, with probability 1 ϵ,

max i=1,...n sup (t,x) K | ˆH(i) 0,n(t, x) H0(t, x)| M1

| log(ϵhd/2)|

max i=1,...n sup (t,x) K | ˆH(i) n (t, x) H(t, x)| M2

| log(ϵhd/2)|

We suppose further that both previous inequalities are realized. Note that under the mentioned condition on (n, h), it holds that (t, x) K

inf i=1,...,n ˆH(i) n (t, x) b3

Ausset, Cl emenc on and Portier

In a similar fashion as in the proof of Proposition 1 (see (27),(28) and (29)), we apply Lemma 15 to get that

sup (t,x) K | ˆS(i) C,n(t|x) SC(t|x)| (2/b) sup (t,x) K |ˆΛ(i) C,n(t|x) ΛC(t|x)|.

Then, we apply Lemma 16, with θ1 = b3, θ2 = b3/2, β = 1, to ﬁnally obtain that: i {1, . . . , n},

sup (t,x) K | ˆS(i) C,n(t|x) SC(t|x)|

2 b3 sup (t,x) K | ˆH(i) 0,n(t, x) H0(t, x)| + 2

b6 sup (t,x) K | ˆH(i) n (t, x) H(t, x)|

| log(ϵhd/2)|

Hence provided that h h0 and nhd M2| log(ϵhd/2)|, we have R n(ϕ) 2MΦ

b3 sup (t,x) K

SC(t | x) ˆS(i) C,n(t | x) 2 .

F.7 Proof of Lemma 29

Recall ﬁrst that

δiϕ( Yi, Xi) SC( Yi | Xi) ˆb(i) n ( Yi | Xi),

ˆb(i) n (t | x) = Z t

H(u, x)2 ˆH(i) n (u, x) ( ˆH(i) n (u, x) H(u, x))2 ˆH(i) 0,n(du, x)

( ˆS(i) C,n(u | x) SC(u | x))

SC(u | x) ˆ (i) n (du | x).

and ˆ (i) n (du | x) = ˆΛ(i) C,n(du | x) ΛC(du | x).

The following argument is based on Lemma 17, stated in section B. Note that, on the event En, we have: ˆb(i) n (t | x) 2

Z ˆH(i) n (u, x) Hh(u, x) 2 ˆH(i) n (du, x),

( ˆS(i) C,n(u | x) SC(u | x))

SC(u | x) ˆ (i) n (du | x)

b10 sup (u,x) Γb

ˆH(i) n (u, x) Hh(u, x) 2

( ˆS(i) C,n(u | x) SC(u | x))

SC(u | x) ˆ (i) n (du | x)

Empirical Risk Minimization under Random Censorship

The application of the Lemma 17, with S(2)(u) = SC(u | x), S(1)(u) = ˆS(i) C,n(u | x), β = 1, θ = b and

Λ(1)(u) = ˆΛ(i) C,n(u | x) = Z u

ˆH(i) 0,n(ds, x)

ˆH(i) n (s , x) ,

Λ(2)(u) = ΛC(u | x) = Z u

ˆS(i) C,n(u | x) SC(u | x)

SC(u | x) ˆ (i) n (du | x)

sup (u,x) Γb

ˆH(i) n (u, x) H(u, x) 2

+ sup (u,x) Γb

ˆH(i) 0,n(u, x) H0(u, x) 2

+ sup (u,x) Γb

ˆW (i) n (u, x) ,

where C > 0 depends on b and ˆW (i) n (t, x) is deﬁned as

ˆH(i) 0,n(ds, x) H0(ds, x)

ˆH(i) 0,n(du, x) H0(du, x)

SC(u | x)H(u, x) .

Using (30) and (31) combined with Lemma 18 and Lemma 19, we obtain that with probability at least 1 ϵ:

sup (u,x) Γb

ˆH(i) n (u, x) H(u, x) 2 + sup (u,x) Γb

ˆH(i) 0,n(u, x) H0(u, x) 2

1 (nhd)2 + | log(ϵhd/2)|

as soon as h h0 and M2| log(ϵhd/2)| nhd. It remains to show that, with probability at least 1 ϵ:

max i {1, ..., n} sup (u,x) Γb

ˆW (i) n (u, x) M1

| log(hd/2ϵ)|

as soon as h h0 and M2| log(ϵhd/2)| nhd. We ﬁrst deﬁne ˆWn,1(t, x), for all (t, x) K,

ˆH0,n(ds, x) H0(ds, x)

d ˆH0,n(du, x) H0(du, x)

SC(u | x)H(u, x)

Ausset, Cl emenc on and Portier

and notice that, since c(s | x)/(H(s, x)SC(u | x)H(u, x)) 1/b8 (using (34)), we have by virtue of (30)

max i {1, ..., n} sup (t,x) K

ˆWn,1(t, x) ˆW (i) n (t, x) C/(nhd),

where C is a constant depending on b and K only. Let

α1(u, x) = (SC(u | x)H(u, x)) 1 Z u

0 c(s | x)(H0,h(ds, x) H0(ds, x))

and note that

sup (u,x) K |α1(u, x)| M1 sup (u,x) K |H0,h(u, x) H0(u, x)| M2h2.

We deﬁne ˆWn,2(t, x) as

ˆH0,n(ds, x) H0,h(ds, x)

ˆH0,n(du, x) H0,h(du, x)

SC(u | x)H(u, x) ,

we have ˆWn(t, x) = ˆWn,2(t, x)

ˆH0,n(ds, x) H0,h(ds, x)

H(s, x) (H0,h(du, x) H0(du, x))

SC(u | x)H(u, x)

0 α1(u, x) ˆH0,n(du, x) H0(du, x)

0 α1(u, x) (H0,h(du, x) H0(du, x)) .

Applying Fubini s theorem in the second term, we see that the last three terms are similar. We give the details only for the second one. We have

0 α1(u, x)( ˆH0,n(du, x) H0(du, x))

0 |α1(u, x)|( ˆH0,n(du, x) + H0(du, x))

sup (u,x) K |α1(u, x)| sup (u,x) K | ˆH0,n(u, x) + H0(u, x)|

M2h2 sup (u,x) K | ˆH0,n(u, x) + H0(u, x)|

In addition, observe that

ˆWn,2(t, x) = n 2 n X

j=1 vij(t, x)

i =j vij(t, x) + n 2 n X

i=1 vii(t, x)

:= Un(t, x) + Mn(t, x),

Empirical Risk Minimization under Random Censorship

where, for all 1 i, j n, we set

vij(t, x) = uij(t, x) E[uij(t, x)|Zi] E[uij(t, x)|Zj] + E[u1,2(t, x)],

uij(t, x) = ξi,j(x)I{ Yi t}Kh(Xi x)Kh(Xj x),

ξi,j(x) = δiδjc( Yj | x) SC( Yi, x)H( Yi, x)H( Yj, x) I{ Yj Yi}.

Because we have, for all (t, x) K,

E[v12(t, x)|Z1] = E[v12(t, x)|Z2] = 0,

the collection of random variables

{n2h2d Un(t, x) : (t, x, h) K ]0, h0]}

is a degenerate U-process of order 2. The related class of kernels is uniformly bounded by 4||K||2 /b8 and of VC type, by virtue of classic permanence properties recalled in Appendix A. Observe in addition that

Var h2dv12(t, x) h4d42E[u2 12(t, x)]

2 E[K2 1x K2 2x]

Applying Corollary 8 with k = 2 and

σ2 = hdhd 0 4 b8 L2c4 K,

G = 4 K 2 9b8 ,

we obtain that, with probability greater than 1 ϵ,

sup (t,x) K |Un(t, x)| M1 | log(ϵhd/2)|

as soon as M2| log(ϵhd/2)| nhd and h h0. Notice now that, for all (t, x) K, Mn(t, x) = Ln(t, x) + Rn(t, x), where

Ln(t, x) = n 2 n X

i=1 {vii(t, x) E[v11(t, x)]} ,

Rn(t, x) = n 1E[v11(t, x)].

Ausset, Cl emenc on and Portier

Observing that, for all (t, x) K, |h2dv11(t, x)| 4||K||2 /b8 and

Var h2dv11(t, x) h4d42E[u2 11(t, x)]

2 L Z K4(x)dx.

Hence, the application of Corollary 8 with k = 1 to the empirical sums {n2h2d Ln(t, x) : (t, x, h) K ]0, h0]} permits to get that, with probability at least 1 ϵ,

sup (t,x) K |Ln(t, x)| M1

| log( M2/hd/2)| + p

log(C2/ϵ)/C3 (nhd)3/2 ,

where M1 and M2 are constants depending on K, b and L. The previous bound is valid whenever M2| log(ϵhd/2)| nhd and h h0. We also have

n 1E[|v11(t, x)|] 4n 1E[|u11(t, x)|] (4/b8)Lc2 K/(nhd).

This leads to the stated results.

Empirical Risk Minimization under Random Censorship

P. K. Andersen, Ø. Borgan, R. D. Gill, and N. Keiding. Statistical Models Based on Counting Processes. Springer Series in Statistics. Springer US, New York, NY, 1993. ISBN 978-0-387-94519-4. doi: 10.1007/978-1-4612-4348-9.

G. Ausset, F. Portier, and S. Cl emen con. Machine Learning for Survival Analysis: Empirical Risk Minimization for Censored Distribution Free Regression with Applications. In Neur IPS ML4H Workshop, Montreal, Canada, 2018.

H. Bang and A. A. Tsiatis. Median Regression with Censored Cost Data. Biometrics, 58(3): 643 649, 2002. ISSN 0006-341X.

P. L. Bartlett, O. Bousquet, and S. Mendelson. Local Rademacher complexities. Annals of Statistics, 33(4):1497 1537, Aug. 2005. ISSN 0090-5364, 2168-8966. doi: 10.1214/009053 605000000282.

R. Beran. Nonparametric regression with randomly censored survival data. Technical report, 1981.

J. Buckley and I. James. Linear Regression with Censored Data. Biometrika, 66(3):429 436, 1979. ISSN 0006-3444. doi: 10.2307/2335161.

S. Cl emen con and F. Portier. Beating Monte Carlo Integration: A Nonasymptotic Study of Kernel Smoothing Methods. In International Conference on Artiﬁcial Intelligence and Statistics, pages 548 556. PMLR, Mar. 2018.

S. Cl emen con, G. Lugosi, and N. Vayatis. Ranking and Empirical Minimization of Ustatistics. Annals of Statistics, 36(2):844 874, Apr. 2008. ISSN 0090-5364, 2168-8966. doi: 10.1214/009052607000000910.

D. R. Cox. Regression Models and Life-Tables. Journal of the Royal Statistical Society. Series B (Methodological), 34(2):187 220, 1972. ISSN 0035-9246.

D. R. Cox and D. Oakes. Analysis of Survival Data. Chapman and Hall/CRC, Boca Raton, 1984. ISBN 978-1-315-13743-8. doi: 10.1201/9781315137438.

D. M. Dabrowska. Uniform Consistency of the Kernel Conditional Kaplan-Meier Estimate. Annals of Statistics, 17(3):1157 1167, Sept. 1989. ISSN 0090-5364, 2168-8966. doi: 10.1214/aos/1176347261.

V. de la Pe na and E. Gin e. Decoupling: From Dependence to Independence. Probability and Its Applications. Springer-Verlag, New York, 1999. ISBN 978-0-387-98616-6. doi: 10.1007/978-1-4612-0537-1.

B. Delyon and F. Portier. Integral approximation by kernel smoothing. Bernoulli, 22(4): 2177 2208, Nov. 2016. ISSN 1350-7265. doi: 10.3150/15-BEJ725.

B. Delyon and F. Portier. Safe and adaptive importance sampling: A mixture approach. Annals of Statistics, Mar. 2020.

Ausset, Cl emenc on and Portier

Y. Du and M. G. Akritas. Uniform strong representation of the conditional Kaplan-Meier process. Mathematical Methods of Statistics, 11(2):152 182, 2002.

R. M. Dudley. Frechet Diﬀerentiability, p-Variation and Uniform Donsker Classes. Annals of Probability, 20(4):1968 1982, Oct. 1992. ISSN 0091-1798, 2168-894X. doi: 10.1214/aop/ 1176989537.

U. Einmahl and D. M. Mason. An Empirical Process Approach to the Uniform Consistency of Kernel-Type Function Estimators. Journal of Theoretical Probability, 13(1):1 37, Jan. 2000. ISSN 1572-9230. doi: 10.1023/A:1007769924157.

T. Fleming and D. Harrington. Counting Processes and Survival Analysis. 1991. doi: 10.2307/2290673.

T. A. Gerds, J. Beyersmann, L. Starkopf, S. Frank, M. J. van der Laan, and M. Schumacher. The Kaplan-Meier Integral in the Presence of Covariates: A Review. From Statistics to Mathematical Finance, pages 25 42, 2017. doi: 10.1007/978-3-319-50986-0.

E. Gin e and A. Guillou. On consistency of kernel density estimators for randomly censored data: Rates holding uniformly over adaptive intervals. Annales de l Institut Henri Poincare (B) Probability and Statistics, 37(4):503 522, July 2001. ISSN 0246-0203. doi: 10.1016/S0246-0203(01)01081-0.

E. Gin e and H. Sang. Uniform asymptotics for kernel density estimators with variable bandwidths. Journal of Nonparametric Statistics, 22(6):773 795, Aug. 2010. ISSN 10485252. doi: 10.1080/10485250903483331.

E. Gin e, V. Koltchinskii, and J. Zinn. Weighted uniform consistency of kernel density estimators. Annals of Probability, 32(3B):2570 2605, July 2004. ISSN 0091-1798, 2168894X. doi: 10.1214/009117904000000063.

R. L. Grossman, A. P. Heath, V. Ferretti, H. E. Varmus, D. R. Lowy, W. A. Kibbe, and L. M. Staudt. Toward a Shared Vision for Cancer Genomic Data. New England Journal of Medicine, 375(12):1109 1112, Sept. 2016. ISSN 0028-4793. doi: 10.1056/NEJMp1607591.

L. Gy orﬁ, M. Kohler, A. Krzyzak, and H. Walk. A Distribution-Free Theory of Nonparametric Regression. Springer Series in Statistics. Springer-Verlag, New York, 2002. ISBN 978-0387-95441-7. doi: 10.1007/b97848.

T. Hothorn, P. B uhlmann, S. Dudoit, A. Molinaro, and M. J. van der Laan. Survival ensembles. Biostatistics, 7(3):355 373, 2006. ISSN 14654644. doi: 10.1093/biostatistics/ kxj011.

H. Ishwaran and U. B. Kogalur. Random survival forests for R. R News, 7(2):25 31, 2007.

H. Ishwaran, U. B. Kogalur, E. H. Blackstone, and M. S. Lauer. Random survival forests. The Annals of Applied Statistics, 2(3):841 860, 2008. ISSN 19326157. doi: 10.1214/08-A OAS169.

Empirical Risk Minimization under Random Censorship

E. L. Kaplan and P. Meier. Nonparametric Estimation from Incomplete Observations. Journal of the American Statistical Association, 53(282):457 481, 1958. ISSN 0162-1459. doi: 10.2307/2281868.

M. Kohler, K. M ath e, and M. Pint er. Prediction from Randomly Right Censored Data. Journal of Multivariate Analysis, 80(1):73 100, Jan. 2002. ISSN 0047259X. doi: 10.1006/ jmva.2000.1973.

G. Lecu e and S. Mendelson. Learning subgaussian classes : Upper and minimax bounds. ar Xiv:1305.4825 [math, stat], Sept. 2016.

O. Lopez. Nonparametric Estimation of the Multivariate Distribution Function in a Censored Regression Model with Applications. Communications in Statistics - Theory and Methods, 40(15):2639 2660, Aug. 2011. ISSN 0361-0926, 1532-415X. doi: 10.1080/03610926.2010.48 9175.

O. Lopez, V. Patilea, and I. van Keilegom. Single index regression models in the presence of censoring depending on the covariates. Bernoulli, 19(3):721 747, Aug. 2013. ISSN 1350-7265. doi: 10.3150/12-BEJ464.

G. Lugosi and S. Mendelson. Risk minimization by median-of-means tournaments. Journal of the European Mathematical Society, 22(3), 2016.

P. Major. An estimate on the supremum of a nice class of stochastic integrals and Ustatistics. Probability Theory and Related Fields, 134(3):489 537, Mar. 2006. ISSN 0178-8051, 1432-2064. doi: 10.1007/s00440-005-0440-9.

N. R. Mann, R. E. Schafer, and N. D. Singpurwalla. Methods for Statistical Analysis of Reliability and Life Data. Wiley, 1974. ISBN 978-0-471-56737-0.

A. M. Molinaro, S. Dudoit, and M. J. van der Laan. Tree-based multivariate regression and density estimation with right-censored data. Journal of Multivariate Analysis, 90(1 SPEC. ISS.):154 177, 2004. ISSN 10957243. doi: 10.1016/j.jmva.2004.02.003.

D. Nolan and D. Pollard. U-Processes: Rates of Convergence. Annals of Statistics, 15(2): 780 799, June 1987. ISSN 0090-5364, 2168-8966. doi: 10.1214/aos/1176350374.

S. J. Pan and Q. Yang. A Survey on Transfer Learning. IEEE Transactions on Knowledge and Data Engineering, 22(10):1345 1359, Oct. 2010. ISSN 1558-2191. doi: 10.1109/TK DE.2009.191.

G. Papa, A. Bellet, and S. Cl emen con. On Graph Reconstruction via Empirical Risk Minimization: Fast Learning Rates and Scalability. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 694 702. Curran Associates, Inc., 2016.

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, and V. Dubourg. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011.

Ausset, Cl emenc on and Portier

S. P olsterl. Scikit-survival: A Library for Time-to-Event Analysis Built on Top of scikit-learn. Journal of Machine Learning Research, 21(212):1 6, 2020. ISSN 1533-7928.

S. P olsterl, N. Navab, and A. Katouzian. Fast Training of Support Vector Machines for Survival Analysis. In A. Appice, P. P. Rodrigues, V. Santos Costa, J. Gama, A. Jorge, and C. Soares, editors, Machine Learning and Knowledge Discovery in Databases, Lecture Notes in Computer Science, pages 243 259. Springer International Publishing, 2015. ISBN 978-3-319-23525-7.

S. P olsterl, N. Navab, and A. Katouzian. An Eﬃcient Training Algorithm for Kernel Survival Support Vector Machines. In CML PKDD MLLS 2016, 2016.

F. Portier and J. Segers. On the weak convergence of the empirical conditional copula under a simplifying assumption. Journal of Multivariate Analysis, 166(C):160 181, 2018.

A. Rotnitzky and J. M. Robins. Recovery of Information and Adjustment for Dependent Censoring Using Surrogate Markers. AIDS Epidemiology, 88(424):1473, 1992. ISSN 01621459. doi: 10.1007/978-1-4757-1229-2 14.

P. Royston and M. K. B. Parmar. The use of restricted mean survival time to estimate the treatment eﬀect in randomized clinical trials when the proportional hazards assumption is in doubt. Statistics in Medicine, 30(19):2409 2421, 2011. ISSN 1097-0258. doi: 10.1002/sim.4274.

D. Rubin and M. J. van der Laan. A Doubly Robust Censoring Unbiased Transformation. The International Journal of Biostatistics, 3(1), Jan. 2007. ISSN 1557-4679. doi: 10.2202/ 1557-4679.1052.

G. R. Shorack and J. A. Wellner. Empirical Processes with Applications to Statistics. Classics in Applied Mathematics. Society for Industrial and Applied Mathematics, Jan. 2009. ISBN 978-0-89871-684-9. doi: 10.1137/1.9780898719017.

J. A. Steingrimsson and S. Morrison. Deep learning for survival outcomes. Statistics in Medicine, 39(17):2339 2349, 2020. ISSN 1097-0258. doi: 10.1002/sim.8542.

J. A. Steingrimsson, L. Diao, A. M. Molinaro, and R. L. Strawderman. Doubly robust survival trees: Doubly Robust Survival Trees. Statistics in Medicine, 35(20):3595 3612, Sept. 2016. ISSN 02776715. doi: 10.1002/sim.6949.

J. A. Steingrimsson, L. Diao, and R. L. Strawderman. Censoring Unbiased Regression Trees and Ensembles. Journal of the American Statistical Association, 114(525):370 383, Jan. 2019. ISSN 0162-1459, 1537-274X. doi: 10.1080/01621459.2017.1407775.

W. Stute. Consistent estimation under random censorship when covariables are present. Journal of Multivariate Analysis, 45(1), 1993. doi: 10.1006/jmva.1993.1028.

W. Stute. The Central Limit Theorem Under Random Censorship. Annals of Statistics, 23 (2):422 439, Apr. 1995. ISSN 0090-5364, 2168-8966. doi: 10.1214/aos/1176324528.

Empirical Risk Minimization under Random Censorship

W. Stute. Distributional Convergence under Random Censorship when Covariables are Present. Scandinavian Journal of Statistics, 23(4):461 471, 1996.

V. Van Belle, K. Pelckmans, J. A. K. Suykens, and S. Van Huﬀel. Support Vector Machines for Survival Analysis. Proceedings of the Third International Conference on Computational Intelligence in Medicine and Healthcare, 2007.

V. Van Belle, K. Pelckmans, J. A. K. Suykens, and S. V. Huﬀel. Learning Transformation Models for Ranking and Survival Analysis. Journal of Machine Learning Research, 12: 819 862, 2011. ISSN 15324435.

M. J. van der Laan and J. M. Robins. Uniﬁed Methods for Censored Longitudinal Data and Causality. Springer Series in Statistics. Springer New York, New York, NY, ﬁrst edition, 2003. ISBN 978-0-387-21700-0.

I. van Keilegom and N. Veraverbeke. Uniform strong convergence results for the conditional kaplan-meier estimator and its quantiles. Communications in Statistics - Theory and Methods, 25(10):2251 2265, Jan. 1996. ISSN 0361-0926. doi: 10.1080/03610929608831836.

M. P. Wand and M. C. Jones. Kernel Smoothing. Number 60. Chapman & Hall, Boca Raton, FL, U.S., Dec. 1994. ISBN 978-0-412-55270-0.