# survival_analysis_via_density_estimation__74207676.pdf Survival Analysis via Density Estimation Hiroki Yanagisawa 1 Shunta Akiyama 1 This paper introduces a framework for survival analysis by reinterpreting it as a form of density estimation. Our algorithm post-processes density estimation outputs to derive survival functions, enabling the application of any density estimation model to effectively estimate survival functions. This approach broadens the toolkit for survival analysis and enhances the flexibility and applicability of existing techniques for density estimation. Our framework is versatile enough to handle various survival analysis scenarios, including competing risk models for multiple event types. It can also address dependent censoring when prior knowledge of the dependency between event time and censoring time is available in the form of a copula. In the absence of such information, our framework can estimate the upper and lower bounds of survival functions, accounting for the associated uncertainty. 1. Introduction Survival analysis constitutes a statistical methodology extensively employed across diverse domains, including medicine, engineering, and social sciences, to analyze and predict the probability distribution of the time until the occurrence of an event of interest. In numerous real-world scenarios, the phenomenon of competing risks emerges when an individual is subject to multiple mutually exclusive events, where the occurrence of one event precludes the observation or alters the probability of the other events. The field of survival analysis with competing risks has a rich historical foundation, as extensively reviewed in survey papers (Wang et al., 2019; Wiegrebe et al., 2024). Survival analysis with K competing risks can be conceptualized as a variant of density estimation, a subfield of machine learning aimed at estimating the probability distribution of a 1Cyber Agent, Tokyo, Japan. Correspondence to: Hiroki Yanagisawa . Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). target variable. Given this conceptual similarity, numerous methodologies originally developed for density estimation have been adapted for survival analysis, particularly in scenarios with K = 2 under the conditional independence assumption. For instance, random forests (Breiman, 2001) for density estimation have been adapted into random survival forests (Ishwaran et al., 2008) for survival analysis, and modern neural network models for density estimation have been extended into models such as Deep Hit (Lee et al., 2018). Furthermore, strictly proper scoring rules for density estimation (Gneiting & Raftery, 2007) have been adapted for survival analysis (Rindt et al., 2022; Yanagisawa, 2023). Calibration metrics, such as the expected calibration error (Naeini et al., 2015; Guo et al., 2017), have similarly been extended to D-calibration for survival analysis (Haider et al., 2020). Despite the numerous extensions of density estimation methodologies for survival analysis, these adaptations encounter several limitations. First, these adaptations are typically tailored to specific methodologies on a case-by-case basis, necessitating the development of a new customized extension for each novel density estimation method. Second, most survival analysis models rely on the conditional independence assumption (or even stronger assumptions such as the proportional hazards assumption (Cox, 1972)), which may not hold in various real-world applications. This issue highlights the need for survival models that operate under weaker assumptions. Third, while Tsiatis (1975) demonstrates that the survival function cannot be identified without making any assumption, the estimation of the upper and lower bounds of the survival function has not been fully investigated under the condition where no assumptions can be made. Fourth, while the existence of a strictly proper scoring rule for survival analysis with K = 2 under the conditional independence assumption has been established (Rindt et al., 2022) and can be used as an evaluation metric, no strictly proper scoring rule has been established for K > 2. In this paper, we propose a novel framework that reinterprets survival analysis through the lens of density estimation. By post-processing the outputs of density estimation models in the form of cumulative incidence functions, our algorithm derives survival functions, thus enabling any density estimation model to be effectively utilized for survival anal- Survival Analysis via Density Estimation Cumulative incidence function ˆVk(t|x) postprocess Output ˆFk(t|x) Figure 1. Two-step (TS) algorithm for survival analysis with K competing risks: it first estimates the cumulative incidence functions ˆVk(t|x) via density estimation and then postprocesses them to obtain the outputs ˆFk(t|x). ysis, as illustrated in Fig. 1. This reinterpretation not only broadens the toolkit available for survival analysis but also enhances the flexibility and applicability of existing density estimation techniques. Our framework s versatility is evident in its capacity to handle a diverse range of survival analysis scenarios, including competing risks and dependent censoring. The contributions of this paper are summarized as follows: Model-agnostic framework. Our two-step algorithm can be integrated with any density estimation model, allowing for the selection of a density model based on various criteria such as prediction performance, training time, and interpretability. To demonstrate the effectiveness of our algorithm, we show that our algorithm, when combined with the Light GBM model (Ke et al., 2017), outperforms baseline models on real datasets. Furthermore, we prove that if the density estimation can achieve an arbitrarily small error ϵ (as the number of data points increases), our algorithm can estimate survival functions with a small error for K = 2 under plausible assumptions (see Appendices C and D). Dependent censoring. Our approach addresses the challenge of dependent censoring, a prevalent issue in survival data where the censoring mechanism is not independent of the event time. To handle dependent censoring, a widely used assumption is that prior knowledge regarding the dependency between event time and censoring time is available in the form of a copula (Emura & Chen, 2018). Our algorithm is based on this copula-based assumption and, therefore, relies on weaker assumptions than the conditional independence assumption. Additionally, our two-step algorithm is capable of handling situations where the dependency information is provided as a parameterized copula, along with appropriate assumptions on the form of the survival function to guarantee the identifiability of the parameter. While our framework is primarily designed to estimate the individual survival function, for the estimation of the aver- age survival function, the Kaplan-Meier estimator (1958) is widely used when the conditional independence assumption is valid. This estimator has been extended as the copulagraphic (CG) estimator (Zheng & Klein, 1995; Carri ere, 1995), which operates under the same assumption as ours, that a copula is provided as prior knowledge. Therefore, our two-step algorithm can be seen as an extension of the CG estimators to estimate the individual survival function. Upper and lower bounds estimation. Even in scenarios where prior knowledge of the dependency is absent, our framework is capable of estimating the upper and lower bounds of individual survival functions, accounting for the uncertainty arising from the lack of knowledge about the copula. Regarding the estimation of the average survival function, Peterson (1976) presents the upper and lower bounds estimation. Our method can be seen as an extension of this approach for estimating individual survival functions. Strictly proper scoring rule for competing risks. Given a probabilistic output, a strictly proper scoring rule is usually employed as an evaluation metric. Several proper scoring rules exist for survival analysis with K = 2 (Rindt et al., 2022; Yanagisawa, 2023), but no scoring rule has been proven to be proper for K > 2. In this paper, we introduce a new strictly proper scoring rule, termed NLL-SC, for K 2. NLL-SC is based on the copula and can be used as an evaluation metric for survival analysis under dependent censoring, addressing the difficulties discussed by Gharari et al. (2023) in defining an appropriate evaluation metric under dependent censoring. Additionally, by utilizing NLLSC, we construct a new monotone neural network model for K competing risks based on the copula-based assumption, named the survival copula network (SC-Net), which can be seen as an extension of the monotone neural network model for K = 2 under the conditional independence assumption (Rindt et al., 2022). Survival Analysis via Density Estimation 2. Preliminaries In this study, we investigate survival analysis with K competing risks. Let X denote the random variable representing a feature vector whose support is X. Let T1, T2, . . . , TK be the K random variables corresponding to the event times of K distinct types of events, with each Tk supported on R 0. Due to censoring, direct observation of samples (t1, t2, . . . , t K) (T1, T2, . . . , TK) is not feasible. However, we can observe their minimum T = mink{Tk}K k=1 and the corresponding index = arg mink{Tk}K k=1. We note that in much of the existing literature on survival analysis, the index starts at 0 (i.e., {0, 1} for K = 2); however, in this paper, we assume starts at 1 (i.e., {1, 2} for K = 2). The primary objective of survival analysis is to estimate the cumulative distribution function (CDF) Fk(t|x) = Pr(Tk t|x) of Tk conditioned on x X for each k [K] from samples (x, t, δ) i.i.d. (X, T, ), where [K] = {1, 2, . . . , K}. In this study, we frame the problem to estimate Fk(t|x) for each t in a finite set of times {ζb}B b=0 such that 0 = ζ0 < ζ1 < < ζB, where ζB is sufficiently large to ensure 0 t < ζB for any observed time t T. We assume that the true Fk(t|x) is a continuous and monotonically increasing function of t, satisfying Fk(ζ0|x) = 0 and Fk(ζB|x) = 1 for all k [K] and x X. This discretization approach is commonly adopted in numerous survival models (e.g., (Lee et al., 2018; Yanagisawa, 2023; Hickey et al., 2024)). While survival analysis often focuses on estimating the survival function, defined as Sk(t|x) = 1 Fk(t|x), this study aims to estimate the CDF Fk(t|x) of Tk. Additionally, we consider the estimation of the average CDF Fk(t) of Fk(t|x) over x X and the average survival function Sk(t) = 1 Fk(t). Censored Joint Distribution (CJD) representation. Given an observation (x, t, δ) (X, T, ), we interpret the pair (t, δ) in a K-dimensional space. For instance, in the case where K = 2, the pair (t, δ) can be represented as a line segment in a two-dimensional plane, as illustrated in Fig. 2(a). In this figure, the pair (t, δ) = (20, 1) observed for x(1) is depicted as a vertical line segment, indicating that Event 1 is observed at time t1 = 20 and t2 t1. Similarly, the pair (t, δ) = (35, 2) observed for x(2) is represented as a horizontal line segment, indicating that Event 2 is observed at time t2 = 35 and t1 t2. Given that the time horizon is discretized by the boundaries {ζb}B b=0, the K-dimensional space is partitioned as illustrated in Fig. 2(b) by defining the set Rb,k of observations (t, δ) (T, ) as follows: Rb,k = {(t, δ) : ζb < t ζb+1, δ = k}. t = 20, δ = 1 t = 35, δ = 2 35 (a) Observations in 2D space R6,2 R1,1 R2,1 R3,1 R4,1 R5,1 0 ζ1 ζ2 ζ3 ζ4 ζ5 ζ6 (b) CJD divided by Rb,k Figure 2. (a) Two observations {(x(1), 20, 1), (x(2), 35, 2)} are illustrated as vertical and horizontal line segments in the twodimensional space. (b) The CJD space with B = 6 and K = 2, which is divided into subregions Rb,k. In this study, we refer to this partitioned region as the Censored Joint Distribution (CJD) representation. Copula and survival copula. Many models for survival analysis with K = 2 operate under the conditional independence assumption, denoted as T1 T2|X, or even stronger assumptions such as the proportional hazards assumption, exemplified by the Cox model (1972). In this work, we adopt a more flexible assumption that the dependence structure among T1, T2, . . . , TK can be modeled using a copula, a widely recognized method for capturing dependencies in survival analysis (Emura & Chen, 2018; Gharari et al., 2023; Zhang et al., 2024). In probability theory and statistics, a copula is defined as a multivariate cumulative distribution function where each univariate marginal distribution is uniformly distributed over the interval [0, 1]. Copulas are particularly useful for characterizing the dependence structure among random variables. Formally, a copula is defined as follows (Nelsen, 2006; Gharari et al., 2023). Definition 2.1. (Copula.) A copula is a K-dimensional function C : [0, 1]K [0, 1] that satisfies the following conditions: (i) C(u1, u2, . . . , uk 1, 0, uk+1, . . . , u K) = 0 for every u [0, 1]K, (ii) C(1, . . . , 1, u, 1, . . . , 1) = u for every u [0, 1], (iii) Given u, v [0, 1]K such that uk < vk is valid for all k [K], the following condition is satisfied: X l {0,1}K ( 1)l1+l2+ +l K C(ul1 1 v1 l1 1 , ul2 2 v1 l2 2 , . . . , ul K K v1 l K K ) 0, where l = (l1, l2, . . . , l K) is a length-K binary vector. Survival Analysis via Density Estimation A notable example of a copula is the independence copula, defined as: Cind(u1, u2, . . . , u K) = k=1 uk. (1) Another example is the Frank copula for the bivariate case with a non-zero parameter θ: CFrank(u1, u2) = 1 θ log 1 + (e θu1 1)(e θu2 1) A significant attribute of copulas is that the joint distribution Pr(T1 t1, T2 t2, . . . , TK t K) can be uniquely represented using a copula, as stated in Sklar s theorem. Theorem 2.2. (Sklar s Theorem (1959).) There exists a copula C such that for all t1, t2, . . . , t K, Pr(T1 t1, T2 t2, . . . , TK t K) = C(F1(t1), F2(t2), . . . , FK(t K)). If the marginal distribution Fk is continuous for all k, then C is unique. Copulas are instrumental in computing joint probabilities. For instance, the joint probability Pr(ζ1 < T1 ζ4, T2 ζ3) can be determined using a copula as follows: Pr(ζ1 < T1 ζ4, T2 ζ3) = C(F1(ζ4), F2(ζ3)) C(F1(ζ1), F2(ζ3)), (3) where Fk(t) = Pr(Tk t). In the context of survival analysis, a survival copula C is frequently employed, which satisfies the following equation: Pr(T1 > t1, T2 > t2, . . . , TK > t K) = C(1 F1(t1), 1 F2(t2), . . . , 1 FK(t K)). It is well-established that any survival copula C can be represented using its corresponding copula C (see, e.g., (Georges et al., 2001)). For instance, in the case where K = 2, the survival copula C can be expressed as: C(u1, u2) = u1 + u2 1 + C(1 u1, 1 u2). Additionally, it is important to note that if C = Cind (defined in (1)), then C = Cind. 3. Two-Step Algorithm We introduce a two-step algorithm designed to estimate the CDF Fk(ζb|x) for survival analysis with K competing risks. As illustrated in Algorithm 1, the initial step involves estimating Vk(t|x), the conditional k-th cumulative incidence function (CIF), which is formally defined as follows: Vk(t|x) = Pr(T t, = k|x). (4) Subsequently, utilizing the estimated ˆVk(ζ|x), we estimate the probability rb,k|x = Pr((t, δ) Rb,k|x) within the context of the CJD representation via the following equation: ˆrb,k|x = ˆVk(ζb|x) ˆVk(ζb 1|x). (5) Thereafter, the output distribution ˆFk(ζb|x) is derived from the estimation ˆrb,k|x, assuming that we have prior knowledge regarding the dependencies among the random variables T1, T2, . . . , TK in the form of a copula C. It is noteworthy that the conditional independence assumption (T1 T2|X) used in many survival models for K = 2 is equivalent to using the independence copula (1) for the copula C. 3.1. Step 1: Cumulative Incidence Function Estimation A straightforward methodology for estimating the CIF (4) is the application of a distribution regression model, which is specifically engineered to estimate a conditional CDF F(y|x) = Pr(Y y|x) given samples (x, y) i.i.d. (X, Y ), where the random variables X and Y represent feature vectors and target values from R, respectively. Exemplary distribution regression models include those based on monotone neural networks (Chilinski & Silva, 2020) and random forests (Schlosser et al., 2019; Hothorn & Zeileis, 2021; Cevid et al., 2022), and NGBoost (Duan et al., 2020), which is founded on gradient boosting. An alternative approach involves directly estimating ˆrb,k|x (defined in (5)) through the utilization of a density estimation model. Here, density estimation refers to estimating Pr(Y = y|x) from samples (x, y) i.i.d. (X, Y ), where the random variables X and Y represent feature vectors and target values from a discrete set, respectively. Most multiclass classification models can be used for density estimation, including random forests (Breiman, 2001), gradient boosting (e.g., Light GBM (Ke et al., 2017)), and neural networks. Recent advancements in density estimation techniques are found in (Filho et al., 2023). While a broader array of models for density estimation exists compared to distribution regression models, an advantage of employing a distribution regression model is that, if we wish to adjust the hyperparameter B, there is no need to retrain the predictive model to estimate ˆVk(t|x); we only need to recompute Eq. (5) to obtain ˆrb,k|x. 3.2. Step 2: Postprocessing The second step of our algorithm involves computing ˆFk(ζb|x) utilizing the estimates ˆrb,k|x obtained in the first Survival Analysis via Density Estimation step and a specified copula C. Let rb|x [0, 1]K denote the vector of length K whose k-th element is rb,k|x, and let Fb|x RK denote the vector of length K whose k-th element is Fk(ζb|x). We begin by representing rb|x as a function of Fb 1|x, Fb|x, and the copula C. For simplicity, we consider the case where K = 2 in this section, with generalizations for K > 2 detailed in Appendix B.1. By the definition of rb,k|x, we have the following representations: rb,1|x = Pr(ζb 1 < T1 ζb, T1 T2|x) = q{1},b|x w1 q{1,2},b|x , (6) rb,2|x = Pr(ζb 1 < T2 ζb, T2 T1|x) = q{2},b|x w2 q{1,2},b|x , (7) q{1},b|x = Pr(ζb 1 < T1 ζb, ζb 1 T2|x), (8) q{2},b|x = Pr(ζb 1 < T2 ζb, ζb 1 T1|x), q{1,2},b|x = Pr(ζb 1 < T1 ζb, ζb 1 < T2 ζb|x), and w1, w2 0 are unknown weight parameters such that w1 + w2 = 1. See Fig. 3 for an illustration of quantities (8) and (9) with b = 2. Unless otherwise stated, we set w1 = w2 = 1/2. Note that, if we use a sufficiently large B, the correction term (9) should be small, and therefore the choices of the weight parameters w1 and w2 should have minimal impact on these equations. Then, recalling that any joint distribution can be computed using a copula (as demonstrated in equation (3)), it is straightforward to represent (8) (9) using F1(ζb 1|x), F1(ζb|x), F2(ζb 1|x), F2(ζb|x), and the copula C: q{1},b|x = C(F1(ζb|x), 1) C(F1(ζb 1|x), 1) C(F1(ζb|x), F2(ζb 1|x)) + C(F1(ζb 1|x), F2(ζb 1|x)), (10) q{2},b|x = C(1, F2(ζb|x)) C(1, F2(ζb 1|x)) C(F1(ζb 1|x), F2(ζb|x)) + C(F1(ζb 1|x), F2(ζb 1|x)), (11) q{1,2},b|x = C(F1(ζb|x), F2(ζb|x)) C(F1(ζb|x), F2(ζb 1|x)) C(F1(ζb 1|x), F2(ζb|x)) + C(F1(ζb 1|x), F2(ζb 1|x)). (12) Combining (6) (12), we can represent rb,1|x and rb,2|x using F1(ζb 1|x), F1(ζb|x), F2(ζb 1|x), F2(ζb|x), and the copula C. This means that we can write this relationship as rb|x = g C(Fb|x|Fb 1|x) (13) 0 ζ1 ζ2 ζ3 ζ4 ζ5 ζ6 Figure 3. Illustration of q{1},2|x and q{1,2},2|x. Algorithm 1 Two-Step (TS) Algorithm 1: Estimate ˆVk(t|x) 2: Let ˆrb,k|x = ˆVk(ζb|x) ˆVk(ζb 1|x) 3: Let ˆF0|x = 0 4: for b = 1, 2, . . . , B 1 do 5: Calculate ˆFb|x by solving Eq. (13) 6: end for 7: return {ˆFb|x}B 1 b=1 by using a function g C that depends on C. Having established (13), we can obtain the estimation ˆFb|x of Fb|x for all b by solving this equation based on the estimation ˆrb|x of rb|x, as outlined in Steps 3 7 of Algorithm 1. We leverage the initial condition ˆF0|x = F0|x = 0 for all x, where 0 is the K-dimensional vector of zeros. At the initial step for b = 1, we obtain ˆFb|x by solving (13). This equation is solvable because it contains K equality constraints and the unknown value is the length-K vector Fb|x. We repeat this procedure for b = 2, 3, . . . , B 1 to obtain ˆFb|x for all b. See Appendix B.2 for more details of the algorithm. Note that this second step of our algorithm is similar to the algorithm based on a bisection root-finding algorithm (Zheng & Klein, 1995), but their algorithm is valid only for K = 2 and its extension for K > 2 is unknown. In contrast, our algorithm is extendable for K > 2 as shown in Appendix B.1. We also note that as B , Carri ere (1995) shows another method to estimate ˆFb|x by solving an equation similar to (13). We discuss this method further in Appendix B.3. Simplified implementation of our algorithm. In our implementation, we adopted a simpler approach to solve Eq. (13) for all b simultaneously, rather than sequentially solving Eq. (13) for each b as outlined in Lines 3 7 of Algorithm 1. Specifically, by utilizing an automatic differentiation library for Python (e.g., Py Torch and Tensorflow), we estimate ˆFb|x by minimizing the following objective Survival Analysis via Density Estimation 0 ζ1 ζ2 ζ3 ζ4 ζ5 ζ6 0 ζ1 ζ2 ζ3 ζ4 ζ5 ζ6 0 ζ1 ζ2 ζ3 ζ4 ζ5 ζ6 0 ζ1 ζ2 ζ3 ζ4 ζ5 ζ6 column-wise sum column-wise sum column-wise sum Censored joint distribution Lower bound estimation Copula-based estimation Upper bound estimation Figure 4. Illustration of the upper and lower bounds estimation. Here, region Rb,k is divided into grids, with denser color indicating a higher probability that a data point is contained in the corresponding region. The lower bound estimation is achieved by assigning the probability mass ˆrb,k to the last time slot within region Rb,k and then calculating the column-wise sum. Conversely, the upper bound estimation is obtained by assigning the probability mass ˆrb,k to the earliest time slot within region Rb,k and calculating the column-wise sum. g C(ˆFb|x|ˆFb 1|x) ˆrb|x 2 (14) for all b simultaneously. Note that Eq. (14) is minimized if Eq. (13) holds for all b. 3.3. Non-Identifiablity A limitation of our two-step algorithm is the necessity for prior knowledge of the copula C, which may not be readily available in practical scenarios. One potential approach to address this issue is to employ a parameterized copula Cθ, where θ represents some parameter (e.g., the Frank copula (2) with parameter θ), and to extend the survival analysis framework to estimate the copula parameter θ in addition to the CDFs Fk(t|x). However, Tsiatis (1975) demonstrated that this approach is not feasible because it is impossible to identify Fk(t|x) without making some assumptions about Fk(t|x). Consequently, many researchers have explored the introduction of additional assumptions on the distribution Fk(t|x) to ensure the identifiability of both the copula parameter θ and Fk(t|x). For instance, Gharari et al. (2023) assume that Fk(t|x) follows a Weibull distribution and present an algorithm to estimate the copula parameter θ. Other examples include the study of copula identifiability under the proportional hazards assumption in (Heckman & Honor e, 1989; Deresa & Keilegom, 2024), and the investigation of copula identifiability for other restricted classes of distributions Fk(t|x) in (Czado & Keilegom, 2022; Wang, 2023; Zhang et al., 2024). Under the strongest assumption that the true distribution Fk(t|x) is completely known, Schwarz et al. (2013) discuss the identifiability of Archimedean copulas and the non-identifiability of symmetric copulas. It is important to note that our two-step algorithm is sufficiently flexible to incorporate these identifiability results. Suppose that the distribution Fk,η(t|x) and the copula Cθ are parameterized by η and θ, respectively. If the identifiability of the parameters η and θ is established, then the objective function (14) should have a unique solution, where ˆFb|x and C are replaced with the parameterized ˆFb,η|x and Cθ, respectively. 4. Upper and Lower Bounds Estimation As we discuss in Appendix A, there exist scenarios where it is infeasible to verify dependency in any manner and we cannot have any copula as a prior knowledge on the dependency. Even in such instances, our algorithm can still be employed to estimate the upper and lower bounds of Fk(t|x). By definition, the following inequalities can be readily verified: Fk(ζb|x) Pr Fk(ζb|x) Pr b b,k [K] Rb ,k which are equivalent to X b b rb ,k|x Fk(ζb|x) X b b,k [K] rb ,k |x. (15) Survival Analysis via Density Estimation Based on these inequalities, we estimate the upper and lower bounds of Fk(ζb|x) by substituting rb,k|x with the output ˆrb,k|x from Step 1 of our algorithm. It is important to note that these inequalities are derived without utilizing the parameters w1 and w2 in our two-step algorithm. As illustrated in Fig. 4, the estimation of upper and lower bounds and the second step of our algorithm to estimate Fk(ζb|x) can be interpreted as redistributing the probability mass within the CJD representation into fine-grained grid cells. While these bounds for Fk(t|x) are computed only for discrete values t {ζb}B b=1, it is possible to compute these bounds for any t using a distribution regression model to estimate the CIF Vk(t|x) (as defined in (4)). This approach is explained in Appendix E. These bounds can be viewed as variants of the bounds established for the average survival function in (Peterson, 1976). It is crucial to distinguish that our upper and lower bounds differ from the concept of a confidence interval, which quantifies the epistemic uncertainty inherent in the prediction model (Bengs et al., 2022). Our bounds quantify uncertainty arising from the absence of prior knowledge about the true copula C. As we discuss in Appendix E, these two types of bounds can be combined to account for both sources of uncertainty. 5. Strictly Proper Scoring Rule for Competing Risks Strictly proper scoring rules hold significant importance in the domain of statistics, particularly for the evaluation of probabilistic estimates (Gneiting & Raftery, 2007). These scoring rules ensure that the expected score is minimized when the estimated probabilities accurately reflect the true distribution of outcomes. In this section, we establish the existence of a strictly proper scoring rule for any number of competing risks, K, given that the dependency structure is defined by a survival copula. While Rindt et al. (2022) demonstrated the existence of a strictly proper scoring rule for K = 2 under the assumption of conditional independence, no such rule has been established for K > 2 or for cases where the conditional independence assumption does not hold. We begin by defining proper and strictly proper scoring rules as follows: Definition 5.1. (Proper and Strictly Proper Scoring Rules.) A scoring rule S for the estimation ˆFk(t|x) of Fk(t|x) is proper if the following inequality is satisfied: E[S({ ˆFk(t|x)}K k=1, (t, δ))] E[S({Fk(t|x)}K k=1, (t, δ))]. (16) A scoring rule is strictly proper if equality in (16) holds if and only if ˆFk(t|x) is equal to Fk(t|x) for all k and t. Subsequently, we demonstrate the existence of a strictly proper scoring rule for any K. According to Tsiatis (1975), the derivative vk(t|x) of the cumulative incidence function (CIF) Vk(t|x) (as defined in Eq. (4)) can be represented using Fk(t|x) and a survival copula C: vk(t|x) = d C(1 F1(t1|x), 1 F2(t2|x), 1 FK(t K|x)) t1=t2= =t K=t . By using this equation, we propose a scoring rule NLLSC, which stands for Negative Log-Likelihood based on Survival Copula. Assumption 5.2. The Kullback-Leibler (KL) divergence between vk(t|x) and its estimate ˆvk(t|x) satisfies DKL(vx||ˆvx) < for all k, where vx and ˆvx denote the probability distribution over (k, t) of vk(t|x) and ˆvk(t|x), respectively. Theorem 5.3. (A Strictly Proper Scoring Rule for Competing Risks.) If Assumption 5.2 holds, the following scoring rule SNLL SC is strictly proper: SNLL SC({ ˆFk(t|x)}K k=1, (t, δ)) = 1δ=k log ˆvk(t|x). Proof. Since Assumption 5.2 ensures the existence of the KL divergence, we have E[SNLL SC({ ˆFk(t|x)}K k=1, (t, δ))] E[SNLL SC({Fk(t|x)}K k=1, (t, δ))] 0 vk(t|x)(log vk(t|x) log ˆvk(t|x))dt = DKL(vx||ˆvx) Equality in the last inequality holds if and only if vk(t|x) = ˆvk(t|x) holds for all k and t, which is equivalent to that Fk(t|x) = ˆFk(t|x) holds for all k and t. Hence the scoring rule SNLL SC is strictly proper. Neural network model based on NLL-SC. Utilizing the scoring rule NLL-SC, we propose a new monotone neural network model, the survival copula network (SC-Net). Similar to the monotone neural network model for K = 2 proposed by Rindt et al. (2022), our SC-Net employs a monotone neural network to represent a CDF (Chilinski & Silva, 2020), and we use the NLL-SC as its loss function. Survival Analysis via Density Estimation Table 1. Evaluation metrics Task Metric Name Assumption Metric Type Density estimation CJD-Brier - Discrimination CJD-Logarithmic - Discrimination CJD-KS - Calibration Survival analysis NLL-SC Copula Discrimination Cen-log Independence Discrimination D-calibration Independence Calibration KM-calibration Independence Calibration We utilized the smooth min-max neural network (Igel, 2024) as the monotone neural network, though other monotone neural networks, such as those proposed in (Yanagisawa et al., 2022; Kim & Lee, 2024), could also be used. Note that our SC-Net with K = 2 can be viewed as incorporating prior knowledge of the ground truth copula into the DCSurvival model (Zhang et al., 2024), although the primary objective of DCSurvival appears to be the identification of parameters within an Archimedean copula. 6. Experiments We conducted a series of experimental evaluations to assess the performance of our proposed two-step algorithm. We conducted the experimental procedures on a virtual machine possessing a single CPU devoid of any GPU, equipped with a memory of 64 GB, and operating on Cent OS Stream 9. The software implementation was achieved using Python 3.11.6 and Py Torch 2.1.2. The datasets employed for this purpose were the Dialysis and oldmort datasets, sourced from the Python package Surv Set (Drysdale, 2022). Models. In the first step of our algorithm, we utilized five different models. Specifically, the TS-Brier and TSLog models employed neural networks for density estimation, utilizing the Brier and Logarithmic scores (Gneiting & Raftery, 2007) as their respective loss functions. The TS-LGB, TS-RF, and TS-DRF models utilized the Light GBM (LGB) model (Ke et al., 2017), the random forest (RF) model available in the Python package sklearn, and the distribution regression forest (DRF) ( Cevid et al., 2022), respectively. The prefix TS denotes the Two-Step algorithm, with each model implementing a distinct first-step model but sharing a common second-step algorithm (specifically, the simplified implementation as detailed in Sec. 3.2). The hyperparameter was set to B = 32. Additionally, we employed our SC-Net model proposed in Sec. 5. For comparative purposes, we included the Cox model (Cox, 1972), the random survival forest (RSF) (Ishwaran et al., 2008), and the Deep Hit model (Lee et al., 2018), a neural network model. In the Deep Hit model, we set the parameter α = 0 to ensure its loss function was a proper scoring rule as recommended by Yanagisawa (2023). Evaluation metrics. We used three metrics to evaluate the estimates of the CJD representation ˆrb,k|x and four metrics to assess the estimated distribution ˆFk(t|x) as summarized in Table 1. For the CJD representation, we used the Brier and Logarithmic scores (Gneiting & Raftery, 2007) as discrimination metrics. Additionally, we applied the sum of the Kolmogorov-Smirnov calibration error (Gupta et al., 2021) as a calibration metric. This metric, based on the Kolmogorov-Smirnov test (Kolmogorov, 1933; Smirnov, 1939), is defined as follows: k=1 max 0 σ 1 hk,σ hk,σ , i=1 1 ˆ fk(x(i)) σ 1y(i)=k i=1 1 ˆ fk(x(i)) σ ˆfk(x(i)). Here, ˆfk(x(i)) represents the probability of the feature vector x(i) being classified in class k, and each ˆfk(x) is equal to its corresponding ˆrb,k|x. For models that output only the distribution ˆFk(t|x), the CJD representation ˆrb,k|x was estimated using Eq. (13) with the output and the independence copula Cind (as defined in (1)). For the estimated distribution ˆFk(t|x), we utilized the simplified variant of the censored logarithmic score (referred to as cen-log) (Yanagisawa, 2023) and our strictly proper scoring rule, NLL-SC, as evaluation metrics. Additionally, Dcalibration (Haider et al., 2020) and KM-calibration (Yanagisawa, 2023) were employed as calibration metrics. Results. Figure 5 illustrates the results for the Dialysis and oldmort datasets. The results indicate that while the Deep Hit model demonstrated superior performance compared to the Cox and RSF models, our TS-LGB model Survival Analysis via Density Estimation 0.00 0.25 0.50 0.75 1.00 1.25 1.50 Cen-log (event 1) SC-Net TS-Brier TS-RF TS-DRF TS-LGB Deep Hit 0.000 0.005 0.010 0.015 D-calibration (event 1) SC-Net TS-Brier TS-RF TS-DRF TS-LGB Deep Hit 0.0 0.2 0.4 0.6 0.8 1.0 1.2 KM-calibration (event 1) SC-Net TS-Brier TS-RF TS-DRF TS-LGB Deep Hit 1.5 2.0 2.5 3.0 3.5 NLL-SC SC-Net TS-Brier TS-RF TS-DRF TS-LGB Deep Hit 0.6 0.8 1.0 1.2 CJD-Brier SC-Net TS-Brier TS-RF TS-DRF TS-LGB Deep Hit 2.0 2.5 3.0 3.5 4.0 CJD-Logarithmic SC-Net TS-Brier TS-RF TS-DRF TS-LGB Deep Hit 0.0 0.2 0.4 0.6 0.8 CJD-KS SC-Net TS-Brier TS-RF TS-DRF TS-LGB Deep Hit 0.00 0.25 0.50 0.75 1.00 1.25 1.50 Cen-log (event 1) SC-Net TS-Brier TS-RF TS-DRF TS-LGB Deep Hit 0.000 0.002 0.004 0.006 0.008 D-calibration (event 1) SC-Net TS-Brier TS-RF TS-DRF TS-LGB Deep Hit 0.0 0.2 0.4 0.6 0.8 1.0 1.2 KM-calibration (event 1) SC-Net TS-Brier TS-RF TS-DRF TS-LGB Deep Hit 2 3 4 5 6 NLL-SC SC-Net TS-Brier TS-RF TS-DRF TS-LGB Deep Hit 0.6 0.8 1.0 1.2 CJD-Brier SC-Net TS-Brier TS-RF TS-DRF TS-LGB Deep Hit 2.5 3.0 3.5 4.0 CJD-Logarithmic SC-Net TS-Brier TS-RF TS-DRF TS-LGB Deep Hit 0.0 0.2 0.4 0.6 0.8 CJD-KS SC-Net TS-Brier TS-RF TS-DRF TS-LGB Deep Hit Figure 5. Performance comparison on the Dialysis and oldmort datasets with B = 32 (lower is better). While Deep Hit performs the best among the baseline models, our TS-LGB model shows comparable or better performances than Deep Hit. exhibited comparable or superior performance relative to Deep Hit. It is also noteworthy that neural network-based models, including Deep Hit, typically require longer training times compared to tree-based models such as RSF and TS-LGB. Consequently, the TS-LGB model emerged as the most effective model for the Dialysis and oldmort datasets. Additional evaluation results using other datasets for K = 2 and scenarios involving competing risks with K = 3, in comparison with models such as Deep Survival Machines (DSM) (Nagpal et al., 2021), De Surv (Danks & Yau, 2022), and Neural Fine-Gray (Neural FG) (Jeanselme et al., 2023), are detailed in Appendix F. Furthermore, the appendix includes additional evaluation results encompassing the estimation of upper and lower bounds, as well as an ablation study on the hyperparameter B. Source Codes. The implementations of our models are accessible at https://github.com/Cyber Agent AILab/cenreg. 7. Conclusion We demonstrated a reduction from survival analysis to density estimation, which allows the application of any density estimation model to survival analysis. This algorithm op- erates under the assumption of having prior knowledge of the copula C, which is a weaker assumption compared to the conditional independence assumption. This approach is consistent with Tsiatis s non-identifiability result (1975), and we have shown that our algorithm can also be utilized to estimate the upper and lower bounds of individual survival functions. Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here. Bengs, V., H ullermeier, E., and Waegeman, W. Pitfalls of epistemic uncertainty quantification through loss minimisation. In Advances in Neural Information Processing Systems, pp. 29205 29216, 2022. Bilodeau, B., Foster, D. J., and Roy, D. M. Minimax rates for conditional density estimation via empirical entropy. The Annals of Statistics, 51(2):762 790, 2023. Survival Analysis via Density Estimation Breiman, L. Random forests. Machine Learning, 45:5 32, 2001. Carri ere, J. F. Removing cancer when it is correlated with other causes of death. Biometrical Journal, 37(3):339 350, 1995. Cevid, D., Michel, L., N af, J., B uhlmann, P., and Meinshausen, N. Distributional random forests: Heterogeneity adjustment and multivariate distributional regression. Journal of Machine Learning Research, 23(333):1 79, 2022. Chilinski, P. and Silva, R. Neural likelihoods via cumulative distribution functions. In Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI), pp. 420 429, 2020. Cox, D. R. Regression models and life-tables. Journal of the Royal Statistical Society. Series B (Methodological), 34(2):187 220, 1972. Czado, C. and Keilegom, I. V. Dependent censoring based on parametric copulas. Biometrika, 110(3):721 738, 2022. Danks, D. and Yau, C. Derivative-based neural modelling of cumulative distribution functions for survival analysis. In Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, pp. 7240 7256, 2022. Defazio, A., Yang, X. A., Mehta, H., Mishchenko, K., Khaled, A., and Cutkosky, A. The road less scheduled. In Advances in Neural Information Processing Systems, 2024. Deresa, N. W. and Keilegom, I. V. Copula based Cox proportional hazards models for dependent censoring. Journal of the American Statistical Association, 119(546):1044 1054, 2024. Drysdale, E. Surv Set: An open-source time-to-event dataset repository. Technical report, ar Xiv:2203.03094, 2022. Duan, T., Avati, A., Ding, D. Y., Thai, K. K., Basu, S., Ng, A., and Schuler, A. NGBoost: Natural gradient boosting for probabilistic prediction. In Proceedings of the 37th International Conference on Machine Learning, pp. 2690 2700, 2020. Emura, T. and Chen, Y.-H. Analysis of Survival Data with Dependent Censoring: Copula-Based Approaches. Springer, 2018. Filho, T. S., Song, H., Perello-Nieto, M., Santos-Rodriguez, R., Kull, M., and Flach, P. Classifier calibration: a survey on how to assess and improve predicted class probabilities. Machine Learning, 112:3211 3260, 2023. Fine, J. P. and Gray, R. J. A proportional hazards model for the subdistribution of a competing risk. Journal of the American Statistical Association, 94(446):496 509, 1999. Georges, P., Lamy, A.-G., Nicolas, E., Quibel, G., and Roncalli, T. Multivariate survival modelling: A unified approach with copulas. Technical report, SSRN, 2001. Gharari, A. H. F., Cooper, M., Greiner, R., and Krishnan, R. G. Copula-based deep survival models for dependent censoring. In Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence, pp. 669 680, 2023. Gneiting, T. and Raftery, A. E. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477):359 378, 2007. Greenwood, M. A report on the natural duration of cancer. In Reports on Public Health and Medical Subjects. Ministry of Health. London: H.M.S.O., 1926. Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning, pp. 1321 1330, 2017. Gupta, K., Rahimi, A., Ajanthan, T., Mensink, T., Sminchisescu, C., and Hartley, R. Calibration of neural networks using splines. In ICLR 2021, 2021. Haider, H., Hoehn, B., Davis, S., and Greiner, R. Effective ways to build and evaluate individual survival distributions. Journal of Machine Learning Research, 21(1), 2020. Heckman, J. J. and Honor e, B. E. The identifiability of the competing risks model. Biometrika, 76(2):325 330, 1989. Hickey, J., Henao, R., Wojdyla, D., Pencina, M., and Engelhard, M. Adaptive discretization for event prediction (ADEPT). In Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, pp. 1351 1359, 2024. Hothorn, T. and Zeileis, A. Predictive distribution modeling using transformation forests. Journal of Computational and Graphical Statistics, 30(4):1181 1196, 2021. Igel, C. Smooth min-max monotonic networks. In Proceedings of the 41st International Conference on Machine Learning, pp. 20908 20923, 2024. Ishwaran, H., Kogalur, U. B., Blackstone, E. H., and Lauer, M. S. Random survival forests. The Annals of Applied Statistics, 2(3):841 860, 2008. Survival Analysis via Density Estimation Jeanselme, V., Yoon, C. H., Tom, B., and Barrett, J. Neural Fine-Gray: Monotonic neural networks for competing risks. In Proceedings of the Conference on Health, Inference, and Learning, pp. 379 392, 2023. Kannel, W. B. and Mc Gee, D. L. Diabetes and cardiovascular disease: the Framingham study. JAMA, 241(19): 2035 2038, 1979. Kaplan, E. L. and Meier, P. Nonparametric estimation from incomplete observations. Journal of the American Statistical Association, 53(282):457 481, 1958. Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., and Liu, T.-Y. Light GBM: a highly efficient gradient boosting decision tree. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 3149 3157, 2017. Kim, H. and Lee, J.-S. Scalable monotonic neural networks. In The Twelfth International Conference on Learning Representations, 2024. Kolmogorov, A. Sulla determinazione emp ırica di uma legge di distribuzione, 1933. Lee, C., Zame, W., Yoon, J., and van der Schaar, M. Deep Hit: A deep learning approach to survival analysis with competing risks. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), 2018. Naeini, M. P., Cooper, G., and Hauskrecht, M. Obtaining well calibrated probabilities using Bayesian binning. Proceedings of the AAAI Conference on Artificial Intelligence, 29(1), 2015. Nagpal, C., Li, X., and Dubrawski, A. Deep survival machines: Fully parametric survival regression and representation learning for censored data with competing risks. IEEE Journal of Biomedical and Health Informatics, 25 (8):3163 3175, 2021. Nelsen, R. B. An Introduction to Copulas. Springer New York, NY, 2006. Niles-Weed, J. and Berthet, Q. Minimax estimation of smooth densities in Wasserstein distance. The Annals of Statistics, 50(3):1519 1540, 2022. Peterson, A. V. Bounds for a joint distribution function with fixed sub-distribution functions: Application to competing risks. Proceedings of the National Academy of Sciences, 73(1):11 13, 1976. Rindt, D., Hu, R., Steinsaltz, D., and Sejdinovic, D. Survival regression with proper scoring rules and monotonic neural networks. In Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, pp. 1190 1205, 2022. Sart, M. Estimating the conditional density by histogram type estimators and model selection. ESAIM: Probability and Statistics, 21:34 55, 2017. Schlosser, L., Hothorn, T., Stauffer, R., and Zeileis, A. Distributional regression forests for probabilistic precipitation forecasting in complex terrain. The Annals of Applied Statistics, 13(3):1564 1589, 2019. Schwarz, M., Jongbloed, G., and Van Keilegom, I. On the identifiability of copulas in bivariate competing risks models. Canadian Journal of Statistics, 41(2):291 303, 2013. Sklar, A. Fonctions de r epartition a n dimensions et leurs marges. Publications de l Institut Statistique de l Universit e de Paris, 8:229 231, 1959. Smirnov, N. On the estimation of the discrepancy between empirical curves of distribution for two independent samples, 1939. Therneau, T. M. and Grambsch, P. M. Modeling Survival Data: Extending the Cox Model. Springer Science & Business Media, 2000. Tsiatis, A. A nonidentifiability aspect of the problem of competing risks. Proc. Natl. Acad. Sci. USA, 72(1):20 22, 1975. Vallender, S. S. Calculation of the Wasserstein distance between probability distributions on the line. Theory of Probability & Its Applications, 18(4):784 786, 1974. Wang, A. The identifiability of copula models for dependent competing risks data with exponentially distributed margins. Statistica Sinica, 33(2):983 1001, 2023. Wang, P., Li, Y., and Reddy, C. K. Machine learning for survival analysis: A survey. ACM Comput. Surv., 51(6), 2019. Wiegrebe, S., Kopper, P., Sonabend, R., Bischl, B., and Bender, A. Deep learning for survival analysis: A review. Artificial Intelligence Review, 57, 2024. Yanagisawa, H. Proper scoring rules for survival analysis. In ICML 2023, 2023. Yanagisawa, H., Miyaguchi, K., and Katsuki, T. Hierarchical lattice layer for partially monotone neural networks. In Advances in Neural Information Processing Systems, pp. 11092 11103, 2022. Zhang, W., Ling, C. K., and Zhang, X. Deep copula-based survival analysis for dependent censoring with identifiability guarantees. In AAAI 2024, 2024. Zheng, M. and Klein, J. P. Estimates of marginal survival for dependent competing risks based on an assumed copula. Biometrika, 82(1):127 138, 1995. Survival Analysis via Density Estimation A. On Applications of Proposed Methods Survival analysis is a crucial tool in various fields such as medicine, engineering, and social sciences, where the time until an event of interest occurs is studied. The applicability of our proposed methods can be classified into three distinct types based on the nature of the dependency between the event time and the censoring time. (Conditional Independence.) The first class of applications encompasses scenarios where the conditional independence assumption between the event time and the censoring time is valid or highly likely to hold. This situation typically arises in cases of administrative censoring, where data points are censored solely due to the limited window of observation times. For instance, in clinical trials, patients might be censored at the end of the study period regardless of whether the event of interest has occurred. In such cases, the conditional independence assumption simplifies the analysis, allowing the use of survival models based on the conditional independence assumption including our two-step algorithm. (Verifiable Dependency.) The second class of applications includes scenarios where there is a dependency between the event time and the censoring time, but this dependency can be verified, albeit at a significant cost, for a small subset of data points. An example of this situation is found in medical studies where the primary event of interest is patient mortality, and censoring occurs when patients are discharged from the hospital. In such cases, it might be feasible to investigate the true event time for a small fraction of patients, thereby assessing the dependency between the event and censoring times as a form of copula. This estimated copula can then be used to model the dependency in the entire dataset, allowing for more accurate estimation of survival functions. (Unverifiable Dependency.) The third class of applications involves situations where the dependency between the event time and the censoring time cannot be determined, even with extensive resources. In such scenarios, identifying the survival function is inherently challenging due to the unknown nature of the dependency. We note that, even without any knowledge of the dependency, our method can estimate the upper and lower bounds to account for the uncertainty in the dependency structure. Additionally, if we have some confidence that the dependency can be represented by a parameterized copula with some range of the parameters (e.g., the Frank copula with parameters 5 θ 5), we can also estimate the survival functions using the copula information to narrow down the bounds. B. Step 2 of Proposed Algorithm B.1. Generalization for competing risks We generalize the second step of our algorithm presented in Sec. 3 for K > 2. For notational simplicity, we omit x in this section. For a subset I [K], let QI,b = {(t1, t2, . . . , t K) : k I(ζb 1 < tk ζb) and k I (ζb 1 tk )}. Then we can compute the probability q I,b = Pr((t1, t2, . . . , t K) QI,b) by using the inclusion-exclusion principle: j=0 ( 1)K j X J:J [K],|J|=j c(I, J, b), where c(I, J, b) = C(p I,J,b,1, p I,J,b,2, . . . , p I,J,b,K) and p I,J,b,k = 1 if k J \ I, Fk(ζb) if k J I, Fk(ζb 1) if k J. Note that, if K = 2, the quantity q I,b here is equal to the quantities computed in equations (10), (11), and (12). As generalizations of equations (6) and (7) for K > 2, we represent rb,k using q I,b as rb,k = q{k},b i=2 ( 1)i X H:k H [K],|H|=i w H q H,b | {z } Correction term Survival Analysis via Density Estimation Algorithm 2 Algorithm to solve equations 1: Initialize ˆFb|x = ˆFb 1|x 2: while ˆFb|x is not converged do 3: for k {1, 2, . . . , K} do 4: Increase the k-th element of ˆFb|x so that the k-th equation of ˆrb|x = g C(ˆFb|x|ˆFb 1|x) is satisfied (while other k -th element (k = k ) of ˆFb|x is fixed) 5: end for 6: end while 7: return ˆFb|x where w H 0 is a weight parameter and we assume that w H = 1/|H|. Note that, if K = 2, this equation is equal to equations (6) and (7). Since q I,b are functions of Fb, Fb 1, and copula C, we can write this relationship as rb = g C(Fb|Fb 1), by using a function g C that depends on C. Having established this equation, we can estimate ˆFb by using Algorithm 1. B.2. Solving Equations In this section, we present an algorithm to solve Eq. (13). Algorithm 2 shows a pseudo-code to solve this equation, and it capitalizes on the following property: Property B.1. Assuming that ˆFb 1|x is fixed: (1) The k-th element of the length-K vector g C(ˆFb|x|ˆFb 1|x) is monotonically increasing with respect to the k-th element of ˆFb|x. (2) The k ( = k)-th element of the length-K vector g C(ˆFb|x|ˆFb 1|x) is monotonically decreasing with respect to the k -th element of ˆFb|x. First, we demonstrate that the following inequality always holds during the execution of Algorithm 2: ˆrb|x g C(ˆFb|x|ˆFb 1|x). (17) At line 1 of Algorithm 2, ˆFb|x is initialized with ˆFb 1|x. Hence, by the definition of Eq. (13), we have g C(ˆFb|x|ˆFb 1|x) = 0, which means that inequality (17) holds. At line 4 of this algorithm, we can increase the k-th element of ˆFb|x to satisfy ˆrb|x = g C(ˆFb|x|ˆFb 1|x) due to Property B.1(1), and this increment does not violate inequality (17) due to Property B.1(2). Since each element in the length-K vector ˆFb|x does not decrease during the execution of this algorithm, we can find the solution to equation (13) by repeating the while-loop (lines 2 6) until the convergence of ˆFb|x. Parameterization. When we formulate equation (13), we assume that Fk(ζb|x) is represented by using some parameters. In our implementation, we used the parameterization by using the softmax function Fk(ζb|x) = Pb b =1 exp(αb ,k,x) PB b =1 exp(αb ,k,x) with B parameters αb,k,x for each k and x. The softmax function was chosen to ensure that any function Fk(ζ|x) can be expressed. Note that, any other parameterization can be used to represent Fk(ζb|x) in our algorithm. Survival Analysis via Density Estimation B.3. Alternative Algorithm for Step 2 We briefly explain the algorithm presented in (Carri ere, 1995). This algorithm exploits the following equation: d dt Vk(t) = tk C(1 F1(t1), 1 F2(t2), . . . , 1 FK(t K)) t1=t2= =t K=t C(S1(t1), S2(t2), . . . , SK(t K)) t1=t2= =t K=t = C(u1, u2, . . . , u K) u1=S1(t),u2=S2(t),...,u K=SK(t) d dt Sk(tk) tk=t , where C is the survival copula corresponding to copula C. Assuming that B is sufficiently large, his algorithm solves this equation with respect to Sk(t) on t {ζb}B b=0 by using an estimate ˆVk(t) of Vk(t) and these approximations for all k: uk Sk(ζb) + Sk(ζb+1) d dt Sk(t) t=ζb Sk(ζb+1) Sk(ζb) d dt ˆVk(t) t=ζb ˆVk(ζb+1) ˆVk(ζb) This algorithm is designed to estimate the average survival function Sk(t) and thereby Fk(t), but we can modify the algorithm to obtain Fk(t|x) conditional on x if a conditional estimate ˆVk(t|x) is available. C. Theoretical Analysis In this section, we theoretically verify that the two-step algorithm outputs solutions with sufficiently small errors. We consider the case K = 2 for simplicity, and we assume ζb = b B ζB. As discussed in the preceding section, various models can be implemented to estimate the CIF in the first step of our algorithm. Therefore, we evaluate errors affected by Step 2, solving (13), under the assumption that the models employed in the first step accurately approximate the true probabilities such that |ˆrb,k rb,k| ϵ (18) holds for all b = 1, . . . , B and k = 1, 2. Note that while how small a value we can take as ϵ in (18) depends on the choice of the model in Step 1, we can apply the results exhibited in this section. We provide examples of achieving (18) in Appendix D. To formally state our theoretical results, we introduce the following assumption: Assumption C.1. We assume the following conditions: (1) (True probability is not biased.) There exists a global constant c0 > 0 such that for every b = 1, . . . , B and k = 1, 2, Fk(ζb|x) Fk(ζb 1|x) = Pr(ζb 1 < Tk ζb) c0 (2) (Copula.) We assume that the copula C is of class C2 and satisfies ℓ:= inf (u,v) [0,1]2 2 u v C(u, v) > 0, L := sup (u,v) [0,1]2 max 2 u2 C(u, v), 2 u v C(u, v), 2 v2 C(u, v) < + . (3) (All tk are equally observed.) There exist constants c1 > 0 depend on ℓ, L, and τ > 0 such that the following condition holds: Let δ0 > 0 be a constant determined by ℓand L and b0 := max b n b | k, Fk(ζb|x) 1 δ0 τ log B o . Then, min k Fk(ζb0|x) 1 c1 log B . We make some remarks on the assumption. The first condition is required to bound the error by the choice of w1 and w2; if the probability concentrates on a squared region partitioned by suboptimal w1 and w2, significant errors are inevitable. Survival Analysis via Density Estimation This condition is satisfied if Fk is Lipschitz-continuous with c0ζB serving as the Lipschitz constant. The second condition manages the sensitivity of the estimation relative to the true distribution and noise. For example, if 2C u v 1, indicating that C exhibits minimal variation as u and v change, substantial adjustments to the estimation are necessary to accommodate for noise and achieve (13). This condition is typically met for the independence copula C(u, v) = uv with any ℓ< 1 and L > 1. The third condition appears to be technical. As will be demonstrated in subsequent analyses, errors between ˆFk and Fk can only be effectively bounded for b b0. As b approaches B, and consequently rb,k diminishes, the impact of ϵ intensifies. Condition (3) excludes scenarios where a part of tks is concentrated in the region b > b0. In other words, all tks are equally observed in the region b b0. Let W1( , ) be the Wasserstein distance1. Then, we provide the statement about the W1 distance between the estimated and true probabilities. We consider the extension of ˆFk(ζb|x) to a CDF on [0, ζB] by ˆFk(t|x) := ˆFk(ζb|x), where ζb < t ζb+1. Theorem C.2. Suppose that Assumption C.1 holds. Then, there exists a constant cϵ > 0 depending c0, ℓand L such that if ϵ cϵ B , the following inequality holds: W1 ˆµk|x, µk|x ζB B1+τϵ + c1 B b0 where ˆµk|x and µk|x are probability measures whose CDFs are given by ˆFk( |x) and Fk( |x), respectively. The proof is deffered to the following subsection. Suppose that the condition ϵ = o(B1+τ) holds. Then, we obtain an upper bound as W1 ˆµk|x, µk|x = ζB o(1). Thus, we can ensure that as B + and the sample size increases as we can take sufficiently small ϵ, the output of Step 2 converges to the ground truth distribution in terms of the W1 distance. We provide some comments on Theorem C.2. We can observe a trade-off in (19) based on the choice of B: while the second term decreases as B increases, B1+τ and ϵ in the first term should increase. Consequently, an optimal choice of B should be considered under appropriate assumptions that determine ϵ and b0, such as the model utilized in Step 1 and the properties of F |x. It is also significant to examine that the derived bound achieves a statistical min-max lower bound exhibited in (Niles-Weed & Berthet, 2022; Bilodeau et al., 2023), for example. We reserve these considerations for future research endeavors. C.1. Proof of Theorem C.2 This subsection provides proofs of Theorem C.2. Here, for notational simplicity, we abbreviate the conditional variable x. For example, we denote Fk(ζb) instead of Fk(ζb|x). We assume w1 = w2 = 1 2 just for simplicity. We can extend our analysis to arbitrary w1, w2 (0, 1). Moreover, we assume that Step 2 exactly solves (13), i.e., ˆFk exactly satisfies the equation (13). First, we evaluate the error between the outputs of Step 2 and the true probability. Proposition C.3 (Estimation error when solving (13)). Suppose that Assumption C.1 holds. Let ˆFb be the output of Algorithm 1. Then, for every b = 1, . . . , B and k = 1, 2, there exists a constant cϵ > 0 depending c0, ℓand L such that under the condition ϵ cϵ B , for sufficiently large B and every b satisfying max k {1,2} Fk(ζb) 1 δ with a positive constant δ satisfying 8c0L τℓlog B δ with τ (0, 1 | ˆFk(ζb) Fk(ζb)| ϵ + 16Lδ 2 Proof. Let us denote b,k := ˆFk(ζb) Fk(ζb). Instead of (20), we aim to obtain a tighter bound | b,k| ϵ + 16Lδ 2 1W1(µ, ν) := inf π Π(µ,ν) R R2|x y|dπ(x, y), where Π(µ, ν) denotes the set of all couplings of two probability measures µ and ν on Survival Analysis via Density Estimation We give its proof by induction on b. For the case b = 1, we have C( ˆF1(ζ1), 1) = ˆr1,1 + w1 C( ˆF1(ζ1), ˆF2(ζ1)) C(F1(ζ1), 1) = r1,1 + w 1,1 C(F1(ζ1), F2(ζ1)) with w 1,1 [0, 1]. By taking the difference of the both sides, we obtain C( ˆF1(ζ1), 1) C(F1(ζ1), 1) = ˆr1,1 r1,1 + w1 C( ˆF1(ζ1), ˆF2(ζ1)) w 1,1 C(F1(ζ1), F2(ζ1)). (22) First, we evaluate the term C( ˆF1(ζ1), ˆF2(ζ1)). Rearranging terms gives C( ˆF1(ζ1), 1) w1 C( ˆF1(ζ1), ˆF2(ζ1)) = ˆr1,1 r1,1 + C(F1(ζ1), 1) w 1,1 C(F1(ζ1), F2(ζ1)) ϵ + C(F1(ζ1), 1) where the first inequality follows from ˆr1,1 r1,1 ϵ by (18) and w 1,1 C(F1(ζ1), F2(ζ1)) 0, and the last inequality is derived from F1(ζ1) c0 B by Assumption C.1-(1) and C(F1(ζ1), 1) = Z F1(ζ1) u v C(u, v) | {z } L dvdu L F1(ζ1) L c0 Since C( ˆF1(ζ1), ˆF2(ζ1)) C( ˆF1(ζ1), 1), we obtain C( ˆF1(ζ1), 1) w1 C( ˆF1(ζ1), 1) C( ˆF1(ζ1), 1) w1 C( ˆF1(ζ1), ˆF2(ζ1)) ϵ + L c0 By using 1 w1 = 1 2, we obtain C( ˆF1(ζ1), 1) 2 ϵ + L c0 Moreover, we have C( ˆF1(ζ1), 1) = Z F1(ζ1) u v C(u, v) | {z } ℓ dvdu ℓˆF1(ζ1), A similar argument gives By combining these bounds, we obtain C( ˆF1(ζ1), ˆF2(ζ1)) = Z ˆ F1(ζ1) u v C(u, v) | {z } L dvdu L ˆF1(ζ1) ˆF2(ζ1) 4L Thus we get the bound on the term C( ˆF1(ζ1), ˆF2(ζ1)). Survival Analysis via Density Estimation Moreover, we have C(F1(ζ1), F2(ζ1)) = Z F1(ζ1) u v C(u, v) | {z } L dvdu L F1(ζ1)F2(ζ1) L c0 where we use Assumption C.1-(1) for the last inequality. Then, by taking the absolute value of the both sides of (22), we have C( ˆF1(ζ1), 1) C(F1(ζ1), 1) = ˆr1,1 r1,1 + w1 C( ˆF1(ζ1), ˆF2(ζ1)) w 1,1 C(F1(ζ1), F2(ζ1)) |ˆr1,1 r1,1| + w1 C( ˆF1(ζ1), ˆF2(ζ1)) w 1,1 C(F1(ζ1), F2(ζ1)) ϵ + max n C( ˆF1(ζ1), ˆF2(ζ1)), C(F1(ζ1), F2(ζ1)) o where we use the triangle inequality for the first inequality, 0 w1, w 1,1 1 for the second one, and (23), (24) for the third one. Since C( ˆF1(ζ1), 1) C(F1(ζ1), 1) ℓ ˆF1(ζ1) F1(ζ1) = ℓ| 1,1|, we obtain where the last term coincides to the right hand side of (21) with b = 1. We can obtain the bound for k = 2 by utilizing the same argument. Thus we get (21) for b = 1. Assume that (21) holds for b = b , i.e., | b ,k| ϵ + 16Lδ 2 b +1 B 4c0L holds for k = 1, 2. We consider the case b = b + 1. This bound gives that by taking ϵ cϵ B with a sufficiently small cϵ, | b ,k| ϵ + 16Lδ 2 1 B1 τ , (26) where we use 1 x e x in the third inequality and the definition of δ in the fourth inequality. This and Fk(ζb ) < 1 δ gives ˆFk(ζb ) < 1 δ 2 for sufficiently large B. ˆrb +1,1 = ˆq{1},b w1 ˆq{1,2},b , (27) rb +1,1 = q{1},b w b +1,1 q{1,2},b , (28) Survival Analysis via Density Estimation Now, we consider integral representation of ˆq{1},b and ˆq{1,2},b as ˆq{1},b = C( ˆF1(ζb +1), 1) C( ˆF1(ζb ), 1) C( ˆF1(ζb +1), ˆF2(ζb )) + C( ˆF1(ζb ), ˆF2(ζb )) = Z ˆ F1(ζb +1) u v C(u, v)dvdu, (29) ˆq{1,2},b = C( ˆF1(ζb +1), ˆF2(ζb +1)) C( ˆF1(ζb +1), ˆF2(ζb )) C( ˆF1(ζb ), ˆF2(ζb +1)) + C( ˆF1(ζb ), ˆF2(ζb )) = Z ˆ F1(ζb +1) Z ˆ F2(ζb +1) u v C(u, v)dvdu. The same expression holds for q{1},b and q{1,2},b by replacing ˆF1 and ˆF2 with F1 and F2. By taking the difference between (27) and (28), we obtain ˆrb +1,1 rb +1,1 = ˆq{1},b ˆq{1},b w1 ˆq{1,2},b + w 1,1 q{1,2},b . (30) Similar to the case b = 1, we first evaluate the term ˆq{1,2},b (note that ˆq{1,2},b = C( ˆF1(ζ1), ˆF2(ζ1))). By rearranging terms, we have ˆq{1},b w1 ˆq{1,2},b = ˆrb +1,1 rb +1,1 + q{1},b w 1,1 q{1,2},b = ϵ + Z F1(ζb +1) u v C(u, v) | {z } L ϵ + L (F1(ζb +1) F1(ζb ))(1 F2(ζ b)) where the first inequality follows from ˆrb +1,1 rb +1,1 ϵ by (18) and q{1,2},b 0, and the last inequality, follows from Assumption C.1-(1) and 1 F2(ζb) < 1. Moreover, the left hand side is lower bounded by ˆq{1},b w1 ˆq{1,2},b (1 w1)ˆq{1},b Z ˆ F1(ζb +1) u v C(u, v) | {z } ℓ 2 ℓ ˆF1(ζb +1) ˆF1(ζb ) (1 ˆF2(ζb )), where we use 1 w1 = 1 2 and (29) for the equality. Combining this with (31), we have ˆF1(ζb +1) ˆF1(ζb ) 2 ℓ(1 ˆF2(ζb )) This and the triangle inequality give | b +1,1| = ˆF1(ζb +1) ˆF1(ζb ) + ˆF1(ζb ) F1(ζb ) + (F1(ζb ) F1(ζb +1)) ˆF1(ζb +1) ˆF1(ζb ) + ˆF1(ζb ) F1(ζb ) + |F1(ζb ) F1(ζb +1)| ℓ(1 ˆF2(ζb )) + | b ,1| + c0 which we will use in the latter of this proof. The same bound as (32) holds for k = 2, which gives ˆF1(ζb +1) ˆF1(ζb ) ˆF2(ζb +1) ˆF2(ζb ) 4 ℓ2 1 ˆF1(ζb ) 1 ˆF2(ζb ) ϵ + L c0 Survival Analysis via Density Estimation Thus, we obtain ˆq{1,2},b = Z ˆ F1(ζb +1) Z ˆ F2(ζb +1) u v C(u, v) | {z } L L ˆF1(ζb +1) ˆF1(ζb ) ˆF2(ζb +1) ˆF2(ζb ) ℓ2 1 ˆF1(ζb ) 1 ˆF2(ζb ) ϵ + L c0 Moreover, we have q{1,2},b |x = Z F1(ζb +1) Z F2(ζb +1) u v C(u, v) | {z } L L (F1(ζb +1) F1(ζb ))(F2(ζb +1) F2(ζb )) where we use Assumption C.1-(1) for the last inequality. Then, by taking the absolute value of the both sides of (30), we have ˆq{1},b |x q{1},b |x = ˆrb +1,1 rb +1,1 + w1 ˆq{1,2},b |x w b +1,1 q{1,2},b |x |ˆrb +1,1 rb +1,1| + w1 ˆq{1,2},b |x w b +1,1 q{1,2},b |x ϵ + max ˆq{1,2},b |x, q{1,2},b |x ℓ2 1 ˆF1(ζb ) 1 ˆF2(ζb ) ϵ + L c0 ℓ2 1 ˆF1(ζb ) 1 ˆF2(ζb ) ϵ + L c0 where we use the triangle inequality for the first inequality, 0 w1, w b +1,1 1 for the second one, and (33) and (34) for the third one. Then, we evaluate the left hand side. We have ˆq{1},b q{1},b = Z ˆ F1(ζb +1) u v C(u, v)dvdu Z F1(ζb +1) u v C(u, v)dvdu Survival Analysis via Density Estimation Z F1(ζb +1) u v C(u, v)dvdu Z ˆ F1(ζb +1) u v C(u, v)dvdu = Z F1(ζb +1) u C(u, 1)du Z ˆ F1(ζb +1) u C(u, 1)du Z F1(ζb +1) u C(u, F2(ζb ))du + Z ˆ F1(ζb +1) u C(u, ˆF2(ζb ))du = Z F1(ζb +1) ˆ F1(ζb +1) u C(u, 1)du Z F1(ζb ) u C(u, 1)du Z ˆ F1(ζb ) ˆ F1(ζb +1) u C(u, F2(ζb ))du + Z F1(ζb ) u C(u, F2(ζb ))du + Z ˆ F1(ζb +1) u C(u, ˆF2(ζb )) u C(u, F2(ζb )) du = Z F1(ζb +1) ˆ F1(ζb +1) u C(u, 1)du Z F1(ζb ) u C(u, 1)du Z F1(ζb +1) ˆ F1(ζb +1) u C(u, F2(ζb ))du + Z F1(ζb ) u C(u, F2(ζb ))du + Z ˆ F1(ζb +1) u C(u, ˆF2(ζb )) u C(u, F2(ζb )) du. By the mean value theorem for integrals, there exist constants ub ,1, ub ,2 h min n ˆF1(ζb ), F1(ζb ) o , max n ˆF1(ζb ), F1(ζb ) oi ub +1,1, ub +1,2 h min n ˆF1(ζb +1), F1(ζb +1) o , max n ˆF1(ζb +1), F1(ζb +1) oi (36) = b +1,1 u C(ub +1,1, 1) + b ,1 u C(ub ,1, 1) + b +1,1 u C(ub +1,2, F2(ζb +1)) u C(ub ,2, F2(ζb )) + Z ˆ F1(ζb +1) u C(u, ˆF2(ζb )) u C(u, F2(ζb )) du. (I) = b +1,1 u C(ub +1,1, 1) b ,1 u C(ub ,1, 1) u C(ub +1,2, F2(ζb +1)) + b +1,1 u C(ub ,2, F2(ζb )), (II) = Z ˆ F1(ζb +1) u C(u, ˆF2(ζb )) u C(u, F2(ζb )) du. Survival Analysis via Density Estimation We evaluate the each term. First, we have (I) =( b +1,1 b ,1) u C(ub ,1, 1) + b +1,1 u C(ub +1,1, 1) u C(ub ,1, 1) ( b +1,1 b ,1) u C(ub ,2, F2(ζb )) b +1,1 u C(ub +1,2, F2(ζb +1)) u C(ub ,2, F2(ζb )) =( b +1,1 b ,1) u C(ub ,1, 1) u C(ub ,2, F2(ζb )) + b +1,1 u C(ub +1,1, 1) u C(ub ,1, 1) u C(ub +1,2, F2(ζb +1)) u C(ub ,2, F2(ζb )) . Thus, we obtain |(I)| ( b +1,1 b ,1) u C(ub ,1, 1) u C(ub ,2, F2(ζb )) u C(ub +1,1, 1) u C(ub ,1, 1) b +1,1 u C(ub +1,2, F2(ζb +1)) u C(ub ,2, F2(ζb )) , where we use the triangle inequality. Moreover, we can bound the terms in the above inequality by u C(ub +1,1, 1) u C(ub ,1, 1) = u2 C(u, 1) | {z } L L(ub +1,1 ub ,1) L | b +1,1| + c0 and u C(ub +1,2, F2(ζb +1)) u C(ub ,2, F2(ζb )) u C(ub +1,2, F2(ζb +1)) u C(ub +1,2, F2(ζb )) + u C(ub +1,2, F2(ζb )) u C(ub ,2, F2(ζb )) Z F2(ζb +1) v2 C(ub +1,2, v) | {z } L u2 C(u, F2(ζb )) | {z } L B + | b +1,1| = L | b +1,1| + 2c0 where we use Assumption C.1-(1) for the first term and |ub ,2 ub +1,2| b +1,1 + c0 B for the second term. Moreover, we have Z ˆ F1(ζb +1) Z ˆ F2(ζb ) u v C(u, v) | {z } L | b ,2| 4c0L2 Bℓδ | b ,2|, where we use ϵ c0 B in the last inequality. Then, (35) gives | b +1,1 b ,1| u C(ub ,1, 1) u C(ub ,2, F2(ζb )) ℓ2 1 ˆF1(ζb ) 1 ˆF2(ζb ) ϵ + L c0 Bℓδ | b ,2| + L| b +1,1| | b +1,1| + 3c0 Survival Analysis via Density Estimation Moreover, by the triangle inequality, for a sufficiently large B satisfying L c0 2 , we have u C(ub ,1, 1) u C(ub ,2, F2(ζb )) u C(ub ,2, 1) u C(ub ,2, F2(ζb )) u C(ub ,1, 1) u C(ub ,2, 1) v2 C(ub ,2, v) | {z } ℓ u2 C(u, 1) | {z } L ℓ(1 F2(ζb )) L|ub ,2 ub ,1| ℓδ L c0 where the third inequality follows from F2(ζb) 1 δ and |ub ,2 ub ,1| max n ˆF1(ζb ), F1(ζb ) o min n ˆF1(ζb ), F1(ζb ) o = | b ,k| c0 Then, we have | b +1,1 b ,1| ℓ2 1 ˆF1(ζb ) 1 ˆF2(ζb ) ϵ + L c0 Bℓδ | b ,2| + L| b +1,1| | b +1,1| + 3c0 ℓ2 1 ˆF1(ζb ) 1 ˆF2(ζb ) ϵ + L c0 Bℓδ | b ,2| ℓδ | b +1,1| | b +1,1| + 3c0 We consider two cases (i) b +1,1 c0 B and (ii) b +1,1 > c0 B . If (i) holds, we have | b +1,1 b ,1| 2 ℓ2 1 ˆF1(ζb ) 1 ˆF2(ζb ) ϵ + L c0 Bℓδ | b ,2| Bℓδ | b +1,1|. By using the triangle inequality again, we obtain | b +1,1| | b ,1| + 2 ℓ2 1 ˆF1(ζb ) 1 ˆF2(ζb ) ϵ + L c0 Bℓδ | b ,2| Bℓδ | b +1,1|. Finally, by rearranging the above inequality and using the induction hypothesis, we have | b +1,1| ϵ + 16Lδ 2 b +1 B 4c0L ℓ2 1 ˆF1(ζb ) 1 ˆF2(ζb ) ϵ + L c0 b +1 B 4c0L Survival Analysis via Density Estimation where we use Fk(t) < 1 δ 2 for the last inequality, and hence, | b +1,1| ϵ + 16Lδ 2 b +1 B 4c0L = ϵ + 16Lδ 2 (b +1)+1 B 4c0L which ensures (21) for b = b + 1 and k = 1. Since the same argument holds for k = 2, we obtain the conclusion for the case (i). If (ii) holds, we have | b +1,1 b ,1| 2 ℓ2 1 ˆF1(ζb ) 1 ˆF2(ζb ) ϵ + L c0 Bℓδ | b ,2| ℓδ | b +1,1|2. Then, by using (26), we have | b +1,1|2 B 2(1 τ). This and triangle inequality gives | b +1,1| | b ,1| + 2 ℓ2 1 ˆF1(ζb ) 1 ˆF2(ζb ) ϵ + L c0 Bℓδ | b ,2| ℓδ | b +1,1|2 | b ,1| + 1 8c0L ℓ2 1 ˆF1(ζb ) 1 ˆF2(ζb ) ϵ + L c0 Bℓδ | b ,2| with taking a sufficiently large B. This gives the conclusion for the case (ii). To obtain the bound with respect to W1, we utilize the following lemma: Lemma C.4 (Vallender (1974)). Let µ and ν be probability measures on R whose CDFs are defined by F and G. Then, Z + |F(t) G(t)|dt = W1(µ, ν). Then, we move to the proof of Theorem C.2. Proof of Theorem C.2. Set δ0 in Assumption C.1-(3) by δ0 := 8c0L ℓ . By using Lemma C.3, we obtain W1( ˆFk, Fk) = Z + ˆFk(t) Fk(t) dt ˆFk(t) Fk(t) dt ˆFk(t) Fk(t) dt ˆFk(t) Fk(t) dt | {z } (II) where b0 is defined by Assumption C.1-(3). Then, we bound the each term. We denote b,k := ˆFk(ζb) Fk(ζb) again. By using the bound (25), we have | b,k| ϵ + log B Survival Analysis via Density Estimation Then, the first term can be bounded by ζb 1 max n ˆFk(t) Fk(ζb 1) , ˆFk(t) Fk(ζb) o dt ζb 1 max n | b,k|, c0 | b,k| + c0 B = ϵ + log B B1+τζB. (38) For the second term, since 1 Fk(ζb 0) log 1 B by Assumption C.1-(3), we have ˆFk(t) Fk(t) log 1 B for t ζb0 with sufficiently large B. This implies ζb0 log 1 Bdt B b0 B log B ζB. (39) By substituting the bounds (38) and (39) into (37), we obtain the conclusion. D. Examples of Bounds on Step 1 Error In this section, we present an example that satisfies (18). As mentioned in Sections 3 and C, any estimation method for the probability distribution can be employed, and theoretical results pertaining to those models can be leveraged to guarantee (18). Among the various methods, we introduce results derived from the Distributional Random Forest (DRF) ( Cevid et al., 2022) and histogram type estimators (Sart, 2017). D.1. Distributional Random Forest DRF constructs random forests designed to estimate the conditional distribution of multivariate responses. It achieves this by splitting the data using a distributional metric, specifically the maximal mean discrepancy (MMD), with the goal of maximizing the differences in distributions between child nodes. DRF then estimates targets, such as the CIF in this study, by employing a weight function that reflects how frequently the training data points end up in the same leaf as the test point across different trees. Cevid et al. (2022) impose the following assumptions: (P1) (Data Sampling.) Instead of the traditional bootstrap sampling with replacement, commonly used in forest-based methods, a subsampling approach is employed. For each tree, a random subset of size sn is selected from n training data points. It is assumed that sn approaches infinity as n increases, with the rate specified below. (P2) (Honesty.) The data used to construct each tree is split into two parts: one part is used for determining the splits, and the other is used for populating the leaves and thus for estimating the response. (P3) (α-Regularity.) Each split leaves at least a fraction 0 < α 0.2 of the available training sample on each side. Additionally, trees are grown until each leaf contains between κ and 2κ 1 observations, where κ N is a fixed tuning parameter. (P4) (Symmetry.) The (randomized) output of a tree does not depend on the ordering of the training samples. (P5) (Random-Split.) At every split point, the probability that the split occurs along the feature Xj is bounded below by π/p, for some π > 0 and for all j = 1, . . . , p. Survival Analysis via Density Estimation Note that each of these conditions can be verified by inspecting the constructed forest. The following proposition is direct consequence from Corollary 5 of ( Cevid et al., 2022). Proposition D.1. Under the assumptions (P1)-(P5), it holds that ˆrb,k|x p rb,k|x for any b and k as the sample size goes to infinity2. The proposition above ensures the probabilistic convergence of the estimation. Specifically, for arbitrary values of ϵ > 0 and δ > 0, there exists a sample size threshold nϵ,δ > 0 such that if the sample size exceeds nϵ,δ, then (18) holds with a probability of at least 1 δ. D.2. Histogram Type Estimators Sart (2017) proposes histogram type conditional density estimators on X Y by c Pr(y|x) := ˆs(x, y) = X Pn i=1 1K(xi, yi) Pn i=1(δxi µ)(K)1K(x, y), where m is a partition of X Y, µ is a reference measure of the conditional density, and δx is the Dirac measure at x X. In the context of survival analysis, we can choose Y = [0, 1]K equipped with the Lebesgue measure µ and m as a set of regions defined by (ζb 1 < tk δb, δ = k) for b [B], k [K]. Let ν be a measure defined on X and h(f, g) = Z g(x, y) 2 dν(x) dµ(y) be the Hellinger distance. Then, Sart (2017) provides the following result: Proposition D.2 (Proposition 2.6 of (Sart, 2017)). Let s be a true conditional density. Then, there exists global constants C1, C2 > 0 such that for any ξ > 0, Pr h2(s, ˆs) inf v Vmh2(s, v) + C1 |m| n + C2ξ 1 e nξ, K m a K1K, K m, a K 0 See (Sart, 2017) for specific examples of deriving the term inf v Vmh2(s, v) under the conditions on s and X. The bound on the Hellinger distance implies the bound on the total variation distance TV(f, g) = R X Y|f(x, y) g(x, y)| dν(x) dµ(y) as TV(f, g) h(f, g), which follows from the inequality between the L1-norm and the L2-norm. Thus, by utilizing Proposition D.2, we obtain (18) by taking ϵ as the total variation distance between ˆr and r. E. Upper and Lower Bounds Based on Cumulative Incidence Function If the CIF Vk(t|x) (as defined in (4)) is available, the upper and lower bounds of Fk(t|x) can be easily computed by Pr(T t, δ = k|x) Fk(t|x) Pr(T t|x) Vk(t|x) Fk(t|x) X 2 p denotes the probability convergence. Survival Analysis via Density Estimation Table 2. Real datasets used in our experiments Name K N # categorical # numuerical censored max. time data DIVAT1 2 5943 3 2 83.6% 6225 oldmort 2 6495 5 2 69.7% 20 Dialysis 2 6805 2 2 76.4% 44 flchain 2 7874 4 6 72.5% 5215 support2 2 9105 11 24 31.9% 2029 prostate Survival 2 14294 3 0 94.4% 119 PBC 3 312 5 12 45.8% 15 Framingham 3 4434 10 9 56.2% 8767 By averaging over x X, we can derive the same upper and lower bounds as in (Peterson, 1976): E x X[Vk(t|x)] Fk(t) E x X Given a dataset D = {(x(i), t(i), δ(i))}N i=1, we can compute the empirical estimates of the expectations in (40) as follows: E x X[Vk(t|x)] 1 (x(i),t(i),δ(i)) D 1t(i) t,δ(i)=k (x(i),t(i),δ(i)) D 1t(i) t. Note that these values are equivalent to empirical CDFs, and they may not correspond to the actual bounds if the number of data points N is insufficient. In such cases, the confidence intervals of these empirical CDFs should also be computed using methods such as Greenwood s method (1926). F. Additional Experiments Datasets. We used eight datasets, summarized in Table 2, where N denotes the number of data points, and the fourth and fifth columns indicate the numbers of categorical and numerical features in the feature vectors, respectively. The six datasets with K = 2 were obtained from the Python package Surv Set (Drysdale, 2022). The Framingham (Kannel & Mc Gee, 1979) and PBC (Therneau & Grambsch, 2000) datasets with K = 3 were ones used in (Jeanselme et al., 2023). All datasets were randomly split into training (65%), validation (15%), and testing (20%) sets. The results reported in this section are the mean and standard deviation over five random splits. We divided the time horizon [0, tmax] into B 1 evenly spaced boundaries and added an additional time slot to represent times greater than tmax, where tmax is the maximum observed time within the dataset. This setup means that we used B time slots divided by {ζb}B b=0. We set B = 32 unless otherwise stated. Models and hyperparameters. We used a multi-layer perceptron (MLP) with three hidden layers as a neural network model. The dropout layer was employed with a dropout rate of 0.5, and the Re LU function was utilized as the activation layer. The softmax function served as the output layer. The neural network was trained using the Adam WSchedule Free optimizer (Defazio et al., 2024) with early stopping. For each dataset, we performed a hyperparameter search to determine the number of neurons in the hidden layers and the learning rate of the optimizer: the number of neurons was chosen from the set {4, 8, 16, 32, 64, 128, 256}, and the learning rate was chosen from the set {0.00001, 0.0001, 0.001, 0.01, 0.1, 1.0, 10.0}. For TS-RF and TS-DRF models, hyperparameter searches were performed on these parameters: n estimators was chosen from the integers between 100 and 1000, max depth was chosen from the integers between 10 and 50, min samples split was chosen from the integers between 2 and 64, min samples leaf was chosen from the integers between 1 and 32, max features was chosen from sqrt or log2, criterion was chosen from log loss, Survival Analysis via Density Estimation gini, or entropy, splitting rule was chosen from CART or Fourier MMD, num features was chosen from the integers between 1 and 100, sample fraction was chosen from the numbers between 0.1 and 0.5, min node size was chosen from the integers between 1 and 10, and alpha was chosen from the numbers between 0.01 and 0.3. For TS-LGB model, hyperparameter search was performed on these parameters: n estimators was chosen from the integers between 30 and 300, max depth was chosen from the integers between 1 and 64, min child samples was chosen from the integers between 1 and 32, num leaves was chosen from the integers between 2 and 64, criterion was chosen from log loss, gini, or entropy, learning rate was chosen from the numbers between 0.001 and 0.01, lambda l1 was chosen from the numbers between 10 8 and 1.0, and lambda l2 was chosen from the numbers between 10 8. Prediction performance on the other datasets. To complement the results in Sec. 6 for the Dialysis and oldmort datasets, we conducted experiments using the data DIVAT1, flchain, prostate Survival, and support datasets with K = 2. The outcomes of these experiments are presented in Fig. 6. Unlike the Dialysis and oldmort datasets, no single model consistently outperformed the others across different metrics and datasets. We also evaluated the prediction performance using the Framingham and PBC datasets with K = 3. As baseline methods, we compared our models with those utilized in Jeanselme et al. (2023). Specifically, we compared against: Deep Hit (Lee et al., 2018), Deep Survival Machines (DSM) (Nagpal et al., 2021), De Surv (Danks & Yau, 2022), and Neural Fine-Gray (Neural FG) (Jeanselme et al., 2023), which is a neural network model extending the Fine-Gray model (Fine & Gray, 1999). For these models, we used the implementations available at https://github.com/Jeanselme/Neural Fine Gray/ under MIT license, and performed hyperparameter searches based on the guidelines provided in the source code. We compared our models with Deep Hit, DSM, De Surv, and Neural FG models using the independence copula, and the results are displayed in Fig. 7. These results demonstrate that our two-step algorithm is competitive with these baseline models. Ablation study on hyperparameter B. We conducted an ablation study on the hyperparameter B in our two-step algorithm. This study aimed to evaluate the prediction performance on the cen-log metric using the TS-LGB model with the parameters w1 {0, 0.5, 1} and B {4, 8, 16, 32, 64}, where w1 is the parameter for the primary event of interest. In this study, we also implemented the method proposed in (Carri ere, 1995), which is labeled as middle . Figure 8 shows the results, where the prediction performances are normalized relative to those with w1 = 0.5 for each B. The results indicate that prediction performances varied significantly depending on the choice of the parameter w1 on several datasets when B was small. However, these differences diminished for B 32. Regarding the method proposed in (Carri ere, 1995), while it is valid only if B in theory, somewhat surprisingly, the results demonstrate that it performed comparable to our algorithm even for small B in practice. Upper and lower bounds estimation. We illustrate survival functions along with bounds for the six datasets with K = 2 in Fig. 9. The six graphs on the left depict average survival functions. In these graphs, we employed the Kaplan-Meier (KM) estimator (1958) and the copula-graphic (CG) estimator (Zheng & Klein, 1995) using the Frank copula (2) with parameters θ = 5 and θ = 5. Recall that the CG estimator is a generalization of the KM estimator. The shaded regions indicate the bounds enclosed by the upper and lower limits (Peterson, 1976). The twelve graphs in the center and right display individual survival functions. To estimate these functions, we used the TS-LGB model combined with the independence copula and the Frank copula with parameters θ = 5 and θ = 5. The shaded regions represent the bounds enclosed by the upper and lower limits given in Inequality (15). The figures displaying the average survival functions indicate that uncertainty due to the unknown copula increases as time progresses. The figures displaying the individual survival functions also show similar uncertainty, but the degree of uncertainty varies by individual. In particular, the right figures for the flchain and support2 datasets showed small uncertainty, even without prior knowledge of the copula. We also see that the estimated uncertainty due to the unknown copula can be narrowed by using the estimated survival functions based on the Frank copula. Survival Analysis via Density Estimation data DIVAT1 0.0 0.2 0.4 0.6 0.8 1.0 Cen-log (event 1) SC-Net TS-Brier TS-RF TS-DRF TS-LGB Deep Hit data DIVAT1 0.000 0.002 0.004 0.006 D-calibration (event 1) SC-Net TS-Brier TS-RF TS-DRF TS-LGB Deep Hit data DIVAT1 0.0 0.5 1.0 1.5 2.0 2.5 KM-calibration (event 1) SC-Net TS-Brier TS-RF TS-DRF TS-LGB Deep Hit data DIVAT1 5 6 7 8 NLL-SC SC-Net TS-Brier TS-RF TS-DRF TS-LGB Deep Hit data DIVAT1 0.6 0.8 1.0 1.2 CJD-Brier SC-Net TS-Brier TS-RF TS-DRF TS-LGB Deep Hit data DIVAT1 2.0 2.5 3.0 CJD-Logarithmic SC-Net TS-Brier TS-RF TS-DRF TS-LGB Deep Hit data DIVAT1 0.0 0.2 0.4 0.6 0.8 CJD-KS SC-Net TS-Brier TS-RF TS-DRF TS-LGB Deep Hit data DIVAT1 0.0 0.5 1.0 1.5 Cen-log (event 1) SC-Net TS-Brier TS-RF TS-DRF TS-LGB Deep Hit 0.000 0.005 0.010 0.015 0.020 0.025 D-calibration (event 1) SC-Net TS-Brier TS-RF TS-DRF TS-LGB Deep Hit 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 KM-calibration (event 1) SC-Net TS-Brier TS-RF TS-DRF TS-LGB Deep Hit 5.0 5.5 6.0 6.5 7.0 7.5 8.0 NLL-SC SC-Net TS-Brier TS-RF TS-DRF TS-LGB Deep Hit 0.6 0.8 1.0 1.2 CJD-Brier SC-Net TS-Brier TS-RF TS-DRF TS-LGB Deep Hit 1.5 2.0 2.5 3.0 CJD-Logarithmic SC-Net TS-Brier TS-RF TS-DRF TS-LGB Deep Hit 0.0 0.2 0.4 0.6 0.8 CJD-KS SC-Net TS-Brier TS-RF TS-DRF TS-LGB Deep Hit prostate Survival 0.0 0.1 0.2 0.3 0.4 Cen-log (event 1) SC-Net TS-Brier TS-RF TS-DRF TS-LGB Deep Hit prostate Survival 0.000 0.001 0.002 0.003 0.004 0.005 D-calibration (event 1) SC-Net TS-Brier TS-RF TS-DRF TS-LGB Deep Hit prostate Survival 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 KM-calibration (event 1) SC-Net TS-Brier TS-RF TS-DRF TS-LGB Deep Hit prostate Survival 3.5 4.0 4.5 NLL-SC SC-Net TS-Brier TS-RF TS-DRF TS-LGB Deep Hit prostate Survival 0.6 0.8 1.0 1.2 CJD-Brier SC-Net TS-Brier TS-RF TS-DRF TS-LGB Deep Hit prostate Survival 2.75 3.00 3.25 3.50 3.75 4.00 CJD-Logarithmic SC-Net TS-Brier TS-RF TS-DRF TS-LGB Deep Hit prostate Survival 0.0 0.2 0.4 0.6 0.8 CJD-KS SC-Net TS-Brier TS-RF TS-DRF TS-LGB Deep Hit prostate Survival 0.0 0.5 1.0 1.5 Cen-log (event 1) SC-Net TS-Brier TS-RF TS-DRF TS-LGB Deep Hit 0.000 0.005 0.010 0.015 0.020 0.025 D-calibration (event 1) SC-Net TS-Brier TS-RF TS-DRF TS-LGB Deep Hit 0.0 0.2 0.4 0.6 0.8 1.0 KM-calibration (event 1) SC-Net TS-Brier TS-RF TS-DRF TS-LGB Deep Hit 4.5 5.0 5.5 6.0 6.5 7.0 NLL-SC SC-Net TS-Brier TS-RF TS-DRF TS-LGB Deep Hit 0.6 0.8 1.0 1.2 CJD-Brier SC-Net TS-Brier TS-RF TS-DRF TS-LGB Deep Hit 1.75 2.00 2.25 2.50 2.75 3.00 CJD-Logarithmic SC-Net TS-Brier TS-RF TS-DRF TS-LGB Deep Hit 0.0 0.2 0.4 0.6 0.8 CJD-KS SC-Net TS-Brier TS-RF TS-DRF TS-LGB Deep Hit Figure 6. Prediction performance comparison on data DIVAT1, flchain, prostate Survival, and support2 datasets with various metrics (lower is better). Survival Analysis via Density Estimation 0.00 0.25 0.50 0.75 1.00 1.25 1.50 Cen-log (event 1) SC-Net TS-Brier TS-RF TS-DRF TS-LGB Deep Hit DSM Neural FG 0.0000 0.0025 0.0050 0.0075 0.0100 0.0125 D-calibration (event 1) SC-Net TS-Brier TS-RF TS-DRF TS-LGB Deep Hit DSM Neural FG 0.0 0.5 1.0 1.5 2.0 Cen-log (event 2) SC-Net TS-Brier TS-RF TS-DRF TS-LGB Deep Hit DSM Neural FG 0.00 0.01 0.02 0.03 0.04 0.05 D-calibration (event 2) SC-Net TS-Brier TS-RF TS-DRF TS-LGB Deep Hit DSM Neural FG 5 6 7 8 9 10 NLL-SC SC-Net TS-Brier TS-RF TS-DRF TS-LGB Deep Hit DSM Neural FG 0.6 0.8 1.0 1.2 CJD-Brier SC-Net TS-Brier TS-RF TS-DRF TS-LGB Deep Hit DSM Neural FG 2 4 6 8 CJD-Logarithmic SC-Net TS-Brier TS-RF TS-DRF TS-LGB Deep Hit DSM Neural FG 0.0 0.5 1.0 1.5 CJD-KS SC-Net TS-Brier TS-RF TS-DRF TS-LGB Deep Hit DSM Neural FG 0 1 2 3 Cen-log (event 1) SC-Net TS-Brier TS-RF TS-DRF TS-LGB Deep Hit DSM Neural FG 0.00 0.01 0.02 0.03 0.04 0.05 0.06 D-calibration (event 1) SC-Net TS-Brier TS-RF TS-DRF TS-LGB Deep Hit DSM Neural FG 0.0 0.2 0.4 0.6 0.8 1.0 Cen-log (event 2) SC-Net TS-Brier TS-RF TS-DRF TS-LGB Deep Hit DSM Neural FG 0.000 0.005 0.010 0.015 0.020 0.025 0.030 D-calibration (event 2) SC-Net TS-Brier TS-RF TS-DRF TS-LGB Deep Hit DSM Neural FG 2 4 6 8 NLL-SC SC-Net TS-Brier TS-RF TS-DRF TS-LGB Deep Hit DSM Neural FG 0.6 0.8 1.0 1.2 CJD-Brier SC-Net TS-Brier TS-RF TS-DRF TS-LGB Deep Hit DSM Neural FG 3 4 5 6 CJD-Logarithmic SC-Net TS-Brier TS-RF TS-DRF TS-LGB Deep Hit DSM Neural FG 0.00 0.25 0.50 0.75 1.00 1.25 1.50 CJD-KS SC-Net TS-Brier TS-RF TS-DRF TS-LGB Deep Hit DSM Neural FG Figure 7. Prediction performance comparison on Framingham and PBC datasets with various metrics (lower is better). 4 8 16 32 64 B Relative Cen-log data DIVAT1 w1=0 w1=0.5 w1=1 middle 4 8 16 32 64 B Relative Cen-log w1=0 w1=0.5 w1=1 middle 4 8 16 32 64 B Relative Cen-log w1=0 w1=0.5 w1=1 middle 4 8 16 32 64 B Relative Cen-log w1=0 w1=0.5 w1=1 middle 4 8 16 32 64 B Relative Cen-log prostate Survival w1=0 w1=0.5 w1=1 middle 4 8 16 32 64 B Relative Cen-log w1=0 w1=0.5 w1=1 middle Figure 8. Ablation study on hyperparameter B on the six datasets (lower is better). Survival Analysis via Density Estimation 0 1000 2000 3000 4000 5000 6000 Time Survival prob. data DIVAT1: Average survival function Copula-Graphic Kaplan-Meier 0 1000 2000 3000 4000 5000 6000 Time Survival prob. data DIVAT1: Individual survival function Frank (theta= 5) Independence 0 1000 2000 3000 4000 5000 6000 Time Survival prob. data DIVAT1: Individual survival function Frank (theta= 5) Independence 0 10 20 30 40 Time Survival prob. Dialysis: Average survival function Copula-Graphic Kaplan-Meier 0 10 20 30 40 Time Survival prob. Dialysis: Individual survival function Frank (theta= 5) Independence 0 10 20 30 40 Time Survival prob. Dialysis: Individual survival function Frank (theta= 5) Independence 0 1000 2000 3000 4000 5000 Time Survival prob. flchain: Average survival function Copula-Graphic Kaplan-Meier 0 1000 2000 3000 4000 5000 Time Survival prob. flchain: Individual survival function Frank (theta= 5) Independence 0 1000 2000 3000 4000 5000 Time Survival prob. flchain: Individual survival function Frank (theta= 5) Independence 0 5 10 15 20 Time Survival prob. oldmort: Average survival function Copula-Graphic Kaplan-Meier 0 5 10 15 20 Time Survival prob. oldmort: Individual survival function Frank (theta= 5) Independence 0 5 10 15 20 Time Survival prob. oldmort: Individual survival function Frank (theta= 5) Independence 0 20 40 60 80 100 Time Survival prob. prostate Survival: Average survival function Copula-Graphic Kaplan-Meier 0 20 40 60 80 100 Time Survival prob. prostate Survival: Individual survival function Frank (theta= 5) Independence 0 20 40 60 80 100 Time Survival prob. prostate Survival: Individual survival function Frank (theta= 5) Independence 0 500 1000 1500 2000 Time Survival prob. support2: Average survival function Copula-Graphic Kaplan-Meier 0 500 1000 1500 2000 Time Survival prob. support2: Individual survival function Frank (theta= 5) Independence 0 500 1000 1500 2000 Time Survival prob. support2: Individual survival function Frank (theta= 5) Independence Figure 9. Estimated survival functions with upper and lower bounds for the six datasets with K = 2. The left graphs show average survival functions, while the graphs in the center and the right show arbitrary chosen individual survival functions. In these graphs, the shaded region corresponds to the upper and lower bounds of survival functions accounting for the uncertainty arising from the lack of knowledge about the copula.