# dynamic_survival_analysis_with_controlled_latent_states__2c5f2e1e.pdf

Dynamic Survival Analysis with Controlled Latent States

Linus Bleistein * 1 2 3 Van-Tuan Nguyen * 1 4 5 Adeline Fermanian 4 Agathe Guilloux 1 2

We consider the task of learning individualspecific intensities of counting processes from a set of static variables and irregularly sampled time series. We introduce a novel modelization approach in which the intensity is the solution to a controlled differential equation. We first design a neural estimator by building on neural controlled differential equations. In a second time, we show that our model can be linearized in the signature space under sufficient regularity conditions, yielding a signature-based estimator which we call Cox Sig. We provide theoretical learning guarantees for both estimators, before showcasing the performance of our models on a vast array of simulated and real-world datasets from finance, predictive maintenance and food supply chain management.

1. Introduction

Time-to-event data is ubiquitous in numerous fields such as meteorology, economics, healthcare and finance. We typically want to predict when an event - which can be a catastrophic earthquake, the burst of a housing bubble, the onset of a disease or a financial crash - will occur by using some prior historical information (Ogata, 1988; Bacry et al., 2015; Bussy et al., 2019). This general problem encompasses many settings and in particular survival analysis, where every individual experiences at most one event (Cox, 1972).

For an individual i, we have access to several event times T i 1 < T i 2 < . . . and features Wi Rs measured at time

*Equal contribution 1Inria Paris, F-75015 Paris, France 2Centre de Recherche des Cordeliers, INSERM, Universit e de Paris, Sorbonne Universit e, F-75006 Paris, France 3La MME, UEVE and UMR 8071, Paris Saclay University, F-91042, Evry, France 4LOPF, Califrais Machine Learning Lab, Paris, France 5Laboratoire de Probabilit es, Statistique et Mod elisation, LPSM, Univ. Paris Cit e, F-75005, Paris, France. Correspondence to: Linus Bleistein <linus.bleistein@inria.fr>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

0. For instance, in neurology, one might consider the onset times of a series of seizures (Rasheed et al., 2020) and Wi summarizes unchanging characteristics of the individual (age, gender, ethnicity, ...). The physician s goal is to determine whether an individual has a high probability to experience a seizure at time t given their characteristics. Such a task is most often addressed by modelling the individual specific intensity of a counting process of the form P

j 1 1T i j t, using for instance Cox models (Cox, 1972; Aalen et al., 2008; Kvamme et al., 2019) or Hawkes processes in the case of self-exciting processes (Bacry et al., 2015). Recent advances in the field have also enriched these models using deep architectures (Mei & Eisner, 2017; Kvamme et al., 2019; Omi et al., 2019; Chen et al., 2021; Groha et al., 2020; Shchur et al., 2021; De Brouwer et al., 2022; Tang et al., 2022). Once learnt, the intensity of the process can be used to predict occurrence times of future events or rank individuals based on their relative risks.

Learning with Time-dependent Data. More realistically, in addition to the static features Wi, we often have access to time-dependent features along with their sampling times

Xi =: {(Xi(t1), t1), . . . , (Xi(t K), t K)} Rd K,

where D = {t1, . . . , t K} [0, τ] is a set of measurement times and τ the end of study. Taking again the example of seizure prediction, the time-dependent features may represent some measurements made by a wearable device, as done for instance by Dumanis et al. (2017). Taking both the static and time-dependent information into account is crucial when making predictions. This setting calls for highly flexible models of the intensity which take into account the stream of information carried by the longitudinal features.

From joint models to ODE-based methods. This problem has been tackled by the bio-statistics community, in particular using joint models that concurrently fit parametric models to the trajectory of the longitudinal features and the intensity of the counting process (Ibrahim et al., 2010; Crowther et al., 2013; Proust-Lima et al., 2014; Long & Mills, 2018). Popular implementations include JMBayses (Rizopoulos, 2016). While being highly interpretable, they do not scale to high-dimensional and fre-

Dynamic Survival Analysis with Controlled Latent States

quently measured data, despite some recent algorithmic advances (Hickey et al., 2016; Murray & Philipson, 2022; Rustand et al., 2024) adapted to moderate dimension (up to 5 longitudinal features).

Modern deep methods, that can encode complex and meaningful patterns from complex data in latent states, offer a particularly attractive alternative for this problem. However, the literature bridging the gap between deep learning and survival analysis is scarce. Notably, Lee et al. (2019) tackle this problem by embedding the time-dependent data through a recurrent neural network combined with an attention mechanism. They then use this embedding in a discrete-time setting to maximize the likelihood of dying in a given time-frame conditional on having survived until this time. Moon et al. (2022) combine a probabilistic model with a continuous-time neural network, namely the ODE-RNNS of Rubanova et al. (2019) in a similar setup.

Modelling Time Series with Controlled Latent States. Building on the increasing momentum of differential equation-based methods for learning (Chen et al., 2018; De Brouwer et al., 2019; Rubanova et al., 2019; Chen et al., 2021; Moon et al., 2022; Marion et al., 2022), we propose a novel modelling framework in which the unknown intensity of the counting process is parameterized by a latent state driven by a controlled differential equation (CDE). Formally, we let the unknown intensity of the counting process of individual i depend on their covariates Wi and an unobserved process xi : [0, τ] Rd that is the continuous unobserved counterpart of the time series Xi, i.e., (Xi(t), t) = xi(t) for all t D. We model the intensity (i.e. the instantaneous probability of experiencing an event see Section 2.2) by setting

λi t | Wi, (xi(s))s t = exp zi (t) + β Wi , (1)

where the dynamical latent state zi (t) R is the solution to the CDE

dzi (t) = G zi (t) dxi(t) (2)

with initial condition zi (0) = 0 driven by xi. Here, the vector field G : R Rd and β Rs are both unknown. This means that the latent dynamics are common between individuals, but are driven by individual-specific data, yielding individual-specific intensities. Such a modelling strategy is reminiscent of state space models, which embed times series through linear controlled latent differential equations (Gu et al., 2022; Cirone et al., 2024). Our framework is introduced in more detail later.

Contributions. In an effort to provide scalable and efficient models for event-data analysis, we propose two novel estimators. We first leverage neural CDEs (Kidger et al., 2020), which directly approximate the vector field G with

a neural vector field Gψ. In a second time, following Fermanian et al. (2021) and Bleistein et al. (2023), we propose to linearize the unknown dynamic latent state zi ( ) in the signature space. Informally, this means that at any time t, we have the simplified expression

zi (t) α ,NSN(xi [0,t])

where α ,N is an unknown finite-dimensional vector and SN(xi [0,t]) is a deterministic transformation of the time series xi observed up to time t called the signature transform. Notice that in this form, the vector α ,N does not depend on t and can hence be learned at any observation time. We obtain theoretical guarantees for both models ; for the second model in particular, we state a precise decomposition of the variance and the discretization bias of our estimator, which crucially depends on the coarseness of the sampling grid D. Finally, we benchmark both methods on simulated and real-world datasets from finance, healthcare and digital food retail, in a survival analysis setting. Our signaturebased estimator provides state-of-the-art results.

Summary. Section 2 details our theoretical framework. In Section 3, we state theoretical guarantees for our model. Lastly, we conduct a series of experiments in Section 4 that displays the strong performances of our models against an array of benchmarks. All proofs are given in the appendix. The code is available at https://github. com/Linus Bleistein/signature_survival.

2. Modelling Point Processes with Controlled Latent States

2.1. The Data

In practice, an individual can be censored (for example after dropping out from a study) or cannot experience more than a given number of events. To take this into account, we introduce Y i : [0, τ] {0, 1} the at-risk indicator function, which equals 1 when the individual i is still at risk of experiencing an event. Together with Y i, we define

N i(t) := X

j 1 1T i j t Y i(T i j)

as the stochastic process counting the number of events experienced by individual i up to time t and while Y i(T i j) = 1. Our dataset

Dn := {Xi, Wi, Y i(t), N i(t), 0 t τ}

consists of n i.i.d. historical observations up to time τ. Our setup can be extended to individual-dependent grids (Di)n i=1, but we choose to focus on the former setting for the sake of clarity. The individual specific time series are only observed as long as the individual is at risk. We first make an assumption on the time series.

Dynamic Survival Analysis with Controlled Latent States

Assumption 1. For every individual i = 1, . . . , n, there exists a continuous path of bounded variation xi : [0, τ] Rd satisfying, for all 0 s < t τ, xi 1-var,[s,t] := sup D

xi(tk+1) xi(tk) Lx|t s|

where is the Euclidean norm and the supremum is taken over all finite dissections D = {s = t1 < < t K = t}. The time series Xi is a discretization of xi on the grid D.

Remark that this assumption implies that the paths are Lx Lipschitz. We now state a supplementary assumption on the static features.

Assumption 2. There exists a constant BW > 0 such that for every i = 1, . . . , n, Wi 2 BW.

2.2. Modelling Intensities with Controlled Differential Equations

Intensity of a counting process. We define the individual-specific intensity λi t | Wi, xi [0,t] of the underlying counting process, which we will simply write λi (t) in the following, as

λi (t) := lim h 0+ 1 h E N i(t + h) N i(t) | Fi t

where Fi t is the past information at time t which includes Wi and xi [0,t] (Aalen et al., 2008).

Controlled Dynamics. Controlled differential equations are a theoretical framework that allows to generalize ODEs beyond the non-autonomous regime (Lyons et al., 2007). Recall that a non-autonomous ODE is the solution to

dz(t) = F(z(t), t)dt

with a given initial value z(0) = z0 Rp. Here, the vector field F : Rp [0, + [ Rp depends explicitly on t 0, allowing for time-varying dynamics unlike autonomous ODEs whose dynamics remain unchanged through time. Controlled differential equations can be seen as a generalization of non-autonomous ODEs. They allow for the vector field to depend explicitly on the values of another Rd-valued function x : [0, 1] Rd through

dz(t) = F(z(t), x(t))dt

thus encoding even richer dynamics. Formally, a CDE writes

dz(t) = G z(t) dx(t)

z(0) = z0 Rp

where G is a Rp d-valued vector field. Existence and uniqueness of the solution is ensured under regularity conditions on G and x by the Picard-Lindelh of Theorem (see

Theorem A.1). The following assumption is needed in order to ensure that the function

λi (t) = exp zi (t) + β Wi ,

where the dynamical latent state zi (t) R is the solution to the CDE

dzi (t) = G zi (t) dxi(t)

with initial condition zi (0) = 0 driven by xi is welldefined.

Assumption 3. The vector field G : R Rd defining λi in Equation (2) is LG -Lipschitz; β is such that β 2 Bβ,2, β 1 Bβ,1 and β 0 Bβ,0, where Bβ,2, Bβ,1, Bβ,0 > 0 are constants.

Under these assumptions, the intensity is bounded at all times.

Lemma 2.1 (A bound on the intensity). For every individual i = 1, . . . , n and all t [0, τ], the log intensity log λi (t) is upper bounded by

Bβ,2BW + G (0) op Lxt exp LG Lxt

almost surely.

This is a direct consequence of Lemma 3.3 in Bleistein & Guilloux (2024). Remark that G (0) op < since the vector field is Lipschitz and hence continuous. Remark 2.2. By differentiation, one can see that the intensity itself satisfies a so-called controlled Volterra differential equation (Lin & Yong, 2020). Indeed, differentiating the intensity λi yields the CDE

dλi (s) = λi (s)G (zi (s))dxi(s)

with initial condition λi (0) = exp(β Wi). Note that this CDE is path dependent, i.e., its vector field depends on the path zi : [0, τ] R. Remark 2.3. This model enforces continuity of the intensity: indeed, the solution of a CDE inherits the regularity of its driving path. A possible solution to accommodate discontinuous intensity functions is to add a jump term to the generative CDE, which could then be learnt using neural jump ODEs (Jia & Benson, 2019).

2.3. Neural Controlled Differential Equations

Following the ideas of continuous time models, our first approach to learning the dynamics is to fit a parameterized intensity to this model by setting

λi θ(s) = exp(α zi θ(s) + β Wi),

where zi θ(s) Rp is an embedding of the time series Xi

parameterized by θ Rv and α Rp is a learnable parameter. We propose to use Neural Controlled Differential

Dynamic Survival Analysis with Controlled Latent States

Equations (NCDEs), a powerful tool for embedding irregular time series introduced by Kidger et al. (2020). NCDEs work by first embedding a time series Xi in the space of functions of bounded variation, yielding xi,D : [0, τ] Rd, before defining a representation of the data through

dzθ(t) = Gψ zθ(s) dxi,D(s)

with initial condition zθ(0) = 0. It is common practice to set Gψ : Rp Rp d to be a small feed-forward neural network parameterized by ψ. The learnable parameters of this model are thus θ = (α, ψ, β). In our setup, the embedding must be carefully chosen in order not to leak information from the future observations. Hence natural cubic splines, used in the original paper by Kidger et al. (2020), cannot be used and we resort to the piecewise constant interpolation scheme proposed by Morrill et al. (2021) and defined as xi,D(s) = (Xi(tk), s) for all s [tk, tk+1[. This yields a discretely updated latent state equal to

zi,D θ (tk) = zi,D θ (tk 1) + Gψ(zi,D θ (tk 1)) Xi(tk)

where Xi(tk) = Xi(tk) Xi(tk 1). This architecture has been studied under the name of controlled Res Net because of its resemblance with the popular Res Net (Cirone et al., 2023; Bleistein & Guilloux, 2024).

In order to provide theoretical guarantees, we restrict ourselves to a bounded set of NCDEs i.e. we consider a set of NCDE predictors

Θ1 = {θ Rvs.t. α 2 Bα, ψ Bψ, β 2 Bβ,2}

where the norm on ψ refers to the sum of ℓ2 norms of the weights and biases of the neural vector field Gψ. This restriction is fairly classical in statistical learning theory (Bach, 2021).

2.4. Linearizing CDEs in the Signature Space

The Signature Transform. While neural controlled differential equations allow for great flexibility in representation of the time series, they are difficult to train and require significant computational resources. The signature is a promising and theoretically well-grounded tool from stochastic analysis, that allows for a parameter-free embedding of the time series. Mathematically, the signature coefficient of a function

x : t [0, τ] 7 x(1)(t), . . . , x(d)(t)

associated to a word I = (i1, . . . , ik) {1, . . . , d}k of size k is the function

SI(x[0,t]) := Z

0<u1< <uk<t dx(i1)(u1) . . . dx(ik)(uk)

0.0 0.2 0.4 0.6 0.8 1.0

t SI(x[0, t])

0.0 0.2 0.4 0.6 0.8 1.0 Time

I = (1,1,3)

t SI(x[0, t])

0.0 0.2 0.4 0.6 0.8 1.0 0.020

I = (3,3,3,2)

t SI(x[0, t])

0.0 0.2 0.4 0.6 0.8 1.0

Figure 1: Sample path x(t) of a 3-dimensional fractional Brownian motion on top, and three signature coefficients SI(x[0,t]) associated to different words on the bottom.

which maps [0, τ] to R. The integral is to be understood as the Riemann-Stieltjes integral. While the definition of the signature is technical, it can simply be seen as a feature extraction step. We refer to Figure 1 for an illustration. The truncated signature of order N 1, which we write SN(x[0,t]), is equal to the collection of all signature coefficients associated to words of size k N sorted by lexicographical order. Finally, the infinite signature is the sequence defined through

S(x[0,t]) = lim N + SN(x[0,t]).

Learning with Signatures. Signatures are a prominent tool in stochastic analysis since the pioneering work of Chen (1958) and Lyons et al. (2007). They have recently found successful applications in statistics and machine learning as a feature representation for irregular time series (Kidger et al., 2019; Morrill et al., 2020; Fermanian, 2021; Salvi et al., 2021; Fermanian, 2022; Lyons & Mc Leod, 2022; Bleistein et al., 2023; Horvath et al., 2023) and a tool for analyzing residual neural networks in the infinite depth limit (Fermanian et al., 2021).

Signatures and CDEs. An appealing feature of signatures is their connection to controlled differential equations. Indeed, under sufficient regularity assumptions (Friz & Victoir, 2010; Fermanian et al., 2021; Bleistein et al., 2023; Cirone et al., 2023), the generative CDE (1) can be linearized in the signature space. Informally, this means that there exists a sequence α such that for all t [0, τ] we have

zi (t) = α S(xi [0,t]).

The mathematical definition of α is technical and we refer to Appendix A.4 for a formal statement and a discussion of the regularity assumptions. Hence, under the corresponding regularity conditions, the true intensity for individual i

Dynamic Survival Analysis with Controlled Latent States

λi (t) = exp α S(xi [0,t]) + β Wi .

This motivates the use of the signature-based estimator

λi,D θ (t) := exp α SN(xi,D [0,t]) + β Wi ,

where θ = (α, β) Rq Rs, N 1 is treated as a hyperparameter and xi,D corresponds to the piecewise constant embedding of the observed time series Xi described previously. The integer q = d N 1 1

d 1 is the size of the signature truncated at depth N 1. The superscript in D emphasises the dependence of this estimator on the observation grid D.

Similarly to the NCDE-based estimator, we restrict ourselves to the bounded set of estimators

Θ2 = {θ s.t. α Bα, β Bβ,2}.

2.5. Connections to Cox Models with Time-Varying Covariates

Cox models with time-varying covariates are the classical class of models (Therneau & Grambsch, 2000; Aalen et al., 2008; Zhang et al., 2018), where the individual specific hazard rate has the form λi θ(t) = λ0(t) exp(α Xi(t) + β Wi), where λ0 : [0, τ] R+ is called the baseline hazard.

For signature-based embeddings, recall that we compute the signature of a time-embedded time series Xi = {(Xi(t1), t1), . . . , (Xi(tk), tk)}. In fact, this amounts to

α SN(xi,D [0,t])

| {z } =log λ0(t)

+ α I1Xi(t) + X

I I2 αISI(xi,D [0,t])

| {z } =log of individual specific hazard rate

where αI1 is a subvector of α and I2 QN k=2{1, . . . , d}k. Hence our model can be interpreted as a generalized version of Cox models with time-varying covariates. A similar interpretation holds for NCDEs. We detail this link in Appendix A.6.

3. Theoretical Guarantees

3.1. The Learning Problem

For both models, the parameter θ can be fitted by likelihood maximization by solving

ˆθ arg min θ Θ ℓD n (θ) + pen(θ), (3)

where Θ {Θ1, Θ2} depending on whether one uses signature or NCDE-based embeddings, pen : Θ R+ is a penalty and ℓD n (θ) is equal to the negative log-likelihood of the sample Dn evaluated at θ.

Unless specified other, the following statements hold for both NCDEs and signature-based embeddings (up to different constants given explicitly in the proofs). Following Aalen et al. (2008), the negative log likelihood ℓD n (θ) of the sample writes

0 λi,D θ (s)Y i(s)ds Z τ

0 log λi,D θ (s)d N i(s),

Z λi (s)Y i(s)ds Z log λi (s)d N i(s)

be the true likelihood of the data. Our goal, in this section, is to obtain a bias-variance decomposition of the difference

ℓD n (ˆθ) ℓ n

between the true likelihood and the likelihood of the learnt model.

3.2. A Risk Bound

Theorem 3.1 (Informal Risk Bound for the Signature Model). Consider the signature-based embedding. Let ˆθ be the solution of (3) with pen(θ) = η1 α 1 + η2 β 1. For any N 1, we have with high probability and an appropriate choice of η1, η2 that

ℓD n (ˆθ) ℓ n Discretization bias + Approximation bias

For a formal statement, see Appendix B.4. We make a series of comments on this result.

1. This full risk bound can only be obtained for the signature-based model. It can also be extended to other types of penalty such as Ridge or Group Lasso (see for instance Nardi & Rinaldo (2008)) For NCDEs, we are able to give precise guarantees on the bias following Bleistein & Guilloux (2024), but a precise control of the variance term is out of reach.

2. The discretization bias is proportional to |D| := max i=1,...,K |ti ti 1| and hence vanishes as sampling

gets finer.

Dynamic Survival Analysis with Controlled Latent States

3. The approximation bias crucially depends on the regularity of the unknown tensor field G , and more precisely on the speed of decay of its derivatives, which can be seen as a measure of smoothness of the target function.

4. The regularity assumptions made on G are not necessary to bound the approximation bias of the NCDE model: in this case, this bias term depends on the approximation capacities of the neural tensor field.

5. Remarkably, we obtain classical rates in n 1/2 for the variance term. For signature based methods, fast rates in n 1 are yet to be obtained.

4. Experimental Evaluation

We now focus on the survival analysis setup. We hence let T i be the unique time-of-event, which may eventually be censored, of individual i. i is the censorship indicator, equal to 1 if the individual experiences the event and to 0 otherwise.

4.1. Training Setup

We train on a dataset Dn of the same structure than described in Section 2.1 and learn the parameter ˆθ by solving the optimization problem (3). NCDEs are trained without penalization, while we use a mixture of elastic-net penalties

pen(θ) := η1pen EN(α) + η2pen EN(β)

for training the signature-based model, where pen EN( ) = γ 1 + (1 γ) 2. The hyperparameters (η1, η2, N) are chosen by cross-validation of a mixed metric equal to the difference between the C-index and the Brier score (see below) and we set γ = 0.1. We refer to Appendix C.1 for a detailed description of the training procedures. We evaluate our model s capacity to predict events in [t, t + δt] by leveraging values of the longitudinal features up to t (see Figure 2) through a ranking metric and a calibration metric. This evaluation procedure is standard (Lee et al., 2019).

4.2. Metrics

We compute four metrics using the individual specific survival functions as estimated by our model with parameters θ. At time t + δt for δt > 0 conditional on survival up to time t, and on observation of the longitudinal features up to time t, it is defined as

ri θ(t, δt) = P T i > t + δt | T i > t, (Xi(s)) s t s D , Wi .

We describe its detailed computation in Appendix C.2.

0.0 0.2 0.4 0.6 0.8 1.0 0.2

t SI1(x[0, t])

t SI2(x[0, t])

t SI3(x[0, t])

0.0 0.2 0.4 0.6 0.8 1.0 Time

t SI1(x FF, [0, t])

t SI2(x FF, [0, t])

t SI3(x FF, [0, t])

0.0 0.2 0.4 0.6 0.8 1.0 0.0

||S(x(t)) S(x FF(t))||2

0.0 0.2 0.4 0.6 0.8 1.0 1.0

t increases

Figure 2: On the top, observed time series up to time t in bold colors and true time series in faded colors. When evaluating our models, we fill-forward the last observed value from t on. On the bottom, signatures of the true path (left), of the observed path (center) and difference in ℓ2 norm (right) x F F (t) denotes the filled-forward time series.

Time-dependent Concordance Index. Following Lee et al. (2019), we measure the discriminative power of our models by using a time-dependent concordance index C(t, δt) that captures our models ability to correctly rank individuals on their predicted probability of survival. The concordance index C(t, δt) is then finally computed as

i=1 1ri θ(t,δt)>rj θ(t,δt)1T i>T j, T j [t,t+δt], j=1

i=1 1T i>T j, T j [t,t+δt], j=1 .

This metric captures the capacity of our model to discriminate between j and another individual i through the conditional probability of survival.

Brier Score. While the concordance index is a rankingbased measure, the Brier Score measures the accuracy in predictions by comparing the estimated survival function and the survival indicator function (Lee et al., 2019; Kvamme et al., 2019; Kvamme & Borgan, 2023). Formally, we define the Brier score BS(t, δt) as

i=1 1T i t+δt, i=1ri θ(t, δt)2

i=1 1T i>t+δt(1 ri θ(t, δt))2.

Contrarily to the C-index, the Brier score is a measure of calibration of the predictions: it measures the distance between the estimated survival function and the indicator function of survival on the interval [t, t + δt].

Averaged performance. Additionally, we evaluate the average prediction performance of our models over a set

Dynamic Survival Analysis with Controlled Latent States

0 2 4 6 8 10

0 2 4 6 8 10 Time

Observed Unobserved

Figure 3: Time series Xi of a randomly picked individual on bottom and unobserved SDE wi(t) on the top. The red star indicates the first hitting time of the threshold value w = 2.5.

of different prediction times. The averaged C-index and Brier score on the interval [t1, t2] along with the window time δt are defined respectively as

t1 C(s, δt)ds and 1 t2 t1

t1 BS(s, δt)ds.

Comparison with static metrics. A crucial difference with static survival analysis metrics is that our metric only compares the individuals who experienced the event in this time window to all the ones who are still at risk at time t. This can lead to a C-index below 0.5 and Brier scores above 0.25 without the model being worse than random.

Additional metrics. We furthermore report AUC and weighted Brier score in Appendix D.

4.3. Methods

We propose three distinct methods. In addition to the signature-based model, which we call Cox Sig, we also consider Cox Sig+ which adds the first value of the time series to the static features. This is motivated by the translation invariance of signatures (see discussion below). Our last method is the NCDE embedding of the longitudinal features. We benchmark our three models against a set of competing methods. All methods are detailed in Appendix C.1.

Time-Independent Cox Model. As a sanity check, we implement a simple Cox model with elastic-net penalty which uses the parameterized intensity λi θ(t) = λ0(t) exp(β Wi) using scikit-survival (P olsterl, 2020). This baseline allows to check whether our proposed methods can make use of the supplementary timedependent information. If no static features are available, we use the first observed value of the time series, i.e., Wi = Xi(0).

Name n d Censoring Avg. Times

Hitting time 500 5 Terminal (3.2%) 177 Tumor Growth 500 2 Terminal (8.4%) 250 Maintenance 200 17 Online (50%) 167 Churn 1043 14 Terminal (38.4%) 25

Table 1: Description of the 4 datasets we consider. The integer d is the dimension of the time series including the time channel. Terminal censoring means that the individuals are censored at the end of the overall observation period [0, τ] if they have not experienced any event. It is opposed to online censoring that can happen at any time in [0, τ]. The reported percentage indicates the censoring level i.e. the share of the population that does not experience the event. The last column reports the average number of observations times over individuals.

Random Survival Forest (RSF). We use RSF (Ishwaran et al., 2008) with static features Wi as the only input. Similarly to our implementation of the Cox model, we use the first value of the time series as static features if no other features are available.

Dynamic Deep Hit (Lee et al., 2019). DDH is a state-ofthe-art method for dynamical survival analysis, that combines an RNN with an attention mechanism and uses both time dependent and static features.

Surv Latent ODE (Moon et al., 2022). SLODE is a recent deep learning framework for survival analysis that leverages an ODE-RNN architecture (Rubanova et al., 2019) to handle the time dependent features.

4.4. Synthetic Experiments

Hitting time of a partially observed SDE. Predicting hitting times is a crucial problem in finance for instance, when pricing so-called catastrophe bounds triggering a payment to the holder in case of an event (Cheridito & Xu, 2015; Corcuera & Valdivia, 2016). Their relation to survival analysis is well documented, see e.g. (Lee & Whitmore, 2006). Building on this problem, we consider the Ornstein-Uhlenbeck SDE

dwi(t) = ω(wi(t) µ)dt +

j=1 dx(i,j)(t) + σd Bi(t)

where d = 5, σ = 1, µ = 0.1 and ω = 0.1 are fixed parameters. xi(t) = (x(i,1)(t), . . . , x(i,d 1)(t)) is a sample path of a fractional Brownian motion with Hurst parameter H = 0.6, and Bi(t) is a Brownian noise term. In this setup, our data consists of Xi which is a downsampled version of xi and the Brownian part is unobserved. Our goal is to predict the first hitting time min{t > 0 | wt w }

Dynamic Survival Analysis with Controlled Latent States

0.2 0.4 0.6 0.8 1.0

Brier Score

Cox Sig Cox Sig+ NCDE Cox

RSF DDH SLODE

Figure 4: Brier score δt 7 BS(t, δt), evaluated at t = 0.23, for the partially observed SDE experiment. Confidence intervals indicate 1 standard deviation.

of a threshold value w = 2.5. We train on n = 500 individuals. Figure 3 shows the sample paths and SDE of a randomly selected individual. This setup is close to a well-specified model since signatures linearize controlled differential equations.

Tumor Growth. We similarly aim at predicting the hitting time of a stochastic process modelling the growth of a tumor (Simeoni et al., 2004), where xi represents a drugintake. In this experiment, the time series Xi is very-low dimensional (d = 2, which includes the time channel).

4.5. Real-World Datasets

Predictive Maintenance. (Saxena et al., 2008) This dataset collects simulations of measurements of sensors placed on aircraft gas turbine engines run until a threshold value is reached. In this context, the time-to-event is the failure time. This dataset features a small sample size, considerable censoring rates and a high number of time channels.

Churn prediction. We use a private dataset provided by Califrais, a food supply chain company that delivers fresh products from Rungis to food professionals. The company has access to a variety of features observed through time for every customer. Its goal is for example to predict when the customer will churn. The time series in this setup are high dimensional but sampled at a low frequency.

Further details on all datasets are provided in Appendix D. Overall, our datasets are diverse in terms of sample size, size and length of the time series and censoring type.

4.6. Results

General performance of Cox Sig. Overall, the signaturebased estimators outperform competing methods. We observe that Cox Sig and Cox Sig+ improve over the strongest

Algorithms Avg. C-Index IBS

Cox Sig 0.857 0.01 0.857 0.01 0.091 0.01 0.091 0.01 Cox Sig+ 0.857 0.01 0.095 0.01 NCDE 0.517 0.04 0.103 0.01 DDH 0.545 0.02 0.094 0.01 SLODE 0.621 0.05 0.253 0.03

Cox Sig 0.696 0.02 0.138 0.01 Cox Sig+ 0.797 0.03 0.137 0.01 NCDE 0.827 0.02 0.130 0.01 0.130 0.01 DDH 0.941 0.05 0.941 0.05 0.133 0.01 SLODE 0.601 0.07 0.136 0.01

Cox Sig 0.858 0.04 0.154 0.03 Cox Sig+ 0.867 0.04 0.867 0.04 0.154 0.03 NCDE 0.541 0.09 0.178 0.04 DDH 0.813 0.06 0.156 0.02 SLODE 0.438 0.14 0.145 0.02 0.145 0.02

Cox Sig 0.741 0.01 0.130 0.01 Cox Sig+ 0.751 0.01 0.751 0.01 0.129 0.01 0.129 0.01 NCDE 0.529 0.05 0.152 0.01 DDH 0.570 0.03 0.139 0.01 SLODE 0.542 0.03 0.193 0.03

Table 2: Averaged value of our metrics for 4 considered dataset over set of 10 different values of t chosen from the 5 to the 50th percentile of the distribution of event times. The values of δt for each dataset is chosen to be the same as that shown in Figure 5.

baselines in terms of Brier scores. Contrarily to the strong baseline DDH, this improvement is consistent over larger prediction windows [t, t+δt] as δt increases (see Figure 4). They provide even stronger improvements in terms of Cindexes (see Figure 5 and Appendix D). This suggests that they are particularly well-tailored for ranking tasks, such as identifying the most-at-risk individual. Including the first observed value of the time series generally improves Cox Sig s performance: this is possibly due to the fact that signatures are invariant by translation (i.e. the signature of x : t 7 x(t) is equal to the signature of x : t 7 x(t) + a), and hence including the first value of the time series provides non-redundant information.

Performance on low-dimensional data. A notable exception is the tumor growth simulation, in which Cox Sig is generally outperformed (see Figures 15 and 16 in the appendix). The competitive performance of signatures for moderate to high dimensional data streams and its below average performance on low dimensional data is a wellstudied feature (see Fermanian (2021) for an empirical study). A possible solution to handle low-dimensional data is to use embeddings before computing signatures to make them more informative (Morrill et al., 2020).

Dynamic Survival Analysis with Controlled Latent States

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 0.23, t = 0.10

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 3.00, t = 2.00

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 166.60, t = 16.20

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

Brier Score

t = 0.23, t = 0.10

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

Brier Score

t = 3.00, t = 2.00

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

Brier Score

t = 166.60, t = 16.20

Figure 5: C-Index (higher is better) on top and Brier score (lower is better) on bottom for hitting time of a partially observed SDE (left), churn prediction (center) and predictive maintenance (right) evaluated at chosen points (t, δt). t is chosen as the first decile of the event times i.e. 90% of the events occur after t. Hollow dots indicate outliers, and error bars indicate 80% of the interquartile range. We report detailed results for numerous points (t, δt) in Appendix D.

NCDEs. On the other side, NCDEs generally tie or perform worse than competing methods. Notably, when considering C-indexes, they even perform worse than random on the predictive maintenance dataset. This stands in stark contrast to their good performances on classification or regression tasks (Kidger et al., 2020; Morrill et al., 2021; Vanderschueren et al., 2023).

Running times. Finally, we observe that our methods run in similar times than DDH, while including crossvalidation (see Figure 13 in the appendix). Models that do not use time dependent features (RSF and Cox) are 2 orders of magnitude faster to train.

5. Conclusion

We have designed and analyzed a model for generic counting processes driven by a controlled latent state, which can be readily estimated using either NCDE or signature-based estimators. Cox Sig in particular offers a parsimonious alternative to deep models and yields excellent performance for survival analysis. Future research efforts will be targeted at extending our model to competing risks and multimodal data.

Limitations. While our model shows competitive performance on moderate to high-dimensional data, one central limitation is its below average performance on low dimensional data. We also stress that the extension to very high dimensional time series is computationally prohibitive since the signature scales exponentially with the dimension of the time series. Finally, our experimental setup is lim-

ited to the survival analysis case: we plan on extending it to general counting processes in future work.

Acknowledgements

For this work, AG has benefited from the support of the National Agency for Research under the France 2030 program with the reference ANR-22-PESN-0016. LB and AG thank Killian Bakong-Epoun e, who performed preliminary analysis of the NCDE model during an internship at Inria. LB, VTN and AF thank the Sorbonne Center for Artificial Intelligence (SCAI) and its team for their support. LB thanks Claire Boyer and G erard Biau for supervising a previous internship that sparked his interest in signatures.

Impact statement

This paper presents work whose goal is to rank individuals or to predict their risk of event by relying on past information. This is a common goal in many diverse fields, including healthcare (which patient is most at risk of dying ?), insurance policy pricing (given a person s history, what is her probability of experiencing an event covered by her policy ?) and human resources management (which employee is most likely to quit ?).

Our proposed method can be used in any of these settings to personalize predictions, and hence target interventions. As any prediction algorithm, it may cause some disadvantaged groups to suffer from biased decisions.

Dynamic Survival Analysis with Controlled Latent States

Aalen, O., Borgan, O., and Gjessing, H. Survival and event history analysis: a process point of view. Springer Science & Business Media, 2008.

Bach, F. Self-concordant analysis for logistic regression. Electronic Journal of Statistics, 4:384 414, 2010.

Bach, F. Learning theory from first principles. Draft of a book, version of Sept, 6:2021, 2021.

Bacry, E., Mastromatteo, I., and Muzy, J.-F. Hawkes processes in finance. Market Microstructure and Liquidity, 1(01):1550005, 2015.

Bleistein, L. and Guilloux, A. On the generalization capacities of neural controlled differential equations. In International Conference on Learning Representations, 2024.

Bleistein, L., Fermanian, A., Jannot, A.-S., and Guilloux, A. Learning the dynamics of sparsely observed interacting systems. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202, pp. 2603 2640. PMLR, 2023.

Boyd, S. P. and Vandenberghe, L. Convex optimization. Cambridge university press, 2004.

Bussy, S., Veil, R., Looten, V., Burgun, A., Ga ıffas, S., Guilloux, A., Ranque, B., and Jannot, A.-S. Comparison of methods for early-readmission prediction in a highdimensional heterogeneous covariates and time-to-event outcome framework. BMC medical research methodology, 19:1 9, 2019.

Chen, K.-T. Integration of paths a faithful representation of paths by non-commutative formal power series. Transactions of the American Mathematical Society, 89: 395 407, 1958.

Chen, R. T. Q., Rubanova, Y., Bettencourt, J., and Duvenaud, D. K. Neural ordinary differential equations. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 31, pp. 6572 6583. Curran Associates, Inc., 2018.

Chen, R. T. Q., Amos, B., and Nickel, M. Neural spatiotemporal point processes. In International Conference on Learning Representations, 2021.

Cheridito, P. and Xu, Z. Pricing and hedging cocos. Available at SSRN 2201364, 2015.

Cirone, N. M., Lemercier, M., and Salvi, C. Neural signature kernels as infinite-width-depth-limits of controlled resnets. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, pp. 25358 25425. PMLR, 2023.

Cirone, N. M., Orvieto, A., Walker, B., Salvi, C., and Lyons, T. Theoretical foundations of deep selective statespace models. ar Xiv preprint ar Xiv:2402.19047, 2024.

Corcuera, J. M. and Valdivia, A. Pricing cocos with a market trigger. In Benth, F. E. and Di Nunno, G. (eds.), Stochastics of Environmental and Financial Economics, pp. 179 209. Springer International Publishing, 2016.

Cox, D. R. Regression models and life-tables. Journal of the Royal Statistical Society: Series B (Methodological), 34(2):187 202, 1972.

Crowther, M. J., Abrams, K. R., and Lambert, P. C. Joint modeling of longitudinal and survival data. The Stata Journal, 13(1):165 184, 2013.

De Brouwer, E., Simm, J., Arany, A., and Moreau, Y. GRUODE-Bayes: Continuous modeling of sporadicallyobserved time series. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alch e-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32, pp. 7379 7390. Curran Associates, Inc., 2019.

De Brouwer, E., Gonzalez, J., and Hyland, S. Predicting the impact of treatments over time with uncertainty aware neural differential equations. In Camps-Valls, G., Ruiz, F. J. R., and Valera, I. (eds.), Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151, pp. 4705 4722. PMLR, 2022.

Dumanis, S. B., French, J. A., Bernard, C., Worrell, G. A., and Fureman, B. E. Seizure forecasting from idea to reality. outcomes of the my seizure gauge epilepsy innovation institute workshop. Eneuro, 4(6), 2017.

Fermanian, A. Embedding and learning with signatures. Computational Statistics & Data Analysis, 157:107148, 2021.

Fermanian, A. Functional linear regression with truncated signatures. Journal of Multivariate Analysis, 192: 105031, 2022.

Fermanian, A., Marion, P., Vert, J.-P., and Biau, G. Framing RNN as a kernel method: A neural ODE approach. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 3121 3134. Curran Associates, Inc., 2021.

Dynamic Survival Analysis with Controlled Latent States

Friz, P. K. and Victoir, N. B. Multidimensional Stochastic Processes as Rough Paths: Theory and Applications, volume 120 of Cambridge Studies in Advanced Mathematics. Cambridge University Press, Cambridge, 2010.

Groha, S., Schmon, S. M., and Gusev, A. A general framework for survival analysis and multi-state modelling. ar Xiv preprint ar Xiv:2006.04893, 2020.

Gu, A., Goel, K., and R e, C. Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations, 2022.

Hickey, G. L., Philipson, P., Jorgensen, A., and Kolamunnage-Dona, R. Joint modelling of time-to-event and multivariate longitudinal outcomes: recent developments and issues. BMC medical research methodology, 16:1 15, 2016.

Horvath, B., Lemercier, M., Liu, C., Lyons, T., and Salvi, C. Optimal stopping via distribution regression: a higher rank signature approach. ar Xiv preprint ar Xiv:2304.01479, 2023.

Ibrahim, J. G., Chu, H., and Chen, L. M. Basic concepts and methods for joint models of longitudinal and survival data. Journal of Clinical Oncology, 28(16):2796, 2010.

Ishwaran, H., Kogalur, U. B., Blackstone, E. H., and Lauer, M. S. Random survival forests. The Annals of Applied Statistics, 2(3):841 860, 2008.

Jia, J. and Benson, A. R. Neural jump stochastic differential equations. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alch e-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32, 2019.

Kidger, P. and Lyons, T. Signatory: differentiable computations of the signature and logsignature transforms, on both cpu and gpu. In International Conference on Learning Representations, 2020.

Kidger, P., Bonnier, P., Perez Arribas, I., Salvi, C., and Lyons, T. Deep signature transforms. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alch e-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.

Kidger, P., Morrill, J., Foster, J., and Lyons, T. Neural controlled differential equations for irregular time series. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 6696 6707. Curran Associates, Inc., 2020.

Kingma, D. P. and Ba, J. Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations, 2015.

Kingma, D. P. and Welling, M. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013.

Kvamme, H. and Borgan, Ø. The brier score under administrative censoring: Problems and a solution. Journal of Machine Learning Research, 24(2):1 26, 2023.

Kvamme, H., Ørnulf Borgan, and Scheel, I. Time-toevent prediction with neural networks and cox regression. Journal of Machine Learning Research, 20(129): 1 30, 2019.

Lee, C., Yoon, J., and Van Der Schaar, M. Dynamicdeephit: A deep learning approach for dynamic survival analysis with competing risks based on longitudinal data. IEEE Transactions on Biomedical Engineering, 67(1): 122 133, 2019.

Lee, M.-L. T. and Whitmore, G. A. Threshold regression for survival analysis: modeling event times by a stochastic process reaching a boundary. 2006.

Lemler, S. Oracle inequalities for the Lasso in the highdimensional Aalen multiplicative intensity model. Annales de l Institut Henri Poincar e, Probabilit es et Statistiques, 52(2):981 1008, 2016.

Lin, P. and Yong, J. Controlled singular volterra integral equations and pontryagin maximum principle. SIAM Journal on Control and Optimization, 58(1):136 164, 2020.

Long, J. D. and Mills, J. A. Joint modeling of multivariate longitudinal data and survival data in several observational studies of huntington s disease. BMC medical research methodology, 18(1):1 15, 2018.

Lyons, T. and Mc Leod, A. D. Signature methods in machine learning. ar Xiv preprint ar Xiv:2206.14674, 2022.

Lyons, T., Caruana, M., and L evy, T. Differential Equations driven by Rough Paths, volume 1908 of Lecture Notes in Mathematics. Springer, Berlin, 2007.

Marion, P., Fermanian, A., Biau, G., and Vert, J.-P. Scaling resnets in the large-depth regime. ar Xiv preprint ar Xiv:2206.06929, 2022.

Mei, H. and Eisner, J. M. The neural hawkes process: A neurally self-modulating multivariate point process. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30, 2017.

Dynamic Survival Analysis with Controlled Latent States

Moon, I., Groha, S., and Gusev, A. Survlatent ode : A neural ode based time-to-event model with competing risks for longitudinal data improves cancer-associated venous thromboembolism (vte) prediction. In Lipton, Z., Ranganath, R., Sendak, M., Sjoding, M., and Yeung, S. (eds.), Proceedings of the 7th Machine Learning for Healthcare Conference, volume 182, pp. 800 827. PMLR, 2022.

Morrill, J., Fermanian, A., Kidger, P., and Lyons, T. A generalised signature method for multivariate time series feature extraction. ar Xiv preprint ar Xiv:2006.00873, 2020.

Morrill, J., Kidger, P., Yang, L., and Lyons, T. Neural controlled differential equations for online prediction tasks. ar Xiv preprint ar Xiv:2106.11028, 2021.

Murray, J. and Philipson, P. A fast approximate em algorithm for joint models of survival and multivariate longitudinal data. Computational Statistics & Data Analysis, 170:107438, 2022.

Nardi, Y. and Rinaldo, A. On the asymptotic properties of the group lasso estimator for linear models. Electronic Journal of Statistics, 0, 01 2008. doi: 10.1214/ 08-EJS200.

Ogata, Y. Statistical models for earthquake occurrences and residual analysis for point processes. Journal of the American Statistical association, pp. 9 27, 1988.

Omi, T., ueda, n., and Aihara, K. Fully neural network based model for general temporal point processes. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alch e Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32, 2019.

P olsterl, S. scikit-survival: A library for time-to-event analysis built on top of scikit-learn. Journal of Machine Learning Research, 21(212):1 6, 2020.

Proust-Lima, C., S ene, M., Taylor, J. M., and Jacqmin Gadda, H. Joint latent class models for longitudinal and time-to-event data: a review. Statistical methods in medical research, 23(1):74 90, 2014.

Rasheed, K., Qayyum, A., Qadir, J., Sivathamboo, S., Kwan, P., Kuhlmann, L., O Brien, T., and Razi, A. Machine learning for predicting epileptic seizures using eeg signals: A review. IEEE Reviews in Biomedical Engineering, 14:139 155, 2020.

Reizenstein, J. F. and Graham, B. Algorithm 1004: The iisignature library: Efficient calculation of iteratedintegral signatures and log signatures. ACM Transactions on Mathematical Software, 46(1):1 21, 2020.

Rizopoulos, D. The r package jmbayes for fitting joint models for longitudinal and time-to-event data using mcmc. Journal of Statistical Software, 72(7):1 46, 2016.

Rubanova, Y., Chen, R. T. Q., and Duvenaud, D. K. Latent ordinary differential equations for irregularly-sampled time series. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alch e-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32, pp. 5320 5330. Curran Associates, Inc., 2019.

Rustand, D., Van Niekerk, J., Krainski, E. T., Rue, H., and Proust-Lima, C. Fast and flexible inference for joint models of multivariate longitudinal and survival data using integrated nested laplace approximations. Biostatistics, 25(2):429 448, 2024.

Salvi, C., Lemercier, M., Liu, C., Horvath, B., Damoulas, T., and Lyons, T. Higher order kernel mean embeddings to capture filtrations of stochastic processes. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 16635 16647, 2021.

Saxena, A., Goebel, K., Simon, D., and Eklund, N. Damage propagation modeling for aircraft engine run-tofailure simulation. In 2008 International Conference on Prognostics and Health Management, pp. 1 9, 2008.

Shchur, O., T urkmen, A. C., Januschowski, T., and G unnemann, S. Neural temporal point processes: A review. In Zhou, Z.-H. (ed.), Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, pp. 4585 4593, 2021.

Shorack, G. R. and Wellner, J. A. Empirical processes with applications to statistics. Society for Industrial and Applied Mathematics, 2009.

Simeoni, M., Magni, P., Cammia, C., De Nicolao, G., Croci, V., Pesenti, E., Germani, M., Poggesi, I., and Rocchetti, M. Predictive pharmacokineticpharmacodynamic modeling of tumor growth kinetics in xenograft models after administration of anticancer agents. Cancer research, 64(3):1094 1101, 2004.

Tang, W., Ma, J., Mei, Q., and Zhu, J. Soden: A scalable continuous-time survival model through ordinary differential equation networks. Journal of Machine Learning Research, 23(34):1 29, 2022.

Therneau, T. M. and Grambsch, P. M. Modeling survival data: extending the Cox model. Springer, New-York, 2000.

Van de Geer, S. Exponential inequalities for martingales, with application to maximum likelihood estimation for counting processes. The Annals of Statistics, pp. 1779 1801, 1995.

Dynamic Survival Analysis with Controlled Latent States

Vanderschueren, T., Curth, A., Verbeke, W., and Van Der Schaar, M. Accounting for informative sampling when learning to forecast treatment outcomes over time. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202, pp. 34855 34874. PMLR, 2023.

Virmaux, A. and Scaman, K. Lipschitz regularity of deep neural networks: analysis and efficient estimation. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 31, 2018.

Zhang, Z., Reinikainen, J., Adeleke, K. A., Pieterse, M. E., and Groothuis-Oudshoorn, C. G. Time-varying covariates and coefficients in cox regression models. Annals of Translational Medicine, 6(7), 2018.

Dynamic Survival Analysis with Controlled Latent States

Supplementary Material

A. Supplementary Mathematical Elements

A.1. Supplementary elements on survival analysis

The counting process associated with the observation of T i 1 < T i 2 < . . . is denoted by N i. The observed counting process is t N i(t) = R t 0 Y i(s)d N i(s). The integral against the counting process N i is to be understood is to be understood as a Stieltjes integral, i.e., R t 0 λi (s)d N i(s) = P

Ti t λi (Ti) see Aalen et al. (2008, p.55-56). Its intensity writes λi (t | Wi, (xi(s))s t)Y i(t), which we simply write λi (t)Y i(t) to alleviate notations.

To the observations, we associate the filtration F, with all σ-fields at 0 t τ defined as

i=1,...,n Fi t

where Fi t = σ xi(s), Wi, N i(s), Y i(s), 0 s t . We assume in addition that Y i is Fi-predictable.

Using the Doob-Meyer decomposition of counting processes - see Aalen et al. (2008, p. 52-60) - we have

N i(t) = Z t

0 λi (s)Y i(s)ds + M i(t) (4)

where M i is local square integrable martingale with respect to Fi.

A.2. Picard-Lindelh of Theorem

Theorem A.1. Let x : [0, τ] Rd be a continuous path of bounded variation, and assume that G : Rp Rp d is Lipschitz continuous. Then the CDE dz(t) = G(z(t))dx(t)

with initial condition z0 Rp has a unique solution on [0, τ].

A full proof can be found in Fermanian et al. (2021, Theorem 4). Remark that since in our setting, NCDEs are Lipschitz since typical neural vector fields, such as feed-forward neural networks, are Lipschitz (Virmaux & Scaman, 2018). This ensures that the solutions to NCDEs are well defined.

A.3. Continuity of the Flow of CDEs

We state a result on the continuity of the flow adapted from Bleistein & Guilloux (2024), Theorem B.5. Theorem A.2. Let F, G : Rp Rp d be two Lipschitz vector fields with Lipschitz constants LF, LG > 0. Let x, r : [0, τ] Rd be either continuous or piecewise constant paths of total variations bounded by Lx and Lr. Consider the controlled differential equations

dw(t) = F(w(t))dx(t) and dv(t) = G(v(t))dr(t)

with initial conditions w(0) = v(0) = 0 respectively. It then follows that for any t [0, τ]

x r ,[0,t] 1 + LFLr K + max v Ω F(v) G(v) op Lr

LF F(0) op Lx exp(LFLx) + F(0) op

Ω= u Rp | u ( G(0) op Lr) exp(LGLr) .

Dynamic Survival Analysis with Controlled Latent States

A.4. Linearization in the Signature Space

A.4.1. GENERAL RESULT

In this section, we give additional details on the linearization of CDEs in the signature space. We first define the differential product.

Definition A.3. Let F, G : Rp Rp be two C vector fields and let J( ) be the Jacobian matrix. Their differential product F G : Rp Rp is the smooth vector field defined for every h Rp by

G hj (h)Fj(h) = J(G)(h)F(h).

We now consider a tensor field F : Rp Rp d which we write

| . . . | F 1 . . . F d

where for every 1 i d, F i : Rp Rp, and we define

Γk(F) := sup h M, i1 ik d

F i1 F ik(h) 2 .

Consider the solution z : [0, τ] Rp to the CDE

dz(t) = F(z(t))dx(t) (5)

z(0) = 0 Rp

where x : [0, τ] Rd is a continuous path of finite total variation bounded by Lxτ > 0. We recall the following result from Fermanian et al. (2021), Proposition 4.

Proposition A.4 (Fermanian et al. (2021), Proposition 4.). We have

zi (t) α ,NSN(x[0,t]) (d Lxt)N+1

(N + 1)! ΓN+1(F)

As a consequence, we have the following theorem.

Theorem A.5. Let F : Rp Rp d be a C tensor field. If

(N + 1)! ΓN+1(F) 0

as N + , then the solution z to the CDE (5) can be written as

I {1,...,d}k SI(x[0,t])F i1 F ik(0).

A.4.2. APPLICATION TO OUR MODEL

Recall that we have defined our generative model through the CDE

dzi (t) = G (zi (t))dxi(t)

with initial condition zi (0) = 0, where G : R Rp is a LG -Lipschitz vector field. Since in our case, the vector field G maps R to Rd, it can be written as

G (z) = G1 (z) . . . Gd (z), ,

Dynamic Survival Analysis with Controlled Latent States

where for every 1 i d, Gi : R R. In this setup, for 1 i1, i2 d the differential product collapses to

(Gi1 Gi2 )(h) = (Gi2 ) (h) Gi1 (h) R.

For 1 i1, i2, i3 d, it writes

(Gi1 Gi2 Gi3 )(h) = (Gi2 Gi3 ) (h) Gi1 (h)

= (Gi3 ) (h) Gi2 (h) Gi1 (h)

= (Gi3 )(2)(h) Gi2 (h) + (Gi3 ) (h) (Gi2 ) (h) Gi1 (h) R.

One can derive similar expression for 1 i1, . . . , ik d. In line with Theorem A.5, we make the following Assumption on the vector field G .

Assumption 4. The vector field G satisfies

(N + 1)! ΓN+1(G ) 0

We can write the ℓ2 and ℓ1 norms of α ,N as functions of the differential product of G .

Lemma A.6. We have that

k=1 dkΓk(G )2 1/2

k=1 dkΓk(G ).

Proof. Starting with the ℓ2 norm, one has

α ,N 2 = N X

1 i1,i2,...,ik d Gi1 Gik (0)2 1/2

k=1 dk max 1 i1,i2,...,ik d |Gi1 Gik (0)|2 1/2

k=1 dkΓk(G )2 1/2 .

Moving on to the ℓ1 norm, we similarly obtain

α ,N 1 = N X

1 i1,i2,...,ik d |Gi1 Gik (0)|

k=1 dk max 1 i1,i2,...,ik d |Gi1 Gik (0)|

k=1 dkΓk(G ).

Dynamic Survival Analysis with Controlled Latent States

A.5. Signature of a Discretized Path

We recall the following result from Bleistein et al. (2023).

Theorem A.7. Let x : [0, τ] Rd be a path satisfying Assumption 1. Let D = {t1, . . . , t K} [0, τ] be a grid of sampling points, and x D the piecewise constant interpolation of the path x sampled on the grid D. For all α Rq, where q := d N 1

d 1 , we have

|α SN(x[0,t]) SN(x D [0,t]) | c3(N) α |D|,

c3(N) = 2e(Lxt)N 1 1

A.6. The Cox Connection

Signature-based embeddings. Consider a continuous path of bounded variation x : t 7 (x(t), t) Rd. First, remark that for every word of size one I {1, . . . , d}, the signature writes

SI(x[0,t]) = Z

0<u1<t dx(I)(s) = x(I)(t).

Furthermore, for any word I = (d, . . . , d) of size k made only of the letter d, i.e., words that only include the time channel, we have

SI(x[0,t]) = Z

0<u1< <uk<t du1 . . . duk = 1

This shows that for x = xi,D

α SN(xi,D [0,t]) = α I1(1, t, t2, . . . , t N) + α I2Xi(t) + X

I I3 αISI(xi,D [0,t])

where αI1 is the subvector of α collecting all coefficients associated to the words {d}, {d, d}, . . . , {d, . . . , d} containing only the letter d, αI2 is the subvector collecting all coefficients associated to the d 1 words {1}, {2}, . . . , {d 1} of size 1, and αI3 collects the remaining coefficients.

NCDEs. For any N 1, consider the augmented vector field

Gψ(z) = Gψ(z) 0p (N 1) 0(N 1) d I(N 1) (N 1)

R(N 1+p) (N 1+d), z Rp,

and an embedding of the time series Xi of the form xi,D(s) = (Xi(tk), s, s2, . . . , s N) Rd+N 1 for s [tk, tk+1[. The latent state of the NCDE model is now updated as

zi,D θ (tk+1) = zi,D θ (tk) + Gψ( zi,D θ (tk)) Xi(tk+1)

= zi,D θ (tk) +

Gψ(zi,D θ ) Xi(tk+1) tk+1 ... t N k+1

zi,D θ (tk) + Gψ(zi,D θ ) Xi(tk+1) tk+1 ... t N k+1

This proves that in the NCDE based model, the intensity can similarly be written as

α zi,D θ (t) = α I1zi,D θ (t) + α I2(1, t, t2, . . . , t N)

where α = (αI1, αI2), and α1 Rp and α2 RN.

Dynamic Survival Analysis with Controlled Latent States

A.7. Self-concordance

We now state a self-concordance bound, which can be found along with its proof in Bach (2010).

Lemma A.8. Let g : R R be a convex, three times differentiable function such that

|g(3)(x)| Mg(2)(x)

for all x R and for some M 0. Then it follows that

M 2 Φ( Mt) g(t) g(0) tg (0) g(2)(0)

for all t 0, where

Φ : t 7 exp(t) t 1.

A.8. Decomposition of the difference in likelihoods

We first define the empirical KL-divergence between the true and parameterized intensity associated to the sample Dn as

KLn(λ , λD θ ) := 1

0 log λi (s) λi θ(s)λi (s)Y i(s)ds 1

λi (s) λi θ(s) Y i(s)ds.

This definition is classical for intensities of counting processes (Aalen et al., 2008; Lemler, 2016). We now show that minimizing the empirical KL-divergence between the true and the parameterized intensity amounts to minimizing the empirical log likelihood, ignoring a noise term that will be canceled by setting the penalty accordingly.

Proposition A.9. For every θ Θ, the difference in likelihoods ℓD n (θ) ℓ n decomposes as

KLn(λ , λD θ ) 1

Z log λi,D θ (s) λi (s) d M i(s),

where M i : [0, τ] R is a local square integrable martingale.

This proposition is a consequence of the Doob-Meyer decomposition N i(t) = R t 0 λi (s)Y i(s)ds + M i(t) of the counting process (Aalen et al., 2008). We now furthermore define the total variation divergence as

TVn(λ , λD θ ) := 1

0 |λi (s) λi,D θ (s)|Y i(s)ds

and the quadratic log divergence D2 n(λ , λD θ ) as

log λi,D θ (s) log λi (s) 2λi (s)Y i(s)ds.

Proposition A.10. There exist two constants c1, c2 > 0 such that

c1TVn(λ , λD θ )2 KLn(λ , λD θ ) c2D2 n(λ , λD θ ).

More precisely, the constants c1, c2 are functions of Θ, Lx, τ and LG and are given explicitly in Appendix B.2.

This bound is obtained by combining a Pinsker-type inequality (Proposition B.1) and a self-concordance bound (Proposition B.2). It is informative in two ways. First, it shows that minimizing the negative empirical log-likelihood and hence the KL-divergence between the true and parameterized intensity will lead to a minimization of the total variation between the two intensities. Secondly, it shows that the KL-divergence is upper bounded by a term involving the difference of the logarithms of the intensities. We make use of this second bound to obtain a bias-variance decomposition.

Dynamic Survival Analysis with Controlled Latent States

B.1. Proof of Proposition A.9

Proof. Thanks to the Doob-Meyer decomposition of Equation (4), the log-likelihood associated to individual i is

ℓD i (θ) = Z τ

0 λi,D θ (s)Y i(s)ds Z τ

0 log λi,D θ (s)d N i(s)

λi,D θ (s) log λi,D θ (s)λi (s) Y i(s)ds Z τ

0 log λi,D θ (s)d M i(s).

Similarly, the log-likelihood associated to the true intensity λi writes

0 λi (s)Y i(s)ds Z τ

0 log λi (s)d N i(s)

λi (s) log λi (s)λi (s) Y i(s)ds Z τ

0 log λi (s)d M i(s).

Hence, we get

ℓD n (θ) ℓ n

Z h λi,D θ (s) λi (s) λi (s) log λi,D θ (s) λi (s)

i Y i(s)ds 1

Z log λi,D θ (s) λi (s) d M i(s)

= KLn(λ , λD θ ) 1

Z log λi,D θ (s) λi (s) d M i(s).

This concludes the proof.

B.2. Proof of Proposition A.10

To prove Proposition A.10, we essentially combine a Pinsker-type inequality and a self-concordance bound. We prove these two bounds separatly bellow. Combining them yields the double inequality of Proposition A.10. In the following, for all t [0, τ], we let

Λi,D θ (t) = Z t

0 λi,D θ (s)Y i(s)ds

Λi (t) = Z t

0 λi (s)Y i(s)ds

be the cumulative hazard functions.

Proposition B.1 (Pinsker s inequality.). Let λ be the true intensity defined in Equations (1) and (2), and λθ be an intensity parameterized by θ Θ. Under Assumptions 1, 2, and 3, we have that

c1TVn(λ , λD θ )2 KLn(λ , λD θ ),

τ exp( Bβ,2BW) h4

3 exp G (0) op Lxτ exp LG Lxτ + 2

3 exp Bα exp(Lxτ) i 1

for signature-based embeddings and

τ exp( Bβ,2BW) h4

3 exp G (0) op Lxτ exp LG Lxτ + 2

3 exp Gψ(0) op Lxτ exp LGψLxτ) i 1

Dynamic Survival Analysis with Controlled Latent States

Proof. We have

TVn(λ , λD θ ) = 1

0 |λi (s) λi,D θ (s)|Y i(s)ds = 1

λi,D θ (s) 1 λi,D θ (s)Y i(s)ds

λi,D θ (s) 1

λi,D θ (s)Y i(s)ds,

where we have used that |x| =

x2. Note that by definition, λi,D θ (s) = exp zi,D θ (s) + β Wi > 0 for all s [0, τ]: we can thus safely divide by this term. Now, since for all x 0

ξ(x) := x log x x + 1,

one obtains

TVn(λ , λD θ ) 1

λi,D θ (s)Y i(s)ds.

Using the Cauchy-Schwarz inequality yields

TVn(λ , λθ) 1

λi,D θ (s)Y i(s)ds

λi,D θ (s)Y i(s)ds

" 4 3Λi,D θ (τ) + 2

λi (s) log λi (s)

λi,D θ (s) + λi,D θ (s) λi (s) Y i(s)ds

max i=1,...,n

" 4 3Λi,D θ (τ) + 2

KLn(λ , λD θ ).

Taking the square on both sides yields

TVn(λ , λD θ )2 max i=1,...,n

" 4 3Λi,D θ (τ) + 2

KLn(λ , λD θ ).

Bounding the true cumulative hazard function. We have

max i=1,...,n

" 4 3Λi,D θ (τ) + 2

3 max i=1,...,n Λi,D θ (τ) + 2

3 max i=1,...,n Λi (τ).

Moreover, Lemma 2.1 yeilds that, for all s [0, τ],

log λi (s) Bβ,2BW + G (0) op Lxs exp LG Lxs

Bβ,2BW + G (0) op Lxτ exp LG Lxτ .

Hence for all i = 1, . . . , n, it holds that

Λi (τ) = Z τ

0 λi (s)Y i(s)ds τ sup s [0,τ] λi (s) τ exp Bβ,2BW + G (0) op Lxτ exp LG Lxτ .

Dynamic Survival Analysis with Controlled Latent States

Since this last bound does not depend on i, this gives us that

2 3 max i=1,...,n Λi (τ) 2τ

3 exp Bβ,2BW + G (0) op Lxτ exp LG Lxτ .

Similarly, one obtains that

4 3 max i=1,...,n Λi,D θ (τ) 4τ

3 max i=1,...,n sup s [0,τ] λi θ(s).

Signature-based embeddings. For signature-based embeddings, we have

sup s [0,τ] λi θ(s) exp(Bβ,2BW) h exp(Bα exp(Lxτ)) i .

NCDEs. For NCDEs, one obtains

sup s [0,τ] λi θ(s) exp Bβ,2BW) exp h Gψ(0) op Lxτ exp LGψLxτ) i .

Final Bound. Putting everything together, one finally has that

c1TVn(λ , λD θ )2 KLn(λ , λθ),

τ exp( Bβ,2BW) h4

3 exp( G (0) op Lxτ exp LG Lxτ ) + 2

3 exp(Bα exp(Lxτ)) i 1

when using signatures and

τ exp( Bβ,2BW) h4

3 exp( G (0) op Lxτ exp LG Lxτ ) + 2

3 exp h Gψ(0) op Lxτ exp LGψLxτ) ii 1

when using NCDEs.

We also prove the following self-concordance bound, a close result can be found in Lemler (2016).

Proposition B.2 (A self-concordance bound.). Let λ be the true intensity defined in Equations (1) and (2), and λθ be a intensity parameterized by θ Θ. Under Assumptions 1, 2, and 3, it holds that

KLn(λ , λθ) c2D2 n(λ , λD θ ),

c2 := exp M M 1

M := exp(Bβ,2BW) exp G (0) op Lxτ exp LG Lxτ + exp Bα exp(Lxτ)

when using signatures and

M := exp(Bβ,2BW) exp G (0) op Lxτ exp LG Lxτ + exp Gψ(0) op Lxτ exp LGψLxτ)

when using NCDEs.

Dynamic Survival Analysis with Controlled Latent States

Proof. Define

h exp t(log λi,D θ (s) log λi (s)) t log λi,D θ (s) λi (s) 1 i λi (s)Y i(s)ds.

This function satisfies all assumptions needed in Lemma A.8. The function f : t 7 exp(γt) γt 1 is convex for all γ R. Convexity is preserved by integration against a positive function: indeed, if f(t, s) is convex in t for all s and h(s) 0 for all s A, then Z

A f(t, s)h(s)ds

is convex in t (see Boyd & Vandenberghe, 2004, Page 79). The function is also clearly C . Finally, remark that by differentiating the integral, one obtains

h log λi,D θ (s) log λi (s) exp t(log λi,D θ (s) log λi (s)) log λi,D θ (s) λi (s)

i λi (s)Y i(s)ds,

g(2)(t) = 1

log λi,D θ (s) log λi (s) 2 exp t(log λi,D θ (s) log λi (s)) λi (s)Y i(s)ds,

g(3)(t) = 1

log λi,D θ (s) log λi (s) 3 exp t(log λi,D θ (s) log λi (s)) λi (s)Y i(s)ds.

|g(3)(t)| 1

λi,D θ λi (log λi,D θ (s) log λi (s))2 exp t(log λi,D θ (s) log λi (s)) λi (s)Y i(s)ds

for all t R with

M := max i=1,...,n

λi,D θ λi .

Using now Lemma A.8, we get at t = 1

M 2 Φ( M) g(1) g(0) g (0) | {z } =0

g(2)(0) = 1

log λi,D θ (s) log λi (s) 2λi (s)Y i(s)ds

Finally, remark that

g(1) = KLn(λ , λθ),

KLn(λ , λθ) exp M M 1

log λi,D θ (s) log λi (s) 2λi (s)Y i(s)ds.

Turning to the constant M, remark that

M := max i=1,...,n λi θ λi max i=1,...,n λi θ + max i=1,...,n λi . (6)

Dynamic Survival Analysis with Controlled Latent States

Similarly to what is done in the proof of Proposition B.1, we have

max i=1,...,n λi exp Bβ,2BW + G (0) op Lxτ exp LG Lxτ ,

max i=1,...,n

λi,D θ exp(Bβ,2BW) h exp Bα exp(Lxτ) i

when using signatures and

max i=1,...,n

λi,D θ exp Bβ,2BW) exp h Gψ(0) op Lxτ exp LGψLxτ) i

when using NCDEs.

B.3. Formal Statement and Proof of the Risk Decomposition

Proposition B.3. Let λ be the true intensity defined in Equations 12 and λD θ be the intensity parameterized by θ Θ. Under Assumptions 1, 2, 3 and 4, for signature-based embeddings and θ = (α ,N, β ), it holds that

D2 n(λ , λD θ ) 2τ exp Bβ,2BW + G (0) op Lxτ exp LG Lxτ " (d Lx)N+1

(N + 1)! ΛN+1(G )

2N + 3 | {z } Approximation bias

+ 2τ exp Bβ,2BW + G (0) op Lxτ exp LG Lxτ c3(N)2 N X

k=1 dkΓk(G )2|D|2

| {z } Discretization bias

whereas for NCDEs, for θ = (α, ψ, β ), it holds that

D2 n(λ , λD θ ) 2τ exp(2LGψLx) Bα 1 + LGψLx K Lx|D| 2

| {z } Discretization bias

+ max v Ω α Gψ(v) G (v) 2

| {z } Approximation bias

exp Bβ,2BW + G (0) op Lxτ exp LG Lxτ .

Proof. We have the following.

NCDEs. We first consider the case where the individual time series are embedded using NCDEs. Considering a general θ Θ, we then have

D2 n(λ , λD θ ) = 1

α zi,D θ (s) + β Wi zi (s) + β Wi 2λi (s)Y i(s)ds

α zi,D θ (s) zi (s) + (β β ) Wi 2λi (s)Y i(s)ds

α zi,D θ (s) zi (s) 2λi (s)Y i(s)ds + 2

(β β ) Wi 2λi (s)Y i(s)ds,

where the last inequality is obtained using (a + b)2 2a2 + 2b2. Since this is true for all θ Θ, we first chose β = β , hence cancelling the second term. Turning to the remaining term, we have

α zi,D θ (s) zi (s) = α zi,D θ (s) α zi θ(s) + α zi θ(s) zi (s),

Dynamic Survival Analysis with Controlled Latent States

where zi θ(s) is the solution to the CDE driven by continuous unobserved path

dzi θ(t) = Gψ(zi θ(t))dxi(t)

with initial condition zi(0) = 0 R. Using the continuity of the flow (Theorem A.2), one obtains

|α zi,D θ (s) α zi θ(s)| α zi,D θ (s) zi θ(s)

Bα exp(LGψLx) 1 + LGψLx K xi xi,D .

Using the fact that xi xi,D Lx|D|,

one finally obtains

|α zi,D θ (s) α zi θ(s)| Bα exp(LGψLx) 1 + LGψLx K Lx|D|.

Using again the continuity of the flow, one also has that

|α zi θ(s) zi (s)| exp(LGψLx) max v Ω α Gψ(v) G (v)

since α zi θ(s) is the solution to the CDE

dui(t) = α Gψ(ui(t))dxi(t)

with initial condition ui(0) = 0. Putting everything together, one obtains

(α zi,D θ (s) zi (s))2 2 exp(2LGψLx) Bα 1 + LGψLx K Lx|D| 2 + max v Ω α Gψ(v) G (v) 2 .

Plugging this result in the original inequality on the squared log-divergence yields

D2 n(λ , λD θ ) 2τ exp(2LGψLx)

" Bα 1 + LGψLx K Lx|D| 2 + max v Ω α Gψ(v) G (v) 2 #

Bβ,2BW + G (0) op Lxτ exp LG Lxτ !

Signature-based embeddings. We proceed in a similar fashion. We have for any θ Θ that

log λi,D θ (s) log λi (s) 2 = α SN(xi,D [0,s]) + β Wi α S(xi [0,s]) β Wi 2

= α SN(xi,D [0,s]) α S(xi [0,s]) + (β β ) Wi 2 .

Now, we furthermore have

α SN(xi,D [0,s]) α S(xi [0,s]) = α SN(xi,D [0,s]) α ,NSN(xi [0,s]) + α ,NSN(xi [0,s]) α S(xi [0,s]).

In particular, taking α = α ,N, we have

α ,NSN(xi,D [0,s]) α S(xi [0,s]) = α ,N SN(xi,D [0,s]) SN(xi [0,s]) + α ,NSN(xi [0,s]) α S(xi [0,s]).

Using Proposition A.4, one obtains

|α ,NSN(xi [0,s]) α S(xi [0,s])| (d Lxs)N+1

(N + 1)! ΛN+1(G ).

Dynamic Survival Analysis with Controlled Latent States

Additionally, using Theorem A.7, we have α ,N SN(xi,D [0,s]) SN(xi [0,s]) α ,N c3(N)|D|.

We obtain for θ = (α ,N, β ) that

log λi,D θ (s) log λi (s) 2 2 h(d Lxs)N+1

(N + 1)! ΛN+1(G ) i2 + 2 α ,N 2 c3(N)2|D|2

using the fact that (a + b)2 2a2 + 2b2. Finally, integrating yields

D2 n(λ , λD θ ) 1

i=1 exp Bβ,2BW + G (0) op Lxτ exp LG Lxτ 2τ h(d Lx)N+1

(N + 1)! ΛN+1(G ) i2 Z τ

0 s2(N+1)ds

i=1 exp Bβ,2BW + G (0) op Lxτ exp LG Lxτ 2τ α ,N 2 c3(N)2|D|2

2τ exp Bβ,2BW + G (0) op Lxτ exp LG Lxτ " (d Lx)N+1

(N + 1)! ΛN+1(G )

+ 2τ exp Bβ,2BW + G (0) op Lxτ exp LG Lxτ α ,N 2 c3(N)2|D|2.

Using Lemma A.6, we can furthermore simplify the bound to

D2 n(λ , λD θ ) 2τ exp

Bβ,2BW + G (0) op Lxτ exp LG Lxτ !" (d Lx)N+1

(N + 1)! ΓN+1(G )

Bβ,2BW + G (0) op Lxτ exp LG Lxτ !

k=1 dkΓk(G )2|D|2.

B.4. Proof of Theorem 3.1

Theorem B.4 (Formal statement of Theorem 3.1). Consider the signature-based embedding. Let ˆθ = (ˆα, ˆβ) be the solution of (3) with pen(θ) = η1 α 1 + η2 β 1. Under Assumptions 1, 2, 3 and 4, and writing β = (β(1) , . . . , β(s) ), we have for any N 1 and any x > 0 that with probability greater than 1 4e x

ℓD n (ˆθ) ℓ n 2τ exp Bβ,2BW + G (0) op Lxτ exp LG Lxτ " (d Lx)N+1

(N + 1)! ΓN+1(G )

2N + 3 | {z } Approximation bias

+ 2τ exp Bβ,2BW + G (0) op Lxτ exp LG Lxτ c3(N)2 N X

k=1 dkΓk(G )2|D|2

| {z } Discretization bias

2 x + log(Nd N) λ n

k=1 dkΓk(G ) + 4Bβ,0 sup k=1,...,s |β(k) |BW

2τλ (x + log s)

Proof. By optimality of ˆθ, we have for all θ Θ that

ℓD n (ˆθ) + pen(ˆθ) ℓ n ℓD n (θ) + pen(θ) ℓ n

Dynamic Survival Analysis with Controlled Latent States

and hence, using Proposition A.9, one obtains

KLn(λ , λD ˆθ ) KLn(λ , λD θ ) + pen(θ) pen(ˆθ) + 1

Z log λi,D ˆθ (s)

λi,D θ (s) d M i(s).

Using Proposition A.10, the KL-divergence on the right hand side can be bounded by the squared log divergence, yielding

KLn(λ , λD ˆθ ) c2D2 n(λ , λD θ ) + pen(θ) pen(ˆθ) + 1

Z log λi,D ˆθ (s)

λi,D θ (s) d M i(s). (7)

The bias term c2D2 n(λ , λD θ ) can be bounded thanks to Proposition B.3. We shall now derive a bound for the term

pen(θ) pen(ˆθ) + 1

Z log λi,D ˆθ (s)

λi,D θ (s) d M i(s).

Since Equation (7) holds true for all θ Θ, we can set θ = (α ,N, β ). We now study the term

0 log λi,D ˆθ (s)

λi,D θ N (s) d M i(s) = 1

ˆα SN(xi,D [0,s]) + ˆβ Wi α ,NS(xi [0,t]) β Wi d M i(s) (8)

= (ˆα α ,N) 1

0 SN(xi,D [0,s])d M i(s) + (ˆβ β ) 1

i=1 Wi M i(t). (9)

which appears in Proposition A.9 when considering signature based embeddings. We make a repeated use of the following lemmas in our derivation of a bound for this term.

Lemma B.5. Let S[k],j(xi,D [0,t]) R be the signature coefficient associated to the j-th word of the k-th signature layer of a

time series xi,D [0,t] evaluated at time t τ. Then we have

S[k],j(xi,D [0, ]) ,[0,t] = max s [0,t] |S[k],j(xi,D [0,s])| (Lxt)k

Proof. For all s [0, t], we have

|S[k],j(xi,D [0,s])| S[k](xi,D [0,s])

xi,D [0,t] k

where S[k](xi,D [0,s]) refers to the full signature layer of depth k, and the last inequality can be found in Fermanian (2021). Using Assumption 1, we have xi,D [0,t] k

1-var k! (Lxt)k

The following deviation inequality is a direct consequence from the one in Van de Geer (1995) and derives from inequalities for general martingales that can be found in Shorack & Wellner (2009) for instance.

Lemma B.6 (Deviation inequality for a martingale). Let Υ be a locally square integrable martingale .Then, for any x > 0 and t 0, the following holds true for

2v(t)x + B(t)x

3 , Υ(t) v(t), sup s [0,t] | Υ(s)| B(t) 2e x, (10)

where Υ(t) is the predictable variation of Υ and Υ(t) its jump at time t.

Dynamic Survival Analysis with Controlled Latent States

Going back to Equation (8) and decomposing on the signature layers, we can write

(ˆα α ,N) 1

0 SN(xi,D [0,s])d M i(s) =

k=1 (ˆα[k] α ,[k]) 1

0 S[k](xi,D [0,s])d M i(s)

ˆα[k] α ,[k] 1 sup 1 j dk

0 S[k],j(xi,D [0,s])d M i(s) .

Because the martingales M i are independent, and the signature coefficients bounded, the term

0 S[k],j(xi,D [0,s])d M i(s)

is itself a martingale. Moreover, since each M i comes from a counting process via a Doob Meier decomposition, its jumps are bounded by 1 and, at a given time, there is (almost surely) at most one M i that jumps. As a consequence, we get the following bound with jumps bounded by

sup t [0,τ]

n sup i=1,...,n

S[k],j(xi,D [0, ]) ,[0,τ] (Lxτ)k

and quadratic variation given at time t [0, τ] by

χn(t) = D 1

0 S[k],j(xi,D [0,s])d M i(s) E = 1

0 S[k],j(xi,D [0,s])2λi (s)Y i(s)ds

sup i=1,...,n S[k],j(xi,D [0, ]) 2 ,[0,τ] 1 n2

n sup i=1,...,n S[k],j(xi,D [0, ]) 2 ,[0,τ] sup i=1,...,n Λi (t)

L2k x τ 2k+1λ

n(k!)2 , (11)

where we have used Lemma B.5 and the fact that

Λi (t) = Z t

0 λi (s)Y i(s)ds t sup s [0,t] λi (s) τ sup s [0,τ] λi (s) τλ ,

with λ = exp Bβ,2BW) exp G (0) op Lxτ exp LG Lxτ

according to Lemma 2.1. Lemma B.6 now warrants that for any ε > 0 with a probability greater than 1 2e ε

0 S[k],j(xi,D [0,s])d M i(s)

2εL2k x τ 2k+2λ

n(k!)2 + ε(Lxτ)k

3nk! (Lxτ)k

k = argmaxk 1 (Lxτ)k

A double union bound on the signature layers and the signature coefficients within each layer ensures that for any ε > 0 with a probability greater than 1 2e ε

sup 1 k N sup 1 j dk

0 S[k],j(xi,D [0,s])d M i(s) (Lxτ)k

2 ε + log(Nd N) τ 2λ n + ε + log(Nd N)

Dynamic Survival Analysis with Controlled Latent States

As a consequence

(ˆα α ,N) 1

0 SN(xi,D [0,s])d M i(s) ˆα α ,N 1 (Lxτ)k

2 ε + log(Nd N) τ 2λ n + ε + log(Nd N)

for any ε > 0 with a probability greater than 1 2e ε.

We apply the same line of reasoning to

i=1 Wi M i(t) ˆβ β 1 sup 1 m s

i=1 W i m M i(t) .

For each m, the term n P

i=1 W i m M i( ) is a martingale with predictable variation less than

i=1 (W i m)2Λi (t) B2 Wτλ .

Its jumps are bounded by BW, and therefore Lemma B.6 applies. Via an union bound, we deduce that for any ε > 0 and with a probability greater that 1 2e ε

i=1 Wi M i(t) ˆβ β 1

2B2 Wτλ (ε + log s)

n + BW(ε + log s)

Now defining the penalty

pen(θ) = α 1 (Lxτ)k

2 ε + log(Nd N) τ 2λ n + ε + log(Nd N)

2B2 Wτλ (ε + log s)

n + BW(ε + log s)

pen(θ ,N) pen(ˆθ) + 1

Z log λi,D ˆθ (s)

λi,D θ (s) d M i(s)

2 α ,N 1 (Lxτ)k

2 x + log(Nd N) τ 2λ n + x + log(Nd N)

2B2 Wτλ (x + log s)

n + BW(x + log s)

with a probability greater that 1 4e ε for any ε > 0. For n large enough, we can write

2 α ,N 1 (Lxτ)k

2 ε + log(Nd N) τ 2λ n + ε + log(Nd N)

2τλ (ε + log s)

n + BW(ε + log s)

4 α ,N 1 (Lxτ)k

2 ε + log(Nd N) τ 2λ n + 4 β 1BW

2τλ (ε + log s)

Dynamic Survival Analysis with Controlled Latent States

Finally, using Lemma A.6 and Assumption 3, we can bound the ℓ1 norms of α ,N and β ,N and obtain that for large n, for any ε > 0 we have with probability greater than 1 4e ε that

4 α ,N 1 (Lxτ)k

2 ε + log(Nd N) τ 2λ n + 4 β 1BW

2τλ (ε + log s)

2 ε + log(Nd N) τ 2λ n

k=1 dkΓk(G ) + 4Bβ,0 sup k=1,...,s |β(k) |BW

2τλ (ε + log s)

Dynamic Survival Analysis with Controlled Latent States

C. Algorithmic and Implementation Details

In this Section, we provide extra information about learning algorithms described in the main paper and their hyperparameters optimization by gridsearch.

C.1. Description of Competing Methods

C.1.1. COXSIG AND COXSIG+

Implementation. We use iisignature (Reizenstein & Graham, 2020) to compute signatures. Alternatives for computing signatures include the signatory library (Kidger & Lyons, 2020).

Training. We minimize the penalized negative log-likelihood (defined in 3 in the main paper) using a vanilla proximal point algorithm (Boyd & Vandenberghe, 2004).

Hyperparameters. The initial learning rate of the proximal gradient algorithm is set to e 3 and the learning rate for each iteration is chosen by back tracking linesearch method (Boyd & Vandenberghe, 2004). The hyperparameters of penalization strength (η1, η2) and truncation depth N are chosen by 1-fold cross-validation of a mixed metric equal to the difference between the C-index and the Brier score. We select the best hyperparameters that minimize the average of this mixed metric on the validation set. We list the hyperparameters search space of this algorithm below.

η1: {1, e 1, e 2, e 3, e 4, e 5};

η2: {1, e 1, e 2, e 3, e 4, e 5};

N: {2, 3}. Larger values were considered in the beginning of experiments but were removed from the cross-validation grid because they yielded bad performance and numerical instabilities.

C.1.2. NCDE

Implementation. We implement the fill-forward discrete update of NCDEs in Pytorch.

Structure. The neural vector field is a feed-forward network composed of two fully connected hidden layers whose hidden dimension is set to 128. We choose to represent the latent state in 4 dimensions the number of nodes in the input layer is therefore set to 4. The dimension of the output layer is equal to the multiplication of the dimension of the hidden layer (128) and the dimension of the sample paths of a given data set. tanh is set to be the activation function for all the nodes in the network.

Training. The model was trained for 50 epochs using the Adam optimizer (Kingma & Ba, 2015) with a batch size of 32 and cross-validated learning rate set to e 4.

C.1.3. COX MODEL

Implementation and Training. We use a classical Cox model with elastic-net penalty as a baseline, which is given either the first measured value of the individual time series or the static features if they are available. The intensity of this model has then the form λi θ(t) = λ0(t) exp(β Wi),

where Wi = Xi(0) if no static features are available. We use the implementation provided in the Python package scikit-survival and called Coxnet Survival Analysis (P olsterl, 2020).

Hyperparameters. The Elastic Net mixing parameter γ is set to 0.1. The hyperparameter of penalization strength η is chosen by cross-validation as described above. We crossvalidate over the set {1, e 1, e 2, e 3, e 4, e 5} to select the best value.

C.1.4. RANDOM SURVIVAL FOREST

Implementation. We use the implementation of RSF (Ishwaran et al., 2008) provided in the Python package scikit-survival (P olsterl, 2020).

Dynamic Survival Analysis with Controlled Latent States

Training. We train this model with static features Wi as the only input. Similarly to our implementation of the Cox model, we use the first value of the time series as static features if no other features are available.

Hyperparameters. We cross-validate two hyperparameters on the following grids.

max features: {None, sqrt};

min samples leaf: {1, 5, 10};

C.1.5. DYNAMIC DEEP-HIT (LEE ET AL., 2019)

DDH is a dynamical survival analysis algorithm that frames dynamical survival analysis as a classification problem. It divides the considered time period [0, τ] into a set of contiguous time intervals. The network is then trained to predict a time interval of event for every subject, which is a multiclass classification task.

Network Architecture. Being adapted to competing events, Dynamic Deep-Hit combines a shared network with a causespecific network. The shared network is a combination of a RNN-like network that processes the longitudinal data and an attention mechanism, which helps the network decide which part of the history of the measurements is important. The cause-specific network is a feed-forward network taking as an input the history of embedded measurements and learning a cause-specific representation. See Figure 6 for a graphical representation of the network s structure.

Figure 6: Network structure of Dynamic Deep Hit. Figure is taken from Lee et al. (2019).

Loss Function. The loss function of DDH is a sum of three loss functions

ℓDynamic Deep Hit = ℓlog-likelihood + ℓranking + ℓprediction.

The first loss maximizes the conditional likelihood of dying in the interval [tk, tk+1[ given that the individual has survived up to time tk. On a side note, we notice that the claim of Lee et al. (2019) that this loss corresponds to the negative log-likelihood of the joint distribution of the first hitting time and corresponding event considering the right-censoring of the data is hence inexact. This might explain the results observed in Figure 4: DDH s performance, in terms of Brier score, strongly degrades as δt increases because the model is only trained to predict one step ahead, instead of maximizing the full likelihood.

The second loss favors correct rankings among at risk individuals: an individual experiencing an event at time T i should have a higher risk score at time t < T i than an individual j for which T j > T i.

The third loss is a prediction loss, which measures the difference between the value of the time-dependent features and a prediction of this value made by the shared network. The loss is minimized using Adam (Kingma & Ba, 2015).

Hyperparameters. In our setting, we use the network in its original structure. The learning rate is set to e 4 and the number of epochs to 300.

Dynamic Survival Analysis with Controlled Latent States

C.1.6. SURVLATENT ODE (MOON ET AL., 2022)

Network Architecture. Surv Latent ODE is a variational autoencoder architecture (Kingma & Welling, 2013). The encoder embeds the entire longitudinal features into an initial latent state, and the decoder uses this latent state to drive the latent trajectory and to estimate the distribution of event time. In this framework, the encoder is an ODE-RNN architecture (Rubanova et al., 2019), which handles the longitudinal features sequentially backward in time and outputs the posterior over the initial latent state. The decoder, which is adapted to competing events, consists of an ODE model and cause-specific decoder modules. The latent trajectory derived from the ODE model is shared across cause-specific decoder modules to estimate the cause-specific discrete hazard functions. See Figure 7 for a graphical representation of the network s structure.

Figure 7: Network structure of Surv Latent ODE. Figure taken from Moon et al. (2022).

Loss Function. The loss function is a combination of the log-likelihood and the Kullback-Leibler divergence between the approximate and the true posterior over the initial latent state.

Hyperparameters. In our setting, we use the network in its original structure. The learning rate is set to e 2 and the number of epochs to 15, as in the original paper. The training of this framework cannot use subjects whose last longitudinal measurement time is equal to the event time, which is not the case for our proposed methods as well as other competing methods. In order to avoid this problem, we then stop observing the longitudinal measurement before the time-to-event for a period equal to 80 % of the event time of these subjects when training the model of this framework.

C.2. Computation of the Different Metrics

The following lemma details the computation of the conditional survival function.

Lemma C.1. For any i {1, . . . , n},

ri θ(t, δt) = exp Z t+δt

t λi θ(u, xi,D [0,u t])du ,

where ri θ(t, δt) = P T i > t + δt | T i > t, xi,D [0,t] is the survival function of individual i, as estimated by the model with parameters θ, at time t + δt for δt > 0 conditional on survival up to time t, and on observation of the longitudinal features up to time t, and the notation λi θ(u, xi,D [0,u t]) means that the intensity at time u is computed by using the longitudinal features up to time u t = min(u, t).

Proof. Since Bayes rule gives

ri θ(t, δt) = P T i > t + δt | T i > t, xi,D [0,t], Wi = P T i > t + δt | xi,D [0,t], Wi

P T i > t | xi,D [0,t], Wi ,

Dynamic Survival Analysis with Controlled Latent States

we can compute this score by using the fact that

P T i > t | xi,D [0,t], Wi = exp( Λi,D θ (t)),

where we recall that Λi,D θ (t) is the cumulative hazard function

Λi,D θ (t) := Z t

0 λi,D θ (s)Y i(s)ds.

We refer the reader unfamiliar with survival analysis to Aalen et al. (2008, Chapter 1, p. 6) for a proof of this expression of the survival function. This then yields

ri θ(t, δt) = exp( R t+δt 0 λi θ(u, xi,D [0,u t])du)

exp( R t 0 λi θ(u, xi,D [0,u t])du)

= exp( Z t+δt

t λi θ(u, xi,D [0,u t])du).

Beside the two metrics described in the main paper, we report our results in term of two more metrics namely the weighted Brier Score and the area under the receiver operating characteristic curve (AUC). The details of these metrics are given below.

Weighted Brier Score. The weighted version of the Brier score, which we write WBS(t, δt), is defined as

i=1 1T i t, i=1 ri θ(t, δt))2

ˆG(T i) + 1T i t (1 ri θ(t, δt))2

where ˆG( ) is the probability of censoring weight, estimated by the Kaplan-Meier estimator.

AUC. We define the area under the receiver operating characteristic curve AUC(t, δt) as

j=1 1ri θ(t,δt)>rj θ(t,δt)1T i>t+δt, T j [t,t+δt]wj

i=1 1T i>t+δt)( n P

i=1 1T i [t,t+δt]wi) ,

where wi are inverse probability of censoring weights, estimated by the Kaplan-Meier estimator.

D. Details of Experiments and Datasets

The main characteristics of the datasets used in the paper are summarized in Table 3 and we provide more detailed information of these datasets in subsections below. For the experiments, each dataset is randomly divided into a training set (80%) and test set (20%). Hyperparameter optimization is performed as follows. We split the training set, using 4/5 for training and 1/5 for validation. We then re-fit on the whole training set with the best hyperparameters and report the results on the test set for 10 runs. Note that the performance is evaluated at numerous points (t, δt), where t is set to the 5th, 10th, and 20th percentile of the distribution of event times.

D.1. Hitting Time of a partially observed SDE

Time series. The paths xt = (x(1) t , . . . , x(d 1) t ) are (d 1)-dimensional sample paths of a fractional Brownian motion with Hurst parameter H = 0.6, and Bi(t) is a Brownian noise term. We set d = 5. The paths are sampled at 1000 times over the time interval [0, 10]. All simulations are done using the stochastic package1. The time series Xi are identical, up to observation time, to the ones used for simulations.

1Available at https://github.com/crflynn/stochastic

Dynamic Survival Analysis with Controlled Latent States

Name n d Static Features Censoring Avg. Observation Times Source

Hitting time 500 5 Terminal (3.2%) 177 Simulation Tumor Growth 500 2 Terminal (8.4%) 250 Simeoni et al. (2004) Predictive Maintenance 200 17 Online (50%) 167 Saxena et al. (2008) Churn 1043 14 Terminal (38.4%) 25 Private dataset

Table 3: Description of the 4 datasets we consider. The integer d is the dimension of the time series including the time channel. Terminal censoring means that the individuals are censored at the end of the overall observation period [0, τ] if they have not experienced any event. It is opposed to online censoring that can happen at any time in [0, τ]. The reported percentage indicates the censoring level i.e. the share of the population that does not experience the event.

0 2 4 6 8 10 Time

0 2 4 6 8 10 Time

Figure 8: Full sample path of an individual (left) and distribution of the event times (left) for the partially observed SDE experiment. The surge in events at the terminal time indicates terminal censorship.

Event definition We consider the stochastic differential equation

dwt = ω(wt µ)dt +

i=1 dx(i) t + σd Bt,

where wt is trajectory of each individual with (σ, µ, ω) R3 are fixed parameters. In our experiment, the parameters are chosen to be σ = 1, µ = 0.1 and ω = 0.1. We then define the time-of-event as the time when trajectory cross the threshold w R during the observation period [t0 t N], which is

T = min{t0 t t N | wt w }.

In our experiments, we use the threshold value w = 2.5. The target SDE is simulated using an Euler discretization. We train on n = 500 individuals.

Censorship We censor individuals whose trajectory does not cross the threshold during the observation period. This means that individuals are never censored during the observation period, but only at the end. The simulated censoring level is 3.2%.

Supplementary Figures. Figure 8 provides an example of the full sample path of an individual and the distribution of the event times of the whole population. We add additional results on the test set in Figures 9, 10, 11, 12 and 13.

D.2. Tumor Growth

Time series. Similarly to the partially observed SDE experiment described above, we set d = 2 which includes 1dimensional sample path xt of a fractional Brownian motion with Hurst parameter H = 0.6. The paths are sampled at 1000 times over the time interval [0, 10]. All simulations are done using the stochastic package. The time series Xi

are identical, up to observation time, to the ones used for simulations.

Dynamic Survival Analysis with Controlled Latent States

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 0.17, t = 0.06

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 0.17, t = 0.10

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 0.17, t = 0.17

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 0.23, t = 0.06

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 0.23, t = 0.10

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 0.23, t = 0.17

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 0.31, t = 0.06

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 0.31, t = 0.10

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 0.31, t = 0.17

Figure 9: C-Index (higher is better) for hitting time of a partially observed SDE for numerous points (t, δt).

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

Brier Score

t = 0.17, t = 0.06

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

Brier Score

t = 0.17, t = 0.10

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

Brier Score

t = 0.17, t = 0.17

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

Brier Score

t = 0.23, t = 0.06

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

Brier Score

t = 0.23, t = 0.10

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

Brier Score

t = 0.23, t = 0.17

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

Brier Score

t = 0.31, t = 0.06

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

Brier Score

t = 0.31, t = 0.10

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

Brier Score

t = 0.31, t = 0.17

Figure 10: Brier score (lower is better) for hitting time of a partially observed SDE for numerous points (t, δt).

Dynamic Survival Analysis with Controlled Latent States

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

Weighted Brier Score

t = 0.17, t = 0.06

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

Weighted Brier Score

t = 0.17, t = 0.10

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

Weighted Brier Score

t = 0.17, t = 0.17

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.00

Weighted Brier Score

t = 0.23, t = 0.06

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

Weighted Brier Score

t = 0.23, t = 0.10

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

Weighted Brier Score

t = 0.23, t = 0.17

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.00

Weighted Brier Score

t = 0.31, t = 0.06

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

Weighted Brier Score

t = 0.31, t = 0.10

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

Weighted Brier Score

t = 0.31, t = 0.17

Figure 11: Weighted Brier score (lower is better) for hitting time of a partially observed SDE for numerous points (t, δt).

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 0.17, t = 0.06

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 0.17, t = 0.10

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 0.17, t = 0.17

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 0.23, t = 0.06

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 0.23, t = 0.10

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 0.23, t = 0.17

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 0.31, t = 0.06

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 0.31, t = 0.10

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 0.31, t = 0.17

Figure 12: AUC (higher is better) for hitting time of a partially observed SDE for numerous points (t, δt).

Dynamic Survival Analysis with Controlled Latent States

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 100

Running Time (sec.)

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 100

Running Time (sec.)

Figure 13: Running times on the partially observed SDE experiment (log-scale) averaged over 10 runs including cross-validation of the hyperparameters on Cox Sig, Cox Sig+, Cox and RSF (left) and over 1 run without cross-validation of the hyperparameters on Cox Sig, Cox Sig+, Cox and RSF (right).

Event definition. Following Simeoni et al. (2004), we consider the differential equations

du(1) t dt = λ0u(1) t 1 + ( λ0

λ1 wt)Ψ 1/Ψ κ2xtu(1) t

du(2) t dt = κ2xtu(1) t κ1u(2) t

du(3) t dt = κ1(u(2) t u(3) t )

du(4) t dt = κ1(u(3) t u(4) t )

wt = u(1) t + u(2) t + u(3) t + u(4) t ,

where wt is trajectory of each individual with initial status of (u(1) 0 , u(2) 0 , u(3) 0 , u(4) 0 ) = (0.8, 0, 0, 0) and (λ0, λ1, κ1, κ2, Ψ) R5 are fixed parameters. In our experiment, the parameters are chosen to be λ0 = 0.9, λ1 = 0.7, κ1 = 10, κ2 = 0.15 and Ψ = 20. We then define the time-of-event as the time when trajectory cross the threshold w R during the observation period [t0 t N], which is

T = min{t0 t t N | wt w }.

In our experiments, we use the threshold value w = 1.7. The target differential equations are simulated using an Euler discretization. We train on n = 500 individuals.

Censorship. Similarly to the partially observed SDE experiment, we consider terminal censorship: individuals that do not experience the event within the observation period are censored. The censoring level is 8.4%.

Supplementary Figures. Figure 14 provides an example of the full sample path of an individual and the distribution of the event times of the whole population. We add additional results on the test set in Figures 15, 16, 17 and 18.

D.3. Predictive Maintenance

Time series. This dataset describes the degradation of 200 aircraft gas turbine engines, where 22 measurements of sensors and 3 operational settings are recorded each operational cycle until its failure. After removing low-variance features, 16 longitudinal features are selected for training models. The average time length of these features is about 25 cycles. Note that we apply standardization for selected features before training.

Event definition. The times of event are given as-is in the dataset. We refer to Saxena et al. (2008) for a precise description of the data generation.

Dynamic Survival Analysis with Controlled Latent States

0 2 4 6 8 10 Time

0 2 4 6 8 10 Time

Figure 14: Full sample path of an individual (left) and distribution of the event times (left) for the tumor growth experiment. The surge in events at the terminal time indicates terminal censorship.

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 1.04, t = 0.05

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 1.04, t = 0.10

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 1.04, t = 0.13

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 1.09, t = 0.05

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 1.09, t = 0.10

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 1.09, t = 0.13

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 1.17, t = 0.05

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 1.17, t = 0.10

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 1.17, t = 0.13

Figure 15: C-Index (higher is better) for Tumor Growth for numerous points (t, δt).

Censorship. Censorship is given as-in in the dataset. The censoring level of this dataset is 50%, which is a high censorship rate in survival analysis. We refer again to Saxena et al. (2008) for more details.

Supplementary Figures. Figure 19 provides an example of several randomly picked sample paths of an individual and the distribution of the event times of the whole population. We add additional results in Figures 20, 21, 22 and 23.

D.4. Churn Prediction

For this dataset, the amount of details that we can release is limited both because of the sensitive nature of the data and of the anonymity requirements of the reviewing process.

Dynamic Survival Analysis with Controlled Latent States

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

Brier Score

t = 1.04, t = 0.05

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

Brier Score

t = 1.04, t = 0.10

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

Brier Score

t = 1.04, t = 0.13

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

Brier Score

t = 1.09, t = 0.05

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

Brier Score

t = 1.09, t = 0.10

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

Brier Score

t = 1.09, t = 0.13

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

Brier Score

t = 1.17, t = 0.05

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

Brier Score

t = 1.17, t = 0.10

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

Brier Score

t = 1.17, t = 0.13

Figure 16: Brier score (lower is better) for Tumor Growth for numerous points (t, δt).

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.00

Weighted Brier Score

t = 1.04, t = 0.05

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.00

Weighted Brier Score

t = 1.04, t = 0.10

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.00

Weighted Brier Score

t = 1.04, t = 0.13

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.00

Weighted Brier Score

t = 1.09, t = 0.05

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.00

Weighted Brier Score

t = 1.09, t = 0.10

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.00

Weighted Brier Score

t = 1.09, t = 0.13

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.00

Weighted Brier Score

t = 1.17, t = 0.05

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.00

Weighted Brier Score

t = 1.17, t = 0.10

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.00

Weighted Brier Score

t = 1.17, t = 0.13

Figure 17: Weighted Brier score (lower is better) for Tumor Growth for numerous points (t, δt).

Dynamic Survival Analysis with Controlled Latent States

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 1.04, t = 0.05

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 1.04, t = 0.10

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 1.04, t = 0.13

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 1.09, t = 0.05

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 1.09, t = 0.10

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 1.09, t = 0.13

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 1.17, t = 0.05

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 1.17, t = 0.10

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 1.17, t = 0.13

Figure 18: AUC (higher is better) for Tumor Growth for numerous points (t, δt).

0 50 100 150 200 250 300 350 Time

setting1 s2

0 50 100 150 200 250 300 350 Time

Figure 19: Partial sample path of an individual (left) and distribution of the event times (left) for the predictive maintenance experiment. On the left, the time series is filled with the last observed value from the time of the event on.

Dynamic Survival Analysis with Controlled Latent States

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 153.90, t = 9.40

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 153.90, t = 16.20

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 153.90, t = 22.10

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 166.60, t = 9.40

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 166.60, t = 16.20

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 166.60, t = 22.10

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 191.60, t = 9.40

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 191.60, t = 16.20

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 191.60, t = 22.10

Figure 20: C-Index (higher is better) for predictive maintenance for numerous points (t, δt).

Time series. All longitudinal features have been computed on a temporal window of one week, the raw data corresponding to all product orders placed on the platform from 06-12-2021 to 12-11-2023. For clients who have no order during the week, we fill zero value for all longitudinal measurements this week. After removing features with more than 90 % of missingness, 14 longitudinal features of 1043 clients are selected for the training step. Note that we apply standardization for selected features before training.

Event definition. We consider that a customer has churned if she has no passed any order in the last 4 weeks. If the customer starts ordering again after a churn, we register her as a new customer.

Censorship. Censorship is terminal based on the data collection period (give dates here). Hence any customer that has not churned by 12-11-2023 is censored. In this dataset, 38.4% of the clients are terminally censored.

Supplementary Figures. Figure 24 provides an example of four sample paths of four randomly chosen individuals. We add additional results in Figures 25, 26, 27 and 28.

Dynamic Survival Analysis with Controlled Latent States

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

Brier Score

t = 153.90, t = 9.40

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

Brier Score

t = 153.90, t = 16.20

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

Brier Score

t = 153.90, t = 22.10

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

Brier Score

t = 166.60, t = 9.40

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

Brier Score

t = 166.60, t = 16.20

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

Brier Score

t = 166.60, t = 22.10

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

Brier Score

t = 191.60, t = 9.40

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

Brier Score

t = 191.60, t = 16.20

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

Brier Score

t = 191.60, t = 22.10

Figure 21: Brier Score (lower is better) for predictive maintenance for numerous points (t, δt).

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.00

Weighted Brier Score

t = 153.90, t = 9.40

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.00

Weighted Brier Score

t = 153.90, t = 16.20

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.00

Weighted Brier Score

t = 153.90, t = 22.10

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.00

Weighted Brier Score

t = 166.60, t = 9.40

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.00

Weighted Brier Score

t = 166.60, t = 16.20

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

Weighted Brier Score

t = 166.60, t = 22.10

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

Weighted Brier Score

t = 191.60, t = 9.40

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

Weighted Brier Score

t = 191.60, t = 16.20

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

Weighted Brier Score

t = 191.60, t = 22.10

Figure 22: Weighted Brier Score (lower is better) for predictive maintenance for numerous points (t, δt).

Dynamic Survival Analysis with Controlled Latent States

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 153.90, t = 9.40

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 153.90, t = 16.20

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 153.90, t = 22.10

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 166.60, t = 9.40

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 166.60, t = 16.20

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 166.60, t = 22.10

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 191.60, t = 9.40

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 191.60, t = 16.20

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 191.60, t = 22.10

Figure 23: AUC (higher is better) for predictive maintenance for numerous points (t, δt).

0 20 40 60 80 Time

Number of items per order

client_id = 0 client_id = 1

client_id = 4 client_id = 6

0 20 40 60 80 Time

Number of order per week

client_id = 0 client_id = 1

client_id = 4 client_id = 6

0 20 40 60 80 Time

Weighted average price

client_id = 0 client_id = 1

client_id = 4 client_id = 6

0 20 40 60 80 Time

Average order value

client_id = 0 client_id = 1

client_id = 4 client_id = 6

Figure 24: Values of 4 different time-dependent features for 4 randomly chosen individuals from the churn prediction dataset. Individual time-to-event and distribution of the event times cannot be displayed to protect consumer and business privacy. A precise description of the different time-dependent features will be provided upon publication.

Dynamic Survival Analysis with Controlled Latent States

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 2.00, t = 1.00

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 2.00, t = 2.00

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 2.00, t = 4.00

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 3.00, t = 1.00

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 3.00, t = 2.00

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 3.00, t = 4.00

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 4.00, t = 1.00

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 4.00, t = 2.00

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 4.00, t = 4.00

Figure 25: C-Index (higher is better) for churn prediction for numerous points (t, δt).

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

Brier Score

t = 2.00, t = 1.00

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

Brier Score

t = 2.00, t = 2.00

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

Brier Score

t = 2.00, t = 4.00

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

Brier Score

t = 3.00, t = 1.00

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

Brier Score

t = 3.00, t = 2.00

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

Brier Score

t = 3.00, t = 4.00

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

Brier Score

t = 4.00, t = 1.00

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

Brier Score

t = 4.00, t = 2.00

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

Brier Score

t = 4.00, t = 4.00

Figure 26: Brier score (lower is better) for churn prediction for numerous points (t, δt).

Dynamic Survival Analysis with Controlled Latent States

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

Weighted Brier Score

t = 2.00, t = 1.00

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

Weighted Brier Score

t = 2.00, t = 2.00

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

Weighted Brier Score

t = 2.00, t = 4.00

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.00

Weighted Brier Score

t = 3.00, t = 1.00

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

Weighted Brier Score

t = 3.00, t = 2.00

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

Weighted Brier Score

t = 3.00, t = 4.00

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.00

Weighted Brier Score

t = 4.00, t = 1.00

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

Weighted Brier Score

t = 4.00, t = 2.00

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

Weighted Brier Score

t = 4.00, t = 4.00

Figure 27: Weighted Brier score (lower is better) for churn prediction for numerous points (t, δt).

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 2.00, t = 1.00

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 2.00, t = 2.00

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 2.00, t = 4.00

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 3.00, t = 1.00

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 3.00, t = 2.00

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 3.00, t = 4.00

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 4.00, t = 1.00

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 4.00, t = 2.00

Cox Sig Cox Sig+ NCDE Cox RSF DDH SLODE 0.0

t = 4.00, t = 4.00

Figure 28: AUC (higher is better) for churn prediction for numerous points (t, δt).