# set_functions_for_time_series__402c4307.pdf Set Functions for Time Series Max Horn 1 2 Michael Moor 1 2 Christian Bock 1 2 Bastian Rieck 1 2 Karsten Borgwardt 1 2 Abstract Despite the eminent successes of deep neural networks, many architectures are often hard to transfer to irregularly-sampled and asynchronous time series that commonly occur in real-world datasets, especially in healthcare applications. This paper proposes a novel approach for classifying irregularly-sampled time series with unaligned measurements, focusing on high scalability and data efficiency. Our method Se FT (Set Functions for Time Series) is based on recent advances in differentiable set function learning, extremely parallelizable with a beneficial memory footprint, thus scaling well to large datasets of long time series and online monitoring scenarios. Furthermore, our approach permits quantifying per-observation contributions to the classification outcome. We extensively compare our method with existing algorithms on multiple healthcare time series datasets and demonstrate that it performs competitively whilst significantly reducing runtime. 1. Introduction With the increasing digitalization, measurements over extensive time periods are becoming ubiquitous. Nevertheless, in many application domains, such as healthcare (Yadav et al., 2018), measurements might not necessarily be observed at a regular rate or could be misaligned. Moreover, the presence or absence of a measurement and its observation frequency may carry information of its own (Little & Rubin, 2014), such that imputing the missing values is not always desired. While some algorithms can be readily applied to datasets with varying lengths, these methods usually assume regular sampling of the data and/or require the measurements 1Department of Biosystems Science and Engineering, ETH Zurich, 4058 Basel, Switzerland 2SIB Swiss Institute of Bioinformatics, Switzerland. Correspondence to: Karsten Borgwardt . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). across modalities to be aligned/synchronized, preventing their application to the aforementioned settings. By contrast, existing approaches, in particular in clinical applications, for unaligned measurements, typically rely on imputation to obtain a regularly-sampled version of a dataset for classification (Desautels et al., 2016, Moor et al., 2019). Learning a suitable imputation scheme, however, requires understanding the underlying dynamics of a system; this task is significantly more complicated and not necessarily required when classification or pattern detection is the main goal. Furthermore, even though a decoupled imputation scheme followed by classification is generally more scalable, it may lose information (in terms of missingness patterns ) that could be crucial for prediction tasks. The fact that decoupled schemes perform worse than methods that are trained end-to-end was empirically demonstrated by Li & Marlin (2016). Approaches that jointly optimize both tasks add a large computational overhead, thus suffering from poor scalability or high memory requirements. Our method is motivated by the understanding that, while RNNs and similar architectures are well suited for capturing and modelling the dynamics of a time series and thus excel at tasks such as forecasting, retaining the order of an input sequence can even be a disadvantage in some scenarios (Vinyals et al., 2016). We show that by relaxing the condition that a sequence must be processed in order, we can naturally derive an architecture that directly accounts for (i) irregular sampling, and (ii) unsynchronized measurements. Our method SEFT: Set Functions for Time Series, extends recent advances in set function learning to irregular sampled time series classification tasks, yields favourable classification performance, is highly scalable, and improves over current approaches by almost an order of magnitude in terms of runtime. With SEFT, we propose to rephrase the problem of classifying time series as classifying a set of observations. We show how set functions can be used to create classifiers that are applicable to unaligned and irregularly-sampled time series, leading to favourable performance in classification tasks. Next to being highly parallelizable, thus permitting ready extensions to online monitoring setups with thousands of patients, our method also yields importance values for each observation and each modality. This makes it possible to interpret predictions, providing much-needed insights into the decision made by the model. Set Functions for Time Series 2. Related Work This paper focuses on classifying time series with irregular sampling and potentially unaligned measurements. We briefly discuss recent work in this field; all approaches can be broadly grouped into the following three categories. Irregular sampling as missing data: While the problem of supervised classification in the presence of missing data is closely related to irregular sampling on time series, there are some core differences. Missing data is usually defined with respect to a number of features that could be observed, whereas time series themselves can have different lengths and a typical number of observed values might not exist. Generally, an irregularly-sampled time series can be converted into a missing data problem by discretizing the time axis into non-overlapping intervals, and declaring intervals in which no data was sampled as missing (Bahadori & Lipton, 2019). This approach is followed by Marlin et al. (2012), who used a Gaussian Mixture Model for semi-supervised clustering on electronic health records. Similarly, Lipton et al. (2016) discretize the time series into intervals, aggregate multiple measurements within an interval, and add missingness indicators to the input of a Recurrent Neural Network. By contrast, Che et al. (2018) present several variants of the Gated Recurrent Unit (GRU) combined with imputation schemes. Most prominently, the GRU-model was extended to include a decay term (GRU-D), such that the last observed value is decayed to the empirical mean of the time series via a learnable decay term. While these approaches are applicable to irregularly-sampled data, they either rely on imputation schemes or empirical global estimates on the data distribution (our method, by contrast, requires neither), without directly exploiting the global structure of the time series. Frameworks supporting irregular sampling: Some frameworks support missing data. For example, Lu et al. (2008) directly defined a kernel on irregularlysampled time series, permitting subsequent classification and regression with kernel-based classifiers or regression schemes. Furthermore, Gaussian Processes (Williams & Rasmussen, 2006) constitute a common probabilistic model for time series; they directly permit modelling of continuous time data using mean and covariance functions. Along these lines, Li & Marlin (2015) derived a kernel on Gaussian Process Posteriors, allowing the comparison and classification of irregularly-sampled time series using kernel-based classifiers. Nevertheless, all of these approaches still rely on separate tuning/training of the imputation method and the classifier so that structures supporting the classification could be potentially missed in the imputation step. An emerging line of research employs Hawkes processes (Hawkes, 1971, Liniger, 2009), i.e. a specific class of self-exciting point processes, for time se- ries modelling and forecasting (Mei & Eisner, 2017, Yang et al., 2017, Xiao et al., 2017). While Hawkes processes exhibit extraordinary performance in these domains, there is no standardised way of using them for classification. Previous work (Lukasik et al., 2016) trains multiple Hawkes processes (one for each label) and classifies a time series by assigning it the label that maximises the respective likelihood function. Since this approach does not scale to our datasets, we were unable to perform a fair comparison. We conjecture that further research will be required to make Hawkes processes applicable to general time series classification scenarios. End-to-end learning of imputation schemes: Methods of this type are composed of two modules with separate responsibilities, namely an imputation scheme and a classifier, where both components are trained in a discriminative manner and end-to-end using gradient-based training. Recently, Li & Marlin (2016) proposed the Gaussian Process Adapters (GP Adapters) framework, where the parameters of a Gaussian Process Kernel are trained alongside a classifier. The Gaussian Process gives rise to a fixed-size representation of the irregularly-sampled time series, making it possible to apply any differentiable classification architecture. This approach was further extended to multivariate time series by Futoma et al. (2017) using Multi-task Gaussian Processes (MGPs) (Bonilla et al., 2008), which allow correlations between the imputed channels. Moreover, Futoma et al. (2017) made the approach more compatible with time series of different lengths by applying a Long Short Term Memory (LSTM) (Hochreiter & Schmidhuber, 1997) classifier. Motivated by the limited scalability of approaches based on GP Adapters, Shukla & Marlin (2019) suggest an alternative imputation scheme, the interpolation prediction networks. It applies multiple semi-parametric interpolation schemes to obtain a regularly-sampled time series representation. The parameters of the interpolation network are trained with the classifier in an end-to-end setup. 3. Proposed Method Our paper focuses on the problem of time series classification of irregularly sampled and unaligned time series. We first define the required terms before describing our model. 3.1. Notation & Requirements Definition 1 (Time series). We describe a time series of an instance i as a set Si of M := |Si| observations sj such that Si := {s1, . . . , s M}. We assume that each observation sj is represented as a tuple (tj, zj, mj), consisting of a time value tj R+, an observed value zj R, and a modality indicator mj {1 . . . D}, where D represents the dimensionality of the time series. We write Set Functions for Time Series Multivariate Time Series (t3, z3, m1), (t5, z5, m1), . . . (t1, z1, m2), (t4, z4, m2), . . . (t2, z2, m3), . . . (t11, z11, m3) Set Encoding Embedding, Aggregation, and Attention Classification Figure 1. Schematic overview of SEFT s architecture. The first panel exemplifies a potential input, namely a multivariate time series, consisting of 3 modalities m1, m2, m3. We treat the jth observation as a tuple (tj, zj, mj), comprising of a time tj, a value zj, and a modality indicator mj. All observations are summarized as a set of such tuples. The elements of each set are summarized using a set function f . Conditional on the summarized representation and the individual set elements an attention mechanism (as described in Section 3.3) is applied to learn the importance of the individual observations. Respective query vectors for 2 attentions head are illustrated in purple and orange blocks. The results of each attention head are then concatenated and used as the input for the final classification layers. Ω R+ R N+ to denote the domain of observations. An entire D-dimensional time series can thus be represented as Si := {(t1, z1, m1) , . . . , (t M, z M, m M)} , (1) where for notational convenience we omitted the index i from individual measurements. We leave this definition very general on purpose, in particular allowing the length of each time series to differ, since our models are inherently capable of handling this. Likewise, we neither enforce nor expect all time series to be synchronized, i.e. being sampled at the same time, but rather we are fully agnostic to non-synchronized observations in the sense of not having to observe all modalities at each time point1. We collect all time series and their associated labels in a dataset D. Definition 2 (Dataset). We consider a dataset D to consist of n time series. Elements of D are tuples, i.e. D := {(S1, y1), . . . , (SN, y N)}, where Si denotes the ith time series and yi {1, . . . , C} its class label. For an online monitoring scenario, we will slightly modify Definition 2 and only consider subsets of time series that have already been observed. Figure 1 gives a high-level overview of our method, including the individual steps required to perform classification. To get a more intuitive grasp of these definitions, we briefly illustrate our time series notation with an example. Let instance i be an inhospital patient, while the time series represent measurements of two channels of vital parameters during a hospi- 1We make no assumptions about the time values tj and merely require them to be positive real-valued numbers because our time encoding procedure (see below) is symmetric with respect to zero. In practice, positive time values can always be achieved by applying a shift transformation. tal stay, namely heart rate (HR) and mean arterial blood pressure (MAP). We enumerate those channels as modalities 1 and 2. Counting from admission time, a HR of 60 and 65 beats per minute (bpm) was measured after 0.5 h and 3.0 h, respectively, whereas MAP values of 80, 85, and 87 mm Hg were observed after 0.5 h, 1.7 h, and 2.5 h. According to Definition 1, the time series is thus represented as Si := {(0.5, 60, 1), (3, 65, 1), (0.5, 80, 2), (1.7, 85, 2), (3, 87, 2)}. In this example, observations are ordered by modality to increase readability; in practice, we are dealing with unordered sets. This does not imply, however, that we throw away any time information; we encode time values in our model, thus making it possible to maintain the temporal ordering of observations. Our model, however, does not assume that all observations are stored or processed in the same ordering this assumption was already shown (Vinyals et al., 2016) to be detrimental with respect to classification performance in some scenarios. Therefore, our model does not employ a sequentialness prior : instead of processing a sequence conditional on previouslyseen elements (such as in RNNs or other sequence-based models), it processes values of a sequence all at once through encoding and aggregation steps and retains all information about event occurrence times. In our experiments, we will focus on time series in which certain modalities channels are not always observed, i.e. some measurements might be missing. We call such time series non-synchronized. Definition 3 (Non-synchronized time series). A D-dimensional time series is non-synchronized if there exists at least one time point tj R+ at which at least one modality is not observed, i.e. if there is tj R+ such that |{(tk, zk, mk) | tk = tj}| = D. Furthermore, we require that the measurements of each Set Functions for Time Series modality satisfy ti = tj for i = j, such that no two measurements of of the same modality occur at the same time. This assumption is not required for technical reasons but for consistency; moreover, it permits interpreting the results later on. Summary To summarize our generic setup, we do not require M, the number of observations per time series, to be the same, i.e. |Si| = |Sj| for i = j is permitted, nor do we assume that the time points and modalities of the observations are the same across time series. This setting is common in clinical and biomedical time series. Since typical machine learning algorithms are designed to operate on data of a fixed dimension, novel approaches to this non-trivial problem are required. 3.2. Our Model In the following, we describe an approach inspired by differentiable learning of functions that operate on sets (Zaheer et al., 2017, Wagstaff et al., 2019). The following paragraphs provide a brief overview of this domain, while describing the building blocks of our model. Specifically, we phrase the problem of classifying time series on irregular grids as learning a function f on a set of arbitrarily many time series observations following Definition 1, i.e. S = {(t1, z1, m1), . . . , (t M, z M, m M)}, such that f : S RC, where S represents a generic time series of arbitrary cardinality and RC corresponds to the logits of the C classes in the dataset. As we previously discussed, we interpret each time series as an unordered set of measurements, where all information is conserved because the observation time is included for each set element. Specifically, we define f to be a set function, i.e. a function that operates on a set and thus has to be invariant to the ordering of the elements in the set. Multiple architectures are applicable to constructing set functions such as Transformers (Lee et al., 2019, Vaswani et al., 2017), or Deep Sets (Zaheer et al., 2017). Given its exceptional scalability properties, we base this work on the framework of Zaheer et al. (2017). Intuitively, this amounts to computing multivariate dataset-specific summary statistics, which are optimized to maximize classification performance. Thus, we sum-decompose the set function f into the form where h: Ω Rd and g: Rd RC are neural networks, d N+ determines the dimensionality of the latent representation, and sj represents a single observation of the time series S. We can view the averaged representations 1/|S| P sj S h(sj) in general as a dataset-specific summary statistic learned to best distinguish the class labels. Equation 2 also implies the beneficial scalability properties of our approach: each embedding can be calculated independently of the others; hence, the constant computational cost of passing a single observation through the function h is scaled by the number of observations, resulting in a runtime of O(M) for a time series of length M. Recently, Wagstaff et al. (2019) derived requirements for a practical universal function representation of sumdecomposable set functions, i.e the requirements necessary for a sum-decomposable function to represent an arbitrary set-function given that h and g are arbitrarily expressive. In particular, they show that a universal function representation can only be guaranteed provided that d maxi |Si| is satisfied. During hyperparameter search, we therefore independently sample the dimensionality of the aggregation space, and allow it to be in the order of the number of observations that are to be expected in the dataset. Further, we explored the utilization of max, sum, and mean as alternative aggregation functions inspired by Zaheer et al. (2017), Garnelo et al. (2018). Intuitively, our method can be related to Takens s embedding theorem (Takens, 1981) for dynamical systems: we also observe a set of samples from some unknown (but deterministic) dynamical process; provided the dimensionality of our architecture is sufficiently large2, we are capable of reconstructing the system up to diffeomorphism. The crucial difference is that we do not have to construct a time-delay embedding (since we are not interested in being able to perfectly reproduce the dynamics of the system) but rather, we let the network learn an embedding that is suitable for classification, which is arguably a simpler task due to its restricted scope. Time Encoding To represent the time point of an observation on a normalized scale, we employ a variant of positional encodings (Vaswani et al., 2017). Preliminary results indicated that this encoding scheme reduces the sensitivity towards initialization and training hyperparameters of a model. Specifically, the time encoding converts the 1-dimensional time axis into a multi-dimensional input by passing the time t of each observation through multiple trigonometric functions of varying frequencies. Given a dimensionality τ N+ of the time encoding, we refer to the encoded position as x Rτ, where x2k(t) := sin t t2k/τ x2k+1(t) := cos t t2k/τ 2In Takens s embedding theorem, d > d B is required, where d B refers to the fractal box counting dimension (Liebovitch & Toth, 1989), which is typically well below the size of typical neural network architectures. Set Functions for Time Series with k {0, . . . , τ/2} and t representing the maximum time scale that is expected in the data. Intuitively, we select the wavelengths using a geometric progression from 2π to t 2π, and treat the number of steps and the maximum timescale t as hyperparameters of the model. We used time encodings for all experiments, such that an observation is represented as sj = (x (tj) , zj, mj). 3.3. Attention-based Aggregation So far, our method permits encoding sets of arbitrary sizes into a fixed-size representation. For increasingly large set sizes, however, many irrelevant observations could influence the result of the set function. The mean aggregation function is particularly susceptible to this because the influence of an observation to the embedding shrinks proportionally to the size of the set. We thus suggest to use a weighted mean in order to allow the model to decide which observations are relevant and which should be considered irrelevant. This is equivalent to computing an attention over the set input elements, and subsequently, computing the sum over all elements in the set. Our approach is based on scaled dot-product attention with multiple heads i {1, . . . , m} in order to be able to cover different aspects of the aggregated set3. We define a(S, sj), i.e. the attention weight function of an individual time series, to depend on the overall set of observations S, and the value of the set element sj. This is achieved by computing an embedding of the set elements using a smaller set function f , and projecting the concatenation of the set representation and the individual set elements into a d-dimensional space. Specifically, we have Kj,i = [f (S), sj]T Wi where Wi R(im(f )+|sj|) d and K R|S| d. Furthermore, we define a matrix of query points Q Rm d, which allow the model to summarize different aspects of the dataset via ej,i = Kj,i Qi d and aj,i = exp(ej,i) P j exp(ej,i) where aj,i represents the amount of attention that head i gives to set element j. The head-specific row Qi of the query matrix Q allows a head to focus on individual aspects (such as the distribution of one or multiple modalities) of a time series. For each head, we multiply the set element embeddings computed via the function h with the attentions derived for the individual instances, i.e. ri = P j aj,ih(sj). The computed representation is concatenated and passed to the aggregation network gθ as in a regular set function, i.e. r = [r1 . . . rm]. In our setup, we initialize Q with zeros, such that at the beginning of train- 3Since we are dealing only with a single instance (i.e. time series) in this section, we use i and j to denote a head and an observation, respectively. ing, the attention mechanism is equivalent to computing the unweighted mean over the set elements. Overall, this aggregation function is similar to Transformers (Vaswani et al., 2017), but differs from them in a few key aspects. Commonly, Transformer blocks use the information from all set elements to compute the embedding of an individual set element, leading to a runtime and space complexity of O(n2). By contrast, our approach computes the embeddings of set elements independently, leading to lower runtime and memory complexity of O(n). This is particularly relevant as set elements in our case are individual observations, so that we obtain set sizes that are often multiples of the time series length. Furthermore, we observed that computing embeddings with information from other set elements (as the Transformer does) actually decreases generalization performance in several scenarios (please refer to Table 1 for details). Online monitoring scenario In an online monitoring scenario, we compute all variables of the model in a cumulative fashion. The set of observations used to predict at the current time point is therefore a subset of the total observations available at the time point at which the prediction is made. If this were computed na ıvely, the attention computation would result in O(|S|) runtime and memory complexity, where |S| is the number of observations. Instead we rearrange the computation of the weighted mean as follows, while discarding the head indices for simplicity: k i exp(ek)h(sj) j i exp(ej)h(sj) P k i exp(ek) In the previous equation, both numerator and denominator can be computed in a cumulative fashion and thus allow reusing computations from previous time points. 3.3.1. LOSS FUNCTION If not mentioned otherwise, we choose h and g in Equation 2 to be multilayer perceptron deep neural networks, parametrized by weights θ and ψ, respectively. We thus denote these neural networks by hθ and gψ; their parameters are shared across all instances per dataset. Our training setup follows Zaheer et al. (2017); we apply the set function to the complete time series, i.e. to the set of all observations for each time series. Overall, we optimize a loss function that is defined as L(θ, ψ) := E(S,y) D h ℓ y; gψ P sj S a(S, sj)hθ(sj) i , where ℓ( ) represents a task-specific loss function. In all of our experiments, we utilize the binary cross-entropy loss in Set Functions for Time Series combination with a sigmoid activation function in the last layer of gψ for binary classification. 4. Experiments We executed all experiments and implementations in a unified and modular code base, which we make available to the community. We provide two dedicated packages (i) for automatic downloading and preprocessing of the datasets according to the splits used in this work and (ii) for training the introduced method and baselines to which we compare in the following. We make both publicly available4. While some of the datasets used in the following have access restrictions, anybody can gain access after satisfying the defined requirements. This ensures the reproducibility of our results. Please consult Appendix A.3 for further details. 4.1. Datasets In order to benchmark the proposed method, we selected three datasets with irregularly-sampled and nonsynchronized measurements. We are focusing on two tasks with different challenges: first, we predict patient mortality on two datasets; this task is exacerbated by the high imbalance in the datasets. Second, we predict the onset of sepsis5 in an online scenario. MIMIC-III Mortality Prediction MIMIC-III (Johnson et al., 2016) is a widely-used, freely-accessible dataset consisting of distinct ICU stays of patients. The median length of a stay is 2.1 d; a wide range of physiological measurements (e.g. MAP and HR) are recorded with a resolution of 1 h. Furthermore, laboratory test results, collected at irregular time intervals, are available. Recently, Harutyunyan et al. (2019) defined a set of machine learning tasks, labels, and benchmarks using a subset of the MIMIC-III dataset. We trained and evaluated our method and competing methods on the binary mortality prediction task (M3-Mortality), while discarding the binning step and applying additional filtering described in Appendix A.1. The goal of the mortality prediction task (which we abbreviate as M3-Mortality) is to predict whether a patient will die during their hospital stay using only data from the first 48 h of the ICU stay. In total, the dataset contains around 21, 000 stays of which approximately 10 % result in death. Physionet 2012 Mortality Prediction Challenge The 2012 Physionet challenge dataset (Goldberger et al., 2000), 4https://github.com/Borgwardt Lab/Set_ Functions_for_Time_Series 5An organ dysfunction caused by a dysregulated host response to an infection. Sepsis is potentially life-threatening and is associated with high mortality of patients. which we abbreviate P-Mortality, contains 12, 000 ICU stays each of which lasts at least 48 h. For each stay, a set of general descriptors (such as gender or age) are collected at admission time. Depending on the course of the stay and patient status, up to 37 time series variables were measured (e.g. blood pressure, lactate, and respiration rate). While some modalities might be measured in regular time intervals (e.g. hourly or daily), some are only collected when required; moreover, not all variables are available for each stay. Similar to M3-Mortality, our goal is to predict whether a patient will die during the hospital stay. The training set comprises 8, 000 stays, while the testing set comprises 4, 000 ICU visits. The datasets are similarly imbalanced, with a prevalence of around 14 %. Please refer to Table A.2, Table A.1, and Table A.3 in the appendix for a more detailed enumeration of samples sizes, label distributions, and the handling of demographics data. The total number of samples slightly deviates from the originallypublished splits, as time series of excessive length precluded us from fitting some methods in reasonable time, and we therefore excluded them. Physionet 2019 Sepsis Early Prediction Challenge Given the high global epidemiological burden of sepsis, Reyna et al. (2020) launched a challenge for the early detection of sepsis from clinical data. Observations from over 60, 000 intensive care unit patients from three different U.S. hospitals were aggregated. Up to 40 variables (e.g. vitals signs and lab results) were recorded hourly, with each hour being labelled with a binary variable indicating whether an onset of sepsis according to the Sepsis-3 definition (Seymour et al., 2016) occurred. Our task is the hourly prediction of sepsis onset within the next 6 h to 12 h. In our work we refer to this task as P-Sepsis. To account for clinical utility of a model, the authors introduced a novel evaluation metric (see Reyna et al. (2020) for more details), which we also report in our experiments: at each prediction time point t, a classifier is either rewarded or penalized using a utility function U(p, t), which depends on the distance to the actual sepsis onset for patient p. The total utility function is the sum over all patients P and the predictions at all time points T, i.e. Utotal := P t T U(p, t). The score is then normalized (Unorm) such that a perfect classifier receives a score of 1, while a classifier with no positive predictions at all receives a score of 0. 4.2. Competitor Methods In order to achieve a thorough comparison, we compare our method to the following six approaches: 1. GRUsimple (Che et al., 2018) 2. GRU-Decay (Che et al., 2018) 3. Phased-LSTM (Neil et al., 2016) 4. Interpolation Prediction Networks (Shukla & Marlin, 2019) 5. Trans- Set Functions for Time Series former (Vaswani et al., 2017) 6. Latent-ODE (Rubanova et al., 2019) . All methods except LATENT-ODE were implemented in the same framework using the same training pipeline. Due to differences in implementation and limited comparability, we highlight the results of LATENT-ODE with . In particular, for Latent ODE we were unable to run an extensive hyperparameter search using the provided codebase, as runtime was considerable higher compared to any other method. This results in an underestimation of performance for LATENT-ODE compared to the other methods. For a detailed description of the differences between the implementations, we refer the reader to Appendix A.3. 4.3. Experimental Setup To mitigate the problem of unbalanced datasets, all models were trained on balanced batches of the training data rather than utilizing class weights. This was done in order to not penalize models with a higher memory footprint6. Due to oversampling, the notion of an epoch is different from common understanding. In our experiments we set the number of optimizer steps per epoch to be the minimum of the number of steps required for seeing all samples from the majority class and the number of steps required to see each samples from the minority class three times. Training was stopped after 30 epochs without improvement of the area under the precision recall curve (AUPRC) on the validation data for the mortality prediction tasks, whereas balanced accuracy was utilized for the online predictions scenario. The hyperparameters with the best overall validation performance were selected for quantifying the performance on the test set. The train, validation, and test splits were the same for all models and all evaluations. To permit a fair comparison between the methods, we executed hyperparameter searches for each model on each dataset, composed of uniformly sampling 20 parameters according to Appendix A.4. Final performance on the test set was calculated by 3 independent runs of the models; evaluation took place after the model was restored to the state with the best validation AUPRC / balanced accuracy. In all subsequent benchmarks, we use the standard deviation of the test performance of these runs as generalization performance estimates. 4.4. Results Table 1 depicts the results on the two mortality prediction tasks. For each metric, we use bold font to indicate the best value, and italics to indicate the second-best. Overall, our proposed method SEFT-ATTN exhibits competi- 6These models would only allow training with small batch sizes, which combined with the unbalanced nature of the datasets would lead to high variance in the gradients. Table 1. Performance comparison of methods on mortality prediction datasets. AUROC denotes the area under the Receiver Operating Characteristic (ROC) curve; AUPRC denotes the area under the precision recall curve. Evaluation metrics were scaled to 100 in order to increase readability. denotes that the performance could be underestimated due to limited hyperparameter tuning compared to other methods. Dataset Model Accuracy AUPRC AUROC s/epoch GRU-D 77.0 1.5 52.0 0.8 85.7 0.2 133 8 GRU-SIMPLE 78.1 1.3 43.6 0.4 82.8 0.0 140 7 IP-NETS 78.3 0.7 48.3 0.4 83.2 0.5 81.2 8.5 PHASED-LSTM 73.8 3.3 37.1 0.5 80.3 0.4 166 7 TRANSFORMER 77.4 5.6 42.6 1.0 82.1 0.3 20.1 0.1 LATENT-ODE 72.8 1.7 39.5 0.5 80.9 0.2 4622 SEFT-ATTN 79.0 2.2 46.3 0.5 83.9 0.4 14.5 0.5 GRU-D 80.0 2.9 53.7 0.9 86.3 0.3 8.67 0.49 GRU-SIMPLE 82.2 0.2 42.2 0.6 80.8 1.1 30.0 2.5 IP-NETS 79.4 0.3 51.0 0.6 86.0 0.2 25.3 1.8 PHASED-LSTM 76.8 5.2 38.7 1.5 79.0 1.0 44.6 2.3 TRANSFORMER 83.7 3.5 52.8 2.2 86.3 0.8 6.06 0.06 LATENT-ODE 76.0 0.1 50.7 1.7 85.7 0.6 3500 SEFT-ATTN 75.3 3.5 52.4 1.1 85.1 0.4 7.62 0.10 GRU-D GRU-SIMPLE PHASED-LSTM TRANSFORMER Runtime in s M3-Mortality 40 45 50 55 TRANSFORMER PHASED-LSTM P-Mortality Figure 2. Runtime vs. AUPRC trade-offs for all methods on the two mortality prediction tasks. LATENT-ODE is not shown as its runtime is significantly higher compared to the other models. tive performance. In terms of AUPRC, arguably the most appropriate metric for unbalanced datasets such as these, we consistently rank among the first four methods. For M3-Mortality, abbreviated as M3M in the table, our runtime performance is lower than that of TRANSFORMER, but we outperform it in terms of AUPRC. Here both GRUD and IP-NETS outperform the devised approach, while exhibiting considerably higher runtimes. The favourable trade-offs of SEFT-ATTN between runtime and AUPRC are further underscored by Figure 2. On P-Mortality, abbreviated as P12 in the table, our method is comparable to the performance of GRU-D and TRANSFORMER and shows comparable or lower runtime. This picture is similar for the Area under the ROC curve (AUROC), where IP-NETS show a slightly higher performance than our approach, at a cost of almost three-fold higher runtime. Opening the black box In the medical domain, it is of particular interest to understand the decisions a model makes based on the input it is provided with. The for- Set Functions for Time Series Figure 3. Visualizations of a single attention head on an instance of the P-Mortality dataset. We display a set of blood pressure variables which are most relevant for assessing patient stability: non-invasive diastolic arterial blood pressure (NIDias ABP), non-invasive systolic arterial blood pressure (NISys ABP), and invasively measured systolic arterial blood pressure (Sys ABP). Darker colors represent higher attention values. In the invasive channel showing high time resolution (right most panel), our model attends foremost to the region after a sudden increase in blood pressure. In the non-invasive, intermittently observed channels, the model additionally focuses on regions of high observation density reflecting the clinicians concern. mulation of our model and its per-observation perspective on time series gives it the unique property of being able to quantify to which extent an individual observation contributed to the output of the model. We exemplify this in Figure 3 with a patient time series that was combined with the attention values of our model, displayed for a set of clinically relevant variables. After reviewing these records with our medical expert, we find that in channels showing frequent and regularly-spaced observations, the model attends to drastic changes. For instance, see the sudden increase in continuously-monitored invasive systolic blood pressure. Interestingly, in channels that are observed only intermittently (due to manual intervention, such as noninvasive blood pressure measurements), we observe that our model additionally attends to regions of high observation density, thereby reflecting the increased concern of clinicians for the circulatory stability of patients. Online Prediction Scenario In order to provide models with potential clinical applicability, it is instrumental to cover online monitoring scenario to potentially support clinicians. We present the results of the Sepsis early prediction online monitoring scenario P-Sepsis in Table 2. In this scenario the TRANSFORMER and IP-NETS yield the highest performance and outperform all other methods almost two-fold. These results are very surprising, given that the best out of 853 submissions to the Physionet 2019 challenge only achieved a test utility score of 0.36 (Reyna et al., 2020). In order to investigate this issue further, we designed a second evaluation scenario, where future information of each instance is guaranteed to be unavailable to the model, by splitting each instance into multiple cumulative versions of increasing length. We then ran all trained models in this scenario and included results for models where the performance metrics differ in Table 2, highlighted with an additional . It is clearly recognizable that the performance of both IP-NETS and TRANSFORMER decrease in the second evaluation scenario indicating the models reliance on leaked future information. For IP-NETS, information can leak through the nonparametric imputation step prior to the application of the downstream recurrent neural network. It is infeasible to train the vanilla IP-NETS approach on slices of the time series up until the time point of prediction, as we cannot reuse computations from previous imputation steps. While it would be possible to construct an IP-NETS variant that does not rely on future information during the imputation step, for example using smoothing techniques, we deem this beyond the scope of this work. Similar effects occur in the case of the TRANSFORMER: While observations from the future are masked in the attention computation, preventing access to future values results in a detrimental reduction in performance. Even though the source of dependence on future values is quite probable to reside in the layer normalization applied in the TRANSFORMER model, the performance drop can have multiple explanations, i.e. 1. the absence of future time points leads to high variance estimates of mean and variance in the layer norm operation, resulting in bad performance in the initial time points of the time series, or 2. the model actively exploits future information though the layer normalization. This could for example be possible by the model looking for indicative signals in future time points and when present returning very high norm outputs. The signature of these high norm outputs can then, through the layer norm operation, be observed in earlier time points. While one could construct variants where the TRANSFORMER model can by no means access future information, for example by replacing the layer norm layer with and alternative normalization scheme (Nguyen & Salazar, 2019, Bachlechner et al., 2020), we reserve a more thorough investigation of this issue for future work. By contrast, our model does not contain any means of leaking future information into the prediction of the current time point and thus exhibits the same performance in both evaluation scenarios, while remaining competitive with alternative approaches. Surprisingly, the model with Set Functions for Time Series Table 2. Results of the online prediction scenario on the P-Sepsis task. The dataset is highly imbalanced, such that we only report measures which are sensitive to class imbalance. Further, if the results between the evaluation scenarios differ, we highlight results without masked future information in gray, and the performance achieved with masking with . indicates that the results might be underestimating the true performance due to limited hyperparameter tuning compared to the other methods. Model B-Accuracy AUPRC AUROC Unorm s/epoch GRU-D 57.4 0.2 5.33 0.39 67.4 1.2 12.6 1.1 72.3 GRU-SIMPLE 71.0 1.4 6.10 0.75 78.1 1.5 26.9 4.1 116 IP-NETS 87.1 0.9 29.4 2.1 94.1 0.4 62.2 1.3 253 IP-NETS 63.8 0.9 5.11 0.80 74.2 1.2 11.9 4.0 253 PHASED-LSTM 67.5 1.7 5.54 0.91 75.4 1.3 20.2 3.2 192 LATENT-ODE 62.4 0.1 11.4 2.1 64.6 0.7 12.3 1.0 1872 TRANSFORMER 91.2 0.2 53.4 5.6 97.3 0.2 71.3 1.4 28.5 TRANSFORMER 53.6 1.7 3.63 0.95 65.8 3.7 43.9 10.0 28.5 SEFT-ATTN 70.9 0.8 4.84 0.22 76.8 0.9 25.6 1.9 77.5 the highest performance in this scenario is GRU-SIMPLE, which could be explained by the very regular sampling character of the P-Sepsis dataset. Here the measurements were already binned into hours, such that it cannot be considered completely irregularly sampled. This explains the high performance of GRU-SIMPLE, as compared to models which were specifically designed to cope with the irregular sampling problem. 5. Conclusion and Discussion We presented a novel approach for classifying time series with irregularly-sampled and unaligned i.e. nonsynchronized observations. Our approach yields competitive performance on real-world datasets with low runtime compared to many competitors. While it does not out perform state-of-the-art models, it shows that shifting the perspective to individual observations represents a promising avenue for models on irregularly-sampled data in the future. Further, as the model does not contain any sequential inductive biases compared to RNNs, it indicates that for time series classification tasks, this bias might not be of high importance. This is in line with recent research on Transformer architectures for NLP (Vaswani et al., 2017), where order is solely retained through positional embeddings and not inherent to the processing structure of the model. Our experiments demonstrated that combining the perspective of individual observations with an attention mechanism permits increasing the interpretability of the model. This is particularly relevant for medical and healthcare applications. Along these lines, we also want to briefly discuss a phenomenon that we observed on M3-Mortality: the performance we report on this task is often lower than the one reported in the original paper (Harutyunyan et al., 2019) and follow-up work (Song et al., 2018). We suspect that this is most likely caused by a distribution shift between the validation dataset and the test dataset: in fact, model performance is on average 6.3% higher (in terms of AUPRC) on the validation data set than on the test dataset, which might indicate that the validation dataset is not representative here. We also notice that previous work (Harutyunyan et al., 2019) uses heavily-regularised and comparatively small models for this specific scenario, making them more impervious to distribution shifts. Finally, the fact that we do not bin the observations prior to the application of the models could make our task more difficult compared to the original setup. Additionally, we would like to discuss the low performance of the LATENT-ODE model. This can be partially attributed to the fact that we did not perform an extensive hyperparameter search with this model. Furthermore, all runs of the LATENT-ODE model contain an additional reconstruction loss term, so that it is necessary to define the trade-off between reconstruction and classification. In this work we used the same parameters as Rubanova et al. (2019), which (due to potentially different characteristics of the datasets) could lead to a less than optimal tradeoff. For example, some modalities in the datasets we considered might be harder to reconstruct, such that the value of the reconstruction term could be higher, leading to gradients which prioritize reconstruction over classification. Nevertheless, despite these differences the performance of LATENT-ODE in terms of AUC on the P-Mortality dataset measured in this work is actually higher than the performance reported in the original publication. In future work, we plan to explore attention mechanisms specifically designed for sets of very high cardinalities. We also strive to make the attention computation more robust so that elements with low attentions values do not get neglected due to numerical imprecision of aggregation operations; this is also known as catastrophic cancellation (Knuth, 1998), in our case, summation. GPU implementations of algorithms such as Kahan summation (Kahan, 1965) would represent a promising avenue for further improving performance of attention mechanisms for set functions. Acknowledgements The authors would like to thank Dr. Felipe Llinares L opez for comments and discussions during the progress of this work. This project was supported by the grant #2017-110 of the Strategic Focal Area Personalized Health and Related Technologies (PHRT) of the ETH Domain for the SPHN/PHRT Driver Project Personalized Swiss Sepsis Study and the SNSF Starting Grant Significant Pattern Set Functions for Time Series Mining (K.B., grant no. 155913). Moreover, this work was funded in part by the Alfried Krupp Prize for Young University Teachers of the Alfried Krupp von Bohlen und Halbach-Stiftung (K.B.). Tensor Flow Datasets, a collection of ready-to-use datasets. https://www.tensorflow.org/datasets. Bachlechner, T., Majumder, B. P., Mao, H. H., Cottrell, G. W., and Mc Auley, J. Rezero is all you need: Fast convergence at large depth. ar Xiv preprint ar Xiv:2003.04887, 2020. Bahadori, M. T. and Lipton, Z. C. Temporal-clustering invariance in irregular healthcare time series. ar Xiv preprint ar Xiv:1904.12206, 2019. Bonilla, E. V., Chai, K. M., and Williams, C. Multi-task gaussian process prediction. In Advances in Neural Information Processing Systems (Neur IPS), pp. 153 160, 2008. Che, Z., Purushotham, S., Cho, K., Sontag, D., and Liu, Y. Recurrent neural networks for multivariate time series with missing values. Scientific reports, 8(1):6085, 2018. Desautels, T., Calvert, J., Hoffman, J., Jay, M., Kerem, Y., Shieh, L., Shimabukuro, D., Chettipally, U., Feldman, M. D., Barton, C., et al. Prediction of sepsis in the intensive care unit with minimal electronic health record data: A machine learning approach. JMIR Medical Informatics, 4(3):e28, 2016. Futoma, J., Hariharan, S., and Heller, K. Learning to detect sepsis with a Multitask Gaussian Process RNN classifier. In International Conference on Machine Learning (ICML), pp. 1174 1182, 2017. Garnelo, M., Rosenbaum, D., Maddison, C., Ramalho, T., Saxton, D., Shanahan, M., Teh, Y. W., Rezende, D., and Eslami, S. A. Conditional neural processes. In International Conference on Machine Learning, pp. 1690 1699, 2018. Goldberger, A. L., Amaral, L. A., Glass, L., Hausdorff, J. M., Ivanov, P., Mark, R. G., Mietus, J. E., Moody, G. B., Peng, C.-K., and Stanley, H. E. Physiobank, physiotoolkit, and physionet: components of a new research resource for complex physiologic signals. Circulation, 101(23):e215 e220, 2000. Harutyunyan, H., Khachatrian, H., Kale, D. C., Ver Steeg, G., and Galstyan, A. Multitask learning and benchmarking with clinical time series data. Scientific Data, 6(1): 96, 2019. ISSN 2052-4463. Hawkes, A. G. Spectra of some self-exciting and mutually exciting point processes. Biometrika, 58(1):83 90, 1971. Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural Computation, 9(8):1735 1780, 1997. Johnson, A. E. W., Pollard, T. J., Shen, L., Lehman, L.- w. H., Feng, M., Ghassemi, M., Moody, B., Szolovits, P., Anthony Celi, L., and Mark, R. G. MIMIC-III, a freely accessible critical care database. Scientific Data, 3, 2016. Kahan, W. Pracniques: Further remarks on reducing truncation errors. Communications of the ACM, 8(1):40 41, 1965. Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015. Knuth, D. The Art of Computer Programming, Volume 2. Addison-Wesley, 1998. Lee, J., Lee, Y., Kim, J., Kosiorek, A., Choi, S., and Teh, Y. W. Set transformer: A framework for attentionbased permutation-invariant neural networks. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 3744 3753, Long Beach, California, USA, 09 15 Jun 2019. PMLR. Li, S. C.-X. and Marlin, B. M. Classification of sparse and irregularly sampled time series with mixtures of expected gaussian kernels and random features. In UAI, pp. 484 493, 2015. Li, S. C.-X. and Marlin, B. M. A scalable end-to-end Gaussian process adapter for irregularly sampled time series classification. In Advances In Neural Information Processing Systems 29, pp. 1804 1812, 2016. Liebovitch, L. S. and Toth, T. A fast algorithm to determine fractal dimensions by box counting. Physics Letters A, 141(8):386 390, 1989. Liniger, T. J. Multivariate Hawkes processes. Ph D thesis, ETH Zurich, 2009. Lipton, Z. C., Kale, D., and Wetzel, R. Directly modeling missing data in sequences with rnns: Improved classification of clinical time series. In Machine Learning for Healthcare Conference, pp. 253 270, 2016. Little, R. J. and Rubin, D. B. Statistical analysis with missing data, volume 333. John Wiley & Sons, 2014. Set Functions for Time Series Lu, Z., Leen, T. K., Huang, Y., and Erdogmus, D. A reproducing kernel hilbert space framework for pairwise time series distances. In Proceedings of the 25th International Conference on Machine learning, pp. 624 631, 2008. Lukasik, M., Srijith, P. K., Vu, D., Bontcheva, K., Zubiaga, A., and Cohn, T. Hawkes processes for continuous time sequence classification: an application to rumour stance classification in Twitter. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 393 398, Berlin, Germany, August 2016. Association for Computational Linguistics. doi: 10.18653/v1/P16-2064. URL https: //www.aclweb.org/anthology/P16-2064. Marlin, B. M., Kale, D. C., Khemani, R. G., and Wetzel, R. C. Unsupervised pattern discovery in electronic health care data using probabilistic clustering models. In Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium, pp. 389 398. ACM, 2012. Mei, H. and Eisner, J. M. The neural Hawkes process: A neurally self-modulating multivariate point process. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30, pp. 6754 6764. Curran Associates, Inc., 2017. Moor, M., Horn, M., Rieck, B., Roqueiro, D., and Borgwardt, K. Early recognition of sepsis with gaussian process temporal convolutional networks and dynamic time warping. In Doshi-Velez, F., Fackler, J., Jung, K., Kale, D., Ranganath, R., Wallace, B., and Wiens, J. (eds.), Proceedings of the 4th Machine Learning for Healthcare Conference, volume 106 of Proceedings of Machine Learning Research, pp. 2 26, Ann Arbor, Michigan, 09 10 Aug 2019. PMLR. Neil, D., Pfeiffer, M., and Liu, S.-C. Phased lstm: Accelerating recurrent network training for long or event-based sequences. In Advances in Neural Information Processing Systems 29, pp. 3882 3890, 2016. Nguyen, T. Q. and Salazar, J. Transformers without tears: Improving the normalization of self-attention. ar Xiv preprint ar Xiv:1910.05895, 2019. Reyna, M. A., Josef, C. S., Jeter, R., Shashikumar, S. P., Westover, M. B., Nemati, S., Clifford, G. D., and Sharma, A. Early prediction of sepsis from clinical data: the physionet/computing in cardiology challenge 2019. Critical care medicine, 48(2):210 217, 2020. Rubanova, Y., Chen, R. T., and Duvenaud, D. Latent odes for irregularly-sampled time series. ar Xiv preprint ar Xiv:1907.03907, 2019. Seymour, C. W., Liu, V. X., Iwashyna, T. J., Brunkhorst, F. M., Rea, T. D., Scherag, A., Rubenfeld, G., Kahn, J. M., Shankar-Hari, M., Singer, M., et al. Assessment of clinical criteria for sepsis: for the third international consensus definitions for sepsis and septic shock (sepsis3). Jama, 315(8):762 774, 2016. Shukla, S. N. and Marlin, B. Interpolation-prediction networks for irregularly sampled time series. In International Conference on Learning Representations, 2019. Song, H., Rajan, D., Thiagarajan, J. J., and Spanias, A. Attend and diagnose: Clinical time series analysis using attention models. In Thirty-second AAAI conference on artificial intelligence, 2018. Takens, F. Detecting strange attractors in turbulence. In Rand, D. and Young, L.-S. (eds.), Dynamical Systems and Turbulence, pp. 366 381, Heidelberg, Germany, 1981. Springer. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems (Neur IPS), pp. 5998 6008, 2017. Vinyals, O., Bengio, S., and Kudlur, M. Order matters: Sequence to sequence for sets. In International Conference on Learning Representations, 2016. Wagstaff, E., Fuchs, F. B., Engelcke, M., Posner, I., and Osborne, M. On the limitations of representing functions on sets. ar Xiv preprint ar Xiv:1901.09006, 2019. Williams, C. K. and Rasmussen, C. E. Gaussian processes for machine learning. MIT Press, 2(3):4, 2006. Xiao, S., Yan, J., Farajtabar, M., Song, L., Yang, X., and Zha, H. Joint modeling of event sequence and time series with attentional twin recurrent neural networks. ar Xiv preprint ar Xiv:1703.08524, 2017. Yadav, P., Steinbach, M., Kumar, V., and Simon, G. Mining electronic health records (EHRs): a survey. ACM Computing Surveys, 50(6):85, 2018. Yang, Y., Etesami, J., He, N., and Kiyavash, N. Online learning for multivariate hawkes processes. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30, pp. 4937 4946. Curran Associates, Inc., 2017. Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R. R., and Smola, A. J. Deep sets. In Advances in Neural Information Processing Systems (Neur IPS), pp. 3391 3401, 2017.