# synctwin_treatment_effect_estimation_with_longitudinal_outcomes__dde2afb0.pdf Sync Twin: Treatment Effect Estimation with Longitudinal Outcomes Zhaozhi Qian University of Cambridge zq224@cam.ac.uk Yao Zhang University of Cambridge yz555@cam.ac.uk Ioana Bica University of Oxford The Alan Turing Institute ioana.bica@eng.ox.ac.uk Angela Mary Wood University of Cambridge amw79@medschl.cam.ac.uk Mihaela van der Schaar University of Cambridge UCLA The Alan Turing Institute mv472@cam.ac.uk Most of the medical observational studies estimate the causal treatment effects using electronic health records (EHR), where a patient s covariates and outcomes are both observed longitudinally. However, previous methods focus only on adjusting for the covariates while neglecting the temporal structure in the outcomes. To bridge the gap, this paper develops a new method, Sync Twin, that learns a patient-specific time-constant representation from the pre-treatment observations. Sync Twin issues counterfactual prediction of a target patient by constructing a synthetic twin that closely matches the target in representation. The reliability of the estimated treatment effect can be assessed by comparing the observed and synthetic pre-treatment outcomes. The medical experts can interpret the estimate by examining the most important contributing individuals to the synthetic twin. In the real-data experiment, Sync Twin successfully reproduced the findings of a randomized controlled clinical trial using observational data, which demonstrates its usability in the complex real-world EHR. 1 Introduction Estimating the causal individual treatment effect (ITE) using observational data has become increasingly common in the medical literature due to the popularization of electronic health records (EHR). The EHR is a longitudinal collection of records: it contains repeated measurements of a patient s health condition over irregular time intervals. Although the treatments may also vary over time, many studies consider a point treatment, where the treatment allocation is performed at some observed time and stays fixed during the study [12, 44, 34, 5, 51]. Point treatment is widely applicable to problems involving one-off treatments (e.g. surgical operations) or treatments that do not change frequently (e.g. long term medication for chronic disease). We will refer to the setting above as Longitudinal and Irregularly sampled data with Point treatment setting, or LIP. The LIP setting will be the focus of this work (Figure 1 A). The LIP setting is different from the conventional settings in causal inference. This is because we observe the pre-treatment outcome yt over time leading to the treatment allocation. These pretreatment outcomes may unveil the inherent temporal structure in the outcome time series (e.g. trend and seasonality), leading to better ITE estimation over time. In contrast, the conventional settings 35th Conference on Neural Information Processing Systems (Neur IPS 2021). Contributors Target individual Weights Outcomes Synthetic twin Rep. Vectors 1. Encode 2. Synthetize 3. Estimate A: problem setting B: illustration of Sycn Twin Temporal Covariates Potential Outcomes Before Treatment After Treatment Figure 1: A: Illustration of the LIP setting. Yellow dots: the potential outcomes. B: Illustration of Sync Twin (shaded area: the time points after the treatment allocation). 1. Temporal covariates are encoded as representation vectors. 2. The synthetic twin of a treated target individual is constructed as the weighted average of the few contributors from the control group. 3. The difference between the observed outcome and the synthetic twin outcome estimates ITE. Similar procedures can be carried out for control or new individuals. only consider the outcomes after treatment initiation [54, 52, 36, 43, 6]. This is true even for works that deal with dynamic treatment allocations [40, 35, 11]. A naive way to incorporate pre-treatment outcomes into the standard ITE methods is to treat them as additional covariates. This approach is viable but inadequate: the pre-treatment outcomes are arguably much more closely linked to the outcomes after treatment than the covariates they hence deserve special considerations, e.g. modifying the architecture or loss function to reflect their importance. Without such modification, the ITE methods will not incorporate our prior belief on the importance of the pre-treatment outcomes, which may lead to worse performance. To bridge this gap, we propose Sync Twin, a novel ITE estimation method tailored for the LIP setting. We assume that the temporal outcomes are generated by individualized latent factors and timevarying latent trends. This assumption is similar to the factor model assumption commonly used in Econometrics [1] and allows us to achieve the right balance between the parametric assumption and modeling flexibility. Figure 1 B illustrates the schematics of Sync Twin using an example with two treatment options (treated and control). Sync Twin first uncovers the individualized latent factors using representation learning. For a target individual, Sync Twin selects and weights a few contributors based on their latent factors and a sparsity constraint. It proceeds to construct a synthetic twin whose temporal outcomes are the weighted average of the contributors. Finally, the ITE is estimated as the difference between the target individual and the synthetic twin s outcomes. Unlike conventional methods, Sync Twin does not use the learned latent factors to directly predict the outcomes; instead it uses these variables to find the contributors and their importance weights so as to construct the synthetic twin. This approach brings two bonus features; both are important to medical applications. (1) We can calculate an individualized estimation error bound based on the actual and the synthetic outcomes before treatment. In practice, the clinician can accept the recommended treatment when the error bound is below a threshold and resort to expert knowledge otherwise. (2) We can identify the most important contributors to an estimate as the ones who receive the highest weights. The clinician can further examine these contributors (e.g. whether they are indeed similar to the target individual) to validate and interpret the estimate. Contributions. (1) We develop Sync Twin to leverage the temporal structure in the outcomes for better ITE estimation. (2) We provide theoretical justifications for each step involved in Sync Twin and prove an individualized error bound that can be used to identify untrustworthy estimates. (3) In addition to extensive simulations, we conduct an observational study using real EHR data and successfully reproduced the findings of a randomized controlled clinical trial with Sync Twin. 2 Problem setting We consider an observational study with N individuals indexed by i [N] = {1, . . . , N}. Let ai {0, 1} be the treatment indicator with ai = 1 if i received the treatment at some time and ai = 0 otherwise1. We realign the time steps such that all treatments were initiated at time t = 0. Let I1 = {i [N] | ai = 1} and I0 = {i [N] | ai = 0} be the set of the treated and the control individuals respectively. Denote N0 = |I0| and N1 = |I1| as the sizes of the groups. Let Xi = [xis]s [Si] be the temporal covariates consisting of a sequence of observations xis RD. Let Mi = [mis]s [Si] be the sequence of binary masking vectors mis {0, 1}D, where [mis]d = 1 indicates the dth element of xis is measured and [mis]d = 0 otherwise. The entire sequence Xi contains Si N observations taken before treatment at times Ti = [tis]s [Si], tis < 0. The maximum sequence length S = maxi(Si). The outcome yit R is observed at time t T T +. Let T = { M, . . . , 1} and T + = {0, . . . , H 1} be the observation times before and after treatment allocation. We arrange the outcomes after treatment into a H-dimensional vector denoted as yi = [yit]t T + RH. Similarly define pre-treatment outcome vector y i = [yit]t T RM. Using the potential outcome framework [41], let yit(ai) R denote the potential outcome at time t in a world where i received the treatment as indicated by ai. Let yi(1) = [yit(1)]t T + RH, and y i (1) = [yit(1)]t T RM. Similarly, let yi(0) = [yit(0)]t T + and y i (0) = [yit(0)]t T . The individual treatment effect (ITE) is defined as τi = yi(1) yi(0) RH. 3 ITE estimation via Sync Twin Sync Twin estimates the ITE by predicting both potential outcomes yi(1) and yi(0) and take the difference. Note that for an individual in the dataset i [N], it is only necessary to predict the counterfactual outcome yi(1 ai) because the factual outcome yi(ai) = yi is observed (under the consistency assumption to be discussed later). Without loss of generality, we will use estimating yi(0) as an example in the following sections. One can estimate yi(1) using the same method. 3.1 Assumptions Sync Twin relies on three assumptions. (1) Stable Unit Treatment Value Assumption [41]: yit(ai) = yit, i [N], t T T +. (2) No anticipation, also known as causal systems [4]: yit = yit(1) = yit(0), t T , i [N]. (3) Data generating process (DGP). The DGP assumption involves two parts. First, the causal directed acyclic graph (DAG) is specified in Figure 2 [37], which involves a latent factor ci RK. Secondly, we assume the potential outcomes have a parametric form: yit(0) = q t ci + ξit, t T T +, (1) where qt RK, K < min(M, H) is a weight vector and ξit is the white noise. Equation 1 is commonly referred to as a latent factor model with ci as the individualized latent factor and qt as the time-varying latent trend [10]. Without loss of generality, we require the weight vectors ||qt|| = 1, t T T + [50]. Covariates Treatment Outcomes Latent Variable Figure 2: The DAG of the assumed data generating model. Discussion. Sync Twin differs from the nonparametric ITE estimation methods [27] because it assumes the parametric form in Equation 1. In essence, it means that we can fully separate the time effect from the individual effect. We would like to highlight two aspects of this assumption. First, the latent factor ci does not change over time. It provides a constant link between the pre-treatment outcomes and the post-treatment outcomes. Secondly, the time-varying latent trend qt is common to all individuals. Together, they ensure that any linear combination of different individuals outcomes will automatically preserve qt, i.e. P i biyit(0) = q t P i biξit, for any weights bi R, i [N]. For this reason, Sync Twin bypasses qt and directly finds the weights bi s to issue counterfactual predictions. As we will show later, doing so allows us to derive various theoretical results to inform the design of Sync Twin . It also leads to a checking procedure to validate estimation quality and improvements in interpretability. 1Sync Twin can model more than two treatment options, but we focus on binary treatments for illustration. The DAG in Figure 2 is different from the one commonly used in the static setting [41]. We assume the latent factor ci captures the physiological factors that impact both the outcome and the covariates. Here, the covariates xis is a confounder in a general sense because it opens up a backdoor path from treatment to outcome, i.e. ai xis ci yis [37]. Based on the backdoor criterion [37], adjusting for the covariates xis is sufficient to identify the treatment effect from observational data. We compare our assumptions with those used in the related works in Appendix A.2. In particular, we show that our assumption is weaker than the one used by Synthetic Control, a widely-applied method in Econometrics [2]. In Appendix A.5, we further demonstrate the plausibility of our assumptions in real-world scenarios. In the simulation study (Section 5.1) we show experimentally that Sync Twin performs well even when the data is not generated from the assumed DGP exactly but instead from a set of differential equations. Finally, and perhaps most importantly, we show that based on our assumptions, Sync Twin is able to reproduce the findings of a large-scale randomized controlled trial using observational data (Section 5.2). 3.2 Learning to represent temporal covariates The latent factor ci plays an important role as it affects the covariates xis and the outcomes yit. Sync Twin uses deep neural networks to learn the representation ci as a proxy for ci. Sync Twin is agnostic to the exact choice of architecture as long as the network translates the irregularly sampled temporal covariates into a fixed-sized representation vector. For this reason, we use the well-proven sequence-to-sequence architecture (Seq2Seq) [47] with a standard attentive encoder [9] and a LSTM decoder [26]. The learned representation ci = fe(Xi, Mi, Ti; θe), where fe is the encoder with trainable weights θe. The reconstructed covariates ˆXi = fd( ci, Ti; θd), where fd is the decoder with trainable weights θd. To ensure the learned representation ci is a linear predictor of the potential outcome yi(0) (Equation 1), we introduce a trainable parameter Q RH K and define yi(0) := Q ci. Note that using a nonlinear function to map from ci to yi(0) will be inconsistent with the DGP and will not uncover the latent factor ci as desired. We train the networks end-to-end by optimizing the loss function Ltr = λr Lr + λp Ls, where λr and λp are hyper-parameters2 balancing the supervised loss Ls and the reconstruction loss Lr: i D0 || yi(0) yi(0)||2; Lr(D0, D1) = X i D0 D1 ||( Xi Xi) Mi||2, (2) where D0 I0 and D1 I1 are training data, mis is the masking vector, represents element-wise product and || || is the L2 norm. In Proposition 1, we show that minimizing the supervised loss Ls will reduce the error bound on the learned representations, making them closer to the true latent factor ci. The proof and the motivations for Lr are shown in A.1.1. Proposition 1 (Error bound on the learned representations). Under the assumptions in Section 3.1, the total error on the learned representations for the control is bounded by: X j I0 cj cj βLs + X j I0 ξj , (3) where Ls is the supervised loss, β is a constant depending on qt and Q, and ξj is the white noise (Equation 1 and 2). 3.3 Synthesizing the twin At this point, one may be temped to predict the counterfactual outcome yi(0) of a treated individual i I1 with the output of the neural network yi(0). However, yi(0) is issued by a black-box neural network, which is not easily interpretable. As a remedy, Sync Twin explicitly constructs a synthetic twin who matches the target in representations. As we will show, this approach is able to control estimation bias and is more interpretable. The synthetic twin of a target individual is defined by a set of weights bi := [bij]j I0 RN0, each associated with a contributor in the control group j I0 (or treatment group when predicting ˆyi(1)). 2The hyperparamter sensitivity is studied in A.13. λr and λp do not significantly impact the performance. Sync Twin solves the following optimization problem to find the weights: bi = arg min bi ci X bij cj 2 s.t. bij 0, j I0 and X bij = 1, (4) where cj is the representation learned by the encoder. Denote the loss function in Equation 4 as Lm. We will call Lm the matching loss because it indicates how well the synthetic twin matches the target individual in representations. Let dc i denote the optimal value of Lm and let ˆci be the synthetic representation given by the solution bi: dc i := ci X j I0 bij cj 2; ˆci := X j I0 bij cj (5) The optimization procedure is detailed in Appendix A.8. Sync Twin predicts the potential outcome yit(0) for a target individual i using the same weights bi: t T T +, ˆyit(0) = X j I0 bijyjt(0) = X j I0 bijyjt, (6) where the last equality follows from the consistency assumption (Section 3.1). Denote ˆyi(0) = [ˆyit(0)]t T + as the predicted potential outcomes after treatment initiation. The estimator in Equation 6 follows directly from the DGP assumption (Equation 1). To see this, remember that yit(0) is linear in the latent factor ci. Hence, if we find a set of weights b ij to match the latent factor, i.e. ci = P j b ijcj, the same set of weights will also match the outcome yit(0) P j b ijyjt(0) up to random noise. In practice, since we do not observe ci, we have to perform matching on the learned representation ci as in Equation 4. Therefore we can see that the quality of the estimator ˆyit(0) depends on two things: (1) whether the learned representation c is close to the true latent factor c and (2) whether good matching weights bi can be found on the learned representation. Sync Twin attempts to fulfill the first condition by representation learning in Section 3.2. The second condition is facilitated by solving the optimization problem in Equation 4. Proposition 2 formalizes this intuition and shows that when a perfectlymatching twin is found, the estimation error only depends on the quality of the learned representations (proved in A.1.1). Proposition 2 (Bias bound on counterfactual prediction). Suppose that dc i = 0 for some i I1 (Equation 1 and 5), the absolute value of the counterfactual prediction bias for i is bounded by: |E[ˆyi(0) yi(0)]| |T +| ci ci + X j I0 cj cj . Notes on Interpretability. In addition to reducing bias by learning and matching representations, Sync Twin is also more interpretable. In particular, the weights bi can be interpreted as the contribution or importance of a contributor j to the target i due to the optimization constraints. We can create a shortlist of the most important contributors based on bi. A domain expert can understand the rationale behind the estimate by examining the shortlist. For instance, they can check whether the important contributors share similar disease progression patterns as the target individual. This procedure corresponds to the notion of data point interpretability or example-based interpretability [45], where one explains the prediction by presenting the most relevant data points to the users. Note that the learned representations ci are internal to the algorithm; we do not intend to show these representations to the user or to make interpretations on them. 3.4 Calculating individualized error bound with pre-treatment outcomes Since Equation 6 applies to all time points before and after treatment, let ˆy i (0) = [ˆyit(0)]t T be the predicted pre-treatment potential outcome vector. We define dy i as the estimation error in the pre-treatment period: dy i = ˆy i (0) y i (0) 1 = ˆy i (0) y i 1, (7) where || ||1 is the vector ℓ1-norm. The second equality allows dy i to be evaluated for all individuals. It holds because of the no anticipation assumption (Section 3.1). Achieving a small error dy i implies Table 1: Problem settings considered in the literature. Static : observed (or allocated) only once; Regular : observed (or allocated) over time at a regular frequency; Irregular : observed over time irregularly; - : not observed or modeled. * can be extended to Irregular. can be extended to Regular. LIP: Longitudinal, Irregular, Point treatment. Setting Example Pre-treatment Treatment Post-treatment Nonlinear f: X y a y y = f(X) Static [43] Static* - Static Static DT [11] Regular* - Regular Regular SC [2] Regular Regular Static Regular LIP (This work) This work Irregular Regular Static Regular that the synthetic twin ˆci matches well with the true latent factor ci due to the linearity between y i (0) and ci (Equation 1). Since ci is assumed to be constant over time, the pre-treatment error can be used to assess the post-treatment error. We formalize this intuition in Proposition 3 and use dy i to control the error in the post-treatment period (proved in A.1.1). Proposition 3 (Error control under no hidden confounders). Given any target error threshold δ > 0, define the acceptance group of treated individuals as Aδ = i I1|dy i δ|T |/|T +| . Under the assumptions in Section 3.1, the post-treatment estimation error |E[ˆyi(0)] E[yi(0)]| δ, i Aδ. Proposition 3 shows that we can control the estimation error to be below a certain threshold δ by rejecting the estimate if its error dy i during the pre-treatment period is larger than δ|T |/|T +|. Alternatively, we can rank the estimation trustworthiness for the individuals based on dy i . This is helpful when the user is willing to accept a percentage of estimations which are deemed most trustworthy. 3.5 Training, validation and inference We perform model training, validation and inference (testing) on three disjoint datasets. In summary, we train the encoder and decoder on the training data using the loss functions described in Section 3.2. The validation data is then used to validate and tune the hyper-parameters of the encoder and decoder. Then, we solve the optimization problem in Section 3.3 to find the weights bi for the target individuals in the testing data. Finally, we compute the potential outcome estimator (Section 3.3) and the individualized error bound (Section 3.4). The pseudocode is described in A.7. 4 Related work Synthetic Control. Similar to Sync Twin, Synthetic control (SC) [1] and its extensions [8, 7] estimate ITE based on synthetic outcomes. However, SC needs to flatten the temporal covariates [xis]s [Si] into a fixed-sized (high-dimensional) vector xi and use it to construct the twin. As a result, SC does not allow the covariates to be variable-length or sampled at irregular frequencies (Table 1); otherwise xi s dimensionality will vary across individuals. This severely limits SC s usability in medical observational studies because such sampling irregularities are common in EHR. Moreover, SC assumes yit(0) = q t xi + ξit, i.e. the flattened covariates xi linearly predicts yit(0), which is a special case of Sync Twin s assumption (to be discussed later in Equation 1) and seems unlikely in many medical applications (e.g. body mass index does not linearly relate to blood pressure). Static nonparametric ITE estimation. Most ITE estimation methods in the literature consider the static setting and do not pay special attention to the outcomes before treatment y i these outcomes are treated as standard covariates Xi, if at all (Table 1). However, the pre-treatment outcomes y i may encode certain temporal structure that will persist even on the outcomes yi after treatment (e.g. trend and seasonality). Explicitly unveiling such temporal structure will be very beneficial to accurately estimating and validating the ITE after treatment. Moreover, these methods make no parametric assumption on the data generating model [27] and they learn to predict the two potential outcomes using deep neural networks [43, 54, 22]. Despite the recent progress in interpretable machine learning, these methods are still in general less interpretable than the parametric models such as Sync Twin whose coefficients can be directly understood by an expert. ITE estimation for dynamic treatments. Several works consider ITE estimation for dynamic treatments (DT) [40]. In the DT setting (Table 1), multiple treatment decisions are made over time for each individual. ITE estimation is much more challenging for DT because a prior treatment at 1 may influence the temporal covariates xt, which may in turn confound a latter treatment at. This problem is known as temporal confounding. Marginal structural model [40] and related works [42, 35, 11] have been proposed to overcome temporal confounding. In comparison, our setting considers a point treatment and therefore do not suffer from temporal confounding. As a result, although the works in the DT setting are applicable, they might be over-complicated for our problem. In addition, our previous discussion about the inadequacy in leveraging pre-treatment outcomes also applies to DT methods. Works with similar terminology. Several unrelated works in the literature use terms such as twin , which might cause confusion. We discuss these works in Appendix A.4. 5 Experiments 5.1 Simulation Study Data generation.3 In this simulation study, we evaluate Sync Twin on the task of estimating the LDL cholesterol-lowering effect of statins, a common drug prescribed to hypercholesterolaemic patients. Following our convention, the individuals are enrolled at t = 0, the covariates are observed in T = [ S, 0), where S {15, 25, 45}, and the ITE is to be estimated in the period T + = [0, 4]. We start by generating a covariate kin t for from a mixture distribution: kin it = δifit + (1 δi)git; δi iid Bern(pi) (8) where Bern(p) is the Bernoulli distribution with success probability p and fit, git are drawn from two different distributions (specified in A.10). To introduce confounding bias, we vary pi for the treated and the control: pi = p0, i I0 and pi = 1, i I1. The constant p0 controls the degree of confounding bias (smaller p0, larger bias). We simulate the outcome yt using the widely adopted Pharmacology model in the literature [16, 53, 32]. It is obtained by solving the differential equation 9 and adding independent white noise ϵ N(0, 0.1). pt = kin t k pt; dt = at h dt; yt = k pt kdt(dt + d50) 1 yt. (9) where yt is the derivative of yt and at is the indicator of statins treatment. The covariates includes xt = {kin t , yt, pt}. The interpretation of all constants involved are presented in Appendix A.10. Finally, we introduce irregular sampling by creating masks mit Bern(m), where probability m {0.3, 0.5, 0.7, 1}. It is worth highlighting that the simulation data are not generated from Sync Twin s assumed DGP (1) but from domain-specific ODEs (9). Benchmarks. From the Synthetic Control literature, we considered the original Synthetic Control method (SC) [2], Robust Synthetic Control (RSC) [7] and MC-NNM [8]. From the dynamic treatment literature, we compared against Counterfactual Recurrent Network (CRN) [11] and Recurrent Marginal Structural Network (RMSN) [35]. Note that RMSN belongs to the family of Marginal Structural Models (MSMs) and we use it as a representative of MSMs. We included a modified CFRNet, which was originally developed for the static setting [43]. We replaced its fully-connected encoder with the encoder used by Sync Twin to to allow CFRNet to model temporal covariates (Section 3.2). We also included a benchmark adapted from the counterfactual Gaussian Process (CGP) [42] in order to adjust for the patient level covariates. We use One-nearest Neighbour Matching (1NN) [46] as a baseline. The implementation details of all benchmarks are available in Appendix A.9. We compared with two ablated versions of Sync Twin. Sync Twin-Lr is trained only with reconstruction loss and Sync Twin-Ls only with supervised loss. 3The implementation of Sync Twin and the experiment code are available at https:// github.com/Zhaozhi QIAN/Sync Twin-Neur IPS-2021 or https://github.com/orgs/ vanderschaarlab/repositories Table 2: Mean absolute error on ITE under different levels of confounding bias p0. m = 1 and S = 25 are used. Estimated standard deviations are shown in the parentheses. The best performer is in bold. Additional quantitative results are shown in Appendix A.11 Method N0 = 200 N0 = 1000 p0 = 0.1 p0 = 0.25 p0 = 0.5 p0 = 0.1 p0 = 0.25 p0 = 0.5 Sync Twin-Full 0.323 (.039) 0.144 (.013) 0.128 (.008) 0.178 (.012) 0.106 (.006) 0.094 (.005) Sync Twin-Lr 0.353 (.040) 0.171 (.016) 0.135 (.010) 0.256 (.026) 0.146 (.013) 0.102 (.006) Sync Twin-Ls 0.336 (.040) 0.171 (.015) 0.119 (.008) 0.145 (.013) 0.114 (.007) 0.127 (.010) SC 0.341 (.041) 0.151 (.024) 0.149 (.018) 0.258 (.050) 0.166 (.034) 0.214 (.036) RSC 0.842 (.045) 0.361 (.020) 0.322 (.019) 0.310 (.016) 0.298 (.014) 0.302 (.014) MC-NNM 1.160 (.060) 0.612 (.031) 0.226 (.011) 0.527 (.029) 0.159 (.008) 0.124 (.007) CFRNet 0.903 (.077) 0.387 (.035) 0.291 (.003) 0.399 (.038) 0.178 (.013) 0.104 (.007) CRN 0.809 (.050) 0.613 (.039) 0.335 (.023) 0.779 (.041) 0.589 (.040) 0.563 (.035) RMSN 0.418 (.032) 0.391 (.029) 0.334 (.027) 0.478 (.039) 0.414 (.034) 0.390 (.032) CGP 0.660 (.043) 0.610 (.039) 0.561 (.035) 0.826 (.056) 0.693 (.047) 0.602 (.038) 1NN 1.866 (.099) 1.721 (.091) 1.614 (.078) 2.446 (.131) 1.746 (.106) 1.384 (.083) Figure 3: (A) The percentage of accepted estimates (red) and their MAE (blue) using various rejection thresholds δ on dy i . Eg. setting δ = 0.12 gives 75% acceptance rate and MAE around 0.105. (B) Performance when certain covariates are omitted (hidden confounders). Shaded area: 95% confidence interval. Evaluation metric. In many medical studies, the outcome of interest is measured with heavytailed non-Gaussian noise [30]. In these settings, the mean absolute error (MAE) is preferred over the mean squared error (MSE) as an evaluation metric due to its robustness to the outliers [38]. For this reason, we will evaluate the mean absolute error (MAE) on counterfactual prediction: 1 N1 PN1 i=1 ||yi(0) ˆyi(0)||1. Note that MAE is also directly comparable to the error bound in Proposition 3 (i.e. the L1 norm). Main results. Table 2 presents the results for various levels of confounding bias p0. Additional results for different sequence length S and sampling irregularity m are shown in the tables in Appendix A.11. Sync Twin achieves the best or equally-best performance in all cases despite the assumed DGP does not exactly hold. The full Sync Twin with both loss functions also consistently outperforms the versions trained only with Lr or Ls. In comparison, SC, RSC and MC-NNM underperform because their assumption that the flattened covariates xi linearly predict the outcome is violated (Section 4). Error bound. Sync Twin allows the user to reject untrustworthy estimates based on dy i . In Figure 3 (A) below, we show the scenarios of using various rejection thresholds on dy i . As expected from Proposition 3, when the user increases the threshold, more estimates will be accepted but they will have a higher MAE. The fact that the two curves in the figure share the similar increasing trend suggests dy i is a good indicator of individual error: rejection based on a non-informative criterion will lead to a flat line of MAE. In practice, the user can use Figure 3 (A) to calibrate the threshold in order to achieve balance between the accuracy and the workload. For instance, if the user would like Sync Twin to accept 75% of the estimate, he or she should set threshold δ = 0.12 according to the pre-treatment error dy i and expect a MAE of around 0.105 for the post-treatment outcome. Sensitivity to omitted covariates. In Figure 3 (B), we show the situation when some covariates are omitted from the analysis, thus making it harder to adjust for the confounding bias. In particular, Figure 4: Top: the outcomes (LDL) before and after treatment of a target individual and its synthetic twin. Bottom left: histogram of distance dy (Equation 7). Bottom right: histogram of number of contributors used to construct the synthetic twin. we set xt = {kin t , yt, pt}, xt = {yt, pt} or xt = {yt} (omitting 0,1,2 covariates). We observe that Sync Twin is still the best-performing model when some covariates are omitted. 5.2 Experiment on real data Few existing works in the literature validate the method by conducting a real-world observational study to replicate the findings of a randomized controlled trial (RCT), which is the gold standard of causal inference. Most existing works rely on synthetic or semi-synthetic data for validation, which may significantly under-represent the complexity of the real-world data. To validate the usability of Sync Twin in real-world medical problems, we conduct an observational study to emulate a large-scale RCT Heart Protection Study (HPS) [21, 20]. Although most existing methods have been validated in synthetic or semi-synthetic experiments [43, 11], few of them have been demonstrated to successfully reproduce the findings of a RCT from observational data. Since RCT is the gold standard of treatment effect estimation, the ability to reproduce a RCT provides strong evidence about the method s usability in medical research and facilitates the method s adoption in the medical community. The HPS was conducted to investigate the treatment effect of statins, a drug commonly used to lower the LDL Cholesterol (LDL). It reported an a change of -1.26 mmol/L (SD=0.06) in LDL for participants randomised to statins versus placebo at the end of the first year after treatment [39]. The study enrolled 20,536 individuals and lasted for eight years, making it one of the largest studies to investigate the treatment effect of statins. Data Source. We used medical records from English National Health Service general practices that contributed anonymised primary care electronic health records to the Clinical Practice Research Datalink (CPRD), covering approximately 6.9 percent of the UK population [25]. CPRD was linked to secondary care admissions from Hospital Episode Statistics, and national mortality records from the Office for National Statistics. We defined treatment initiation as the date of first CPRD prescription and the outcome of interest was the measured LDL. Known risk factors for LDL were selected as temporal covariates measured before treatment initiation: HDL Cholesterol, Systolic Blood Pressure, Diastolic Blood Pressure, Body Mass Index, Pulse, Creatinine, Triglycerides and smoking status. Our analysis is based on a subset of 125,784 individuals who met the enrollment criterion of HPS (Appendix A.15). They were split into three equally-sized subsets for training, validation and testing, each with 17,371 treated and 24,557 controls. Main findings. Sync Twin estimates the average treatment effect as -1.25 mmol/L (SD 0.01) by averaging the estimated ITE on testing data P i Dte ˆτit/|Dte|. The estimate is very close to the result reported in HPS, -1.26 mmol/L (SD=0.06), after taking into account the standard errors. In contrast, CRN and RMSN estimate the effect to be -0.72 mmol/L (SD 0.01) and -0.83 mmol/L (SD 0.01) respectively, which are significantly smaller than the trial s finding. Other benchmark methods (e.g. SC, RSC and MC-NNM) either cannot handle irregularly-measured covariates or do not scale to the size of the dataset. Our result suggests Sync Twin is able to overcome the confounding bias in the complex real-world datasets. Note that Sync Twin is also able to estimate ITE and the effect over different time horizons. However, since these results are not reported in HPS, we are not able to quantitatively evaluate them. Qualitative evaluation of ITE over time. For each individual, we can visualize the outcomes before treatment and compare them with the synthetic twin to sense-check the estimate. The individual shown in Figure 4 (top) has a sensible ITE estimate because the synthetic twin matches its pretreatment outcomes closely. In addition to visualization, we can calculate the individualized error bound based on dy i (Equation 7). From Figure 4 (bottom left) we can see in most cases dy i is small with a median of 0.24 mmol/L (compared to the average between any two randomly chosen individuals 0.76 mmol/L). This means if the expert can only tolerate an error of 0.24 mmol/L on ITE estimation, half of the estimates (those with dy i 0.24 mmol/L) can be accepted (Section 3). As shown in Figure 4 (bottom right) on average only 15 (out of 24,557) individuals contribute to the synthetic twin (bij > 0.01). The medical experts can check these few contributors to interpret the estimate. 6 Conclusion and future work We present Sync Twin, an ITE estimation method tailored for temporal outcomes with point treatment. Sync Twin achieves interpretability and strong performance in both simulated and real data experiments. In future works, we plan to extend Sync Twin to the dynamic treatment setting and to model the outcomes in continuous time. Developing additional assumptions to guarantee the robustness to certain types of unobserved confounders is also an interesting avenue for future research. Acknowledgments and Disclosure of Funding We thank anonymous reviewers as well as members of the vanderschaar-lab for many insightful comments and suggestions. This work is supported by the US Office of Naval Research (ONR), the National Science Foundation (NSF Grant number:1722516), the Alan Turing Institute (under the EPSRC grant EP/N510129/1), and Glaxo Smith Kline (GSK). [1] Alberto Abadie. Using synthetic controls: Feasibility, data requirements, and methodological aspects. Journal of Economic Literature, 2019. [2] Alberto Abadie, Alexis Diamond, and Jens Hainmueller. Synthetic control methods for comparative case studies: Estimating the effect of california s tobacco control program. Journal of the American statistical Association, 105(490):493 505, 2010. [3] Alberto Abadie and Javier Gardeazabal. The economic costs of conflict: A case study of the basque country. American economic review, 93(1):113 132, 2003. [4] Jaap H Abbring and Gerard J Van den Berg. The nonparametric identification of treatment effects in duration models. Econometrica, 71(5):1491 1517, 2003. [5] Linda H Aiken, Douglas M Sloane, Luk Bruyneel, Koen Van den Heede, Peter Griffiths, Reinhard Busse, Marianna Diomidous, Juha Kinnunen, Maria K ozka, Emmanuel Lesaffre, et al. Nurse staffing and education and hospital mortality in nine european countries: a retrospective observational study. The lancet, 383(9931):1824 1830, 2014. [6] Ahmed M Alaa and Mihaela van der Schaar. Bayesian inference of individualized treatment effects using multi-task gaussian processes. In Advances in Neural Information Processing Systems, pages 3424 3432, 2017. [7] Muhammad Amjad, Devavrat Shah, and Dennis Shen. Robust synthetic control. The Journal of Machine Learning Research, 19(1):802 852, 2018. [8] Susan Athey, Mohsen Bayati, Nikolay Doudchenko, Guido Imbens, and Khashayar Khosravi. Matrix completion methods for causal panel data models. Technical report, National Bureau of Economic Research, 2018. [9] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, 2015. [10] Jushan Bai and Serena Ng. Large dimensional factor analysis. Now Publishers Inc, 2008. [11] Ioana Bica, Ahmed M Alaa, James Jordon, and Mihaela van der Schaar. Estimating counterfactual treatment outcomes over time through adversarially balanced representations. In International Conference on Learning Representations, 2020. [12] Melissa DA Carlson and R Sean Morrison. Study design, precision, and validity in observational studies. Journal of palliative medicine, 12(1):77 82, 2009. [13] Barbra A Dickerman, Xabier Garc ıa-Alb eniz, Roger W Logan, Spiros Denaxas, and Miguel A Hern an. Avoidable flaws in observational analyses: an application to statins and cancer. Nature Medicine, 25(10):1601 1606, 2019. [14] Dumitru Erhan, Aaron Courville, Yoshua Bengio, and Pascal Vincent. Why does unsupervised pre-training help deep learning? In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 201 208, 2010. [15] Dumitru Erhan, Pierre-Antoine Manzagol, Yoshua Bengio, Samy Bengio, and Pascal Vincent. The difficulty of training deep architectures and the effect of unsupervised pre-training. In Artificial Intelligence and Statistics, pages 153 160, 2009. [16] Demiana William Faltaos, Sa ık Urien, Val erie Carreau, Marina Chauvenet, Jean Sebastian Hulot, Philippe Giral, Eric Bruckert, and Philippe Lechat. Use of an indirect effect model to describe the ldl cholesterol-lowering effect by statins in hypercholesterolaemic patients. Fundamental & clinical pharmacology, 20(3):321 330, 2006. [17] Steven E Finkel. Causal analysis with panel data. Sage, 1995. [18] Jared C Foster, Jeremy MG Taylor, and Stephen J Ruberg. Subgroup identification from randomized clinical trial data. Statistics in medicine, 30(24):2867 2880, 2011. [19] GPy. GPy: A gaussian process framework in python. http://github.com/ Sheffield ML/GPy, since 2012. [20] Heart Protection Study Collaborative Group et al. Mrc/bhf heart protection study: randomised placebo-controlled trial of cholesterol-lowering with simvastatin in 20,536 high-risk individuals. Lancet, 360(9326):7 22, 2002. [21] Heart Protection Study Collaborative Group et al. Randomized trial of the effects of cholesterollowering with simvastatin on peripheral vascular and other major vascular outcomes in 20,536 people with peripheral arterial disease and other high-risk conditions. Journal of vascular surgery, 45(4):645 654, 2007. [22] Negar Hassanpour and Russell Greiner. Learning disentangled representations for counterfactual regression. In International Conference on Learning Representations, 2020. [23] Harshad Hegde, Neel Shimpi, Aloksagar Panny, Ingrid Glurich, Pamela Christie, and Amit Acharya. Mice vs ppca: Missing data imputation in healthcare. Informatics in Medicine Unlocked, 17:100275, 2019. [24] Dan Hendrycks, Mantas Mazeika, Saurav Kadavath, and Dawn Song. Using self-supervised learning can improve model robustness and uncertainty. In Advances in Neural Information Processing Systems, pages 15663 15674, 2019. [25] Emily Herrett, Arlene M Gallagher, Krishnan Bhaskaran, Harriet Forbes, Rohini Mathur, Tjeerd Van Staa, and Liam Smeeth. Data resource profile: clinical practice research datalink (cprd). International journal of epidemiology, 44(3):827 836, 2015. [26] Sepp Hochreiter and J urgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735 1780, 1997. [27] Guido W Imbens. Nonparametric estimation of average treatment effects under exogeneity: A review. Review of Economics and statistics, 86(1):4 29, 2004. [28] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. ar Xiv preprint ar Xiv:1611.01144, 2016. [29] Fredrik D Johansson, Nathan Kallus, Uri Shalit, and David Sontag. Learning weighted representations for generalization across designs. ar Xiv preprint ar Xiv:1802.08598, 2018. [30] Andrew Michael Jones. Data visualization and health econometrics. Foundations and Trends in Econometrics, 2017. [31] Nathan Kallus. Deepmatch: Balancing deep covariate representations for causal inference using adversarial training. ar Xiv preprint ar Xiv:1802.05664, 2018. [32] Jimyon Kim, Byung-Jin Ahn, Hong-Seok Chae, Seunghoon Han, Kichan Doh, Jeongeun Choi, Yong K Jun, Yong W Lee, and Dong-Seok Yim. A population pharmacokinetic pharmacodynamic model for simvastatin that predicts low-density lipoprotein-cholesterol reduction in patients with primary hyperlipidaemia. Basic & clinical pharmacology & toxicology, 109(3):156 163, 2011. [33] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. [34] Matthieu Legrand, Claire Dupuis, Christelle Simon, Etienne Gayat, Joaquim Mateo, Anne Claire Lukaszewicz, and Didier Payen. Association between systemic hemodynamics and septic acute kidney injury in critically ill patients: a retrospective observational study. Critical care, 17(6):1 8, 2013. [35] Bryan Lim, Ahmed M Alaa, and Mihaela van der Schaar. Forecasting treatment responses over time using recurrent marginal structural networks. In Advances in Neural Information Processing Systems, pages 7483 7493, 2018. [36] Christos Louizos, Uri Shalit, Joris M Mooij, David Sontag, Richard Zemel, and Max Welling. Causal effect inference with deep latent-variable models. In Advances in Neural Information Processing Systems, pages 6446 6456, 2017. [37] Judea Pearl. Causality. Cambridge university press, 2009. [38] Ryan Poplin, Avinash V Varadarajan, Katy Blumer, Yun Liu, Michael V Mc Connell, Greg S Corrado, Lily Peng, and Dale R Webster. Prediction of cardiovascular risk factors from retinal fundus photographs via deep learning. Nature Biomedical Engineering, 2(3):158 164, 2018. [39] Paul M Ridker and Nancy R Cook. Statins: new american guidelines for prevention of cardiovascular disease. The Lancet, 382(9907):1762 1765, 2013. [40] James M Robins, Miguel Angel Hernan, and Babette Brumback. Marginal structural models and causal inference in epidemiology, 2000. [41] Donald B Rubin. Causal inference using potential outcomes: Design, modeling, decisions. Journal of the American Statistical Association, 100(469):322 331, 2005. [42] Peter Schulam and Suchi Saria. Reliable decision support using counterfactual models. In Advances in Neural Information Processing Systems, pages 1697 1708, 2017. [43] Uri Shalit, Fredrik D Johansson, and David Sontag. Estimating individual treatment effect: generalization bounds and algorithms. In International Conference on Machine Learning, pages 3076 3085. PMLR, 2017. [44] Jae W Song and Kevin C Chung. Observational studies: cohort and case-control studies. Plastic and reconstructive surgery, 126(6):2234, 2010. [45] Gregor Stiglic, Primoz Kocbek, Nino Fijacko, Marinka Zitnik, Katrien Verbert, and Leona Cilar. Interpretability of machine learning-based prediction models in healthcare. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 10(5):e1379, 2020. [46] Elizabeth A Stuart. Matching methods for causal inference: A review and a look forward. Statistical science: a review journal of the Institute of Mathematical Statistics, 25(1):1, 2010. [47] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104 3112, 2014. [48] Timo Ter asvirta, Dag Tjøstheim, Clive William John Granger, et al. Modelling nonlinear economic time series. Oxford University Press Oxford, 2010. [49] Madeleine Udell and Alex Townsend. Nice latent variable models have log-rank. ar Xiv preprint ar Xiv:1705.07474, 1, 2017. [50] Yiqing Xu. Generalized synthetic control method: Causal inference with interactive fixed effects models. Political Analysis, 25(1):57 76, 2017. [51] Xiaobo Yang, Yuan Yu, Jiqian Xu, Huaqing Shu, Hong Liu, Yongran Wu, Lu Zhang, Zhui Yu, Minghao Fang, Ting Yu, et al. Clinical course and outcomes of critically ill patients with sars-cov-2 pneumonia in wuhan, china: a single-centered, retrospective, observational study. The Lancet Respiratory Medicine, 8(5):475 481, 2020. [52] Liuyi Yao, Sheng Li, Yaliang Li, Mengdi Huai, Jing Gao, and Aidong Zhang. Representation learning for treatment effect estimation from observational data. In Advances in Neural Information Processing Systems, pages 2633 2643, 2018. [53] Koutaro Yokote, Hideaki Bujo, Hideki Hanaoka, Masaki Shinomiya, Keiji Mikami, Yoh Miyashita, Tetsuo Nishikawa, Tatsuhiko Kodama, Norio Tada, and Yasushi Saito. Multicenter collaborative randomized parallel group comparative study of pitavastatin and atorvastatin in japanese hypercholesterolemic patients: collaborative study on hypercholesterolemia drug intervention and their benefits for atherosclerosis prevention (chiba study). Atherosclerosis, 201(2):345 352, 2008. [54] Jinsung Yoon, James Jordon, and Mihaela van der Schaar. Ganite: Estimation of individualized treatment effects using generative adversarial nets. In International Conference on Learning Representations, 2018. [55] Hao Zou, Peng Cui, Bo Li, Zheyan Shen, Jianxin Ma, Hongxia Yang, and Yue He. Counterfactual prediction for bundle treatment. Advances in Neural Information Processing Systems, 33, 2020.