# estimating_the_longterm_effects_of_novel_treatments__5b957126.pdf Estimating the Long-Term Effects of Novel Treatments Keith Battocchi Microsoft Research Eleanor W. Dillon Microsoft Research Maggie Hei Microsoft Research Greg Lewis Microsoft Research Miruna Oprescu Microsoft Research Vasilis Syrgkanis Microsoft Research Policy makers often need to estimate the long-term effects of newly-developed treatments, while only having historical data of older treatment options. We propose a surrogate-based approach using a long-term dataset where only past treatments were administered and a short-term dataset where novel treatments have been administered. Our approach generalizes previous surrogate-style methods, allowing for continuous treatments and serially-correlated treatment policies while maintaining consistency and root-n asymptotically normal estimates under a Markovian assumption on the data and the observational policy. Using a semi-synthetic dataset on customer incentives from a major corporation, we evaluate the performance of our method and discuss solutions to practical challenges when deploying our methodology. 1 Introduction Businesses invent new ways of interacting with their customers. Marketing departments devise new campaigns. Pharmaceutical companies roll out trials of new drugs. In many cases, decision makers expect these novel treatments to affect outcomes like sales or health over several years, but they are eager to begin evaluating these innovations within months of deployment. We propose an estimation methodology that leverages historical data, collected under earlier treatment regimes, to derive early estimates of the long-term effects of novel, newly-deployed treatments. Athey et al. (2020) develop a surrogate index as one solution to this problem. This method relies on an assumption that causal effects on long-term outcomes are channeled through a set of observable short-term proxies. This assumption allows one to use the historical long-term data set to learn a mapping from short-term signals to a projected long-term reward and subsequently estimate the causal effect of novel treatments on this index of surrogates. In practice, treatment policies are often dynamic: treatments are assigned repeatedly, and their assignments depend on past treatments and short-term outcomes. Unfortunately, this pattern can easily break the assumptions needed for the surrogate index to work. For example, suppose that a firm offers multiple investments to a particular customer in the historical data and these investments are auto-correlated, i.e. if a customer receives an investment this month, then they will receive an investment with high probability in one of the subsequent months. These future investments can substantially increase the long-term outcome of interest, and this increase will be attributed to the short-term proxies. The surrogate index thus formed will tend to over-predict long-run outcomes, and so when it is used to measure the treatment effect of some new treatment in the short-run data, the estimated treatment effect will be bigger (in absolute magnitude) than the truth. Correspondence to: Vasilis Syrgkanis . 35th Conference on Neural Information Processing Systems (Neur IPS 2021). The main methodological innovation of this paper is to use the dynamic treatment effect analysis of Lewis & Syrgkanis (2020) on the historical data in order to create an unbiased, dynamically adjusted surrogate index. Our new index takes the interpretation of the projected long-term reward in the absence of any future treatments. This model produces unbiased causal effect estimates of the long-term effects of the novel treatments, even with auto-correlated treatments. A second contribution of the paper is to generalize the final causal analysis step to allow for multiple continuous treatments rather than a single binary treatment, as in Athey et al. (2020). Our new estimator for this expanded surrogate approach allows for the construction of valid confidence intervals, even when using flexible machine learning models at both stages of the estimation to deal with high-dimensional data. In short, we show how one can combine three recently developed techniques, i) the surrogate index approach of Athey et al. (2020), ii) the double machine learning approach of Chernozhukov et al. (2018a) and iii) the dynamic treatment effect estimation approach of Lewis & Syrgkanis (2020), in a single data analysis pipeline to estimate treatment effects in the presence of dynamic treatment policies. Our work lies in the broader field of estimating causal effects with machine learning and Neyman orthogonality (Neyman, 1979; Robinson, 1988; Ai & Chen, 2003; Chernozhukov et al., 2016; Chernozhukov et al., 2018b). Moreover, it relates to the work on machine learning estimation of treatment effects in the dynamic treatment regime (Nie et al., 2019; Thomas & Brunskill, 2016; Petersen et al., 2014; Kallus & Uehara, 2019b,a; Lewis & Syrgkanis, 2020; Bodory et al.; Singh et al., 2020) and on structural nested models in biostatistics (Robins, 1986; Robins et al., 1992; Robins, 1994; Robins & Ritov, 1997; Robins et al., 2000; Lok & De Gruttola, 2012; Vansteelandt et al., 2014; Vansteelandt & Sjolander, 2016). Finally, it relates to the surrogacy literature in causal inference (Prentice, 1989; Begg & Leung, 2000; Frangakis & Rubin, 2002; Freedman et al., 1992). In the following sections we build on insights in these works to design a new, streamlined method that resolves several difficult aspects of a common applied problem. 2 Problem and Methodology We illustrate the difficulties of our problem and how we our method resolves them with a running example of a firm making investments in its customers in order to increase subsequent purchases. In this section we describe the problem set up and data requirements and use a simple two-period model to build intuition for how our approach estimates long-term causal effects in this environment and why earlier approaches fail. In the following section we provide formal proofs of our methodology in the general, multi-period case. 2.1 Problem statement A firm has a number of distinct treatments T1, T2 . . . Tk it can offer to customers, such as discounts or access to special services. At each period t (say, a month), a vector of treatments Ti,t = (Ti,t,1, . . . , Ti,t,k) Rk, is applied to each customer i. We also observe a vector of p characteristics Xi,t = (Xi,t,1, . . . , Xi,t,p) Rp, some of which are constant within customer (e.g. industry) and some of which vary over time and customer (e.g. last month s revenue). We observe the outcome of interest Yi,t R (e.g. monthly revenue). Treatments may affect outcomes over many future periods, and this horizon could vary by treatment. We are therefore interested in identifying the average effect of each treatment at some period t, on the cumulative outcome in the subsequent M periods, i.e. Yi,t = PM κ=1 Yi,t+κ. Suppose direct inference is impossible because we have not yet observed the subsequent M periods following the introduction of a new treatment. Instead, we have access to a vector of d short-term proxies/surrogates, Si,t = (Si,t,1, . . . , Si,t,d) that act as leading indicators for our target outcome. In this example, we would like surrogates that capture a customer s trajectory such as the next few months of revenue, intensity of product usage, new contract commitments, and participation in company events. 2.2 Dynamic Adjusted Surrogate Index The key innovation of all surrogate-based solutions is to use the proxy variables to bridge two incomplete data sets to estimate Yi,t. A long-term observational data set need only include customer characteristics Xi,t, surrogates Si,t+1, realized M-period outcomes Yi,t. A second short-term, experimental data set, includes customer features, Xi,t, surrogates, Si,t, and treatments Ti,t. This experimental data set is restricted to the period where all treatments of interest have been introduced, but only requires a few months of leading surrogates rather than M periods of leading outcomes. We propose the following estimation strategy: 1. Use the observational data set to estimate the expected long-term outcome net of any postperiod t treatments using the recursive methodology of Lewis & Syrgkanis (2020). This adjusted outcome, Y adj it , is the expected cumulative outcome starting in period t under the assumption that customer i receives no further treatments over the next M periods. 2. Still using the observational data set, construct a model that predicts this adjusted outcome from Si,t and Xi,t: gadj 0 (Si,t, Xi,t) := E[ Y adj i,t | Si,t, Xi,t]. (surr. model) Any machine learning estimation algorithm can be used to construct ˆg. 3. Finally, calculate the adjusted surrogate index using the surrogates in the experimental data set: Ii,t := ˆg(Si,t, Xi,t). (predicted surrogate index) This predicted index becomes the outcome in the final causal model. This approach will be consistent for the treatment effects under three assumptions, which we illustrate as a causal graph in Figure 1. 1. Conditional Independence: There are no paths from current treatments to future outcomes except through current or future surrogates. Formally, Tit Yis|Xit, Sit..Sis for all s t. 2. Stability: Relationships between current and future surrogates and between surrogates and outcomes, marked in red in the figure, are constant over time and across samples. 3. Unconfoundedness: There are no variables beyond our observed set that directly affect both the treatment assignment and outcome. Note that the first assumption allows relationships between current and future treatments and between current treatments and future surrogates, in contrast to the canonical surrogate model. The stability assumption does not extend to the relationships between treatments and surrogates, which allows for the introduction of novel treatments in the experimental data not found in the observational data. The model is able to account for steady growth in a natural way: as the surrogates grow, the predicted outcomes grow too. Figure 1: Causal graph representation of the main assumptions of our causal analysis. 2.3 A two-period example For conciseness, we now collapse Xi,t and Si,t 1 so that the surrogates at each period and the controls in the next period are all denoted by Si,t 1. In other words, all customer observable characteristics in the next period are candidate surrogates and become controls for the period after next. This is without loss of generality. For the remainder of this section only, we further simplify by assuming a linear Markovian model of how the random variables evolve over time and focusing on a single scalar investment (k = 1). With these simplifications, the evolution of the random variables can be described in three equations: Si,t = ATi,t + BSi,t 1 + ϵi,t (evolution of controls) Yi,t = γ Si,t + ζi,t (outcome equation) Ti,t+1 = κTi,t + λ Si,t + ηi,t (treatment policy) where A is an p 1 matrix, B is a p p matrix, γ, λ are p-dimensional vectors and κ is a scalar. The terms ϵi,t, ζi,t, ηi,t are exogenous mean-zero independent noise terms. For conciseness we drop the customer index i in the remaining equations. Target effect estimand With two periods, the long term outcome Yi,t = Yi,t + Yi,t+1. The effect of treatment Tt on Yt can be derived from the structural model equations: Yt = γ St + γ St+1 + noise = γ ATt+1 + γ (B + I)ATt + γ (B + I)BSt 1 + noise2 Thus we see that the effect of Tt on Yt+1 + Yt, assuming future treatments are held constant, is: θ0 := γ (B + I)A. (1) Bias from a standard surrogate index Athey et al. (2020) propose training a surrogate index using the realized cumulative outcome Yt on St, essentially our proposed step 2 using raw instead of adjusted outcomes. Following that approach, we would estimate: g0(St) = E[ Yt | St] = γ (I + B)St + γ AE[Tt+1 | St] (2) Subsequently, if we estimate the causal effect of Tt on g0(St), then this effect would be: θ = θ0 + γ AE[E[Tt+1 | St] Tt] E[T 2 t ] (3) The standard estimate contains a bias stemming from the fact that investment today can lead to higher investment tomorrow. As illustrated in equation above, this bias can stem from two channels, both of which are plausible in many use cases. In our example, a customer that is treated once may build a stronger relationship with the firm and receive further treatments (i.e. κ 0 in the treatment policy). Alternatively, if the firm targets treatments to fast-growing customers, any treatment that successfully drives surrogate metrics higher will make the customer eligible for further treatment (λ A 0). Dynamic adjustment With only two periods, our first step simply requires estimating the singleperiod causal effect of Tt+1 on Yt+1, controlling for St. This conditional expectation is equal to: E[Yt+1 | Tt+1, St] = γ ATt+1 + γ BSt. (4) We can now create an adjusted long-term outcome, Y adj t := Yt + Yt+1 α t+1Tt+1, where αt+1 is our estimate of γ A from this first step. The dynamically adjusted surrogate index builds this adjustment into equation (2) above: gadj 0 (St) = E[Yt | St] + E[Yt+1 γ ATt+1 | St] = γ St + E[γ BSt | St+1] = γ (I + B)St This new index captures the projected M-period outcomes as if the customer was offered no future treatments. When we estimate the effect of the treatment based on this adjusted surrogate index we recover: E[gadj 0 (St) | Tt, St 1] = γ (I + B)E[St | Tt, St 1] = γ (I + B)ATt + γ (I + B)BSt 1 (6) With the dynamic adjustment, the coefficient in front of Tt that we recover is the true causal effect θ0 = γ (I + B)A. Lewis & Syrgkanis (2020) show how this adjustment approach can be extended to many periods, via a recursive peeling process, and also to high-dimensional surrogates and controls via a dynamic double machine learning approach. Estimating causal effects of new treatments With this adjusted surrogate index in hand, we can also estimate the long-term effect of any other treatment that was introduced more recently. In this example, the stationarity assumption of the proposed surrogate approaches requires that B, which governs how the surrogate evolves, and γ, which governs how surrogates translate to per-period outcomes, do not change between the observational and experimental data-sets. These two parameters govern how surrogates today relate to future outcomes in the absence of any treatment. Under such a condition, if we introduce a new treatment T new t , which has a different effect Anew on the surrogates (and hence on the long-term outcome), then the effect of this treatment on the long-term outcome is θnew 0 = γ (I + B)Anew. This θnew 0 is exactly the outcome of estimating the causal effect of T new t on gadj 0 (St+1) controlling for St. 3 Formal Guarantees In this section we present our main results. In addition to expanding to M periods, we introduce two substantive innovations over the basic strategy presented in the two period example above. First, we develop a generalization of the doubly robust estimation method in Athey et al. (2020) to the case of multiple continuous treatments. Second, we relax the linearity assumptions in the structural model and make use of orthogonal machine learning techniques Chernozhukov et al. (2018a) to allow for a rich set of potential confounders and valid analytic confidence intervals. We now assume that we observe the full M-period time series, (S0, T1, S1, Y1, T2, S2, Y2, . . . , TM, SM, YM), for each sample in the observational population, denoted as o. In contrast, we only observe (S0, T1, S1) for each sample from the experimental population, denoted as e. Notation Throughout, we will denote with Ee[ ] the expectation conditional on the experimental population and Eo[ ] the expectation conditional on the observational population. Moreover, for any vector-valued function f that takes as input a random variable Z, we denote with: Eo [ f(Z) 2 2] (7) and analogously f 2,e. We denote with En[ ], the empirical expectation over all the samples, i.e. for any random variable Z, En[Z] := 1 i Zi, and with Ee,n and Eo,n the empirical expectation over the experimental and observational samples correspondingly, i.e. Ee,n[Z] = 1 ne P i e Zi and Eo,n[Z] = 1 no P i o Zi. Our goal is to isolate the causal effect of treatments in period 1, T1, on the long term outcome Y1 := PM t=1 Yt. Formally, if we fix the future treatments that each sample receives and we change the treatment T1 from some value t0 to some other value t1, the treatment effect is: τ(t1, t0) :=Ee[E[ Y1 | do(T1 = t1), T2, . . . , TM] E[ Y1 | do(T1 = t0), T2, . . . , TM]]. (8) We present our theoretical results in two steps. For exposition, we begin by assuming treatments happen only at period 1. This is the setting analyzed in Athey et al. (2020), albeit only for the case of a single binary treatment T1.3 We then show how this approach can be modified to incorporate a dynamic treatment policy in the observational and experimental sample. 3.1 Double/Debiased Correction without Dynamic Adjustment To begin, we assume that T2, . . . , TM = 0. Throughout, we assume a partially linear relationship between the treatment and the long-term outcome: Ee[ Y1 | T1, S0] = θ 0 φ(T1, S0) + b0(S0) (PLR) 3Athey et al. (2020) also allowed for estimation of average treatment effects, even in the case when there is arbitrary treatment effect heterogeneity. In this work, we assume that treatment effects are constant. A generalization to the case of arbitrary treatment effect heterogeneity is feasible, but would require the estimation of conditional covariance matrices, which would make the estimation algorithm more brittle and the exposition much more complex. for some known feature map φ( , ), but arbitrary function b0( ). Formally, the invariance of the surrogate-outcome relationship requires that that the mean-relationship between the surrogates S1 and the long-term outcome does not change between the observational and the experimental sample: g0(S1) := Eo[ Y1 | S1] = Ee[ Y1 | S1]. (IR) Under the PLR assumption and the causal graph governing our data, we have that: τ(t1, t0) = θ 0 E[φ(t1, S0) φ(t0, S0)]. (9) We establish three consistent estimators for θ0 that follow the same intuition as Athey et al. (2020), adapted to consider linear effects of continuous treatments rather than a single binary treatment. Theorem 3.1 (Identification). Denote the residual surrogate index and the residual treatment with: g0(S1) := g0(S1) Ee[g0(S1) | S0] (10) T1 := φ(T1, S0) Ee[φ(T1, S0) | S0] (11) Then under assumptions (PLR) and (IR): θ0 = Ee[ T1 T 1 ] 1Ee[ T1 g0(S1)] (surrogate index rep.) θ0 = Ee[ T1 T 1 ] 1Eo Pr(e | S1, S0) Pr(o | S1, S0) Pr(o) Pr(e)Ee[ T1 | S1, S0] Y1 (surrogate score rep.) θ0 = Ee[ T1 T 1 ] 1 Ee[ T1 g0(S1)] + Eo Pr(e | S1, S0) Pr(o | S1, S0) Pr(o) Pr(e) Ee[ T1 | S1, S0]( Y1 g0(S1)) (orthogonal rep.) The core estimation challenge that the surrogate approach resolves is that the treatments and outcome of interest are not observed in a single dataset. Intuitively, the first surrogate index representation approaches this challenge by using realized treatments from the experimental sample and, in place of realized outcomes, substitutes the expected outcome conditional on the surrogates, g0 (S), which can be identified from the observational sample and then constructed in the experimental sample. The second surrogate score representation reverses this substitution. The second term pairs an expectation of the treatment conditional on surrogates, Ee[ T1 | S1, S0], with the realized outcomes from the observational sample. This representation requires an added ratio of probabilities of appearing in each sample, Pr(e) and Pr(o), to adjust for sampling variation across the two datasets. The third orthogonal representation blends the first two representations and satisfies Neyman orthogonality, which allows the construction of confidence intervals and double robustness. The parameter identified in the equations in Theorem 3.1 is interpretable even if the partially linear assumption is violated. In this case, the equations are identifying the best linear projection of the variation in the long-term outcome that is not explained by the initial state. We formulate the estimation of θ0 based on the third orthogonal representation as a Z-estimator based on a vector of moment equations that depends on a vector of nuisance functions f0, i.e.: m(θ0; f0) := E[ψ(Z; θ0, f0))] = 0 (12) and such that it satisfies the Neyman orthogonality condition: D[m(θ0; f0), f f0] := tm(θ0; f0 + t (f f0)) t=0 = 0 Subsequently, this will allow us to invoke the results in Chernozhukov et al. (2018b), to derive an asymptotic normal estimator, even when high-dimensional, regularized approaches are used to estimate the nuisance functions f0. Theorem 3.2 (Orthogonal Moment Formulation). Let f = (g, q, p, h), denote a set of nuisance functions and define: q0(S1, S0) := Pr(e | S1, S0) Pr(o | S1, S0) Ee[ T1 | S1, S0] p0(S0) := Ee[φ(T1, S0) | S0] h0(S0) := Ee[g(S1) | S0] Then θ0 is the solution to the moment equation: m(θ0; f0) := me(θ0; f0) + mo(θ0; f0) = 0 me(θ; f) := E 1{e} (g(S1) h(S0) θ (φ(T1, S0) p(S0))) (φ(T1, S0) p(S0)) mo(θ0; f) := E 1{o} q(S1, S0)( Y1 g(S1)) Moreover, the moment m satisfies the Neyman orthogonality property with respect to f. Furthermore, it satisfies a stronger double robustness property with respect to g and q, i.e. if we denote with ˆθ the solution to m(θ; ˆg, ˆq, p0, ˆh) = 0, then: Ee[ T1 T 1 ] ˆθ θ0 2 Pr(o) Pr(e) g0 ˆg 2,o q0 ˆq 2,o We estimate θ0 by method of moments, solving the empirical analogues of these moment conditions, after estimating the nuisance functions f. We show that this estimator is asymptotically normal under relatively mild rate conditions on the estimation of the nuisances in Appendix Theorem A.2 (Theorem 3.3, below, establishes the same result for a more general case). 3.2 Double/Debiased Correction with Dynamic Adjustment We now consider the case of potentially serially correlated treatments in all periods. To achieve consistent estimates of the effects of period 1 treatments in this setting we need to again assume that all the relationships are linear, albeit potentially high-dimensional: St = ATt + BSt 1 + ϵt Yt = CSt + ηt (13) Tt+1 = DTt 1 + GSt 1 + ζt where ϵt, ηt, ζt are i.i.d. random shocks. In this case, the main identification result in Lewis & Syrgkanis (2020), shows that: τ(t1, t0) = t=1 θ 1,t(t1 t0) = θ 0 (t1 t0) (14) where θ0 := PM t=1 θ1,t and θ1,t are interpreted as the dynamic effect of treatment at period 1 on the outcome at period t, controlling for all future treatments. This quantity is equivalent to the effect assuming that all future treatments are zero, since the effects of each period treatment under this linear model are linearly separable. This property simplifies both the estimation and the interpretation of the effect quantity as many causal quantities of potential interest collapse in this setting to the same object. The identification results in Lewis & Syrgkanis (2020) also imply that we can identify the quantity θ1,t by estimating the effect of T1 on the multi-period dynamically adjusted target outcome: τ=2 θ τ,t Tt = θ 0 T1 + β0S0 + µ1 (15) for some exogenous mean-zero random shock variable µ1. This is the setup that we analyzed in the previous section, albeit with Y1 replaced by Y adj 1 and with φ(T1, S0) = T1.The PLR assumption with respect to this dynamically adjusted target immediately holds by the equation above. Furthermore, we need to assume that the invariance of the surrogate-outcome assumption holds, albeit now for the dynamically adjusted long-term outcome: gadj 0 (S1) := Eo[ Y adj 1 | S1] = Ee[ Y adj 1 | S1] (dyn IR) This assumption is more permissive in practice than the standard invariance assumption as we no longer require that the dynamic treatment policy in the observational data be the same as in the experimental data, but only that the adjusted outcomes retain the same relationship with the surrogates. For estimation, we can apply the same analysis as in the previous section. The differences are that 1) we now have this new target long-term outcome and 2) we must additionally account for the variance of the ˆθt terms that are being estimated by the dynamic adjustment algorithm. We articulate the full algorithm including these additions in Appendix B. We show below that the method of moments estimator with plug in nuisance estimates is asymptotically normal: Theorem 3.3 (Surrogates with Dynamic Adjustment: Estimation and Asymptotic Normality). Let: ψ0(Z; θ, f0) := 1{e} gadj 0 (S1) θ 0 T1 T1 | {z } ψ0,e(Z;θ,f0) +1{o} Y adj 1 gadj 0 (S1) q0(S1, S0) | {z } ψ0,o(Z;θ,f0) 2 τ t M : ψτ,t(Z; θ, f0) := 1{o} κ=τ+1 θ κ,t Tκ,τ θ τ,t Tτ,τ 2 t M : ψt(Z; θ0, f0) = [ψt,t(Z; θ, f0) . . . ψt,2(Z; θ, f0)] where we use the short-hand notation Yt,τ = Yt E[Yt | Sτ 1] and Y adj 1 = PM t=1 Yt Pt τ=2 θ τ,t Tτ and Tt,τ = Tt E[Tt | Sτ 1]. Let ht,τ(Sτ 1) = E[Yt | Sτ 1] and pt,τ(Sτ 1) = E[Tt | Sτ 1]. Let f = {g, q, p, h, {hτ,t, pτ,t}2 τ t M}, denote the set of all nuisance functions and let f0 denote their true value. Consider the estimator based on the empirical version of the orthogonal moment, with plug-in nuisance estimates ˆf trained on a separate sample, i.e. ˆθ solves the empirical moment equation: mn(ˆθ; ˆf) = 0 mn(θ; f) := En[ψ(Z; θ, f) := En [ψ0(Z; θ, f)] En [ψ2(Z; θ, f)] . . . En [ψM(Z; θ, f)] If ˆr r0 2,o = op(n 1/4), for r {{hτ,t, pτ,t}2 τ t M} and if ˆh h0 e = op(n 1/4), ˆp p0 e = op(n 1/4) and g0 ˆg 2,o q0 ˆq 2,o = op(n 1/2), then: n ˆθ0 θ0 d N(0, J 1 e (Σe + Σo,1 + Σo,2) J 1 e ) where Je, Σe, Σo,1, Σo,2 are defined in the appendix. 4 Semi-Synthetic Experimental Evaluation To evaluate the performance of our proposed estimation strategy we construct a semi-synthetic dataset that retains qualitative characteristics of data on real-world incentive investments in customers at a major corporation, while preserving confidentiality. The semi-synthetic data set preserves several common patterns that require thoughtful attention. The treatments, in this case incentive investments, are lumpy: in most periods most customers get no investments. Proxies, which include single period values of the outcome of interest, are highly auto-correlated over time. Treatments are also auto-correlated, and correlated with past values of proxies. The data also include time-invariant controls that affect proxies and treatments in all periods. To build the data we estimate a series of moments from a real-world dataset: a full covariance matrix of all proxies, treatments, and controls in one period and a series of linear prediction models (lasso CV) of each proxy and treatment on a set of 6 lags of each treatment, 6 lags of each proxy, and time-invariant controls. Using these values, we draw new parameters from distributions matching the key characteristics of each family of parameters. Finally, we use these new parameters to simulate proxies, treatments, and controls by drawing a set of initial values from the covariance matrix and forward simulating to match intertemporal relationships from the transformed prediction models. Finally, we arbitrarily select one proxy to be the outcome of interest and construct the cumulative sum of this selected outcome over four or eight periods. For further details on the data generation process, see the appendix. The true treatment effects in the synthetic data are known functions of parameters from the linear prediction models. We then compare these true causal effects to estimated effects using a variety of approaches, gradually incorporating our proposed innovations. Figure 2 shows the distribution of the ℓ2 estimation error ( ˆθ θ0 2) for each approach over 100 simulated data sets. The top row plots the estimation error when estimating the effect on four periods of outcomes, increasing the sample size of each simulation from left to right, while the bottom row shows the same for the effect on eight periods of outcome. adj. surrogate n=1000, n_periods=4, n_exp=100 adj. surrogate n=2000, n_periods=4, n_exp=100 adj. surrogate n=5000, n_periods=4, n_exp=100 adj. surrogate n=10000, n_periods=4, n_exp=100 adj. surrogate n=1000, n_periods=8, n_exp=100 adj. surrogate n=2000, n_periods=8, n_exp=100 adj. surrogate n=5000, n_periods=8, n_exp=100 adj. surrogate n=10000, n_periods=8, n_exp=100 Figure 2: Experimental performance for M = 4 periods and M = 8 periods. Because we construct a single, long synthetic data set for this experiment it is possible to estimate the treatment effects on realized long-term outcomes directly, unlike the typical use case for a surrogate approach. We begin by estimating the effect of each treatment at time t directly on the realized total outcomes, rather than the surrogate index, without dynamic adjustment. The blue, total" bars in each panel of Figure 2 display substantial bias from this method, which does nothing to control for future treatments. Because current treatments are positively correlated with future treatments in the semi-synthetic data, this first approach consistently overestimates treatment effects. We then estimate the same set of treatment effects using the surrogate approach described in Section 3.1, still without dynamic adjustment. The distribution of estimation errors from this approach is represented in the orange surrogate" bars. The estimated treatment effects are still substantially larger than the true effects on average, but the surrogate model exhibits slightly less bias than the direct total" approach. Intuitively, because the surrogate approach is only capturing the relationship between treatment and outcome that passes through the surrogates it picks up less of the bias resulting from future correlated treatments than the direct approach. The third set of green adj. total" bars plot errors when estimating treatment effects on adjusted realized outcomes using the method of Lewis & Syrgkanis (2020), the preferred approach when all treatments of interest and realized long-run outcomes are available in a single data set. This method removes the effects of future treatments from the long-run outcome in a first step and, as expected, exhibits significantly less bias than the first two methods, particularly for reasonably large samples. The final two bars illustrate the success of the adjusted surrogate approach described in Section 3.2. As illustrated by the red adj. surrogate" bars, our proposed approach is highly accurate in predicting long-term effects with a performance comparable to that of having access to the raw long-term outcome itself. The final purple new treat." bars show that the approach works equally well when considering the effect of a novel treatment that appears only in the experimental sample and was not part of the dynamic adjustment. Overall, this methodology overcomes a common data limitation when considering long-term effects of novel treatments and expands the surrogate approach to consider a common, and previously problematic, pattern of serially correlated treatments. Acknowledgments and Disclosure of Funding The authors have no sources of additional funding to disclose. Ai, C. and Chen, X. Efficient estimation of models with conditional moment restrictions containing unknown functions. Econometrica, 71(6):1795 1843, 2003. Athey, S., Chetty, R., Imbens, G., and Kang, H. Estimating treatment effects using multiple surrogates: The role of the surrogate score and the surrogate index, 2020. Begg, C. B. and Leung, D. H. On the use of surrogate end points in randomized trials. Journal of the Royal Statistical Society: Series A (Statistics in Society), 163(1):15 28, 2000. Bodory, H., Huber, M., and Lafférs, L. Evaluating (weighted) dynamic treatment effects by double machine learning. Chernozhukov, V., Escanciano, J. C., Ichimura, H., Newey, W. K., and Robins, J. M. Locally Robust Semiparametric Estimation. ar Xiv e-prints, art. ar Xiv:1608.00033, July 2016. Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1):C1 C68, 01 2018a. ISSN 1368-4221. doi: 10.1111/ectj.12097. URL https: //doi.org/10.1111/ectj.12097. Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1):C1 C68, 2018b. Frangakis, C. E. and Rubin, D. B. Principal stratification in causal inference. Biometrics, 58(1): 21 29, 2002. Freedman, L. S., Graubard, B. I., and Schatzkin, A. Statistical validation of intermediate endpoints for chronic diseases. Statistics in medicine, 11(2):167 178, 1992. Kallus, N. and Uehara, M. Double reinforcement learning for efficient off-policy evaluation in markov decision processes, 2019a. Kallus, N. and Uehara, M. Efficiently breaking the curse of horizon in off-policy evaluation with double reinforcement learning, 2019b. Lewis, G. and Syrgkanis, V. Double/debiased machine learning for dynamic treatment effects, 2020. Lok, J. J. and De Gruttola, V. Impact of time to start treatment following infection with application to initiating haart in hiv-positive patients. Biometrics, 68(3):745 754, 2012. Neyman, J. c(α) tests and their use. Sankhya, pp. 1 21, 1979. Nie, X., Brunskill, E., and Wager, S. Learning when-to-treat policies, 2019. Petersen, M., Schwab, J., Gruber, S., Blaser, N., Schomaker, M., and van der Laan, M. Targeted maximum likelihood estimation for dynamic and static longitudinal marginal structural working models. Journal of causal inference, 2(2):147 185, 2014. Prentice, R. L. Surrogate endpoints in clinical trials: definition and operational criteria. Statistics in medicine, 8(4):431 440, 1989. Robins, J. A new approach to causal inference in mortality studies with a sustained exposure periodapplication to control of the healthy worker survivor effect. Mathematical modelling, 7(9-12): 1393 1512, 1986. Robins, J. M. Correcting for non-compliance in randomized trials using structural nested mean models. Communications in Statistics-Theory and methods, 23(8):2379 2412, 1994. Robins, J. M. and Ritov, Y. Toward a curse of dimensionality appropriate (coda) asymptotic theory for semi-parametric models. Statistics in medicine, 16(3):285 319, 1997. Robins, J. M., Blevins, D., Ritter, G., and Wulfsohn, M. G-estimation of the effect of prophylaxis therapy for pneumocystis carinii pneumonia on the survival of aids patients. Epidemiology, pp. 319 336, 1992. Robins, J. M., Hernan, M. A., and Brumback, B. Marginal structural models and causal inference in epidemiology, 2000. Robinson, P. M. Root-n-consistent semiparametric regression. Econometrica: Journal of the Econometric Society, pp. 931 954, 1988. Singh, R., Xu, L., and Gretton, A. Kernel methods for policy evaluation: Treatment effects, mediation analysis, and off-policy planning. ar Xiv preprint ar Xiv:2010.04855, 2020. Thomas, P. S. and Brunskill, E. Data-efficient off-policy policy evaluation for reinforcement learning. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML 16, pp. 2139 2148. JMLR.org, 2016. Vansteelandt, S. and Sjolander, A. Revisiting g-estimation of the effect of a time-varying exposure subject to time-varying confounding. Epidemiologic Methods, 5(1):37 56, 2016. doi: https: //doi.org/10.1515/em-2015-0005. URL https://www.degruyter.com/view/journals/em/ 5/1/article-p37.xml. Vansteelandt, S., Joffe, M., et al. Structural nested models and g-estimation: the partially realized promise. Statistical Science, 29(4):707 731, 2014.