# continuous_treatment_effects_with_surrogate_outcomes__52ec17ba.pdf

Continuous Treatment Effects with Surrogate Outcomes

Zhenghao Zeng 1 David Arbour 2 Avi Feller 3 Raghavendra Addanki 2 Ryan Rossi 2 Ritwik Sinha 2

Edward H. Kennedy 1

In many real-world causal inference applications, the primary outcomes (labels) are often partially missing, especially if they are expensive or difficult to collect. If the missingness depends on covariates (i.e., missingness is not completely at random), analyses based on fully observed samples alone may be biased. Incorporating surrogates, which are fully observed post-treatment variables related to the primary outcome, can improve estimation in this case. In this paper, we study the role of surrogates in estimating continuous treatment effects and propose a doubly robust method to efficiently incorporate surrogates in the analysis, which uses both labeled and unlabeled data and does not suffer from the above selection bias problem. Importantly, we establish the asymptotic normality of the proposed estimator and show possible improvements on the variance compared with methods that solely use labeled data. Extensive simulations show our methods enjoy appealing empirical performance.

1. Introduction

In many causal inference applications, the primary outcomes are missing for a non-trivial number of observations. For instance, in studies on long-term health effects of medical interventions, some measurements require expensive testing and a loss to follow-up is common (Hogan et al., 2004). In evaluating commercial online ad effectiveness, some individuals may drop out from the panel because they use multiple devices (Shankar et al., 2023), leading to missing revenue measures. In many of these studies, however, there often exist short-term outcomes that are easier and faster to

1Department of Statistics and Data Science, Carnegie Mellon University, Pittsburgh, PA, USA 2Adobe Research, San Jose, CA, USA 3Goldman School of Public Policy and Department of Statistics, University of California, Berkeley, Berkeley, CA, USA. Correspondence to: Zhenghao Zeng <zhenghaz@andrew.cmu.edu>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

measure, e.g., short-term health measures or an online ad s click-through rate, that are observed for a greater share of the sample. These outcomes, which are typically informative about the primary outcomes themselves, are referred to as surrogate outcomes or surrogates.

There is a rich causal inference literature addressing missing outcome data. Simply restricting to data with observed primary outcomes may induce strong bias (Hern an & Robins, 2010). Ignoring unlabeled data also reduces the effective sample size for estimating the treatment effects and inflates the variance. Chakrabortty et al. (2022) considered the missing completely at random (MCAR) setting and showed that incorporating unlabeled data reduces variance. Zhang et al. (2023) generalized the results to missing-at-random (MAR) settings where the unlabeled data has a much larger size than the labeled data. Kallus & Mao (2020) further examined the role of surrogates in datasets with limited primary outcomes and showed efficiency gains after including surrogates and unlabeled data in the analysis. Singh (2022) proposed a generalized kernel ridge regression framework, which incorporates information on surrogate outcomes in treatment effect estimation. See also Zhang & Bradic (2022); Hou et al. (2021); Zeng et al. (2023) for relevant discussions.

Continuous treatments appear in many applications; e.g., waiting time before follow-up, percent of discount, and drug dosage. Existing estimation procedures include outcome modeling (Newey, 1994) and treatment process modeling (Galvao & Wang, 2015). Kennedy et al. (2017) proposed doubly robust methods that model both the outcome and the treatment process, and enjoy appealing robustness properties. Bonvini & Kennedy (2022) further examined highorder estimators to achieve faster rates.

In this paper, we consider the estimation of dose-response functions with limited primary outcome data. We propose novel doubly robust methods using both labeled and unlabeled data, with the help of surrogates. Our approach avoids the potential selection bias caused by restricting only to labeled data and provably reduces the variance. Importantly, we study the theoretical properties of the estimator proposed, and establish its asymptotic normality to facilitate statistical inference. Our work serves as a counterpart for Kallus & Mao (2020) in continuous treatment effects

Continuous Treatment Effects with Surrogate Outcomes

settings and enriches the existing dose-response estimation literature (Kennedy et al., 2017; Bonvini & Kennedy, 2022).

The rest of this paper is organized as follows: In Section 2 we introduce the problem setup and notation. Section 3 provides assumptions to identify the continuous treatment effect. Novel methodology and theoretical guarantees are discussed in Section 4. Our simulation study in Section 6 demonstrates the performance of our method. Finally we conclude with a discussion in Section 7. All proofs, additional simulation results and a real data example are included in supplementary materials.

2. Setup and Notation

In this section we introduce the problem of estimating continuous treatment effects with surrogates. We formalize the problem using potential outcomes framework (Splawa Neyman et al., 1990; Rubin, 1974) and introduce notation to present our results concisely.

2.1. Data Structure

Suppose we have access to two datasets L and U from a randomized experiment/observational study. The labeled dataset is L = {Zi = (Vi, Si, Ai, Yi, Ri = 1), 1 i n1}, where V is the set of covariates, S is the vector of surrogates, A is the continuous treatment, and Y is the primary outcome. Here R is an indicator with R = 1 if a sample is from the labeled data L and R = 0 otherwise. The unlabeled dataset consists of samples without primary outcomes U = {Zi = (Vi, Si, Ai, Ri = 0), n1 + 1 i n1 + n2}. Hence the total sample size is n = n1 + n2 and both L and U contain information on treatment A, surrogates S and covariates V, but the primary outcome is missing in the unlabeled dataset U due to, for example, loss to follow-up. Let X = (V, S) be the union of covariates and surrogates. Note that we will present the results for S = , but S = can be viewed as a special setting where our methodology still applies.

2.2. Estimand and Nuisance Functions

Now we define the continuous treatment effects of interest. We use the random variable Y a to denote the potential (counterfactual) outcome we would have observed had a subject received treatment A = a, which may be contrary to the observed Y . The continuous treatment effect (or dose-response function) is defined as

θ(a) = E[Y a]. (1)

Without additional causal assumptions introduced in Section 3, θ(a) involves counterfactual outcome Y a and cannot be identified by observed data. Note that since S contains posttreatment surrogate outcomes and may be affected by the

treatment, we also use potential outcomes Sa to denote the surrogate under treatment A = a.

We further introduce some nuisance functions that are not of primary interest, but which our estimation method depends on. Let µ(A, X, 1) = E[Y |A, X, R = 1] be the regression function of the primary outcome in the labeled population R = 1. Denote the function obtained by further regressing µ(A, X, 1) on (A, V) as τ(A, V) = E[µ(A, X, 1)|A, V] , where the expectation is over the conditional distribution of S given A, V. Denote the conditional density of A given V as π(a|V) and the marginal density of A as f(a) = R π(a|v)d P(v). The ratio of the marginal density and conditional density is denoted as w(a, v) = f(a)/π(a|v). w(a, v) is known as a stabilized weight in the literature (Robins et al., 2000; Ai et al., 2021). The propensity score for R (i.e. the conditional probability that the primary outcome is observed) is denoted as ρ(A, X) = P(R = 1|A, X).

Let (V, S, A) denote the support of (V, S, A). For a (possibly random) function f on variables Z we use Pn[f(Z)] or Pn[f] to denote the sample average 1

n Pn i=1 f(Zi) on a sample of size n. The sample over which averages are taken should be clear from context. We use P[f] = R f(z)d P(z) to denote the expectation of f(Z) where only randomness of Z is considered and f is conditioned on. Finally we use f = supz Z |f(z)| to denote the uniform norm,

f a = R f 2(z)d P(z|A = a) 1/2 to denote the L2-norm with respect to the conditional distribution Z|A = a and f 2 = R f 2(z)d P(z) 1/2 to denote the usual L2-norm.

3. Identification

In this section we discuss sufficient conditions to identify the dose-response function (1), summarized as follows:

Assumption 3.1. (Consistency) Y = Y a, S = Sa if A = a, a A.

Assumption 3.2. (Exchangeability) (Y a, Sa) A | V for a A.

Assumption 3.3. (Missing at random) R Y a | V, Sa, A = a for a A.

Assumption 3.4. (Positivity) π(a|V) > 0, ρ(a, X) > 0 almost surely for a A.

Assumption 3.1 is also known as the stable unit treatment value assumption (SUTVA), and requires an absence of interference between different individuals in the study. Assumption 3.2 is commonly used to identify treatment effects and holds in a randomized experiment or observational study with all confounders measured. Assumption 3.3 ensures whether the primary outcome is observed only depends on covariates V, surrogates S, and treatment A so that the distributions of labeled and unlabeled data are comparable after

Continuous Treatment Effects with Surrogate Outcomes

Figure 1. Example of a causal graph with surrogate outcome S.

conditioning on X, A. Assumption 3.4 says every subject has some chance to receive treatment A = a and has the primary outcome observed. An illustrative causal graph is shown in Figure 1. The readers are referred to Gill & Robins (2001) for detailed discussions on identifying continuous treatment effects and Kallus & Mao (2020) for the role of surrogates in identifying treatment effects. With all these assumptions, the treatment effects of interest can be identified using the observable distribution as summarized in Theorem 3.5. Theorem 3.5. Under Assumption 3.1 3.4 we have

θ(a) = E{E[E(Y |A = a, X, R = 1)|A = a, V]}

= E {E[µ(a, X, 1)|A = a, V]}

= E[τ(a, V)] (2)

for fixed a A, where the expectations are over Y, S, V in (2).

The identification formula (2) suggests the following plugin style estimator of θ(a): first regress Y on A, X in the labeled dataset L and obtain an estimator of µ as bµ, which is further regressed on A, V to get an estimator bτ. Finally we take an average over all the samples to get an estimator of θ(a) as bθ(a) = Pn[bτ(a, V)]. (3)

The performance of such a plug-in style estimator highly depends on the estimation error of bτ. To see this, assume the nuisance estimator bτ is independent of the samples that we average over, then the conditional bias of bθ(a) given bτ is

E h bθ(a) i θ(a) = E [bτ(a, V) τ(a, V)] ,

which solely depends on the estimation error of τ. When τ is hard to estimate (e.g., non-smooth/sparse in highdimensional problems), the plug-in style estimator may inherit the slow convergence rate of bτ and have sub-optimal performance.

4. Doubly Robust Estimation

In this section we present the novel method of this paper. We begin with an alternative characterization of the doseresponse function (1) in Section 4.1, based on which we propose a doubly robust estimator in Section 4.2.

4.1. Doubly Robust Characterization

Since treatment A is continuous, the function θ(a) in (2) is not pathwise differentiable (Bickel et al., 1993; D ıaz & van der Laan, 2013) and we need a novel way to apply semiparametric efficiency theory. The idea is to find a pseudo-outcome φ(Z) := φ(Z; µ, π, ρ) depending on nuisance functions such that E[φ(Z; µ, π, ρ)|A = a] = θ(a), i.e., regressing φ(Z; µ, π, ρ) on A yields the target doseresponse function ideally with second-order dependence on nuisance estimation error. Following Rubin & van der Laan (2005); Kennedy et al. (2017), we consider the functional ψ = E[θ(A)], which is pathwise differentiable and admits an efficient influence function. Then the pseudo-outcome φ(Z; µ, π, ρ) is a component in the influence function of ψ. We omit the derivation of the influence function and only present the form of pseudo-outcome. Let ( µ, π, ρ) be nuisance functions that may not necessarily equal the true (µ, π, ρ), and define

φ(Z; µ, π, ρ)

= R(Y µ(A, X, 1))

ρ(A, X) + µ(A, X, 1) τ(A, V)

V π(A|v)d P(v)

V τ(A, v)d P(v),

where τ(A, V) = E[ µ(A, X, 1)|A, V]. The following theorem shows an alternative characterization of the doseresponse function θ(a) through φ(Z; µ, π, ρ). Proposition 4.1. Let ( µ, π, ρ) be nuisance functions that may not necessarily equal the true (µ, π, ρ). Then

E[φ(Z; µ, π, ρ)|A = a] = θ(a)

if either µ = µ or ( π, ρ) = (π, ρ).

Proposition 4.1 gives the first interpretation of the double robustness of our methods. There are two chances to obtain the dose-response function θ(a) by regressing the pseudooutcome φ(Z; µ, π, ρ) on A: we correctly specify either the outcome regression model µ or both the propensity score of R and conditional density of A given V (although see Proposition 5.2 for a slightly different parameterization).

4.2. Estimation Procedure

The doubly robust characterization in Proposition 4.1 motivates a two-stage procedure to estimate θ(a): In the first step

Continuous Treatment Effects with Surrogate Outcomes

Algorithm 1 Doubly Robust Estimation Let (Dn 1 , Dn 2 , T n) denote three independent samples of n i.i.d observations of Z and Dn = (Dn 1 , Dn 2 ) denote the training set to train the nuisance functions.

1. Nuisance functions training: Construct estimates of µ, τ, ρ, w using Dn 1 . Then use Dn 2 to get an initial estimator of θ(a) as

i Dn 2 bτ(a, Vi).

2. Pseudo-outcome regression: Construct estimated pseudo-outcome

= R(Y bµ(A, X, 1))

bρ(A, X) + bµ(A, X, 1) bτ(A, V)

bw(A, V) + bθ0(A)

for each sample in T n and regress the pseudooutcomes on the treatment A in T n using a linear smoother to obtain

bθ(a) = b En [bφ(Z)|A = a] = X

i T n Wi(a; An)bφ(Zi),

where An = (A1, . . . , An) are the treatments in T n

and Wi(a; An) is the coefficient of i-th sample.

3. (Optional) Cross-fitting: Swap the role of Dn 1 , Dn 2 , T n

and repeat steps 1 and 2. Use the average of different estimates as the final estimator of θ(a).

we model the nuisance functions that appear in the pseudooutcome φ(Z) with flexible nonparametric or machine learning methods. We then construct estimated pseudo-outcomes bφ(Z) and regress these on A to obtain an estimator of θ(a). We will use the stability framework developed in Kennedy (2023) to analyze such an estimator, which regresses imputed outcomes bφ(Z) on treatment. The formal procedure is summarized in Algorithm 1.

Sample splitting is used in Algorithm 1 to avoid complicated empirical process assumptions that are difficult to justify in practice and simplify our theoretical analysis (Robins et al., 2008; Chernozhukov et al., 2018; Kennedy et al., 2020; Levis et al., 2023; Bonvini et al., 2023). In Step 1, one can use any appropriate regression/classification algorithms to estimate µ, τ, ρ. However, there are fewer results on estimating stabilized weight w(a, v) = f(a)/π(a|v). Ai et al. (2021) proposed a method that directly estimates w with entropy maximization. Alternatively, one can estimate

the conditional density π (see, e.g., Colangelo & Lee, 2020, for discussions on related methods), and then the marginal density can be estimated by

i Dn 2 bπ(a|Vi)

and use their ratio to estimate w.

Our analysis in Section 5 applies to a wide class of nuisance estimators if the product of convergence rates is fast enough. In step 2 we focus on linear smoother-based estimators since they are relatively straightforward to analyze. We believe similar results hold for a wider class of regression estimators under the stability framework in Kennedy (2023) and leave the theoretical analysis of applying general regression algorithms for future investigation. In applications, researchers can choose suitable parametric methods based on their domain knowledge or flexible nonparametric machine learning methods to avoid model misspecification (or ensembles thereof).

5. Theoretical Results

In this section, we provide theoretical guarantees of the proposed method. We first present estimation error guarantees of Algorithm 1, for general linear smoothers. Then we focus on a specific type of linear smoother, namely local linear regression, and study the asymptotic distribution of the estimator in Section 5.2.

5.1. Oracle Estimation Theory

In Algorithm 1 we obtain the estimator bθ(a) by regressing the imputed pseudo-outcome bφ(Z) on A. It is natural to compare bθ(a) with the oracle estimator eθ(a) defined as

i T n Wi(a; An)φ(Zi),

which regresses the ground-truth pseudo-outcome φ(Z) on A using the same linear smoother. Intuitively it is hard for bθ to have a faster convergence rate than eθ. In the following theorem, we summarize the conditions under which bθ enjoys the same rate as eθ and hence is oracle efficient . Theorem 5.1. Let bθ(a) denote the doubly robust estimator obtained from Algorithm 1 and eθ(a) = P i T n Wi(a; An)φ(Zi) denote the oracle estimator with oracle risk R2 n(a) = E[(eθ(a) θ(a))2] at point a. Suppose

Var(φ(Z)|A = a) c, a A for some constants c > 0.

The estimator for nuisance functions in Step 2 is uniformly consistent in the sense that supz |bφ(z) φ(z)| = o P(1).

Continuous Treatment Effects with Surrogate Outcomes

Then we have

bθ(a) θ(a) = eθ(a) θ(a)+b En h bb(A)|A = a i +o P (Rn(a))

where bb(a) = E[bφ(Z) φ(Z)|Dn, A = a] is the conditional bias of bφ(Z) and b En h bb(A)|A = a i = P

i T n Wi(a; An)bb(Ai). Further assume

b En h bb(A)|A = a i = o P(Rn(a)),

then bθ is oracle efficient in the sense that

bθ(a) eθ(a)

The conditions in Theorem 5.1 are mild: φ(Z) is not a function of A and hence its conditional variance given A should be positive; We do not impose conditions on the convergence rate of bφ but only require its consistency. The key condition for bθ to be oracle efficient is b En h bb(A)|A = a i =

o P(Rn(a)) and hence we need a bound on the conditional bias bb(a), as summarized in the following proposition.

Proposition 5.2. Under the conditions in Theorem 5.1, further assume the estimated conditional probability bρ(a, x) c and the estimated stabilized weight bw(a, v) C hold for all a A, x X, v V for some constant c, C > 0. Then we have

|bb(a)| bρ ρ a bµ µ a + bτ τ a bw w a

i Dn 2 bτ(a, Vi) P[bτ(a, V)],

where recall f 2 a = R f 2(z)d P(z|A = a). If we further assume the weights of the linear smoother satisfy P

i T n Wi(a; An) C and there exists a neighborhood N(a) around a such that Wi(a; An) = 0 if Ai / N(a). Then we have

b En h bb(A)|A = a i sup t N(a) |bb(t)|.

We note that the condition on linear smoothers will be satisfied by Nadaraya Watson estimators (Nadaraya, 1964; Watson, 1964) and local polynomial estimators (Fan et al., 1994; Fan & Gijbels, 1996) when the kernel function used has compact support, e.g., the uniform and Epanechnikov kernel. In the bound for bb(a), the first two terms depend on the product of the convergence rates of nuisance functions. This phenomenon is commonly observed in influence functionsbased doubly robust approaches (Kennedy et al., 2016; Chernozhukov et al., 2018; Meza & Singh, 2021) and gives the second interpretation of double robustness: Introducing an

extra term to the plug-in style estimator in φ can correct for the first-order bias, making the remainder second-order and doubly small . In common examples of ATE estimation, conditional bias only involves estimation error of outcome model and propensity scores. The additional dependence on the convergence rate of ˆτ (compared with Proposition 4.1) appears in the bias since the formula (2) is more complicated compared with ATE-style functional and we use an agnostic estimator of τ in Algorithm 1. In most settings we expect the nuisance error rate bµ µ a (with respect to measure d P(z|A = a)) has the same order as the more common conventional rate bµ µ (with respect to measure d P(z)), which is n α/(2α+d+1) if µ(a, x) belongs to a H older class of order α and X is d-dimensional. Alternatively one can always upper bound bµ µ a with bµ µ at the cost of log factors (Tsybakov, 2009, Section 1.6.2). The empirical process term 1

i Dn 2 bτ(a, Vi) P[bτ(a, V)] would be OP(1/ n) provided that E[bτ 2(a, V)|Dn] is bounded. The

bound on b En h bb(A)|A = a i in Proposition 5.2 involves

i Dn 2 bτ(t, Vi) P[bτ(t, V)]

as a coarse bound and for specific estimators, a tighter bound can be derived. For instance, as we will see in local linear estimation this empirical process term is o P 1/

nh and asymptotically negligible. Hence we can focus on the first two second-order terms in the conditional bias. Theorem 5.1 together with Proposition 5.2 gives conditions under which the estimator bθ(a) has the same rate as the oracle estimator eθ(a). For instance, assume V = X (no surrogates) and |X| = d, suppose the dose-response function θ(a) is αsmooth (i.e. belongs to a H older class of order α) and w and µ are s-smooth, then the rate condition for bθ(a) to be oracle efficient is

n 2s 2s+d+1 n α 2α+1

or equivalently

5.2. Asymptotic Normality

In the following discussions, we will analyze the estimator bθ(a) based on a particular linear smoother (i.e. the local linear regression estimator). For a scalar bandwidth parapmeter h, let gha(t) = [1, (t a)/h] be the local linear basis, Kha(t) = h 1K((t a)/h) with K being a probability density. The local linear regression solves the following weighted least square problem:

i T n Kha(Ai) bφ(Zi) gha(Ai) β 2

Continuous Treatment Effects with Surrogate Outcomes

which gives the closed-form solution:

bβh(a) = b D 1 ha Pn[gha(A)Kha(A)bφ(Z)],

where b Dha = Pn[gha(A)Kha(A)g ha(A)] and Pn is the sample average over T n. Then the local linear estimator of θ(a) is e 1 bβh(a), i.e. the first component of bβh(a). We summarize the asymptotic properties of this local linear estimator in Theorem 5.3, which is the key contribution of our paper.

Theorem 5.3. (Asymptotic normality of Local Linear Estimator) Let a A be an inner point of the compact support A of A. Assume

1. The bandwidth h satisfies h = hn 0 and nh as n .

2. The marginal density of the treatment f(a) is continuously differentiable, the conditional variance Var(φ(Z)|A = a) is continuous and the dose response θ(a) is twice continuously differentiable.

3. For any a A, x X, v V, the conditional variance Var(φ(Z)|A = a) > 0, the marginal density of treatment f(a) c, the estimated conditional probability bρ(a, x) c and the estimated stablized weight bw(a, v) C for some constant C, c > 0.

4. K is a continuous symmetric probability density with support [ 1, 1].

5. All nuisance functions are estimated consistently in ℓ norm and the estimated pseudo-outcome also satisfies

bφ φ = o P(1).

Furthermore, the convergence rates of nuisance estimation satisfy

sup |t a| h bρ ρ t bµ µ t = o P

sup |t a| h bτ τ t bw w t = o P

Then we have

nh bθ(a) θ(a) h2θ (a) R u2K(u)du 2

d N 0, σ2(a) R K2(u)du f(a)

σ2(a) = E Var(Y |A = a, X, R = 1)

+ Var(µ(a, X, 1)|A = a, V)] w2(a, V) A = a

Assumptions 1-4 in Theorem 5.3 are standard in the nonparametric kernel regression literature. Assumption 5 guarantees that the contribution of nuisance estimation error is asymptotically negligible compared with the smoothing error. One can also use a symmetric kernel K supported on R that is square-integrable and has a finite second-order moment. Then the rate condition would be supt A bρ ρ t bµ µ t = o P 1

. Similar to our discussions in Section 5.1, the rate conditions in assumption 5 are imposed on the product of nuisance estimation error since we use a doubly robust estimator and the conditional bias is second-order small . The theoretically optimal bandwidth to estimate a twice continuously differentiable function is h n 1/5 and yields a root mean square error of order n 2/5. With such a choice of h the requirement on the convergence rate becomes supt A bρ ρ t bµ µ t = o P n 2/5 , which, for example, can be satisfied when ρ is consistent and µ is estimated with rate OP(n 2/5). In applications, one can select the bandwidth using leave-one-out cross-validation (H ardle et al., 1988) due to its computational ease. Specifically, after we obtain the estimated pseudo-outcome, we treat them as known and select h by

bhopt = argmin h H

( bφ(Zi) bθh(Ai)

where c Wh(Ai) = e 1 b D 1 h Aie1h 1K(0). In the setting of Algorithm 1 we can select the bandwidth on Dn 2 to avoid overfitting on T n.

Similar to most nonparametric inference methods, Theorem 5.3 shows the estimator bθ is centered around θ(a) = θ(a) h2θ (a) R u2K(u)du/2 instead of θ(a) under optimal smoothing, which is known as the bias problem in the literature (Wasserman, 2006, Section 5.7). There are several methods to overcome the bias problem and each of them has its own consideration and trade-offs. For instance, one can estimate the second-order derivative and debias the estimator (Calonico et al., 2018; Takatsu & Westling, 2022) but this requires extra smoothness conditions. Another method is to undersmooth (Fan et al., 2022) and make the bias decrease asymptotically relative to the variance. Unfortunately, there does not seem to be a simple, practical rule for choosing just the right amount of undersmoothing. In this paper we choose to live with the bias and report uncertainty quantification for θ. Theoretically, the bias shrinks to 0 as n and the proposed estimator ˆθ(a) is still consistent for θ(a). To construct confidence intervals for θ(a) one needs to estimate the variance. Define a localized functional θh(a) = e 1 D 1 ha E[gha(A)Kha(A)θ(A)] (which can be viewed as population version of local linear estimator

Continuous Treatment Effects with Surrogate Outcomes

e 1 bβh(a)) with efficient influence function

= e 1 D 1 ha gha(A)Kha(A)

φ(Z) g ha(A)D 1 ha E[gha(A)Kha(A)θ(A)]

+ e 1 D 1 ha

Z gha(t)Kha(t)τ(t, V)f(t)dt θh(a).

Following Kennedy et al. (2017); Takatsu & Westling (2022), one can show the variance of bθ(a) can be approxi-

bϕha(Z) 2 for

= e 1 b D 1 ha gha(A)Kha(A) bφ(Z) g ha(A)bβh(a)

+ e 1 b D 1 ha

Z gha(t)Kha(t)bτ(t, V)d Pn(t) bθ(a).

Finally, we compare the asymptotic variance in Theorem 5.3 with the asymptotic variance in Kennedy et al. (2017), where the unlabeled dataset U and surrogates S are unavailable. Consider the MCAR setting where R is independent of all other variables so that ρ(a, x) = ρ (0, 1), and for simplicity assume n1 = nρ to show how the surrogates and unlabeled data help to reduce the variance in this special setting. Note that since R is independent of all other variables, we have Var(Y |A = a, X, R = 1) = Var(Y |A = a, X) and µ(a, X, 1) = E[Y |A = a, X, R = 1] = E[Y |A = a, X] = µ(a, X). The asymptotic variance of bθ(a) in our setting (i.e. in Theorem 5.3) is reduced to

1 nh E Var(Y |A = a, X)

+ Var(µ(a, X)|A = a, V)] w2(a, V) A = a (5)

under the MCAR assumption (note the additional factor R K2(u)du/f(a) is omitted since it appears in both settings). In the setting where the unlabeled data is unavailable, the asymptotic variance is shown to be

1 n1h E Var(Y |A = a, V)w2(a, V)|A = a (6)

in Kennedy et al. (2017). By the property of conditional variance

Var(Y |A = a, V) = E[Var(Y |A = a, X)|A = a, V]

+ Var(E[Y |A = a, X]|A = a, V),

(6) can be re-written as

1 n1h E {(Var(Y |A = a, X)

+ Var(µ(a, X)|A = a, V))w2(a, V)|A = a . (7)

Comparing (5) with (7) we see the first term is the same since nρ = n1. However the second term in (7) is improved by a factor of ρ in (5). This shows how the variance of the estimator is smaller after introducing unlabeled data and surrogate outcomes. The amount of improvement depends on the missing rate 1 ρ and Var(µ(a, X)|A = a, V), which measures the variation of µ(A, V, S) that cannot be explained by (A, V).

6. Simulation Study

In this section we use simulations to evaluate the performance of the proposed methods. We will illustrate the advantage of doubly robust estimation over naive plug-in style estimators. Consider the following data-generating process: The covariates V have a multi-variate Gaussian distribution

V = (V1, V2, V3, V4) N(0, I4).

Conditioning on V, the continuous treatment A has normal distribution N(λ(V), 1) with

λ(V) = 1 + 0.2V1 + 0.2V2 0.2V3 + 0.3V4.

The surrogates S has a normal distribution

S = (S1, S2) N(0, I2).

The indicator of whether the outcome is observed or not R is Bernoulli(0.5) (so we assume a missing completely at random mechanism and ρ = 0.5). Finally the outcome Y conditioning on A, X, R = 1 has a normal distribution N(µ(A, X, 1), 1) with

µ(A, X, 1) = 1 + (0.1, 0.1) S + (0.2, 0.2, 0.3, 0.1) V

+ A(1 0.1V1 + 0.1V3) A2.

By direct calculations we have

τ(A, V) = 1 + (0.2, 0.2, 0.3, 0.1) V

+ A(1 0.1V1 + 0.3V3) A2,

The dose-response function of interest is

θ(A) = 1 + A A2.

To illustrate the performance of two estimators with different nuisance estimation errors we will manually set the estimation error, which is applicable for simulation purposes. For a fixed α we let ϵ1, . . . , ϵ4 N(n α, n 2α) and set bλ(V) = λ(V)+ϵ1, the estimated conditional density of A is

N bλ(V), 1 , bµ(A, X, 1) = µ(A, X, 1) + ϵ2, bτ(A, V) =

τ(A, V)+ϵ3, logit (bρ) = logit(ρ)+ϵ4. Such estimates guarantee the nuisance estimation error is of order n α. After

Continuous Treatment Effects with Surrogate Outcomes

0.10 0.15 0.20 0.25 0.30 0.35 0.40 Alpha

Plug-in DR-learner

(a) n = 500

0.10 0.15 0.20 0.25 0.30 0.35 0.40 Alpha

Plug-in DR-learner

(b) n = 2000

Figure 2. Root mean square error Versus α, where n α is the estimation error of the nuisance functions.

we generate a sample of size n, the plug-in style estimator is defined as bθ(a) = 1

n Pn i=1 bτ(a, Vi). To implement the doubly robust estimator we split the sample into two parts D, T (since the nuisance estimators are given there is no need to split the sample into three parts as in Algorithm 1). We use D to construct estimator of the initial estimator bθ0(a) = 1 |D| P

i D bτ(a, Vi), marginal density bf(a) =

i D bπ(a|Vi) and select the optimal bandwidth h (i.e. construct pseudo-outcomes and run LOOCV as in equation (4) on D). We then construct pseudo-outcomes on T and perform local linear estimation using the optimal bandwidth h . Finally, the roles of D and T are exchanged to obtain another estimator and we average the two estimates as the final doubly robust estimator. For sample size n {500, 2000} and convergence rate α {0.1, 0.13, . . . , 0.4}, we repeat the data generation and estimation process M = 500 times. We will aim at estimating θ(1) and compare the

RMSE= 1 M PM m=1 h bθm(1) θ(1) i2 1/2 of plug-in es-

timator and doubly robust estimator, where bθm(1) is the estimate from m-th repetition. The results are summarized in Figure 2.

As we see in Figure 2, if the nuisance estimation error is large (α is small), the doubly robust estimator outperforms the naive plug-in estimator. This can be explained by the second-order bias term of the doubly robust estimator in Proposition 5.2, i.e., the conditional bias is the product of nuisance estimation errors and is doubly small . On the other hand, the plug-in style estimator inherits the slow convergence rate of bτ. As α increases, the estimators of nuisance functions are more accurate, and plug-in style estimators finally outperform the doubly robust estimator because the doubly robust estimator may suffer from accumulating error in constructing pseudo-outcomes, bandwidth selection, and local linear regression, which dominates the conditional

bias when nuisance estimation is accurate enough. In real applications, there are typically many covariates and parametric models for nuisance functions may not be correct. The convergence rate of nonparametric nuisance estimation can be slow when the dimension of covariates is large so the doubly robust estimator with a smaller bias is recommended for use. Additional simulation results and a real data example are provided in the supplementary materials.

7. Discussion

In this work, we study the estimation of continuous treatment effects when there is limited access to the primary outcome of interest but auxiliary information on surrogate outcomes is available. We propose a doubly robust estimator that is less sensitive to nuisance estimation error and hence incorporates flexible nonparametric machine learning methods. Although nonparametric machine learning methods usually suffer from slow convergence rates, they are widely used in nuisance function estimation, especially when practitioners do not have sufficient domain knowledge to justify parametric models. Our doubly robust estimator facilitates the application of nonparametric methods and enjoys optimal estimation rates under mild conditions. Asymptotic normality is further established, which enables researchers to construct confidence intervals and perform statistical inference. We also show how incorporating information on surrogate outcomes improves the variance, compared with methods solely based on labeled data, as in Kennedy et al. (2017). In summary, our methodology provides a robust and efficient approach to leverage surrogate outcomes in continuous treatment effect estimation. However, our method could not deal with the case where only covariate is available in the unlabeled dataset and continuous treatment is also missing (corresponds to the generalizability problem as in Dahabreh et al. (2019)). It is also interesting to extend our

Continuous Treatment Effects with Surrogate Outcomes

results to studies with multiple outcomes (Kennedy et al., 2019; Du et al., 2024). We leave these problems for future investigation.

Impact Statement

Our research introduces an efficient method for incorporating surrogate outcomes into the estimation of continuous treatment effects, especially useful in contexts where measuring the primary outcome for all subjects is costly or impractical. This approach is particularly helpful in fields like medicine, where it can facilitate the assessment of new treatments effects on mortality rates, and in business, for evaluating long-term strategies. By appropriately taking advantage of surrogate outcomes and modeling the missing mechanism of the primary outcome, our method enables more feasible and data-driven decision-making, which provides important implications in difference areas and enhance both the effectiveness and efficiency of research and strategic planning.

Ai, C., Linton, O., Motegi, K., and Zhang, Z. A unified framework for efficient estimation of general treatment models. Quantitative Economics, 12(3):779 816, 2021.

Bickel, P. J., Klaassen, C. A., Bickel, P. J., Ritov, Y., Klaassen, J., Wellner, J. A., and Ritov, Y. Efficient and adaptive estimation for semiparametric models, volume 4. Springer, 1993.

Bonvini, M. and Kennedy, E. H. Fast convergence rates for dose-response estimation. ar Xiv preprint ar Xiv:2207.11825, 2022.

Bonvini, M., Zeng, Z., Yu, M., Kennedy, E. H., and Keele, L. Flexibly estimating and interpreting heterogeneous treatment effects of laparoscopic surgery for cholecystitis patients. ar Xiv preprint ar Xiv:2311.04359, 2023.

Calonico, S., Cattaneo, M. D., and Farrell, M. H. On the effect of bias estimation on coverage accuracy in nonparametric inference. Journal of the American Statistical Association, 113(522):767 779, 2018.

Chakrabortty, A., Dai, G., and Tchetgen, E. T. A general framework for treatment effect estimation in semisupervised and high dimensional settings. ar Xiv preprint ar Xiv:2201.00468, 2022.

Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1):C1 C68, 2018.

Colangelo, K. and Lee, Y.-Y. Double debiased machine learning nonparametric inference with continuous treatments. ar Xiv preprint ar Xiv:2004.03036, 2020.

Dahabreh, I. J., Robertson, S. E., Tchetgen, E. J., Stuart, E. A., and Hern an, M. A. Generalizing causal inferences from individuals in randomized trials to all trial-eligible individuals. Biometrics, 75(2):685 694, 2019.

D ıaz, I. and van der Laan, M. J. Targeted data adaptive estimation of the causal dose response curve. Journal of Causal Inference, 1(2):171 192, 2013.

Du, J.-H., Zeng, Z., Kennedy, E. H., Wasserman, L., and Roeder, K. Causal inference for genomic data with multiple heterogeneous outcomes. ar Xiv preprint ar Xiv:2404.09119, 2024.

Fan, J. and Gijbels, I. Local polynomial modelling and its applications: monographs on statistics and applied probability 66, volume 66. CRC Press, 1996.

Fan, J., Hu, T.-C., and Truong, Y. K. Robust non-parametric function estimation. Scandinavian journal of statistics, pp. 433 446, 1994.

Fan, Q., Hsu, Y.-C., Lieli, R. P., and Zhang, Y. Estimation of conditional average treatment effects with highdimensional data. Journal of Business & Economic Statistics, 40(1):313 327, 2022.

Flores, C. A. and Flores-Lagunes, A. Identification and estimation of causal mechanisms and net effects of a treatment under unconfoundedness, 2009. URL http://hdl.handle.net/10419/35701. urn:nbn:de:101:1-20090622213.

Fr olich, M. and Huber, M. Direct and indirect treatment effects causal chains and mediation analysis with instrumental variables. Journal of the Royal Statistical Society Series B: Statistical Methodology, 79(5):1645 1666, 2017.

Galvao, A. F. and Wang, L. Uniformly semiparametric efficient estimation of treatment effects with a continuous treatment. Journal of the American Statistical Association, 110(512):1528 1542, 2015.

Gill, R. D. and Robins, J. M. Causal inference for complex longitudinal data: the continuous case. Annals of Statistics, pp. 1785 1811, 2001.

H ardle, W., Hall, P., and Marron, J. S. How far are automatically chosen regression smoothing parameters from their optimum? Journal of the American Statistical Association, 83(401):86 95, 1988.

Hern an, M. A. and Robins, J. M. Causal inference, 2010.

Continuous Treatment Effects with Surrogate Outcomes

Hogan, J. W., Roy, J., and Korkontzelou, C. Handling dropout in longitudinal studies. Statistics in medicine, 23(9): 1455 1497, 2004.

Hou, J., Mukherjee, R., and Cai, T. Efficient and robust semi-supervised estimation of ate with partially annotated treatment and response. ar Xiv preprint ar Xiv:2110.12336, 2021.

Huber, M. Identifying causal mechanisms (primarily) based on inverse probability weighting. Journal of Applied Econometrics, 29(6):920 943, 2014.

Huber, M. Replication data for Direct and indirect effects of continuous treatments based on generalized propensity score weighting , 2020. URL https://doi.org/ 10.7910/DVN/AJ7Q9B.

Huber, M., Hsu, Y.-C., Lee, Y.-Y., and Lettry, L. Direct and indirect effects of continuous treatments based on generalized propensity score weighting. Journal of Applied Econometrics, 35(7):814 840, 2020.

Kallus, N. and Mao, X. On the role of surrogates in the efficient estimation of treatment effects with limited outcome data. ar Xiv preprint ar Xiv:2003.12408, 2020.

Kennedy, E. H. Semiparametric doubly robust targeted double machine learning: a review. ar Xiv preprint ar Xiv:2203.06469, 2022.

Kennedy, E. H. Towards optimal doubly robust estimation of heterogeneous causal effects. Electronic Journal of Statistics, 17(2):3008 3049, 2023.

Kennedy, E. H., Ma, Z., Mc Hugh, M. D., and Small, D. S. Non-parametric methods for doubly robust estimation of continuous treatment effects. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(4):1229 1245, 2016. doi: 10.1111/ rssb.12212. URL https://rss.onlinelibrary. wiley.com/doi/abs/10.1111/rssb.12212.

Kennedy, E. H., Ma, Z., Mc Hugh, M. D., and Small, D. S. Non-parametric methods for doubly robust estimation of continuous treatment effects. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 79 (4):1229 1245, 2017.

Kennedy, E. H., Kangovi, S., and Mitra, N. Estimating scaled treatment effects with multiple outcomes. Statistical methods in medical research, 28(4):1094 1104, 2019.

Kennedy, E. H., Balakrishnan, S., and G Sell, M. Sharp instruments for classifying compliers and generalizing causal effects. The Annals of Statistics, 48(4):2008 2030, 2020.

Laan, M. J. and Robins, J. M. Unified methods for censored longitudinal data and causality. Springer, 2003.

Levis, A. W., Bonvini, M., Zeng, Z., Keele, L., and Kennedy, E. H. Covariate-assisted bounds on causal effects with instrumental variables. ar Xiv preprint ar Xiv:2301.12106, 2023.

Meza, I. and Singh, R. Nested nonparametric instrumental variable regression: Long term, mediated, and time varying treatment effects. ar Xiv e-prints, pp. ar Xiv 2112, 2021.

Nadaraya, E. A. On estimating regression. Theory of Probability & Its Applications, 9(1):141 142, 1964.

Newey, W. K. Kernel estimation of partial means and a general variance estimator. Econometric Theory, 10(2): 1 21, 1994.

Robins, J., Li, L., Tchetgen, E., van der Vaart, A., et al. Higher order influence functions and minimax estimation of nonlinear functionals. In Probability and statistics: essays in honor of David A. Freedman, volume 2, pp. 335 422. Institute of Mathematical Statistics, 2008.

Robins, J. M., Hernan, M. A., and Brumback, B. Marginal structural models and causal inference in epidemiology. Epidemiology, pp. 550 560, 2000.

Rubin, D. and van der Laan, M. J. A general imputation methodology for nonparametric regression with censored data. UC Berkeley Division of Biostatistics Working Paper Series, 2005.

Rubin, D. B. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology, 66(5):688 701, 1974.

Schochet, P. Z., Burghardt, J., and Glazerman, S. National job corps study: The impacts of job corps on participants employment and related outcomes [and] methodological appendixes on the impact analysis. 2001.

Schochet, P. Z., Burghardt, J., and Mc Connell, S. Does job corps work? impact findings from the national job corps study. American economic review, 98(5):1864 1886, 2008.

Shankar, S., Sinha, R., Mitra, S., Sinha, M., and Fiterau, M. Direct inference of effect of treatment (diet) for a cookieless world. In International Conference on Artificial Intelligence and Statistics, pp. 1869 1887. PMLR, 2023.

Singh, R. Generalized kernel ridge regression for long term causal inference: Treatment effects, dose responses, and counterfactual distributions. ar Xiv preprint ar Xiv:2201.05139, 2022.

Continuous Treatment Effects with Surrogate Outcomes

Splawa-Neyman, J., Dabrowska, D. M., and Speed, T. P. On the application of probability theory to agricultural experiments. essay on principles. section 9. Statistical Science, pp. 465 472, 1990.

Takatsu, K. and Westling, T. Debiased inference for a covariate-adjusted regression function. ar Xiv preprint ar Xiv:2210.06448, 2022.

Tsiatis, A. A. Semiparametric theory and missing data, volume 4. Springer, 2006.

Tsybakov, A. B. Introduction to Nonparametric Estimation. Springer, 1st edition, 2009.

Van der Laan, M. J., Polley, E. C., and Hubbard, A. E. Super learner. Statistical applications in genetics and molecular biology, 6(1), 2007.

Van der Vaart, A. W. Asymptotic statistics, volume 3. Cambridge university press, 2000.

Wasserman, L. All of nonparametric statistics. Springer Science & Business Media, 2006.

Watson, G. S. Smooth regression analysis. Sankhy a: The Indian Journal of Statistics, Series A, pp. 359 372, 1964.

Zeng, Z., Kennedy, E. H., Bodnar, L. M., and Naimi, A. I. Efficient generalization and transportation. ar Xiv preprint ar Xiv:2302.00092, 2023.

Zhang, Y. and Bradic, J. High-dimensional semi-supervised learning: in search of optimal inference of the mean. Biometrika, 109(2):387 403, 2022.

Zhang, Y., Chakrabortty, A., and Bradic, J. Semi-supervised causal inference: Generalizable and double robust inference for average treatment effects under selection bias with decaying overlap. ar Xiv preprint ar Xiv:2305.12789, 2023.

Continuous Treatment Effects with Surrogate Outcomes

A. Background on Efficiency Theory

As discussed in Section 3, the plug-in estimator suffers from first-order error and is sensitive to the estimation accuracy of nuisance functions. To address this problem, one can derive the efficient influence function of the target functional (E[θ(A)] in our problem), based on which a one-step estimator can be obtained to reduce the bias. The efficient influence function is critical in non-parametric efficiency theory (Bickel et al., 1993; Tsiatis, 2006; Van der Vaart, 2000; Laan & Robins, 2003; Kennedy, 2022). Mathematically, the influence function is the derivative of the target statistical functional in a Von Mises expansion (i.e., distributional Taylor expansion). In the discrete case, it coincides with the Gateaux derivative of the functional when the contamination distribution is a point mass. Influence functions are important in different respects. First, the variance of the influence function is equal to the efficiency bound of the target statistical functional, which characterizes the inherent estimation difficulty of the target functional and provides a benchmark to compare against when we construct estimators. Moreover, it allows us to correct for first-order bias in the plug-in estimator and obtain doubly robust-style estimators, which enjoy appealing statistical properties even if non-parametric methods with relatively slow rates are used in nuisance estimation.

Suppose the statistical functional of interest ψ = ψ(P) admits the first-order Von Mises expansion. Mathematically, we have

ψ(ˆP) ψ(P) = Z ϕ1(z, ˆP)d P(z) + R2(ˆP, P), (8)

where ϕ1(z, P) is the influence function of ψ(P), ψ(ˆP) is the plug-in estimator and R2(ˆP, P) is the second-order reminder. Under regularity conditions, Von Mises expansion implies the pathwise differentiability

ϵψ (Pϵ) ϵ=0 = Z ϕ1(z; P)sϵ(z)d P(z)

where sϵ(z) = ϵ log pϵ(z) ϵ=0 is the submodel score function. (8) suggests that the plug-in estimator has first-order bias R ϕ1(z, ˆP)d P(z). Equivalently, we can write

ψ(P) = ψ(ˆP) + Z ϕ1(z, ˆP)d P(z) + R2(ˆP, P),

which motivates us to correct for the first-order bias and arrive at the doubly robust estimator

ˆψdr = ψ(ˆP) + Pn{ϕ1(Z, ˆP)}.

Under further empirical process assumptions or sample splitting assumptions, we can show the dominating term in the conditional bias of ˆψdr is R2(ˆP, P), which is usually a second-order error term and depends on the product of convergence rates of nuisance functions. We refer the readers to Kennedy et al. (2016); Kennedy (2022) for a more complete review. In our problem, the estimand θ(a) in (2) is not pathwise differentiable and the ideas above do not apply directly. However, the pseudo-outcome φ(Z) is obtained by deriving the influence function of E[θ(A)]. The readers are referred to Section 3.1 of Kennedy et al. (2017) for more discussion on the connection between the pseudo-outcome of θ(a) and the influence function of E[θ(A)].

B.1. Proof of Theorem 3.5

= E[E(Y a|V, Sa]

= E[E(Y a|A = a, V, Sa)]

= E[E(Y a|A = a, V, Sa, R = 1)]

= E{E[E(Y a|A = a, V, Sa, R = 1)|V]}

= E{E[E(Y a|A = a, V, Sa, R = 1)|A = a, V]} = E{E[E(Y |A = a, V, S, R = 1)|A = a, V]},

Continuous Treatment Effects with Surrogate Outcomes

where the first and fourth equations follow from the property of conditional expectation. The second equation holds since Assumption 3.2 implies Y a A|V, Sa. The third equation follows from Assumption 3.3. The fifth equation follows from Assumption 3.2 (specifically Sa A|V). The last equation follows from Assumption 3.1. Note that positivity is implicitly assumed to guarantee the conditional expectations are well-defined.

B.2. Proof of Proposition 4.1

Proof. If µ = µ we have

φ(Z; µ, π, ρ) = R(Y µ(A, X, 1))

ρ(A, X) + µ(A, X, 1) E[µ(A, X, 1)|A, V] R

V π(A|v)d P(v)

E R(Y µ(A, X, 1))

A = a, X, R = R ρ(a, X)E[Y µ(a, X, 1)|A = a, X, R = 1] = 0,

E[µ(A, X, 1) E[µ(A, X, 1)|A, V]|A = a, V] = 0.

These two equations imply

E R(Y µ(A, X, 1))

ρ(A, X) + µ(A, X, 1) E[µ(A, X, 1)|A, V] R

V π(A|v)d P(v)

and hence E[φ(Z; µ, π, ρ)|A = a] = θ(a).

If ( π, ρ) = (π, ρ),

φ(Z; µ, π, ρ) = R(Y µ(A, X, 1))

ρ(A, X) + µ(A, X, 1) E[ µ(A, X, 1)|A, V] R

V π(A|v)d P(v)

V E[ µ(A, X, 1)|A, V = v]d P(v).

E R(Y µ(A, X, 1))

A = a, X, R = R(µ(a, X, 1) µ(a, X, 1))

E R(Y µ(A, X, 1))

A = a, X = E R(µ(a, X, 1) µ(a, X, 1))

A = a, X = µ(a, X, 1) µ(a, X, 1).

By the property of conditional expectation, we have

E R(Y µ(A, X, 1))

ρ(A, X) + µ(A, X, 1) E[ µ(A, X, 1)|A, V] R

V π(A|v)d P(v)

= E µ(a, X, 1) E[ µ(a, X, 1)|A = a, V] A = a, V R

V π(a|v)d P(v)

= E[µ(a, X, 1) µ(a, X, 1) A = a, V]

V π(a|v)d P(v)

Due to the following equation on the measure:

d P(v|a) = π(a|v)d P(v) R

V π(a|v)d P(v), (9)

we can write

E R(Y µ(A, X, 1))

ρ(A, X) + µ(A, X, 1) E[ µ(A, X, 1)|A, V] R

V π(A|v)d P(v)

V E[µ(a, X, 1) µ(a, X, 1) A = a, V = v]

V π(a|v)d P(v)

π(a|v) d P(v|a)

V E[µ(a, X, 1) µ(a, X, 1) A = a, V = v]d P(v).

Continuous Treatment Effects with Surrogate Outcomes

Finally we get

E[φ(Z; µ, π, ρ)|A = a]

V E[µ(a, X, 1) µ(a, X, 1) A = a, V = v]d P(v) + Z

V E[ µ(a, X, 1)|A = a, V = v]d P(v)

V E[µ(a, X, 1)|A = a, V = v]d P(v)

B.3. Proof of Theorem 5.1

Proof. By Theorem 1 in Kennedy (2023), the linear smoother is stable if the variance Var(φ(Z)|A = a) is bounded away from 0, i.e. bθ(a) eθ(a) b En[bb(A)|A = a] = o P(Rn(a))

if supz |bφ(z) φ(z)| = o P(1).

B.4. Proof of Proposition 5.2

Proof. For abbreviations, we will omit conditioning on Dn in our notation but all the expectations in this part are conditioning on Dn (recall such expectation is denoted using P). Note that

E[φ(Z)|A = a] = θ(a)

by Proposition 4.1. By the property of conditional expectation

P R(Y bµ(a, X, 1))

bρ(a, X) + bµ(a, X, 1) bτ(a, V) bw(a, V) A = a

= P R(µ(a, X, 1) bµ(a, X, 1))

bρ(a, X) + bµ(a, X, 1) bτ(a, V) bw(a, V) A = a

= P ρ(a, X)(µ(a, X, 1) bµ(a, X, 1))

bρ(a, X) + bµ(a, X, 1) bτ(a, V) bw(a, V) A = a

= P 1 ρ(a, X)

(bµ(a, X, 1) µ(a, X, 1)) + µ(a, X, 1) bτ(a, V) bw(a, V) A = a

= P 1 ρ(a, X)

(bµ(a, X, 1) µ(a, X, 1)) bw(a, V) A = a + P[(τ(a, V) bτ(a, V)) bw(a, V)|A = a]

where the first equation follows by conditioning on X, R, A = a, the second equation follows from conditioning on X, A = a and the last equation follows from conditioning on V, A = a. Further note that θ(a) = E[τ(a, V)] and

P h bθ0(a) i θ(a) = 1

i Dn 2 bτ(a, Vi) P[bτ(a, V)] + P[bτ(a, V) τ(a, V)]. (11)

By equation 9

P[bτ(a, V) τ(a, V)] = Z

V (bτ(a, v) τ(a, v))d P(v) = Z

V (bτ(a, v) τ(a, v))w(a, v)d P(v|A = a).

Add equation 10 and equation 11 together we have

bb(a) = P 1 ρ(a, X)

(bµ(a, X, 1) µ(a, X, 1)) bw(a, V) A = a

+ P[(τ(a, V) bτ(a, V))( bw(a, V) w(a, V))|A = a] + 1

i Dn 2 bτ(a, Vi) P[bτ(a, V)].

Continuous Treatment Effects with Surrogate Outcomes

The bound on bb then follows from Cauchy-Schwarz s inequality.

For linear smoother b En, we have

b En h bb(A)|A = a i =

i T n Wi(a; An)bb(Ai)

sup t N(a) |bb(t)| X

i T n |Wi(a; An)| C sup t N(a) |bb(t)|

B.5. Proof of Theorem 5.3

Proof. We will prove the results with the following decomposition. Let eθ(a) = e 1 b D 1 ha Pn[gha(A)Kha(A)φ(Z)] be the oracle estimator. We write

bθ(a) θ(a) = eθ(a) θ(a) + e 1 b D 1 ha Pn[gha(A)Kha(A)(bφ(Z) φ(Z))]

=: eθ(a) θ(a) + R1 + R2

where R1 = e 1 b D 1 ha (Pn P)[gha(A)Kha(A)(bφ(Z) φ(Z))] and R2 = e 1 b D 1 ha P[gha(A)Kha(A)(bφ(Z) φ(Z))].

Step 1: The CLT term. The existing results on local linear estimator (Fan et al., 1994; Fan & Gijbels, 1996) imply

nh eθ(a) θ(a) h2θ (a) R u2K(u)du 2

d N 0, σ2(a) R K2(u)du f(a)

under the conditions stated in the theorem. The conditional variance can be computed as follows:

σ2(a) = Var(φ(Z)|A = a)

= E[(φ(Z) θ(a))2|A = a]

( R(Y µ(a, X, 1))

ρ(a, X) + µ(a, X, 1) τ(a, V) 2 w2(a, V) A = a

= E R(Y µ(a, X, 1))2

ρ2(a, X) + (µ(a, X, 1) τ(a, V))2 w2(a, V) A = a ,

where the last equation follows since by conditioning on A = a, X, R one can show

E R(Y µ(a, X, 1))(µ(a, X, 1) τ(a, V))

w2(a, V) A = a = 0.

For the first term in σ2(a)

E R(Y µ(a, X, 1))2w2(a, V)

= E R Var(Y |A = a, X, R = 1)w2(a, V)

= E Var(Y |A = a, X, R = 1)w2(a, V)

where the first equation follows from conditioning on A = a, X, R and the second equation follows from conditioning on A = a, X. For the second term in σ2(a) we have

E (µ(a, X, 1) τ(a, V))2w2(a, V)|A = a = E Var(µ(a, X, 1)|A = a, V)w2(a, V)|A = a .

σ2(a) = E Var(Y |A = a, X, R = 1)

ρ(a, X) + Var(µ(a, X, 1)|A = a, V) w2(a, V) A = a

Continuous Treatment Effects with Surrogate Outcomes

Step 2: Bounding R1. Then we proceed to analyze R1 = e 1 b D 1 ha (Pn P)[gha(A)Kha(A)(bφ(Z) φ(Z))]. We first show e 1 b D 1 ha = OP(1). Recall b Dha = Pn[gha(A)Kha(A)g ha(A)] R2 2. Note that Pn[Kha(A)] is the kernel density estimator of f(a) and standard results in the literature show

E[(Pn[Kha(A)] f(a))2] = O h2 + 1

which implies b Dha,11 = Pn[Kha(A)] P f(a). For the element of b Dha not on the diagonal

Kha(A) = Z t a

h 1 h K t a

f(t)dt = Z u K(u)f(a + hu)du.

So we have E A a

Kha(A) Z u K(u)f(a)du Z |u|K(u)|f(a + hu) f(a)| h Z u2K(u)du 0.

Since K is symmetric around 0 we have R u K(u)du = 0 and hence

Kha(A) = O(h).

Further notice

Z u2K2(u)f(a + hu)du

Z u2K2(u)du = O 1

Hence we have

which implies b Dha,12 = Pn A a

h Kha(A) P 0. Finally

f(t)dt = Z u2K(u)f(a + hu)du.

So we have E

Z u2K(u)f(a)du

Z u2K(u)|f(a + hu) f(a)| h Z |u|3K(u)du 0.

One can similarly show

Continuous Treatment Effects with Surrogate Outcomes

Z u2K(u)f(a)du

which implies b Dha,22 = Pn h A a

h 2 Kha(A) i P f(a) R u2K(u)du. Hence we have

b D 1 ha P diag

f(a) 1, f(a) 1 Z u2K(u)du 1)

This implies e 1 b D 1 ha = OP(1). Then we consider (Pn P)[gha(A)Kha(A)(bφ(Z) φ(Z))]. By Lemma 2 in Kennedy et al. (2020) we have for j = 1, 2

(Pn P)[gha,j(A)Kha(A)(bφ(Z) φ(Z))] = OP

gha,j(A)Kha(A)(bφ(Z) φ(Z)) 2 n

Note that gha,j(A)Kha(A)(bφ(Z) φ(Z)) 2 2 = P g2 ha,j(A)K2 ha(A)(bφ(Z) φ(Z))2

Z u2(j 1)K2(u)f(a + hu)du

This together with bφ φ = o P(1) implies

(Pn P)[gha,j(A)Kha(A)(bφ(Z) φ(Z))] = OP

We conclude that R1 = o P 1

Step 3: Bounding R2. The last step is to bound R2 = e 1 b D 1 ha P[gha(A)Kha(A)(bφ(Z) φ(Z))]. Since e 1 b D 1 ha = OP(1) we only need to consider

P[gha(A)Kha(A)(bφ(Z) φ(Z))] = Z gha(t)Kha(t)P[bφ(Z) φ(Z)|A = t]f(t)dt

In Proposition 5.2 we show bb(a) is equal to

P[bφ(Z) φ(Z)|A = t] = P 1 ρ(t, X)

(bµ(t, X, 1) µ(t, X, 1)) bw(t, V) A = t

+ P[(τ(t, V) bτ(t, V))( bw(t, V) w(t, V))|A = t] + 1

i Dn 2 bτ(t, Vi) P[bτ(t, V)]

Plug into the equation above we have Z gha,j(t)Kha(t)P[bφ(Z) φ(Z)|A = t]f(t)dt

= Z gha,j(t)Kha(t)P 1 ρ(t, X)

(bµ(t, X, 1) µ(t, X, 1)) bw(t, V) A = t f(t)dt

+ Z gha,j(t)Kha(t)P[(τ(t, V) bτ(t, V))( bw(t, V) w(t, V))|A = t]f(t)dt

+ (Pn P) Z gha,j(t)Kha(t)bτ(t, V)f(t)dt

Continuous Treatment Effects with Surrogate Outcomes

Here we slightly abuse the notation and denote Pn as the average over Dn 2 . By boundedness of nuisance estimates and Cauchy-Schwarz s inequality, we have

Z gha,j(t)Kha(t)P 1 ρ(t, X)

(bµ(t, X, 1) µ(t, X, 1)) bw(t, V) A = t f(t)dt

Z |gha,j(t)|Kha(t) bρ ρ t bµ µ tf(t)dt

sup |t a| h bρ ρ t bµ µ t

= sup |t a| h bρ ρ t bµ µ t

Z |u|j 1 K (u) f(a + hu)du

sup |t a| h bρ ρ t bµ µ t.

Similarly we have

Z gha,j(t)Kha(t)P[(τ(t, V) bτ(t, V))( bw(t, V) w(t, V))|A = t]f(t)dt

Z |gha,j(t)|Kha(t) bτ τ t bw w tf(t)dt

sup |t a| h bτ τ t bw w t.

For the remaining term (Pn P) R gha,j(t)Kha(t)bτ(t, V)f(t)dt, by Lemma 2 in Kennedy et al. (2020) we have

(Pn P) Z gha,j(t)Kha(t)(bτ(t, V) τ(t, V))f(t)dt = OP

R gha,j(t)Kha(t)(bτ(t, V) τ(t, V))f(t)dt 2 n

Z gha,j(t)Kha(t)(bτ(t, V) τ(t, V))f(t)dt

= Z Z gha,j(t)Kha(t)(bτ(t, V) τ(t, V))f(t)dt 2 d P(v)

Z |gha,j(t)|Kha(t)f(t)dt 2

This together with bτ τ = o P(1) implies

(Pn P) Z gha,j(t)Kha(t)(bτ(t, V) τ(t, V))f(t)dt = o P

By direct calculations (where we assume the outcome is bounded hence τ is also bounded)

( (Pn P) Z gha,j(t)Kha(t)τ(t, V)f(t)dt 2)

( Z gha,j(t)Kha(t)τ(t, V)f(t)dt 2)

( Z |gha,j(t)|Kha(t)f(t)dt 2)

Continuous Treatment Effects with Surrogate Outcomes

which implies

(Pn P) Z gha,j(t)Kha(t)τ(t, V)f(t)dt = OP

Combining this equation with (12) yields

(Pn P) Z gha,j(t)Kha(t)bτ(t, V)f(t)dt = OP

Hence under the rate conditions in the theorem, we conclude

sup |t a| h bρ ρ t bµ µ t + sup |t a| h bτ τ t bw w t

The asymptotic normality then follows from Slutsky s theorem.

C. Additional Simulation Results

C.1. Nuisance Functions Estimated by Parametric Methods

We evaluate the performance of plug-in-style estimator and doubly robust estimator when nuisance functions are estimated using parametric models under the same setting as Section 6. We will fit linear regression models for µ, τ, λ and separately consider correctly specifying the outcome model µ, τ or not, where a misspecified model left out the quadratic term in a but keeps all other main effects and interactions. The conditional density of A given V is obtained by first estimating the conditional mean of E[A|V] and plug-in the normal density (i.e., we assume the model for conditional density is always correct). For the plug-in estimator we randomly separate the sample into two parts D, T. The first part D is used to fit the regression models for all nuisance functions and we take the average on the second part as bθ(a) = 1 |T | P

i T bτ(a, Vi). The roles of D, T are then exchanged to obtain another estimate and the final estimator is the average of two estimates. The doubly robust estimator is implemented according to Algorithm 1, where the bandwidth is selected on D2. We generate samples with sample size n {102.6, 102.8, . . . , 104.6}, apply two estimators to estimate θ(1) and repeat the process 500 times. The results are summarized in Figure 3.

Sample Size

Plug-in with Correct Outcome Model Plug-in with Wrong Outcome Model DR-learner with Correct Outcome Model DR-learner with Wrong Outcome Model

Figure 3. RMSE versus sample size (in log scale) when nuisance functions are estimated by parametric models.

When the outcome model is misspecified, the plug-in estimator (solely based on outcome modeling) is no longer consistent, as shown in Figure 3. The doubly robust estimator, however, models both the outcome and treatment process and has a

Continuous Treatment Effects with Surrogate Outcomes

smaller estimation error when the outcome model is misspecified. Estimation with correctly specified outcome model corresponds to α = 0.5 in Figure 2, where doubly robust estimator with outcome model and propensity score both correctly specified has larger error compared with the plug-in estimator since slow rate of local linear smoothing O(n 2/5) dominates the nuisance estimation error O(n 1/2).

C.2. Nuisance Functions Estimated by Nonparametric Methods

We further evaluate the performance of the plug-in style estimator and doubly robust estimator when nuisance functions are estimated using nonparametric models under the same setting as Section 6. We will fit nuisance functions µ, τ, ρ by superlearner combining generalized linear models and random forests. The conditional density of A given V is estimated by kernel density estimation. The plug-in estimator is implemented the same way as in Appendix C.1 and the doubly robust estimator is implemented according to Algorithm 1. We generate samples with sample size n {102.6, 102.8, . . . , 104}, apply two estimators to estimate θ(1) and repeat the process 500 times. The results are summarized in Figure 3.

Sample Size

Plug-in DR-learner

Figure 4. RMSE versus sample size when nuisance functions are estimated by nonparametric models.

As shown in Figure 4, the plug-in-style estimator has smaller estimation error when all the nuisance functions are fitted by nonparametric methods. This could be explained by the diffculty in nonparametrically estimating conditional density π. Due to the curse of dimensionality, a large sample size is required for the kernel density estimator to estimate π well. In our simulations, we find bπ could be small and hence violate the positivity assumption, yielding variation in the construction of pseudo-outcome bφ and larger estimation error of dose-response compared with the plug-in estimator. In applications, prior knowledge on the conditional density, including a reliable parametric model on π from domain knowledge or information on the lower bound on π due to study design might help us reduce the variation of bπ and estimate the conditional density better, which could improve the performance of doubly robust estimator proposed.

C.3. Surrogates Dependent on Treatment and Covariates

In this section we provide additional simulation results on the setting where surrogates may depend on the treatment and covariates. In real applications, the surrogates S are post-treatment short-term outcomes and are likely to depend on pre-treatment covariates and the treatment. We follow the same setting and estimation procedures as in Section 6 (so the nuisance estimation error is set manually) but modify the conditional distribution of S given (A, V) as

S N (V1 + A, V2 A) , I2 .

The RMSE is estimated from M = 500 replications of the data-generating process and estimation procedures. The results of DR-learner and plug-in estimators are summarized in Figure 5.

Continuous Treatment Effects with Surrogate Outcomes

0.10 0.15 0.20 0.25 0.30 0.35 0.40 Alpha

Plug-in DR-learner

(a) n = 500

0.10 0.15 0.20 0.25 0.30 0.35 0.40 Alpha

Plug-in DR-learner

(b) n = 2000

Figure 5. Root mean square error Versus α, where n α is the estimation error of the nuisance functions.

Figure 5 shows when surrogate outcomes S depend on treatment and covariates, the doubly robust estimator enjoys better estimation accuracy compared with the plug-in estimator when α is small (and hence nuisance estimation error is large). As α increases and the nuisance estimation error becomes smaller, the plug-in estimator gradually outperforms the doubly robust estimator. The results are similar to those in Section 6 and readers are referred to Section 6 for more discussion and explanation.

C.4. Comparison Between Supervised and Semi-supervised Methods

In this section we compare the supervised estimator, which implements the method in (Kennedy et al., 2017) on labeled data L, with our semi-supervised estimator incorporating the unlabeled data U with surrogate outcomes. We follow the same setting in Section 6 and implement method in (Kennedy et al., 2017) via similar cross-fitting techniques as in Algorithm 1 except that there is no need to fit ρ(a, x) = P(R = 1 | A = a, X = x) since their method only uses labeled data. All the nuisance functions are estimated by correctly specified parametric models. We generate samples with sample size n {102.6, 102.8, . . . , 104.6}, apply supervised and semi-supervised methods to estimate θ(1) and repeat the process M = 500 times. The results are summarized in Figure 6.

Figure 6 shows the semi-supervised method incorporating unlabeled data with surrogate outcomes has a smaller estimation error compared with the supervised method only using labeled data. The improvement comes from many aspects: As discussed in Section 5.2, the asymptotic variance of our semi-supervised method is smaller than that of the supervised method only using labeled data; Also in the semi-supervised setting, both labeled and unlabeled data are used to estimate the nuisance functions, making nuisance estimation more accurate than supervised method solely based on labeled data since the effective sample size of semi-supervised method is larger.

D. Real Data Analysis

In this section we apply the proposed method to the Job Corps study, conducted in the 1990s to evaluate the effects of the publicly funded U.S. Job Corps program. The Job Corps program targets a population between 16 and 24 years old living in the U.S. and coming from low-income households, where participants received an average of 1,200 hours of vocational training over approximately eight months. Schochet et al. (2001) and Schochet et al. (2008) discuss the study design in detail and analyze the effects of the Jobs Corps program on various outcomes. They found the Job Corps program effectively increased educational attainment, prevented arrests, and increased employment and earnings. The effects of the Job Corps program have been extensively studied under different causal inference frameworks (Flores & Flores-Lagunes, 2009; Huber, 2014; Fr olich & Huber, 2017).

However, these previous studies on the Job Corps program mainly considered binary treatment definitions. In this work, we are interested in estimating the effects of different doses of participation in the program on future involvement in the criminal

Continuous Treatment Effects with Surrogate Outcomes

Sample Size

Only labeled data Both labeled and unlabeled data

Figure 6. RMSE versus sample size when nuisance functions are estimated by correctly specified parametric models.

justice system, namely the number of arrests in the fourth year after the program (outcome Y ). Specifically, our treatment variable A is defined as the total hours spent either in academic or vocational classes of the program. The short-term surrogate outcome S is the proportion of weeks employed in the second year after the program. Huber et al. (2020) used generalized propensity score weighting to estimate the continuous treatment effects of time spent in the Job Corps program under a mediation analysis framework. We re-analyze their dataset publically available on Harvard dataverse (Huber, 2020), using the doubly robust estimator proposed as an illustration of our method.

We follow Huber et al. (2020) and focus on n = 4000 samples with a positive treatment (i.e., Ai > 0, 1 i n). To identify the causal estimand, we invoke the conditional exchangeability in Section 3, where a set of covariates is conditioned on to adjust for confounding bias. The covariate set V we adjust for confounding consists of age, gender, ethnicity, education, marital status, previous employment status and income, welfare receipt during childhood, and family background (e.g., parents education). Missing dummies are created for covariates in V containing missing values. The readers are referred to Table 4 in Huber et al. (2020) for descriptive statistics of the pretreatment covariates as well as the treatment, surrogate, and outcome variables in the data. Conditioning on a rich set of covariates is important since it enables us to identify the causal estimand by making the conditional exchangeability assumption plausible. In our analysis, the outcome Y (number of arrests in year 4) is omitted for 25% of the samples randomly to mimic the setting where the primary outcome is missing and a surrogate outcome is used as auxiliary information. We estimate the treatment effects θ(a) for each of a {100, 150, 200, . . . , 2000} using both plug-in-style estimator and doubly robust estimator in Algorithm 1, where the nuisance functions ρ, µ, τ are estimated by superlearner (Van der Laan et al., 2007) combining generalized linear model and random forests, conditional density π is estimated by kernel density estimator. The estimated dose response curve is plotted in Figure 7.

Figure 7 shows that, as participants spend more hours in the program, the expected number of arrests in year 4 has a decreasing trend, which confirms the conclusion that such training programs effectively reduce involvement in the criminal justice system. Importantly, the dose-response fitted by the doubly robust estimator is very similar to the results in Huber et al. (2020): the shape of the function is similar to their Figure 2. (Note that the specific values on the y-axis are different since they plot a contrast effect and we instead plot the expectation of potential outcome Y a.) As pointed out in Huber et al. (2020), the treatment effect is highly nonlinear, which is further verified by the curve estimated from the doubly robust estimator. By contrast, the curve estimated by a simple plug-in estimator in (3) fails to capture this non-linearity, possibly because it suffers from a large first-order bias.

Continuous Treatment Effects with Surrogate Outcomes

250 500 750 1000 1250 1500 1750 2000 Hours in academic and/or vocational training

Number of arrests in year 4

Plug-in DR Learner

Figure 7. Dose response curve of number of arrests in year 4 (outcome) versus hours in academic and/or vocational training (treatment)