# offpolicy_evaluation_with_outofsample_guarantees__690e4c5b.pdf Published in Transactions on Machine Learning Research (06/2023) Off-Policy Evaluation with Out-of-Sample Guarantees Sofia Ek sofia.ek@it.uu.se Department of Information Technology Uppsala University Dave Zachariah dave.zachariah@it.uu.se Department of Information Technology Uppsala University Fredrik D. Johansson fredrik.johansson@chalmers.se Department of Computer Science & Engineering Chalmers University of Technology Petre Stoica ps@it.uu.se Department of Information Technology Uppsala University Reviewed on Open Review: https: // openreview. net/ forum? id= Xn Yt GPg G9p We consider the problem of evaluating the performance of a decision policy using past observational data. The outcome of a policy is measured in terms of a loss (aka. disutility or negative reward) and the main problem is making valid inferences about its out-of-sample loss when the past data was observed under a different and possibly unknown policy. Using a sample-splitting method, we show that it is possible to draw such inferences with finitesample coverage guarantees about the entire loss distribution, rather than just its mean. Importantly, the method takes into account model misspecifications of the past policy including unmeasured confounding. The evaluation method can be used to certify the performance of a policy using observational data under a specified range of credible model assumptions. 1 Introduction In this work, we are interested in evaluating the performance of a decision policy, denoted π, which chooses an action from a discrete action set. Each action A is taken in a context with observable covariates X and incurs a real-valued loss L (aka. disutility or negative reward). Such policies are considered, for example, in contextual bandit problems and precision medicine (Langford & Zhang, 2007; Qian & Murphy, 2011; Lattimore & Szepesvári, 2020; Tsiatis et al., 2019). For instance, A may be one of several treatment options for a patient with observable characteristics X, and L measures the severity of the outcome. A target policy π can be evaluated using experimental data obtained from trials. Such experiments are, however, often costly to perform and may lead to rather small sample sizes in, e.g., clinical settings. Moreover, in safety-critical applications, it is often unethical to test new policies without severe restrictions on the trial population as well as the target policy. A more fundamental inferential problem, however, is the lack of external validity, i.e., the limited ability to extrapolate from the trial population to the intended target population may lead to invalid inferences (Westreich, 2019; Manski, 2019). The main alternative is offpolicy evaluation, i.e., using observational data from a past decision process to infer the performance of the target policy. For this to be valid one needs to assume that the past process is modeled without systematic errors, i.e, no unmeasured confounding and using well-specified models. The credibility of these assumptions, therefore, determines the internal validity of inferences about π from observational data (Manski, 2003). Published in Transactions on Machine Learning Research (06/2023) Inferences that lack validity are particularly serious when evaluating π in decision processes that are costly or safety-critical. In such cases, even inferences that are asymptotically valid with increasing sample size may not be adequate. Moreover, when the resulting distribution of losses is skewed or is widely dispersed, its tails are important to evaluate. Then inferring the expected loss Eπ[L], as is commonly done, provides a very limited evaluation of π (Wang et al., 2018). For instance, the average loss in a population may be small but the tail losses can be unacceptably large. In such applications, we are more concerned with providing valid certifications of the overall performance (see Figure 1a), rather than inferences of a single distributional parameter. In this paper, we propose a method for evaluating a specified target policy using observational data that provides finite-sample coverage guarantees for the out-of-sample loss, evaluates the entire loss distribution instead of, e.g., its expected value, and takes model misspecification, including unmeasured confounding, into account. 0.0 0.5 1.0 1.5 ℓα (a) Evaluation of out-of-sample loss L. 0% 20% 40% 60% 80% 100% Target coverage Actual coverage (b) Evaluation of coverage of limit curves. Figure 1: Evaluating out-of-sample losses under target policy with binary decisions A {0, 1}. Policies π0 and π1 correspond to treat none (A 0) and treat all (A 1), respectively, while π denotes a policy that adapts to context X. (For more details, see the example in Section 5.1 with p1.) (a) Each curve certifies that a new loss L falls below the limit ℓα with a probability of at least 1 α using n = 1000 data points. The certified performance of policy π dominates those of the alternative policies. (b) Evaluation of actual coverage, that is, the probability of L ℓα, versus target coverage 1 α. 2 Problem formulation We consider a target policy π for deciding an action A in different contexts, which are described by observed and unobserved covariates X and U, respectively. The policy corresponds to a distribution pπ(A|X), and can be either deterministic or random. Our aim is to evaluate the losses L that result from applying any given π. Each instance of contextual covariates, action, and loss, i.e., (X, U, A, L), is drawn independently from a target distribution pπ(X, U, A, L). At our disposal is an observational data set D = (Xi, Ai, Li) n i=1, (1) and our goal is to use it to characterize the out-of-sample loss Ln+1. Note that it was collected under a possibly different policy than π. If we continue with the example of patients, then Ln+1 would quantify the severity of the outcome for a new future patient that is unobserved at the time of our evaluation. Specifically, for any miscoverage level α (0, 1), we seek an informative limit ℓα(D) on the loss such that Pπ n Ln+1 ℓα(D) o 1 α. (2) Published in Transactions on Machine Learning Research (06/2023) In other words, ℓα(D) as a function of α yields a finite-sample performance certification of π as is illustrated in Figure 1a. Unlike a single point estimate, the limit curve characterizes the distribution of losses under π. For instance, consider a patient population where (X, U) denote its covariates, and A is its received treatment with an outcome loss L. Then across all coverage levels, (2) ensures that the treatment of a future patient will incur a loss no greater than ℓα(D) under policy π with confidence 1 α. Figure 1b shows the probability that a limit ℓα(D), proposed below, bounds the future loss Ln+1 across all values of 1 α. Note that limit curves can be constructed for several alternative target policies, which enables us to compare and focus on the parts of their loss distributions that are most relevant to the specific decision problem. The causal structure of this decision process is illustrated in Figure 2a. The target distribution admits a causal factorization pπ(X, U, A, L) = p(X, U) pπ(A|X) p(L|A, X, U), (3) where p(X, U) and p(L|A, X, U) are unknown. The central challenge in off-policy evaluation of π is that (1) is not sampled from (3) but from a shifted training distribution which admits a causal factorization p(X, U, A, L) = p(X, U) p(A|X, U) p(L|A, X, U), (4) where p(A|X, U) characterizes a past policy (aka. behavioral policy) that may differ from π. The causal structure corresponding to (4) is illustrated in Figure 2b. If the past policy were known, it would be possible to adjust for the distribution shift from training to target distribution using the ratio pπ(X, U, A, L) p(X, U, A, L) pπ(A|X) p(A|X, U). (5) This is feasible in certain problems with fully automated decision-making, such as online recommendation systems, where the past policy is designed using observable covariates only, i.e., p(A|X, U) p(A|X). In more general problems, however, we have only a nominal model of the past policy bp(A|X) (aka. propensity model), typically estimated from prior observable data. The nominal model may therefore diverge from p(A|X, U) due to various modelling errors that persist even in the large-sample scenario: model misspecification and unmeasured confounding via U (Peters et al., 2017; Westreich, 2019). Here we follow the marginal sensitivity methodology of Tan (2006) and characterize the model divergence with respect to the odds of taking action A. That is, the nominal odds diverge from the unknown odds by some bounded factor Γ 1 as follows: 1 Γ p(A|X, U) 1 p(A|X, U) | {z } unknown odds bp(A|X) 1 bp(A|X) | {z } nominal odds for all discrete actions A and almost surely all X, U. When the bound equals Γ = 1, the nominal model is perfectly specified and there is no unmeasured confounding. In general, we consider all cases where the nominal odds diverge by at most a factor Γ. Note that (6) can accommodate all possible sources of errors in the nominal model bp(A|X), including model misspecification as well as finite-sample errors. In summary, the problem we consider is to construct a limit ℓα(D) for target policy π using a nominal model bp(A|X) and Γ. The resulting limit should satisfy the finite-sample guarantee (2) for all miscoverage levels α, and thereby certify the target policy performance for any specified bound Γ in (6). This enables a robust evaluation of target policies using observational data since it can be performed across a range of credible odds divergence bounds Γ as we will illustrate in the numerical experiments in Section 5. By increasing the divergence bound Γ, the credibility of our assumptions on bp(A|X) increases, but the informativeness of inferences about Ln+1 decrease, cf. Manski (2003). Therefore we advocate to evaluate a policy π using several Γ in the range [1, Γmax], where Γmax is the maximum value for which the limit curve remains informative in any given problem. The informativeness of a valid limit curve can be defined as follows: Informativeness = 1 α , where α = inf{α : ℓα(D) < Lmax}, (7) where Lmax is the maximum value of the support of L. That is, the highest coverage probability at which we can informatively certify the performance of π, i.e. the right limit of a curve. Note that the informativeness is a problem-dependent quantity. Published in Transactions on Machine Learning Research (06/2023) (a) Decision process under target policy (b) Decision process under past policy Figure 2: Directed acyclic diagrams representing the causal structure of the decision process under (a) target policy, which yields (3), and (b) past policy, which yields (4). In (b), both contextual covariates (X, U) may joint affect actions A and outcome loss L. Thus U gives rise to unmeasured confounding. 3 Background We place the problem considered in this paper and our proposed method within the framework of off-policy evaluation. Expected loss: In the off-policy evaluation literature, the target quantity is often the unknown expected loss Eπ[L] of policy π. A standard estimator of the mean, that dates back to Horvitz & Thompson (1952), is based on inverse propensity weighting: VIPW(D) = 1 i=1 bw(Xi, Ai) Li, (8) where bw(X, A) = pπ(A|X) bp(A|X) is a model of (5), see for instance Rosenbaum & Rubin (1983); Beygelzimer et al. (2009); Qian & Murphy (2011); Zhang et al. (2012); Zhao et al. (2012); Kallus (2018). We note that the estimator in (8) is unbiased when Γ = 1. An alternative standard estimator is based on regression modeling: a A pπ(a|Xi) bℓ(a, Xi), (9) where bℓ(A, X) is a model of E[L|A, X]. The approaches in (8) and (9) have complementary strengths and weaknesses resulting from the challenges of modelling the past policy and the conditional mean of losses, respectively. Even when the models are well-specified, the accuracy of the estimators depends highly on the overlap of covariates X across decisions A in the training data (Oberst et al., 2020; D Amour et al., 2021). When the overlap is weak, the variance of VIPW(D) can become excessively large, even when it is unbiased. This can be mitigated by clipping the weights, which in turn causes bias (Rubin, 2001; Kang & Schafer, 2007; Schafer & Kang, 2008; Strehl et al., 2010). When the models bw(X, A) or bℓ(A, X) exhibit systematic errors, however, the corresponding estimators in (8) and (9) are biased and may invalidate the evaluation of π. The doubly robust estimator bw(Xi, Ai) h Li bℓ(Ai, Xi) i + X a A pπ(a|Xi) bℓ(a, Xi) , is one way of robustification in the case of either model bw(X, A) or bℓ(A, X) is misspecified. Moreover, it reduces the estimator variance provided bℓ(A, X) is sufficiently accurate (Bang & Robins, 2005; Dudík et al., 2011; Rotnitzky et al., 2012). Distribution of losses: When the loss distribution is highly skewed and the tails are long, the use of expected loss to evaluate policies can be inadequate, especially in high-stakes problems. There are alternative Published in Transactions on Machine Learning Research (06/2023) parameters of the loss distribution, described by the cumulative distribution function F(ℓ) = Pπ{Ln+1 ℓ} (cdf), that one can consider in such problems, e.g., a quantile or the conditional value at risk (Wang et al., 2018; Chandak et al., 2021; Huang et al., 2021). Off-policy evaluation of π with respect to some alternative parameter can be achieved using cdf-estimators that are analogous to the mean estimators above, see Huang et al. (2021). In analogy to (8), the inverse propensity weighted cdf-estimator b FIPW(ℓ; D) = 1 i=1 bw(Xi, Ai) 1(Li ℓ), (10) is point-wise unbiased when Γ = 1. Similar to (9), the estimator b FRM(ℓ; D) = 1 a A pπ(a|Xi) bc(ℓ; a, Xi), requires a model bc(ℓ; a, x) of the conditional distribution P{L ℓ|A, X}. To mitigate against model misspecification that threatens the validity of the evaluation of π, one can use the doubly robust estimator b FDR(ℓ; D) = 1 bw(Xi, Ai) h 1(Li ℓ) bc(ℓ; Ai, Xi) i + X a A pπ(a|Xi) bc(ℓ; a, Xi) , which protects in the case that either model bw(X, A) or bℓ(A, X) is misspecified. While this estimator is consistent, it is not guaranteed to yield a proper cdf. In this paper, we are interested in limiting the out-of-sample loss Ln+1. The α-quantile is the smallest ℓα such that Pπ{Ln+1 ℓα} 1 α. It can be estimated as ℓα(D) = inf ℓ: b F(ℓ; D) 1 α , using one of the cdf-estimators above. We will use b FIPW as a benchmark below. Distribution-free inference: Derivations of finite-sample valid, nonparametric limits on random variables date back to the works of Wilks (1941); Wald (1943); Scheffe & Tukey (1945). More recently, the related methodology of conformal prediction has focused on developing covariate-specific prediction regions (Vovk et al., 2005; Shafer & Vovk, 2008; Vovk, 2012). See Lei & Wasserman (2014); Lei et al. (2018); Romano et al. (2019) for further developments. Tibshirani et al. (2019) extended the methodology to make it valid also under known covariate shifts. This extended methodology was used to provide context-specific prediction intervals for any given policy π, which are statistically valid under the assumption that the past policy p(A|X, U) is known (Osama et al., 2020; Taufiq et al., 2022). The marginal sensitivity methodology developed in Tan (2006) enables us to specify a more credible range of assumptions using (6). This type of methodology was used for robust policy learning in Kallus & Zhou (2021), for robust policy learning in sequential decisions in Namkoong et al. (2020) and for sensitivity analysis of treatment effects in Jin et al. (2023) in the case of unobserved confounding. The present paper considers the overall performance of π, similar to Huang et al. (2021). However, different from that paper, we focus on ensuring inferences on the out-of-sample losses that are valid even for finite training data and systematic modelling errors including unobserved confounding using a sample-splitting technique that leverages results derived in Jin et al. (2023). We show that one can limit the out-of-sample losses for π under any specified odds divergence bound Γ 1 for the nominal model in (6), which we assume holds. It follows that (5) is bounded as: W pπ(X, U, A, L) p(X, U, A, L) W, (11) Published in Transactions on Machine Learning Research (06/2023) where the bounds equal W = pπ(A|X) h 1 + Γ 1 bp(A|X 1 1) i and W = pπ(A|X) 1 + Γ bp(A|X) 1 1 . (12) The model bp(A|X) is typically estimated from prior observable data. The above bounds are functions of X and A drawn from the training distribution (4). In order to ensure finite-sample guarantees, we randomly split the training data (1) into two separate sets, with samples of sizes n0 and n n0, respectively. The set D1 is used to form the function b F(ℓ; D1, w) = Pn i=n0+1 W i1(Li ℓ) Pn i=n0+1 W i1(Li ℓ) + Pn i=n0+1 W i1(Li > ℓ) + w, (13) as a proxy for the unknown cdf of the out-of-sample loss Ln+1 and w > 0 represents the ratio in (11). We use the set D0 to construct an upper bound on this unknown ratio. As the following result shows, this enables us to construct a valid limit ℓα on the future loss Ln+1. Theorem 4.1. Define the following quantile function of (13) as ℓα,β = inf ℓ: b F(ℓ; D1, wβ(D0)) 1 α W [ (n0+1)(1 β) ], (n0 + 1)(1 β) n0, , otherwise, (15) and W [k] denotes the kth order statistic of (W i)n0 i=1. For any miscoverage probability α (0, 1), construct ℓα(D) = min β:0<β<α ℓα,β, (16) which a valid limit on the out-of-sample loss Ln+1 with a probability of at least 1 α as specified in (2). The proof of Theorem 4.1 can be found in Appendix A.1. The variable wβ(D0) in (15) provides a statistically valid upper bound on the unknown ratio in (11) where β only specifies its confidence level. For any given value of 0 < β < α, ℓα,β in (14) bounds the loss Ln+1 with a coverage probability of at least 1 α. Consequently, (2) chooses the tightest achievable limit. Remark 1. When Γ = 1, then W = W in (11). Therefore (13) can be thought of as an inverse propensity weighted cdf, where the weights are normalized to sum to unity. We now turn to the implementation of (16). We note that (13) is a piecewise constant function in ℓand that wβ in (15) is a piecewise constant function in β. Therefore (14) and subsequently (16) can be solved by evaluating functions along a discrete set of points. The computation of a limit curve is summarized in Algorithm 1, using a discrete grid of miscoverage levels α. Given a model bp(A|X) and a range of odds divergence bounds Γ, Algorithm 1 produces a set of corresponding limit curves. 5 Numerical experiments In the experiments below, we evaluate policies using the limit curves (α, ℓα). We quantify how increasing the credibility of our model assumption, i.e., by increasing Γ, affects the informativeness of the limit curve using (7). We also consider the coverage probability of the curves: Miscoverage gap = α Pπ{Ln+1 > ℓα(D)}. (17) Published in Transactions on Machine Learning Research (06/2023) Algorithm 1 Limit curve of policy π Input: Policy pπ(A|X), training data D, model bp(A|X), bound Γ 1 and sample split size n0. 1: Randomly split D into D0 and D1. 2: for α {0, . . . , 1} do 3: for β {0, . . . , α} do 4: Compute wβ using (15). 5: Compute ℓα,β using (14). 6: end for 7: Set ℓα to the smallest ℓα,β above. 8: Store (α, ℓα). 9: end for Output: {(α, ℓα)} When this gap is positive, the limit is conservative and when the gap is negative the limit is invalid, respectively, at level α. A natural benchmark for the proposed limit (16) in this problem setting is the estimated quantile ℓα(D) = inf ℓ: b FIPW(ℓ; D) 1 α , (18) using the inverse propensity weighted cdf-estimator (10). In all examples below, the limit (16) is computed using sample splits of equal size, i.e., n0 = n/2 . An additional experiment using data from the Infant Health and Development Program (IHDP) can be found in Appendix A.2. The code used for the experiments is made available here https://github.com/sofiaek/ off-policy-evaluation. 5.1 Synthetic data In the first example, we consider synthetic data in order to evaluate the coverage of the derived limit curves. We use a simulation setting similar to Jin et al. (2023). The miscoverage gap (17) is estimated by Monte Carlo simulation using 1000 runs and for each run drawing independent 1000 new samples (Xn+1, Un+1, An+1, Ln+1). We consider a population of individuals with two-dimensional covariates distributed uniformly as The actions are binary A {0, 1} corresponding to not treat and treat , respectively. We want to evaluate a deterministic target policy, described by pπ(A = 0|X) = 1(X1X2 τ), (19) for different τ [0, 1]. That is, all individuals whose covariate product X1X2 falls below τ are treated. Note that τ = 0 corresponds a treat none policy (A 0 for all X) and τ = 1 corresponds to a treat all policy (A 1 for all X). Below we discuss the resulting losses under this policy using observational data with sample sizes n {250, 500, 1000}. Case: Known past policy (Γ = 1). In the first scenario, we assume that the past policy is known exactly and there is no unmeasured confounding. For the training data, the past policy has selected actions as a Bernoulli process: p(A = 0|X) bp(A = 0|X) = f c(X1X2 + 1) , c 1 2, 2 , (20) Published in Transactions on Machine Learning Research (06/2023) where f( ) is the sigmoid function. The conditional loss distribution is given by (L|A = 0, X) N(1 X1X2, 0.1) and (L|A = 1, X) N(X1X2, 0.1), and thus Lmax = . We consider three configurations c of past policies (20), which yield inverse propensity weights in three ranges: 1 p1(A|X) < 3.72 (c = 1/2), 1 p2(A|X) < 8.39 (c = 1), and 1 p3(A|X) < 55.6 (c = 2). Thus we anticipate p3(A|X) to be the most challenging case. Here we evaluate three target policies τ = {0, 0.5, 1} in (19) and present their resulting limit curves using data from different past policies (20), see Figure 3. The dashed line shows the corresponding past policy. The limit curves for a given target policy are quite similar across training distributions and are still informative at the 90% level using (7). The main difference is in the inferred tail losses and is notable for when τ = 1 under the more challenging past policy p3(A|X). We now turn to the evaluating miscoverage gap (17). Figure 4 presents the gaps for target policy τ = 0.5 in (19). The solid lines illustrate the proposed method and the dashed lines show the benchmark (18). We see that the proposed method is slightly conservative, but remains valid for all α. In contrast, the benchmark is not valid in the tail of the distribution, but is less conservative for higher α in this well-specified case. 0.5 0.0 0.5 1.0 1.5 0% 0.5 0.0 0.5 1.0 1.5 ℓα 100% p2(A|X) 0.5 0.0 0.5 1.0 1.5 0% 100% p3(A|X) τ = 0.0 τ = 0.5 τ = 1.0 pπ = p Figure 3: Limit curves when the past policy is known (Γ = 1) for three different potential target policies (i.e. τ = {0.0, 0.5, 1.0} in (19)). Dashed curve denotes the past policy. n = 1000. 0.0 0.2 0.4 0.6 0.8 1.0 0.025 Miscoverage gap 0.0 0.2 0.4 0.6 0.8 1.0 Target α 0.150 n = 500 0.0 0.2 0.4 0.6 0.8 1.0 0.025 0.150 n = 1000 Past policy p3(A|X) Type Proposed Benchmark Figure 4: Miscoverage gaps (17) versus α, when the past policy is known (Γ = 1). Dashed curve denotes the benchmark (18). Case: Estimated past policy (Γ > 1). In the second scenario, we assume that we only have access to an estimate bp(A|X) (given by (20)) and that there is unmeasured confounding in the training distribution. To render visually distinct curves from the previous case, we consider here a rather extreme case of confounding following Jin et al. (2023). Specifically, we have an unobserved variable drawn as (U|X) N(0, 0.1(X1 + X2)), Published in Transactions on Machine Learning Research (06/2023) and the loss (L|A, X, U) is generated as ( 1 X1X2 + U, A = 0, X1X2 + U, A = 1. We define the past policy in a manner that enables us to control the divergence from the nominal model bp(A|X) in (20): p(A = 0|X, U) = 1(U t(X)) h 1 + Γ 1 0 bp(A = 0|X 1 1) i + 1(U > t(X)) 1 + Γ0 bp(A = 0|X) 1 1 , (21) where the threshold function t(X) is designed empirically to ensure that the resulting median loss of the past policy for A = 1 is maximized. Our design of the past policy can be seen as the worst case among all unknown past policies that diverge by a factor Γ0 in (6). We fix Γ0 = 2 here, but treat it as unknown. For simplicity, we consider a treat all target policy (τ = 1). Its limit curves, under different assumed odds divergence bounds Γ = {1, 2, 3}, are presented in Figure 5. Note that in the case of unmeasured confounding, the resulting curves differ notably across the training distributions unlike in Figure 3. We see that for the first and second distributions, the informativeness of all curves stays around the 90% level. However, in the more extreme third case, the informativeness drops to just above the 60% level when we increase the credibility of our model assumption to an odds bound of Γ = 3. This example illustrates an inherent trade-off between credibility and informativeness. Figure 6 validates our guarantees using data drawn from p1(A|X). When Γ Γ0 = 2, the limit curves are valid and as Γ increases to 3, the limits become quite conservative. Note that the conservativeness persists even as the sample size n is increased fourfold. For Γ = 1, there is no guarantee of coverage and in this case the limit curve is invalid. The benchmark does not take the unmeasured confounding into account and is consequently invalid throughout. 0.5 0.0 0.5 1.0 0% 0.5 0.0 0.5 1.0 ℓα 100% p2(A|X) 0.5 0.0 0.5 1.0 0% 100% p3(A|X) Figure 5: Limit curves for treat all target policy using odds bounds Γ = {1, 2, 3}, when the past policy is unknown and subject to unmeasured confounding (Γ0 = 2 in (21)). n = 1000. 5.2 Real data In the second example, we use data from the National Health and Nutrition Examination Survey (NHANES) for the years 2013-2014 to illustrate the use of the proposed method. Following Zhao et al. (2019), we study the effect of seafood consumption on blood mercury levels. The action A indicates whether a person has a low or high consumption of fish or shellfish ( 1 vs. >12 servings in the past month) and the loss L is the total concentration of blood mercury (in µg/L). There are 8 covariates in X for each person: gender, age, income, whether income is missing or not, race, education, ever smoked and number of cigarettes smoked last month. 1 individual with missing education data and 7 with missing smoking data are omitted. 175 individuals have missing income data, for them we impute the median income. This preprocessing results Published in Transactions on Machine Learning Research (06/2023) 0.0 0.2 0.4 0.6 0.8 1.0 Miscoverage gap 0.0 0.2 0.4 0.6 0.8 1.0 Target α 0.0 0.2 0.4 0.6 0.8 1.0 Proposed Benchmark Figure 6: Miscoverage gaps (17) versus α, when the past policy is unknown and subject to unmeasured confounding. Dashed curve denotes the benchmark (18) which does not take confounding into account. Note that Γ = 1 assumes there is no odds divergence for our model, which in this case leads to an invalid limit curve. in a data set of 1107 individuals of which 21% follow a high fish consumption diet and 79% follow a low consumption diet, see Zhao et al. (2018). We use a fitted logistic regression model for bp(A|X), where the continuous covariates in X have been standardized. We compare two policies, π0 and π1, corresponding to low (A 0) and high (A 1) seafood consumption, respectively. Their corresponding limit curves are presented in Figure 7 under different odds divergences bounds Γ on the model bp(A|X). In the most extreme case, we consider the nominal odds diverging by at most a factor Γ = 3. For reference, we display the limit curve under the unknown consumption policy in the population. We see that lower mercury levels can be certified by low seafood consumption. Note that a guidance value for mercury levels is 8 µg/L for women of child-bearing age and 20 µg/L for women 50 years (Kales & Thompson, 2016). If we were to evaluate policies based on the expected loss alone, then high seafood consumption for women has an expected loss of approximately 3.3 µg/L and could be deemed safe. However, this figure masks the tail losses that substantial proportion of women may face under such a policy. For a high consumption policy, our method could only certify that approximately 80% (Γ = 1) to 50% (Γ = 3) of females would stay below the guideline level of 8 µg/L. In contrast, for the low consumption policy, the corresponding figure is approximately 95% (for 1 Γ 3). 0.0 2.5 5.0 7.5 10.0 12.5 15.0 ℓα [µg/L] 0.0 2.5 5.0 7.5 10.0 12.5 15.0 ℓα [µg/L] π0 π1 pπ = p Gamma Γ Figure 7: Limit curves for blood mercury levels in a population under different seafood consumption policies. Left: Overall population. Right: Female population. We compare two policies: low (π0) and high (π1) consumption. The credibility of the assumptions increases with the odds bound Γ = {1, 2, 3}. The past policy is indicated by dotted curves. Published in Transactions on Machine Learning Research (06/2023) 6 Conclusion We have considered the problem of off-policy evaluation, i.e., making valid inferences about a target policy using past observational data obtained under a different decision process with a possibly unknown policy. Using the marginal sensitivity model, we derived a sample-splitting method that provides limit curves with finite-sample coverage guarantees even under model misspecifications, including unmeasured confounding. The validity, informativeness, and conservativeness of the resulting limit curves were demonstrated in the numerical experiments. Acknowledgments We thank the anonymous reviewers from TMLR for their constructive feedback. This research was partially supported by the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by Knut and Alice Wallenberg Foundation, and the Swedish Research Council under contract 2021-05022. Heejung Bang and James M Robins. Doubly robust estimation in missing data and causal inference models. Biometrics, 61(4):962 973, 2005. Alina Beygelzimer, Sanjoy Dasgupta, and John Langford. Importance weighted active learning. In Proceedings of the 26th annual international conference on machine learning, pp. 49 56, 2009. Yash Chandak, Scott Niekum, Bruno da Silva, Erik Learned-Miller, Emma Brunskill, and Philip S Thomas. Universal off-policy evaluation. Advances in Neural Information Processing Systems, 34:27475 27490, 2021. Miroslav Dudík, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning, pp. 1097 1104, 2011. Alexander D Amour, Peng Ding, Avi Feller, Lihua Lei, and Jasjeet Sekhon. Overlap in observational studies with high-dimensional covariates. Journal of Econometrics, 221(2):644 654, 2021. The Infant Health and Development Program. Enhancing the outcomes of low-birth-weight, premature infants. a multisite, randomized trial. JAMA, 263 22:3035 42, 1990. Jennifer L Hill. Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics, 20(1):217 240, 2011. Daniel G Horvitz and Donovan J Thompson. A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association, 47(260):663 685, 1952. Audrey Huang, Liu Leqi, Zachary Lipton, and Kamyar Azizzadenesheli. Off-policy risk assessment in contextual bandits. Advances in Neural Information Processing Systems, 34:23714 23726, 2021. Ying Jin, Zhimei Ren, and Emmanuel J Candès. Sensitivity analysis of individual treatment effects: A robust conformal inference approach. Proceedings of the National Academy of Sciences, 120(6), 2023. Stefanos N Kales and Aaron MS Thompson. A young woman concerned about mercury. CMAJ, 188(2): 133 134, 2016. Nathan Kallus. Balanced policy evaluation and learning. Advances in neural information processing systems, 31, 2018. Nathan Kallus and Angela Zhou. Minimax-optimal policy learning under unobserved confounding. Management Science, 67(5):2870 2890, 2021. Published in Transactions on Machine Learning Research (06/2023) Joseph DY Kang and Joseph L Schafer. Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical science, 22(4):523 539, 2007. John Langford and Tong Zhang. The epoch-greedy algorithm for contextual multi-armed bandits. Advances in neural information processing systems, 20(1):96 1, 2007. Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020. Jing Lei and Larry Wasserman. Distribution-free prediction bands for non-parametric regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(1):71 96, 2014. Jing Lei, Max G Sell, Alessandro Rinaldo, Ryan J Tibshirani, and Larry Wasserman. Distribution-free predictive inference for regression. Journal of the American Statistical Association, 113(523):1094 1111, 2018. Charles F Manski. Identification problems in the social sciences and everyday life. Southern Economic Journal, 70(1):11 21, 2003. Charles F Manski. Patient Care Under Uncertainty. Princeton University Press, 2019. Hongseok Namkoong, Ramtin Keramati, Steve Yadlowsky, and Emma Brunskill. Off-policy policy evaluation for sequential decisions under unobserved confounding. Advances in Neural Information Processing Systems, 33:18819 18831, 2020. Michael Oberst, Fredrik Johansson, Dennis Wei, Tian Gao, Gabriel Brat, David Sontag, and Kush Varshney. Characterization of overlap in observational studies. In International Conference on Artificial Intelligence and Statistics, pp. 788 798. PMLR, 2020. Muhammad Osama, Dave Zachariah, and Peter Stoica. Learning robust decision policies from observational data. Advances in Neural Information Processing Systems, 33:18205 18214, 2020. Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. Elements of causal inference: foundations and learning algorithms. The MIT Press, 2017. Min Qian and Susan A Murphy. Performance guarantees for individualized treatment rules. Annals of statistics, 39(2):1180, 2011. Yaniv Romano, Evan Patterson, and Emmanuel Candes. Conformalized quantile regression. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. Paul R Rosenbaum and Donald B Rubin. The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41 55, 1983. Andrea Rotnitzky, Quanhong Lei, Mariela Sued, and James M Robins. Improved double-robust estimation in missing data and causal inference models. Biometrika, 99(2):439 456, 2012. Donald B Rubin. Using propensity scores to help design observational studies: application to the tobacco litigation. Health Services and Outcomes Research Methodology, 2(3):169 188, 2001. Joseph L Schafer and Joseph Kang. Average causal effects from nonrandomized studies: a practical guide and simulated example. Psychological methods, 13(4):279, 2008. Henry Scheffe and John W Tukey. Non-parametric estimation. i. validation of order statistics. The Annals of Mathematical Statistics, 16(2):187 192, 1945. Glenn Shafer and Vladimir Vovk. A tutorial on conformal prediction. Journal of Machine Learning Research, 9(3), 2008. Alex Strehl, John Langford, Lihong Li, and Sham M Kakade. Learning from logged implicit exploration data. Advances in neural information processing systems, 23, 2010. Published in Transactions on Machine Learning Research (06/2023) Zhiqiang Tan. A distributional approach for causal inference using propensity scores. Journal of the American Statistical Association, 101(476):1619 1637, 2006. Muhammad Faaiz Taufiq, Jean-Francois Ton, Rob Cornish, Yee Whye Teh, and Arnaud Doucet. Conformal off-policy prediction in contextual bandits. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=Ifg OWI5v2f. Ryan J Tibshirani, Rina Foygel Barber, Emmanuel Candes, and Aaditya Ramdas. Conformal prediction under covariate shift. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. Anastasios A Tsiatis, Marie Davidian, Shannon T Holloway, and Eric B Laber. Dynamic treatment regimes: Statistical methods for precision medicine. Chapman and Hall/CRC, 2019. Vladimir Vovk. Conditional validity of inductive conformal predictors. In Asian conference on machine learning, pp. 475 490. PMLR, 2012. Vladimir Vovk, Alexander Gammerman, and Glenn Shafer. Algorithmic learning in a random world. Springer Science & Business Media, 2005. Abraham Wald. An extension of wilks method for setting tolerance limits. The Annals of Mathematical Statistics, 14(1):45 55, 1943. Lan Wang, Yu Zhou, Rui Song, and Ben Sherwood. Quantile-optimal treatment regimes. Journal of the American Statistical Association, 113(523):1243 1254, 2018. Daniel Westreich. Epidemiology by Design: A Causal Approach to the Health Sciences. Oxford University Press, Incorporated, 2019. ISBN 9780190665760. URL https://books.google.se/books?id= 5R2y Dw AAQBAJ. Samuel S Wilks. Determination of sample sizes for setting tolerance limits. The Annals of Mathematical Statistics, 12(1):91 96, 1941. Baqun Zhang, Anastasios A Tsiatis, Marie Davidian, Min Zhang, and Eric Laber. Estimating optimal treatment regimes from a classification perspective. Stat, 1(1):103 114, 2012. Qingyuan Zhao, Dylan S Small, and Paul R Rosenbaum. Cross-screening in observational studies that test many hypotheses. Journal of the American Statistical Association, 113(523):1070 1084, 2018. Qingyuan Zhao, Dylan S Small, and Bhaswar B Bhattacharya. Sensitivity analysis for inverse probability weighting estimators via the percentile bootstrap. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 81(4):735 761, 2019. Yingqi Zhao, Donglin Zeng, A John Rush, and Michael R Kosorok. Estimating individualized treatment rules using outcome weighted learning. Journal of the American Statistical Association, 107(499):1106 1118, 2012. A.1 Proof of Theorem 4.1 The first part of the proof builds on techniques used to derive weighted conformal prediction intervals in Tibshirani et al. (2019). Let us consider a sequence of n n0 samples drawn from (4) followed by a new sample drawn from (3), i.e., D+ = (Xn0+1, Un0+1, An0+1, Ln0+1), . . . , (Xn, Un, An, Ln), (Xn+1, Un+1, An+1, Ln+1) . Published in Transactions on Machine Learning Research (06/2023) The joint distribution of this sequence can be expressed using: i=n+ p(xi, ui, ai, ℓi) p(xn+1, un+1, an+1, ℓn+1)wn+1 = p(D+)wn+1 = p(S+)wn+1, (22) where n+ = n0 + 1 for notational simplicity, S+ is an unordered set of elements from D+, and the weight wi = pπ(xi, ui, ai, ℓi) p(xi, ui, ai, ℓi) , is the (unobservable) ratio (5) that quantifies the distribution shift from training to target distribution. We shall use the expression for the joint distribution to derive the distribution function for the new loss Ln+1. Suppose we are given the unordered set S+ alone, then the particular sequence D+ is unknown. Let Ei denote the event that the sample (Xn+1, Un+1, An+1, Ln+1) equals the ith sample (xi, ui, ai, ℓi) in the unknown sequence D+. We consider all possible sequences D+ obtained by permutations σ of elements in the set S+. Using the joint distribution in (22), the joint probability of the event Ei and S+ is then P{Ei, S+} = X σ:σ(n+1)=n+i p(S+)wi = p(S+)win!, where we have used the symbol P for the samples that are drawn from two different processes (4) and (3). The conditional probability of event Ei can now be expressed as pi = P{Ei|S+} = P{Ei, S+} Pn+1 j=n+ P{Ej, S+} = wi Pn+1 j=n+ wj , where the first equality follows from the law of total probability. The probability that the loss Ln+1 of the new sample equals ℓi, when conditioning on the unordered set S+, is equal to P{Ln+1 = ℓi|S+} = P{Ei|S+} = pi. Thus conditional on S+, the new loss Ln+1 has the following cdf: P{Ln+1 ℓ|S+} = i=n+ pi1(ℓi ℓ) = Pn+1 i=n+ wi1(Li ℓ) Pn+1 i=n+ wi . (23) Before marginalizing out S+ from (23), we consider a limit ℓthat is a function of the observable elements in S+. For this part, we will build on the proof technique in (Jin et al., 2023, thm. 2.2). Specifically, using (13) we define the following limit: ℓα(D1, W n+1) = inf ℓ: b F(ℓ; D1, W n+1) 1 α for any 0 < β < α, where W n+1 Wn+1 is given in (12). Now insert the limit ℓα(D1, W n+1) into (23) and use the law of total expectation to marginalize out S+: P{Ln+1 ℓα(D1, W n+1)} = E P{Ln+1 ℓα(D1, W n+1)|S+} "Pn+1 i=n+ Wi1(Li ℓα(D1, W n+1)) Pn+1 i=n+ Wi We now proceed to lower bound this probability. Note that by construction: E h b F(ℓα; D1, W n+1) i = E i D1 W i1(Li ℓα) P i D1 W i1(Li ℓα) + P i D1 W i1(Li > ℓα) + W n+1 Published in Transactions on Machine Learning Research (06/2023) Using this expression, we have that P{Ln+1 ℓα(D1, W n+1)} (1 α) "Pn+1 i=n+ Wi1(Li ℓα) Pn+1 i=n+ Wi " Pn i=n+ W i1(Li ℓα) Pn i=n+ W i1(Li ℓα) + Pn i=n+ W i1(Li > ℓα) + W n+1 ( ) h Pn+1 i=n+ Wi i h Pn i=n+ W i1(Li ℓα) + Pn i=n+ W i1(Li > ℓα) + W n+1 i i=n+ Wi1(Li ℓα) i=n+ W i1(Li ℓα) + W i1(Li > ℓα) + W n+1 i=n+ W i1(Li ℓα) i=n+ Wi1(Li ℓα) i=n+ W i1(Li ℓα) + W i1(Li > ℓα) + W n+1 i=n+ W i1(Li ℓα) i=n+ Wi1(Li ℓα) W i1(Li > ℓα) + W n+1 i=n+ W i1(Li ℓα) i=n+ Wi1(Li > ℓα) + Wn+1 i=n+ Wi1(Li ℓα) i=n+ Wi1(Li > ℓα) + Wn+1 i=n+ Wi1(Li ℓα) i=n+ Wi1(Li > ℓα) + Wn+1 using the bounds in (11) to get the third inequality. Therefore we obtain a valid limit: P{Ln+1 ℓα(D1, W n+1)} (1 α) (1 β). (25) However, W n+1 depends on a new sample drawn from the training distribution and thus the limit is unusable. In lieu of W n+1, we use wβ(D0) in (15) to define the modified limit ℓα(D) = inf ℓ: b F(ℓ; D1, wβ(D0)) 1 α Comparing it with (24), we see that ℓα(D) ℓα(D1, W n+1), (27) whenever W n+1 wβ(D0). By the construction in (15), the probability of this event is lower bounded by P{W n+1 wβ(D0)} 1 β, (28) see Vovk et al. (2005); Lei et al. (2018). We use this property to lower bound the probability of Ln+1 ℓα(D). First, note that P{Ln+1 ℓα(D)} = P{Ln+1 ℓα(D) | W n+1 wβ(D0)} P{W n+1 wβ(D0)} + P{Ln+1 ℓα(D) | W n+1 > wβ(D0)} P{ W n+1 > wβ(D0)} P{Ln+1 ℓα(D) | W n+1 wβ(D0)} P{W n+1 wβ(D0)} + 0. Published in Transactions on Machine Learning Research (06/2023) The first factor can be lower bounded using (27), so that P{Ln+1 ℓα(D)} P{Ln+1 ℓα(D1, W n+1) | W n+1 wβ(D0)} P{W n+1 wβ(D0)} = P{Ln+1 ℓα(D1, W n+1)} P{W n+1 wβ(D0)} (1 β) P{W n+1 wβ(D0)} The second line follows from using sample splitting, which ensures that Ln+1 ℓα(D1, W n+1) and W n+1 wβ(D0) are independent events. The third and fourth lines follow from (25) and (28), respectively. Since (29) holds for any 0 < β < α, we choose β in (26) that yields the tightest limit, cf. (16). Note that P here represents the probability over samples drawn from both (4) and (3). Since Ln+1 is drawn from (3), we can write (29) as in (2) for notational clarity. A.2 Semi-real data The Infant Health and Development Program (IHDP) investigated the impact of early childhood interventions on the health of low birth-weight and premature infants (Health & Program, 1990). Hill (2011) used this study to assemble a data set of 25 covariates X that measured various aspects of the children and their mothers such as birth weight, weeks born preterm, head circumference and age of mother, etc. These covariates were standardized to have a zero mean and unit standard deviation. The data set also includes whether each child received special medical care or not, denoted by action A. The original study was a randomised control study, but the data were imbalanced as a non-random subset of the treated population was removed. This unbalanced data set contains 747 children, including 139 treated and 608 control units. The outcome was the child s cognitive development score, which we reverse it to an underdevelopment score to treat it as a loss L. The outcome is generated synthetically and we use response surface A from Hill (2011): L|A = 0, X N( φ X, 1), L|A = 1, X N( (φ X + 4), 1). The vector φ is discrete valued with elements drawn from {0, 1, 2, 3, 4} with probabilities {0.5, 0.2, 0.15, 0.1, 0.05}. We use a fitted logistic regression model for bp(A|X). The limit (16) is computed using sample splits where n0 = 0.1n. Another 0.1n samples of the data is randomly used to evaluate the policies in Figure 9. We compare two policies: the child did not receive special medical care π0 ( treat none ) or the child received special medical care π1 ( treat all ). The limit curves are presented in Figure 8 under different odds divergence bounds Γ = {1, 1.5, 2} on the nominal model. When Γ = 1, the treat all policy is the policy resulting in the lowest limit curve. When Γ is increased to 2 the limit curves are closer to each other, but the treat none policy is instead the policy with the lowest limit curve. Figure 9 validates our guarantee in (2). The miscoverage gap (17) is estimated using 1000 simulation runs where we draw a new vector φ and a new underdevelopment score L for each run. The proposed method is valid for all values of Γ, but is more conservative for the higher values. The benchmark does not take the unmeasured confounding into account and is invalid for the treat all policy (right). Published in Transactions on Machine Learning Research (06/2023) 20 10 0 10 ℓα π0 π1 pπ = p Gamma Γ 1.0 1.5 2.0 Figure 8: Limit curves for a synthetic underdevelopment score in a population under different policies. We compare two policies: the child did not receive special medical care π0 or the child received special medical care π1. The past policy is indicated by dotted curves (pπ = p). The credibility of the assumptions increases with the odds bound Γ = {1, 1.5, 2}. 0.0 0.2 0.4 0.6 0.8 1.0 Target α Miscoverage gap 0.0 0.2 0.4 0.6 0.8 1.0 Target α 1.0 1.5 2.0 Type Proposed Benchmark Figure 9: Miscoverage gaps (17) versus α for data from the Infant Health and Development Program (IHDP) under two different policies: the child did not receive special medical care π0 or the child received special medical care π1. Dashed curve denotes the benchmark (18) which does not take confounding or modelling errors into account. The credibility of the assumptions increases with the odds bound Γ = {1, 1.5, 2}.