# adaptive_instrument_design_for_indirect_experiments__ec609fb7.pdf

Published as a conference paper at ICLR 2024

ADAPTIVE INSTRUMENT DESIGN FOR INDIRECT EXPERIMENTS

Yash Chandak Stanford University Shiv Shankar UMass Amherst Vasilis Syrgkanis Stanford University Emma Brunskill Stanford University

Indirect experiments provide a valuable framework for estimating treatment effects in situations where conducting randomized control trials (RCTs) is impractical or unethical. Unlike RCTs, indirect experiments estimate treatment effects by leveraging (conditional) instrumental variables, enabling estimation through encouragement and recommendation rather than strict treatment assignment. However, the sample efficiency of such estimators depends not only on the inherent variability in outcomes but also on the varying compliance levels of users with the instrumental variables and the choice of estimator being used, especially when dealing with numerous instrumental variables. While adaptive experiment design has a rich literature for direct experiments, in this paper we take the initial steps towards enhancing sample efficiency for indirect experiments by adaptively designing a data collection policy over instrumental variables. Our main contribution is a practical computational procedure that utilizes influence functions to search for an optimal data collection policy, minimizing the mean-squared error of the desired (non-linear) estimator. Through experiments conducted in various domains inspired by real-world applications, we showcase how our method can significantly improve the sample efficiency of indirect experiments.

1 INTRODUCTION

Advances in machine learning, especially from large language models, are greatly expanding the potential of AI to augment and support humans in an enormous number of settings, such as helping direct teachers to students unproductively struggling while using math software (Holstein et al., 2018), providing suggestions to novice customer support operators (Brynjolfsson et al., 2023), and helping peer volunteers learn to be more effective mental health supporters (Hsu et al., 2023). AIaugmentation will often (importantly) provide autonomy to the human, who may consider information or suggestions provided by an AI, before ultimately making their own choice about how to proceed (we provide examples of more use-cases in Appendix B). In such settings it will be highly informative to be able to separate estimating the strength of the intervention on humans treatment choices, as well as the actual treatment effects (the outcomes if the human chooses to follow the treatment of interest), instead of estimating only intent-to-treat effects. This estimation problem becomes more challenging given the likely existence of latent confounders, that may influence both whether a person is likely to uptake a particular recommendation (such as following a large language model s recommendation of what to say to a customer), and their downstream outcomes (on customer satisfaction). Fortunately, if the AI intervention satisfies the requirements of an instrumental variable (Hartford et al., 2017; Syrgkanis et al., 2019), it is well-known when and how consistent estimates of the treatment effect estimates can be obtained.

In this paper, we consider how to automatically learn data-efficient adaptive instrument-selection decision policies, in order to quickly estimate conditional average treatment effects. While adaptive experiment design is well-studied when direct experimentation is feasible (Murphy, 2005; Xiong et al., 2019; Chaloner & Verdinelli, 1995; Rainforth et al., 2023; Che & Namkoong, 2023; Hahn et al., 2011; Kato et al., 2020; Song et al., 2023), it remains largely unexplored in situations where direct assignment to interventions is impractical, unethical or undesirable, such as for the real-time AI support of teachers, customer support agents, and volunteer mental health support trainees. Some work has considered adaptive data collection to minimize regret when using instrument variables (Kallus, 2018; Della Vecchia & Basu, 2023), which also relates to work in the multi-armed bandit literature. Other work has considered when user compliance may be dynamic, and shaped over

Published as a conference paper at ICLR 2024

Figure 1: In randomized control trials (direct experiments), there is no unobserved confounding, i.e., the value of the treatment is the same as that of the instrument. Using adaptive experiment design strategy of RCTs naively for indirect experiments, i.e., when confounding does exists, can result in performance worse than the uniform sampling strategy. In contrast, the proposed algorithm for designing instruments adaptively (DIA) is more effective for indirect experiments. 0 1 2 3 4 5 Confounder Strength 10 1

Percent of MSE of Uniform

Binary Instrument & Treatment

Adaptive RCT DIA Uniform Oracle

time (Ngo et al., 2021). In contrast, our work focuses on estimating treatment effects through the use of adaptive allocation of instruments, but the impact of those instruments is assumed to be static, though unknown. As Figure 1 demonstrates, we will shortly show that it is possible to create adaptive instrumental strategies that far outperform standard uniform allocation. In particular, our paper address this challenge of adaptive instrument design through two main avenues:

1. A general framework for adaptive indirect experiments: We envision modern scenarios where available instruments could be high-dimensional, even encompassing natural language (e.g., personalised texts to encourage participation in treatment or control groups). This necessitates modeling the sampling procedure with rich function approximators. Taking the first steps towards this goal, we do restrict ourselves to relatively small synthetic and semi-synthetic experiments but prioritize understanding the fundamental challenges of creating such a flexible framework.

In Section 3 we introduce influence functions estimators for black-box double causal estimators and a multi-rejection sampler. These tools enable us to perform a gradient-based optimization of a data collection policy π. Importantly, the proposed method is flexible to support π that is modeled using deep networks, and is applicable to various (non)-linear (conditional) IV estimators.

2. Balancing sample complexity and computational complexity: Experiment design is most valuable when sample efficiency is paramount as data is relatively costly in terms of time or resources compared to computation. Therefore, we advocate for leveraging computational resources to enhance the search for an optimal data collection strategy π. Our framework is designed to minimize the need for expert knowledge and can readily scale with computational capabilities.

Perhaps the most relevant is the work by Gupta et al. (2021) for enhancing sample efficiency for a specific (linear) IV estimator by learning a sampling distribution over a few data sources, where each source provides different moment conditions. This is complementary to our work as we consider the problem of adaptive sampling, even within a single data source, and using potentially non-linear estimators. Therefore, a detailed related work discussion is deferred to Appendix A.

2 BACKGROUND AND PROBLEM SETUP

Figure 2: Causal graph.

Let X X be the observable covariates, U U be the unobserved confounding variables, Z Z be the instruments, A A be the actions/treatments, and Y R be the outcomes. Let π : X (Z) be an instrument sampling policy, and Π be the set of all such policies. Let Dn := {Si}n i=1 be a dataset of size n, where each data sample Si := {Xi, Zi, Ai, Yi} is collected using (potentially different) policy πi Π. With the model given in Figure 2, we assume that the causal relationship can be represented with the following structural model:

Y = g(X, A) + U, E[U] = 0, E[U|A, X] = 0, (1)

where g is an unknown (potentially non-linear) function. Importantly, note that the variable U can affect not only the outcomes Y but also X and A. We consider treatment effects to be homogenous across instruments, i.e., the CATE matches the conditional LATE (Angrist et al., 1996).

Following Newey & Powell (2003) and Hartford et al. (2017), we define the counterfactual prediction as f(x, a) := g(x, a)+E[U|x], where E[U|x] corresponds to the fixed baseline effect across all treatments a A, for a given covariate x X. This definition of f is useful, as it can be identified (discussed later) and can also be used to determine the effect of treatment a1 A compared to

Published as a conference paper at ICLR 2024

another treatment a2 A as f(x, a1) f(x, a2) = g(x, a1) g(x, a2). Further, in the simpler setting, if U X, then f(x, a) = g(x, a).

In the supervised learning setting (which assumes U A, X), the prediction model estimates E[Y |x, a], however, this can result in a biased (and inconsistent) treatment effect estimate as E[Y |x, a] = g(x, a)+E[U|x, a] = f(x, a). This issue can often be alleviated using instruments. For Z to be a valid instrument, Z should satisfy the following assumptions (red arrows in Fig 2),

Assumption 1. (a) Exclusive: Z Y |X, A, i.e., instruments do not effect the outcomes directly. (b) Relevance: Z A|X, i.e., the action chosen can be influenced using instruments. (c) Independence: Z U|X, i.e., the unobserved confounder does not affect the instrument.

Under Assumption 1, an inverse problem can be designed for estimating f, with the help of z,

E[Y |x, z] = E[g(x, A)|x, z] + E[U|x] = E[f(x, A)|x, z] = Z

a f(x, a)d P(a|x, z). (2)

However, identifiability of f using 2 depends on specific assumptions. In the simple setting where f is linear and E[U|x, z] = 0, Angrist et al. (1996) discuss how two-stage least-squares estimator (2SLS) can be used to identify f. In contrast, if f is non-linear (or non-parametric), exact recovery of f may not be viable as equation 2 results in an ill-posed inverse problem (Newey & Powell, 2003; Xu et al., 2020). Different techniques have been proposed to mitigate this issue (Newey & Powell, 2003; Xu et al., 2020; Syrgkanis et al., 2019; Frauen & Feuerriegel, 2022); see Wu et al. (2022) for a survey. These methods often employ a two-stage procedure, where each stage solves a set of moment conditions. We formalize these below.

Let f : X A R be parametrized using θ Θ, where Θ is the set of all parameters. Let θ0 Θ be the true parameter for f. Let q(Si; ϕ) and m(Si; θ, ϕ) be the moment condition for nuisance parameters ϕ and parameters of interest θ, respectively, for each sample Si. Let Mn and Qn be the corresponding sample average moments such that,

Mn θ, ˆϕ := 1

i=1 m(Si; θ, ˆϕ), Qn(ϕ) := 1

i=1 q(Si; ϕ), (3)

where ˆϕ is the solution to Qn(ϕ) = 0, and the estimated parameters of interest ˆθ are obtained as a solution to Mn(θ, ˆϕ) = 0. For example, in the basic 2SLS setting without covariates Xi (Pearl, 2009), first stage moment q(Si; ϕ) := ϕ(Ziϕ Ai)2 and the second stage moment m(Si; θ, ˆϕ) := θ( Aiθ Yi)2, where Ai = Zi ˆϕ. Similarly, to solve 2 in the non-linear setting with deep networks, Hartford et al. (2017) use q(Si; ϕ) := ϕ log Pr(Ai|Xi, Zi; ϕ) to estimate density d P(Ai|Xi, Zi; ϕ) using a neural density estimator, and subsequently define m(Si; θ, ˆϕ) := θ(Yi R f(Xi, a; θ)d P(a|Xi, Zi; ˆϕ))2, where f is also a neural network parametrized by θ.

To make the dependency of the estimated parameters of f on the data explicit, we denote the estimated parameter as θ(Dn). Further, we assume that the θ(Dn) is symmetric in Dn, i.e., θ(Dn) is invariant to the ordering of samples in the dataset Dn. This holds for 2SLS, Deep IV (Hartford et al., 2017), and most other popular estimators (Syrgkanis et al., 2019; Wu et al., 2022). We also consider θ(Dn) to be deterministic for a given Dn. In Appendix H.2, we discuss how the proposed algorithm can be used as-is for the setting where θ(Dn) is also stochastic given Dn.

Problem Statement: While these estimators mitigate the confounder bias in estimation of f, they are often subject to high variance (Imbens, 2014). In this work, we aim to develop an adaptive data collection policy that can improve sample-efficiency of the estimate of the counterfactual prediction f. Importantly, we aim to develop an algorithmic framework that can work with general (non)-linear two-stage estimators. Specifically, let X, A Px,a be samples from a fixed distribution over which the expected mean-squared error is evaluated,

L(π) := E Dn π[MSE(θ(Dn))], and MSE(θ) := EX,A h (f(X, A; θ0) f(X, A; θ))2i . (4)

To begin, we aim to find a single policy π arg minπ Π L(π) for the most sample-efficient estimation of θ. In Section 3.3, we will adapt this solution for sequential collection of data.

Published as a conference paper at ICLR 2024

As Figure 1 illustrates, selecting instruments strategically to generate a dataset can have a substantial impact on the efficiency of indirect experiments. Importantly, as L(π) is a function of both the data Dn and the estimator f( ; θ), an optimal π not only depends on the data-generating process (e.g., heteroskedasticity in compliances Z|X and outcomes Y , even given a specific covariate X) but also on the actual procedure used to compute an estimate of the parameter of interest θ, such as 2SLS, etc. In Appendix C, we present a simple illustrative case of linear models for the conditional average treatment effect (CATE) estimation, where each aspect can be observed explicitly.

Now we provide the key insights for a general gradient-based optimization procedure to find an instrument selection decision policy π that tackles all of these problems simultaneously. A formal analysis of the claims and the algorithmic approaches we make, and how they contrast with some other alternate choices, will be presented later in Section 4.

As our main objective is to search for π , we propose computing the gradient of L(π), which we can express as the following using the popular REINFORCE approach (Williams, 1992),

L(π) = E Dn π

MSE(θ(Dn)) log Pr(Dn; π)

(a) = E Dn π

log π(Zi|Xi)

where (a) follows because Pr(Dn; π) is the probability of observing the entire dataset Dn,

Pr(Dn; π) =

i=1 Pr(Xi, Ui)π(Zi|Xi) Pr(Ai|Xi, Ui, Zi) Pr(Yi|Xi, Ui, Ai). (6)

The gradient in 5 is appealing as it provides an unbiased direction to update π, and encapsulates properties of both the data-generating process and the estimators as a black-box in MSE(θ(Dn)). However, using L(π) for optimization is impractical due to two main challenges:

Challenge 1: As we will state more formally in Section 4, the variance of sample estimate of L(π) can be Θ(n), i.e., it can grow linearly in the size of Dn. This can make the optimization process prohibitively inefficient as we collect more data.

Challenge 2: Optimization using 5 will also require evaluating L(π) for a π different from π that was used to collect the data. If off-policy correction is done using importance weighting (Precup, 2000), then the variance might grow exponentially with n.

To address these two issues we now demonstrate how influence functions and multi-rejection importance sampling can be used to leverage the structure in our indirect experiment design setup to not only significantly reduce the variance of the resulting gradient estimate, but also do so in a computationally practical manner that is compatible with deep neural networks. The pseudocode for our algorithm approach is shown in Appendix H.

3.1 INFLUENCE FUNCTIONS FOR CONTROL VARIATES To address challenge 1, a common approach in reinforcement learning and other areas is to introduce control variates, that reduce the variance of estimation without introducing additional bias. In particular, we observe that for our setting, using the MSE computed using all other data points Dn\i (i.e., all of Dn, except the sample Si used in the log π(Zi|Xi) term) serves as a control variate,

ˆ CVL(π) :=

log π(Zi|Xi)

π MSE(θ(Dn)) MSE θ Dn\i . (7)

As we illustrate in Figure 3 and formalize in Section 4, this immediately results in a substantial variance reduction over the originally proposed gradient (5). However, the computation required for estimating ˆ CVL(π) can be prohibitively expensive as now for each log π(Zi|Xi) in 7, an entire separate re-training process is required to estimate the control variate MSE(θ(Dn\i)).

To be efficient in terms of both the variance and the computation, we now show that MSE(θ(Dn)) MSE θ Dn\i can be estimated using influence functions (Fisher & Kennedy, 2021). As we discuss below, this requires only a single optimization process for estimating MSE(θ(Dn)) MSE(θ(Dn\i)) across all i s, and thus preserves the best of both ˆ CVL(π) and ˆ L(π).

Published as a conference paper at ICLR 2024

Figure 3: Bias-variance of different gradient estimators, for a given policy π, when using a 2SLS estimators under model-misspecification. Observe the scale on the y-axes.

2 4 Samples 102

2 4 Samples 102

Let D be the distribution function induced by the dataset Dn, and with a slight overload of notation, we let θ(Dn) θ(D). Let Di,ϵ := (1 ϵ)D + ϵδ(Si) be a distribution perturbed in the direction of Si, where δ( ) denotes the Dirac distribution. Note that for ϵ = 1/n, distribution Di,ϵ corresponds to the distribution function without the sample Si, i.e., the distribution induced by Dn\i. From this perspective, MSE(θ( )) corresponds to a functional of the data distribution function D, and thus we can do a Von Mises expansion (Fernholz, 2012), i.e., a distributional analog of the Taylor expansion for statistical functionals, to directly approximate MSE(θ(Di,ϵ)) in terms of MSE(θ(D)) as

MSE(θ(Di,ϵ)) = MSE(θ(D)) +

k! I(k) MSE(Si), (8)

where I(k) MSE is the k-th order influence function (see the work by Fisher & Kennedy (2021); Kahn (2015) for an accessible introduction to influence functions). Therefore, when all the higher order influence function exists, we can use 8 to recover an estimate of MSE(θ(Dn)) MSE(θ(Dn\i)) without re-training. In practice, approximating 8 with a K-th order expansion often suffices, which permits a gradient estimator ˆ IFL(π) that avoids any re-computation of the mean-square-error,

ˆ IFL(π) :=

log π(Zi|Xi)

π IMSE(Si), where, IMSE(Si) :=

k! I(k) MSE(Si). (9)

In fact, as we illustrate in Figure 3 and state formally in Section 4, when using finite K terms, the bias of the gradient in 9 is of the order n K, while still providing significant variance reduction. This enables the bias to be neglected even when using K = 1. Intuitively, for k = 1, I(k) MSE operationalizes the concept of the first-order derivative for the functional MSE(θ( )), i.e., it characterizes the effect on MSE when the training sample Si is infinitesimaly up-weighted during optimization,

I(1) MSE(Si) := d

dt MSE θ (1 t)D + tδ(Si) t=0. (10)

A key contribution is to derive (Theorem 2 in Appendix D) the following form for I(1) MSE(Si),

I(1) MSE(Si) = d

dθMSE(ˆθ) I(1) θ (Si), (11)

where, I(1) θ (Si) = Mn

1 (ˆθ, ˆϕ) m(Si, ˆθ, ˆϕ) Mn

ϕ (ˆθ, ˆϕ) Qn

1 (ˆϕ)q(Si, ˆϕ) . (12)

Note that these expressions generalize the influence function for standard machine learning estimators (e.g., least-squares, classification) (Koh & Liang, 2017) to more general estimators (e.g., IV estimator, GMM estimators, double machine-learning estimators) that involve black-box estimation of nuisance parameters ϕ alongside the parameters of interest θ, as discussed in 3.

In Appendix H we show how I(1) θ (Si) can be computed efficiently, without ever explicitly storing or inverting derivatives of Mn or Qn, using Hessian-vector products (Pearlmutter, 1994) readily available in auto-diff packages (Bradbury et al., 2018; Paszke et al., 2019). Our approaches are inspired by techniques used to scale implicit gradients and influence function computation to models with millions (Lorraine et al., 2020) of parameters. In Appendix H, we also discuss how I(1) θ (Si) can be further re-used to partially estimate I(2) MSE(Si), thereby making IMSE(Si) more accurate.

3.2 MULTI-REJECTION SAMPLING FOR DISTRIBUTION CORRECTION Recall from Challenge 2 that gradient based search of π would require evaluating gradient at π using data collected from π = π . A common technique in RL to address this is by using importance

Published as a conference paper at ICLR 2024

Figure 4: (Left) Change in k did not impact the π optima as the orderings were preserved. (Right) variance of rejection sampling (RS) vs importance sampling (IS). See Figure 8 for the plot with log-scale. Observe the scale on the y-axes. 1 0.75 0.5 for k = n

MSE Estimate

1 2 3 4 5 Samples 102

sampling (IS) (Precup, 2000; Thomas, 2015). Unfortunately, the effective importance ratio, when using IS in our setting, will be a product of the importance ratios for all the n data points (see Equation 165 in Appendix F). This can result in the variance of gradient estimates being exponential in the number of samples n, in the worst case. We illustrate this in Figure 4 and discuss formally in Appendix F.

We now introduce a multi-rejection importance sampling approach to estimate the gradient under a decision policy different from that used to gather the historical data. In contrast, unlike in RL, in our setting, the estimator θ(Dn) can be considered symmetric for most of the popular estimators as the order of samples in Dn does not matter. Therefore, instead of performing importance sampling over the whole prior data, we propose creating a new dataset of size k by employing rejection sampling. Critically, the importance ratio used to select (or reject) each sample Si to evaluate a new policy π depends only on the importance ratio ρ(s) for that single sample s, instead of the product of those ratios over the entire dataset:

Si = Accept if ξi ρ(Si)/ρmax Reject Otherwise. ρ(Si) := Pr(Si; π )

Pr(Si; πi) = π (Zi|Xi)

πi(Zi|Xi), (13)

where ξi Uniform(0, 1) and ρmax := maxs ρ(s). Note that 13 allows simulating samples from the data distribution governed by π , without requiring any knowledge of the unobserved confounders Ui. As we will prove in Section 4, this ensures that the variance of the resulting dataset scales with the worst single sample importance ratio, instead of exponentially over the entire dataset size n. This process can therefore be used to generate a smaller k sized dataset with the same distribution of data as sampling from the desired policy π .

As we illustrate in Figure 4 and discuss formally in Appendix F.4, it is reasonable to consider that instrument-selection policy ordering (in terms of their MSE) is robust to the dataset size, i.e., Eπ[MSE(θ(Dn))] Eπ [MSE(θ(Dn))] implies that Eπ[MSE(θ(Dk))] Eπ [MSE(θ(Dk))], where k < n. In addition, as we will state formally in Property 4, if k = o(n), then the probability of the failure case, where the expected number of samples n RS accepted by rejection sampling is less than the desired k samples, decays exponentially.

This insight is advantageous as we can now directly obtain datasets Dk from π , albeit of size k < n, to evaluate π , as desired. We implement this by first running the rejection process for all samples, yielding a set of samples of size n RS that are drawn from π, and then selecting a random subset of size k from the n RS accepted samples. For the sub-sample Dk, analogous to the true MSE in 4 we calculate an estimate of the MSE using Xi and Ai drawn from a held-out dataset with M samples from Px,a as

[ MSE(θ(Dk)) = 1

i=1 (f(Xi, Ai; θ(Dn)) f(Xi, Ai; θ(Dk)))2. (14)

Note that in general, our historical data may be a sequence of decision policies, as we adaptively deploy new instrument-selection decision policies. In this setting, a straightforward application of rejection sampling will require the support assumption which necessitates πi(Zi|Xi) > 0 if π (Zi|Xi) > 0 for all i {1, ..., n}. This may restrict the class of new instrument-selection policies that can be deployed. Instead we propose the following multi-rejection sampling strategy:

Si = Accept if ξi ρ(Si)/ ρmax Reject Otherwise. ρ(Si) := π (Zi|Xi) 1 n P

j πj(Zi|Xi) (15)

Published as a conference paper at ICLR 2024

where ρmax = maxs ρ(s). As we show in Appendix G, this relaxes the support assumption enforced for every πi to support assumption over the union of supports of {πi}n i=1, while still ensuring that the accepted samples follow the distribution specified by π . For e.g., if the initial data collected or available used a uniform instrument-selection decision policy, then subsequent policies can be arbitrarily deterministic and the multi-rejection procedure can still use all the data and remain valid. This makes it particularly well suited for our adaptive experiment design framework. In Appendix G, we further prove that the expected number of samples n MRS accepted under multi-rejection will always be more than or equal to n RS obtained using rejection sampling, irrespective of {πi}n i=1.

3.3 DIA: DESIGNING INSTRUMENTS ADAPTIVELY Using influence functions and multi-rejection sampling we addressed both the challenges about high variance. Leveraging the flexibility offered by the proposed gradient-based procedure, we can now readily extend the idea for sequential collection of data. To do so, we propose using a parameterization of π to account for data already collected, which is important during adaptive instrument-selection policy design.

Let n be the size of the dataset Dn= {Si}n i=1 collected so far, and let N be the budget for the total number of samples that can be collected. At each stage of data collection, DIA searches for a policy πφ, parameterized using φ, such that when the remaining DN n= {Si}N i=n+1 samples are collected using πφ, the combined distribution DN := Dn DN n would approximate the optimal distribution for estimating θ. Let πeff φ operationalize DN, if DN n were to be collected using πφ,

πeff φ (z|x) := n

i=1 πi(z|x)

πφ(z|x). (16)

Let I [ MSE(Si) be the influence function similar to 9, where the MSE is replaced with its proxy 14, and let {Si}k i=1 in the following be the samples for πeff φ obtained using multi-rejection sampling in 15. Then for the first batch, πφ is designed to be a uniform distribution, and for the later batches we leverage 9 and 14 to update πφ using

ˆ IFL(πeff φ ) :=

log πeff φ (Zi|Xi) φ I [ MSE(Si). (17)

We use πφ to collect the next batch of data and then update πφ again, as discussed above, to ensure that the final distribution DN = {Si}N i=1 approximates the optimal distribution, for estimating θ, as closely as possible. Intuitively, this technique is similar to receding horizon control, where the controller is planned for the entire remaining horizon of length N n, but is updated and refined again periodically. A pseudo-code for the proposed algorithm is provided in Appendix H.

We now provide theoretical statements about the impact of the algorithmic design choices, over alternatives, in terms of their benefit on the variance. Proofs and more formal detailed statements of the theorems are deferred to Appendix E. We first state that variance of the naive gradient can scale poorly with the dataset size:

Property 1 (Informal). If MSE(θ(Dn)) is leave-one-out stable and n MSE(θ(Dn)) 2 L2 = Ω(α(n)) = ω(1), we have Var( ˆ L(π)) = Ω(α(n)) (c.f. Theorem 3 in Appendix).

In many cases, we would expect that the mean squared error decays at the rate of n 1/2 (unless we can invoke a fast parametric rate of 1/n, which holds under strong convexity assumptions); in such cases, we have that α(n) = ω(1) and Var( ˆ L(π)) can grow with the sample size. Moreover, if the mean squared error does not converge to zero, due to persistent approximation error, or local optima in the optimization, or if 2 is ill-posed when dealing with non-parametric estimators, then the variance can grow linearly, i.e. α(n) = Θ(n). See Figure 3. In contrast, the variance of the gradient of the loss with MSE covariates 7 is often independent of the data set size:

Property 2 (Informal). E[ ˆ CVL(π)] = L(π). Further, if MSE(θ(Dn)) is leave-one-out stable then Var( ˆ CVL(π)) = O(1) (c.f. Theorem 4 in Appendix).

Published as a conference paper at ICLR 2024

1 2 3 4 5 Samples 103

Percent of MSE of Uniform

1 2 3 4 5 Samples 103 0.00

Percent of MSE of Uniform

Conditional IV

0.2 0.4 0.6 0.8 1.0 Samples 104

Percent of MSE of Uniform

Trip Advisor

Figure 5: Performance gains when considering a different number of instrumental variables, with a total of 5 re-allocation. First allocation (1000 samples) is done using uniform sampling, and thus all the methods have the same performance initially. DIA-X refers to the application of the proposed algorithm to the domain setting where the number of instruments is X. Note that since MSE under uniform distribution also keeps increasing, a saturating or increasing trend of the ratio does not imply that the MSE is increasing. (Left) Results for the IV setting (using closed-form linear estimator). Dotted lines correspond to the oracle estimate for the respective number of instruments. (Middle) Results for the conditional IV setting (using logistic regression based estimator). (Right) Results for the Trip Advisor domain (Syrgkanis et al., 2019) (using neural network based estimator). Error bars correspond to one stderr.

Unlike Property 1, Property 2 holds irrespective of α(n), thus providing a reliable gradient estimator even if the function approximator is mis-specified, or optimization is susceptible to local optima, or if 2 is ill-posed when dealing with non-parametric estimators. Further, our loss using influence functions 9 can also yield similar variance reduction, at much lower computational cost:

Property 3 (Informal). If the Von-Mises expansion in 8 exists, E[ ˆ IFL(π)] = L(π) + O(n K). If MSE(θ(Dn)) is leave-one-out stable then Var( ˆ IFL(π)) = O(1) (c.f. Theorem 5 in Appendix).

We then prove how rejection sampling can impact the MSE, enabling a variance that has an exponential reduction in variance compared to importance sampling, whose variance can depend on ρk max (c.f. Theorem 7 in Appendix), at the expense of only a bounded failure rate of not producing the desired number of samples k. For this result, let MSERS k be the estimate of MSE computed using the mean of [ MSE(θ(Dk)) across all the subsets Dk from the n RS accepted samples.

Property 4 (Informal). Let Eπ[MSE(θ(Dk))2] = α(k)2 then Var MSERS k = O(α(k)2kρmax/n+ α(n)) and Bias MSERS k = O(e n/ρmax + p

α(n)). Further, the failure probability is bounded as Pr(n RS < k) e n/(8ρmax) (c.f. Theorem 8 in Appendix).

We also show in the Appendix (c.f. Corollary 1) that the above theorem, together with further regularity conditions, can be used to argue that ranking policies based on MSERS k leads to approximately optimal policy decisions, with respect to E[MSE(θ(Dn))], in a strong approximation sense.

5 EXPERIMENTS

In this section we empirically investigate the flexibility and efficiency of our approach. We provide the key takeaway points here, and more experimental details are deferred to Appendix I.

A. Flexibility of the proposed approach: One of the key benefits of DIA is that it can be used asis for a variety of estimators (linear and non-linear) and settings (unconditional and conditional IV setting). Drawing inspiration from real-world use cases, consider binary treatment settings, where we only have access to a plethora of instruments that can be used to encourage (e.g., using different types of notifications, emails, etc.) treatment uptake. Specifically, we consider three regimes:

IV: This simulator is a basic IV setting with homogenous treatment effects and which thus permits a closed-form solution using the standard two-stage least-squares procedure (Pearl, 2009). This domain has heteroskedastic outcome noise and varying levels of compliance for different instruments. Figure 5 (left) presents the results for this setting.

Published as a conference paper at ICLR 2024

3 5 7 Number of Re-allocations

Percent of MSE of Uniform

3 5 7 Number of Re-allocations

Percent of MSE of Uniform

Conditional IV

Figure 6: DIA-X refers to the application of the proposed algorithm to the domain setting where the number of instruments is X. The plots illustrate the percent of MSE of uniform, after all the 5000 samples are collected, with the given number of re-allocations. Error bars correspond to one stderr.

Conditional IV: Second we consider the important setting where instruments/encouragements can be allocated in a context-dependent way, and the objective is to estimate the conditional average treatment effect. Our simulator has covariates containing a mix of binary and continuous-valued features, and heteroskedastic noises and compliances, across instruments and outcomes, for every covariate. We solve the counterfactual prediction f in 2 using the two-stage moment conditions in 3 involving logistic regression and this estimator does not have a closed-form solution. Figure 5 (middle) presents the results for this setting.

Trip Advisor: We use the simulator built from Trip Advisor customer data (Syrgkanis et al., 2019). The goal is to estimate the effect on revenue if a customer becomes a subscribed member. The original setup had instruments corresponding to easier/promotional sign-up options, but to provide ablation studies we consider a larger, synthetically augmented, set of instruments. Covariates correspond to demographic data about the customer, and their past interaction behavior on the website. For this setting, we use the estimator by Hartford et al. (2017) and model all the functions f, d P (from 2), and π, using neural networks. Figure 5 (right) presents the results for this setting.

B. Understanding the performance gains: In order to simulate a realistic situation where deployed policy can be updated only periodically, we consider a batched allocation setting. To measure the performance gain, we compute the relative improvement in MSE over uniform allocation strategy, i.e., a metric popular in direct experiment design literature (Che & Namkoong, 2023) (lower is better). In the IV setting, in Figure 5 (left), we also provide a comparison with the oracle obtained using brute force search. To assess the robustness of our method, we also consider varying the number of available instruments and the number of batch re-allocations possible.

Across all the domains, we observe that DIA can provide substantial gains for estimating the counterfactual prediction f by adaptively designing the instruments for indirect experiments (Figure 5). Performance gains are observed even as we vary the number of instruemnts present in the domain. Importantly, DIA achieves this across different (linear and non-linear) estimators, thereby illustrating its flexibility. In addition, in general there is a benefit to increasing the number of times the instrument-design policy is updated (Figure 6).

It is worth highlighting that DIA can improve the accuracy of the base estimator by a factor of 1.2 to 10 : equivalently, for some desired target treatment effect estimation accuracies, DIA needs only 80% to an order-of-magnitude less sample data.

Our results also suggest a tension behind the amount of data and the complexity of the problem, creating a U -trend (see e.g. Figure 6). With a larger number of instruments, there is more potential advantage of being strategic, especially when for many contexts, most instruments have a weak influence. However, learning the optimal strategy to realize those gains also becomes harder when there are so many instruments (and the sample budget is held fixed). In the middle regime gains are substantial and the method can also quickly learn an effective instrument-selection design π .

Published as a conference paper at ICLR 2024

6 CONCLUSION

The increasing prevalence of human-AI interaction systems presents an important development. However, as AI systems can often only be suggestive and not prescriptive, estimating the effect of its suggested action necessitates the development of sample-efficient methods to estimate treatment effect merely through suggestions, while leaving the agency of decision to the user. This work took the initial step to lay the foundation for adaptive data collection for such indirect experiments. We characterized the key challenges, theoretically assessed the proposed remedies, and validated them empirically on domains inspired by real-world settings. Scaling the framework to settings with natural language instruments, and doing inference with adaptively collected data (Zhang et al., 2021; Gupta et al., 2021) remain exciting future directions.

7 ACKNOWLEDGEMENT

We thank Stefan Wager, Jann Spiess, and Art Owen for their valuable feedback. Earlier versions of the draft also benefited from feedback from Jonathan Lee and Allen Nie. This work was supported in part by 2023 Amazon Research Award.

Karim Abou-Moustafa and Csaba Szepesvári. An exponential efron-stein inequality for l_q stable learning rules. In Algorithmic Learning Theory, pp. 31 63. PMLR, 2019. [Page(s): 27]

Ahmed Alaa and Mihaela Van Der Schaar. Validating causal inference models via influence functions. In International Conference on Machine Learning, pp. 191 201. PMLR, 2019. [Page(s): 17, 25]

Ahmed Alaa and Mihaela Van Der Schaar. Discriminative jackknife: Quantifying uncertainty in deep learning via higher-order influence functions. In International Conference on Machine Learning, pp. 165 174. PMLR, 2020. [Page(s): 17]

Joshua D Angrist, Guido W Imbens, and Donald B Rubin. Identification of causal effects using instrumental variables. Journal of the American statistical Association, 91(434):444 455, 1996. [Page(s): 2, 3, 17]

Susan Athey and Stefan Wager. Policy learning with observational data. Econometrica, 89(1): 133 161, 2021. [Page(s): 17]

Juhan Bae, Nathan Ng, Alston Lo, Marzyeh Ghassemi, and Roger B Grosse. If influence functions are the answer, then what is the question? Advances in Neural Information Processing Systems, 35:17953 17967, 2022. [Page(s): 24, 46, 47]

Samyadeep Basu, Philip Pope, and Soheil Feizi. Influence functions in deep learning are fragile. ar Xiv preprint ar Xiv:2006.14651, 2020. [Page(s): 46]

James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake Vander Plas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+Num Py programs, 2018. URL http: //github.com/google/jax. [Page(s): 5]

Patrick Breheny. Statistical functionals and influence functions. Technical report, The University of Iowa, 2020. https://myweb.uiowa.edu/pbreheny/uk/teaching/621/notes/ 8-28.pdf. [Page(s): 24]

Erik Brynjolfsson, Danielle Li, and Lindsey R Raymond. Generative ai at work. Technical report, National Bureau of Economic Research, 2023. [Page(s): 1]

Brian S Caffo, James G Booth, and AC Davison. Empirical supremum rejection sampling. Biometrika, 89(4):745 754, 2002. [Page(s): 45]

Ricardo Cao. Bootstrapping the mean integrated squared error. Journal of Multivariate Analysis, 45 (1):137 160, 1993. [Page(s): 17]

Alain Celisse and Benjamin Guedj. Stability revisited: new generalisation bounds for the leave-oneout. ar Xiv preprint ar Xiv:1608.06412, 2016. [Page(s): 27]

Kathryn Chaloner and Isabella Verdinelli. Bayesian experimental design: A review. Statistical science, pp. 273 304, 1995. [Page(s): 1, 17, 47]

Published as a conference paper at ICLR 2024

Ethan Che and Hongseok Namkoong. Adaptive experimentation at scale: Bayesian algorithms for flexible batches. ar Xiv preprint ar Xiv:2303.11582, 2023. [Page(s): 1, 9, 17]

Yen-Chi Chen. Introduction to resampling methods. lecture 5: Bootstrap. Technical report, University of Washington, 2017a. https://faculty.washington.edu/yenchic/17Sp_ 403/Lec5-bootstrap.pdf. [Page(s): 17]

Yen-Chi Chen. Introduction to resampling methods. lecture 9: Introduction to the bootstrap theory. Technical report, University of Washington, 2017b. https://faculty.washington. edu/yenchic/17Sp_403/Lec9_theory.pdf. [Page(s): 17]

Yen-Chi Chen. Stat 512: Statistical inference. lecture 10: Statistical functionals and the bootstrap. Technical report, University of Washington, 2020. https://faculty.washington.edu/ yenchic/20A_stat512/Lec10_functional.pdf. [Page(s): 17]

Paul S Clarke and Frank Windmeijer. Instrumental variable estimators for binary outcomes. Journal of the American Statistical Association, 107(500):1638 1652, 2012. [Page(s): 17]

R Dennis Cook and Sanford Weisberg. Residuals and influence in regression. New York: Chapman and Hall, 1982. [Page(s): 24]

Alicia Curth, Alihan Hüyük, and Mihaela van der Schaar. Adaptively identifying patient populations with treatment benefit in clinical trials. ar Xiv preprint ar Xiv:2208.05844, 2022. [Page(s): 17]

Michiel Debruyne, Mia Hubert, and Johan AK Suykens. Model selection in kernel based regression using the influence function. Journal of machine learning research.-Cambridge, Mass., 9:2377 2400, 2008. [Page(s): 17]

Riccardo Della Vecchia and Debabrota Basu. Online instrumental variable regression: Regret analysis and bandit feedback. ar Xiv preprint ar Xiv:2302.09357, 2023. [Page(s): 1]

Ozkan Eren and Daniel J Henderson. The impact of homework on student achievement. The Econometrics Journal, 11(2):326 348, 2008. [Page(s): 18]

Yingying Fan, Jinchi Lv, and Jingbo Wang. Dnn: A two-scale distributional tale of heterogeneous treatment effect inference. Available at SSRN 3238897, 2018. [Page(s): 31, 41]

Luisa Turrin Fernholz. Von Mises calculus for statistical functionals, volume 19. Springer Science & Business Media, 2012. [Page(s): 5]

Johannes Ferstad, Priya Prahalad, David M Maahs, Emily Fox, Ramesh Johari, and David Scheinker. 1009-p: The association between patient characteristics and the efficacy of remote patient monitoring and messaging. Diabetes, 71(Supplement_1), 2022. [Page(s): 18]

Aaron Fisher and Edward H Kennedy. Visually communicating and teaching intuition for influence functions. The American Statistician, 75(2):162 172, 2021. [Page(s): 4, 5]

Dennis Frauen and Stefan Feuerriegel. Estimating individual treatment effects under unobserved confounding using binary instruments. ar Xiv preprint ar Xiv:2208.08544, 2022. [Page(s): 3]

Charles J Geyer. 5601 notes: The subsampling bootstrap. Unpublished manuscript, 2006. URL

https://www.stat.umn.edu/geyer/5601/notes/sub.pdf. [Page(s): 17]

Roger Grosse, Juhan Bae, Cem Anil, Nelson Elhage, Alex Tamkin, Amirhossein Tajdini, Benoit Steiner, Dustin Li, Esin Durmus, Ethan Perez, et al. Studying large language model generalization with influence functions. ar Xiv preprint ar Xiv:2308.03296, 2023. [Page(s): 45]

Shantanu Gupta, Zachary Lipton, and David Childers. Efficient online estimation of causal effects by deciding what to observe. Advances in Neural Information Processing Systems, 34:20995 21007, 2021. [Page(s): 2, 10]

Jinyong Hahn, Keisuke Hirano, and Dean Karlan. Adaptive experimental design using the propensity score. Journal of Business & Economic Statistics, 29(1):96 108, 2011. [Page(s): 1, 17]

Peter Hall. Using the bootstrap to estimate mean squared error and select smoothing parameter in nonparametric problems. Journal of multivariate analysis, 32(2):177 203, 1990. [Page(s): 17]

Peter Hall. Methodology and theory for the bootstrap. Technical report, University of California, Davis, 2016. https://anson.ucdavis.edu/~peterh/sta251/ bootstrap-lectures-to-may-16.pdf. [Page(s): 17]

Published as a conference paper at ICLR 2024

Jason Hartford, Greg Lewis, Kevin Leyton-Brown, and Matt Taddy. Deep iv: A flexible approach for counterfactual prediction. In International Conference on Machine Learning, pp. 1414 1423. PMLR, 2017. [Page(s): 1, 2, 3, 9, 48]

James J Heckman and Sergio Urzua. Comparing iv with structural models: What simple iv can and cannot identify. Journal of Econometrics, 156(1):27 37, 2010. [Page(s): 17]

Oliver Hines, Oliver Dukes, Karla Diaz-Ordaz, and Stijn Vansteelandt. Demystifying statistical learning based on efficient influence functions. The American Statistician, 76(3):292 304, 2022. [Page(s): 17]

Wassily Hoeffding. A class of statistics with asymptotically normal distribution. The Annals of Mathematical Statistics, 19(3):293 325, 1948. [Page(s): 31]

Kenneth Holstein, Bruce M Mc Laren, and Vincent Aleven. Student learning benefits of a mixedreality teacher awareness tool in ai-enhanced classrooms. In Artificial Intelligence in Education: 19th International Conference, AIED 2018, London, UK, June 27 30, 2018, Proceedings, Part I 19, pp. 154 168. Springer, 2018. [Page(s): 1]

Shang-Ling Hsu, Raj Sanjay Shah, Prathik Senthil, Zahra Ashktorab, Casey Dugan, Werner Geyer, and Diyi Yang. Helping the helper: Supporting peer counselors via ai-empowered practice and feedback. ar Xiv preprint ar Xiv:2305.08982, 2023. [Page(s): 1]

Hidehiko Ichimura and Whitney K Newey. The influence function of semiparametric estimators. Quantitative Economics, 13(1):29 61, 2022. [Page(s): 17, 24]

Guido Imbens. Instrumental variables: an econometrician s perspective. Technical report, National Bureau of Economic Research, 2014. [Page(s): 3]

Guido W Imbens. Better late than nothing: Some comments on deaton (2009) and heckman and urzua (2009). Journal of Economic literature, 48(2):399 423, 2010. [Page(s): 17]

Jay Kahn. Influence functions for fun and profit. Ross School of Business, University of Michigan, 2015. URL http://j-kahn.com/files/influencefunctions.pdf. [Page(s): 5, 19, 24, 48]

Nathan Kallus. Instrument-armed bandits. In Firdaus Janoos, Mehryar Mohri, and Karthik Sridharan (eds.), Proceedings of Algorithmic Learning Theory, volume 83 of Proceedings of Machine Learning Research, pp. 529 546. PMLR, 07 09 Apr 2018. URL https://proceedings. mlr.press/v83/kallus18a.html. [Page(s): 1]

Maximilian Kasy. Semiparametrically efficient estimation of conditional instrumental variables parameters. The International Journal of Biostatistics, 5(1), 2009. [Page(s): 17]

Maximilian Kasy and Anja Sautmann. Adaptive treatment assignment in experiments for policy choice. Econometrica, 89(1):113 132, 2021. [Page(s): 17]

Masahiro Kato, Takuya Ishihara, Junya Honda, and Yusuke Narita. Efficient adaptive experimental design for average treatment effect estimation. ar Xiv preprint ar Xiv:2002.05308, 2020. [Page(s): 1, 17]

Michael Kearns and Dana Ron. Algorithmic stability and sanity-check bounds for leave-one-out cross-validation. In Proceedings of the tenth annual conference on Computational learning theory, pp. 152 162, 1997. [Page(s): 27]

Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. In International conference on machine learning, pp. 1885 1894. PMLR, 2017. [Page(s): 5, 24]

Wouter Kool, Herke van Hoof, and Max Welling. Buy 4 REINFORCE samples, get a baseline for free! In Deep Reinforcement Learning Meets Structured Prediction, ICLR 2019 Workshop, New Orleans, Louisiana, United States, May 6, 2019. Open Review.net, 2019. URL https: //openreview.net/forum?id=r1lg TGL5DE. [Page(s): 17]

Steven George Krantz and Harold R Parks. The implicit function theorem: history, theory, and applications. Springer Science & Business Media, 2002. [Page(s): 24]

Zhaobin Kuang, Frederic Sala, Nimit Sohoni, Sen Wu, Aldo Córdova-Palomera, Jared Dunnmon, James Priest, and Christopher Ré. Ivy: Instrumental variable synthesis for causal inference. In International Conference on Artificial Intelligence and Statistics, pp. 398 410. PMLR, 2020. [Page(s): 17]

Published as a conference paper at ICLR 2024

Harrison H Li and Art B Owen. Double machine learning and design in batch adaptive experiments. ar Xiv preprint ar Xiv:2309.15297, 2023. [Page(s): 17]

Pierre Liotet, Francesco Vidaich, Alberto Maria Metelli, and Marcello Restelli. Lifelong hyperpolicy optimization with multiple importance sampling regularization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 7525 7533, 2022. [Page(s): 17]

Hao Liu, Yihao Feng, Yi Mao, Dengyong Zhou, Jian Peng, and Qiang Liu. Action-depedent control variates for policy optimization via stein s identity. ar Xiv preprint ar Xiv:1710.11198, 2017. [Page(s): 17]

Noel Loo, Ramin Hasani, Mathias Lechner, and Daniela Rus. Dataset distillation with convexified implicit gradients. ar Xiv preprint ar Xiv:2302.06755, 2023. [Page(s): 17, 24]

Jonathan Lorraine, Paul Vicol, and David Duvenaud. Optimizing millions of hyperparameters by implicit differentiation. In International conference on artificial intelligence and statistics, pp. 1540 1552. PMLR, 2020. [Page(s): 5, 45]

Anna Mikusheva. Time series analysis. lecture 9: Bootstrap. Technical report, Massachusetts Institute of Technology, 2013. https://ocw.mit.edu/courses/ 14-384-time-series-analysis-fall-2013/resources/mit14_384f13_ lec9/. [Page(s): 17]

Susan A Murphy. An experimental design for the development of adaptive treatment strategies. Statistics in medicine, 24(10):1455 1481, 2005. [Page(s): 1]

Whitney K Newey and Daniel Mc Fadden. Large sample estimation and hypothesis testing. Handbook of econometrics, 4:2111 2245, 1994. [Page(s): 19]

Whitney K Newey and James L Powell. Instrumental variable estimation of nonparametric models. Econometrica, 71(5):1565 1578, 2003. [Page(s): 2, 3]

Dung Daniel T Ngo, Logan Stapleton, Vasilis Syrgkanis, and Steven Wu. Incentivizing compliance with algorithmic instruments. In International Conference on Machine Learning, pp. 8045 8055. PMLR, 2021. [Page(s): 2]

Matteo Papini, Alberto Maria Metelli, Lorenzo Lupo, and Marcello Restelli. Optimistic policy optimization via multiple importance sampling. In International Conference on Machine Learning, pp. 4989 4999. PMLR, 2019. [Page(s): 17]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary De Vito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8024 8035. Curran Associates, Inc., 2019. [Page(s): 5]

Judea Pearl. Causality. Cambridge university press, 2009. [Page(s): 3, 8]

Barak A Pearlmutter. Fast exact multiplication by the hessian. Neural computation, 6(1):147 160, 1994. [Page(s): 5, 45]

Dimitris N Politis, Joseph P Romano, and Michael Wolf. Subsampling. Springer Science & Business Media, 1999. [Page(s): 17]

Doina Precup. Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, pp. 80, 2000. [Page(s): 4, 6]

Tom Rainforth, Adam Foster, Desi R Ivanova, and Freddie Bickford Smith. Modern bayesian experimental design. ar Xiv preprint ar Xiv:2302.14545, 2023. [Page(s): 1, 17, 47]

Lorenz Richter, Ayman Boustati, Nikolas Nüsken, Francisco Ruiz, and Omer Deniz Akyildiz. Vargrad: a low-variance gradient estimator for variational inference. Advances in Neural Information Processing Systems, 33:13481 13492, 2020. [Page(s): 17]

Emmanuel Rio. Moment inequalities for sums of dependent random variables under projective conditions. Journal of Theoretical Probability, 22(1):146 163, 2009. [Page(s): 28]

James Robins, Lingling Li, Eric Tchetgen, Aad van der Vaart, et al. Higher order influence functions and minimax estimation of nonlinear functionals. In Probability and statistics: essays in honor of David A. Freedman, volume 2, pp. 335 422. Institute of Mathematical Statistics, 2008. [Page(s): 25]

Published as a conference paper at ICLR 2024

Joseph P Romano. On subsampling estimators with unknown rate of convergence. 1995. [Page(s): 47]

Tim Salimans and David A Knowles. On using control variates with stochastic approximation for variational bayes and its connection to stochastic linear regression. ar Xiv preprint ar Xiv:1401.1022, 2014. [Page(s): 17]

Andrea Schioppa, Polina Zablotskaia, David Vilar, and Artem Sokolov. Scaling up influence functions. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 8179 8186, 2022. [Page(s): 24]

Jiaxin Shi, Yuhao Zhou, Jessica Hwang, Michalis Titsias, and Lester Mackey. Gradient estimation with discrete stein operators. Advances in neural information processing systems, 35:25829 25841, 2022. [Page(s): 17]

Xiaoxia Shi. Econ 715. lecture 10: Bootstrap. Technical report, University of Wisconsin-Madison, 2012. https://www.ssc.wisc.edu/~xshi/econ715/Lecture_10_bootstrap. pdf. [Page(s): 17]

Difan Song, Simon Mak, and CF Wu. Ace: Active learning for causal inference with expensive experiments. ar Xiv preprint ar Xiv:2306.07480, 2023. [Page(s): 1, 17]

Masashi Sugiyama, Motoaki Kawanabe, et al. Dimensionality reduction for density ratio estimation in high-dimensional spaces. Artificial Intelligence Society, 2009(DMSM-A901):04, 2009. [Page(s): 49]

Masashi Sugiyama, Makoto Yamada, Paul Von Buenau, Taiji Suzuki, Takafumi Kanamori, and Motoaki Kawanabe. Direct density-ratio estimation with dimensionality reduction via least-squares hetero-distributional subspace search. Neural Networks, 24(2):183 198, 2011. [Page(s): 49]

Vasilis Syrgkanis, Victor Lei, Miruna Oprescu, Maggie Hei, Keith Battocchi, and Greg Lewis. Machine learning estimation of heterogeneous treatment effects with instruments. Advances in Neural Information Processing Systems, 32, 2019. [Page(s): 1, 3, 8, 9, 18, 49]

Hajime Takahashi. A note on edgeworth expansions for the von mises functionals. Journal of multivariate analysis, 24(1):56 65, 1988. [Page(s): 17]

Yunhao Tang. Biased gradient estimate with drastic variance reduction for meta reinforcement learning. In International Conference on Machine Learning, pp. 21050 21075. PMLR, 2022. [Page(s): 17]

Philip S Thomas. Safe reinforcement learning. Ph D thesis, University of Massachusetts, Amherst, 2015. [Page(s): 6]

Michalis Titsias and Jiaxin Shi. Double control variates for gradient estimation in discrete latent variable models. In International Conference on Artificial Intelligence and Statistics, pp. 6134 6151. PMLR, 2022. [Page(s): 17]

Toon Vanderschueren, Alicia Curth, Wouter Verbeke, and Mihaela van der Schaar. Accounting for informative sampling when learning to forecast treatment outcomes over time. ar Xiv preprint ar Xiv:2306.04255, 2023. [Page(s): 17]

Stefan Wager and Susan Athey. Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523):1228 1242, 2018a. [Page(s): 42]

Stefan Wager and Susan Athey. Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523):1228 1242, 2018b. [Page(s): 31, 41]

Linbo Wang, Yuexia Zhang, Thomas S Richardson, and James M Robins. Estimation of local treatment effects under the binary instrumental variable model. Biometrika, 108(4):881 894, 2021. [Page(s): 17]

Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8:229 256, 1992. [Page(s): 4]

Anpeng Wu, Kun Kuang, Ruoxuan Xiong, and Fei Wu. Instrumental variables in causal inference and machine learning: A survey. ar Xiv preprint ar Xiv:2212.05778, 2022. [Page(s): 3]

Cathy Wu, Aravind Rajeswaran, Yan Duan, Vikash Kumar, Alexandre M Bayen, Sham Kakade, Igor Mordatch, and Pieter Abbeel. Variance reduction for policy gradient with action-dependent factorized baselines. ar Xiv preprint ar Xiv:1803.07246, 2018. [Page(s): 17]

Published as a conference paper at ICLR 2024

Ruoxuan Xiong, Susan Athey, Mohsen Bayati, and Guido Imbens. Optimal experimental design for staggered rollouts. ar Xiv preprint ar Xiv:1911.03764, 2019. [Page(s): 1]

Liyuan Xu, Yutian Chen, Siddarth Srinivasan, Nando de Freitas, Arnaud Doucet, and Arthur Gretton. Learning deep features in instrumental variable regression. ar Xiv preprint ar Xiv:2010.07154, 2020. [Page(s): 3]

Junkun Yuan, Anpeng Wu, Kun Kuang, Bo Li, Runze Wu, Fei Wu, and Lanfen Lin. Auto iv: Counterfactual prediction via automatic instrumental variable decomposition. ACM Transactions on Knowledge Discovery from Data (TKDD), 16(4):1 20, 2022. [Page(s): 17]

Kelly Zhang, Lucas Janson, and Susan Murphy. Statistical inference with m-estimators on adaptively collected data. Advances in neural information processing systems, 34:7460 7471, 2021. [Page(s): 10]

Published as a conference paper at ICLR 2024

Adaptive Instrument Design for Indirect Experiments (Appendix)

1 Introduction 1

2 Background and Problem Setup 2

3 Approach 4 3.1 Influence Functions for Control Variates . . . . . . . . . . . . . . . . . . . . . . . 4 3.2 Multi-Rejection Sampling for Distribution Correction . . . . . . . . . . . . . . . . 5 3.3 DIA: Designing Instruments Adaptively . . . . . . . . . . . . . . . . . . . . . . . 7

5 Experiments 8

6 Conclusion 10

7 Acknowledgement 10

A Extended Discussion on Related Work 17

B Examples of Practical Use-cases 18

C Linear Conditional Instrumental Variables 18 C.1 Specific Instantiations of Lemma 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 20

D Influence functions 24

E Bias and Variances of Gradient Estimators 26 E.1 Naive REINFORCE estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 E.2 REINFORCE estimator with CV . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 E.3 REINFORCE estimator with Influence . . . . . . . . . . . . . . . . . . . . . . . . 30

F Estimation Error, Bias and Variance Analysis 31 F.1 Preliminaries on U-Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 F.2 Variance of IS procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 F.3 Guarantees for Rejection Sampling Procedure . . . . . . . . . . . . . . . . . . . . 35 F.4 Policy Ordering Consistency Guarantees . . . . . . . . . . . . . . . . . . . . . . . 39 F.5 Proof of Theorem 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

G Multi-Rejection Sampling 42 G.1 Review: Standard Rejection Sampling . . . . . . . . . . . . . . . . . . . . . . . . 42 G.2 Proposed: Multi-Rejection Sampling . . . . . . . . . . . . . . . . . . . . . . . . . 43

H Algorithm 44 H.1 Pseudo-code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 H.2 Influence function usage when the estimates are stochastic . . . . . . . . . . . . . 46 H.3 Proxy MSE alternatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

I Additional Empirical Details 47 I.1 Synthetic Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 I.2 Trip Advisor domain (semi-synthetic) . . . . . . . . . . . . . . . . . . . . . . . . . 49 I.3 Hyper-parameters: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 I.4 Details of other plots: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

Published as a conference paper at ICLR 2024

A EXTENDED DISCUSSION ON RELATED WORK

Experiment design has a rich literature and no effort is enough to provide a detailed review. We refer readers to the work by Chaloner & Verdinelli (1995); Rainforth et al. (2023) for a survey. Researchers have considered adaptive estimation of (conditional) average treatment effect (Hahn et al., 2011; Kato et al., 2020), leveraging Gaussian processes (Song et al., 2023), active estimation of the sub-population benefiting the most from the treatment (Curth et al., 2022), for estimating the treatment effects when samples are collected irregularly (Vanderschueren et al., 2023), and adaptive data collection to find the best arm while also maximizing welfare (Kasy & Sautmann, 2021). Che & Namkoong (2023) consider adaptive experiment design using differentiable programming. Our work shares their philosophy in terms of balancing sample complexity with computational complexity. Li & Owen (2023) consider using methods from double ML for adaptive direct experiments. In contrast, we consider adaptive indirect experiments that often make use of two-stage and double ML estimators. However, all these works only consider design for direct experiments.

For instrumental variables, researchers have developed methods that can automatically decide how to efficiently combine different instruments to enhance the efficiency of the estimator (Kuang et al., 2020; Yuan et al., 2022). In contrast, our work considers how to collect more data online. Further, while considered IV estimation under specific assumptions, we refer readers to the work by Angrist et al. (1996); Clarke & Windmeijer (2012); Wang et al. (2021) for discussions on alternate assumptions. Particularly, we considered conditional LATE to be equal to CATE. For more discussion on the relation of CATE and conditional LATE, we refer readers to the work by Heckman & Urzua (2010); Imbens (2010); Athey & Wager (2021).

Some of the tools that we developed are also related to a literature outside of causal estimation and experiment design. Our treatment of the dataset selection policy as a factorized distribution and development of the leave-one-out sample is related to ideas in action-factorized baselines in reinforcement learning (Wu et al., 2018; Liu et al., 2017) and leave-one-out control variates (Richter et al., 2020; Salimans & Knowles, 2014; Shi et al., 2022; Kool et al., 2019; Titsias & Shi, 2022). In contrast to these, our work is for experiment design, and to be compute efficient in our setup we had to additionally develop black-box influence functions for two-stage estimators.

Our work also complements prior work on influence function for conditional IVs (Kasy, 2009; Ichimura & Newey, 2022) as they were not directly applicable to deep neural network-based estimators. Further, our characterization of the leave-one-out estimate of the MSE requires existence of all the higher-order influence functions. Similar conditions have been used by prior works in the context of uncertainty quantification (Alaa & Van Der Schaar, 2020), model selection (Alaa & Van Der Schaar, 2019; Hines et al., 2022; Debruyne et al., 2008), and dataset distillation (Loo et al., 2023).

Tang (2022) quantifies and developes first-order gradient-based control-variates to perform biasvariance trade-off of the naive REINFORCE gradient in the meta-reinforcement learning setting. However, in their analysis, the effective reward function admits a simple additive decomposition, where in our setting it corresponds to MSE(θ(Dn)) which cannot be analysed using their technique. Further, potential non-linearity of θ makes the analysis a lot more involved in our case. Similarly, to deal with off-policy correction in RL, researchers have also considered using multi-importancesampling (Papini et al., 2019; Liotet et al., 2022). However, using multi-importance sampling (as opposed to multi-rejection sampling) in our setup can still result in exponential variance. Our multirejection sampling is designed by leveraging the symmetric structure of our estimator.

Further, our [ MSE estimator is related to MSE estimation using sub-sampling bootstrap (Politis et al., 1999; Geyer, 2006), where the mean-squared-error is computed by comparing the estimate from the entire dataset of size n with the estimator obtained using the subset of the dataset of size k < n, and then rescaling this appropriately using the convergence rate of the estimator (Hall, 1990; Cao, 1993). In our setting, we can avoid using the rescaling factor (which requires knowledge of convergence rates) since we only care about computing the gradient of the [ MSE. Additionally, we had to address the distribution shift issue for our setting. For more background on statistical bootstrap, we refer readers to some helpful lectures notes (Chen, 2017a; Mikusheva, 2013; Shi, 2012; Hall, 2016). See (Chen, 2017b, 2020) for connections between bootstrap and influence functions, and (Takahashi, 1988) for connections between Von-Mises expansion (used for LOO estimation using influence functions) and Edgeworth expansions (used for bootstrap theory).

Published as a conference paper at ICLR 2024

Finally, while we leverage tools from RL, our problem setup presents several unique challenges that make the application of standard RL methods as-is ineffective. Off-policy learning is an important requirement for our method, and policy-gradient and Q-learning are the two main pillars of RL. As we discussed in Section 4, conventional importance-sampling-based policy gradient methods can have variance exponential in the number of samples (in the worst-case), rendering them practically useless for our setting. On the other hand, it is unclear how to efficiently formulate the optimization problem using Q-learning. From an RL point of view, the reward corresponds to the MSE, and the actions corresponds to instruments. Since this reward depends on all the samples, the effective state space for Q-learning style methods would depend on the combinatorial set of the covariates, i.e., let X be the set of covariates and n be the number of samples, then the state s X n. This causes additional trouble as n continuously changes/increases in our setting as we collect more data.

B EXAMPLES OF PRACTICAL USE-CASES

Due to space constraints, we presented the motivation concisely in the main paper. To provide more discussion on some real-world use cases, below we have mentioned three use cases across different fields of application:

Education: It is important for designing an education curriculum to estimate the effect of homework assignments, extra reading material, etc. on a student s final performance (Eren & Henderson, 2008). However, as students cannot be forced to (not) do these, conducting an RCT becomes infeasible. Nonetheless, students can be encouraged via a multitude of different methods to participate in these exercises. These different forms of encouragement provide various instruments and choosing them strategically can enable sample efficient estimation of the desired treatment effect.

Healthcare: Similarly, for mobile gym applications, or remote patient monitoring settings (Ferstad et al., 2022), users retain the agency to decide whether they would like to follow the suggested at-home exercise or medical intervention, respectively. As we cannot perform random user assignment to a control/treatment group, RCTs cannot be performed. However, there is a variety of different messages and reminders (in terms of how to phrase the message, when to notify the patient, etc.) that serve as instruments. Being strategic about the choice of the instrument can enable sample efficient estimation of treatment effect.

Digital marketing: Many online digital platforms aim at estimating the impact of premier membership on both the company s revenue and also on customer satisfaction. However, it is infeasible to randomly make users a member or not, thereby making RCTs inapplicable. Nonetheless, various forms of promotional offers, easier sign-up options, etc. serve as instruments to encourage customers to become members. Strategically personalizing these instruments for customers can enable sample-efficient treatment effect estimation. In our work, we consider one such case study using the publicly available Trip Advisor domain (Syrgkanis et al., 2019).

C LINEAR CONDITIONAL INSTRUMENTAL VARIABLES

In this section, we aim to elucidate how different factors in (a) the data-generating process (e.g., heteroskedasticity in compliance and outcome noise, structure in the covariates) and (b) the choice of estimators can affect the optimal data collection strategy.

Since this discussion is aimed at a more qualitative (instead of quantitative results like those in Section 5) we will consider a (partially) linear CATE IV model,

Y = θ 0XA + f0(X) + ϵ, (18)

where E[ϵ|X, A] = 0 but E[ϵ|X, Z] = 0. As E[XZϵ] = E[XZE[ϵ|XZ]] = 0, the estimate ˆθ is constructed based on the solution to the moment vector:

E[(Y θ X A ˆf(X))X Z] = 0, (19)

where Z is a scalar, A {0, 1} and E[Z | X] = 0 (i.e. we know the propensity of the instrument and we can always exactly center the instrument). Moreover, ˆf(X) is an arbitrary function that converges in probability to some function f0.

An example: Z = w(X) V and E[V | X] = 0. This can simulate a situation where V represents the one-hot-encoding of a collection of binary instruments, w(X) selects a linear combination of these

Published as a conference paper at ICLR 2024

instruments to observe, yielding the observed instrument Z (e.g. w(X) {e1, . . . , em} encodes that we can only select one base instrument to observe). Moreover, each of these base instruments comes from a known propensity and we can always exactly center it conditional on X.

The estimate ˆθ is the solution to: 1 n

i (Yi θ Xi Ai ˆf(Xi))Xi Zi = 0. (20)

We care about the MSE of the estimate ˆθ:

MSE(ˆθ) = EX[(ˆθ X θ 0X)2]. (21)

Lemma 1. Let V = E[XX ] and J = E[AZXX ] and Σ = E[ϵ2Z2XX ], with ϵ = Y θ 0XA f0(X). Finally, let

U = V 1/2J 1ΣJ 1V 1/2. (22)

The mean and the variance of the MSE converge to:

n E[MSE(ˆθ)] U nuclear n2Var[MSE(ˆθ)] 2 U 2 fro. (23)

Proof. The proof is structured as the following,

Part A: We first recall the asymptotic properties of GMM estimators.

Part B: We then use this result to estimate the asymptotic property of n(ˆθ θ), and also of n MSE(ˆθ).

Part C: Finally, we use singular value decomposition to express the mean and variance for MSE(ˆθ) in simplified terms.

Part A: Let Si = {Xi, Zi, Ai, Yi}. Recall that for a GMM-estimator, ˆθ is the solution to

i m(θ, Si) = 0, (24)

where m(θ, Si) is the moment for sample Si. For large enough n we can represent the difference between the estimates ˆθ and the true θ by use of a Taylor expansion (Kahn, 2015),

i IF(θ, Si) + higher order terms, (25)

where IF(θ, Si) is the influence of the sample i, and is given by E[ m(θ, S)] 1m(θ, Si). Therefore,

n(ˆθ θ) = 1 n

i IF(θ, Si) + op(1). (26)

Moreover, E[IF(θ, S)] = 0 and Cov(IF(θ, S)) = E[IF(θ, S)IF(θ, S) ]. See (Newey & Mc Fadden, 1994) for a more formal argument.

Part B: Therefore, the parameters estimated by the empirical analogue of the vector of moment equations in 20 are asymptotically linear:

n(ˆθ θ0) = 1 n

i=1 E[AZXX ] 1Xi Zi(Yi θ 0Xi Ai f0(Xi)) + op(1). (27)

For simplicity, let s take f0(X) = 0. We can always redefine Y Y f0(X). Moreover, let v(X) = E[AZ | X] = Cov(A, Z | X) and J = E[v(X)XX ]. Let ϵi = Yi θ 0Xi Ai. Then we have:

n(ˆθ θ0) = 1 n

i=1 J 1Xi Ziϵi + op(1). (28)

Published as a conference paper at ICLR 2024

We care about the average prediction error:

E[(θ X θ 0X)2] = (θ θ0) E[XX ](θ θ0). (29)

Let V = E[XX ]. Then,

E[(θ X θ 0X)2] = V 1/2(θ θ0) 2. (30)

Invoking the asymptotic linearity of n(ˆθ θ0):

n E[(θ X θ 0X)2] =

i ϵi Zi V 1/2J 1Xi

+ op(1). (31)

Moreover, we have: 1 n

i ϵi Zi V 1/2J 1Xi d N 0, V 1/2J 1E[ϵ2Z2XX ]J 1V 1/2 . (32)

Let Σ = E[ϵ2Z2XX ] and U = V 1/2J 1ΣJ 1V 1/2. Thus the mean squared error is distributed as the squared of the ℓ2 norm of a multivariate Gaussian vector N(0, U).

Part C: If we let U = CΛC , be the SVD decomposition, then note that if v N(0, U) then since C C = I, we have:

Cv 2 = v C Cv = v v = v 2. (33)

Now note that Cv N(0, C UC) = N(0, Λ). Thus Cv 2 is distributed as the weighted sum of d independent chi-squared distributions, i.e. we can write:

i=1 λj X 2 j , (34)

where X 2 j are independent χ2(1) distributed random variables and λj are the singular values of the matrix:

U = V 1/2J 1ΣJ 1V 1/2. (35)

Using the mean and variance of the chi-squared distribution, asymptotic mean and variance of the RMSE can be expressed as:

n E[MSE(ˆθ)]

j=1 λj (36)

n2Var[MSE(ˆθ)]

j=1 Var(λj X 2 j ) = X

j λ2 j Var(X 2 j ) = X

j 2 λ2 j. (37)

Thus the expected MSE is the trace norm (or nuclear norm) of U and the variance of the MSE is twice the square of the Forbenius norm.

C.1 SPECIFIC INSTANTIATIONS OF LEMMA 1 In this subsection, we instantiate Lemma 1 to understand the impact of heteroskedastic compliance, heteroskedastic outcome errors, and structure of covariates on the optimal data collection policy.

Note that U in Lemma 1 takes the form:

U = E[XX ]1/2E[AZXX ] 1E[ϵ2Z2XX ]E[AZXX ] 1E[XX ]1/2. (38)

Noting that by the instrument assumption ϵ Z | X, and denoting with

σ2(X) = E[ϵ2 | X] (residual variance of the outcome) (39)

u2(X) = E[Z2 | X] (variance of the instrument) (40)

γ(X) = E[A Z | X]

E[Z2 | X] (heteroskedastic compliance) (41)

Published as a conference paper at ICLR 2024

where the heteroskedastic compliance corresponds to the coefficient in the regression A Z conditional on X. Then we can simplify:

U = E[XX ]1/2E[γ(X)u2(X)XX ] 1E[σ2(X)u2(X)XX ]E[γ(X)u2(X)XX ] 1E[XX ]1/2

Remark 1 (Homoskedastic Compliance, Outcome Error and Instrument Propensity). If we have a lot of homoskedasticity, i.e.: σ2(X) = σ2, u2(X) = u2 and γ2(X) = γ then U simplifies to:

U = γ 2u 2σ2V 1/2V 1V V 1V 1/2 = γ 2u 2σ2I (43)

and we get:

E[MSE(ˆθ)] = d

nγ 2u 2σ2 (44)

Var[MSE(ˆθ)] = 2 d

n2 γ 4u 4σ4 (45)

Thus we are just looking for instruments that maximize γ u where γ := E[AZ]/E[Z2] is the OLS coefficient of A Z (the first stage coefficient in 2SLS) and u = p

Var(Z) is the standard deviation of the instrument.

Remark 2 (Homoskedastic Compliance and Outcome Error). If we have homoskedasticity in compliance (i.e. γ(X) = γ) and in outcome (σ(X) = σ), then:

U = σ2γ 2E[XX ]1/2E[u2(X)XX ] 1E[XX ]1/2 (46)

Or equivalently, we want to maximize the trace of the inverse of U:

E[XX ] 1/2E γ2u2(X)XX E[XX ] 1/2 (47)

As XX is positive semi-definite, the latter is maximized if we simply maximize γu(X) for each X. Thus assuming that all the instruments in our choice set have a homogeneous compliance, then we are looking for the instrument policy that maximizes: γu(X), where γ is the compliance coefficient and u(X) is the conditional standard deviation of the instrument. If for instance, in our choice set we have only binary instruments and we can fully randomize them, then we should always randomize the instrument equiprobably and we should choose the binary instrument with the largest γ.

It is harder to understand how the trace norm or the schatten-2 norm of these more complex matrices behave. Though maybe some matrix tricks could lead to further simplifications of these.

Remark 3 (Orthonormal Supported X). If X takes values on the orthonormal basis {e1, . . . , ej}, then by denoting σ2 i = E[ϵ2 | X = ei], u2 i = E[Z2 | X = ei] and γi = E[A Z|X=ei]

E[Z2|X=ei] and by Σ = diag{σ2 1, . . . , σ2 d} and K = diag{u2 1, . . . , u2 d} and Γ = diag{γ1, . . . , γd}, let V = E[XX ] = WΛW be the eigen-representation of V , then we have:

E σ2(X)u2(X)XX = WΛKΣW , (48)

E γ(X)u2(X)XX = WΛΓKW . (49)

Leading to:

U = WΛ1/2W WΛ 1Γ 1K 1W WΛKΣW WΛ 1Γ 1K 1W WΛ1/2W, (50)

which by orthonormality of the W, simplifies to:

U = WΓ 2K 1ΣW . (51)

hence we get that:

n E[MSE(ˆθ)] = X

σ2 i γ2 i u2 i = X

E[ϵ2 | X = ei]E[Z2 | X = ei]

E[AZ | X = ei]2 , (52)

n2Var[MSE(ˆθ)] = 2 X

σ4 i γ4 i u4 i , (53)

Published as a conference paper at ICLR 2024

Estimator Compliance Instrument Var. E[Z2|X] Optimal? X=-1 X = 1 X = 1 X = 1

En (Y θ X A)X Z = 0

1 1/3 1/4 0 Yes 1/4 1/4 No

1 1/2 1/4 0 No 1/4 1/4 Yes

γ(X) σ2(X)(Y θ X A)X Z i = 0

1 1/3 1/4 0 No 1/4 1/4 Yes

1 1/2 1/4 0 No 1/4 1/4 Yes

Table 1: In general the optimal instrument-selection policy will depend not only on compliance per covariates but also on the estimation method. We demonstrate this here for a simple linear CATE estimation where g(X, A) := θ 0 XA, instrument Z is a scalar, A {0, 1}, X { 1, 1}, E[Z | X] = 0, and outcome error is homoskedastic. For the first estimator, the optimal instrumentselection policy for X = 1 changes depending on the compliance for X = 1, where we characterize optimality in terms of the asymptotic mse. In contrast, for the second estimator, randomizing the instrument is best for both compliance values of X = 1.

and the problem decomposes into optimizing separately for each i, maximize the terms:

u2 i = E[AZ | X = ei]

E[Z2 | X = ei]

E[Z2 | X = ei] (54)

assuming that our constraints on the policy for choosing Z are decoupled across the values of X. So similarly, we just want to maximize the compliance coefficient, multiplied by the standard deviation of the instrument, conditional on each X.

Beyond this case, when X doesn t only take values on the orthonormal basis, then the problem of optimizing the choice of an instrument distribution so as to minimize the expected MSE doesn t seem to decouple across the values of X.

Remark 4 (Scalar X). One can see the intricacies of the problem when there is no homoskedastic compliance, even when we have a scalar X. In this case, we are trying to maximize:

U 1 := E[γ(X)u2(X)X2]2

E[X2]E[σ2(X)u2(X)X2] (55)

Suppose for instance that X U({ 1, 1}) and that our policy can only choose the conditional variance of Z, but not the conditional compliance γ(X) (e.g. we have a fixed binary instrument and we can only play with its randomization probability). Then we are maximizing:

(γ1u2 1 + γ2u2 2)2

σ2 1u2 1 + σ2 2u2 2 (56)

over u2 1, u2 2 [0, 1/4] (the variance of Z for each value of X). Without loss of generality, we can take γ1 = 1 and σ1 = 1 in the above problem, since we can always take out these factors and rename γ2/γ1 γ2 and σ2/σ1 σ2. We then get the simplified problem:

(u2 1 + γu2 2)2

u2 1 + σ2u2 2 (57)

where γ, σ (0, ).

Take for instance the case when σ = 1 (i.e. homoskedastic outcome error) and γ = 1/3 (i.e. compliance is 30% less for X = 1 than for X = 1. Then the optimal solution is u2 1 = 1/4, u2 2 = 0, i.e. we don t want to randomize the instrument at all when X = 1, yielding a value of 1/4 = 0.25 for U 1. If we were for instance to randomize also when X = 1, then we would get a value of (1/3)2/(1/2) = 2/9 0.22.

If on the other hand γ = 1/2, then if we randomize on both points we get (3/8)2/(1/2) = 9/32 0.281, while if we only randomize when X = 1, then we get again 1/4 = 0.25. Thus the relative compliance strength for different values of X, changes the optimal randomization solution for Z.

Published as a conference paper at ICLR 2024

Remark 5 (Efficient Instrument). It may very well be the case that the above contorted optimal solution is an artifact of the estimator that we are using. Under such a heteroskedastic compliance, most probably the estimate that we are using is not the efficient estimate, and we should be dividing the instruments by the compliance measure (i.e. construct efficient instruments, e.g. Z = Z γ(X)

σ2(X) ). Then maybe once we use an efficient instrument estimator, its variance most probably would always be improved if we full randomize on all X points. But at least the above shows that for some fixed estimator (and in particular one that is heavily used in practice), it can very well be the case that we don t want to fully randomize the instrument all the time. If we use such an instrument Z then observe that for this instrument, by definition γ(X) = σ2(X). Then U simplifies to:

E[XX ]1/2E u2(X)γ2(X)

σ2(X) XX 1 E[XX ]1/2 (58)

Thus we just want to maximize u2(X)γ2(X)

σ2(X) , i.e. maximize the product of the OLS coefficient of A Z conditional on X1 and the conditional standard deviation of the instrument.

Revisiting the simple example in the previous remark, when we use an efficient instrument, the matrix U 1 takes the form:

u2 1 + u2 2 γ2

For the case when γ = 1/3 and σ = 1, if we choose u2 1 = u2 2 = 1/4, we get 1/4 + 1/36 (compared to the 1/4 we would achieve with the inefficient estimator under the optimal randomization policy). When γ = 1/2, with the same u2 i , we would get 1/4 + 1/16 0.3125, which is larger than the 0.281 we would get under the optimal policy with the inefficient estimator.

Remark 6. (Non-conditional IV) Even when X = 1 always, it can be shown that randomizing a single binary instrument equiprobably might be sub-optimal. To show this, we will consider the setting where there is heteroskedasticty in outcomes with respected to the treatment A, thus with a slight abuse of notation let σ2(Z) = E[ϵ2 | Z]. Under this setting, we can simplify,

U = E[XX ]1/2E[γ(X)u2(X)XX ] 1E[σ2(Z)u2(X)XX ]E[γ(X)u2(X)XX ] 1E[XX ]1/2

U = E[σ2(Z)u2]

E[γu2]2 (61)

= E[σ2(Z)E[ Z2]]

(γE[ Z2])2 u2(X) = E[ Z2|X] (62)

where Z is the centered instrument. Now assume γ = 1 (full compliance, i.e., A = Z, such that σ2(Z = 0) = σ2 0 = variance in outcome for treatment A = 0, and σ2 1 is defined similarly.), and Z {0, 1} with probability p and (1 p),

U = p2(1 p)σ2 0 + p(1 p)2σ1 p2(1 p)2 (63)

= pσ2 0 + (1 p)σ2 1 p(1 p) (64)

= σ2 0 1 p + σ2 1 p . (65)

When σ2 0 = 1 and σ2 1 = 2, then the minimizer is p 0.6, thereby illustrating that drawing the binary instrument equiprobably can be sub-optimal even in this simple setting.

1Equivalently the efficient instrument can be viewed as the normalized residualized efficient instrument Zγ(X) := E[A | Z, X] E[A | X], and γ(X) is the heterogeneous compliance; normalized by the conditional variance of the conditional moment σ2(X).

Published as a conference paper at ICLR 2024

D INFLUENCE FUNCTIONS

Influence functions have been widely studied in the machine learning literature from the perspective of robustness and interpretability (Koh & Liang, 2017; Bae et al., 2022; Schioppa et al., 2022; Loo et al., 2023), and dates back to the seminal work in robust statistics of (Cook & Weisberg, 1982). Asymptotic influence functions for two stage procedures and semi-parametric estimators have also been well-studied in the econometrics and semi-parametric inference literature (see e.g. (Ichimura & Newey, 2022; Breheny, 2020; Kahn, 2015)). Here we derive a finite sample influence function of a two-stage estimator that arises in our IV setting, from the perspective of a robust statistics definition of an influence function, that approximates the leave-one-out variation of the mean-squared-error of our estimate.

Influence functions characterize the effect of a training sample Si := {Xi, Zi, Ai, Yi} on ˆθ (for brevity, let ˆθ θ(Dn)), i.e., the change in ˆθ if moments associated with Si were perturbed by a small ϵ. Specifically, let ˆθ be a solution to the Mn(θ, ˆϕ) = 0 and ˆϕ be a solution to Qn(ϕ) = 0, as defined in 3. Similarly, let ˆθϵ, ˆϕϵ be a solution to the following perturbed moment conditions M(ϵ, θ, ˆϕϵ) = 0, and ˆϕϵ is the solution to Q(ϵ, ϕ) = 0, where M(ϵ, θ, ϕ) = Mn(θ, ϕ) + ϵm(Si, θ, ϕ) = 0 (66) Q(ϵ, ϕ) = Qn(ϕ) + ϵq(Si, ϕ) = 0. (67)

The rate of change in MSE due to an infinitesimal perturbation ϵ in the moments of Si can be obtained by the (first-order) influence function I(1) MSE, which itself can be decomposed using the chain-rule in terms of the (first-order) influence I(1) θ on the estimated parameter ˆθ by perturbing the moment associated with Si,

I(1) MSE(Si) := d

dθMSE ˆθϵ, ˆϕϵ

I(1) θ (Si)

ϵ=0 where, I(1) θ (Si) := dˆθϵ, ˆϕϵ

To derive the influence function for our estimate we will use the classical implicit function theorem:

Theorem 1 (Implicit Function Theorem (Krantz & Parks, 2002)). Consider a multi-dimensional vector-valued function F(x, y) Rn, with x Rm and y Rn and fix any point (x0, y0), such that

F(x0, y0) = 0 (69)

Suppose that F is differentiable and let Fx(x, y) and Fy(x, y), denote the Jacobians of F with respect to its first and second arguments correspondingly. Suppose that det(Fy(x0, y0)) = 0. Then there exists an open set U containing x0 and a unique function ψ(x), such that ψ(x0) = y0 and such that F(x, ψ(x)) = 0 for all x U. Moreover, the Jacobian matrix of partial derivatives of ψ in U is given by:

n m = [Fy(x, ψ(x))] 1 Fx(x, ψ(x)) (70)

If F is k-times differentiable, then there exists an open set U and a unique function ψ that satisfies the above properties and is also k-times differentiable.

Theorem 2 (Influence Functions for Black-box DML Estimators). Assuming that the inverses exist,

I(1) θ (Si) = Mn

1 (ˆθ, ˆϕ) m(Si, ˆθ, ˆϕ) Mn

ϕ (ˆθ, ˆϕ) Qn

1 (ˆϕ)q(Si, ˆϕ) . (71)

Proof. Let ˆθ, ˆϕ be the empirical solutions without perturbation. Note that the ϵ-perturbed estimates θ(ϵ), ϕ(ϵ) are defined as the solution to the zero equations: M(ϵ, θ, ϕ) = Mn(θ, ϕ) + ϵm(Si, θ, ϕ) = 0 (72) Q(ϵ, ϕ) = Qn(ϕ) + ϵq(Si, ϕ) = 0 (73)

Published as a conference paper at ICLR 2024

Defining x = ϵ, y = (θ; ϕ) and F(ϵ, (θ, ϕ)) = [ M(ϵ, θ, ϕ); Q(ϵ, θ)]. Suppose that:

Fy(0, (ˆθ, ˆϕ)) = DθMn(ˆθ, ˆϕ) DϕMn(ˆθ, ˆϕ) 0 DϕQn(ˆϕ)

is invertible (equiv. has a non-zero determinant). Note that for the latter it suffices that the diagonal block matrices are invertible, since it is an upper block triangular matrix.

Applying the implicit function theorem, we get that there exists an open set U containing ϵ = 0 and a unique function ψ(ϵ) = (θ(ϵ), ϕ(ϵ)), such that ψ(0) = (ˆθ, ˆϕ) and such that ψ(ϵ) solves the zero equations for all ϵ U. Moreover, for all ϵ U: Dϵθ(ϵ) Dϵϕ(ϵ)

= [Fy(ϵ, (θ(ϵ), ϕ(ϵ)))] 1 Fx(ϵ, (θ(ϵ), ϕ(ϵ))) (75)

Using the form of the inverse of an upper triangular matrix2 and applying the latter at ϵ = 0 we get

that Dϵθ(0) Dϵϕ(0)

takes the form:

h DθMn(ˆθ, ˆϕ) i 1 h DθMn(ˆθ, ˆϕ) i 1 DϕMn(ˆθ, ˆϕ) h DϕQn(ˆϕ) i 1

0 h DϕQn(ˆϕ) i 1

m(Si, ˆθ, ˆϕ) q(Si, ˆϕ)

Since the influence function is I(1) θ (Si) := Dϵθ(0), we get the result by applying the matrix multiplication.

Higher order influence functions can be calculated in a similar manner, using repeated applications of the chain rule and the implicit function theorem. Note that writing an explicit form for a general k-th order can often be tedious, and computing it can be practically challenging. Discussion about higher-order influences for the single-stage estimators can be found in the works by Alaa & Van Der Schaar (2019) and Robins et al. (2008). As we formally show in Theorem 5, for our purpose, we recommend using K = 1 as it suffices to dramatically reduce the variance of the gradient estimator, incurring bias that decays quickly, while also being easy to compute

(Alternative) Informal Proof. Here we provide an alternate way to derive the form of the influence function without invoking the implicit function theorem. This derivation only makes use of the chain rule of the derivatives. In the following, we will use f to denote partial derivative with respect to the immediate argument of the function, whereas df is for the total derivative. That is, let f(x, g(x)) := xg(x) then

f x(x, g(x)) = g(x) df dx(x, g(x)) = g(x) + x g

From 67 we know that ˆθϵ, ˆϕϵ is the solution to the moment conditions M ϵ, θ, ˆϕϵ = 0

M ϵ, ˆθϵ, ˆϕϵ, ˆϕϵ = 1

j=1 m(Sj, ˆθϵ, ˆϕϵ, ˆϕϵ) + ϵ m(Si, ˆθϵ, ˆϕϵ, ˆϕϵ) = 0. (78)

d M dϵ (ϵ, ˆθϵ, ˆϕϵ, ˆϕϵ) = 1

dϵ (Sj, ˆθϵ, ˆϕϵ, ˆϕϵ) + m(Si, ˆθϵ, ˆϕϵ, ˆϕϵ) + ϵdm

dϵ (Si, ˆθϵ, ˆϕϵ, ˆϕϵ) (79)

dϵ (ˆθϵ, ˆϕϵ, ˆϕϵ) + m(Si, ˆθϵ, ˆϕϵ, ˆϕϵ) + ϵdm

dϵ (Si, ˆθϵ, ˆϕϵ, ˆϕϵ) (80)

θ (ˆθϵ, ˆϕϵ, ˆϕϵ) dˆθϵ, ˆϕϵ

ϕ (ˆθϵ, ˆϕϵ, ˆϕϵ)dˆϕϵ

dϵ + m(Si, ˆθϵ, ˆϕϵ, ˆϕϵ) + ϵdm

dϵ (Si, ˆθϵ, ˆϕϵ, ˆϕϵ).

2For any block matrix: A X 0 B

1 = A 1 A 1XB 1

Published as a conference paper at ICLR 2024

Further, note that,

dϵ = ˆθϵ, ˆϕϵ

ϵ + ˆθϵ, ˆϕϵ

Using the fact that d M(ϵ, ˆθϵ, ˆϕϵ, ˆϕϵ)/dϵ = 0, rearranging terms in equation 81,

1 (ˆθϵ, ˆϕϵ, ˆϕϵ)

m(Si, ˆθϵ, ˆϕϵ, ˆϕϵ) + ϵdm

dϵ (Si, ˆθϵ, ˆϕϵ, ˆϕϵ) + Mn

ϕ (ˆθϵ, ˆϕϵ, ˆϕϵ)dˆϕϵ

Now we expand the terms in blue. Similar to earlier, since ˆϕϵ is the solution to the moment condition Q (ϵ, ϕ) = 0,

Q ϵ, ˆϕϵ = 1

j=1 q(Sj, ˆϕϵ) + ϵ q(Si, ˆϕϵ) = 0. (84)

dq dϵ (Sj, ˆϕϵ) + q(Si, ˆϕϵ) + ϵdq

dϵ (Si, ˆϕϵ) (85)

dϵ (ˆϕϵ) + q(Si, ˆϕϵ) + ϵdq

dϵ (Si, ˆϕϵ) (86)

ϕ (ˆϕϵ)dˆϕϵ

dϵ + q(Si, ˆϕϵ) + ϵdq

dϵ (Si, ˆϕϵ) (87)

Now using the fact that d Q(ϵ, ˆϕϵ)/dϵ = 0, rearranging terms in equation 87,

1 (ˆϕϵ) q(Si, ˆϕϵ) + ϵdq

dϵ (Si, ˆϕϵ) . (88)

Combining equation 83 and equation 88,

1 (ˆθϵ, ˆϕϵ, ˆϕϵ)

m(Si, ˆθϵ, ˆϕϵ, ˆϕϵ) + ϵdm

dϵ (Si, ˆθϵ, ˆϕϵ, ˆϕϵ) (89)

ϕ (ˆθϵ, ˆϕϵ, ˆϕϵ) Qn

1 (ˆϕϵ) q(Si, ˆϕϵ) + ϵdq

dϵ (Si, ˆϕϵ) #

At ϵ = 0, notice that ˆθ0, ˆϕ0 = ˆθ and ˆϕ0 = ˆϕ. Therefore, using equation 90,

1 (ˆθ, ˆϕ) m(Si, ˆθ, ˆϕ) Mn

ϕ (ˆθ, ˆϕ) Qn

1 (ˆϕ)q(Si, ˆϕ) . (91)

E BIAS AND VARIANCES OF GRADIENT ESTIMATORS

To convey the key insights, we will consider log π(Si) R to be scalar for simplicity. Similar results would follow when log π(Si) Rd.

E.1 NAIVE REINFORCE ESTIMATOR For any random variable Z, we let Z Lp = (E[Zp])1/p.

Theorem 3. E h ˆ L(π) i = L(π). Moreover, let:

X1 = MSE θ({Si}n j=2) (92)

Published as a conference paper at ICLR 2024

denote the leave-one-out MSE and let:

1 = MSE θ({Si}n j=1) MSE θ({Si}n j=2) (93)

denote the leave-one-out stability. Suppose that for a sufficiently large n,

X1 L4 C2,4 X1 L2 (94)

and for Yi := log π(Si), let i, Yi L C and Yi L2 c > 0 for some universal constants c, C, C2,4. Then the variance of the naive REINFORCE estimator is upper and lower bounded as:

2 n X1 2 L2 O n2 1 2 L2 Var ˆ L(π) 3C2

2 n X1 2 L2 + O n2 1 2 L2 (95)

Thus, when n 1 L2 = O(1) and n MSE θ({Si}n j=2) 2 L2 = Ω(α(n)) = ω(1), we have

Var ˆ L(π) = Ω(α(n)). (96)

Remark 7. In many cases, we would expect that the mean squared error decays at the rate of n 1/2 (unless we can invoke a fast parametric rate of 1/n, which holds under strong convexity

assumptions); in such cases, we have that α(n) = ω(1) and Var ˆ L(π) can grow with the sample size. Moreover, if the mean squared error does not converge to zero, due to persistent approximation error, or local optima in the optimization, then the variance can grow linearly, i.e. α(n) = Θ(n).

Remark 8. The quantity is a leave-one-out stability quantity. Hence, the property that n L2 = O(1) is a leave-one-out-stability property of the estimator, which is a well-studied concept (see e.g. Kearns & Ron (1997); Celisse & Guedj (2016); Abou-Moustafa & Szepesvári (2019)). It states that the estimator is 1/n-leave-one-out-stable. This will typically hold for many M-estimators over parametric spaces.

Proof. In the following, we first define some new notation to make the proof more concise. Subsequently, we show (I) unbiasedness, and then (II) we discuss the variance of ˆ L(π).

Let Dn be the entire dataset {Si}n i=1, where Si is the i-th sample.

Zi := MSE(θ({Si}n i=1)) log π(Si) (97) X := MSE(θ({Si}n i=1)) (98)

Xi := MSE θ({Si}n j=1,j =i) (99)

i := X Xi (100) Yi := log π(Si) (101)

(I) Unbiased

E h ˆ L(π) i = Eπ

i MSE(θ({Si}n i=1)) log π(Si)

(a) = Eπ[MSE(θ(Dn)) log Pr(Dn; π)] (103) = Eπ[MSE(θ(Dn))] (104) = L(π), (105)

where Eπ indicates that the dataset Dn is sampled using π, and (a) follows from 5.

(II) Variance

Var ˆ L(π) = Var X Zi = Var

Notice {Yi}n i=1 are i.i.d random variables, and X is dependent on all of {Yi}n i=1, which makes the analysis more involved. We begin with the following alternate expansion for variance,

Var X Zi = X

j Cov(Zi, Zj), (107)

where, Cov(Zi, Zj) = Cov(XYi, XYj) = Cov((Xi + i)Yi, (Xi + i)Yj). (108)

Published as a conference paper at ICLR 2024

Var X Zi = X

i,j Cov(Xi Yi, Xi Yj) + Cov(Xi Yi, i Yj) + Cov( i Yi, Xi Yj) + Cov( i Yi, i Yj).

Now, observe that as Xi Yi, Yi Yj, and E[Yi] = E[Xi Yi] = E[Xi Yj] = 0. Therefore,

Cov(Xi Yi, Xi Yj) = E[X2 i Yi Yj] = E[X2 i Yj]E[Yi] = 0, i = j (110)

Cov(Xi Yi, Xi Yj) = E[X2 i Y 2 i ], i = j. (111)

For all i, j:

Cov(Xi Yi, i Yj) = E[Xi i Yi Yj] E[Xi Yi]E[ i Yj] = E[Xi i Yi Yj], (112) Cov( i Yi, Xi Yj) = E[Xi i Yi Yj] E[ i Yi]E[Xi Yj], (113)

|Cov( i Yi, i Yj)| (a) q

E[ 2 i Y 2 i ]E[ 2 i Y 2 j ] C2 E[ 2 i ], (114)

where (a) follows from Cauchy-Schwarz by noting that for any two random variables A and B, Cov(A, B) p

Var(A)Var(B) and that Var(A) E[A2]. If we let Y = Pn j=1 Yi, we have:3

i E[X2 i Y 2 i ]

j 2E[Xi i Yi Yj] E[ i Yi]E[Xi Yj]

+ n2C2E[ 2 1]

i 2E[Xi i Yi Y ] E[ i Yi]E[Xi Y ]

+ n2C2E[ 2 1]. (116)

Now, by assumption we have that Yi L C for some constant C. Further, from Cauchy Schwarz, |E[AB]| p

E[A2]E[B2] for any two random variable A and B. Thus we get: E[ i Yi]E[Xi Y ] q

E[ 2 i ]E[Y 2 i ] q

E[ Y 2] (117)

E[ 2 i ]E[Y 2 i ] q

n E[Y 2 i ] (118)

n E[ 2 i ]E[X2 i ] Yi L C (119)

= C2 n i L2 Xi L2 (120)

where (a) follows from observing that E[(P

i Yi)2] = E[P

i,j Yi Yj] = E[P

i Y 2 i ] = n E[Y 2 i ] as E[Yi Yj] = 0 for i = j. The final expression uses that p

E[X2 i ] = Xi L2. Similarly, E[Xi i Yi Y ] q

E[ 2 i Y 2 i ] q

E[X2 i Y 2] C q

E[ 2 i ] E[X4 i ]E[ Y 4] 1/4 (121)

Now by the Marcinkiewicz-Zygmund inequality Rio (2009) we have that 1 n P

Cp Yi Lp for any p 2 and for some universal constant Cp. Therefore, (E[ Y 4])1/4 n C4 Yi L4 and E[Xi i Yi Y ] CC4 i L2 Xi L4 n Yi L4 (122)

C2C4 n i L2 Xi L4 (123)

C2C4C2,4 n i L2 Xi L2, (124)

where the last line follows as Xi L4 C2,4 Xi L2. Thus for some constant C1, combining 116, 120, and 124, we conclude that: Var X Zi X

i E[X2 i Y 2 i ]

n i L2 Xi L2 + n2C2 1 2 L2. (125)

3We note here that an important aspect of the proof is to first push the summation inside to get Y and then to apply the Cauchy-Schwarz inequality, since Y concentrates around n and one saves an important extra n factor.

Published as a conference paper at ICLR 2024

Recall from the AM-GM inequality that for any a and b: a b 1 2ηa2 + 2ηb2 for any η > 0. Let

a = C1 n i and b = Xi L2 = p

E[X2 i ] and η = c2/4, then Var X Zi X

i E[X2 i Y 2 i ]

2C2 1n i 2 L2 c2 + c2

2 E[X2 i ] + n2C2 1 2 L2. (126)

Since c2 E[Y 2 i ] C2 and Xi Yi, we have E[X2 i Y 2 i ] = E[X2 i ]E[Y 2 i ] C2E[X2 i ], and E[X2 i Y 2 i ] c2E[X2 i ] . Further, as |A B| C = A B + C and A B C, we obtain the lower bound:

E[X2 i ]c2 2C2 1n i 2 L2 c2 c2

2 E[X2 i ] n2C2 1 2 L2 (127)

2 E[X2 1] O(n2 1 2 L2). (128)

Similarly, for the upper bound:

E[X2 i ]C2 + 2C2 1n i 2 L2 c2 + c2

2 E[X2 i ] + n2C2 1 2 L2 (129)

2 E[X2 1] + O n2 1 2 L2 . c2 C2

Therefore, we obtain our result

2 n X1 2 L2 O n2 1 2 L2 Var ˆ L(π) 3C2

2 n X1 2 L2 + O n2 1 2 L2 . (130)

E.2 REINFORCE ESTIMATOR WITH CV

Theorem 4. E h ˆ CVL(π) i = L(π). Moreover, let:

1 = MSE θ({Si}n j=1) MSE θ({Si}n j=2) (131)

denote the leave-one-out stability. Let Yi = log π(Si) and suppose that and that Yi L C for some universal constant C. Then the variance of the CV REINFORCE estimator is upper bounded as:

Var ˆ CVL(π) C2n2 1 2 L2 (132)

Thus if the estimator satisfies a n 1-leave-one-out stability, i.e. n 1 L2 = O(1), then we have:

Var ˆ CVL(π) O(1). (133)

Remark 9. This holds irrespective of α(n), thus providing a more reliable gradient estimator even if the function approximator is mis-specified or optimization is susceptible to local optima.

Proof. Let Dn be the entire dataset {Si}n i=1, where Si is the i-th sample.

Zi = log π(Si) MSE θ({Sj}n j=1) MSE θ({Sj}n j=1,j =i) (134)

i = MSE θ({Sj}n j=1) MSE θ({Sj}n j=1,j =i) (135)

Yi = log π(Si) (136)

Published as a conference paper at ICLR 2024

(I) Unbiased

E h ˆ CVL(π) i = Eπ

i log π(Si) MSE θ({Sj}n j=1) MSE θ({Sj}n j=1,j =i) #

= Eπ h ˆ L(π) i Eπ

i log π(Si)MSE θ({Sj}n j=1,j =i) #

(a) = L(π) Eπ

i MSE θ(Dn\i) Eπ log π(Si) Dn\i

= L(π), (140)

where (a) follows as Si Dn\i, and Eπ[ log π(Si)] = 0.

(II) Variance

Let Zi = Zi E[Zi]. First, observe that

From Jensen s inequality, as 1

i E[ Z2 i ] = 1

i Var(Zi). (142)

Var ˆ CVL(π) = Var X Zi n X Var(Zi) n2E[ 2 1Y 2 1 ] n2C2E[ 2 1]. (143)

E.3 REINFORCE ESTIMATOR WITH INFLUENCE

Theorem 5. E h ˆ IFL(π) i = L(π) + O(n K). Moreover, let:

1 = MSE θ({Si}n j=1) MSE θ({Si}n j=2) (144)

denote the leave-one-out stability. Let MSE θ({Si}n j=1) admit a convergent Taylor series expansion. Let Yi = log π(Si) and suppose that and that Yi L C for some universal constant C. Then the variance of the Influence function based REINFORCE estimator is upper bounded as:

Var ˆ IFL(π) O n2 1 2 L2 + n2(1 K) (145)

Thus if the estimator satisfies a n 1-leave-one-out stability, i.e. n 1 L2 = O(1), and K 1, then we have:

Var ˆ IFL(π) O(1). (146)

Proof. Let {Yi}n i=1 be i.i.d random variables, and let i be a variable dependent on all of {Yi}n i=1.

Zi = log π(Si) i (147)

k! I(k) MSE = i + O(n K) (148)

i = MSE θ({Sj}n j=1) MSE θ({Sj}n j=1,j =i) (149)

Yi = log π(Si) (150)

Published as a conference paper at ICLR 2024

Notice {Yi}n i=1 are i.i.d random variables, and i is dependent on all of {Yi}n i=1. The bias follows from standard results on K-th order remainder term in a convergent Taylor series expansion. Finally, it can be observed that Var(P Zi) follows the exact same analysis as done for ˆ CVL(π):

Var ˆ IFL(π) n2C2E[ 2 1] n2C22E[( 2 1 + O(n 2K)] = O n2E[ 2 1] + n2

F ESTIMATION ERROR, BIAS AND VARIANCE ANALYSIS

F.1 PRELIMINARIES ON U-STATISTICS We first present a result on the variance of U-statistics with a growing order that primarily follows from recent prior work. For references see Fan et al. (2018); Hoeffding (1948); Wager & Athey (2018b) and some concise course notes [A, B, C, D, E]. With a slight abuse of notation, in the following let ZS represents a set of S random variables (and is unrelated to notation for instrumental variables). Let n be the number of random variables available and k be the size of the selected random variables.

Theorem 6 (Variance of U-Statistics). Consider any k-order U-statistic

U = 1 n k X

S [n]:|S|=k h(ZS)

where h is a symmetric function in its arguments. Let:

ηc = Var(E[h(Z) | Z1:c]) (152)

Then, we have that:

Var(U) = n k

Moreover, one can show that:

n η1 Var(U) k2

n2 ηk, (154)

k ηk. (155)

These results, for instance, follow from the classical Hoeffding s canonical decomposition of a Ustatistic and an inequality due to Wager & Athey (2018b), to upper bound the non-leading term, for every k, n, η1, . . . , ηk not just asymptotically and for fixed k, η1, . . . , ηk, and letting n as in Hoeffding s classical analysis. The proof is included in Appendix F.5.

Typically, due to computational considerations, one considers an incomplete version of a U-statistic, where instead of calculating all possible subsets of size k, we draw B random subsets {S1, . . . , Sb}, each of size k, uniformly at random among all subsets of size k. Then we define the incomplete U-statistic approximation as:

b=1 h(ZSb) (156)

We can also derive the following finite sample variance characterization for an incomplete Ustatistic:

Lemma 2. The variance of the incomplete U-statistic ˆU is:

Var( ˆU) = 1 1

B ηk. (157)

Published as a conference paper at ICLR 2024

Proof. Note that ˆU is an unbiased estimate of U, conditional on the samples, i.e.:

E[ ˆU | Z1:n] = U. (158)

Moreover, the variance of an incomplete U-statistic can be decomposed by the law of total variance, as:

Var( ˆU) = Var(E[ ˆU | Z1:n]) + E[Var( ˆU | Z1:n)] = Var(U) + E[Var( ˆU | Z1:n)]. (159)

Note that, conditional on Z1:n, the incomplete statistic ˆU is the mean of i.i.d. samples of h(ZSb), where S is a random subset of the original n samples of size k. Thus:

Var( ˆU | Z1:n) = 1

B Var(h(ZS) | Z1:n). (160)

Combining 159 and 160,

Var( ˆU) = Var(U) + 1

B E[Var(h(ZS) | Z1:n)]. (161)

Moreover, note that the un-conditional variance of the first k samples, is equal to the unconditional variance of a random k subset of the n samples, since we can equivalently think of drawing k i.i.d. samples, by first drawing n samples and then choosing a random subset of them.

Var(h(Z1:k)) = Var(h(ZS)) = Var(E[h(ZS) | Z1:n]) + E[Var(h(ZS) | Z1:n)]. (162)

Moreover, note that by the definition of the random sample S and the U-statistic:

E[h(ZS) | Z1:n] = U = Var(E[h(ZS) | Z1:n]) = Var(U). (163)

Thus we get:

Var( ˆU) = Var(U) + 1

B Var(h(Z1:k)) 1

B Var(U) = 1 1

B ηk. (164)

F.2 VARIANCE OF IS PROCEDURE Observe that for the full n sample size, the importance sampling based estimate of MSE is

MSEIS n = E Dn {πi}n i=1

Pr(Dn; π) Pr(Dn; πi)

(a) = E Dn {πi}n i=1

π(Zi|Xi) πi(Zi|Xi)

where (a) follows from expanding Pr(Dn; π) and Pr(Dn; πi) using 6. Let MSEIS k be the estimate of MSE for a subset dataset of size k, obtained using importance sampling and a complete enu-

meration over all subsets of n of size k. Similarly, let [ MSE IS k denote it s computationally friendly approximation that draws B subsets of size k uniformly at random.

Moreover, for simplicity, we will assume here, that given a sub-sample of size k we can exactly calculate the MSE of the estimate that is produced from that sub-sample. Since our goal is to showcase the poor behavior of the IS approach, it suffices to showcase such a behavior in this ideal scenario.

Theorem 7. If MSE(θ(Dk)) L2 = α(k) then

Var(MSEIS k ) α(k)2 n k

ρc max (166)

n ρmax + k2

Var([ MSE IS k ) α(k)2 k2

n ρmax + k2

n2 ρk max + 1

Published as a conference paper at ICLR 2024

Moreover, for any deterministic data generating process without covariates, and for any deterministic policy π, the variance can be exactly characterized as:

Var(MSEIS k ) = α(k)2 n k

(ρc max 1). (169)

Var([ MSE IS k ) = 1 1

Var(MSEIS k ) + α(k)2 1

B ρk max 1 . (170)

Even though the lower bound formula for the complete U-statistic is hard to parse in terms of a rate, we see that for many reasonable values of n and k, its dependence in ρmax is exponential (see Figure 7). Moreover, it is obvious that for any finite B (which would be the typical application), the dependence in ρmax is exponential in k, due to the extra variance stemming from the random sampling of subsets.

Proof. Let n be the total number of samples. Let k = o(n) be the desired number of samples. To consider analysis with all possible combination C(n, k), we will construct a U-estimator. Let the h be the kernel for the importance weighted base estimator, and Dk = {Si}k i=1,

h({Si}k i=1) = ρ1:k MSE(θ(Dk)), (171)

i=1 ρ(Si). (172)

The U-statistic corresponding to this kernel is

U = k!(n k)!

i1<i2<...<ik h {Sij}k j=1 . (173)

For this kernel, we can derive that:

E[h(Z) | Z1:c] = E [ρ1:k MSE(θ(Dk)) | S1:c] (174) = E [ρc+1:k MSE(θ(Dk)) | S1:c] ρ1:c (175) = Eπ [MSE(θ(Dk)) | S1:c] ρ1:c. (176)

Thus we have that:

ηc = Var (E[h(Z)|Z1:c]) (177)

= E h Eπ [MSE(θ(Dk)) | S1:c]2 ρ2 1:c i Eπ[MSE(θ(Dk))]2 (178)

= Eπ h Eπ [MSE(θ(Dk)) | S1:c]2 ρ1:c i Eπ[MSE(θ(Dk))]2 (179)

Eπ Eπ MSE(θ(Dk))2 | S1:c ρ1:c Eπ[MSE(θ(Dk))]2. (180)

Eπ Eπ MSE(θ(Dk))2 | S1:c ρc max Eπ[MSE(θ(Dk))]2. (181)

= Eπ MSE(θ(Dk))2 ρc max Eπ[MSE(θ(Dk))]2. (182)

Upper bound From this we can derive the upper bound:

ηc Eπ MSE(θ(Dk))2 ρc max = α(k)2ρc max. (183)

Which, by invoking Theorem 6 yields the upper bound:

Var(U) α(k)2 n k

ρc max. (184)

Moreover, from the looser upper bounds in Theorem 6, we also get that:

Var(U) α(k)2 k2

n ρmax + k2

Published as a conference paper at ICLR 2024

Figure 7: Behavior of complete U-statistic variance for n = 1000, k = 100 as a function of ρmax.

Then, by invoking Lemma 2, we have that the incomplete version of our estimate with B random sub-samples of size k has:

Var( ˆU) = 1 1

B α(k)2ρk max (187)

n ρmax + k2

n2 ρk max + 1

where the first inequality comes by upper bounding Equation 182 and using our assumption in the Theorem that Eπ MSE(θ(Dk))2 = α(k)2, and the second inequality comes from substituting in Equation 185.

Lower bound In general it is hard to characterize the lower bound. However, in the following, we will show that for a certain class of problems, the lower bound on the variance from using the importance sampling procedure can grow exponentially. Specifically, for this theoretical analysis, we consider the case where there are no covariates, all aspects of the data generating process is deterministic, and the policy π is also deterministic. Figure 4 (Right) provides an empirical evidence of exponential growth of variance for a more general setting with both stochastic policy and a stochastic data generating process.

Notice that (182) follows with equality when steps (180) and (181) follow with equality. This holds when the entire data generating process is deterministic and the policy π is also deterministic. Further, since there are no covariates and that π is a deterministic policy, then ρi = ρmax almost surely when Si is drawn from π. Therefore,

ηc = Eπ MSE(θ(Dk))2 ρc max Eπ [MSE(θ(Dk)]2 (189)

= Eπ MSE(θ(Dk))2 (ρc max 1) (190)

= α(k)2(ρc max 1), (191) where the last equality holds because π is deterministic, and by our assumption in the Theorem, Eπ MSE(θ(Dk))2 = α(k)2. Moreover, from Theorem 6, the variance of the U-statistic is:

Var(U) = α(k)2 n k

(ρc max 1). (192)

Similarly, using Lemma 2, the variance of the incomplete U-statistic is:

Var( ˆU) = 1 1

Var(U) + α(k)2 1

B ρk max 1 , (194)

where in the second equality we have used Equation 191.

Published as a conference paper at ICLR 2024

F.3 GUARANTEES FOR REJECTION SAMPLING PROCEDURE We will analyze the bias and variance of the estimate of the mean squared error of the rejection sampling procedure. Almost identical analysis also applies to the analysis of the bias and variance of policy gradient estimates based on the rejection sampling procedure, but we omit it for conciseness.

Let n be the total number of available samples. Let MSERS k be the estimate of MSE for a subset dataset of size k, obtained using rejection sampling. This estimates first accepts a random subset of size N and if N < k returns 0, otherwise if N k, returns the estimate that goes over all subsets of size k of the N accepted samples:

MSERS k = 1

i=1 [ MSE(θ({Sij}k j=1)) c = N

[ MSE(θ(Dk)) = 1

i=1 (f(Xi, Ai; θ(Dn)) f(Xi, Ai; θ(Dk)))2, (196)

where [ MSE(θ(Dk)) is the estimate of the mean-squared error produced by the samples in Dk, (same as 14) and where {Xi, Ai}M i=1 are held-out samples of covariates and treatments (outcomes for these are not available), which is not used in the estimation of θ, i.e. is separate from Dn. We will assume throughout that M >> n.

We will also consider an alternative estimate [ MSE RS k that is the computationally friendly approximation that draws B subsets of size k uniformly at random from the N accepted samples, instead of performing complete enumeration over subsets of size k done for MSERS k .

Theorem 8 (Consistency of Rejection Sampling). Consider the rejection sampling estimate MSERS k as defined in Equation (195) and its incomplete U-statistic variant [ MSE RS k . For any dataset Dn of iid samples of si ze n drawn from policy π, define:

α(n) := MSE(θ(Dn)) L2 (197)

Assume that k n/2 and n is large enough, such that n/ log(n) 8ρmax. Then:

Var MSERS k O α(k)2 kρmax

M + α(n) , (198)

BIAS MSERS k O α(k)e n/8ρmax + 1

α(n) , (199)

and similarly for the incomplete variant:

Var [ MSE RS k O α(k)2 kρmax

n + α(k)2 1

M + α(n) , (200)

BIAS [ MSE RS k O α(k)e n/8ρmax + 1

α(n) . (201)

Moreover, the probability of failure, i.e. Pr(N < k), is bounded as: e n/(8ρmax).

Proof. Let n be the total number of samples and N be the number of of samples accepted by the rejection sampler, which in expectation equals to n := E[N ] = n/ρmax. Like before, let k be the desired number of samples. Let Dk = {S1:k}.

We first consider the difference between MSERS k from 195 and its ideal analogue (note that this ideal version is biased as it returns 0 when N < k),

i=1 MSE(θ({Sij}k j=1)) c = N

MSE(θ(Dk)) = EX,A h (f(X, A; θ0) f(X, A; θ(Dk)))2i (203)

We now break the proof into three parts:

Published as a conference paper at ICLR 2024

Part I: Express bias and variance of MSERS k in terms of bias and variance of U.

Part II: Compute Variance of U (for both complete and incomplete U-statistic).

Part III: Compute bias of U (for both complete and incomplete U-statistic).

(Part I) Bias and variance of MSERS k in terms of that of U: We will first express the error of MSERS k in terms of the the convergence rate of MSE(θ(Dn)), as θ(Dn) is used as a proxy for θ0 by MSERS k . Observe that,

E[(MSERS k U)2] (204)

= E En[[ MSE(θ(Dk)] En[MSE(θ(Dk))] 2 (205)

= E En[[ MSE(θ(Dk) MSE(θ(Dk))] 2 (206)

[ MSE(θ(Dk)) MSE(θ(Dk)) 2 (Jensen s inequality)

= E [ MSE(θ(Dk)) MSE(θ(Dk)) 2 . (207)

(a) = E [ MSE(θ(Dk)) ] MSE(θ(Dk)) MSE(θ(Dk)) 2 (208)

(b) 2E [ MSE(θ(Dk)) ] MSE(θ(Dk)) 2 + 2E ] MSE(θ(Dk)) MSE(θ(Dk)) 2 (209)

where (a) follows from defining an intermediate quantity ] MSE , which unlike [ MSE(θ(Dk)) in 195 computes the true expectation instead of empirical average,

] MSE(θ(Dk)) = EX,A h (f(Xi, Ai; θ(Dn)) f(Xi, Ai; θ(Dk)))2i , (210)

and (b) follows from (a + b)2 2(a2 + b2). Now we bound the two terms in the RHS of 209.

To bound the first term in the RHS of 209, note that since predictions are uniformly bounded and the M samples are independent of the samples in Dn, we have:

E [ MSE(θ(Dk)) ] MSE(θ(Dk)) 2 C

for some constant C, since ] MSE(θ(Dk)) is the mean of [ MSE(θ(Dk)), conditional on Dn and [ MSE(θ(Dk)) is the sum of M iid variables, conditional on Dn.

To bound the second term in the RHS of 209, we have that:

E MSE(θ(Dk)) ] MSE(θ(Dk)) 2 (212)

= E EX,A h (f(X, A; θ0) f(X, A; θ(Dk)))2 (f(X, A; θ(Dn)) f(X, A; θ(Dk)))2i2

(a) E h EX,A[(f(X, A; θ0) f(X, A; θ(Dn)))]2i (213)

E h EX,A h (f(X, A; θ0) f(X, A; θ(Dn)))2ii (Jensen s inequality)

= E [MSE(θ(Dn))] p

E [MSE(θ(Dn))2] = α(n), (214)

where (a) follows as (a b)2 (c b)2 (a c)2. Thus, combining 209, 211, and 214:

E[(MSERS k U)2] 2C

M + 2α(n). (215)

Published as a conference paper at ICLR 2024

Therefore, using 215, bias and variance of MSERS k can be expressed using the bias and variance of the idealized estimate U as the following,

Var(MSERS k ) = Var(MSERS k + U U) (216)

2Var(U) + 2Var(MSERS k U) Var(A + B) 2(Var(A) + Var(B))

2Var(U) + 2E[(MSERS k U)2] (217)

2Var(U) + 4C

M + 4α(n). (218)

Bias(MSERS k ) = |E[MSERS k ] E[MSE(θ(Dk))]| (219)

= |E[MSERS k ] E[U] E[MSE(θ(Dk))]| (220)

|E[U] E[MSE(θ(Dk))]| + q

E[(MSERS k U)2] (221)

M + 2α(n). (222)

Thus it suffices to analyze the bias and variance of the idealized estimate U.

(Part II) Analysis of variance of U: Note that the idealized estimate U returns 0 if N < k and otherwise returns a U-statistic (or an incomplete version of it):

i=1 h {Sij}k j=1 c = N

So if we let U be our estimate of the MSE on k samples, we have:

U = UN 1{N k}. (224)

Variance of complete U-statistic First we note that:

Var(U) = Var(E[U | N ]) + E[Var(U | N )] (225)

The latter term, i.e. Var(U | N ) is the variance of a U-statistic of order k with N samples drawn iid from distribution π. Thus based on Theorem 6, we have that for N k:

Var(U | N ) k2

(N )2 ηk (226)

k N ηk + k2

(N )2 ηk ( η1 1

N Varπ(h(S1:k)) (k/N 1)

N Eπ[MSE(θ(Dk))2] = 2 k

N α(k)2. (227)

Moreover, for N < k, we have by the definition of U that Var(U | N ) = 0. Thus we get that:

E[Var(U | N )] = 2kα(k)2E 1{N k}

2kα(k)2E 1 max{N , k}

Note that N is a binomially distributed random variable, Bin(n, 1/ρmax). Thus by a Chernoff bound, and reminding that n = E[N ] = n/ρmax, we have that:

Pr(N (1 δ)n ) e δ2n /2. (229)

Therefore to upperbound E h 1 max{N ,k} i , we split it into the following two cases

1 max{N , k}

( 1 (1 δ)n when N > (1 δ)n

1 k otherwise (230)

Published as a conference paper at ICLR 2024

where the first case holds because irrespective of the value of N , the value of 1/ max{N , k} 1/N . Further, when N > (1 δ)n , then 1/N 1/(1 δ)n . Thus we get that:

E[Var(U | N )] 2kα(k)2 Pr(N > (1 δ)n )

(1 δ)n + Pr(N (1 δ)n )

2kα(k)2 1 (1 δ)n + 1

k e δ2n /2 (232)

= 2α(k)2 k (1 δ)n + e δ2n /2 (233)

= 2α(k)2 kρmax

(1 δ)n + e δ2n/(2ρmax) (234)

Choosing δ = q

2ρmax log(n)

n , we get:

E[Var(U | N )] 2α(k)2 kρmax

4α(k)2 kρmax

(1 δ)n (235)

For n large enough such that n/ log(n) 8ρmax, we get δ 1/2 and:

E[Var(U | N )] 8α(k)2 kρmax

We now consider the first part in the variance decomposition in Equation (225). Note that for N k:

E[U | N ] = E[h(Z1:k)] = Eπ[MSE(θ(Dk))] (237)

While for N < k, we have E[U | N ] = 0. Thus we have:

Var(E[U | N ]) = Var(Eπ[MSE(θ(Dk))]1{N k}) (238)

= Eπ[MSE(θ(Dk))]2Var(1{N k}) (239)

Eπ[MSE(θ(Dk))2]Var(1{N k}) (240)

α(k)2Var(1{N k}) (241)

= α(k)2 Pr(N k) (1 Pr(N k)) (242)

α(k)2 Pr(N < k). (243)

If we let δ, be such that k = (1 δ)n, then we have by a Chernoff bound

Pr(N < k) e δ2n /2 = e δ2n/(2ρmax) = e (1 k/n)2n/(2ρmax)

Assuming that k n/2, we have:

Pr(N < k) e n/(8ρmax). (244)

Since, we assume that n/ log(n) 8ρmax, we have:

Pr(N < k) 1

Thus overall we have derived that:

Var(U) 9α(k)2 kρmax

Variance of incomplete U-statistic With an identical analysis, if we instead use an incomplete Ustatistic with B random sub-samples (Lemma 2), then we can apply the exact same analysis, albeit, when N k:

Var(U | N ) 2 k

N α(k)2 + 1

B Varπ(h(S1:k)) (247)

N α(k)2 + 1

B α(k)2. (248)

Published as a conference paper at ICLR 2024

Following then an identical analysis, we can derive:

Var(U) α(k)2 9 kρmax

(Part III) Analysis of bias of U: Finally, we also quantify the bias of the rejection sampling estimate. Note that when N k, we have:

E[U | N ] = E[h(S1:k)] = Eπ[MSE(θ(Dk))] (250)

and the estimate is un-biased. However, when N < k, then we output 0, hence we incur a bias of:

|E[U | N ] Eπ[MSE(θ(Dk))]| = Eπ[MSE(θ(Dk))] α(k). (251)

Thus overall we have a bias of:

|E[U] Eπ[MSE(θ(Dk))]| |E[U | N ] Eπ[MSE(θ(Dk))]| Pr(N < k) (252)

α(k)e n/(8ρmax) (253)

The bias of the incomplete U-statistic variant is identical, since both variants are un-biased when N k.

Combining the aforementioned bounds on the idealized estimate U with our error bounds between U and MSERS k , we get the result.

F.4 POLICY ORDERING CONSISTENCY GUARANTEES Let:

Lπ(k) = Eπ [MSE(θ(Dk))] , απ(k) =

Eπ h MSE(θ(Dk))2i . (254)

Suppose that we assume that the mean-squared-error, irrespective of sampling distribution, satisfies that:

rn Lπ(n) Vπ, rnαπ(n) = Θ(1), (255)

for some random variable Vπ and some rate rn (e.g. rn = n). Then we can consider the normalized estimate:

rk [ MSE RS k . (256)

Suppose we choose M >> n kα(k)2 and B >> n k and we choose k such that r2 k/rn 0 and k/n 0. Then, from Theorem 8, we have that the normalized estimate of the mean-squared-error from 256 has:

Var(rk [ MSE RS k ) = O r2 kα(k)2 k

n + r2 kα(n) = O k

n + r2 k/rn

rk BIAS [ MSE RS k O rkα(k)e n/8ρmax + rk p

α(n) = O e n/8ρmax + rk/ rn 0

Thus we can conclude that the normalized mean squared error of the estimate will satisfy:

r2 k E[([ MSE RS k Lπ(k))2] 0 (259)

Thus we get that:

rk [ MSE RS k = rk Lπ(k) + rk([ MSE RS k Lπ(k)) = Vπ + op(1). (260)

From this we conclude that optimizing rk [ MSE RS k over policies π approximately optimizes Vπ, up to an asymptotically neglibible error. Thus for a large enough sample n, we should expect that we are choosing an approximately optimal policy π.

Published as a conference paper at ICLR 2024

Corollary 1 (Approximate Policy Ordering). For any dataset Dn of iid samples of size n drawn from policy π, define:

Lπ(k) = Eπ [MSE(θ(Dk))] απ(k) =

Eπ h MSE(θ(Dk))2i (261)

Suppose that for some appropriately defined growth rate rn, we have that:

|rn Lπ(n) Vπ| ϵn rnαπ(n) = O(1) (262)

for some rate rn that is policy independent and for some approximation error ϵn 0. Let [ MSE RS k (π) denote the rejection sampling estimate of the mean squared error of π. Suppose that we choose k such that r2 k/rn 0 and k/n 0 and k and M, B are large enough. Then for any fixed policy π:

E[(rk [ MSE RS k (π) Vπ)2] = E[(rk [ MSE RS k (π) rk Lπ(k) Vπ)2] (263)

2ϵ2 k + 2r2 k E[([ MSE RS k Lπ(k))2] (264)

(a) O ϵ2 k + k

n + r2 k rn + e n/(8ρmax) =: γ(n, k) 0, (265)

where (a) follows from using the bias and variance terms from Theorem 8 to expand the mean-

squared-error E[([ MSE RS k Lπ(k))2].

Let π1, π2 be two candidate policies. if we choose π2 over π1 based on ˆVπ (i.e., ˆVπ1 < ˆVπ2), By our assumptions on the convergence of rn Lπ(n), we have:

rn(Lπ1(n) Lπ2(n)) Vπ1 Vπ2 + 2ϵn (266)

ˆVπ1 ˆVπ2 + 2 ϵn + max π {π1,π2} |Vπ ˆVπ| (267)

2 ϵn + max π {π1,π2} |Vπ ˆVπ| (268)

By Markov s inequality and a union bound we have w.p. 1 δ:

δ E[|Vπ ˆVπ|] 2

E[|Vπ ˆVπ|2] 2

γ(n, k) (269)

for π {π1, π2}. Thus overall we get w.p. 1 δ:

rn(Lπ1(n) Lπ2(n)) O ϵn + 1

γ(n, k) . (270)

Thus the policy that is best according to the estimated criterion ˆVπ is also approximately best according to the normalized expected mean-squared-error with n samples, i.e. rn Lπ(n).

F.5 PROOF OF THEOREM 6 For the first part of the Lemma, i.e. proving Equation (153), we first observe that:

Var(U) = n k

1 i1<...<ik n

1 j1<...<jk n Cov (h(Zi1, . . . , Zik), h(Zi1, . . . , Zik)) (271)

Moreover, note that if |{i1, . . . , ik} {j1, . . . , jk}| = c then

Cov (h(Zi1, . . . , Zik), h(Zi1, . . . , Zik)) = ηc,

with η0 = 0. Furthermore, the number of subsets {i1, . . . , ik} and {j1, . . . , jk} that have intersection equal to c is n k k c n k k c . Thus we derive Equation (153):

Var(U) = n k

Published as a conference paper at ICLR 2024

The proof of the second part follows from along the lines of Hoeffding s canonical decomposition and bares resemblance to the asymptotic normality proofs in Fan et al. (2018); Wager & Athey (2018b). Consider the following projection functions:

h1(z1) = E[h(z1, Z2, . . . , Zk)] , h1(z1) = h1(z1) E [h] , (273)

h2(z1, z2) = E[h(z1, z2, Z3, . . . , Zk)] , h2(z1, z2) = h2(z1, z2) E [h] , (274) ... (275)

hk(z1, z2, . . . , zk) = E[h(z1, z2, z3, . . . , zk)] , hk(z1, z2, . . . , zk) = hk(z1, z2, . . . , zk) E [h] , (276)

where E [h] = E [h(Z1, . . . , Zk)]. Then we define the canonical terms of Hoeffding s U-statistic decomposition as:

g1(z1) = h1(z1) , (277)

g2(z1, z2) = h2(z1, z2) g1(z1) g2(z2) , (278)

g3(z1, z2, z3) = h2(z1, z2, z3)

i=1 g1(zi) X

1 i<j 3 g2(zi, zj) , (279)

gk(z1, z2, . . . , zk) = hk(z1, z2, . . . , zk)

i=1 g1(zi) X

1 i<j k g2(zi, zj) . . . (281)

1 i1<i2<...<ik 1 k gk 1(zi1, zi2, . . . , zik 1) . (282)

The key property is that all the canonical terms in the latter expression are uncorrelated and mean zero.

Subsequently the kernel of the U-statistic can be re-written as a function of the canonical terms:

h(z1, . . . , zk) = h(z1, . . . , zk) E [h] =

i=1 g1(zi) + X

1 i<j k g2(zi, zj) + . . . + gk(z1, . . . , zk) .

By the uncorrelatedness of the canonical terms we have:

Var [h(Z1, . . . , Zk)] = k 1

E g2 1 + k 2

E g2 2 + . . . + k k

E g2 k . (283)

We can now re-write the U statistic also as a function of canonical terms:

U E [U] = n k

1 i1<i2<...<ik n h(Zi1, . . . , Zik) (284)

i=1 g1(Zi) + n 2 k 2

1 i<j n g2(Zi, Zj) + . . . (285)

1 i1<i2<...<ik n gk(Zi1, . . . , Zik) . (286)

Now we define the Hájek projection to be the leading term in the latter decomposition:

i=1 g1(Zi) .

The variance of the Hajek projection is:

σ2 1 = Var [U1] = k2

n Var [h1(z1)] = k2

Published as a conference paper at ICLR 2024

Moreover, since the canonical terms are un-correlated and mean-zero, we also have:

Var(U) = Var(U1) + Var(U E[U] U1) = Var(U1) + E h (U E[U] U1)2i (287)

From an inequality due to Wager & Athey (2018a):

E h (U E [U] U1)2i = n k

2 ( n 2 k 2

E[g2 2] + . . . + n k k k

E[g2 r] (290)

E[g2 r] (291)

n2 Var [h(Z1, . . . , Zk)] (292)

n2 ηk . (293)

From the above we get that:

n η1 = Var(U1) Var(U) Var(U1) + k2

n2 ηk. (294)

Moreover, from Equation (283), we have that:

ηm = Var(h(Z1, . . . , Zk)) k 1

E[g1(Z1)2] = k 1

Var(h1(Z1)) = k η1. (295)

which completes the proof of the Lemma.

G MULTI-REJECTION SAMPLING

G.1 REVIEW: STANDARD REJECTION SAMPLING Recall that to simulate draws from a distribution p using samples from another sequence of distribution {qi}k i=1, the standard rejection sampling procedure accepts a sample Si if ξi p(Si)/(qi(Si)C), where C = maxs,i p(s)/qi(s) and ξ U. While it is well-known in the literature, for completeness, we show that the samples drawn using the above rejection sampling procedure will simulate draws from p(x),

Pr(X = x) = 1

i=1 Pr(Accept x; qi)qi(x) (296)

i=1 Pr U p(x) Cqi(x)

qi(x) (297)

p(x) Cqi(x)qi(x) (298)

p(x). (301)

An instantiation of standard rejection sampling for our setup is provided in 13.

Published as a conference paper at ICLR 2024

G.2 PROPOSED: MULTI-REJECTION SAMPLING Here, we consider discussing the proposed multiple-importance sampling procedure more generally. Consider the case when we have samples from multiple proposal distributions {qi}k i=1, we can either treat each of them independently to sample from p, as discussed above. Alternatively, as we show below, we can treat them as a mixture jointly, such that even if a sample X qi, we form the acceptance ratio as p(X) C 1

j qj(X), where C := maxx p(x)/( 1

k Pk j=1 qj(x)). The following

provides proof that the proposed procedure draws samples that simulates sampling from the desired distribution p(x) :

Pr(X = x) = 1

i=1 Pr(Accept x; qi)qi(x) (302)

k Pk j=1 qj(x)

qi(x) (303)

k Pk j=1 qj(x) qi(x) (304)

k Pk j=1 qj(x)

i=1 qi(x) (305)

p(x). (307)

An instantiation of multi-rejection sampling for our setup is provided in 15.

Remark 10. Validity of rejection sampling requires the support assumption, that is for X := {x : p(x) > 0} it hods that qi(x) > 0, i {1, ..., k}, x X. Multiple-rejection sampling replaces the support condition from all qi s to at least one qi. That is, we only require that 1

k Pk i=1 qi(x) > 0, x X. Alternatively, we only require that x X, { i {1, ..., k}, s.t. qi(x) > 0}.

For instance, if q1 (the first exploration policy) is uniform, then subsequent qi s can even be deterministic and the data from it can still be used for multiple rejection sampling. This would not have been possible if we had used single-rejection sampling.

Remark 11. Multiple-rejection sampling can often also reduce the value of the normalizing constant significantly as now

ρmax := max x p(x)

1 k Pk j=1 qj(x) (308)

which can reduce the failure rate of rejection sampling significantly.

It can be shown that the expected number of samples accepted using multiple-rejection sampling will always be more than or equal to the number of samples accepted under individual rejection sampling, irrespective of how different are the proposal distributions {qi}k i=1 to the target p. To see this, recall that for individual rejection sampling, the expected number of samples accepted from N samples drawn from a proposal q is N/ρmax as

Pr(Accept X) = Pr U p(X) ρmax q(X)

= Z Pr U p(x) ρmax q(x)

X = x q(x) dx (310)

= Z p(x) ρmax q(x)q(x) dx (311)

= 1 ρmax . (312)

Published as a conference paper at ICLR 2024

Let there be k proposal distributions, with N samples drawn from each distributions. Therefore, the expected number of samples accepted would be

N ρimax = N

1 ρimax = N

maxx p(x) qi(x) = N

Contrast this with the expected number of samples accepted under multiple-rejection sampling, where all the Nk samples are treated equally,

Nk ρmax = Nk

maxx p(x) 1 k Pk i=1 qi(x) = Nk min x

1 k Pk i=1 qi(x) p(x) = N min x

N ρimax N ρimax i {1, ..., k}. (317)

Example 1. consider the case where N samples are drawn from each of the two proposal distributions q1 and q2 over {0, 1}, such that q1(x = 1) = q2(x = 0) = 0.75 and q1(x = 0) = q2(x = 1) = 0.25. Let target distribution be p(x = 0) = p(x = 1) = 0.5. In this case, it can be observed that individual rejection sampling will reject 50% of the samples! In contrast, multiple-rejection sampling will accept all the samples as the average proposal distribution (q1 + q2)/2 = p.

Example 2. Consider the other extreme case where one of the proposal distribution is exactly the same as the target distribution and the other is completely opposite. We will see that multiplerejection sampling stays robust against such settings as well.

Let N samples be drawn from each of the two proposal distributions q1 and q2 over {0, 1}, such that q1(x = 1) = q2(x = 0) 0 and q1(x = 0) = q2(x = 1) 1 (here we use instead of = to ensure that individual rejection sampling satisfies support assumption (note: multiple rejection sampling will be valid in the case of equality as well)). Let the target distribution be p = q1.

In this case, both individual rejection sampling and multiple-rejection sampling accept 50% of all the samples drawn (i.e., 100% of the samples drawn from q1). This can be worked out by observing that (a) Since p(x = 1) = 0, no samples will be selected from the samples drawn from q2. (b) The average proposal distribution is q = [0.5, 0.5], therefore maxx p(x)/ q(x) = 2. Since q1(x = 0) = 1 only 0 s are sampled from q1. Probability of accepting zero under multiple rejection sampling is (p(x = 0)/ q(x = 0))/( ρmax) = (1/0.5)/2 = 1.

H ALGORITHM

H.1 PSEUDO-CODE

Algorithm 1: DIA: Designing Instruments Adaptively

1 Input Number of allocations K, Size of each allocation n, Subsampling factor α

2 π = uniform

3 D = Get Data(π, n) Sample n new data using π

4 for i [1, ..., K 1] do

5 π = Optimize(D, α, Kn)

6 D = Get Data(π, n)

7 Return θ = Estimate(D)

Published as a conference paper at ICLR 2024

Algorithm 2: Optimize

1 Input Dataset D, Sub-sampling factor α, Max-budget N

2 Initialize πφ

3 Initialize Θ Set of B parameters

5 ˆθ0 = Estimate(D) Compute estimate on the entire data

6 while πφ not converged do

7 πeff(Z|X) n

N Pr(Z|X; D) + 1 n

N πφ(Z|X) Equation 16

#Parallel For loops

8 for b [1, ..., B] do

9 Dk = Multi-Reject-Sampler(D, πeff, α) Equation 15, with k = |D|α

10 Θ[b] = Estimate(Dk, Init = Θ[b]) Optional: Init from previous solution

11 MSE = (Θ[b] ˆθ0)2

12 I = Get Influences(MSE, Θ[b], Dk) Equation 12

13 IF [b] = P

s Dk φ log πeff(s)I(s) Equation 17

14 πφ Update(πφ, IF ) Any optimizer: SGD, Adam, etc.

15 Return πφ

We provide a pseudo code for the proposed DIA algorithm in 1 and 2. In the following we provide a short description of the functions used there-in.

Get Data: Uses the given policy to sample n new data samples

Estimate: Uses 3 to solve for the required nuisances and the parameters of interest.

Multi-Reject-Sampler: Takes as input dataset D of size n and returns a sub-sampled dataset of size k = nα using the proposed multi-rejection sampling 15. Since the data is collected by the algorithm itself, DIA has access to π and also all the prior policies (that are require for the denominator of ρ) and can directly estimate ρ. In practice, the max mutli-importance ratio ρmax may not be known. Therefore, we use the empirical supremum (Caffo et al., 2002), where empirical max of multi-importance ratios are used in the place of ρmax.

Get Influences: Uses 12 to return influence of each of the samples s Dk on the MSE. Here we discuss how to compute it efficiently using Hessian vector products (Pearlmutter, 1994). Recall, from 11 and 12,

I(1) MSE(Si) = d

dθMSE(ˆθ) I(1) θ (Si) (318)

I(1) θ (Si) = Mn

1 (ˆθ, ˆϕ) m(Si, ˆθ, ˆϕ) Mn

ϕ (ˆθ, ˆϕ) Qn

1 (ˆϕ)q(Si, ˆϕ) . (319)

Note that d dθ can be computed directly as it is just the gradient of the MSE. Below, we focus on I(1) θ (Si). To begin, Let θ Rd1 and ϕ Rd2. Let H = Qn

ϕ Rd2 d2 and v = q(Si, ˆϕ) Rd2, such that we want to compute H 1v. Notice that this is a solution to the equation Hp = v, therefore we can solve for arg minp Rd2 Hp v 2 using any method (we make use of conjugate gradients). Importantly, note that we no longer need to explicitly compute the inverse. Further, to avoid even explicitly computing or storing H, we use the stop-gradient operator sg in auto-diff libraries and express

ϕ Q, sg(p) , Rd2 (320)

which reduces to merely taking a derivative of the scalar Q, sg(p) R. Thus using hessianvector product and conjugate gradients we can compute H 1v, even for arbitrary neural networks with d2 parameters. Similarly, we can compute all the terms in I(1) θ (Si), by repeated use of Hessian-vector products. Other tricks can be used to scale up this computation to millions (Lorraine et al., 2020) and billions of hyper-parameters (Grosse et al., 2023).

Published as a conference paper at ICLR 2024

Finally, we also observe that the higher order influence I(2) θ (Si) can be partially estimated using the first order I(1) θ (Si) terms. Recall from 68

I(1) MSE(Si) = d

dθMSE ˆθi,ϵ, ˆϕi,ϵ

I(1) θ (Si)

ϵ=0 where, I(1) θ (Si) = dˆθi,ϵ, ˆϕi,ϵ

I(2) MSE(Si) = d

dθMSE ˆθi,ϵ, ˆϕi,ϵ

I(1) θ (Si) ϵ=0 (322)

= I(1) θ (Si) d2

d2θMSE ˆθi,ϵ, ˆϕi,ϵ

I(1) θ (Si)

dθMSE ˆθi,ϵ, ˆϕi,ϵ

dϵI(1) θ (Si)

ϵ=0 | {z } Ignore

I(1) θ (Si) d2

d2θMSE(ˆθ) I(1) θ (Si). (325)

Hessian-vector product can now again be used to copmute this approximation of I(2) MSE(Si) using the already computed I(1) θ (Si).

Update: Uses any optimizer to update the parameters of the policy.

H.2 INFLUENCE FUNCTION USAGE WHEN THE ESTIMATES ARE STOCHASTIC When using influence functions for deep-learning based estimators, prior works raise the caution that influence function might not provide an accurate estimate of the leave-one-out loss (Basu et al., 2020; Bae et al., 2022). One of the reasons is that the optimization process is susceptible to converge to different (local) optimas based on the stochasticity in the training process. Therefore, the difference between an estimate of θ(Dn) and θ(Dn\i) is not only due to the sample Si but also on the other stochastic factors (e.g., what was the parameter initialization when computing θ(Dn) vs θ(Dn\i)). However, computing the influence of sample Si using influence functions does not account for stochasticty in the training process. In the following, we discuss why this might be less of a concern for our use-case.

Let ξ be a noise variable that governs all the stochastic factors (e.g., parameter initialization). Let θ(Dn, ξ) be a deterministic function given a dataset Dn and ξ. Therefore, similar to (4),

L(π) = E Dn π,ξ[MSE(θ(Dn, ξ))]. (326)

Then similar to (5), (7), and (9)

L(π) = E Dn π,ξ

MSE(θ(Dn, ξ))

log π(Zi|Xi)

log π(Zi|Xi)

π MSE(θ(Dn, ξ)) MSE θ Dn\i, ξ #

log π(Zi|Xi)

π IMSE(Si, ξ)

where, IMSE(Si, ξ) is similar to (9) and (10), but θ is also a function of ξ. Therefore, from (328) it can be observed that, for our setting, we indeed want to compute the difference from the leave-oneout estimate, where there is no difference due to the stochasticity from ξ.

In practice, as we compute a sample estimate using a single sample of Dn and ξ,

log π(Zi|Xi)

π IMSE(Si, ξ), (330)

Published as a conference paper at ICLR 2024

the resulting procedure is essentially the same as if running the DIA algorithm considering θ(Dn) to be deterministic. Our experimental setup uses DIA from Algorithm 1 as-is for all the domains considered. For instance, Trip Advisor setting uses deep learning based estimators whose outputs are stochastic (e.g., due to parameter initialization), and the synthtetic IV domain uses closed form solutions where there is no stochasticity in the learned estimator, given the data.

Beyond the initialization difference, Bae et al. (2022) also point out three other sources of issue with practical influence function computation. (a) Linearization: or using first order influences in (9). For our purpose, we formally characterize in Theorem 5 the bias due to such approximation. Further, in (325), we discussed how second-order influence can also be estimated partially and be used to obtain a better estimate of the leave-one-out difference. (b) Non-convergence of the parameters. As experiment design is most useful when sample complexity is much more expensive than compute complexity, this issue can be mitigated with longer training time. (c) Regularization when computing the inverse hessian vector product. For our purpose, we used conjugate gradients with hessian vector products to compute the inverse hessian vector product (see discussion around (320)). In our experimental setup we did not use any additional regularization for this computation.

More importantly, contributions of our proposed DIA framework is complementary to the solutions to the above mentioned challenges. Therefore, any advancements to resolve these challenges can also potentially be incorporated in DIA.

H.3 PROXY MSE ALTERNATIVES

Recall from 4 that under the knowledge of the oracle θ0, the MSE can be expressed as,

MSE(θ(Dk)) = 1

i=1 (f(Xi, Ai; θ0) f(Xi, Ai; θ(Dk)))2. (331)

However, as we do not have access to either θ0, or (unbiased estimates of) the outcomes of f(Xi, Ai; θ0), we cannot directly estimate (331). One way would be to decompose the MSE is bias and variance, and only consider the variance component,

d Var(θ(Dk)) = 1

i=1 (E[f(Xi, Ai; θ(Dk))] f(Xi, Ai; θ(Dk)))2 (332)

Just optimizing for the variance is analogous to the objective considered by most prior work (Chaloner & Verdinelli, 1995; Rainforth et al., 2023) and can also be easily used in our framework. However, instead of discarding the bias term from the MSE s decomposition, we consider the proxy [ MSE where θ0 is substituted with θ(Dn) (Lines 5 and 11 in Algorithm 2),

[ MSE(θ(Dk)) = 1

i=1 (f(Xi, Ai; θ(Dn)) f(Xi, Ai; θ(Dk)))2 (333)

Intuitively, for a consistent estimator, θ(Dn) can be considered a reasonable approximation of θ0 for estimation of the MSE of θ(Dk), where k is much smaller than n (for e.g., k = o(n)). This approach is similar to MSE estimation using sub-sampling bootstrap (Romano, 1995) and we provide formal justification for this in Theorem 8.

As we noticed above, there are multiple different choices of target quantity that can be used, and this work focused on studying the one in (333). However, an important advantage of DIA is that the target is easily customizable, and one can use any desired target quantity f?,

MSE?(θ(Dk)) = 1

i=1 (f?(Xi, Ai) f(Xi, Ai; θ(Dk)))2. (334)

I ADDITIONAL EMPIRICAL DETAILS

To demonstrate the flexibility of the proposed approach, we use DIA across various settings, with different kinds of estimators,

Published as a conference paper at ICLR 2024

IV (synthetic): For this domain, we use 2SLS based linear regression for both the nuisance and ATE parameters, where the solutions can be obtained in a closed-form. The parameters φ of the instrument-sampling policy πφ correspond to logits for each instrument. The results for this setting are reported using 100 trials.

CIV (synthetic): For this domain, we use moment conditions from (Hartford et al., 2017) and use logistic regression with linear parameters for both the nuisance and CATE parameters, and thus require gradient descent to search for the optimal parameters (no closed-form solution). Further, to account for covariates, the parameters φ of the instrument-sampling policy πφ correspond to the parameters of a non-linear function modeled using a neural network. The results for this setting are reported using 30 trials.

Trip Advisor (semi-synthetic): For this domain, we use neural network based non-linear parameterization of both θ and ϕ, for the CATE and the nuisance functions, respectively. The parameters φ of the instrument-sampling policy πφ also correspond to the parameters of a non-linear function modeled using a neural network. The results for this setting are reported using 30 trials.

The policy parameters φ in all the settings are searched using gradient expression in 17. Below, we discuss details for each of these settings.

I.1 SYNTHETIC DOMAINS For both the IV and the CIV setting, random 5% (or 1, whichever is more) of the instruments had strong positive encouragement for treatment A = 0 and 5% (or 1, whichever is more) had strong positive encouragement for treatment A = 1. The strength (γ) of other instruments was in between. Below, |Z| denotes the number of instruments.

The following provides the data-generating process for the IV setting:

θ0 R2, γ [0, 1]|Z|, π( ) (Z) (335)

Z (a) π( ) {0, 1}|Z| (336) U N(0, σu) R (337)

p = clip Z γ + U, min = 0, max = 1 [0, 1] (338)

A Bernoulli(p) {0, 1} (339)

ξ (b) AN(0, σ1) + (1 A)N(0, σ0) R (340)

Y = [A, 1 A] θ0 + U + ξ R (341)

where (a) samples from a multinomial and Z is a one-hot vector that indicates the chosen instrument, and (b) induces heteroskedastic noise in the outcomes based on the chosen treatment.

For the IV setting, we make use of 2SLS which makes use of linear functions in both the first and the second stages of regression. The parameters for those settings can be obtained in closed form. Additionally, the influence function can also be computed in closed-form (Kahn, 2015). The instrument sampling policy πφ( ) is an (unconditional) distribution on (Z).

The following provides the data-generating process for the CIV setting:

θ0 R, X Rd, γ [0, 1]|Z|, x : π( |x) (Z) (342)

Z (a) π( |X) {0, 1}|Z| (343) U N(0, σu) R (344)

C (b) = X1γ + (1 X1)(1 γ) [0, 1]|Z| (345)

p = clip Z C + U, min = 0, max = 1 [0, 1] (346)

A Bernoulli(p) {0, 1} (347)

ξ (c) AN(0, σ1) + (1 A)N(0, σ0) R (348)

Y (d) = X2 + Aθ0 + U + ξ R (349)

Published as a conference paper at ICLR 2024

where (a) samples from a multinomial and Z is a one-hot vector that indicates the chosen instrument, (b) induces heteroskedastic compliance by flipping the compliance factor of the instruments based on the covariates (by construct, X1 {0, 1}), (c) induces heteroskedastic noise in the outcomes based on the chosen treatment, and (d) induces heteroskedastic outcomes based on the covariates (by construct, X2 R).

For the CIV setting, d = 2 and we parameterize θ such that g(X, A; θ) as a linear function of X, A. Similarly, we ϕ is linearly parametrized to estimate the nuisance function d P(A|X, Z; ϕ) using logistic regression. Parameters for both g and d P are obtained using gradient descent and the moment conditions in 3. The instrument sampling policy πφ( |x) is modeled using a two-layered neural network with 8 hidden units.

I.2 TRIPADVISOR DOMAIN (SEMI-SYNTHETIC)

Adaptivity and learning the distribution of instruments are most relevant when there are a lot of instruments. Our paper is developed for the setting where online sampling is feasible. Unfortunately, simulators for real-world scenarios that we are aware of and could access publicly (e.g., Trip Advisor s (Syrgkanis et al., 2019)) only consider a single/binary instrument. These simulator domains are modeled to represent the setting where expert knowledge was used to find optimal instruments (e.g., better/easier sign-up strategy to make the membership process easier). However, in reality, there could be multiple ways to encourage membership (emails, notifications, sign-up bonuses, etc.) that can all serve as instruments. Adaptivity is more relevant in such a setting to discover an optimal strategy for instrument selection, while minimizing the need for expert knowledge. Therefore, given the above considerations, we created a semi-synthetic Trip Advisor with varying numbers of potential instruments to simulate such settings.

The original data-generating process for Trip Advisor domain is available on Github. The co-variates X Rd correspond to how many times the user visited web pages of different categories (hotels, restaurants, etc.), how many times they visited the webpage through different modes (direct-search, ad-click, etc.), and other demographic features (geo-location, laptop operating system, etc.). The original setting only had a binary instrument. To make it resemble realistic settings that are more broadly applicable (e.g., where there are a multitude of different instruments, each corresponding to a different way to encourage a user to uptake the treatment), we incorporated several weak instruments. The compliance strength of the instrument varies non-linearly with the covariates. Our semi-synthetic version is available in the supplementary material.

For the Trip Advisor setting, we parameterize θ such that g(X, A; θ) as a non-linear function of X, A. Similarly, we ϕ is non-linearly parametrized to estimate the nuisance function d P(A|X, Z; ϕ). In both cases, the non-linear function is a two-layered neural network with 8 hidden nodes and sigmoid activation units. Parameters for both g and d P are obtained using gradient descent and the moment conditions in 3. The instrument sampling policy πφ( |x) is modeled using a two-layered neural network with 8 hidden units.

I.3 HYPER-PARAMETERS:

The choice of sub-sampling size k is used as a hyper-parameter in our work. We parametrize k as nα. For our empirical results, we searched over the hyper-parameter settings of α [0.5, 0.8] and report the best-performing setting.

Auto-tuning suggestion for α: As the expected number of samples that get accepted via rejection sampling from a set of n samples is n/ρmax, it would be ideal for α to be such that nα > n/ρmax. Empirical supremum can be used as a proxy for ρmax.

Further, doing rejection sampling in higher dimensions can be challenging as ρmax can be very large. A standard technique for density ratio estimation (i.e., ρ that governs acceptance-rejection ratio in 13) in higher dimensions is through performing density ratio estimation on a lower dimensional subspace (Sugiyama et al., 2009, 2011). We believe that incorporating similar ideas into our framework can provide further improvements.

I.4 DETAILS OF OTHER PLOTS:

Figure 1: This figure provides a comparison with conventional experiment design (that assumes that there is no confounding between what group (treatment/control) the user was recommended to

Published as a conference paper at ICLR 2024

1 2 3 4 5 Samples 102

Figure 8: Log-scale version of Figure 4 to better visualize the difference in variances at smaller values of n.

be a part of (i.e., Z), and the group that the user actually became a part of (i.e., A)), and the proposed method DIA on the following data-generating process:

θ0 R2, π( ) (Z) (350) Z π( ) {0, 1} (351) U N(0, σu) R (352) p = clip(Z + U, min = 0, max = 1) [0, 1] (353) A Bernoulli(p) {0, 1} (354)

ξ (b) AN(0, σ1) + (1 A)N(0, σ0) R (355)

Y = [A, 1 A] θ0 + U + ξ R (356)

The confounder strength corresponds to σu. Observe that when σu = 0, then the binary treatment s value is the same as that of the binary instrument s value, i.e., A = Z.

Figure 3: For the illustrative plots regarding bias and variance trade-offs of different gradient estimators and sampling strategies, we make use of the following data-generating process where the treatment A R is real-valued, the outcomes Y are quadratic in A, and |Z| = 2,

θ0 R, γ [0, 1]|Z|, π( ) (Z) (357)

Z π( ) {0, 1}|Z| (358) U N(0, σu) R (359)

A Z γ + U + N(0, σa) R (360)

Y A2θ0 + U + N(0, σy) R (361)

This figure illustrates the bias and variance of different gradient estimators when the function approximator is mis-specified, i.e., Y is modeled as a linear function of treatment A but it is actually a quadratic function of treatment. We use the standard 2SLS estimator for this setting.

Figure 4: This domain also uses a 2SLS estimator and binary instrument with the data generating process in (357)-(361). (Left) Illustration of how the policy orderings are preserved when using different values of α in k = nα. π0(1) = 0.2, π1(1) = 0.5, and π2(1) = 0.8, where the data collection policy is β(1) = 0.5. (Right) Variance of evaluating MSE under different sampling strategies. The data collection policy β(1) = 0.5 and the target policy π(1) = 0.6. Sub-sample size k = o(n) = n0.75. To better visualize the difference in the variance for small sample sizes, we have also included a log-scale version of the plot in Figure 8