# postcontextualbandit_inference__1497986c.pdf

Post-Contextual-Bandit Inference

Aur elien Bibaut Netﬂix abibaut@netflix.com

Maria Dimakopoulou Netﬂix mdimakopoulou@netflix.com

Nathan Kallus Cornell University and Netﬂix kallus@cornell.edu

Antoine Chambaz Universit e de Paris antoine.chambaz@u-paris.fr

Mark van der Laan University of California, Berkeley laan@stat.berkeley.edu

Contextual bandit algorithms are increasingly replacing non-adaptive A/B tests in e-commerce, healthcare, and policymaking because they can both improve outcomes for study participants and increase the chance of identifying good or even best policies. To support credible inference on novel interventions at the end of the study, nonetheless, we still want to construct valid conﬁdence intervals on average treatment effects, subgroup effects, or value of new policies. The adaptive nature of the data collected by contextual bandit algorithms, however, makes this difﬁcult: standard estimators are no longer asymptotically normally distributed and classic conﬁdence intervals fail to provide correct coverage. While this has been addressed in non-contextual settings by using stabilized estimators, variancestabilized estimators in the contextual setting pose unique challenges that we tackle for the ﬁrst time in this paper. We propose the Contextual Adaptive Doubly Robust (CADR) estimator, a novel estimator for policy value that is asymptotically normal under contextual adaptive data collection. The main technical challenge in constructing CADR is designing adaptive and consistent conditional standard deviation estimators for stabilization. Extensive numerical experiments using 57 Open ML datasets demonstrate that conﬁdence intervals based on CADR uniquely provide correct coverage.

1 Introduction

Contextual bandits, where personalized decisions are made sequentially and simultaneously with data collection, are increasingly used to address important decision-making problems where data is limited and/or expensive to collect, with applications in product recommendation [Li et al., 2010], revenue management [Kallus and Udell, 2020, Qiang and Bayati, 2016], and personalized medicine [Tewari and Murphy, 2017]. Adaptive experiments, whether based on bandit algorithms or Bayesian optimization, are increasingly being considered in place of classic randomized trials in order to improve both the outcomes for study participants and the chance of identifying the best treatment allocations [Athey et al., 2018, Quinn et al., 2019, Kasy and Sautmann, 2021, Bakshy et al., 2018].

But, at the end of the study, we still want to construct valid conﬁdence intervals on average treatment effects, subgroup effects, or the value of new personalized interventions. Such conﬁdence intervals are, for example, crucial for enabling credible inference on the presence or absence of improvement of novel policies. However, due to the adaptive nature of the data collection, unlike classic randomized trials, standard estimates and their conﬁdence intervals actually fail to provide correct coverage, that is, contain the true parameter with the desired conﬁdence probability (e.g., 95%). A variety of recent work has recognized this and offered remedies [Hadad et al., 2019, Luedtke and van der Laan, 2016], but only for the case of non-contextual adaptive data collection. Like classic

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

conﬁdence intervals, when data comes from a contextual bandit or any other context-dependent adaptive data collection these intervals also fail to provide correct coverage. In this paper, we propose a novel asymptotically normal estimator for the value of a (possibly contextual) policy from context-dependent adaptively collected data. This asymptotic normality leads directly to the construction of valid conﬁdence intervals.

Our estimator takes the form of a stabilized doubly robust estimator, that is, a weighted time average of an estimate of the so-called canonical gradient using plug in estimators for the outcome model, where each time point is inversely weighted by its estimated conditional standard deviation given the past. We term this the Contextual Adaptive Doubly Robust (CADR) estimator. We show that, given consistent conditional variance estimates which at each time point only depend on previous data, the CADR estimator is asymptotically normal, and as a result we can easily construct asymptotically valid conﬁdence intervals. This normality is in fact robust to misspecifying the outcome model. A signiﬁcant technical challenge is actually constructing such variance estimators. We resolve this using an adaptive variance estimator based on the importance-sampling ratio of current to past (adaptive) policies at each time point. We also show that we can reliably estimate outcome models from the adaptively-collected data so that we can plug them in. Extensive experiments using 57 Open ML datasets demonstrate the failure of previous approaches and the success of ours at constructing conﬁdence intervals with correct coverage.

1.1 Problem Statement and Notation

The data. Our data consists of a sequence of observations indexed t = 1, . . . , T comprising of context X(t) X, action A(t) A, and outcome Y (t) Y R generated by an adaptive experiment, such as a contextual bandit algorithm. Roughly, at each round t = 1, 2, . . . , T, an agent formed a contextual policy gt(a | x) based on all past observations, then observed an independently drawn context vector X(t) Q0,X, carried out an action A(t) drawn from its current policy gt( | X(t)), and observed an outcome Y (t) Q0,Y ( | A(t), X(t)) depending only on the present context and action. The action and context measurable spaces X, A are arbitrary, e.g., ﬁnite or continuous. We assume outcomes are bounded and, without loss of generality, take Y = [ 1, 1].

More formally, we let O(t) := (X(t), A(t), Y (t)) and make the following assumptions about the sequence O(1), . . . , O(T) comprising our dataset. First, we assume X(t) is independent of all else given A(t) and has a time-independent marginal distribution that we denote by Q0,X. Second, we assume A(t) is independent of all else given O(1), . . . , O(t 1), X(t) and we set gt( | X(t)) to its (random) conditional distribution given O(1), . . . , O(t 1), X(t). Third, we assume Y (t) is independent of all else given X(t), A(t) and has a time-independent conditional distribution given X(t) = x, A(t) = a that is denoted by Q0,Y ( | A, X). The distributions Q0,X and Q0,Y are unknown, while the policies gt(a | x) are known, as would be the case when running an adaptive experiment. To simplify presentation we endow A with a base measure µA (e.g., counting for ﬁnite actions or Lebesgue for continuous actions) and identify policies gt with conditional densities with respect to (w.r.t.) µA. In the case of K < actions, policies are maps from X to the K-simplex.

Note that, as the agent updates its policy based on already collected observations, gt is a random O(1), . . . , O(t 1)-measurable object. This is the major departure from the setting considered in other literature on off-policy evaluation, which only consider a ﬁxed logging policy, gt = g, that is independent of the data. See Section 1.2.

The target parameter. We are interested in inference on a generalized average causal effect expressed as a functional of the unknown distributions above, Ψ0 = Ψ(Q0,X, Q0,Y ), where for any distributions QX, QY , we deﬁne

Ψ(QX, QY ) := Z y QX(dx)g (a | x)dµA(a)QY (dy | a, x),

where g (a | x) : A X [ G, G] is a given ﬁxed, bounded function. Two examples are: (a) when g is a policy (conditional density), then Ψ0 is its value; (b) when g is the difference between two policies then Ψ0 is the difference between their values. A prominent example of the latter is when A = {+1, 1} and g (a | x) = a, which is known as the average treatment effect. If we include an indicator for x being in some set, then we get the subgroup effect. Deﬁning the conditional mean outcome,

Q0(a, x) := EQ0,Y ( |x,a)[Y ] = Z y Q0,Y (dy | a, x),

we note that the target parameter only depends on Q0,Y via Q0, so we also overload notation and write Ψ(QX, Q) = R Q(a, x)QX(dx)g(a | x)dµA(a) for any function Q : A X Y. Note that when |A| < and µA is the counting measure, the integral over a is a simple sum.

Canonical gradient. We will make repeated use of the following function: for any conditional density (a, x) 7 g(a | x), any probability distribution QX over the context space X, and any function Q : A X, we deﬁne the function D (g, Q) : O R by

D (g, Q)(x, a, y) := g (a | x)

g(a | x) (y Q(a, x)) + Z Q(a , x)g (a | x)dµA(a ).

Further, deﬁne D(g, QX, Q) = D (g, QX, Q) Ψ(QX, Q), which coincides with the so-called canonical gradient of the target parameter Ψ w.r.t. the usual nonparametric statistical model comprising all joint distributions over O [van der Vaart, 2000, van der Laan and Robins, 2003].

Integration operator notation. For any policy g and distributions QX, QY , denote by PQ,g the induced distribution on O := X A Y. For any function f : O R, we use the integration operator notation

PQ,gf = Z f(x, a, y)QX(dx)g(a | x)dµA(a)QY (dy | a, x),

that is, the expectation w.r.t. PQ,g alone. Then, for example, for any O(1), . . . , O(s 1)-measurable random function f : O R, we have that PQ0,gsf = EQ0,gs[f(O(s)) | O(1), . . . , O(s 1)].

1.2 Related Literature and Challenges for Post-Contextual-Bandit Inference

Off-policy evaluation. In non-adaptive settings, where gt = g is ﬁxed and does not depend on previous observations, common off-the shelf estimators for the mean outcome under g include the Inverse Propensity Scoring (IPS) estimator [Beygelzimer and Langford, 2009, Li et al., 2011] and and the Doubly Robust (DR) estimator [Dud ık et al., 2011, Robins et al., 1994]:

t=1 D (g, 0), bΨDR := 1

t=1 D (g, b Q)

where b Q is an estimator of the outcome model Q0(a, x). If we use cross-ﬁtting to estimate b Q [Chernozhukov et al., 2018], then both the IPS and DR estimators are unbiased and asymptotically normal, permitting straightforward inference using Wald conﬁdence intervals (i.e., 1.96 of the estimated standard error). There also exist many variants of the IPS and DR estimators that, rather than plugging in the importance sampling (IS) ratios (g /gt)(A(t) | X(t)) and/or outcome-model estimators, instead choose them directly with the aim to minimize error [e.g. Kallus, 2018, Farajtabar et al., 2018, Thomas and Brunskill, 2016, Wang et al., 2017, Kallus and Uehara, 2019].

Inference challenges in adaptive settings. In the adaptive setting, it is easy to see that, if in the tth term for DR we use an outcome model b Qt 1 ﬁt using only the observations O(1), . . . , O(t 1), then both the IPS and DR estimators both remain unbiased. However, neither generally converges to a normal distribution. One key difference between the non-adaptive and adaptive settings is that the IS ratios (g /gt)(A(t) | X(t)) can both diverge to inﬁnity or converge to zero. As a result of this, the above two estimators may either be dominated by their ﬁrst terms or their last terms. At a more theoretical level, this violates the classical condition of martingale central limit theorems that the conditional variance of the terms given previous observations stabilizes asymptotically.

Stabilized estimators. The issue for inference due to instability of the DR estimator terms was recognized by Luedtke and van der Laan [2016] in another setting. They work in the non-adaptive setting but consider the problem of inferring the maximum mean outcome over all policies when the optimal policy is non-unique. Their proposal is a so-called stabilized estimator, in which each term is inversely weighted by an estimate of its conditional standard deviation given the previous terms.

This stabilization trick has been also been reused for off-policy inference from non-contextual bandit data by Hadad et al. [2019], as the stabilized estimator remains asymptotically normal, permitting inference. In their non-contextual setting, an estimate of the conditional standard deviation of the terms can easily be obtained by the inverse square root propensities. In contrast, in our contextual setting, obtaining valid stabilization weights is more challenging and requires a construction involving adaptive training on past data. Zhang et al. [2021] use similar weights to Hadad et al. [2019] in order to stabilize an M-estimator ﬁt to adaptively collected data for the purpose of inference on a true model for Q0. While they do consider a contextual setting, their method requires correct speciﬁcation, else the inferential target depends on the logging policies. We focus only on off-policy evaluation but do not require well-speciﬁed models.

Concurrent work of Zhan et al. [2021]. The paper of Zhan et al. [2021], which was published after the submission of our paper, is also motivated by establishing an unbiased and asymptotically normal estimator from data collected by a contextual bandit. Both our method and theirs rely on adaptively weighted DR scores and yield, under certain conditions, asymptotically normal estimators. However, the way they construct their adaptive weights is quite different from ours. In particular, their adaptive weights at time t (a) do not incorporate an outcome model and do not rely on it and (b) depend only on the known propensity score, which is plugged-in. As a result of these different constructions, our analysis and theirs require different sets of conditions. While we require slightly more stringent conditions on the exploration rate and the complexity of the logging policy class than they do (our assumptions 5 and 6), they require some form of asymptotic stabilization of the logging policies (their conditions (8) and (12)). An instance where the logging policy does not stabilize is, for example, when various actions have the same expected reward on certain subsets of the context space. Our approach in this paper is to use the variance-stabilization weights for the contextual bandit setting, just like Hadad et al. [2019] do the in the non-contextual bandit setting and Luedtke and van der Laan [2016] do in an iid setting with multiple optimal policies. Although Zhan et al. [2021] also tackle the contextual bandit setting, they do not use the variance-stabilization weights; correspondingly their conditions and results apply in different, complementary settings. The key challenge we tackle in applying variance-stabilization weights in the contextual bandit setting is dealing with the need to use sequentially estimated weights.

1.3 Contributions

In this paper, we construct and analyze a stabilized estimator for policy evaluation from contextdependent adaptively collected data, such as the result of running a contextual bandit algorithm. This then immediately enables inference. After constructing a generic extension of the stabilization trick, the main technical challenge is to construct a sequence of estimators bσ1, . . . , bσT of the conditional standard deviations that are both consistent and such that for each t, bσt only uses the previous data points O(1), . . . , O(t 1). We show in extensive experiments across a large set of contextual bandit environments that our conﬁdence intervals uniquely achieve close to nominal coverage.

2 Construction and Analysis of the Generic Contextual Stabilized Estimator

In this section, we give a generic construction of a stabilized estimator in our contextual and adaptive setting. That is, given generic plug-ins for outcome model and conditional standard deviation. We then provide conditions under which the estimator is asymptotically normal, as desired. To develop CADR, we will then proceed to construct appropriate plug in estimators in the proceeding sections.

2.1 Construction of the Estimator

Outcome and variance estimators. Our estimator uses a sequence (b Qt)t 1 of estimators of the outcome model Q0, such that, for every t, b Qt is O(1), . . . , O(t)-measurable, that is, is trained using only the data up to time t. A key part of our estimator is the conditional variance estimators. Also, we require estimates of the conditional standard deviation of the canonical gradient. Deﬁne:

σ0,t := σ0,t(gt),

where σ2 0,t(g) := Var Q0,g D (g, b Qt 1)(O(t)) | O(1), . . . , O(t 1) .

Let (bσt)t 1 be a given sequence of estimates of σ0,t such that bσt is O(1), . . . , O(t 1)-measurable, that is, is estimated using only the data up to time t.

The generic form of the estimator. The generic contextual stabilized estimator is then deﬁned as:

t=1 bσ 1 t D (g, b Qt 1). (1)

2.2 Asymptotic normality guarantees

We next characterize the asymptotic distribution of bΨT under some assumptions. Assumption 1 (Non degenerate efﬁciency bound). infg PQ0,g D2(g, Q0, Q0,X) > 0.

Assumption 1 states that there is no ﬁxed logging policy g such that the efﬁciency bound for estimation of Ψ( Q0, Q0,X) in the nonparametric model, from i.i.d. draws of PQ0,g, is zero. If assumption 1 does not hold, there exists a logging policy g such that, if O = (X, A, Y ) PQ0,g, then (g (A | X)/g(A | X))Y equals Ψ( Q0, Q0,X) with probability 1. In other words, if assumption 1 does not hold, there exists a logging policy g such that Ψ( Q0, Q0,X) can be estimated with no error with probability 1 from a single draw of PQ0,g. Thus, it is very lax. An easy sufﬁcient condition for Assumption 1 is that the outcome model has nontrivial variance in that Var Q0,X( R Q(a, X)g (a | x)dµA(a)) > 0.

Assumption 2 (Consistent standard deviation estimators.). bσt σ0,t t 0 almost surely.

In the next section we will proceed to construct speciﬁc estimators bσt that satisfy Assumption 2, leading to our proposed CADR estimator and conﬁdence intervals.

Assumption 3 (Exploration rate). supa A,x X g 1 t (a | x) = ω(t 1/2).

Assumption 3 requires that the exploration rate of the adaptive experiment does not decay too quickly. Crucially, Assumption 3 requires that we have nontrivial propensities of pulling any one arm in any one context, conditioned on the history. While this may be satisﬁed by inherently randomized algorithms such as ϵ-greedy and Thompson sampling, this is not satisﬁed by algorithms such as UCB, which are deterministic once given the history. This limitation is the same as for Hadad et al. [2019], Zhan et al. [2021]. Remark 1. One can weaken Assumption 3 assumption to just requiring ω(t 1) decay by instead slightly strengthening Assumption 1 to require that σ2 0,t supa A,x X g 1 t (a | x) , which would be implied by g placing some mass on actions with decaying propensity and Y having everywhere-nontrivial conditional variance. Here at bt means that for some constant c > 0, we have at cbt for all t 1. However, as we will in fact strengthen Assumption 3 anyway in Assumption 6 below, which we will use in constructing estimators bσt that satisfy Assumption 2, we do not discuss this potential weakening of Assumption 3.

Based on these assumptions, we have the following asymptotic normality result:

Theorem 1. Denote ΓT := T 1 PT t=1 bσ 1 t 1 . Under Assumptions 1 to 3, it holds that

T bΨT Ψ0 d N(0, 1).

Remark 2. Theorem 1 does not require the outcome model estimator to converge at all. As we will see in Section 3, our conditional variance estimator does require that the outcome model converges to a ﬁxed limit Q1, but this limit does not have to be the true outcome model Q0. In other words, consistency of the outcome model is not required at any point of our analysis.

3 Construction of the Conditional Variance Estimator and CADR

We now tackle the construction of bσt satisfying our assumptions; namely, they must be adaptively trained only on past data at each t and they must be consistent. Observe that σ2 0,t = σ2 0(gt, b Qt 1),

Algorithm 1 The CADR Estimator and Conﬁdence Interval

Input: Data O(1), . . . , O(T), policies g1, . . . , g T , target g , outcome regression estimator for t = 1, 2, . . . , T do

Train b Qt 1 on O(1), . . . , O(t 1) using the outcome regression estimator

Set D t,s = D(gs, b Qt 1)(O(t)) for s = t, . . . , T // (note index order compared to next line)

Set bσ2 t = 1 t 1 Pt 1 s=1 gt(A(s)|X(s)) gs(A(s)|X(s))(D s,t)2 1 t 1 Pt 1 s=1 gt(A(s)|X(s)) gs(A(s)|X(s))D s,t 2

end for Set ΓT = 1 T PT t=1 bσ 1 t 1

Return estimate bΨT = ΓT

T PT t=1 bσ 1 t D t,t and conﬁdence intervals CIα = [bΨT ζ1 α/2ΓT /

where we deﬁne σ2 0(g, Q) := Φ0,2(g, Q) (Φ0,1(g, Q))2, Φ0,i(g, Q) := PQ0,g(D )i(g, Q), i = 1, 2.

Designing an O(1), . . . , O(t 1)-measurable estimator of σ2 0,t presents several challenges. First, while we can only use observations O(1), . . . , O(t 1) to estimate it, σ2 0,t is deﬁned as a function of integrals w.r.t. PQ0,gt, from which we have only one observation, namely O(t). Second, our estimation target σ2 0,t = σ0(gt, b Qt 1) is random as it depends on gt and b Qt. Third, gt, b Qt depend on the same observations O(1), . . . , O(t 1) that we have at our disposal to estimate σ2 0,t.

Representation via importance sampling. We can overcome the ﬁrst difﬁculty via importance sampling, which allows us to write Φ0,i(g, Q), i = 1, 2 as integrals w.r.t. PQ0,gs, s = 1, . . . , t 1, i.e., the conditional distributions of observations O(s), s = 1, . . . , t 1 given their respective past. Namely, for any s 1, i = 1, 2, we have that

Φ0,i(g, Q) = PQ0,gs g gs (D )i(g, Q). (2)

Dealing with the randomness of the estimation target. We now turn to second challenge. Since σ2 0,t can be written in terms of Φ0,i(gt, b Qt 1) for i = 1, 2, Eq. (2) suggests perhaps an approach

based on sample averages of (gt/gs)(D )i(gt, b Qt 1) over s. However, whenever s < t, the lat-

ter is an O(1), . . . , O(t 1)-measurable function due to the dependence on gt and b Qt. Namely, PQ0,gs{(gt/gs)(D )i(gt, b Qt 1)} does not coincide in general with the conditional expectation

EQ0,gs[((gt/gs)(D )i(gt, b Qt 1))(O(s)) | O(s 1)], as would arise from a sample average. We

now look at solutions to overcome this difﬁculty, considering ﬁrst b Qt 1 and then gt.

Dealing with the randomness of b Qt 1. We propose an estimator of σ2 0(g, b Qt 1) for any ﬁxed g.

While requiring that b Qt 1 converges to the true outcome regression function Q0 is a strong requirement, most reasonable estimators will at least converge to some ﬁxed limit Q1. As a result, under an appropriate stochastic convergence condition on (b Qt 1)t 1, Φ0,i(g, b Qt 1) can be reasonably approximated by the corresponding Cesaro averages, deﬁned for i = 1, 2 as

Φ0,i,t(g) := 1 t 1

s=1 Φ0,i(g, b Qs 1) = 1 t 1

s=1 EQ0,gs h ((g/gs)(D )i(g, b Qs 1))(O(s)) | O(s 1) i .

These are easy to estimate from the corresponding sample averages, deﬁned for i = 1, 2 as

bΦi,t(g) := 1 t 1

s=1 ((g/gs)(D )i(g, b Qs 1))(O(s)),

since for each i = 1, 2, the difference bΦi,t(g) Φ0,i,t(g) is the average of a martingale difference sequence (MDS). We then deﬁne our estimator of σ2 0(g, b Qt 1) as

bσt(g) := bΦ2,t(g) (bΦ1,t(g))2. (3)

From ﬁxed g to random gt. So far, we have proposed and justiﬁed the construction of bσt(g) as an estimator of σ0,t(g, b Qt 1) for a ﬁxed g. We now discuss conditions under which bσt(gt) is valid

estimator of σ0,t(gt, b Qt 1). When g is ﬁxed, for each i = 1, 2, the error bΦi,t(g) Φ0,i,t(g, b Qt 1) decomposes as the sum of the MDS average bΦi,t(g) Φ0,i,t(g) and of the Cesaro approximation error Φ0,i,t(g) Φ0,i(g, b Qt 1). Both differences are straightforward to bound. For a random gt, the term bΦi,t(gt) Φ0,i,t(gt) is no longer an MDS average. Fortunately, under a complexity condition on the logging policy class G, we can bound the supremum of the martingale empirical processes {|bΦi,t(g) Φ0,i,t(g)| : g G}, which in turn gives us a bound on |bΦi,t(gt) Φ0,i,t(gt)|.

Consistency guarantee for bσ2 t . Our formal consistency result relies on the following assumptions. Assumption 4 (Outcome regression estimator convergence). There exists β > 0, and a ﬁxed function Q1 : A X R such that b Qt Q1 1,Q0,X,g = O(t β) almost surely.

The next assumption is a bound on the bracketing entropy (see, e.g., [van der Vaart and Wellner, 1996] for deﬁnition) of the logging policy class. Assumption 5 (Complexity of the logging policy class). There exists a ﬁxed reference function gref(a | x), a class of conditional densities G, and some p > 0 such that gt G t 1 almost surely, supg G g/gref G, g /gref G, and

log N[ ](ϵ, G/gref, 2,Q0,X,gref) ϵ p,

where G/gref := {g/gref : g G}.

Next, we require a condition on the exploration rate that is stronger than Assumption 3.

Assumption 6 (Exploration rate (stronger)). supa A,x X gref(a | x)/gt(a | x) tα(β,p), where α(β, p) := min(1/(3 + p), 1/(1 + 2p), β). Theorem 2. Suppose that Assumptions 4 to 6 hold. Then, bσ2 t σ2 0,t = o(1) almost surely. Remark 3. While we theoretically require the existence of a logging policy class G with controlled complexity, we do not actually need to know G to construct our estimator. Moreover, while we require a bound on the bracketing entropy of the logging policy class G, we impose no restriction on the outcome regression model complexity, permitting us to use ﬂexible black-box regression methods. Usually we simply take gref(a | x) = 1 so the boundedness in Assumption 5 is simply boundedness of densities, but the above allows for slightly more ﬂexibility in cases where we do not necessarily explore all actions equally well in Assumption 6.

Remark 4. Assumption 4 requires (b Qt) to be a sequence of regression estimator, such that for every t 1, b Qt is ﬁtted on O(1), . . . , O(t) and for which we can guarantee a rate of convergence to some ﬁxed limit Q1. Note that this can at ﬁrst glance pose a challenge since observations O(1), . . . , O(t) are adaptively collected. In the appendix, we give guarantees for outcome regression estimation over a nonparametric model using an importance sampling weighted empirical risk minimization. Remark 5. Assumptions 5 and 6 put restrictions on the logging policy, that is, the contextual bandit algorithm that was run. Assumption 5 is not very restrictive, especially for large p. It is generally satisﬁed for p arbitrarily close to 0 for ϵ-greedy with outcome regressions in a VC-subgraph class such as trees or for Thompson sampling with parametric models. For larger p it can accommodate very general policy classes, even non-Donsker classes (p > 2). Assumption 6, on the other hand, may be restrictive as it can require a lot of exploration. While it appears that at least ω(1/t) exploration might be necessary for some normality as in Theorem 1 (see Remark 1), the rates needed in Assumption 6 may be too strong than actually needed. We therefore view it as a high-level assumption, as it does not change the method itself and it may not need to be satisﬁed in practice, and we leave it as an open question how to tighten our analysis to be able to rely on less exploration.

CADR asymptotics. Our proposed CADR estimator is now given by plugging our estimates bσt from Eq. (3) into Eq. (1), as summarized in Algorithm 1 As an immediate corollary of Theorems 1 and 2 we have our main guarantee for this ﬁnal estimator, showing CADR is asymptotically normal, whence we immediately obtain asymptotically valid conﬁdence intervals.

Corollary 1 (CADR Asymptotics and Inference). Suppose that Assumptions 1 and 4 to 6 hold. Let

bσt be given as in Eq. (3). Denote ΓT := T 1 PT t=1 bσ 1 t 1 . Then,

T bΨT Ψ0 d N(0, 1).

Moreover, letting ζα denote the α-quantile of the standard normal distribution,

Pr h Ψ(Q0,X, Q0) h bΨT ζ1 α/2ΓT /

T ii T 1 α.

4 Empirical Evaluation

In this section and in sections E.2-E.5 of the appendix, we present an extensive and reproducible1 empirical evaluation on public datasets that demonstrates the robustness of CADR conﬁdence intervals using contextual bandit data with comparison to several baselines.

To produce the 8 ﬁgures in the paper (Figure 1 in the main paper and Figures 2-8 in the appendix) we worked with: 57-72 real-world classiﬁcation datasets (Open ML Curated Classiﬁcation benchmarking suite 2018 [Bischl et al., 2017]) converted to contextual bandit datasets (wide-spread evaluation approach in the ofﬂine and online contextual bandit literature); 4 different target policies; 6 estimators (CADR, and 5 other popular baseline estimators in the literature); 8 different training procedures (sequential cross-ﬁtting vs. cross-time cross-ﬁtting in Figures 3 and 3; misspeciﬁed vs. well-speciﬁed outcome model family in Figures 4 and 5; weighted vs. unweighted outcome model ﬁtting in Figures 6 and 7; large data vs. small data in Figures 3 and 8). For each (dataset, target policy, estimator, training procedure)-combination, 64 simulations were run to obtain and report tight standard errors of the estimators performance.

In the subsections that follow, we present our approach and the main results.

4.1 Baseline Estimators

We compare CADR to several benchmarks (our experiments focus on the case of ﬁnitely-many actions, A = {1, . . . , K}). All baselines take the following form for a choice of wt, ωt, b Qt:

t=1 wt D t, CIα =

PT t=1 w2 t ( D t bΨ)2 PT t=1 wt 2

where D t = ωt(Y (t) b Qt 1(A(t), X(t))) +

b Qt 1(a, X(t))g (a | X(t)).

The Direct Method (DM) sets wt = 1, ωt = 0 and ﬁts ˆ Qt 1(a, ) by running some regression method for each a on the data {(X(s), Y (s)) : 1 s t 1, A(s) = a}. We will use either linear regression or decision-tree regression, both using default sklearn parameters. Note that even in non-contextual settings, where ˆ Qt 1 is a simple per-arm sample average, ˆ Qt 1 may be biased due to adaptive data collection [Xu et al., 2013, Luedtke and van der Laan, 2016, Bowden and Trippa, 2017, Nie et al., 2018, Hadad et al., 2019, Shin et al., 2019]. Inverse Propensity Score Weighting (IPW) sets wt = 1, ωt = (g /gt)(A(t) | X(t)), ˆ Qt = 0. Doubly Robust (DR) sets wt = 1, ωt = (g /gt)(A(t) | X(t)) and ﬁts ˆ Qt 1 as in DM. More Robust Doubly Robust (MRDR) [Farajtabar et al., 2018] is the same as DR but when ﬁtting ˆ Qt 1 we reweight each data point by g (A(s)|X(s))(1 gs(A(s)|X(s)))

gs(A(s)|X(s))2 . None of the above are generally asymptotically normal under adaptive data collection [Hadad et al., 2019]. Adaptive Doubly Robust (ADR; a.k.a. stabilized onestep estimator for multi-armed bandit data) [Luedtke and van der Laan, 2016, Hadad et al., 2019] is the same as DR but sets wt = g 1/2 t (A(t)|X(t)). ADR is unbiased and asymptotically normal for multi-armed bandit logging policies but is biased for context-measurable adaptive logging policies, which is the focus of this paper. Finally, note that our proposal CADR takes the same form as DR but with wt = bσ 1 t using our adaptive conditional standard deviation estimators bσt in Eq. (3).

1The code can be found at https://github.com/mdimakopoulou/post-contextual-bandit-inference.

Figure 1: Comparison of CADR estimator against DM, IPW, DR, ADR and MRDR w.r.t. 95% conﬁdence interval coverage on 57 Open ML-CC18 datasets and 4 target policies.

4.2 Contextual Bandit Data from Multiclass Classiﬁcation Data

To construct our data, we turn K-class classiﬁcation tasks into a K-armed contextual bandit problems [Dud ık et al., 2014, Dimakopoulou et al., 2017, Su et al., 2019], which has the beneﬁts of reproducibility using public datasets and being able to make uncontroversial comparisons using actual ground truth data with counterfactuals. We use the public Open ML Curated Classiﬁcation benchmarking suite 2018 (Open ML-CC18; BSD 3-Clause license) [Bischl et al., 2017], which has datasets that vary in domain, number of observations, number of classes and number of features. Among these, we select the classiﬁcation datasets which have less than 100 features. This results in 57 classiﬁcation datasets from Open ML-CC18 used for evaluation and Table 1 in section E.1 of the appendix summarizes the characteristics of these datasets.

Each dataset is a collection of pairs of covariates X and labels L {1, . . . , K}. We transform each dataset to the contextual bandit problem as follows. At each round, we draw X(t), L(t) uniformly at random with replacement from the dataset. We reveal the context X(t) to the agent, and given an arm pull A(t), we draw and return the reward Y (t) N(1{A(t) = L(t)}, 1). To generate our data, we set T = 10000 and use the following ϵ-greedy procedure. We pull arms uniformly at random until each arm has been pulled at least once. Then at each subsequent round t, we ﬁt b Qt 1 using the data up to that time in the same fashion as used for the DM estimator above using decision-tree regressions. We set Ax(t) = arg maxa=1,...,K b Qt 1(a, X(t)) and ϵt = 0.01 t 1/3. We then let gt(a | x) = ϵt/K for a = Ax(t) and gt( Ax(t) | x) = 1 ϵt + ϵt/K. That is, with probability ϵt we pull a random arm, and otherwise we pull AX(t)(t).

We then consider four candidate policies to evaluate: (1) arm 1 non-contextual : g (1 | x) = 1 and otherwise g (a | x) = 0 (note that the meaning of label 1 changes by dataset), (2) arm 2 noncontextual : g (2 | x) = 1 and otherwise g (a | x) = 0, (3) linear contextual : we sample a new

dataset of size T using a uniform exploration policy, then ﬁt b QT as above using linear regression, ﬁx a = arg maxa {1,...,K} b QT (a, x), and set g (a | x) = 1 and otherwise g (a | x) = 0, (4) tree

contextual : same as linear contextual but ﬁt b QT using decision-tree regression.

4.3 Results

Figure 1 shows the comparison of CADR estimator against DM, IPW, DR, ADR, and MRDR w.r.t. coverage, that is, the frequency over 64 replications of the 95% conﬁdence interval covering the true Ψ0, for each of the 57 Open ML-CC18 datasets and 4 target policies. In each subﬁgure, each dot represents a dataset, the y-axis corresponds to the coverage of the CADR estimator and the x-axis corresponds to the coverage of one of the baseline estimators. The lines represent one standard error over the 64 replications. The dot is depicted in blue if for that dataset CADR has signiﬁcantly (i.e., by more than one standard error) better coverage than the baseline estimator, in red if it has signiﬁcantly worse coverage, and in black if the difference in coverage of both estimators is within one standard error. Results are averaged over 64 replications that we ran for each dataset, estimator, and target policy. The standard errors (computed as the sample standard deviation over the 64 replications divided by

64) are depicted and the 95% coverage targets are shown in dashed lines. In Figure 1, outcome models for CADR, DM, DR, ADR, and MRDR are ﬁt using linear regression (with default sklearn parameters). In the appendix, we provide additional empirical results where we use decision-tree regressions, or where we use the MRDR outcome model for CADR, or where we use cross-fold estimation across time.

Across all of our experiments, we observe that the conﬁdence interval of CADR has better coverage of the ground truth than any other baseline, which can be attributed to its asymptotic normality. The second best estimator in terms of coverage is DR. The advantages of CADR over DR are most pronounced when either (a) there is a mismatch between the logging policy and the target policy (e.g., compare the 1st and 2nd rows in Figure 1; the tree target policy is most similar to the logging policy, which also uses trees) or (b) when the outcome model is bad (either due to model misspeciﬁcation such as with a linear model on real data or due to small sample size).

5 Conclusions

Adaptive experiments hold great promise for better, more efﬁcient, and even more ethical experiments. However, they complicate post-experiment inference, which is a cornerstone of drawing credible conclusions from controlled experiments. We provided here a novel asymptotically normal estimator for policy value and causal effects when data were generated from a contextual adaptive experiment, such as a contextual bandit algorithm. This led to simple and effective conﬁdence intervals given by adding and subtracting multiples of the standard error, making contextual adaptive experiments a more viable option for experimentation in practice.

6 Societal Impact and Limitations

Adaptive experiments hold particular promise in settings where experimentation is costly and/or dangerous, such as in medicine and policymaking. By adapting treatment allocation, harmful interventions can be avoided, outcomes for study participants improved, and smaller studies enabled. Being able to draw credible conclusions from such experiments make them viable replacements for classic randomized trials. Our conﬁdence intervals offer one way to do so. At the same time, and especially subject to our assumption of vanishing but nonzero exploration, these experiments must be subject to the same ethical guidelines as classic randomized experiments. Additionally, the usual caveats of frequentist conﬁdence intervals hold here, such as its interpretation only as a guarantee over data collection, this guarantee only being approximate in ﬁnite samples when we rely on asymptotic normality, and the risks of multiple comparisons and of p-hacking. Finally, we note that our inference focused on an average quantity, as such it focuses on social welfare and need not capture the risk to individuals or groups. Subgroup analyses may therefore be helpful in complementing the analysis; these can be conducted by setting g (a | x) to zero for some x s. Future work may be necessary to further extend our results to conducting inference on risk metrics such as quantiles of outcomes.

Susan Athey, Sarah Baird, Julian Jamison, Craig Mc Intosh, Berk Ozler, and Dohbit Sama. A sequential and adaptive experiment to increase the uptake of long-acting reversible contraceptives in cameroon, 2018. URL http://pubdocs.worldbank.org/en/606341582906195532/ Study-Protocol-Adaptive-experiment-on-FP-counseling-and-uptake-of-MCs. pdf. Study protocol.

Eytan Bakshy, Lili Dworkin, Brian Karrer, Konstantin Kashin, Benjamin Letham, Ashwin Murthy, and Shaun Singh. Ae: A domain-agnostic platform for adaptive experimentation. In Workshop on System for ML, 2018.

Alina Beygelzimer and John Langford. The offset tree for learning with partial labels. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 129 138, 2009.

Bernd Bischl, Giuseppe Casalicchio, Matthias Feurer, Frank Hutter, Michel Lang, Rafael G Mantovani, Jan N van Rijn, and Joaquin Vanschoren. Openml benchmarking suites. ar Xiv preprint ar Xiv:1708.03731, 2017.

Jack Bowden and Lorenzo Trippa. Unbiased estimation for response adaptive clinical trials. Statistical methods in medical research, 26(5):2376 2388, 2017.

Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duﬂo, Christian Hansen, Whitney Newey, and James Robins. Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1):C1 C68, 2018.

Maria Dimakopoulou, Zhengyuan Zhou, Susan Athey, and Guido Imbens. Estimation considerations in contextual bandits. ar Xiv preprint ar Xiv:1711.07077, 2017.

Miroslav Dud ık, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning, pages 1097 1104, 2011.

Miroslav Dud ık, Dumitru Erhan, John Langford, Lihong Li, et al. Doubly robust policy evaluation and optimization. Statistical Science, 29(4):485 511, 2014.

Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. More robust doubly robust offpolicy evaluation. In International Conference on Machine Learning, pages 1447 1456. PMLR, 2018.

Vitor Hadad, David A Hirshberg, Ruohan Zhan, Stefan Wager, and Susan Athey. Conﬁdence intervals for policy evaluation in adaptive experiments. ar Xiv preprint ar Xiv:1911.02768, 2019.

Nathan Kallus. Balanced policy evaluation and learning. In Advances in Neural Information Processing Systems, pages 8895 8906, 2018.

Nathan Kallus and Madeleine Udell. Dynamic assortment personalization in high dimensions. Operations Research, 68(4):1020 1037, 2020.

Nathan Kallus and Masatoshi Uehara. Intrinsically efﬁcient, stable, and bounded off-policy evaluation for reinforcement learning. Advances in neural information processing systems, 32, 2019.

Maximilian Kasy and Anja Sautmann. Adaptive treatment assignment in experiments for policy choice. Econometrica, 89(1):113 132, 2021.

Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661 670, 2010.

Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. Unbiased ofﬂine evaluation of contextualbandit-based news article recommendation algorithms. In Proceedings of the fourth ACM international conference on Web search and data mining, pages 297 306, 2011.

Alexander R. Luedtke and Mark J. van der Laan. Statistical inference for the mean outcome under a possibly non-unique optimal treatment strategy. The Annals of Statistics, 44(2):713 742, 2016. doi: 10.1214/15-AOS1384. URL https://doi.org/10.1214/15-AOS1384.

Xinkun Nie, Xiaoying Tian, Jonathan Taylor, and James Zou. Why adaptively collected data have negative bias and how to correct for it. In International Conference on Artiﬁcial Intelligence and Statistics, pages 1261 1269. PMLR, 2018.

Sheng Qiang and Mohsen Bayati. Dynamic pricing with demand covariates. ar Xiv preprint ar Xiv:1604.07463, 2016.

Simon Quinn, Alex Teytelboym, Maximilian Kasy, Grant Gordon, and Stefano Caria. A sequential and adaptive experiment to increase the uptake of long-acting reversible contraceptives in cameroon, 2019. URL https://www.socialscienceregistry.org/trials/3870. Study registration.

James M Robins, Andrea Rotnitzky, and Lue Ping Zhao. Estimation of regression coefﬁcients when some regressors are not always observed. Journal of the American statistical Association, 89 (427):846 866, 1994.

Jaehyeok Shin, Aaditya Ramdas, and Alessandro Rinaldo. On the bias, risk and consistency of sample means in multi-armed bandits. ar Xiv preprint ar Xiv:1902.00746, 2019.

Yi Su, Lequn Wang, Michele Santacatterina, and Thorsten Joachims. Cab: Continuous adaptive blending for policy evaluation and learning. In International Conference on Machine Learning, pages 6005 6014. PMLR, 2019.

Ambuj Tewari and Susan A Murphy. From ads to interventions: Contextual bandits in mobile health. In Mobile Health, pages 495 517. Springer, 2017.

Philip Thomas and Emma Brunskill. Data-efﬁcient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pages 2139 2148. PMLR, 2016.

Mark J van der Laan and James M Robins. Uniﬁed methods for censored longitudinal data and causality. Springer Science & Business Media, 2003.

A. van der Vaart and J. Wellner. Weak Convergence and Empirical Processes. Springer-Verlag New York, 03 1996. ISBN 9781475725452.

Aad W van der Vaart. Asymptotic statistics. Cambridge university press, 2000.

Yu-Xiang Wang, Alekh Agarwal, and Miroslav Dudık. Optimal and adaptive off-policy evaluation in contextual bandits. In International Conference on Machine Learning, pages 3589 3597. PMLR, 2017.

Min Xu, Tao Qin, and Tie-Yan Liu. Estimation bias in multi-armed bandit algorithms for search advertising. Advances in Neural Information Processing Systems, 26:2400 2408, 2013.

Ruohan Zhan, Vitor Hadad, David A Hirshberg, and Susan Athey. Off-policy evaluation via adaptive weighting with data from contextual bandits. ar Xiv preprint ar Xiv:2106.02029, 2021.

Kelly W Zhang, Lucas Janson, and Susan A Murphy. Statistical inference with m-estimators on adaptively collected data. ar Xiv preprint ar Xiv:2104.14074, 2021.