# offpolicy_confidence_sequences__c92cb33d.pdf

Off-Policy Conﬁdence Sequences

Nikos Karampatziakis 1 Paul Mineiro 2 Aaditya Ramdas 3

We develop conﬁdence bounds that hold uniformly over time for off-policy evaluation in the contextual bandit setting. These conﬁdence sequences are based on recent ideas from martingale analysis and are non-asymptotic, non-parametric, and valid at arbitrary stopping times. We provide algorithms for computing these conﬁdence sequences that strike a good balance between computational and statistical efﬁciency. We empirically demonstrate the tightness of our approach in terms of failure probability and width and apply it to the gated deployment problem of safely upgrading a production contextual bandit system.

1. Introduction

Reasoning about the reward that a new policy π would have achieved if it had been deployed, a task known as Off-Policy Evaluation (OPE), is one of the key challenges in modern Contextual Bandits (CBs) (Langford and Zhang, 2007) and Reinforcement Learning (RL). A typical OPE use case is the validation of new modeling ideas by data scientists. If OPE suggests that π is better, this can then be validated online by deploying the new policy to the real world.

The classic way to answer whether π has better reward than the current policy h is via a conﬁdence interval (CI). Unfortunately, CIs take a very static view of the world. Suppose that π is better than h and our OPE shows a higher but not signiﬁcantly better estimated reward. What should we do? We could collect more data, but since a CI holds for a particular (ﬁxed) sample size and is not designed to handle interactive/adaptive data collection, simply recalculating the CI at a larger sample size invalidates its coverage guarantee.

While there are ways to ﬁx this, such as a crude union bound, the proper statistical tool for such cases is called a Conﬁdence Sequence (CS). A CS is a sequence of CIs such

1Microsoft Azure AI 2Microsoft Research 3Carnegie Mellon University. Correspondence to: Nikos Karampatziakis <nikosk@microsoft.com>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

that the probability that they ever exclude the true value is bounded by a prespeciﬁed quantity. In other words, they retain validity under optional (early) stopping and optional continuation (collecting more data).

In this work we develop CSs for OPE using recent insights from martingale analysis (for simpler problems). Besides the aforementioned high probability uniformly over time guarantee, these CSs make no parametric assumptions and are easy to compute. We use them to create a gated deployment primitive: instead of deploying π directly we keep it in a staging area where we compute its off-policy CS as h is collecting data. Then π can replace h as soon as (if ever) we can reject the hypothesis that h is better than π.

We now introduce some notation to give context to our contributions. We have iid CB data of the form (x, a, r) collected by a historical policy h in the following way. First a context x was sampled from an unknown distribution D. Then h assigns a probability to each action. An action a is sampled with probability h(a; x) and performed. A reward r associated with performing a in situation x is sampled from an unknown distribution R(x, a). Afterwards, we wish to estimate the reward of another policy π that is absolutely continuous wrt h. We have

V (π) = Ex D a π(x) r R(x,a)

[r] = Ex D a h(x) r R(x,a)

h(a; x)r (1)

where the last quantity can be estimated from data.

Letting w = π(a;x)

h(a;x) we see that Ex D,a h[w] = 1, where we write w instead of w(x, a) to reduce notation clutter. More generally for any function q(x, a) which is typically a predictor of the reward of a at x we have

Ex D,a h[wq(x, a)] = X

a π(a ; x)q(x, a ), (2)

which reduces to E[w] = 1 when q(x, a) = 1 always. Eq. (1) and (2) are the building blocks of OPE estimators. The IPS estimator (Horvitz and Thompson, 1952) estimates (1) via Monte Carlo: ˆV IPS(π) = 1/n Pn i=1 wiri. A plethora of other OPE estimators are discussed in Section 6. In general there is a tension between the desirability of an unbiased estimator like ˆV IPS and the difﬁculty of working with it in ﬁnite samples due to its excessive variance.

Off-Policy Conﬁdence Sequences

Recently, Kallus and Uehara (2019) proposed an OPE estimator based on Empirical Likelihood (Owen, 2001) with several desirable properties. Empirical Likelihood (EL) has also been used to derive CIs for OPE in CBs (Karampatziakis et al., 2020) and RL (Dai et al., 2020). Our CSs can be thought of as a natural extension to the online setting of the CIs for OPE in the batch setting; its advantages include

Our CSs hold non-asymptotically, unlike most existing CIs mentioned above which are either asymptotically valid (or nonasymptotic but overly conservative).

Our CSs are not unnecessarily conservative due to naive union bounds or peeling techniques.

We do not make any assumptions, either parametric or about the support of w and r, beyond boundedness.

Our validity guarantees are time-uniform, meaning that they remain valid under optional continuation (collecting more data) and/or at stopping times, both of which are not true for all aforementioned CIs.

2. Background: OPE Conﬁdence Intervals

We start by reviewing OPE CIs from the perspective of Karampatziakis et al. (2020). Their CI is constructed by considering plausible distributions from a nonparametric family Q of distributions Q for random vectors (w, r) [0, wmax] [0, 1] under the constraint EQ[w] = 1. Let Qwr be the probability that Q Q assigns to the event where the importance weight is w and the reward is r. Then there exists Q Q such that

Q wr = Ex D,a h,ρ R(x,a)

h(a; x) = w I [ρ = r]

and V (π) = EQ [wr]. To estimate V (π) we can ﬁnd Qmle Q that maximizes the data likelihood. To ﬁnd a CI we minimize/maximize EQ[wr] over plausible Q Q so the data likelihood is not far off from that of Qmle.

Using convex duality the MLE is Qmle wr = 1 n(1+λmle 1 (w 1)) where λmle 1 is a dual variable solving

λmle 1 = argmax λ1

i=1 log(1 + λ1(wi 1))

subject to 1 + λ1(wmax 1) 0, 1 λ1 0. The proﬁle likelihood L(v) = sup Q:EQ[w]=1,EQ[wr]=v Qn i=1 Qwi,ri is used for CIs. From EL theory, an asymptotic 1 α-CI is (

Qn i=1 Qmle wi,ri L(v)

where χ2,1 α 1 is the 1 α quantile of a χ2 distribution with

one degree of freedom. Using convex duality the CI is (

i=1 log(1 + λmle 1 (wi 1)) χ2,1 α 1

where the dual proﬁle log likelihood B(v) is

B(v) = sup λ1,λ2

i=1 log(1+λ1(wi 1)+λ2(wiri v)) (3)

subject to (λ1, λ2) D0 v where

Dm v = {(λ1, λ2) :1 + λ1(w 1) + λ2(wr v) m

(w, r) {0, wmax} {0, 1}}. (4)

The CI endpoints can be found via bisection on v.

3. Off-Policy Conﬁdence Sequences

We now move from the batch setting and asymptotics to online procedures and ﬁnite sample, time-uniform results. We adapt and extend ideas from Waudby-Smith and Ramdas (2020) which constructs CSs for the means of random variables in [0, 1]. Our key insight is to combine their construction with an interpretation of (3) as the log wealth accrued by a skeptic who is betting against the hypotheses

EQ [w] = 1 and EQ [wr] = v.

In particular, the skeptic starts with a wealth of 1 and wants to maximize her wealth. Her bet on the outcome w 1 is captured by λ1, while λ2 represents the bet on the outcome of wr v so that the wealth after the i-th sample is multiplied by 1+λ1(wi 1)+λ2(wiri v). If the outcomes had been in [ 1, 1] then |λ1| and |λ2| would have an interpretation as the fraction of the skeptic s wealth that is being risked on each step. The bets can be positive or negative, and their signs represent the directions of the bet. For example, λ2 < 0 means the skeptic will make money if wiri v < 0. Enforcing the constraints (4) from the batch setting here means that the resulting wealth cannot be negative.

The ﬁrst beneﬁt of this framing is that we have mapped the abstract concepts of dual likelihood, dual variables, and dual constraints to more familiar concepts of wealth, bets, and avoiding bankruptcy. We now formalize our constructions and show how they lead to always valid, ﬁnite sample, CSs. We introduce a family of processes

i=1 (1 + λ1,i(wi 1) + λ2,i(wiri v))

where λ1,i and λ2,i are predictable, i.e. based on past data (formally, measurable with respect to the sigma ﬁeld σ({(wj, rj)}i 1 j=1)). We also formalize CIs and CSs below.

Off-Policy Conﬁdence Sequences

Deﬁnition 1. Given data Sn = {(xi, ai, ri)}n i=1, where xi D, ai h( ; xi), ri R(xi, ai), a (1 α)-conﬁdence interval for V (π) is a set Cn = C(h, π, Sn) such that

sup D,R Pr(V (π) / Cn) α.

In contrast, a (1 α)-conﬁdence sequence for V (π) is a sequence of conﬁdence intervals (Ct)t N such that

sup D,R Pr( t N : V (π) / Ct) α.

We now have the setup to state our ﬁrst theoretical result. Theorem 1. Kt(V (π)) is a nonnegative martingale. Moreover, the sequences Ct = {v : Kt(v) 1 α} and Ct = Tt i=1 Ci are (1 α)-conﬁdence sequences for V (π).

All proofs are in the appendix. The process Kt(v) tracks the wealth of a skeptic betting against V (π) = v. The process Kt(V (π)) is a nonnegative martingale so it has a small probability of attaining large values (formally, Ville s inequality states that the probability of ever exceeding 1/α is at most α). Of course, we don t know V (π), but if we retain all values of v where the wealth is below 1/α, and reject the values of v for which it has crossed 1/α at some point, this set will always contain V (π) with high probability; this is the basis of our construction. The strength of our approach comes from this result, as it guarantees always-valid bounds for V (π) using only martingale arguments crucially avoiding parametric or other assumptions.

What about v = V (π)? Can we be sure that Ct does not contain values v very far from V (π)? That s where the betting strategy, quantiﬁed by the predictable sequences (λ1,i) and (λ2,i), enters. The hope is the skeptic can eventually force Kt(v) to be large via a series of good bets. Importantly, Theorem 1 holds regardless of how the bets are set, but good bets will lead to small Ct. How to smartly bet is the subject of what follows.

4. Main Betting Strategy: MOPE

We develop our main betting strategy, MOPE (Martingale OPE) in steps starting with a slow but effective algorithm and making changes that gradually lead to a computationally efﬁcient algorithm.

4.1. Follow The Leader

We begin with a Follow-The-Leader (FTL) strategy that is known to work very well for iid problems (De Rooij et al., 2014). We deﬁne ℓv i (λ) = ln(1+λ1(wi 1)+λ2(wiri v)) and set λ = [λ1, λ2] to maximize wealth in hindsight

λftl t (v) = argmax λ

i=1 ℓv i (λ) (5)

for every step of betting in Kt(v). The problem (5) is convex and can be solved in polynomial time leading to an overall polynomial time algorithm. However, this approach has three undesirable properties. First, the algorithm needs to store the whole history of (w, r) samples. Second the overall algorithm is tractable but slow. Finally, we need to solve (5) for all values of v that we have not yet rejected. We address these issues by replacing the wealth with a lower bound that is a quadratic polynomial, introducing a hedged process, and using common bets for all v. We detail these below.

4.2. Maximizing a Lower Bound on Wealth

We can avoid having to store all history by optimizing an easy-to-maintain lower bound of (5).

Lemma 1. For all x 1

2 and ψ = 2 4 ln(2), we have

ln(1 + x) x + ψx2.

Observe that if we restrict our bets to lie in the convex set D1/2 v (cf. eq. (4)) then for all λ D1/2 v

i=1 ℓv i (λ) λ t 1 X

i=1 bi(v) + ψλ t 1 X

where bi(v) = wi 1 wiri v

and Ai(v) = bi(v)bi(v) .

The ﬁrst step towards a more efﬁcient algorithm is to set our bets at time t as

λt(v) = argmax

λ D1/2 v ψλ t 1 X

i=1 bi(v) (6)

The restriction λ D1/2 v is very mild: it does not allow the skeptic to lose more than half of her wealth from any single outcome. The ﬁrst advantage of this formulation is that P

i Ai(v) and P

i bi(v) are low degree polynomials of v and can share the coefﬁcients

i=1 Ai(v) = A(0) t + v A(1) t + v2A(2) t

i=1 bi(v) = b(0) t + vb(1) t .

Secondly, the coefﬁcients can be updated incrementally

(wi 1)2 (wi 1)wiri (wi 1)wiri w2 i r2 i

0 (wi 1) (wi 1) 2wiri

Off-Policy Conﬁdence Sequences

Finally, we can solve (6) exactly in O(1) time. Section 4.4 will elaborate on this using a slight variation of eq. (6).

4.3. Common Bets and Hedging

So far we have made betting simpler, but we still need to reason about betting per value of v. While the most competitive betting sequences for the process K(v) will take advantage of the knowledge of v, placing different bets for different values of v creates two problems: First, the resulting conﬁdence set need not be an interval and second makes it hard to implement Theorem 1 in a computationally efﬁcient way. Indeed, even in the simpler setup of Waudby Smith and Ramdas (2020) the authors maintain a grid of test values for the quantity of interest (here v) and at least keep track of the wealth separately. This is because tracking the wealth for each value in the grid is not straightforward when the bets are different.

To make wealth tracking easy and obtain algorithms that do not require the discretization of the domain of v, a natural proposal would be to use a common bet for all v in each timestep. Unfortunately, this is not adequate because we do need λ2 > 0 for v < EQ [wr] and λ2 < 0 for v > EQ [wr]. A simple ﬁx is to use a hedged strategy as in Waudby-Smith and Ramdas (2020). First, we split the initial wealth equally. The ﬁrst half is used to bet against low v s via the process

1 + λ+ 1,i(wi 1) + λ+ 2,i(wiri v)

and the second half to bet against high v s via a separate process K t (v) which for symmetry we parametrize as

1 + λ 1,i(wi 1) + λ 2,i(wir i v ) .

where r i = 1 ri and v = 1 v. This can be seen as the wealth process for betting against 1 v in an world where r has been remapped to 1 r. Thus betting against high values of v reduces to betting against low values of v in a modiﬁed process. The total wealth of the hedged process is

K t (v) = 1

2(K+ t (v) + K t (v)), (12)

and it can be used for CSs in the same way as Kt(v):

Theorem 2. The sequence C t = {v : K t (v) 1

α} and its running intersection Tt i=1 C i are 1 α CSs for V (π).

Algorithm 1 Solve λ = argmaxλ C ψλ Aλ + λ b

Input: A, b λ = (2ψA) 1b if λ C then

Return λ end if Λ = nh 1 2(1 wmax), 0 i , 1

2, 0 , 0, 1

2 o {vertices of C}

for c, d {([0, 1], 0), ([1, 1], 1

2), ([1 wmax, 1], 1

µ = c (2ψA) 1b+d

c (2ψA) 1c {Lagrange multiplier} λ = (2ψA) 1(b + µc) if λ C then

Λ = Λ {λ} {Add feasible solutions on faces} end if end for Return argmaxλ Λ ψλ Aλ + λ b

It remains to design a common bet for K+ t (v). Betting against any ﬁxed v0 will not work well when V (π) = v0 since the optimal bet for V (π) is 0 but such a bet cannot help us reject those v that are far from V (π). Therefore we propose to adaptively choose the bets against the smallest v that has not been rejected. As we construct the CS, we have access to the values of v that constitute the endpoints of the CS at the last time step. These values are on the cusp of plausibility given the available data and conﬁdence level which means the bets are neither too conservative nor too detached from what can be estimated.

4.4. Avoiding Grid Search

Once we have determined v for the current step we could choose λ via (6). For reasons that will become apparent shortly, we can also consider a more restrictive criterion

λt = argmax λ C ψλ t 1 X

λ + λ t 1 X

i=1 bi(v), (13)

where C = {λ : λ2 0} T

v [0,1] D1/2 v or more succintly

C = λ : λ2 0, λ1 + λ2 1

2, λ1 (1 wmax) + λ2 1

The constraint λ2 0 is expected for good bets in K+ t (v) (and by reduction in K t (v)) since we are eliminating v s with E[wr v] > 0. Since there are only three constraints and two variables we can exactly solve (13) very efﬁciently. Our implementation ﬁrst tries to return the unconstrained maximizer, if feasible. If not, we evaluate the objective on up to 6 candidates: up to one candidate per face of C (obtained via maximizing the objective subject to one equality constraint) and its 3 vertices. Algorithm 1 summarizes this.

Off-Policy Conﬁdence Sequences

Algorithm 2 MOPE: Martingale Off-Policy Evaluation

Input: process Z = (wi, ri) i=1, wmax, α Let Z = (wi, 1 ri) for (wi, ri) in Z for vi, v i in zip(LCS(Z), LCS(Z )) do Output(vi, 1 v i) end for function LCS(Z)

λ1 = [0, 0] , v = 0 for i = 1, . . . do

Observe (wi, ri) from Z Update statistics via (7)-(11) and (18)-(22). if (14) has real roots then

v = max(v, largest root of (14)) end if yield v {execution suspends/resumes here} A = A(0) i + v A(1) i + v2A(2) i b = b(0) i + vb(1) i λi+1 = argmaxλ C ψλ Aλ + b λ. end for end function

Given λ1, . . . , λt 1, from (13) we get from Lemma 1

i=1 ℓv i (λi) ψ

i=1 λ i Ai(v)λi +

i=1 λ i bi(v)

for all v [0, 1]. Thus, if the lower bound exceeds ln(1/α) for a particular v, the log wealth will also exceed it. Furthermore, the lower bound is quadratic in v so we can easily ﬁnd those values v [0, 1] such that

i=1 λ i Ai(v)λi +

i=1 λ i bi(v) = ln 2

The extra 2 is due to the hedged process. Appendix B explains this and the details of how to incrementally maintain statistics for solving (14) via eqs. (18)-(22). The advantage of (13) over (6) is that the latter cannot ensure that old bets will produce values in D1/2 v for future values of v while the former always does because C D1/2 v , v [0, 1].

The whole process of updating the statistics, tightening the lower bound v via (14) and computing the new bets via (13) is summarized in Algorithm 2.

Conﬁdence Intervals. If one only desires a single CI using a ﬁxed batch of data, then a CI can be formed by returning the last set from the CS on any permutation of the data. To reduce variance, we can average the wealth of several independent permutations without violating validity.

Alternative Betting Algorithms An obvious question is why develop this strategy and not just feed the convex functions ℓv i (λ) to an online learning algorithm? The Online

Newton Step (ONS) (Hazan et al., 2007) is particularly well-suited as ℓv i (λ) is exp-concave. While ONS does not require storing all history and needs small per-step computation, we could not ﬁnd an efﬁcient way to efﬁciently reason about Kt(v) for every v [0, 1]. While the ONS bounds the log wealth in terms of the gradients observed at each bet, these gradients depend on v in a way that makes it hard to efﬁciently reuse for different values of v. Our approach on the other hand maintains a lower bound on the wealth as a second degree polynomial in v, enabling us to reason about all values of v in constant time.

5. Extensions

5.1. Adding a Reward Predictor

When considering just the task of Off-Policy Estimation, a well-known improvement over eq. (1) is the doubly robust estimator (Dud ık et al., 2011; Jiang and Li, 2016). Suppose we have access to a reward predictor q(x, a) mapping context and action to an estimated reward. If this estimated reward is well correlated with the actual reward then we should expect that subtracting the zero mean quantity

ci = wiq(xi, ai) X

a π(a ; xi)q(xi, a ). (15)

from wiri will not affect the mean but will reduce the variance of our estimator. We can follow the same rationale for deﬁning our wealth process. Let

i=1 (1 + λ1,i(w 1) + λ2,i(wiri ci v))

for predictable sequences (λ1,i, λ2,i) E0 v, where

Em v = {(λ1, λ2) : 1 + λ1(w 1) + λ2(wr c v) m

(x, a, r, q) supp(D) A {0, 1} [0, 1]|A|}.

Note that w = w(x, a) and c = c(x, a, q) so all quantities are well deﬁned. This set looks daunting but without loss of generality it sufﬁces to only consider two actions: a, which is sampled by h, and an alternative one a , h(a) {1/wmax, 1}, and π(a), q(x, a), q(x, a ) {0, 1}. Considering all these combinations and removing redundant constraints leads to the equivalent description for Em v as λ : h 1 1 W W v v W v W + v i λ m 1 , (16)

where W = wmax 1 and v = 1 v.

For an efﬁcient procedure we introduce the set Cq = T

v [0,1] E1/2 v to enable the use of our lower bound and common bets for all v. This set can be shown to be the

Off-Policy Conﬁdence Sequences

same as (16) but with v = 1 and v = 1. For predictable sequences of bets λ+q t , λ q t Cq deﬁne the processes

K+q t (v) =

1 + λ+q 1,i(wi 1) + λ+q 2,i(wiri ci v) ,

K q t (v) =

1 + λ q 1,i(wi 1) + λ q 2,i(wir i c i v ) ,

where r i = 1 ri, v = 1 v. For the deﬁnition of c i we reason as follows: If q(x, a) is a good reward predictor for ri then q (x, a) = 1 q(x, a) is a good reward predictor for r i. Plugging q (x, a) in place of q(x, a) in (15) leads to c i = wi 1 ci. Finally, the hedged process is just K q t (v) = 1

2(K+q t (v) + K q t (v)) and we have

Theorem 3. The sequences Cq t = {v : Kq t (v) 1

α} and C q t = {v : K q t (v) 1

α} as well as their running intersections Tt i=1 Cq i and Tt i=1 C q i are 1 α CSs for V (π).

Appendix D contains the details on how to bet. We close this section with two remarks. First, a bad q(x, a) can make wr c have larger variance than wr. To protect against this case we can run a doubly hedged process: K 2 t (v) = 1

2(K q t (v) + K t (v)) which will accrue wealth almost as well as the best of its two components. Second our framework allows for q(x, a) to be updated in every step as long as the updates are predictable.

5.2. Scalar Betting

Since E[w] = 1, it would seem that the λ1 bet cannot have any long term beneﬁts. While this will be shown to be false in our experiments we nevertheless develop a betting strategy that only bets on wiri v. The advantages of this strategy are computational and conceptual simplicity. Similarly to Section 4.3 we use a hedged process K t (v) = 1 2 (K> t (v) + K< t (v)) where

K> t (v) = Y

1 + λ> 2,i(wiri v) ,

K< t (v) = Y

1 + λ< 2,i (wi(1 ri) (1 v)) .

The deﬁnition of K< t (v) is similar to that of K t (v) and in Appendix C.1 we provide an alternative justiﬁcation for using 1 r and 1 v via a worst case argument. The upshot is that if we start from our original proposal for Kt(v) and try to eliminate λ1 to ensure a worst case wealth we end up with λ1 = max(0, λ2). Plugging this λ1 in Kt(v) leads to the above processes.

We explain betting for K> t (v), since betting for K< t (v) reduces to that. We use a result by Fan et al. (2015):

ln(1 + λξ) λξ + (ln (1 λ) + λ) ξ2

for all ξ 1 and λ [0, 1), which we reproduce in Appendix C.2. We apply it in our case with ξi = wiri v 1 and consider the log wealth lower bound for a ﬁxed λ2

ln(K> t (v)) λ2

i=1 ξi + (ln (1 λ2) + λ2)

When Pt 1 i=1 ξ2 i > 0 the lower bound is concave and can be maximized in λ2 by setting its derivative to 0. This gives

λ> 2,t(v) = Pt 1 i=1(wiri v) Pt 1 i=1(wiri v) + Pt 1 i=1(wiri v)2 .

When Pt 1 i=1 ξ2 i = 0 we can set λ> 2,t(v) = 0. Finally, employing the same ideas as Section 4.4 we can adaptively choose the v to bet against and avoid maintaining a grid of values for v. Details are in Appendix C.3

5.3. Gated Deployment

A common OPE use case is to estimate the difference V (π) V (h). If we can reject all negative values (i.e. the lower CS crosses 0) then π should be deployed. Conversely, rejecting all positive values (i.e. the upper CS crosses 0) means π should be discarded. Since h is the policy collecting the data we have V (h) = E[r]. Thus we can form a CS around V (π) V (h) by considering the process:

Kgd t (v) =

i=1 (1 + λ1,i(wi 1) + λ2,i(wiri ri v))

for predictable λ1,i, λ2,i subject to (λ1,i, λ2,i) G0 v where

Gm v = {(λ1, λ2) :1 + λ1(w 1) + λ2(wr r) m

(w, r) {0, wmax} {0, 1}}. (17)

As before, we can form a hedged process and restrict bets to a set that enables the use of our lower bound. We defer these details to appendix E. We can then show

Theorem 4. The sequences Cgd t = {v : Kgd t (v) 1

α} and Tt i=1 Cgd i are 1 α CSs for V (π) V (h).

This CS has two advantages over a classical A/B test. First, we don t have to choose a stopping time in advance. The CS can run for as little or as long as necessary. Second, if π were worse than h the A/B test would have an adverse effect on the quality of the overall system, while here we can reason about this degradation without deploying π.

6. Related Work

Behind our CSs there is always a concentration inequality for martingales. Such inequalities are in a sense dual to regret guarantees online learning: Every online learning

Off-Policy Conﬁdence Sequences

regret bound gives rise to a concentration inequality for martingales and vice versa (Rakhlin and Sridharan, 2017). At the same time minimizing regret in online learning can be achieved by betting algorithms (Cutkosky and Orabona, 2018; Orabona and P al, 2016). These two ideas together imply that betting algorithms can also lead to CSs via their regret bounds. In particular, (Jun and Orabona, 2019) develop regret bounds via a betting view and derive interesting concentration inequalities in the spirit of the law of iterated logarithm. Our approach here, as well as that of (Waudby Smith and Ramdas, 2020) is related and uses similar techniques. However there is a key difference: we do not rely on a regret bound. In other words, our CSs are driven by the actual regret instead of the regret bound.

Apart from IPS, other popular OPE estimators include Doubly Robust (Robins and Rotnitzky, 1995; Dud ık et al., 2011) which incorporates (2) as an additive control variate and SNIPS (Swaminathan and Joachims, 2015) which incorporates E[w] = 1 as a multiplicative control variate. The quest to balance the bias-variance tradeoff in OPE has led to many different proposals (Wang et al., 2017; Vlassis et al., 2019). EL-based estimators are proposed in Kallus and Uehara (2019) and Karampatziakis et al. (2020).

CIs for OPE include both ﬁnite-sample (Thomas et al., 2015) and asymptotic (Li et al., 2015; Karampatziakis et al., 2020) ones. Some works that propose both types are Bottou et al. (2013) and Dai et al. (2020). The latter obtains CIs without knowledge of w, a much more challenging scenario that requires additional assumptions.

We are not aware of any CSs for OPE. For on-policy setups, the most competitive CSs all rely on exploiting (super)martingales, and in some sense all admissible CSs have to (Ramdas et al., 2020). Examples include Robbins mixture martingale (Robbins, 1970) and the techniques of Howard et al. (2020). The recent work of Waudby-Smith and Ramdas (2020) substantially increases the scope of these techniques, while simplifying and tightening the constructions.

7. Experiments

Code to reproduce all experiment results is available at https://github.com/n17s/mope

7.1. Coverage

While any predictable betting sequence guarantees correct coverage, some will overcover more than others. Here we investigate the coverage properties of MOPE and the strategy of Section 5.2. We generate 1000 sequences of 100000 (w, r) pairs each from a different distribution. All distributions are maximum entropy distributions subject to (w, r) {0, 0.5, 2, 100} {0, 1}, E[w] = 1, E[w2] = 10

Figure 1. Empirical coverage for two proposed CSs. The CS that bets on both w 1 and wr v converges to nominal coverage while the CS that does not bet on w 1 overcovers.

and V (π) sampled uniformly in [0, 1]. In Figure 1 we show the empirical mean coverage of the two CSs for α = 0.05. MOPE approaches nominal coverage from above, a property rarely seen with standard conﬁdence bounds.

7.2. Computational vs. Statistical Efﬁciency

We run an ablation study for the three ingredients of MOPE, where Vector is the scalar betting technique of section 5.2; Common solves (6) over a grid of 200 v values at each timestep; and Bound optimizes the log wealth exactly rather than the bound of Lemma 1, i.e., Algorithm 2 with equation (5) in lieu of equation (13).

We use four synthetic environments which are distributions over (w, r) generated in the same way as section 7.1 but with (V (π), E[w2]) {0.05, 0.5} {10, 50}. Table 1 shows the running times for each method in the environment with the largest variance. We see that directly maximizing wealth and individual betting per v are very slow. MOPE and Vector are computationally efﬁcient. In Figure 2 we show the average CS width over 10 repetitions for 500000 time steps for MOPE and its ablations as well as the asymptotic CI from Karampatziakis et al. (2020) which is only valid pointwise and provides a lower bound for all CSs in the ﬁgure. MOPE is better than Vector and as good or better than Bound. This may look surprising but there is a simple explanation. In particular, using the bound leads to underestimating the optimal bet size, while using the empirical wealth so far can lead to overestimating the optimal bet size. Moreover there is an asymmetry in overestimating vs. underestimating the optimal bet. Under the approximation E[ln(1 + λX)] λE[X] λ2E[X2]/2 (valid for small λX) the optimal bet is λ = E[X]/E[X2]. Underestimating λ = λ /2 leads to an expected log wealth of 75% of the optimal, while overestimating λ = 2λ leads to an expected log wealth of 0. We believe this asymmetry is why sometimes the bound is better than empirical wealth maximization.

Off-Policy Conﬁdence Sequences

Figure 2. The width of 95% CS produced by MOPE and its three ablations. The pointwise asymptotic curve is not a CS.

Figure 3. Three 99.9% CSs with/without a reward predictor and a doubly hedged one that achieves the best of both

Figure 4. CS for gated deployment and A/B test. π can be deployed as soon as the lower CS crosses 0 (dotted line at t = 657).

Table 1. Timings for MOPE and its ablations on 500000 samples

METHOD MOPE -VECTOR -COMMON -BOUND

TIME (SEC) 32 14.5 10440 15882

Finally, while MOPE is not as tight as the (much more computationally demanding) Common, the gap is small in all but the most challenging environment.

7.3. Effect of a Reward Predictor

We now investigate the use of reward predictors in our CSs using the processes K q t (v) and K 2 t (v) of Section 5.1. We use the ﬁrst 1 million samples from the mnist8m dataset which has 10 classes and train the following functions: h using linear multinomial logistic regression (MLR), π again using MLR but now on 1000 random Fourier features (RFF) (Rahimi and Recht, 2007) that approximate a Gaussian kernel machine, and ﬁnally q which uses the same RFF represetation as π but instead its i-th output is independently trained to predict whether the input is the i-th class using 10 binary logistic regressions. We used the rest of the data with the following protocol: for each input/label pair (xi, yi), we sample action ai with probability 0.9h(ai; xi) + 0.01 (so that we can safely set wmax = 100),

Off-Policy Conﬁdence Sequences

we set ri = 1 if ai = yi, otherwise ri = 0, and record wi and ci. We estimated V (π) 0.9385 using the next million samples. In Figure 3 we show the CS for V (π) averaged over 5 runs each with 10000 different samples using the processes K q t (v), K t (v) and K 2 t (v). We see that including a reward predictor dramatically improves the lower bound and somewhat hurts the upper bound. The doubly hedged process on the other hand attains the best of both worlds.

7.4. CSs for Gated Deployment

Here we investigate the use of CSs for gated deployment. We use the same h and π and the same data as in Section 7.3 but now we are using the process Kgd t (v) (or rather a computationally efﬁcient version of this process based on a hedged process with common bets and optimizing a quadratic lower bound c.f. Appendix E). Figure 4 shows the average CS over 5 runs each with 10000 different samples. We see that the CS contains the true difference (about 0.17) and quickly decides that π is better than h at t = 657 samples. We also include an on-policy CS (from (Waudby-Smith and Ramdas, 2020)) which can only be computed if π is deployed e.g. in an A/B test. While this is riskier when π is inferior to h, the on-policy rewards typically have lower variance. Thus the on policy CS can conclude that π is better than h at the same α = 0.01 using 440 samples (220 for each policy). If the roles of π and h were swapped, i.e, π was the behavior policy and h was a proposed alternative, the on policy CS would still need to collect 220 samples from h. In contrast, a system using the off-policy CS would never have to experience any regret when the behavior policy is superior. Finally, during the initial phase of the experiment we observe that the off-policy CS is tighter than the on-policy CS. This is because the on-policy CS is splitting the data into two disjoint sets (one for h and one for π) while the off-policy CS is reusing the data for both h and π.

8. Conclusions

We presented a generic way to construct conﬁdence sequences for OPE in the Contextual Bandit setting. The construction leaves a lot of freedom in designing betting strategies and we mostly explored options with an eye towards computational efﬁciency. Theoretically we achieve ﬁnite sample coverage and validity at any time with minimal assumptions. Empirically the resulting sequences are tight and not too far away from asymptotic and pointwise valid existing work. Theoretical results on the width of our CSs remain elusive and are both an interesting area for future work and a key to unlock much stronger analyses of various algorithms in Bandits and RL.

John Langford and Tong Zhang. The epoch-greedy algorithm for contextual multi-armed bandits. In Proceedings of the 20th International Conference on Neural Information Processing Systems, pages 817 824. Citeseer, 2007.

Daniel G Horvitz and Donovan J Thompson. A generalization of sampling without replacement from a ﬁnite universe. Journal of the American statistical Association, 47(260):663 685, 1952.

Nathan Kallus and Masatoshi Uehara. Intrinsically efﬁcient, stable, and bounded off-policy evaluation for reinforcement learning. ar Xiv preprint ar Xiv:1906.03735, 2019.

Art B Owen. Empirical likelihood. Chapman and Hall/CRC, 2001.

Nikos Karampatziakis, John Langford, and Paul Mineiro. Empirical likelihood for contextual bandits. Advances in neural information processing systems, 33, 2020.

Bo Dai, Oﬁr Nachum, Yinlam Chow, Lihong Li, Csaba Szepesvari, and Dale Schuurmans. Coindice: Off-policy conﬁdence interval estimation. Advances in neural information processing systems, 33, 2020.

Ian Waudby-Smith and Aaditya Ramdas. Variance-adaptive conﬁdence sequences by betting. ar Xiv:2010.09686 [math, stat], October 2020. URL http://arxiv. org/abs/2010.09686v1. ar Xiv: 2010.09686.

Steven De Rooij, Tim Van Erven, Peter D Gr unwald, and Wouter M Koolen. Follow the leader if you can, hedge if you must. The Journal of Machine Learning Research, 15(1):1281 1316, 2014.

Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3):169 192, 2007.

Miroslav Dud ık, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. ar Xiv preprint ar Xiv:1103.4601, 2011.

Nan Jiang and Lihong Li. Doubly robust off-policy value evaluation for reinforcement learning. In International Conference on Machine Learning, pages 652 661. PMLR, 2016.

Xiequan Fan, Ion Grama, and Quansheng Liu. Exponential inequalities for martingales with applications. Electronic Journal of Probability, 20, 2015.

Alexander Rakhlin and Karthik Sridharan. On equivalence of martingale tail bounds and deterministic regret inequalities. In Conference on Learning Theory, pages 1704 1722. PMLR, 2017.

Off-Policy Conﬁdence Sequences

Ashok Cutkosky and Francesco Orabona. Black-box reductions for parameter-free online learning in banach spaces. In Conference On Learning Theory, pages 1493 1529. PMLR, 2018.

Francesco Orabona and D avid P al. Coin betting and parameter-free online learning. In NIPS, pages 577 585, 2016.

Kwang-Sung Jun and Francesco Orabona. Parameter-free online convex optimization with sub-exponential noise. In Conference on Learning Theory, pages 1802 1823. PMLR, 2019.

James M. Robins and Andrea Rotnitzky. Semiparametric efﬁciency in multivariate regression models with missing data. Journal of the American Statistical Association, 90 (429):122 129, 1995.

Adith Swaminathan and Thorsten Joachims. The selfnormalized estimator for counterfactual learning. In advances in neural information processing systems, pages 3231 3239, 2015.

Yu-Xiang Wang, Alekh Agarwal, and Miroslav Dud ık. Optimal and adaptive off-policy evaluation in contextual bandits. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pages 3589 3597, 2017. URL http://proceedings.mlr.press/ v70/wang17a.html.

Nikos Vlassis, Aurelien Bibaut, Maria Dimakopoulou, and Tony Jebara. On the design of estimators for bandit offpolicy evaluation. In International Conference on Machine Learning, pages 6468 6476, 2019.

Philip Thomas, Georgios Theocharous, and Mohammad Ghavamzadeh. High-conﬁdence off-policy evaluation. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 29, 2015.

Lihong Li, Shunbao Chen, Jim Kleban, and Ankur Gupta. Counterfactual estimation and optimization of click metrics in search engines: A case study. In Proceedings of the 24th International Conference on World Wide Web, pages 929 934. ACM, 2015.

L eon Bottou, Jonas Peters, Joaquin Qui nonero-Candela, Denis X Charles, D Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. Counterfactual reasoning and learning systems: The example of computational advertising. The Journal of Machine Learning Research, 14(1):3207 3260, 2013.

Aaditya Ramdas, Johannes Ruf, Martin Larsson, and Wouter Koolen. Admissible anytime-valid sequential inference must rely on nonnegative martingales. ar Xiv preprint ar Xiv:2009.03167, 2020.

Herbert Robbins. Statistical Methods Related to the Law of the Iterated Logarithm. The Annals of Mathematical Statistics, 41(5):1397 1409, October 1970. ISSN 00034851, 2168-8990.

Steven R. Howard, Aaditya Ramdas, Jon Mc Auliffe, and Jasjeet Sekhon. Time-uniform, nonparametric, nonasymptotic conﬁdence sequences. The Annals of Statistics, forthcoming, 2020.

Ali Rahimi and Benjamin Recht. Random features for largescale kernel machines. In NIPS, volume 3, page 5. Citeseer, 2007.

Jean Ville. Etude critique de la notion de collectif. Bull. Amer. Math. Soc, 45(11):824, 1939.