# empirical_likelihood_for_contextual_bandits__aef3dc32.pdf Empirical Likelihood for Contextual Bandits Nikos Karampatziakis Microsoft Dynamics 365 AI nikosk@microsoft.com John Langford Microsoft Research jcl@microsoft.com Paul Mineiro Microsoft Research pmineiro@microsoft.com We propose an estimator and confidence interval for computing the value of a policy from off-policy data in the contextual bandit setting. To this end we apply empirical likelihood techniques to formulate our estimator and confidence interval as simple convex optimization problems. Using the lower bound of our confidence interval, we then propose an off-policy policy optimization algorithm that searches for policies with large reward lower bound. We empirically find that both our estimator and confidence interval improve over previous proposals in finite sample regimes. Finally, the policy optimization algorithm we propose outperforms a strong baseline system for learning from off-policy data. 1 Introduction Contextual Bandits [3, 17] are now in widespread practical use ([19, 7, 25]). Key to their success is the ability to do off-policy or counterfactual estimation [12] of the value of any policy enabling sound train/test regimes similar to supervised learning. However, off-policy evaluation requires more data than supervised learning to produce estimates of the same accuracy. This is because off-policy data needs to be importance-weighted and accurate estimation for importance-weighted data is still an active research area. How can we find a tight confidence interval (CI) on counterfactual estimates? And since tight CIs are deeply dependent on the form of their estimate, how can we find a tight estimate? And given what we discover, how can we leverage this for improved learning algorithms? We discover good answers to these questions through the application of empirical likelihood [24], a nonparametric maximum likelihood approach that treats the sample as a realization from a multinomial distribution with an infinite number of categories. Like a likelihood method, empirical likelihood (EL) adapts to the difficulty of the problem in an automatic way and results in efficient estimators. Unlike parametric likelihood methods, we do not need to make any parametric assumptions about the data generating process. We do assume that the expected importance weight is 1, a nonparametric moment condition that is supposed to hold for correctly collected off-policy data. Finally, EL-based estimators and confidence intervals can be computed by efficient algorithms that solve low dimensional convex optimization problems. Figure 1 shows a preview of our results. In section 4.2 we introduce our estimator. The estimator is computationally tractable, requiring a bisection search over a single scalar, has provably low bias (see Theorem 1) and in section 5.1 we experimentally demonstrate performance exceeding that of popular alternatives. The estimator leads to an asymptotically exact confidence interval for off-policy estimation which we describe in section 4.3. Other CIs are either narrow but fail to guarantee prescribed coverage, or guarantee prescribed coverage but are too wide to be useful. Our interval is narrow and (despite having only an asymptotic guarantee) empirically approaches nominal coverage from above as in Figure 1 and Table 3. Finally, in section 4.5, we use our CI to construct a robust counterfactual learning objective. We experiment with this in section 5.3 and empirically outperform a strong baseline. We now highlight several innovations in our approach: 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. 10 1 101 103 n / wmax EL Gaussian Binomial 10 1 101 103 n / wmax empirical coverage EL Gaussian Binomial Figure 1: A comparison of confidence intervals on contextual bandit data. The EL confidence interval is dramatically tighter than an approach based on a binomial confidence interval while avoiding chronic undercoverage as per the asymptotic Gaussian confidence interval. In some regimes, the asymptotic Gaussian CI both undercovers and has greater average width. This is possible as the EL CI has a different functional form than a multiplier on the Gaussian CI. On the left, shaded area represents 90% of the empirical distribution indicating the EL CI width varies less over realizations. On the right, shaded area represents 4 times the standard error of the mean indicating coverage differences are everywhere statistically significant. We use a nonparametric likelihood approach. This maintains [15] some of the asymptotic optimality results known for likelihood in the multinomial (hence well-specified) case[11]. We prove a finite sample result on the bias of our estimator. This also implies our estimator is asymptotically consistent. Our CI considers a large set of plausible worlds (alternative hypotheses) from which the observed off-policy data could have come from. One implication (cf. section 4.4) is that for binary rewards the CI lower bound will be < 1 (and > 0) even if all observed rewards are 1. We show how to compute the confidence interval directly, saving a factor of log(1/ϵ) in time complexity compared to standard implementations of EL for general settings. We propose a learning objective that searches for a policy with the best lower bound on its reward and draw connections with distributionally robust optimization. 2 Related Work There are many off-policy estimators for contextual bandits. The "Inverse Propensity Score" (IPS) [12] is unbiased, but has high variance. The Self-Normalized IPS (SNIPS) [30] estimator trades off some bias for better mean squared error (MSE). Our estimator has bias of the same order as SNIPS and empirically better MSE. The EMP estimator of [14] also uses EL techniques and we will explain the differences in detail in section 4.2. Critically, it would be challenging to use EMP to construct a CI with correct coverage for small samples, as we will explain in section 4.4. An orthogonal way to reduce variance is to incorporate a reward estimator as in the doubly robust (DR) estimator and associated variants [27, 9, 33, 32]. The estimator presented here is a natural alternative to IPS and SNIPS and can naturally replace the IPS part of a doubly robust estimator. There is less work on off-policy CIs for contextual bandits. A simple baseline randomly rounds the rewards to {0, 1} and the importance weights to 0 or the largest possible weight value and applies a Binomial confidence interval. Another simple asymptotically motivated approach, previously applied to contextual bandits [18], is via a Gaussian approximation. The EL confidence intervals are also asymptotically motivated but empirically approach nominal coverage from above and are much tighter than the Binomial confidence interval. In [5] empirical Bernstein bounds or Gaussian approximations are combined with clipping of large importance weights to trade bias for variance. This requires hyperparameter tuning whereas EL provides parameter-free CIs. Similar ideas to ours have been used for the upper confidence bound in the Empirical KL-UCB algorithm [6], an on-policy algorithm for multi-armed bandits. As detailed in section 4.4, both constructions need to consider some events that may not be in the data. While this happens without explicit data augmentation, it is analogous to the use of explicitly augmented MDPs for off-policy estimation in Markov Decision Processes[20]. Learning algorithms for contextual bandits include theoretical [3, 17], reduction oriented [9], optimization-based [29], and Bayesian [21] algorithms. A recent paper about empirical contextual bandit learning [4] informs our experiments. Ideas from empirical likelihood have previously been applied to robust supervised learning [8]. Our combination of CIs with learning is a contextual bandit analogue to robust supervised learning. Regularizing counterfactual learning via lower-bound optimization has been previously considered, e.g., based upon empirical Bernstein bounds [29] or divergence-based trust regions grounded in lower bounds from conservative policy iteration [28, 13]. 3 Notation and Warm-up We consider the off-policy contextual bandit problem, with contexts x X, a finite set of actions A, and bounded real rewards r A [0, 1]. The environment generates i.i.d. context-reward pairs (x, r) D and first reveals x. Then an action a A is sampled and the reward r(a) is revealed. Let π be the policy whose value we want to estimate. For off-policy estimation we assume a dataset {(xn, an, pn, rn)}N n=1, generated from an arbitrary sequence of historical stochastic policies hn, with pn .= hn(an|xn) and rn .= rn(an). Let w(a) .= π(a|x) h(a|x) be a random variable denoting the density ratio between π and h and wn .= π(an|xn) hn(an|xn) its realization. We assume π hn (absolute continuity), and that w [wmin, wmax].1 The value of π is defined as V (π) = E(x,r) D,a π( |x)[r(a)]. Since we don t have data from π, but from hn we use importance weighting to write V (π) = E(x,r) D,a h( |x)[w(a)r(a)]. The inverse propensity score (IPS) estimator is a direct implementation of this: V IPS(π) = 1 N PN n=1 wnrn. We can do better by observing that each policy hn is created using data before time n. Formally, let {Fn} be the filtration generated by {(xk, ak, pk, rk)}k 1 to satisfy EQ[w] = 1, but the likelihood is not sensitive to the value of r. 4 Off-Policy Estimation and Confidence Interval We first review how empirical likelihood extends the above results, then present our results. 4.1 Empirical Likelihood So far, we assumed that the random vector (w, r) has finite support and that data is iid. Empirical likelihood [24] allows us to transfer the above results to settings where the support is infinite. The seminal work [23] showed that the finite support assumption is immaterial. This was later extended [26] to prove that estimating equations such as E[w] = 1 could be incorporated into the estimation of E[wr]. As long as Cov(w, wr) = 0 Corollary 5 of [26] (also Theorem 3.5 of [24]) implies that 2(L(V (π)) P n log(Qmle wn,rn)) χ2 (1) in distribution as n without assuming finite support for (w, r). Asymptotic optimality results for empirical likelihood are established in [15], but require different proof techniques from the multinomial case [11]. We now turn to the iid. assumption. In many practical setups data may have been collected from various logging policies which makes the w s non-iid. Existing estimators, such as IPS, have no trouble handling such data. A key insight is that all the information about the problem is captured in the martingale estimating equation mn(V (π)) = 0. The extension of empirical likelihood to martingales is given by Dual Likelihood [22]. The reason for the name is that the functional of interest is the convex dual of the empirical likelihood formulation subject to the martingale estimating equation of interest. In our case, we use dual variables τ and β that correspond to the first and second component of mn(v) = 0 respectively. As derived in appendix A we get the dual likelihood lv(β, τ) = X n log (1 + β(wn 1) + τ(wnrn v)) (2) That derivation also reveals the constraint set associated with a feasible primal solution, C = {(β, τ)| w, r : 1 + β(w 1) + τ(wr v) 0}. (3) Despite the domains of w and r being potentially infinite, we can express C using only 4 constraints as C = {(β, τ)| w {wmin, wmax}, r {0, 1} : 1 + β(w 1) + τ(wr v) 0}. This is also the convex dual of (1) as iid and finite support data are just special cases of this framework. However, L(v) and the corresponding Q do not have a generative interpretation when w s are not iid. Nevertheless, under very mild conditions [22] the maximum of eq. (2) with v = V (π) still has an asymptotic distribution that obeys a nonparametric analogue to Wilks theorem. Thus it functions similarly for hypothesis testing. We will still refer to the support of Q to provide intuition. What is the set of alternative hypotheses considered when constructing hypothesis tests or CIs via a dual likelihood formulation? This is easier to understand in the primal, as the dual likelihood corresponds to a primal optimization over all distributions Q over (w, r) which measure-theoretically dominate the empirical distribution (i.e., place positive probability on each realized datum) and satisfy the moment condition EQ[w] = 1. Although this includes distributions with unbounded support, the optima are supported on the sample plus at most one more point as discussed in section 4.4. 4.2 Off-Policy Estimation We start by defining a (dual) analogue to the nonparametric maximum likelihood estimator (NPMLE) in the primal formulation for the iid case. Consider the quantity l mle = sup (β,0) C lv(β, 0) (4) which is obtained by setting τ = 0 (so the value of v is immaterial) and optimizing over β. This quantity may seem mysterious, but it corresponds to the NPMLE. Indeed, τ = 0 means EQ[wr] is free to take on any value, as in the primal maximum likelihood formulation. We propose our estimator as any v which obtains the maximum dual likelihood, i.e., any value in the set ( v sup (β,τ) C lv(β, τ) = l mle In appendix B we prove there is an interval of maximizers of the form ˆV (π; ρ) = ρ + 1 wn(rn ρ) 1 + β (wn 1), (6) where ρ is any value in [0, 1] and β maximizes X n log (1 + β(wn 1)) s.t. w : 1 + β(w 1) 0. (7) The constraints on β are over all possible values of w, not just the observed w. However the constraints with w = wmin and w = wmax imply all other constraints. We solve this 1-d convex problem via bisection to accuracy ϵ in O(N log( 1 ϵ )) time. Note that β = 0 is always feasible and it is optimal when P n wn = N. When β = 0, (6) becomes V IPS for all values of ρ. Eq. (6) (and eq. (9) in 4.3) are valid in the martingale setting, i.e., for a sequence of historical policies. Appendix B shows that when there exists an unobserved extreme value of w, say wex, any associated primal solution Qmle will assign some probability to a pair (wex, ρ). Section 4.4 discusses the beneficial implications of this. Once both wmin, wmax are observed with any r, eq. (6) becomes a point estimate because P n wn (1 + β (wn 1)) 1 = N, i.e., ρ cancels out and Qmle only has support on the observed data. The EMP estimator, based on empirical likelihood, was proposed in [14]. Specializing it to a constant reward predictor for all (x, a) we can write both estimators in terms of Qmle. Eq. (6) leads to ˆV (π) = (1 P n Qmle wn,rnwn)ρ+P n Qmle wn,rnwnrn while EMP is ˆVEMP(π) = P n Qmle wn,rnwnrn/ P n Qmle wn,rn. When wmin and wmax are observed, P n Qmle wn,rn = P n Qmle wn,rnwn = 1 and the two estimators coincide. Section 5.1 empirically investigates their finite sample behavior. 4.2.1 Finite Sample Bias We show a finite-sample bound on the bias of an estimator, based upon eq. (6), of the value difference R(π) .= V (π) V (h) between π and the logging policy. We obtain our estimator for R(π) via EQmle[wr] EQmle[r] and using the primal-dual relationship for Qmle from appendix A. In practical applications R(π) is the relevant quantity for deciding when to update a production policy. The proof is in appendix D. Theorem 1. Let ˆR(π) .= 1 N P n (wn 1)(rn ρ) 1+β (wn 1) with β as in eq. (7), and let a.s. n : 0 wn wmax with wmax 1. Then E h ˆR(π) i R(π) 10 rwmax where R(π) .= V (π) V (h) is the true policy value difference between π and {hn}n N. The leading term in Theorem 1 is actually any ω En[(wn 1)2]; ω = wmax is a worst case. This result indicates low bias for the estimator (leading terms comparable to finite-sample variance); meanwhile inspection of eq. (6) (and (9)) indicate neither can overflow the underlying range of reward. This explains the excellent mean square error performance observed in section 5.1. 4.3 Off-Policy Confidence Interval We can use the dual likelihood of eq. (2) to construct an asymptotic confidence interval [22] in a manner completely analogous to Wilks theorem for the primal likelihood formulation ( v sup (β,τ) C lv(β, τ) l mle 1 where α is the desired nominal coverage and χ2,α (1) is the α-quantile of a χ-square distribution with one degree of freedom. The asymptotic guarantee is that the coverage error of this interval is O(1/n). In general applications of EL a bisection on v is recommended for finding the boundaries of the CI: given an interval [ℓ, u] check whether v = (ℓ+ u)/2 is in the set given by (8) and update ℓor u. This requires O(log(1/ϵ)) calls to maximize (2). Here we derive a more explicit form for the boundary points which is more insightful and faster to compute (2 optimization calls). In appendix C we prove the lower bound of the CI is vlb(π) = κ 1 wnrn γ + β wn + wnrn , (9) where (β , γ , κ ) are given by sup κ 0 β,γ κ log κ + κ φ + log (γ + βwn + wnrn) subject to w : γ + βw 0, where φ = 1 2N χ2,α (1) 1 The constraints range over all possible values of w, but wmin and wmax are the only relevant ones. This is a convex problem with 3 variables and 2 constraints that can be solved to ϵ-accuracy by the ellipsoid method (for example) in O(N log( 1 ϵ )) time. The upper bound can be obtained by transforming the rewards r 1 r, finding the lower bound, and then setting vub 1 vlb. In eq. (9) we can have κ P n wn (γ + β (wn 1) + wnrn) 1 < N even after extreme values of w have been observed. This corresponds to a primal solution which is placing extra probability on either (wmin, 0) or (wmax, 0). For example, this allows our lower bound to be < 1 even if all observed rewards are 1. Section 4.4 discusses the benefits of the additional primal support. 4.4 The Importance of E[w] = 1 By inspection, the primal constraint E[w] = 1 can be infeasible for a distribution only supported on the observed values if 1 is not in the convex hull of observed importance weights. Consequently solutions to equations (6) and (9) can correspond to distributions Q in the primal formulation with support beyond the observed values. This is a known property of constrained empirical likelihood [10]. We precisely characterize the additional support as confined to a single extreme point. In appendix B we show the support of the primal distribution associated with equation (6) is a subset of {(wn, rn)}n N {(wex, ρ)}, where wex = wmin if P wn N and otherwise wex = wmax. In appendix C we show the support of the primal distribution associated with equation (9) is {(wn, rn)}n N {(wex, 0)}, where wex is either wmin or wmax. We also point out that similar ideas have already been used for multi-armed bandits. For example, the empirical KL-UCB algorithm [6] uses empirical likelihood to construct an upper confidence bound on each arm by considering distributions that can place additional mass on the largest possible reward. Although the modification of the support from the observed data points seems modest, it greatly improves both the estimator and the CI. Critically, both can produce values that are outside the convex hull of the observations, but never overflow the possible range [0, 1]. In contrast, empirical likelihood on the sample is constrained to the convex hull of the observations; while empirical likelihood on the bounded range without the E[w] = 1 constraint can produce value estimates in the range [0, wmax]. Furthermore, we observe in practice that our CIs approach nominal coverage values from above, as in Figure 1. This is not typical behavior when empirical likelihood is constrained to the sample. Per Lemma 2.1 of [24], empirical likelihood can only place O(1/n) mass outside the sample. With our primal constraint E[w] = 1 this mass is further limited to O(1/wmax), and decreases as the realized average importance weight approaches 1. As seen in Figure 1, this can result in non-trivial CIs in the regime n < wmax where other interval estimation techniques struggle. 4.5 Offline Contextual Bandit Learning Here the goal is to learn a policy π using a dataset {(xn, an, pn, rn)}n N, i.e., without interacting with the system generating the data. One strategy is to leverage a counterfactual estimator to reduce policy learning to optimization [18], suggesting the use of equation (6) in the objective. Alternatively we can instead optimize the lower bound of equation (9). In the iid. case optimizing the lower bound corresponds to a variant of distributionally robust optimization. The log-empirical likelihood for a distribution Q is equivalent to the KL divergence between the empirical distribution 1/N and Q. A likelihood maximizer Qmle attains the minimum such KL divergence. By optimizing the lower bound (9) we are performing distributionally robust optimization with uncertainty set Q(π) = Q EQ[w(π)] = 1, KL 1 where B(π) = KL 1 N ||Qmle(π) + 1 2N χ2,α (1) and we have made dependences on π explicit. Given a set of policies Π we can set up the game max π Π min Q Q(π) n Qw(π)n,rnw(π)nrn for finding the policy π Π with the best reward lower bound. For our experiments we use a heuristic alternating optimization strategy. In one phase the policy is fixed and we find the optimal dual variables associated with equation (9). In the alternate phase we find a policy with a better lower bound, i.e., a policy which improves upon equation (9) with dual variables held fixed. Developing better methods for solving this game is deferred for future work. 5 Experiments The purpose of our experiments is to demonstrate the empirical behavior of the proposed methods against other methods that use the same information. Comparing against methods that leverage or focus on reward predictors is therefore out of scope, as reward predictors can help/hurt any method. Our experiments compare MSE of estimators (section 5.1), confidence interval coverage and width (section 5.2), and utility of lower bound optimization for off-policy learning (section 5.3). Replication instructions are available in the supplement, and replication software is available at http://github.com/pmineiro/elfcb. All experiment details are in the appendix. 5.1 Off-Policy Estimation Synthetic Data We begin with a synthetic example to build intuition. In appendix E we detail how we sample w = π/h and r for each synthetic environment. Figure 2 shows the mean squared error (MSE) over 10,000 environment samples for various estimators. The best constant predictor of 1/2 ( Constant ) has a MSE of 1/12, as expected. Clipped DR is the doubly robust estimator with the best constant predictor of 1/2 clipped to the range [0, 1], i.e. min(1, max(0, 1 N (rn 1/2))). SNIPS is the self-normalized estimator IPS estimator. EMP is the estimator of [14]. For EL, we use ρ = 1 2. When a small number of large importance weight events is expected in a realization, both Clipped DR and SNIPS suffer due to their poor handling of the E[w] = 1 constraint. EMP is an improvement and EL is a further improvement. Asymptotically all estimators are similar. 10 2 10 1 100 101 102 n / wmax Constant 0.5 Clipped DR SNIPS EMP EL Figure 2: Mean squared error of EL and other estimators on synthetic data. Asymptotics are similar while EL dominates in the small sample regime. Line width is 4 times the standard error of the population mean. EL vs. Exploration Wins Ties Losses IPS ϵ = 0.05 26 11 3 bags=10 13 19 8 cover=10 16 16 9 SNIPS ϵ = 0.05 5 34 1 bags=10 7 30 3 cover=10 7 33 0 EMP ϵ = 0.05 24 13 3 bags=10 8 26 6 cover=10 8 23 9 Table 1: Off-policy evaluation results where ϵ = 0.05 is ϵ-greedy exploration, bags=10 is bootstrap exploration with 10 replicas, and cover=10 is online cover [2] with 10 policies. Exploration CI LB EL Wins Ties Losses Wins Ties Losses ϵ = 0.05 greedy 16 18 6 11 26 3 ϵ = 0.1 greedy 16 19 5 13 24 3 ϵ = 0.25 greedy 15 22 3 3 34 3 bagging, 10 bags 21 18 1 11 28 1 bagging, 32 bags 4 26 10 7 31 2 cover, 10 policies 18 21 1 6 30 4 cover, 32 policies 9 29 2 6 34 0 Table 2: Learning results. CI LB uses equation (9); EL uses equation (6). EL serves as an ablation study, on whether the improvement in CI LB is due to distributional robustness, or the estimator itself. Realistic Data We employ an experimental protocol inspired by the operations of the Decision Service [1], an industrial contextual bandit platform. Details are in appendix F. Succinctly, we use 40 classification datasets from Open ML [31]; apply a supervised-to-bandit transform [9]; and limit the datasets to 10,000 examples. Each dataset is randomly split 20%/60%/20% into Initialize/Learn/Evaluate subsets, to learn h, learn π, and evaluate π respectively. Learning is via Vowpal Wabbit [16] using various exploration strategies, with default parameters and π initialized to h. We compare the MSE of EL, IPS, SNIPS, and EMP using the true value of π on the evaluation set (available because the underlying dataset is fully observed and π(a|x) is known). For each dataset we evaluate multiple times, each time resampling a h( |x). Table 1 shows the results of a paired t-test with 60 trials per dataset and 95% confidence level: tie indicates null result, and win or loss indicates significantly better or worse. The EL is overall superior to IPS and SNIPS. It is similar to EMP except when the data comes from 0.05-greedy exploration, where EL is better than EMP. 5.2 Confidence Intervals Table 3: Off-Policy Confidence Intervals Technique Coverage Width Ratio (Average) (Median) EL 0.975 n/a Binomial 0.996 2.89 AG 0.912 0.99 Synthetic Data We use the same synthetic ϵ-greedy data as described above. Figure 1 shows the mean width and empirical coverage over 10,000 environment samples for various CIs at 95% nominal coverage. Binomial CI is the Clopper Pearson confidence interval on the random variable w wmax R. This is an excessively wide CI. Asymptotic Gaussian is the standard z-score CI around the empirical mean and standard deviation motivated by the central limit theorem. Intervals are narrow but typically violate nominal coverage. The EL interval is narrow and obeys nominal coverage throughout the entire range despite only having asymptotic guarantees. Once again there is a qualitative change when the sample size is comparable to the largest importance weight. The Binomial CI interval only begins to make progress at this point. Meanwhile, the asymptotic Gaussian interval widens as empirical variance increases. Realistic Data We use the same datasets mentioned above, but produce a 95% confidence interval for off-policy evaluation rather than the maximum likelihood estimate. With 40 datasets and 60 evaluations per dataset, we have 2400 confidence intervals from which we compute the coverage and the ratio of the width of the interval to the EL in table 3. As expected from simulation, the Binomial CI overcovers and has wider intervals. EL widths are comparable to asymptotic Gaussian (AG) on this data, but AG undercovers. A 95% binomial confidence interval on the coverage of AG is [90.0%, 92.3%], indicating sufficient data to conclude undercoverage. 5.3 Offline Contextual Bandit Learning We use the same 40 datasets as above, but with a 20%/20%/60% Initialize/Learn/Evaluate split. We made no effort to tune the confidence level setting it to 95% for all experiments. For optimizing the policy parameters and the distribution dual variables, we alternate between solving the dual problem with the policy fixed and then optimizing the policy with the dual variables fixed. To optimize the policy we do a single pass over the data using Vowpal Wabbit as a black-box oracle for learning, supplying different importance weights on each example depending upon the dual variables. We do 4 passes over the learning set and update the dual variables before each pass. Details are in appendix G. We compare the true value of π on the evaluation set resulting from learning with the different objectives. For each dataset we learn multiple times, with different actions chosen by the historical policy h. Table 2 shows the results of a paired t-test with 60 trials per dataset and 95% confidence level: tie indicates null result, and win or loss indicates significantly better or worse evaluation value for the CI lower bound. Using the CI lower bound overall yields superior results. Using the EL estimate also provides some lift but is less effective than using the CI lower bound. 6 Conclusions We presented a practical estimator and a CI for contextual bandits with correct asymptotic coverage and empirically valid coverage for small samples. To this end we used empirical likelihood techniques which yielded computationally efficient and hyperparameter-free procedures for estimation, CIs and learning. Empirically, our proposed CI is a substantial improvement over existing methods and the learning algorithm is a useful improvement against techniques that optimize the value of a point estimate. Our methods offer the largest advantage in regimes where existing methods struggle, such as when the number of samples N is of the same order as the largest possible importance weight. Broader Impact Not applicable to this work. Acknowledgments and Disclosure of Funding We thank Adith Swaminathan and the anonymous reviewers for their valuable comments on earlier drafts on this work. [1] Alekh Agarwal, Sarah Bird, Markus Cozowicz, Luong Hoang, John Langford, Stephen Lee, Jiaji Li, Dan Melamed, Gal Oshri, Oswaldo Ribas, et al. Making contextual decisions with low technical debt. ar Xiv preprint ar Xiv:1606.03966, 2016. [2] Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert Schapire. Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning, pages 1638 1646, 2014. [3] Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed bandit problem. SIAM J. Comput., 32(1):48 77, 2002. [4] Alberto Bietti, Alekh Agarwal, and John Langford. Practical evaluation and optimization of contextual bandit algorithms. Co RR, abs/1802.04064, 2018. [5] Léon Bottou, Jonas Peters, Joaquin Quiñonero-Candela, Denis X Charles, D Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. Counterfactual reasoning and learning systems: The example of computational advertising. The Journal of Machine Learning Research, 14(1):3207 3260, 2013. [6] Olivier Cappé, Aurélien Garivier, Odalric-Ambrym Maillard, Rémi Munos, Gilles Stoltz, et al. Kullback leibler upper confidence bounds for optimal sequential allocation. The Annals of Statistics, 41(3):1516 1541, 2013. [7] Ashok Chandrashekar, Fernando Amat, Justin Basilico, and Tony Jebara. Artwork personalization at netflix. The Netflix Tech Blog, 2017. [8] John Duchi, Peter Glynn, and Hongseok Namkoong. Statistics of robust optimization: A generalized empirical likelihood approach. ar Xiv preprint ar Xiv:1610.03425, 2016. [9] Miroslav Dudík, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. ar Xiv preprint ar Xiv:1103.4601, 2011. [10] Marian Grendár, Vladimír Špitalsk y, et al. Multinomial and empirical likelihood under convex constraints: Directions of recession, fenchel duality, the pp algorithm. Electronic Journal of Statistics, 11(1):2547 2612, 2017. [11] Wassily Hoeffding. Asymptotically optimal tests for multinomial distributions. The Annals of Mathematical Statistics, pages 369 401, 1965. [12] Daniel G Horvitz and Donovan J Thompson. A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association, 47(260):663 685, 1952. [13] Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In ICML, volume 2, pages 267 274, 2002. [14] Nathan Kallus and Masatoshi Uehara. Intrinsically efficient, stable, and bounded off-policy evaluation for reinforcement learning. ar Xiv preprint ar Xiv:1906.03735, 2019. [15] Yuichi Kitamura. Asymptotic optimality of empirical likelihood for testing moment restrictions. Econometrica, 69(6):1661 1672, 2001. [16] John Langford, Lihong Li, and Alexander Strehl. Vowpal wabbit open source project. URL https://github. com, 2007. [17] John Langford and Tong Zhang. The epoch-greedy algorithm for contextual multi-armed bandits. In Proceedings of the 20th International Conference on Neural Information Processing Systems, pages 817 824. Citeseer, 2007. [18] Lihong Li, Shunbao Chen, Jim Kleban, and Ankur Gupta. Counterfactual estimation and optimization of click metrics in search engines: A case study. In Proceedings of the 24th International Conference on World Wide Web, pages 929 934. ACM, 2015. [19] Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article recommendation. Co RR, abs/1003.0146, 2010. [20] Yao Liu, Adith Swaminathan, Alekh Agarwal, and Emma Brunskill. Off-policy policy gradient with state distribution correction. ar Xiv preprint ar Xiv:1904.08473, 2019. [21] Benedict C. May, Nathan Korda, Anthony Lee, and David S. Leslie. Optimistic bayesian sampling in contextual-bandit problems. Journal of Machine Learning Research, 13:2069 2106, 2012. [22] Per Aslak Mykland. Dual likelihood. The Annals of Statistics, pages 396 421, 1995. [23] Art B Owen. Empirical likelihood ratio confidence intervals for a single functional. Biometrika, 75(2):237 249, 1988. [24] Art B Owen. Empirical likelihood. Chapman and Hall/CRC, 2001. [25] Pablo Paredes, Ran Gilad-Bachrach, Mary Czerwinski, Asta Roseway, Kael Rowan, and Javier Hernandez. Poptherapy: coping with stress through pop-culture. In Proceedings of the 8th International Conference on Pervasive Computing Technologies for Healthcare, Pervasive Health 2014, Oldenburg, Germany, May 20-23, 2014, pages 109 117, 2014. [26] Jin Qin and Jerry Lawless. Empirical likelihood and general estimating equations. the Annals of Statistics, pages 300 325, 1994. [27] James M. Robins and Andrea Rotnitzky. Semiparametric efficiency in multivariate regression models with missing data. Journal of the American Statistical Association, 90(429):122 129, 1995. [28] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889 1897, 2015. [29] Adith Swaminathan and Thorsten Joachims. Batch learning from logged bandit feedback through counterfactual risk minimization. Journal of Machine Learning Research, 16(1):1731 1755, 2015. [30] Adith Swaminathan and Thorsten Joachims. The self-normalized estimator for counterfactual learning. In advances in neural information processing systems, pages 3231 3239, 2015. [31] Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. Openml: Networked science in machine learning. SIGKDD Explorations, 15(2):49 60, 2013. [32] Nikos Vlassis, Aurelien Bibaut, Maria Dimakopoulou, and Tony Jebara. On the design of estimators for bandit off-policy evaluation. In International Conference on Machine Learning, pages 6468 6476, 2019. [33] Yu-Xiang Wang, Alekh Agarwal, and Miroslav Dudík. Optimal and adaptive off-policy evaluation in contextual bandits. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pages 3589 3597, 2017.