# practical_contextual_bandits_with_regression_oracles__a1a5fd59.pdf

Practical Contextual Bandits with Regression Oracles

Dylan J. Foster 1 Alekh Agarwal 2 Miroslav Dud ık 2 Haipeng Luo 3 Robert E. Schapire 2

A major challenge in contextual bandits is to design general-purpose algorithms that are both practically useful and theoretically well-founded. We present a new technique that has the empirical and computational advantages of realizabilitybased approaches combined with the ﬂexibility of agnostic methods. Our algorithms leverage the availability of a regression oracle for the valuefunction class, a more realistic and reasonable oracle than the classiﬁcation oracles over policies typically assumed by agnostic methods. Our approach generalizes both UCB and Lin UCB to far more expressive possible model classes and achieves low regret under certain distributional assumptions. In an extensive empirical evaluation, we ﬁnd that our approach typically matches or outperforms both realizability-based and agnostic baselines.

1. Introduction

We study the design of practically useful, theoretically wellfounded, general-purpose algorithms for the contextual bandits (CBs) problem. In this setting, the learner repeatedly receives context, then selects an action, resulting in a received reward. The aim is to learn a policy, a mapping from contexts to actions, to maximize the long-term cumulative reward. For instance, a news portal must repeatedly choose articles to present to each user to maximize clicks. Here, the context is information about the user, the actions are the articles, and the reward might be indicator of a click. We refer the reader to an ICML 2017 tutorial (http://hunch.net/ rwil/) for further examples.

CB algorithms can be put into two groups. Some methods (Langford & Zhang, 2008; Agarwal et al., 2014) are

1Cornell University. Work performed while the author was an intern at Microsoft Research. 2Microsoft Research 3University of Southern California. Correspondence to: Dylan J. Foster <djf244@cornell.edu>.

Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s).

agnostic in the sense that they are provably effective for any given policy class and data distribution. In contrast, realizability-based approaches such as Lin UCB and variants (Chu et al., 2011; Li et al., 2017; Filippi et al., 2010) or Thompson sampling (Thompson, 1933) assume the data is generated from a particular parametrized family of models. Computationally tractable realizability-based algorithms are only known for speciﬁc model families, such as when the conditional reward distributions come from a generalized linear model.

The two groups of approaches seem to have different advantages and disadvantages. Empirically, in the contextual semibandit setting, Krishnamurthy et al. (2016) found that the realizability-based Lin UCB approach outperforms all agnostic baselines using a linear policy class. However, the agnostic approaches were able to overcome this shortcoming by using a more powerful policy class. Computationally, previous realizability-based approaches have been limited by their reliance on either closed-form conﬁdence bounds (as in Lin UCB variants), or the ability to efﬁciently sample from and frequently update the posterior (as in Thompson sampling). Agnostic approaches, on the other hand, typically assume an oracle for cost-sensitive classiﬁcation, which is computationally intractable in the worst case, but often practically feasible for many natural policy classes.

In this paper, we aim to develop techniques that combine the best of both of these approaches. To this end, in Section 3, we propose computationally efﬁcient and practical realizability-based algorithms for arbitrary model classes. As is often done in agnostic approaches, we assume the availability of an oracle which reduces to a standard learning setting and knows how to efﬁciently leverage the structure of the model class. Speciﬁcally, we require access to a leastsquares regression oracle over the model class that we use for predicting rewards given contexts. Since regression can often be solved efﬁciently, the availability of such an oracle is a far more reasonable assumption than the cost-sensitive classiﬁcation oracle usually assumed, which typically must solve NP-hard problems. In fact, for this reason, even the classiﬁcation oracles are typically approximated by regression oracles in practice (see, e.g., Beygelzimer & Langford, 2009). Our main algorithmic components here are motivated by and adapted from a recent work of Krishnamurthy et al. (2017) on cost-sensitive active learning.

Practical Contextual Bandits with Regression Oracles

In Section 4, we prove that our algorithms are effective in achieving low regret under certain distributional assumptions. Speciﬁcally, we show that our methods enjoy low regret so long as certain quantities like the disagreement coefﬁcient (Hanneke, 2014; Krishnamurthy et al., 2017) are bounded, or when some other distributional coefﬁcients inspired by Bastani & Bayati (2015) are well-behaved. As a special consequence, we obtain nearly dimension-free results for sparse linear bandits in high dimensions.

Finally, in Section 5, we conduct a very extensive empirical evaluation of our algorithms on a number of datasets and against both realizability-based and agnostic baselines. In this test of practical effectiveness, we ﬁnd that our approach gives comparable or superior results in nearly all cases, and we also validate the distributional assumptions required for low-regret guarantees on these datasets.

2. Preliminaries

We consider the following contextual bandit protocol. Contexts are drawn from an arbitrary space, x X, actions are from a ﬁnite set, a A = {1,...,K}, for some ﬁxed K, and reward vectors are from a bounded set, r [0,1]K, with component r(a) denoting the reward for action a A.

We consider an i.i.d. setting where there is a ﬁxed and unknown distribution D over the context-reward pairs (x,r), with DX denoting its marginal over X. At each round t = 1,2,...,T, nature samples (xt,rt) according to D and reveals xt to the learner. The learner chooses an action at A and observes the reward rt(at). The learner aims to maximize its reward and compete with strategies that model the expected reward E[r(a) x,a] via functions f X A [0,1]. We consider mappings f from a given class F, such as linear predictors or regression trees.

The main assumption this paper follows is that the class F is rich enough to contain a predictor that perfectly predicts the expected reward of any action under any context, that is:

Assumption 1 (Realizability). There is a predictor f F such that E[r(a) x,a] = f (x,a) x X, a A.

This assumption is used by essentially all regression-based contextual bandit algorithms (Chu et al., 2011; Filippi et al., 2010; Russo & Van Roy, 2013; Li et al., 2017).

Given a predictor f F, the associated optimal strategy f X A, called a policy, picks the action with the highest predicted reward, i.e., f(x) = arg maxa A f(x,a) (ties broken arbitrarily). Using = f to denote an optimal policy, the learner aims to minimize its regret

t=1 rt( (xt)) T

t=1 rt(at),

which compares the accumulated rewards between the optimal policy and the learner s strategy. The classic Exp4

algorithm (Auer et al., 2002b) achieves an optimal regret bound of order O(

TK ln F ) (for any ﬁnite F), but the computational complexity is unfortunately linear in F .

Regression Oracle To overcome the computational obstacle, our algorithms reduce the contextual bandit problem to weighted least-squares regression. Abstracting the computational complexity, we assume access to a weighted least-squares regression oracle over the predictor class F, which takes any set H of weighted examples (w,x,a,y) R+ X A [0,1] as input, and outputs the predictor with the smallest weighted squared loss:

ORACLE(H) = arg minf F (w,x,a,y) H w(f(x,a) y)2.

As mentioned, such regression tasks are very common in machine learning practice and the availability of such oracle is thus a very mild assumption.

3. Algorithms

The high-level idea of our algorithms is the following. As data is collected, we maintain a subset of F, referred to as the version space, that only contains f F with small squared loss on observed data. For a new example, we construct a conﬁdence interval for the expected reward of each action based on this version space. Finally, with these conﬁdence intervals, we either optimistically pick the action with the highest upper bound, similar to UCB and Lin UCB, or randomize among all actions that are potentially the best.

The challenge here is to maintain such version spaces and compute upper and lower conﬁdence bounds efﬁciently, and we show that this can be done using a binary search together with a small number of regression oracle calls.

More formally, we deﬁne the upper and lower bounds on the expected reward with respect to a subset F F as

HIGHF (x,a) = max

f F f(x,a), LOWF (x,a) = min

f F f(x,a).

Our algorithms will induce the conﬁdence bounds by instantiating these quantities using the version space as F . To reduce computation, our algorithms update on a doubling epoch schedule. There are M = O(log T) epochs and each epoch m begins at time m = 2m 1. At epoch m our algorithms (implicitly) construct a version space Fm F, and then select an action based on the reward ranges deﬁned by HIGHFm(x,a) and LOWFm(x,a) for each time t that falls into epoch m. Speciﬁcally, we consider two algorithm variants: the ﬁrst one uniformly at random picks from actions that are plausible to be the best (see lines 6-7 in Algorithm 1); the second one simply behaves optimistically and picks the action with the highest upper bound (see line 9 in Algorithm 2). For technical reasons, the optimistic variant also performs pure exploration in the ﬁrst few epochs to warm-start the algorithm.

Practical Contextual Bandits with Regression Oracles

Algorithm 1 REGCB.ELIMINATION

1: Input: square-loss tolerance βm 2: for epoch m = 1,...,M do

a A Gm(βm,a) (OPTION I) Fm(βm) (OPTION II) 4: for time t = m,..., m+1 1 do 5: Receive xt and deﬁne At as: 6: {a HIGHFm(xt,a) maxa A LOWFm(xt,a )}. 7: Sample at Unif(At) and receive rt(at). 8: end for 9: end for

To construct these version spaces, we further introduce the following least-squares notation for any m 2:

Rm(f) = 1 m 1 s< m f(xs,as) rs(as)

Fm(β) = f F Rm(f) minf F Rm(f) β ,

and also let F1(β) = F for any β. With this notation Fm is simply set to F(βm) for some βm, and HIGHFm and LOWFm recover the conﬁdence bounds in UCB (Auer et al., 2002a) and Lin UCB (Chu et al., 2011) for appropriate βm.

Product Classes Sometimes it is desirable to have a product predictor class, that is, F = GA, where G X [0,1] is a base class and each f F, described by a K-tuple (ga)a A where ga G, predicts according to f(x,a) = ga(x). Similar to the general case, we deﬁne:

Rm(g,a) = 1 m 1 s< m(g(xs) rs(as))21{as = a},

Gm(β,a) = g G Rm(g,a) ming G Rm(g,a) β ,

and let G1(β,a) = G for any β. In this case we construct Fm as a A Gm(βm,a) for some tolerance parameter βm.

Our two procedures are described in Algorithms 1 and 2.

3.1. Efﬁcient Reward-Range Computation

Algorithms 1 and 2 hinge on the computation of the bounds

HIGHFm and LOWFm. This can be carried out efﬁciently via a small number of calls to the regression oracle.

Speciﬁcally, to calculate the conﬁdence bounds for a given pair (x,a), we augment the data set Hm with a single example (x,a,r) with a weight w. For the upper bound HIGHFm we use r = 2; for the lower bound r = 1 (these values are chosen as they lie outside the reward range). By changing the weight w, we trade-off the loss on this single example against that on the history Hm. The binary search over w identiﬁes up to a given precision the weight w at which

Algorithm 2 REGCB.OPTIMISTIC

1: Input: square-loss tolerance βm

number of warm-start epochs M0 2: for time t = 1,..., M0 1 do 3: Receive xt, play at Unif(A), and receive rt(at). 4: end for 5: for epoch m = M0,...,M do 6: Fm Fm(βm). 7: for time t = m,..., m+1 1 do 8: Receive xt. 9: Select at = arg maxa A HIGHFm(xt,a). 10: Receive rt(at). 11: end for 12: end for

the empirical regret on Hm is exactly the desired tolerance β, with the corresponding prediction on x,a yielding HIGH Fm(β)(x,a) or LOW Fm(β)(x,a) (see Algorithm 3).

In Appendix A.1 we prove that this strategy works as intended and in O(log(1 )) iterations computes the conﬁdence bounds up to a precision of .

Theorem 1. Let Hm = {(xs,as,rs(as))} m 1

s=1 . If the function class F is convex and closed under pointwise convergence, then the calls

z HIGH BINSEARCH(HIGH,(x,a),Hm,β, )

z LOW BINSEARCH(LOW,(x,a),Hm,β, )

terminate after O(log(1 )) oracle invocations and

HIGH Fm(β)(x,a) z HIGH ,

LOW Fm(β)(x,a) z LOW .

Compared to the procedure from Krishnamurthy et al. (2017), Algorithm 3 is much simpler and achieves an exponential improvement in terms of oracle calls, namely O(log(1 )) as opposed to O(1 ), when F is convex. Compared to oracles for cost-sensitive classiﬁcation, convexity is not a strong assumption for regression oracles. When F is not convex, reward bounds can be computed in O(1 ) oracle calls (see Krishnamurthy et al. 2017).

4. Regret Guarantees

In this section we provide regret guarantees for Reg CB (Algorithm 1 and Algorithm 2). Note that Reg CB is not minimax optimal: while it can obtain O

KT log F regret or even logarithmic regret under certain distributional assumptions, which we describe shortly, for some instances it can make as many as F mistakes, which is suboptimal:

Proposition 1. For every (0,1] and N N there exists a class of reward predictors satisfying Assumption 1 with

Practical Contextual Bandits with Regression Oracles

Algorithm 3 BINSEARCH

1: Input: bound type {LOW, HIGH}, target pair (x,a)

history H, radius β > 0, precision > 0 2: Based on bound type: r 2 if HIGH and r 1 if LOW 3: Let R(f) = (x ,a ,r ) H(f(x ,a ) r )2

4: Let R(f,w) = R(f) + w

2 (f(x,a) r)2

5: w L 0, w H β

// Invoke oracle twice 6: f L arg minf F R(f,w L), z L f L(x,a)

7: f H arg minf F R(f,w H), z H f H(x,a) 8: Rmin R(f L) 9: β (r z L)3

10: while z H z L > and w H w L > do 11: w (w H + w L) 2

// Invoke oracle. 12: f arg min f F R( f,w), z f(x,a) 13: if R(f) Rmin + β then 14: w H w, z H z 15: else 16: w L w, z L z 17: end if 18: end while 19: return z H.

F = N + 1 and a distribution for which both Algorithms 1 and 2 have regret lower bounded by (1 ) min N, (T) .

Proposition 1 is proved in Appendix A.2. The proof builds on a well-known albeit rather pathological instance. In contrast, our strong empirical results in the following section show that such instances are not encountered in practice. In order to understand the typical behavior of such algorithms, prior works have considered structural assumptions such as ﬁnite eluder dimension (Russo & Van Roy, 2013) or disagreement coefﬁcients (Hanneke, 2014; Krishnamurthy et al., 2017). In the next two subsections, we use similar ideas to analyze the regret incurred by our algorithm. We assume that HIGHFm and LOWFm are computed exactly, but extension to the approximate case is straightforward.

4.1. Disagreement-based Analysis

Disagreement coefﬁcients come from the active learning literature (Hanneke, 2014), and roughly assume that given a set of functions which ﬁt the historical data well, the probability that these functions make differing predictions on a new example is small. This rules out the bad case of Proposition 1, where a near-optimal predictor signiﬁcantly disagrees from the others on each context. Our development in this subsection largely follows Krishnamurthy et al. (2017), with appropriate modiﬁcations to translate from active learning to contextual bandits. We begin with a formal deﬁnition of the disagreement coefﬁcient.

Deﬁnition 1 (Disagreement Coefﬁcient). The disagreement coefﬁcient for F (with respect to DX ) is deﬁned as

0 = sup δ>0,">0

δ " Pr DX x Dis(F(")) and

a AF(")(x) WF(")(x,a) > δ . Here F(") is the set of all predictors f whose greedy policies have regret at most ", Dis(F(")) is the set of x s where the greedy policies of at least two functions in F(") choose different actions, AF(x) = f F arg maxa A f(x,a) , and WF(x,a) is the difference between the upper and lower bounds HIGHF(x,a) LOWF(x,a). Formal deﬁnitions of these quantities can be found in Appendix A.3. Informally, the disagreement coefﬁcient is small if on most contexts either all f F(") choose the same action according to their greedy policies or all actions chosen by those policies have a low range of predicted rewards.

The following theorem provides regret bounds in terms of the disagreement coefﬁcient. In all theorems we use O to suppress polynomial terms in log T, log K, and log(1 δ), where δ is the failure probability. Moreover, all results can be improved to be logarithmic (in T) under the standard Massart noise condition (see the appendix for the details).

Theorem 2. With βm = (M m+1)Cδ

m 1 and Cδ =

16log 2 G KT 2

δ , Algorithm 1 with Option I incurs Reg T =

3 4 (log G )

1 4 0K with probability at least 1 δ.

See Theorem 5 in Appendix A.3 for the full version of this theorem, which applies to inﬁnite classes and additionally obtains faster rates under the Massart noise condition.

Discussion Theorem 2 critically uses the product class structure, speciﬁcally the fact that the set At computed by the algorithm coincides with the disagreement set AFm(xt) for t { m,..., m+1 1}. This is true for product classes, but not necessarily for general (non-product) predictor classes. Computing the disagreement set efﬁciently for non-product classes is a challenge for future work.

While bounding the disagreement coefﬁcients a priori often requires strong assumptions on the model class and the distribution, the size of disagreement set can be easily checked empirically under the product class assumption, and we include this diagnostic in our experimental results.

Finally, while the disagreement coefﬁcient enables the analysis of Algorithm 1, it is not obvious how to use it to analyze Algorithm 2. Our analysis crucially requires that any plausibly optimal action a be chosen with a reasonable probability, something which the optimistic algorithm fails to ensure.

4.2. Moment-based Analysis

The disagreement-based analysis of Theorem 2 is not entirely satisfying, because even for linear predictors (e.g.,

Practical Contextual Bandits with Regression Oracles

as in Lin UCB, Chu et al. 2011), fairly strong assumptions on DX (e.g., log-concavity) are required to bound the disagreement coefﬁcient 0 (Hanneke, 2014). To generalize the analysis to richer than linear classes without distributional assumptions on the contexts, prior work has used the notion of eluder dimension (Russo & Van Roy, 2013). It remains challenging, however, to show examples with a small eluder dimension beyond linearly parameterized functions. In addition, taking the worst-case over all histories, as in eluder dimension, is overly pessimistic for i.i.d. contextual bandits.

To address the shortcomings of both the disagreement-based analysis as well as eluder dimension for i.i.d. settings, we deﬁne two new distributional properties which we will use to analyze the regret of both of our algorithms. Deﬁnition 2 (Surprise bound). The surprise bound L1 > 0 is the smallest constant such that for all f F, x X, and a A, the gap (f(x,a) f (x,a))2 is at most

L1 Ex DX Ea Unif(A) f(x ,a ) f (x ,a )

The surprise bound is small if functions with a small expected squared error relative to f (under a uniform choice of actions) do not encounter a much larger squared error on any single context-action pair.

The second quantity, which we call the implicit exploration coefﬁcient (or IEC) relates the expected regression error under actions chosen by the optimal policy to the worst-case error on any other context-action pair. For λ (0,1] deﬁne Uλ(a) = {x f (x,a) f (x,a ) + λ a a}. Deﬁnition 3 (Implicit exploration coefﬁcient IEC). For any λ (0,1], the implicit exploration coefﬁcient L2,λ > 0 is the smallest constant such that for all f F, x X, and a A, the gap (f(x,a) f (x,a))2 is at most

L2,λ Ex DX Ea Unif(A) 1 x Uλ(a ) (1)

f(x ,a ) f (x ,a )

We now make two remarks about these deﬁnitions and their impact on the performance of Algorithms 1 and 2.

By deﬁnition, L2,λ is non-decreasing in λ. For Al-

gorithm 1 we can simply use λ = 0, for which L2,0 is deﬁned by replacing the right-hand side of (1) with L2,0

K Ex DX [(f(x, (x)) f (x, (x)))2]. The analysis of Algorithm 2 requires λ > 0, and this λ must be used to tune the algorithm s warm-start period.

We always have L1 L2,0, but L1 may be much

smaller. L1 appears in the regret bound of Algorithm 2, but not Algorithm 1.

We now state the regret bound for Algorithm 1 with a general class F, and employ the shorthand C

δ = 16log 2 F T 2

Theorem 3. With βm = (M m+1)C

δ m 1 , Algorithm 1 with Op-

tion II incurs Reg T = O

TL2,0 log F with probability at least 1 δ.

We now move on to describe the performance guarantee for Algorithm 2. Because this optimistic strategy does not explore as readily as the elimination-based strategy of Algorithm 1, the analysis requires both that (i) the IEC L2,λ be invoked for some λ > 0 and (ii) that the algorithm use a warm-start period whose size grows as 1 λ2.

Theorem 4. With βm = (M m+1)C

δ m 1 and M0 = 2 +

log2 1 + (2M+3)L1C

δ λ2 for any λ (0,1), Algorithm 2

incurs Reg T = O L1 log F

TL2,λ log F with probability at least 1 δ.

As Algorithm 2 requires a warm start, the regret bounds of Theorem 4 are always worse than those of Theorem 3. Appendix A.4 contains full versions of these theorems, Theorem 6 and Theorem 7, which obtain faster rates under the Massart noise condition and apply to inﬁnite classes.

Linear classes For concreteness, let us discuss the regret of both algorithms in a linear setting with a ﬁxed feature map φ X A Rd and F = {(x,a) w φ(x,a) w W} for some W Rd (e.g., as in Lin UCB). In the basic 2-bounded case, L1 and L2,λ can be bounded in terms of the minimum eigenvalues of Ex[φ(x,a)φ(x,a) ] and Ex 1{x Uλ(a)}φ(x,a)φ(x,a) , respectively. When predictors are s-sparse we can instead obtain bounds in terms of (A) = minw 0 w 0 2s w Aw w w, the minimum restricted eigenvalue for 2s-sparse predictors (Raskutti et al., 2010). For Algorithm 1 this yields a near dimensionindependent bound on Reg T of

KT log d Ex φ(x, (x))φ(x, (x)) .1

This improves upon the moment matrix conditions of Bastani & Bayati (2015), although our algorithm requires nonconvex optimization oracles.2 Note that without the scaling with K as in our result, a

d dependence is unavoidable (Abbasi-Yadkori et al., 2012). The result highlights the strengths of our analysis in the best case compared with eluder dimension, which does not adapt to sparsity structures. On the other hand, for the standard Lin UCB setting, our result is inferior by at least a factor of K.

Discussion Our analysis is inﬂuenced by the results of Bastani & Bayati (2015) for the (high-dimensional) linear setting, but extends to general classes F, and when applied to Algorithm 1 with linear classes, the assumed bound on

1See Proposition 3, Lemma 9, and Theorem 3 in the appendix. 2Also, since the class F is non-convex, this requires the slower binary search algorithm of Krishnamurthy et al. (2017).

Practical Contextual Bandits with Regression Oracles

L2,λ is weaker than their diversity condition . Similar assumptions have been used to analyze purely greedy linear contextual bandits (Bastani et al., 2017; Kannan et al., 2018); our assumptions are strictly weaker.

5. Experiments

We compared our new algorithms with existing oracle-based alternatives. In addition to showing that Reg CB3 has strong empirical performance, our experiments provide a more extensive empirical study of oracle-based contextual bandit algorithms than any past works (e.g., Agarwal et al., 2014, Krishnamurthy et al., 2016). Description of the datasets, benchmark algorithms, and oracle conﬁgurations, as well as further experimental results are included in Appendix B.

Datasets We begin with 10 datasets with full reward information and simulate bandit feedback by withholding the rewards for actions not selected by the algorithm. We use two large-scale learning-to-rank datasets, Microsoft MSLRWEB30k (mslr) (Qin & Liu, 2010) and Yahoo! Learning to Rank Challenge V2.0 (yahoo) (Chapelle & Chang, 2011), which were previously used to evaluate contextual semibandits (Krishnamurthy et al., 2016). We also use eight classiﬁcation datasets from the UCI repository (Lichman, 2013), summarized in Table 1 of Appendix B.1.

The ranking datasets have natural rewards (relevances), but the rewards for the classiﬁcation datasets always have multiclass structure (1 for the correct action and 0 for all others). To ensure that we evaluate the full generality of the CB setting, we create eight noisy UCI datasets by sampling new rewards for the datasets according to a noisy reward matrix model described in Appendix B. This yields additional 8 datasets for a total of 18. On each dataset we consider several replicates obtained by randomly permuting examples and, on noisy UCI, also randomly generating rewards. All the methods are evaluated on the same set of replicates.

Algorithms We evaluate Algorithms 1 and 2 against three baselines, all based on various optimization-oracle assumptions. First two are agnostic baselines, -Greedy (Langford & Zhang, 2008) and the minimax-optimal ILOVETOCONBANDITS (ILTCB) strategy of Agarwal et al. (2014).4

-Greedy and ILTCB both assume cost-sensitive classiﬁcation oracles and come with theoretical guarantees. The third baseline is a bootstrapping-based exploration strategy of Dimakopoulou et al. (2017) (Bootstrap-TS), which uses bootstrapping to estimate conﬁdence intervals and then performs Thompson sampling to select an action based on the intervals. This algorithm represents a computationally

3Reg CB refers collectively to both Algorithms 1 and 2. 4We use an implementation available at https://github. com/akshaykr/oracle_cb, which was also used by Krishnamurthy et al. (2016).

tractable alternative to Thompson sampling as it works in the regression-oracle model we consider here, but it does not have a theoretical analysis.5

Note that the Lin UCB algorithm (Chu et al., 2011; Abbasi Yadkori et al., 2011), which is a natural baseline as well,

coincides with our Algorithm 2 (with a linear oracle), so we only plot the performance of Reg CB with a linear oracle.

All of the algorithms update on an epoch schedule with epoch lengths of 2i 2, which is a theoretically rigorous choice for each algorithm.

Oracles We consider two baseline predictor classes F: 2regularized linear functions (Linear) and gradient-boosted depth-5 regression trees (GB5) (Friedman, 2001). For the regularized linear class, Algorithm 2 is equivalent to Lin UCB on an epoch schedule.6 See Appendix B.3 for details.

When running both Reg CB variants with the GB5 oracle, we use a simple heuristic to substantially speed up the computation. At the beginning of each epoch m, we ﬁnd the best regression-tree ensemble on the dataset so far (i.e., with respect to Rm). Throughout the epoch, we keep the structure of the ensemble ﬁxed and in each call to ORACLE(H) we only re-optimize the predictions in leaves. This can be solved in closed form, similar to Lin UCB, so the full binary search procedure (Algorithm 3) does not need to be run.

Parameter Tuning We evaluate each algorithm for eight exponentially spaced parameter values across ﬁve replicates. For -Greedy we tune the constant , and for ILTCB we tune a certain smoothing parameter (see Appendix B). For Algorithms 1 and 2 we set βm = β for all m and tune β. For Algorithm 2 we use a warm start of 0. We tune a conﬁdence parameter similar to β for Bootstrap-TS.

Evaluation Each dataset is split into training data , for which algorithm receives one example at a time and must predict online, and a holdout validation set. Validation is performed by simulating the algorithm s predictions on examples from the holdout set without allowing the algorithm to incorporate these examples. We also plot the validation reward of a supervised baseline obtained by training the oracle (either Linear or GB5) on the entire training set at once (including rewards for all actions).

For Algorithms 1 and 2 we show average reward at various numbers of training examples for the best ﬁxed parameter value in each dataset. For the baselines, we take the pointwise maximum of the average reward across all parameter values for each number of examples. Thus,

5It is not known how to implement the standard formulation of Thompson sampling for contextual bandits (e.g., Russo & Van Roy 2013) with optimization oracles.

6More precisely, it is equivalent to the well-known OFUL variant of Lin UCB (Abbasi-Yadkori et al., 2011).

Practical Contextual Bandits with Regression Oracles

the curves for our methods correspond to an actual run of the algorithm, while the baselines are an upper envelope aggregating multiple parameter values.

Results: Performance Figure 1 shows average reward of each algorithm on a holdout validation set for three representative datasets, letter from UCI, letter-noise (the variant with simulated rewards), and yahoo.

Reg CB (both Algorithms 1 and 2) outperforms all baselines on the unmodiﬁed UCI datasets (e.g., letter in Figure 1). On the noisy variants (e.g., letter+N in Figure 1), the performance of the ILTCB and Bootstrap-TS benchmarks improves signiﬁcantly, with Bootstrap-TS slightly edging out the rest of the algorithms. On the yahoo ranking dataset (Figure 1, right), the ordering of the algorithms in performance is similar to noisy UCI datasets.

Validation performance plots for all datasets are in Appendix B. Overall, Reg CB methods and Bootstrap-TS generally dominate the ﬁeld. While Bootstrap-TS can outperform Reg CB methods when using GB5 models, the gap is typically quite small. For linear models, Reg CB methods generally outperform Bootstrap-TS, hinting that the approximate binary search might be hurting Reg CB with GB5 models. We also observe that when Reg CB methods outperform Bootstrap-TS, the gap is often quite large. We will see further evidence of this behavior in the next set of results.

Results: Aggregate Performance To rigorously draw conclusions about overall performance, Figure 2 aggregates performance across all datasets. We compute normalized relative loss for each algorithm by rescaling the validation reward (computed as in Figure 1) so that, at each round, the best performing algorithm has loss 0 and the worst has loss 1. In each plot of Figure 2 we consider normalized relative losses at a speciﬁc cutoff time (1000 examples in the left plot, and all examples in the center and right), and for each method we plot the number of datasets where it achieves loss below a threshold, as a function of the threshold. Thus, curves towards top left corner correspond to methods that achieve lower relative loss on more datasets. The intercept at loss 0 is the number of datasets where an algorithm is the best, and the intercept at 0.99 is the number of datasets where the it is not the worst (so the distance from top is the number of datasets where it is the worst). Solid lines are runs with GB5 and dashed lines are with the Linear oracle.

The aggregate performance with the GB5 oracle across all datasets can be brieﬂy summarized as follows: Reg CB always beats -Greedy and ILTCB, but sometimes loses out to Bootstrap-TS, and Bootstrap-TS itself sometimes underperforms relative to the other baselines, especially on the UCI datasets. Even when Reg CB is not the best, it is almost always within 20% of the best. The elimination and optimistic variants of Reg CB have comparable performance,

with elimination performing slightly better in aggregate.

The Reg CB algorithms with the GB5 oracle also dominate the -Greedy, ILTCB, and Bootstrap-TS baselines when they are equipped with Linear oracles (the dashed lines in Figure 2). When the Reg CB algorithms use the Linear oracle they also dominate the baselines with the Linear oracle across all datasets, including Bootstrap-TS.7 This suggests that the gap between Reg CB and Bootstrap-TS for GB5 may be due to the approximation of ﬁxing the ensemble structure in each epoch, as noted earlier.

Results: Conﬁdence Width The analysis of Reg CB relies on assumptions on D (disagreement coefﬁcient or moment parameters) that are not easy to verify. The main role of these parameters is to control the rate at which conﬁdence width WFm(xt,a) = HIGHFm(xt,a) LOWFm(xt,a) used in Reg CB shrinks, since small widths imply that the algorithm makes good decisions and thus has low regret.

To investigate whether the width indeed shrinks empirically, we compute WFm(xt,a) on each dataset for Algorithm 2 and Bootstrap-TS, where a is the optimistic action with highest upper conﬁdence bound under each algorithm. Finally for both Algorithm 2 and Bootstrap-TS we compute the size of the disagreement set At, deﬁned in Algorithm 1, which measures how many actions the algorithm thinks are plausibly best.8

Figure 3 shows width and disagreement for a representative sample of datasets under the GB5 oracle; the remaining datasets are in Appendix B. The ﬁgure suggests that our distributional assumptions are reasonable for real-world datasets. In particular, for our algorithm, the width decays roughly as T 1 3 for letter and T 1 2 for letter+N and yahoo. Interestingly, the best hyper-parameter setting for Bootstrap-TS on letter yields low but essentially constant (i.e., not shrinking) width, and obtains a poor validation reward in Figure 1 (left). This suggests that while the Bootstrap-TS conﬁdence intervals are small, they may not be faithful in the sense of containing f (x,a).

6. Conclusion and Discussion

This work serves as a starting point for what we hope will be a fruitful line of research on oracle-efﬁcient contextual bandit algorithms in realizability-based settings. We have shown that the Reg CB family of algorithms have strong empirical performance and enjoy nice theoretical properties.

7The aggregate plots for Reg CB with the Linear oracle can be found in Appendix B along with additional aggregate plots.

8This set is well-deﬁned for both Reg CB-Opt and Bootstrap-TS even through neither algorithm instantiates it explicitly. For the yahoo and mslr datasets this At is technically a lower bound on the true disagreement set size AFm(xt) because our classes F do not have product structure on these datasets see Section 4.1.

Practical Contextual Bandits with Regression Oracles

Figure 1. Validation performance for three representative datasets as a function of the number of rounds t of interaction. The number of

rounds t is on a log scale. For each dataset, we show separately the performance with the GB5 oracle and the Linear oracle.

Figure 2. Aggregate performance across all datasets, at various sample sizes; solid lines GB5 oracle; dashed lines Linear oracle.

Left: All datasets (the UCI datasets, their noisy variants, and the Microsoft and Yahoo ranking datasets) at 1 000 examples (datasets with fewer examples dropped). Center: All datasets at their ﬁnal round. Right: Unmodiﬁed UCI at their ﬁnal round.

Figure 3. For each dataset, disagreement set size as a function of number of rounds t (with t on a log scale), and the log-log plot of the width of the optimistic action (the action chosen by Algorithm 2) as a function of t. Plots are averaged using a sliding window of length 20. Black lines on the width plots are best linear ﬁts, whose slopes give the rate of the width decay as follows: letter/Bootstrap-TS: 0.05, letter/Reg CB: 0.34, letter-noise/Bootstrap-TS: 0.33, letter-noise/Reg CB: 0.51, yahoo/Bootstrap-TS: 0.26, yahoo/Reg CB: 0.52.

These results suggest several compelling future directions.

First, is there a regression oracle based algorithm that achieves the optimal O(

KT log F ) regret? For example, is it possible to oraclize regressor elimination of Agarwal et al. (2012)?

Second, given the competitive empirical performance of

Bootstrap-TS, are there reasonable assumptions as in Section 4 under which it can be analyzed? There is recent work in this direction for linear models (Lu & Van Roy, 2017).

Finally, randomizing uniformly or putting all the mass on the optimistic choice are two extreme cases of choosing amongst the plausibly optimal actions. Are there better randomization schemes that lead to stronger regret guarantees?

Practical Contextual Bandits with Regression Oracles

Acknowledgements

We thank Akshay Krishnamurthy and Alberto Bietti for helpful discussions.

Abbasi-Yadkori, Y., P al, D., and Szepesv ari, C. Improved

algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pp. 2312 2320, 2011.

Abbasi-Yadkori, Y., Pal, D., and Szepesvari, C. Online-

to-conﬁdence-set conversions and application to sparse stochastic bandits. In Artiﬁcial Intelligence and Statistics, pp. 1 9, 2012.

Agarwal, A., Dud ık, M., Kale, S., Langford, J., and

Schapire, R. E. Contextual bandit learning with predictable rewards. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 19 26, 2012.

Agarwal, A., Hsu, D., Kale, S., Langford, J., Li, L., and

Schapire, R. Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning, pp. 1638 1646, 2014.

Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235 256, 2002a.

Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E.

The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48 77, 2002b.

Bastani, H. and Bayati, M. Online decision-making with

high-dimensional covariates. 2015.

Bastani, H., Bayati, M., and Khosravi, K. Exploiting the

natural exploration in contextual bandits. ar Xiv preprint ar Xiv:1704.09011, 2017.

Beygelzimer, A. and Langford, J. The offset tree for learn-

ing with partial labels. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 129 138. ACM, 2009.

Chapelle, O. and Chang, Y. Yahoo! learning to rank chal-

lenge overview. In Proceedings of the Learning to Rank Challenge, pp. 1 24, 2011.

Chu, W., Li, L., Reyzin, L., and Schapire, R. E. Contextual

bandits with linear payoff functions. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 208 214, 2011.

Dimakopoulou, M., Athey, S., and Imbens, G. Estima-

tion considerations in contextual bandits. ar Xiv preprint ar Xiv:1711.07077, 2017.

Dud ık, M., Langford, J., and Li, L. Doubly robust policy

evaluation and learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning, pp. 1097 1104. Omnipress, 2011.

Filippi, S., Cappe, O., Garivier, A., and Szepesv ari, C. Para-

metric bandits: The generalized linear case. In Advances in Neural Information Processing Systems, pp. 586 594, 2010.

Friedman, J. H. Greedy function approximation: a gradient

boosting machine. Annals of statistics, pp. 1189 1232, 2001.

Hanneke, S. Theory of disagreement-based active learning.

Foundations and Trends in Machine Learning, 7(2-3): 131 309, 2014.

Kannan, S., Morgenstern, J., Roth, A., Waggoner, B., and

Wu, Z. S. A Smoothed Analysis of the Greedy Algorithm for the Linear Contextual Bandit Problem. Ar Xiv e-prints, January 2018.

Krishnamurthy, A., Agarwal, A., and Dudik, M. Contex-

tual semibandits via supervised learning oracles. In Advances In Neural Information Processing Systems, pp. 2388 2396, 2016.

Krishnamurthy, A., Agarwal, A., Huang, T.-K., Daume III,

H., and Langford, J. Active learning for cost-sensitive classiﬁcation. International Conference on Machine Learning, 2017.

Langford, J. and Zhang, T. The epoch-greedy algorithm for

multi-armed bandits with side information. In Advances in neural information processing systems, pp. 817 824, 2008.

Li, L., Lu, Y., and Zhou, D. Provable optimal algorithms

for generalized linear contextual bandits. International Conference on Machine Learning, 2017.

Lichman, M. UCI machine learning repository, 2013. URL

http://archive.ics.uci.edu/ml.

Lu, X. and Van Roy, B. Ensemble sampling. In Advances in

Neural Information Processing Systems, pp. 3260 3268, 2017.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,

Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al. Scikit-learn: Machine learning in python. Journal of Machine Learning Research, 12(Oct):2825 2830, 2011.

Qin, T. and Liu, T.-Y. Mslr: Microsoft learning to rank

dataset. 2010. URL http://www.microsoft. com/en-us/research/project/mslr/.

Practical Contextual Bandits with Regression Oracles

Raskutti, G., Wainwright, M. J., and Yu, B. Restricted eigen-

value properties for correlated gaussian designs. Journal of Machine Learning Research, 11(Aug):2241 2259, 2010.

Rockafellar, R. T. Convex analysis. Princeton university

press, 1970.

Russo, D. and Van Roy, B. Eluder dimension and the sample

complexity of optimistic exploration. In Advances in Neural Information Processing Systems, pp. 2256 2264, 2013.

Thompson, W. R. On the likelihood that one unknown

probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285 294, 1933.