# adapting_to_misspecification_in_contextual_bandits__80c40b9c.pdf

Adapting to Misspeciﬁcation in Contextual Bandits

Dylan J. Foster dylanf@mit.edu Claudio Gentile cgentile@google.com Mehryar Mohri mohri@google.com Julian Zimmert zimmert@google.com

A major research direction in contextual bandits is to develop algorithms that are computationally efﬁcient, yet support ﬂexible, general-purpose function approximation. Algorithms based on modeling rewards have shown strong empirical performance, yet typically require a well-speciﬁed model, and can fail when this assumption does not hold. Can we design algorithms that are efﬁcient and ﬂexible, yet degrade gracefully in the face of model misspeciﬁcation? We introduce a new family of oracle-efﬁcient algorithms for ε-misspeciﬁed contextual bandits that adapt to unknown model misspeciﬁcation both for ﬁnite and inﬁnite action settings. Given access to an online oracle for square loss regression, our algorithm attains optimal regret and in particular optimal dependence on the misspeciﬁcation level, with no prior knowledge. Specializing to linear contextual bandits with inﬁnite actions in d dimensions, we obtain the ﬁrst algorithm that achieves the optimal O(d

d T) regret bound for unknown ε. On a conceptual level, our results are enabled by a new optimization-based perspective on the regression oracle reduction framework of Foster and Rakhlin [20], which we believe will be useful more broadly.

1 Introduction

The contextual bandit (CB) problem is an extension of the standard multi-armed bandit problem that is relevant to a variety of applications in practice, including health services [42], online advertisement [34, 4] and recommendation systems [8]. In the contextual bandit setting, at each round, the learner observes a feature vector (or context) and an action set. The learner must select an action out of that set and only observes the reward of that action. To make its selection, the learner has access to a family of hypotheses (or policies), which map contexts to actions. The objective of the learner is to achieve a cumulative reward that is close to that of the best hypothesis in hindsight for that speciﬁc sequence of contexts and action sets.

A common approach to the contextual bandit problem consists of reducing it to a supervised learning task such as classiﬁcation or regression [32, 19, 6, 7, 41, 8, 35]. Recently, Foster and Rakhlin [20] proposed Square CB, an efﬁcient reduction from K-armed contextual bandits to square loss regression under realizability assumptions. One open question that comes up after this work is whether their approach can be generalized to action spaces with many (or inﬁnite) actions in d-dimensions. Another open question is whether one can seamlessly shift from realizability to misspeciﬁed models without requiring prior knowledge of the amount of misspeciﬁcation. This is precisely the setup we study here, where the action set is large or inﬁnite, but where the learner has a good feature representation available up to some unknown amount of misspeciﬁcation.

Adequately handling misspeciﬁcation has been a subject of intense recent interest even for the simple special case of linear contextual bandits. Du et al. [18] questioned whether good is indeed enough,

Massachusetts Institute of Technology. Google Research. Courant Institute of Mathematical Sciences.

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

that is, whether we can learn efﬁciently even without realizability. Lattimore et al. [33] gave a positive answer to that question, provided the misspeciﬁcation level ε is known in advance, and showed that the price of misspeciﬁcation (for regret) is roughly ε

d T, where d is the dimension and T is the time horizon. However, they left the adapting to unknown ε as an open question.

Our results. We provide an afﬁrmative answer to all of these questions. We generalize Square CB to inﬁnite action sets, and use this strategy to adapt to unknown misspeciﬁcation ε by combining it with a bandit model selection procedure akin to the one proposed by Agarwal et al. [9]. Our algorithm is oracle-efﬁcient, and adapts to misspeciﬁcation efﬁciently and optimally whenever it has access to an online oracle for square loss regression. When specialized to linear contextual bandits, it answers the question of Lattimore et al. [33].

An important conceptual contribution of our work is to show that one can view the action selection scheme used by Square CB as an approximation to a log-barrier regularized optimization problem, which paves the way for a generalization to inﬁnite action spaces. Another by-product of our results is a generalization of the original CORRAL algorithm [9] for combining bandit algorithms, which is simpler, ﬂexible, and enjoys improved logarithmic factors.

1.1 Related Work

The contextual bandit is a well-studied problem, and misspeciﬁcation in bandits and reinforcement learning has been the subject of intense recent interest. We mention a few works which are closely related to our results.

For linear bandits in d dimensions, Lattimore et al. [33] gave an algorithm with regret O(d

d T), and left adapting to unknown misspeciﬁcation for changing action sets as an open problem. Concurrent work of Pacchiano et al. [37] solves this problem for the special case where contexts/action sets are stochastic, and also leverages CORRAL-type aggregation of contextual bandit algorithms. Our results resolve this question in the general adversarial setting.

Within the literature general-purpose contextual bandit algorithms, our approach builds on a recent line of research that provides reductions to ofﬂine/online square loss regression [21, 20, 38, 45, 23].

Besides the standard references on oracle-based agnostic contextual bandits (e.g., [32, 19, 6, 7]), ε-misspeciﬁcation is somewhat related to the recent stream of literature on bandits with adversariallycorrupted feedback [36, 26, 13]. See the discussion in the supplementary material.

2 Problem Setting

We consider the following contextual bandit protocol: At every round t = 1, . . . , T, the learner ﬁrst observes a context xt X and an action set At A, where A Rd is a compact action space; for simplicity, we assume throughout that A = {a Rd : a 1}, but place no restriction on (At)T t=1. Given the context and action set, the learner chooses action at At, then observes a stochastic loss ℓt [ 1, +1] depending on the action selected. We assume that the sequence of context vectors xt and the associated sequence of action sets At are generated by an oblivious adversary.

We let µ(a, x) := E[ℓt | xt = x, at = a] denote the mean loss function, which is unknown to the learner. We adopt a semi-parametric approach to modeling the losses, in which µ(a, x) is modelled a (nearly) linear in the action a, but can depend on the context x arbitrarily [20, 45, 14]. In particular, we assume the learner has access to a class of functions F {f : X Rd}, where for each f F, a, f(x) attempts to predict the value of µ(a, x). In a well-speciﬁed/realizable setting, one would assume that there exists some f F such that µ(a, x) = a, f (x) . In this paper, we make no such assumption, but the regret incurred by our algorithms depends on how far this is from being true. For each f F, we let πf( , ) denote the induced policy, whose action at time t is given by πf(xt, At) := argmina At a, f(xt) .

The goal of the learner is to minimize its pseudoregret Reg(T) against the best unconstrained policy:

Reg(T) := E h PT t=1 µ(at, xt) infa At µ(a, xt) i .

Here, and for the remainder of the paper, we use E[ ] to denote the expectation with respect to both the randomized choices of the learner and the stochastic realization of the losses ℓt.

This setup recovers the usual ﬁnite-arm contextual bandit with K arms setting by taking At = {e1, . . . , e K}. Another important special case is the well-studied linear contextual bandit setting, which corresponds to the case where F consists of constant vector-valued functions that do not depend on X. Speciﬁcally, for any Θ Rd, we can take F = {x 7 θ | θ Θ}. In this case, the prediction a, f(x) simpliﬁes to a, θ , a constant linear function of the action space A. This special case recovers the most widely studied version of linear contextual bandits [3, 12, 1, 15, 2, 10, 16], as well as Gaussian process extensions [39, 30, 17, 40].

2.1 Misspeciﬁcation

Contextual bandit algorithms based on modeling rewards typically rely on the assumption of a well-speciﬁed model (or, realizability ): That is, existence of a function f F such that µ(a, x) = a, f (x) for all a A and x X [15, 1, 6, 21]. Since this assumption may not be realistic in practice, a more recent line of work has begun to develop algorithms for misspeciﬁed models. In particular, Crammer and Gentile [16], Ghosh et al. [25], Lattimore et al. [33] and Foster and Rakhlin [20] consider a uniform ε-misspeciﬁed setting in which

inff F supa A,x X |µ(a, x) a, f(x) | ε, (1)

for some misspeciﬁcation level ε > 0. Notably, Lattimore et al. [33] show that for the linear setting, regret must grow as Ω(d

d T). Since d

T is the optimal regret for a well-speciﬁed model, ε

d T may be thought of as the price of misspeciﬁcation.

In this paper, we consider a weaker average-case notion of misspeciﬁcation. Given a sequence S = (x1, A1), . . . , (x T , AT ) of context-action set pairs, we deﬁne the average misspeciﬁcation level εT (S) as

εT (S) := inff F 1 T PT t=1 supa At( a, f(xt) µ(a, xt))2 1/2 . (2)

This quantity measures the misspeciﬁcation level for the speciﬁc sequence S at hand. Of course, the uniform bound in Eq. (1) directly implies εT (S) ε for all S in Eq. (2), and εT (S) = 0 whenever the model is well-speciﬁed.

We provide regret bounds that optimally adapt to εT (S) for any given realization of the sequence S, with no prior knowledge of the misspeciﬁcation level. The issue of adapting to unknown misspeciﬁcation has not been addressed even for the stronger uniform notion (1). Indeed, previous efforts typically use prior knowledge of ε to tune the exploration-exploitation scheme to encourage conservative exploration when misspeciﬁcation is large; see Lattimore et al. [33, Appendix E], Foster and Rakhlin [20, Section 5.1], Crammer and Gentile [16, Section 4.2], and Zanette et al. [46] for examples. Naively adapting such schemes using, e.g., doubling tricks, presents difﬁculties because the quantity in Eq. (2) does not appear to be estimable without knowledge of µ.

2.2 Regression Oracles

Following Foster and Rakhlin [20], we assume access to an online regression oracle Sq Alg, which is simply an algorithm for sequential prediction with the square loss, using F as a benchmark class. More precisely, the oracle operates under the following protocol. At each round t [T], the algorithm receives a context xt X, outputs a predictor ˆyt Rd (in particular, we interpret a, ˆyt as the predicted loss for action a), then observes an action at A and loss ℓt [ 1, +1] and incurs loss ( at, ˆyt ℓt)2.4 Formally, we make the following assumption.

Assumption 1. The regression oracle Sq Alg guarantees that for any (potentially adaptively chosen) sequence {(xt, at, ℓt)}T t=1, PT t=1( at, ˆyt ℓt)2 inff F PT t=1( at, f(xt) ℓt)2 Reg Sq(T) ,

for some (non-data-dependent) upper bound Reg Sq(T).

4As in Foster and Rakhlin [20], the square loss itself does not play a crucial role, and can be replaced by other loss that is strongly convex with respect to the learner s predictions.

For the ﬁnite-action setting, this deﬁnition coincides with that of Foster and Rakhlin [20]. To simplify the presentation of our results, we assume throughout the paper that Reg Sq(T) is a non-decreasing function of T.

While this type of oracle sufﬁces for all of our results, our algorithms are stated more naturally in terms of a stronger oracle which supports weighted online regression. In this model, we follow the same protocol as in Assumption 1, except that at each time t, the regression oracle observes a weight wt 0 at the same time as the context xt, and the loss incurred is now wt ( at, ˆyt ℓt)2. For technical reasons, we also allow the oracle for this model to be randomized. We make the following assumption. Assumption 2. The weighted regression oracle Sq Alg guarantees that for any (potentially adaptively chosen) sequence {(wt, xt, at, ℓt)}T t=1,

E h PT t=1 wt( at, ˆyt ℓt)2 inff F PT t=1 wt( at, f(xt) ℓt)2i E maxt [T ] wt

Reg Sq(T) ,

for some upper bound Reg Sq(T), where the expectation is taken with respect to the oracle s randomization.

We show in the supplementary materialthat any unweighted regression oracle satisfying Assumption 1 can be transformed into a randomized oracle for weighted regression that satisﬁes Assumption 2, with no overhead in runtime. Hence, to simplify exposition, for the remainder of the paper we state our results in terms of weighted regression oracles satisfying Assumption 2.

Online regression has been well-studied, and many efﬁcient algorithms are known for standard classes F. One example, which is important for our applications, is when F is linear. Example 1 (Linear Models). Suppose F = {x 7 θ | θ Θ}, where Θ Rd is a convex set with θ 1. Then the online Newton step algorithm [27] satisﬁes Assumption 1 with Reg Sq(T) = O(d log(T)) and via our reduction can be augmented to satisfy Assumption 2.

Further examples include kernels [44], generalized linear models [28], and standard nonparametric classes [24]. We refer to Foster and Rakhlin [20] for a more extensive discussion.

Additional notation. We make use of the following additional notation. Given a set X, we let (X) denote the set of all probability distributions over X. If X is continuous, we restrict (X) to distributions with countable support. We let x denote the euclidean norm for x Rd. For any positive deﬁnite matrix H Rd d, we denote the induced norm on x Rd by x 2 H = x, Hx . For functions f, g : X R+, we write f = O(g) if there exists some constant C > 0 such that f(x) Cg(x) for all x X. We write f = O(g) if f = O(g max{1, polylog(g)}), and deﬁne Ω( ) analogously.

3 Adapting to Misspeciﬁcation: An Oracle-Efﬁcient Algorithm

We now present our main result: an efﬁcient reduction from contextual bandits to online regression that adapts to unknown misspeciﬁcation εT (S) and supports inﬁnite action sets. Our main theorem is as follows. Theorem 1. Suppose we have access to a weighted regression oracle Sq Alg that satisﬁes Assumption 2 for class F. Then there exists an efﬁcient reduction which guarantees that for any sequence S = (x1, A1), . . . , (x T , AT ) with misspeciﬁcation level εT (S),

Reg(T) = O q

d TReg Sq(T) log(T) + εT (S)

The algorithm has building blocks: First, we extend the reduction of [20] to inﬁnite action sets via a new optimization-based perspective, and we show that the resulting algorithm has favorable dependence on misspeciﬁcation level when it is known in advance. Then, we combine this reduction with a scheme which aggregates multiple instances to adapt to unknown misspeciﬁcation. If the time required for a single query to Sq Alg is TSq Alg, then the per-step runtime of our algorithm is O(TSq Alg + |At| poly(d)).

As an important application, we solve an open question recently posed by Lattimore et al. [33]: we exhibit an efﬁcient algorithm for inﬁnite-action linear contextual bandits which optimally adapts to unknown misspeciﬁcation. Corollary 1. Let F = {x 7 θ | θ Rd, θ 1}. Then there exists an efﬁcient algorithm that, for any sequence S = (x1, A1), . . . , (x T , AT ), satisﬁes

Reg(T) = O d

T log(T) + εT (S)

This result immediately follows from Theorem 1 by applying online Newton step algorithm as the regression oracle, as in Example 1. Modulo logarithmic factors, this bound coincides with the one achieved by Lattimore et al. [33] for the simpler non-contextual linear bandit problem, for which the authors also present a matching lower bound.

The remainder of this section is dedicated to proving Theorem 1. The roadmap is as follows. First, we revisit the reduction from K-armed contextual bandits to online regression by Foster and Rakhlin [20] and provide a new optimization-based perspective. This new viewpoint leads to a natural generalization from the K-armed case to the inﬁnite action case. We then provide an aggregation-type procedure which combines multiple instances of this algorithm to adapt to unknown misspeciﬁcation, and ﬁnally put all the pieces together to prove the main result. As an extension, we also give a variant of the algorithm which enjoys improved bounds when the action sets At lie in low-dimensional subspaces of Rd. Going forward, we abbreviate εT (S) to εT whenever the sequence S is clear from context.

3.1 Oracle Reductions with Finite Actions: An Optimization-Based Perspective

An important special case of our setting, is the ﬁnite-arm contextual bandit problem, where At = K := {e1, . . . , e K}. For this setting, Foster and Rakhlin [20] proposed an efﬁcient and optimal reduction called Square CB, which is displayed in Algorithm 1. At each step, queries the oracle Sq Alg with the current context xt and receives a loss predictor ˆθt RK (so that (ˆθt)i predicts the loss of action i). The algorithm then samples an action from a probability distribution introduced by Abe and Long [3]. Speciﬁcally for any θ RK and learning rate γ > 0, we deﬁne abe-long(θ, γ) as the distribution p ([K]) obtained by ﬁrst selecting any i argmini [K] θi,

Algorithm 1: Square CB [20] Input: Learning rate γ, time horizon T. Initialize Regression oracle Sq Alg. for t = 1, . . . , T do

Receive context xt. Let ˆθt be the oracle s prediction for xt. Sample It abe-long(ˆθt, γ). Play at = e It and observe loss ℓt. Update Sq Alg with (xt, at, ℓt).

then deﬁning

( 1 K+γ(θi θi ), if i = i , 1 P i =i pi, otherwise. (3)

By choosing γ q

KT/(Reg Sq(T) + εT ), this algorithm guarantees that

KTReg Sq(T) + εT

Since this approach is the starting point for our results, it will be useful to sketch the proof. For p (A), let Hp := Ea p[aa ] be the correlation matrix, and ap := Ea p[a] be the expected action. Let the sequence S be ﬁxed, and let f F be any regression function which attains the value of εT (S) in Eq. (2).5 With a t := πf (xt, At) and θ t := f (xt), we have

E h PT t=1 µ(at, xt) infa At µ(a, xt) i E h PT t=1 at a t , θ t i + 2εT T

= E h PT t=1 apt a t , θ γ

4 θ ˆθt 2 Hpt

i + E h PT t=1 γ 4 θ ˆθt 2 Hpt

i + 2εT T .

The ﬁrst expectation term above is bounded by O(KT/γ), which is established by showing that abe-long(ˆθ, γ) is an approximate solution to the per-round minimax problem

min p (K) max θ RK max a K ap a , θ γ

4 ˆθ θ 2 Hp , (4)

5If the inﬁmum is not obtained, we can simply apply the argument that follows with a limit sequence.

with value O(K/γ). The second expectation term is bounded by O(γ (Reg Sq(T) + εT T)), which follows almost immediately from the deﬁnition of the square loss regret in Assumption 1 (see the proof of Theorem 3 for details). Choosing γ to balance the terms leads to the result.

As a ﬁrst step toward generalizing this result to inﬁnite actions, we propose a new distribution which exactly solves the minimax problem (4). This distribution is the solution to a dual optimization problem based on log-barrier regularization, and provides a new principled approach to deriving reductions. Lemma 1. For any θ RK and γ > 0, the unique minimizer of Eq. (4) coincides with the unique minimizer of the log-barrier(θ, γ) optimization problem deﬁned by

log-barrier(θ, γ) = argminp ([K])

γ P a [K] log(pa) = 1 λ+γθi

where λ is the unique value that ensures that the weights on the right-hand side above sum to one.

The abe-long distribution is closely related to the log-barrier distribution: Rather than ﬁnding the optimal Lagrange multiplier λ that solves the log-barrier problem, the abe-long strategy simply plugs in λ = K γ mini θi , then shifts weight to pi to ensure the distribution is normalized. Since the log-barrier strategy solves the minimax problem Eq. (4) exactly, plugging it into the results of Foster and Rakhlin [20] and Simchi-Levi and Xu [38] in place of abe-long leads to slightly improved constants. More importantly, this new perspective leads to a principled way to extend these reductions to inﬁnite actions.

3.2 Moving to Inﬁnite Action Sets: The Log-Determinant Barrier

Algorithm 2: Square CB.Inf Input: Learning rate γ, time horizon T. Initialize Regression oracle Sq Alg. for t = 1, . . . , T do

Receive context xt. Let ˆθt be the oracle s prediction for xt. Play at logdet-barrier(ˆθt, γ; At). Observe loss ℓt. Update Sq Alg with (xt, at, ℓt).

We generalize the log-barrier distribution to inﬁnite action sets using the log-determinant function. For any p (A), denote ap = Ea p[a] and Hp = Ea p[aa T ]. Furthermore we use dim(A) to denote the dimension of the smallest afﬁne linear subspace that contains A. When dim(A) < d, we adopt the convention that det( ) takes the product of only the ﬁrst dim(A) eigenvalues of the matrix in its argument, so that the solution bwlow is welldeﬁned. Our logdet-barrier distributions are deﬁned as follows. Deﬁnition 1. For any θ Rd, action set A Rd, and γ > 0, the set of logdet-barrier(θ, γ; A) distributions are deﬁned as the solutions to argminp (A) ap, θ γ 1 log det(Hp ap a T p ) . (6)

In general, Eq. (6) does not admit a unique solution; all of our results apply to any minimizer. Our key result is that these logdet-barrier distributions solve a minimax problem analogous to that of Eq. (4).

Lemma 2. Any solution to logdet-barrier(ˆθ, γ; A) satisﬁes

maxθ Rd maxa A ap a , θ γ

4 ˆθ θ 2 Hp γ 1 dim(A). (7)

By replacing the abe-long distribution with the logdet-barrier distribution in Algorithm 1, we obtain an optimal reduction for inﬁnite action sets. This algorithm, which we call Square CB.Inf, is displayed in Algorithm 2. Theorem 2. Given a regression oracle Sq Alg that satisﬁes Assumption 1 for class F, Square CB.Inf with learning rate γ q

d T/(Reg Sq(T) + ε) guarantees for all sequences S with εT (S) ε that

Reg(T) = O q

d TReg Sq(T) + ε

The logdet-barrier optimization problem is closely related to the D-optimal experimental design problem and to ﬁnding the John ellipsoid [29, 43], which correspond to the case where θ = 0 in Eq. (6) [31]. By adapting specialized optimization algorithms for these problems (in particular, a Frank-Wolfe-type scheme), we can efﬁciently solve the logdet-barrier problem. In particular, we have the following proposition.

Proposition 1. An approximation to (6) that achieves the same regret bound up to a constant factor can be computed in time O(|At| poly(d)) and memory O(log|At| poly(d)) per round.

The algorithm and a full analysis for runtime and memory complexity, as well as the impact on the regret, is provided in the supplementary material.

3.3 Adapting to Misspeciﬁcation: Algorithmic Framework

The regret bound of Square CB.Inf in Theorem 2 achieves optimal dependence on dimension and on the misspeciﬁcation level, but requires an a-priori upper bound on εT (S) to set the learning rate. We now turn our attention to adapting to this parameter.

At a high level, our approach is to run multiple instances of Square CB.Inf, each tuned to a different level of misspeciﬁcation, then run an aggregation procedure on top to learn the best instance. Speciﬁcally, if we initialize a collection of M := log(T) instances of Algorithm 2 in which the learning rate for instance m is tuned for misspeciﬁcation level ε m := exp( m) (that is, we follow a geometric grid), then it is straightforward to show that there exists m [M] such that the m th instance would enjoy optimal regret if we ran it on the sequence S. Since m is not known a-priori, we run an aggregation (or, Corralling ) procedure [9] to select the best instance. This approach is, in general, not suitable for model selection, since it typically requires prior knowledge of the optimal regret bound to tune certain parameters appropriately [22]. We show that adaptation to misspeciﬁcation is an exception to this rule, and provides a simple setting where model selection for contextual bandits is possible.

Algorithm 3: Corralling [9] Input: Master algorithm Master, T Initialize (Basem)M m=1 for t = 1, . . . , T do

Receive context xt. Receive At, qt,At from Master. Pass (xt, At, qt,At, ρt,At) to Base At. Base At plays at and observes ℓt. Update Master with ℓt,At = (ℓt + 1).

We consider the aggregation scheme in Algorithm 3, which is a generalization of the CORRAL algorithm of Agarwal et al. [9]. The algorithm is initialized with M base algorithms, and uses a multi-armed bandit algorithm with M arms as a master algorithm responsible for choosing which base algorithm to follow at each round.

The master maintains a distribution qt ([M]) over the base algorithms. At each round t, it samples an algorithm At qt and passes the current context xt into this algorithm, as well as the sampling probability qt,At and a weight ρt,At, where we deﬁne ρt,m := 1/ mins t qs,m for each m. The base algorithm At now plays a regular contextual bandit round: Given the context xt, it proposes an arm at, which is pulled, receives the loss ℓt, and updates its internal state. Finally, the master updates its state with the action-loss pair (At, ℓt,At), where ℓt,At := ℓt + 1 (for technical reasons, it is useful to shift the loss by 1 to ensure non-negativity).

Let Regm Imp(T) := E h PT t=1 I{At=m}

qt,m (µ(at, xt) infa At µ(a, xt)) i denote the importanceweighted regret for base algorithm m, which is simply the pseudoregret incurred in the rounds where Algorithm 3 follows this base algorithm, weighted inversely proportional to the probability that this occurs. It is straightforward to show that for any choice of master and base algorithms, this scheme guarantees that

Reg(T) = E h PT t=1 ℓt,At ℓt,m i + Regm Imp(T) , (8)

where ℓt,m henceforth denotes the loss the algorithm would have suffered at round t if we had At = m. That is, the regret of Algorithm 3 is equal to the regret Reg M(T) := E[PT t=1 ℓt,At ℓt,m ] of the master algorithm, plus the importance-weighted regret of the optimal base algorithm m .

The difﬁculty in instantiating this general scheme lies in the fact that the important-weighted regret of the best base typically scales with E[ρα T,m ] Regm Unw(T), where α [ 1

2, 1] is an algorithm-dependent parameter and Regm Unw(T) := E[PT t=1 I {At = m} (µ(at, xt) infa At µ(a, xt))] denotes the unweighted regret of algorithm m. A-priori, the E[ρα T,m ] can be unbounded, leading to large regret. The key to the analysis of Agarwal et al. [9], and the approach we follow here, is to use a master algorithm with negative regret proportional to E[ρα T,m ], allowing to cancel this factor.

Algorithm 4: Square CB.Imp (for base alg. m) Input: T, Reg Sq(T) Initialize Weighted regression oracle Sq Alg. for t = (τ1, τ2, . . .) [T] do

Receive context xt and (qt,m, ρt,m). Set γt,m =

d T/(ρt,m Reg Sq(T)) o .

Set wt = γt,m/qt,m. Compute oracle s prediction ˆθt for xt, wt. Sample at logdet-barrier(θt, γt,m; At). Play at and observe loss ℓt. Update Sq Alg with (wt, xt, at, ℓt).

Base algorithm. As the ﬁrst step towards instantiating the aggregation scheme above, we specify the base algorithm. We use a modiﬁcation to Square CB.Inf based importance weighting, which is designed to ensure that the importance-weighted regret in Eq. (8) is bounded. Pseudocode for the mth base algorithm is given in Algorithm 4.

Let the instance m be ﬁxed, and let Zt,m = I{At = m} indicate the event that this instance gets to select an arm; note that we have Zt,m Ber(qt,m) marginally. When Zt,m = 1, instance m receives qt,m and ρt,m = maxs t q 1 s,m from the master algorithm. The instance then follows the same update scheme as in the vanilla version of Square CB.Inf, except that i) it uses an adaptive learning rate γt,m, which is tuned based on ρt,m, and ii) it uses a weighted square loss regression oracle (Assumption 2), with the weight wt set as a function of γt,m and qt,m.

The importance weighted regret Regm Imp(T) for this scheme is bounded as follows.

Theorem 3. When invoked within Algorithm 3 with a regression oracle satisfying Assumption 2, the importance-weighted regret for each instance m [M] of Algorithm 4 satisﬁes

Regm Imp(T) 3

2E[ ρT,m] q

d TReg Sq(T) + ε m εT + εT

d + 2 εT T. (9)

The key feature of this regret bound is that only the leading term involving Reg Sq(T) depends on the importance weights, not the second term involving the misspeciﬁcation. This is allows us to get away with tuning the master algorithm using only d, T, and Reg Sq(T), but not εT , which is critical to adapt without prior knowledge. Another important detail is that if ε m is within a constant factor of εT , the second term simpliﬁes to O(εT

d T) as desired.

3.4 Improved Master Algorithms for Combining Bandit Algorithms

It remains to provide a master algorithm for use within Algorithm 3. While it turns out the master algorithm proposed in Agarwal et al. [9] sufﬁces for this task, we go a step further and propose a new master algorithm called (α, R) hedged FTRL which is simpler and enjoys slightly improved regret, removing logarithmic factors. While this is not the focus of the paper, we believe that it to be a useful contribution on its own, because it provides a new approach to designing master algorithms for bandit aggregation. We hope that it will ﬁnd use more broadly.

The (α, R) hedged FTRL algorithm is parameterized by a regularizer and two scale parameters α (0, 1) and R > 0. We defer a precise deﬁnition and analysis to supplementary material, and state only the relevant result for our aggregation setup here. This result concerns a speciﬁc instance of the (α, R) hedged FTRL algorithm that we call (α, R) hedged Tsallis-INF, which instantiates the framework using the Tsallis entropy as a regularizer [11, 5, 47]. The key property of the algorithm is that the regret with respect to a policy playing a ﬁxed arm m contains a negative contribution of magnitude ρα T,m R. The following result is a corollary of a more general theorem, found in the supplementary material.

Corollary 2. Consider the adversarial multi-armed bandit problem with M arms and losses ℓt,m [0, 2]. For any α (0, 1) and R > 0, the (α, R) hedged Tsallis-INF algorithm with learning rate η = p

1/(2T) guarantees that for all m [M],

E h PT t=1 ℓt,At ℓt,m i 4

2MT + E h min n 1 1 α, 2 log(ρT,m ) o M α ρα T,m i R . (10)

3.5 Putting Everything Together

Crucially, the regret bound in Corollary 2 has a negative R ρα T,m term which, for sufﬁciently large R and appropriate α, can be used to offset the regret incurred from importance-weighting the base

algorithms. In particular, 1 2, 3

d TReg Sq(T) hedged Tsallis-INF has exactly the negative regret contribution needed to cancel the importance weighting term in Eq. (9) if we use Square CB.Imp as the base algorithm. In more detail, we combine the regret for the master and base algorithms as follows to prove Theorem 1.

Proof sketch for Theorem 1. Using Eq. (8), it sufﬁces to bound the regret of the bandit master Reg M(T) and the important-weighted regret Regm Imp(T) for the optimal instance m . By Corollary 2,

using 1 2, 3

d TReg Sq(T) hedged Tsallis-INF as the master algorithm gives

Reg M(T) O q

d TReg Sq(T) log(T) 3

2E[ ρT,m ] q

d TReg Sq(T).

Whenever the misspeciﬁcation level is not trivially small, the geometric grid ensures that there exists m [M] such that e 1εT ε m εT . For this instance, Theorem 3 yields

Regm Imp(T) 3

2E[ ρT,m ] q

d TReg Sq(T) + O(εT

Summing the two bounds using Eq. (8) completes the proof.

3.6 Extension: Adapting to the Average Dimension

A canonical application of linear contextual bandit is the problem of online news article recommendation, where the context xt is taken to be a feature vector containing information about the user, and each action a At is the concatenation of xt with a feature representation for a candidate article (e.g., Li et al. [34]). In this application and others like it, it is often the case that while examples lie in high-dimensional space, the true dimensionality dim(At) of the action set is small, so that davg := 1

T PT t=1 dim(At) d. If we have prior knowledge of davg (or an upper bound thereof), we can exploit this low dimensionality for tighter regret. In fact, following the proof of Theorem 3 and Theorem 1, and bounding PT t=1 dim(At) by davg T instead of d T, it is fairly immediate to show that Algorithm 3 enjoys improved regret Reg(T) = O( p

davg TReg Sq(T) log(T) + εT p

davg T), so long as davg is replaced by d in the algorithm s various parameter settings. Our ﬁnal result shows that it is possible to adapt to unknown davg and unknown misspeciﬁcation simultaneously. The key idea to apply a doubling trick on top of Algorithm 3 Theorem 4. There exists an algorithm that, under the same conditions as Theorem 1, satisﬁes Reg(T) = O q

davg TReg Sq(T) log(T) + εT p

davg T without prior knowledge of davg or εT .

We remark that while the bound in Theorem 4 replaces the d factor in the reduction with the datadependent quantity davg, the oracle s regret Reg Sq(T) may itself still depend on d unless a sufﬁciently sophisticated algorithm is used.

4 Discussion

We have presented the ﬁrst general-purpose, oracle-efﬁcient algorithms for contextual bandits that adapt to unknown model misspeciﬁcation. For inﬁnite-action linear contextual bandits, our results yield the ﬁrst optimal algorithms that adapt to unknown misspeciﬁcation with changing action sets. Our results suggest a number of interesting conceptual questions:

Can our optimization-based perspective lead to new oracle-based algorithms for more rich types of inﬁnite action sets? Examples include nonparametric actions and structured (e.g., sparse) linear actions. Can our reduction-based techniques be lifted to more sophisticated interactive learning settings such as reinforcement learning?

On the technical side, we anticipate that our new approach to reductions will ﬁnd broader use; natural extensions include reductions for ofﬂine oracles [38] and adapting to low-noise conditions [23].

Lastly, we recall that in passing, we have derived a novel class of master algorithms for combining bandit algorithms which enjoys more ﬂexibility, an improvement in logarithmic factors, and a greatly simpliﬁed analysis. We hope this result will be useful for future work on model selection in contextual bandits.

Acknowledgements

DF acknowledges the support of NSF TRIPODS grant #1740751. We thank Teodor Marinov and Alexander Rakhlin for discussions on related topics.

Broader Impact

This paper concerns contextual bandit algorithms that adapt to unknown model misspeciﬁcation. Because of their efﬁciency and ability to adapt to the amount of misspeciﬁcation contained with no prior knowledge, our algorithms are robust, and may be suitable for large-scale practical deployment. On the other hand, our work is at the level of foundational research, and hence its impact on society is shaped by the applications that stem from it. We will focus our brief discussion on the applications mentioned in the introduction.

Health services [42] offer an opportunity for potential positive impact. Contextual bandits can be used to propose medical interventions that lead to a better health outcomes. However, care must be taken to ethically implement the explore-exploit tradeoff in this sensitive setting, and more research is required. Online advertisements [4, 34] and recommendation systems [8] are another well-known application. While improved, robust algorithms can lead to increased proﬁts here, it is important to recognize that this may positively impact society as a whole.

Lastly, we mention that predictive algorithms like contextual bandits become more and more powerful as more information is gathered about users. This provides a clear incentive toward collecting as much information as possible. We believe that the net beneﬁt of research on contextual bandit outweighs the harm, but we welcome regulatory efforts to produce a legal framework that steers the usage of machine learning algorithms, including in contextual bandits, in a direction which is respects of the privacy rights of users.

[1] Y. Abbasi-Yadkori, D. Pál, and C. Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems 24, NIPS, pages 2312 2320. Curran Associates, Inc., 2011.

[2] Y. Abbasi-Yadkori, D. Pal, and C. Szepesvári. Online-to-conﬁdence-set conversions and application to sparse stochastic bandits. In Proc. of the 15th International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), pages 1 9, 2012.

[3] N. Abe and P. M. Long. Associative reinforcement learning using linear probabilistic concepts. In Proceedings of the 16th International Conference on Machine Learning, ICML, pages 3 11, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc.

[4] N. Abe, A. W. Biermann, and P. M. Long. Reinforcement learning with immediate rewards and linear hypotheses. Algorithmica, 37(4):263 293, 2003.

[5] J. D. Abernethy, C. Lee, and A. Tewari. Fighting bandits with a new kind of smoothness. In Advances in Neural Information Processing Systems 28, NIPS, pages 2197 2205. Curran Associates, Inc., 2015.

[6] A. Agarwal, M. Dudik, S. Kale, J. Langford, and R. Schapire. Contextual bandit learning with predictable rewards. In Proc. of the 15th International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), 2012.

[7] A. Agarwal, D. Hsu, S. Kale, J. Langford, L. Li, and R. Schapire. Taming the monster: A fast and simple algorithm for contextual bandits. In Proceedings of the 31st International Conference on Machine Learning, volume 32, pages 1638 1646, 22 24 Jun 2014.

[8] A. Agarwal, S. Bird, M. Cozowicz, L. Hoang, J. Langford, S. Lee, J. Li, D. Melamed, G. Oshri, and O. Ribas. Making contextual decisions with low technical debt. ar Xiv preprint ar Xiv:1606.03966, 2016.

[9] A. Agarwal, H. Luo, B. Neyshabur, and R. E. Schapire. Corralling a band of bandit algorithms. In Conference on Learning Theory, pages 12 38, 2017.

[10] S. Agrawal and N. Goyal. Thompson sampling for contextual bandits with linear payoffs. In Proc. of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 127 135, 2013.

[11] J.-Y. Audibert and S. Bubeck. Minimax policies for adversarial and stochastic bandits. In Proceedings of Conference on Learning Theory (COLT), pages 217 226, 2009.

[12] P. Auer. Using conﬁdence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397 422, 2002.

[13] I. Bogunovic, A. Krause, and J. Scarlett. Corruption-tolerant Gaussian Process bandit optimization. In Proc. of the 23rd International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), 2020.

[14] V. Chernozhukov, M. Demirer, G. Lewis, and V. Syrgkanis. Semi-parametric efﬁcient policy learning with continuous actions. In Advances in Neural Information Processing Systems, pages 15065 15075, 2019.

[15] W. Chu, L. Li, L. Reyzin, and R. Schapire. Contextual bandits with linear payoff functions. In Proc. of the 14th International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), volume 15, pages 208 214. PMLR, 2011.

[16] K. Crammer and C. Gentile. Multiclass classiﬁcation with bandit feedback using adaptive regularization. Machine learning, 90(3):347 383, 2013.

[17] J. Djolonga, A. Krause, and V. Cevher. High-dimensional Gaussian Process bandits. In Proc. 27th NIPS, pages 1025 1033, 2013.

[18] S. S. Du, S. M. Kakade, R. Wang, and Lin F Yang. Is a good representation sufﬁcient for sample efﬁcient reinforcement learning? ar Xiv preprint ar Xiv:1910.03016, 2019.

[19] M. Dudik, D. Hsu, S. Kale, N. Karampatziakis, J. Langford, L. Reyzin, and T. Zhang. Efﬁcient optimal learning for contextual bandits. In Proceedings of the 27th Conference on Uncertainty in Artiﬁcial Intelligence, UAI, pages 169 178, 2011.

[20] D. J. Foster and A. Rakhlin. Beyond UCB: Optimal and efﬁcient contextual bandits with regression oracles. International Conference on Machine Learning (ICML), 2020.

[21] D. J. Foster, A. Agarwal, M. Dudik, H. Luo, and R. Schapire. Practical contextual bandits with regression oracles. In International Conference on Machine Learning, pages 1539 1548, 2018.

[22] D. J. Foster, A. Krishnamurthy, and H. Luo. Model selection for contextual bandits. In Advances in Neural Information Processing Systems, pages 14741 14752, 2019.

[23] D. J. Foster, A. Rakhlin, D. Simchi-Levi, and Y. Xu. Instance-dependent complexity of contextual bandits and reinforcement learning: A disagreement-based perspective. ar Xiv preprint ar Xiv:2010.03104, 2020.

[24] P. Gaillard and S. Gerchinovitz. A chaining algorithm for online nonparametric regression. In Conference on Learning Theory, pages 764 796, 2015.

[25] A. Ghosh, S.R. Chowdhury, and A. Gopalan. Misspeciﬁed linear bandits. In Proc. of the Thirty-First AAAI Conference on Artiﬁcial Intelligence, 2017.

[26] A. Gupta, T. Koren, and K. Talwar. Better algorithms for stochastic bandits with adversarial corruptions. In Proc. of Conference on Learning Theory, pages 1562 1578, 2019.

[27] E. Hazan, A. Agarwal, and S. Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3):169 192, 2007.

[28] S. M. Kakade, V. Kanade, O. Shamir, and A. Kalai. Efﬁcient learning of generalized linear and single index models with isotonic regression. In NIPS, pages 927 935, 2011.

[29] L. G. Khachiyan and M. J. Todd. On the complexity of approximating the maximal inscribed ellipsoid for a polytope. Technical report, Cornell University Operations Research and Industrial Engineering, 1990.

[30] A. Krause and C.S. Ong. Contextual Gaussian process bandit optimization. In Proc. 25th NIPS, 2011.

[31] P. Kumar and E. A. Yildirim. Minimum-volume enclosing ellipsoids and core sets. Journal of Optimization Theory and applications, 126(1):1 21, 2005.

[32] J. Langford and T. Zhang. The epoch-greedy algorithm for multi-armed bandits with side information. In Advances in Neural Information Processing Systems 20, NIPS, pages 817 824. 2008. [33] T. Lattimore, C. Szepesvari, and W. Gellert. Learning with good feature representations in bandits and in rl with a generative model. ar Xiv preprint ar Xiv:1911.07676, 2019. [34] K. Li, W. Chu, J. Langford, and R. E. Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on world wide web, pages 661 670, 2010. [35] H. Luo, C-Y. Wei, A. Agarwal, and J. Langford. Efﬁcient contextual bandits in non-stationary worlds. In Proceedings of the 31st Conference On Learning Theory, volume 75, pages 1739 1776, 2018. [36] T. Lykouris, V. Mirrokni, and R. Paes Leme. Stochastic bandits robust to adversarial corruptions. In Proc. of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 114 122. ACM, 2018. [37] A. Pacchiano, M. Phan, Y. Abbasi-Yadkori, A. Rao, J. Zimmert, T. Lattimore, and C. Szepesvari. Model selection in contextual stochastic bandit problems. Neural Information Processing Systems (Neur IPS), 2020. [38] D. Simchi-Levi and Y. Xu. Bypassing the monster: A faster and simpler optimal algorithm for contextual bandits under realizability. Available at SSRN, 2020. [39] N. Srinivas, A. Krause, S. Kakade, and M. Seeger. Gaussian process optimization in the bandit setting: no regret and experimental design. In ICML 10: Proceedings of the 27th International Conference on International Conference on Machine Learning, pages 1015 1022, June 2010. [40] Y. Sui, A. Gotovos, J. Burdick, and A. Krause. Safe exploration for optimization with gaussian processes. In Proc. of the 32nd International Conference on Machine Learning, volume 37, pages 997 1005, 2015. [41] V. Syrgkanis, A. Krishnamurthy, and R. Schapire. Efﬁcient algorithms for adversarial contextual learning. In Proceedings of The 33rd International Conference on Machine Learning, volume 48, pages 2159 2168, 2016. [42] A. Tewari and S. A. Murphy. From ads to interventions: Contextual bandits in mobile health. In Mobile Health, pages 495 517. Springer, 2017. [43] Michael J Todd and E Alper Yıldırım. On khachiyan s algorithm for the computation of minimum-volume enclosing ellipsoids. Discrete Applied Mathematics, 155(13):1731 1744, 2007. [44] M. Valko, N. Korda, R. Munos, I. Flaounas, and N. Cristianini. Finite-time analysis of kernelised contextual bandits. In Proc. of the 29th Conference on Uncertainty in Artiﬁcial Intelligence, UAI, pages 654 663, 2013. [45] Y. Xu and A. Zeevi. Upper counterfactual conﬁdence bounds: a new optimism principle for contextual bandits. ar Xiv preprint ar Xiv:2007.07876, 2020. [46] A. Zanette, A. Lazaric, M. Kochenderfer, and E. Brunskill. Learning near optimal policies with low inherent Bellman error. ar Xiv preprint ar Xiv:2003.00153, 2020. [47] Julian Zimmert and Yevgeny Seldin. An optimal algorithm for stochastic and adversarial bandits. In The 22nd International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), pages 467 475. PMLR, 2019.