# adapting_to_misspecification_in_contextual_bandits__80c40b9c.pdf Adapting to Misspecification in Contextual Bandits Dylan J. Foster dylanf@mit.edu Claudio Gentile cgentile@google.com Mehryar Mohri mohri@google.com Julian Zimmert zimmert@google.com A major research direction in contextual bandits is to develop algorithms that are computationally efficient, yet support flexible, general-purpose function approximation. Algorithms based on modeling rewards have shown strong empirical performance, yet typically require a well-specified model, and can fail when this assumption does not hold. Can we design algorithms that are efficient and flexible, yet degrade gracefully in the face of model misspecification? We introduce a new family of oracle-efficient algorithms for ε-misspecified contextual bandits that adapt to unknown model misspecification both for finite and infinite action settings. Given access to an online oracle for square loss regression, our algorithm attains optimal regret and in particular optimal dependence on the misspecification level, with no prior knowledge. Specializing to linear contextual bandits with infinite actions in d dimensions, we obtain the first algorithm that achieves the optimal O(d d T) regret bound for unknown ε. On a conceptual level, our results are enabled by a new optimization-based perspective on the regression oracle reduction framework of Foster and Rakhlin [20], which we believe will be useful more broadly. 1 Introduction The contextual bandit (CB) problem is an extension of the standard multi-armed bandit problem that is relevant to a variety of applications in practice, including health services [42], online advertisement [34, 4] and recommendation systems [8]. In the contextual bandit setting, at each round, the learner observes a feature vector (or context) and an action set. The learner must select an action out of that set and only observes the reward of that action. To make its selection, the learner has access to a family of hypotheses (or policies), which map contexts to actions. The objective of the learner is to achieve a cumulative reward that is close to that of the best hypothesis in hindsight for that specific sequence of contexts and action sets. A common approach to the contextual bandit problem consists of reducing it to a supervised learning task such as classification or regression [32, 19, 6, 7, 41, 8, 35]. Recently, Foster and Rakhlin [20] proposed Square CB, an efficient reduction from K-armed contextual bandits to square loss regression under realizability assumptions. One open question that comes up after this work is whether their approach can be generalized to action spaces with many (or infinite) actions in d-dimensions. Another open question is whether one can seamlessly shift from realizability to misspecified models without requiring prior knowledge of the amount of misspecification. This is precisely the setup we study here, where the action set is large or infinite, but where the learner has a good feature representation available up to some unknown amount of misspecification. Adequately handling misspecification has been a subject of intense recent interest even for the simple special case of linear contextual bandits. Du et al. [18] questioned whether good is indeed enough, Massachusetts Institute of Technology. Google Research. Courant Institute of Mathematical Sciences. 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. that is, whether we can learn efficiently even without realizability. Lattimore et al. [33] gave a positive answer to that question, provided the misspecification level ε is known in advance, and showed that the price of misspecification (for regret) is roughly ε d T, where d is the dimension and T is the time horizon. However, they left the adapting to unknown ε as an open question. Our results. We provide an affirmative answer to all of these questions. We generalize Square CB to infinite action sets, and use this strategy to adapt to unknown misspecification ε by combining it with a bandit model selection procedure akin to the one proposed by Agarwal et al. [9]. Our algorithm is oracle-efficient, and adapts to misspecification efficiently and optimally whenever it has access to an online oracle for square loss regression. When specialized to linear contextual bandits, it answers the question of Lattimore et al. [33]. An important conceptual contribution of our work is to show that one can view the action selection scheme used by Square CB as an approximation to a log-barrier regularized optimization problem, which paves the way for a generalization to infinite action spaces. Another by-product of our results is a generalization of the original CORRAL algorithm [9] for combining bandit algorithms, which is simpler, flexible, and enjoys improved logarithmic factors. 1.1 Related Work The contextual bandit is a well-studied problem, and misspecification in bandits and reinforcement learning has been the subject of intense recent interest. We mention a few works which are closely related to our results. For linear bandits in d dimensions, Lattimore et al. [33] gave an algorithm with regret O(d d T), and left adapting to unknown misspecification for changing action sets as an open problem. Concurrent work of Pacchiano et al. [37] solves this problem for the special case where contexts/action sets are stochastic, and also leverages CORRAL-type aggregation of contextual bandit algorithms. Our results resolve this question in the general adversarial setting. Within the literature general-purpose contextual bandit algorithms, our approach builds on a recent line of research that provides reductions to offline/online square loss regression [21, 20, 38, 45, 23]. Besides the standard references on oracle-based agnostic contextual bandits (e.g., [32, 19, 6, 7]), ε-misspecification is somewhat related to the recent stream of literature on bandits with adversariallycorrupted feedback [36, 26, 13]. See the discussion in the supplementary material. 2 Problem Setting We consider the following contextual bandit protocol: At every round t = 1, . . . , T, the learner first observes a context xt X and an action set At A, where A Rd is a compact action space; for simplicity, we assume throughout that A = {a Rd : a 1}, but place no restriction on (At)T t=1. Given the context and action set, the learner chooses action at At, then observes a stochastic loss ℓt [ 1, +1] depending on the action selected. We assume that the sequence of context vectors xt and the associated sequence of action sets At are generated by an oblivious adversary. We let µ(a, x) := E[ℓt | xt = x, at = a] denote the mean loss function, which is unknown to the learner. We adopt a semi-parametric approach to modeling the losses, in which µ(a, x) is modelled a (nearly) linear in the action a, but can depend on the context x arbitrarily [20, 45, 14]. In particular, we assume the learner has access to a class of functions F {f : X Rd}, where for each f F, a, f(x) attempts to predict the value of µ(a, x). In a well-specified/realizable setting, one would assume that there exists some f F such that µ(a, x) = a, f (x) . In this paper, we make no such assumption, but the regret incurred by our algorithms depends on how far this is from being true. For each f F, we let πf( , ) denote the induced policy, whose action at time t is given by πf(xt, At) := argmina At a, f(xt) . The goal of the learner is to minimize its pseudoregret Reg(T) against the best unconstrained policy: Reg(T) := E h PT t=1 µ(at, xt) infa At µ(a, xt) i . Here, and for the remainder of the paper, we use E[ ] to denote the expectation with respect to both the randomized choices of the learner and the stochastic realization of the losses ℓt. This setup recovers the usual finite-arm contextual bandit with K arms setting by taking At = {e1, . . . , e K}. Another important special case is the well-studied linear contextual bandit setting, which corresponds to the case where F consists of constant vector-valued functions that do not depend on X. Specifically, for any Θ Rd, we can take F = {x 7 θ | θ Θ}. In this case, the prediction a, f(x) simplifies to a, θ , a constant linear function of the action space A. This special case recovers the most widely studied version of linear contextual bandits [3, 12, 1, 15, 2, 10, 16], as well as Gaussian process extensions [39, 30, 17, 40]. 2.1 Misspecification Contextual bandit algorithms based on modeling rewards typically rely on the assumption of a well-specified model (or, realizability ): That is, existence of a function f F such that µ(a, x) = a, f (x) for all a A and x X [15, 1, 6, 21]. Since this assumption may not be realistic in practice, a more recent line of work has begun to develop algorithms for misspecified models. In particular, Crammer and Gentile [16], Ghosh et al. [25], Lattimore et al. [33] and Foster and Rakhlin [20] consider a uniform ε-misspecified setting in which inff F supa A,x X |µ(a, x) a, f(x) | ε, (1) for some misspecification level ε > 0. Notably, Lattimore et al. [33] show that for the linear setting, regret must grow as Ω(d d T). Since d T is the optimal regret for a well-specified model, ε d T may be thought of as the price of misspecification. In this paper, we consider a weaker average-case notion of misspecification. Given a sequence S = (x1, A1), . . . , (x T , AT ) of context-action set pairs, we define the average misspecification level εT (S) as εT (S) := inff F 1 T PT t=1 supa At( a, f(xt) µ(a, xt))2 1/2 . (2) This quantity measures the misspecification level for the specific sequence S at hand. Of course, the uniform bound in Eq. (1) directly implies εT (S) ε for all S in Eq. (2), and εT (S) = 0 whenever the model is well-specified. We provide regret bounds that optimally adapt to εT (S) for any given realization of the sequence S, with no prior knowledge of the misspecification level. The issue of adapting to unknown misspecification has not been addressed even for the stronger uniform notion (1). Indeed, previous efforts typically use prior knowledge of ε to tune the exploration-exploitation scheme to encourage conservative exploration when misspecification is large; see Lattimore et al. [33, Appendix E], Foster and Rakhlin [20, Section 5.1], Crammer and Gentile [16, Section 4.2], and Zanette et al. [46] for examples. Naively adapting such schemes using, e.g., doubling tricks, presents difficulties because the quantity in Eq. (2) does not appear to be estimable without knowledge of µ. 2.2 Regression Oracles Following Foster and Rakhlin [20], we assume access to an online regression oracle Sq Alg, which is simply an algorithm for sequential prediction with the square loss, using F as a benchmark class. More precisely, the oracle operates under the following protocol. At each round t [T], the algorithm receives a context xt X, outputs a predictor ˆyt Rd (in particular, we interpret a, ˆyt as the predicted loss for action a), then observes an action at A and loss ℓt [ 1, +1] and incurs loss ( at, ˆyt ℓt)2.4 Formally, we make the following assumption. Assumption 1. The regression oracle Sq Alg guarantees that for any (potentially adaptively chosen) sequence {(xt, at, ℓt)}T t=1, PT t=1( at, ˆyt ℓt)2 inff F PT t=1( at, f(xt) ℓt)2 Reg Sq(T) , for some (non-data-dependent) upper bound Reg Sq(T). 4As in Foster and Rakhlin [20], the square loss itself does not play a crucial role, and can be replaced by other loss that is strongly convex with respect to the learner s predictions. For the finite-action setting, this definition coincides with that of Foster and Rakhlin [20]. To simplify the presentation of our results, we assume throughout the paper that Reg Sq(T) is a non-decreasing function of T. While this type of oracle suffices for all of our results, our algorithms are stated more naturally in terms of a stronger oracle which supports weighted online regression. In this model, we follow the same protocol as in Assumption 1, except that at each time t, the regression oracle observes a weight wt 0 at the same time as the context xt, and the loss incurred is now wt ( at, ˆyt ℓt)2. For technical reasons, we also allow the oracle for this model to be randomized. We make the following assumption. Assumption 2. The weighted regression oracle Sq Alg guarantees that for any (potentially adaptively chosen) sequence {(wt, xt, at, ℓt)}T t=1, E h PT t=1 wt( at, ˆyt ℓt)2 inff F PT t=1 wt( at, f(xt) ℓt)2i E maxt [T ] wt Reg Sq(T) , for some upper bound Reg Sq(T), where the expectation is taken with respect to the oracle s randomization. We show in the supplementary materialthat any unweighted regression oracle satisfying Assumption 1 can be transformed into a randomized oracle for weighted regression that satisfies Assumption 2, with no overhead in runtime. Hence, to simplify exposition, for the remainder of the paper we state our results in terms of weighted regression oracles satisfying Assumption 2. Online regression has been well-studied, and many efficient algorithms are known for standard classes F. One example, which is important for our applications, is when F is linear. Example 1 (Linear Models). Suppose F = {x 7 θ | θ Θ}, where Θ Rd is a convex set with θ 1. Then the online Newton step algorithm [27] satisfies Assumption 1 with Reg Sq(T) = O(d log(T)) and via our reduction can be augmented to satisfy Assumption 2. Further examples include kernels [44], generalized linear models [28], and standard nonparametric classes [24]. We refer to Foster and Rakhlin [20] for a more extensive discussion. Additional notation. We make use of the following additional notation. Given a set X, we let (X) denote the set of all probability distributions over X. If X is continuous, we restrict (X) to distributions with countable support. We let x denote the euclidean norm for x Rd. For any positive definite matrix H Rd d, we denote the induced norm on x Rd by x 2 H = x, Hx . For functions f, g : X R+, we write f = O(g) if there exists some constant C > 0 such that f(x) Cg(x) for all x X. We write f = O(g) if f = O(g max{1, polylog(g)}), and define Ω( ) analogously. 3 Adapting to Misspecification: An Oracle-Efficient Algorithm We now present our main result: an efficient reduction from contextual bandits to online regression that adapts to unknown misspecification εT (S) and supports infinite action sets. Our main theorem is as follows. Theorem 1. Suppose we have access to a weighted regression oracle Sq Alg that satisfies Assumption 2 for class F. Then there exists an efficient reduction which guarantees that for any sequence S = (x1, A1), . . . , (x T , AT ) with misspecification level εT (S), Reg(T) = O q d TReg Sq(T) log(T) + εT (S) The algorithm has building blocks: First, we extend the reduction of [20] to infinite action sets via a new optimization-based perspective, and we show that the resulting algorithm has favorable dependence on misspecification level when it is known in advance. Then, we combine this reduction with a scheme which aggregates multiple instances to adapt to unknown misspecification. If the time required for a single query to Sq Alg is TSq Alg, then the per-step runtime of our algorithm is O(TSq Alg + |At| poly(d)). As an important application, we solve an open question recently posed by Lattimore et al. [33]: we exhibit an efficient algorithm for infinite-action linear contextual bandits which optimally adapts to unknown misspecification. Corollary 1. Let F = {x 7 θ | θ Rd, θ 1}. Then there exists an efficient algorithm that, for any sequence S = (x1, A1), . . . , (x T , AT ), satisfies Reg(T) = O d T log(T) + εT (S) This result immediately follows from Theorem 1 by applying online Newton step algorithm as the regression oracle, as in Example 1. Modulo logarithmic factors, this bound coincides with the one achieved by Lattimore et al. [33] for the simpler non-contextual linear bandit problem, for which the authors also present a matching lower bound. The remainder of this section is dedicated to proving Theorem 1. The roadmap is as follows. First, we revisit the reduction from K-armed contextual bandits to online regression by Foster and Rakhlin [20] and provide a new optimization-based perspective. This new viewpoint leads to a natural generalization from the K-armed case to the infinite action case. We then provide an aggregation-type procedure which combines multiple instances of this algorithm to adapt to unknown misspecification, and finally put all the pieces together to prove the main result. As an extension, we also give a variant of the algorithm which enjoys improved bounds when the action sets At lie in low-dimensional subspaces of Rd. Going forward, we abbreviate εT (S) to εT whenever the sequence S is clear from context. 3.1 Oracle Reductions with Finite Actions: An Optimization-Based Perspective An important special case of our setting, is the finite-arm contextual bandit problem, where At = K := {e1, . . . , e K}. For this setting, Foster and Rakhlin [20] proposed an efficient and optimal reduction called Square CB, which is displayed in Algorithm 1. At each step, queries the oracle Sq Alg with the current context xt and receives a loss predictor ˆθt RK (so that (ˆθt)i predicts the loss of action i). The algorithm then samples an action from a probability distribution introduced by Abe and Long [3]. Specifically for any θ RK and learning rate γ > 0, we define abe-long(θ, γ) as the distribution p ([K]) obtained by first selecting any i argmini [K] θi, Algorithm 1: Square CB [20] Input: Learning rate γ, time horizon T. Initialize Regression oracle Sq Alg. for t = 1, . . . , T do Receive context xt. Let ˆθt be the oracle s prediction for xt. Sample It abe-long(ˆθt, γ). Play at = e It and observe loss ℓt. Update Sq Alg with (xt, at, ℓt). then defining ( 1 K+γ(θi θi ), if i = i , 1 P i =i pi, otherwise. (3) By choosing γ q KT/(Reg Sq(T) + εT ), this algorithm guarantees that KTReg Sq(T) + εT Since this approach is the starting point for our results, it will be useful to sketch the proof. For p (A), let Hp := Ea p[aa ] be the correlation matrix, and ap := Ea p[a] be the expected action. Let the sequence S be fixed, and let f F be any regression function which attains the value of εT (S) in Eq. (2).5 With a t := πf (xt, At) and θ t := f (xt), we have E h PT t=1 µ(at, xt) infa At µ(a, xt) i E h PT t=1 at a t , θ t i + 2εT T = E h PT t=1 apt a t , θ γ 4 θ ˆθt 2 Hpt i + E h PT t=1 γ 4 θ ˆθt 2 Hpt i + 2εT T . The first expectation term above is bounded by O(KT/γ), which is established by showing that abe-long(ˆθ, γ) is an approximate solution to the per-round minimax problem min p (K) max θ RK max a K ap a , θ γ 4 ˆθ θ 2 Hp , (4) 5If the infimum is not obtained, we can simply apply the argument that follows with a limit sequence. with value O(K/γ). The second expectation term is bounded by O(γ (Reg Sq(T) + εT T)), which follows almost immediately from the definition of the square loss regret in Assumption 1 (see the proof of Theorem 3 for details). Choosing γ to balance the terms leads to the result. As a first step toward generalizing this result to infinite actions, we propose a new distribution which exactly solves the minimax problem (4). This distribution is the solution to a dual optimization problem based on log-barrier regularization, and provides a new principled approach to deriving reductions. Lemma 1. For any θ RK and γ > 0, the unique minimizer of Eq. (4) coincides with the unique minimizer of the log-barrier(θ, γ) optimization problem defined by log-barrier(θ, γ) = argminp ([K]) γ P a [K] log(pa) = 1 λ+γθi where λ is the unique value that ensures that the weights on the right-hand side above sum to one. The abe-long distribution is closely related to the log-barrier distribution: Rather than finding the optimal Lagrange multiplier λ that solves the log-barrier problem, the abe-long strategy simply plugs in λ = K γ mini θi , then shifts weight to pi to ensure the distribution is normalized. Since the log-barrier strategy solves the minimax problem Eq. (4) exactly, plugging it into the results of Foster and Rakhlin [20] and Simchi-Levi and Xu [38] in place of abe-long leads to slightly improved constants. More importantly, this new perspective leads to a principled way to extend these reductions to infinite actions. 3.2 Moving to Infinite Action Sets: The Log-Determinant Barrier Algorithm 2: Square CB.Inf Input: Learning rate γ, time horizon T. Initialize Regression oracle Sq Alg. for t = 1, . . . , T do Receive context xt. Let ˆθt be the oracle s prediction for xt. Play at logdet-barrier(ˆθt, γ; At). Observe loss ℓt. Update Sq Alg with (xt, at, ℓt). We generalize the log-barrier distribution to infinite action sets using the log-determinant function. For any p (A), denote ap = Ea p[a] and Hp = Ea p[aa T ]. Furthermore we use dim(A) to denote the dimension of the smallest affine linear subspace that contains A. When dim(A) < d, we adopt the convention that det( ) takes the product of only the first dim(A) eigenvalues of the matrix in its argument, so that the solution bwlow is welldefined. Our logdet-barrier distributions are defined as follows. Definition 1. For any θ Rd, action set A Rd, and γ > 0, the set of logdet-barrier(θ, γ; A) distributions are defined as the solutions to argminp (A) ap, θ γ 1 log det(Hp ap a T p ) . (6) In general, Eq. (6) does not admit a unique solution; all of our results apply to any minimizer. Our key result is that these logdet-barrier distributions solve a minimax problem analogous to that of Eq. (4). Lemma 2. Any solution to logdet-barrier(ˆθ, γ; A) satisfies maxθ Rd maxa A ap a , θ γ 4 ˆθ θ 2 Hp γ 1 dim(A). (7) By replacing the abe-long distribution with the logdet-barrier distribution in Algorithm 1, we obtain an optimal reduction for infinite action sets. This algorithm, which we call Square CB.Inf, is displayed in Algorithm 2. Theorem 2. Given a regression oracle Sq Alg that satisfies Assumption 1 for class F, Square CB.Inf with learning rate γ q d T/(Reg Sq(T) + ε) guarantees for all sequences S with εT (S) ε that Reg(T) = O q d TReg Sq(T) + ε The logdet-barrier optimization problem is closely related to the D-optimal experimental design problem and to finding the John ellipsoid [29, 43], which correspond to the case where θ = 0 in Eq. (6) [31]. By adapting specialized optimization algorithms for these problems (in particular, a Frank-Wolfe-type scheme), we can efficiently solve the logdet-barrier problem. In particular, we have the following proposition. Proposition 1. An approximation to (6) that achieves the same regret bound up to a constant factor can be computed in time O(|At| poly(d)) and memory O(log|At| poly(d)) per round. The algorithm and a full analysis for runtime and memory complexity, as well as the impact on the regret, is provided in the supplementary material. 3.3 Adapting to Misspecification: Algorithmic Framework The regret bound of Square CB.Inf in Theorem 2 achieves optimal dependence on dimension and on the misspecification level, but requires an a-priori upper bound on εT (S) to set the learning rate. We now turn our attention to adapting to this parameter. At a high level, our approach is to run multiple instances of Square CB.Inf, each tuned to a different level of misspecification, then run an aggregation procedure on top to learn the best instance. Specifically, if we initialize a collection of M := log(T) instances of Algorithm 2 in which the learning rate for instance m is tuned for misspecification level ε m := exp( m) (that is, we follow a geometric grid), then it is straightforward to show that there exists m [M] such that the m th instance would enjoy optimal regret if we ran it on the sequence S. Since m is not known a-priori, we run an aggregation (or, Corralling ) procedure [9] to select the best instance. This approach is, in general, not suitable for model selection, since it typically requires prior knowledge of the optimal regret bound to tune certain parameters appropriately [22]. We show that adaptation to misspecification is an exception to this rule, and provides a simple setting where model selection for contextual bandits is possible. Algorithm 3: Corralling [9] Input: Master algorithm Master, T Initialize (Basem)M m=1 for t = 1, . . . , T do Receive context xt. Receive At, qt,At from Master. Pass (xt, At, qt,At, ρt,At) to Base At. Base At plays at and observes ℓt. Update Master with ℓt,At = (ℓt + 1). We consider the aggregation scheme in Algorithm 3, which is a generalization of the CORRAL algorithm of Agarwal et al. [9]. The algorithm is initialized with M base algorithms, and uses a multi-armed bandit algorithm with M arms as a master algorithm responsible for choosing which base algorithm to follow at each round. The master maintains a distribution qt ([M]) over the base algorithms. At each round t, it samples an algorithm At qt and passes the current context xt into this algorithm, as well as the sampling probability qt,At and a weight ρt,At, where we define ρt,m := 1/ mins t qs,m for each m. The base algorithm At now plays a regular contextual bandit round: Given the context xt, it proposes an arm at, which is pulled, receives the loss ℓt, and updates its internal state. Finally, the master updates its state with the action-loss pair (At, ℓt,At), where ℓt,At := ℓt + 1 (for technical reasons, it is useful to shift the loss by 1 to ensure non-negativity). Let Regm Imp(T) := E h PT t=1 I{At=m} qt,m (µ(at, xt) infa At µ(a, xt)) i denote the importanceweighted regret for base algorithm m, which is simply the pseudoregret incurred in the rounds where Algorithm 3 follows this base algorithm, weighted inversely proportional to the probability that this occurs. It is straightforward to show that for any choice of master and base algorithms, this scheme guarantees that Reg(T) = E h PT t=1 ℓt,At ℓt,m i + Regm Imp(T) , (8) where ℓt,m henceforth denotes the loss the algorithm would have suffered at round t if we had At = m. That is, the regret of Algorithm 3 is equal to the regret Reg M(T) := E[PT t=1 ℓt,At ℓt,m ] of the master algorithm, plus the importance-weighted regret of the optimal base algorithm m . The difficulty in instantiating this general scheme lies in the fact that the important-weighted regret of the best base typically scales with E[ρα T,m ] Regm Unw(T), where α [ 1 2, 1] is an algorithm-dependent parameter and Regm Unw(T) := E[PT t=1 I {At = m} (µ(at, xt) infa At µ(a, xt))] denotes the unweighted regret of algorithm m. A-priori, the E[ρα T,m ] can be unbounded, leading to large regret. The key to the analysis of Agarwal et al. [9], and the approach we follow here, is to use a master algorithm with negative regret proportional to E[ρα T,m ], allowing to cancel this factor. Algorithm 4: Square CB.Imp (for base alg. m) Input: T, Reg Sq(T) Initialize Weighted regression oracle Sq Alg. for t = (τ1, τ2, . . .) [T] do Receive context xt and (qt,m, ρt,m). Set γt,m = d T/(ρt,m Reg Sq(T)) o . Set wt = γt,m/qt,m. Compute oracle s prediction ˆθt for xt, wt. Sample at logdet-barrier(θt, γt,m; At). Play at and observe loss ℓt. Update Sq Alg with (wt, xt, at, ℓt). Base algorithm. As the first step towards instantiating the aggregation scheme above, we specify the base algorithm. We use a modification to Square CB.Inf based importance weighting, which is designed to ensure that the importance-weighted regret in Eq. (8) is bounded. Pseudocode for the mth base algorithm is given in Algorithm 4. Let the instance m be fixed, and let Zt,m = I{At = m} indicate the event that this instance gets to select an arm; note that we have Zt,m Ber(qt,m) marginally. When Zt,m = 1, instance m receives qt,m and ρt,m = maxs t q 1 s,m from the master algorithm. The instance then follows the same update scheme as in the vanilla version of Square CB.Inf, except that i) it uses an adaptive learning rate γt,m, which is tuned based on ρt,m, and ii) it uses a weighted square loss regression oracle (Assumption 2), with the weight wt set as a function of γt,m and qt,m. The importance weighted regret Regm Imp(T) for this scheme is bounded as follows. Theorem 3. When invoked within Algorithm 3 with a regression oracle satisfying Assumption 2, the importance-weighted regret for each instance m [M] of Algorithm 4 satisfies Regm Imp(T) 3 2E[ ρT,m] q d TReg Sq(T) + ε m εT + εT d + 2 εT T. (9) The key feature of this regret bound is that only the leading term involving Reg Sq(T) depends on the importance weights, not the second term involving the misspecification. This is allows us to get away with tuning the master algorithm using only d, T, and Reg Sq(T), but not εT , which is critical to adapt without prior knowledge. Another important detail is that if ε m is within a constant factor of εT , the second term simplifies to O(εT d T) as desired. 3.4 Improved Master Algorithms for Combining Bandit Algorithms It remains to provide a master algorithm for use within Algorithm 3. While it turns out the master algorithm proposed in Agarwal et al. [9] suffices for this task, we go a step further and propose a new master algorithm called (α, R) hedged FTRL which is simpler and enjoys slightly improved regret, removing logarithmic factors. While this is not the focus of the paper, we believe that it to be a useful contribution on its own, because it provides a new approach to designing master algorithms for bandit aggregation. We hope that it will find use more broadly. The (α, R) hedged FTRL algorithm is parameterized by a regularizer and two scale parameters α (0, 1) and R > 0. We defer a precise definition and analysis to supplementary material, and state only the relevant result for our aggregation setup here. This result concerns a specific instance of the (α, R) hedged FTRL algorithm that we call (α, R) hedged Tsallis-INF, which instantiates the framework using the Tsallis entropy as a regularizer [11, 5, 47]. The key property of the algorithm is that the regret with respect to a policy playing a fixed arm m contains a negative contribution of magnitude ρα T,m R. The following result is a corollary of a more general theorem, found in the supplementary material. Corollary 2. Consider the adversarial multi-armed bandit problem with M arms and losses ℓt,m [0, 2]. For any α (0, 1) and R > 0, the (α, R) hedged Tsallis-INF algorithm with learning rate η = p 1/(2T) guarantees that for all m [M], E h PT t=1 ℓt,At ℓt,m i 4 2MT + E h min n 1 1 α, 2 log(ρT,m ) o M α ρα T,m i R . (10) 3.5 Putting Everything Together Crucially, the regret bound in Corollary 2 has a negative R ρα T,m term which, for sufficiently large R and appropriate α, can be used to offset the regret incurred from importance-weighting the base algorithms. In particular, 1 2, 3 d TReg Sq(T) hedged Tsallis-INF has exactly the negative regret contribution needed to cancel the importance weighting term in Eq. (9) if we use Square CB.Imp as the base algorithm. In more detail, we combine the regret for the master and base algorithms as follows to prove Theorem 1. Proof sketch for Theorem 1. Using Eq. (8), it suffices to bound the regret of the bandit master Reg M(T) and the important-weighted regret Regm Imp(T) for the optimal instance m . By Corollary 2, using 1 2, 3 d TReg Sq(T) hedged Tsallis-INF as the master algorithm gives Reg M(T) O q d TReg Sq(T) log(T) 3 2E[ ρT,m ] q d TReg Sq(T). Whenever the misspecification level is not trivially small, the geometric grid ensures that there exists m [M] such that e 1εT ε m εT . For this instance, Theorem 3 yields Regm Imp(T) 3 2E[ ρT,m ] q d TReg Sq(T) + O(εT Summing the two bounds using Eq. (8) completes the proof. 3.6 Extension: Adapting to the Average Dimension A canonical application of linear contextual bandit is the problem of online news article recommendation, where the context xt is taken to be a feature vector containing information about the user, and each action a At is the concatenation of xt with a feature representation for a candidate article (e.g., Li et al. [34]). In this application and others like it, it is often the case that while examples lie in high-dimensional space, the true dimensionality dim(At) of the action set is small, so that davg := 1 T PT t=1 dim(At) d. If we have prior knowledge of davg (or an upper bound thereof), we can exploit this low dimensionality for tighter regret. In fact, following the proof of Theorem 3 and Theorem 1, and bounding PT t=1 dim(At) by davg T instead of d T, it is fairly immediate to show that Algorithm 3 enjoys improved regret Reg(T) = O( p davg TReg Sq(T) log(T) + εT p davg T), so long as davg is replaced by d in the algorithm s various parameter settings. Our final result shows that it is possible to adapt to unknown davg and unknown misspecification simultaneously. The key idea to apply a doubling trick on top of Algorithm 3 Theorem 4. There exists an algorithm that, under the same conditions as Theorem 1, satisfies Reg(T) = O q davg TReg Sq(T) log(T) + εT p davg T without prior knowledge of davg or εT . We remark that while the bound in Theorem 4 replaces the d factor in the reduction with the datadependent quantity davg, the oracle s regret Reg Sq(T) may itself still depend on d unless a sufficiently sophisticated algorithm is used. 4 Discussion We have presented the first general-purpose, oracle-efficient algorithms for contextual bandits that adapt to unknown model misspecification. For infinite-action linear contextual bandits, our results yield the first optimal algorithms that adapt to unknown misspecification with changing action sets. Our results suggest a number of interesting conceptual questions: Can our optimization-based perspective lead to new oracle-based algorithms for more rich types of infinite action sets? Examples include nonparametric actions and structured (e.g., sparse) linear actions. Can our reduction-based techniques be lifted to more sophisticated interactive learning settings such as reinforcement learning? On the technical side, we anticipate that our new approach to reductions will find broader use; natural extensions include reductions for offline oracles [38] and adapting to low-noise conditions [23]. Lastly, we recall that in passing, we have derived a novel class of master algorithms for combining bandit algorithms which enjoys more flexibility, an improvement in logarithmic factors, and a greatly simplified analysis. We hope this result will be useful for future work on model selection in contextual bandits. Acknowledgements DF acknowledges the support of NSF TRIPODS grant #1740751. We thank Teodor Marinov and Alexander Rakhlin for discussions on related topics. Broader Impact This paper concerns contextual bandit algorithms that adapt to unknown model misspecification. Because of their efficiency and ability to adapt to the amount of misspecification contained with no prior knowledge, our algorithms are robust, and may be suitable for large-scale practical deployment. On the other hand, our work is at the level of foundational research, and hence its impact on society is shaped by the applications that stem from it. We will focus our brief discussion on the applications mentioned in the introduction. Health services [42] offer an opportunity for potential positive impact. Contextual bandits can be used to propose medical interventions that lead to a better health outcomes. However, care must be taken to ethically implement the explore-exploit tradeoff in this sensitive setting, and more research is required. Online advertisements [4, 34] and recommendation systems [8] are another well-known application. While improved, robust algorithms can lead to increased profits here, it is important to recognize that this may positively impact society as a whole. Lastly, we mention that predictive algorithms like contextual bandits become more and more powerful as more information is gathered about users. This provides a clear incentive toward collecting as much information as possible. We believe that the net benefit of research on contextual bandit outweighs the harm, but we welcome regulatory efforts to produce a legal framework that steers the usage of machine learning algorithms, including in contextual bandits, in a direction which is respects of the privacy rights of users. [1] Y. Abbasi-Yadkori, D. Pál, and C. Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems 24, NIPS, pages 2312 2320. Curran Associates, Inc., 2011. [2] Y. Abbasi-Yadkori, D. Pal, and C. Szepesvári. Online-to-confidence-set conversions and application to sparse stochastic bandits. In Proc. of the 15th International Conference on Artificial Intelligence and Statistics (AISTATS), pages 1 9, 2012. [3] N. Abe and P. M. Long. Associative reinforcement learning using linear probabilistic concepts. In Proceedings of the 16th International Conference on Machine Learning, ICML, pages 3 11, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc. [4] N. Abe, A. W. Biermann, and P. M. Long. Reinforcement learning with immediate rewards and linear hypotheses. Algorithmica, 37(4):263 293, 2003. [5] J. D. Abernethy, C. Lee, and A. Tewari. Fighting bandits with a new kind of smoothness. In Advances in Neural Information Processing Systems 28, NIPS, pages 2197 2205. Curran Associates, Inc., 2015. [6] A. Agarwal, M. Dudik, S. Kale, J. Langford, and R. Schapire. Contextual bandit learning with predictable rewards. In Proc. of the 15th International Conference on Artificial Intelligence and Statistics (AISTATS), 2012. [7] A. Agarwal, D. Hsu, S. Kale, J. Langford, L. Li, and R. Schapire. Taming the monster: A fast and simple algorithm for contextual bandits. In Proceedings of the 31st International Conference on Machine Learning, volume 32, pages 1638 1646, 22 24 Jun 2014. [8] A. Agarwal, S. Bird, M. Cozowicz, L. Hoang, J. Langford, S. Lee, J. Li, D. Melamed, G. Oshri, and O. Ribas. Making contextual decisions with low technical debt. ar Xiv preprint ar Xiv:1606.03966, 2016. [9] A. Agarwal, H. Luo, B. Neyshabur, and R. E. Schapire. Corralling a band of bandit algorithms. In Conference on Learning Theory, pages 12 38, 2017. [10] S. Agrawal and N. Goyal. Thompson sampling for contextual bandits with linear payoffs. In Proc. of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 127 135, 2013. [11] J.-Y. Audibert and S. Bubeck. Minimax policies for adversarial and stochastic bandits. In Proceedings of Conference on Learning Theory (COLT), pages 217 226, 2009. [12] P. Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397 422, 2002. [13] I. Bogunovic, A. Krause, and J. Scarlett. Corruption-tolerant Gaussian Process bandit optimization. In Proc. of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS), 2020. [14] V. Chernozhukov, M. Demirer, G. Lewis, and V. Syrgkanis. Semi-parametric efficient policy learning with continuous actions. In Advances in Neural Information Processing Systems, pages 15065 15075, 2019. [15] W. Chu, L. Li, L. Reyzin, and R. Schapire. Contextual bandits with linear payoff functions. In Proc. of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS), volume 15, pages 208 214. PMLR, 2011. [16] K. Crammer and C. Gentile. Multiclass classification with bandit feedback using adaptive regularization. Machine learning, 90(3):347 383, 2013. [17] J. Djolonga, A. Krause, and V. Cevher. High-dimensional Gaussian Process bandits. In Proc. 27th NIPS, pages 1025 1033, 2013. [18] S. S. Du, S. M. Kakade, R. Wang, and Lin F Yang. Is a good representation sufficient for sample efficient reinforcement learning? ar Xiv preprint ar Xiv:1910.03016, 2019. [19] M. Dudik, D. Hsu, S. Kale, N. Karampatziakis, J. Langford, L. Reyzin, and T. Zhang. Efficient optimal learning for contextual bandits. In Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence, UAI, pages 169 178, 2011. [20] D. J. Foster and A. Rakhlin. Beyond UCB: Optimal and efficient contextual bandits with regression oracles. International Conference on Machine Learning (ICML), 2020. [21] D. J. Foster, A. Agarwal, M. Dudik, H. Luo, and R. Schapire. Practical contextual bandits with regression oracles. In International Conference on Machine Learning, pages 1539 1548, 2018. [22] D. J. Foster, A. Krishnamurthy, and H. Luo. Model selection for contextual bandits. In Advances in Neural Information Processing Systems, pages 14741 14752, 2019. [23] D. J. Foster, A. Rakhlin, D. Simchi-Levi, and Y. Xu. Instance-dependent complexity of contextual bandits and reinforcement learning: A disagreement-based perspective. ar Xiv preprint ar Xiv:2010.03104, 2020. [24] P. Gaillard and S. Gerchinovitz. A chaining algorithm for online nonparametric regression. In Conference on Learning Theory, pages 764 796, 2015. [25] A. Ghosh, S.R. Chowdhury, and A. Gopalan. Misspecified linear bandits. In Proc. of the Thirty-First AAAI Conference on Artificial Intelligence, 2017. [26] A. Gupta, T. Koren, and K. Talwar. Better algorithms for stochastic bandits with adversarial corruptions. In Proc. of Conference on Learning Theory, pages 1562 1578, 2019. [27] E. Hazan, A. Agarwal, and S. Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3):169 192, 2007. [28] S. M. Kakade, V. Kanade, O. Shamir, and A. Kalai. Efficient learning of generalized linear and single index models with isotonic regression. In NIPS, pages 927 935, 2011. [29] L. G. Khachiyan and M. J. Todd. On the complexity of approximating the maximal inscribed ellipsoid for a polytope. Technical report, Cornell University Operations Research and Industrial Engineering, 1990. [30] A. Krause and C.S. Ong. Contextual Gaussian process bandit optimization. In Proc. 25th NIPS, 2011. [31] P. Kumar and E. A. Yildirim. Minimum-volume enclosing ellipsoids and core sets. Journal of Optimization Theory and applications, 126(1):1 21, 2005. [32] J. Langford and T. Zhang. The epoch-greedy algorithm for multi-armed bandits with side information. In Advances in Neural Information Processing Systems 20, NIPS, pages 817 824. 2008. [33] T. Lattimore, C. Szepesvari, and W. Gellert. Learning with good feature representations in bandits and in rl with a generative model. ar Xiv preprint ar Xiv:1911.07676, 2019. [34] K. Li, W. Chu, J. Langford, and R. E. Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on world wide web, pages 661 670, 2010. [35] H. Luo, C-Y. Wei, A. Agarwal, and J. Langford. Efficient contextual bandits in non-stationary worlds. In Proceedings of the 31st Conference On Learning Theory, volume 75, pages 1739 1776, 2018. [36] T. Lykouris, V. Mirrokni, and R. Paes Leme. Stochastic bandits robust to adversarial corruptions. In Proc. of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 114 122. ACM, 2018. [37] A. Pacchiano, M. Phan, Y. Abbasi-Yadkori, A. Rao, J. Zimmert, T. Lattimore, and C. Szepesvari. Model selection in contextual stochastic bandit problems. Neural Information Processing Systems (Neur IPS), 2020. [38] D. Simchi-Levi and Y. Xu. Bypassing the monster: A faster and simpler optimal algorithm for contextual bandits under realizability. Available at SSRN, 2020. [39] N. Srinivas, A. Krause, S. Kakade, and M. Seeger. Gaussian process optimization in the bandit setting: no regret and experimental design. In ICML 10: Proceedings of the 27th International Conference on International Conference on Machine Learning, pages 1015 1022, June 2010. [40] Y. Sui, A. Gotovos, J. Burdick, and A. Krause. Safe exploration for optimization with gaussian processes. In Proc. of the 32nd International Conference on Machine Learning, volume 37, pages 997 1005, 2015. [41] V. Syrgkanis, A. Krishnamurthy, and R. Schapire. Efficient algorithms for adversarial contextual learning. In Proceedings of The 33rd International Conference on Machine Learning, volume 48, pages 2159 2168, 2016. [42] A. Tewari and S. A. Murphy. From ads to interventions: Contextual bandits in mobile health. In Mobile Health, pages 495 517. Springer, 2017. [43] Michael J Todd and E Alper Yıldırım. On khachiyan s algorithm for the computation of minimum-volume enclosing ellipsoids. Discrete Applied Mathematics, 155(13):1731 1744, 2007. [44] M. Valko, N. Korda, R. Munos, I. Flaounas, and N. Cristianini. Finite-time analysis of kernelised contextual bandits. In Proc. of the 29th Conference on Uncertainty in Artificial Intelligence, UAI, pages 654 663, 2013. [45] Y. Xu and A. Zeevi. Upper counterfactual confidence bounds: a new optimism principle for contextual bandits. ar Xiv preprint ar Xiv:2007.07876, 2020. [46] A. Zanette, A. Lazaric, M. Kochenderfer, and E. Brunskill. Learning near optimal policies with low inherent Bellman error. ar Xiv preprint ar Xiv:2003.00153, 2020. [47] Julian Zimmert and Yevgeny Seldin. An optimal algorithm for stochastic and adversarial bandits. In The 22nd International Conference on Artificial Intelligence and Statistics (AISTATS), pages 467 475. PMLR, 2019.