# model_selection_for_contextual_bandits__02845f78.pdf Model selection for contextual bandits Dylan J. Foster Massachusetts Institute of Technology dylanf@mit.edu Akshay Krishnamurthy Microsoft Research NYC akshay@cs.umass.edu Haipeng Luo University of Southern California haipengl@usc.edu We introduce the problem of model selection for contextual bandits, where a learner must adapt to the complexity of the optimal policy while balancing exploration and exploitation. Our main result is a new model selection guarantee for linear contextual bandits. We work in the stochastic realizable setting with a sequence of nested linear policy classes of dimension d1 < d2 < ..., where the m -th class contains the optimal policy, and we design an algorithm that achieves O(T 2 3d1 3 m ) regret with no prior knowledge of the optimal dimension dm . The algorithm also achieves regret O T 3 4 + Tdm , which is optimal for dm T. This is the first model selection result for contextual bandits with non-vacuous regret for all values of dm , and to the best of our knowledge is the first positive result of this type for any online learning setting with partial information. The core of the algorithm is a new estimator for the gap in the best loss achievable by two linear policy classes, which we show admits a convergence rate faster than the rate required to learn the parameters for either class. 1 Introduction Model selection is the fundamental statistical task of choosing a hypothesis class using data. The choice of hypothesis class modulates a tradeoff between approximation error and estimation error, as a small class can be learned with less data, but may have worse asymptotic performance than a richer class. In the classical statistical learning setting, model selection algorithms provide the following luckiness guarantee: If the class of models decomposes as a nested sequence F1 F2 Fm F, the sample complexity of the algorithm scales with the statistical complexity of the smallest subclass Fm containing the true model, even though m is not known in advance. Such guarantees date back to Vapnik s structural risk minimization principle and are by now well-known (Vapnik, 1982, 1992; Devroye et al., 1996; Birgé and Massart, 1998; Shawe-Taylor et al., 1998; Lugosi and Nobel, 1999; Koltchinskii, 2001; Bartlett et al., 2002; Massart, 2007). In practice, one may use cross-validation the de-facto model selection procedure to decide whether to use, for example, a linear model, a decision tree, or a neural network. That cross-validation appears in virtually every machine learning pipeline highlights the necessity of model selection for successful ML deployments. We investigate model selection in contextual bandits, a simple interactive learning setting. Our main question is: Can model selection guarantees be achieved in contextual bandit learning, where a learner must balance exploration and exploitation to make decisions online? Contextual bandit learning is more challenging than statistical learning on two fronts: First, decisions must be made online without seeing the entire dataset, and second, the learner s actions influence what data is observed ( bandit feedback ). Between these extremes is full-information online learning, where the learner does not have to deal with bandit feedback, but still makes decisions online. Even in this simpler setting, model selection is challenging, since the learner must attempt to identify the appropriate model class while making irrevocable decisions and incurring regret. Nevertheless, several prior works on so-called parameter-free online learning (Mc Mahan and Abernethy, 2013; 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada. Orabona, 2014; Koolen and Van Erven, 2015; Luo and Schapire, 2015; Foster et al., 2017; Cutkosky and Boahen, 2017) provide algorithms for online model selection with guarantees analogous to those in statistical learning. With bandit feedback, however, the learner must carefully balance exploration and exploitation, which presents a substantial challenge for model selection. At an intuitive level, the reason is that different hypothesis classes require different amounts of exploration, but either overor under-exploring can incur significant regret (A detailed discussion requires a formal setup and is deferred to Section 2). At this point, it suffices to say that prior to this work, we are not aware of any adequate model selection guarantee that adapts results from statistical learning to any online learning setting with partial information. We provide a new model selection guarantee for the linear stochastic contextual bandit setting (Chu et al., 2011; Abbasi-Yadkori et al., 2011). We consider a sequence of feature maps into d1 < d2 < ... < d M dimensions and assume that the losses are linearly related to the contexts via the m -th feature map, so that the optimal policy is a dm -dimensional linear policy. We design an algorithm that achieves O(T 2 3d1 3 m ) regret to this optimal policy over T rounds, with no prior knowledge of dm . As this bound has no dependence on the maximum dimensionality d M, we say that the algorithm adapts to the complexity of the optimal policy. All prior approaches suffer linear regret for non-trivial values of dm , whereas the regret of our algorithm is sublinear whenever dm is such that the problem is learnable. Our algorithm can also be tuned to achieve O T 3 4 + Tdm regret, which matches the optimal rate when dm At a technical level, we design a sequential test to determine whether the optimal square loss for a large linear class is substantially better than that of a smaller linear class. We show that this test has sublinear sample complexity: while learning a near-optimal predictor in d dimensions requires at least (d) labeled examples, we can estimate the improvement in value of the optimal loss using only O( d) examples, analogous to so-called variance estimation results in statistics (Dicker, 2014; Kong and Valiant, 2018). Crucially, this implies that we can test whether or not to use the larger class without over-exploring for the smaller class. 2 Preliminaries We work in the stochastic contextual bandit setting (Langford and Zhang, 2008; Beygelzimer et al., 2011; Agarwal et al., 2014). The setting is defined by a context space X, a finite action space A = {1,...,K} and a distribution D supported over (x, ) pairs, where x X and RA is a loss vector. The learner interacts with nature for T rounds, where in round t: (1) nature samples (xt, t) D, (2) the learner observes xt and chooses action at, (3) the learner observes t(at). The goal of the learner is to choose actions to minimize the cumulative loss. Following several prior works (Chu et al., 2011; Abbasi-Yadkori et al., 2011; Agarwal et al., 2012; Russo and Van Roy, 2013; Li et al., 2017), we study a variant of the contextual bandit setting where the learner has access to a class of regression functions F X A R containing the Bayes optimal regressor f (x,a) = E[ (a) x] x,a. (1) We refer to this assumption (f F) as realizability. For each regression function f we define the induced policy f(x) = argmina f(x,a). Note that = f is the globally optimal policy, and chooses the best action on every context. We measure performance via regret to : Low regret is tractable here due to the realizability assumption, and it is well known that the optimal regret is T comp(F) , where comp(F) measures the statistical complexity of F. For example, comp(F) = log F for finite classes, and comp(F) = d for d-dimensional linear classes (Agarwal et al., 2012; Chu et al., 2011).1 Model selection for contextual bandits. We aim for refined guarantees that scale with the complexity of the optimal regressor f rather than the worst-case complexity of the class F. To this end, 1We suppress dependence on K and logarithmic dependence on T for this discussion. we assume that F is structured as a nested sequence of classes F1 F2 ... FM = F, and we define m = min{m f Fm}. The model selection problem for contextual bandits asks: Given that m is not known in advance, can we achieve regret scaling as O( T comp(Fm )), rather than the less favorable O( T comp(F))? A slightly weaker model selection problem is to achieve O T comp(Fm )1 for some [1 2,1), again without knowing m . Crucially, the exponents on T and comp(Fm ) sum to one, implying that we can achieve sublinear regret whenever comp(Fm ) is sublinear in T, which is precisely whenever the optimal model class is learnable. This implies that the bound, in spite of having worse dependence on T, adapts to the complexity of the optimal class with no prior knowledge. We achieve this type of guarantee for linear contextual bandits. We assume that each regressor class Fm consists of linear functions of the form Fm = (x,a) β,φm(x,a) β Rdm , where φm X A Rdm is a fixed feature map. To obtain a nested sequence of classes, and to ensure the complexity is monotonically increasing, we assume that d1 < d2 < ...,d M = d and that for each m, the feature map φm contains the map φm 1 as its first dm 1 coordinates.2 If m is the smallest feature map that realizes the optimal regressor, we can write f (x,a) = β ,φm (x,a) , where β Rdm is the optimal coefficient vector. In this setup, the optimal rate if m is known is O( Tdm ), obtained by Lin UCB (Chu et al., 2011).3 Our main result achieves both O(T 2 3d1 3 m ) regret (i.e., = 2 3) and O T 3 4 + Tdm regret without knowing m in advance. Related work. The model selection guarantee we seek is straightforward for full information online learning and statistical learning. A simple strategy for the former setting is to use a low-regret online learner for each sub-class Fm and aggregate these base learners with a master Hedge instance (Freund and Schapire, 1997). Other strategies include parameter-free methods like Ada Normal Hedge (Luo and Schapire, 2015) and Squint (Koolen and Van Erven, 2015). Unfortunately, none of these methods appear to adequately handle bandit feedback. For example, the regret bounds of parameter-free methods do not depend on the so-called local norms , which are essential for achieving T-regret in the bandit setting via the usual importance weighting approach (Auer et al., 2002). See Appendix B for further discussion. In the bandit setting, two approaches we are aware of also fail: the Corral algorithm of Agarwal et al. (2017b), and an adaptive version of the classical -greedy strategy (Langford and Zhang, 2008). Unfortunately, both algorithms require tuning parameters in terms of the unknown index m , and naive tuning gives a guarantee of the form O(T comp(Fm )β) where + β > 1. For example, for finite classes Corral gives regret T log Fm . This guarantee is quite weak, since it is vacuous when log Fm = ( T) even though such a class admits sublinear regret if m were known in advance (see Appendix B). The conceptual takeaway from these examples is that model selection for contextual bandits appears to require new algorithmic ideas, even when we are satisfied with O(T comp(Fm )1 )-type rates. Several recent papers have developed adaptive guarantees for various contextual bandit settings. These include: (1) adaptivity to easy data, where the optimal policy achieves low loss (Allenberg et al., 2006; Agarwal et al., 2017a; Lykouris et al., 2018; Allen-Zhu et al., 2018), (2) adaptivity to smoothness in settings with continuous action spaces (Locatelli and Carpentier, 2018; Krishnamurthy et al., 2019), and (3) adaptivity in non-stationary environments, where distribution drift parameters are unknown (Luo et al., 2018; Cheung et al., 2019; Auer et al., 2018; Chen et al., 2019). The latter results can be cast as model selection with appropriate nested classes of time-dependent policies, but these results are incomparable to our own, since they are specialized to the non-stationary setting. 2This is without loss of generality in a certain quantitative sense, since we can concatenate features without significantly increasing the complexity of Fm. See Corollary 5. 3Regret scaling as O( d T) is optimal for the finite action setting we work in. Results for the infinite action case, where regret scales as (d T), are incomparable to ours. Interestingly, for multi-armed (non-contextual) bandits, several lower bounds demonstrate that model selection is not possible. The simplest of these results is Lattimore s pareto frontier (Lattimore, 2015), which states that for multi-armed bandits, if we want to ensure O( T) regret against a single fixed arm instead of the usual O( KT) rate, we must incur (K T) regret to the remaining K 1 arms. This precludes a model selection guarantee of the form T comp(A) since for bandits, the statistical complexity is simply the number of arms. Related lower bounds are known for Lipschitz bandits (Locatelli and Carpentier, 2018; Krishnamurthy et al., 2019). Our results show that model selection is possible for contextual bandits, and thus highlight an important gap between the two settings. In concurrent work, Chatterji et al. (2019) studied a similar model selection problem with two classes, where the first class consists of all K constant policies and the second is a d-dimensional linear class. They obtain logarithmic regret to the first class and O( Td) regret to the second, but their assumptions on the context distribution are strictly stronger than our own. A detailed discussion is deferred to the end of the section. Technical preliminaries and assumptions. For a matrix A, A denotes the pseudoinverse and A 2 denotes the spectral norm. Id denotes the identity matrix in Rd d and p denotes the p norm. We use non-asymptotic big-O notation, and use O to hide terms logarithmic in K, d M, M, and T. For a real-valued random variable z, we use the following notation to indicate if z is subgaussian or subexponential, following Vershynin (2012): z sub G(σ2) sup {p 1 2(E z p)1 p} σ, z sub E(λ) sup {p 1(E z p)1 p} λ. (2) When z is a random variable in Rd, we write z sub Gd(σ2) if ,z sub G(σ2) for all 2 = 1 and z sub Ed(λ) if ,z sub E(λ) for all 2 = 1. These definitions are equivalent to many other familiar definitions for subgaussian/subexponential random variables; see Appendix C.1. We assume that for each m and a A, φm(x,a) sub G( 2 m) under x D. Nestedness implies that 1 2 ..., and we define = M. We also assume that (a) E[ (a) x] sub G(σ2) for all x X and a A. Finally, we assume that β B. To keep notation clean, we use the convention that σ and B 1, which ensures that (a) sub G(4 2). We require a lower bound on the eigenvalues of the second moment matrices for the feature vectors. For each m, define m = 1 K a A Ex D[φm(x,a)φm(x,a) ]. We let γ2 m = λmin( m), where λmin( ) denotes the smallest eigenvalue; nestedness implies γ1 γ2 .... We assume γm γ > 0 for all m, and our regret bounds scale inversely proportional to γ. Note that prior linear contextual bandit algorithms (Chu et al., 2011; Abbasi-Yadkori et al., 2011) do not require lower bounds on the second moment matrices. As discussed earlier, the work of Chatterji et al. (2019) obtains stronger model selection guarantees in the case of two classes, but their result requires a lower bound on λmin(E[φ(x,a)φ(x,a) ]) for all actions. Previous work suggests that advanced exploration is not needed under such assumptions (Bastani et al., 2017; Kannan et al., 2018; Raghavan et al., 2018), which considerably simplifies the problem.4 As such, the result should be seen as complementary to our own. Whether model selection can be achieved without some type of eigenvalue condition is an important open question. 3 Model selection for linear contextual bandits We now present our algorithm for model selection in linear contextual bandits, Mod CB ( Model Selection for Contextual Bandits ). Pseudocode is displayed in Algorithm 1. The algorithm maintains an active policy class index m [M], which it updates over the T rounds starting from m = 1. The algorithm updates m only when it can prove that m m , which is achieved through a statistical test called Estimate Residual (Algorithm 2). When m is not being updated, the algorithm operates as if 4It appears that exploration is still required for linear contextual bandits under our average eigenvalue assumption. Consider the case d = 2 and β = (1 2, 1). Suppose there are four actions, and that at the first round, φ(x, ) = {e1, e1, e2, e2}. We can ensure that with probability 1 2, the first action played will be one of the first two. At this point a greedy strategy will always choose e1, but the average context distribution has minimum eigenvalue 1. Algorithm 1 Mod CB (Model Selection for Contextual Bandits) Feature maps {φm( , )}m [M], where φm(x,a) Rdm, and time T N. Subgaussian parameter > 0, second moment parameter γ > 0. Failure probability δ (0,1), exploration parameter (0,1), confidence parameters. C1,C2 > 0. definitions: Define δ0 = δ 10M 2T 2 and µt = (K t) 1. Define m,t = C1 6 m log2(2dm δ0) γ8 dm log(2 δ0) Define T min γ2 dm log(2 δ0) + log 1 1 (2 δ0) + K + 1. initialization: m 1. // Index of candidate policy class. Exp4-IX1 Exp4-IX( 1,T,δ0). S { }. // Times at which uniform exploration takes place. for t = 1,...,T do Receive xt. with probability 1 µt Feed xt into Exp4-IX m and take at to be the predicted action. Update Exp4-IX m with (xt,at, t(at)). otherwise Sample at uniformly from A and let S S {t}. /* Test whether we should move on to a larger policy class. */ s=1 φi(xs,a)φi(xs,a) for each i m. // Empirical second moment. Hi {(φi(xs,as), (as))}s S for each i > m. E m,i Estimate Residual Hi, m, i for each i > m. // Gap estimate. if there exists i > m such that E m,i 2 i,t and t T min i then Let m be the smallest such i. Re-initialize Exp4-IX m Exp4-IX( m,T,δ0). m = m by running a low-regret contextual bandit algorithm with the policies induced by F m; we call this policy class m = {x argmina A β,φm(x,a) β 2 γ}.5 Note that the low-regret algorithm we run for m cannot based on realizability, since F m will not contain the true regressor f until we reach m . This rules out the usual linear contextual bandit algorithms such as Lin UCB. Instead we use a variant of Exp4-IX (Neu, 2015), which is an agnostic contextual bandit algorithm and does not depend on realizability. To deal with infinite classes, unbounded losses, and other technical issues, we require some simple modifications to Exp4-IX; pseudocode and analysis are deferred to Appendix C.3. 3.1 Key idea: Estimating prediction error with sublinear sample complexity Before stating the main result, we elaborate on the statistical test (Estimate Residual) used in Algorithm 1. Estimate Residual estimates an upper bound on the gap between the best-in-class loss for two policy classes i and j, which we define as i,j = L i = min i L( ). At each round, Algorithm 1 uses Estimate Residual to estimate the gap m,i for all i > m. If the estimated gap is sufficiently large for some i, the algorithm sets m to the smallest such i for the next round. This approach is based on the following observation: For all m m , L m . Hence, if m,i > 0, it must be the case that m m , and we should move on to a larger class. The key challenge is to estimate m,i while ensuring low regret. Naively, we could use uniform exploration and find a policy in i that has minimal empirical loss, which gives an estimate of L i . Unfortunately, this requires tuning the exploration parameter in terms of di and would compromise the regret if m = m . Similar tuning issues arise with other approaches and are discussed further in Appendix B. 5The norm constraint γ guarantees that m contains parameter vectors arising from a certain square loss minimization problem under our assumption that β 2 1; see Proposition 20. Algorithm 2 Estimate Residual input: Examples {(xs,ys)}n s=1, second moment matrix estimates Rd d and 1 Rd1 d1. Define d2 = d d1 and R = D , where D = 1 0d1 d2 0d2 d1 0d2 d2 Return estimator 1 2 Rxsys, 1 2 Rxtyt . We do not estimate the gaps i,j directly, but instead estimate an upper bound motivated by the realizability assumption. For each m, define Ex D( β,φm(x,a) (a))2, (3) i ,φi(x,a) β We call Ei,j the square loss gap and we call i,j the policy gap. A key lemma driving these definitions is that the square loss gap upper bounds the policy gap. Lemma 1. For all i [M] and j m , i,j 4KEi,j. Furthermore, if i,j m then Ei,j = 0. With realizability, the square loss gap behaves similar to the policy gap: it is non-zero whenever the latter is non-zero, and m has zero gap to all m m . Therefore, we seek estimators for the square loss gap E m,i for i > m. Observe that E m,i depends on the optimal predictors β i in the two classes. A natural approach to estimate E m,i is to solve regression problems over both classes to estimate the predictors, then plug them into the expression for E; we call this the plug-in approach. As this relies on linear regression, it gives an O(di n) error rate for estimating E m,i from n uniform exploration samples. Unfortunately, since the error scales linearly with the size of the larger class, we must over-explore to ensure low error, and this compromises the regret if the smaller class is optimal. As a key technical contribution, we design more efficient estimators for the square loss gap E m,i. We work in the following slightly more general gap estimation setup: we receive pairs {(xs,ys)}n s=1 i.i.d. from a distribution D (Rd R), where x sub G( 2) and y sub G(σ2). We partition the feature space into x = (x(1),x(2)), where x(1) Rd1, and define β Rd E( β,x y)2 , β These are, respectively, the optimal linear predictor and the optimal linear predictor restricted to the first d1 dimensions. The square loss gap for the two predictors is defined as E = E β ,x β 2. Our problem of estimating E m,i clearly falls into this general setup if we uniformly explore the actions for n rounds, then set {xs}n s=1 to be the features obtained through the feature map φi and {ys}n s=1 to be the observed losses. The pseudocode for our estimator Estimate Residual is displayed in Algorithm 2. In addition to the n labeled samples, it takes as input two empirical second moment matrices and 1 constructed via an extra set of m i.i.d. unlabeled samples; these serve as estimates for = E[xx ] and 1 = E x(1)x(1) . The intuition is that one can write the square loss gap as E = 1 2R E[xy] 2 where R Rd d is a certain function of and 1. Estimate Residual simply replaces the second moment matrices with their empirical counterparts and then uses the labeled examples to estimate the weighted norm of E[xy] through a U-statistic. The main guarantee for the estimator is as follows. 6In Appendix D we show that β m and consequently Ei,j are always uniquely defined. Theorem 2. Suppose we take and 1 to be the empirical second moment matrices formed from m i.i.d. unlabeled samples. Then when m C(d + log(2 δ)) 4 λmin( ), Estimate Residual, given n labeled samples, guarantees that with probability at least 1 δ, 2E + O σ2 4 min( ) d1 2 log2(2d δ) min( ) dlog(2 δ) To compare with the plug-in approach, we focus on the dependence between d and n. When Estimate Residual is applied within Mod CB we have plenty of unlabeled data, so the dependence on m is less important. The dominant term in Theorem 2 is O( d n), a significant improvement over the O(d n) rate for the plug-in estimator. In particular, the dependence on the larger ambient dimension is much milder: we can achieve constant error with n d, or in other words the estimator has sublinear sample complexity. This property is crucial for our model selection result. The result generalizes and is inspired by the variance estimation method of Dicker (Dicker, 2014; Kong and Valiant, 2018), which obtains a rate of O d n + 1 n to estimate the optimal square loss minβ Rd E( β,x y)2 when the second moments are known. By estimating the square loss gap instead of the loss itself, we avoid the 1 n term, which is critical for achieving O(T 2 3d1 3 m ) regret. 3.2 Main result Equipped with Estimate Residual, we can now sketch the approach behind Mod CB in a bit more detail. Recall that the algorithm maintains an index m denoting the current guess for m . We run Exp4-IX over the induced policy class m, mixing in some additional uniform exploration (with probability µt at round t). We use all of the data to estimate the second moment matrices of all classes, and we pass only the exploration data into the subroutine Estimate Residual. We check if there exists some i > m such that the estimated gap satisfies E m,i 2 i,t and t T min i which based on the deviation bound in Theorem 2 implies that E m,i > 0 and thus m m . If this is the case, we advance m to the smallest such i, and if not, we continue with our current guess. This leads to the following guarantee. Theorem 3. When C1 and C2 are sufficiently large absolute constants and = 1 3, Mod CB guarantees that with probability at least 1 δ, γ3 (Tm )2 3(Kdm )1 3 log(2 δ) . (6) When = 1 4, Mod CB guarantees that with probability at least 1 δ, γ2 K1 4(Tm )3 4 log(2 δ) + 5 K(Tm )dm log(2 δ) . (7) A few remarks are in order The two stated bounds are incomparable in general. Tracking only dependence on T and dm , the first is O(T 2 3d1 3 m ) while the latter is O(T 3 4 + Tdm ). The former is better when dm T 1 4. There is a more general Pareto frontier that can be explored by choosing [1 3,1 4], but no choice for dominates the others for all values of dm . Recall that if had we known dm in advance, we have could simply run Lin UCB to achieve O( Tdm ) regret. The bound (7) matches this oracle rate when dm > T, but otherwise our guarantee is slightly worse than the oracle rate. Nevertheless, both bounds are o(T) whenever the oracle rate is o(T) (that is, when dm = o(T)), so the algorithm has sublinear regret whenever the optimal model class is learnable. It remains open whether there is a model selection algorithm that can match the oracle rate for all values of dm simultaneously. We have not optimized dependence on the condition number γ or the logarithmic factors. If the individual distribution parameters { m}m [M] and {γm}m [M] are known, the algo- rithm can be modified slightly so that regret scales in terms of m and γ 1 m . However the current model, in which we assume access only to uniform upper and lower bounds on these parameters, is more realistic. Additionally, Appendix A contains a validation experiment, where we demonstrate that Mod CB compares favorably to Lin UCB in simulations. Improving the dependence on m . Theorem 3 obtains the desired model selection guarantee for linear classes, but the bound includes a polynomial dependence on the optimal index m . This contrasts the logarithmic dependence on m that can be obtained through structural risk minimization in statistical learning (Vapnik, 1992). However, this poly(m ) dependence can be replaced by a single log(T) factor with a simple preprocessing step: Given feature maps {φm( , )}m [M] we construct a new collection of maps φm( , ) m M , where M log T, as follows. First, for i = 1,...,log T, take φi to be the largest feature map φm for which dm ei. Second, remove any duplicates. This preprocessing reduces the number of feature maps to at most log(T) while ensuring that a map of dimension O(dm ) that contains φm is always available. Specifically, the preprocessing step yields the following improved regret bounds. Theorem 4. Mod CB with preprocessing guarantees that with probability at least 1 δ, γ3 T 2 3(Kdm )1 3 log(2 δ) , = 1 3. γ2 K1 4T 3 4 log(2 δ) + 5 γ4 KTdm log(2 δ) , = 1 4. Non-nested feature maps. As a final variant, we note that the algorithm easily extends to the case where feature maps are not nested. Suppose we have non-nested feature maps {φm}m [M], where d1 d2 ... d M; note that the inequality is no longer strict. In this case, we can obtain a nested collection by concatenating φ1,...,φm 1 to the map φm for each m. This process increases the dimension of the optimal map from dm to at most dm m , so we have the following corollary. Corollary 5. For non-nested feature maps, Mod CB with preprocessing guarantees that with probability at least 1 δ, γ3 T 2 3(Kdm m )1 3 log(2 δ) , = 1 3. γ2 K1 4T 3 4 log(2 δ) + 5 γ4 KTdm m log(2 δ) , = 1 4. 3.3 Proof sketch We now sketch the proof of the main theorem, with the full proof deferred to Appendix D. We focus on the case where there are just two classes, so M = 2. We only track dependence on T and dm, as this preserves the relevant details but simplifies the argument. The analysis has two cases depending on whether f belongs to F1 or F2. First, if f F1 then by Lemma 1 we have that E1,2 = 0. Further, via Theorem 2, we can guarantee that we never advance to m = 2 with high probability. The result then follows from the regret guarantee for Exp4-IX using policy class 1, and by accounting for uniform exploration. The more challenging case is when f F2. Let T denote the first round where m = 2, or T if the algorithm never advances. We can bound regret as Reg O T 1 + O Td1 + T 1,2 + O The four terms correspond to: (1) uniform exploration with probability µt t in round t, (2) the Exp4-IX regret bound for class 1 until time T, (3) the policy gap between the best policy in 1 and the optimal policy 2, and (4) the Exp4-IX bound over class 2 until time T. The two regret bounds (the second and fourth terms) clearly contribute O( Td2) to the overall regret, and the first term is controlled by our choice of {1 4,1 3}. We are left to bound the third term. For this term, observe that in round T 1, since the algorithm did not advance, we must have E1,2 2 2, T 1. Appealing to Theorem 2, this implies that E1,2 O( E1,2 + 2, T 1) O( 2, T 1). Plugging in the definition of 2,t and applying Lemma 1, this gives E1,2 O T 2, T 1 = O T This establishes the result. In particular, with = 1 3 we obtain O(T 2 3 + Td2 + T 2 3d1 4 2 ) O(T 2 3d1 3 2 ) regret, and with = 1 4 we obtain O(T 3 4 + Td2 + T 5 8d1 4 2 ) = O(T 3 4 + Td2).7 The sublinear estimation rate for Estimate Residual (Theorem 2) plays a critical role in this argument. With the O(d n) rate for the naïve plug-in estimator, we could at best set t,2 = d2 t1 , but this would degrade the dimension dependence in (8) from d1 4 2 to d2. Unfortunately, this results in T 1 + d2T 2 regret, which is not a meaningful model selection result, since there is no choice for (0,1) for which the exponents on d2 and T sum to one for both terms. 4 Discussion This paper initiates the study of model selection tradeoffs in contextual bandits. We provide the first model selection algorithm for the linear contextual bandit setting, which we show achieves O T 2 3d1 3 m when the optimal model is a dm -dimensional linear function. This is the first contextual bandit algorithm that adapts the complexity of the optimal policy class with no prior knowledge, and raises a number of intriguing questions: 1. Is it possible to achieve O Tdm regret in our setup? This would show that the price for model selection is negligible. 2. Is it possible to generalize our results beyond linear classes? Specifically, given regressor classes F1 F2 ... FM and assuming the optimal model f belongs to Fm for some unknown m , is there a contextual bandit algorithm that achieve O T comp1 (Fm ) regret for some [1 2,1)? More ambitiously, can we achieve O T comp(Fm ) ? 3. We have conducted a validation experiment with Mod CB (see Appendix A), demonstrating that the algorithm performs favorably in simulations. While this synthetic experiment is encouraging, Mod CB may not be immediately useful for practical deployments for several reasons, including its reliance on linear realizability and its computational complexity. Are there more robust algorithmic principles for theoretically-sound and practically-effective model selection in contextual bandits? 4. For what classes F can we estimate the loss at a sublinear rate, and is this necessary for contextual bandit model selection? Any sublinear guarantee will lead to non-trivial model selection guarantees through a strategy similar to Mod CB. Interestingly, it is already known that for certain (e.g., sparse linear) classes, sublinear loss estimation is not possible (Verzelen and Gassiat, 2018). On the other hand, positive results are available for certain nonparametric classes (Brown et al., 2007; Wang et al., 2008). Model selection is instrumental to the success of ML deployments, yet few positive results exist for partial feedback settings. We believe these questions are technically challenging and practically important, and we are hopeful that positive results of the type we provide here will extend to interactive learning settings beyond contextual bandits. Acknowledgements We thank Ruihao Zhu for working with us at the early stages of this project, and for many helpful discussions. AK is supported by NSF IIS-1763618. HL is supported by NSF IIS-1755781. Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, 2011. Alekh Agarwal, Miroslav Dudík, Satyen Kale, John Langford, and Robert E Schapire. Contextual bandit learning with predictable rewards. In International Conference on Artificial Intelligence and Statistics, 2012. 7Note that if d2 T then T 5 8d1 4 2 T 3 4, but if d2 T then T 5 8d1 4 Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihon Li, and Robert E. Schapire. Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning, 2014. Alekh Agarwal, Akshay Krishnamurthy, John Langford, Haipeng Luo, and Robert E. Schapire. Open problem: First-order regret bounds for contextual bandits. In Conference on Learning Theory, 2017a. Alekh Agarwal, Haipeng Luo, Behnam Neyshabur, and Robert E Schapire. Corralling a band of bandit algorithms. Conference on Learning Theory, 2017b. Zeyuan Allen-Zhu, Sébastien Bubeck, and Yuanzhi Li. Make the minority great again: First-order regret bound for contextual bandits. International Conference on Machine Learning, 2018. Chamy Allenberg, Peter Auer, László Györfi, and György Ottucsák. Hannan consistency in on-line learning in case of unbounded losses under partial monitoring. In International Conference on Algorithmic Learning Theory. Springer, 2006. Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 2002. Peter Auer, Pratik Gajane, and Ronald Ortner. Adaptively tracking the best arm with an unknown number of distribution changes. In European Workshop on Reinforcement Learning, 2018. Peter L Bartlett, Stéphane Boucheron, and Gábor Lugosi. Model selection and error estimation. Machine Learning, 2002. Hamsa Bastani, Mohsen Bayati, and Khashayar Khosravi. Mostly exploration-free algorithms for contextual bandits. ar Xiv:1704.09011, 2017. Shai Ben-David, Nicolo Cesa-Bianchi, David Haussler, and Philip M Long. Characterizations of learnability for classes of (0,..., n)-valued functions. Journal of Computer and System Sciences, 1995. Alina Beygelzimer, John Langford, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandit algorithms with supervised learning guarantees. In International Conference on Artificial Intelligence and Statistics, 2011. Lucien Birgé and Pascal Massart. Minimum contrast estimators on sieves: exponential bounds and rates of convergence. Bernoulli, 1998. Lawrence D Brown, Michael Levine, et al. Variance estimation in nonparametric regression via the difference sequence method. The Annals of Statistics, 2007. Niladri S. Chatterji, Vidya Muthukumar, and Peter L. Bartlett. Osom: A simultaneously optimal algorithm for multi-armed and linear contextual bandits. ar Xiv:1905.10040, 2019. Yifang Chen, Chung-Wei Lee, Haipeng Luo, and Chen-Yu Wei. A new algorithm for non-stationary contextual bandits: Efficient, optimal, and parameter-free. Conference on Learning Theory, 2019. Wang Chi Cheung, David Simchi-Levi, and Ruihao Zhu. Learning to optimize under non-stationarity. International Conference on Artificial Intelligence and Statistics, 2019. Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linear payoff functions. In International Conference on Artificial Intelligence and Statistics, 2011. Ashok Cutkosky and Kwabena A. Boahen. Online learning without prior information. Conference on Learning Theory, 2017. Amit Daniely, Sivan Sabato, Shai Ben-David, and Shai Shalev-Shwartz. Multiclass learnability and the ERM principle. The Journal of Machine Learning Research, 2015. Victor de la Peña and Evarist Giné. Decoupling: From Dependence to Independence. Springer, 1998. Luc Devroye, Lázló Györfi, and Gábor Lugosi. A Probabilistic Theory of Pattern Recognition. Springer, 1996. Lee H Dicker. Variance estimation in high-dimensional linear models. Biometrika, 2014. Dylan J. Foster, Satyen Kale, Mehryar Mohri, and Karthik Sridharan. Parameter-free online learning via model selection. In Advances in Neural Information Processing Systems, 2017. Dylan J. Foster, Alekh Agarwal, Miroslav Dudik, Haipeng Luo, and Robert Schapire. Practical contextual bandits with regression oracles. In International Conference on Machine Learning, 2018. Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 1997. Pierre Gaillard, Gilles Stoltz, and Tim Van Erven. A second-order bound with excess losses. In Conference on Learning Theory, 2014. David Haussler and Philip M Long. A generalization of Sauer s lemma. Journal of Combinatorial Theory, Series A, 1995. Sampath Kannan, Jamie H Morgenstern, Aaron Roth, Bo Waggoner, and Zhiwei Steven Wu. A smoothed analysis of the greedy algorithm for the linear contextual bandit problem. In Advances in Neural Information Processing Systems, 2018. Vladimir Koltchinskii. Rademacher penalties and structural risk minimization. IEEE Transactions on Information Theory, 2001. Weihao Kong and Gregory Valiant. Estimating learnability in the sublinear data regime. In Advances in Neural Information Processing Systems, 2018. Wouter M Koolen and Tim Van Erven. Second-order quantile methods for experts and combinatorial games. In Conference on Learning Theory, 2015. Akshay Krishnamurthy, Alekh Agarwal, and Miro Dudik. Contextual semibandits via supervised learning oracles. In Advances In Neural Information Processing Systems, 2016. Akshay Krishnamurthy, Zhiwei Steven Wu, and Vasilis Syrgkanis. Semiparametric contextual bandits. In International Conference on Machine Learning, 2018. Akshay Krishnamurthy, John Langford, Aleksandrs Slivkins, and Chicheng Zhang. Contextual bandits with continuous actions: Smoothing, zooming, and adapting. Conference on Learning Theory, 2019. John Langford and Tong Zhang. The epoch-greedy algorithm for multi-armed bandits with side information. In Advances in neural information processing systems, 2008. Tor Lattimore. The pareto regret frontier for bandits. In Advances in Neural Information Processing Systems, 2015. Lihong Li, Yu Lu, and Dengyong Zhou. Provably optimal algorithms for generalized linear contextual bandits. In International Conference on Machine Learning, 2017. Andrea Locatelli and Alexandra Carpentier. Adaptivity to smoothness in x-armed bandits. In Conference on Learning Theory, 2018. Gábor Lugosi and Andrew B Nobel. Adaptive model selection using empirical complexities. Annals of Statistics, 1999. Haipeng Luo and Robert E Schapire. Achieving all with no parameters: Adanormalhedge. In Conference on Learning Theory, 2015. Haipeng Luo, Chen-Yu Wei, Alekh Agarwal, and John Langford. Efficient contextual bandits in non-stationary worlds. Conference on Learning Theory, 2018. Thodoris Lykouris, Karthik Sridharan, and Éva Tardos. Small-loss bounds for online learning with partial information. Conference on Learning Theory, 2018. Pascal Massart. Concentration inequalities and model selection. Springer, 2007. Brendan Mc Mahan and Jacob Abernethy. Minimax optimal algorithms for unconstrained linear optimization. In Advances in Neural Information Processing Systems, 2013. Gergely Neu. Explore no more: Improved high-probability regret bounds for non-stochastic bandits. In Advances in Neural Information Processing Systems, 2015. Francesco Orabona. Simultaneous model selection and optimization through parameter-free stochastic learning. In Advances in Neural Information Processing Systems, 2014. Manish Raghavan, Aleksandrs Slivkins, Jennifer Wortman Vaughan, and Zhiwei Steven Wu. The externalities of exploration and how data diversity helps exploitation. Conference on Learning Theory, 2018. Daniel Russo and Benjamin Van Roy. Eluder dimension and the sample complexity of optimistic exploration. In Advances in Neural Information Processing Systems, 2013. John Shawe-Taylor, Peter L. Bartlett, Robert C Williamson, and Martin Anthony. Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory, 1998. Vladimir Vapnik. Estimation of dependences based on empirical data. Springer-Verlag, 1982. Vladimir Vapnik. Principles of risk minimization for learning theory. In Advances in Neural Information Processing Systems, 1992. Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. Cambridge University Press, 2012. Nicolas Verzelen and Elisabeth Gassiat. Adaptive estimation of high-dimensional signal-to-noise ratios. Bernoulli, 2018. Lie Wang, Lawrence D Brown, T Tony Cai, Michael Levine, et al. Effect of mean on variance function estimation in nonparametric regression. The Annals of Statistics, 2008.