# model_selection_in_contextual_stochastic_bandit_problems__3a1a947f.pdf Model Selection in Contextual Stochastic Bandit Aldo Pacchiano UC Berkeley My Phan University of Massachusetts Yasin Abbasi-Yadkori Julian Zimmert Google Research Tor Lattimore Csaba Szepesvári Deep Mind and University of Alberta We study bandit model selection in stochastic environments. Our approach relies on a master algorithm that selects between candidate base algorithms. We develop a master-base algorithm abstraction that can work with general classes of base algorithms and different type of adversarial master algorithms. Our methods rely on a novel and generic smoothing transformation for bandit algorithms that permits us to obtain optimal O( T) model selection guarantees for stochastic contextual bandit problems as long as the optimal base algorithm satisfies a high probability regret guarantee. We show through a lower bound that even when one of the base algorithms has O(log T) regret, in general it is impossible to get better than ( T) regret in model selection, even asymptotically. Using our techniques, we address model selection in a variety of problems such as misspecified linear contextual bandits [13], linear bandit with unknown dimension [8] and reinforcement learning with unknown feature maps. Our algorithm requires the knowledge of the optimal base regret to adjust the master learning rate. We show that without such prior knowledge any master can suffer a regret larger than the optimal base regret. 1 Introduction In a bandit model selection problem, given a set of base algorithms, the learner aims to adapt in an online fashion to the best base that is the most suitable for the current environment. Maillard and Munos [15] are perhaps the first to address the bandit model-selection problem, with a variant of EXP4 master algorithm that works with UCB or EXP3 base algorithms. These results are improved by Agarwal et al. [2]. Agarwal et al. [2] combine the base algorithms using an online mirror descent master (CORRAL) that sends importance weighted rewards to the base algorithms, thus requiring each base algorithm to be individually modified to be compatible with the master. For example, to use UCB as a base, we would need to manually re-derive UCB s confidence interval and modify its regret analysis to be compatible with importance weighted feedback. Instead, we introduce a generic smoothing wrapper method that can be applied to base algorithms without substantial modification. There are works on model selection in settings such as in linear bandits with unknown dimension or structure [8, 6]. Apart from strong assumptions, those works are limited to a specific model-selection problem. A general and efficient method to combine multiple base algorithms is missing. Contributions. We focus on bandit model-selection in stochastic environments. Our contributions are as follows: Equal contribution. 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. We introduce a general "smoothing" wrapper so that any contextual base algorithm can be compat- ible with the CORRAL [2] and EXP3.P masters [5]. This is more general than the approach of Agarwal et al. [2] where each base algorithm needs to be individually modified to satisfy certain stability condition. Our modification of the CORRAL algorithm has another important difference: instead of importance weighted feedback, the original rewards are sent to the base algorithms. The resulting model selection strategy can be readily used with almost any base algorithm developed for stochastic environments. When the optimal base regret is known, the CORRAL master achieves optimal regret guarantees. Under certain conditions when the optimal base regret is unknown EXP3.P can achieve better performance. We demonstrate the generality and effectiveness of our method by showing how it seamlessly improves existing results or addresses open questions in a variety of problems. We show applications in adapting to the misspecification level in contextual linear bandits [13], adapting to the unknown dimension in nested linear bandit classes [8], tuning the data-dependent exploration rate of bandit algorithms, and choosing feature maps in reinforcement learning. Moreover, our master algorithm can simultaneously perform different types of model selection. For example, we show how to choose both the unknown dimension and the unknown mis-specification error at the same time. This is in contrast to algorithms that specialize in a specific type of model selections such as detecting the unknown dimension [8]. In the stochastic domain, an important question is whether a model selection procedure can inherit the O(log T) regret of a fast stochastic base algorithm. We show a lower bound for the model selection problem that scales as ( T), which implies that our result is minimax optimal. Our master algorithm requires knowledge of the best base s regret to achieve the same regret. We show that this condition is unavoidable in general: there are problems where regret of the best base scales as O(T x) for an unknown x, and the regret of any master algorithm scales as (T y) for y > x. 2 Problem statement Let δa denotes the delta distribution at a. For an integer n, we use [n] to denote the set {1, 2, . . . , n}. We consider contextual stochastic bandit problems: Let A Rd be a set of actions. Let S be the set of all subsets of A and let DS be a distribution over S. At time t, the learner observes an action set At (which could be infinite) sampled from DS. The learner chooses policy t, which takes an action set X 2 S as an input and outputs a distribution over X. The learner selects action at t(At) and receives a reward rt such that rt = f(At, δat) + t where t is 1-sub Gaussian random noise and f(X, ) denotes the expected reward of applying policy on action set X . The fixed action case At = A and the linear contextual bandit problem with IID contexts are special cases of this setting. For linear contextual bandits, there are k actions and a linearly parameterized policy maps from the space of d k matrices to [k]: in round t and given context xt 2 Rd k, (xt) = argmaxi2[k] x> t,i , where xt,i denotes the i th column of xt. Letting it = (xt), the reward is given by rt = x> t,it + t where 2 Rd is an unknown parameter vector. We are interested in designing an algorithm with small regret, defined as We assume there are M candidate base algorithms and a master algorithm M that selects one of the base algorithms in each round. Let {pi 1, . . . , pi T } be the (random) probabilities that M chooses base i during the game and let pi = mint pi t. If base Bi satisfies a high probability regret bound Ui(T, δ) when played in an environment E, we call E a Ui compatible environment for Bi. We do not require E to be Ui compatible w.r.t all input base algorithms. We want to design a master algorithm that satisfies regret R(T) O(Ui (T, δ)); We are interested in competing with the best performing compatible base i . For the rest of the paper we use i to denote the optimal base i . 3 Main results We consider the following abstraction in Algorithms 1 and 2. Base algorithms only play their chosen action, receive rewards and update their policy when selected by the master. Base algorithms keep a counter s, keeping track of the number of times they have been invoked. For any base algorithm Bj, s,j is the policy Bj uses at state s. Let st,j denote the state of base j at time t. If t1 < t2 are two consecutive times when base j is chosen by the master, then the base has policy st1,j,j at time t1 and policy st2,j,j at times t1 + 1, . . . , t2 where st2,j = st1,j + 1 Algorithm 1 Master Algorithm Input: Base Algorithms {Bj}M j=1 for t = 1, , T do Play base jt. Receive feedback rt = rt,jt from Bjt Update itself using rt end for Algorithm 2 Base Algorithm Bj Initialize state counter s = 1 for t = 1, , T do Receive action set At DS Choose action at,j s,j(At) if selected by master then Play action at,j Receive feedback rt,j = f(At, δat,j)+ t Send rt,j to the master Compute s+1,j using rt,j s s + 1 end if end for To analyze the regret of the master w.r.t. the optimal base Bi, we add and subtract terms {f(At, st,i,i)}T t=1 and use a regret decomposition similar to the one used by Agarwal et al. [2]: f(At, ) f(At, t) f(At, st,i,i) f(At, t) f(At, ) f(At, st,i,i) Term I is the regret of the master with respect to the optimal base, and term II is the regret of the optimal base with respect to the optimal policy . Analysis of term I is largely based on existing analysis of CORRAL and EXP3.P. To bound term II, we provide a smoothing transformation (Section 5, Algorithm 3) that converts any base algorithm with high probability bound U(T, δ) to one with high probability instantaneous regret bound ut = U(t,δ) t at time t , which is decreasing if U(T, δ) is concave. Since ut is decreasing, term II is the largest when base i is selected the least often (pi t = pi 8t). In this case base i will be played roughly Tpi times, and will repeat its decisions in intervals of length 1 pi , resulting in the following bound: Lemma 3.1 (informal). If regret of the optimal base is bounded by U (T, δ) with probability at least 1 δ when it runs alone, then we have that E [II] O 1 pi U (Tpi, δ) log T + δT(log T + 1) We demonstrate the effectiveness of our smooth transformation by deriving regret bounds with two master algorithms: CORRAL (introduced by Agarwal et al. [2] and reproduced in Appendix B) and EXP3.P (Theorem 3.3 in [5]), a simple algorithm that ensures each base is picked with at least a (horizon dependent) constant probability p. Theorem 3.2 (informal version of Theorem 5.3). If U (T, δ) = O(c(δ) T ) for some function c : R ! R and constant 2 [1/2, 1), the regrets of EXP3.P and CORRAL are: Known and c(δ) Known , Unknown c(δ) CORRAL O (T c(δ)) O CORRAL has optimal regret when and c(δ) are known. However when c(δ) is unknown and 2 (which is T 1/6 if = 1/2 or = 1/3), then EXP3.P has better regret because O Lower bounds. Theorem 6.2 shows that if the regret of the best base is O(T x), in the worst case a master algorithm that does not know x can have regret (T y) with y > x. Theorem 6.1 shows that in general it is impossible for any master algorithm to achieve a regret better than ( T) even when the best base has regret O(log(T)). When the regret of the best base is O( T), CORRAL with our smoothing achieves the optimal O( The detailed description of the smoothing procedure and the analysis are postponed to Section 5. First, we show some applications of our main result in the next section. All algorithms presented in the next section are compatible with the smoothing procedure, and all regret bounds are direct applications of Theorem 3.2. 4 Applications 4.1 Misspecified Contextual Linear Bandit We consider the misspecified linear bandit problem. The learner selects an action at 2 At and receives a reward rt such that |E[rt] a> t | where 2 Rd is an unknown parameter vector and is the misspecification error. For this problem, [20] and [13] present variants of Lin UCB that achieve a high probability O(d d T) regret bound. Both algorithms require knowledge of , but [13] show a regret bound of the same order without the knowledge of for the version of the problem with a fixed action set At = A. Their method relies on G-optimal design, which does not work for contextual settings. It is an open question whether it is possible to achieve the above regret without knowing for problems with changing action sets. In this section, we show a O(d d T) regret bound for linear bandit problems with changing action sets without knowing . For problems with fixed action sets, we show an improved regret that matches the lower bound of [12]. Given a constant E so that | | E, we divide the interval [1, E] into an exponential grid G = [1, 2, 22, ..., 2log(E)]. We use log(E) modified Lin UCB bases, from either Zanette et al. [20] or Lattimore et al. [13], with each base algorithm instantiated with a value of in the grid. Theorem 4.1. For the misspecified linear bandit problem described above, the regret of CORRAL with learning rate = 1 p T d applied with modified Lin UCB base algorithms with 2 G, is upper bounded by O(d d T). When the size k action sets are fixed and k > d, the regret of CORRAL with = 1 p T d applied with one UCB base and one G-optimal base algorithm [13] is upper bounded by O This result matches the following lower bound that shows that it is impossible to achieve d T)) regret: Lemma 4.2 (Implied by Theorem 24.4 in [12]). Let R (T) denote the cumulative regret at time T on environment . For any algorithm, there exists a 1-dimensional linear bandit environment 1 and a k-armed bandit environment 2 such that R 1(T) R 2(T) T(k 1)e 2. Experiment (Figure 1). Let d = 2. Consider a contextual bandit problem with k = 50 arms, where each arm j has an associated vector aj 2 Rd sampled uniformly at random from [0, 1]d. We consider two cases: (1) For a 2 Rd sampled uniformly at random from [0, 1]d, reward of arm j at time t is a> j + t, where t N(0, 1), and (2) There are k parameters µj for j 2 [k] all sampled uniformly at random from [0, 10], so that the reward of arm j at time t is sampled from N(µj, 1). We use CORRAL with learning rate = 2 p T d and UCB and Lin UCB as base algorithm. In case (1) Lin UCB performs better while in case (2) UCB performs better. Each experiment is repeated 500 times. 4.2 Contextual Bandits with Unknown Dimension Linear Contextual Bandit. We consider the contextual linear bandit problem studied by Foster et al. [8]. In this problem, each action a 2 At is a D-dimensional feature map, but only the first d elements of the parameter vector are nonzero. Here, d is unknown and possibly much smaller than D. [8] consider the special case when the number of actions k is finite and require a lower bound (a) Arms with linear rewards. (b) Arms with non-linear rewards. Figure 1: CORRAL with UCB and Lin UCB bases. Shaded regions denote the standard deviations. on the average eigenvalues of the co-variance matrices of all actions. We provide the first sublinear regret for this problem when the action set is infinite. Further, we have no eigenvalue assumptions and our regret does not scale with the number of actions k. We use Lin UCB with each value of d 2 [1, 2, 22, ..., 2log(D)] as a base algorithm for CORRAL and EXP3.P. We also consider the case when both the optimal dimension d and the misspecification are unknown: we use M = log(E) log(D) modified Lin UCB bases (see Section 4.1) for each value of ( , d ) in the grid [1, 2, 22, ..., 2log(E)] [1, 2, 22, ..., 2log(D)]. We obtain the regret bounds summarized in the following table: Linear contextual bandit Misspecified linear contextual bandit Unknown d Unknown d and Finite action sets Infinite action sets Foster et al. [8] O(T 2/3k1/3d1/3 ) or O(k1/4T 3/4 + pk Td ) N/A N/A 2 3 ) O(d T With our approach, it is possible to combine different types of master and base algorithms, which provides much more flexibility compared to approaches specializing in a specific type of model selection. To the best of our knowledge, this is the first result that provides these types of guarantees. Nonparametric Contextual Bandit. [9] consider non-parametric stochastic contextual bandits. At time t and given a context xt 2 RN, the learner selects arm at 2 [k] and observes the reward fat(xt) + t, where t is a 1-sub-Gaussian random variable and fj denotes the mean reward function of arm j. It is assumed that the contexts arrive in an IID fashion. [9] obtain a O regret for this problem. Similar to Foster et al. [8], we assume that only the first n context features are relevant for an unknown n < N. It is important to find n because T 1+N 2+N . We have a model selection strategy that adapts to this unknown quantity: for each value of n in the grid [b0, b1, b2, ..., blogb(N)] for some b > 1, we use the algorithm of Guan and Jiang [9] as a base, and perform model selection with CORRAL and EXP3.P with these base algorithms. Foster et al. [8] EXP3.P CORRAL Nonparametric contextual bandit Unknown n N/A O 1+bn 2+bn + 1 3(2+bn ) 1+2bn 2+2bn 4.3 Tuning the exploration rate of -greedy For a given positive constant c, the -greedy algorithm pulls the arm with the largest empirical average reward with probability 1 c/t, and otherwise pulls an arm uniformly at random. Let t = c/t. It can be shown that the optimal value for t is min{1, 5k 2 t} where is the smallest gap between the optimal arm and the sub-optimal arms [12]. With this exploration rate, the regret scales as e O( T) for k = 2. We would like to find the optimal value of c without the knowledge of . We obtain such result by applying CORRAL to a set of -greedy base algorithms each instantiated with a c in [1, 2, 22, ..., 2log(k T )]. Theorem 4.3. The regret of CORRAL using -greedy base algorithms defined on the grid is bounded by O(T 1/2) when k = 2. Figure 2: CORRAL with -Greedy bases with different exploration rates. 2 Experiment (Figure 2). Let there be two Bernoulli arms with means 0.5 and 0.45. We use 18 -greedy base algorithms differing in their choice of c in the exploration rate t = c/t. We take T = 50, 000, = 20/ T and s to lie on a geometric grid in [1, 2T]. Each experiments is repeated 50 times. 4.4 Reinforcement Learning We consider the case of linear MDPs (see Assumption A in [11] for a definition). The learner has access to multiple feature maps one of which is aligned with the true dynamics of the MDP. Theorem 4.4. Let M = (S, A, H, P, r) be a linear MDP parametrized by an unknown feature map {Φ : S R ! Rd}. Let {Φi(s, a)}M i=1 be a family of feature maps with Φi(s, a) 2 Rd and satisfying Φ 2 {Φi(s, a)}M i=1. The regret of CORRAL with = M 1/2 T 1/2d3/2H3/2 using LSVI-UCB base algorithms is: R(T) O Figure 3: -Greedy vs UCRL2 vs PSRL in the River Swim environment [19]. We also observe that in practice, smoothing RL algorithms such as UCRL and PSRL and using a CORRAL master on top of them can lead to improved performance. A longer discussion is in Appendix A. 5 Base Smoothing 5.1 Non-increasing instantaneous regret We introduce a two step "smoothing" procedure (Algorithm 3) which, given an algorithm Bj with concave (in t) cumulative high probability (see Definition 5.2) regret bound Uj(t, δ), constructs a smoothed algorithm Bj with an instantaneous regret bound uj(t, δ) = U(t, δ)/t. If U(t, δ) is concave, u(t, δ) will be non-increasing in t. Algorithm Bj works as follows. We have two steps in each round s. In step 1, we play Bj. In step 2, at state s, we pick a time step q in [1, 2, .., s] uniformly at random, and re-play the policy made by Bj at time q. Since the policy of Bj at each round [1, 2, ...s] is chosen with probability 1/s to be played at step 2, and Bj satisfies a high probability upper bound (Definition 5.1), the expected instantaneous regret of step 2 at round s is at most U(s, δ)/s with high probability (Lemma D.1 ) which allows us to control term II in Eq.2 via Theorem 5.2 (Appendix F.3). We use the superscript (1) and (2) to distinguish Step 1 and Step 2 s action sets, policies, actions and rewards. Since the instantaneous regret of Step 2 is 1/s times the cumulative regret of Step 1, the cumulative regret of Step 2 over S states is bounded roughly by PS s=1 1/s log(S) times that of step 1. 2The shaded areas around UCB and CORRAL are the std. The shaded areas around the -greedy bases are 0.1 of std. For small , -greedy has a very high variance because it either commits to the optimal arm or the sub-optimal arm at the beginning, so plotting the whole std would make the plot unreadable. Algorithm 3 Smoothed Algorithm Input: Base Algorithms Bj; Output: Produce a smoothed base Bj Let s,j be the policy of Bj in state s; Let (1) s,j be the policies of Bj in state s. Initialize state counter s = 1. 1: for t = 1, , T do 2: if selected by master then Receive action set A(1) t DS Let (1) s,i = s,i from Bi. Step 1 Play action a(1) t ); Receive feedback r(1) t,j = f(A(1) t,j ) + (1) Calculate s+1,j of Bj using r(1) Receive action set A(2) t DS. Sample q Uniform(0, , s); Let (2) s,i = q,i from Bi. Step 2 Play action a(2) t ); Receive feedback r(2) t,j = f(A(2) t,j ) + (2) 4: Send smoothed reward r(2) t,j as both the rewards of Step 1 and Step 2 to the master. 5: s s + 1 6: else 7: for 2 steps do 8: Receive action set At DS. 9: Choose action a(2) s,i (At). 10: end for 11: end if 12: end for Definition 5.1 ((U, δ, T) Boundedness). Let U : R [0, 1] ! R+. We say an algorithm B is (U, δ, T) bounded if with probability at least 1 δ and for all rounds t 2 [1, T], the cumulative pseudo-regret is bounded above by U(t, δ): Pt l=1 f(Al, ) f(Al, l) U(t, δ). Definition 5.2 ((U, δ, T (2)) Smoothness). Let U : R [0, 1] ! R+. We say a smoothed algorithm B is (U, δ, T (2)) smooth if with probability 1 δ and for all rounds t 2 [T], the conditional expected instantaneous regret of Step 2 is bounded above by U(t, δ)/t: EAt DS[r(2) t |Ft 1] U(t, δ) t , 8t 2 [T]. (3) Here Ft 1 denotes the sigma algebra of all randomness up to the beginning of round t. In Appendix D we show that several algorithms such as UCB, Lin UCB, -greedy and EXP3 are (U, δ, T)-bounded for appropriate functions U. In Propositon 5.1 we show that if Bj is bounded, then Bj is both bounded and smooth: Proposition 5.1. If U(t, δ) > 8 T and Bj is (U, δ, T) bounded, then Bj is (6U log(T), δ, T) bounded and (5U, δ, T (2)) smooth. 5.2 Regret Analysis Term I. Note that we only send the smoothed reward of Step 2 to the master while the base plays and incurs regrets from both Step 1 and Step 2. We show in Appendix E that this does not affect the regret of the master significantly. For CORRAL with learning rate , E [I] O MT + M ln T 40 ln T . For EXP3.P with exploration rate p, E [I] < O( Term II. Term II is the regret of the base i when it only updates its state when selected. We assume smoothed base algorithm Bi satisfies the smoothness and boundedness in Definitions 5.1 and 5.2. Note that when a smoothed base repeats its policy while not played, it repeats the next Step 2 policy (Algorithm 3) whose instantaneous regret is non-increasing. Since the Step 2 s conditional instantaneous regret (Definition 5.2) has a non-increasing upper bound, selecting Bi with probability pi at every time step will result in the largest upper bound on its regret because the base is updated the least often. In this case the base will be updated every 1/pi time-steps and the regret upper bound will be roughly 1 pi Ui(Tpi, δ). Theorem 5.2. We have that E [II] O 1 pi Ui(Tpi, δ) log T + δT(log T + 1) . Here, the expectation is over the random variable pi. If U(t, δ) = t c(δ) for some 2 [1/2, 1) then, + δT(log T + 1) Total Regret. Adding Term I and Term II gives us the following worst-case bound for CORRAL (maximized over pi and with a chosen ) and EXP3.P (with a chosen p): Theorem 5.3. If a base algorithm is (U, δ, T)-bounded for U(T, δ) = T c(δ) and some 2 [1/2, 1) and the choice of δ = 1/T, the regret is upper bounded by: EXP3.P CORRAL MT + MTp + T p 1c(δ) + T + T c(δ) Known Known c(δ) O MT + M T 1 + M 1 T c(δ) Known Unknown c(δ) O MT + M T 1 + M 1 T c(δ) 6 Lower bound In stochastic environments, algorithms such as UCB can achieve logarithmic regret bounds. Our model selection procedure however has a O( T) overall regret. In this section, we show that in general it is impossible to obtain a regret better than ( T) even when one base has 0 regret. Theorem 6.1. There exists an algorithm selection problem, such that the regret for any time T is lower bounded by R(T) = Proof sketch. The two base algorithms are constructed such that one base algorithm has 0 regret and the gap between the algorithms closes at a rate of (1/( t log(t))). We show that at this rate, any master will have a constant probability of misidentifying the optimal algorithm even after observing infinite pulls. Hence the regret of the master is of order CORRAL needs knowledge of the best base s regret to achieve the same regret. The following lower bound shows that this requirement is unavoidable: Theorem 6.2. Let there be two base algorithms where the best base has regret O(T x) for some 0 < x < 1. If we don t know x and we don t know the reward of the best arm, then the regret of the master algorithm can be (T y) with y > x. Proof sketch. Let there be two base algorithms, and let R1 and R2 be their regrets incurred when called by the model selection strategy. If R1 = o(R2), we can construct the bases such that they both have zero regret after the learner stops selecting them. Therefore their regret when running alone are R1 and R2, and the learner has regret of the same order as R2, which is higher than the regret of the better base running alone (R1). If however R1 R2, since the learner doesn t know the optimal arm reward, we create another environment where the optimal arm reward is different, so that in the new environment the regrets are no longer equal. Acknowledgments and Disclosure of Funding Csaba Szepesvári gratefully acknowledges funding from the Canada CIFAR AI Chairs Program, Amii and NSERC. Broader impact The work does not present any foreseeable societal consequence. [1] Yasin Abbasi-Yadkori, David Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In NIPS, 2011. [2] Alekh Agarwal, Haipeng Luo, Behnam Neyshabur, and Robert E. Schapire. Corralling a band of bandit algorithms. In COLT, volume 65 of Proceedings of Machine Learning Research, pages 12 38. PMLR, 2017. [3] Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Mach. Learn., 47(2-3):235 256, 2002. [4] S. Bubeck and A. Slivkins. The best of both worlds: stochastic and adversarial bandits. In In Proceedings of the International Conference on Computational Learning Theory (COLT), 2012. [5] Sébastien Bubeck and Nicolò Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Co RR, abs/1204.5721, 2012. URL http://arxiv.org/abs/ 1204.5721. [6] Niladri Chatterji, Vidya Muthukumar, and Peter L. Bartlett. Osom: A simultaneously optimal algorithm for multi-armed and linear contextual bandits. In ar Xiv, 2019. [7] Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linear payoff functions. In Geoffrey Gordon, David Dunson, and Miroslav Dudík, editors, Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15 of Proceedings of Machine Learning Research, pages 208 214, Fort Lauderdale, FL, USA, 11 13 Apr 2011. PMLR. URL http://proceedings.mlr.press/v15/chu11a.html. [8] Dylan Foster, Akshay Krishnamurthy, and Haipeng Luo. Model selection for contextual bandits. In Advances in Neural Information Processing Systems, 2019. [9] Melody Guan and Heinrich Jiang. Nonparametric stochastic contextual bandits, 2018. URL https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16944. [10] Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563 1600, 2010. [11] Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linear function approximation. ar Xiv preprint ar Xiv:1907.05388, 2019. [12] Tor Lattimore and Csaba Szepesvari. Bandit Algorithms. 2020. [13] Tor Lattimore, Csaba Szepesvari, and Gellert Weisz. Learning with good feature representations in bandits and in rl with a generative model, 2019. [14] Lihong Li, Yu Lu, and Dengyong Zhou. Provably optimal algorithms for generalized linear contextual bandits, 2017. [15] Odalric-Ambrym Maillard and Rémi Munos. Adaptive bandits: Towards the best history- dependent strategy. In AISTATS, 2011. [16] Ian Osband and Benjamin Van Roy. Why is posterior sampling better than optimism for reinforcement learning? In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2701 2710. JMLR. org, 2017. [17] Yevgeny Seldin, Csaba Szepesvari, Peter Auer, and Yasin Abbasi-Yadkori. Evaluation and analysis of the performance of the exp3 algorithm in stochastic environments. In Marc Peter Deisenroth, Csaba Szepesvári, and Jan Peters, editors, Proceedings of the Tenth European Workshop on Reinforcement Learning, volume 24 of Proceedings of Machine Learning Research, pages 103 116, Edinburgh, Scotland, 30 Jun 01 Jul 2013. PMLR. URL http://proceedings.mlr.press/v24/seldin12a.html. [18] Han Shao, Xiaotian Yu, Irwin King, and Michael R Lyu. Almost optimal algorithms for linear stochastic bandits with heavy-tailed payoffs. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 8420 8429. Curran Associates, Inc., 2018. URL http://papers.nips. cc/paper/8062-almost-optimal-algorithms-for-heavy-tailed-payoffs.pdf. [19] Alexander L Strehl and Michael L Littman. An analysis of model-based interval estimation for markov decision processes. Journal of Computer and System Sciences, 74(8):1309 1331, 2008. [20] Andrea Zanette, Alessandro Lazaric, Mykel Kochenderfer, and Emma Brunskill. Learning near optimal policies with low inherent bellman error, 2020.