# boovi_provably_efficient_bootstrapped_value_iteration__63d56c74.pdf Boo VI: Provably Efficient Bootstrapped Value Boyi Liu Qi Cai Zhuoran Yang Zhaoran Wang Despite the tremendous success of reinforcement learning (RL) with function approximation, efficient exploration remains a significant challenge, both practically and theoretically. In particular, existing theoretically grounded RL algorithms based on upper confidence bounds (UCBs), such as optimistic least-squares value iteration (LSVI), are often incompatible with practically powerful function approximators, such as neural networks. In this paper, we develop a variant of bootstrapped LSVI, namely Boo VI, which bridges such a gap between practice and theory. Practically, Boo VI drives exploration through (re)sampling, making it compatible with general function approximators. Theoretically, Boo VI inherits the worst-case e O( d3H3T)-regret of optimistic LSVI in the episodic linear setting. Here d is the feature dimension, H is the episode horizon, and T is the total number of steps. 1 Introduction Reinforcement learning (RL) with function approximation demonstrates significant empirical success in a broad range of applications [e.g., 16, 38, 40, 45]. However, computationally and statistically efficient exploration of large and intricate state spaces remains a major barrier. Practically, the lack of temporally-extended exploration [31, 28, 30] in existing RL algorithms, e.g., deterministic policy gradient [39] and soft actor-critic [19], hinders them from solving more challenging tasks, e.g., Montezuma s Revenge in Atari Games [42, 17], in a more sample-efficient manner. Theoretically, it remains unclear how to design provably efficient RL algorithms with finite sample complexities or regrets that allow for practically powerful function approximators, e.g., neural networks. There exist two principled approaches to efficient exploration in RL, namely optimism in the face of uncertainty [e.g., 4, 41, 21, 11, 12, 5, 23, 24] and posterior sampling [e.g., 29, 34, 33, 30, 27, 36]. The optimism-based approach is often instantiated by incorporating upper confidence bounds (UCBs) into the estimated (action-)value functions as bonuses, which are used to direct exploration. In the tabular setting, the resulting algorithms, e.g., variants of upper confidence bound reinforcement learning (UCRL) [4, 21, 11, 5], are known to attain the optimal worstcase regret [32]. Beyond the tabular setting, it remains unclear how to construct closed-form UCBs in a principled manner for general function approximators, e.g., neural networks. The only exception is the linear setting [51, 52, 24, 8], where optimistic least-squares value iteration (LSVI) [24] is known to attain a near-optimal (with respect to T = KH) worst-case regret. However, the closed-form UCB therein is tailored to linear models [1, 9] rather than fully general-purpose. Northwestern University; boyiliu2018@u.northwestern.edu Northwestern University; qicai2022@u.northwestern.edu Princeton University; zy6@princeton.edu Northwestern University; zhaoranwang@gmail.com 35th Conference on Neural Information Processing Systems (Neur IPS 2021). On the other hand, the posterior-based approach, which originates from Thompson sampling [43, 37], can be instantiated using randomized (action-)value functions [34, 30, 27, 36]. Unlike optimistic LSVI, the resulting algorithm, namely randomized LSVI, straightforwardly allows for general function approximators, as it only requires injecting random noise into the training data of LSVI. Ideally, such injected random noise does not depend on particular function approximators. In the tabular setting, randomized LSVI is known to attain near-optimal worst-case and Bayesian regrets [30, 36]. Meanwhile, in the linear setting, randomized LSVI is very recently shown to attain a near-optimal worst-case regret [54], which is only worse than that of optimistic LSVI by a factor of d H. Here d is the feature dimension and H is the episode horizon. However, achieving such a regret requires a nontrivial modification of the injected random noise, which is tailored to linear models. Such a specialized modification diminishes the supposed practical advantage of randomized LSVI. To bridge such a gap between practice and theory, we aim to answer the following question: Can we design an RL algorithm that simultaneously achieves the practical advantage of randomized LSVI and the theoretical guarantee of optimistic LSVI? In this paper, we propose a variant of bootstrapped LSVI, namely Boo VI, which combines the advantages of the optimism-based and posterior-based approaches. The key idea of Boo VI is to use posterior sampling to implicitly construct an optimistic version of the estimated (action-)value functions in a data-driven manner. Unlike randomized LSVI, which samples from the posterior only once, e.g., by injecting random noise into the training data of LSVI, Boo VI samples from the posterior multiple times, e.g., via the Langevin dynamics [48, 35]. Upon evaluating an action at a state, Boo VI ranks the values of the randomized (action-)value functions sampled from the posterior in descending order and returns a top-ranked value, which can be shown to be approximately optimistic. Generally speaking, Boo VI corresponds to bootstrapping the noise in the least-squares regression problem of LSVI. As a result, it can be viewed as a parametric bootstrap counterpart of the nonparametric bootstrap technique used in bootstrapped deep Q-networks (DQNs) [31, 30], which demonstrates significant empirical success in terms of exploration. Compared with existing RL algorithms with function approximation, the advantage of Boo VI is twofold: Practically, Boo VI bypasses the explicit construction of the closed-form UCB in optimistic LSVI. Like randomized LSVI, Boo VI straightforwardly allows for general function approximators, as it only requires injecting random noise into the training process of LSVI, e.g., via the Langevin dynamics. Theoretically, Boo VI inherits the worst-case e O( d3H3T)-regret of optimistic LSVI in the linear setting. Here d is the feature dimension, H is the episode horizon, and T is the total number of steps. Such a regret is better than the best known worst-case regret of randomized LSVI by a factor of d H. More importantly, unlike the specialized variant of randomized LSVI studied in [54], Boo VI can be applied to the linear setting as is , without tailoring the injected random noise to linear models. More Related Work: The idea of using posterior sampling to achieve optimism in a data-driven manner is previously studied in the linear bandit setting [26] and the tabular setting of RL [3]. In fact, our linear setting of RL covers the linear bandit setting as a special case, where the episode horizon H is set to one, and the tabular setting of RL as another special case, where the feature mapping is the canonical basis of the state and action spaces. Although Boo VI is practically applicable to general function approximators, our theoretical guarantee on its worst-case regret is only applicable to the linear setting. Despite the recent progress [49, 22, 13, 15], simultaneously achieving provable computational and statistical efficiency in exploration with general function approximators remains challenging. See, e.g., [2, 18] for the recent progress in the contextual bandit setting. Finally, we refer readers to [46, 53, 57, 55] for theories on the generalized models of linear MDPs, which all share similar (approximately) linear structure with the class of linear MDPs considered in the analysis of this paper. We expect our regret bound stays the same for those settings with linear structures, or admits an extra O( T) dependency over T for those settings with approximately linear structure, where > 0 describes the level of nonlinearity in the MDP. We also refer readers to [50] for sharper regret bounds in linear MDPs, which are out of the scope of this paper. 2 Background In this section, we introduce the general problem setting. Episodic Markov Decision Process. We consider the episodic Markov decision process represented by a tuple (S, A, H, P, r), where S is a compact state space, A is a finite action space with cardinality A, H is the number of timesteps in each episode, P = {Ph}h2[H] with Ph : S S A ! [0, 1] for all h 2 [H] is the set of transition kernels, and r = {rh}h2[H] with rh : S A ! [0, 1] for all h 2 [H] is the set of reward functions. At each timestep h 2 [H] of episode k, an agent at state sk h 2 S with policy k = { k h}h2[H], where k h : A S ! R for all h 2 [H], interacts with the environment by first taking an action ak h with probability k h) and then receiving the corresponding reward rk We evaluate the performance of policy = { h}h2[H] starting from timestep h, and state-action pair (s, a) using its action-value function Q h : S A ! R, which is defined as h(s, a) = E rh0(sh0, ah0) $$$$ sh = s, ah = a for all h 2 [H]. Correspondingly, the value function V h : S ! R of a policy is defined as rh0(sh0, ah0) $$$$ sh = s for all h 2 [H]. Here the expectation E [ ] is taken over the trajectory generated by . Also, we let Q H+1 0 and thus V H+1 0. Furthermore, we denote by V h (s) the value function corresponding to the optimal policy . Finally, for notational simplicity, we define [Ph V ](s, a) = Es0 Ph( | s,a)[V (s0)], where V : S ! R can be any function. For any algorithm that generates a sequence of policies { k}k2[K], we track its performance via the cumulative regret defined by Regret(K) = where K is the number of episodes. Here sk 1 is the initial state of episode k, which is arbitrarily chosen at the start of the episode. 3 Bootstrapped Value Iteration In this section, we introduce Bootstrapped Value Iteration (Boo VI in Algorithm 1). For notational simplicity, we write rh(sk h throughout the rest of this paper. Also, in this paper, we write max{min{ , }, 0} as min{ , }+, and denote by k k the 2-norm for vectors. Least-Squares Value Iteration. At timestep h 2 [H] of episode k, given the estimated action-value function b Qk h+1, a) for all 2 [k 1] and a 2 A, let b V k h+1) = maxa2A b Qk h+1, a). Then, Least-Squares Value Iteration (LSVI) updates the parameter ! of the action-value function via where Q( , ; ) : S A Rd ! R is the parameterization of action-value function with ! 2 Rd being its parameter, and λ 0 is the regularization parameter. Although the deterministic parameter update by LSVI could exploit the historical data well, it has limited ability to address the exploration need in more challenging tasks. Bootstrapped Value Iteration. To achieve guided exploration, we introduce Boo VI (Algorithm 1), which uses bootstrapping to enforce the optimism of the estimated action-value function. At the timestep h of episode k, given the current data buffer Dk h+1)} 2[k 1], which contains the data collected from the previous k 1 episodes, and the current bootstrapped state values {e V k h+1)} 2[k 1], we define h+1), for all 2 [k 1]. (3.2) With p0(!) as prior of ! and p(y h), !) as the likelihood of y h, the posterior of ! is given by In Boo VI (Algorithm 1), we propose to sample repeatedly from such posterior for Nk times, which is equivalent to sampling repeatedly from randomized action-value function since each sample !k,i h corresponds to one randomized action-value function Q( , ; !k,i h ). Such a collection of posterior weights is later used to construct the bootstrapped action-value function in Algorithm 2. To independently sample from the posterior in (3.3), we can consider, e.g., using the Langevin dynamics !(t + 1) !(t) + !(t), where where t > 0 is the stepsize, and t N(0, t). We note here that, after running sufficient many iterations with suitable choices of stepsizes t > 0 , the Langevin dynamics gives effectively independent parameter samples from the posterior in (3.3). To instantiate the connection between the posterior (3.3) with the LSVI, we consider Gaussian prior and likelihood as an example. For the Gaussian prior ! N(0, Id) and the Gaussian likelihood p(y h), !) / exp{ (y h; !))2/(2σ2)}, the posterior of ! is given by 2 k!k2 1 2σ2 where σ > 0 is an absolute constant. In this case, the maximum a posteriori (MAP) estimate of ! coincides with the weight estimate b!k h obtained by LSVI. Such observation suggests that sampling from the posterior distribution p(! | {e V k h+1)} 2[k 1], Dk h) would allows us to both achieve exploitation based on the available data, and generate noise for randomized exploration. More specifically, the !(t) in (3.4) takes the form of which can be viewed as using stochastic gradient descent to solve for the LSVI update (3.1) with λ = σ2 and b V k h+1 replaced by e V k h+1. See also in Lemma E.1 for the closed form of the posterior in a linear MDP setting with Gaussian prior and likelihood. We would like to highlight that, although the theoretical guarantee in this paper is built upon the Gaussian prior and likelihood, the choices of prior and likelihood are flexible. For example, we can use uninformative prior for generalized linear model, and Gaussian process prior for kernel of overparameterized neural networks [6]. Moreover, as suggested by theoretical results on Langevin Dynamics [56] and illustrative experiments in Appendix F, the computational overhead caused by replacing the minimization of (3.1) with posterior sampling is mild. Algorithm 1 Bootstrapped Value Iteration (Boo VI) 1: Require: MDP (S, A, H, P, r), action-value function parameterization Q( , ; ) : S A Rd ! R, number of episodes K, number of posterior weights {Nk}k2[K], lower and upper bootstrapping ratios , β 2 (0, 1) 2: Initialize the data buffer D1 h {} for h 2 [H] 3: For episode k = 1, . . . , K do 4: Set Nk, d Nke,Nk,β bβ Nkc, and !k,i H+1 0 for all i 2 [Nk] 5: Sample nk uniformly from {Nk, , Nk, + 1, . . . , Nk,β} 6: For timestep h = H, . . . , 1 do 7: Generate e Qk h+1, a) using Algorithm 2 with weights {!k,i h+1}i2[Nk] and parameter nk for all a 2 A and 2 [k 1] 8: e V k h+1) maxa2A e Qk h+1, a) for all 2 [k 1] 9: Independently sample {!k,i h }i2[Nk] from the posterior p(! | {e V k h+1)} 2[k 1], Dk h) defined in (3.3), e.g., using Langevin dynamics in (3.4) 10: end 11: For timestep h = 1, . . . , H do 12: Generate e Qk h, a) using Algorithm 2 with weights {!k,i h }i2[Nk] and parameter nk for all a 2 A 13: Take action ak h argmaxa2A e Qk h, a), and observe rk h+1 14: Update the data buffer Dk+1 h+1)} 15: End 16: End The following Algorithm 2 serves as the critical building block of Boo VI for enforcing the optimism of the estimated action-value. At episode k in Boo VI, for any (s, a) 2 S A, Algorithm 2 ranks the estimated action-value functions corresponding to the posterior sample obtained in Line 1 of Algorithm 1 in ascending order. Then, in order to enforce the optimism of the estimated action-value function, Algorithm 2 resamples the nk-th top-ranked value from the ordered estimated action-value functions in the manner of bootstrapping. Finally, to ensure sufficient optimism of the obtained bootstrapped action-value function, we extrapolate with a tunable parameter > 1. Algorithm 2 Bootstrapping Action-Value Function 1: Require: Action-value function parameterization Q( , ; ) : S A Rd ! R, posterior sample h }i2[Nk], integer nk 2 [Nk], extrapolation parameter > 1, and state-action pair (s, a) 2: Compute Qk,i h (s, a) Q(s, a; !k,i h ) for all i 2 [Nk] 3: Set b Qk h(s, a) (1/Nk) PNk 4: Rank {Qk,i h (s, a)}i2[Nk] in ascending order to obtain h (s, a)}i2[Nk] 5: Output: e Qk h(s, a) min{(1 ) b Qk h(s, a) + Qk,(nk) h (s, a), H h + 1}+ Remark 3.1 (Sample Efficiency and Computational Efficiency). Here we clarify the differences between sample efficiency and computational efficiency. Throughout this paper, we refer sample efficiency to the learning efficiency with respect to the total number T = KH of interactions with the environment. Such interactions can often time be very resource demanding. Thus, this paper seeks to provide an algorithm that aims to achieve the sample efficiency with respect to the number of such interactions. Although posterior weights are sampled in Algorithm 1, the efficiency of such posterior sampling process is categorized as computational efficiency since it is purely based on the current data buffer Dk h and does not require any extra interaction with the environment. Remark 3.2 (Additional Computational Cost). Here we discuss the computational cost of Boo VI compared to LSVI. (i) Posterior sampling vs. least-square minimization: Consider the Langevin dynamics in (3.4). Since posterior sampling replaces and admits similar steps as least-square minimization, the computational cost should not differ significantly between LSVI and Boo VI. (ii) Computing bootstrapped Q-function: To compute bootstrapped Q-function, at each state-action pair, the extra sorting takes O(Nk log Nk) time complexity, which makes the action selection procedure O(Nk log Nk) times slower. In addition, keeping Nk posterior weights takes O(d Nk) memory. We note that, although the requirement on Nk in our subsequent analysis is large, we should be able to use much smaller Nk in practice due to the pessimistic nature of the analysis. See Appendix F for illustrative experiment results. It is worth mentioning that Boo VI is applicable to any specific parameterization Q( , ; ) : S A Rd ! R of the action-value function and is fully data-driven. We also note here that the design of Boo VI does not rely on the Gaussian prior/posterior or the Langevin dynamics sampling technique. 4 Main Results While Boo VI (Algorithm 1) is generally applicable to different parameterizations of the action-value function, we establish its regret for a class of linear MDPs. Linear Markov Decision Process. We consider a class of MDPs, where the transition kernels and the reward functions are linear in the same feature mapping. Specifically, we have the following definition. Definition 4.1 (Linear MDP). An MDP (S, A, H, P, r) is called a linear MDP with feature mapping φ : S A ! Rd, if for any h 2 [H], there exists d unknown signed measures µh = (µ(1) h , . . . , µ(d) h ) over S and an unknown vector h 2 Rd, such that for any (s, a) 2 S A, we have Ph( | s, a) = φ(s, a), µh( ) , rh(s, a) = See [51, 52, 24] for examples of such a class of linear MDPs. Specifically, examples include tabular MDPs where d = SA and the feature mapping is the canonical basis φ(s, a) = e(s,a), and the simplex feature space where the feature space {φ(s, a) | (s, a) 2 S A} is a subset of the d-dimensional simplex. See also [14, 44, 25] for related discussions on such a linear representation. For notational simplicity, we write the feature φ(sk h throughout the rest of this paper. Based on Definition 4.1, without loss of generality, we make the following assumption. Assumption 4.2. The MDP (S, A, H, P, r) is a linear MDP with kφ(s, a)k 1 for all (s, a) 2 S A and max{kµh(S)k, k hk} d for all h 2 [H]. To motivate the linear parameterization of the action-value function in the subsequent section, we have the following proposition. Proposition 4.3 (Linear Action-Value Function, Proposition 2.3 in [24]). For a linear MDP, for any policy and h 2 [H], there exists ! h 2 Rd such that for all (s, a) 2 S A, we have Q h(s, a) = hφ(s, a), ! Regret of Boo VI for Linear MDPs. For a linear MDP, by Proposition 4.3, the parameterization Q( , ; ) : S A Rd ! R in Boo VI should take the form of Q(s, a; !) = hφ(s, a), !i, where ! 2 Rd. Furthermore, when the prior and the likelihood are Gaussian, the posterior of ! is Gaussian distribution with mean being is the weight update by LSVI with chioce of λ = σ2. For any bootstrapping ratio q 2 (0, 1), we write Cq = Φ 1(q) throughout the rest of this paper, where Φ( ) is the cumulative distribution function of the standard Gaussian. Then, we have the following upper bound of regret of Boo VI. Theorem 4.4. Suppose that Assumption 4.2 holds, and that d 2. We set σ = 1, = d H, and fix a failure probability p 2 (0, 1]. Then there exists a sequence of posterior sample sizes {Nk}k2[K], and there exist absolute constants cβ > c > 0, such that if Nk = O(d6T 4k/p4) for all k 2 [K], C = c p , and Cβ = cβ p , Boo VI (Algorithm 1) with Gaussian posterior in (3.5) satisfies Regret(K) = O with probability at least 1 p, where T = KH and = log(3d T/p). Proof. See Section 5 for a proof sketch. 5 Proof Sketch Notation. At episode k, given the number of posterior sample weights Nk and given nk sampled in Line 1 of Algorithm 1, we define Ck = Φ 1(nk/Nk). At timestep h of episode k, further given posterior weights {!k,i h }i2[Nk], corresponding to the bootstrapped action-value function e Qk h generated by Algorithm 2, we define the bootstrapped value function as h (s) = max h(s, a). (5.1) Next, we define the mean bootstrapped action-value functions as k h(s, a) = min h(s, a) + Qk,(nk) where E![ ] is taken over ! with respect to the posterior p(! | {e V k h+1)} 2[k 1], Dk h) defined in (3.3). Correspondingly, we define the mean bootstrapped value function as k h(s) = max k h(s, a). (5.3) Also, we denote by the mean of the posterior distribution p(! | {e V k h+1)} 2[k 1], Dk h), where y h is defined in (3.2). For a fixed failure probability p 2 (0, 1], we write = log(3d T/p), where T = KH. Finally, we define matrix k h 2 Rd d by h)> + σ2 Id. (5.5) Concentration Events. Before proving Theorem 4.4, we first present two lemmas, each characterizing an event that is involved throughout the remaining proofs. The first lemma characterizes the concentration behavior of the bootstrapped action-value function e Qk Lemma 5.1. Let σ = 1, = d H, and Cβ = cβ p for some cβ > 0. For a fixed failure probability p 2 (0, 1], we define E as the event that the condition is satisfied for all (s, a, k, h) 2 S A [K] [H]. Then, there exists a sequence of posterior sample sizes {Nk}k2[K] satisfying Nk = O(d6T 4k/p4) for all k 2 [K], such that P(E) 1 p/3. Proof. See Appendix C for a detailed proof. The second lemma characterizes the concentration behavior of the mean bootstrapped state value function V Lemma 5.2. Let σ = 1, = d H, and Cβ = cβ p for some constant cβ > 0. For a fixed failure p 2 (0, 1], we define E0 as the event that the condition is satisfied for all (k, h) 2 [K] [H], where χ = log[3(1+cβ)d T/p)]. Then we have P(E0) 1 p/3. Proof. See Appendix D for a detailed proof. Model Estimation Error. With the events E and E0 ready, we can proceed to characterize the model estimation error k h : S A ! R for all (k, h) 2 [K] [H] defined as , (5.6) which can be comprehended as the estimation error of the model at timestep h of episode k induced by the mean bootstrapped value function Q k h+1. We have the following lemma charaterizing the model estimation error k h defined in (5.6). Lemma 5.3 (Model Estimation Error). Let d 2, = d H, σ = 1, and p 2 (0, 1]. Then there exists a sequence of posterior sample sizes {Nk}k2[K], and there exist absolute constants cβ > c > 0 such that, if Nk = O(d6T 4k/p4) for all k 2 [K], C = c p , and Cβ = cβ p , for the model estimation error k h defined in (5.6), we under events E and E0 that h(s, a) (c + cβ) d Hp φ(s, a)>( k h) 1φ(s, a) for all (s, a, k, h) 2 S A [K] [H]. Proof. Under Assumption 4.2, using (4.1) we have [Ph V k h+1(s0) hφ, µh(ds0)i. Then, with slight abuse of notation, we denote by V k h+1(µh) = k h+1(s)µh(ds) 2 Rd and write where the second equality follows from (5.5). Thus, we have the following decomposition hi (rh + Ph V $$ u1 + u2, h is defined in (5.4), and In the sequel, we upper bound u1 and u2 separately. Upper bounding u1: First, we further decompose u1 as applying the Cauchy-Schwartz inequality to which we obtain Now we proceed to upper bound the two terms on the right-hand side of (5.7) in the following. For the first term, since by Lemma 5.1 we have under event E that |e V k d /k, we have $$$$ By the Cauchy-Schwartz inequality and Lemma D.1 in [24] (see Appendix E), we have taking which into (5.8) and gives Taking (5.9) into (5.7) and applying Lemma 5.2 to the second term on the right-hand side of (5.7), we have under the events E and E0 that d Hp + C d Hpχ h) 1φ. (5.10) Upper bounding u2: By (4.1), we have Then, by the Cauchy-Schwartz inequality, we further obtain h) 1φ, (5.11) where the second inequality follows from k h Id, and the last inequality follows from k hk in Assumption (4.2) as well as the fact that V k h+1(s) H h H 1 for all s 2 S. Combining (5.10) and (5.11), we obtain under event E that hi (rh + Ph V h) 1φ (5.12) d Hp + C d Hpχ + 2H d Hp + C d Hpχ = C0 d Hp , where C0 > 0 is an absolute constant. Next, we need to find an absolute constant cβ > 0 such that C0 p = (cβ/ d + 5) p + C + log(1 + cβ) < cβ p . Note that d 2 and log 2, it suffices to pick a cβ > 0 such that log 2 + log(1 + cβ) < log 2, (5.13) which must exist as the left hand side grows logarithmically in cβ and the right-hand side grows linearly in cβ. For cβ > 0 satisfying (5.13), we pick any c > 0 such that C0 c < cβ and let C = c p . By (5.12) and rh + [Ph V k h+1] H h + 1, we obtain under events E and E0 that h) 1φ, H h + 1 (c C0) d Hp On the other hand, by (5.12), we also have under events E and E0 that hi (rh + Ph V k h+1) + Cβ h) 1φ (c + cβ) d Hp Therefore, we finish the proof of Lemma 5.3. As results of Lemma 5.3, we have the following two lemmas characterizing the optimism and the cumulative estimation error of the mean bootstrapped value function V k h, respectively. Lemma 5.4 (Optimistic Random Value Function). Let σ = 1 and = d H. There there exists a sequence of posterior sample sizes {Nk}k2[K], and there exist absolute constants cβ > c > 0, such that if Nk = O(d6T 4k/p4) for all k 2 [K], C = c p , and Cβ = cβ p , we have under the events E and E0 that Proof. See Appendix D for a detailed proof. Lemma 5.5 (Cumulative Estimation Error). Let σ = 1, = d H, and p 2 (0, 1]. There exists a sequence of posterior sample sizes {Nk}k2[K], and there exist absolute constants cβ > c > 0, such that if Nk = O(d6T 4k/p4) for all k 2 [K], C = c p , and Cβ = cβ p , we have under the events E and E0 that 18TH2 log(3/p) + 4(cβ/ + (c + cβ) d H2p with probability at least 1 p/3. Proof. See Appendix D for a detailed proof. Finally, the regret bound of Boo VI for the class of linear MDPs can be established as a consequence of the optimism (Lemmas 5.4) and the upper bound of the cumulative estimation error (Lemma 5.5) of the bootstrapped value function. Proof of Theorem 4.4. First, recall that we have V k h(s) defined in (5.3). We have the regret decomposition Regret(K) = Applying Lemmas 5.4 and 5.5 to (5.14), we obtain under the events E and E0 that Regret(K) 0 + 18TH2 log(3/p) + 4(cβ/ + (c + cβ) d H2p with probability at least 1 p/3. Finally, by Lemmas 5.1 and 5.2, we have with at least probability 1 p/3 p/3 = 1 2p/3 that the events E and E0 hold simultaneously. Thus, we have (5.15) hold with probability at least 1 p, which concludes the proof. 6 Conclusion The cost of collecting data via online experiments is often prohibitive compared to collecting offline data, e.g., via posterior sampling. Thus, designing online learning algorithms that provably explores the environment in a data-driven manner is essential. Moreover, applicability to general environments is critical since reinforcement learning tasks are getting more chanlleging and complex recently. We aspire to motivate the design of novel reinforcement learning algorithms that utilize cheaper data to boost online sample efficiency. Our algorithm and analysis serve as a step towards developing general applicable and provable sample-efficient reinforcement learning. While this paper is mainly on algorithmic and theoretical aspects of the bootstrapping idea in RL, it would be interesting to see the empirical strength of Boo VI in more challenging RL environments. We leave this part to our future work. Acknowledgement Zhaoran Wang acknowledges National Science Foundation (Awards 2048075, 2008827, 2015568, 1934931), Simons Institute (Theory of Reinforcement Learning), Amazon, J.P. Morgan, and Two Sigma for their supports. Zhuoran Yang acknowledges Simons Institute (Theory of Reinforcement Learning). [1] ABBASI-YADKORI, Y., PÁL, D. and SZEPESVÁRI, C. (2011). Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems. [2] AGARWAL, A., HSU, D., KALE, S., LANGFORD, J., LI, L. and SCHAPIRE, R. (2014). Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning. [3] AGRAWAL, S. and JIA, R. (2017). Optimistic posterior sampling for reinforcement learning: worst-case regret bounds. In Advances in Neural Information Processing Systems. [4] AUER, P. and ORTNER, R. (2006). Logarithmic online regret bounds for undiscounted reinforcement learning. In Advances in Neural Information Processing Systems. [5] AZAR, M. G., OSBAND, I. and MUNOS, R. (2017). Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning. [6] BERNARDO, J., BERGER, J., DAWID, A., SMITH, A. ET AL. (1998). Regression and classification using gaussian process priors. Bayesian statistics 6 475. [7] BROCKMAN, G., CHEUNG, V., PETTERSSON, L., SCHNEIDER, J., SCHULMAN, J., TANG, J. and ZAREMBA, W. (2016). Openai gym. ar Xiv preprint ar Xiv:1606.01540 . [8] CAI, Q., YANG, Z., JIN, C. and WANG, Z. (2019). Provably efficient exploration in policy optimization. ar Xiv preprint ar Xiv:1912.05830 . [9] CHU, W., LI, L., REYZIN, L. and SCHAPIRE, R. (2011). Contextual bandits with linear payoff functions. In International Conference on Artificial Intelligence and Statistics. [10] DANI, V., HAYES, T. P. and KAKADE, S. M. (2008). Stochastic linear optimization under bandit feedback. Conference on Learning Theory . [11] DANN, C. and BRUNSKILL, E. (2015). Sample complexity of episodic fixed-horizon reinforcement learning. In Advances in Neural Information Processing Systems. [12] DANN, C., LATTIMORE, T. and BRUNSKILL, E. (2017). Unifying PAC and regret: Uniform PAC bounds for episodic reinforcement learning. In Advances in Neural Information Processing Systems. [13] DONG, K., PENG, J., WANG, Y. and ZHOU, Y. (2019). pn-regret for learning in Markov decision processes with function approximation and low Bellman rank. ar Xiv preprint ar Xiv:1909.02506 . [14] DU, S. S., KAKADE, S. M., WANG, R. and YANG, L. F. (2019). Is a good representation sufficient for sample efficient reinforcement learning? ar Xiv preprint ar Xiv:1910.03016 . [15] DU, S. S., LUO, Y., WANG, R. and ZHANG, H. (2019). Provably efficient Q-learning with function approximation via distribution shift error checking oracle. ar Xiv preprint ar Xiv:1906.06321 . [16] DUAN, Y., CHEN, X., HOUTHOOFT, R., SCHULMAN, J. and ABBEEL, P. (2016). Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning. [17] ECOFFET, A., HUIZINGA, J., LEHMAN, J., STANLEY, K. O. and CLUNE, J. (2019). Go-explore: A new approach for hard-exploration problems. ar Xiv preprint ar Xiv:1901.10995 . [18] FOSTER, D. J., AGARWAL, A., DUDÍK, M., LUO, H. and SCHAPIRE, R. E. (2018). Practical contextual bandits with regression oracles. ar Xiv preprint ar Xiv:1803.01088 . [19] HAARNOJA, T., ZHOU, A., ABBEEL, P. and LEVINE, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. ar Xiv preprint ar Xiv:1801.01290 . [20] HSU, D., KAKADE, S. and ZHANG, T. (2012). A tail inequality for quadratic forms of subgaussian random vectors. Electronic Communications in Probability 17. [21] JAKSCH, T., ORTNER, R. and AUER, P. (2010). Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research 11 1563 1600. [22] JIANG, N., KRISHNAMURTHY, A., AGARWAL, A., LANGFORD, J. and SCHAPIRE, R. E. (2017). Contextual decision processes with low Bellman rank are PAC-learnable. In International Conference on Machine Learning. [23] JIN, C., ALLEN-ZHU, Z., BUBECK, S. and JORDAN, M. I. (2018). Is Q-learning provably efficient? In Advances in Neural Information Processing Systems. [24] JIN, C., YANG, Z., WANG, Z. and JORDAN, M. I. (2019). Provably efficient reinforcement learning with linear function approximation. ar Xiv preprint ar Xiv:1907.05388 . [25] LATTIMORE, T. and SZEPESVARI, C. (2019). Learning with good feature representations in bandits and in RL with a generative model. ar Xiv preprint ar Xiv:1911.07676 . [26] MAY, B. C., KORDA, N., LEE, A. and LESLIE, D. S. (2012). Optimistic bayesian sampling in contextual- bandit problems. Journal of Machine Learning Research 13 2069 2106. [27] OSBAND, I., ASLANIDES, J. and CASSIRER, A. (2018). Randomized prior functions for deep reinforce- ment learning. In Advances in Neural Information Processing Systems. [28] OSBAND, I., BLUNDELL, C., PRITZEL, A. and VAN ROY, B. (2016). Deep exploration via bootstrapped DQN. In Advances in Neural Information Processing Systems. [29] OSBAND, I., RUSSO, D. and VAN ROY, B. (2013). (more) efficient reinforcement learning via posterior sampling. In Advances in Neural Information Processing Systems. [30] OSBAND, I., RUSSO, D., WEN, Z. and VAN ROY, B. (2017). Deep exploration via randomized value functions. Journal of Machine Learning Research . [31] OSBAND, I. and VAN ROY, B. (2015). Bootstrapped Thompson sampling and deep exploration. ar Xiv preprint ar Xiv:1507.00300 . [32] OSBAND, I. and VAN ROY, B. (2016). On lower bounds for regret in reinforcement learning. ar Xiv preprint ar Xiv:1608.02732 . [33] OSBAND, I. and VAN ROY, B. (2017). Why is posterior sampling better than optimism for reinforcement learning? In International Conference on Machine Learning. [34] OSBAND, I., VAN ROY, B. and WEN, Z. (2014). Generalization and exploration via randomized value functions. ar Xiv preprint ar Xiv:1402.0635 . [35] RAGINSKY, M., RAKHLIN, A. and TELGARSKY, M. (2017). Non-convex learning via stochastic gradient Langevin dynamics: a nonasymptotic analysis. ar Xiv preprint ar Xiv:1702.03849 . [36] RUSSO, D. (2019). Worst-case regret bounds for exploration via randomized value functions. In Advances in Neural Information Processing Systems. [37] RUSSO, D. J., VAN ROY, B., KAZEROUNI, A., OSBAND, I. and WEN, Z. (2018). A tutorial on Thompson sampling. Foundations and Trends in Machine Learning 11 1 96. [38] SILVER, D., HUANG, A., MADDISON, C. J., GUEZ, A., SIFRE, L., VAN DEN DRIESSCHE, G., SCHRITTWIESER, J., ANTONOGLOU, I., PANNEERSHELVAM, V., LANCTOT, M. ET AL. (2016). Mastering the game of Go with deep neural networks and tree search. Nature 529 484. [39] SILVER, D., LEVER, G., HEESS, N., DEGRIS, T., WIERSTRA, D. and RIEDMILLER, M. (2014). Deterministic policy gradient algorithms. In International Conference on Machine Learning. [40] SILVER, D., SCHRITTWIESER, J., SIMONYAN, K., ANTONOGLOU, I., HUANG, A., GUEZ, A., HUBERT, T., BAKER, L., LAI, M., BOLTON, A. ET AL. (2017). Mastering the game of Go without human knowledge. Nature 550 354. [41] STREHL, A. L., LI, L., WIEWIORA, E., LANGFORD, J. and LITTMAN, M. L. (2006). PAC model-free reinforcement learning. In International Conference on Machine Learning. [42] TAÏGA, A. A., FEDUS, W., MACHADO, M. C., COURVILLE, A. and BELLEMARE, M. G. (2019). Benchmarking bonus-based exploration methods on the Arcade Learning Environment. ar Xiv preprint ar Xiv:1908.02388 . [43] THOMPSON, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25 285 294. [44] VAN ROY, B. and DONG, S. (2019). Comments on the Du-Kakade-Wang-Yang lower bounds. ar Xiv preprint ar Xiv:1911.07910 . [45] WANG, W. Y., LI, J. and HE, X. (2018). Deep reinforcement learning for NLP. In Association for Computational Linguistics. [46] WANG, Y., WANG, R., DU, S. S. and KRISHNAMURTHY, A. (2019). Optimism in reinforcement learning with generalized linear function approximation. ar Xiv preprint ar Xiv:1912.04136 . [47] WATKINS, C. J. C. H. (1989). Learning from delayed rewards . [48] WELLING, M. and TEH, Y. W. (2011). Bayesian learning via stochastic gradient Langevin dynamics. In International Conference on Machine Learning. [49] WEN, Z. and VAN ROY, B. (2017). Efficient reinforcement learning in deterministic systems with value function generalization. Mathematics of Operations Research 42 762 782. [50] XIONG, Z., SHEN, R. and DU, S. S. (2021). Randomized exploration is near-optimal for tabular mdp. ar Xiv preprint ar Xiv:2102.09703 . [51] YANG, L. and WANG, M. (2019). Sample-optimal parametric Q-learning using linearly additive features. In International Conference on Machine Learning. [52] YANG, L. F. and WANG, M. (2019). Reinforcement leaning in feature space: Matrix bandit, kernels, and regret bound. ar Xiv preprint ar Xiv:1905.10389 . [53] YANG, Z., JIN, C., WANG, Z., WANG, M. and JORDAN, M. (2020). Provably efficient reinforcement learning with kernel and neural function approximations. Advances in Neural Information Processing Systems 33. [54] ZANETTE, A., BRANDFONBRENER, D., PIROTTA, M. and LAZARIC, A. (2019). Frequentist regret bounds for randomized least-squares value iteration. ar Xiv preprint ar Xiv:1911.00567 . [55] ZANETTE, A., LAZARIC, A., KOCHENDERFER, M. and BRUNSKILL, E. (2020). Learning near optimal policies with low inherent bellman error. In International Conference on Machine Learning. PMLR. [56] ZHANG, Y., LIANG, P. and CHARIKAR, M. (2017). A hitting time analysis of stochastic gradient langevin dynamics. In Conference on Learning Theory. PMLR. [57] ZHOU, D., GU, Q. and SZEPESVARI, C. (2021). Nearly minimax optimal reinforcement learning for linear mixture markov decision processes. In Conference on Learning Theory. PMLR. [58] ZHOU, H., CHEN, J., VARSHNEY, L. R. and JAGMOHAN, A. (2020). Nonstationary reinforcement learning with linear function approximation. ar Xiv preprint ar Xiv:2010.04244 .