# boovi_provably_efficient_bootstrapped_value_iteration__63d56c74.pdf

Boo VI: Provably Efﬁcient Bootstrapped Value

Boyi Liu Qi Cai Zhuoran Yang Zhaoran Wang

Despite the tremendous success of reinforcement learning (RL) with function approximation, efﬁcient exploration remains a signiﬁcant challenge, both practically and theoretically. In particular, existing theoretically grounded RL algorithms based on upper conﬁdence bounds (UCBs), such as optimistic least-squares value iteration (LSVI), are often incompatible with practically powerful function approximators, such as neural networks. In this paper, we develop a variant of bootstrapped LSVI, namely Boo VI, which bridges such a gap between practice and theory. Practically, Boo VI drives exploration through (re)sampling, making it compatible with general function approximators. Theoretically, Boo VI inherits the worst-case e O(

d3H3T)-regret of optimistic LSVI in the episodic linear setting. Here d is the feature dimension, H is the episode horizon, and T is the total number of steps.

1 Introduction

Reinforcement learning (RL) with function approximation demonstrates signiﬁcant empirical success in a broad range of applications [e.g., 16, 38, 40, 45]. However, computationally and statistically efﬁcient exploration of large and intricate state spaces remains a major barrier. Practically, the lack of temporally-extended exploration [31, 28, 30] in existing RL algorithms, e.g., deterministic policy gradient [39] and soft actor-critic [19], hinders them from solving more challenging tasks, e.g., Montezuma s Revenge in Atari Games [42, 17], in a more sample-efﬁcient manner. Theoretically, it remains unclear how to design provably efﬁcient RL algorithms with ﬁnite sample complexities or regrets that allow for practically powerful function approximators, e.g., neural networks.

There exist two principled approaches to efﬁcient exploration in RL, namely optimism in the face of uncertainty [e.g., 4, 41, 21, 11, 12, 5, 23, 24] and posterior sampling [e.g., 29, 34, 33, 30, 27, 36].

The optimism-based approach is often instantiated by incorporating upper conﬁdence bounds

(UCBs) into the estimated (action-)value functions as bonuses, which are used to direct exploration. In the tabular setting, the resulting algorithms, e.g., variants of upper conﬁdence bound reinforcement learning (UCRL) [4, 21, 11, 5], are known to attain the optimal worstcase regret [32]. Beyond the tabular setting, it remains unclear how to construct closed-form UCBs in a principled manner for general function approximators, e.g., neural networks. The only exception is the linear setting [51, 52, 24, 8], where optimistic least-squares value iteration (LSVI) [24] is known to attain a near-optimal (with respect to T = KH) worst-case regret. However, the closed-form UCB therein is tailored to linear models [1, 9] rather than fully general-purpose.

Northwestern University; boyiliu2018@u.northwestern.edu Northwestern University; qicai2022@u.northwestern.edu Princeton University; zy6@princeton.edu Northwestern University; zhaoranwang@gmail.com

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

On the other hand, the posterior-based approach, which originates from Thompson sampling

[43, 37], can be instantiated using randomized (action-)value functions [34, 30, 27, 36]. Unlike optimistic LSVI, the resulting algorithm, namely randomized LSVI, straightforwardly allows for general function approximators, as it only requires injecting random noise into the training data of LSVI. Ideally, such injected random noise does not depend on particular function approximators. In the tabular setting, randomized LSVI is known to attain near-optimal worst-case and Bayesian regrets [30, 36]. Meanwhile, in the linear setting, randomized LSVI is very recently shown to attain a near-optimal worst-case regret [54], which is only worse than that of optimistic LSVI by a factor of

d H. Here d is the feature dimension and H is the episode horizon. However, achieving such a regret requires a nontrivial modiﬁcation of the injected random noise, which is tailored to linear models. Such a specialized modiﬁcation diminishes the supposed practical advantage of randomized LSVI.

To bridge such a gap between practice and theory, we aim to answer the following question: Can we design an RL algorithm that simultaneously achieves the practical advantage of randomized LSVI and the theoretical guarantee of optimistic LSVI?

In this paper, we propose a variant of bootstrapped LSVI, namely Boo VI, which combines the advantages of the optimism-based and posterior-based approaches. The key idea of Boo VI is to use posterior sampling to implicitly construct an optimistic version of the estimated (action-)value functions in a data-driven manner. Unlike randomized LSVI, which samples from the posterior only once, e.g., by injecting random noise into the training data of LSVI, Boo VI samples from the posterior multiple times, e.g., via the Langevin dynamics [48, 35]. Upon evaluating an action at a state, Boo VI ranks the values of the randomized (action-)value functions sampled from the posterior in descending order and returns a top-ranked value, which can be shown to be approximately optimistic. Generally speaking, Boo VI corresponds to bootstrapping the noise in the least-squares regression problem of LSVI. As a result, it can be viewed as a parametric bootstrap counterpart of the nonparametric bootstrap technique used in bootstrapped deep Q-networks (DQNs) [31, 30], which demonstrates signiﬁcant empirical success in terms of exploration.

Compared with existing RL algorithms with function approximation, the advantage of Boo VI is twofold:

Practically, Boo VI bypasses the explicit construction of the closed-form UCB in optimistic

LSVI. Like randomized LSVI, Boo VI straightforwardly allows for general function approximators, as it only requires injecting random noise into the training process of LSVI, e.g., via the Langevin dynamics.

Theoretically, Boo VI inherits the worst-case e O(

d3H3T)-regret of optimistic LSVI in the linear setting. Here d is the feature dimension, H is the episode horizon, and T is the total number of steps. Such a regret is better than the best known worst-case regret of randomized LSVI by a factor of

d H. More importantly, unlike the specialized variant of randomized LSVI studied in [54], Boo VI can be applied to the linear setting as is , without tailoring the injected random noise to linear models.

More Related Work: The idea of using posterior sampling to achieve optimism in a data-driven manner is previously studied in the linear bandit setting [26] and the tabular setting of RL [3]. In fact, our linear setting of RL covers the linear bandit setting as a special case, where the episode horizon H is set to one, and the tabular setting of RL as another special case, where the feature mapping is the canonical basis of the state and action spaces. Although Boo VI is practically applicable to general function approximators, our theoretical guarantee on its worst-case regret is only applicable to the linear setting. Despite the recent progress [49, 22, 13, 15], simultaneously achieving provable computational and statistical efﬁciency in exploration with general function approximators remains challenging. See, e.g., [2, 18] for the recent progress in the contextual bandit setting. Finally, we refer readers to [46, 53, 57, 55] for theories on the generalized models of linear MDPs, which all share similar (approximately) linear structure with the class of linear MDPs considered in the analysis of this paper. We expect our regret bound stays the same for those settings with linear structures, or admits an extra O( T) dependency over T for those settings with approximately linear structure, where > 0 describes the level of nonlinearity in the MDP. We also refer readers to [50] for sharper regret bounds in linear MDPs, which are out of the scope of this paper.

2 Background

In this section, we introduce the general problem setting.

Episodic Markov Decision Process. We consider the episodic Markov decision process represented by a tuple (S, A, H, P, r), where S is a compact state space, A is a ﬁnite action space with cardinality A, H is the number of timesteps in each episode, P = {Ph}h2[H] with Ph : S S A ! [0, 1] for all h 2 [H] is the set of transition kernels, and r = {rh}h2[H] with rh : S A ! [0, 1] for all h 2 [H] is the set of reward functions.

At each timestep h 2 [H] of episode k, an agent at state sk

h 2 S with policy k = { k

h}h2[H], where k

h : A S ! R for all h 2 [H], interacts with the environment by ﬁrst taking an action ak

h with probability k

h) and then receiving the corresponding reward rk

We evaluate the performance of policy = { h}h2[H] starting from timestep h, and state-action pair (s, a) using its action-value function Q

h : S A ! R, which is deﬁned as

h(s, a) = E

rh0(sh0, ah0)

$$$$ sh = s, ah = a

for all h 2 [H]. Correspondingly, the value function V

h : S ! R of a policy is deﬁned as

rh0(sh0, ah0)

$$$$ sh = s

for all h 2 [H]. Here the expectation E [ ] is taken over the trajectory generated by . Also, we let Q

H+1 0 and thus V

H+1 0. Furthermore, we denote by V

h (s) the value function corresponding to the optimal policy . Finally, for notational simplicity, we deﬁne [Ph V ](s, a) = Es0 Ph( | s,a)[V (s0)], where V : S ! R can be any function.

For any algorithm that generates a sequence of policies { k}k2[K], we track its performance via the cumulative regret deﬁned by

Regret(K) =

where K is the number of episodes. Here sk

1 is the initial state of episode k, which is arbitrarily chosen at the start of the episode.

3 Bootstrapped Value Iteration

In this section, we introduce Bootstrapped Value Iteration (Boo VI in Algorithm 1). For notational simplicity, we write rh(sk

h throughout the rest of this paper. Also, in this paper, we write max{min{ , }, 0} as min{ , }+, and denote by k k the 2-norm for vectors.

Least-Squares Value Iteration. At timestep h 2 [H] of episode k, given the estimated action-value function b Qk

h+1, a) for all 2 [k 1] and a 2 A, let b V k

h+1) = maxa2A b Qk

h+1, a). Then, Least-Squares Value Iteration (LSVI) updates the parameter ! of the action-value function via

where Q( , ; ) : S A Rd ! R is the parameterization of action-value function with ! 2 Rd being its parameter, and λ 0 is the regularization parameter. Although the deterministic parameter update by LSVI could exploit the historical data well, it has limited ability to address the exploration need in more challenging tasks.

Bootstrapped Value Iteration. To achieve guided exploration, we introduce Boo VI (Algorithm 1), which uses bootstrapping to enforce the optimism of the estimated action-value function.

At the timestep h of episode k, given the current data buffer Dk

h+1)} 2[k 1], which contains the data collected from the previous k 1 episodes, and the current bootstrapped state values {e V k

h+1)} 2[k 1], we deﬁne

h+1), for all 2 [k 1]. (3.2)

With p0(!) as prior of ! and p(y

h), !) as the likelihood of y

h, the posterior of ! is given by

In Boo VI (Algorithm 1), we propose to sample repeatedly from such posterior for Nk times, which is equivalent to sampling repeatedly from randomized action-value function since each sample !k,i

h corresponds to one randomized action-value function Q( , ; !k,i

h ). Such a collection of posterior weights is later used to construct the bootstrapped action-value function in Algorithm 2.

To independently sample from the posterior in (3.3), we can consider, e.g., using the Langevin dynamics !(t + 1) !(t) + !(t), where

where t > 0 is the stepsize, and t N(0, t). We note here that, after running sufﬁcient many iterations with suitable choices of stepsizes t > 0 , the Langevin dynamics gives effectively independent parameter samples from the posterior in (3.3).

To instantiate the connection between the posterior (3.3) with the LSVI, we consider Gaussian prior and likelihood as an example. For the Gaussian prior ! N(0, Id) and the Gaussian likelihood p(y

h), !) / exp{ (y

h; !))2/(2σ2)}, the posterior of ! is given by

2 k!k2 1 2σ2

where σ > 0 is an absolute constant. In this case, the maximum a posteriori (MAP) estimate of ! coincides with the weight estimate b!k

h obtained by LSVI. Such observation suggests that sampling from the posterior distribution p(! | {e V k

h+1)} 2[k 1], Dk

h) would allows us to both achieve exploitation based on the available data, and generate noise for randomized exploration. More speciﬁcally, the !(t) in (3.4) takes the form of

which can be viewed as using stochastic gradient descent to solve for the LSVI update (3.1) with λ = σ2 and b V k

h+1 replaced by e V k

h+1. See also in Lemma E.1 for the closed form of the posterior in a linear MDP setting with Gaussian prior and likelihood.

We would like to highlight that, although the theoretical guarantee in this paper is built upon the Gaussian prior and likelihood, the choices of prior and likelihood are ﬂexible. For example, we can use uninformative prior for generalized linear model, and Gaussian process prior for kernel of overparameterized neural networks [6]. Moreover, as suggested by theoretical results on Langevin Dynamics [56] and illustrative experiments in Appendix F, the computational overhead caused by replacing the minimization of (3.1) with posterior sampling is mild.

Algorithm 1 Bootstrapped Value Iteration (Boo VI)

1: Require: MDP (S, A, H, P, r), action-value function parameterization Q( , ; ) : S A

Rd ! R, number of episodes K, number of posterior weights {Nk}k2[K], lower and upper bootstrapping ratios , β 2 (0, 1) 2: Initialize the data buffer D1

h {} for h 2 [H] 3: For episode k = 1, . . . , K do 4: Set Nk, d Nke,Nk,β bβ Nkc, and !k,i

H+1 0 for all i 2 [Nk] 5: Sample nk uniformly from {Nk, , Nk, + 1, . . . , Nk,β} 6: For timestep h = H, . . . , 1 do 7: Generate e Qk

h+1, a) using Algorithm 2 with weights {!k,i

h+1}i2[Nk] and parameter nk for all a 2 A and 2 [k 1] 8: e V k

h+1) maxa2A e Qk

h+1, a) for all 2 [k 1]

9: Independently sample {!k,i

h }i2[Nk] from the posterior p(! | {e V k

h+1)} 2[k 1], Dk

h) deﬁned in (3.3), e.g., using Langevin dynamics in (3.4) 10: end 11: For timestep h = 1, . . . , H do 12: Generate e Qk

h, a) using Algorithm 2 with weights {!k,i

h }i2[Nk] and parameter nk for all a 2 A 13: Take action ak

h argmaxa2A e Qk

h, a), and observe rk

h+1 14: Update the data buffer Dk+1

h+1)} 15: End 16: End

The following Algorithm 2 serves as the critical building block of Boo VI for enforcing the optimism of the estimated action-value. At episode k in Boo VI, for any (s, a) 2 S A, Algorithm 2 ranks the estimated action-value functions corresponding to the posterior sample obtained in Line 1 of Algorithm 1 in ascending order. Then, in order to enforce the optimism of the estimated action-value function, Algorithm 2 resamples the nk-th top-ranked value from the ordered estimated action-value functions in the manner of bootstrapping. Finally, to ensure sufﬁcient optimism of the obtained bootstrapped action-value function, we extrapolate with a tunable parameter > 1.

Algorithm 2 Bootstrapping Action-Value Function

1: Require: Action-value function parameterization Q( , ; ) : S A Rd ! R, posterior sample

h }i2[Nk], integer nk 2 [Nk], extrapolation parameter > 1, and state-action pair (s, a)

2: Compute Qk,i

h (s, a) Q(s, a; !k,i

h ) for all i 2 [Nk]

3: Set b Qk

h(s, a) (1/Nk) PNk

4: Rank {Qk,i

h (s, a)}i2[Nk] in ascending order to obtain

h (s, a)}i2[Nk] 5: Output: e Qk

h(s, a) min{(1 ) b Qk

h(s, a) + Qk,(nk)

h (s, a), H h + 1}+

Remark 3.1 (Sample Efﬁciency and Computational Efﬁciency). Here we clarify the differences between sample efﬁciency and computational efﬁciency. Throughout this paper, we refer sample efﬁciency to the learning efﬁciency with respect to the total number T = KH of interactions with the environment. Such interactions can often time be very resource demanding. Thus, this paper seeks to provide an algorithm that aims to achieve the sample efﬁciency with respect to the number of such interactions. Although posterior weights are sampled in Algorithm 1, the efﬁciency of such posterior sampling process is categorized as computational efﬁciency since it is purely based on the current data buffer Dk

h and does not require any extra interaction with the environment. Remark 3.2 (Additional Computational Cost). Here we discuss the computational cost of Boo VI compared to LSVI. (i) Posterior sampling vs. least-square minimization: Consider the Langevin dynamics in (3.4). Since posterior sampling replaces and admits similar steps as least-square minimization, the computational cost should not differ signiﬁcantly between LSVI and Boo VI. (ii) Computing bootstrapped Q-function: To compute bootstrapped Q-function, at each state-action pair,

the extra sorting takes O(Nk log Nk) time complexity, which makes the action selection procedure O(Nk log Nk) times slower. In addition, keeping Nk posterior weights takes O(d Nk) memory. We note that, although the requirement on Nk in our subsequent analysis is large, we should be able to use much smaller Nk in practice due to the pessimistic nature of the analysis. See Appendix F for illustrative experiment results.

It is worth mentioning that Boo VI is applicable to any speciﬁc parameterization Q( , ; ) : S A Rd ! R of the action-value function and is fully data-driven. We also note here that the design of Boo VI does not rely on the Gaussian prior/posterior or the Langevin dynamics sampling technique.

4 Main Results

While Boo VI (Algorithm 1) is generally applicable to different parameterizations of the action-value function, we establish its regret for a class of linear MDPs.

Linear Markov Decision Process. We consider a class of MDPs, where the transition kernels and the reward functions are linear in the same feature mapping. Speciﬁcally, we have the following deﬁnition. Deﬁnition 4.1 (Linear MDP). An MDP (S, A, H, P, r) is called a linear MDP with feature mapping φ : S A ! Rd, if for any h 2 [H], there exists d unknown signed measures µh = (µ(1)

h , . . . , µ(d)

h ) over S and an unknown vector h 2 Rd, such that for any (s, a) 2 S A, we have

Ph( | s, a) =

φ(s, a), µh( )

, rh(s, a) =

See [51, 52, 24] for examples of such a class of linear MDPs. Speciﬁcally, examples include tabular MDPs where d = SA and the feature mapping is the canonical basis φ(s, a) = e(s,a), and the simplex feature space where the feature space {φ(s, a) | (s, a) 2 S A} is a subset of the d-dimensional simplex. See also [14, 44, 25] for related discussions on such a linear representation. For notational simplicity, we write the feature φ(sk

h throughout the rest of this paper.

Based on Deﬁnition 4.1, without loss of generality, we make the following assumption. Assumption 4.2. The MDP (S, A, H, P, r) is a linear MDP with kφ(s, a)k 1 for all (s, a) 2 S A and max{kµh(S)k, k hk}

d for all h 2 [H].

To motivate the linear parameterization of the action-value function in the subsequent section, we have the following proposition. Proposition 4.3 (Linear Action-Value Function, Proposition 2.3 in [24]). For a linear MDP, for any policy and h 2 [H], there exists !

h 2 Rd such that for all (s, a) 2 S A, we have Q

h(s, a) = hφ(s, a), !

Regret of Boo VI for Linear MDPs. For a linear MDP, by Proposition 4.3, the parameterization Q( , ; ) : S A Rd ! R in Boo VI should take the form of Q(s, a; !) = hφ(s, a), !i, where ! 2 Rd. Furthermore, when the prior and the likelihood are Gaussian, the posterior of ! is Gaussian distribution with mean being is the weight update by LSVI with chioce of λ = σ2.

For any bootstrapping ratio q 2 (0, 1), we write Cq = Φ 1(q) throughout the rest of this paper, where Φ( ) is the cumulative distribution function of the standard Gaussian. Then, we have the following upper bound of regret of Boo VI. Theorem 4.4. Suppose that Assumption 4.2 holds, and that d 2. We set σ = 1, = d H, and ﬁx a failure probability p 2 (0, 1]. Then there exists a sequence of posterior sample sizes {Nk}k2[K], and there exist absolute constants cβ > c > 0, such that if Nk = O(d6T 4k/p4) for all k 2 [K], C = c p , and Cβ = cβ p , Boo VI (Algorithm 1) with Gaussian posterior in (3.5) satisﬁes

Regret(K) = O

with probability at least 1 p, where T = KH and = log(3d T/p).

Proof. See Section 5 for a proof sketch.

5 Proof Sketch

Notation. At episode k, given the number of posterior sample weights Nk and given nk sampled in Line 1 of Algorithm 1, we deﬁne Ck = Φ 1(nk/Nk). At timestep h of episode k, further given posterior weights {!k,i

h }i2[Nk], corresponding to the bootstrapped action-value function e Qk

h generated by Algorithm 2, we deﬁne the bootstrapped value function as

h (s) = max

h(s, a). (5.1)

Next, we deﬁne the mean bootstrapped action-value functions as

k h(s, a) = min

h(s, a) + Qk,(nk)

where E![ ] is taken over ! with respect to the posterior p(! | {e V k

h+1)} 2[k 1], Dk

h) deﬁned in (3.3). Correspondingly, we deﬁne the mean bootstrapped value function as

k h(s) = max

k h(s, a). (5.3)

Also, we denote by

the mean of the posterior distribution p(! | {e V k

h+1)} 2[k 1], Dk

h), where y

h is deﬁned in (3.2). For a ﬁxed failure probability p 2 (0, 1], we write = log(3d T/p), where T = KH. Finally, we deﬁne matrix k

h 2 Rd d by

h)> + σ2 Id. (5.5)

Concentration Events. Before proving Theorem 4.4, we ﬁrst present two lemmas, each characterizing an event that is involved throughout the remaining proofs. The ﬁrst lemma characterizes the concentration behavior of the bootstrapped action-value function e Qk

Lemma 5.1. Let σ = 1, = d H, and Cβ = cβ p for some cβ > 0. For a ﬁxed failure probability p 2 (0, 1], we deﬁne E as the event that the condition

is satisﬁed for all (s, a, k, h) 2 S A [K] [H]. Then, there exists a sequence of posterior sample sizes {Nk}k2[K] satisfying Nk = O(d6T 4k/p4) for all k 2 [K], such that P(E) 1 p/3.

Proof. See Appendix C for a detailed proof.

The second lemma characterizes the concentration behavior of the mean bootstrapped state value function V

Lemma 5.2. Let σ = 1, = d H, and Cβ = cβ p for some constant cβ > 0. For a ﬁxed failure p 2 (0, 1], we deﬁne E0 as the event that the condition

is satisﬁed for all (k, h) 2 [K] [H], where χ = log[3(1+cβ)d T/p)]. Then we have P(E0) 1 p/3.

Proof. See Appendix D for a detailed proof.

Model Estimation Error. With the events E and E0 ready, we can proceed to characterize the model estimation error k

h : S A ! R for all (k, h) 2 [K] [H] deﬁned as

, (5.6) which can be comprehended as the estimation error of the model at timestep h of episode k induced by the mean bootstrapped value function Q

k h+1. We have the following lemma charaterizing the model estimation error k

h deﬁned in (5.6). Lemma 5.3 (Model Estimation Error). Let d 2, = d H, σ = 1, and p 2 (0, 1]. Then there exists a sequence of posterior sample sizes {Nk}k2[K], and there exist absolute constants cβ > c > 0 such that, if Nk = O(d6T 4k/p4) for all k 2 [K], C = c p , and Cβ = cβ p , for the model estimation error k

h deﬁned in (5.6), we under events E and E0 that

h(s, a) (c + cβ) d Hp

φ(s, a)>( k

h) 1φ(s, a)

for all (s, a, k, h) 2 S A [K] [H].

Proof. Under Assumption 4.2, using (4.1) we have [Ph V

k h+1(s0) hφ, µh(ds0)i. Then,

with slight abuse of notation, we denote by V

k h+1(µh) =

k h+1(s)µh(ds) 2 Rd and write

where the second equality follows from (5.5). Thus, we have the following decomposition

hi (rh + Ph V

$$ u1 + u2,

h is deﬁned in (5.4), and

In the sequel, we upper bound u1 and u2 separately. Upper bounding u1: First, we further decompose u1 as

applying the Cauchy-Schwartz inequality to which we obtain

Now we proceed to upper bound the two terms on the right-hand side of (5.7) in the following. For the ﬁrst term, since by Lemma 5.1 we have under event E that |e V k

d /k, we have $$$$

By the Cauchy-Schwartz inequality and Lemma D.1 in [24] (see Appendix E), we have

taking which into (5.8) and gives

Taking (5.9) into (5.7) and applying Lemma 5.2 to the second term on the right-hand side of (5.7), we have under the events E and E0 that

d Hp + C d Hpχ

h) 1φ. (5.10)

Upper bounding u2: By (4.1), we have

Then, by the Cauchy-Schwartz inequality, we further obtain

h) 1φ, (5.11)

where the second inequality follows from k

h Id, and the last inequality follows from k hk

in Assumption (4.2) as well as the fact that V

k h+1(s) H h H 1 for all s 2 S.

Combining (5.10) and (5.11), we obtain under event E that

hi (rh + Ph V

h) 1φ (5.12)

d Hp + C d Hpχ + 2H

d Hp + C d Hpχ

= C0 d Hp ,

where C0 > 0 is an absolute constant. Next, we need to ﬁnd an absolute constant cβ > 0 such that C0 p = (cβ/

d + 5) p + C

+ log(1 + cβ) < cβ p . Note that d 2 and log 2, it sufﬁces to pick a cβ > 0 such that

log 2 + log(1 + cβ) <

log 2, (5.13)

which must exist as the left hand side grows logarithmically in cβ and the right-hand side grows linearly in cβ. For cβ > 0 satisfying (5.13), we pick any c > 0 such that C0 c < cβ and let

C = c p . By (5.12) and rh + [Ph V

k h+1] H h + 1, we obtain under events E and E0 that

h) 1φ, H h + 1

(c C0) d Hp

On the other hand, by (5.12), we also have under events E and E0 that

hi (rh + Ph V

k h+1) + Cβ

h) 1φ (c + cβ) d Hp

Therefore, we ﬁnish the proof of Lemma 5.3.

As results of Lemma 5.3, we have the following two lemmas characterizing the optimism and the cumulative estimation error of the mean bootstrapped value function V

k h, respectively.

Lemma 5.4 (Optimistic Random Value Function). Let σ = 1 and = d H. There there exists a sequence of posterior sample sizes {Nk}k2[K], and there exist absolute constants cβ > c > 0, such that if Nk = O(d6T 4k/p4) for all k 2 [K], C = c p , and Cβ = cβ p , we have under the events E and E0 that

Proof. See Appendix D for a detailed proof.

Lemma 5.5 (Cumulative Estimation Error). Let σ = 1, = d H, and p 2 (0, 1]. There exists a sequence of posterior sample sizes {Nk}k2[K], and there exist absolute constants cβ > c > 0, such that if Nk = O(d6T 4k/p4) for all k 2 [K], C = c p , and Cβ = cβ p , we have under the events E and E0 that

18TH2 log(3/p) + 4(cβ/

+ (c + cβ) d H2p

with probability at least 1 p/3.

Proof. See Appendix D for a detailed proof.

Finally, the regret bound of Boo VI for the class of linear MDPs can be established as a consequence of the optimism (Lemmas 5.4) and the upper bound of the cumulative estimation error (Lemma 5.5) of the bootstrapped value function.

Proof of Theorem 4.4. First, recall that we have V

k h(s) deﬁned in (5.3). We have the regret decomposition

Regret(K) =

Applying Lemmas 5.4 and 5.5 to (5.14), we obtain under the events E and E0 that

Regret(K) 0 +

18TH2 log(3/p) + 4(cβ/

+ (c + cβ) d H2p

with probability at least 1 p/3. Finally, by Lemmas 5.1 and 5.2, we have with at least probability 1 p/3 p/3 = 1 2p/3 that the events E and E0 hold simultaneously. Thus, we have (5.15) hold with probability at least 1 p, which concludes the proof.

6 Conclusion

The cost of collecting data via online experiments is often prohibitive compared to collecting ofﬂine data, e.g., via posterior sampling. Thus, designing online learning algorithms that provably explores the environment in a data-driven manner is essential. Moreover, applicability to general environments is critical since reinforcement learning tasks are getting more chanlleging and complex recently. We aspire to motivate the design of novel reinforcement learning algorithms that utilize cheaper data to boost online sample efﬁciency. Our algorithm and analysis serve as a step towards developing general applicable and provable sample-efﬁcient reinforcement learning. While this paper is mainly on algorithmic and theoretical aspects of the bootstrapping idea in RL, it would be interesting to see the empirical strength of Boo VI in more challenging RL environments. We leave this part to our future work.

Acknowledgement

Zhaoran Wang acknowledges National Science Foundation (Awards 2048075, 2008827, 2015568, 1934931), Simons Institute (Theory of Reinforcement Learning), Amazon, J.P. Morgan, and Two

Sigma for their supports.

Zhuoran Yang acknowledges Simons Institute (Theory of Reinforcement Learning).

[1] ABBASI-YADKORI, Y., PÁL, D. and SZEPESVÁRI, C. (2011). Improved algorithms for linear stochastic

bandits. In Advances in Neural Information Processing Systems.

[2] AGARWAL, A., HSU, D., KALE, S., LANGFORD, J., LI, L. and SCHAPIRE, R. (2014). Taming the

monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning.

[3] AGRAWAL, S. and JIA, R. (2017). Optimistic posterior sampling for reinforcement learning: worst-case

regret bounds. In Advances in Neural Information Processing Systems.

[4] AUER, P. and ORTNER, R. (2006). Logarithmic online regret bounds for undiscounted reinforcement

learning. In Advances in Neural Information Processing Systems.

[5] AZAR, M. G., OSBAND, I. and MUNOS, R. (2017). Minimax regret bounds for reinforcement learning.

In International Conference on Machine Learning.

[6] BERNARDO, J., BERGER, J., DAWID, A., SMITH, A. ET AL. (1998). Regression and classiﬁcation using

gaussian process priors. Bayesian statistics 6 475.

[7] BROCKMAN, G., CHEUNG, V., PETTERSSON, L., SCHNEIDER, J., SCHULMAN, J., TANG, J. and

ZAREMBA, W. (2016). Openai gym. ar Xiv preprint ar Xiv:1606.01540 .

[8] CAI, Q., YANG, Z., JIN, C. and WANG, Z. (2019). Provably efﬁcient exploration in policy optimization.

ar Xiv preprint ar Xiv:1912.05830 .

[9] CHU, W., LI, L., REYZIN, L. and SCHAPIRE, R. (2011). Contextual bandits with linear payoff functions.

In International Conference on Artiﬁcial Intelligence and Statistics.

[10] DANI, V., HAYES, T. P. and KAKADE, S. M. (2008). Stochastic linear optimization under bandit feedback.

Conference on Learning Theory .

[11] DANN, C. and BRUNSKILL, E. (2015). Sample complexity of episodic ﬁxed-horizon reinforcement

learning. In Advances in Neural Information Processing Systems.

[12] DANN, C., LATTIMORE, T. and BRUNSKILL, E. (2017). Unifying PAC and regret: Uniform PAC bounds

for episodic reinforcement learning. In Advances in Neural Information Processing Systems.

[13] DONG, K., PENG, J., WANG, Y. and ZHOU, Y. (2019). pn-regret for learning in Markov decision

processes with function approximation and low Bellman rank. ar Xiv preprint ar Xiv:1909.02506 .

[14] DU, S. S., KAKADE, S. M., WANG, R. and YANG, L. F. (2019). Is a good representation sufﬁcient for

sample efﬁcient reinforcement learning? ar Xiv preprint ar Xiv:1910.03016 .

[15] DU, S. S., LUO, Y., WANG, R. and ZHANG, H. (2019). Provably efﬁcient Q-learning with function

approximation via distribution shift error checking oracle. ar Xiv preprint ar Xiv:1906.06321 .

[16] DUAN, Y., CHEN, X., HOUTHOOFT, R., SCHULMAN, J. and ABBEEL, P. (2016). Benchmarking deep

reinforcement learning for continuous control. In International Conference on Machine Learning.

[17] ECOFFET, A., HUIZINGA, J., LEHMAN, J., STANLEY, K. O. and CLUNE, J. (2019). Go-explore: A new

approach for hard-exploration problems. ar Xiv preprint ar Xiv:1901.10995 .

[18] FOSTER, D. J., AGARWAL, A., DUDÍK, M., LUO, H. and SCHAPIRE, R. E. (2018). Practical contextual

bandits with regression oracles. ar Xiv preprint ar Xiv:1803.01088 .

[19] HAARNOJA, T., ZHOU, A., ABBEEL, P. and LEVINE, S. (2018). Soft actor-critic: Off-policy maximum

entropy deep reinforcement learning with a stochastic actor. ar Xiv preprint ar Xiv:1801.01290 .

[20] HSU, D., KAKADE, S. and ZHANG, T. (2012). A tail inequality for quadratic forms of subgaussian

random vectors. Electronic Communications in Probability 17.

[21] JAKSCH, T., ORTNER, R. and AUER, P. (2010). Near-optimal regret bounds for reinforcement learning.

Journal of Machine Learning Research 11 1563 1600.

[22] JIANG, N., KRISHNAMURTHY, A., AGARWAL, A., LANGFORD, J. and SCHAPIRE, R. E. (2017).

Contextual decision processes with low Bellman rank are PAC-learnable. In International Conference on Machine Learning.

[23] JIN, C., ALLEN-ZHU, Z., BUBECK, S. and JORDAN, M. I. (2018). Is Q-learning provably efﬁcient? In

Advances in Neural Information Processing Systems.

[24] JIN, C., YANG, Z., WANG, Z. and JORDAN, M. I. (2019). Provably efﬁcient reinforcement learning with

linear function approximation. ar Xiv preprint ar Xiv:1907.05388 .

[25] LATTIMORE, T. and SZEPESVARI, C. (2019). Learning with good feature representations in bandits and

in RL with a generative model. ar Xiv preprint ar Xiv:1911.07676 .

[26] MAY, B. C., KORDA, N., LEE, A. and LESLIE, D. S. (2012). Optimistic bayesian sampling in contextual-

bandit problems. Journal of Machine Learning Research 13 2069 2106.

[27] OSBAND, I., ASLANIDES, J. and CASSIRER, A. (2018). Randomized prior functions for deep reinforce-

ment learning. In Advances in Neural Information Processing Systems.

[28] OSBAND, I., BLUNDELL, C., PRITZEL, A. and VAN ROY, B. (2016). Deep exploration via bootstrapped

DQN. In Advances in Neural Information Processing Systems.

[29] OSBAND, I., RUSSO, D. and VAN ROY, B. (2013). (more) efﬁcient reinforcement learning via posterior

sampling. In Advances in Neural Information Processing Systems.

[30] OSBAND, I., RUSSO, D., WEN, Z. and VAN ROY, B. (2017). Deep exploration via randomized value

functions. Journal of Machine Learning Research .

[31] OSBAND, I. and VAN ROY, B. (2015). Bootstrapped Thompson sampling and deep exploration. ar Xiv

preprint ar Xiv:1507.00300 .

[32] OSBAND, I. and VAN ROY, B. (2016). On lower bounds for regret in reinforcement learning. ar Xiv

preprint ar Xiv:1608.02732 .

[33] OSBAND, I. and VAN ROY, B. (2017). Why is posterior sampling better than optimism for reinforcement

learning? In International Conference on Machine Learning.

[34] OSBAND, I., VAN ROY, B. and WEN, Z. (2014). Generalization and exploration via randomized value

functions. ar Xiv preprint ar Xiv:1402.0635 .

[35] RAGINSKY, M., RAKHLIN, A. and TELGARSKY, M. (2017). Non-convex learning via stochastic gradient

Langevin dynamics: a nonasymptotic analysis. ar Xiv preprint ar Xiv:1702.03849 .

[36] RUSSO, D. (2019). Worst-case regret bounds for exploration via randomized value functions. In Advances

in Neural Information Processing Systems.

[37] RUSSO, D. J., VAN ROY, B., KAZEROUNI, A., OSBAND, I. and WEN, Z. (2018). A tutorial on Thompson

sampling. Foundations and Trends in Machine Learning 11 1 96.

[38] SILVER, D., HUANG, A., MADDISON, C. J., GUEZ, A., SIFRE, L., VAN DEN DRIESSCHE, G.,

SCHRITTWIESER, J., ANTONOGLOU, I., PANNEERSHELVAM, V., LANCTOT, M. ET AL. (2016). Mastering the game of Go with deep neural networks and tree search. Nature 529 484.

[39] SILVER, D., LEVER, G., HEESS, N., DEGRIS, T., WIERSTRA, D. and RIEDMILLER, M. (2014).

Deterministic policy gradient algorithms. In International Conference on Machine Learning.

[40] SILVER, D., SCHRITTWIESER, J., SIMONYAN, K., ANTONOGLOU, I., HUANG, A., GUEZ, A., HUBERT,

T., BAKER, L., LAI, M., BOLTON, A. ET AL. (2017). Mastering the game of Go without human knowledge. Nature 550 354.

[41] STREHL, A. L., LI, L., WIEWIORA, E., LANGFORD, J. and LITTMAN, M. L. (2006). PAC model-free

reinforcement learning. In International Conference on Machine Learning.

[42] TAÏGA, A. A., FEDUS, W., MACHADO, M. C., COURVILLE, A. and BELLEMARE, M. G. (2019).

Benchmarking bonus-based exploration methods on the Arcade Learning Environment. ar Xiv preprint ar Xiv:1908.02388 .

[43] THOMPSON, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the

evidence of two samples. Biometrika 25 285 294.

[44] VAN ROY, B. and DONG, S. (2019). Comments on the Du-Kakade-Wang-Yang lower bounds. ar Xiv

preprint ar Xiv:1911.07910 .

[45] WANG, W. Y., LI, J. and HE, X. (2018). Deep reinforcement learning for NLP. In Association for

Computational Linguistics.

[46] WANG, Y., WANG, R., DU, S. S. and KRISHNAMURTHY, A. (2019). Optimism in reinforcement learning

with generalized linear function approximation. ar Xiv preprint ar Xiv:1912.04136 .

[47] WATKINS, C. J. C. H. (1989). Learning from delayed rewards .

[48] WELLING, M. and TEH, Y. W. (2011). Bayesian learning via stochastic gradient Langevin dynamics. In

International Conference on Machine Learning.

[49] WEN, Z. and VAN ROY, B. (2017). Efﬁcient reinforcement learning in deterministic systems with value

function generalization. Mathematics of Operations Research 42 762 782.

[50] XIONG, Z., SHEN, R. and DU, S. S. (2021). Randomized exploration is near-optimal for tabular mdp.

ar Xiv preprint ar Xiv:2102.09703 .

[51] YANG, L. and WANG, M. (2019). Sample-optimal parametric Q-learning using linearly additive features.

In International Conference on Machine Learning.

[52] YANG, L. F. and WANG, M. (2019). Reinforcement leaning in feature space: Matrix bandit, kernels, and

regret bound. ar Xiv preprint ar Xiv:1905.10389 .

[53] YANG, Z., JIN, C., WANG, Z., WANG, M. and JORDAN, M. (2020). Provably efﬁcient reinforcement

learning with kernel and neural function approximations. Advances in Neural Information Processing Systems 33.

[54] ZANETTE, A., BRANDFONBRENER, D., PIROTTA, M. and LAZARIC, A. (2019). Frequentist regret

bounds for randomized least-squares value iteration. ar Xiv preprint ar Xiv:1911.00567 .

[55] ZANETTE, A., LAZARIC, A., KOCHENDERFER, M. and BRUNSKILL, E. (2020). Learning near optimal

policies with low inherent bellman error. In International Conference on Machine Learning. PMLR.

[56] ZHANG, Y., LIANG, P. and CHARIKAR, M. (2017). A hitting time analysis of stochastic gradient langevin

dynamics. In Conference on Learning Theory. PMLR.

[57] ZHOU, D., GU, Q. and SZEPESVARI, C. (2021). Nearly minimax optimal reinforcement learning for

linear mixture markov decision processes. In Conference on Learning Theory. PMLR.

[58] ZHOU, H., CHEN, J., VARSHNEY, L. R. and JAGMOHAN, A. (2020). Nonstationary reinforcement

learning with linear function approximation. ar Xiv preprint ar Xiv:2010.04244 .