# near_optimal_rewardfree_reinforcement_learning__4e7cdff1.pdf

Nearly Optimal Reward-Free Reinforcement Learning

Zihan Zhang 1 Simon S. Du 2 Xiangyang Ji 1

We study the reward-free reinforcement learning framework, which is particularly suitable for batch reinforcement learning and scenarios where one needs policies for multiple reward functions. This framework has two phases: in the exploration phase, the agent collects trajectories by interacting with the environment without using any reward signal; in the planning phase, the agent needs to return a near-optimal policy for arbitrary reward functions. We give a new efﬁcient algorithm, Staged Sampling + Truncated Planning (SSTP), which interacts with the environment at most O

episodes in the exploration phase, and guarantees to output a nearoptimal policy for arbitrary reward functions in the planning phase, where S is the size of state space, A is the size of action space, H is the planning horizon, and is the target accuracy relative to the total reward. Notably, our sample complexity scales only logarithmically with H, in contrast to all existing results which scale polynomially with H. Furthermore, this bound matches the minimax lower bound

up to logarithmic factors. Our results rely on three new techniques : 1) A new sufﬁcient condition for the dataset to plan for an -suboptimal policy ; 2) A new way to plan efﬁciently under the proposed condition using soft-truncated planning; 3) Constructing extended MDP to maximize the truncated accumulative rewards efﬁciently.

1. Introduction

Reinforcement learning (RL) studies the problem in which an agent aims to maximize its accumulative rewards by interaction with an unknown environment. A major challenge in

1Tsinghua University 2University of Washington. Correspondence to: Zihan Zhang <zihan-zh17@mails.tsinghua.edu.cn>, Simon S. Du <ssdu@cs.washington.edu>, Xiangyang Ji <xyji@tsinghua.edu.cn>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

RL is exploration for which the agent needs to strategically visit new states to learn transition and reward information therein. To execute efﬁcient exploration, the agent must follow a well-designed adaptive strategy by which the agent is properly guided by the reward and transition information, other than the trivial random exploration. Provably algorithms have been proposed to help the agent visit new states efﬁciently with a ﬁxed reward and transition model. See Section 2 for a review.

However, in various applications, it is necessary to re-design the reward function to incentivize the agent to learn new desired behavior (Altman, 1999; Achiam et al., 2017; Tessler et al., 2018; Miryooseﬁet al., 2019). To avoid repeatedly invoking the learning algorithm and interacting with the environment, it is desired to let the agent efﬁciently explore the environment without the reward signal and collect data based on which the agent can compute a near-optimal policy for any reward function.

The main challenge of this problem is that the agent needs to collect data that sufﬁciently covers the state space. This problem was previous studied in (Brafman & Tennenholtz, 2003; Hazan et al., 2019; Du et al., 2019). Recently, Jin et al. (2020) formalized the setting, and named it reward-free RL. In this setting, the agent ﬁrst collects a dataset by interacting with the environment, and then is required to compute an -optimal policy given any proper reward function.

Jin et al. (2020) gave a formal theoretical treatment of this setting. They designed a method which guarantees that by collecting O

poly log (SAH/ )

episodes, the agent is able to output an -optimal policy, where S is the number of states, A is the number of actions, and H is the planning horizon. 1 They also provided an

lower bound. Recently, Kaufmann et al. (2020); M enard et al. (2020) gave tighter sample complexity bound. We refer the readers to table 1 for more details.

Unfortunately, it remains open what is the fundamental limit of the sample complexity of reward-free RL. In particular, compared to the ( S2A

2 ) lower bound, all existing upper bounds have a polynomial dependence on H. The gap be-

1Because we consider reward function satisfying the total reward bounded by 1 setting (instead of H in their paper) this paper, we rescale the error to H in the bound .

Nearly Optimal Reward-Free Reinforcement Learning

tween upper and lower bound can be huge for environments with a long horizon. Conceptually, this gap represents that we still lack understanding on whether long horizon imposes signiﬁcant hardness on reward-free RL.

1.1. Our Contribution

In this work, we break the poly (H)-dependency barrier. We design a new algorithm, Staged Sampling + Truncated Planning (SSTP), which enjoys the following sample complexity guarantee.

Theorem 1. For any , δ 2 (0, 1), there exists an algorithm (SSTP, Algorithm 1) which can compute an -optimal policy for any reward function that is non-negative and totally bounded by 1, after O

poly log (SAH/ )

episodes of exploration with probability 1 δ.

The signiﬁcance of our theorem is that we match the lower bound of

SA(S+log 1/δ)

up to logarithmic factors on

S, A, H, 1/ . 2

Importantly, our bound only depends logarithmically on the planning horizon H. This is an exponential improvement over existing results, and demonstrates that long horizon poses little additional difﬁculty for reward-free RL. Furthermore, our bound only requires the reward to be totally bounded (cf. Assumption 2), in contrast to uniformly bounded (cf. Assumption 1), which is assumed in previous works. See Section 2 for more discussions.

Remark 1. Jin et al. (2020),Kaufmann et al. (2020) and M enard et al. (2020) studied reward-free exploration on the non-stationary episodic MDP (i.e., the transition model depends on the level h 2 [H]), where the lower bound of sample complexity is at least linear in H because the complexity of MDP is larger (Jin et al., 2018; Zhang et al., 2020c). In this paper, we consider the episodic MDP with a stationary transition, that is, the transition model is independent of the level. This is often considered to be a more realistic model than the non-stationary transition model. Furthermore, for non-stationary episodic MDP, our algo-

rithm could also provide a reward-free sample complexity of O

, which matches current best result in (M enard et al., 2020) up to logarithmic terms (see Section 6 for more discussion).

2. Related Work

We review relevant works in this section. Comparisons between our algorithm and existing ones on reward-free RL are provided in Table 1.

2The original bound is for the case the total reward is bounded by H, and here we scale down the total reward by a factor of H.

Algorithm Sample Complexity Non-unif.

Reward Log H

RF-RL-EXPLORE

(Jin et al., 2020)

δ ) + H3S2A

RF-UCRL (Kaufmann et al., 2020)

RF-EXPRESS (M enard et al., 2020)

SSTP This Work

Lower Bound (Jin et al., 2020)

Table 1. Sample complexity comparisons for state-of-the-art

episodic RL algorithms. See Section 2 for discussions on this table. e O omits logarithmic factors on S, A, H, 1/ but not 1/δ. Sample Complexity: number of episodes to ﬁnd an -suboptimal policy. Non-unif. Reward: Yes means the bound holds under Assumption 2 (allows non-uniformly bounded reward), and No means the bound only holds under Assumption 1. Log H: Whether the sample complexity bound depends logarithmically on H instead of polynomially on H.

Reward-dependent exploration In reward-dependent exploration, the agent aims to learn an -optimal policy under a ﬁxed reward. Some papers assumed there is a generative model which can be queried to provide a sample for any state-action pair (s, a) (Kearns & Singh, 1999; Azar et al., 2013; Sidford et al., 2018; Agarwal et al., 2019; Li et al., 2020), and the sample complexity is deﬁned as the number of queries needed to compute an -optimal policy. In the online setting (Brafman & Tennenholtz, 2003; Kakade, 2003; Dann & Brunskill, 2015; Dann et al., 2017; 2019; Azar et al., 2017; Jin et al., 2018; Zanette & Brunskill, 2019; Kaufmann et al., 2020; Zhang et al., 2020c; Wang et al., 2020a; Zhang et al., 2020b), the agent starts from a ﬁxed initial distribution in each episode, and collects a trajectory by interacting with the environment. Then the sample complexity is given by the number of episodes that are necessary to learn an -optimal policy. The state-of-the-art result by Zhang et al. (2020b) requires e O

number of episodes.

Reward Assumption and Dependency on H For the reward, the widely adopted assumption is rh 2 [0, 1] for all h 2 [H], which implies the total reward PH

h=1 rh 2 [0, H]. However, as argued in (Kakade, 2003; Jiang & Agarwal, 2018), the characterization of sample complexity should be independent of the scaling, i.e., the target suboptimality 2 (0, 1) should be a relative quantity to measure the performance of an algorithm. To this end, we need to scale the total reward within [0, 1]. Then the assumption becomes:

Assumption 1 (Uniformly Bounded Reward). The reward satisﬁes that rh 2 [0, 1/H] for all h 2 [H].

Compared to Assumption 1, the totally-bounded reward assumption (Assumption 2) is more general. Therefore, any

Nearly Optimal Reward-Free Reinforcement Learning

upper bound under Assumption 2 implies an upper bound under Assumption 1. In the view of practice, because environments under Assumption 2 can have high one-step reward, it is more natural to consider Assumption 2 in environments with sparse rewards, such as the Go game, which are often considered to be puzzling. In the view of theoretical basis, it is more complicated to design efﬁcient algorithms under Assumption 2 due to the global structure of the reward. 3

Recent work (Zanette & Brunskill, 2019; Wang et al., 2020a; Zhang et al., 2020b) made essential progress in rewarddependent exploration under Assumption 2, and obtained sample complexity bounds that only scale logarithmically with H. Zanette & Brunskill (2019) showed the main term (with respect to 1/ 2) does not depend on H. Wang et al. (2020a) proved a sample complexity bound of O

despite suffering an exponential computational cost, and later (Zhang et al., 2020b) achieved a nearly sharp sample complexity bound of O

with a computationally efﬁcient algorithm. We use some technical ideas from (Zhang et al., 2020b) (cf. Section 4.2). However, because the problem settings are different, we need new techniques to establish a nearly tight sample complexity bound for reward-free exploration.

Reward-Free RL The main algorithm in (Jin et al., 2020) assigns only non-zero reward for each state at every turn, and utilizes a regret minimization algorithm EULER (Zanette & Brunskill, 2019) to visit each state as much as possible. Since their algorithm only learns one state each time, their sample complexity bound is not tight with respect to H. Kaufmann et al. (2020) proposed RF-UCRL to achieve sample complexity of O

2 (S + log( 1

by building upper conﬁdence bounds for any reward function and any policy, and then taking the greedy policy accordingly. The later work by M enard et al. (2020) constructed an exploration bonus of 1 n(s,a) instead of the classical exploration bonus of 1 p

n(s,a), where n(s, a) is the visit count of (s, a).

Based on the novel bonus, they achieve sample complexity of O( SAH

2 (S + log( 1

δ )). Recently, these results have been extended to linear function approximation settings (Wang et al., 2020b; Zanette et al., 2020). Reward-free exploration is also related to another setting, reward-agnostic RL, in which N reward functions are considered in the planning

3Under Assumption 2, the reward still satisﬁes rh 2 [0, 1], so if an algorithms enjoys an sample complexity bound under Assumption 1, scaling up this bound by an H2 for PAC bound, one can also obtain a bound under Assumption 2. However, this reduction is highly suboptimal in terms of H, so when comparing with existing results, we display their original results and add a column indicating whether the bound is under Assumption 2 or Assumption 1.

phase. Zhang et al. (2020a) provided an algorithm which achieves O

H3SA log(N) log(1/δ)

sample complexity. See Table 1 for a summary.

3. Preliminaries

Notations. Throughout this paper, we deﬁne [N] to be the set {1, 2, . . . , N} for N 2 Z+. We use I[E] to denote the indicator function for an event E, i.e., I[E] = 1 if E holds and I[E] = 0 otherwise. For notational convenience, we set = ln(2/δ) throughout the paper. For two n-dimensional vectors x and y, we use xy to denote x>y, use V(x, y) = P

i xiyi)2, and use x2 to denote the vector [x2

n]> for x = [x1, x2, ..., xn]>. For two vectors x, y, x y denotes xi yi for all i 2 [n] and x y denotes xi yi for all i 2 [n]. We use 1 to denote the S-dimensional vector [1, ..., 1]> and 1s to denote the Sdimensional vector [0, ..., 1, ..., 0]> where the only non-zero element is in the s-th dimension.

Episodic Reinforcement Learning. We ﬁrst describe the setting for standard episodic RL. A ﬁnite-horizon Markov Decision Process (MDP) is a tuple M = (S, A, P, R, H, µ). S is the ﬁnite state space with cardinality S. A is the ﬁnite action space with cardinality A. P : S A ! (S) is the transition operator which takes a state-action pair and returns a distribution over states. For h = 1, 2, ..., H, Rh : S A ! (R) is the reward distribution with a mean function rh : S A ! R. H 2 Z+ is the planning horizon (episode length). µ 2 (S) is the initial state distribution. P, R and µ are unknown. For notational convenience, we use Ps,a and Ps,a,s0 to denote P( |s, a) and P(s0|s, a) respectively.

A policy chooses an action a based on the current state s 2 S and the time step h 2 [H]. Formally, we deﬁne = { h}H

h=1 where for each h 2 [H], h : S ! A maps a given state to an action. The policy induces a (random) trajectory {s1, a1, r1, s2, a2, r2, . . . , s H, a H, r H}, where s1 µ, a1 = 1(s1), r1 R(s1, a1), s2 P( |s1, a1), a2 = 2(s2), etc. We use E ,M[ ] and P ,M[ ] to denote respectively the expectation and probability under policy with respect to the MDP M, and omit M when M is clear in the context,

The goal of RL is to ﬁnd a policy that maximizes the expected total reward, i.e. max E

where the expectation is over the initial distribution state µ, the transition operator P and the reward distribution R.

As for scaling, we make the following assumption about the reward. As we discussed in Section 2, this is a more general assumption than the assumption made in most previous works.

Nearly Optimal Reward-Free Reinforcement Learning

Assumption 2 (Bounded Total Reward). The reward satisﬁes that rh 0 for all h 2 [H]. Besides, PH

h=1 rh 1 almost surely.

Given a policy , a level h 2 [H] and a state-action pair (s, a) 2 S A, the Q-function is deﬁned as:

h(s, a) = E

rh0 | sh = s, ah = a

Similarly, given a policy , a level h 2 [H], the value function of a given state s 2 S is deﬁned as:

rh0 | sh = s,

Then Bellman equation establishes the following identities for policy and (s, a, h) 2 S A [H]

h(s, a) = rh(s, a) + P >

Throughout the paper, we let VH+1(s) = 0 and QH+1(s, a) = 0 for notational simplicity. We use Q

h to denote the optimal Q-function and V -function at level h 2 [H], which satisﬁes for any state-action pair (s, a) 2 S A, Q

h(s, a) = max Q

h(s, a) and V

h (s) = max V

Dataset A dataset D = {(sk

h+1)}(h,k)2[H] [K] consists of trajectories of K episodes. We also deﬁne {Ns,a,s0(D)}(s,a,s0)2S A S to be the visitation count and

{Ps,a,s0(D) =

Ns,a,s0(D) P

s Ns,a, s(D)}(s,a,s0)2S A S to be the empirical transition probability computed by D, where Ps,a,s0(D) is deﬁned as 1

s Ns,a, s(D) = 0. We further deﬁne Ns,a(D) = P

s Ns,a, s(D) and Ps,a(D) be the vector such that the value of its s0-th dimension is Ps,a,s0(D) for all (s, a) 2 S A. With the notation deﬁned above, we let N(D) and P(D) be respectively the shorthands of {Ns,a(D)}(s,a)2S A and {Ps,a(D)}(s,a)2S A.

Reward-Free Reinforcement Learning Now we formally describe reward-free RL. Let , δ 2 (0, 1) be the thresholds of sub-optimality and failure probability. Rewardfree RL consists of two phases. In the exploration phase, the algorithm collects a dataset D by interacting with the environment without reward information, and in the planning phase, given any reward function r satisfying Assumption 2, the agent is asked to output an -optimal policy with probability at least 1 δ. The performance of an algorithm is measured by how many episodes K used in the exploration phase to make sure the planning phase succeeds.

4. Technique Overview

The proposed algorithm has two main components: the sampling phase and the planning phase. At a high level, we ﬁrst propose a sufﬁcient condition (see Condition 2) for the agent to use the collect samples to learn an -optimal policy for any reward function satisfying Assumption 2. Then we apply a modiﬁed version of Rmax (Brafman & Tennenholtz, 2003) to obtain samples to satisfy Condition 2 in the sampling phase.

4.1. Planning Phase

4.1.1. A TIGHT SUFFICIENT CONDITION

To obtain a near-optimal policy for any given reward function, a sufﬁcient condition is to collect N samples for each (s, a) pair, where N is some polynomial function of S, A

and 1/ . However, some (s, a) pairs might be rarely visited with any policy so it is hard to get enough samples for such pairs. To address this problem, we observe that such stateaction pairs contribute little to the accumulative reward. As mentioned in (Jin et al., 2020), if the maximal expected visit count of (s, a) is λ(s, a), then Nλ(s, a) samples of (s, a) is sufﬁcient for us to compute a good policy. Instead of considering each (s, a) pair one by one, we hope to divide the state-action space into a group of disjoint subsets, such that the maximal expect visit count of each subset is proportionally to minimal visit count in this subset. This poses a sufﬁcient condition for the dataset in the plan phase.

Condition 1. Let K = blog2(2H/ )c. Given the dataset D, the state-action space S A could be divided into K +1 subsets S A = X1 [ X2 [ ... [ XK+1, such that, (1) For any 1 i K, Ns,a(D) Ni := 4 SH

2i 2 for any (s, a) 2 Xi; (2) For each 1 i K + 1, it holds that sup E

h=1 I [(sh, ah) 2 Xi]

The following proposition shows this condition is sufﬁcient. The proof of Proposition 1 is postpone to Appendix D.

Proposition 1. Suppose Condition 1 holds for the dataset D. Given any reward function r satisfying Assumption 2, with probability 1 4S2A(log2(T0H) + 2)δ, QCOMPUTING(P(D), N(D), r) (see Algorithm ??) returns an -optimal policy.

In previous work on reward-free exploration (Jin et al., 2020; Kaufmann et al., 2020; M enard et al., 2020), sufﬁcient conditions similar to Condition 1 have been proposed to prove efﬁcient reward-free exploration. However, to obtain a dataset satisfying Condition 1, the sample complexity bound is at least polynomial in H in the worst case, which is the main barrier of previous work. We give a simple counter-example to explain why Condition 1 is hard to be satisﬁed without a poly (H) number of episodes. Suppose there is a state

Nearly Optimal Reward-Free Reinforcement Learning

s, such that for any other (s, a) 2 S A Ps,a, s = 1, and for any action a P s,a, s = 1. Direct computation gives that λ(s) := P

a λ(s, a) = (H2 1). However, the probability that the agent never visit s in N episodes is at least (1 H 1)N e NH 1 e Nλ(s)

H . In the case N H, the expected visit count in N episodes is Nλ(s), while the empirical visit count could be 0 with constant probability, which implies the expected visited number and the empirical visit can be very different in the N = o(H) regime.

To address this problem, we observe that in the example above, the probability the agent reaches s is relatively small. If we simply ignore s, the regret due to this ignorance is at most O(H 1) = O(λ(s)/H) instead of original regret bound of O(λ(s)). This poses our main novel condition to plan for a near-optimal policy given any reward function satisfying Assumption 2. This is one of our key technical contributions.

Condition 2. Recall K = blog2(2H/ )c. The stateaction space S A could be divided into K + 1 subsets S A = X1 [ X2 [ ... [ XK+1, such that, (1) N(s, a) Ni = 4 H( +6S ln(SAH/ ))

2i 2 for any (s, a) 2 Xi for 1 i K; (2) Recall Zi = max{min{ H

2i , H}, 1} for each 1 i K + 1. For each 1 i K + 1, it holds that sup P [PH

h=1 I [(sh, ah) 2 Xi] > Zi] and

h=1 I [(sh, ah) 2 Xi] , Zi}

Under Condition 2, the state-action space are divided into K + 1 subsets according to their visit counts. For the stateaction pairs with visit counts in [Ni, Ni 1), different with the second requirement in Condition 1 we require that the maximal truncated expected visit count is strictly bounded proportionally to their visit counts. Let Ei be the set of trajectories satisfying that PH

h=1 I [(sh, ah) 2 Xi] > Zi. We also requires that the probability of Ei is no larger than for any policy. In fact, we directly pay loss of sup P [Ei] due to ignoring Ei when computing the value function. On the other hand, Zi is far less than H when i is relatively large, which enables us to collect samples to satisfy Condition 2.

The selection of Zi is quite non-trivial. On one hand, we need Zi large enough so that it is possible to ensure sup P

h=1 I [(sh, ah) 2 Xi]

no larger than (for example, by choosing Zi = H + 1, we can easily make this probability 0), and on the other hand, we need Zi small enough to get rid of polynomial dependence on H. One possible solution is to set Zi to scale linear as the maximal expected visit count of Xi, which plays a crucial role in the analysis.

4.1.2. PLANNING USING AN AUXILIARY MDP

Suppose Condition 2 holds for some dataset D with the partition {Xi}K+1

i=1 . Because we only require the truncated maximal expected visit is properly bounded in Condition 2, standard planning method cannot work trivially. The main difﬁculty here is that, to apply the bounds of sup E

h=1 I [(sh, ah) 2 Xi] , Zi}

and sup P [Ei], we should set the reward 0 if Xi has been visited for more than Zi times in an episode. A naive solution is to encode the visit counts of {Xi}K+1

i=1 into the state space. However, in this approach, the size of the new state space is exponential in S, which leads to exponential computational cost. Due to the reason above, to our best of knowledge, no existing algorithms can direct learn such a truncated MDP.

To address this problem, we consider an auxiliary MDP M =

S [ send, A, r , ˆP , µ E

. Here send is an addi-

tional absorbing state. The reward function r is the same as r except for an additional column 0 for send, and the transition probability ˆP is given by P

Zi ) ˆPs,a +

1 Zi 1send for any (s, a) 2 Xi and ˆPsend,a = 1send for any a. In words, we add an absorbing state to the original MDP, such that the agent would fall into send if it visit Xi for Zi times in expectation for some 1 i K + 1. Instead of learning the truncated MDP, we consider a soft-truncated MDP, which exponentially reduces computational cost. For more details, we refer the readers to Section 5.2.

4.2. Sampling Phase

Having identiﬁed the sufﬁcient condition, we need to design an algorithm to collect a set of samples that satisfy this condition.

We make the partition S A = [K+1

i=1 Xi by specifying Xi for i = 1, 2, ..., K + 1 one by one. We divide the learning process into K stages. Take the ﬁrst stage as an example. At the beginning of the ﬁrst stage, we assign reward 1 to all (s, a) 2 S A, and proceeds to learn with this reward. Like RMAX, whenever the visit count of some (s, a) pair is equal to or larger than N1, we say this (s, a)

is known and set r(s, a) = 0. We will discuss the problem of regret minimization for this MDP with time-varying reward function later and simply assume the regret is properly bounded. Deﬁning X1 be the set of known state-action pairs after the ﬁrst stage, the statements in Condition 2 holds trivially. Beside, the length of each stage is properly designed. Combining this with the bound of regret, we show that the maximal expected visit count of the unknown state-action pairs is properly bounded. Because X2 (X1)C, we learn that the second part in the second statement in Condition 2 holds for X2. We then continue to learn the second subset

Nearly Optimal Reward-Free Reinforcement Learning

X2 and so on.

Note that in arguments above, we do not introduce Zi because Zi = H for the beginning stages by deﬁnition. In the case Zi < H, there are two major problems.

The regret minimization algorithm Most regret minimization algorithms such (Azar et al., 2017; Zanette & Brunskill, 2019) works in the regime Zi = H, where no truncation occurs. However, in the case Zi H, the regret bounds by these algorithms depends on H polynomially. To address this problem, we constructed an expanded MDP with truncated cumulative reward (see deﬁnition in Section 5.1 ), where the Q-function is strictly bounded by Zi. In this way, we obtain desired regret bounds. We would like to mention that our algorithm is somewhat similar to recent work (Zhang et al., 2020b) which addresses the regret minimization problem with a total-bounded reward function. More precisely, after re-scaling, the reward function in our regret minimization problem is also total bounded by 1 and each single reward is bounded by 1/Zi. Although the reward function might vary in different episodes, we can provide efﬁcient regret bounds in a similar way to the analysis in (Zhang et al., 2020b).

Bound of P [Ei+1] Recall that Ei is the set of trajectories satisfying that PH

h=1 I[(sh, ah) 2 Xi] > Zi. Deﬁne Yi+1 = (S A)/(X1 [ . . . [ Xi) for i 1 and Y1 = S A. It is clear that Wi is decreasing in i. By the upper bound of regret (see Lemma 3), we show that the maximal truncated expected visit count sup E

h=1 I [(sh, ah) 2 Yi] , Zi}

is properly bounded. Noting that Zi+1 Zi, we have that

P [Ei+1] sup

I [(sh, ah) 2 Yi+1] > Zi+1

I [(sh, ah) 2 Yi+1] > Zi

I [(sh, ah) 2 Yi+1] , Zi

I [(sh, ah) 2 Yi] > Zi

I [(sh, ah) 2 Yi+1] , Zi

By properly choosing the value of Zi, we show that the second term in RHS of (1) could be bounded by O( ). Then by induction, we show that P [Ei+1] K . Noting that K = blog2(2H/ )c is a logarithmic term, we can bound the probability of P [Ei+1] properly.

Algorithm 1 MAIN ALGORITHM: STAGED SAMPLING + TRUNCATED PLANNING

1: (D, {Xi}K+1

i=1 ) STAGED SAMPLING (Algorithm 2); 2: Given any reward function r satisfying Assumption 2,

return TRUNCATED PLANNING(D, {Xi}K+1

i=1 , r) (Algorithm 3).

Following the arguments above, we set the number of episodes in each stage to be T0 := C1

2 where C1 is some large enough constant and l is a poly-logarithmic term in (S, A, H, 1/ ). At the beginning of an episode in the i-th stage, we assign reward 1 to a state-action pair if its visit number is less than Ni and otherwise 0. We then apply Algorithm 4 to minimize regret in each stage, and ﬁnally obtain X1, X2, ..., XK+1.

For more technical details, we refer the reader to Section 5.1 and 5.2.

5. Proof of Theorem 1

Similar as in Section 4, our proof consists of two parts, one for the sampling phase and another for the planning phase. We propose the main lemmas for these two parts respectively.

Lemma 1. By running Algorithm 2, with probability 1 K

2(log2(T0H) + 1) log2(T0H) + 4S2A(log2(H) + 2)

δ, we can collect a dataset D and obtain the partition {Xi}K+1

i=1 such that Condition 2 holds for the collected dataset D with the partition {Xi}K+1

i=1 . Besides, we consumes at most KT0 = O( SA( +S)

2 ) episodes to run Algorithm 2.

Lemma 2. Assuming Condition 2 holds for the collected dataset D with partition {Xi}K+1

i=1 , with probability 1 4S2AKT0H(log2(T0H) + 2)δ, Algorithm 3 can compute an -optimal policy using these samples for any reward function r satisfying Assumption 2.

Theorem 1 follows by combining Lemma 1 with Lemma 2 and replacing δ by poly(S, A, 1/ , log(H))δ. The rest part of this section is devoted to the proofs of Lemma 1 and Lemma 2.

5.1. Sampling Phase: Proof of Lemma 1

As mentioned in Section 4, we aim to collect samples such that Condition 2 holds. Our algorithm proceeds in K + 1 stages, where each stage consists of T0 episodes. Therefore, at most KT0 = O( SA( +S)

2 ) episodes are needed to run Algorithm 2. In an episode, saying the k-th episode in the i-th stage, we deﬁne Yk = {(s, a)|N k(s, a) < Ni} to be the set of unknown state-action pairs. In particular, we deﬁne Yi := Yk(i) where k(i) is the ﬁrst episode in

Nearly Optimal Reward-Free Reinforcement Learning

Algorithm 2 STAGED SAMPLING

1: Initialize: D ;, Y1 S A 2: for i = 1, 2, ..., K do 3: (Di, Yi+1) TRVRL(i, Yi); 4: D D [ Di; 5: Xi Yi/Yi+1 6: end for 7: XK+1 YK+1; 8: Return (D, {Xi}K+1

the i-th stage. Note that Xi is deﬁned as Yi/Yi+1 and Y1 = S A, which means Yi+1 = Y1/(X1 [ . . . [ Xi) = (S A)/(X1 [ . . . [ Xi). So we have that Wi = Yi for

To learn the unknown state-action pairs, we adopt the idea of Rmax by setting reward function to be r(s, a) = I

(s, a) 2 Yk

. However, by the deﬁnition of Condition 2, it sufﬁces to assign reward 1 to the ﬁrst Zi visits to Yi. So it corresponds to learn a policy to maximize

(sh, ah) 2 Yk

To address this learning problem, we consider an expanded MDP Mk =

Sk, Ak, P k, rk, µk

Sk = S [Zi + 1];

rk(s, z, a) = I[(s, a) 2 Yk, z Zi], 8(s, z, a) 2 Sk Ak;

P k(s0, z0|s, z, a) = P(s0|s, a)

z0 = z + 1 \ (s, a) 2 Yk

z0 = z \ (s, a) /2 Yk

, 8(s, a) 2 S A, z 2 [Zi];

P k(s0, Zi + 1|s, Zi + 1, a) = P(s0|s, a), 8(s, a) 2 S A;

µk(s, z) = µ(s)I [z = 1] .

Roughly speaking, a state in Mk not only represent its position in S, but also records the reward the agent has collected in current episode. We then deﬁne the pseudo regret in the i-th stage as:

k in stage i

h is a shorthand of rk(sk

h). We show that Ri could be bounded properly in a similar way to (Zhang et al., 2020b).

Lemma 3. For any 1 i K, by running Algorithm 4 with input i, with probability 1 (2(log2(T0H) + 1) log2(T0H) + 4SA(log2(Zi) + 2)) δ,

Ri is bounded by

SAT0(S + ) + Zi SA(S + )

Algorithm 4 and the proof of Lemma 3 is postponed to Appendix.B due to limitation of space.

Let i be ﬁxed. Recall that ki is the ﬁrst episode in the i-th stage. We deﬁne uk = sup E

, ui = uk(i) and ui = uk(i) where k(i) is the index of the last episode in the i-th stage. Because rk is non-increasing in k, µk is also non-increasing in k. If ui > H

2i , then by Lemma 3 and the deﬁnition of T0 we have that there exists a constant C2 and a poly-logarithmic factor l2 in (S, A, H, 1/ ) such that

k in stage i

SAT0(S + ) + Zi SA(S + )

By choosing C1l1 8C2l2, we have that

k in stage i

SAT0(S + ) + Zi SA(S + )

SAH( + 6S ln(SAH/ ))

8 SANi. (3)

By choosing C1 16, we learn that P

k in stage i

h > 2SANi. On the other hand, each (s, a) could provide at most Ni rewards in the i-th stage, which implies that P

k in stage i

h 2SANi. This leads to a contradiction. We then have that ui H

Again because the reward function is non-increasing in k, we have that

Deﬁne pi = sup P

h=1 I [(sh, ah) 2 Yi] > Zi

. Then we have that

I [(sh, ah) 2 Yi+1] Zi

I [(sh, ah) 2 Yi+1] > Zi

Nearly Optimal Reward-Free Reinforcement Learning

By induction, we can obtain that pi i (K + 1) and PK

i=1 pi (K + 1)2 . We claim that Condition 2 holds by deﬁning Xi = Yi/Yi+1 for 1 i K and XK+1 = YK+1. (1) By the deﬁnition of Yi+1 , we learn that for any (s, a) 2 Xi, N(s, a) 2Ni+1 Ni. (2) By the arguments above, we have that for each 1 i K + 1.

I [(sh, ah) 2 Xi] > Zi

I [(sh, ah) 2 Yi] > Zi

= pi (K + 1)

I [(sh, ah) 2 Xi] , Zi

ui ui 1 H 2i 1 .

Noting that there are exactly K stages and each stage consists of T0 = C1l1

SA( +6S ln(SAH/ )) log2(H)

2 episodes, we prove that we can collect a dataset satisfying Condition 2 within

KT0 = O(SA(S + ln(1/δ))

5.2. Planning Phase: Proof of Lemma 2

Suppose we have a dataset D satisfying Condition 2 with partition {Xi}K+1

i=1 . Let ˆPs,a and N(s, a) be the shorthand of Ps,a(D) and N(s, a)(D) respectively. Denote the empirical transition and visit count of (s, a) as ˆPs,a and N(s, a) respectively.

As mentioned in Section 4, we consider the reward-free auxiliary MDP M =

S [ {send}, A, P , µ

. The transition function P

Zi )Ps,a + 1

Zi 1send for all (s, a) 2 Xi and P

send,a = 1send for any a. We ﬁrst show that for any policy, the value function of M is e O( )-closed to that of M.

Lemma 4. For any policy and reward function r satisfying Assumption 2 and rsend = 0, deﬁne V

1 be the value function under M and M with respectively. We then have

1 + 4(K + 1)2 .

Instead of learning M, we aim to learn M . Let ˆPs,a be the empirical transition computed by the collected samples. As described in Algorithm 3, for each 1 i K + 1, we

Algorithm 3 Truncated Planning

1: Input: The partition {Xi}K+1

i=1 ; the dataset D; the reward function r. 2: Initialize: r(send, a) 0; P( |send, a) 1send

for all a 2 A; ˆP {Ps,a(D)}(s,a)2S A; N {Ns,a(D)}(s,a)2S A; 3: for i = 1, 2, ..., K + 1 do 4: ˆP

Zi ) ˆPs,a + 1

Zi 1send for any (s, a) 2 Xi; 5: end for 6: for (s, a, h) 2 S A [H] do 7: Qh(s, a) 1; 8: end for 9: for (a, h) 2 A H do 10: Qh(send, a) 0; 11: end for 12: for h = H, H 1, ..., 1 do 13: for (s, a) 2 S A do 14:

s,a, Vh+1) 1 N(s, a) + 29 1 3N(s, a);

Qh(s, a) min{r(s, a) + ˆP

s,a Vh+1 + bh(s, a), 1};

a Qh(s, a);

15: end for 16: end for 17: h(s) arg maxa Qh(s, a), 8s, h; 18: Return .

s,a = (1 1 Zi ) ˆPs,a + 1 Zi 1send for (s, a) in Xi and then update backward the Q-function and value function in an optimistic way. The ﬁnal output policy is induced by the Q-function above. We ﬁrst verify the Q-function is optimistic, i.e., Lemma 5. With probability 1 4S2AKT0H(log2(T0H)+ 2)δ, Qh(s, a) Q

h (s, a) for any (s, a, h).

Without loss of generality, we assume µ = 1s1. Now we bound the gap V

1 (s1). Let 1 = min{ T0H , 2

0 H3 } 1 , δ1 = δ S

1 and 1 = ln(1/δ1)

0 H4/ 3). Lemma 6. With probability 1 4S2AKT0H(log2(T0H)+ 2)δ, it holds that

1 (s1) V1(s1) V

wh(s, a, )βh(s, a)

where wh(s, a, ) := E ,M [I[(sh, ah) = (s, a)]] and

Nearly Optimal Reward-Free Reinforcement Learning

βh(s, a) := min{6

s,a,Vh+1) 1

N(s,a) + 12 1 N(s,a), 1}.

h wh(s, a, ) for 1 i K + 1. Lemma 7. w

i ( ) O( HK

2i ) for 1 i K.

By Lemma 7, we further have Lemma 8.

wh(s, a, )βh(s, a)

wh(s, a, )βh(s, a) + K2 2

By Lemma 6 and solving the inequality x O(K p2 + 2x + K2 2), we learn that

wh(s, a, )βh(s, a) O

Recall that by Lemma 4, we have |V

. We then ﬁnally conclude that

wh(s, a, )βh(s, a) O

Since K log2(2H/ ), we ﬁnish the proof by rescaling .

6. Discussions on Non-Stationary Episodic

We claim that SSTP could provide a reward-free sample complexity of O( SAH

δ ) + S)) with a slight modiﬁcation. Because the analysis for the non-stationary episodic MDP is very similar to previous analysis, we only point out the major differences between the proofs and omit the details. We also follow previous notations.

Planning phase Let Nh(s, a) denote the count of (s, a, h) in the dataset. For non-stationary episodic MDP, we propose a sufﬁcient condition for the dataset to plan for a nearoptimal policy given any reward function as follows. Condition 3. Recall K = blog2(2H/ )c. S A [H] could be divided into K + 1 subsets S A [H] = X1 [ X2 [ ... [ XK+1, such that, (1) Nh(s, a) Ni = 4 H( +6S ln(SAH/ ))

2i 2 for any (s, a, h) 2 Xi for 1 i K; (2) Recall Zi = max{min{ H

2i , H}, 1} for each 1 i K + 1. For each 1 i K + 1, it holds that sup P [PH

h=1 I [(sh, ah, h) 2 Xi] > Zi] and

h=1 I [(sh, ah, h) 2 Xi] , Zi}

Because the transition model at different layer could be different, instead of the requirement on N(s, a), we ask Nh(s, a) Ni for (s, a, h) 2 Xi. Following the arguments in the proof of Lemma 2, we can show that with Condition 3, we can compute an -optimal policy for any reward function satisfying Assumption 2.

Sampling phase To learn a dataset satisfying Condition 3, in a similar way as Algorithm 4, we invoke TRVRL to learn Yi for 1 i K + 1. The major difference is that HT0 episodes are required for each stage. In this way, following the proof of Lemma 3, we have that the regret in the i-th stage is bounded by

SAH2T0(S + ) + Zi SAH(S + )

Compared to Lemma 3, we note that the bound in (7) has two additional

H factors. The ﬁrst

H factor is because the length is multiplied by H and the second is due to the structure of non-stationary episodic MDP. By (7), we can further ensure that ui H

2i by noting that when ui > H

k in stage i

8 SAHNi, (8)

which contradicts to the fact that P

k in stage i

h 2SAHNi (because each (s, a, h) could be visited for at most Ni times).

Lower bound The current best lower bound is ( SA

2 (H+ S + log( 1

δ )) by the lower bounds in (Jin et al., 2020) and (Zhang et al., 2020c). It remains an open problem whether the O( SAH

2 (S + log( 1

δ ))) upper bound is tight.

7. Conclusion

We give a new algorithm, SSTP, which enjoys a near-optimal sample complexity for reward-free RL. Importantly, we show the sample complexity only depends logarithmically on the planning horizon. Our algorithm relies on three new technical ideas: 1) A new sufﬁcient condition for the dataset to plan for an -suboptimal policy ; 2) A new way to plan efﬁciently under the proposed condition using soft-truncated planning; 3) Constructing extended MDP to maximize the truncated accumulative rewards efﬁciently. In this way, we can divide the state-action space into different groups according to their maximal possible frequencies, which is especially suited for RL with growing batches.

Another important future direction is to generalize our algorithm to RL with function approximation. For example, can we obtain a near-optimal sample complexity for reward-free RL with linear function approximation (Wang et al., 2020b; Zanette et al., 2020)?

Nearly Optimal Reward-Free Reinforcement Learning

Achiam, J., Held, D., Tamar, A., and Abbeel, P. Constrained

policy optimization. ar Xiv preprint ar Xiv:1705.10528, 2017.

Agarwal, A., Kakade, S., and Yang, L. F. Model-based rein-

forcement learning with a generative model is minimax optimal. ar Xiv preprint ar Xiv:1906.03804, 2019.

Altman, E. Constrained Markov decision processes, vol-

ume 7. CRC Press, 1999.

Azar, M. G., Munos, R., and Kappen, H. J. Minimax PAC

bounds on the sample complexity of reinforcement learning with a generative model. Machine learning, 91(3): 325 349, 2013.

Azar, M. G., Osband, I., and Munos, R. Minimax regret

bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning Volume 70, pp. 263 272. JMLR. org, 2017.

Brafman, R. I. and Tennenholtz, M. R-max - a general poly-

nomial time algorithm for near-optimal reinforcement learning. J. Mach. Learn. Res., 3(Oct):213 231, March 2003. ISSN 1532-4435.

Dann, C. and Brunskill, E. Sample complexity of episodic

ﬁxed-horizon reinforcement learning. In Advances in Neural Information Processing Systems, pp. 2818 2826, 2015.

Dann, C., Lattimore, T., and Brunskill, E. Unifying PAC

and regret: Uniform PAC bounds for episodic reinforcement learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS 17, pp. 5717 5727, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964.

Dann, C., Li, L., Wei, W., and Brunskill, E. Policy certiﬁ-

cates: Towards accountable reinforcement learning. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 1507 1516, Long Beach, California, USA, 09 15 Jun 2019. PMLR.

Du, S. S., Krishnamurthy, A., Jiang, N., Agarwal, A., Dud ık,

M., and Langford, J. Provably efﬁcient RL with rich observations via latent state decoding. ar Xiv preprint ar Xiv:1901.09018, 2019.

Hazan, E., Kakade, S., Singh, K., and Van Soest, A. Prov-

ably efﬁcient maximum entropy exploration. In International Conference on Machine Learning, pp. 2681 2691, 2019.

Jiang, N. and Agarwal, A. Open problem: The dependence

of sample complexity lower bounds on planning horizon. In Conference On Learning Theory, pp. 3395 3398, 2018.

Jin, C., Allen-Zhu, Z., Bubeck, S., and Jordan, M. I. Is

Q-learning provably efﬁcient? In Advances in Neural Information Processing Systems, pp. 4863 4873, 2018.

Jin, C., Krishnamurthy, A., Simchowitz, M., and Yu,

T. Reward-free exploration for reinforcement learning. ar Xiv preprint ar Xiv:2002.02794, 2020.

Kakade, S. M. On the sample complexity of reinforcement

learning. Ph D thesis, University of London London, England, 2003.

Kaufmann, E., M enard, P., Domingues, O. D., Jonsson,

A., Leurent, E., and Valko, M. Adaptive reward-free exploration. ar Xiv preprint ar Xiv:2006.06294, 2020.

Kearns, M. J. and Singh, S. P. Finite-sample convergence

rates for Q-learning and indirect algorithms. In Advances in neural information processing systems, pp. 996 1002, 1999.

Li, G., Wei, Y., Chi, Y., Gu, Y., and Chen, Y. Breaking the sample size barrier in model-based reinforcement learning with a generative model. ar Xiv preprint ar Xiv:2005.12900, 2020.

Maurer, A. and Pontil, M. Empirical Bernstein bounds

and sample variance penalization. ar Xiv preprint ar Xiv:0907.3740, 2009.

M enard, P., Domingues, O. D., Jonsson, A., Kaufmann, E.,

Leurent, E., and Valko, M. Fast active learning for pure exploration in reinforcement learning. ar Xiv preprint ar Xiv:2007.13442, 2020.

Miryooseﬁ, S., Brantley, K., Daume III, H., Dudik, M., and

Schapire, R. E. Reinforcement learning with convex constraints. In Advances in Neural Information Processing Systems, pp. 14093 14102, 2019.

Sidford, A., Wang, M., Wu, X., Yang, L., and Ye, Y. Near-

optimal time and sample complexities for solving Markov decision processes with a generative model. In Advances in Neural Information Processing Systems, pp. 5186 5196, 2018.

Tessler, C., Mankowitz, D. J., and Mannor, S. Reward constrained policy optimization. ar Xiv preprint ar Xiv:1805.11074, 2018.

Wang, R., Du, S. S., Yang, L. F., and Kakade, S. M. Is

long horizon reinforcement learning more difﬁcult than short horizon reinforcement learning? ar Xiv preprint ar Xiv:2005.00527, 2020a.

Nearly Optimal Reward-Free Reinforcement Learning

Wang, R., Du, S. S., Yang, L. F., and Salakhutdinov, R. On

reward-free reinforcement learning with linear function approximation. ar Xiv preprint ar Xiv:2006.11274, 2020b.

Zanette, A. and Brunskill, E. Tighter problem-dependent

regret bounds in reinforcement learning without domain knowledge using value function bounds. In International Conference on Machine Learning, pp. 7304 7312, 2019.

Zanette, A., Lazaric, A., Kochenderfer, M. J., and Brunskill,

E. Provably efﬁcient reward-agnostic navigation with linear value iteration. ar Xiv preprint ar Xiv:2008.07737, 2020.

Zhang, X., Singla, A., et al. Task-agnostic exploration in

reinforcement learning. ar Xiv preprint ar Xiv:2006.09497, 2020a.

Zhang, Z., Ji, X., and Du, S. S. Is reinforcement learn-

ing more difﬁcult than bandits? a near-optimal algorithm escaping the curse of horizon. ar Xiv preprint ar Xiv:2009.13503, 2020b.

Zhang, Z., Zhou, Y., and Ji, X. Almost optimal model-free

reinforcement learning via reference-advantage decomposition. ar Xiv preprint ar Xiv:2004.10019, 2020c.

Zhang, Z., Zhou, Y., and Ji, X. Model-free reinforcement

learning: from clipped pseudo-regret to sample complexity. ar Xiv preprint ar Xiv:2006.03864, 2020d.