# taskagnostic_exploration_in_reinforcement_learning__2f6d2748.pdf

Task-agnostic Exploration in Reinforcement Learning

Xuezhou Zhang UW-Madison xzhang784@wisc.edu

Yuzhe Ma UW-Madison ma234@wisc.edu

Adish Singla MPI-SWS adishs@mpi-sws.org

Efﬁcient exploration is one of the main challenges in reinforcement learning (RL). Most existing sample-efﬁcient algorithms assume the existence of a single reward function during exploration. In many practical scenarios, however, there is not a single underlying reward function to guide the exploration, for instance, when an agent needs to learn many skills simultaneously, or multiple conﬂicting objectives need to be balanced. To address these challenges, we propose the task-agnostic RL framework: In the exploration phase, the agent ﬁrst collects trajectories by exploring the MDP without the guidance of a reward function. After exploration, it aims at ﬁnding near-optimal policies for N tasks, given the collected trajectories augmented with sampled rewards for each task. We present an efﬁcient taskagnostic RL algorithm, UCBZERO, that ﬁnds ε-optimal policies for N arbitrary tasks after at most O(log(N)H5SA/ε2) exploration episodes, where H is the episode length, S is the state space size, and A is the action space size. We also provide an Ω(log(N)H2SA/ε2) lower bound, showing that the log dependency on N is unavoidable. Furthermore, we provide an N-independent sample complexity bound of UCBZERO in the recently proposed reward-free setting, a statistically easier setting where the ground truth reward functions are known.

1 Introduction

Efﬁcient exploration is widely regarded as one of the main challenges in reinforcement learning (RL). Recent works in theoretical RL have provided near-optimal algorithms in both model-based [Jaksch et al., 2010, Azar et al., 2017, Zanette and Brunskill, 2019] and model-free [Strehl et al., 2006, Jin et al., 2018, Zhang et al., 2020] paradigms, that are able to learn a nearoptimal policy with a sample complexity that matches the information-theoretical lower bound. However, these algorithms are designed to solve a single pre-deﬁned task and rely on the assumption that a well-deﬁned reward function is available during exploration. In such settings, policy optimization can be performed simultaneously with exploration and results in an effective explorationexploitation trade-off.

In many real-world applications, however, a pre-speciﬁed reward function is not available during exploration. For instance, in recommendation systems, reward often consists of multiple conﬂicting objectives, and the balance between them is tuned via continuous trial and error to encourage the desired behavior [Zheng et al., 2018]; In hierarchical and multi-task RL [Dietterich, 2000, Tessler et al., 2017, Oh et al., 2017], the agent aims at simultaneously learning a set of skills; In the robotic navigation problem [Rimon and Koditschek, 1992, Kretzschmar et al., 2016], the agent needs to navigate to not only one goal location, but a set of locations in the environment. Motivated by these real-world challenges, we ask: Is it possible to efﬁciently explore the environment and simultaneously learn a set of potentially conﬂicting tasks?

To answer this question, we present the task-agnostic RL paradigm: During the exploration phase, the agent collects trajectories from an MDP without the guidance of pre-speciﬁed reward function. Next, in the policy-optimization phase, it aims at ﬁnding near-optimal policies for N tasks, given the

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

Setting No. of tasks Upper bound Lower bound

Task-speciﬁc RL N = 1 O H2SA

ε2 [ORLC] Ω H2SA

Reward-free RL N dep. O log(N)H5SA

ε2 [UCBZERO] -

N indep. O H5S2A

ε2 [RFE] Ω H2S2A

Task-agnostic RL N dep. O log(N)H5SA

ε2 [UCBZERO] Ω( log(N)H2SA

Table 1: Comparison of sample complexity results in three RL settings. Shaded cells represent our results.

collected trajectories augmented with the instantiated rewards sampled from the unknown reward function for each task. We emphasize that our framework covers applications like recommendation systems, where stochastic objectives (click rate, dwell time, etc.) are observed with the transitions, but the ground-truth reward functions are not known. There is a common belief in task-speciﬁc RL that estimating the reward function is not a statistical bottleneck. We will show that, in the more challenging task-agnostic setting, the need to estimate rewards makes the problem strictly harder.

To address the task-agnostic RL problem, we present an efﬁcient algorithm, UCBZERO, that can explore the MDP without the guidance of a pre-speciﬁed reward function. The algorithm design adopts the Optimism Under Uncertainty principle to perform exploration and uses a conceptually simpler Azuma-Hoeffding-type reward bonus, instead of a Bernstein-Friedman-type reward bonus that typically achieves better bounds in the task-speciﬁc RL framework. The advantage of an Hoeffdingtype bonus is that it only depends on the number of visitations to the current state-action pair, but not on the reward value. This is a key property that allows it to be used in the task-agnostic setting. The UCBZERO algorithm is deﬁned in Alg. 1. Despite its simplicity, UCBZero provably ﬁnds ε-optimal policies for N arbitrary tasks simultaneously after at most O(log(N)H5SA/ε2) exploration episodes. To complement our algorithmic results, we establish a near-matching Ω(log(N)H2SA/ε2) lower bound, demonstrating the near-optimality of UCBZERO as well as the necessity of the dependency on N, which is unique to the task-agnostic setting.

Our Contributions:

1. We propose a novel task-agnostic RL framework and present an algorithm, UCBZERO, that ﬁnds near-optimal polices to all N tasks with small sample complexity. In addition, we provide a near-matching lower bound, demonstrating the near-optimality of our algorithm. 2. We investigate interesting properties of the UCBZERO algorithm that shine a light on its success. In particular, we show that (i) UCBZERO is able to visit all state-action pairs sufﬁciently often, and (ii) the transitions collected by UCBZERO allows one to obtain accurate estimate of the dynamics model. 3. Lastly, we provide an N-independent sample complexity bound of UCBZERO in a statistically easier setting studied in a contemporary work [Jin et al., 2020], where all ground truth reward functions are known. We provide a uniﬁed view of the two settings and present detailed discussion contrasting the statistical challenges between the two. Table 1 summarizes our key theoretical results.

2 Related Work

Sample-efﬁcient algorithms for the task-speciﬁc RL. There has been a long line of work that design provably efﬁcient algorithms for the task-speciﬁc RL setting. In the model-based setting, UCRL2 [Jaksch et al., 2010] and PSRL [Agrawal and Jia, 2017] estimates the transition dynamics of the MDP with upper-conﬁdence bounds (UCB) added to encourage exploration. UCBVI [Azar et al., 2017] and later ORLC [Dann et al., 2018] improves the sample complexity bound to match with the information-theoretic lower bound O(H2SA/ε2) [Dann and Brunskill, 2015]. In the model-free setting, delayed Q-learning [Strehl et al., 2006] is the ﬁrst sample efﬁcient algorithm. UCB-H, UCB-B [Jin et al., 2018] and later UCB-Advantage [Zhang et al., 2020] improve upon previous results and matches the information-theoretical lower-bound up to log factors and lower-order terms. Our algorithm UCBZERO can be viewed as a zero-reward version of the UCB-H algorithm

and is shown to achieve the same sample complexity as UCB-H when there is a single task, despite not having access to reward information during exploration. Our results suggest that a zero-reward version of other sample-efﬁcient algorithms, such as UCB-B and UCB-Advantage, can potentially be adapted to the task-agnostic setting to achieve tighter bounds.

Empirical study of task-agnostic RL. In the Deep Reinforcement Learning (DRL) community, there has been an increasing interest in designing algorithms that can learn without the supervision of a well-designed reward function. Part of the motivation comes from the ability of humans and other mammals to learn a variety of skills without explicit external rewards. Task-agnostic RL is studied in a variety of settings, including robotic control [Riedmiller et al., 2018, Finn and Levine, 2017], selfsupervised model learning [Xie et al., 2018, Xie et al., 2019], general value function approximation (GVFA) [Schaul et al., 2015], multi-task meta-learning [Sæmundsson et al., 2018], etc. Even when one only cares about solving a single task, auxiliary navigation tasks have been shown to help solving problems with sparse rewards, in algorithms such as Hindsight Experience Replay (HER) [Andrychowicz et al., 2017] and Go-Explore [Ecoffet et al., 2019]. Our work provides a theoretical foundation for these empirically successful algorithms.

Closely related contemporary work. A closely related contemporary work proposes the rewardfree RL setting [Jin et al., 2020]. Their setting differs from ours mainly in that they assume the availability of true reward functions during the policy optimization phase, whereas we only assume the availability of sampled rewards on the collected transitions. In the task-speciﬁc RL setting, having to estimate reward is typically not considered a statistical bottleneck, since estimating the reward function boils down to estimating an additional SA parameters, which is negligible compared to the estimation of transition model of size S2A. However, when there are N tasks, the parameters of reward functions now have a total size of SAN. When N is large, this estimation error becomes non-negligible. Therefore, our setting can be statistically more challenging than the reward-free ssetting. We provide a thorough comparison between the two frameworks in Section 6.

3 Preliminaries

In this paper, we consider the setting of a tabular episodic Markov decision process, MDP(S, A, H, P, r), where S is the state space of size S, A is the action space of size A, H is the number of steps in each episode, P is the time-dependent transition matrix so that Ph( |s, a) gives the distribution over the next state if action a is taken from state s at step h [H], and rh( |s, a) is a stochastic reward function at step h whose support is bounded in [0, 1]. Without loss of generality, we assume that the MDP has a ﬁxed starting state s1. The general case reduces to this case by adding an additional time step at the beginning of each episode.

A policy π is a collection of H functions {πh : S A}h H, where A is the probability simplex over A. We denote V π h : S R as the value function at step h under policy π, and Qπ h : S A R as the action-value function at step h under policy π, i.e.

V π h (s) = Eπ

h =h rh (sh , ah )|sh = s

, Qπ h(s, a) = Eπ

h =h rh (sh , ah )|sh = s, ah = a

Since the total reward is ﬁnite, there exists an optimal policy π that gives the maximum value V h (s) = maxπ V π h (s) for all s S, h H. Denoting [Ph Vh+1](s, a) := Es Ph( |s,a) [Vh+1(s )], we have the following Bellman s equation and Bellman s optimal equation:

V π h (s) = Qπ h(s, πh(s)), Qπ h(s, a) = E [rh(s, a)] + Ph V π h+1(s, a) (1) V h (s) = max a Q h(s, a), Q h(s, a) = E [rh(s, a)] + Ph V h+1(s, a) (2)

where we deﬁne V π H+1(s) = 0 for all s S. In this work, we evaluate algorithms based on the sample complexity of exploration framework [Kakade et al., 2003], where the goal is to ﬁnd an ε-optimal policy π, i.e. V 1 (s1) V π 1 (s1) ε, with probability at least (1 p). An algorithm is said to be PAC-MDP (Probably Approximately Correct in Markov Decision Processes) if for any ε, p, the sample complexity of ﬁnding an ε-optimal policy with probability at least (1 p) is O(poly(H, S, A, 1/ε, 1/p)).

Task-agnostic RL. In this paper, we study the setting of task-agnostic RL where learning is performed in two phases. In the exploration phase, the algorithm interacts with the envionment for K episodes

without the guidance of a reward, and collects a dataset of transitions D = {(sk h, ak h)}h [H],k [K]. In

the policy-optimization phase, for each task n [N], let r(n) h ( |s, a) be the unknown reward function for task n, and rewards are instantiated on the collected transitions, augmenting the dataset to be D(n) = {(sk h, ak h, rk h)}h [H],k [K], where rk h r(n) h ( |sk h, ak h). The goal is to ﬁnd ε-optimal policies for all N tasks, and algorithms are evaluated based on the number of episodes K needed to reliably achieve this goal. We remark that our setting generalizes and is more challenging than the standard task-speciﬁc RL setting and the reward-free RL setting in the contemporary work [Jin et al., 2020]; importantly, our algorithm and upper bounds apply to both those settings.

4 UCBZERO: Task-agnostic Exploration

Algorithm 1 UCBZERO

PARAMETERS: No. of tasks N, ι = log(SAHK/p), bt = c p

H3(log(N) + ι)/t, αt = H+1

Exploration Phase:

1: initialize Qh(s, a) H, Nh(s, a) 0 for all (s, a, h) S A H. 2: for episode k = 1, ..., K do 3: receive sk 1. 4: for step h = 1, ..., H do 5: Take action ak h arg maxa Qh(sk h, a), and observe sk h+1. 6: t = Nh(sk h, ak h) Nh(sk h, ak h) + 1. V h+1(sk h+1) = min(H, maxa Qh+1(sk h+1, a)). 7: Qh(sk h, ak h) (1 αt)Qh(sk h, ak h) + αt[V h+1(sk h+1) + 2bt]. 8: Return D = {(sk h, ak h)}h [H],k [K].

Policy-Optimization Phase for Task n [N]: INPUTS: task-speciﬁc reward-augmented transitions D(n) = {(sk h, ak h, rk h)}h [H],k [K].

1: initialize Π = , Qh(s, a) H, Nh(s, a) 0 for all (s, a, h) S A H. 2: for episode k = 1, ..., K do 3: Π.append(πk), where πk is the greedy policy w.r.t. the current {Qh}h [H]. 4: for step h = 1, ..., H do 5: t = Nh(sk h, ak h) Nh(sk h, ak h) + 1, Vh+1(sk h+1) = min(H, maxa Qh+1(sk h+1, a)). 6: Qh(sk h, ak h) (1 αt)Qh(sk h, ak h) + αt[rk h + Vh+1(sk h+1) + bt]. 7: Return the (non-stationary) stochastic policy equivalent to uniformly sampling from Π.

We begin by presenting our algorithm, UCBZERO, as deﬁned in Alg. 1. In the task-agnostic RL setting, the algorithm needs to handle both the exploration phase and the policy-optimization phase. In the exploration phase, UCBZERO maintains a pseudo-Q table, denoted Qh, that estimates the cumulative UCB bonuses from the current step h. By acting greedily with respect to Qh in step h, the algorithm will choose actions that can lead to under-explored states in future steps h > h, i.e. states with large UCB bonus. In constrast to the original UCB-H algorithm which incorporates task-speciﬁc rewards and thus performs an exploration-exploitation trade-off, UCBZERO is a full-exploration algorithm and, therefore, is able to keep visiting all reachable states throughout the exploration phase.

In the policy-optimization phase, UCBZERO performs the same optimistic Q-learning update, only with smaller UCB bonuses than the ones in the exploration phase. For the sake of theoretical analysis, at the end of the policy-optimization phase, UCBZERO outputs a non-stationary stochastic policy equivalent to uniformly sampling from {πk}k [K], the greedy policies w.r.t. the Q table at the beginning of each episode k K. We remark that both the exploration phase and policy-optimization phase of UCBZERO are model-free, enjoying smaller space complexity than model-based algorithms, including the RFE algorithm in the comtemporary work [Jin et al., 2020]. Next, we present our theoretical results upper-bounding the number of exploration episodes required for UCBZERO to ﬁnd ε-optimal policies for N tasks. Due to space constraints, we discuss key ideas in proving the theoretical results and defer the full proofs to the appendix.

4.1 Upper-bounding the Sample Complexity of UCBZERO

The exploration phase of UCBZERO is equivalent to a zero-reward version of the UCB-H algorithm. Our ﬁrst result shows that, despite not knowing the reward functions during exploration, UCBZERO enjoys the same sample complexity as UCB-H in the task-agnostic setting, except for an additional log factor on the number of tasks N.

Notation: We denote by Qk h, V k h , N k h respectively the Qh, Vh, Nh functions at the beginning of episode k, and similarly for Qk h, V k h, N k h. We denote by πk and πk the greedy policy w.r.t. Qk h and Qk h. We further use superscript (n) to denote related quantities for task n, e.g. Qk(n) h and Qπk(n) h .

Theorem 1. There exists an absolute constant c > 0 such that, for all p (0, 1), if we choose bt = c p

H3(log(N) + ι)/t, we have that with probability at least 1 p, it takes Alg. 1 at most

O((log N + ι)H5SA/ε2) (3)

exploration episodes to simutaneously return an ε-optimal policy π(n) for each of the N tasks.

Proof Sketch: The proof of theorem 1 relies on a key lemma, which relates the estimation error of the Q function on each task, (Qk(n) h Qπk(n) h ), to the estimation error on the pseudo-Q function, (Qk h Qπk h ). Note that Q corresponds to a zero-reward MDP, and thus Qπk h = 0.

Lemma 2. There exists an absolute constant c > 0 such that, for any p (0, 1), if we choose bt = c p

H3(log(N) + ι)/t, we have that with probability at least 1 p, the following holds simultaneously for all (s, a, h, k, n) S A [H] [K] [N]:

(Qk(n) h Qπk(n) h )(s, a) (Qk h Qπk h )(s, a). (4)

Lemma 2 shows that the task Q function Qk(n) h converges at least as fast as Qk h. We then have

(V (n) 1 V πk(n) 1 )(s1) 1 V k(n) h V πk(n) h (s1) max a

Qk(n) h Qπk(n) h (s1, a)

Qk h Qπk h (s1, a) 3 =

V k h V πk h (s1)

where 1 is due to [Jin et al., 2018, Lemma 4.3], 2 is due to Lemma 2 and 3 is due to Qπk h = 0. Since UCBZERO is mathematically equivalent to a zero-reward version of the UCB-H algorithm, we get PK k=1(V (n) 1 V πk(n) 1 )(s1) PK k=1(V k 1 V πk 1 )(s1) p

(log N + ι)H5SAK, where the last step is due to [Jin et al., 2018, Theorem 1]. For each task n, deﬁne π to be the non-stationary stochastic policy which uniformly sample a policy from π1, ..., πK. Then, (V (n) 1 V π(n) 1 )(s1) = h PK k=1(V (n) 1 V πk(n) 1 )(s1) i /K O p

(log N + ι)H5SA/K . This implies that in order for

(V (n) 1 V π(n) 1 )(s1) ε, we need at most K = O (log N + ι)H5SA/ε2 .

Theorem 1 indicates that, when N is small, e.g. N O(poly(H, S, A)), the sample complexity of UCBZERO to simutaneously learn N tasks is the same in the leading terms as the sample complexity of UCB-H to learn a single task, despite that UCBZERO does not have access to rewards during exploration. In contrast, a naive application of UCB-H to learn N tasks will require N times the original sample complexity. This exponential improvement is achieved because UCBZERO is able to collect transitions in a way that helps with learning all tasks during a single exploration phase. On the other hand, the data collected by UCB-H guided by a speciﬁc task may under-explore certain regions of the environment that are not important to the current task, but may be important to other tasks. Even for the task-speciﬁc RL setting, our result implies that, perhaps surprisingly, a full-exploration algorithm can learn as fast as an algorithm that balances exploration vs. exploitation. It suggests that exploitation doesn t necessarily help in accelerating learning, and is probably only required for the sake of regret minimization.

4.2 Further Insights on the Exploration Phase of UCBZERO

In Theorem 1 above, we directly analyzed the sample complexity of UCBZERO, but the proof technique provides limited insight in the quality of the dataset D collected during the exploration phase. Intuitively, D must cover all reachable states sufﬁciently often to ensure that one can later learn near-optimal policy for any unknown reward functions. Our next result shows that this is indeed the case for UCBZERO. Deﬁnition 1. We deﬁne the relative reachability between (s , h ) and (s, h), h < h, by the maximum probability of reaching (s, h) from (s , h ) following some policy. Speciﬁcally,

δh ,h(s , s) = max π P π(sh = s|sh = s ) (5)

We also deﬁne the reachability of (s, h) by the distance from (s1, 1) to (s, h), i.e. δh(s) = δ1,h(s1, s).

Intuitively, δh(s) represents the maximum probability of reaching state s in step h. If δh(s) is zero or close to zero for some (s, h), it means that it s almost impossible to reach state s in step h, and thus (s, h) will not have too much impact in the optimal performance of any task. Our next theorem shows that UCBZERO is able to consistently visit all state-action pairs during the whole exploration phase. Theorem 3. With probability 1 p, after K episode of UCBZERO, we have

N K h (s, a) Ω Kδh(s)2

for all (s, a, h) S A [H].

Theorem 3 shows that UCBZERO will visit all state-action pairs in proportion to the total number of episodes, and the visitation frequency scales at most quadratically with the reachability of the state. One direct implication of sufﬁcient visitation is that a model-based batch-RL algorithm, such as Tabular Certainty Equivalence (TCE) [Jiang, 2018], can make accurate estimations on the transition dynamics and thus ﬁnd near-optimal policies for any reward functions: Theorem 4. With probability 1 p, for all (h, s, a, s ) [H] S A S, Alg. 1 induces an estimate ˆPh(s |s, a) of Ph(s |s, a), such that | ˆPh(s |s, a) Ph(s |s, a)| ε δh(s), after at most O(H5SAι/ε2) exploration episodes.

These insights suggest that the efﬁciency of UCBZERO is likely not a consequence of careful cooperation between the exploration component and the policy optimization component, both using Q-learning with UCB bonuses. Instead, any batch RL algorithm can be applied in the policy optimization phase to ﬁnd near-optimal policies.

5 Lower bound

In the standard task-speciﬁc RL setting, it is a common belief that having to estimate reward functions is not a statistical bottleneck to efﬁcient learning, due to the relatively small number of parameters of a reward function compared to the number of parameters of the transition dynamics. However, we show that in the task-agnostic RL setting, the additional log N factor in the sample complexity of UCBZERO is unavoidable, and this additional complexity is essentially due to the need to accurately estimate N reward functions simultaneously. Theorem 5. Any PAC-MDP algorithm in the task-agnostic RL setting must spend at least Ω log(N)H2SA/ε2 episodes in the exploration phase.

Proof Sketch: Our main intuition is that when N becomes very large, the sample complexity of simultaneously estimating N reward functions eventually out-scales the sample complexity of estimating the shared transition dynamics. As a result, one can equivalently assume that the transition dynamics are already known to the learner, and therefore solving the MDP with a particular reward function becomes equivalent to solving SH parallel multi-armed bandits each with A arms.

Based on the this intuition, we construct an MDP M where the transition is deﬁned as Ph(s |s, a) = 1/S for all (h, s, a, s ) and is known to the learner. Since the action has no control over the nextstate, this is equivalent to a collection of SH multi-armed bandits. Due to the uniform transition,

P π h (s) = 1/S for any (π, s, h), and so ﬁnding the ε-optimal policy amounts to ﬁnding an ε/Hoptimal policy for each bandit (s, h). We then construct each bandit following the classic lower bound construction of multi-armed bandits [Mannor and Tsitsiklis, 2004], where the reward for all arms are Bernoulli with p = 1/2, except for the ﬁrst arm that has Bernoulli with p = 1/2 + ε/2 and one other arm that has Bernoulli p = 1/2 + ε. The classic result shows that such a bandit requires at least Ω(AH2/ε2) steps to ﬁnd an ε/H-optimal arm. We extend it and show that it takes at least Ω(log(N)H2A/ε2) steps to ﬁnd an ε/H-optimal arm simultaneously for N arbitrary reward functions in the task-agnostic bandit setting. As a result, the HS bandits requires at least Ω(log(N)H3SA/ε2) steps, which can be achieved in at least Ω(log(N)H2SA/ε2) episodes.

The lower bound suggest that our bound achieved by UCBZERO is tight in S, A, ε, up to logarithmic factors and lower-order terms. In the next section, we discuss the connections of our upper and lower bound results to the ones in the standard task-speciﬁc RL setting and the reward-free setting.

6 Task-Agnostic vs. Reward-free RL

In this section, we (1) make a thorough comparison with the comtemporary work [Jin et al., 2020], (2) provide additional results on the sample complexity of UCBZERO in the reward-free setting, and (3) present a uniﬁed view of the three RL frameworks. A summary is presented in Table 1.

6.1 Summary of the Reward-free RL setting

A comtemporary work [Jin et al., 2020] proposed the reward-free RL framework that is similar to our task-agnostic RL framework, where learning is decomposed into the exploration phase and the planning phase. In the exploration phase, the learner explores without reward information. In the planning phase, the agent is tasked with computing near-optimal policies for a large collection of given reward functions. The only essential difference between the two frameworks is whether true reward functions are known (reward-free), or only instantiated rewards are available (task-agnostic). Therefore, our task-agnostic setting is statistically harder than the reward-free setting.

Under the reward-free setting, Jin et al. focus on providing N-independent bounds on the sample complexity. Intuitively, this is possible since the only unknown quantity in the problem is the transition dynamics that are shared across all tasks. Therefore, as long as the algorithm achieves an accurate estimate of the transition dynamics, it can compute near-optimal policies under arbitrary known reward functions. In the N-independent regime, Jin et al. provided an Ω H2S2A/ε2

lower bound, which has an additional factor of S compared to the lower bound in the standard task-speciﬁc RL setting. The lower bound construction requires an exponential number of rewards, i.e. N exp(S), and they emphasized that it remains an open problem whether smaller N-dependent bound is possible when N is ﬁnite.

In addition, they presented an efﬁcient algorithm RFE that has sample complexity O H5S2A/ε2 , matching the lower bound in the dependency on S, A, and ε. During exploration, the RFE algorithm executes two meta-steps. In the ﬁrst step, the algorithm calls a SOTA PAC-MDP algorithm, EULER [Zanette and Brunskill, 2019], as a subroutine for HS times to learn a navigation policy πh,s that visit state s in step h as often as possible. In the second step, RFE collects data with randomly sampled policies from {πh,s} to ensure that all (h, s) pairs can be visited sufﬁciently often. While the idea is conceptually simple, the algorithm can be expensive to execute in practice, as the overhead of calling EULER HS times can already be large. In fact, our result indicates that calling UCBZERO once sufﬁces to learn all navigation policies {πh,s} with only an additional log(HS) factor more samples, as opposed to HS times more by calling EULER HS times.

6.2 UCBZERO in Reward-free RL

Since our task-agnostic setting is strictly more difﬁcult than the reward-free setting, the O(log(N)H5SA/ε2) upper bound readily applies to the reward-free setting. Our result immediately implies that a tighter N-dependent bound is available in the reward-free setting, answering the open question raised by Jin et al. Compared with the bound achieved by RFE, our bound replaces the extra dependency on S with a logarithmic dependency on N. When the number of tasks is not extremely large, e.g. N O(poly(H, S, A)), UCBZERO achieves a tighter bound than RFE.

In the rare case that N is exponentially large or even inﬁnite, the bound in theorem 1 will blow up. Our next result complements theorem 1 by providing an N-independent bound on the sample-complexity of UCBZERO in the reward-free setting, with the price of an additional factor of HSA. Theorem 6. There exists an absolute constant c > 0 such that, for all p (0, 1), we have that with probability at least 1 p, it takes Alg. 1 at most

O H6S2A2(log 3H

ε + ι)/ε2 (7)

exploration episodes to simutaneously return an ε-optimal policy π for any number of tasks.

Proof Sketch: The proof is based on the observation that if two reward functions are close enough, e.g. |E [rh(s, a)] E [r h(s, a)] | ε for all (s, a, h), then a near-optimal policy for r will also be nearoptimal for r . This is sometimes referred as the Simulation Lemma [Kearns and Singh, 2002]. As a result, even though there are potentially inﬁnite number of reward functions to optimize, we can divide the whole space of rewards [0, 1]H,S,A into M HSA disjoint sets of the form Πh,s,a[ ih,s,a 1

M ]. For sufﬁciently large M, one only needs to ﬁnd a near-optimal policy for one reward function per set, to guarantee that this policy will be near-optimal for all reward functions in the same set. Therefore, we effectively only need to ﬁnd a near-optimal policy for at most N = M HSA tasks. Plugging this N into theorem 1 gives the desired result.

This N-independent bound is not as sharp as the bound for RFE, likely due to the simplicity of the argument we use. Nevertheless, the key takeaway is that the sample complexity of UCBZERO will not scale with N indeﬁnitely. In most practical scenarios, the number of tasks to be learned is small. For example, in the navigation problem, the agent wants to learn to navigate to all (s, h) S [H]. This implies a task number of N = SH. Therefore, the bound achieved by UCBZERO is usually tighter in practice. What remains an open question is whether a smaller N-dependent lower bound exist in the reward-free setting. We note that our lower bound for the task-agnostic setting does not readily transfer to the easier reward-free setting.

6.3 A Uniﬁed View

Lastly, we try to provide a uniﬁed view of the three frameworks: the standard task-speciﬁc RL, the reward-free RL, and the task-agnostic RL. An overview of technical results in each setting is presented in Table 1. The task-speciﬁc RL aims at learning near-optimal policy for a single task, with or without knowledge of the true reward function. The reward-free RL aims at learning near-optimal policy for a set of tasks, given the knowledge of the true reward functions. The task-agnostic RL aims at learning near-optimal policy for a set of tasks, but without knowledge of the true reward functions. In terms of statistical difﬁculty, there is a total order between the three, namely task-speciﬁc RL < reward-free RL < task-agnostic RL . Task-speciﬁc RL is well-understood, with matching Θ(H2SA/ε2) upper and lower bound. Reward-free RL is not fully-solved, as there are both N dependent bounds and N-independent bounds. The additional dependency on S is proved unavoidable in an N-independent bound but seems to be avoidable in an N-dependent one. It remains an open question whether there exists a tighter N-dependent lower bound. Task-agnostic RL is the hardest. UCBZERO enjoys the same N-dependent upper bound in both the reward-free and task-agnostic setting. Our N-dependent lower bound also suggests that N-independent bounds do not exist in the task-agnostic setting. Nevertheless, our upper and lower bounds do not yet match in the H factors. And it remains an open question whether zero-reward versions of other PAC-MDP algorithms in the task-speciﬁc RL setting can be applied to the task-agnostic RL setting to achieve a tighter bound.

7 Conclusions

In this paper, we propose a new task-agnostic RL framework, that consists of an exploration phase, where the learner collects data without reward information, and a policy-optimization phase, where the learner computes near-optimal policies for N tasks given instantiated rewards. We present an efﬁcient algorithm, UCBZERO that ﬁnds ε-optimal policies on N tasks within O(log(N)H5SA/ε2) exploration episodes. We also provide a near-matching Ω(log(N)H2SA/ε2) lower bound, demonstrating the near-optimality of our algorithm in this setting.

Broader Impact

Our work provides theoretical foundation for empirical studies of multi-task reinforcement learning and unsupervised reinforcement learning.

[Agrawal and Jia, 2017] Agrawal, S. and Jia, R. (2017). Optimistic posterior sampling for reinforcement learning: worst-case regret bounds. In Advances in Neural Information Processing Systems, pages 1184 1194. [Andrychowicz et al., 2017] Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, R., Welinder, P., Mc Grew, B., Tobin, J., Abbeel, O. P., and Zaremba, W. (2017). Hindsight experience replay. In Advances in neural information processing systems, pages 5048 5058. [Azar et al., 2017] Azar, M. G., Osband, I., and Munos, R. (2017). Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning Volume 70, pages 263 272. JMLR. org. [Dann and Brunskill, 2015] Dann, C. and Brunskill, E. (2015). Sample complexity of episodic ﬁxed-horizon reinforcement learning. In Advances in Neural Information Processing Systems, pages 2818 2826. [Dann et al., 2018] Dann, C., Li, L., Wei, W., and Brunskill, E. (2018). Policy certiﬁcates: Towards accountable reinforcement learning. ar Xiv preprint ar Xiv:1811.03056. [Dietterich, 2000] Dietterich, T. G. (2000). Hierarchical reinforcement learning with the maxq value function decomposition. Journal of artiﬁcial intelligence research, 13:227 303. [Ecoffet et al., 2019] Ecoffet, A., Huizinga, J., Lehman, J., Stanley, K. O., and Clune, J. (2019). Go-explore: a new approach for hard-exploration problems. ar Xiv preprint ar Xiv:1901.10995. [Finn and Levine, 2017] Finn, C. and Levine, S. (2017). Deep visual foresight for planning robot motion. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 2786 2793. IEEE. [Jaksch et al., 2010] Jaksch, T., Ortner, R., and Auer, P. (2010). Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563 1600. [Jiang, 2018] Jiang, N. (2018). Notes on tabular methods. [Jin et al., 2018] Jin, C., Allen-Zhu, Z., Bubeck, S., and Jordan, M. I. (2018). Is q-learning provably efﬁcient? In Advances in Neural Information Processing Systems, pages 4863 4873. [Jin et al., 2020] Jin, C., Krishnamurthy, A., Simchowitz, M., and Yu, T. (2020). Reward-free exploration for reinforcement learning. ar Xiv preprint ar Xiv:2002.02794. [Kakade et al., 2003] Kakade, S. M. et al. (2003). On the sample complexity of reinforcement learning. Ph D thesis, University of London London, England. [Kearns and Singh, 2002] Kearns, M. and Singh, S. (2002). Near-optimal reinforcement learning in polynomial time. Machine learning, 49(2-3):209 232. [Kretzschmar et al., 2016] Kretzschmar, H., Spies, M., Sprunk, C., and Burgard, W. (2016). Socially compliant mobile robot navigation via inverse reinforcement learning. The International Journal of Robotics Research, 35(11):1289 1307. [Mannor and Tsitsiklis, 2004] Mannor, S. and Tsitsiklis, J. N. (2004). The sample complexity of exploration in the multi-armed bandit problem. Journal of Machine Learning Research, 5(Jun):623 648. [Oh et al., 2017] Oh, J., Singh, S., Lee, H., and Kohli, P. (2017). Zero-shot task generalization with multi-task deep reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2661 2670. JMLR. org. [Riedmiller et al., 2018] Riedmiller, M., Hafner, R., Lampe, T., Neunert, M., Degrave, J., Van de Wiele, T., Mnih, V., Heess, N., and Springenberg, J. T. (2018). Learning by playing-solving sparse reward tasks from scratch. ar Xiv preprint ar Xiv:1802.10567. [Rimon and Koditschek, 1992] Rimon, E. and Koditschek, D. E. (1992). Exact robot navigation using artiﬁcial potential functions. Departmental Papers (ESE), page 323.

[Sæmundsson et al., 2018] Sæmundsson, S., Hofmann, K., and Deisenroth, M. P. (2018). Meta reinforcement learning with latent variable gaussian processes. ar Xiv preprint ar Xiv:1803.07551. [Schaul et al., 2015] Schaul, T., Horgan, D., Gregor, K., and Silver, D. (2015). Universal value function approximators. In International conference on machine learning, pages 1312 1320. [Strehl et al., 2006] Strehl, A. L., Li, L., Wiewiora, E., Langford, J., and Littman, M. L. (2006). Pac model-free reinforcement learning. In Proceedings of the 23rd international conference on Machine learning, pages 881 888. [Tessler et al., 2017] Tessler, C., Givony, S., Zahavy, T., Mankowitz, D. J., and Mannor, S. (2017). A deep hierarchical approach to lifelong learning in minecraft. In Thirty-First AAAI Conference on Artiﬁcial Intelligence. [Xie et al., 2019] Xie, A., Ebert, F., Levine, S., and Finn, C. (2019). Improvisation through physical understanding: Using novel objects as tools with visual foresight. ar Xiv preprint ar Xiv:1904.05538. [Xie et al., 2018] Xie, A., Singh, A., Levine, S., and Finn, C. (2018). Few-shot goal inference for visuomotor learning and planning. ar Xiv preprint ar Xiv:1810.00482. [Zanette and Brunskill, 2019] Zanette, A. and Brunskill, E. (2019). Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. ar Xiv preprint ar Xiv:1901.00210. [Zhang et al., 2020] Zhang, Z., Zhou, Y., and Ji, X. (2020). Almost optimal model-free reinforcement learning via reference-advantage decomposition. ar Xiv preprint ar Xiv:2004.10019. [Zheng et al., 2018] Zheng, G., Zhang, F., Zheng, Z., Xiang, Y., Yuan, N. J., Xie, X., and Li, Z. (2018). Drn: A deep reinforcement learning framework for news recommendation. In Proceedings of the 2018 World Wide Web Conference, pages 167 176.