# autonomous_reinforcement_learning_via_subgoal_curricula__1a3a53fb.pdf

Autonomous Reinforcement Learning via Subgoal Curricula

Archit Sharma , Abhishek Gupta# , Sergey Levine# , Karol Hausman , Chelsea Finn

Stanford University, Google Brain, # UC Berkeley {architsh,cbfinn}@stanford.edu {abhishekunique,slevine,karolhausman}@google.com

Reinforcement learning (RL) promises to enable autonomous acquisition of complex behaviors for diverse agents. However, the success of current reinforcement learning algorithms is predicated on an often under-emphasised requirement each trial needs to start from a ﬁxed initial state distribution. Unfortunately, resetting the environment to its initial state after each trial requires substantial amount of human supervision and extensive instrumentation of the environment which defeats the goal of autonomous acquisition of complex behaviors. In this work, we propose Value-accelerated Persistent Reinforcement Learning (Va PRL), which generates a curriculum of initial states such that the agent can bootstrap on the success of easier tasks to efﬁciently learn harder tasks. The agent also learns to reach the initial states proposed by the curriculum, minimizing the reliance on human interventions into the learning. We observe that Va PRL reduces the interventions required by three orders of magnitude compared to episodic RL while outperforming prior state-of-the art methods for reset-free RL both in terms of sample efﬁciency and asymptotic performance on a variety of simulated robotics problems1.

1 Introduction

Reinforcement learning (RL) offers an appealing opportunity to enable autonomous acquisition of complex behaviors for interactive agents. Despite recent RL successes on robots [26, 34, 25, 28, 35, 22, 32, 23, 14], several challenges exist that inhibit wider adoption of reinforcement learning for robotics [48]. One of the major challenges to the autonomy of current reinforcement learning algorithms, particularly in robotics, is the assumption that each trial starts from an initial state drawn from a speciﬁc state distribution in the environment. Conventionally, reinforcement learning algorithms assume the ability to arbitrarily sample and reset to states drawn from this distribution, making such algorithms impractical for most real-world setups.

Many prior examples of reinforcement learning on real robots have relied on extensive instrumentation of the robotic setup and human supervision to enable environment resets to this initial state distribution. This is accomplished through a human providing the environment reset themselves throughout the training [8, 12, 4], scripted behaviors for the robot to reset the environment [28, 39], an additional robot executing scripted behavior to reset the environment [32], or engineered mechanical contraptions [46, 23]. The additional instrumentation of the environment and creating scripted behaviors are both time-intensive and often require additional resources such as sensors or even robots. The scripted reset behaviors are narrow in application, often designed for just a single task or environment, and their brittleness mandates human oversight of the learning process. Eliminating or minimizing the algorithmic reliance on the reset mechanisms can enable more autonomous learning, and in turn it will allow agents to scale to broader and harder set of tasks.

1Code and supplementary videos are available at https://sites.google.com/view/vaprl/home

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

Figure 1: Comparison of the persistent RL setting with the episodic RL setting. Interventions (human or otherwise orchestrated) reset the environment to the initial state distribution after every episode in episodic RL, while the state of the environment persists through the training in persistent RL. The learned policy is tested starting from the initial state distribution for both the settings.

To address these challenges, some recent works have developed reinforcement learning algorithms that can effectively learn with minimal resets to the initial distribution [19, 6, 48, 43, 14]. We provide a formal problem deﬁnition that encapsulates and sheds light on the general setting addressed by these prior methods, which we refer to as the persistent reinforcement learning in this work. In this problem setting, we disentangle the training and the test time settings such that the test-time objective matches that of the conventional RL setting but the train-time setting restricts access to the initial state distribution by giving a low frequency periodic reset. In this setting, the agent must persistently learn and interact with the environment with minimal human interventions, as shown in Figure 1. Conventional episodic RL algorithms often fail to solve the task entirely in this setting, as shown by Zhu et al. [48] and Figure 2. This is because these methods rely on the ability to sample the initial state distribution arbitrarily. One solution to this problem is to additionally learn a reset policy that recovers the initial state distribution [19, 6] allowing the agent to repeatedly alternate between practicing the task and practicing the reverse. Unfortunately, not only can solving the task directly from the initial state distribution be hard from an exploration standpoint, but (attempting to) return to the initial state repeatedly can be sample inefﬁcient. In this paper, we propose to instead have the agent reset itself to and attempt the task from different initial states along the path to the goal state. In particular, the agent can learn to solve the task from easier starting states that are closer to the goal and bootstrap on these to solve the task from harder states farther away from the goal.

Figure 2: The performance of episodic RL algorithms substantially deteriorates when environment resets are not available.

The main contribution of this work is Value-accelerated Persistent Reinforcement Learning (Va PRL), a goal-conditioned RL method that creates an adaptive curriculum of starting states for the agent to efﬁciently improve test-time performance while substantially reducing the reliance on extrinsic reset mechanisms. Additionally, we provide a formal description of the persistent RL problem setting to conceptualize our work and prior methods. We benchmark Va PRL on several robotic control tasks in the persistent RL setting against state-of-the-art methods, which either simulate the initial state distribution by learning a reset controller, or incrementally grow the state-space from which the given task can be solved. Our experiments indicate that using a tailored curriculum generated by Va PRL can be up to 30% more sample-efﬁcient in acquiring task behaviors compared to these prior methods. For the most challenging dexterous manipulation problem, Va PRL provides a 2.5 gain in performance compared to the next best performing method.

2 Related Work

Robot learning. Prior works using reinforcement learning have relied on manually design controllers or human supervision to enable episodic environmental resets, as is required by the current algorithms. This can be through human orchestrated resets [8, 13, 12, 4, 16], which requires high frequency human intervention in robot training. In some cases, it is possible to execute a scripted behavior to reset the environment [28, 32, 47, 39, 46, 1]. However, programming such behaviors is time-intensive for the practitioner, and robot training still requires human oversight as the scripted behaviors are often brittle. Some prior works have designed the environment [35, 7, 22] to bypass the need for having a reset mechanism. This is not generally applicable and can require extensive environment

design. Some recent works leverage multi-task RL to bypass the need for extrinsic reset mechanisms [15, 14]. Typically, a task-graph uses the current state to decides the next task for the agent, such that only minimal intervention is required during training. However, these task-graphs are speciﬁc to a problem and require additional engineering to appropriately decide the next task.

Reset-free reinforcement learning. Constraining the access to these orchestrated reset mechanisms severely impedes policy learning when using current RL algorithms [48]. Recent works have proposed several algorithms to reduce reliance on extrinsic reset mechanisms by learning a reset controller to retrieve the initial state distribution [19, 6], by learning a perturbation controller [48], or by learning reset skills in adversarial games [43]. These works implicitly deﬁne a target state distribution for a reset controller: Han et al. [19], Eysenbach et al. [6] target a ﬁxed initial state distribution; Zhu et al. [47] target a uniform distribution over the state space as a consequence of novelty-seeking behavior of the reset controller; and Xu et al. [43] target an adversarial form of initial state distribution to produce a more robust policy. In contrast, our proposed algorithm Va PRL generates a curriculum of starting states tailored to the task and agent s performance. Our experiments demonstrate that Va PRL outperforms these prior methods in both sample efﬁciency and absolute performance. Other recent work like [29] has considered combining model-based RL with unsupervised skill discovery to solve reset-free learning problems, but largely focus on avoiding sink states rather than attempting tasks repeatedly with a curriculum like Va PRL.

Curriculum generation for reinforcement learning. Curriculum generation is a crucial aspect of sample-efﬁcient learning in Va PRL. Prior works have shown that using a curriculum can enable faster learning and improve performance [9, 10, 40, 30, 33, 27]. Task-tailored curriculum can simplify the exploration as it is easier to solve the task from certain states [21, 9] enabling faster progress on the downstream task. In addition to proposing a novel method for curriculum generation, we design it for the persistent RL setting without requiring the ability to reset the environment to arbitrary states as assumed by prior work.

Persistent vs. lifelong reinforcement learning. Prior reinforcement learning algorithms that reduce the need for oracle resets have relied on the problem setting of lifelong or continual reinforcement learning [41, 24], when the objective in practice is to learn episodic behaviors. Both the persistent RL and the lifelong learning frameworks do transcend the episodic setting for training, promoting more autonomy in reinforcement learning. However, persistent reinforcement learning distinguishes between the training and evaluation objectives, where the evaluation objective matches that of the episodic reinforcement learning. While the assumptions of episodic reinforcement learning are hard to realize for real-world training, real-world deployment of policies is often episodic. This is commonly true for robotics, where the assigned tasks are expected to be repetitive but it is hard to orchestrate resets in the training environment. This makes persistent reinforcement learning a suitable framework for modelling robotic learning tasks.

3 Persistent Reinforcement Learning

In this section, we formalize the persistent reinforcement learning as an optimization problem. The key insight is to separate the evaluation and training objectives such that the evaluation objective measures the performance of the desired behavior while the training objective enables us to acquire those behaviors, while recognizing that frequent invocation of a reset mechanism is untenable. We ﬁrst provide a general formulation, and then adapt persistent RL to the goal-conditioned setting.

Deﬁnition. Consider a Markov decision process (MDP) ME (S, A, p, r, ρ, γ, HE) [37]. Here, S denotes the state space, A denotes the action space, p : S A S 7 R 0 denotes the transition dynamics, r : S A 7 R denotes the reward function, ρ : S 7 R 0 denotes the initial state distribution, γ [0, 1] denotes the discount factor, and HE denotes the episode horizon. Our objective is to learn a policy π that maximizes JE(π) = E[PHE t=1 γtr(st, at)], where s0 ρ( ), at π( | st) and st+1 p( | st, at), the episodic expected sum of discounted rewards.

However, generating samples from the initial state distribution ρ invokes a reset mechanism, which is hard to realize in the real world. We want to construct a MDP MT corresponding to our training environment which reduces invocations of the reset mechanism. To reduce such interventions, we consider a training environment MT (S, A, p, rt, ρ, γ, HT ) with episode horizon HT HE. Näively optimizing r can substantially deteriorate the performance of episodic RL algorithms, as shown in Figure 2 where we compare the evaluation performance with HE = 200 when training in environments with HT = 200 (with resets) versus HT = 200, 000 (without resets). Therefore,

it becomes beneﬁcial to consider a surrogate reward function rt rather than just optimizing for r naively. As a motivating example, consider a forward-backward controller which alternates between solving the task corresponding to r and recovering the initial state distribution ρ. The surrogate reward function corresponding for this approach can be written as:

rt(st, at) = r(st, at) t = [1, HE], [2HE + 1, 3HE], . . . rρ(st, at) t = [HE + 1, 2HE], . . . (1)

Here, r alternates between the task-reward r for HE steps and rρ (which encourages initial state distribution recovery) for HE steps2, also illustrated in Figure 3 (right). This surrogate reward function allows the agent to repeatedly practice the task, thus using the autonomous interaction more judiciously as compared to the näive approach. Note, this r loosely recovers the objectives used in some prior works [19, 6]. For a general time-dependent surrogate reward function rt, we deﬁne the training objective as JT (π) = Es0 ρ,at π( |st),st+1 p( |st,at) HT X

t=1 γt rt(st, at) (2)

where ρ is the initial state distribution at training time (which does not need to match the evaluationtime initial state distribution ρ). The persistent RL optimization objective is to maximize JT (π) efﬁciently under the constraint that JE(arg maxπ JT (π)) = maxπ JE(π). Intuitively, the objective encourages construction of a training environment that can recover the optimal policy for the evaluation environment. The primary design choice is rt, which as shown above leads to different algorithms. Another design choice is ρ, which may or may not match ρ. Importantly, we do not assume ρ is any easier to sample compared to ρ.

Finally, we note that the formulation discussed here is suitable only for reversible environments. Reversible environments guarantee that the agent can continue to make progress on the task and not get stuck (for example, if the object goes out of reach of the robot s arm). A large class of practical tasks can be considered reversible (door opening, cloth folding, and so on) or the environment can be constructed to enforce reversibility (add bounding walls so the object does not go out of reach). A formal deﬁnition for reversible environments is provided in Appendix A. In this work, we will restrict ourselves to reversible environments, and defer a full discussion of persistent RL for environments with irreversible states to future work.

Goal-conditioned persistent reinforcement learning. We adapt the general formulation above to a goal-conditioned [20, 38] instantiation of persistent RL. Consider a goal-conditioned MDP ME (S, A, G, p, r, ρ, γ, HE), where G S denotes the goal space. For a goal distribution pg : G 7 R 0, the evaluation objective is JE(π) = Eg pg( )Eπ( |s,g)[PHE t=1 γtr(s, g)] for π : S A G 7 R 0. The training objective is then stated as:

JT (π) = Es0 ρ,at π( |st,G(st,pg)),st+1 p( |st,at) HT X

t=1 γtr(st, G(st, pg)) , (3)

where we assume that r = r remains as a goal reaching objective, but where algorithms instead use a goal generator G to generate a curriculum of goals to practice throughout training3. The intuition is that, since HT HE, the algorithm can repeatedly practice reaching various task-goals. However, the objective is to learn a policy that can reach task-goals from pg in the test environment, i.e., starting from the initial state distribution ρ. This implies the goal generator G should expand the goal space beyond the task-goals to improve the policy π for the test environment. For example, the goal generator could alternate task-goals (g pg) and the initial state distribution (s ρ), which again loosely recovers prior works [19, 6]. This instantiation transforms the problem of ﬁnding the right reward function r to the right curriculum of goals using G.

4 Value-Accelerated Persistent Reinforcement Learning

To address the goal-conditioned persistent RL problem, we now describe our proposed algorithm, Va PRL. The key idea in Va PRL is that the agent does not need to return to the initial state distribution between every attempt at the task, and can instead choose to practice from states that facilitate efﬁcient learning. Section 4.1 discusses how to generate this curriculum of initial states. Using

2We assume that the state includes information indicating the reward function being optimized so that agent can take appropriate actions, for example, one-hot task indicators as is common in multi-task RL. 3The goal generator may use additional memory which is not explicitly represented here.

Multiple iterations of forward and reverse passes

Va PRL Training Forward-Backward RL

Multiple iterations of forward and reverse passes

Multiple iterations of forward and reverse passes

Multiple iterations of forward and reverse passes

Initial State

Initial State

Initial State

Initial State

Initial State

Initial State

Figure 3: An overview of the Va PRL algorithm (left) compared to forward-backward RL (right). For Va PRL, the value function gives us a set of states from where the agent can solve the task with some conﬁdence (shaded in green), and the Va PRL chooses the state closest to the initial state distribution among them (purple square). In each iteration, the agent can bootstrap on the knowledge of solving the task from a future state (bold green) which simpliﬁes the exploration from its current state (broken green line). As the performance of the agent improves, the states commanded by Va PRL move closer to the initial state distribution. This is in contrast to the forward-backward controller that alternates between the test-goals and the initial state distribution.

goal-conditioned RL within Va PRL allows us to use the same policy to solve the task and reach the initial states suggested by the curriculum, in contrast to prior work that learns a separate reset and task policy. Section 4.2 describes how careful goal relabeling can be leveraged to efﬁciently learn this uniﬁed goal-reaching policy. We also discuss how Va PRL can effectively use prior data, which often becomes crucial for efﬁciently solving hard sparse-reward tasks.

4.1 Generating a Curriculum Using the Value Function

Consider the problem of reaching a goal g pg in the MDP ME. Learning how to reach the goal g is easier starting from a state s S that is close to g, especially when the rewards are sparse. Knowing how to reach the goal g from a state s in turn makes it easier to reach the goal from states in the neighborhood of s, enabling us to incrementally move farther away from the goal g. Bootstrapping on the success of an easier problem to solve a harder problems motivates the use of curriculum in reinforcement learning, also illustrated in Figure 3.

Following the intuition above, we aim to deﬁne an increasingly-difﬁcult curriculum such that the policy is eventually able to reach the goal g starting from the initial state distribution ρ. Our simple scheme is to sample a task goal g pg, run the policy π with a subgoal C(g) as input, and then run the policy with the task goal g as input. The main question now becomes: given a goal g, how do we select the subgoal C(g) to attempt the goal g from? We propose to set up C(g) as follows: C(g) = arg min s Xρ(s) s.t. V π(s, g) ϵ, (4)

where Xρ is a user-speciﬁed distance function between the state s and the initial state distribution ρ, V π(s, g) = E[PHE t=1 γtr(s, g) | s1 = s] denotes the value function of the policy π reaching the goal g from the state s, and ϵ R is some ﬁxed threshold. Here, the value function represents the ability of the policy to reach the goal g from the state s. To see that, consider the case when discount factor γ = 1 and r(s, g) = 1 when s g and 0 otherwise, the value function V π exactly represents the probability of reaching a goal g from state s when following the policy π. The intuition carries over to γ [0, 1) too, where the environment can go into a terminal state with probability 1 γ at every transition. For general goal-reaching reward functions, a state s with a higher value under V π(s, g) would still represent greater ability to reach the goal g for the policy π.

Revisiting Equation 4 with this understanding of the value function, the objective C(g) chooses the state closest to the initial state distribution for which the value function V π(s, g) crosses the threshold ϵ. This encourages the curriculum to be closer to the goal state in the early stages of the training as the policy would be less effective at reaching the goal. As the policy improves, a larger number of states satisfy the constraint and the curriculum progressively moves closer to the initial state distribution. Eventually, the curriculum converges to the initial state distribution leading to a policy π that would optimize the evaluation objective in the MDP ME. Following this intuition, we can write the goal

generator G(st, pg) as:

G(st, pg) = g s.t. gtask pg, g C(gtask) if switch(st, g) = subgoal g gtask elif switch(st, g) = task goal (5)

where the switch(st, gcur) is true if the gcur has been reached or gcur has been in place for HE steps. For every new goal gtask pg, we ﬁrst attempt to reach the curriculum subgoal C(gtask) (that is switch(st, g) = subgoal), and then we attempt to reach goal gtask (that is switch(st, g) = task goal). This cycles repeats until the environment resets after HT steps.

Computing the Curriculum Generator C(g). Equation 4 involves a minimization over the state space S, which is intractable in general. While it is possible to come up with general solutions by constructing a generative model p(s) and taking a minimum over the samples generated by it, we opt for a simpler solution: we use the data collected by the policy π during the training and minimize C(g) over a randomly sampled subset of it by enumeration. If an ofﬂine dataset or demonstrations are available, we can also minimize C(g) on this data exclusively. The constrained minimization can similarly be approximated by considering the subset of the data which satisﬁes the constraint V π(s, g) ϵ and choosing the state from this subset which minimizes Xρ. If no state satisﬁes the constraint, C(g) returns the state with the maximum V π(s, g).

Measuring the Initial State Distribution Distance. An important component of curriculum generation is choosing the distance function Xρ(s), which should reﬂect the distance to the state from the initial state distribution under the environment dynamics. To factor in the dynamics, we can use the learned goal-conditioned value function as measure of shortest distance between the state and the goal [20, 36]. In particular, we use Xρ(s) = Es0 ρV π(s, s0). While this choice is convenient as we are already estimating V π(s, g), there is an even simpler choice for Xρ(s) when ofﬂine demonstration data is available. Assuming that the trajectories in the provided dataset start from the initial state distribution ρ, we can use the timestep index of the state as the distance from the initial state distribution, that is Xρ(s) = argt(s, D) where D denotes the ofﬂine demonstration set. The step index distance function deﬁned here encodes the intuition that states which require more steps by the policy are farther away. The function naturally accommodates for the dynamics of the environment. Finally, since we are minimizing Xρ(s) in Equation 4, if there are multiple trajectories to the same state or suboptimal loops within a single trajectory, we use the shortest distance to that state.

4.2 Relabeling Goals

Commanded goal:

Relabeled Tuples

Figure 4: An illustration of goal relabeling in Va PRL. Every transition in a trajectory is relabeled with a randomly sampled subset of curriculum goals, yielding a large set of relabeled tuples that are added to the replay buffer. This ensures efﬁcient data reuse.

Not only does our policy need to learn how to reach g pg, but it also needs to learn how to reach all the goals generated by C(g) over the course of training, causing the effective goal-space to grow substantially. However, there is a lot of shared structure in reaching goals, especially those generated by the curriculum. The knowledge of how to reach a goal g1 also conveys meaningful information about how to reach a goal g2. This structure can be leveraged by using techniques from goal relabeling [2]. In particular, we relabel every trajectory collected in the training environment with N goals sampled randomly from the set of goals that may be a part of the curriculum. If we do not have any prior data, we randomly sample the replay buffer for relabeling goals. If we are given some prior data D, this reduces sampling to g D {g pg} for relabeling.

There is a subtle difference between hindsight experience replay (HER) and the goal relabeling strategy we employ. While HER chooses future states from within an episode as goals for relabeling, we exclusively choose states that may be used as goal states in the curriculum, which may not occur in the collected trajectory at all. Since our policy will only be tasked with reaching goals generated by the goal generator G, it is advantageous to extract signal speciﬁcally for these goals. To summarize, while the goal-space has grown, goal relabeling enables us to generate data for the algorithm commensurately to improve sample-efﬁciency.

Algorithm Summary. The outline for Va PRL is given in Algorithm 1. At a high level, Va PRL takes a set of demonstrations as input and adds it to the replay buffer R. These demonstrations are relabelled to generate additional trajectories such that every intermediate state is used as a goal. Next, Va PRL starts collecting data in the training MDP MT . At every step, Va PRL samples the goal

Algorithm 1: Value-Accelerated Persistent Reinforcement Learning (Va PRL) Input: initial state(s) Dρ, N; // N: number of goals for relabeling Optional: Demos D; Initialize replay buffer B, π(a | s, g), Qπ(s, a, g); // If demos, add them to replay buffer and relabel B B D; relabel_demos(B); while not done do

s ρ; // sample initial state for HT steps do

g G(s, pg) (Eq 5); a π( | s, g), s p( | s, a); B B {(s, a, s , g, r(s , g))}; for i 1, i N do

g D pg; // if D = , sample replay buffer B B {(s, a, s , g, r(s , g)}; update π, Qπ; s s ;

Figure 5: Continuous control environments for goalconditioned persistent RL. (top left) A table-top rearrangement task, where a gripper is tasked with moving the mug to four potential goal positions, (top right) a sawyer robot learns how to close the door and (bottom) a high-dimensional dexterous hand attached to a sawyer robot is tasked to pick up a three-pronged object.

generator G to get the current goal and collects the next transition using the current policy π. This transition is added to replay buffer R along with N relabelled transitions, as described in Sec 4.2. The policy π and the critic Qπ are updated every step, using any off-policy reinforcement learning algorithm. This loop is repeated for HT steps till an extrinsic intervention resets the environment to a state s ρ. Note, it isn t necessary to initialize the agent close to the goal. Additional details pertaining to the algorithm can be found in the Appendix B.

5 Experiments

In this section, we study the performance of Va PRL on continuous control environments for goalconditioned persistent RL and provide ablations and visualization to isolate the effect of the curriculum. In particular, we aim to answer the following questions:

1. Does Va PRL allow efﬁcient reinforcement learning with minimal episodic resets? 2. How does the scheme for generating a curriculum in Va PRL perform compared to other methods for persistent reinforcement learning? 3. Does Va PRL scale to high dimensional state and action spaces? 4. What does the generated curriculum look like? Is the curriculum effective?

We next describe the speciﬁc choices of environments, evaluation metrics and comparisons in order to answer the questions above.

Environments. For our experimental evaluation, we consider three continuous control environments, shown in Figure 5. The table-top rearrangement is a simpliﬁed manipulation environment, where a gripper (modelled as a point mass which can attach to the object if it is close to it) is tasked with taking the mug to one of the 4 potential goal squares. The evaluation horizon is HE = 200 steps and the training horizon is HT = 200, 000 steps. This task involves a challenging exploration problem in navigating to objects, picking them up, and dropping them at the right location. The sawyer door closing environment involves using a sawyer robot arm to close a door to a particular target angle [45]. For this environment, we set the horizon for evaluation to be HE = 400 and HT = 200, 000 steps for training. Since environment resets are not freely available, repeatedly practicing the task implicitly requires the agent to also learn how to open the door. The hand manipulation environment, introduced in [14], involves a dexterous hand attached to a sawyer robot. This environment entails a 16 Do F hand that is mounted onto a 6 Do F arm, with the goal of manipulating a 3 pronged object as seen in Figure 5. In particular, the task involves picking up the object from random positions on a table and lifting it to a goal position above the table. This task is particularly challenging since it involves complex contact dynamics with high dimensional state and action spaces. Additionally, the robot has to learn how to reconﬁgure the object to diverse locations to simulate the test-time conditions where the agent is expected to pickup the object from random locations. For this environment, we set the horizon for evaluation HE = 400 and for training HT = 400, 000 steps.

Figure 6: Performance of each method on (left) the table-top rearrangement environment, (center) the sawyer door closing environment, and (right) the hand manipulation environment. Plots show learning curves with mean and standard error over 5 random seeds. Va PRL is more sample-efﬁcient and outperforms prior methods.

Environment Setup. For table-top rearrangement and sawyer door closing, we consider a sparse reward function r(s, g) = I(s, g), which is 1 when the state s is close to the goal position g, and 0 otherwise. Since the hand manipulation environment is a substantially more challenging problem, we consider a dense reward function that rewards the the hand and the object to be close to the goal position. To aid exploration in table-top rearrangement and sawyer door closing, we provide all the algorithms with a small set of trajectories (6 per goal, 3 going from initial state to the goal and the other 3 going in reverse) for each environment, though we do not assume that the trajectories take the optimal path (for example, the trajectories could come from teleoperation in practice). For the hand manipulation environment, we provide the agent with 10 trajectories demonstrating the pickup task from random positions on the table and 20 trajectories showing how to reposition the object to different locations on the table. For all environments, we report results by evaluating the number of times the policy successfully reaches the goal out of 10 trials in the evaluation environment ME (by resetting to a state from the initial state distribution ρ and sampling an appropriate goal from the goal distribution pg), performing intermittent evaluations as the training progresses. Note, the training agent does not receive the evaluation experience and it is only used to measure the performance on the evaluation environment. Further details about problem setup, demonstrations, implementation, hyperparameters and evaluation metrics can be found in the Appendix.

Comparisons. We compare Va PRL to four approaches: (a) A standard off-policy RL algorithm that only trains to reach the goal distribution, such that a new goal g pg(s) is sampled every HE steps (labelled naïve RL), (b) A forward-backward controller [19, 6] which alternates between g pg(s) and g ρ(s) for HE steps each, as described in Section 3 (labelled FBRL), (c) A perturbation controller [48] that alternates between optimizing a controller to maximize task reward and a controller to maximally perturb the state via task agnostic exploration (labelled R3L), and (d) RL directly on the evaluation environment, resetting to the initial state distribution after every HE steps (labelled oracle RL). This oracle is an expected upper bound on the performance of Va PRL, since it has access to episodic resets. We use soft actor-critic [17] as the base RL algorithm for all methods to ensure fair comparison, although any value-based method would be equally applicable. To emphasize, all the algorithms are provided the same set of demonstrations. Further implementation details can be found in the Appendix.

5.1 Persistent RL Results

The performance of each of the algorithms on the three evaluation domains are shown in Figure 6. We see that Va PRL outperforms naïve RL, FBRL and R3L, providing substantial improvements in terms of sample efﬁciency. For our most challenging domain of hand manipulation, the sample efﬁciency enables us a reach a much better performance within the training budget. The primary difference between the methods is that Va PRL uses a curriculum of starting states progressing from easier to harder states. In contrast, FBRL always attempts to reach the initial state distribution and R3L uses a perturbation controller to reach novel states in attempt to cover the entire state space uniformly. In the table-top rearrangement environment, the agent starts close to the goal and then gradually brings it back to the initial state distribution, trying different intermediate states in the process (discussed in Section 5.3). In the sawyer door closing environment, the agent learns to close the door from intermediate angles, incrementally improving the performance. For the hand manipulation domain, the agent focuses on picking up the object from a particular location, and then incrementally grow the locations from which it can complete pickup. In contrast, FBRL chooses attempts the pickup from random states from the initial state distribution ρ and R3L attempts to ﬁnd new states to pickup the object from (even though it might not be succeeding to pickup the object from previous locations).

Figure 8: Visualization of curricula generated by Va PRL on the table-top rearrangement environment. We plot the step index distance between the initial state and the curriculum goals generated by C(g) (blue) and the evaluation performance (orange) as the training progresses. The distance is normalized to be on the same scale as the success metric, such that a value of 1 corresponds to the test-goal distribution and 0 corresponds to the initial state distribution. We visualize some of the commanded goals C(g) during the training, observing that the curriculum gradually progresses from goal states to initial states with a correlated improvement in evaluation performance.

Compared to oracle RL with resets, Va PRL learns to solve the task while requiring 500 fewer environment resets in the door closing environment, 1000 fewer environment resets in the table-top rearrangement environment and dexterous hand manipulation. This amounts to less than 20 total interventions for Va PRL, indicating the substantial autonomy with which the algorithm can run.

Surpisingly, in the domains with a sparse reward function (that is, sawyer door closing and table-top reaarrangement), Va PRL matches or even outperforms the oracle RL method. In the table-top rearrangement environment, oracle RL does substantially worse than Va PRL. It has been noted in prior work that multi-goal RL problems can converge suboptimally due to issues arising in the optimization [44]. We hypothesize that an appropriate initial state distribution can ameliorate some of these issues. In particular, moving beyond deterministic initial distributions may lead to better downstream performance (also noted in [47]). For the door opening environment, Va PRL matches the performance of oracle RL. To emphasize, oracle RL is training on the evaluation environment directly, that is HT = HE with the environment resetting to a state s0 ρ. In contrast, Va PRL also learns how to reverse the task it is solving and thus only spends half of its training samples collecting the data for the evaluation task (for example, Va PRL learns how to open the door and close it).

5.2 Isolating the Role of the Initial State Distribution

Figure 7: Ablation isolating the effect of curriculum generated by Va PRL.

We construct an experiment to isolate the effect of the starting state distribution on learning efﬁciency and downstream performance. In this experiment, the environment resets directly to the state C(g) for Va PRL, such that the policy only has to learn reaching the goals g pg (labelled Va PRL + reset). Analogously, for FBRL, the environment resets to the state s0 ρ, which is identical to the oracle RL method (labelled oracle RL / FBRL + reset). For R3L, the environment resets to a state uniformly sampled from the state space (labelled uniform / R3L + reset). We run this ablation on the table-top rearrangement environment, where the episode horizon for all the algorithms is HE = 200. The results in Figure 7 indicate that the starting state distribution induced by the Va PRL curriculum improves the performance, translating into improved performance in the persistent RL setting.

5.3 Visualizations of Generated Curricula

To better understand the curriculum generated by Va PRL, we visualize the sequence of states chosen by Equation 4 as the training progresses on the table-top rearrangement environment, shown in Figure 8. As we can observe, initially the curriculum chooses states which are farther away from the initial state and closer to the goal distribution. As training progresses, the curriculum moves towards the initial state distribution. Correspondingly, the evaluation performance starts to improve as we

move closer to the initial state distribution. Thus, Va PRL can generates an adaptive curriculum for the agent to efﬁciently improve the performance on the evaluation setting.

6 Conclusion

In this work, we propose Va PRL, an algorithm that can efﬁciently solve reinforcement learning problems with minimal episodic resets by generating a curriculum of starting states. Va PRL is able to reduce amount of human intervention required in the learning process by a factor of 1000 compared to episodic RL, while outperforming prior methods. In the process, we also formalize the problem setting of persistent RL to understand current algorithms and aid the development of future ones.

There are a number of interesting avenues for future work that Va PRL does not currently address. A natural extension is to environments with irreversible states. This setting can likely be addressed by leveraging ideas from the literature in safe reinforcement learning [11, 42, 3, 5]. Another extension is to work with visual state spaces, allowing the algorithm to be more broadly applicable in the real world. These two extensions would be a signiﬁcant step towards enabling autonomous agents in the real world that minimize the reliance on human interventions.

7 Disclosure of Funding

This work was supported in part by Schmidt Futures and ONR grant N00014-21-1-2685.

[1] Learning to poke by poking: Experiential learning of intuitive physics. In D. D. Lee, M. Sugiyama, U. von Luxburg, I. Guyon, and R. Garnett, editors, Neur IPS, 2016.

[2] M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. Mc Grew, J. Tobin, P. Abbeel, and W. Zaremba. Hindsight experience replay. ar Xiv preprint ar Xiv:1707.01495, 2017.

[3] F. Berkenkamp, M. Turchetta, A. P. Schoellig, and A. Krause. Safe model-based reinforcement learning with stability guarantees. ar Xiv preprint ar Xiv:1705.08551, 2017.

[4] Y. Chebotar, K. Hausman, M. Zhang, G. Sukhatme, S. Schaal, and S. Levine. Combining modelbased and model-free updates for trajectory-centric reinforcement learning. In International conference on machine learning, pages 703 711. PMLR, 2017.

[5] Y. Chow, O. Nachum, E. Duenez-Guzman, and M. Ghavamzadeh. A lyapunov-based approach to safe reinforcement learning. ar Xiv preprint ar Xiv:1805.07708, 2018.

[6] B. Eysenbach, S. Gu, J. Ibarz, and S. Levine. Leave no trace: Learning to reset for safe and autonomous reinforcement learning. ar Xiv preprint ar Xiv:1711.06782, 2017.

[7] C. Finn and S. Levine. Deep visual foresight for planning robot motion. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 2786 2793. IEEE, 2017.

[8] C. Finn, X. Y. Tan, Y. Duan, T. Darrell, S. Levine, and P. Abbeel. Deep spatial autoencoders for visuomotor learning. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 512 519. IEEE, 2016.

[9] C. Florensa, D. Held, M. Wulfmeier, M. Zhang, and P. Abbeel. Reverse curriculum generation for reinforcement learning. In Conference on robot learning, pages 482 495. PMLR, 2017.

[10] C. Florensa, D. Held, X. Geng, and P. Abbeel. Automatic goal generation for reinforcement learning agents. In International conference on machine learning, pages 1515 1528. PMLR, 2018.

[11] J. Garcıa and F. Fernández. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437 1480, 2015.

[12] A. Ghadirzadeh, A. Maki, D. Kragic, and M. Björkman. Deep predictive policy training using reinforcement learning. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2351 2358. IEEE, 2017.

[13] S. Gu, E. Holly, T. Lillicrap, and S. Levine. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In 2017 IEEE international conference on robotics and automation (ICRA), pages 3389 3396. IEEE, 2017.

[14] A. Gupta, J. Yu, T. Zhao, V. Kumar, A. Rovinsky, K. Xu, T. Devlin, and S. Levine. Reset-free reinforcement learning via multi-task learning: Learning dexterous manipulation behaviors without human intervention. Ar Xiv, abs/2104.11203, 2021.

[15] S. Ha, P. Xu, Z. Tan, S. Levine, and J. Tan. Learning to walk in the real world with minimal human effort, 2020.

[16] T. Haarnoja, S. Ha, A. Zhou, J. Tan, G. Tucker, and S. Levine. Learning to walk via deep reinforcement learning. ar Xiv preprint ar Xiv:1812.11103, 2018.

[17] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, pages 1861 1870. PMLR, 2018.

[18] D. Hafner, J. Davidson, and V. Vanhoucke. Tensorﬂow agents: Efﬁcient batched reinforcement learning in tensorﬂow. ar Xiv preprint ar Xiv:1709.02878, 2017.

[19] W. Han, S. Levine, and P. Abbeel. Learning compound multi-step controllers under unknown dynamics. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 6435 6442. IEEE, 2015.

[20] L. P. Kaelbling. Learning to achieve goals. In IJCAI, pages 1094 1099. Citeseer, 1993.

[21] S. Kakade and J. Langford. Approximately optimal approximate reinforcement learning. In In Proc. 19th International Conference on Machine Learning. Citeseer, 2002.

[22] D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke, et al. Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation. ar Xiv preprint ar Xiv:1806.10293, 2018.

[23] D. Kalashnikov, J. Varley, Y. Chebotar, B. Swanson, R. Jonschkowski, C. Finn, S. Levine, and K. Hausman. Mt-opt: Continuous multi-task robotic reinforcement learning at scale. ar Xiv preprint ar Xiv:2104.08212, 2021.

[24] K. Khetarpal, M. Riemer, I. Rish, and D. Precup. Towards continual reinforcement learning: A review and perspectives. ar Xiv preprint ar Xiv:2012.13490, 2020.

[25] J. Kober, J. A. Bagnell, and J. Peters. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11):1238 1274, 2013.

[26] N. Kohl and P. Stone. Policy gradient reinforcement learning for fast quadrupedal locomotion. In IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA 04. 2004, volume 3, pages 2619 2624. IEEE, 2004.

[27] G. Konidaris and A. Barto. Skill discovery in continuous reinforcement learning domains using skill chaining. Advances in neural information processing systems, 22:1015 1023, 2009.

[28] S. Levine, C. Finn, T. Darrell, and P. Abbeel. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334 1373, 2016.

[29] K. Lu, A. Grover, P. Abbeel, and I. Mordatch. Reset-free lifelong learning with skill-space planning. ar Xiv preprint ar Xiv:2012.03548, 2020.

[30] T. Matiisen, A. Oliver, T. Cohen, and J. Schulman. Teacher student curriculum learning. IEEE transactions on neural networks and learning systems, 31(9):3732 3740, 2019.

[31] T. M. Moldovan and P. Abbeel. Safe exploration in markov decision processes. ar Xiv preprint ar Xiv:1205.4810, 2012.

[32] A. Nagabandi, K. Konolige, S. Levine, and V. Kumar. Deep dynamics models for learning dexterous manipulation. In Conference on Robot Learning, pages 1101 1112. PMLR, 2020.

[33] S. Narvekar and P. Stone. Learning curriculum policies for reinforcement learning. ar Xiv preprint ar Xiv:1812.00285, 2018.

[34] A. Y. Ng, H. J. Kim, M. I. Jordan, S. Sastry, and S. Ballianda. Autonomous helicopter ﬂight via reinforcement learning. In NIPS, volume 16. Citeseer, 2003.

[35] L. Pinto and A. Gupta. Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. In 2016 IEEE international conference on robotics and automation (ICRA), pages 3406 3413. IEEE, 2016.

[36] V. Pong, S. Gu, M. Dalal, and S. Levine. Temporal difference models: Model-free deep rl for model-based control. ar Xiv preprint ar Xiv:1802.09081, 2018.

[37] M. L. Puterman. Markov decision processes. Handbooks in operations research and management science, 2:331 434, 1990.

[38] T. Schaul, D. Horgan, K. Gregor, and D. Silver. Universal value function approximators. In International conference on machine learning, pages 1312 1320. PMLR, 2015.

[39] A. Sharma, M. Ahn, S. Levine, V. Kumar, K. Hausman, and S. Gu. Emergent real-world robotic skills via unsupervised off-policy reinforcement learning. ar Xiv preprint ar Xiv:2004.12974, 2020.

[40] S. Sukhbaatar, Z. Lin, I. Kostrikov, G. Synnaeve, A. Szlam, and R. Fergus. Intrinsic motivation and automatic curricula via asymmetric self-play. ar Xiv preprint ar Xiv:1703.05407, 2017.

[41] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.

[42] B. Thananjeyan, A. Balakrishna, S. Nair, M. Luo, K. Srinivasan, M. Hwang, J. E. Gonzalez, J. Ibarz, C. Finn, and K. Goldberg. Recovery rl: Safe reinforcement learning with learned recovery zones. ar Xiv preprint ar Xiv:2010.15920, 2020.

[43] K. Xu, S. Verma, C. Finn, and S. Levine. Continual learning of control primitives: Skill discovery via reset-games. Ar Xiv, abs/2011.05286, 2020.

[44] T. Yu, S. Kumar, A. Gupta, S. Levine, K. Hausman, and C. Finn. Gradient surgery for multi-task learning. ar Xiv preprint ar Xiv:2001.06782, 2020.

[45] T. Yu, D. Quillen, Z. He, R. Julian, K. Hausman, C. Finn, and S. Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on Robot Learning, pages 1094 1100. PMLR, 2020.

[46] A. Zeng, S. Song, J. Lee, A. Rodriguez, and T. Funkhouser. Tossingbot: Learning to throw arbitrary objects with residual physics. IEEE Transactions on Robotics, 36(4):1307 1319, 2020.

[47] H. Zhu, A. Gupta, A. Rajeswaran, S. Levine, and V. Kumar. Dexterous manipulation with deep reinforcement learning: Efﬁcient, general, and low-cost. In 2019 International Conference on Robotics and Automation (ICRA), pages 3651 3657. IEEE, 2019.

[48] H. Zhu, J. Yu, A. Gupta, D. Shah, K. Hartikainen, A. Singh, V. Kumar, and S. Levine. The ingredients of real-world robotic reinforcement learning. ar Xiv preprint ar Xiv:2004.12570, 2020.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s contributions and scope? [Yes] See Section 5. (b) Did you describe the limitations of your work? [Yes] See Section 3 and Section 6.

(c) Did you discuss any potential negative societal impacts of your work? [No] , this work inherits the potential negative societal impacts of reinforcement learning and its applications to robotics. We do not anticipate any additional negative impacts that are unique to this work.

(d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [N/A] (b) Did you include complete proofs of all theoretical results? [N/A] 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [No] , we will release the code and the environments upon publication. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] , See Section 5 and Appendix. (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] , See Section 5. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] See Appendix. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] See Section 5. (b) Did you mention the license of the assets? [Yes] , See Appendix.

(c) Did you include any new assets either in the supplemental material or as a URL? [No] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]