# competitive_experience_replay__d54726a0.pdf

Published as a conference paper at ICLR 2019

COMPETITIVE EXPERIENCE REPLAY

Hao Liu , Alexander Trott, Richard Socher, Caiming Xiong Salesforce Research Palo Alto, 94301 lhao499@gmail.com,{atrott, rsocher, cxiong}@salesforce.com

Deep learning has achieved remarkable successes in solving challenging reinforcement learning (RL) problems when dense reward function is provided. However, in sparse reward environment it still often suffers from the need to carefully shape reward function to guide policy optimization. This limits the applicability of RL in the real world since both reinforcement learning and domain-speciﬁc knowledge are required. It is therefore of great practical importance to develop algorithms which can learn from a binary signal indicating successful task completion or other unshaped, sparse reward signals. We propose a novel method called competitive experience replay, which efﬁciently supplements a sparse reward by placing learning in the context of an exploration competition between a pair of agents. Our method complements the recently proposed hindsight experience replay (HER) by inducing an automatic exploratory curriculum. We evaluate our approach on the tasks of reaching various goal locations in an ant maze and manipulating objects with a robotic arm. Each task provides only binary rewards indicating whether or not the goal is achieved. Our method asymmetrically augments these sparse rewards for a pair of agents each learning the same task, creating a competitive game designed to drive exploration. Extensive experiments demonstrate that this method leads to faster converge and improved task performance.

1 INTRODUCTION

Recent progress in deep reinforcement learning has achieved very impressive results in domains ranging from playing games (Mnih et al., 2015; Silver et al., 2016; 2017; Open AI, 2018), to high dimensional continuous control (Schulman et al., 2016; 2017; 2015), and robotics (Levine et al., 2018; Kalashnikov et al., 2018; Andrychowicz et al., 2018b).

Despite these successes, in robotics control and many other areas, deep reinforcement learning still suffers from the need to engineer a proper reward function to guide policy optimization (see e.g. Popov et al. 2017, Ng et al. 2006). In robotic control as stacking bricks, reward function need to be sharped to very complex which consists of multiple terms(Popov et al., 2017). It is extremely hard and not applicable to engineer such reward function for each task in real world since both reinforcement learning expertise and domain-speciﬁc knowledge are required. Learning to perform well in environments with sparse rewards remains a major challenge. Therefore, it is of great practical importance to develop algorithms which can learn from binary signal indicating successful task completion or other unshaped reward signal.

In environments where dense reward function is not available, only a small fraction of the agents experiences will be useful to compute gradient to optimize policy, leading to substantial high sample complexity. Providing agents with useful signals to pursue in sparse reward environments becomes crucial in these scenarios.

This work was done when the author was an intern at Salesforce Research.

Published as a conference paper at ICLR 2019

In the domain of goal-directed RL, the recently proposed hindsight experience replay (HER) (Andrychowicz et al., 2017) addresses the challenge of learning from sparse rewards by re-labelling visited states as goal states during training. However, this technique continues to suffer from sample inefﬁciency, ostensibly due to difﬁculties related to exploration. In this work, we address these limitations by introducing a method called Competitive Experience Replay (CER). This technique attempts to emphasize exploration by introducing a competition between two agents attempting to learn the same task. Intuitively, agent A (the agent ultimately used for evaluation) receives a penalty for visiting states that the competitor agent (B) also visits; and B is rewarded for visiting states found by A. Our approach maintains the reward from the original task such that exploration is biased towards the behaviors best suited to accomplishing the task goals. We show that this competition between agents can automatically generate a curriculum of exploration and shape otherwise sparse reward. We jointly train both agents policies by adopting methods from multi-agent RL. In addition, we propose two versions of CER, independent CER, and interact CER, which differ in the state initialization of agent B: whether it is sampled from the initial state distribution or sampled from off-policy samples of agent A, respectively.

Whereas HER re-labels samples based on an agent s individual rollout, our method re-labels samples based on intra-agent behavior; as such, the two methods do not interfere with each other algorithmically and are easily combined during training. We evaluate our method both with and without HER on a variety of reinforcement learning tasks, including navigating an ant agent to reach a goal position and manipulating objects with a robotic arm. For each such task the default reward is sparse, corresponding to a binary indicator of goal completion. Ablation studies show that our method is important for achieving a high success rate and often demonstrates faster convergence. Interestingly, we ﬁnd that CER and HER are complementary methods and employ both to reach peak efﬁciency and performance. Furthermore, we observe that, when combined with HER, CER outperforms curiosity-driven exploration.

2 BACKGROUND

Here, we provide an introduction to the relevant concepts for reinforcement learning with sparse reward (Section 2.1), Deep Deterministic Policy Gradient, the backbone algorithm we build off of, (Section 2.2), and Hindsight Experience Replay (Section 2.3).

2.1 SPARSE REWARD REINFORCEMENT LEARNING

Reinforcement learning considers the problem of ﬁnding an optimal policy for an agent that interacts with an uncertain environment and collects reward per action. The goal of the agent is to maximize its cumulative reward. Formally, this problem can be viewed as a Markov decision process over the environment states s S and agent actions a A, with the (unknown) environment dynamics deﬁned by the transition probability T(s |s, a) and reward function r(st, at), which yields a reward immediately following the action at performed in state st.

We consider goal-conditioned reinforcement learning from sparse rewards. This constitutes a modiﬁcation to the reward function such that it depends on a goal g G, such that rg : S A G R. Every episode starts with sampling a state-goal pair from some distribution p(s0, g). Unlike the state, the goal stays ﬁxed for the whole episode. At every time step, an action is chosen according to some policy π, which is expressed as a function of the state and the goal, π : S G A. For generality, our only restriction on G is that it is a subset of S. In other words, the goal describes a target state and the task of the agent is to reach that state. Therefore, we apply the following sparse reward function:

rt = rg(st, at, g) = 0, if |st g| < δ 1, otherwise (1)

Published as a conference paper at ICLR 2019

where g is a goal, |st g| is a distance measure, and δ is a predeﬁned threshold that controls when the goal is considered completed.

Following policy gradient methods, we model the policy as a conditional probability distribution over states, πθ(a|[s, g]), where [s, g] denotes concatenation of state s and goal g, and θ are the learnable parameters. Our objective is to optimize θ with respect to the expected cumulative reward, given by:

J(θ) = Es ρπ,a π(a|s,g),g G [rg(s, a, g)] ,

where ρπ(s) = P t=1 γt 1Pr(st = s) is the normalized discounted state visitation distribution with discount factor γ [0, 1). To simplify the notation, we denote Es ρπ,a π(a|s,g),g G[ ] by simply Eπ[ ] in the rest of paper. According to the policy gradient theorem (Sutton et al., 1998), the gradient of J(θ) can be written as

θJ(θ) = Eπ [ θ log π(a|s, g)Qπ(s, a, g)] , (2)

where Qπ(s, a, g) = Eπ P t=1 γt 1rg(st, at, g)|s1 = s, a1 = a , called the critic, denotes the expected return under policy π after taking an action a in state s, with goal g.

2.2 DDPG ALGORITHM

Here, we introduce Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al., 2015), a model-free RL algorithm for continuous action spaces that serves as our backbone algorithm. Our proposed modiﬁcations need not be restricted to DDPG; however, we leave experimentation with other continuous control algorithms to future work. In DDPG, we maintain a deterministic target policy µ(s, g) and a critic Q(s, a, g), both implemented as deep neural networks (note: we modify the standard notation to accommodate goal-conditioned tasks). To train these networks, episodes are generated by sampling actions from the policy plus some noise, a µ(s, g) + N(0, 1). The transition tuple associated with each action (st, at, gt, rt, st+1) is stored in the so-called replay buffer. During training, transition tuples are sampled from the buffer to perform mini-batch gradient descent on the loss L which encourages the approximated Q-function to satisfy the Bellman equation L = E(Q(st, at, gt) yt)2, where yt = rt+γQ(st+1, µ(st+1, gt), gt). Similarly, the actor can be updated by training with mini-batch gradient descent on the loss J(θ) = Es Q(s, µ(s, g), g) through the deterministic policy gradient algorithm (Silver et al., 2014),

θJ(θ) = E[ θµ(s, g) a Q(s, a, g)|a=µ(s,g)]. (3)

To make training more stable, the targets yt are typically computed using a separate target network, whose weights are periodically updated to the current weights of the main network (Lillicrap et al., 2015; Mnih et al., 2013; 2015).

2.3 HINDSIGHT EXPERIENCE REPLAY

Despite numerous advances in the application of deep learning to RL challenges, learning in the presence of sparse rewards remains a major challenge. Intuitively, these algorithms depend on sufﬁcient variability within the encountered rewards and, in many cases, random exploration is unlikely to uncover this variability if goals are difﬁcult to reach. Recently, Andrychowicz et al. (2017) proposed Hindsight Experience Replay (HER) as a technique to address this challenge. The key insight of HER is that failed rollouts (where no task reward was obtained) can be treated as successful by assuming that a visited state was the actual goal. Basically, HER amounts to a relabelling strategy. For every episode the agent experiences, it gets stored in the replay buffer twice: once with the original goal pursued in the episode and once with the goal replaced with a future state achieved in the episode, as if the agent were instructed to reach this state from the beginning. Formally, HER randomly samples a mini-batch of episodes in buffer, for each episode ({si}T i=1, {gi}T i=1, {ai}T i=1, {ri}T i=1, {s i}T i=1), for each state st, where 1 t T 1 in an episode, we

Published as a conference paper at ICLR 2019

randomly choose sk where t + 1 k T and relabel transition (st, at, gt, rt, st+1) to (st, at, sk, r t, st+1) and recalculate reward r t,

r t = rg(st, sk) = 0, if |st sk| < δ, 1, otherwise. (4)

In this section, we present Competitive Experience Replay (CER) for policy gradient methods (Section 3.1) and describe the application of multi-agent DDPG to enable this technique (Section 3.2).

3.1 COMPETITIVE EXPERIENCE REPLAY

While the re-labelling strategy introduced by HER provides useful rewards for training a goal-conditioned policy, it assumes that learning from arbitrary goals will generalize to the actual task goals. As such, exploration remains a fundamental challenge for goal-directed RL with sparse reward. We propose a relabelling strategy designed to overcome this challenge. Our method is partly inspired by the success of self-play in learning to play competitive games, where sparse rewards (i.e. win or lose) are common. Rather than train a single agent, we train a pair of agents on the same task and apply an asymmetric reward relabelling strategy to induce a competition designed to encourage exploration. We call this strategy Competitive Experience Replay (CER).

To implement CER, we learn a policy for each agent, πA and πB, as well as a multi-agent critic (see below), taking advantage of methods for decentralized execution and centralized training. During decentralized execution, πA and πB collect DDPG episode rollouts in parallel. Each agent effectively plays a singleplayer game; however, to take advantage of multi-agent training methods, we arbitrarily pair the rollout from πA with that from πB and store them as a single, multi-agent rollout in the replay buffer D. When training on a mini-batch of off policy samples, we ﬁrst randomly sample a mini-batch of episodes in D and then randomly sample transitions in each episode. We denote the resulting mini-batch of transitions as {(si A, ai A, gi A, ri A, s i A), (si B, ai B, gi B, ri B, s i B)}m i=1, where m is the size of the mini-batch.

Reward re-labelling in CER attempts to create an implicit exploration curriculum by punishing agent A for visiting states that agent B also visited, and, for those same states, rewarding agent B. For each A state si A in a mini-batch of transitions, we check if any B state sj B in the mini-batch satisﬁes |si A sj B| < δ and, if so, re-label ri A with ri A 1. Conversely, each time such a nearby state pair is found, we increment the associated reward for agent B, rj B, by +1. Each transition for agent A can therefore be penalized only once, whereas no restriction is placed on the extra reward given to a transition for agent B. Following training both agents with the re-labelled rewards, we retain the policy πA for evaluation. Additional implementation details are provided in the appendix (Section D).

We focus on two variations of CER that satisfy the multi-agent self-play requirements: ﬁrst, the policy πB receives its initial state from the task s initial state distribution; second, although more restricted to re-settable environments, πB receives its initial state from a random off-policy sample of πA. We refer to the above methods as independent-CER and interact-CER, respectively, in the following sections.

Importantly, CER re-labels rewards based on intra-agent behavior, whereas HER re-labels rewards based on each individual agent s behavior. As a result, the two methods can be easily combined. In fact, as our experiments demonstrate, CER and HER are complementary and likely reﬂect distinct challenges that are both addressed through reward re-labelling.

Published as a conference paper at ICLR 2019

3.2 GOAL CONDITIONED MULTI-AGENT LEARNING

We extend multi-agent DDPG (MADDPG), proposed by Lowe et al. (2017), for training using CER. MADDPG attempts to learn a different policy per agent and a single, centralized critic that has access to the combined states, actions, and goals of all agents.

More precisely, consider a game with N agents with policies parameterized by θ = {θ1, . . . , θN}, and let π = {π1, . . . , πN} be the set of all agent policies. g = [g1, . . . , g N] represents the concatenation of each agent s goal, s = [s1, . . . , s N] the concatenated states, and a = [a1, . . . , a N] the concatenated actions. With this notation, we can write the gradient of the expected return for agent i, J(θi) = E[Ri] as:

θi J(θi) = Eπ[ θi log πi(ai|si, gi)Qπ i (s, a, g)]. (5)

With deterministic policies µ = {µ1, . . . , µN}, the gradient becomes:

θi J(θi) = Eµ[ θiµi(si, gi) ai Qµ i (s, a, g)|ai=µi(si,gi)], (6)

The centralized action-value function Qµ i , which estimates the expected return for agent i, is updated as:

L(θi) = Es,a,r,s [(Qµ i (s, a, g) y)2], y = ri + γ Qµ

i (s , a 1, . . . , a N, g) a j=µ j(sj), (7)

where µ = {µθ 1, ...,µθ N } is the set of target policies with delayed parameters θ i. In practice people usually soft update it as θ i τθi + (1 τ)θ i, where τ is a Polyak coefﬁcient.

During training, we collect paired rollouts as described above, apply any re-labelling strategies (such as CER or HER) and use the MADDPG algorithm to train both agent policies and the centralized critic, concatenating states, actions, and goals where appropriate. Putting everything together, we summarize the full method in Algorithm 1.

4 EXPERIMENT

U Maze Learning S Maze Learning

success rate

10 20 30 40 50 0.0

success rate

20 40 60 80 100 0.0

PPO DDPG HER HER+ind-CER HER+int-CER ind-CER int-CER

Figure 1: Goal conditioned U shaped maze and S shaped maze. For each plot pair, the left illustrates a rendering of the maze type and the right shows the associated success rate throughout training for each method. Each line shows results averaged over 5 random initializations; the X-axis corresponds to epoch number and shaded regions show standard deviation. All metrics based on CER reference the performance of agent A.

4.1 COMBINING COMPETITIVE EXPERIENCE REPLAY WITH EXISTING METHODS

We start by asking whether CER improves performance and sample efﬁciency in a sparse reward task. To this end, we constructed two different mazes, an easier U shaped maze and a more difﬁcult S shaped maze (Figure 1). The goal of the ant agent is to reach the target mark by a red sphere, whose location is randomly sampled for each new episode. At each step, the agent obtains a reward of 0 if the goal has been achieved and

Published as a conference paper at ICLR 2019

1 otherwise. Additional details of the ant maze environments are found in Appendix B. An advantage of this environment is that it can be reset to any given state, facilitating comparison between our two proposed variants of CER (int-CER requires the environment to have this feature).

We compare agents trained with HER, both variants of CER, and both variants of CER with HER. Since each uses DDPG as a backbone algorithm, we also include results from a policy trained using DDPG alone. To conﬁrm that any difﬁculties are not simply due to DDPG, we include results from a policy trained using Proximal Policy Optimization (PPO) (Schulman et al., 2017). The results for each maze are shown in Figure 1. DDPG and PPO each performs quite poorly by itself, likely due to the sparsity of the reward in this task set up. Adding CER to DDPG partially overcomes this limitation in terms of ﬁnal success rate, and, notably, reaches this stronger result with many fewer examples. A similar result is seen when adding HER to DDPG. Importantly, adding both HER and CER offers the best improvement of ﬁnal success rate without requiring any observable increase in the number of episodes, as compared to each of the other baselines. These results support our hypothesis that existing state-of-the-art methods do not sufﬁciently address the exploration challenge intrinsic to sparse reward environments. Furthermore, these results show that CER improves both the quality and efﬁciency of learning in such challenging settings, especially when combined with HER. These results also show that int-CER tends to outperform ind-CER. As such, int-CER is considered preferable but has more restrictive technical requirements.

To examine the efﬁcacy of our method on a broader range of tasks, we evaluate the change in performance when ind-CER is added to HER on the challenging multi-goal sparse reward environments introduced in Plappert et al. (2018). (Note: we would prefer to examine int-CER but are prevented by technical issues related to the environment.) Results for each of the 12 tasks we trained on are illustrated in Figure 3. Our method, when used on top of HER, improves performance wherever it is not already saturated. This is especially true on harder tasks, where HER alone achieves only modest success (e.g., Hand Manipulate Egg Full and hand Manipulate Pen Rotate). These results further support the conclusion that existing methods often fail to achieve sufﬁcient exploration. Our method, which provides a targeted solution to this challenge, naturally complements HER.

Hand Reach Hand Manipulate Pen Rotate U maze S maze

10 20 30 40 50 0.0

25 50 75 100 125 150 175 200 0.0

10 20 30 40 50 0.0

20 40 60 80 100 0.0

1.0 HER HER+CER HER+ICM CER ICM

Figure 2: Comparing CER and ICM with and without HER. Each line shows task success rate averaged over 5 random initializations. X-axis is epoch number; shaded regions denote standard deviation.

4.2 COMPARING CER WITH CURIOSITY-DRIVEN EXPLORATION

CER is designed to encourage exploration, but, unlike other methods, uses the behavior of a competitor agent to automatically determine the criteria for exploratory reward. Numerous methods have been proposed for improving exploration; for example, count-based exploration (Bellemare et al., 2016; Tang et al., 2017), VIME (Houthooft et al., 2016), bootstrap DQN (Osband et al., 2016), goal exploration process (Forestier et al., 2017), parameter space noise (Plappert et al., 2017), dynamics curiosity and boredom (Schmidhuber, 1991), and EX2 (Fu et al., 2017). One recent and popular method, intrinsic curiosity module (ICM) (Pathak et al., 2017; Burda et al., 2018), augments task reward with curiosity-driven reward. Much like CER and HER, ICM provides a method to relabel the reward associated with each transition; this reward comes from the

Published as a conference paper at ICLR 2019

Fetch Pick And Place Fetch Push Fetch Reach Fetch Slide

success rate

10 20 30 40 50 0.0

HER HER + ind-CER

10 20 30 40 50 0.0

10 20 30 40 50 0.0

25 50 75 100 125 150 175 200 0.0

Hand Reach Hand Manipulate Block Full Hand Manipulate Egg Full Hand Manipulate Pen Rotate

success rate

10 20 30 40 50 0.0

25 50 75 100 125 150 175 200 0.0

25 50 75 100 125 150 175 200 0.0

25 50 75 100 125 150 175 200 0.0

Hand Manipulate Block Rotate Z Hand Manipulate Block Rotate Parallel Hand Manipulate Egg Rotate Hand Manipulate Block Rotate XYZ

success rate

25 50 75 100 125 150 175 200 0.0

25 50 75 100 125 150 175 200 0.0

25 50 75 100 125 150 175 200 0.0

25 50 75 100 125 150 175 200 0.0

Figure 3: Evaluation of HER with ind-CER across different robotic control environments. Each line shows results averaged over 5 random initializations. X-axis shows epoch number; shaded regions denote standard deviation.

error in a jointly trained forward prediction model. We compare how CER and ICM affect task performance and how they interact with HER across 4 tasks where we were able to implement ICM successfully. Figure 2 shows the results on several robotic control and maze tasks. We observe CER to consistently outperform ICM when each is implemented in isolation. In addition, we observe HER to beneﬁt more from the addition of CER than of ICM. Compared to ICM, CER also tends to improve the speed of learning. These results suggest that the underlying problem is not a lack of diversity of states being visited but the size of the state space that the agent must explore (as also discussed in Andrychowicz et al. (2017)). One interpretation is that, since the two CER agents are both learning the task, the dynamics of their competition encourage more task-relevant exploration. From this perspective, CER provides an automatic curriculum for exploration such that visited states are utilized more efﬁciently.

4.3 ANALYSIS OF CER

Figure 4 illustrates the success rates of agents A and B as well as the effect ratio, which is the fraction of mini-batch samples whose reward is changed by CER during training, calculated as φ = N

M where N is the number of samples changed by CER and M is the number of samples in mini-batch. We observe a strong correspondence between the success rate and effect ratio, likely reﬂecting the inﬂuence of CER on the learning dynamics. While a deeper analysis would be required to concretely understand the interplay of these two terms, we point out that CER re-labels a substantial fraction of rewards in the mini-batch. Interestingly, even the relatively small effect ratio observed in the ﬁrst few epochs is enough support rapid learning. We speculate that the sampling strategy used in int-CER provides a more targeted re-labelling, leading to the more rapid increase in success rate for agent A. We observe that agent B reaches a lower level of performance.

Published as a conference paper at ICLR 2019

success rate

10 20 30 40 50 0.0

effect ratio

10 20 30 40 50 0.0

success rate

10 20 30 40 50 0.0

effect ratio

10 20 30 40 50 0.0

A B ratio A ratio B

Figure 4: Illustration of the automatically generated curriculum between A and B on U shaped Ant Maze task. From left to right are success rate of ind-CER, effect ratio of ind-CER, success rate of int-CER, and effect ratio of int-CER. Each line shows results averaged over 5 random initializations. X-axis is epoch number.

DDPG HER ICM CER(A) CER(B)

3 2 1 0 1 2 3 4 5 6 7 3

3 2 1 0 1 2 3 4 5 6 7 3

3 2 1 0 1 2 3 4 5 6 7 3

3 2 1 0 1 2 3 4 5 6 7 3

3 2 1 0 1 2 3 4 5 6 7 3

3 2 1 0 1 2 3 4 5 6 7 3

3 2 1 0 1 2 3 4 5 6 7 3

3 2 1 0 1 2 3 4 5 6 7 3

3 2 1 0 1 2 3 4 5 6 7 3

3 2 1 0 1 2 3 4 5 6 7 3

Figure 5: State visitation frequency on U maze. Top row is from after 5 epochs; bottom row from after 35. Blue on the colormap denotes no visitation. The start and goal locations are labeled with the blue and green markers, respectively. ICM and CER are trained with HER also.

This likely results from resetting the parameters of agent B periodically during early training, which we observe to improve the ultimate performance of agent A (see Section D for details). It is also possible that the reward structure of CER asymmetrically beneﬁts agent A with respect to the underlying task.

To gain additional insight into how each method inﬂuences the behaviors learned across training, we visualize the frequency with which each location in the U maze is visited after the 5th and 35th training epoch (Figure 5). Comparing DDPG and HER shows that the latter clearly helps move the state distribution towards the goal, especially during the early parts of training. When CER is added, states near the goal create a clear mode in the visitation distribution. This is never apparent with DDPG alone and only obvious for HER and ICM+HER later in training. CER also appears to focus the visitation of the agent, as it less frequently gets caught along the outer walls. These disparities are emphasized in Figure 6, where we show the difference in the visitation proﬁles. The left two plots compare CER+HER vs. HER. The right two plots compare Agent A vs Agent B from CER+HER. Interestingly, the Agents A and B exhibit fairly similar aggregate visitation proﬁles with the exception that Agent A reaches the goal more often later during training. These visitation proﬁles underscore both the quantitative and qualitative improvements associated with CER.

5 RELATED WORK

Self-play has a long history in this domain of research (Samuel, 1959; Tesauro, 1995; Heess et al., 2017). Silver et al. (2017) use self-play with deep reinforcement learning techniques to master the game of Go; and self-play has even been applied in Dota 5v5 (Open AI, 2018).

Published as a conference paper at ICLR 2019

CER(A) - HER [5th epoch] CER(A) - HER [35th epoch] CER(A) - CER(B) [5th epoch] CER(A) - CER(B) [35th epoch]

3 2 1 0 1 2 3 4 5 6 7 3

3 2 1 0 1 2 3 4 5 6 7 3

3 2 1 0 1 2 3 4 5 6 7 3

3 2 1 0 1 2 3 4 5 6 7 3

Figure 6: State visitation difference on U maze. Left plots compare CER against HER; right plots compare the two CER agents. Each plot illustrates the difference in visitation, with red indicating more visitation by CER(A). Note: here, the CER agent is trained with CER+HER.

Curriculum learning is widely used for training neural networks (see e.g., Bengio et al., 2009; Graves et al., 2017; Elman, 1993; Olsson, 1995). A general framework for automatic task selection is Powerplay (Schmidhuber, 2013; Srivastava et al., 2013), which proposes an asymptotically optimal way of selecting tasks from a set of tasks with program search, and use replay buffer to avoid catastrophic forgetting. Florensa et al. (2017); Held et al. (2017) propose to automatically adjust task difﬁculty during training. Forestier et al. (2017) study intrinsically motivated goal for robotic learning. Recently, Sukhbaatar et al. (2017) suggest to use self-play between two agents where reward depends on the time of the other agent to complete to enable implicit curriculum. Our work is similar to theirs, but we propose to use sample-based competitive experience replay, which is not only more readily scalable to high-dimension control but also integrates easily with Hindsight Experience Replay (Andrychowicz et al., 2017). The method of initializing based on a state not sampled from the initial state distribution has been explored in other works. For example, Ivanovic et al. (2018) propose to create a backwards curriculum for continuous control tasks through learning a dynamics model. Resnick et al. (2018b) and Salimans & Chen (2018) propose to train policies on Pommerman (Resnick et al., 2018a) and the Atari game Montezumas Revenge by starting each episode from a different point along a demonstration. Recently, Goyal et al. (2018) and Edwards et al. (2018) propose a learned backtracking model to generate traces that lead to high value states in order to obtain higher sample efﬁciency.

Experience replay has been introduced in (Lin, 1992) and later was a crucial ingredient in learning to master Atari games (Mnih et al., 2013; 2015). Wang et al. (2016) propose truncation with bias correction to reduce variance from using off-policy data in buffer and achieves good performance on continuous and discrete tasks. Different approaches have been proposed to incorporate model-free learning with experience replay(Gu et al., 2016; Feng et al., 2019). Schaul et al. (2015) improve experience replay by assigning priorities to transitions in the buffer to efﬁciently utilize samples. Horgan et al. (2018) further improve experience replay by proposing a distributed RL system in which experiences are shared between parallel workers and accumulated into a central replay memory and prioritized replay is used to update the policy based on the diverse accumulated experiences.

6 CONCLUSION

We introduce Competitive Experience Replay, a new and general method for encouraging exploration through implicit curriculum learning in sparse reward settings. We demonstrate an empirical advantage of our technique when combined with existing methods in several challenging RL tasks. In future work, we aim to investigate richer ways to re-label rewards based on intra-agent samples to further harness multi-agent competition, it s interesting to investigate counterfactual inference to promote efﬁcient re-label off-policy samples. We hope that this will facilitate the application of our method to more open-end environments with even more challenging task structures. In addition, future work will explore integrating our method into approaches more closely related to model-based learning, where adequate exposure to the dynamics of the environment is often crucial.

Published as a conference paper at ICLR 2019

Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob Mc Grew, Josh Tobin, Open AI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. In Advances in Neural Information Processing Systems, pp. 5048 5058, 2017.

Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Jozefowicz, Bob Mc Grew, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, Jonas Schneider, Szymon Sidor, Josh Tobin, Peter Welinder, Lilian Weng, and Wojciech Zaremba. Learning dexterous in-hand manipulation. ar Xiv:1808.00177, 2018a.

Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Jozefowicz, Bob Mc Grew, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, et al. Learning dexterous in-hand manipulation. ar Xiv preprint ar Xiv:1808.00177, 2018b.

Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, pp. 1471 1479, 2016.

Yoshua Bengio, J erˆome Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pp. 41 48. ACM, 2009.

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. ar Xiv preprint ar Xiv:1606.01540, 2016.

Yuri Burda, Harri Edwards, Deepak Pathak, Amos Storkey, Trevor Darrell, and Alexei A Efros. Large-scale study of curiosity-driven learning. ar Xiv preprint ar Xiv:1808.04355, 2018.

Ashley D Edwards, Laura Downs, and James C Davidson. Forward-backward reinforcement learning. ar Xiv preprint ar Xiv:1803.10227, 2018.

Jeffrey L Elman. Learning and development in neural networks: The importance of starting small. Cognition, 48(1):71 99, 1993.

Yihao Feng, Hao Liu, Jian Peng, and Qiang Liu. Shrinkage-based bias-variance trade-off for deep reinforcement learning, 2019. URL https://openreview.net/forum?id=S1z9eh Aq YX.

Carlos Florensa, David Held, Markus Wulfmeier, Michael Zhang, and Pieter Abbeel. Reverse curriculum generation for reinforcement learning. ar Xiv preprint ar Xiv:1707.05300, 2017.

S ebastien Forestier, Yoan Mollard, and Pierre-Yves Oudeyer. Intrinsically motivated goal exploration processes with automatic curriculum learning. ar Xiv preprint ar Xiv:1708.02190, 2017.

Justin Fu, John Co-Reyes, and Sergey Levine. Ex2: Exploration with exemplar models for deep reinforcement learning. In Advances in Neural Information Processing Systems, pp. 2577 2587, 2017.

Anirudh Goyal, Philemon Brakel, William Fedus, Timothy Lillicrap, Sergey Levine, Hugo Larochelle, and Yoshua Bengio. Recall traces: Backtracking models for efﬁcient reinforcement learning. ar Xiv preprint ar Xiv:1804.00379, 2018.

Alex Graves, Marc G Bellemare, Jacob Menick, Remi Munos, and Koray Kavukcuoglu. Automated curriculum learning for neural networks. ar Xiv preprint ar Xiv:1704.03003, 2017.

Shixiang Gu, Timothy Lillicrap, Ilya Sutskever, and Sergey Levine. Continuous deep q-learning with model-based acceleration. In International Conference on Machine Learning, pp. 2829 2838, 2016.

Published as a conference paper at ICLR 2019

Nicolas Heess, Srinivasan Sriram, Jay Lemmon, Josh Merel, Greg Wayne, Yuval Tassa, Tom Erez, Ziyu Wang, Ali Eslami, Martin Riedmiller, et al. Emergence of locomotion behaviours in rich environments. ar Xiv preprint ar Xiv:1707.02286, 2017.

David Held, Xinyang Geng, Carlos Florensa, and Pieter Abbeel. Automatic goal generation for reinforcement learning agents. ar Xiv preprint ar Xiv:1705.06366, 2017.

Dan Horgan, John Quan, David Budden, Gabriel Barth-Maron, Matteo Hessel, Hado Van Hasselt, and David Silver. Distributed prioritized experience replay. ar Xiv preprint ar Xiv:1803.00933, 2018.

Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. Vime: Variational information maximizing exploration. In Advances in Neural Information Processing Systems, pp. 1109 1117, 2016.

Boris Ivanovic, James Harrison, Apoorva Sharma, Mo Chen, and Marco Pavone. Barc: Backward reachability curriculum for robotic reinforcement learning. ar Xiv preprint ar Xiv:1806.06161, 2018.

Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, et al. Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation. ar Xiv preprint ar Xiv:1806.10293, 2018.

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Sergey Levine, Peter Pastor, Alex Krizhevsky, Julian Ibarz, and Deirdre Quillen. Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. The International Journal of Robotics Research, 37(4-5):421 436, 2018.

Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. ar Xiv preprint ar Xiv:1509.02971, 2015.

Long-Ji Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning, 8(3-4):293 321, 1992.

Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in Neural Information Processing Systems, pp. 6379 6390, 2017.

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. ar Xiv preprint ar Xiv:1312.5602, 2013.

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.

Andrew Y Ng, Adam Coates, Mark Diel, Varun Ganapathi, Jamie Schulte, Ben Tse, Eric Berger, and Eric Liang. Autonomous inverted helicopter ﬂight via reinforcement learning. In Experimental Robotics IX, pp. 363 372. Springer, 2006.

Roland Olsson. Inductive functional programming using incremental program transformation. Artiﬁcial intelligence, 74(1):55 81, 1995.

Open AI. Openai ﬁve. https://blog.openai.com/openai-five/, 2018.

Published as a conference paper at ICLR 2019

Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via bootstrapped dqn. In Advances in neural information processing systems, pp. 4026 4034, 2016.

Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration by selfsupervised prediction. In International Conference on Machine Learning (ICML), volume 2017, 2017.

Matthias Plappert, Rein Houthooft, Prafulla Dhariwal, Szymon Sidor, Richard Y Chen, Xi Chen, Tamim Asfour, Pieter Abbeel, and Marcin Andrychowicz. Parameter space noise for exploration. ar Xiv preprint ar Xiv:1706.01905, 2017.

Matthias Plappert, Marcin Andrychowicz, Alex Ray, Bob Mc Grew, Bowen Baker, Glenn Powell, Jonas Schneider, Josh Tobin, Maciek Chociej, Peter Welinder, et al. Multi-goal reinforcement learning: Challenging robotics environments and request for research. ar Xiv preprint ar Xiv:1802.09464, 2018.

Ivaylo Popov, Nicolas Heess, Timothy Lillicrap, Roland Hafner, Gabriel Barth-Maron, Matej Vecerik, Thomas Lampe, Yuval Tassa, Tom Erez, and Martin Riedmiller. Data-efﬁcient deep reinforcement learning for dexterous manipulation. ar Xiv preprint ar Xiv:1704.03073, 2017.

Cinjon Resnick, Wes Eldridge, David Ha, Denny Britz, Jakob Foerster, Julian Togelius, Kyunghyun Cho, and Joan Bruna. Pommerman: A multi-agent playground. ar Xiv preprint ar Xiv:1809.07124, 2018a.

Cinjon Resnick, Roberta Raileanu, Sanyam Kapoor, Alex Peysakhovich, Kyunghyun Cho, and Joan Bruna. Backplay: man muss immer umkehren . ar Xiv preprint ar Xiv:1807.06919, 2018b.

Tim Salimans and Richard Chen. Learning montezumas revenge from a single demonstration. https://blog.openai.com/ learning-montezumas-revenge-from-a-single-demonstration/, 2018.

Arthur L Samuel. Some studies in machine learning using the game of checkers. IBM Journal of research and development, 3(3):210 229, 1959.

Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. ar Xiv preprint ar Xiv:1511.05952, 2015.

J urgen Schmidhuber. A possibility for implementing curiosity and boredom in model-building neural controllers. In Proc. of the international conference on simulation of adaptive behavior: From animals to animats, pp. 222 227, 1991.

J urgen Schmidhuber. Powerplay: Training an increasingly general problem solver by continually searching for the simplest still unsolvable problem. Frontiers in psychology, 4:313, 2013.

John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel. Trust region policy optimization. In International Conference on Machine Learning, 2015.

John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. International Conference of Learning Representations (ICLR), 2016.

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. Advances in Neural Information Processing Systems, 2017.

David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient algorithms. In ICML, 2014.

Published as a conference paper at ICLR 2019

David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484 489, 2016.

David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. ar Xiv preprint ar Xiv:1712.01815, 2017.

Rupesh Kumar Srivastava, Bas R Steunebrink, and J urgen Schmidhuber. First experiments with powerplay. Neural Networks, 41:130 136, 2013.

Sainbayar Sukhbaatar, Zeming Lin, Ilya Kostrikov, Gabriel Synnaeve, Arthur Szlam, and Rob Fergus. Intrinsic motivation and automatic curricula via asymmetric self-play. ar Xiv preprint ar Xiv:1703.05407, 2017.

Richard S Sutton, Andrew G Barto, et al. Reinforcement learning: An introduction. MIT press, 1998.

Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, Open AI Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. # exploration: A study of count-based exploration for deep reinforcement learning. In Advances in Neural Information Processing Systems, pp. 2753 2762, 2017.

Gerald Tesauro. Temporal difference learning and td-gammon. Communications of the ACM, 38(3):58 68, 1995.

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 5026 5033. IEEE, 2012.

Ziyu Wang, Victor Bapst, Nicolas Heess, Volodymyr Mnih, Remi Munos, Koray Kavukcuoglu, and Nando de Freitas. Sample efﬁcient actor-critic with experience replay. ar Xiv preprint ar Xiv:1611.01224, 2016.

Published as a conference paper at ICLR 2019

A ALGORITHM

We summarize the algorithm for HER with CER in Algorithm 1.

Algorithm 1 HER with CER

Initialize a random process N for action exploration, max episode length to L Set CER to ind-CER or int-CER for episode = 1 to M do

Receive initial state s A Receive a goal r A for this episode Initialize episode buffer buffer A for t = 1 to L do

select action a A = µθi(s A) + Nt w.r.t. the current policy and exploration Execute actions a A and observe reward r A and new state s A Store (s A, a A, g A, r A, s A ) in buffer A s A s A end for Receive initial state s B or Receive initial state s B, where s B is a state sampled from buffer A Receive a goal r B for this episode Initialize episode buffer buffer B for t = 1 to L do

select action a B = µθi(s B) + Nt w.r.t. the current policy and exploration Execute actions a B and observe reward r B and new state s B Store (s B, a B, g B, r B, s B ) in buffer B s B s B end for Concatenate buffer A and buffer B and store ({si}T i=1, {ai}T i=1, {gi}T i=1, {ri}T i=1, {s i}T i=1) in replay buffer D // Optimization based on off-policy samples for k = 1 to K do

// Relabelling off-policy samples Sample a random minibatch of S samples (sj, aj, gj, rj, s j) from D Apply HER strategy on samples Apply ind-CER or int-CER strategy on samples for agent i = A, B do

Do one step optimization based on Eq (6) and Eq (7), and update target networks. end for end for end for

B ENVIRONMENT DETAILS

Robotic control environments The robotic control environments shown in Figure 7 are part of challenging continuous robotics control suite (Plappert et al., 2018) integrated in Open AI Gym (Brockman et al., 2016) and based on Mujoco simulation enginge (Todorov et al., 2012). The Fetch environments are based on the 7-Do F Fetch robotics arm, which has a two-ﬁngered parallel gripper. The Hand environments are based on the Shadow Dexterous Hand, which is an anthropomorphic robotic hand with 24 degrees of freedom. In all tasks, rewards are sparse and binary: the agent obtains a reward of 0 if the goal has been achieved and 1 otherwise. For more details please refer to Plappert et al. (2018).

Ant maze environments We also construct two maze environments with the re-settable property, rendered in Figure 1, in which walls are placed to construct U or S shaped maze. At the beginning of each episode,

Published as a conference paper at ICLR 2019

Hand Manipulate Egg Hand Manipulate Pen Hand Reach Hand Manipulate Block

Fetch Pick And Place Fetch Reach Fetch Push Fetch Slide

Figure 7: Robotics control environments used in our experiments

the agent is always initialized at position (0, 0), and can move in any direction as long as it is not obstructed by a wall. The xand y-axis locations of the target ball are sampled uniformly from [ 5, 20], [ 5, 20], respectively.

C COMPETITIVE EXPERIENCE REPLAY WITH DIFFERENT ADVERSARIAL AGENTS

success rate

25 50 75 100 125 150 175 200 0.0

HER HER(B5) HER(B15) HER(B25)

25 50 75 100 125 150 175 200 0.0

HER HER + ind-CER(A15B5) HER + ind-CER(A15B15) HER + ind-CER(A15B25)

25 50 75 100 125 150 175 200 0.0

B in A15B5 B in A15B15 B in A15B25

success rate

25 50 75 100 125 150 175 200 0.0

HER HER(B5) HER(B15) HER(B25)

25 50 75 100 125 150 175 200 0.0

HER HER + ind-CER(A15B5) HER + ind-CER(A15B15) HER + ind-CER(A15B25)

25 50 75 100 125 150 175 200 0.0

B in A15B5 B in A15B15 B in A15B25

Figure 8: Performance on Fetch Slide (top) and Hand Manipulate Pen Full (bottom) for different batch-size multipliers when training using HER (left columns) or HER+CER (middle, right columns). Middle plots show the performance of agent A and, in the right plots, performance of agent B is shown for reference. As Andrychowicz et al. (2017) and Andrychowicz et al. (2018a) observe (and we also observe), performance depends on the batch size. We leverage this observation to tune the relative strengths of Agents A and B by separately manipulating the batch sizes used for updating each.

Published as a conference paper at ICLR 2019

For simplicity, we control the batch size by changing the number of MPI workers devoted to a particular update. Each MPI worker computes the gradients for a batch size of 256; averaging the gradients from each worker results in an effective batch size of N 256. For our single-agent baselines, we choose N = 30 workers, and, when using CER, a default of N = 15 for each A and B In the following, Ax By denotes, for agent A, N = x and, for agent B, N = y. These results suggest that, while a sufﬁciently large batch size is important for achieving the best performance, the optimal conﬁguration occurs when the batch sizes used for the two agents are balanced. Interestingly, we observe that batch size imbalance adversely effects both agents trained during CER.

D IMPLEMENTATION DETAILS

In the code for HER (Andrychowicz et al., 2017), the authors use MPI to increase the batch size. MPI is used here to run rollouts in parallel and average gradients over all MPI workers. We found that MPI is crucial for good performance, since training for longer time with a smaller batch size gives sub-optimal performance. This is consistent with the authors ﬁndings in their code that having a much larger batch size helps a lot. For each experiment, we provide the per-worker batch sizes in Table 1; note that the effective batch size is multiplied by the number of MPI workers N.

In our implementation, we do N rollouts in parallel for each agent and have separate optimizers for each. We found that periodically resetting the parameters of agent B in early stage of training helps agent A more consistently reach a high level of performance. This resetting helps to strike an optimal balance between the inﬂuences of HER and CER in training agent A. We also add L2 regularization, following the practice of Andrychowicz et al. (2017).

For the experiments with neural networks, all parameters are randomly initialized from N(0, 0.2). We use networks with three layers of hidden layer size 256 and Adam (Kingma & Ba, 2014) for optimization. Presented results are averaged over 5 random seeds. We summarize the hyperparameters in Table 1.

Hyperparameter U S Fetch Hand name Ant Maze Ant Maze Control Control Buffer size 1E5 1E6 1E6 1E6 Batch size 128 128 256 256 Max steps of 50 100 50 100 episode Reset epochs 2 2 5 5 Max reset epochs 10 10 20 30 Total epochs 50 100 100/50 200 Actor 0.0004 0.0004 0.001 0.001 Learning rate Critic 0.0004 0.0004 0.001 0.001 Learning rate Action 0.01 0.01 1.00 1.00 L2 regularization Polyak 0.95 0.95 0.95 0.95

Table 1: Hyperparameter values used in experiments.