# learning_selfimitating_diverse_policies__f39c77e5.pdf

Published as a conference paper at ICLR 2019

LEARNING SELF-IMITATING DIVERSE POLICIES

Tanmay Gangwani Dept. of Computer Science UIUC gangwan2@uiuc.edu

Qiang Liu Dept. of Computer Science UT Austin lqiang@cs.utexas.edu

Jian Peng Dept. of Computer Science UIUC jianpeng@uiuc.edu

The success of popular algorithms for deep reinforcement learning, such as policygradients and Q-learning, relies heavily on the availability of an informative reward signal at each timestep of the sequential decision-making process. When rewards are only sparsely available during an episode, or a rewarding feedback is provided only after episode termination, these algorithms perform sub-optimally due to the difﬁcultly in credit assignment. Alternatively, trajectory-based policy optimization methods, such as cross-entropy method and evolution strategies, do not require per-timestep rewards, but have been found to suffer from high sample complexity by completing forgoing the temporal nature of the problem. Improving the efﬁciency of RL algorithms in real-world problems with sparse or episodic rewards is therefore a pressing need. In this work, we introduce a self-imitation learning algorithm that exploits and explores well in the sparse and episodic reward settings. We view each policy as a state-action visitation distribution and formulate policy optimization as a divergence minimization problem. We show that with Jensen-Shannon divergence, this divergence minimization problem can be reduced into a policy-gradient algorithm with shaped rewards learned from experience replays. Experimental results indicate that our algorithm works comparable to existing algorithms in environments with dense rewards, and signiﬁcantly better in environments with sparse and episodic rewards. We then discuss limitations of self-imitation learning, and propose to solve them by using Stein variational policy gradient descent with the Jensen-Shannon kernel to learn multiple diverse policies. We demonstrate its effectiveness on a challenging variant of continuous-control Mu Jo Co locomotion tasks.

1 INTRODUCTION

Deep reinforcement learning (RL) has demonstrated signiﬁcant applicability and superior performance in many problems outside the reach of traditional algorithms, such as computer and board games (Mnih et al., 2015; Silver et al., 2016), continuous control (Lillicrap et al., 2015), and robotics (Levine et al., 2016). Using deep neural networks as functional approximators, many classical RL algorithms have been shown to be very effective in solving sequential decision problems. For example, a policy that selects actions under certain state observation can be parameterized by a deep neural network that takes the current state observation as input and gives an action or a distribution over actions as output. Value functions that take both state observation and action as inputs and predict expected future reward can also be parameterized as neural networks. In order to optimize such neural networks, policy gradient methods (Mnih et al., 2016; Schulman et al., 2015; 2017a) and Q-learning algorithms (Mnih et al., 2015) capture the temporal structure of the sequential decision problem and decompose it to a supervised learning problem, guided by the immediate and discounted future reward from rollout data.

Unfortunately, when the reward signal becomes sparse or delayed, these RL algorithms may suffer from inferior performance and inefﬁcient sample complexity, mainly due to the scarcity of the immediate supervision when training happens in single-timestep manner. This is known as the temporal credit assignment problem (Sutton, 1984). For instance, consider the Atari Montezuma s revenge game a reward is received after collecting certain items or arriving at the ﬁnal destination in the lowest level, while no reward is received as the agent is trying to reach these goals. The sparsity of the reward makes the neural network training very inefﬁcient and also poses challenges in exploration. It is not hard to see that many of the real-world problems tend to be of the form where

Published as a conference paper at ICLR 2019

rewards are either only sparsely available during an episode, or the rewards are episodic, meaning that a non-zero reward is only provided at the end of the trajectory or episode.

In addition to policy-gradient and Q-learning, alternative algorithms, such as those for globalor stochastic-optimization, have recently been studied for policy search. These algorithms do not decompose trajectories into individual timesteps, but instead apply zeroth-order ﬁnite-difference gradient or gradient-free methods to learn policies based on the cumulative rewards of the entire trajectory. Usually, trajectory samples are ﬁrst generated by running the current policy and then the distribution of policy parameters is updated according to the trajectory-returns. The cross-entropy method (CEM, Rubinstein & Kroese (2016)) and evolution strategies (Salimans et al., 2017) are two nominal examples. Although their sample efﬁciency is often not comparable to the policy gradient methods when dense rewards are available from the environment, they are more widely applicable in the sparse or episodic reward settings as they are agnostic to task horizon, and only the trajectorybased cumulative reward is needed.

Our contribution is the introduction of a new algorithm based on policy-gradients, with the objective of achieving better performance than existing RL algorithms in sparse and episodic reward settings. Using the equivalence between the policy function and its state-action visitation distribution, we formulate policy optimization as a divergence minimization problem between the current policy s visitation and the distribution induced by a set of experience replay trajectories with high returns. We show that with the Jensen-Shannon divergence (DJS), this divergence minimization problem can be reduced into a policy-gradient algorithm with shaped, dense rewards learned from these experience replays. This algorithm can be seen as self-imitation learning, in which the expert trajectories in the experience replays are self-generated by the agent during the course of learning, rather than using some external demonstrations. We combine the divergence minimization objective with the standard RL objective, and empirically show that the shaped, dense rewards signiﬁcantly help in sparse and episodic settings by improving credit assignment. Following that, we qualitatively analyze the shortcomings of the self-imitation algorithm. Our second contribution is the application of Stein variational policy gradient (SVPG) with the Jensen-Shannon kernel to simultaneously learn multiple diverse policies. We demonstrate the beneﬁts of this addition to the self-imitation framework by considering difﬁcult exploration tasks with sparse and deceptive rewards.

Related Works. Divergence minimization has been used in various policy learning algorithms. Relative Entropy Policy Search (REPS) (Peters et al., 2010) restricts the loss of information between policy updates by constraining the KL-divergence between the state-action distribution of old and new policy. Policy search can also be formulated as an EM problem, leading to several interesting algorithms, such as RWR (Peters & Schaal, 2007) and Po WER (Kober & Peters, 2009). Here the M-step minimizes a KL-divergence between trajectory distributions, leading to an update rule which resembles return-weighted imitation learning. Please refer to Deisenroth et al. (2013) for a comprehensive exposition. MATL (Wulfmeier et al., 2017) uses adversarial training to bring state occupancy from a real and simulated agent close to each other for efﬁcient transfer learning. In Guided Policy Search (GPS, Levine & Koltun (2013)), a parameterized policy is trained by constraining the divergence between the current policy and a controller learnt via trajectory optimization.

Learning from Demonstrations (Lf D). The objective in Lf D, or imitation learning, is to train a control policy to produce a trajectory distribution similar to the demonstrator. Approaches for self-driving cars (Bojarski et al., 2016) and drone manipulation (Ross et al., 2013) have used human-expert data, along with Behavioral Cloning algorithm to learn good control policies. Deep Q-learning has been combined with human demonstrations to achieve performance gains in Atari (Hester et al., 2017) and robotics tasks (Veˇcer ık et al., 2017; Nair et al., 2017). Human data has also been used in the maximum entropy IRL framework to learn cost functions under which the demonstrations are optimal (Finn et al., 2016). Ho & Ermon (2016) use the same framework to derive an imitation-learning algorithm (GAIL) which is motivated by minimizing the divergence between agent s rollouts and external expert demonstrations. Besides humans, other sources of expert supervision include planningbased approaches such as i LQR (Levine et al., 2016) and MCTS (Silver et al., 2016). Our algorithm departs from prior work in forgoing external supervision, and instead using the past experiences of the learner itself as demonstration data.

Exploration and Diversity in RL. Count-based exploration methods utilize state-action visitation counts N(s, a), and award a bonus to rarely visited states (Strehl & Littman, 2008). In large statespaces, approximation techniques (Tang et al., 2017), and estimation of pseudo-counts by learning

Published as a conference paper at ICLR 2019

density models (Bellemare et al., 2016; Fu et al., 2017) has been researched. Intrinsic motivation has been shown to aid exploration, for instance by using information gain (Houthooft et al., 2016) or prediction error (Stadie et al., 2015) as a bonus. Hindsight Experience Replay (Andrychowicz et al., 2017) adds additional goals (and corresponding rewards) to a Q-learning algorithm. We also obtain additional rewards, but from a discriminator trained on past agent experiences, to accelerate a policy-gradient algorithm. Prior work has looked at training a diverse ensemble of agents with good exploratory skills (Liu et al., 2017; Conti et al., 2017; Florensa et al., 2017). To enjoy the beneﬁts of diversity, we incorporate a modiﬁcation of SVPG (Liu et al., 2017) in our ﬁnal algorithm.

In very recent work, Oh et al. (2018) propose exploiting past good trajectories to drive exploration. Their algorithm buffers (s, a) and the corresponding return for each transition in rolled trajectories, and reuses them for training if the stored return value is higher than the current state-value estimate. Our approach presents a different objective for self-imitation based on divergence-minimization. With this view, we learn shaped, dense rewards which are then used for policy optimization. We further improve the algorithm with SVPG. Reusing high-reward trajectories has also been explored for program synthesis and semantic parsing tasks (Liang et al., 2016; 2018; Abolaﬁa et al., 2018).

2 MAIN METHODS

We start with a brief introduction to RL in Section 2.1, and then introduce our main algorithm of self-imitating learning in Section 2.2. Section 2.3 further extends our main method to learn multiple diverse policies using Stein variational policy gradient with Jensen-Shannon kernel.

2.1 REINFORCEMENT LEARNING BACKGROUND

A typical RL setting involves an environment modeled as a Markov Decision Process with an unknown system dynamics model p(st+1|st, at) and an initial state distribution p0(s0). An agent interacts sequentially with the environment in discrete time-steps using a policy π which maps the an observation st S to either a single action at (deterministic policy), or a distribution over the action space A (stochastic policy). We consider the scenario of stochastic policies over high-dimensional, continuous state and action spaces. The agent receives a per-step reward rt(st, at) R, and the RL objective involves maximization of the expected discounted sum of rewards, η(πθ) = Ep0,p,π P t=0 γtr(st, at) , where γ (0, 1] is the discount factor. The action-value function is Qπ(st, at) = Ep0,p,π P t =t γt tr(st , at ) . We deﬁne the unnormalized γ-discounted statevisitation distribution for a policy π by ρπ(s) = P t=0 γt P(st = s|π), where P(st = s|π) is the probability of being in state s at time t, when following policy π and starting state s0 p0. The expected policy return η(πθ) can then be written as Eρπ(s,a)[r(s, a)], where ρπ(s, a) = ρπ(s)π(a|s) is the state-action visitation distribution. Using the policy gradient theorem (Sutton et al., 2000), we can get the direction of ascent θη(πθ) = Eρπ(s,a) θ log πθ(a|s)Qπ(s, a) .

2.2 POLICY OPTIMIZATION AS DIVERGENCE MINIMIZATION WITH SELF-IMITATION

Although the policy π(a|s) is given as a conditional distribution, its behavior is better characterized by the corresponding state-action visitation distribution ρπ(s, a), which wraps the MDP dynamics and fully decides the expected return via η(π) = Eρπ[r(s, a)]. Therefore, distance metrics on a policy π should be deﬁned with respect to the visitation distribution ρπ, and the policy search should be viewed as ﬁnding policies with good visitation distributions ρπ that yield high reward. Suppose we have access to a good policy π , then it is natural to consider ﬁnding a π such that its visitation distribution ρπ matches ρπ . To do so, we can deﬁne a divergence measure D(ρπ, ρπ ) that captures the similarity between two distributions, and minimize this divergence for policy improvement.

Assume there exists an expert policy πE, such that policy optimization can be framed as minimizing the divergence minπ D(ρπ, ρπE), that is, ﬁnding a policy π to imitate πE. In practice, however, we do not have access to any real guiding expert policy. Instead, we can maintain a selected subset ME of highly-rewarded trajectories from the previous rollouts of policy π, and optimize the policy π to minimize the divergence between ρπ and the empirical state-action pair distribution {(si, ai)}ME:

min π D(ρπ, {(si, ai)}ME). (1)

Published as a conference paper at ICLR 2019

Since it is not always possible to explicitly formulate ρπ even with the exact functional form of π, we generate rollouts from π in the environment and obtain an empirical distribution of ρπ. To measure the divergence between two empirical distributions, we use the Jensen-Shannon divergence, with the following variational form (up to a constant shift) as exploited in GANs (Goodfellow et al., 2014):

DJS(ρπ, ρπE) = max d(s,a),d E(s,a) e Eρπ[log d(s, a) d(s, a) + d E(s, a)] + e EρπE [log d E(s, a) d(s, a) + d E(s, a)], (2)

where d(s, a) and d E(s, a) are empirical density estimators of ρπ and ρπE, respectively. Under certain assumptions, we can obtain an approximate gradient of DJS w.r.t the policy parameters, thus enabling us to optimize the policy.

Gradient Approximation: Let ρπ(s, a) and ρπE(s, a) be the state-action visitation distributions induced by two policies π and πE respectively. Let dπ and dπE be the surrogates to ρπ and ρπE, respectively, obtained by solving Equation 2. Then, if the policy π is parameterized by θ, the gradient of DJS(ρπ, ρπE) with respect to policy parameters (θ) can be approximated as:

θDJS(ρπ, ρπE) e Eρπ(s,a) θ log πθ(a|s) e Qπ(s, a) ,

where e Qπ(st, at) = e Eρπ(s,a) X

t =t γt t log dπ(st , at ) dπ(st , at ) + dπE(st , at ) .. (3)

The derivation of the approximation and the underlying assumptions are in Appendix 5.1. Next, we introduce a simple and inexpensive approach to construct the replay memory ME using highreturn past experiences during training. In this way, ρπE can be seen as a mixture of deterministic policies, each representing a delta point mass distribution in the trajectory space or a ﬁnite discrete visitation distribution of state-action pairs. At each iteration, we apply the current policy πθ to sample b trajectories {τ}b 1. We hope to include in ME, the top-k trajectories (or trajectories with returns above a threshold) generated thus far during the training process. For this, we use a priorityqueue list for ME which keeps the trajectories sorted according to the total trajectory reward. The reward for each newly sampled trajectory in {τ}b 1 is compared with the current threshold of the priority-queue, updating ME accordingly. The frequency of updates is impacted by the exploration capabilities of the agent and the stochasticity in the environment. We ﬁnd that simply sampling noisy actions from Gaussian policies is sufﬁcient for several locomotion tasks (Section 3). To handle more challenging environments, in the next sub-section, we augment our policy optimization procedure to explicitly enhance exploration and produce an ensemble of diverse policies.

In the usual imitation learning framework, expert demonstrations of trajectories from external sources are available as the empirical distribution of ρπE of an expert policy πE. In our approach, since the agent learns by treating its own good past experiences as the expert, we can view the algorithm as self-imitation learning from experience replay. As noted in Equation 3, the gradient estimator of DJS has a form similar to policy gradients, but for replacing the true reward function with per-timestep reward deﬁned as log(dπ(s, a)/(dπ(s, a) + dπE(s, a))). Therefore, it is possible to interpolate the gradient of DJS and the standard policy gradient. We would highlight the beneﬁt of this interpolation soon. The net gradient on the policy parameters is:

θη(πθ) = (1 ν)Eρπ(s,a) θ log πθ(a|s)Qr(s, a) ν θDJS(ρπ, ρπE), (4)

where Qr is the Q function with true rewards, and πE is the mixture policy represented by the samples in ME. Let rφ(s, a) = dπ(s, a)/[dπ(s, a) + dπE(s, a)]. rφ(s, a) can be computed using parameterized networks for densities dπ and dπE, which are trained by solving the DJS optimization (Eq 2) using the current policy rollouts and ME, where φ includes the parameters for dπ and dπE. Using Equation 3, the interpolated gradient can be further simpliﬁed to:

θη(πθ) = Eρπ(s,a) h θ log πθ(a|s) (1 ν)Qr(s, a) + νQrφ(s, a) i , (5)

where Qrφ(st, at) = Ep0,p,π P t =t γt t log rφ(st , at ) is the Q function calculated using log rφ(s, a) as the reward. This reward is high in the regions of the S A space frequented more by the expert than the learner, and low in regions visited more by the learner than the expert. The effective Q in Equation 5 is therefore an interpolation between Qr obtained with true environment rewards, and Qrφ obtained with rewards which are implicitly shaped to guide the learner

Published as a conference paper at ICLR 2019

towards expert behavior. In environments with sparse or deceptive rewards, where the signal from Qr is weak or sub-optimal, a higher weight on Qrφ enables successful learning by imitation. We show this empirically in our experiments. We further ﬁnd that even in cases with dense environment rewards, the two gradient components can be successfully combined for policy optimization. The complete algorithm for self-imitation is outlined in Appendix 5.2 (Algorithm 1).

Limitations of self-imitation. We now elucidate some shortcomings of the self-imitation approach. Since the replay memory ME is only constructed from the past training rollouts, the quality of the trajectories in ME is hinged on good exploration by the agent. Consider a maze environment where the robot is only rewarded when it arrives at a goal G placed in a far-off corner. Unless the robot reaches G once, the trajectories in ME always have a total reward of zero, and the learning signal from Qrφ is not useful. Secondly, self-imitation can lead to sub-optimal policies when there are local minima in the policy optimization landscape; for example, assume the maze has a second goal G in the opposite direction of G, but with a much smaller reward. With simple exploration, the agent may ﬁll ME with below-par trajectories leading to G , and the reinforcement from Qrφ would drive it further to G . Thirdly, stochasticity in the environment may make it difﬁcult to recover the optimal policy just by imitating the past top-k rollouts. For instance, in a 2-armed bandit problem with reward distributions Bernoulli (p) and Bernoulli (p+ϵ), rollouts from both the arms get conﬂated in ME during training with high probability, making it hard to imitate the action of picking the arm with the higher expected reward.

We propose to overcome these pitfalls by training an ensemble of self-imitating agents, which are explicitly encouraged to visit different, non-overlapping regions of the state-space. This helps to discover useful rewards in sparse settings, avoids deceptive reward traps, and in environments with reward-stochasticity like the 2-armed bandit, increases the probability of the optimal policy being present in the ﬁnal trained ensemble. We detail the enhancements next.

2.3 IMPROVING EXPLORATION WITH STEIN VARIATIONAL GRADIENT

One approach to achieve better exploration in challenging cases like above is to simultaneously learn multiple diverse policies and enforce them to explore different parts of the high dimensional space. This can be achieved based on the recent work by Liu et al. (2017) on Stein variational policy gradient (SVPG). The idea of SVPG is to ﬁnd an optimal distribution q(θ) over the policy parameters θ which maximizes the expected policy returns, along with an entropy regularization that enforces diversity on the parameter space, i.e.

max q Eθ q[η(θ)] + αH(q).

Without a parametric assumption on q, this problem admits a challenging functional optimization problem. Stein variational gradient descent (SVGD, Liu & Wang (2016)) provides an efﬁcient solution for solving this problem, by approximating q with a delta measure q = Pn i=1 δθi/n, where {θi}n i=1 is an ensemble of policies, and iteratively update {θi} with

θi θi + ϵ θi, θi = 1

θjη(πθj)k(θj, θi) + α θjk(θj, θi) (6)

where k(θj, θi) is a positive deﬁnite kernel function. The ﬁrst term in θi moves the policy to regions with high expected return (exploitation), while the second term creates a repulsion pressure between policies in the ensemble and encourages diversity (exploration). The choice of kernel is critical. Liu et al. (2017) used a simple Gaussian RBF kernel k(θj, θi) = exp( θj θi 2 2/h), with the bandwidth h dynamically adapted. This, however, assumes a ﬂat Euclidean distance between θj and θi, ignoring the structure of the entities deﬁned by them, which are probability distributions. A statistical distance, such as DJS, serves as a better metric for comparing policies (Amari, 1998; Kakade, 2002). Motivated by this, we propose to improve SVPG using JS kernel k(θj, θi) = exp( DJS(ρπθj , ρπθi)/T), where ρπθ(s, a) is the state-action visitation distribution obtained by running policy πθ, and T is the temperature. The second exploration term in SVPG involves the gradient of the kernel w.r.t policy parameters. With the JS kernel, this requires estimating gradient of DJS, which as shown in Equation 3, can be obtained using policy gradients with an appropriately trained reward function.

Published as a conference paper at ICLR 2019

Figure 1: Learning curves for PPO and Self-Imitation on tasks with episodic rewards. Mean and standarddeviation over 5 random seeds is plotted.

Our full algorithm is summarized in Appendix 5.3 (Algorithm 2). In each iteration, we apply the SVPG gradient to each of the policies, where the θη(πθ) in Equation 6 is the interpolated gradient from self-imitation (Equation 5). We also utilize state-value function networks as baselines to reduce the variance in sampled policy-gradients.

3 EXPERIMENTS

Our goal in this section is to answer the following questions: 1) How does self-imitation fare against standard policy gradients under various reward distributions from the environment, namely episodic, noisy and dense? 2) How far does the SVPG exploration go in overcoming the limitations of selfimitation, such as susceptibility to local-minimas?

We benchmark high-dimensional, continuous-control locomotion tasks based on the Mu Jo Co physics simulator by extending the Open AI Baselines (Dhariwal et al., 2017) framework. Our control policies (θi) are modeled as unimodal Gaussians. All feed-forward networks have two layers of 64 hidden units each with tanh non-linearity. For policy-gradient, we use the clipped-surrogate based PPO algorithm (Schulman et al., 2017b). Further implementation details are in the Appendix.

Episodic rewards Noisy rewards Each rt suppressed w/ 90% prob. (pm = 0.9)

Noisy rewards Each rt suppressed w/ 50% prob. (pm = 0.5)

Dense rewards (Gym default)

ν = 0.8 (SI)

ν = 0 (PPO)

CEM ES ν = 0.8 (SI)

ν = 0 (PPO)

ν = 0.8 (SI)

ν = 0 (PPO)

ν = 0.8 (SI)

ν = 0 (PPO) Walker 2996 252 205 1200 2276 2047 3049 3364 3263 3401 Humanoid 3602 532 426 - 4136 1159 4296 3145 3339 4149 H-Standup ( 104) 18.1 4.4 9.6 - 14.3 11.4 16.3 9.8 17.2 10 Hopper 2618 354 97 1900 2381 2264 2137 2132 2700 2252 Swimmer 173 21 17 - 52 37 127 56 106 68 Invd.Pendulum 8668 344 86 9000 8744 8826 8926 8968 8989 8694

Table 1: Performance of PPO and Self-Imitation (SI) on tasks with episodic rewards, noisy rewards with masking probability pm, and dense rewards. All runs use 5M timesteps of interaction with the environment. ES performance at 5M timesteps is taken from (Salimans et al., 2017). Missing entry denotes that we were unable to obtain the 5M timestep performance from the paper.

3.1 SELF-IMITATION WITH DIFFERENT REWARD DISTRIBUTIONS

We evaluate the performance of self-imitation with a single agent in this sub-section; combination with SVPG exploration for multiple agents is discussed in the next. We consider the locomotion tasks in Open AI Gym under 3 separate reward distributions: Dense refers to the default reward function in Gym, which provides a reward for each simulation timestep. In episodic reward setting, rather than providing r(st, at) at each timestep of an episode, we provide P

t r(st, at) at the last timestep of the episode, and zero reward at other timesteps. This is the case for many practical settings where the reward function is hard to design, but scoring each trajectory, possibly by a human (Christiano et al., 2017), is feasible. In noisy reward setting, we probabilistically mask out each out each per-timestep reward r(at, st) in an episode. Reward masking is done independently for every new episode, and therefore, the agent receives non-zero feedback at different albeit only

Published as a conference paper at ICLR 2019

(a) SI-independent state-density

(b) SI-interact-JS state-density

(c) SI-independent kernel matrix

(d) SI-interact-JS kernel matrix

Figure 2: SI-independent and SI-interact-JS agents on Maze environment.

few timesteps in different episodes. The probability of masking-out or suppressing the rewards is denoted by pm.

In Figure 1, we plot the learning curves on three tasks with episodic rewards. Recall that ν is the hyper-parameter controlling the weight distribution between gradients with environment rewards and the gradients with shaped reward from rφ (Equation 5). The baseline PPO agents use ν = 0, meaning that the entire learning signal comes from the environment. We compare them with selfimitating (SI) agents using a constant value ν = 0.8. The capacity of ME is ﬁxed at 10 trajectories. We didn t observe our method to be particularly sensitive to the choice of ν and the capacity value. For instance, ν = 1 works equally well. Further ablation on these two hyper-parameters can be found in the Appendix.

In Figure 1, we see that the PPO agents are unable to make any tangible progress on these tasks with episodic rewards, possibly due to difﬁculty in credit assignment the lumped rewards at the end of the episode can t be properly attributed to the individual state-action pairs during the episode. In case of Self-Imitation, the algorithm has access to the shaped rewards for each timestep, derived from the high-return trajectories in ME. This makes credit-assignment easier, leading to successful learning even for very high-dimensional control tasks such as Humanoid.

Table 1 summarizes the ﬁnal performance, averaged over 5 runs with random seeds, under the various reward settings. For the noisy rewards, we compare performance with two different reward masking values - suppressing each reward r(st, at) with 90% probability (pm = 0.9), and with 50% probability (pm = 0.5). The density of rewards increases across the reward settings from left to right in Table 1. We ﬁnd that SI agents (ν = 0.8) achieve higher average score than the baseline PPO agents (ν = 0) in majority of the tasks for all the settings. This indicates that not only does self-imitation vastly help when the environment rewards are scant, it can readily be incorporated with the standard policy gradients via interpolation, for successful learning across reward settings. For completion, we include performance of CEM and ES since these algorithms depend only on the total trajectory rewards and don t exploit the temporal structure. CEM perform poorly in most of the cases. ES, while being able to solve the tasks, is sample-inefﬁcient. We include ES performance from Salimans et al. (2017) after 5M timesteps of training for a fair comparison with our algorithm.

3.2 CHARACTERIZING ENSEMBLE OF DIVERSE SELF-IMITATING POLICIES

We now conduct experiments to show how self-imitation can lead to sub-optimal policies in certain cases, and how the SVPG objective, which trains an ensemble with an explicit DJS repulsion between policies, can improve performance.

2D-Navigation. Consider a simple Maze environment where the start location of the agent

(blue particle) is shown in the ﬁgure on the right, along with two regions the red region is closer to agent s starting location but has a per-timestep reward of only 1 point if the agent hovers over it; the green region is on the other side of the wall but has a per-timestep reward of 10 points. We run 8 independent, non-interacting, self-imitating (with ν = 0.8) agents on this task. This ensemble is denoted as SI-independent. Figures 2a plots the state-visitation density for SI-independent after training, from which it is evident that the agents get trapped in the local minima. The red-region is relatively easily explored and trajectories leading to it ﬁll the ME, causing sub-optimal imitation. We contrast this with an instantiation of our full

Published as a conference paper at ICLR 2019

Figure 3: Learning curves for various ensembles on sparse locomotion tasks. Mean and standard-deviation over 3 random seeds are plotted.

algorithm, which is referred to as SI-interact-JS. It is composed of 8 self-imitating agents which share information for gradient calculation with the SVPG objective (Equation 6). The temperature T = 0.5 is held constant, and the weight on exploration-facilitating repulsion term (α) is linearly decayed over time. Figure 2b depicts the state-visitation density for this ensemble. SI-interact-JS explores wider portions of the maze, with multiple agents reaching the green zone of high reward.

Figures 2c and 2d show the kernel matrices for the two ensembles after training. Cell (i, j) in the matrix corresponds to the kernel value k(θi, θj) = exp( JS(ρi, ρj)/T). For SI-independent, many darker cells indicate that policies are closer (low JS). For SI-interact-JS, which explicitly tries to decrease k(θi, θj), the cells are noticeably lighter, indicating dissimilar policies (high JS). Behavior of PPO-independent (ν = 0) is similar to SI-independent (ν = 0.8) for the Maze task.

Locomotion. To explore the limitations of self-imitation in harder exploration problems in highdimensional, continuous state-action spaces, we modify 3 Mu Jo Co tasks as follows Sparse Half Cheetah, Sparse Hopper and Sparse Ant yield a forward velocity reward only when the centerof-mass of the corresponding bot is beyond a certain threshold distance. At all timesteps, there is an energy penalty to move the joints, and a survival bonus for bots that can fall over causing premature episode termination (Hopper, Ant). Figure 3 plots the performance of PPO-independent, SI-independent, SI-interact-JS and SI-interact-RBF (which uses RBF-kernel from Liu et al. (2017) instead of the JS-kernel) on the tasks. Each of these 4 algorithms is an ensemble of 8 agents using the same amount of simulation timesteps. The results are averaged over 3 separate runs, where for each run, the best agent from the ensemble after training is selected.

The SI-independent agents rely solely on action-space noise from the Gaussian policy parameterization to ﬁnd high-return trajectories which are added to ME as demonstrations. This is mostly inadequate or slow for sparse environments. Indeed, we ﬁnd that all demonstrations in ME for Sparse Hopper are with the bot standing upright (or tilted) and gathering only the survival bonus, as action-space noise alone can t discover hopping behavior. Similarly, for Sparse Half Cheetah, ME has trajectories with the bot haphazardly moving back and forth. On the other hand, in SI-interact JS, the DJS repulsion term encourages the agents to be diverse and explore the state-space much more effectively. This leads to faster discovery of quality trajectories, which then provide good reinforcement through self-imitation, leading to higher overall score. SI-interact-RBF doesn t perform as well, suggesting that the JS-kernel is more formidable for exploration. PPO-independent gets stuck in the local optimum for Sparse Hopper and Sparse Half Cheetah the bots stand still after training, avoiding energy penalty. For Sparse Ant, the bot can cross our preset distance threshold using only action-space noise, but learning is slow due to na ıve exploration.

4 CONCLUSION AND FUTURE WORK

We approached policy optimization for deep RL from the perspective of JS-divergence minimization between state-action distributions of a policy and its own past good rollouts. This leads to a self-imitation algorithm which improves upon standard policy-gradient methods via the addition of a simple gradient term obtained from implicitly shaped dense rewards. We observe substantial performance gains over the baseline for high-dimensional, continuous-control tasks with episodic and noisy rewards. Further, we discuss the potential limitations of the self-imitation approach, and

Published as a conference paper at ICLR 2019

propose ensemble training with the SVPG objective and JS-kernel as a solution. Through experimentation, we demonstrate the beneﬁts of a self-imitating, diverse ensemble for efﬁcient exploration and avoidance of local minima.

An interesting future work is improving our algorithm using the rich literature on exploration in RL. Since ours is a population-based exploration method, techniques for efﬁcient single agent exploration can be readily combined with it. For instance, parameter-space noise or curiosity-driven exploration can be applied to each agent in the SI-interact-JS ensemble. Secondly, our algorithm for training diverse agents could be used more generally. In Appendix 5.6, we show preliminary results for two cases: a) hierarchical RL, where a diverse group of Swimmer bots is trained for downstream use in a complex Swimming+Gathering task; b) RL without environment rewards, relying solely on diversity as the optimization objective. Further investigation is left for future work.

Daniel A Abolaﬁa, Mohammad Norouzi, and Quoc V Le. Neural program synthesis with priority queue training. ar Xiv preprint ar Xiv:1801.03526, 2018.

Shun-Ichi Amari. Natural gradient works efﬁciently in learning. Neural computation, 10(2):251 276, 1998.

Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob Mc Grew, Josh Tobin, Open AI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. In Advances in Neural Information Processing Systems, pp. 5048 5058, 2017.

Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, pp. 1471 1479, 2016.

Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al. End to end learning for self-driving cars. ar Xiv preprint ar Xiv:1604.07316, 2016.

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, pp. 4302 4310, 2017.

Edoardo Conti, Vashisht Madhavan, Felipe Petroski Such, Joel Lehman, Kenneth O Stanley, and Jeff Clune. Improving exploration in evolution strategies for deep reinforcement learning via a population of novelty-seeking agents. ar Xiv preprint ar Xiv:1712.06560, 2017.

Marc Peter Deisenroth, Gerhard Neumann, Jan Peters, et al. A survey on policy search for robotics. Foundations and Trends R in Robotics, 2(1 2):1 142, 2013.

Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, and Yuhuai Wu. Openai baselines. https://github.com/ openai/baselines, 2017.

Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, pp. 1329 1338, 2016.

Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function. ar Xiv preprint ar Xiv:1802.06070, 2018.

Chelsea Finn, Sergey Levine, and Pieter Abbeel. Guided cost learning: Deep inverse optimal control via policy optimization. In International Conference on Machine Learning, pp. 49 58, 2016.

Carlos Florensa, Yan Duan, and Pieter Abbeel. Stochastic neural networks for hierarchical reinforcement learning. ar Xiv preprint ar Xiv:1704.03012, 2017.

Justin Fu, John Co-Reyes, and Sergey Levine. Ex2: Exploration with exemplar models for deep reinforcement learning. In Advances in Neural Information Processing Systems, pp. 2574 2584, 2017.

Published as a conference paper at ICLR 2019

Scott Fujimoto, Herke van Hoof, and Dave Meger. Addressing function approximation error in actor-critic methods. ar Xiv preprint ar Xiv:1802.09477, 2018.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672 2680, 2014.

Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Dan Horgan, John Quan, Andrew Sendonaris, Gabriel Dulac-Arnold, et al. Deep q-learning from demonstrations. ar Xiv preprint ar Xiv:1704.03732, 2017.

Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems, pp. 4565 4573, 2016.

Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. Vime: Variational information maximizing exploration. In Advances in Neural Information Processing Systems, pp. 1109 1117, 2016.

Sham M Kakade. A natural policy gradient. In Advances in neural information processing systems, pp. 1531 1538, 2002.

Jens Kober and Jan R Peters. Policy search for motor primitives in robotics. In Advances in neural information processing systems, pp. 849 856, 2009.

Sergey Levine and Vladlen Koltun. Guided policy search. In International Conference on Machine Learning, pp. 1 9, 2013.

Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334 1373, 2016.

Chen Liang, Jonathan Berant, Quoc Le, Kenneth D Forbus, and Ni Lao. Neural symbolic machines: Learning semantic parsers on freebase with weak supervision. ar Xiv preprint ar Xiv:1611.00020, 2016.

Chen Liang, Mohammad Norouzi, Jonathan Berant, Quoc Le, and Ni Lao. Memory augmented policy optimization for program synthesis with generalization. ar Xiv preprint ar Xiv:1807.02322, 2018.

Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. ar Xiv preprint ar Xiv:1509.02971, 2015.

Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances In Neural Information Processing Systems, pp. 2378 2386, 2016.

Yang Liu, Prajit Ramachandran, Qiang Liu, and Jian Peng. Stein variational policy gradient. ar Xiv preprint ar Xiv:1704.02399, 2017.

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529 533, 2015.

Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pp. 1928 1937, 2016.

Ashvin Nair, Bob Mc Grew, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. Overcoming exploration in reinforcement learning with demonstrations. ar Xiv preprint ar Xiv:1709.10089, 2017.

Junhyuk Oh, Yijie Guo, Satinder Singh, and Honglak Lee. Self-imitation learning. ar Xiv preprint ar Xiv:1806.05635, 2018.

Published as a conference paper at ICLR 2019

Jan Peters and Stefan Schaal. Reinforcement learning by reward-weighted regression for operational space control. In Proceedings of the 24th international conference on Machine learning, pp. 745 750. ACM, 2007.

Jan Peters, Katharina M ulling, and Yasemin Altun. Relative entropy policy search. In AAAI, pp. 1607 1612. Atlanta, 2010.

St ephane Ross, Narek Melik-Barkhudarov, Kumar Shaurya Shankar, Andreas Wendel, Debadeepta Dey, J Andrew Bagnell, and Martial Hebert. Learning monocular reactive uav control in cluttered natural environments. In Robotics and Automation (ICRA), 2013 IEEE International Conference on, pp. 1765 1772. IEEE, 2013.

Reuven Y Rubinstein and Dirk P Kroese. Simulation and the Monte Carlo method, volume 10. John Wiley & Sons, 2016.

Tim Salimans, Jonathan Ho, Xi Chen, and Ilya Sutskever. Evolution strategies as a scalable alternative to reinforcement learning. ar Xiv preprint ar Xiv:1703.03864, 2017.

John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pp. 1889 1897, 2015.

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017a.

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017b.

David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484 489, 2016.

Bradly C Stadie, Sergey Levine, and Pieter Abbeel. Incentivizing exploration in reinforcement learning with deep predictive models. ar Xiv preprint ar Xiv:1507.00814, 2015.

Alexander L Strehl and Michael L Littman. An analysis of model-based interval estimation for markov decision processes. Journal of Computer and System Sciences, 74(8):1309 1331, 2008.

Richard S Sutton, David A Mc Allester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057 1063, 2000.

Richard Stuart Sutton. Temporal credit assignment in reinforcement learning. 1984.

Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, Open AI Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. # exploration: A study of count-based exploration for deep reinforcement learning. In Advances in Neural Information Processing Systems, pp. 2750 2759, 2017.

Matej Veˇcer ık, Todd Hester, Jonathan Scholz, Fumin Wang, Olivier Pietquin, Bilal Piot, Nicolas Heess, Thomas Roth orl, Thomas Lampe, and Martin Riedmiller. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. ar Xiv preprint ar Xiv:1707.08817, 2017.

Markus Wulfmeier, Ingmar Posner, and Pieter Abbeel. Mutual alignment transfer learning. ar Xiv preprint ar Xiv:1707.07907, 2017.

Published as a conference paper at ICLR 2019

5.1 DERIVATION OF GRADIENT APPROXIMATION

Let d π(s, a) and d E(s, a) be the exact state-action densities for the current policy (πθ) and the expert, respectively. Therefore, by deﬁnition, we have (up to a constant shift):

DJS(ρπθ, ρπE) = e Eρπθ [log d π(s, a) d π(s, a) + d E(s, a)] + e EρπE [log d E(s, a) d π(s, a) + d E(s, a)]

Now, d π(s, a) is a local surrogate to ρπθ(s, a). By approximating it to be constant in an ϵ ball neighborhood around θ, we get the following after taking gradient of the above equation w.r.t θ:

θDJS(ρπθ, ρπE) θe Eρπθ [log d π(s, a) d π(s, a) + d E(s, a)] | {z } r(s, a)

= e Eρπθ (s,a) θ log πθ(a|s) e Qπ(s, a) ,

where e Qπ(st, at) = e Eρπθ (s,a) X

t =t γt tr(st , at )

The last step follows directly from the policy gradient theorem (Sutton et al., 2000). Since we do not have the exact densities d π(s, a) and d E(s, a), we substitute them with the optimized density estimators dπ(s, a) and d E(s, a) from the maximization in Equation 2 for computing DJS. This gives us the gradient approximation mentioned in Section 2.2. A similar approximation is also used by Ho & Ermon (2016) for Generative Adversarial Imitation Learning (GAIL).

Published as a conference paper at ICLR 2019

5.2 ALGORITHM FOR SELF-IMITATION

Notation: θ = Policy parameters φ = Discriminator parameters r(s, a) = Environment reward

Algorithm 1:

1 θ, φ initial parameters

2 ME empty replay memory

3 for each iteration do

4 Generate batch of trajectories {τ}b 1 with two rewards for each transition: r1 = r(s, a) and r2 = log rφ(s, a)

5 Update ME using priory queue threshold

/* Update policy θ */

6 for each minibatch do

7 Calculate g1 = θηr1(πθ) with PPO objective using r1 reward

8 Calculate g2 = θηr2(πθ) with PPO objective using r2 reward

9 Update θ with (1 ν)g1 + νg2 using ADAM

/* Update self-imitation discriminator φ */

11 for each epoch do

12 s1 Sample mini-batch of (s,a) from ME

13 s2 Sample mini-batch of (s,a) from {τ}b 1 14 Update φ with log-loss objective using s1, s2

Published as a conference paper at ICLR 2019

5.3 ALGORITHM FOR SELF-IMITATING DIVERSE POLICIES

Notation: θi = Policy parameters for rank i φi = Self-imitation discriminator parameters for rank i ψi = Empirical density network parameters for rank i

Algorithm 2: /* This is run for every rank i 1 . . . n */

1 θi, φi, ψi some initial distributions

2 ME empty replay memory local to rank i

3 k(i, j) 0, j = i

4 for each iteration do

5 Generate batch of trajectories {τi}b 1 6 Update ME using priory queue threshold

/* Update policy θi */

7 for each minibatch do

8 Calculate θiη(πθi) using self-imitation (as in Algorithm 1)

9 MPI send: θiη(πθi) to other ranks

10 MPI recv: θjη(πθj) from other ranks

11 Calculate θik(i, j) using ψi, ψj

12 Use k(i, j) and lines 8, 10, 11 in SVPG to get θi

13 Update θi with θi using ADAM

/* Update self-imitation discriminator φi */

15 for each epoch do

16 s1 Sample mini-batch of (s,a) from ME

17 s2 Sample mini-batch of (s,a) from {τi}b 1 18 Update φi with log-loss objective using s1, s2

/* Update state-action visitation network ψi */

20 MPI send: ψi to other ranks

21 MPI send: {τi}b 1 to other ranks

22 MPI recv: ψj from other ranks

23 MPI recv: {τj}b 1 from other ranks

24 Update ψi with log-loss objective using ψj, {τi}b 1, {τj}b 1 25 Update k(i, j)

Published as a conference paper at ICLR 2019

5.4 ABLATION STUDIES

We show the sensitivity of self-imitation to ν and the capacity of ME, denoted by C. The experiments in this subsection are done on Humanoid and Hopper tasks with episodic rewards. The tables show the average performance over 5 random seeds. For ablation on ν, C is ﬁxed at 10; for ablation on C, ν is ﬁxed at 0.8. With episodic rewards, a higher value of ν helps boost performance since the RL signal from the environment is weak. With ν = 0.8, there isn t a single best choice for C, though all values of C give better results than baseline PPO (ν = 0).

Humanoid Hopper ν = 0 532 354 ν = 0.2 395 481 ν = 0.5 810 645 ν = 0.8 3602 2618 ν = 1 3891 2633

Humanoid Hopper C = 1 2861 1736 C = 5 2946 2415 C = 10 3602 2618 C = 25 2667 1624 C = 50 4159 2301

5.5 HYPERPARAMETERS

- Horizon (T) = 1000 (locomotion), 250 (Maze), 5000 (Swimming+Gathering) - Discount (γ) = 0.99 - GAE parameter (λ) = 0.95 - PPO internal epochs = 5 - PPO learning rate = 1e-4 - PPO mini-batch = 64

5.6 LEVERAGING DIVERSE POLICIES

The diversity-promoting DJS repulsion can be used for various other purposes apart from aiding exploration in the sparse environments considered thus far. First, we consider the paradigm of hierarchical reinforcement learning wherein multiple sub-policies (or skills) are managed by a highlevel policy, which chooses the most apt sub-policy to execute at any given time. In Figure 4, we use the Swimmer environment from Gym and show that diverse skills (movements) can be acquired in a pre-training phase when DJS repulsion is used. The skills can then be used in a difﬁcult downstream task. During pre-training with SVPG, exploitation is done with policy-gradients calculated using the norm of the velocity as dense rewards, while the exploration term uses the JS-kernel. As before, we compare an ensemble of 8 interacting agents with 8 independent agents. Figures 4a and 4b depict the paths taken by the Swimmer after training with independent and interacting agents, respectively. The latter exhibit variety. Figure 4c is the downstream task of Swimming+Gathering (Duan et al., 2016) where the bot has to swim and collect the green dots, whilst avoiding the red ones. The utility of pre-training a diverse ensemble is shown in Figure 4d, which plots the performance on this task while training a higher-level categorical manager policy (|A| = 8).

Diversity can sometimes also help in learning a skill without any rewards from the environment, as observed by Eysenbach et al. (2018) in recent work. We consider a Hopper task with no rewards, but we do require weak supervision in form of the length of each trajectory L. Using policy-gradient

Published as a conference paper at ICLR 2019

Figure 4: Using diverse agents for hierarchical reinforcement learning. (a) Independent agents paths. (b) Interacting agents paths. (c) Swimming+Gathering task. (d) Performance of manager policy with two different pre-trained ensembles as sub-policies.

with L as reward and DJS repulsion, we see the emergence of hopping behavior within an ensemble of 8 interacting agents. Videos of the skills acquired can be found here 1.

5.7 PERFORMANCE ON MORE MUJOCO TASKS

Episodic rewards Noisy rewards Each rt suppressed w/ 90% prob. (pm = 0.9)

Noisy rewards Each rt suppressed w/ 50% prob. (pm = 0.5)

Dense rewards (Gym default)

ν = 0.8 (SI)

ν = 0 (PPO)

ν = 0.8 (SI)

ν = 0 (PPO)

ν = 0.8 (SI)

ν = 0 (PPO)

ν = 0.8 (SI)

ν = 0 (PPO) Half-Cheetah 3686 -1572 3378 1670 4574 2374 4878 2422 Reacher -12 -12 -12 -10 -6 -6 -5 -5 Inv. Pendulum 977 53 993 999 978 988 969 992

Table 2: Extension of Table 1 from Section 3. All runs use 5M timesteps of interaction with the environment.

5.8 ADDITIONAL DETAILS ON SVPG EXPLORATION WITH JS-KERNEL

5.8.1 SVPG FORMULATION

Let the policy parameters be parameterized by θ. To achieve diverse, high-return policies, we seek to obtain the distribution q (θ) which is the solution of the optimization problem: maxq Eθ q[η(θ)] + αH(q), where H(q) = Eθ q[ log q(θ)] is the entropy of q. Solving the above equation by setting derivative to zero yields the an energy-based formulation for the optimal policy-parameter distribution: q (θ) exp( η(θ)

α ). Drawing samples from this posterior using traditional methods such as MCMC is computationally intractable. Stein variational gradient descent (SVGD; Liu & Wang (2016)) is an efﬁcient method for generating samples and also converges to the posterior of the energy-based model. Let {θ}n 1 be the n particles that constitute the policy ensemble. SVGD provides appropriate direction for perturbing each particle such that induced KL-divergence between the particles and the target distribution q (θ) is reduced. The perturbation (gradient) for particle θi is given by (please see Liu & Wang (2016) for derivation):

θj log q (θj)k(θj, θi) + θjk(θj, θi)

where k(θj, θi) is a positive deﬁnite kernel function. Using q (θ) exp( η(θ)

α ) as target distribution, and k(θj, θi) = exp( DJS(ρπθj , ρπθi)/T) as the JS-kernel, we get the gradient direction for ascent:

j=1 exp( DJS(ρπθj , ρπθi)/T) h θj η(πθj)

T θj DJS(ρπθj , ρπθi) i

where ρπθ(s, a) is the state-action visitation distribution for policy πθ, and T is the temperature. Also, for our case, θjη(πθj) is the interpolated gradient from self-imitation (Equation 5).

1https://sites.google.com/site/tesr4t223424

Published as a conference paper at ICLR 2019

5.8.2 IMPLEMENTATION DETAILS

The θj DJS(ρπθj , ρπθi) gradient in the above equation is the repulsion factor that pushes πθi away from πθj. Similar repulsion can be achieved by using the gradient + θi DJS(ρπθj , ρπθi); note that this gradient is w.r.t θi instead of θj and the sign is reversed. Empirically, we ﬁnd that the latter results in slightly better performance.

Estimation of θi DJS(ρj, ρi): This can be done in two ways - using implicit and explicit distributions. In the implicit method, we could train a parameterized discriminator network (φ) using state-actions pairs from πi and πj to implicitly approximate the ratio rφ ij = ρπi(s, a)/[ρπi(s, a) + ρπj(s, a)]. We could then use the policy gradient theorem to obtain the gradient of DJS as explained in Section 2.2. This, however, requires us to learn O(n2) discriminator networks for a population of size n, one for each policy pair (i, j). To reduce the computational and memory resource burden to O(n), we opt for explicit modeling of ρπi. Speciﬁcally, we train a network ρψi to approximate the state-action visitation density for each policy πi. The ρψ1 . . . ρψn networks are learned using the DJS optimization (Equation 2), and we can easily obtain the ratio rij(s, a) = ρψi(s, a)/[ρψi(s, a) + ρψj(s, a)]. The agent then uses log rij(s, a) as the SVPG exploration rewards in the policy gradient theorem.

State-value baselines: We use state-value function networks as baselines to reduce the variance in sampled policy-gradients. Each agent θi in a population of size n trains n + 1 state-value networks corresponding to real environment rewards r(s, a), self-imitation rewards log rφ(s, a), and n 1 SVPG exploration rewards log rij(s, a).

5.9 COMPARISON TO OH ET AL. (2018)

In this section, we provide evaluation for a recently proposed method for self-imitation learning (SIL; Oh et al. (2018)). The SIL loss function take the form:

LSIL = Es,a,D h log πθ(a|s)(R Vθ(s))+ + β

2 ||(R Vθ(s))+||2i

In words, the algorithm buffers (s, a) and the corresponding return (R) for each transition in rolled trajectories, and reuses them for training if the stored return value is higher than the current statevalue estimate Vθ(s).

We use the code provided by the authors 2. As per our understanding, PPO+SIL does not use a single set of hyper-parameters for all the Mu Jo Co tasks (Appendix A; Oh et al. (2018)). We follow their methodology and report numbers for the best conﬁguration for each task. This is different from our experiments since we run all tasks on a single ﬁx hyper-parameter set (Appendix 5.5), and therefore a direct comparison of the average scores between the two approaches is tricky.

SIL Dense rewards Oh et al. (2018)

SIL Episodic rewards

SIL Noisy rewards Each rt suppressed w/ 90% prob. (pm = 0.9)

SIL Noisy rewards Each rt suppressed w/ 50% prob. (pm = 0.5) Walker 3973 257 565 3911 Humanoid 3610 530 1126 3460 Humanoid-Standup ( 104) 18.9 4.9 14.9 18.8 Hopper 1983 563 1387 1723 Swimmer 120 17 50 100 Inverted Double Pendulum 6250 405 6563 6530

Table 3: Performance of PPO+SIL (Oh et al., 2018) on tasks with episodic rewards, noisy rewards with masking probability pm, and dense rewards. All runs use 5M timesteps of interaction with the environment.

Table 3 shows the performance of PPO+SIL on Mu Jo Co tasks under the various reward distributions explained in Section 3.1 - dense, episodic and noisy. We observe that, compared to the dense rewards setting (default Gym rewards), the performance suffers under the episodic case and when the rewards are masked out with pm = 0.9. Our intuition is as follows. PPO+SIL makes use of the cumulative

2https://github.com/junhyukoh/self-imitation-learning

Published as a conference paper at ICLR 2019

return (R) from each transition of a past good rollout for the update. When rewards are provided only at the end of the episode, for instance, cumulative return does not help with the temporal credit assignment problem and hence is not a strong learning signal. Our approach, on the other hand, derives dense, per-timestep rewards using an objective based on divergence-minimization. This is useful for credit assignment, and as indicated in Table 1. (Section 3.1) leads to learning good policies even under the episodic and noisy pm = 0.9 settings.

5.10 COMPARISON TO OFF-POLICY RL (Q-LEARNING)

Our approach makes use of replay memory ME to store the past good rollouts of the agent. Offpolicy RL methods such as DQN (Mnih et al., 2015) also accumulate agent experience in a replay buffer and reuse them for learning (e.g. by reducing TD-error). In this section, we evaluate the performance of one such recent algorithm - Twin Delayed Deep Deterministic policy gradient (TD3; Fujimoto et al. (2018)) on tasks with episodic and noisy rewards. TD3 builds on DDPG (Lillicrap et al., 2015) and surpasses its performance on all the Mu Jo Co tasks evaluated by the authors.

TD3 Dense rewards Fujimoto et al. (2018)

TD3 Episodic rewards

TD3 Noisy rewards Each rt suppressed w/ 90% prob. (pm = 0.9)

TD3 Noisy rewards Each rt suppressed w/ 50% prob. (pm = 0.5) Walker 4352 189 395 2417 Hopper 3636 402 385 1825 Inverted Double Pendulum 9350 363 948 4711 Swimmer - - - - Humanoid-Standup - - - - Humanoid - - - -

Table 4: Performance of TD3 (Fujimoto et al., 2018) on tasks with episodic rewards, noisy rewards with masking probability pm, and dense rewards. All runs use 5M timesteps of interaction with the environment.

Table 4 shows that the performance of TD3 suffers appreciably with the episodic and noisy pm = 0.9 reward settings, indicating that popular off-policy algorithms (DDPG, TD3) do not exploit the past experience in a manner that accelerates learning when rewards are scarce during an episode.

* For 3 tasks used in our paper Swimmer and the high-dimensional Humanoid, Humanoid Standup the TD3 code from the authors 3 is unable to learn a good policy even in presence of dense rewards (default Gym rewards). These tasks are also not included in the evaluation by Fujimoto et al. (2018).

5.11 COMPARING SVPG EXPLORATION TO A NOVELTY-BASED BASELINE

We run a new exploration baseline - EX2 (Fu et al., 2017) and compare its performance to SIinteract-JS on the hard exploration Mu Jo Co tasks considered in Section 3.2. The EX2 algorithm does implicit state-density ρ(s) estimation using discriminative modeling, and uses it for noveltybased exploration by adding log ρ(s) as the bonus. We used the author provided code 4 and hyperparameter settings. TRPO is used as the policy gradient algorithm.

EX2 SI-interact-JS Sparse Half Cheetah -286 769 Sparse Hopper 1477 1949 Sparse Ant -3.9 208

Table 5: Performance of EX2 (Fu et al., 2017) and SI-interact-JS on the hard exploration Mu Jo Co tasks from Section 3.2. Sparse Half Cheetah, Sparse Half Cheetah, Sparse Ant use 1M, 1M and 2M timesteps of interaction with the environment, respectively. Results are averaged over 3 separate runs.

3https://github.com/sfujim/TD3 4https://github.com/jcoreyes/ex2