# sample_efficient_actorcritic_with_experience_replay__1a03ab7c.pdf Published as a conference paper at ICLR 2017 SAMPLE EFFICIENT ACTOR-CRITIC WITH EXPERIENCE REPLAY Ziyu Wang Deep Mind ziyu@google.com Victor Bapst Deep Mind vbapst@google.com Nicolas Heess Deep Mind heess@google.com Volodymyr Mnih Deep Mind vmnih@google.com Remi Munos Deep Mind Munos@google.com Koray Kavukcuoglu Deep Mind korayk@google.com Nando de Freitas Deep Mind, CIFAR, Oxford University nandodefreitas@google.com This paper presents an actor-critic deep reinforcement learning agent with experience replay that is stable, sample efficient, and performs remarkably well on challenging environments, including the discrete 57-game Atari domain and several continuous control problems. To achieve this, the paper introduces several innovations, including truncated importance sampling with bias correction, stochastic dueling network architectures, and a new trust region policy optimization method. 1 INTRODUCTION Realistic simulated environments, where agents can be trained to learn a large repertoire of cognitive skills, are at the core of recent breakthroughs in AI (Bellemare et al., 2013; Mnih et al., 2015; Schulman et al., 2015a; Narasimhan et al., 2015; Mnih et al., 2016; Brockman et al., 2016; Oh et al., 2016). With richer realistic environments, the capabilities of our agents have increased and improved. Unfortunately, these advances have been accompanied by a substantial increase in the cost of simulation. In particular, every time an agent acts upon the environment, an expensive simulation step is conducted. Thus to reduce the cost of simulation, we need to reduce the number of simulation steps (i.e. samples of the environment). This need for sample efficiency is even more compelling when agents are deployed in the real world. Experience replay (Lin, 1992) has gained popularity in deep Q-learning (Mnih et al., 2015; Schaul et al., 2016; Wang et al., 2016; Narasimhan et al., 2015), where it is often motivated as a technique for reducing sample correlation. Replay is actually a valuable tool for improving sample efficiency and, as we will see in our experiments, state-of-the-art deep Q-learning methods (Schaul et al., 2016; Wang et al., 2016) have been up to this point the most sample efficient techniques on Atari by a significant margin. However, we need to do better than deep Q-learning, because it has two important limitations. First, the deterministic nature of the optimal policy limits its use in adversarial domains. Second, finding the greedy action with respect to the Q function is costly for large action spaces. Policy gradient methods have been at the heart of significant advances in AI and robotics (Silver et al., 2014; Lillicrap et al., 2015; Silver et al., 2016; Levine et al., 2015; Mnih et al., 2016; Schulman et al., 2015a; Heess et al., 2015). Many of these methods are restricted to continuous domains or to very specific tasks such as playing Go. The existing variants applicable to both continuous and discrete domains, such as the on-policy asynchronous advantage actor critic (A3C) of Mnih et al. (2016), are sample inefficient. The design of stable, sample efficient actor critic methods that apply to both continuous and discrete action spaces has been a long-standing hurdle of reinforcement learning (RL). We believe this paper Published as a conference paper at ICLR 2017 is the first to address this challenge successfully at scale. More specifically, we introduce an actor critic with experience replay (ACER) that nearly matches the state-of-the-art performance of deep Q-networks with prioritized replay on Atari, and substantially outperforms A3C in terms of sample efficiency on both Atari and continuous control domains. ACER capitalizes on recent advances in deep neural networks, variance reduction techniques, the off-policy Retrace algorithm (Munos et al., 2016) and parallel training of RL agents (Mnih et al., 2016). Yet, crucially, its success hinges on innovations advanced in this paper: truncated importance sampling with bias correction, stochastic dueling network architectures, and efficient trust region policy optimization. On the theoretical front, the paper proves that the Retrace operator can be rewritten from our proposed truncated importance sampling with bias correction technique. 2 BACKGROUND AND PROBLEM SETUP Consider an agent interacting with its environment over discrete time steps. At time step t, the agent observes the nx-dimensional state vector xt X Rnx, chooses an action at according to a policy π(a|xt) and observes a reward signal rt R produced by the environment. We will consider discrete actions at {1, 2, . . . , Na} in Sections 3 and 4, and continuous actions at A Rna in Section 5. The goal of the agent is to maximize the discounted return Rt = P i 0 γirt+i in expectation. The discount factor γ [0, 1) trades-off the importance of immediate and future rewards. For an agent following policy π, we use the standard definitions of the state-action and state only value functions: Qπ(xt, at) = Ext+1: ,at+1: [Rt| xt, at] and V π(xt) = Eat [Qπ(xt, at)|xt] . Here, the expectations are with respect to the observed environment states xt and the actions generated by the policy π, where xt+1: denotes a state trajectory starting at time t + 1. We also need to define the advantage function Aπ(xt, at) = Qπ(xt, at) V π(xt), which provides a relative measure of value of each action since Eat [Aπ(xt, at)] = 0. The parameters θ of the differentiable policy πθ(at|xt) can be updated using the discounted approximation to the policy gradient (Sutton et al., 2000), which borrowing notation from Schulman et al. (2015b), is defined as: g = Ex0: ,a0: t 0 Aπ(xt, at) θ log πθ(at|xt) Following Proposition 1 of Schulman et al. (2015b), we can replace Aπ(xt, at) in the above expression with the state-action value Qπ(xt, at), the discounted return Rt, or the temporal difference residual rt + γV π(xt+1) V π(xt), without introducing bias. These choices will however have different variance. Moreover, in practice we will approximate these quantities with neural networks thus introducing additional approximation errors and biases. Typically, the policy gradient estimator using Rt will have higher variance and lower bias whereas the estimators using function approximation will have higher bias and lower variance. Combining Rt with the current value function approximation to minimize bias while maintaining bounded variance is one of the central design principles behind ACER. To trade-off bias and variance, the asynchronous advantage actor critic (A3C) of Mnih et al. (2016) uses a single trajectory sample to obtain the following gradient approximation: + γk V π θv(xt+k) V π θv(xt) θ log πθ(at|xt). (2) A3C combines both k-step returns and function approximation to trade-off variance and bias. We may think of V π θv(xt) as a policy gradient baseline used to reduce variance. In the following section, we will introduce the discrete-action version of ACER. ACER may be understood as the off-policy counterpart of the A3C method of Mnih et al. (2016). As such, ACER builds on all the engineering innovations of A3C, including efficient parallel CPU computation. Published as a conference paper at ICLR 2017 ACER uses a single deep neural network to estimate the policy πθ(at|xt) and the value function V π θv(xt). (For clarity and generality, we are using two different symbols to denote the parameters of the policy and value function, θ and θv, but most of these parameters are shared in the single neural network.) Our neural networks, though building on the networks used in A3C, will introduce several modifications and new modules. 3 DISCRETE ACTOR CRITIC WITH EXPERIENCE REPLAY Off-policy learning with experience replay may appear to be an obvious strategy for improving the sample efficiency of actor-critics. However, controlling the variance and stability of off-policy estimators is notoriously hard. Importance sampling is one of the most popular approaches for offpolicy learning (Meuleau et al., 2000; Jie & Abbeel, 2010; Levine & Koltun, 2013). In our context, it proceeds as follows. Suppose we retrieve a trajectory {x0, a0, r0, µ( |x0), , xk, ak, rk, µ( |xk)}, where the actions have been sampled according to the behavior policy µ, from our memory of experiences. Then, the importance weighted policy gradient is given by: θ log πθ(at|xt), (3) where ρt = π(at|xt) µ(at|xt) denotes the importance weight. This estimator is unbiased, but it suffers from very high variance as it involves a product of many potentially unbounded importance weights. To prevent the product of importance weights from exploding, Wawrzy nski (2009) truncates this product. Truncated importance sampling over entire trajectories, although bounded in variance, could suffer from significant bias. Recently, Degris et al. (2012) attacked this problem by using marginal value functions over the limiting distribution of the process to yield the following approximation of the gradient: gmarg = Ext β,at µ [ρt θ log πθ(at|xt)Qπ(xt, at)] , (4) where Ext β,at µ[ ] is the expectation with respect to the limiting distribution β(x) = limt P(xt = x|x0, µ) with behavior policy µ. To keep the notation succinct, we will replace Ext β,at µ[ ] with Extat[ ] and ensure we remind readers of this when necessary. Two important facts about equation (4) must be highlighted. First, note that it depends on Qπ and not on Qµ, consequently we must be able to estimate Qπ. Second, we no longer have a product of importance weights, but instead only need to estimate the marginal importance weight ρt. Importance sampling in this lower dimensional space (over marginals as opposed to trajectories) is expected to exhibit lower variance. Degris et al. (2012) estimate Qπ in equation (4) using lambda returns: Rλ t = rt +(1 λ)γV (xt+1)+ λγρt+1Rλ t+1. This estimator requires that we know how to choose λ ahead of time to trade off bias and variance. Moreover, when using small values of λ to reduce variance, occasional large importance weights can still cause instability. In the following subsection, we adopt the Retrace algorithm of Munos et al. (2016) to estimate Qπ. Subsequently, we propose an importance weight truncation technique to improve the stability of the off-policy actor critic of Degris et al. (2012), and introduce a computationally efficient trust region scheme for policy optimization. The formulation of ACER for continuous action spaces will require further innovations that are advanced in Section 5. 3.1 MULTI-STEP ESTIMATION OF THE STATE-ACTION VALUE FUNCTION In this paper, we estimate Qπ(xt, at) using Retrace (Munos et al., 2016). (We also experimented with the related tree backup method of Precup et al. (2000) but found Retrace to perform better in practice.) Given a trajectory generated under the behavior policy µ, the Retrace estimator can be expressed recursively as follows1: Qret(xt, at) = rt + γ ρt+1[Qret(xt+1, at+1) Q(xt+1, at+1)] + γV (xt+1), (5) 1For ease of presentation, we consider only λ = 1 for Retrace. Published as a conference paper at ICLR 2017 where ρt is the truncated importance weight, ρt = min {c, ρt} with ρt = π(at|xt) µ(at|xt), Q is the current value estimate of Qπ, and V (x) = Ea πQ(x, a). Retrace is an off-policy, return-based algorithm which has low variance and is proven to converge (in the tabular case) to the value function of the target policy for any behavior policy, see Munos et al. (2016). The recursive Retrace equation depends on the estimate Q. To compute it, in discrete action spaces, we adopt a convolutional neural network with two heads that outputs the estimate Qθv(xt, at), as well as the policy πθ(at|xt). This neural representation is the same as in (Mnih et al., 2016), with the exception that we output the vector Qθv(xt, at) instead of the scalar Vθv(xt). The estimate Vθv(xt) can be easily derived by taking the expectation of Qθv under πθ. To approximate the policy gradient gmarg, ACER uses Qret to estimate Qπ. As Retrace uses multistep returns, it can significantly reduce bias in the estimation of the policy gradient 2. To learn the critic Qθv(xt, at), we again use Qret(xt, at) as a target in a mean squared error loss and update its parameters θv with the following standard gradient: (Qret(xt, at) Qθv(xt, at)) θv Qθv(xt, at)). (6) Because Retrace is return-based, it also enables faster learning of the critic. Thus the purpose of the multi-step estimator Qret in our setting is twofold: to reduce bias in the policy gradient, and to enable faster learning of the critic, hence further reducing bias. 3.2 IMPORTANCE WEIGHT TRUNCATION WITH BIAS CORRECTION The marginal importance weights in Equation (4) can become large, thus causing instability. To safe-guard against high variance, we propose to truncate the importance weights and introduce a correction term via the following decomposition of gmarg: gmarg =Extat [ρt θlog πθ(at|xt)Qπ(xt, at)] Eat[ ρt θlog πθ(at|xt)Qπ(xt, at)]+E a π + θlog πθ(a|xt)Qπ(xt, a) where ρt = min {c, ρt} with ρt = π(at|xt) µ(at|xt) as before. We have also introduced the notation ρt(a) = π(a|xt) µ(a|xt), and [x]+ = x if x > 0 and it is zero otherwise. We remind readers that the above expectations are with respect to the limiting state distribution under the behavior policy: xt β and at µ. The clipping of the importance weight in the first term of equation (7) ensures that the variance of the gradient estimate is bounded. The correction term (second term in equation (7)) ensures that our estimate is unbiased. Note that the correction term is only active for actions such that ρt(a) > c. In particular, if we choose a large value for c, the correction term only comes into effect when the variance of the original off-policy estimator of equation (4) is very high. When this happens, our decomposition has the nice property that the truncated weight in the first term is at most c while the correction weight h ρt(a) c + in the second term is at most 1. We model Qπ(xt, a) in the correction term with our neural network approximation Qθv(xt, at). This modification results in what we call the truncation with bias correction trick, in this case applied to the function θ log πθ(at|xt)Qπ(xt, at): bgmarg =Ext Eat ρt θlog πθ(at|xt)Qret(xt, at) +E a π + θlog πθ(a|xt)Qθv(xt, a) Equation (8) involves an expectation over the stationary distribution of the Markov process. We can however approximate it by sampling trajectories {x0, a0, r0, µ( |x0), , xk, ak, rk, µ( |xk)} 2An alternative to Retrace here is Q(λ) with off-policy corrections (Harutyunyan et al., 2016) which we discuss in more detail in Appendix B. Published as a conference paper at ICLR 2017 generated from the behavior policy µ. Here the terms µ( |xt) are the policy vectors. Given these trajectories, we can compute the off-policy ACER gradient: bgacer t = ρt θ log πθ(at|xt)[Qret(xt, at) Vθv(xt)] + θ log πθ(a|xt)[Qθv(xt, a) Vθv(xt)] In the above expression, we have subtracted the classical baseline Vθv(xt) to reduce variance. It is interesting to note that, when c = , (9) recovers (off-policy) policy gradient up to the use of Retrace. When c = 0, (9) recovers an actor critic update that depends entirely on Q estimates. In the continuous control domain, (9) also generalizes Stochastic Value Gradients if c = 0 and the reparametrization trick is used to estimate its second term (Heess et al., 2015). 3.3 EFFICIENT TRUST REGION POLICY OPTIMIZATION The policy updates of actor-critic methods do often exhibit high variance. Hence, to ensure stability, we must limit the per-step changes to the policy. Simply using smaller learning rates is insufficient as they cannot guard against the occasional large updates while maintaining a desired learning speed. Trust Region Policy Optimization (TRPO) (Schulman et al., 2015a) provides a more adequate solution. Schulman et al. (2015a) approximately limit the difference between the updated policy and the current policy to ensure safety. Despite the effectiveness of their TRPO method, it requires repeated computation of Fisher-vector products for each update. This can prove to be prohibitively expensive in large domains. In this section we introduce a new trust region policy optimization method that scales well to large problems. Instead of constraining the updated policy to be close to the current policy (as in TRPO), we propose to maintain an average policy network that represents a running average of past policies and forces the updated policy to not deviate far from this average. We decompose our policy network in two parts: a distribution f, and a deep neural network that generates the statistics φθ(x) of this distribution. That is, given f, the policy is completely characterized by the network φθ: π( |x) = f( |φθ(x)). For example, in the discrete domain, we choose f to be the categorical distribution with a probability vector φθ(x) as its statistics. The probability vector is of course parameterised by θ. We denote the average policy network as φθa and update its parameters θa softly after each update to the policy parameter θ: θa αθa + (1 α)θ. Consider, for example, the ACER policy gradient as defined in Equation (9), but with respect to φ: bgacer t = ρt φθ(xt) log f(at|φθ(x))[Qret(xt, at) Vθv(xt)] + φθ(xt) log f(at|φθ(x))[Qθv(xt, a) Vθv(xt)] Given the averaged policy network, our proposed trust region update involves two stages. In the first stage, we solve the following optimization problem with a linearized KL divergence constraint: minimize z 1 2 ˆgacer t z 2 2 subject to φθ(xt)DKL [f( |φθa(xt)) f( |φθ(xt))]T z δ (11) Since the constraint is linear, the overall optimization problem reduces to a simple quadratic programming problem, the solution of which can be easily derived in closed form using the KKT conditions. Letting k = φθ(xt)DKL [f( |φθa(xt) f( |φθ(xt)], the solution is: z = ˆgacer t max 0, k T ˆgacer t δ k 2 2 This transformation of the gradient has a very natural form. If the constraint is satisfied, there is no change to the gradient with respect to φθ(xt). Otherwise, the update is scaled down in the direction Published as a conference paper at ICLR 2017 0 100 200 300 400 500 600 700 800 900 Million Steps Median (in Human) 1 on-policy + 0 replay (A3C) 1 on-policy + 1 replay (ACER) 1 on-policy + 4 replay (ACER) 1 on-policy + 8 replay (ACER) DQN Prioritized Replay 0 20 40 60 80 100 120 Hours Median (in Human) Figure 1: ACER improvements in sample (LEFT) and computation (RIGHT) complexity on Atari. On each plot, the median of the human-normalized score across all 57 Atari games is presented for 4 ratios of replay with 0 replay corresponding to on-policy A3C. The colored solid and dashed lines represent ACER with and without trust region updating respectively. The environment steps are counted over all threads. The gray curve is the original DQN agent (Mnih et al., 2015) and the black curve is one of the Prioritized Double DQN agents from Schaul et al. (2016). of k, thus effectively lowering rate of change between the activations of the current policy and the average policy network. In the second stage, we take advantage of back-propagation. Specifically, the updated gradient with respect to φθ, that is z , is back-propagated through the network to compute the derivatives with respect to the parameters. The parameter updates for the policy network follow from the chain rule: φθ(x) The trust region step is carried out in the space of the statistics of the distribution f, and not in the space of the policy parameters. This is done deliberately so as to avoid an additional back-propagation step through the policy network. We would like to remark that the algorithm advanced in this section can be thought of as a general strategy for modifying the backward messages in back-propagation so as to stabilize the activations. Instead of a trust region update, one could alternatively add an appropriately scaled KL cost to the objective function as proposed by Heess et al. (2015). This approach, however, is less robust to the choice of hyper-parameters in our experience. The ACER algorithm results from a combination of the above ideas, with the precise pseudo-code appearing in Appendix A. A master algorithm (Algorithm 1) calls ACER on-policy to perform updates and propose trajectories. It then calls ACER off-policy component to conduct several replay steps. When on-policy, ACER effectively becomes a modified version of A3C where Q instead of V baselines are employed and trust region optimization is used. 4 RESULTS ON ATARI We use the Arcade Learning Environment of Bellemare et al. (2013) to conduct an extensive evaluation. We deploy one single algorithm and network architecture, with fixed hyper-parameters, to learn to play 57 Atari games given only raw pixel observations and game rewards. This task is highly demanding because of the diversity of games, and high-dimensional pixel-level observations. Our experimental setup uses 16 actor-learner threads running on a single machine with no GPUs. We adopt the same input pre-processing and network architecture as Mnih et al. (2015). Specifically, the network consists of a convolutional layer with 32 8 8 filters with stride 4 followed by another convolutional layer with 64 4 4 filters with stride 2, followed by a final convolutional layer with 64 3 3 filters with stride 1, followed by a fully-connected layer of size 512. Each of the hidden layers is followed by a rectifier nonlinearity. The network outputs a softmax policy and Q values. Published as a conference paper at ICLR 2017 When using replay, we add to each thread a replay memory that is up to 50 000 frames in size. The total amount of memory used across all threads is thus similar in size to that of DQN (Mnih et al., 2015). For all Atari experiments, we use a single learning rate adopted from an earlier implementation of A3C without further tuning. We do not anneal the learning rates over the course of training as in Mnih et al. (2016). We otherwise adopt the same optimization procedure as in Mnih et al. (2016). Specifically, we adopt entropy regularization with weight 0.001, discount the rewards with γ = 0.99, and perform updates every 20 steps (k = 20 in the notation of Section 2). In all our experiments with experience replay, we use importance weight truncation with c = 10. We consider training ACER both with and without trust region updating as described in Section 3.3. When trust region updating is used, we use δ = 1 and α = 0.99 for all experiments. To compare different agents, we adopt as our metric the median of the human normalized score over all 57 games. The normalization is calculated such that, for each game, human scores and random scores are evaluated to 1, and 0 respectively. The normalized score for a given game at time t is computed as the average normalized score over the past 1 million consecutive frames encountered until time t. For each agent, we plot its cumulative maximum median score over time. The result is summarized in Figure 1. The four colors in Figure 1 correspond to four replay ratios (0, 1, 4 and 8) with a ratio of 4 meaning that we use the off-policy component of ACER 4 times after using the on-policy component (A3C). That is, a replay ratio of 0 means that we are using A3C. The solid and dashed lines represent ACER with and without trust region updating respectively. The gray and black curves are the original DQN (Mnih et al., 2015) and Prioritized Replay agent of Schaul et al. (2016) agents respectively. As shown on the left panel of Figure 1, replay significantly increases data efficiency. We observe that when using the trust region optimizer, the average reward as a function of the number of environmental steps increases with the ratio of replay. This increase has diminishing returns, but with enough replay, ACER can match the performance of the best DQN agents. Moreover, it is clear that the off-policy actor critics (ACER) are much more sample efficient than their on-policy counterpart (A3C). The right panel of Figure 1 shows that ACER agents perform similarly to A3C when measured by wall clock time. Thus, in this case, it is possible to achieve better data-efficiency without necessarily compromising on computation time. In particular, ACER with a replay ratio of 4 is an appealing alternative to either the prioritized DQN agent or A3C. 5 CONTINUOUS ACTOR CRITIC WITH EXPERIENCE REPLAY Retrace requires estimates of both Q and V , but we cannot easily integrate over Q to derive V in continuous action spaces. In this section, we propose a solution to this problem in the form of a novel representation for RL, as well as modifications necessary for trust region updating. 5.1 POLICY EVALUATION Retrace provides a target for learning Qθv, but not for learning Vθv. We could use importance sampling to compute Vθv given Qθv, but this estimator has high variance. We propose a new architecture which we call Stochastic Dueling Networks (SDNs), inspired by the Dueling networks of Wang et al. (2016), which is designed to estimate both V π and Qπ off-policy while maintaining consistency between the two estimates. At each time step, an SDN outputs a stochastic estimate e Qθv of Qπ and a deterministic estimate Vθv of V π, such that e Qθv(xt, at) Vθv(xt) + Aθv(xt, at) 1 i=1 Aθv(xt, ui), and ui πθ( |xt) (13) where n is a parameter, see Figure 2. The two estimates are consistent in the sense that Ea π( |xt) h Eu1:n π( |xt) e Qθv(xt, a) i = Vθv(xt). Furthermore, we can learn about V π by learn- ing e Qθv. To see this, assume we have learned Qπ perfectly such that Eu1:n π( |xt) e Qθv(xt, at) = Qπ(xt, at), then Vθv(xt) = Ea π( |xt) h Eu1:n π( |xt) e Qθv(xt, a) i = Ea π( |xt) [Qπ(xt, a)] = V π(xt). Therefore, a target on e Qθv(xt, at) also provides an error signal for updating Vθv. Published as a conference paper at ICLR 2017 e Q v(xt, at) A v(xt, ui) Figure 2: A schematic of the Stochastic Dueling Network. In the drawing, [u1, , un] are assumed to be samples from πθ( |xt). This schematic illustrates the concept of SDNs but does not reflect the real sizes of the networks used. In addition to SDNs, however, we also construct the following novel target for estimating V π: V target(xt) = min 1, π(at|xt) Qret(xt, at) Qθv(xt, at) + Vθv(xt). (14) The above target is also derived via the truncation and bias correction trick; for more details, see Appendix D. Finally, when estimating Qret in continuous domains, we implement a slightly different formulation of the truncated importance weights ρt = min 1, π(at|xt) µ(at|xt) 1 d , where d is the dimensionality of the action space. Although not essential, we have found this formulation to lead to faster learning. 5.2 TRUST REGION UPDATING To adopt the trust region updating scheme (Section 3.3) in the continuous control domain, one simply has to choose a distribution f and a gradient specification ˆgacer t suitable for continuous action spaces. For the distribution f, we choose Gaussian distributions with fixed diagonal covariance and mean φθ(x). To derive ˆgacer t in continuous action spaces, consider the ACER policy gradient for the stochastic dueling network, but with respect to φ: gacer t = Ext ρt φθ(xt) log f(at|φθ(xt))(Qopc(xt, at) Vθv(xt)) + ( e Qθv(xt, a) Vθv(xt)) φθ(xt) log f(a|φθ(xt)) In the above definition, we are using Qopc instead of Qret. Here, Qopc(xt, at) is the same as Retrace with the exception that the truncated importance ratio is replaced with 1 (Harutyunyan et al., 2016). Please refer to Appendix B an expanded discussion on this design choice. Given an observation xt, we can sample a t πθ( |xt) to obtain the following Monte Carlo approximation ˆgacer t = ρt φθ(xt) log f(at|φθ(xt))(Qopc(xt, at) Vθv(xt)) + ρt(a t) c ρt(a t) + ( e Qθv(xt, a t) Vθv(xt)) φθ(xt) log f(a t|φθ(xt)). (16) Given f and ˆgacer t , we apply the same steps as detailed in Section 3.3 to complete the update. The precise pseudo-code of ACER algorithm for continuous spaces results is presented in Appendix A. Published as a conference paper at ICLR 2017 0 50 100 150 200 250 Million Steps Episode Rewards Walker2d (9-Do F/6-dim. Actions) 0 20 40 60 80 100 120 140 160 Million Steps Fish (13-Do F/5-dim. Actions) 0 10 20 30 40 50 60 Million Steps 500 Cartpole (2-Do F/1-dim. Actions) 0 20 40 60 80 100 120 140 160 Million Steps Episode Rewards Humanoid (27-Do F/21-dim. Actions) 5 10 15 Million Steps 500 Reacher3 (3-Do F/3-dim. Actions) 0 20 40 60 80 100 120 140 160 Million Steps 20 Cheetah (9-Do F/6-dim. Actions) Trust-TIS Trust-A3C TIS ACER A3C Figure 3: [TOP] Screen shots of the continuous control tasks. [BOTTOM] Performance of different methods on these tasks. ACER outperforms all other methods and shows clear gains for the higherdimensionality tasks (humanoid, cheetah, walker and fish). The proposed trust region method by itself improves the two baselines (truncated importance sampling and A3C) significantly. 6 RESULTS ON MUJOCO We evaluate our algorithms on 6 continuous control tasks, all of which are simulated using the Mu Jo Co physics engine (Todorov et al., 2012). For descriptions of the tasks, please refer to Appendix E.1. Briefly, the tasks with action dimensionality in brackets are: cartpole (1D), reacher (3D), cheetah (6D), fish (5D), walker (6D) and humanoid (21D). These tasks are illustrated in Figure 3. To benchmark ACER for continuous control, we compare it to its on-policy counterpart both with and without trust region updating. We refer to these two baselines as A3C and Trust-A3C. Additionally, we also compare to a baseline with replay where we truncate the importance weights over trajectories as in (Wawrzy nski, 2009). For a detailed description of this baseline, please refer to Appendix E. Again, we run this baseline both with and without trust region updating, and refer to these choices as Trust-TIS and TIS respectively. Last but not least, we refer to our proposed approach with SDN and trust region updating as simply ACER. All five setups are implemented in the asynchronous A3C framework. All the aforementioned setups share the same network architecture that computes the policy and state values. We maintain an additional small network that computes the stochastic A values in the case of ACER. We use n = 5 (using the notation in Equation (13)) in all SDNs. Instead of mixing on-policy and replay learning as done in the Atari domain, ACER for continuous actions is entirely off-policy, with experiences generated from the simulator (4 times on average). When using replay, we add to each thread a replay memory that is 5, 000 frames in size and perform updates every 50 steps (k = 50 in the notation of Section 2). The rate of the soft updating (α as in Section 3.3) is set to 0.995 in all setups involving trust region updating. The truncation threshold c is set to 5 for ACER. Published as a conference paper at ICLR 2017 We use diagonal Gaussian policies with fixed diagonal covariances where the diagonal standard deviation is set to 0.3. For all setups, we sample the learning rates log-uniformly in the range [10 4, 10 3.3]. For setups involving trust region updating, we also sample δ uniformly in the range [0.1, 2]. With all setups, we use 30 sampled hyper-parameter settings. The empirical results for all continuous control tasks are shown Figure 3, where we show the mean and standard deviation of the best 5 out of 30 hyper-parameter settings over which we searched 3. For sensitivity analyses with respect to the hyper-parameters, please refer to Figures 5 and 6 in the Appendix. In continuous control, ACER outperforms the A3C and truncated importance sampling baselines by a very significant margin. Here, we also find that the proposed trust region optimization method can result in huge improvements over the baselines. The high-dimensional continuous action policies are much harder to optimize than the small discrete action policies in Atari, and hence we observe much higher gains for trust region optimization in the continuous control domains. In spite of the improvements brought in by trust region optimization, ACER still outperforms all other methods, specially in higher dimensions. 6.1 ABLATIONS To further tease apart the contributions of the different components of ACER, we conduct an ablation analysis where we individually remove Retrace / Q(λ) off-policy correction, SDNs, trust region, and truncation with bias correction from the algorithm. As shown in Figure 4, Retrace and offpolicy correction, SDNs, and trust region are critical: removing any one of them leads to a clear deterioration of the performance. Truncation with bias correction did not alter the results in the Fish and Walker2d tasks. However, in Humanoid, where the dimensionality of the action space is much higher, including truncation and bias correction brings a significant boost which makes the originally kneeling humanoid stand. Presumably, the high dimensionality of the action space increases the variance of the importance weights which makes truncation with bias correction important. For more details on the experimental setup please see Appendix E.4. 7 THEORETICAL ANALYSIS Retrace is a very recent development in reinforcement learning. In fact, this work is the first to consider Retrace in the policy gradients setting. For this reason, and given the core role that Retrace plays in ACER, it is valuable to shed more light on this technique. In this section, we will prove that Retrace can be interpreted as an application of the importance weight truncation and bias correction trick advanced in this paper. Consider the following equation: Qπ(xt, at) = Ext+1at+1 [rt + γρt+1Qπ(xt+1, at+1)] . (17) If we apply the weight truncation and bias correction trick to the above equation we obtain Qπ(xt, at) = Ext+1at+1 rt + γ ρt+1Qπ(xt+1, at+1) + γ E a π + Qπ(xt+1, a) (18) By recursively expanding Qπ as in Equation (18), we can represent Qπ(x, a) as: Qπ(x, a) = Eµ rt + γ E b π + Qπ(xt+1, b) The expectation Eµ is taken over trajectories starting from x with actions generated with respect to µ. When Qπ is not available, we can replace it with our current estimate Q to get a return-based 3 For videos of the policies learned with ACER, please see: https://www.youtube.com/watch?v= Nmbe QYo Vv5g&list=PLkm HIkhl Fji Tlvwx Ens JMs3v7se R5HSP-. Published as a conference paper at ICLR 2017 Fish Walker2d Humanoid No Trust Region 0 10 20 30 40 50 60 70 80 12 Episode Rewards (13-Do F/5-dim. Actions) 0 20 40 60 80 100 120 25 (9-Do F/6-dim. Actions) 0 10 20 30 40 50 60 70 80 (27-Do F/21-dim. Actions) 0 10 20 30 40 50 60 70 80 12 Episode Rewards 0 20 40 60 80 100 120 0 10 20 30 40 50 60 70 80 No Retrace nor Off-Policy Corr. 0 10 20 30 40 50 60 70 80 12 Episode Rewards 0 20 40 60 80 100 120 0 10 20 30 40 50 60 70 80 No Truncation & Bias Corr. 0 10 20 30 40 50 60 70 80 Million Steps Episode Rewards 0 20 40 60 80 100 120 Million Steps 0 10 20 30 40 50 60 70 80 Million Steps Figure 4: Ablation analysis evaluating the effect of different components of ACER. Each row compares ACER with and without one component. The columns represents three control tasks. Red lines, in all plots, represent ACER whereas green lines ACER with missing components. This study indicates that all 4 components studied improve performance where 3 are critical to success. Note that the ACER curve is of course the same in all rows. esitmate of Qπ. This operation also defines an operator: BQ(x, a) = Eµ rt + γ E b π + Q(xt+1, b) In the following proposition, we show that B is a contraction operator with a unique fixed point Qπ and that it is equivalent to the Retrace operator. Proposition 1. The operator B is a contraction operator such that BQ Qπ γ Q Qπ and B is equivalent to Retrace. The above proposition not only shows an alternative way of arriving at the same operator, but also provides a different proof of contraction for Retrace. Please refer to Appendix C for the regularization conditions and proof of the above proposition. Finally, B, and therefore Retrace, generalizes both the Bellman operator T π and importance sampling. Specifically, when c = 0, B = T π and when c = , B recovers importance sampling; see Appendix C. Published as a conference paper at ICLR 2017 8 CONCLUDING REMARKS We have introduced a stable off-policy actor critic that scales to both continuous and discrete action spaces. This approach integrates several recent advances in RL in a principle manner. In addition, it integrates three innovations advanced in this paper: truncated importance sampling with bias correction, stochastic dueling networks and an efficient trust region policy optimization method. We showed that the method not only matches the performance of the best known methods on Atari, but that it also outperforms popular techniques on several continuous control problems. The efficient trust region optimization method advanced in this paper performs remarkably well in continuous domains. It could prove very useful in other deep learning domains, where it is hard to stabilize the training process. ACKNOWLEDGMENTS We are very thankful to Marc Bellemare, Jascha Sohl-Dickstein, and S ebastien Racaniere for proofreading and valuable suggestions. M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An evaluation platform for general agents. JAIR, 47:253 279, 2013. G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. Open AI Gym. ar Xiv preprint 1606.01540, 2016. T. Degris, M. White, and R. S. Sutton. Off-policy actor-critic. In ICML, pp. 457 464, 2012. Anna Harutyunyan, Marc G Bellemare, Tom Stepleton, and Remi Munos. Q (λ) with off-policy corrections. ar Xiv preprint ar Xiv:1602.04951, 2016. N. Heess, G. Wayne, D. Silver, T. Lillicrap, T. Erez, and Y. Tassa. Learning continuous control policies by stochastic value gradients. In NIPS, 2015. T. Jie and P. Abbeel. On a connection between importance sampling and the likelihood ratio policy gradient. In NIPS, pp. 1000 1008, 2010. S. Levine and V. Koltun. Guided policy search. In ICML, 2013. S. Levine, C. Finn, T. Darrell, and P. Abbeel. End-to-end training of deep visuomotor policies. ar Xiv preprint ar Xiv:1504.00702, 2015. T. Lillicrap, J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. ar Xiv:1509.02971, 2015. L.J. Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning, 8(3):293 321, 1992. N. Meuleau, L. Peshkin, L. P. Kaelbling, and K. Kim. Off-policy policy search. Technical report, MIT AI Lab, 2000. V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540): 529 533, 2015. V. Mnih, A. Puigdom enech Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. ar Xiv:1602.01783, 2016. R. Munos, T. Stepleton, A. Harutyunyan, and M. G. Bellemare. Safe and efficient off-policy reinforcement learning. ar Xiv preprint ar Xiv:1606.02647, 2016. K. Narasimhan, T. Kulkarni, and R. Barzilay. Language understanding for text-based games using deep reinforcement learning. In EMNLP, 2015. Published as a conference paper at ICLR 2017 J. Oh, V. Chockalingam, S. P. Singh, and H. Lee. Control of memory, active perception, and action in Minecraft. In ICML, 2016. D. Precup, R. S. Sutton, and S. Singh. Eligibility traces for off-policy policy evaluation. In ICML, pp. 759 766, 2000. T. Schaul, J. Quan, I. Antonoglou, and D. Silver. Prioritized experience replay. In ICLR, 2016. J. Schulman, S. Levine, P. Abbeel, M. I. Jordan, and P. Moritz. Trust region policy optimization. In ICML, 2015a. J. Schulman, P. Moritz, S. Levine, M. I. Jordan, and P. Abbeel. High-dimensional continuous control using generalized advantage estimation. ar Xiv:1506.02438, 2015b. D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller. Deterministic policy gradient algorithms. In ICML, 2014. D. Silver, A. Huang, C.J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484 489, 2016. R. S. Sutton, D. Mcallester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. In NIPS, pp. 1057 1063, 2000. E. Todorov, T. Erez, and Y. Tassa. Mu Jo Co: A physics engine for model-based control. In International Conference on Intelligent Robots and Systems, pp. 5026 5033, 2012. Z. Wang, T. Schaul, M. Hessel, H. van Hasselt, M. Lanctot, and N. de Freitas. Dueling network architectures for deep reinforcement learning. In ICML, 2016. P. Wawrzy nski. Real-time reinforcement learning by sequential actor critics and experience replay. Neural Networks, 22(10):1484 1497, 2009. Published as a conference paper at ICLR 2017 A ACER PSEUDO-CODE FOR DISCRETE ACTIONS Algorithm 1 ACER for discrete actions (master algorithm) // Assume global shared parameter vectors θ and θv. // Assume ratio of replay r. repeat Call ACER on-policy, Algorithm 2. n Possion(r) for i {1, , n} do Call ACER off-policy, Algorithm 2. end for until Max iteration or time reached. Algorithm 2 ACER for discrete actions Reset gradients dθ 0 and dθv 0. Initialize parameters θ θ and θ v θv. if not On-Policy then Sample the trajectory {x0, a0, r0, µ( |x0), , xk, ak, rk, µ( |xk)} from the replay memory. else Get state x0 end if for i {0, , k} do Compute f( |φθ (xi)), Qθ v(xi, ) and f( |φθa(xi)). if On-Policy then Perform ai according to f( |φθ (xi)) Receive reward ri and new state xi+1 µ( |xi) f( |φθ (xi)) end if ρi min n 1, f(ai|φθ (xi)) µ(ai|xi) o . end for ( 0 for terminal xk P a Qθ v(xk, a)f(a|φθ (xk)) otherwise for i {k 1, , 0} do Qret ri + γQret a Qθ v(xi, a)f(a|φθ (xi)) Computing quantities needed for trust region updating: g min {c, ρi(ai)} φθ (xi) log f(ai|φθ (xi))(Qret Vi) + f(a|φθ (xi)) φθ (xi) log f(a|φθ (xi))(Qθ v(xi, ai) Vi) k φθ (xi)DKL [f( |φθa(xi) f( |φθ (xi)] Accumulate gradients wrt θ : dθ dθ + φθ (xi) θ g max n 0, k T g δ Accumulate gradients wrt θ v: dθv dθv + θ v(Qret Qθ v(xi, a))2 Update Retrace target: Qret ρi Qret Qθ v(xi, ai) + Vi end for Perform asynchronous update of θ using dθ and of θv using dθv. Updating the average policy network: θa αθa + (1 α)θ B Q(λ) WITH OFF-POLICY CORRECTIONS Given a trajectory generated under the behavior policy µ, the Q(λ) with off-policy corrections estimator (Harutyunyan et al., 2016) can be expressed recursively as follows: Qopc(xt, at) = rt + γ[Qopc(xt+1, at+1) Q(xt+1, at+1)] + γV (xt+1). (21) Notice that Qopc(xt, at) is the same as Retrace with the exception that the truncated importance ratio is replaced with 1. Published as a conference paper at ICLR 2017 Algorithm 3 ACER for Continuous Actions Reset gradients dθ 0 and dθv 0. Initialize parameters θ θ and θ v θv. Sample the trajectory {x0, a0, r0, µ( |x0), , xk, ak, rk, µ( |xk)} from the replay memory. for i {0, , k} do Compute f( |φθ (xi)), Vθ v(xi), e Qθ v(xi, ai), and f( |φθa(xi)). Sample a i f( |φθ (xi)) ρi f(ai|φθ (xi)) µ(ai|xi) and ρ i f(a i|φθ (xi)) µ(a i|xi) ci min n 1, (ρi) 1 d o . end for ( 0 for terminal xk Vθ v(xk) otherwise for i {k 1, , 0} do Qret ri + γQret Qopc ri + γQopc Computing quantities needed for trust region updating: g min {c, ρi} φθ (xi) log f(ai|φθ (xi)) Qopc(xi, ai) Vθ v(xi) + ( e Qθ v(xi, a i) Vθ v(xi)) φθ (xi) log f(a i|φθ (xi)) k φθ (xi)DKL [f( |φθa(xi) f( |φθ (xi)] Accumulate gradients wrt θ: dθ dθ + φθ (xi) θ g max n 0, k T g δ Accumulate gradients wrt θ v: dθv dθv + (Qret e Qθ v(xi, ai)) θ v e Qθ v(xi, ai) dθv dθv + min {1, ρi} Qret(xt, ai) e Qθ v(xt, ai) θ v Vθ v(xi) Update Retrace target: Qret ci Qret e Qθ v(xi, ai) + Vθ v(xi) Update Retrace target: Qopc Qopc e Qθ v(xi, ai) + Vθ v(xi) end for Perform asynchronous update of θ using dθ and of θv using dθv. Updating the average policy network: θa αθa + (1 α)θ Because of the lack of the truncated importance ratio, the operator defined by Qopc is only a contraction if the target and behavior policies are close to each other (Harutyunyan et al., 2016). Q(λ) with off-policy corrections is therefore less stable compared to Retrace and unsafe for policy evaluation. Qopc, however, could better utilize the returns as the traces are not cut by the truncated importance weights. As a result, Qopc could be used efficiently to estimate Qπ in policy gradient (e.g. in Equation (16)). In our continuous control experiments, we have found that Qopc leads to faster learning. C RETRACE AS TRUNCATED IMPORTANCE SAMPLING WITH BIAS CORRECTION For the purpose of proving proposition 1, we assume our environment to be a Markov Decision Process (X, A, γ, P, r). We restrict X to be a finite state space. For notational simplicity, we also restrict A to be a finite action space. P : X, A X defines the state transition probabilities and r : X, A [ RMAX, RMAX] defines a reward function. Finally, γ [0, 1) is the discount factor. Published as a conference paper at ICLR 2017 Proof of proposition 1. First we show that B is a contraction operator. |BQ(x, a) Qπ(x, a)| + (Q(xt+1, b) Qπ(xt+1, b)) + |Q(xt+1, b) Qπ(xt+1, b)| ! γ(1 Pt+1) sup b |Q(xt+1, b) Qπ(xt+1, b)| Where Pt+1 = 1 E b π h ρt+1(b) c = E b µ[ ρt+1(b)]. The last inequality in the above equation is due to H older s inequality. (22) sup x,b |Q(x, b) Qπ(x, b)| Eµ ! γ(1 Pt+1) = sup x,b |Q(x, b) Qπ(x, b)| Eµ = sup x,b |Q(x, b) Qπ(x, b)| Eµ t 0 γt+1 t+1 Y = sup x,b |Q(x, b) Qπ(x, b)| (γC (C 1)) where C = P t 0 γt Qt i=1 ρi . Since C P0 t=0 γt Qt i=1 ρi = 1, we have that γC (C 1) γ. Therefore, we have shown that B is a contraction operator. Now we show that B is the same as Retrace. By apply the trunction and bias correction trick, we have E b π[Q(xt+1, b)] = E b µ[ ρt+1(b)Q(xt+1, b)] + E b π + Q(xt+1, b) By adding and subtracting the two sides of Equation (23) inside the summand of Equation (20), we have BQ(x, a) = Eµ rt + γ E b π + Q(xt+1, b) +γ E b π[Q(xt+1, b)] γ E b µ[ ρt+1(b)Q(xt+1, b)] γ E b π + Q(xt+1, b) # ! rt + γ E b π[Q(xt+1, b)] γ E b µ[ ρt+1(b)Q(xt+1, b)] ! rt + γ E b π[Q(xt+1, b)] γ ρt+1Q(xt+1, at+1) ! rt + γ E b π[Q(xt+1, b)] Q(xt, at) + Q(x, a) = RQ(x, a) Published as a conference paper at ICLR 2017 In the remainder of this appendix, we show that B generalizes both the Bellman operator and importance sampling. First, we reproduce the definition of B: BQ(x, a) = Eµ rt + γ E b π + Q(xt+1, b) When c = 0, we have that ρi = 0 i. Therefore only the first summand of the sum remains: BQ(x, a) = Eµ rt + γ E b π (Q(xt+1, b)) . In this case B = T . When c = , the compensation term disappears and ρi = ρi i: BQ(x, a) = Eµ ! rt + γ E b π (0 Q(xt+1, b)) In this case B is the same operator defined by importance sampling. D DERIVATION OF V target By using the truncation and bias correction trick, we can derive the following: V π(xt) = E a µ min 1, π(a|xt) Qπ(xt, a) + E a π + Qπ(xt+1, a) We, however, cannot use the above equation as a target as we do not have access to Qπ. To derive a target, we can take a Monte Carlo approximation of the first expectation in the RHS of the above equation and replace the first occurrence of Qπ with Qret and the second with our current neural net approximation Qθv(xt, ): V target pre (xt) := min 1, π(at|xt) Qret(xt, at) + E a π + Qθv(xt, a) Through the truncation and bias correction trick again, we have the following identity: E a π[Qθv(xt, a)] = E a µ min 1, π(a|xt) Qθv(xt, a) + E a π + Qθv(xt, a) Adding and subtracting both sides of Equation (25) to the RHS of (24) while taking a Monte Carlo approximation, we arrive at V target(xt): V target(xt) := min 1, π(at|xt) Qret(xt, at) Qθv(xt, at) + Vθv(xt). E CONTINUOUS CONTROL EXPERIMENTS E.1 DESCRIPTION OF THE CONTINUOUS CONTROL PROBLEMS Our continuous control tasks were simulated using the Mu Jo Co physics engine (Todorov et al. (2012)). For all experiments we considered an episodic setup with an episode length of T = 500 steps and a discount factor of 0.99. Cartpole swingup This is an instance of the classic cart-pole swing-up task. It consists of a pole attached to a cart running on a finite track. The agent is required to balance the pole near the center of the track by applying a force to the cart only. An episode starts with the pole at a random angle and zero velocity. A reward zero is given except when the pole is approximately upright (within 5 deg) and the track approximately in the center of the track ( 0.05) for a track length of 2.4. The observations include position and velocity of the cart, angle and angular velocity of the pole. a sine/cosine of the angle, the position of the tip of the pole, and Cartesian velocities of the pole. The dimension of the action space is 1. Published as a conference paper at ICLR 2017 Reacher3 The agent needs to control a planar 3-link robotic arm in order to minimize the distance between the end effector of the arm and a target. Both arm and target position are chosen randomly at the beginning of each episode. The reward is zero except when the tip of the arm is within 0.05 of the target, where it is one. The 8-dimensional observation consists of the angles and angular velocity of all joints as well as the displacement between target and the end effector of the arm. The 3-dimensional action are the torques applied to the joints. Cheetah The Half-Cheetah (Wawrzy nski (2009); Heess et al. (2015)) is a planar locomotion task where the agent is required to control a 9-Do F cheetah-like body (in the vertical plane) to move in the direction of the x-axis as quickly as possible. The reward is given by the velocity along the x-axis and a control cost: r = vx + 0.1 a 2. The observation vector consists of the z-position of the torso and its x, z velocities as well as the joint angles and angular velocities. The action dimension is 6. Fish The goal of this task is to control a 13-Do F fish-like body to swim to a random target in 3D space. The reward is given by the distance between the head of the fish and the target, a small penalty for the body not being upright, and a control cost. At the beginning of an episode the fish is initialized facing in a random direction relative to the target. The 24-dimensional observation is given by the displacement between the fish and the target projected onto the torso coordinate frame, the joint angles and velocities, the cosine of the angle between the z-axis of the torso and the world z-axis, and the velocities of the torso in the torso coordinate frame. The 5-dimensional actions control the position of the side fins and the tail. Walker The 9-Do F planar walker is inspired by (Schulman et al. (2015a)) and is required to move forward along the x-axis as quickly as possible without falling. The reward consists of the x-velocity of the torso, a quadratic control cost, and terms that penalize deviations of the torso from the preferred height and orientation (i.e. terms that encourage the walker to stay standing and upright). The 24-dimensional observation includes the torso height, velocities of all Do Fs, as well as sines and cosines of all body orientations in the x-z plane. The 6-dimensional action controls the torques applied at the joints. Episodes are terminated early with a negative reward when the torso exceeds upper and lower limits on its height and orientation. Humanoid The humanoid is a 27 degrees-of-freedom body with 21 actuators (21 action dimensions). It is initialized lying on the ground in a random configuration and the task requires it to achieve a standing position. The reward function penalizes deviations from the height of the head when standing, and includes additional terms that encourage upright standing, as well as a quadratic action penalty. The 94 dimensional observation contains information about joint angles and velocities and several derived features reflecting the body s pose. E.2 UPDATE EQUATIONS OF THE BASELINE TIS The baseline TIS follows the following update equations, updates to the policy: min i=0 γirt+i + γk Vθv(xk+t) Vθv(xt) θ log πθ(at|xt), updates to the value: min i=0 γirt+i + γk Vθv(xk+t) Vθv(xt) θv Vθv(xt). The baseline Trust-TIS is appropriately modified according to the trust region update described in Section 3.3. E.3 SENSITIVITY ANALYSIS In this section, we assess the sensitivity of ACER to hyper-parameters. In Figures 5 and 6, we show, for each game, the final performance of our ACER agent versus the choice of learning rates, and the trust region constraint δ respectively. Note, as we are doing random hyper-parameter search, each learning rate is associated with a random δ and vice versa. It is therefore difficult to tease out the effect of either hyper-parameter independently. Published as a conference paper at ICLR 2017 We observe, however, that ACER is not very sensitive to the hyper-parameters overall. In addition, smaller δ s do not seem to adversely affect the final performance while larger δ s do in domains of higher action dimensionality. Similarly, smaller learning rates perform well while bigger learning rates tend to hurt final performance in domains of higher action dimensionality. 4.0 3.9 3.8 3.7 3.6 3.5 3.4 3.3 Log Learning Rate Cumulative Reward 4.0 3.9 3.8 3.7 3.6 3.5 3.4 3.3 Log Learning Rate Cumulative Reward 4.0 3.9 3.8 3.7 3.6 3.5 3.4 3.3 Log Learning Rate Cumulative Reward 4.0 3.9 3.8 3.7 3.6 3.5 3.4 3.3 Log Learning Rate Cumulative Reward 4.0 3.9 3.8 3.7 3.6 3.5 3.4 3.3 Log Learning Rate Cumulative Reward 4.0 3.9 3.8 3.7 3.6 3.5 3.4 3.3 Log Learning Rate Cumulative Reward Figure 5: Log learning rate vs. cumulative rewards in all the continuous control tasks for ACER. The plots show the final performance after training for all 30 log learning rates considered. Note that each learning rate is associated with a different δ as a consequence of random search over hyper-parameters. 0.5 1.0 1.5 2.0 Trust Region Constraint (δ) Cumulative Reward 0.5 1.0 1.5 2.0 Trust Region Constraint (δ) Cumulative Reward 0.5 1.0 1.5 2.0 Trust Region Constraint (δ) Cumulative Reward 0.5 1.0 1.5 2.0 Trust Region Constraint (δ) Cumulative Reward 0.5 1.0 1.5 2.0 Trust Region Constraint (δ) Cumulative Reward 0.5 1.0 1.5 2.0 Trust Region Constraint (δ) Cumulative Reward Figure 6: Trust region constraint (δ) vs. cumulative rewards in all the continuous control tasks for ACER. The plots show the final performance after training for all 30 trust region constraints (δ) searched over. Note that each δ is associated with a different learning rate as a consequence of random search over hyper-parameters. E.4 EXPERIMENTAL SETUP OF ABLATION ANALYSIS For the ablation analysis, we use the same experimental setup as in the continuous control experiments while removing one component at a time. Published as a conference paper at ICLR 2017 To evaluate the effectiveness of Retrace/Q(λ) with off-policy correction, we replace both with importance sampling based estimates (following Degris et al. (2012)) which can be expressed recursively: Rt = rt + ρt+1Rt+1. To evaluate the Stochastic Dueling Networks, we replace it with two separate networks: one computing the state values and the other Q values. Given Qret(xt, at), the naive way of estimating the state values is to use the following update rule: ρt Qret(xt, at) Vθv(xt) θv Vθv(xt). The above update rule, however, suffers from high variance. We consider instead the following update rule: ρt Qret(xt, at) Vθv(xt) θv Vθv(xt) which has markedly lower variance. We update our Q estimates as before. To evaluate the effects of the truncation and bias correction trick, we change our c parameter (see Equation (16)) to so as to use pure importance sampling.