# multitask_deep_reinforcement_learning_with_popart__9bc1438a.pdf

The Thirty-Third AAAI Conference on Artiﬁcial Intelligence (AAAI-19)

Multi-Task Deep Reinforcement Learning with Pop Art

Matteo Hessel Deep Mind Hubert Soyer Deep Mind Lasse Espeholt Deep Mind

Wojciech Czarnecki Deep Mind Simon Schmitt Deep Mind Hado van Hasselt Deep Mind

The reinforcement learning (RL) community has made great strides in designing algorithms capable of exceeding human performance on speciﬁc tasks. These algorithms are mostly trained one task at the time, each new task requiring to train a brand new agent instance. This means the learning algorithm is general, but each solution is not; each agent can only solve the one task it was trained on. In this work, we study the problem of learning to master not one but multiple sequentialdecision tasks at once. A general issue in multi-task learning is that a balance must be found between the needs of multiple tasks competing for the limited resources of a single learning system. Many learning algorithms can get distracted by certain tasks in the set of tasks to solve. Such tasks appear more salient to the learning process, for instance because of the density or magnitude of the in-task rewards. This causes the algorithm to focus on those salient tasks at the expense of generality. We propose to automatically adapt the contribution of each task to the agent s updates, so that all tasks have a similar impact on the learning dynamics. This resulted in state of the art performance on learning to play all games in a set of 57 diverse Atari games. Excitingly, our method learned a single trained policy - with a single set of weights - that exceeds median human performance. To our knowledge, this was the ﬁrst time a single agent surpassed human-level performance on this multi-task domain. The same approach also demonstrated state of the art performance on a set of 30 tasks in the 3D reinforcement learning platform Deep Mind Lab.

Introduction In recent years, the ﬁeld of deep reinforcement learning (RL) has enjoyed many successes. Deep RL agents have been applied to board games such as Go (Silver et al. 2016) and chess (Silver et al. 2017), continuous control tasks (Lillicrap et al. 2016; Duan et al. 2016), classic video-games such as Atari (Mnih et al. 2015; Hessel et al. 2018; Gruslys et al. 2018; Schulman et al. 2015; 2017; Bacon, Harb, and Precup 2017), and 3D ﬁrst person environments (Mnih et al. 2016; Jaderberg et al. 2016). While the results are impressive, they were achieved on one task at the time, each new task requiring to train a brand new agent instance from scratch. Multi-task and transfer learning remain important open problems in deep RL. There are at least four different strains

Copyright c 2019, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

of multi-task reinforcement learning that have been explored in the literature: off-policy learning of many predictions about the same stream of experience (Schmidhuber 1990; Sutton et al. 2011; Jaderberg et al. 2016), continual learning in a sequence of tasks (Ring 1994; Thrun 1996; 2012; Rusu et al. 2016), distillation of task-speciﬁc experts into a single shared model (Parisotto, Ba, and Salakhutdinov 2015; Rusu et al. 2015; Schmitt et al. 2018; Teh et al. 2017), and parallel learning of multiple tasks at once (Sharma and Ravindran 2017; Caruana 1998). We will focus on the latter. Parallel multi-task learning has recently achieved remarkable success in enabling a single system to learn a large number of diverse tasks. The Importance Weighted Actor Learner Architecture, henceforth IMPALA (Espeholt et al. 2018), achieved a 59.7% median human normalised score across 57 Atari games, and a 49.4% mean human normalised score across 30 Deep Mind Lab levels. These results are state of the art for multi-task RL, but they are far from the humanlevel performance demonstrated by deep RL agents on the same domains, when trained on each task individually. Part of why multi-task learning is much harder than single task learning is that a balance must be found between the needs of multiple tasks, that compete for the limited resources of a single learning system (for instance, for its limited representation capacity). We observed that the naive transposition of common RL algorithms to the multi-task setting may not perform well in this respect. More specifically, the saliency of a task for the agent increases with the scale of the returns observed in that task, and these may differ arbitrarily across tasks. This affects value-based algorithms such as Q-learning (Watkins 1989), as well as policybased algorithms such as REINFORCE (Williams 1992). The problem of scaling individual rewards appropriately is not novel, and has often been addressed through reward clipping (Mnih et al. 2015). This heuristic changes the agent s objective, e.g., if all rewards are non-negative the algorithm optimises frequency of rewards rather than their cumulative sum. If the two objectives are sufﬁciently well aligned, clipping can be effective. However, the scale of returns also depends on the rewards sparsity. This implies that, even with reward clipping, in a multi-task setting the magnitude of updates can still differ signiﬁcantly between tasks, causing some tasks to have a larger impact on the learning dynamics than other equally important ones.

Note that both the sparsity and the magnitude of rewards collected in an environment are inherently non-stationary, because the agent is learning to actively maximise the total amount of rewards it can collect. These non-stationary learning dynamics make it impossible to normalise the learning updates a priori, even if we would be willing to pour significant domain knowledge into the design of the algorithm. To summarise, in IMPALA the magnitude of updates resulting from experience gathered in each environment depends on: 1) the scale of rewards, 2) the sparsity of rewards, 3) the competence of the agent. In this paper we use Pop Art normalisation (van Hasselt et al. 2016) to derive an actorcritic update invariant to these factors, enabling large performance improvements in parallel multi-task agents. We demonstrated this on the Atari-57 benchmark, where a single agent achieved a median normalised score of 110% and on Dm Lab-30, where it achieved a mean score of 72.8%.

Background Reinforcement learning (RL) is a framework for learning and decision-making under uncertainty (Sutton and Barto 2018). A learning system - the agent - must learn to interact with the environment it is embedded in, so as to maximise a scalar reward signal. The RL problem is often formalised as a Markov decision process (Bellman 1957): a tuple (S, A, p, γ), where S, A are ﬁnite sets of states and actions, p denotes the dynamics, such that p(r, s | s, a) is the probability of observing reward r and state s when executing action a in state s, and γ [0, 1] discounts future rewards. The policy maps states s S to probability distributions over actions π(A|S = s), thus specifying the behaviour of the agent. The return Gt = Rt+1+γRt+2+. . . is the γ-discounted sum of rewards collected by an agent from state St onward under policy π. We deﬁne action values and state values as qπ(s, a) = Eπ[Gt | St = s, At = a] and vπ(s) = Eπ[Gt | St = s], respectively. The agent s objective is to ﬁnd a policy to maximise such values. In multi-task reinforcement learning, a single agent must learn to master N different environments T = {Di = (Si, Ai, pi, γ)}N i=1, each with its own distinct dynamics (Brunskill and Li 2013). Particularly interesting is the case in which the action space and transition dynamics are at least partially shared. For instance, the environments might follow the same physical rules, while the set of interconnected states and obtainable rewards differ. We may formalise this as a single larger MDP, whose state space is S = {{(sj, i)}sj Si}N i=1. The task index i may be latent, or may be exposed to the agent s policy. In this paper, we use the task index at training time, for the value estimates used to compute the policy updates, but not at testing time: our algorithm will return a single general policy π(A|S) which is only function of the individual environment s state S and not conditioned directly on task index i. This is more challenging than the standard multi-task learning setup, which typically allows conditioning the model on the task index even at evaluation (Romera-Paredes et al. 2013; Collobert and Weston 2008), because our agents will need to infer what task to solve purely from the stream of raw observations and/or early rewards in the episode.

Actor-critic

In our experiments, we use an actor-critic algorithm to learn a policy πη(A|S) and a value estimate vθ(s), which are both outputs of a deep neural network. We update the agent s policy by using REINFORCE-style stochastic gradient (Gt vθ(St)) η log π(At|St) (Williams 1992), where vθ(St) is used as a baseline to reduce variance. In addition we use a multi-step return Gv t that bootstraps on the value estimates after a limited number of transitions, both to reduce variance further and to allow us to update the policy before Gt fully resolves at the end of an episode. The value function vθ(S) is instead updated to minimise the squared loss with respect to the (truncated and bootstrapped) return:

θ θ(Gv t vθ(St))2 = (Gv t vθ(St)) θvθ(St) (1)

η (Gπ t vθ(St)) η log(πη(At|St)) , (2)

where Gv t and Gπ t are stochastic estimates of vπ(St) and qπ(St, At), respectively. Note how both updates depend linearly on the scale of returns, which, as previously argued, depend on scale/sparsity of rewards, and agent s competence.

Efﬁcient multi-task learning in simulation

We use the IMPALA agent architecture (Espeholt et al. 2018), proposed for reinforcement learning in simulated environments. In IMPALA the agent is distributed across multiple threads, processes or machines. Several actors run on CPU generating rollouts of experience, consisting of a ﬁxed number of interactions (100 in our experiments) with their own copy of the environment, and then enqueue the rollouts in a shared queue. Actors receive the latest copy of the network s parameters from the learner before each rollout. A single GPU learner processes rollouts from all actors, in batches, and updates a deep network. The network is a deep convolutional Res Net (He et al. 2015), followed by a LSTM recurrent layer (Hochreiter and Schmidhuber 1997). Policy and values are all linear functions of the LSTM s output. Despite the large network used for estimating policy πη and values vθ, the decoupled nature of the agent enables to process data very efﬁciently: in the order of hundreds of thousands frames per second (Espeholt et al. 2018). The setup easily supports the multi-task setting by simply assigning different environments to each of the actors and then running the single policy π(S|A) on each of them. The data in the queue can also be easily labelled with the task id, if useful at training time. Note that an efﬁcient implementation of IMPALA is available open-source 1, and that, while we use this agent for our experiments, our approach can be applied to other data parallel multi-task agents (e.g. A3C).

Off-policy corrections

Because we use a distributed queue-based learning setup, the data consumed by the learning algorithm might be slightly off-policy, as the policy parameters change between acting and learning. We can use importance sampling corrections ρt = π(At|St)/µ(At|St) to compensate for this (Precup,

1www.github.com/deepmind/scalable agent

Sutton, and Singh 2000). In particular, we can write the nstep return as Gt = Rt+1 + γRt+2 + . . . + γnv(St+n) = v(St) + Pt+n 1 k=t γk tδk, where δt = Rt+1 + γv(St+1) v(St), and then apply appropriate importance sampling corrections to each error term (Sutton et al. 2014) to get Gt = v(St)+Pt+n 1 k=t γk t(Qk i=t ρi)δk. This is unbiased, but has high variance. To reduce variance, we can further clip most of the importance-sampling ratios, e.g., as ct = min(1, ρt). This leads to the v-trace return (Espeholt et al. 2018)

Gv trace t = v(St) +

k=t γk t k Y

A very similar target was proposed for the ABQ(ζ) algorithm (Mahmood 2017), where the product ρtλt was considered and then the trace parameter λt was chosen to be adaptive to lead to exactly the same behaviour that ct = ρtλt = min(1, ρt). This shows that this form of clipping does not impair the validity of the off-policy corrections, in the same sense that bootstrapping in general does not change the semantics of a return. The returns used by the value and policy updates deﬁned in Equation 1 and 2 are then

Gv t = Gv trace t and Gπ t = Rt+1 + γGv trace t+1 . (4)

This is the same algorithm as used by Espeholt et al. (2018) in the experiments on the IMPALA architecture.

Adaptive normalisation In this section we use Pop Art normalisation (van Hasselt et al. 2016), which was introduced for value-based RL, to derive a scale invariant algorithm for actor-critic agents. For simplicity, we ﬁrst consider the single-task setting, then we extend it to the multi-task setting (the focus of this work).

Scale invariant updates In order to normalise both baseline and policy gradient updates, we ﬁrst parameterise the value estimate vµ,σ,θ(S) as the linear transformation of a suitably normalised value prediction nθ(S). We further assume that the normalised value prediction is itself the output of a linear function, for instance the last fully connected layer of a deep neural net:

vµ,σ,θ(s) = σ nθ(s) + µ = σ (w fθ\{w,b}(s) + b | {z } = nθ(s)

(5) As proposed by van Hasselt et al., µ and σ can be updated so as to track mean and standard deviation of the values. First and second moments of can be estimated online as

µt = (1 β)µt 1 +βGv t , νt = (1 β)νt 1 +β(Gv t )2, (6) and then used to derive the estimated standard deviation as σt = p

νt µ2 t. Note that the ﬁxed decay rate β determines the horizon used to compute the statistics. We can then use the normalised value estimate nθ(S) and the statistics µ and σ to normalise the actor-critic loss, both in its value and policy component; this results in the scale-invariant updates:

σ nθ(St) θnθ(St) , (7)

σ nθ(St) η log πη(At|St) . (8)

If we optimise the new objective naively, we are at risk of making the problem harder: the normalised targets for values are non-stationary, since they depend on statistics µ and σ. The Pop Art normalisation algorithm prevents this, by updating the last layer of the normalised value network to preserve unnormalised value estimates vµ,σ,θ, under any change in the statistics µ µ and σ σ :

σ w , b = σb + µ µ

This extends Pop Art s scale-invariant updates to the actorcritic setting, and can help to make tuning hyperparameters easier, but it is not sufﬁcient to tackle the challenging multitask RL setting that we are interested in this paper. For this, a single pair of normalisation statistics is not sufﬁcient.

Scale invariant updates for multi-task learning Let Di be an environment in some ﬁnite set T = {Di}N i=1, and let π(S|A) be a task-agnostic policy, that takes a state S from any of the environments Di, and maps it to a probability distribution onto the shared action space A. Consider now a multi-task value function v(S) with N outputs, one for each task. We can use for v the same parametrisation as in Equation 5, but with vectors of statistics µ, σ RN, and a vector-valued function nθ(s) = (n1 θ(s), . . . , n N θ (s))

vµ,σ,θ(S) = σ nθ(S)+µ = σ (Wfθ\{W,b}(S)+b)+µ (10) where W and b denote the parameters of the last fully connected layer in nθ(s). Given a rollout {Si,k, Ak, Ri,k}t+n k=t, generated under task-agnostic policy πη(A|S) in environment Di, we can adapt the updates in Equation 7 and 8 to provide scale invariant updates also in the multi-task setting:

Gv,i t µi σi ni θ(St)

θni θ(St) , (11)

Gπ,i t µi σi ni θ(St)

η log πη(At|St) , (12)

where the targets G ,i t use the value estimates for environment Di for bootstrapping. For each rollout, only the ith head in the value net is updated, while the same policy network is updated irrespectively of the task, using the appropriate rescaling for updates to parameters η. As in the singletask case, when updating the statistics µ and σ we also need to update W and b to preserve unnormalised outputs,

σ i wi , b i = σibi + µi µ i σ i , (13)

where wi is the ith row of matrix W , and µi, σi, bi are the ith elements of the corresponding parameter vectors. Note that in all updates only the values, but not the policy, are conditioned on the task index, which ensures that the resulting agent can then be run in a fully task agnostic way, since values are only used to reduce the variance of the policy updates at training time but not needed for action selection.

Figure 1: Summary of results: aggregate scores for IMPALA and Pop Art-IMPALA. We report median human normalised score for Atari-57, and mean capped human normalised score for Dm Lab-30. In Atari, Random and Human refer to whether the trained agent is evaluated with random or human starts. In Dm Lab-30, test score includes evaluation on the held-out levels.

Atari-57 Atari-57 (unclipped) Dm Lab-30

Agent Random Human Random Human Train Test

IMPALA 59.7% 28.5% 0.3% 1.0% 60.6% 58.4%

Pop Art-IMPALA 110.7% 101.5% 107.0% 93.7% 73.5% 72.8%

Experiments

We evaluated our approach in two challenging multi-task benchmarks, Atari-57 and Dm Lab-30, based on Atari and Deep Mind Lab respectively, and introduced by Espeholt et al. We also consider a new benchmark, consisting of the same 57 Atari games as Atari-57, but with the original unclipped reward scheme. We demonstrate state of the art performance on all three benchmarks. In all cases, to meaningfully aggregate scores across many diverse tasks, we normalise scores based on the scores achieved by a human player and by a random agent on that same task, as common in literature (van Hasselt, Guez, and Silver 2016). All experiments use population-based training (PBT) to adapt hyperparameters as training progresses (Jaderberg et al. 2017).

Atari-57 is a collection of 57 classic Atari 2600 games. The ALE (Bellemare et al. 2013), exposes them as RL environments. Most prior work has focused on training agents for individual games (Mnih et al. 2015; Hessel et al. 2018; Gruslys et al. 2018; Schulman et al. 2015; 2017; Bacon, Harb, and Precup 2017). Multi-task learning on this platform has not been as successful due to large number of environments, inconsistent dynamics and very different reward structure. Prior work on multi-task RL in the ALE has therefore focused on smaller subsets of games (Rusu et al. 2015; Sharma and Ravindran 2017). Atari has a particularly diverse reward structure. Consequently, it is a perfect domain to fully assess how well can our agents deal with extreme differences in the scale of returns. Thus, we train all agents both with and without reward clipping, to compare performance degradation as returns get more diverse in the unclipped version of the environment. In both cases, at the end of training, we test agents both with random-starts (Mnih et al. 2015) and human-starts (Nair et al. 2015); aggregate results are reported in Table 1 accordingly. Dm Lab-30 is a benchmark consisting of 30 different visually rich, partially observable RL environments (Beattie et al. 2016). This benchmark has strong internal consistency (all levels are played with a ﬁrst person camera in a 3D environment with consistent dynamics). However, the tasks themselves are quite diverse, and were designed to test distinct skills in RL agents: among these navigation, memory, planning, laser-tagging, and language grounding. The levels can also differ visually in non-trivial ways, as they include both natural environments and maze-like levels. Two levels (rooms collect good objects and

rooms exploit deferred effects) have held out test versions, therefore Table 1 reports both train and test aggregate scores. We observed that the original IMPALA agent suffers from an artiﬁcial bottleneck in performance, due to the fact that some of the tasks cannot be solved with the action set available to the agent. As ﬁrst step, we thus ﬁx this issue by equipping it with a larger action set, resulting in a stronger IMPALA baseline than reported in the original paper. We also run multiple independent PBT experiments, to assess the variability of results across multiple replications.

Atari-57 results Figures 1 and 2 show the median human normalised performance across the entire set of 57 Atari games in the ALE, when training agent with and without reward clipping, respectively. The curves are plotted as function of the total number of frames seen by each agent. Pop Art-IMPALA (orange line) achieved a median performance of 110% with reward clipping and a median performance of 101% in the unclipped version of Atari-57. Recall that here we are measuring the median performance of a single trained agent across all games, rather than the median over the performance of a set of individually trained agents as it has been more common in the Atari domain. To our knowledge, both agents are the ﬁrst to surpass median human performance across the entire set of 57 Atari games. The IMPALA agent (blue line) performed much worse. The baseline barely reached 60% with reward clipping, and the median performance is close to 0% in the unclipped setup. The large decrease in the performance of the baseline IMPALA agent once clipping was removed is in stark contrast with what we observed for Pop Art-IMPALA, that achieved almost the same performance in the two training regimes. Since the level-speciﬁc value predictions used by multitask Pop Art effectively increase the capacity of the network, we also ran an additional experiment to disentangle the contribution of the increased network capacity from the contribution of the adaptive normalisation. For this purpose, we trained a second baseline, that used level speciﬁc value predictions, but did not use Pop Art to adaptively normalise the learning updates. Experiments showed that such Multi Head IMPALA agent (pink line) actually performed slightly worse than the original IMPALA, both with and without clipping, conﬁrming that the performance boost of Pop Art-IMPALA is indeed due to the adaptive rescaling. We highlight that in our experiments a single instance of multi-task Pop Art-IMPALA has processed the same amount

0 2 4 6 8 10 12 Environment Frames 1e9

Median Human Normalised Score

Atari-57 (clipped)

Pop Art-IMPALA Multi Head-IMPALA IMPALA

Figure 1: Atari-57 (reward clipping): Median human normalised score across all Atari levels, as function of the total number of frames seen by the agents across all levels. We compare Pop Art-IMPALA to IMPALA and to an additional baseline, Multi Head-IMPALA, that uses task-speciﬁc value predictions but no adaptive normalisation. All three agent are trained with the clipped reward scheme.

of frames as a collection of 57 expert DQN agents (57 200 M = 1.14 1010), while achieving better performance. Despite the large CPU requirements, on a cloud service, training multi-task Pop Art-IMPALA can also be competitive in terms of costs, since it exceeds the performance of a vanilla-DQN in just 2.5 days, with a smaller GPU footprint.

Normalisation statistics It is insightful to observe the different normalisation statistics across games, and how they adapt during the course of training. Figure 3 (top row) plots the shift µ for a selection of Atari games, in the unclipped training regime. The scale σ is visualised within the same ﬁgure by shading the area in the range [µ σ, µ + σ]. We observed that the statistics differed by orders of magnitude across the different games: in crazy climber the shift exceeded 2500, while in bowling it never went above 15. The adaptivity of the proposed normalisation emerged clearly in crazy climber and qbert, where the statistics spanned multiple orders of magnitude during training. The bottom row in Figure 3 shows the corresponding agent s undiscounted episode return: it followed the same patterns as the statistics (with differences in magnitude due to discounting). Note how the statistics even tracked the instabilities in the agent s performance, as in qbert, ensuring that appropriate scaling was preserved throughout the experiment.

Dm Lab-30 results Figure 4 shows, as a function of the total number of frames processed by each agent, the mean human normalised performance across all 30 Deep Mind Lab levels, where each level s score is capped at 100% . For all agents, we ran three independent PBT experiments. In Figure 4 we plot the learning curves for each experiment and, for each agent, ﬁll in the area between the best and worse of these three experiment.

0 2 4 6 8 10 12 Environment Frames 1e9

Median Human Normalised Score

Atari-57 (unclipped)

Pop Art-IMPALA Multi Head-IMPALA IMPALA

Figure 2: Atari-57 (unclipped): Median human normalised score across all Atari levels, as a function of the total number of frames seen by the agents across all levels. We here compare the same set of agents as in Figure 1, but now all agents are trained without using reward clipping. The approximately ﬂat lines corresponding to the baselines mean no learning at all on at least 50% of the games.

breakout crazy_climber qbert seaquest

Undiscounted Return [μ-σ, μ+σ]

Environment Frames Environment Frames Environment Frames Environment Frames

Figure 3: Normalisation statistics: Top: learned statistics, without reward clipping, for four distinct Atari games. The shaded region is [µ σ, µ + σ]. Bottom: undiscounted returns.

Compared to the original paper, our IMPALA baseline used a richer action set, that included more possible horizontal rotations ( 10 and 60 in the corresponding dimension of the native Deep Mind Lab space), and vertical rotations ( 10). Fine-grained horizontal control is useful on lasertag levels, while vertical rotations are necessary for a few psychlab levels. Note that this new baseline (solid blue in Figure 4) performed much better than the original IMPALA agent, which we also trained and reported for completeness (dashed blue). Including Pop Art normalisation (in orange) on top of our baseline resulted in largely improved scores. Note how agents achieved clearly separated performance levels, with the new action set dominating the original paper s one, and with Pop Art-IMPALA dominating IMPALA for all three replications of the experiment.

0 2 4 6 8 10 Environment Frames 1e9

Mean Capped Human Normalised Score

Pop Art-IMPALA IMPALA IMPALA-original

Figure 4: Dm Lab-30: Mean capped human normalised score of IMPALA (blue) and Pop Art-IMPALA (orange), across the Dm Lab-30 benchmark as function of the number of frames (summed across all levels). Shaded region is bounded by best and worse run among 3 PBT experiments. For reference, we also plot the performance of IMPALA with the limited action set from the original paper (dashed).

In this section, we explore the combination of the proposed Pop Art-IMPALA agent with pixel control (Jaderberg et al. 2016), to further improve data efﬁciency, and make training IMPALA-like agents on large multi-task benchmarks cheaper and more practical. Pixel control is an unsupervised auxiliary task introduced to help learning good state representations. As shown in Figure 5, the combination of Pop Art-IMPALA with pixel control (red line) matched the ﬁnal performance of the vanilla Pop Art-IMPALA (orange line) within a fraction of the data ( 2B frames). This was on top of the large improvement in data efﬁciency already provided by Pop Art, meaning that the pixel control augmented Pop Art-IMPALA required less than 1/10-th of the data to match our own IMPALA baseline s performance (and 1/30-th of the frames to match the original published IMPALA). Importantly, since both Pop Art and Pixel Control only add a very small computational cost, this improvement in data efﬁciency directly translated in a large reduction of the cost of training IMPALA agents on large multi-task benchmarks. Note, ﬁnally, that other orthogonal advances in deep RL could also be combined to further improve performance, similarly to what was done by Rainbow (Hessel et al. 2018), in the context of value-based reinforcement learning.

Implementation notes

We implemented all agents in Tensor Flow. For each batch of rollouts processed by the learner, we averaged the Gv t targets within a rollout, and for each rollout in the batch we performed one online update of Pop Art s normalisation statistics with decay β = 3 10 4. Note that β didn t require any tuning. To prevent numerical issues, we clipped the scale σ in the range [0.0001, 1e6]. We did not backpropagate gradients into µ and σ, exclusively updated as

0 2 4 6 8 10 Environment Frames 1e9

Mean Capped Human Normalised Score

IMPALA-original@10B

Pop Art-IMPALA@10B

Pop Art-IMPALA Pixel-Pop Art-IMPALA

Figure 5: Dm Lab-30 (with pixel control): Mean capped human normalised score of Pop Art-IMPALA with pixel control (red), across the Dm Lab-30 benchmark as function of the total number of frames across all tasks. Shaded region is bounded by best and worse run among 3 PBT experiments. Dotted lines mark the point where Pixel-Pop Art-IMPALA matches Pop Art-IMPALA and the two IMPALA baselines.

in Equation 6. The weights W of the last layer of the value function were updated according to Equation 13 and 11. Note that we ﬁrst applied the actor-critic updates (11), then updated the statistics (6), ﬁnally applied output preserving updates (13). For more just-in-time rescaling of updates we can invert this order, but this wasn t necessary. In all experiments we used population-based training (PBT) to adapt hyperparameters during the course of training (Jaderberg et al. 2017). As in the IMPALA paper, we used PBT to tune learning rate, entropy cost, the optimiser s epsilon, and in the Atari experiments the max gradient norm. In Atari-57 we used populations of 24 instances, in Dm Lab-30 just 8 instances. For other hyperparameters we used the values from (Espeholt et al. 2018).

In this paper we propose a scale-invariant actor-critic algorithm that enables signiﬁcantly improved performance in multi-task reinforcement learning settings. Being able to acquire knowledge about a wide range of facts and skills has been long considered an essential feature for an RL agent to demonstrate ﬂexible intelligent behaviour (Sutton et al. 2011; Degris and Modayil 2012; Legg and Hutter 2007; Schmidhuber 2013). To ask our algorithms to be capable of mastering multiple tasks is therefore a natural step as we progress towards increasingly powerful agents. The wide-spread adoption of deep learning in RL is quite timely in this regard, since sharing parts of a neural network across multiple tasks is also a powerful way of building robust representations. This is particularly important for RL, because rewards on individual tasks can be sparse, and therefore sharing representations across tasks can be vital to bootstrap learning. Several agents (Jaderberg et al. 2016; Lample and Chaplot 2016; Shelhamer et al. 2016; Mirowski

et al. 2016) demonstrated this by improving performance on a single external task by learning off-policy about auxiliary tasks deﬁned on the same stream of experience (e.g. pixel control, immediate reward prediction or auto-encoding). Multi-task learning, as considered in this paper, where we get to execute, in parallel, the policies learned for each task, has potential additional beneﬁts, including deep exploration (Osband et al. 2016), and policy composition (Mankowitz et al. 2018; Todorov 2009; Fern andez and Veloso 2006). By learning on-policy about tasks, it may also be easier to scale to much more diverse tasks: if we only learn about some task off-policy from experience generated pursuing a very different one, we might never observe any reward. A limitation of our approach is that it can be complicated, and expensive, to implement parallel learning outside of simulation. However, recent work on parallel training of real world robots (Levine et al. 2016) suggests that this is not necessarily an insurmountable obstacle if sufﬁcient resources are available. Adoption of parallel multi-task RL has up to now been fairly limited. That the scaling issues considered in this paper, may have been a factor in the limited adoption is indicated by the wider use of this kind of learning in supervised settings (Johnson et al. 2017; Lu et al. 2016; Misra et al. 2016; Hashimoto et al. 2016), where loss functions are naturally well scaled (e.g. cross entropy), or can be easily scaled thanks to the stationarity of the training distribution. We therefore hope and believe that the work presented here can enable more research on multi-task RL. We also believe that Pop Art s adaptive normalisation can be combined with other research in multi-task RL, that previously did not scale as effectively to large numbers of diverse tasks. Among the ideas that may be fruitfully combined we highlight policy distillation (Parisotto, Ba, and Salakhutdinov 2015; Rusu et al. 2015; Schmitt et al. 2018; Teh et al. 2017) and active sampling of the task distribution the agent trains on (Sharma and Ravindran 2017). The combination of Pop Art-IMPALA with forms of active sampling might be particularly promising since it may allow to make more efﬁcient use of the parallel data generation, focusing it on the task most amenable for learning. Elastic weight consolidation (Kirkpatrick et al. 2017) and other work from the continual learning literature (Ring 1994; Mcclelland, Mcnaughton, and O Reilly 1995) might also be adapted to parallel learning setups to reduce interference (Mccloskey and Cohen 1989; French 1999) among tasks.

Bacon, P.; Harb, J.; and Precup, D. 2017. The option-critic architecture. AAAI Conference on Artiﬁcial Intelligence. Beattie, C.; Leibo, J. Z.; Teplyashin, D.; Ward, T.; Wainwright, M.; K uttler, H.; Lefrancq, A.; Green, S.; Vald es, V.; Sadik, A.; Schrittwieser, J.; Anderson, K.; York, S.; Cant, M.; Cain, A.; Bolton, A.; Gaffney, S.; King, H.; Hassabis, D.; Legg, S.; and Petersen, S. 2016. Deepmind lab. Co RR abs/1612.03801. Bellemare, M. G.; Naddaf, Y.; Veness, J.; and Bowling, M. 2013. The arcade learning environment: An evaluation platform for general agents. JAIR. Bellman, R. 1957. A markovian decision process. Journal of Mathematics and Mechanics.

Brunskill, E., and Li, L. 2013. Sample complexity of multi-task reinforcement learning. Co RR abs/1309.6821.

Caruana, R. 1998. Multitask learning. In Learning to learn. Kluwer Academic Publishers.

Collobert, R., and Weston, J. 2008. A uniﬁed architecture for natural language processing: Deep neural networks with multitask learning. In ICML.

Degris, T., and Modayil, J. 2012. Scaling-up knowledge for a cognizant robot. In AAAI Spring Symposium: Designing Intelligent Robots.

Duan, Y.; Chen, X.; Houthooft, R.; Schulman, J.; and Abbeel, P. 2016. Benchmarking deep reinforcement learning for continuous control. In ICML.

Espeholt, L.; Soyer, H.; Munos, R.; Simonyan, K.; Mnih, V.; Ward, T.; Doron, Y.; Firoiu, V.; Harley, T.; Dunning, I.; Legg, S.; and Kavukcuoglu, K. 2018. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. In ICML.

Fern andez, F., and Veloso, M. 2006. Probabilistic policy reuse in a reinforcement learning agent. In AAMAS.

French, R. M. 1999. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences.

Gruslys, A.; Azar, M. G.; Bellemare, M. G.; and Munos, R. 2018. The reactor: A sample-efﬁcient actor-critic architecture. ICLR.

Hashimoto, K.; Xiong, C.; Tsuruoka, Y.; and Socher, R. 2016. A joint many-task model: Growing a neural network for multiple NLP tasks. Co RR abs/1611.01587.

He, K.; Zhang, X.; Ren, S.; and Sun, J. 2015. Deep residual learning for image recognition. ar Xiv preprint ar Xiv:1512.03385.

Hessel, M.; Modayil, J.; van Hasselt, H.; Schaul, T.; Ostrovski, G.; Dabney, W.; Horgan, D.; Piot, B.; Azar, M. G.; and Silver, D. 2018. Rainbow: Combining improvements in deep reinforcement learning. AAAI Conference on Artiﬁcial Intelligence.

Hochreiter, S., and Schmidhuber, J. 1997. Long short-term memory. Neural computation.

Jaderberg, M.; Mnih, V.; Czarnecki, W. M.; Schaul, T.; Leibo, J. Z.; Silver, D.; and Kavukcuoglu, K. 2016. Reinforcement learning with unsupervised auxiliary tasks. Co RR abs/1611.05397.

Jaderberg, M.; Dalibard, V.; Osindero, S.; Czarnecki, W. M.; Donahue, J.; Razavi, A.; Vinyals, O.; Green, T.; Dunning, I.; Simonyan, K.; Fernando, C.; and Kavukcuoglu, K. 2017. Population based training of neural networks. Co RR abs/1711.09846.

Johnson, M.; Schuster, M.; Le, Q. V.; Krikun, M.; Wu, Y.; Chen, Z.; Thorat, N.; Vi egas, F. B.; Wattenberg, M.; Corrado, G.; Hughes, M.; and Dean, J. 2017. Google s multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics 5.

Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A. A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska Barwinska, A.; Hassabis, D.; Clopath, C.; Kumaran, D.; and Hadsell, R. 2017. Overcoming catastrophic forgetting in neural networks. PNAS.

Lample, G., and Chaplot, D. S. 2016. Playing FPS games with deep reinforcement learning. Co RR abs/1609.05521.

Legg, S., and Hutter, M. 2007. Universal intelligence: A deﬁnition of machine intelligence. Minds Mach.

Levine, S.; Pastor, P.; Krizhevsky, A.; and Quillen, D. 2016. Learning hand-eye coordination for robotic grasping with large-scale data collection. In ISER.

Lillicrap, T.; Hunt, J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; and Wierstra, D. 2016. Continuous control with deep reinforcement learning. In ICLR.

Lu, Y.; Kumar, A.; Zhai, S.; Cheng, Y.; Javidi, T.; and Feris, R. S. 2016. Fully-adaptive feature sharing in multi-task networks with applications in person attribute classiﬁcation. Co RR abs/1611.05377.

Mahmood, A. 2017. Incremental off-policy reinforcement learning algorithms. Ph.D. UAlberta.

Mankowitz, D. J.; Z ıdek, A.; Barreto, A.; Horgan, D.; Hessel, M.; Quan, J.; Oh, J.; van Hasselt, H.; Silver, D.; and Schaul, T. 2018. Unicorn: Continual learning with a universal, off-policy agent. Co RR abs/1802.08294.

Mcclelland, J. L.; Mcnaughton, B. L.; and O Reilly, R. C. 1995. Why there are complementary learning systems in the hippocampus and neocortex: Insights from the successes and failures of connectionist models of learning and memory. Psychological Review.

Mccloskey, M., and Cohen, N. J. 1989. Catastrophic interference in connectionist networks: The sequential learning problem. The Psychology of Learning and Motivation.

Mirowski, P.; Pascanu, R.; Viola, F.; Soyer, H.; Ballard, A. J.; Banino, A.; Denil, M.; Goroshin, R.; Sifre, L.; Kavukcuoglu, K.; Kumaran, D.; and Hadsell, R. 2016. Learning to navigate in complex environments. Co RR abs/1611.03673.

Misra, I.; Shrivastava, A.; Gupta, A.; and Hebert, M. 2016. Crossstitch networks for multi-task learning. Co RR abs/1604.03539.

Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland, A. K.; Ostrovski, G.; Petersen, S.; Beattie, C.; Sadik, A.; Antonoglou, I.; King, H.; Kumaran, D.; Wierstra, D.; Legg, S.; and Hassabis, D. 2015. Human-level control through deep reinforcement learning. Nature.

Mnih, V.; Badia, A. P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; and Kavukcuoglu, K. 2016. Asynchronous methods for deep reinforcement learning. In ICML.

Nair, A.; Srinivasan, P.; Blackwell, S.; Alcicek, C.; Fearon, R.; De Maria, A.; Panneershelvam, V.; Suleyman, M.; Beattie, C.; Petersen, S.; Legg, S.; Mnih, V.; Kavukcuoglu, K.; and Silver, D. 2015. Massively parallel methods for deep reinforcement learning. ar Xiv preprint ar Xiv:1507.04296.

Osband, I.; Blundell, C.; Pritzel, A.; and Van Roy, B. 2016. Deep exploration via bootstrapped DQN. In NIPS.

Parisotto, E.; Ba, L. J.; and Salakhutdinov, R. 2015. Actormimic: Deep multitask and transfer reinforcement learning. Co RR abs/1511.06342.

Precup, D.; Sutton, R. S.; and Singh, S. P. 2000. Eligibility traces for off-policy policy evaluation. In ICML.

Ring, M. 1994. Continual learning in reinforcement environments.

Romera-Paredes, B.; Aung, H.; Bianchi-Berthouze, N.; and Pontil, M. 2013. Multilinear multitask learning. In ICML.

Rusu, A. A.; Colmenarejo, S. G.; G ulc ehre, C .; Desjardins, G.; Kirkpatrick, J.; Pascanu, R.; Mnih, V.; Kavukcuoglu, K.; and Hadsell, R. 2015. Policy distillation. Co RR abs/1511.06295.

Rusu, A. A.; Rabinowitz, N. C.; Desjardins, G.; Soyer, H.; Kirkpatrick, J.; Kavukcuoglu, K.; Pascanu, R.; and Hadsell, R. 2016. Progressive neural networks. Co RR abs/1606.04671.

Schmidhuber, J. 1990. An on-line algorithm for dynamic reinforcement learning and planning in reactive environments. In IJCNN.

Schmidhuber, J. 2013. Powerplay: Training an increasingly general problem solver by continually searching for the simplest still unsolvable problem. In Frontiers in Psychology. Schmitt, S.; Hudson, J. J.; Z ıdek, A.; Osindero, S.; Doersch, C.; Czarnecki, W. M.; Leibo, J. Z.; K uttler, H.; Zisserman, A.; Simonyan, K.; and Eslami, S. M. A. 2018. Kickstarting deep reinforcement learning. Co RR abs/1803.03835. Schulman, J.; Levine, S.; Moritz, P.; Jordan, M. I.; and Abbeel, P. 2015. Trust region policy optimization. Co RR abs/1502.05477. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and Klimov, O. 2017. Proximal policy optimization algorithms. Co RR abs/1707.06347. Sharma, S., and Ravindran, B. 2017. Online multi-task learning using active sampling. Co RR abs/1702.06053. Shelhamer, E.; Mahmoudieh, P.; Argus, M.; and Darrell, T. 2016. Loss is its own reward: Self-supervision for reinforcement learning. Co RR abs/1612.07307. Silver, D.; Huang, A.; Maddison, C. J.; Guez, A.; Sifre, L.; van den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; Dieleman, S.; Grewe, D.; Nham, J.; Kalchbrenner, N.; Sutskever, I.; Lillicrap, T.; Leach, M.; Kavukcuoglu, K.; Graepel, T.; and Hassabis, D. 2016. Mastering the game of Go with deep neural networks and tree search. Nature. Silver, D.; Hubert, T.; Schrittwieser, J.; Antonoglou, I.; Lai, M.; Guez, A.; Lanctot, M.; Sifre, L.; Kumaran, D.; Graepel, T.; Lillicrap, T. P.; Simonyan, K.; and Hassabis, D. 2017. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. Co RR abs/1712.01815. Sutton, R. S., and Barto, A. G. 2018. Reinforcement Learning: An Introduction. MIT press. Sutton, R. S.; Modayil, J.; Delp, M.; Degris, T.; Pilarski, P. M.; White, A.; and Precup, D. 2011. Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. In AAMAS. Sutton, R. S.; Mahmood, A. R.; Precup, D.; and van Hasselt, H. 2014. A new q(λ) with interim forward view and Monte Carlo equivalence. In ICML. Teh, Y. W.; Bapst, V.; Czarnecki, W. M.; Quan, J.; Kirkpatrick, J.; Hadsell, R.; Heess, N.; and Pascanu, R. 2017. Distral: Robust multitask reinforcement learning. Co RR abs/1707.04175. Thrun, S. 1996. Is learning the n-th thing any easier than learning the ﬁrst? In NIPS. Thrun, S. 2012. Explanation-based neural network learning: A lifelong learning approach. Springer. Todorov, E. 2009. Compositionality of optimal control laws. In NIPS. Curran Associates, Inc. van Hasselt, H.; Guez, A.; Hessel, M.; Mnih, V.; and Silver, D. 2016. Learning values across many orders of magnitude. In NIPS. van Hasselt, H.; Guez, A.; and Silver, D. 2016. Deep reinforcement learning with double Q-learning. In AAAI Conference on Artiﬁcial Intelligence. Watkins, C. J. C. H. 1989. Learning from Delayed Rewards. Ph.D. Dissertation, King s College, Cambridge, England. Williams, R. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learning.