# distral_robust_multitask_reinforcement_learning__b80fcd0a.pdf

Distral: Robust Multitask Reinforcement Learning

Yee Whye Teh, Victor Bapst, Wojciech Marian Czarnecki, John Quan,

James Kirkpatrick, Raia Hadsell, Nicolas Heess, Razvan Pascanu

Deep Mind London, UK

Most deep reinforcement learning algorithms are data inefﬁcient in complex and rich environments, limiting their applicability to many scenarios. One direction for improving data efﬁciency is multitask learning with shared neural network parameters, where efﬁciency may be improved through transfer across related tasks. In practice, however, this is not usually observed, because gradients from different tasks can interfere negatively, making learning unstable and sometimes even less data efﬁcient. Another issue is the different reward schemes between tasks, which can easily lead to one task dominating the learning of a shared model. We propose a new approach for joint training of multiple tasks, which we refer to as Distral (distill & transfer learning). Instead of sharing parameters between the different workers, we propose to share a distilled policy that captures common behaviour across tasks. Each worker is trained to solve its own task while constrained to stay close to the shared policy, while the shared policy is trained by distillation to be the centroid of all task policies. Both aspects of the learning process are derived by optimizing a joint objective function. We show that our approach supports efﬁcient transfer on complex 3D environments, outperforming several related methods. Moreover, the proposed learning process is more robust to hyperparameter settings and more stable attributes that are critical in deep reinforcement learning.

1 Introduction

Deep Reinforcement Learning is an emerging subﬁeld of Reinforcement Learning (RL) that relies on deep neural networks as function approximators that can scale RL algorithms to complex and rich environments. One key work in this direction was the introduction of DQN [21] which is able to play many games in the ATARI suite of games [1] at above human performance. However the agent requires a fairly large amount of time and data to learn effective policies and the learning process itself can be quite unstable, even with innovations introduced to improve wall clock time, data efﬁciency, and robustness by changing the learning algorithm [27, 33] or by improving the optimizer [20, 29]. A different approach was introduced by [12, 19, 14], whereby data efﬁciency is improved by training additional auxiliary tasks jointly with the RL task.

With the success of deep RL has come interest in increasingly complex tasks and a shift in focus towards scenarios in which a single agent must solve multiple related problems, either simultaneously or sequentially. Due to the large computational cost, making progress in this direction requires robust algorithms which do not rely on task-speciﬁc algorithmic design or extensive hyperparameter tuning. Intuitively, solutions to related tasks should facilitate learning since the tasks share common structure, and thus one would expect that individual tasks should require less data or achieve a higher asymptotic performance. Indeed this intuition has long been pursued in the multitask and transfer-learning literature [2, 31, 34, 5].

Somewhat counter-intuitively, however, the above is often not the result encountered in practice, particularly in the RL domain [26, 23]. Instead, the multitask and transfer learning scenarios are

31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

frequently found to pose additional challenges to existing methods. Instead of making learning easier it is often observed that training on multiple tasks can negatively affect performances on the individual tasks, and additional techniques have to be developed to counteract this [26, 23]. It is likely that gradients from other tasks behave as noise, interfering with learning, or, in another extreme, one of the tasks might dominate the others.

In this paper we develop an approach for multitask and transfer RL that allows effective sharing of behavioral structure across tasks, giving rise to several algorithmic instantiations. In addition to some instructive illustrations on a grid world domain, we provide a detailed analysis of the resulting algorithms via comparisons to A3C [20] baselines on a variety of tasks in a ﬁrst-person, visually-rich, 3D environment. We ﬁnd that the Distral algorithms learn faster and achieve better asymptotic performance, are signiﬁcantly more robust to hyperparameter settings, and learn more stably than multitask A3C baselines.

2 Distral: Distill and Transfer Learning

regularise distill

Figure 1: Illustration of the Distral framework.

We propose a framework for simultaneous reinforcement learning of multiple tasks which we call Distral. Figure 1 provides a high level illustration involving four tasks. The method is founded on the notion of a shared policy (shown in the centre) which distills (in the sense of Bucila and Hinton et al. [4, 11]) common behaviours or representations from task-speciﬁc policies [26, 23]. Crucially, the distilled policy is then used to guide task-speciﬁc policies via regularization using a Kullback-Leibler (KL) divergence. The effect is akin to a shaping reward which can, for instance, overcome random walk exploration bottlenecks. In this way, knowledge gained in one task is distilled into the shared policy, then transferred to other tasks.

2.1 Mathematical framework

In this section we describe the mathematical framework underlying Distral. A multitask RL setting is considered where there are n tasks, where for simplicity we assume an inﬁnite horizon with discount factor γ.1 We will assume that the action A and state S spaces are the same across tasks; we use a 2 A to denote actions, s 2 S to denote states. The transition dynamics pi(s0|s, a) and reward functions Ri(a, s) are different for each task i. Let i be task-speciﬁc stochastic policies. The dynamics and policies give rise to joint distributions over state and action trajectories starting from some initial state, which we will also denote by i by an abuse of notation.

Our mechanism for linking the policy learning across tasks is via optimising an objective which consists of expected returns and policy regularizations. We designate 0 to be the distilled policy which we believe will capture agent behaviour that is common across the tasks. We regularize each task policy i towards the distilled policy using γ-discounted KL divergences E i[P

t 0 γt log i(at|st)

0(at|st)]. In addition, we also use a γ-discounted entropy regularization to further encourage exploration. The resulting objective to be maximized is:

J( 0, { i}n

γt Ri(at, st) c KLγt log i(at|st)

0(at|st) c Entγt log i(at|st)

γt Ri(at, st) + γt

β log 0(at|st) γt

β log i(at|st)

where c KL, c Ent 0 are scalar factors which determine the strengths of the KL and entropy regularizations, and = c KL/(c KL + c Ent) and β = 1/(c KL + c Ent). The log 0(at|st) term can be thought

1The method can be easily generalized to other scenarios like undiscounted ﬁnite horizon.

of as a reward shaping term which encourages actions which have high probability under the distilled policy, while the entropy term log i(at|st) encourages exploration. In the above we used the same regularization costs c KL, c Ent for all tasks. It is easy to generalize to using task-speciﬁc costs; this can be important if tasks differ substantially in their reward scales and amounts of exploration needed, although it does introduce additional hyperparameters that are expensive to optimize.

2.2 Soft Q Learning and Distillation

A range of optimization techniques in the literature can be applied to maximize the above objective, which we will expand on below. To build up intuition for how the method operates, we will start in the simple case of a tabular representation and an alternating maximization procedure which optimizes over i given 0 and over 0 given i. With 0 ﬁxed, (1) decomposes into separate maximization problems for each task, and is an entropy regularized expected return with redeﬁned (regularized) reward R0

i(a, s) := Ri(a, s) +

β log 0(a|s). It can be optimized using soft Q learning [10] aka G learning [7], which are based on deriving the following softened Bellman updates for the state and action values (see also [25, 28, 22]):

0 (at|st) exp [βQi(at, st)] (2)

Qi(at, st) = Ri(at, st) + γ

pi(st+1|st, at)Vi(st+1) (3)

The Bellman updates are softened in the sense that the usual max operator over actions for the state values Vi is replaced by a soft-max at inverse temperature β, which hardens into a max operator as β ! 1. The optimal policy i is then a Boltzmann policy at inverse temperature β:

0 (at|st)eβQi(at|st) βVi(st) =

0 (at|st)eβAi(at|st) (4)

where Ai(a, s) = Qi(a, s) Vi(s) is a softened advantage function. Note that the softened state values Vi(s) act as the log normalizers in the above. The distilled policy 0 can be interpreted as a policy prior, a perspective well-known in the literature on RL as probabilistic inference [32, 13, 25, 7]. However, unlike in past works, it is raised to a power of 1. This softens the effect of the prior 0 on i, and is the result of the additional entropy regularization beyond the KL divergence.

Also unlike past works, we will learn 0 instead of hand-picking it (typically as a uniform distribution over actions). In particular, notice that the only terms in (1) depending on 0 are:

γt log 0(at|st)

which is simply a log likelihood for ﬁtting a model 0 to a mixture of γ-discounted state-action distributions, one for each task i under policy i. A maximum likelihood (ML) estimator can be derived from state-action visitation frequencies under roll-outs in each task, with the optimal ML solution given by the mixture of state-conditional action distributions. Alternatively, in the non-tabular case, stochastic gradient ascent can be employed, which leads precisely to an update which distills the task policies i into 0 [4, 11, 26, 23]. Note however that in our case the distillation step is derived naturally from a KL regularized objective on the policies. Another difference from [26, 23] and from prior works on the use of distillation in deep learning [4, 11] is that the distilled policy is fed back in to improve the task policies when they are next optimized, and serves as a conduit in which common and transferable knowledge is shared across the task policies.

It is worthwhile here to take pause and ponder the effect of the extra entropy regularization. First suppose that there is no extra entropy regularisation, = 1, and consider the simple scenario of only n = 1 task.Then (5) is maximized when the distilled policy 0 and the task policy 1 are equal, and the KL regularization term is 0. Thus the objective reduces to an unregularized expected return, and so the task policy 1 converges to a greedy policy which locally maximizes expected returns. Another way to view this line of reasoning is that the alternating maximization scheme is equivalent to trust-region methods like natural gradient or TRPO [24, 29] which use a KL ball centred at the previous policy, and which are understood to converge to greedy policies.

If < 1, there is an additional entropy term in (1). So even with 0 = 1 and KL( 1k 0) = 0, the objective (1) will no longer be maximized by greedy policies. Instead (1) reduces to an entropy

regularized expected returns with entropy regularization factor β0 = β/(1 ) = 1/c Ent, so that the optimal policy is of the Boltzmann form with inverse temperature β0 [25, 7, 28, 22]. In conclusion, by including the extra entropy term, we can guarantee that the task policy will not turn greedy, and we can control the amount of exploration by adjusting c Ent appropriately.

This additional control over the amount of exploration is essential when there are more than one task. To see this, imagine a scenario where one of the tasks is easier and is solved ﬁrst, while other tasks are harder with much sparser rewards. Without the entropy term, and before rewards in other tasks are encountered, both the distilled policy and all the task policies can converge to the one that solves the easy task. Further, because this policy is greedy, it can insufﬁciently explore the other tasks to even encounter rewards, leading to sub-optimal behaviour. For single-task RL, the use of entropy regularization was recently popularized by Mnih et al. [20] to counter premature convergence to greedy policies, which can be particularly severe when doing policy gradient learning. This carries over to our multitask scenario as well, and is the reason for the additional entropy regularization.

2.3 Policy Gradient and a Better Parameterization

The above method alternates between maximization of the distilled policy 0 and the task policies i, and is reminiscent of the EM algorithm [6] for learning latent variable models, with 0 playing the role of parameters, while i plays the role of the posterior distributions for the latent variables. Going beyond the tabular case, when both 0 and i are parameterized by, say, deep networks, such an alternating maximization procedure can be slower than simply optimizing (1) with respect to task and distilled policies jointly by stochastic gradient ascent. In this case the gradient update for i is simply given by policy gradient with an entropic regularization [20, 28], and can be carried out within a framework like advantage actor-critic [20].

A simple parameterization of policies would be to use a separate network for each task policy i, and another one for the distilled policy 0. An alternative parameterization, which we argue can result in faster transfer, can be obtained by considering the form of the optimal Boltzmann policy (4). Speciﬁcally, consider parameterising the distilled policy using a network with parameters 0,

ˆ 0(at|st) = exp(h 0(at|st) P

a0 exp(h 0(a0|st)) (6)

and estimating the soft advantages2 using another network with parameters i:

ˆAi(at|st) = f i(at|st) 1

0 (a|st) exp(βf i(a|st)) (7)

We used hat notation to denote parameterized approximators of the corresponding quantities. The policy for task i then becomes parameterized as,

ˆ i(at|st) = ˆ

0 (at|st) exp(β ˆAi(at|st)) = exp( h 0(at|st) + βf i(at|st)) P

a0 exp(( h 0(a0|st) + βf i(a0|st)) (8)

This can be seen as a two-column architecture for the policy, with one column being the distilled policy, and the other being the adjustment required to specialize to task i.

Given the parameterization above, we can now derive the policy gradients. The gradient wrt to the task speciﬁc parameters i is given by the standard policy gradient theorem [30],

r i J =Eˆ i

t 1 r i log ˆ i(at|st)

u 1 γu(Rreg

i (au, su))

t 1 r i log ˆ i(at|st)

u t γu(Rreg

i (au, su))

i (a, s) = Ri(a, s) +

β log ˆ 0(a|s) 1

β log ˆ i(a|s) is the regularized reward. Note that the partial derivative of the entropy in the integrand has expectation Eˆ i[r i log ˆ i(at|st)] = 0 because of the log-derivative trick. If a value baseline is estimated, it can be subtracted from the regularized

2In practice, we do not actually use these as advantage estimates. Instead we use (8) to parameterize a policy which is optimized by policy gradients.

i = 1, 2, .. KL 2col KL 1col A3C 2col

i = 1, 2, ..

i = 1, 2, ..

i = 1, 2, ..

i = 1, 2, ..

Dis Tra Learning Baselines

h f h h h f f f

KL+ent 2col KL+ent 1col

Figure 2: Depiction of the different algorithms and baselines. On the left are two of the Distral algorithms and on the right are the three A3C baselines. Entropy is drawn in brackets as it is optional and only used for KL+ent 2col and KL+ent 1col.

returns as a control variate. The gradient wrt 0 is more interesting:

t 1 r 0 log ˆ i(at|st)

u t γu(Rreg

t|st) ˆ 0(a0

t|st))r 0h 0(a0

Note that the ﬁrst term is the same as for the policy gradient of i. The second term tries to match the probabilities under the task policy ˆ i and under the distilled policy ˆ 0. The second term would not be present if we simply parameterized i using the same architecture ˆ i, but do not use a KL regularization for the policy. The presence of the KL regularization gets the distilled policy to learn to be the centroid of all task policies, in the sense that the second term would be zero if

t|st), and helps to transfer information quickly across tasks and to new tasks.

2.4 Other Related Works

The centroid and star-shaped structure of Distral is reminiscent of ADMM [3], elastic-averaging SGD [35] and hierarchical Bayes [9]. Though a crucial difference is that while ADMM, EASGD and hierarchical Bayes operate in the space of parameters, in Distral the distilled policy learns to be the centroid in the space of policies. We argue that this is semantically more meaningful, and may contribute to the observed robustness of Distral by stabilizing learning. In our experiments we ﬁnd indeed that absence of the KL regularization signiﬁcantly affects the stability of the algorithm.

Another related line of work is guided policy search [17, 18, 15, 16]. These focus on single tasks, and uses trajectory optimization (corresponding to task policies here) to guide the learning of a policy (corresponding to the distilled policy 0 here). This contrasts with Distral, which is a multitask setting, where a learnt 0 is used to facilitate transfer by sharing common task-agnostic behaviours, and the main outcome of the approach are instead the task policies.

Our approach is also reminiscent of recent work on option learning [8], but with a few important differences. We focus on using deep neural networks as ﬂexible function approximators, and applied our method to rich 3D visual environments, while Fox et al. [8] considered only the tabular case. We argue for the importance of an additional entropy regularization besides the KL regularization. This lead to an interesting twist in the mathematical framework allowing us to separately control the amounts of transfer and of exploration. On the other hand Fox et al. [8] focused on the interesting problem of learning multiple options (distilled policies here). Their approach treats the assignment of tasks to options as a clustering problem, which is not easily extended beyond the tabular case.

3 Algorithms

The framework we just described allows for a number of possible algorithmic instantiations, arising as combinations of objectives, algorithms and architectures, which we describe below and summarize in Table 1 and Figure 2. KL divergence vs entropy regularization: With = 0, we get a purely

h 0(a|s) f i(a|s) h 0(a|s) + βf i(a|s)

= 0 A3C multitask A3C A3C 2col = 1 KL 1col KL 2col 0 < < 1 KL+ent 1col KL+ent 2col

Table 1: The seven different algorithms evaluated in our experiments. Each column describes a different architecture, with the column headings indicating the logits for the task policies. The rows deﬁne the relative amount of KL vs entropy regularization loss, with the ﬁrst row comprising the A3C baselines (no KL loss).

entropy-regularized objective which does not couple and transfer across tasks [20, 28]. With = 1, we get a purely KL regularized objective, which does couple and transfer across tasks, but might prematurely stop exploration if the distilled and task policies become similar and greedy. With 0 < < 1 we get both terms. Alternating vs joint optimization: We have the option of jointly optimizing both the distilled policy and the task policies, or optimizing one while keeping the other ﬁxed. Alternating optimization leads to algorithms that resemble policy distillation/actor-mimic [23, 26], but are iterative in nature with the distilled policy feeding back into task policy optimization. Also, soft Q learning can be applied to each task, instead of policy gradients. While alternating optimization can be slower, evidence from policy distillation/actor-mimic indicate it might learn more stably, particularly for tasks which differ signiﬁcantly. Separate vs two-column parameterization: Finally, the task policy can be parameterized to use the distilled policy (8) or not. If using the distilled policy, behaviour distilled into the distilled policy is immediately available to the task policies so transfer can be faster. However if the process of transfer occurs too quickly, it might interfere with effective exploration of individual tasks.

From this spectrum of possibilities we consider four concrete instances which differ in the underlying network architecture and distillation loss, identiﬁed in Table 1. In addition, we compare against three A3C baselines. In initial experiments we explored two variants of A3C: the original method [20] and the variant of Schulman et al. [28] which uses entropy regularized returns. We did not ﬁnd signiﬁcant differences for the two variants in our setting, and chose to report only the original A3C results for clarity in Section 4. Further algorithmic details are provided in the Appendix.

4 Experiments

We demonstrate the various algorithms derived from our framework, ﬁrstly using alternating optimization with soft Q learning and policy distillation on a set of simple grid world tasks. Then all seven algorithms will be evaluated on three sets of challenging RL tasks in partially observable 3D environments.

4.1 Two room grid world

To give better intuition for the role of the distilled behaviour policy, we considered a set of tasks in a grid world domain with two rooms connected by a corridor (see Figure 3) [8]. Each task is distinguished by a different randomly chosen goal location and each MDP state consists of the map location, the previous action and the previous reward. A Distral agent is trained using only the KL regularization and an optimization algorithm which alternates between soft Q learning and policy distillation. Each soft Q learning iteration learns using a rollout of length 10.

To determine the beneﬁt of the distilled policy, we compared the Distral agent to one which soft Q learns a separate policy for each task. The learning curves are shown in Figure 3 (left). We see that the Distral agent is able to learn signiﬁcantly faster than single-task agents. Figure 3 (right) visualizes the distilled policy (probability of next action given position and previous action), demonstrating that the agent has learnt a policy which guides the agent to move consistently in the same direction through the corridor in order to reach the other room. This allows the agent to reach the other room faster and helps exploration, if the agent is shown new test tasks. In Fox et al. [8] two separate options are learnt, while here we learn a single distilled policy which conditions on more past information (previous action and reward).

Four di erent examples of Grid World tasks

Policy in the corridor if previous action was:

Policy in the corridor if previous action was:

Figure 3: Left: Learning curves on two room grid world. The Distral agent (blue) learns faster, converges towards better policies, and demonstrates more stable learning overall. Center: Example of tasks. Green is goal position which is uniformly sampled for each task. Starting position is uniformly sampled at the beginning of each episode. Right: depiction of learned distilled policy 0 only in the corridor, conditioned on previous action being left/right and no previous reward. Sizes of arrows depict probabilities of actions. Note that up/down actions have negligible probabilities. The model learns to preserve direction of travel in the corridor.

4.2 Complex Tasks

To assess Distral under more challenging conditions, we use a complex ﬁrst-person partially observed 3D environment with a variety of visually-rich RL tasks. All agents were implemented with a distributed Python/Tensor Flow code base, using 32 workers for each task and learnt using asynchronous RMSProp. The network columns contain convolutional layers and an LSTM and are uniform across experiments and algorithms. We tried three values for the entropy costs β and three learning rates . Four runs for each hyperparameter setting were used. All other hyperparameters were ﬁxed to the single-task A3C defaults and, for the KL+ent 1col and KL+ent 2col algorithms, was ﬁxed at 0.5.

Mazes In the ﬁrst experiment, each of n = 8 tasks is a different maze containing randomly placed rewards and a goal object. Figure 4.A1 shows the learning curves for all seven algorithms. Each curve is produced by averaging over all 4 runs and 8 tasks, and selecting the best settings for β and (as measured by the area under the learning curves). The Distral algorithms learn faster and achieve better ﬁnal performance than all three A3C baselines. The two-column algorithms learn faster than the corresponding single column ones. The Distral algorithms without entropy learn faster but achieve lower ﬁnal scores than those with entropy, which we believe is due to insufﬁcient exploration towards the end of learning.

We found that both multitask A3C and two-column A3C can learn well on some runs, but are generally unstable some runs did not learn well, while others may learn initially then suffer degradation later. We believe this is due to negative interference across tasks, which does not happen for Distral algorithms. The stability of Distral algorithms also increases their robustness to hyperparameter selection. Figure 4.A2 shows the ﬁnal achieved average returns for all 36 runs for each algorithm, sorted in decreasing order. We see that Distral algorithms have a signiﬁcantly higher proportion of runs achieving good returns, with KL+ent_2col being the most robust.

Distral algorithms, along with multitask A3C, use a distilled or common policy which can be applied on all tasks. Panels B1 and B2 in Figure 4 summarize the performances of the distilled policies. Algorithms that use two columns (KL_2col and KL+ent_2col) obtain the best performance, because policy gradients are also directly propagated through the distilled policy in those cases. Moreover, panel B2 reveals that Distral algorithms exhibit greater stability as compared to traditional multitask A3C. We also observe that KL algorithms have better-performing distilled policies than KL+ent ones. We believe this is because the additional entropy regularisation allows task policies to diverge more substantially from the distilled policy. This suggests that annealing the entropy term or increasing the KL term throughout training could improve the distilled policy performance, if that is of interest.

Navigation We experimented with n = 4 navigation and memory tasks. In contrast to the previous experiment, these tasks use random maps which are procedurally generated on every episode. The ﬁrst task features reward objects which are randomly placed in a maze, and the second task requires to return these objects to the agent s start position. The third task has a single goal object which must be repeatedly found from different start positions, and on the fourth task doors are randomly opened and

Figure 4: Panels A1, C1, D1 show task speciﬁc policy performance (averaged across all the tasks) for the maze, navigation and laser-tag tasks, respectively. The x-axes are total numbers of training environment steps per task. Panel B1 shows the mean scores obtained with the distilled policies (A3C has no distilled policy, so it is represented by the performance of an untrained network.). For each algorithm, results for the best set of hyperparameters (based on the area under curve) are reported. The bold line is the average over 4 runs, and the colored area the average standard deviation over the tasks. Panels A2, B2, C2, D2 shows the corresponding ﬁnal performances for the 36 runs of each algorithm ordered by best to worst (9 hyperparameter settings and 4 runs).

closed to force novel path-ﬁnding. Hence, these tasks are more involved than the previous navigation tasks. The panels C1 and C2 of Figure 4 summarize the results. We observe again that Distral algorithms yield better ﬁnal results while having greater stability (Figure 4.C2). The top-performing algorithms are, again, the 2 column Distral algorithms (KL_2col and KL+ent_2col).

Laser-tag In the ﬁnal set of experiments, we use n = 8 laser-tag levels. These tasks require the agent to learn to tag bots controlled by a built-in AI, and differ substantially: ﬁxed versus procedurally generated maps, ﬁxed versus procedural bots, and complexity of agent behaviour (e.g. learning to jump in some tasks). Corresponding to this greater diversity, we observe (see panels D1 and D2 of Figure 4) that the best baseline is the A3C algorithm that is trained independently on each task. Among the Distral algorithms, the single column variants perform better, especially initially, as they are able to learn task-speciﬁc features separately. We observe again the early plateauing phenomenon for algorithms that do not possess an additional entropy term. While not signiﬁcantly better than the A3C baseline on these tasks, the Distral algorithms clearly outperform the multitask A3C.

Discussion Considering the 3 different sets of complex 3D experiments, we argue that the Distral algorithms are promising solutions to the multitask deep RL problem. Distral can perform signiﬁcantly better than A3C baselines when tasks have sufﬁcient commonalities for transfer (maze and navigation), while still being competitive with A3C when there is less transfer possible. In terms of speciﬁc algorithmic proposals, the additional entropy regularization is important in encouraging continued exploration, while two column architectures generally allow faster transfer (but can affect performance when there is little transfer due to task interference). The computational costs of Distral algorithms are at most twice that of the corresponding A3C algorithms, as each agent need to process two network columns instead of one. However in practice the runtimes are just slightly more than for A3C, because the cost of simulating environments is signiﬁcant and the same whether single or multitask.

5 Conclusion

We have proposed Distral, a general framework for distilling and transferring common behaviours in multitask reinforcement learning. In experiments we showed that the resulting algorithms learn quicker, produce better ﬁnal performances, and are more stable and robust to hyperparameter settings. We have found that Distral signiﬁcantly outperforms the standard way of using shared neural network parameters for multitask or transfer reinforcement learning.

Two ideas in Distral might be worth reemphasizing here. We observe that distillation arises naturally as one half of an optimization procedure when using KL divergences to regularize the output of task models towards a distilled model. The other half corresponds to using the distilled model as a regularizer for training the task models. Another observation is that parameters in deep networks do not typically by themselves have any semantic meaning, so instead of regularizing networks in parameter space, it is worthwhile considering regularizing networks in a more semantically meaningful space, e.g. of policies.

We would like to end with a discussion of the various difﬁculties faced by multitask RL methods. The ﬁrst is that of positive transfer: when there are commonalities across tasks, how does the method achieve this transfer and lead to better learning speed and better performance on new tasks in the same family? The core aim of Distral is this, where the commonalities are exhibited in terms of shared common behaviours. The second is that of task interference, where the differences among tasks adversely affect agent performance by interfering with exploration and the optimization of network parameters. This is the core aim of the policy distillation and mimic works [26, 23]. As in these works, Distral also learns a distilled policy. But this is further used to regularise the task policies to facilitate transfer. This means that Distral algorithms can be affected by task interference. It would be interesting to explore ways to allow Distral (or other methods) to automatically balance between increasing task transfer and reducing task interference.

Other possible directions of future research include: combining Distral with techniques which use auxiliary losses [12, 19, 14], exploring use of multiple distilled policies or latent variables in the distilled policy to allow for more diversity of behaviours, exploring settings for continual learning where tasks are encountered sequentially, and exploring ways to adaptively adjust the KL and entropy costs to better control the amounts of transfer and exploration. Finally, theoretical analyses of Distral and other KL regularization frameworks for deep RL would help better our understanding of these recent methods.

[1] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An evaluation

platform for general agents. Journal of Artiﬁcial Intelligence Research, 47:253 279, june 2013.

[2] Yoshua Bengio. Deep learning of representations for unsupervised and transfer learning. In JMLR:

Workshop on Unsupervised and Transfer Learning, 2012.

[3] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization and

statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn., 3(1), January 2011.

[4] Cristian Bucila, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In Proc. of the Int l

Conference on Knowledge Discovery and Data Mining (KDD), 2006.

[5] Rich Caruana. Multitask learning. Machine Learning, 28(1):41 75, July 1997.

[6] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the

em algorithm. Journal of the royal statistical society. Series B (methodological), pages 1 38, 1977.

[7] R. Fox, A. Pakman, and N. Tishby. Taming the noise in reinforcement learning via soft updates. In

Uncertainty in Artiﬁcial Intelligence (UAI), 2016.

[8] Roy Fox, Michal Moshkovitz, and Naftali Tishby. Principled option learning in markov decision processes.

In European Workshop on Reinforcement Learning (EWRL), 2016.

[9] Andrew Gelman, John B Carlin, Hal S Stern, and Donald B Rubin. Bayesian data analysis, volume 2.

Chapman & Hall/CRC Boca Raton, FL, USA, 2014.

[10] Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep

energy-based policies. ar Xiv preprint ar Xiv:1702.08165, 2017.

[11] Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network. NIPS

Deep Learning Workshop, 2014.

[12] Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David Silver,

and Koray Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. Int l Conference on Learning Representations (ICLR), 2016.

[13] Hilbert J Kappen, Vicenç Gómez, and Manfred Opper. Optimal control as a graphical model inference

problem. Machine learning, 87(2):159 182, 2012.

[14] Guillaume Lample and Devendra Singh Chaplot. Playing FPS games with deep reinforcement learning.

Association for the Advancement of Artiﬁcial Intelligence (AAAI), 2017.

[15] Sergey Levine and Pieter Abbeel. Learning neural network policies with guided policy search under

unknown dynamics. In Advances in Neural Information Processing Systems, pages 1071 1079, 2014.

[16] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor

policies. Journal of Machine Learning Research, 17(39):1 40, 2016.

[17] Sergey Levine and Vladlen Koltun. Variational policy search via trajectory optimization. In Advances in

Neural Information Processing Systems, pages 207 215, 2013.

[18] Sergey Levine and Vladlen Koltun. Learning complex neural network policies with trajectory optimization.

In International Conference on Machine Learning, pages 829 837, 2014.

[19] Piotr Mirowski, Razvan Pascanu, Fabio Viola, Hubert Soyer, Andrew J. Ballard, Andrea Banino, Misha

Denil, Ross Goroshin, Laurent Sifre, Koray Kavukcuoglu, Dharshan Kumaran, and Raia Hadsell. Learning to navigate in complex environments. Int l Conference on Learning Representations (ICLR), 2016.

[20] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy P Lillicrap, Tim

Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Int l Conference on Machine Learning (ICML), 2016.

[21] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare,

Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529 533, 02 2015.

[22] Oﬁr Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans. Bridging the gap between value

and policy based reinforcement learning. ar Xiv:1702.08892, 2017.

[23] Emilio Parisotto, Jimmy Lei Ba, and Ruslan Salakhutdinov. Actor-mimic: Deep multitask and transfer

reinforcement learning. In Int l Conference on Learning Representations (ICLR), 2016.

[24] Razvan Pascanu and Yoshua Bengio. Revisiting natural gradient for deep networks. Int l Conference on

Learning Representations (ICLR), 2014.

[25] Konrad Rawlik, Marc Toussaint, and Sethu Vijayakumar. On stochastic optimal control and reinforcement

learning by approximate inference. In Robotics: Science and Systems (RSS), 2012.

[26] Andrei A Rusu, Sergio Gomez Colmenarejo, Caglar Gulcehre, Guillaume Desjardins, James Kirkpatrick,

Razvan Pascanu, Volodymyr Mnih, Koray Kavukcuoglu, and Raia Hadsell. Policy distillation. In Int l Conference on Learning Representations (ICLR), 2016.

[27] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. Co RR,

abs/1511.05952, 2015.

[28] J. Schulman, P. Abbeel, and X. Chen. Equivalence between policy gradients and soft Q-Learning. ar Xiv:1704.06440, 2017.

[29] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy

optimization. In Int l Conference on Machine Learning (ICML), 2015.

[30] Richard S Sutton, David A Mc Allester, Satinder P Singh, Yishay Mansour, et al. Policy gradient methods

for reinforcement learning with function approximation. In Adv. in Neural Information Processing Systems (NIPS), volume 99, pages 1057 1063, 1999.

[31] Matthew E. Taylor and Peter Stone. An introduction to inter-task transfer for reinforcement learning. AI

Magazine, 32(1):15 34, 2011.

[32] Marc Toussaint, Stefan Harmeling, and Amos Storkey. Probabilistic inference for solving (PO)MDPs.

Technical Report EDI-INF-RR-0934, University of Edinburgh, School of Informatics, 2006.

[33] Hado van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double Q-learning.

Association for the Advancement of Artiﬁcial Intelligence (AAAI), 2016.

[34] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural

networks? In Adv. in Neural Information Processing Systems (NIPS), 2014.

[35] Sixin Zhang, Anna Choromanska, and Yann Le Cun. Deep learning with elastic averaging SGD. In Adv. in

Neural Information Processing Systems (NIPS), 2015.