# selfsupervised_mixtureofexperts_by_uncertainty_estimation__8642ba19.pdf

The Thirty-Third AAAI Conference on Artiﬁcial Intelligence (AAAI-19)

Self-Supervised Mixture-of-Experts by Uncertainty Estimation

Zhuobin Zheng,1,2 Chun Yuan,2 Xinrui Zhu,1,2 Zhihui Lin,1,2 Yangyang Cheng,1,2 Cheng Shi,1,2 Jiahui Ye1,3

1Department of Computer Science and Technologies, Tsinghua University, Beijing, China 2Graduate School at Shenzhen, Tsinghua University, Shenzhen, China 3Tsinghua-Berkeley Shenzhen Institue, Tsinghua University, Shenzhen, China {zhengzb16, zhuxr17, lin-zh14, cheng-yy13, shic17, yejh16}@mails.tsinghua.edu.cn, yuanc@sz.tsinghua.edu.cn

Learning related tasks in various domains and transferring exploited knowledge to new situations is a signiﬁcant challenge in Reinforcement Learning (RL). However, most RL algorithms are data inefﬁcient and fail to generalize in complex environments, limiting their adaptability and applicability in multi-task scenarios. In this paper, we propose Self Supervised Mixture-of-Experts (SUM), an effective algorithm driven by predictive uncertainty estimation for multitask RL. SUM utilizes a multi-head agent with shared parameters as experts to learn a series of related tasks simultaneously by Deep Deterministic Policy Gradient (DDPG). Each expert is extended by predictive uncertainty estimation on known and unknown states to enhance the Q-value evaluation capacity against overﬁtting and the overall generalization ability. These enable the agent to capture and diffuse the common knowledge across different tasks improving sample efﬁciency in each task and the effectiveness of expert scheduling across multiple tasks. Instead of task-speciﬁc design as common Mo Es, a self-supervised gating network is adopted to determine a potential expert to handle each interaction from unseen environments and calibrated completely by the uncertainty feedback from the experts without explicit supervision. To alleviate the imbalanced expert utilization as the crux of Mo E, optimization is accomplished via decayedmasked experience replay, which encourages both diversiﬁcation and specialization of experts during different periods. We demonstrate that our approach learns faster and achieves better performance by efﬁcient transfer and robust generalization, outperforming several related methods on extended Open AI Gym s Mu Jo Co multi-task environments.

1 Introduction Reinforcement Learning (RL) (Sutton and Barto 1998) trains an agent to solve sequential decision-making problems through trial and error interactions with the environment. With the signiﬁcant advance of deep neural networks as effective function approximators, Deep RL has been scaled and demonstrated success in various complex domains like Go game (Silver et al. 2017), video games (Mnih et al. 2015) and robotic control tasks (Gu et al. 2017). However, most typical RL methods are sample inefﬁcient requiring substantial experiences and large computational cost be-

Copyright c 2019, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

fore obtaining acceptable behaviors, especially when solving multiple related problems. One effective direction for improving data efﬁciency is multi-task learning (Caruana 1997), which shares similar characteristics and transfers reusable representations across multiple tasks. Intuitively, by applying multi-task strategies, algorithms can accelerate the convergence and improve the performance of each task w.r.t. single-task learning, while requiring less training data overall. This further renders the agent applicable to complex real-world environments with robust generalization in different situations (Finn et al. 2017). However, in practice when RL combines with multitask techniques such as shared structure, the learning process tends to be unstable due to the different complexity and reward schemes between tasks (Teh et al. 2017). We ascribe this negative effect to the interference caused by noised gradients from different domains, which misleads the training of vanilla multi-task approaches. Another issue is deep RL agent robustly suffers from overﬁtting (Zhang et al. 2018), which negatively affects the performance of transfer. As deep RL algorithms are increasingly employed in critical problems such as autonomous control, healthcare, ﬁnance, and safety (Amodei et al. 2016), it is crucial to understand the generalization and adaptation abilities of trained agents before real-world deployment. Besides, the crux of multi-task RL is not only exploiting reusable representations but furthermore transferring them across domains effectively. Apart from techniques for mitigating overﬁtting, quantifying the uncertainty of speciﬁc tasks brings a more intuitive understanding of generalization capacity in deep models. In this case, multi-task RL agents with well-calibrated predictive uncertainty estimation can avert overconﬁdent incorrect predictions (Lakshminarayanan, Pritzel, and Blundell 2017) and accomplish multiple tasks by robust generalization and efﬁcient transfer. In this paper, we propose Self-Supervised Mixture-of Experts (SUM), an effective approach for multi-task RL aiming to improve both data efﬁciency and generalization capacity in multiple tasks by effective knowledge sharing and expert scheduling. Based on multi-head DDPG (Zheng et al. 2018), SUM extends the model to Mixture-of-Experts (Mo E) architecture consisting of an agent and a selfsupervised gating network, which is suitable for the multitask environment. The agent is constructed by experts in-

cluding multiple actor heads for exploration and multiple critic heads for evaluation. To counteract the adverse effects of overﬁtting, experts are enhanced by uncertainty estimation on known and unknown states to improve the Q-value evaluation and generalize robustly across multiple tasks. When interacting with tasks from an unknown distribution, a gating network determines a potential expert according to all uncertainty estimates of the state from experts. It leverages these uncertainty feedback as self-supervision to optimize without explicit supervision. Furthermore, we exploit decayed-masked experience replay to improve both early diversiﬁcation and late specialization of experts in different stages. We present empirical experiments to analyze our algorithm dealing with a series of continuous control tasks on extended Mu Jo Co environments (Henderson et al. 2017). The contributions of this work include:

We introduce uncertainty estimation to the critic of DDPG for enhanced Q-value evaluation, which alleviates overﬁtting in an individual task and improves robust generalization ability across multiple tasks.

We extend multi-head DDPG as Mo E architecture with a gating network self-supervised completely by uncertainty estimates from experts without extra supervision, which improves data efﬁciency and performance substantially by effective knowledge sharing and expert scheduling.

To tackle the imbalanced expert utilization, we exploit decayed-masked experience replay to motivate the experts to concentrate on different objectives during training.

We evaluate our approach via extensive experiments from different perspectives, including uncertainty enhancement, generalization ability, and multi-task performance.

2 Related Work In recent years, most research has focused on developing multi-task RL algorithms by transfer learning paradigm (Lazaric 2012). A primary series of approaches are based on policy distillation, which usually constructed as a student-teacher architecture, including (Rusu et al. 2015; Parisotto, Ba, and Salakhutdinov 2015; Teh et al. 2017). Some attempts focus on effective representation reuse in different perspectives, such as shared abstractions of stateaction space (Borsa, Graepel, and Shawe-Taylor 2016), progressive network (Rusu et al. 2016), etc. Uncertainty estimation has been increasingly prevalent as an effective method to reduce the risk of overﬁtting (Zhang et al. 2018) and understand the agent s generalization ability. Furthermore, with the advance of reliable uncertainty estimation, algorithms can tackle difﬁcult speciﬁc issues in RL, including balancing between exploration and exploitation (Dearden, Friedman, and Russell 1998), avoiding collisions in unknown control tasks (Kahn et al. 2017), etc. However, for uncertainty in multi-task RL, there are relatively few methods available based on the Bayesian approach (Lazaric and Ghavamzadeh 2010; Wilson et al. 2007). Unlike these work, we exploit an auxiliary extension of the neural network (Nix and Weigend 1994) to quantify the predictive uncertainty in the critic of DDPG.

Mixture-of-experts (Mo E) (Jacobs et al. 1991; Jordan and Jacobs 1994) is an effective ensemble learning approach that uses a gating function to specialize models to alleviate overﬁtting and improve the performance of complex tasks (Shazeer et al. 2017). This paper presents a novel framework based on Mo E architecture tackling multitask RL by two components, one based on the multi-head DDPG (Zheng et al. 2018) as experts and one using a selfsupervised gating function for expert scheduling. Self-supervised learning enables learning without explicit supervision and exploits unlabeled data to provide intrinsic representations as self-supervision. It has been prevalently pursued in computer vision (Donahue, Kr ahenb uhl, and Darrell 2017; Doersch and Zisserman 2017) and RL (Shelhamer et al. 2016; Pathak et al. 2017). In this work, we ﬁrst investigate the effectiveness of optimizing a gating function via uncertainty estimation in a completely self-supervised manner, instead of common supervised approaches.

3 Background 3.1 Preliminary and Notation In the reinforcement learning setup, tasks are modeled as a Markov decision process (MDP) where an agent in a state st interacts with an environment E by applying an action at and receives a reward rt together with a new state st+1 at each discrete time step t. MDP consists of a state space S, an action space A, a state transition function P : S A 7 S, a reward function R : S A 7 R and a discount factor γ (0, 1], which can be formalized as S, A, P, R, γ . We deﬁne the overall return from the state st as the discounted cumulated reward Rt = rt+PT i=t+1 γi t R(si, ai). The agent aims to ﬁnd an optimal policy π : S 7 A, which can be stochastic or deterministic, to maximize the expected return from the initial state, Rπ(s0) = Eri 0,si 0 E,ai 0 π[R0]. Another key concept is the action value, also called Q-value, which is an estimate of the expected discounted return for selecting action at in state st and following policy π: Qπ(st, at) = Eri t,si>t E,ai>t π[Rt|st, at]. (1)

3.2 Deep Deterministic Policy Gradient (DDPG) This work builds upon DDPG (Lillicrap et al. 2016), a model-free off-policy algorithm for high-dimensional continuous action domains. DDPG utilizes an actor-critic architecture including a Q-value function (the critic Q) optimized by temporal difference (TD) in policy evaluation step,

i (ri + γQ (si+1, µ (si+1|θµ )|θQ )

Q(si, ai|θQ))2,

and a policy function (the actor µ) updated by policy gradient (Silver et al. 2014) in policy improvement step,

i a Q(si, a|θQ)|a=µ(si|θµ) θµµ(si|θµ), (3)

where Q , µ are target networks that slowly track Q, µ respectively in each update step,

θQ (1 τ)θQ + τθQ, θµ (1 τ)θµ + τθµ, (4)

Agent (SUM)

Actor Critic

Q-variances Q-values Transformer

Replay Buffer

Q-variances Q-values

Q1 V1 Q2 V2 Q3 V3

Action Actions

(st , at, st+1, rt, mt)

Expert Selector

Figure 1: Left: Structure of SUM. When the agent receives a state from an unknown environment, the gating network (red) outputs a scheduling vector expecting a speciﬁc expert (actor head) with the highest activation to interact. Given the same state, each actor head (blue) produces an action together with respective Q-value and Q-variance from the paired critic head (green). Based on this information, a transformer (purple) generates a ground truth scheduling vector to select an expert to interact. Besides, experiences with decayed masks are stored for training both the experts and the gating function. Right: Mechanism of the multi-head critic evaluating during training and the multi-head actor activated by the gating network during testing.

with τ (0, 1]. Beneﬁting from experience replay and target networks, DDPG is trainable on off-policy data with stability, which signiﬁcantly brings sample efﬁciency.

4 Self-Supervised Mixture-of-Experts In this work, we develop Self-Supervised Mixture-of Experts (SUM) for multi-task RL (see Figure 1). For clarity of explanation, we ﬁrst describe Uncertainty-Enhanced Multi-head DDPG as basic experts of Mo E. Following that, we show how Self-Supervised Gating Network trains and works for expert specialization and scheduling.

4.1 Uncertainty-Enhanced Multi-head DDPG Though vanilla DDPG (Lillicrap et al. 2016) shows impressive performance in continuous control tasks, it is susceptible to the randomness of complex environments, which may lead to data inefﬁciency and poor generalization. To counteract the adverse effects of the above problems, we introduce predictive uncertainty estimation into DDPG to capture the uncertainty of states from known and unknown tasks. Speciﬁcally, following (Nix and Weigend 1994) which is demonstrated effective in quantifying uncertainty, we extend the critic network to generate two values in the ﬁnal layer, corresponding to the predicted mean Q-value Q(si, ai) and Q-variance σ2(si, ai) (shown in Figure 1). By treating the observed Q-value as a sample from a Gaussian distribution with the predicted mean and variance, the critic is optimized by minimizing the negative log-likelihood (NLL) criterion,

i (log σ2(si, ai)

2 + (yi Q(si, ai))2

2σ2(si, ai) ), (5)

where yi is the target Q-value and n is the mini-batch size. With this enhancement, the critic is supervised towards producing not only accurate but reliable Q-value with well-

calibrated uncertainty estimation reduction, which is robust to task shift. Besides, uncertainty-enhanced DDPG can detect and avert overﬁtting in advance by techniques like early stopping. In multi-task scenarios, utilizing only Q-value from the critic trained on a speciﬁc task is not sufﬁcient and reliable for estimating Q-value in different or unseen tasks. However, a critic with auxiliary uncertainty estimation can improve understanding of the generalization and adaptation abilities in unknown domains, which further beneﬁts multitask performance. Notably, this is a general extension for estimating the uncertainty in the environment requiring few modiﬁcations to value-based RL algorithms. Towards the goal of multi-task learning, we contribute predictive uncertainty estimation to SOUP (Zheng et al. 2018), an algorithm that combines multi-head bootstrapped DDPG with self-adaptive conﬁdence to alleviate spotty Qvalue in single-task RL. However, this rough state-based conﬁdence strategy with the scarcity of crucial relevant information, such as actions selected by respective actors and intermediate representations in the critics, hampers its capacity of accurate evaluation, especially in multi-task domains with different complexity and reward schemes. Instead of crude external uncertainty approximator in SOUP, we directly extend each critic of multi-head DDPG with an auxiliary output unit as Q-variance. Since this unit is accessible to exploit the entire crucial information, it yields higher potentiality towards well-calibrated uncertainty estimation. Note that, different from SOUP selecting an action simply with the maximum Q-value, our approach takes not only Qvalue but mainly Q-variance, into consideration for potential action selection more effectively and reasonably.

Robust Generalization Well-calibrated predictive uncertainty estimation assists the agent in robustly averting overconﬁdent incorrect predictions. Utilizing the multi-head ar-

chitecture simulates Deep Ensembles technique (Lakshminarayanan, Pritzel, and Blundell 2017), averaging predictions over multiple models, which signiﬁcantly improves uncertainty quality and robustness by variance reduction. In this case, critics can produce more accurate Q-value estimates by ensemble improving the generalization ability against overﬁtting. Furthermore, when applying to unknown tasks, highly specialized experts with relatively lower uncertainty shows reliably higher conﬁdence to tackle speciﬁc states, which beneﬁts effective expert scheduling enhancing the overall generalization capacity across multiple tasks.

4.2 Self-Supervised Gating Network The core of Mo E is an effective gating function which performs expert specialization in training and expert scheduling in testing. We propose a self-supervised gating network calibrated completely by uncertainty estimates from experts in an end-to-end manner without explicit supervision.

Self-Supervised Training When the agent of Mo E with K heads {Q, µ}1,...,K interacts with the environment given st, the gating network G, parameterized by θG, generates a gating value as a scheduling vector G(st), which denotes different degrees of expected preference for experts to interact in the current state st. Note that each actor head in multihead DDPG is considered as an expert. Following that, each actor head produces an action candidate ak t while at the same time its paired critic head produces respective Q-value Qk and Q-variance σ2 k, which represents how uncertain the expert is about its evaluation of the action ak t . Based on the uncertainty estimates σ2, we construct the ground truth softmax gating g (st) (Shazeer et al. 2017) as self-supervision: g (st) = Softmax(H(st)). (6) Note that H(st) is a one-hot vector only when an expert specializes and masters exploiting much more rewards in st with discriminatively highest Q-value (i.e., twice of others) and reliably lowest Q-variance concurrently, otherwise

H(st)k = +1, σ2 k Keep Top X(σ2, x), 0, otherwise, (7)

where Keep Top X(v, x) keeps only top x values in v (we set x = 2). In this case, H(st) encourages the gating network to alleviate overﬁtting by activating experts with relatively high uncertainty which own more potential to explore and exploit more rewards in the state st compared with the lower ones, which may be stuck in local optima. In practice, H(st) usually results in a one-hot vector in most cases (85% ). The action at is chosen to interact with st, receiving a new state st+1 and a reward rt, according to at = {ak t |k = arg max k {g (st)}}. (8)

A new transition (st, at, st+1, rt, mt) is stored in replay buffer for experience play, where mt = {mk t |mk t = g k(st)} represents the probabilities of this transition trained by the experts. During training, a mini-batch of n samples are collected to optimize both experts as vanilla DDPG and the gating function by Mean Squared Error (MSE) as follows:

i (g (si) G(si))2, g (si) = mi. (9)

Algorithm 1 Self-Supervised Mixture-of-Experts

Input: number of heads K, environments N. batch size n Initialize: Initialize critic and actor networks with K heads {θQ k , θµ k}K k=1 and copy weights to target networks {θQ

k }K k=1. Initialize the gating network θG, the replay buffer R, and multi-task environments {E}N i=1 for episode e = 1, Emax do Initialize a Gaussian noise N for action exploration Get state s0 from randomly selected environment Ei for step t = 1, Tmax do Generate a scheduling vector G(st) K actor heads output actions K critic heads output Q-values and Q-variances Generate targeted scheduling vector g (st) by (6) Select action at according to (8) and apply N Execute at then observe reward rt and state st+1 Assign mask mt = g (st) Store transition (st, at, st+1, rt, mt) in R Randomly select k-th actor and critic heads Randomly sample a batch of n transitions Update paired heads µk, Qk according to (3) (5) Update the k-th target networks according to (4) Update the gating network according to (9) Decay xt DM according to (10) end for end for

A more detailed procedure is in Figure 1 and Algorithm 1.

Why Uncertainty Estimation as Self-Supervision Different from common supervised based Mo E demanding large task-speciﬁc datasets with rich annotations (i.e., classiﬁcation), utilizing self-supervision exploited from raw data is signiﬁcant to accomplish the supervision-starved tasks without human intervention. Well-calibrated uncertainty estimation is demonstrated effective towards understanding the generalization capacity when tasks shift and simple to implement (Lakshminarayanan, Pritzel, and Blundell 2017). It is intuitively suitable for Mo E in multi-task RL domain, where Q-value evaluation may be inaccurate and sufﬁcient for decision-making across MDPs. In this situation, uncertainty estimation as a reliable criterion encourages expert specialization and scheduling appropriately driven by their conﬁdence in a completely self-supervised manner.

Decayed Mask Experience Replay (DMER) The crux of Mo E is balancing expert utilization in both diversiﬁcation and specialization (Shazeer et al. 2017). We notice that the gating network sometimes tends to converge to a situation where it only produces large weights for the same few experts, which hampers the overall multi-task performance. We exploit decayed mask experience replay designed for training multiple RL experts diversely. Concretely, we replace x in Eq. (7) with xt DM, which is initialized as the number of heads K and decays throughout training:

xt+1 DM = xt DM λn , λn =

i=1 λ, (10)

where λ is the decay rate. In the early period, experts are trained in the entire replay buffer for acquiring common behaviors to tackle basic tasks, which leads to diversiﬁcation and alleviates the risk of imbalanced capacity and even single-expert domination. As xt DM decreases, experts with fundamental skills are provided different experiences according to the diverse masks generated from their uncertainty estimates and specialized in different directions.

Data Efﬁciency Bootstrap technique is demonstrated effective for deep exploration (Osband et al. 2016). DMER applies a masking mechanism similar as bootstrap, which induces diversity for efﬁcient exploration, though in varying degrees during different stages from early diversiﬁcation to late specialization. This breaks the limitation of common task-speciﬁc design where experts are only trained on assigned tasks with risk of overﬁtting. Besides, under the architecture including multiple experts with shared parameters, common knowledge across related tasks can be diffused and transferred quickly, which signiﬁcantly improves data efﬁciency and the overall multi-task performance.

5 Experiments We evaluate our approach on continuous control environment Mu Jo Co (Todorov, Erez, and Tassa 2012) and its multitask extension (Henderson et al. 2017) (see Figure 2). In most environments, a speciﬁc robot is rewarded by moving forward as fast as possible. We demonstrate the effectiveness and analyze the performance of SUM by experiments:

1. We compare the impacts beneﬁting from uncertainty enhancement in the single-task environment.

2. We further examine the generalization ability in tackling multiple related tasks in difﬁcult situations.

3. We evaluate data efﬁciency and expert utilization of our model on learning multiple tasks simultaneously.

In all cases, we use fully-connected network (see Figure 1), where hidden layer and head layer sizes are denoted by (N, M). Unless otherwise stated, we adopt the network structure

(e) (f) (g) (h)

(a) (b) (c) (d)

Figure 2: Illustrations of basic (top) and extended (bottom) locomotion tasks on Mu Jo Co: (a) Hopper, (b) Walker2D, (c) Half Cheetah, (d) Humanoid, (e) Hopper Wall, (f) Hopper Stairs, (g) Humanoid Wall and (h) Humanoid Standup.

0.0 0.2 0.4 0.6 0.8 1.0 Episodes 104

Averaged Return

103 Hopper Environment

DDPG (K=1) DDPG (K=3) SUM (K=1) SUM (K=3) SUM (K=5)

0.0 0.2 0.4 0.6 0.8 Episodes 104

Averaged Return

103 Walker Environment

DDPG (K=1) DDPG (K=3) SUM (K=1) SUM (K=3) SUM (K=5)

Figure 3: Performance of models with a different number of heads. The shaded area denotes the mean the standard deviation. SUM (K = 3, 5) with uncertainty enhancement outperforms other models in both reward and learning speed by reliably accurate Q-value evaluation.

S3 S5 D1 S1 D3

S3 S5 D1 S1 D3

Averaged Return

103 Hopper Environment

S3 S5 D1 S1 D3

S3 S5 D1 S1 D3

Averaged Return

103 Walker Environment

Figure 4: Generalization in different steps. DK and SK denote DDPG and SUM with K heads. The bar plots indicate SUM achieve robust generalization with less gap between the test and training performances, which demonstrates uncertainty enhancement effective for alleviating overﬁtting.

and common hyperparameters same as (Zheng et al. 2018): (256, 256, 128) for the critic and (256, 128) for the actor with Leaky Re LU activation. The gating network is (256, 128) with a softmax layer and updated by a learning rate 1e 4. These networks are trained by Adam (Kingma and Ba 2015) with a batch size n = 1024. Besides, we ﬁx the decay rate for DMER λ = 0.9997. Figure 3, 5, 6 depict the averaged return by lines and standard deviation (std) return by shaded areas over 10k episodes, while Table 1, 2 tabulate the mean and std of the cumulative reward across 20 sample rollouts.

5.1 Uncertainty Enhancement

Accurate Q-value Evaluation in Individual Tasks Figure 3 shows that in both environments, equipped with the same head K = 1, SUM achieves slightly better performance faster than vanilla DDPG since SUM is optimized by NLL, which captures uncertainty, instead of MSE. Namely, training by the extra objective for uncertainty motivates the critic to be optimized in the direction of producing not only accurate but reliable Q-value with well-calibrated uncertainty estimation reduction. The auxiliary supervision further stabilizes the training with better performance and less oscillations. This active impact is magniﬁed obviously with the increasing number of expert heads by comparison between multi-head DDPG and SUM with heads K = 3. In this case, shared architecture mimics ensemble combination , which further achieves high-quality uncertainty estimation and reduces variance for accurate Q-value.

Half Cheetah TRPO - After Training TRPO - Fully Trained After Training Fully Trained Together Small Foot 898.51 363.85 2003.46 933.59 55% 1441 597 3460 460 58% 3748 478 Big Foot 1997.73 101.36 2211.92 65.81 10% 1532 580 2392 421 36% 2754 389 Small Leg 1494.03 310.11 2327.16 702.69 36% 2243 516 4233 322 47% 4175 357 Big Leg 2101.74 95.98 2269.78 95.57 7% 2234 415 3147 347 29% 3264 303 Small Thigh 1672.22 110.11 2555.16 96.80 35% 2041 538 3346 382 39% 4143 291 Big Thigh 2345.88 381.33 2424.95 94.19 3% 1805 586 2465 475 27% 2218 514 Small Torso 1845.20 86.03 2294.72 109.20 20% 1816 425 2870 245 37% 2548 387 Big Torso 2620.46 297.88 2686.13 97.96 2% 2809 551 3478 423 19% 3737 375

Table 1: Average and standard deviation (µ σ) of reward across a set of 20 sample rollouts on modiﬁed Half Cheetah tasks. The percentage represents the ratio of average changes between After Training and Fully Trained (in order). By comparison to TRPO (Henderson et al. 2017), SUM (K = 3) achieves an overall improvement of performance attacking catastrophic forgetting effectively. Together shows the reward obtained by SUM after training in all tasks simultaneously (without order).

Robust Generalization against Overﬁtting To demonstrate the effectiveness of uncertainty enhancement for tackling overﬁtting, we measure the generalization capacity in individual tasks by the gap between the test and training performances. Figure 4 represents the averaged loss evaluated in the same environments with different random seeds after training. Though sometimes multi-head DDPG performs better in training period, it suffers from overﬁtting during testing with different randomness, since overconﬁdent Qvalues mislead the agent into local optima. After training by NLL with uncertainty estimation, SUM generalizes robustly against overﬁtting tackling both tasks with less loss in performance when tested. Concretely, DDPG generalizes with a decreased performance of factor α [0.32, 0.51] while SUM decreases by α [0.05, 0.28] throughout training.

5.2 Generalization in Difﬁcult Situations In this experiment, we focus on the generalization ability of SUM attacking catastrophic forgetting and handling unseen tasks. The environments are a series of Half Cheetah variant tasks with modiﬁed body parts, which demands a common and robust behavior adaptive to any variants.

Generalization against Catastrophic Forgetting Due to the susceptibility of RL agents to different reward schemes and catastrophic forgetting, it is difﬁcult to achieve an overall satisfying performance across all tasks simultaneously. We train SUM consecutively on each environment in the order as listed in Table 1. After having trained on a speciﬁc task, we evaluate SUM immediately on that environment across 20 sample rollouts, denoted by After Training (having seen all the previous environments). We repeat this procedure and ﬁnally evaluate the reward across 20 sample rollouts on each environment, denoted by Fully Trained (having seen all the environments). We measure the generalization capacity against catastrophic forgetting by the gap between After Training and Fully Trained in all tasks as (Henderson et al. 2017). To ensure comparability, we adopt the same network with hidden layers (100, 50, 25). Table 1 shows the difference between TRPO and SUM attacking catastrophic forgetting when learning in order. We ascribe this to the difﬁculty of tackling task shift by only one

Env. Train (3M) Env. Test Small Foot 5628 139 Small Leg 5298 288 Big Foot 4487 289 Big Leg 5083 134 Small Thigh 5025 137 Small Torso 5112 207 Big Thigh 5644 171 Big Torso 5391 167 Small Foot 5058 490 Big Foot 4326 333 Small Leg 5822 367 Big Leg 5741 132 Small Thigh 5346 436 Big Thigh 5681 325 Small Torso 5432 258 Big Torso 5837 360 Wall 4304 521 Wall 4017 501 All Envs 3525 614 Wall 3955 451

Table 2: Performance of SUM (K = 3) tested on unseen tasks. The last row shows results on a different environment, Wall-v0, after having trained on all modiﬁed body tasks is competitive with that trained on the individual Wall-v0 task.

single policy with risk of overﬁtting, which may limit the improvement of performance in multiple tasks. As an auxiliary criterion, uncertainty estimation works by understanding the generalization of different experts for effective expert scheduling. With this enhancement, SUM outperforms TRPO not only in substantially higher reward but robust generalization against catastrophic forgetting. Moreover, it is also feasible for SUM to learn concurrently without order and achieve a satisfying performance, which means SUM can be trained by samples from different tasks with varying reward schemes simultaneously (shown as Together). In this situation, SUM can capture each expert s uncertainty estimates of states from various domains. On the one hand, experts are trained with a shared structure for efﬁcient knowledge sharing. On the other hand, the gating network selfsupervised by uncertainty estimation can determine the most reliably potential expert to tackle the task.

Generalization in Unseen Environments Table 2 shows when tested on unseen environments, SUM can robustly generalize based on the common knowledge learned before. In particular, Half Cheetah Wall-v0 differently rewards

0.0 0.2 0.4 0.6 0.8 1.0 Episodes 104

Averaged Return

SOUP SUM SUM (DMER)

0.0 0.2 0.4 0.6 0.8 1.0 Episodes 104

Averaged Return

103 Hopper Wall

SOUP SUM SUM (DMER)

0.0 0.2 0.4 0.6 0.8 1.0 Episodes 104

Averaged Return

103 Hopper Stairs

SOUP SUM SUM (DMER)

Figure 5: Performance of models with heads K = 3 on learning multiple Hopper tasks without order and any prior knowledge (i.e., the distribution of all tasks) simultaneously. SUM with DMER outperforms others in both reward and learning speed.

0.0 0.2 0.4 0.6 0.8 Episodes 104

Averaged Return

105 Humanoid Standup

SOUP SUM SUM (DMER)

0.0 0.2 0.4 0.6 0.8 Episodes 104

Averaged Return

103 Humanoid

SOUP SUM SUM (DMER)

0.0 0.2 0.4 0.6 0.8 Episodes 104

Averaged Return

103 Humanoid Wall

SOUP SUM SUM (DMER)

0.0 0.2 0.4 0.6 0.8 Episodes 104

Averaged Return

105 Humanoid Standup Run Wall

SOUP SUM SUM (DMER)

Figure 6: Performance of models with heads K = 5 on learning multiple Humanoid tasks without order and any prior knowledge (i.e., the distribution of all tasks) concurrently. SUM with DMER outperforms others in both reward and learning speed.

Hop Wall Stairs 0

Hopper SOUP SUM SUM (DMER)

Hop Wall Stairs 0

Use Frequency

Hop Wall Stairs 0

Standup Run Wall S+R+W 0

Humanoid SOUP SUM SUM (DMER)

Standup Run Wall S+R+W 0

Use Frequency

Standup Run Wall S+R+W 0

Figure 7: Expert utilization of different models during testing, shown by the use frequencies in 1K steps. DMER motivates relatively evener utilization and better specialization.

an agent to step over a wall, where common multi-task agents tested on this unseen environment are always stuck in front of the wall. However, SUM trained on modiﬁed body variant tasks achieves a competitive performance as singletask learning. Utilizing self-supervised expert scheduling, experts with different potential for tackling each state are allocated more appropriately and effectively.

5.3 Multi-task Performance

We conduct two group of multi-task experiments including Hopper that learns to hop on ﬂat ground, walls, and stairs, and Humanoid that learns to stand up, walk, and pass a wall (see Figure 2). We compare SUM with SOUP (Zheng et al. 2018), an approach based on multi-head DDPG with a conﬁdence strategy, to emphasize clearly the strengths of selfsupervised expert scheduling and DMER. Figure 5, 6 shows that SUM not only accelerates the training, but achieves more cumulative rewards by data efﬁciency. Under multihead Mo E architecture, common representations and knowl-

edge are shared and diffused quickly, which boosts the learning with fewer experiences. Furthermore, experts are wellgeneralized for scheduling, which improves performance.

Expert Utilization and Specialization Note that, Standup Run Wall is an extremely challenging task since one of the component environment, Humanoid Standup, provides denser rewards via a different reward scheme making a robot stand up as fast as possible. Though we apply reward scale techniques, SUM and SOUP suffer from imbalanced expert utilization dominated by the expert which is always trained by rewards from Humanoid Standup. To counteract the adverse effects, SUM solves the problem on the strength of decayed mask experience replay (DMER). In the early period, DMER encourages experts to acquire basic behaviors with equal access to the overall experiences towards exploration in various directions. In this case, different from ensemble evaluation in SOUP with risk of single-head and individual-task domination, no experts can dominate the training in SUM with DMER. While in the late period, samples are masked according to uncertainty estimates from experts, which works similar as bootstrap technique limiting the experts to only adapt to fewer and diverse environments. It further leads to expert specialization for efﬁcient expert scheduling tackling multiple tasks. Figure 7 illustrates that SOUP suffers from single-head domination. Though SUM avoids this issue, imbalanced expert utilization still hampers its performance. Taking full advantage of DMER, SUM can motivate an expert to specialize in one or two tasks averting domination. Figure 7 represents that SUM learns to accomplish each task by activating several heads more frequently and tackle Standup Run Wall task by balanced utilization and robust generalization.

6 Conclusion In this paper, we propose SUM, an effective algorithm for multi-task reinforcement learning, improving both data efﬁciency in each task and generalization ability across multiple tasks. SUM utilizes multi-head DDPG as experts enhanced by predictive uncertainty estimation, which improves generalization capacity against overﬁtting. A self-supervised gating network trained by uncertainty feedback from experts is exploited to achieve efﬁcient expert scheduling, which improves data efﬁciency and performance across multiple tasks. Moreover, to alleviate the imbalanced expert utilization, we adopt decayed mask experience replay to motivate early diversiﬁcation and late specialization. To the best of our knowledge, our approach is the ﬁrst work to investigate the effectiveness of exploiting predictive uncertainty estimation in multi-task reinforcement learning in an end-to-end self-supervised manner. We demonstrate the effectiveness, performance and robust generalization ability of our algorithm on extended Mu Jo Co multi-task environments, especially in difﬁcult situations.

7 Acknowledgments This work is supported by the NSFC project (No. U1833101), Shenzhen Foundational Research Project (No. JCYJ20160428182137473) and the Joint Research Center of Tencent & Tsinghua University.

References Amodei, D.; Olah, C.; Steinhardt, J.; Christiano, P.; Schulman, J.; and Man e, D. 2016. Concrete problems in ai safety. ar Xiv:1606.06565. Borsa, D.; Graepel, T.; and Shawe-Taylor, J. 2016. Learning shared representations in multi-task reinforcement learning. ar Xiv:1603.02041. Caruana, R. 1997. Multitask learning. Machine learning 28(1). Dearden, R.; Friedman, N.; and Russell, S. 1998. Bayesian qlearning. In AAAI/IAAI. Doersch, C., and Zisserman, A. 2017. Multi-task self-supervised visual learning. In ICCV. Donahue, J.; Kr ahenb uhl, P.; and Darrell, T. 2017. Adversarial feature learning. Finn, C.; Yu, T.; Fu, J.; Abbeel, P.; and Levine, S. 2017. Generalizing skills with semi-supervised reinforcement learning. In ICLR. Gu, S.; Holly, E.; Lillicrap, T.; and Levine, S. 2017. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In ICRA. Henderson, P.; Chang, W.-D.; Shkurti, F.; Hansen, J.; Meger, D.; and Dudek, G. 2017. Benchmark environments for multitask learning in continuous domains. ar Xiv:1708.04352. Jacobs, R. A.; Jordan, M. I.; Nowlan, S. J.; and Hinton, G. E. 1991. Adaptive mixtures of local experts. Neural computation 3(1). Jordan, M. I., and Jacobs, R. A. 1994. Hierarchical mixtures of experts and the em algorithm. Neural computation 6(2). Kahn, G.; Villaﬂor, A.; Pong, V.; Abbeel, P.; and Levine, S. 2017. Uncertainty-aware reinforcement learning for collision avoidance. ar Xiv:1702.01182. Kingma, D., and Ba, J. 2015. Adam: A method for stochastic optimization. In ICLR.

Lakshminarayanan, B.; Pritzel, A.; and Blundell, C. 2017. Simple and scalable predictive uncertainty estimation using deep ensembles. In NIPS. Lazaric, A., and Ghavamzadeh, M. 2010. Bayesian multi-task reinforcement learning. In ICML. Lazaric, A. 2012. Transfer in reinforcement learning: a framework and a survey. Lillicrap, T. P.; Hunt, J. J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; and Wierstra, D. 2016. Continuous control with deep reinforcement learning. In ICLR. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland, A. K.; Ostrovski, G.; et al. 2015. Human-level control through deep reinforcement learning. Nature 518(7540). Nix, D. A., and Weigend, A. S. 1994. Estimating the mean and variance of the target probability distribution. In Neural Networks, 1994. IEEE World Congress on Computational Intelligence., 1994 IEEE International Conference On, volume 1. Osband, I.; Blundell, C.; Pritzel, A.; and Van Roy, B. 2016. Deep exploration via bootstrapped dqn. In NIPS. Parisotto, E.; Ba, J. L.; and Salakhutdinov, R. 2015. Actormimic: Deep multitask and transfer reinforcement learning. ar Xiv:1511.06342. Pathak, D.; Agrawal, P.; Efros, A. A.; and Darrell, T. 2017. Curiosity-driven exploration by self-supervised prediction. In ICML, volume 2017. Rusu, A. A.; Colmenarejo, S. G.; Gulcehre, C.; Desjardins, G.; Kirkpatrick, J.; Pascanu, R.; Mnih, V.; Kavukcuoglu, K.; and Hadsell, R. 2015. Policy distillation. ar Xiv:1511.06295. Rusu, A. A.; Rabinowitz, N. C.; Desjardins, G.; Soyer, H.; Kirkpatrick, J.; Kavukcuoglu, K.; Pascanu, R.; and Hadsell, R. 2016. Progressive neural networks. ar Xiv:1606.04671. Shazeer, N.; Mirhoseini, A.; Maziarz, K.; Davis, A.; Le, Q.; Hinton, G.; and Dean, J. 2017. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In ICLR. Shelhamer, E.; Mahmoudieh, P.; Argus, M.; and Darrell, T. 2016. Loss is its own reward: Self-supervision for reinforcement learning. ar Xiv:1612.07307. Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; and Riedmiller, M. 2014. Deterministic policy gradient algorithms. In ICML. Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton, A.; et al. 2017. Mastering the game of go without human knowledge. Nature 550(7676). Sutton, R. S., and Barto, A. G. 1998. Reinforcement learning: An introduction, volume 1. Teh, Y.; Bapst, V.; Czarnecki, W. M.; Quan, J.; Kirkpatrick, J.; Hadsell, R.; Heess, N.; and Pascanu, R. 2017. Distral: Robust multitask reinforcement learning. In NIPS. Todorov, E.; Erez, T.; and Tassa, Y. 2012. Mujoco: A physics engine for model-based control. In IROS. Wilson, A.; Fern, A.; Ray, S.; and Tadepalli, P. 2007. Multi-task reinforcement learning: a hierarchical bayesian approach. In ICML. Zhang, C.; Vinyals, O.; Munos, R.; and Bengio, S. 2018. A study on overﬁtting in deep reinforcement learning. ar Xiv:1804.06893. Zheng, Z.; Yuan, C.; Lin, Z.; Cheng, Y.; and Wu, H. 2018. Selfadaptive double bootstrapped ddpg. In IJCAI.