# toward_robust_long_range_policy_transfer__3b9eca1b.pdf Toward Robust Long Range Policy Transfer Wei-Cheng Tseng1, Jin-Siang Lin1, Yao-Min Feng1, Min Sun1,2,3 1National Tsing Hua University 2Appier Inc., Taiwan 3MOST Joint Research Center for AI Technology and All Vista Healthcare, Taiwan {weichengtseng, linjinsiang, yaominlouis}@gapp.nthu.edu.tw, sunmin@ee.nthu.edu.tw Humans can master a new task within a few trials by drawing upon skills acquired through prior experience. To mimic this capability, hierarchical models combining primitive policies learned from prior tasks have been proposed. However, these methods fall short comparing to the human s range of transferability. We propose a method, which leverages the hierarchical structure to train the combination function and adapt the set of diverse primitive polices alternatively, to efficiently produce a range of complex behaviors on challenging new tasks. We also design two regularization terms to improve the diversity and utilization rate of the primitives in the pretraining phase. We demonstrate that our method outperforms other recent policy transfer methods by combining and adapting these reusable primitives in tasks with continuous action space. The experiment results further show that our approach provides a broader transferring range. The ablation study also show the regularization terms are critical for long range policy transfer. Finally, we show that our method consistently outperforms other methods when the quality of the primitives varies. 1 Introduction Reinforcement learning (RL) has lots of success in various applications, such as game playing (Brockman et al. 2016; Silver et al. 2017; Mnih et al. 2015), robotics control (Tassa et al. 2018), molecule design (You et al. 2018), and computer system optimization (Mao et al. 2019a,b). Typically, researchers use RL to solve each task independently and from scratch, which makes RL confronted with low sample efficiency. However, compared with humans, the transferability of RL is limited. Especially, humans can learn to solve complex continuous problems (both state space and action space are continuous) efficiently by utilizing prior knowledge. In this work, we want agents to efficiently solve the complex continuous problem by exploiting prior experiences that provide structured exploration based on effective representation. To this end, we formulate transfer learning in RL as following. We train a policy with one of the RL optimization strategies on the pre-training task. Then, we intend to leverage the policy to master the transferring task. However, Copyright 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. transfer learning in RL may face some fundamental problems. First, unlike supervised learning, the transitions and trajectories are sampled during the training phase based on the interacted policy (Rothfuss et al. 2019). Since the reward distributions are different between the pre-training task and the transferring task, directly finetuning the pre-training policy on transferring tasks may make the agent perform biased structured exploration and get stuck in many low reward trajectories. Second, dynamics shifts between pre-training and transferring tasks may induce the pre-training policy to perform unstructured exploration (Clavera et al. 2019; Nachum et al. 2020). Although domain randomization (Tobin et al. 2017; Nachum et al. 2020) in the pre-training phase may mitigate this problem, we prefer the pre-trained policies to gradually fit the transferring tasks. Some methods intend to limit dependency between pretraining policy and task-specific information (Goyal et al. 2019; Galashov et al. 2019) by using information bottleneck (Alemi et al. 2017) and variational inference (Kingma and Welling 2013). That way, that pre-training policy does not overfit to a specific task and can be transferred to other tasks. Besides, some methods achieve task transfer by embedding tasks into a latent distribution (Merel et al. 2019; Hausman et al. 2018). However, the latent distribution should be smooth and contain a diverse set of tasks to perform behavior well. Some works propose a hierarchical policy (Frans et al. 2018; Bacon, Harb, and Precup 2017; Peng et al. 2019), which contains a combination function to control how to select or combine a set of primitives. Those works acquire a new selection or combination strategy to control the primitives to master the transferring task if we attain a set of task-agnostic primitives. We find that hierarchical architecture has the potential to enable a better transferring range in continuous control problems. We propose a transfer learning method in RL. Our pretraining method leverages existing hierarchical structure in the policy consisting of a combination function and a set of primitive policies. We also design objectives to encourage the set of primitives to be diverse and more evenly utilized in pre-training tasks. Notice that we do not use reference data since we expect our method to be generally applicable to all tasks. In many cases, such as flying creatures (Won, Park, The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21) Figure 1: (a) Our motivating example for RL transferring. The green ball represents the target position, which is sampled from the distribution of the task. The goal direction of the pre-training task and transferring task are quite different. (b) The hierarchical policy architecture. and Lee 2018), Laikago robot1 or D Kitty robot2, reference data is hard to obtain. During the transferring phase, we alternatively train the combination function and the primitive policies. This training procedure makes the training not only stable but also flexible in exploration. When training the combination function and freezing primitives in the transferring phase, it utilizes the benefit of the hierarchical structure that abstracts the exploration space. When training the primitives and fixing the combination function, the primitives can be adapted to the transferring task. In our experiments, we demonstrate that training hierarchical policy with our method significantly increases sample efficiency compared to previous work (Peng et al. 2019). Moreover, our method provides a better transferring range. We also provide an ablation study to discuss the effectiveness of our regularization terms. Finally, we show that with different resource constraints on training the pre-training policy, our method still outperforms other methods. The source code is available to the public3. 2 Preliminaries We consider a multi-task RL framework for transfer learning, consisting of a set of pre-training tasks and transferring tasks. An agent is trained from scratch on the pre-training tasks. Then, it utilizes any skills acquired during pre-training to the transferring tasks. Our objective is to obtain and leverage a set of reusable skills learned from the pre-training tasks to enable the agent to efficiently explore and be more effective at the following transferring tasks. We denote s as a state, a as an action, r as a reward, and τ as a trajectory consisting of actions and states. Each task is represented by a dynamics model st+1 p(st+1|st, at) and a reward function rt = r(st, at, g), where g is the taskspecific goal such as the target location that an agent intends to reach and a terrain that an agent needs to pass. In multitask RL, goals {g} are sampled from a distribution p(g). Given a goal g, a trajectory τ = {s0, a0, s1, ..., s T } with time horizon T is sampled from a policy π(a|s, g). Our objective is to learn an optimal policy π that maximizes its 1http://www.unitree.cc/e/action/Show Info.php?classid=6&id=1 2https://www.trossenrobotics.com/d-kitty.aspx 3https://weichengtseng.github.io/project website/aaai21 expected return J(π) = Eg p(g),τ pπ(τ|g)[ΣT t=0γtrt] over the distribution of goals p(g) and trajectories pπ(τ|g), where γ [0, 1] is the discount factor. The probability of the trajectory τ is calculated as follow pπ(τ|g) = p(s0) t=0 p(st+1|st, at)π(at|st, g) (1) where p(s0) is the probability of the initial state s0. In transfer learning, despite the same state and action space, the goal distributions, reward functions, and dynamics models in pre-training and transferring tasks are subjected to be different. The difference between the pre-training and transferring tasks is referred to as the range of transfer. Note that a successful transfer can t be expected for totally unrelated tasks. We consider the scenario where the pre-training tasks can make the agent to learn relevant information of the following transferring tasks, but may not cover the entire set of skills which are useful at the transferring tasks. 3 Method We will first describe a motivating example in Sec. 3.1. Then we introduce our method in Sec. 3.2. Finally, we show how to apply our method to the existing hierarchical policy in Sec. 3.3. 3.1 A Motivating Example Let s consider a model-free RL scenario. We have a pretraining task that an ant needs to achieve a goal position to get some reward, and the goal position is sampled from a half-circle in front of the ant. However, we change the sample distribution of goal positions in the transferring task to a small arc in the back of the ant that does not overlap with the goal positions of the pre-training task (see Fig. 1a). Intuitively speaking, this is a challenging transferring scenario. There are two straightforward methods to tackle this problem. One is directly finetuning the pre-training policy on the transferring task. However, the pre-trained policy is affected by the goal distribution in the pre-training phase to move forward, and this conflicts with the goal distribution in transferring task to move backward. Therefore, directly finetuning may corrupt the information learned from the pretraining task, which is known as the catastrophic forgetting. The other is to train a new policy from scratch on the transferring task. Training from scratch can saturate to a good point, but it may need lots of trails. Our proposed method aims to address the drawbacks of these two basic methods. Algorithm 1: Full Algorithm // pre-training phase ; initialize θ1:n and φ ; let JRL(θ1:k, φ) be the objective function of some specific RL optimization ; while not converge do train combination function φ and primitives θ1:n with Jpre(θ1:k, φ) = JRL(θ1:k, φ) + α J(π1:k) + β J(w1:k) // transferring phase ; reinitialize φ ; while not converge do Disable the gradient of primitives ; Enable the gradient of combination function ; for i = 1 : p do train combination function φ with Jtransfer(θ1:k, φ) = JRL(θ1:k, φ) Disable the gradient of combination function ; Enable the gradient of primitives ; for i = 1 : p do train primitives θ1:n with Jtransfer(θ1:k, φ) = JRL(θ1:k, φ) 3.2 Our Method We describe our method in this section. Our policy architecture contains: 1) A set of primitive policies πθ1(a|s, g), πθ2(a|s, g), ..., πθn(a|s, g) with parameters θ1:n, and each primitive is an independent policy that outputs action distribution conditioned on s and g. 2) A combination function Cφ(s, g) with parameter φ outputs weight wi:n, where wi specifies the importance of primitive πφi. F(π1:n, w1:n) specifies how to combine primitives with specified weight w1:n. Typically, the larger the weight wi, the more contributions from primitive πθi. During the pre-training phase, the hierarchical policy is end-to-end trained with an off-the-shelf RL optimization method. Our goal is to learn a set of primitives such that the combination function can compose them to form a complete behavior. However, not all primitives learned in pre-training are equally good for long range transfer. We identify two issues to be addressed. First, diversity of primitives is essential to improve the transferability. Without encouraging diversity, some of the primitives may learn similar behavior which better fits the pre-training tasks but affects the transferability. Second, when more primitives are introduced to compose more complex behaviors and improve the transferability, the utilization rate of each primitive varies more. In some extreme cases, some primitives may seldom or never used in the pre-training phase, so these primitives are not updated by RL optimization. Hence, it is important to en- courage the utilization rates of primitives to be more evenly distributed. To mitigate these two issues, we propose two regularization terms. We introduce the first regularization term that separates the distributions of the primitives from each other s so that the primitives become diverse. The intuitive idea to measure the difference between two probability distributions is calculating KL divergence. Therefore, we calculate the average of arbitrary pair of primitives as our regularization term, and we call it Diversity Regularization (DR) J(πθ1:k) = 1 k(k 1) i =j DKL(πθi|πθj) i =j,i