# task_transfer_by_preferencebased_cost_learning__44328ce7.pdf The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19) Task Transfer by Preference-Based Cost Learning Mingxuan Jing, Xiaojian Ma, Wenbing Huang, Fuchun Sun, Huaping Liu Department of Computer Science and Technology, State Key Lab on Intelligent Technology and Systems, National Lab for Information Science and Technology (TNList), Tsinghua University, Beijing 100084, China Tencent AI Lab, Shenzhen, Guangdong, China {jmx16, maxj14}@mails.tsinghua.edu.cn, {fcsun, hpliu}@tsinghua.edu.cn hwenbing@126.com The goal of task transfer in reinforcement learning is migrating the action policy of an agent to the target task from the source task. Given their successes on robotic action planning, current methods mostly rely on two requirements: exactlyrelevant expert demonstrations or the explicitly-coded cost function on target task, both of which, however, are inconvenient to obtain in practice. In this paper, we relax these two strong conditions by developing a novel task transfer framework where the expert preference is applied as a guidance. In particular, we alternate the following two steps: Firstly, letting experts apply pre-defined preference rules to select related expert demonstrates for the target task. Secondly, based on the selection result, we learn the target cost function and trajectory distribution simultaneously via enhanced Adversarial Max Ent IRL and generate more trajectories by the learned target distribution for the next preference selection. The theoretical analysis on the distribution learning and convergence of the proposed algorithm are provided. Extensive simulations on several benchmarks have been conducted for further verifying the effectiveness of the proposed method. 1 Introduction Imitation Learning has become an incredibly convenient scheme to teach robots skills for specific tasks (Wang et al. 2017; Pathak et al. 2018; Yu et al. 2018; Stadie, Abbeel, and Sutskever 2017; Sermanet et al. 2018; Edmonds et al. 2017). It is often achieved by showing the robot various expert trajectories of state-action pairs. Existing imitation methods like MAML (Finn, Abbeel, and Levine 2017) and One-Shot Imitation Learning (Duan et al. 2017) requires perfect demonstrations in the sense that the experts should perform the same as they expect the robot would do. However, this requirement may not always hold since collecting exactly-relevant demonstrations is resource-consuming. One possible relaxation is assuming the expert to perform a basic task that is related but not necessary the same as the target task (sharing some common features, parts, etc). This relaxation, at the very least, can reduce the human effort on demonstration collecting and enrich the diversity of the demonstrations for task transfer. For example in Figure 1, the These two authors contributed equally Copyright c 2019, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Figure 1: Problem statement and method introduction. As an example, we want to transfer a multi-joint robot from moving towards arbitrary directions (basic task) to moving forward (target task). Our preference-based task transfer framework iterate following two steps. 1. Querying expert for preferencebased selection; 2. Learning distribution and cost simultaneously from selected samples, doing policy optimization and re-generating more samples, which would have the same distribution as the selected ones. expert demonstrations contain the agent movements along an arbitrary direction, while the desired target is to move along only one specified direction. Clearly, it does not come for free to learn target action policy from the relaxed expert demonstrations. More advanced strategies are required to transfer the action policy from the demonstrations to the target task. The work by (De Gemmis et al. 2009) suggests that using experts preference as a supervised signal can achieve nearly optimal learning result. Here, the preference refers to the highly-abstract evaluation rules or choice tendency of a human for making comparison and selection among data samples. Indeed, the preference mechanism has been applied in many other scenarios, such as complex tasks learning (Wirth et al. 2017), policy updating (Christiano et al. 2017), and policy optimization combing with Inverse Reinforcement Learning (IRL) (Wirth and F urnkranz 2013) to name a few. However, previous preference-based methods mainly focus on learning the utility function behind each comparison, where the distribution of trajectories is never studied. However, this would be inadequate for task transfer. The impor- tance of modeling distribution comes from two aspects: 1. Learning the trajectory distribution takes a critical role in preference-based selection, which will be discussed lately; 2. With the distribution, it is more convenient to provide a theoretical analysis of the efficiency and stability of the task transfer algorithm (See Section 3.4). In this work, we approach the task transfer by utilizing the expert preference in a principled way. We first model the preference selection as a rejection sampling where a hidden cost is proposed to compute the acceptance probability. After selection, we then learn the distribution of the target trajectories based on the preferred demonstrations. Since the candidate demonstrations would usually be insufficient after selection, we augment the demonstrations with the samples of the current learned trajectory distribution and perform the preference selection and distribution learning iteratively. The distribution here acts as the knowledge which we make the transfer on. The theoretical derivations prove that it can improve the preference after each iteration and the target distribution will eventually converge. As the core of our framework, the trajectory distribution and cost learning are based on but has advanced the Maximum Entropy Inverse Reinforcement Learning (Max Ent IRL) (Ziebart et al. 2008) and its adversary version (Finn et al. 2016). The Max Ent IRL framework models the trajectory distribution as the exponential of the explicitly-coded cost function. Nevertheless in Max Ent IRL, computing the partition function requires MCMC or iterative optimization, which is time-consuming and numerically unstable. Hence in adversary Max Ent IRL, it avoids the computation of the partition function by casting the whole IRL problem into optimization of a generator and a discriminator. Although the adversary Max Ent IRL is more flexible, it never delivers any form of the cost function, which is crucial for down-stream applications and policy learning. Our method enhances the original adversary Max Ent IRL by redefining the samples from the trajectory level to the state-action level and devise the cost function using the outputs of the discriminator and generator. With the cost function, we can optimize the generator by any off-the-shelf reinforcement learning method and then the optimal generator could be used as a policy on the target task. To summarize, our key contributions are as follow. 1. We propose to perform imitation learning from related but not exactly-relevant demonstrations by making use of the expert preference-based selection. 2. We enhance the Adversarial Max Ent IRL framework for learning the trajectory distribution and cost function simultaneously. 3. Theoretical analyses have been provided to guarantee the convergence of the proposed task transfer frameworks. Considerable experimental evaluations demonstrate that our method obtains comparable results with other algorithms that require accurate demonstrations or costs. 2 Preliminaries This section reviews fundamental conceptions and introduces related works to our method. Before further introduction, we first provide key notations used in this paper. Notations. For modeling the action decision procedure of an agent, The Markov Decision Processes (MDP) without reward (S, A, T , γ, µ) is used, where S denotes a set of states which can be acquired from environment; A denotes a set of actions controlled by the agent; T = p(s |s, a) denotes the transition probability from state s to s by action a; γ [0, 1) is a discount factor; µ is the distribution of the initial state s0; π(a|s) defines the policy. A trajectory is given by the sequence of state-action pairs τi = {(s(i) 0 , a(i) 0 ), (s(i) 1 , a(i) 1 ), }. We define the cost function parameterized by θ over a s-a pair as cθ(a, s), and define the cost over a trajectory as Cθ(τi) = P t cθ(a(i) t , s(i) t ) where t is time step. A trajectory set is formulated by n expert demonstrations, i.e. Bi = {τi}n i=1. 2.1 Max Ent IRL Given a demonstration set B, the Inverse Reinforcement Learning (IRL) method (Ng, Russell, and others 2000) seeks to learn optimal parameters θ of the cost function Cθ(τi). The solution could be multiple when using insufficient demonstrations. The Max Ent IRL (Ziebart 2010; Boularias, Kober, and Peters 2011) handles this ambiguity by training the parameters to maximize the entropy over trajectories, leading to the optimization problem as: τ p(τ) log p(τ) s.t. Ep(τ)[Cθ(τi)] = Ep E(τ)[Cθ(τi)], τi B, X i p(τi) = 1, p(τi) 0. Here p(τ) is the distribution of trajectories; p E(τ) is the probability of the expert trajectory; E[ ] computes the expectation. The optimal p(τ) is derived to be the Boltzmann distribution associated with the cost Cθ(τ), namely, Z exp( Cθ(τ)). (2) Here Z is the partition function given by the integral of exp( Cθ(τ)) over all trajectories. 2.2 Generative Adversarial Networks Generative Adversarial Networks (GANs) provides a framework to model a generator G and a discriminator D simultaneously. G generates sample x G(z) from noise z N(0, I) , while D takes x as input, and outputs the likelihood value D(x) [0, 1] indicates whether x is sampled from underlying data distribution or from generator (Goodfellow et al. 2014) min D LD =Ex pdata[log D(x)] + Ez N(0,I)[log(1 D(G(z))] min G LG =Ez N(0,I)[ log D(G(z))] + Ez N(0,I)[log(1 D(G(z))]. Generator loss LG, discriminator loss LD and optimization goals are defined as (3). Here LG is modified as the sum among logarithm confusion and opposite loss of D for keeping training signal in case generated sample is easily classified by the discriminator. 3 Methodology Our preference-based task transfer framework consists of 2 iterative sub-procedures: 1) querying expert preference and construct a selected trajectory samples set; 2) learning the trajectory distribution and cost function from this samples set for re-generating more samples for next episode. Starting from the demonstrations of the basic task, the trajectory distribution and cost function we learned are improved continuously. Finally, with the learned cost function, we can derive a policy of the target task. The following sections will cover the modeling and analysis for all the two steps mentioned above. In Section 3.1, we will introduce the hidden-cost model for modeling the expert preference-based selection. Then in Section 3.2, our enhanced Adversarial Max Ent IRL for distribution and cost learning will be presented. We will combine the above two components to develop a preference-based task transfer framework and provide the theoretical analysis on it. 3.1 Preference-based Sampling and Hidden Cost Model The main idea of our task transfer framework is transferring trajectory distribution with sample selection. Different from other transfer learning algorithms, the selection in our method only depends on preference provided by experts instead of any quantities. The preference of expert here could be abstract conceptions or rules on the performance of agents in target task, which are hard to directly be formalized as cost functions or provided numerically by the expert. In our preference-based cost learning framework, however, we only require experts to choose their most preferred samples among the given set generated on the last step, and try to use the selection result as the guidance on migrating the distribution from current policy to the target task policy. We migrate the distribution by preference-based selection of samples in current set, the agent should be able to generate feasible trajectories on target task, which requires the probability of a trajectory on current task to be non-zero whenever the probability of that trajectory on target task is non-zero, and there should exist one finite value M (which indicates the expected rejections made before a sample is accepted) that τ, M (0, ) s.t. Mp(τ Bi) > p(τ Btar), (4) where Bi and Btar are feasible trajectory sets of current task and target task respectively. In previous section, we have shown that under Max Ent IRL, the expert trajectories are assumed to be sampled from a Boltzmann distribution with negative cost function as energy. For an arbitrary trajectory τ, there will be p(τ Bi) = p(τ) = exp( Ci(τ)) Zi exp( Ci(τ)) p(τ Btar) = ptar(τ) = exp( Ctar(τ)) Ztar exp( Ctar(τ)), (5) where Ci and Ctar are ground truth costs over a trajectory of current and target task, while ci and ctar are corresponding cost functions. During selection, we suppose that the expert intends to keep the trajectory τ which have lower cost value on target task, which means the preference selection procedure could be seen as a rejection sampling over set Bi with acceptance probability psel(τ) = ptar(τ) Mpi(τ) = Zi MZtar exp (Ci(τ) Ctar(τ)) exp ( Ctar(τ) + Ci(τ)) . (6) We define the gap between target cost and current cost as hidden cost ch(s, a) = ctar(s, a) ci(s, a) and for trajectory Ch(τ) = Ctar(τ) Ci(τ). Thus we can view Ch as a latent factor, or formally, a negative utility function (Wirth et al. 2016) that indicates the preference and at the same while indicates the gap between target distribution and current distribution. Lower expectation of Ch over the set of samples indicates greater acceptance possibilities and indicated current distribution to be more similar as target one. After each step, by reintroducing the accept rate, the probability of a sample presenting in the set after ith selection should be pi+1(τ) = p(selected(τ)|τ)pi(τ) = Zi MZtar exp(Ci(τ) Ctar(τ)) 1 Zi exp( Ci(τ)) exp( Ctar(τ)). With preference-based sample selection, the trajectory distribution is expected to approach to the one under the target task finally. The convergence analysis will be provided in Section 3.4. 3.2 Enhanced Adversarial Max Ent IRL for Distribution and Cost Learning In the previous section, we introduce how the preferencebased sample selection works in our task transfer framework. However, since the task transfer is an iterative process, we need to generate more samples with the same distribution as the selected samples set to keep it selectable by experts. Additionally, a cost function needs to be extracted from the selected demonstrations to optimize policies. With our enhanced Adversarial Max Ent IRL, we can tackle these problems by learning the trajectory distribution and unbiased cost function simultaneously. Adversarial Max Ent IRL (Finn et al. 2016) is a recently proposed GAN-based IRL algorithm that explicitly recovers the trajectory distribution from demonstrations. We enhance it to meet the requirements in our task transfer framework. Our enhancement is twofold: Redefining the GAN from trajectory level to state-action pair level to extract a cost function that can be directly used for policy optimization. Although the GAN does not directly work on trajectory anymore, we prove that the generator can still be a sampler to the trajectory distribution of demonstrations. We first briefly review the main ideas of Adversarial Max Ent IRL. In this algorithm, demonstrations are supposed to be drawn from a Boltzmann distribution (2), and the optimizing target can be regarded as Maximum Likelihood Estimation(MLE) over trajectory set B min θ Lcost = Eτ B[ log pθ(τ)]. (8) The optimization in (8) can be cast into an optimization problem of a GAN (Goodfellow et al. 2014; Finn et al. 2016), where the discriminator takes the form as followed D(τ) = p(τ) p(τ) + G(τ) = 1 Z exp( C(τ)) 1 Z exp( C(τ)) + G(τ). (9) Finn et al. showed that, when the model is trained to optimal, the generator G will be an optimal sampler of the trajectory distribution p(τ) = exp( C(τ))/Z. However, we still cannot extract a closed-form cost function from the model. As a result, we enhanced it to meet our requirements. Since the cost function should be defined on each stateaction pair, we first modified the input of the model in (9) from a trajectory to a state-action pair 1 Z exp( c(s, a)) 1 Z exp( c(s, a)) + G(s, a). (10) The connection between the accurate cost c(s, a) and outputs D(s, a), G(s, a) of GANs can be established c(s, a) := c(s, a) + log Z = log(1 D(s, a)) log D(s, a) log G(s, a). (11) Here we define c(s, a) = c(s, a) + log Z as a cost estimator, while c(s, a) is the accurate cost function. Since the partition function Z is a constant while cost function is fixed, it will not affect the policy optimization, which means that c can be directly integrated in common policy optimization algorithms as a unbiased cost function. Notice that, after this modification, there will be several issues we need to address. Firstly, since the GAN is not defined on trajectory anymore, the equivalence between Guided Cost Learning and GAN training need to be re-verified. We will discuss it in Section 3.4. Moreover, it is not straightforward whether G(s, a) is a sampler to the distribution of demonstrations. We now show that when G is trained to optimal, the distribution of trajectories sampled from it is exactly the distribution p(τ) of demonstrations: Assumption 1. The environment is stationary. Lemma 1. Suppose that we have an expert policy πE(a|s) to produce demonstrations B, a trajectory τ = {(s0, a0), (s1, a1), } is sampled from πE. Then τ will have the same probability as drawn from p(τ) if Assumption 1 holds (p(τ) is the trajectory distribution of B). Proof. We first introduce the environment model pe(s |s, a) and the state distribution ps(s). In Reinforcement Learning, environment is basically a condition distribution over state transitions (s , s, a). Thus the probability of a given trajectory τ = {(s0, a0), (s1, a1) } will be p(τ) = ps(s0) Y t=0 πE(at|st)pe(st+1|st, at). (12) Now we sample a trajectory τ with πE by executing rollouts. Under Assumption 1, the environment model pe for sampling τ from πE will be the same as sampling the demonstrations B, while ps(s) = RR pe(s |s, a) ds da is also identical. Therefore, the probability of sampling τ can be derived as q(τ) = ps(s0) Y t=0 πE(at|st)pe(st+1|st, at). (13) It s obvious that p(τ) = q(τ). Lemma 2. (Goodfellow et al. 2014) The global minimum of the discriminator objective (3) is achieved when p G = pdata. For a GAN defined on state-action level, with Lemma 2, p G = pdata = πE, πE is the expert policy for producing demonstrations. Then with Lemma 1, it s obviously that the trajectory sampled with G(s, a) will have the same density as p(τ), which means that G(s, a) can still be a sampler to the trajectory distribution of demonstrations. We formulate the minimization of generator loss as a policy optimization problem. We regard the unbiased cost estimator c(s, a) as the cost function instead of LG in (3), and G as a policy π. Thus the policy objective will be Lπ = E(s,a) B [log(1 D(s, a)) log D(s, a)] + H(π) where H(π) = E(s,a) B [ log π(a|s)] . (14) This is quite similar to the generator objective used by GAIL (Ho and Ermon 2016) but with an extra entropy penalty. We ll compare the performances of cost learning between our method and GAIL in Section 4. 3.3 Preference-based Task Transfer The entire task transfer framework is demonstrated in Algorithm 1, which combines the hidden cost model for preference-based selection and enhanced Adversarial Max Ent IRL for distribution and cost learning. With this framework, a well-trained policy on the basic task can be transferred to target task without accurate demonstrations or cost. Comparing to Section 3.2, we adopt a stop condition with ϵ and M which indicates the termination of the loop, and an extra selection constraint which is observed to be helpful for stability in preliminary experiments. In practice, the parameters of Gφi and Dωi can be directly inherited from Gφi 1 and Dωi 1 when i > 1. Compare to initialize from scratch, this will converge faster in each iteration, while the results remain the same. Algorithm 1 Preference-based task transfer via Adversarial Max Ent IRL Input: Demonstrations set B0 on basic task. Stop indicator ϵ, maximum episode M. Preference rules, or emulators which provides selection results. Output: Transferred policy πt. i = 0 Initialize generator Gφ0, discriminator Dω0; 1: repeat 2: i i + 1 3: for step s in {1, , N} do 4: Sample trajectory τ from Gφi; 5: Update Dωi with binary classification error in (3) to tell demonstration τ E Bi 1 from sample τ; 6: Update Gφi using any policy optimization method with respect to Lπ in (14); 7: end for 8: Sampling with Gφi, and collect Bi; 9: Query for preference to select trajectory in Bi to obtain retained samples Bi, dropped samples Bi, and guarantee |Bi| is no more than half of | Bi|; 10: Random sample β|Bi| trajectories from Bi and put them back into Bi; 11: until |Bi|/| Bi| < ϵ or i = M 12: return πt Gφi 3.4 Theoretical Analysis In this section, we will discuss how can our framework learn the distribution from trajectories in each episode and finally transfer the cost function to target task. Remember the core part in our framework: Transferring the trajectory distribution from p0 to ptar. There is a finite loop in this process, during which we query for preference as psel(τ) exp{ Ch(τ)} and improve the distribution pi for each episode i. If the distribution improves monotonically and the improvement can be maintained, we can guarantee the convergence of our method, which means that ptar can be learned. Then the cost function ci we learned together with pi will also approach to the cost for target task ctar. This intuition is shown as following: Proposition 1. Given a finite set of trajectories B sampled from distribution p(τ) and an expert with select probability (6), the hidden cost over a trajectory Eτ p[ Ch(τ)] is improved monotonically after each selection. This proposition can be proved with some elementary derivations. Here we only provide the proof sketch. Since all the trajectories in B are sampled from corresponding distribution p(τ), the expect cost can be estimated. Notice that we use a normalized select probability psel(τ) = exp( Ch(τ))/Z. Thus the estimations of expectation before and after the selection will be Eτ p[ Ch(τ)] 1 |B| i=0 ( Ch(τi)) Eτ p [ Ch(τ)] i=0 psel(τi)( Ch(τi)) / i=0 psel(τi). Obviously, trajectories after selection can not be seen as samples drawn from p, here we use p , which can be regarded as an improved p. Under linear expansion of cost, Eτ p [ Ch(τ)] Eτ p[ Ch(τ)] can be proved. Thus the expect cost over a trajectory is improved monotonically. Then we need to re-verify that whether the proposed stateaction level GAN in our enhanced Adversarial Max Ent IRL is still equivalent to Guided Cost Learning (Finn, Levine, and Abbeel 2016): Theorem 1. Suppose we have demonstrations B = {τ0, τ1, }, a GAN with generator Gφ(s, a), discriminator Dω(s, a). Then when the generator loss LG = Eτ p[log(1 Dω(s, a)) log(Dω(s, a))] is minimized, the sampler loss in Guided Cost Learning (Finn, Levine, and Abbeel 2016) Lsampler = DKL(q(τ) || exp( C(τ))/Z) is also minimized. q(τ) is the learned trajectory distribution, and Gφ is corresponding sampler. Since the Lsampler is minimized along with LG, when the adversarial training ends, an optimal sampler of q(τ) can be obtained. Now we need to prove B is drawn from q(τ): Theorem 2. Under the same settings in Theorem 1, when the discriminator loss is minimized, the cost loss in Guided Cost Learning Lcost = Eτ B[Cθ(τ)] + Eτ G[exp( Cθ(τ))/q(τ)] is also minimized. Thus the learned cost Cθ(τ) is optimal for B. Refer to Theorem 1, B is drawn from q(τ). In Theorem 2, Max Ent IRL is regarded as a MLE of (2), while the unknown partition function Z needs to be estimated. Therefore, training a state-action level GAN is still equivalent to maximizing the likelihood of trajectory distribution. Thus we can learn the optimal cost function and distribution under the current trajectory set B at the same time. With Proposition 1, we can start from an arbitrary trajectory distribution p0 and trajectory set B0 drawn from it. Then we can define a trajectory distribution iteration as pi+1(τ) pi(τ) exp{ Ch(τ)} (Haarnoja et al. 2017). Then expected hidden cost over a trajectory Eτ pi[ Ch(τ)] improves monotonically in each episode. With Theorem 1, 2, by strictly recovering the improved distribution as pi+1 from trajectory set (after selection), our algorithm can guarantee to maintain the improvement of expect cost over a trajectory to next episode. Under certain regularity conditions (Haarnoja et al. 2017), pi converges to p . For trajectories that sampled from the target distribution ptar(τ), their corresponding select probability (6) will approach to 1. Thus ptar(τ) = exp( Ch(τ) Figure 2: Results of distribution learning under four Mu Jo Co environments. Here the demonstrations are provided by an expert policy (PPO) under a known cost function. We compare the average cost value among trajectories generated by an oracle (an ideal policy that always obtains maximum return), PPO (sample generator, acts as the expert), GAIL (a state-of-the-art IRL algorithm) and our distribution learning method. The results show that our method can finally achieve nearly the same performance as the expert. As we discussed in Section 4.1, we can verify that our method can learn the distribution from demonstrations. Figure 3: Results of cost learning and task transfer. We compare the average returns among an oracle (an ideal policy that always obtains maximum return), an expert policy trained with the cost of target task, and our method. The results show that our algorithm can adapt to new task efficiently within 4 6 episodes, and achieves nearly the same performance as the expert. Cb(τ))/Z can be a fixed point of this iteration when the iteration starts from pb(τ) = exp( Cb(τ))/Z. Since all the nonoptimal distribution can be improved this way, the learned distribution will converge to ptar(τ) at infinity. As we have showed before, with a limited demonstrated trajectories B sampled from arbitrary trajectory distribution p(τ), an optimal cost c(a, s) can be extracted through our enhanced Adversarial Max Ent IRL proposed in Section 3.2. Therefore, the target cost ctar can also be learned from transferred distribution ptar. 4 Experiments We evaluate our algorithm on several control tasks in Mu Jo Co (Todorov, Erez, and Tassa 2012) physical simulator with pre-defined ground-truth cost function cb(s, a) on basic tasks and ctar(s, a) on target tasks in each experiments, Cb(τ) and Ctar(τ) are accumulated costs over trajectory τ for basic and target task respectively. All the initial demonstrations are generated by a well-trained PPO using cb, and during the transfer process, preference is given by emulator with negative utility function (or hidden cost over a trajectory) Ch = Ctar Cb. The select probability follows the definition in (6). For performance evaluation, we use averaged return with respect to ctar(s, a) as the criterion. 4.1 Overview In experiments, we mainly want to answer three questions: 1. During the task transfer procedure, can our method recover the trajectory distribution from demonstrations in each episode? 2. Starting from a basic task, can our method finally transfer to the target task and learn the cost function of it? 3. Under the same task transfer problem, can our method (based on preference only) obtain a policy with comparable performance, compared to other task transfer algorithms (based on accurate cost or demonstrations)? To answer the first question, we need to verify the distribution learning part in our method functionally. Since our enhanced Adversarial Max Ent IRL is built upon Max Ent IRL, the recovered trajectory distribution can be reflected as a cost function, and the trajectories we learn from being generated by the optimal policy under that cost. Intuitively, given the expert trajectories τPPO generated by PPO and its corresponding cost Ctar(τPPO), if we can train a policy which can generate τ with similar average Ctar(τ), we believe that the trajectory distribution can be recovered. To answer the second question, we evaluate the complete preference-based task transfer algorithm under some customized environments and tasks. In each environment, we transfer current policy under basic task to the target one. During the transfer process, expert preference (emulated by computer) is given as a selection result only, while any information of cost or selecting rule is unknown to the agent. We also train an expert policy with PPO and ctar(s, a) for comparison. In each episode, we generate τi using our learned policy and record Ctar(τi). If the average Ctar(τi) finally approaches to Ctar(τPPO), we can verify that our method can learn the cost function of target task. To answer the third question, we compare our method with MAML (Finn, Abbeel, and Levine 2017), a task transfer algorithm requiring accurate ctar. We use averaged cost on target task in each episode (we consider gradient step in MAML the same as episode in our method) for evaluation, to see whether the result of our method is comparable. 4.2 Environments and Tasks Here we outline some specifications of the environments and tasks in our experiments: Hopper, Walker2d, Humanoid and Reacher: These environments and tasks are directly picked from Open AI Gym (Brockman et al. 2016) without customization. Since they are only used for functionally verifying our distribution learning part and comparing with the original GAIL algorithm, there are no transfer settings. Mountain Car, Two Peaks One Peak: In this environment, there are two peaks for the agent to climb. The basic task is to make the vehicle higher, while the target task is climbing to a specified one. Reacher, Two Targets Center of Targets: In this environment, the agent needs to control a 2-DOF manipulator to reach some specified targets. For the basic task, there will be two targets, and the agent can reach any of them, while in target task the agent is expected to reach the central position between the two targets. Half-Cheetah, Arbitrary Backward: In this environment, the agent needs to control a multi-joint (6) robot to move forward or backward. The two directions are all acceptable in the basic task, while only moving backward is expected in the target task. Ant, Arbitrary Single: This environment enhances the Half-Cheetah environment in two aspects: First, there will be more joints (8) to control; Second, the robot can move to arbitrary directions. In the basic task, any directions are allowed, while only one specified direction is expected in the target task. 4.3 Distribution and Cost Learning We first concern the question whether our method can recover the trajectory distribution from demonstrations during the task transfer procedure. Experiment results are shown in Figure 2. All the selected control tasks are equipped with high-dimensional continuous state and action spaces, which can be challenging to common IRL algorithms. We find that our method achieves nearly the same final performance as the expert (PPO) that provides the demonstrations, indicating that our method can recover the trajectory distribution. Also, comparing with other state-of-the-art IRL methods like GAIL, our method can learn a better trajectory distribution and a cost function more efficiently. 4.4 Preference-based Task Transfer In Figure 3, we demonstrate the transfer results on two environments. The transfer in Reacher environment is more difficult than Mountain Car toy environment. The reason would be that the later one can be clustered easily since there are only two actual goals that a trajectory may reach, and the target goal (to reach one specified peak) is exactly one of Figure 4: Results of comparison with other methods. We evaluate our algorithm under the transfer environments introduced by (Finn, Abbeel, and Levine 2017). For the baselines, MAML requires accurate ctar when transferring, Pretrained means pre-training one policy from a basic task using Behavior Cloning (Ross, Gordon, and Bagnell 2011) then finetuning. Random means optimizing a policy from randomly initialized weights. The results show that our method can obtain a policy with comparable performance with MAML and other baselines. them. In Reacher environment, although the demonstrations in the basic task still seem to be easily clustered, the target task cannot be directly derived from any of the clusters. In both two transfer experiments, the adapted policies produced by our algorithm show nearly the same performances as the experts that directly trained on these two target tasks. As the transferred policy is trained with the learned cost function, we can conclude that our algorithm can transfer to target task by learning the target cost function. In our experiments, we find that within less than 10 episodes and less than 100 querying number at each episode can sufficiently derive desired performance. Another potential improvement of our method is to apply some commutable rules to simulate the human selection and reduce the querying time. 4.5 Comparison with Other Methods We compare our method with some state-of-the-art task transfer algorithms including MAML (Finn, Abbeel, and Levine 2017). Results are shown in Figure 4. Half-Cheetah environment is pretty similar to Mountain Car for the limited moving directions. However, its state and action space dimensions are much higher, which increase the difficulties for trajectory distribution and cost learning. Ant is the most difficult one among all the environments. Due to its unrestricted moving directions, the demonstrations on the basic task are highly entangled. The results illustrate that our method achieves comparable performances to those methods that require the accurate cost of the target task on the testing environments. Notice that, for some hard environment like Ant, our method may run for more episodes than MAML, since our algorithm only depends on preference, the results can still be convincing and impressive. 5 Conclusion In this paper, we present an algorithm that can transfer policies through learning the cost function on the target task with expert-provided preference selection results only. By modeling the preference-based selection as rejection sampling and utilizing enhanced Adversarial Max Ent IRL for directly recovering the trajectory distribution and cost function from selection results, our algorithm can efficiently transfer policies from a related but not exactly-relevant basic task to the target one, while theoretical analysis on convergence can be provided at the same time. Comparing to other task transfer methods, our algorithm can handle the scenario in which acquiring the accurate demonstrations or cost functions from experts is inconvenient. Our results achieve comparable task transfer performances to other methods which depend on accurate costs or demonstrations. Future work could focus on the quantitative evaluation of the improvement on the transferred cost function. Also, the upper bound on the sum of total operating episodes could be analyzed. Acknowledgment This research work was jointly supported by the Natural Science Foundation great international cooperation project (Grant No:61621136008) and the National Natural Science Foundation of China (Grant No:61327809). Professor Fuchun Sun(fcsun@tsinghua.edu.cn) is the corresponding author of this paper and we would like to thank Tao Kong, Chao Yang and Professor Chongjie Zhang for their generous help and insightful advice. Boularias, A.; Kober, J.; and Peters, J. 2011. Relative entropy inverse reinforcement learning. In International Conference on Artificial Intelligence and Statistics (AISTATS). Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; and Zaremba, W. 2016. Openai gym. Christiano, P. F.; Leike, J.; Brown, T.; Martic, M.; Legg, S.; and Amodei, D. 2017. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems (NIPS). De Gemmis, M.; Iaquinta, L.; Lops, P.; Musto, C.; Narducci, F.; and Semeraro, G. 2009. Preference learning in recommender systems. Preference Learning. Duan, Y.; Andrychowicz, M.; Stadie, B.; Jonathan Ho, O.; Schneider, J.; Sutskever, I.; Abbeel, P.; and Zaremba, W. 2017. One-shot imitation learning. In Advances in Neural Information Processing Systems (NIPS). Edmonds, M.; Gao, F.; Xie, X.; Liu, H.; Qi, S.; Zhu, Y.; Rothrock, B.; and Zhu, S.-C. 2017. Feeling the force: Integrating force and pose for fluent discovery through imitation learning to open medicine bottles. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Finn, C.; Abbeel, P.; and Levine, S. 2017. Model-agnostic metalearning for fast adaptation of deep networks. In International Conference on Machine Learning (ICML). Finn, C.; Christiano, P.; Abbeel, P.; and Levine, S. 2016. A connection between generative adversarial networks, inverse reinforcement learning, and energy-based models. ar Xiv preprint ar Xiv:1611.03852. Finn, C.; Levine, S.; and Abbeel, P. 2016. Guided cost learning: Deep inverse optimal control via policy optimization. In International Conference on Machine Learning (ICML). Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. In Advances in neural information processing systems (NIPS). Haarnoja, T.; Tang, H.; Abbeel, P.; and Levine, S. 2017. Reinforcement learning with deep energy-based policies. In International Conference on Machine Learning (ICML). Ho, J., and Ermon, S. 2016. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems (NIPS). Ng, A. Y.; Russell, S. J.; et al. 2000. Algorithms for inverse reinforcement learning. In International conference on Machine learning (ICML). Pathak, D.; Mahmoudieh, P.; Luo, M.; Agrawal, P.; Chen, D.; Shentu, F.; Shelhamer, E.; Malik, J.; Efros, A. A.; and Darrell, T. 2018. Zero-shot visual imitation. In International Conference on Learning Representations (ICLR). Ross, S.; Gordon, G.; and Bagnell, D. 2011. A reduction of imitation learning and structured prediction to no-regret online learning. In International conference on artificial intelligence and statistics (AISTATS). Sermanet, P.; Lynch, C.; Chebotar, Y.; Hsu, J.; Jang, E.; Schaal, S.; Levine, S.; and Brain, G. 2018. Time-contrastive networks: Selfsupervised learning from video. In IEEE International Conference on Robotics and Automation (ICRA). Stadie, B. C.; Abbeel, P.; and Sutskever, I. 2017. Third-person imitation learning. In International Conference on Learning Representations (ICLR). Todorov, E.; Erez, T.; and Tassa, Y. 2012. Mujoco: A physics engine for model-based control. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE. Wang, Z.; Merel, J. S.; Reed, S. E.; de Freitas, N.; Wayne, G.; and Heess, N. 2017. Robust imitation of diverse behaviors. In Advances in Neural Information Processing Systems (NIPS). Wirth, C., and F urnkranz, J. 2013. A policy iteration algorithm for learning from preference-based feedback. In International Symposium on Intelligent Data Analysis. Springer. Wirth, C.; Furnkranz, J.; Neumann, G.; et al. 2016. Model-free preference-based reinforcement learning. In AAAI Conference on Artificial Intelligence (AAAI). Wirth, C.; Akrour, R.; Neumann, G.; and F urnkranz, J. 2017. A survey of preference-based reinforcement learning methods. Journal of Machine Learning Research (JMLR). Yu, T.; Finn, C.; Dasari, S.; Xie, A.; Zhang, T.; Abbeel, P.; and Levine, S. 2018. One-shot imitation from observing humans via domain-adaptive meta-learning. In Robotics: Science and Systems (RSS). Ziebart, B. D.; Maas, A. L.; Bagnell, J. A.; and Dey, A. K. 2008. Maximum entropy inverse reinforcement learning. In AAAI Conference on Artificial Intelligence (AAAI). Ziebart, B. D. 2010. Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy. Ph.D. Dissertation, Carnegie Mellon University.