# multitask_hierarchical_adversarial_inverse_reinforcement_learning__8516b2b1.pdf

Multi-task Hierarchical Adversarial Inverse Reinforcement Learning

Jiayu Chen 1 Dipesh Tamboli 2 Tian Lan 3 Vaneet Aggarwal 1 2 4

Multi-task Imitation Learning (MIL) aims to train a policy capable of performing a distribution of tasks based on multi-task expert demonstrations, which is essential for general-purpose robots. Existing MIL algorithms suffer from low data efficiency and poor performance on complex longhorizontal tasks. We develop Multi-task Hierarchical Adversarial Inverse Reinforcement Learning (MH-AIRL) to learn hierarchically-structured multi-task policies, which is more beneficial for compositional tasks with long horizons and has higher expert data efficiency through identifying and transferring reusable basic skills across tasks. To realize this, MH-AIRL effectively synthesizes context-based multi-task learning, AIRL (an IL approach), and hierarchical policy learning. Further, MH-AIRL can be adopted to demonstrations without the task or skill annotations (i.e., stateaction pairs only) which are more accessible in practice. Theoretical justifications are provided for each module of MH-AIRL, and evaluations on challenging multi-task settings demonstrate superior performance and transferability of the multitask policies learned with MH-AIRL as compared to SOTA MIL baselines.

1. Introduction

The generalist robot, which can autonomously perform a wide range of tasks, is one of the essential targets of robotic learning. As an important approach, Imitation Learning (IL) enables the agent to learn policies based on expert demon-

1School of Industrial Engineering, Purdue University, West Lafayette, IN 47907, USA 2Elmore Family School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN 47907, USA 3Department of Electrical and Computer Engineering, George Washington University, Washington DC 20052, USA 4Department of Computer Science and AI Initiative, King Abdullah University of Science and Technology, Thuwal 23955, KSA. Correspondence to: Jiayu Chen <chen3686@purdue.edu>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

strations and is especially effective for problems where it s difficult to discover task solutions autonomously through Reinforcement Learning (RL). To train a general-purpose agent, Multi-task/Meta Imitation Learning (MIL) algorithms (Finn et al., 2017b; Deisenroth et al., 2014; Singh et al., 2020) have been proposed to learn a parameterized policy that is a function of both the current observation and the task and is capable of performing a range of tasks following a particular distribution. The key insight of these algorithms is that the successful control for one task can be informative for other related tasks. However, a critical challenge for them is to acquire enough data for the agent to generalize broadly across tasks. Typically, a large number of demonstrations are required for each task in that distribution, and the required amount increases with task difficulty. Moreover, the learned multi-task policy cannot be transferred to tasks out of that distribution (Yu et al., 2019; Ghasemipour et al., 2019), which limits its general use.

Hierarchical Imitation Learning (HIL) has the potential to reduce the required demonstrations. In HIL, the agent learns a two-level policy, which can be modeled with the option framework (Sutton et al., 1999), from the expert data. Specifically, the low-level policies (i.e., skills) are designated to accomplish certain subtasks in a complex task, while the high-level policy is for scheduling the switch among the skills to solve the entire task. For multi-task settings, learning a hierarchical policy enables the agent to identify basic skills that can be useful in solving a distribution of tasks and to transfer them across tasks during training. In this case, each skill can be trained with demonstrations from different tasks rather than limited to a single one, and, with the shared skills, an agent mainly needs to update its high-level policy rather than learning an entire policy for each task. The expert data efficiency is significantly improved since demonstrations among different tasks are reused for learning skills and the burden of multi-task policy learning becomes lower. Further, in RL and IL, hierarchies exhibit a number of benefits, including better performance on long-horizontal complex tasks (Florensa et al., 2017; Jing et al., 2021) and the possibility of skill transfer between distinct tasks (Andreas et al., 2017).

In this paper, we propose MH-AIRL to introduce hierarchies to MIL. As discussed above, such hierarchies can improve expert data efficiency so that the agent can achieve supe-

Multi-task Hierarchical Adversarial Inverse Reinforcement Learning

rior performance based on a limited number of demonstrations. Further, basic skills can be extracted from the learned policies and reused in out-of-distribution tasks for better transferability (i.e., addressing the core concern of multitask learning). For example, it enables locomotion skills to be reused for multiple goal-achieving tasks of the same robot agent, yet in distinct scenarios. Different from previous Multi-task Hierarchical IL (MHIL) algorithms (Fox et al., 2019; Yu et al., 2018a; Gao et al., 2022; Bian et al., 2022), MH-AIRL is context-based and thus can be applied to demonstrations without any (skill or task) annotations, which are more accessible in practice. To this end, we extend both the multi-task learning and imitation learning modules (i.e., the core components of MIL), with the option framework (i.e., the hierarchical learning module). For multi-task learning, we condition the learned policy on a Hierarchical Latent Context Structure, where the task code and skill segmentation serve as the global and local context variables respectively. To compel the casual relationship of learned policy and latent variables, we start from the definition of mutual information and directed information and derive an easier-to-handle lower bound for each of them, serving as the optimization objectives. For imitation learning, we propose H-AIRL, which redefines a SOTA IL algorithm AIRL (Fu et al., 2017) in an extended state-action space to enable our algorithm to recover a hierarchical policy (rather than a monolithic one) from expert trajectories. Finally, an actor-critic framework HPPO is proposed to synthesize the optimization of the three modules above.

The contributions are as follows: (1) Our work presents the first MHIL algorithm based on demonstrations without any (skill or task) annotations, i.e., state-action pairs only. This greatly generalizes the applicability of our algorithm and reduces the cost of building expert datasets. (2) The newlyproposed H-AIRL and HPPO can be independently used for Hierarchical IL and RL, respectively. They are shown to achieve improved performance than SOTA HIL and HRL baselines. (3) We provide theoretical proof and ablation study for each algorithm module, and show the superiority of our algorithm through comparisons with SOTA baselines on a series of challenging multi-task settings from Mujoco (Todorov et al., 2012) and D4RL (Fu et al., 2020).

2. Related Work

Machine Learning has found successful applications across a wide array of sectors such as transportation (Al-Abbasi et al., 2019; Chen et al., 2021; Luo et al., 2022; Ma et al., 2020), manufacturing (Peddireddy et al., 2021; Fu et al., 2021), networking (Balachandran et al., 2014; Geng et al., 2023), robotics (Gao et al., 2022; Gonzalez et al., 2023), etc. In the field of robotics, one of the key objectives is developing a generalist robot, capable of executing a mul-

titude of tasks with human-like precision. To achieve this, multi-task robotic learning proves to be a highly effective methodology. In this section, we succinctly delineate Multitask IL and Multi-task HIL, illustrating the contributions and significance of our research in this evolving field.

Multi-task/Meta IL algorithms have been proposed to learn a parameterized policy, which is capable of performing a range of tasks following a particular distribution, from a mixture of expert demonstrations. Based on the meta/multi-task learning techniques used, current MIL algorithms can be categorized as gradient-based or context-based. Gradient-based MIL, such as (Finn et al., 2017b; Yu et al., 2018b), integrates a gradient-based meta learning algorithm MAML (Finn et al., 2017a) with supervised IL to train a policy that can be fast adapted to a new task with one-step gradient update. Context-based MIL, such as (Ghasemipour et al., 2019; Yu et al., 2019), learns a latent variable to represent the task contexts and trains a policy conditioned on the task context variable. Thus, with the corresponding task variable, the policy can be directly adopted to a new task setting. However, these algorithms do not make use of the option framework to learn a hierarchical policy like ours. In Section 5.1, we compare our algorithm with MIL baselines from both categories and show that it achieves better performance on a wide range of challenging long-horizon tasks.

Multi-task HIL aims at recovering a multi-task hierarchical policy based on expert demonstrations from a distribution of tasks, which synthesizes the advantages of Multi-task IL and HIL. We present here the previous study in this area. The algorithms proposed in (Fox et al., 2019) and (Duminy et al., 2021) are limited to a certain type of robot. They provide predefined subtask decomposition, like picking and placing dishes, to simplify hierarchical learning, and have access to segmented expert demonstrations. However, our algorithm is proposed to automatically discover a hierarchical policy from unsegmented demonstrations and the discovered policy should capture the subtask structure of the demonstrations without supervision. In (Yu et al., 2018a), they propose to let the robot learn a series of primitive skills from corresponding demonstrations first, and then learn to compose learned primitives into multi-stage skills to complete a task. Thus, they predefine the types of skills and provide demonstrations corresponding to each skill. Also, in their setting, each new task has to be a sequence of predefined skills. A very recent work (Gao et al., 2022) integrates MAML and the option framework for MHIL. Like (Bian et al., 2022) and (Devin et al., 2019), this algorithm can be applied to demonstrations without the skill annotations, but these demonstrations have to be categorized by the task, in accordance with the requirements of MAML. Consequently, our research introduces the first MHIL algorithm that relies on demonstrations devoid of task or skill annotations. This makes it significantly more practical for real-world applications.

Multi-task Hierarchical Adversarial Inverse Reinforcement Learning

3. Background

In this section, we introduce Adversarial Inverse Reinforcement Learning (AIRL), Context-based Meta Learning, and the One-step Option Framework, corresponding to the three components of our algorithm: IL, multi-task learning, and hierarchical policy learning, respectively. They are based on the Markov Decision Process (MDP), denoted by M = (S, A, P, µ, R, γ), where S is the state space, A is the action space, P : S A S [0, 1] is the transition function (PSt+1 St,At P(St+1|St, At)), µ : S [0, 1] is the distribution of the initial state, R : S A R is the reward function, and γ (0, 1] is the discount factor.

3.1. Adversarial Inverse Reinforcement Learning

While there are several other ways to perform IL, such as supervised imitation (e.g., Behavioral Cloning (BC) (Pomerleau, 1991)) and occupancy matching (e.g., GAIL (Ho & Ermon, 2016)), we adopt Inverse Reinforcement Learning (IRL) because it uses not only the expert data but also selfexploration of the agent with the recovered reward function for further improvement (Ng & Russell, 2000; Wang et al., 2021). Comparisons with BCand GAIL-based algorithms will be provided in Section 5. IRL aims to infer an expert s reward function from demonstrations, based on which the expert s policy can be recovered. Maximum Entropy IRL (Ziebart et al., 2008) solves IRL as a maximum likelihood estimation (MLE) problem shown as Equation 1. τE (S0, A0, , ST ) denotes the expert trajectory. Zϑ is the partition function which can be calculated with Zϑ = P

τE b Pϑ(τE).

max ϑ EτE [log Pϑ(τE)] = max ϑ EτE

log b Pϑ(τE)

b Pϑ(τE) = µ(S0)

t=0 PSt+1 St,At exp(Rϑ(St, At))

Since Zϑ is intractable for problems with large state-action space, the authors of (Fu et al., 2017) propose AIRL to solve this MLE problem in a sample-based manner, through alternatively training a discriminator fϑ and policy network π in an adversarial setting. The discriminator is trained by minimizing the cross-entropy loss between the expert demonstrations τE and generated samples τ by π:

t=0 EτE log Dt ϑ Eτ log(1 Dt ϑ) (2)

Here, Dt ϑ = Dϑ(St, At) = exp(fϑ(St,At)) exp(fϑ(St,At))+π(At|St). Meanwhile, the policy π is trained with RL using the reward function defined as log Dt ϑ log(1 Dt ϑ). It is shown that, at optimality, fϑ can serve as the recovered reward function Rϑ and π is the recovered expert policy.

3.2. Context-based Meta Learning

We consider the Meta IRL setting: given a distribution of tasks P(T ), each task sampled from P(T ) has a corresponding MDP, and all of them share the same S and A but may differ in µ, P, and R. The goal is to train a flexible policy π on a set of training tasks sampled from P(T ), which can be quickly adapted to unseen test tasks sampled from the same distribution. As a representative, context-based Meta IRL algorithms (Ghasemipour et al., 2019; Yu et al., 2019) introduce the latent task variable C, which provides an abstraction of the corresponding task T , so each task can be represented with its distinctive components conditioning on C, i.e., (µ(S0|C), P(S |S, A, C), R(S, A|C)). These algorithms learn a context-conditioned policy π(A|S, C) from the multi-task expert data, through IRL and by maximizing the mutual information (Cover, 1999) between the task variable C and the trajectories from π(A|S, C). Thus, given C for a new task, the corresponding π(A|S, C) can be directly adopted. Context-based methods can adopt off-policy data, making them more align with the goal of our work learning from demonstrations. Thus, we choose context-based Meta IRL as our base algorithm.

Given expert trajectories sampled from a distribution of tasks (i.e., C prior( )) and assuming that the demonstrative trajectories of each task are from a corresponding expert policy πE(τE|C), context-based Meta IRL recovers both the task-conditioned reward function Rϑ(S, A|C) and policy π(S, A|C) by solving an MLE problem:

max ϑ EC prior( ),τE πE( |C) [log Pϑ(τE|C)] ,

Pϑ(τE|C) µ(S0|C)

t=0 PSt+1 St,At,C e Rϑ(St,At|C) (3)

where PSt+1 St,At,C P(St+1|St, At, C). Like Equation 1, this can be efficiently solved through AIRL. We provide the AIRL framework to solve Equation 3 in Appendix A.1.

3.3. One-step Option Framework

As proposed in (Sutton et al., 1999), an option Z Z can be described with three components: an initiation set IZ S, an intra-option policy πZ(A|S) : S A [0, 1], and a termination function βZ(S) : S [0, 1]. An option Z is available in state S if and only if S IZ. Once the option is taken, actions are selected according to πZ until it terminates stochastically according to βZ, i.e., the termination probability at the current state. A new option will be activated by a high-level policy πZ(Z|S) : S Z [0, 1] once the previous option terminates. In this way, πZ(Z|S) and πZ(A|S) constitute a hierarchical policy for a certain task. Hierarchical policies tend to have superior performance on complex long-horizontal tasks which can be broken down into a series of subtasks (Chen et al., 2022a;b;c;d).

Multi-task Hierarchical Adversarial Inverse Reinforcement Learning

The one-step option framework (Li et al., 2021) is proposed to learn the hierarchical policy without the extra need to justify the exact beginning and breaking condition of each option, i.e., IZ and βZ. First, it assumes that each option is available at each state, i.e., IZ = S, Z Z. Second, it drops βZ through redefining the high-level and low-level (i.e., intra-option) policies as πθ(Z|S, Z ) (Z : the option in the last timestep) and πϕ(A|S, Z) respectively and implementing them as end-to-end neural networks with the Multi-Head Attention (MHA) mechanism (Vaswani et al., 2017), which enables it to temporally extend options in the absence of the termination function. Intuitively, if Z still fits S, πθ(Z|S, Z ) will assign a larger attention weight to Z and thus has a tendency to continue with it; otherwise, a new option with better compatibility will be sampled. Then, the option is sampled at each timestep rather than after the last one terminates. With this simplified framework, we only need to train the hierarchical policy, i.e., πθ and πϕ, of which the structure design with MHA is in Appendix A.2.

4. Proposed Approach

In this section, we propose Multi-task Hierarchical AIRL (MH-AIRL) to learn a multi-task hierarchical policy from a mixture of expert demonstrations. First, the learned policy is multi-task by conditioning on the task context variable C. Given C prior( ), the policy can be directly adopted to complete the corresponding task. In practice, we can usually model a class of tasks by specifying the key parameters of the system and their distributions (i.e., prior(C)), including the property of the agent (e.g., mass and size), circumstance (e.g., friction and layout), and task setting (e.g., location of the goals). In this case, directly recovering a policy, which is applicable to a class of tasks, is quite meaningful. Second, for complex long-horizontal tasks which usually contain subtasks, learning a monolithic policy to represent a structured activity can be challenging and inevitably requires more demonstrations. In contrast, a hierarchical policy can make full use of the subtask structure and has the potential for better performance. Moreover, the learned low-level policies can be transferred as basic skills to out-of-distribution tasks for better transferability, while the monolithic policy learned with previous Meta IL algorithms cannot.

In Section 4.1 and 4.2, we extend context-based Meta Learning and AIRL with the option framework, respectively. In Section 4.3, we synthesize the three algorithm modules and propose an actor-critic framework for optimization.

4.1. Hierarchical Latent Context Structure

As mentioned in Section 3.2, the current task for the agent is encoded with the task variable C, which serves as the global context since it is consistent through the episode. As mentioned in Section 3.3, at each step, the hierarchical policy

agent will first decide on its option choice Z using πθ and then select the primitive action based on the low-level policy πϕ corresponding to Z. In this case, the policy learned should be additionally conditioned on Z besides the task code C, and the option choice is specific to each timestep t {0, , T}, so we view the option choices Z0:T as the local latent contexts. C and Z0:T constitute a hierarchical latent context structure shown as Figure 1. Moreover, realworld tasks are often compositional, so the agent requires to reason about the subtask at hand while dealing with the global task. Z0:T and C provide a hierarchical embedding, which enhances the expressiveness of the policy trained with MH-AIRL, compared with context-based Meta IL which only employs the task context. In this section, we define the mutual and directed information objectives to enhance the causal relationship between the hierarchical policy and the global & local context variables which the policy should condition on, as an extension of context-based Meta-IL with the one-step option model.

Context-based Meta IL algorithms establish a connection between the policy and task variable C, so that the policy can be adapted among different task modes according to the task context. This can be realized through maximizing the mutual information between the trajectory generated by the policy and the corresponding C, i.e., I(X0:T ; C), where X0:T = (X0, , XT ) = ((A 1, S0), , (AT 1, ST )) = τ. A 1 is a dummy variable. On the other hand, the local latent variables Z0:T have a directed causal relationship with the trajectory X0:T shown as the probabilistic graphical model in Figure 1. As discussed in (Massey et al., 1990; Sharma et al., 2019), this kind of connection can be established by maximizing the directed information (a.k.a., causal information) flow from the trajectory to the latent factors of variation, i.e., I(X0:T Z0:T ). In our multi-task framework, we maximize the conditional directed information I(X0:T Z0:T |C), since for each task c, the corresponding I(X0:T Z0:T |C = c) should be maximized.

Directly optimizing the mutual or directed information objective is computationally infeasible, so we instead maximize their variational lower bounds as follows: (Please refer to Appendix B.1 and B.2 for the definition of mutual and directed information and derivations of their lower bounds. For simplicity, we use XT to represent X0:T , and so on.)

LMI H(C) + E XT ,ZT ,C log Pψ(C|X0:T )

t=1 [ E Xt,Zt,C log Pω(Zt|X0:t, Z0:t 1, C)

+ H(Zt|X0:t 1, Z0:t 1, C)]

where H( ) denotes the entropy, Pψ and Pω are the variational estimation of the posteriors P(C|X0:T ) and

Multi-task Hierarchical Adversarial Inverse Reinforcement Learning

Figure 1. Illustration of the hierarchical latent context structure and its implementation with the one-step option model.

P(Zt|X0:t, Z0:t 1, C) which cannot be calculated directly. Pψ and Pω are implemented as neural networks, H(C) is constant, and H(Zt|X0:t 1, Z0:t 1, C) is the entropy of the output of the high-level policy network (Appendix B.1), so LMI and LDI can be computed in real-time. Moreover, the expectation on Xt, Zt, C in LMI and LDI can be estimated in a Monte-Carlo manner (Sutton & Barto, 2018): C prior( ), (X0:t, Z0:t) Pθ,ϕ( |C), where Pθ,ϕ(X0:t, Z0:t|C) is calculated by: (See Appendix B.1.)

i=1 [πθ(Zi|Si 1, Zi 1, C)

πϕ(Ai 1|Si 1, Zi, C)PSi Si 1,Ai 1,C]

Combining Equation 4 and 5, we can get the objectives with respect to πθ and πϕ, i.e., the hierarchical policy defined in the one-step option model. By maximizing LMI and LDI, the connection between the policy and the hierarchical context structure can be established and enhanced. In LMI and LDI, we also introduce two variational posteriors Pψ and Pω and update them together with πθ and πϕ. An analogy of our learning framework with Variational Autoencoder (VAE) (Kingma & Welling, 2014) is provided in Appendix B.3, which provides another perspective to understand the proposed objectives.

4.2. Hierarchical AIRL

In this section, we consider how to recover the taskconditioned hierarchical policy from a mixture of expert demonstrations {(XE 0:T , ZE 0:T , CE)}. Current algorithms, like AIRL (Fu et al., 2017) or Meta AIRL (Ghasemipour et al., 2019; Yu et al., 2019), can not be directly adopted since they don t take the local latent codes ZE 0:T into consideration. Thus, we propose a novel hierarchical extension of AIRL, denoted as H-AIRL, as a solution, which is also part of our contributions. Further, it s usually difficult to annotate the local and global latent codes, i.e., ZE 0:T and CE, of an expert trajectory XE 0:T , so we propose an Expectation Maximization (EM) adaption of H-AIRL as well to learn the multi-task hierarchical policy based on only the unstructured expert trajectories {XE 0:T }.

First, we define the task-conditioned hierarchical policy. When observing a state St at timestep t {0, , T 1} during a certain task C, the agent needs first to decide on its option choice based on St and its previous option choice Zt using the high-level policy πθ(Zt+1|St, Zt, C), and then decide on the action with the corresponding low-level policy πϕ(At|St, Zt+1, C). Thus, the task-conditioned hierarchical policy can be acquired with the chain rule as:

πθ(Zt+1|St, Zt, C) πϕ(At|St, Zt+1, C)

= πθ,ϕ(Zt+1, At|St, Zt, C) = πθ,ϕ( e At|e St, C) (6)

where the first equality holds because of the onestep Markov assumption (i.e., πϕ(At|St, Zt, Zt+1, C) = πϕ(At|St, Zt+1, C)), e St (St, Zt) and e At (Zt+1, At) denote the extended state and action space respectively.

Next, by substituting (St, At) with (e St, e At) and τE with the hierarchical trajectory (X0:T , Z0:T ) in Equation 3, we can get an MLE problem shown as Equation 7, from which we can recover the task-conditioned hierarchical reward function and policy. The derivation is in Appendix C.1.

max ϑ EC,(XT ,ZT ) πE( |C) log Pϑ(XT , ZT |C) ,

Pϑ(X0:T , Z0:T |C) b Pϑ(X0:T , Z0:T |C)

t=0 PSt+1 St,At,C e Rϑ(St,Zt,Zt+1,At|C)

Equation 7 can be efficiently solved with the adversarial learning framework shown as Equation 8 (C, CE prior( ), (XE 0:T , ZE 0:T ) πE( |CE), and (X0:T , Z0:T ) πθ,ϕ( |C)). At optimality, we can recover the hierarchical policy of the expert as πθ,ϕ with these objectives, of which the justification is provided in Appendix C.2.

min ϑ ECE,(XE 0:T ,ZE 0:T )

t=0 log Dϑ(e SE t , e AE t |CE)

EC,(X0:T ,Z0:T )

t=0 log(1 Dϑ(e St, e At|C)),

max θ,ϕ LIL = EC,(X0:T ,Z0:T )

where the reward function Rt IL = log Dt ϑ log(1 Dt ϑ) and

Dt ϑ = Dϑ(e St, e At|C) = exp(fϑ(e St, e At|C)) exp(fϑ(e St, e At|C))+πθ,ϕ( e At|e St,C).

In practice, the unstructured expert data {XE 0:T }, i.e., trajectories only, is more accessible. In this case, we can view the latent contexts as hidden variables in a hidden Markov model (HMM) (Eddy, 1996) shown as Figure 1 and adopt an EM-style adaption to our algorithm, where we use the variational posteriors introduced in Section 4.1 to

Multi-task Hierarchical Adversarial Inverse Reinforcement Learning

sample the corresponding CE, ZE 0:T for each XE 0:T . In the E step, we sample the global and local latent codes with CE Pψ( |XE 0:T ), ZE 0:T Pω( |XE 0:T , CE). Pψ and Pω represent the posterior networks for C and Z0:T respectively, with the parameters ψ and ω, i.e., the old parameters before being updated in the M step. Then, in the M step, we optimize the hierarchical policy and posteriors with Equation 4 and 8. Note that the expert data used in the first term of Equation 8 should be replaced with (XE 0:T , ZE 0:T , CE) collected in the E step. By this adaption, we can get the solution of the original MLE problem (Equation 7), i.e., the recovered expert policy πθ,ϕ, with only unstructured expert data, which is proved in Appendix C.3.

4.3. Overall Framework

In Section 4.1, we propose LMI(θ, ϕ, ψ) and LDI(θ, ϕ, ω) to establish the causal connection between the policy and hierarchical latent contexts. Then, in Section 4.2, we propose H-AIRL to recover the hierarchical policy from multitask expert demonstrations, where the policy is trained with the objective LIL(θ, ϕ). In this section, we introduce our method to update the hierarchical policy and posteriors with these objectives, and describe the overall algorithm framework. Detailed derivations of θ,ϕ,ψLMI, θ,ϕ,ωLDI and θ,ϕLIL are in Appendix D.1, D.2, and D.3, respectively.

First, the variational posteriors Pψ and Pω can be updated with the gradients shown in Equation 9 through Stochastic Gradient Descent (SGD) (Bottou, 2010).

ψLMI = E C,XT ,ZT ψ log Pψ(C|X0:T )

t=1 E C,Xt,Zt ω log Pω(Zt|Xt, Zt 1, C) (9)

Next, the gradients with respect to θ and ϕ, i.e., the hierarchical policy, are computed based on the overall objective:

L = α1LMI + α2LDI + α3LIL (10)

where α1:3 are the weights (only the ratios α1

α3 matter) and fine-tuned as hyperparameters. Based on L, we can get the unbiased gradient estimators with respect to θ and ϕ: (Derivations are in Appendix D.4.)

θL = E C,XT ,ZT[

t=1 θ log πθ(Zt|St 1, Zt 1, C)

(Rett bhigh(St 1, Zt 1|C))]

ϕL = E C,XT ,ZT[

t=1 ϕ log πϕ(At 1|St 1, Zt, C)

(Rett blow(St 1, Zt|C))] (11)

Rett = α1 log Pψ(C|X0:T )

i=t [α2 log Pω(Zi|Xi, Zi 1, C)

πθ(Zi|Si 1, Zi 1, C) + α3Ri 1 IL ] (12)

Rett represents the return at timestep t, while bhigh and blow are the baseline terms for training πθ and πϕ, respectively. Further, we claim that the advantage functions for training πθ and πϕ are given by Rett bhigh(St 1, Zt 1|C) and Rett blow(St 1, Zt|C), respectively, based on which we can optimize the hierarchical policy via off-the-shelf RL algorithms. In our implementation, we adopt PPO (Schulman et al., 2017) to train πθ and πϕ with their corresponding advantage functions, respectively. This forms a novel Hierarchical RL (HRL) algorithm HPPO, which has shown superiority over RL and HRL baselines in our experiment.

In Appendix D.5, we provide the overall algorithm as Algorithm 1 and illustrate the interactions among the networks in MH-AIRL in Figure 5.

5. Evaluation and Main Results

MH-AIRL is proposed to learn a multi-task hierarchical policy from a mixture of (unstructured) expert demonstrations. The learned policy can be adopted to any task sampled from a distribution of tasks. In this section: (1) We provide an ablation study with respect to the three main components of our algorithm: context-based multi-task/meta learning, option/hierarchical learning, and imitation learning. (2) We show that the hierarchical policy learning can significantly improve the agent s performance on challenging long-horizontal tasks. (3) Through qualitative and quantitative results, we show that our algorithm can capture the subtask structure within the expert demonstrations and that the learned basic skills for the subtasks (i.e., options) can be transferred to tasks not within the task distribution to aid learning, for better transferability.

The evaluation is based on three Mujoco (Todorov et al., 2012) locomotion tasks and the Kitchen task from the D4RL benchmark (Fu et al., 2020). All of them are with continuous state & action spaces, and contain compositional subtask structures to make them long-horizontal and a lot more challenging. To be specific: (1) In Half Cheetah-Multi Vel, the goal velocity v is controlled by a 1-dim Gaussian context variable. The Half Cheetah agent is required to speed up to v/2 first, then slow down to 0, and finally achieve v. (2) In Walker-Rand Param, the Walker agent must achieve the goal velocity 4 in three stages, i.e., [2, 0, 4]. Meanwhile, the mass of the agent changes among different tasks, which is controlled by an 8-dim Gaussian context variable. (3) In Ant-Multi Goal, a 3D Ant agent needs to reach a certain goal, which is different in each task and controlled

Multi-task Hierarchical Adversarial Inverse Reinforcement Learning

(a) Mujoco Env

(b) Half Cheetah-Multi Vel

(c) Walker-Rand Param

(d) Kitchen Env

(e) Ant-Multi Goal

(f) Kitchen-Multi Seq

Figure 2. (a) Multi-stage Mujoco locomotion tasks, where (1)-(3) show Ant, Half Cheetah, and Walker agent, respectively. (d) The Kitchen task. (b)(c)(e)(f) Comparison results of MH-AIRL with SOTA Meta Imitation Learning baselines on the four challenging tasks.

by a 2-dim Gaussian context variable (polar coordinates). Moreover, the agent must go through certain subgoals. For example, if the goal is (x, y) and |x| > |y|, the agent must go along [(0, 0), (x, 0), (x, y)]. (4) In Kitchen-Multi Seq, there are seven different subtasks, like manipulating the microwave, kettle, cabinet, switch, burner, etc. Each task requires the sequential completion of four specific subtasks. Twenty-four permutations are chosen and so 24 tasks, each of which is sampled with the same probability and controlled by a discrete context variable (input as one-hot vectors). Note that the states of the robot agents only contain their original states (defined by Mujoco or D4RL) and the task context variable, and do not include the actual task information, like the goal (velocity) and subgoal list. The task information is randomly generated by a parametric model of which the parameter is used as the context variable (i.e., the Gaussian vectors as mentioned above). The mapping between context variables and true task information is unknown to the learning agent. This makes the learning problem more challenging and our algorithm more general, since a vector of standard normal variables can be used to encode multiple types of task information.

These scenarios are designed to evaluate our algorithm on a wide range of multi-task setups. First, the agent needs to adapt across different reward functions in (1) and (3) since the rewarding state changes, and adjust across different transition functions in (2) since the mass change will influence the robotic dynamics. Next, different from (1)- (3), discrete context variables are adopted in (4), and (4) provides more realistic and challenging robotic tasks for evaluation. The expert data for Mujoco tasks are from expert agents trained with an HRL algorithm (Zhang & Whiteson, 2019) and specifically-designed rewards. While

for the Kitchen task, we use the human demonstrations provided by (Gupta et al., 2019). Note that the demonstrations (state-action pairs only) do not include the rewards, task or option variables. Codes for reproducing all the results are on https://github.com/Lucas CJYSDL/Multi-task Hierarchical-AIRL.

5.1. Effect of Hierarchical Learning

In this part, we evaluate whether the use of options can significantly improve the learning for challenging compound multi-task settings. We compare MH-AIRL with SOTA Meta Imitation Learning (MIL) baselines which also aim to train a policy that can be fast adapted to a class of related tasks but does not adopt options in learning. Contextbased MIL, such as PEMIRL (Yu et al., 2019) and SMILE (Ghasemipour et al., 2019), learns a context-conditioned policy that can be adopted to any task from a class by applying the task variable. While the policy learned with Gradientbased MIL, such as MAML-IL (Finn et al., 2017b) which integrates MAML (Finn et al., 2017a) (a commonly-adopted Meta Learning algorithm) and Behavioral Cloning (BC), has to be updated with gradients calculated from trajectories of the new task, before being applied. We select PEMIRL, SMILE, and MAML-IL from the two major categories of MIL as our baselines. All the algorithms are trained with the same expert data, and evaluated on the same set of test tasks (not contained in the demonstrations). Note that, unlike the others, MAML-IL requires expert data of each test task besides the task variable when testing and requires the expert demonstrations to be categorized by the task when training, which may limit its use in practical scenarios. Our algorithm is trained based on unstructured demonstrations and is only provided with the task context variable for testing.

Multi-task Hierarchical Adversarial Inverse Reinforcement Learning

Table 1. Numeric results of the ablation study Half Cheetah-Multi Vel Walker-Rand Param Ant-Multi Goal Kitchen-Multi Seq Expert 376.55 11.12 399.95 1.43 1593.17 40.91 400.00 0.00 MH-AIRL (ours) 292.79 15.99 357.59 12.10 1530.82 15.18 352.59 15.12 MH-GAIL (ours) 211.32 52.74 268.92 49.29 1064.78 180.28 212.13 25.25 H-AIRL (ours) 126.85 21.92 225.48 12.87 533.80 40.69 83.97 10.95 Option-GAIL 44.89 51.95 132.01 54.75 383.05 13.52 204.73 56.41 DI-GAIL 56.77 49.76 225.22 14.01 328.06 19.89 131.79 53.29

In Figure 2, we record the change of the episodic reward (i.e., the sum of rewards for each step in an episode) on the test tasks as the number of training samples increases. The training is repeated 5 times with different random seeds for each algorithm, of which the mean and standard deviation are shown as the solid line and shadow area, respectively. Our algorithm outperforms the baselines in all tasks, and the improvement is more significant as the task difficulty goes up (i.e., in Ant & Kitchen), which shows the effectiveness of hierarchical policy learning especially in complex tasks. MAML-IL makes use of more expert information in both training and testing, but its performance gets worse on more challenging tasks. This may be because it is based on BC, which is a supervised learning algorithm prone to compounding errors (Ross et al., 2011).

5.2. Ablation Study

We proceed to show the effectiveness of the IL and contextbased multi-task learning components through an ablation study. We propose two ablated versions of our algorithm: (1) MH-GAIL a variant by replacing the AIRL component of MH-AIRL with GAIL (Ho & Ermon, 2016) (another commonly-used IL algorithm), of which the details are in Appendix E.2. (2) H-AIRL a version that does not consider the task context C, which means Pψ (i.e., the posterior for C) is not adopted, LMI is eliminated from Equation 10, and other networks do not use C as input. H-AIRL can be viewed as a newly-proposed HIL algorithm since it integrates the option framework and IL. To be more convincing, we also use two SOTA HIL algorithms Option-GAIL (Jing et al., 2021) and DI-GAIL (Sharma et al., 2019), as the baselines. The training with the HIL algorithms is based on the same multi-task expert data as ours.

In Appendix E.1, we provide the plots of the change of episodic rewards on the test tasks. The training with each algorithm is repeated for 5 times with different random seeds. For each algorithm, we compute the average episodic reward after the learning converges in each of the 5 runs, and record the mean and standard deviation in Table 1 as the convergence performance. First, we can see that our algorithm performs the best on all tasks over the ablations, showing the effectiveness of all the main modules of our

algorithm. Second, MH-GAIL performs better than HIL baselines, showing the necessity of including the contextbased multi-task learning component. Without this component, HIL algorithms can only learn an average policy for a class of tasks from the mixture of multi-task demonstrations. Last, H-AIRL, the newly-proposed HIL algorithm, performs better than the SOTA HIL baselines on Mujoco tasks. A comprehensive empirical study on H-AIRL is provided in (Chen et al., 2022e).

5.3. Analysis on the Learned Hierarchical Policy

In this section, we do the case study to analyze if the learned hierarchical policy can capture the sub-task structure in the demonstrations, and if the learned options can be transferred to tasks out of the task distribution. Capturing the subtask structures in real-life tasks can be essential for the (multitask) policy learning, because: (1) It is more human-like to split a complex task into more manageable subtasks to learn separately and then synthesize these skills to complete the whole task. (2) In some circumstances, the basic skills learned from one task setting can be reused in other task settings so the agent only needs to update its high-level policy over the same skill set, significantly lowering the learning difficulty. We test our algorithm on Mujoco-Multi Goal (Figure 3(a)) where the agent is required to achieve a goal corresponding to the task variable (2-dim Gaussian). The expert demonstrations include 100 goal locations in the Cell and the expert agent only moves horizontally or vertically. We test the learned hierarchical policy on 8 sparsely distributed goal locations, of which the trajectories are shown as Figure 3(d). We can see: (1) Four options (labeled with different colors) are discovered based on the demonstrations, each of which corresponds to a particular forward direction (green: up, yellow: down, etc.). These options are shared among the tasks. (2) The agent knows how to switch among the options to complete the tasks in stages (i.e., horizontal and vertical) with the learned high-level policy. Thus, our algorithm can effectively capture the compositional structure within the tasks and leverage it in the multi-task policy learning, which explains its superior performance. More analysis results of the learned hierarchical policy on Half Cheetah-Multi Vel and Walker-Rand Param are in Appendix E.3.

Multi-task Hierarchical Adversarial Inverse Reinforcement Learning

(a) Point Cell-Multi Goal

(b) Point Room

(c) Point Maze

(d) Recovered Hier Policy

(e) Transfer Learning on Point Room

(f) Transfer Learning on Point Maze

Figure 3. (a) The environment for multi-task learning with MH-AIRL. (d) Visualization of the learned hierarchical policy in (a). (b)(c) New task settings for evaluating the learned options in (a). (e)(f) Comparison results on (b)(c) between our proposed HRL algorithm (i.e., HPPO) initialized with the transferred options (i.e., HPPO-init) and other SOTA HRL and RL baselines.

Next, previous Meta/Multi-task Learning algorithms can learn a policy for a class of tasks whose contexts follow a certain distribution, but the learned policy cannot be transferred as a whole to tasks out of this class. In contrast, our algorithm recovers a hierarchical policy, of which the low-level part can be reused as basic skills for new tasks not necessarily in the same class, resulting in substantially improved transferability of the learned policy. To show this, we reuse the options discovered in Point Cell as the initialization of the low-level part of the hierarchical policy for the goal-achieving tasks in new scenarios Point Room and Point Maze (Figure 3(b) and 3(c)). In each scenario, we select 4 challenging goals (starting from the center point) for evaluation, which are labeled as red points in the figure. Unlike the other evaluation tasks, we provide the agent sparse reward signals (a positive reward for reaching the goal only) instead of expert data, so they are RL rather than IL tasks. We use HPPO proposed in Section 4.3 and initialize it with the transferred options (i.e., HPPO-init). To be more convincing, we use two other SOTA HRL and RL algorithms DAC (Zhang & Whiteson, 2019) and PPO (Schulman et al., 2017), as baselines. In Figure 3(e) and 3(f), we plot the episodic reward change in the training process of each algorithm, where the solid line and shadow represent the mean and standard deviation of the performance across the 4 different goals in each scenario. We can see that the reuse of options significantly accelerate the learning process and the newly proposed HRL algorithm performs much better than the baselines. Note that the other algorithms are trained for more episodes since they do not adopt the transferred options. We show that, in scenarios for which we do not have expert data or dense rewards, we can make use of the basic skills learned from expert demonstrations for similar

task scenarios to effectively aid the learning, which provides a manner to bridge IL and RL.

6. Conclusion and Discussion

In this paper, we propose MH-AIRL to learn a hierarchical policy that can be adopted to perform a class of tasks, based on a mixture of multi-task unannotated expert data. We evaluate our algorithm on a series of challenging robotic multi-task settings. The results show that the multi-task hierarchical policies trained with MH-AIRL perform significantly better than the monotonic policies learned with SOTA Multi-task/Meta IL baselines. Further, with MH-AIRL, the agent can capture the subtask structures in each task and form a skill for each subtask. The basic skills can be reused for different tasks in that distribution to improve the expert data efficiency, and can even be transferred to more distinct tasks out of the distribution to solve long-timescale sparse-reward RL problems.

The primary limitation of our study is the inherent complexity of the overall framework, which comprises five networks as depicted in Figure 5. This complexity arises from our algorithm s integration of AIRL, context-based Meta IL, and the option framework. This amalgamation introduces certain challenges in the training process, particularly in determining the optimal number of training iterations for each network within each learning episode. After careful finetuning, we established a training iteration ratio of 1:3:10 for the discriminator, hierarchical policy, and variational posteriors, respectively. Despite this complexity, our evaluations across a wide variety of tasks utilized a consistent set of hyperparameters, showing the robustness of our approach.

Multi-task Hierarchical Adversarial Inverse Reinforcement Learning

Al-Abbasi, A. O., Ghosh, A., and Aggarwal, V. Deeppool: Distributed model-free algorithm for ride-sharing using deep reinforcement learning. IEEE Transactions on Intelligent Transportation Systems, 20(12):4714 4727, 2019.

Andreas, J., Klein, D., and Levine, S. Modular multitask reinforcement learning with policy sketches. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pp. 166 175, 2017.

Balachandran, A., Aggarwal, V., Halepovic, E., Pang, J., Seshan, S., Venkataraman, S., and Yan, H. Modeling web quality-of-experience on cellular networks. In Proceedings of the 20th annual international conference on Mobile computing and networking, pp. 213 224, 2014.

Bian, X., Maldonado, O. M., and Hadfield, S. SKILL-IL: disentangling skill and knowledge in multitask imitation learning. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 7060 7065. IEEE, 2022.

Bottou, L. Large-scale machine learning with stochastic gradient descent. In Proceedings of the 19th International Conference on Computational Statistics, pp. 177 186. Springer, 2010.

Chen, J., Umrawal, A. K., Lan, T., and Aggarwal, V. Deepfreight: A model-free deep-reinforcement-learning-based algorithm for multi-transfer freight delivery. In Proceedings of the 31st International Conference on Automated Planning and Scheduling, pp. 510 518, 2021.

Chen, J., Aggarwal, V., and Lan, T. ODPP: A unified algorithm framework for unsupervised option discovery based on determinantal point process. Co RR, abs/2212.00211, 2022a.

Chen, J., Chen, J., Lan, T., and Aggarwal, V. Multi-agent covering option discovery based on kronecker product of factor graphs. IEEE Transactions on Artificial Intelligence, pp. 1 13, 2022b. doi: 10.1109/TAI.2022.3195818.

Chen, J., Chen, J., Lan, T., and Aggarwal, V. Scalable multi-agent covering option discovery based on kronecker graphs. In Advances in Neural Information Processing Systems, volume 35, pp. 30406 30418, 2022c.

Chen, J., Haliem, M., Lan, T., and Aggarwal, V. Multi-agent deep covering option discovery. Co RR, abs/2210.03269, 2022d.

Chen, J., Lan, T., and Aggarwal, V. Option-aware adversarial inverse reinforcement learning for robotic control. ar Xiv preprint ar Xiv:2210.01969, 2022e.

Chung, J., Kastner, K., Dinh, L., Goel, K., Courville, A. C., and Bengio, Y. A recurrent latent variable model for sequential data. In Advances in Neural Information Processing Systems 28, pp. 2980 2988, 2015.

Cover, T. M. Elements of information theory. John Wiley & Sons, 1999.

Deisenroth, M. P., Englert, P., Peters, J., and Fox, D. Multitask policy search for robotics. In IEEE International Conference on Robotics and Automation, pp. 3876 3881. IEEE, 2014.

Devin, C., Geng, D., Abbeel, P., Darrell, T., and Levine, S. Compositional plan vectors. In Advances in Neural Information Processing Systems 32, pp. 14963 14974, 2019.

Duminy, N., Nguyen, S. M., Zhu, J., Duhaut, D., and Kerdreux, J. Intrinsically motivated open-ended multi-task learning using transfer learning to discover task hierarchy. Co RR, abs/2102.09854, 2021.

Eddy, S. R. Hidden markov models. Current Opinion in Structural Biology, 6(3):361 365, 1996. ISSN 0959440X.

Finn, C., Abbeel, P., and Levine, S. Model-agnostic metalearning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pp. 1126 1135, 2017a.

Finn, C., Yu, T., Zhang, T., Abbeel, P., and Levine, S. Oneshot visual imitation learning via meta-learning. In Proceedings of the 1st Annual Conference on Robot Learning, volume 78, pp. 357 368, 2017b.

Florensa, C., Duan, Y., and Abbeel, P. Stochastic neural networks for hierarchical reinforcement learning. In Proceedings of the 5th International Conference on Learning Representations. Open Review.net, 2017.

Fox, R., Berenstein, R., Stoica, I., and Goldberg, K. Multitask hierarchical imitation learning for home automation. In Proceedings of the 15th IEEE International Conference on Automation Science and Engineering, pp. 1 8. IEEE, 2019.

Fu, J., Luo, K., and Levine, S. Learning robust rewards with adversarial inverse reinforcement learning. Co RR, abs/1710.11248, 2017.

Fu, J., Kumar, A., Nachum, O., Tucker, G., and Levine, S. D4RL: datasets for deep data-driven reinforcement learning. Co RR, abs/2004.07219, 2020.

Fu, X., Peddireddy, D., Aggarwal, V., and Jun, M. B.-G. Improved dexel representation: A 3-d cnn geometry descriptor for manufacturing cad. IEEE Transactions on Industrial Informatics, 18(9):5882 5892, 2021.

Multi-task Hierarchical Adversarial Inverse Reinforcement Learning

Galvin, D. Three tutorial lectures on entropy and counting. ar Xiv preprint ar Xiv:1406.7872, 2014.

Gao, C., Jiang, Y., and Chen, F. Transferring hierarchical structures with dual meta imitation learning. In Conference on Robot Learning, volume 205 of Proceedings of Machine Learning Research, pp. 762 773, 2022.

Geng, N., Bai, Q., Liu, C., Lan, T., Aggarwal, V., Yang, Y., and Xu, M. A reinforcement learning framework for vehicular network routing under peak and average constraints. IEEE Transactions on Vehicular Technology, 2023.

Ghasemipour, S. K. S., Gu, S., and Zemel, R. S. Smile: Scalable meta inverse reinforcement learning through context-conditional policies. In Advances in Neural Information Processing Systems 32, pp. 7879 7889, 2019.

Gonzalez, G., Balakuntala, M., Agarwal, M., Low, T., Knoth, B., Kirkpatrick, A. W., Mc Kee, J., Hager, G., Aggarwal, V., Xue, Y., et al. Asap: A semi-autonomous precise system for telesurgery during communication delays. IEEE Transactions on Medical Robotics and Bionics, 5(1):66 78, 2023.

Gupta, A., Kumar, V., Lynch, C., Levine, S., and Hausman, K. Relay policy learning: Solving long-horizon tasks via imitation and reinforcement learning. In Proceedings of the 3rd Annual Conference on Robot Learning, volume 100, pp. 1025 1037, 2019.

Higgins, I., Matthey, L., Pal, A., Burgess, C. P., Glorot, X., Botvinick, M. M., Mohamed, S., and Lerchner, A. betavae: Learning basic visual concepts with a constrained variational framework. In Proceedings of the 5th International Conference on Learning Representations, 2017.

Ho, J. and Ermon, S. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems 29, pp. 4565 4573, 2016.

Jensen, J. L. W. V. Sur les fonctions convexes et les in egalit es entre les valeurs moyennes. Acta mathematica, 30(1):175 193, 1906.

Jing, M., Huang, W., Sun, F., Ma, X., Kong, T., Gan, C., and Li, L. Adversarial option-aware hierarchical imitation learning. In Proceedings of the 38th International Conference on Machine Learning, pp. 5097 5106, 2021.

Kingma, D. P. and Welling, M. Auto-encoding variational bayes. In Proceedings of the 2nd International Conference on Learning Representations, 2014.

Kosiorek, A. R., Sabour, S., Teh, Y. W., and Hinton, G. E. Stacked capsule autoencoders. In Advances in Neural Information Processing Systems 32, pp. 15486 15496, 2019.

Li, C., Song, D., and Tao, D. The skill-action architecture: Learning abstract action embeddings for reinforcement learning. In Submissions of the 9th International Conference on Learning Representations, 2021.

Luo, X., Ma, X., Munden, M., Wu, Y.-J., and Jiang, Y. A multisource data approach for estimating vehicle queue length at metered on-ramps. Journal of Transportation Engineering, Part A: Systems, 148(2):04021117, 2022.

Ma, X., Karimpour, A., and Wu, Y.-J. Statistical evaluation of data requirement for ramp metering performance assessment. Transportation Research Part A: Policy and Practice, 141:248 261, 2020. ISSN 0965-8564.

Mangal, S., Joshi, P., and Modak, R. LSTM vs. GRU vs. bidirectional RNN for script generation. Co RR, abs/1908.04332, 2019.

Massey, J. et al. Causality, feedback and directed information. In Proc. Int. Symp. Inf. Theory Applic.(ISITA-90), pp. 303 305, 1990.

Ng, A. Y. and Russell, S. Algorithms for inverse reinforcement learning. In Proceedings of the 7th International Conference on Machine Learning, pp. 663 670. Morgan Kaufmann, 2000.

Ng, A. Y., Harada, D., and Russell, S. Policy invariance under reward transformations: Theory and application to reward shaping. In Proceedings of the 16th International Conference on Machine Learning, pp. 278 287. Morgan Kaufmann, 1999.

Peddireddy, D., Fu, X., Shankar, A., Wang, H., Joung, B. G., Aggarwal, V., Sutherland, J. W., and Jun, M. B.-G. Identifying manufacturability and machining processes using deep 3d convolutional networks. Journal of Manufacturing Processes, 64:1336 1348, 2021. ISSN 1526-6125.

Pomerleau, D. Efficient training of artificial neural networks for autonomous navigation. Neural Computation, 3(1): 88 97, 1991.

Ross, S., Gordon, G. J., and Bagnell, D. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, volume 15, pp. 627 635, 2011.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. Co RR, abs/1707.06347, 2017.

Sharma, M., Sharma, A., Rhinehart, N., and Kitani, K. M. Directed-info GAIL: learning hierarchical policies from unsegmented demonstrations using directed information. In Proceedings of the 7th International Conference on Learning Representations. Open Review.net, 2019.

Multi-task Hierarchical Adversarial Inverse Reinforcement Learning

Singh, A., Jang, E., Irpan, A., Kappler, D., Dalal, M., Levine, S., Khansari, M., and Finn, C. Scalable multitask imitation learning with autonomous improvement. In 2020 IEEE International Conference on Robotics and Automation, pp. 2167 2173. IEEE, 2020.

Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2018.

Sutton, R. S., Precup, D., and Singh, S. P. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112: 181 211, 1999.

Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026 5033. IEEE, 2012.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems 30, pp. 5998 6008, 2017.

Wang, P., Liu, D., Chen, J., Li, H., and Chan, C. Decision making for autonomous driving via augmented adversarial inverse reinforcement learning. In IEEE International Conference on Robotics and Automation, pp. 1036 1042. IEEE, 2021.

Yu, L., Yu, T., Finn, C., and Ermon, S. Meta-inverse reinforcement learning with probabilistic context variables. In Advances in Neural Information Processing Systems 32, pp. 11749 11760, 2019.

Yu, T., Abbeel, P., Levine, S., and Finn, C. One-shot hierarchical imitation learning of compound visuomotor tasks. Co RR, abs/1810.11043, 2018a.

Yu, T., Finn, C., Xie, A., Dasari, S., Zhang, T., Abbeel, P., and Levine, S. One-shot imitation from observing humans via domain-adaptive meta-learning. ar Xiv preprint ar Xiv:1802.01557, 2018b.

Zhang, S. and Whiteson, S. DAC: the double actor-critic architecture for learning options. In Advances in Neural Information Processing Systems 32, pp. 2010 2020, 2019.

Ziebart, B. D., Maas, A. L., Bagnell, J. A., and Dey, A. K. Maximum entropy inverse reinforcement learning. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence, pp. 1433 1438, 2008.

Multi-task Hierarchical Adversarial Inverse Reinforcement Learning

A. Appendix on the Background and Related Works

A.1. AIRL Framework to Solve Equation 3

For each task C, we need to recover the task-specific reward function Rϑ(S, A|C) and policy π(A|S, C) based on the corresponding expert trajectories τE πE( |C) which can be solved by AIRL as mentioned in Section 3.1. Thus, we have the following objective functions for training, which is a simple extension of AIRL (Ghasemipour et al., 2019; Yu et al., 2019):

EτE πE( |C)

t=0 log Dϑ(St, At|C)

t=0 log(1 Dϑ(St, At|C))

t=0 log Dϑ(St, At|C) log(1 Dϑ(St, At|C))

where Dϑ(S, A|C) = exp(fϑ(S, A|C))/[exp(fϑ(S, A|C)) + π(A|S, C)].

A.2. Implementation of the Hierarchical Policy in the One-step Option Model

In this section, we give out the detailed structure design of the hierarchical policy introduced in Section 3.3, i.e., πθ(Z|S, Z ) and πϕ(A|S, Z), which is proposed in (Li et al., 2021). This part is not our contribution, so we only provide the details for the purpose of implementation.

As mentioned in Section 3.3, the structure design is based on the Multi-Head Attention (MHA) mechanism (Vaswani et al., 2017). An attention function can be described as mapping a query, i.e., q Rdk, and a set of key-value pairs, i.e., K = [k1 kn]T Rn dk and V = [v1 vn]T Rn dv, to an output. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key. To be specific:

Attention(q, K, V ) =

" exp(q ki) Pn j=1 exp(q kj) vi

where q, K, V are learnable parameters, exp(q ki) Pn j=1 exp(q kj) represents the attention weight that the model should pay to item i. In MHA, the query and key-value pairs are first linearly projected h times to get h different queries, keys and values. Then, an attention function is performed on each of these projected versions of queries, keys and values in parallel to get h outputs which are then be concatenated and linearly projected to acquire the final output. The whole process can be represented as Equation 16, where W q i Rdk dk, W K i Rdk dk, W V i Rdv dv, W O Rndv dv are the learnable parameters.

MHA(q, K, V ) = Concat(head1, , headh)W O, headi = Attention(q W q i , KW K i , V W V i ) (16)

In this work, the option is represented as an N-dimensional one-hot vector, where N denotes the total number of options to learn. The high-level policy πθ(Z|S, Z ) has the structure shown as:

q = linear(Concat[S, W T C Z ]), dense Z = MHA(q, WC, WC), Z Categorical( |dense Z) (17)

WC RN E is the option context matrix of which the i-th row represents the context embedding of the option i. WC is also used as the key and value matrix for the MHA, so dk = dv = E in this case. Note that WC is only updated in the MHA module. Intuitively, πθ(Z|S, Z ) attends to all the option context embeddings in WC according to S and Z . If Z still fits S, πθ(Z|S, Z ) will assign a larger attention weight to Z and thus has a tendency to continue with it; otherwise, a new skill with better compatibility will be sampled.

As for the low-level policy πϕ(A|S, Z), it has the following structure:

dense A = MLP(S, W T C Z), A Categorical/Gaussian( |dense A) (18)

where MLP represents a multilayer perceptron, A follows a categorical distribution for the discrete case or a gaussian distribution for the continuous case. The context embedding corresponding to Z, i.e., W T C Z, instead of Z only, is used as input of πϕ since it can encode multiple properties of the option Z (Kosiorek et al., 2019).

Multi-task Hierarchical Adversarial Inverse Reinforcement Learning

B. Appendix on the Hierarchical Latent context Structure

B.1. A Lower Bound of the Directed Information Objective

In this section, we give out the derivation of a lower bound of the directed information from the trajectory sequence X0:T to the local latent context sequence Z0:T conditioned on the global latent context C, i.e., I(X0:T Z0:T |C) as follows:

I(X0:T Z0:T |C) =

t=1 [I(X0:t; Zt|Z0:t 1, C)]

t=1 [H(Zt|Z0:t 1, C) H(Zt|X0:t, Z0:t 1, C)]

t=1 [H(Zt|X0:t 1, Z0:t 1, C) H(Zt|X0:t, Z0:t 1, C)]

t=1 [H(Zt|X0:t 1, Z0:t 1, C)+

X0:t,C, Z0:t 1

P(X0:t, Z0:t 1, C) X

Zt P(Zt|X0:t, Z0:t 1, C) log P(Zt|X0:t, Z0:t 1, C)]

In Equation 19, I(V ar1; V ar2|V ar3) denotes the conditional mutual information, H(V ar1|V ar2) denotes the conditional entropy, and the inequality holds because of the basic property related to conditional entropy: increasing conditioning cannot increase entropy (Galvin, 2014). H(Zt|X0:t 1, Z0:t 1, C) is the entropy of the high-level policy πθ(Zt|St 1, Zt 1), where the other variables in X0:t 1, Z0:t 1 are neglected due to the one-step Markov assumption, and more convenient to obtain. Further, the second term in the last step can be processed as follows: X

Zt P(Zt|X0:t, Z0:t 1, C) log P(Zt|X0:t, Z0:t 1, C)

Zt P(Zt|X0:t, Z0:t 1, C) log P(Zt|X0:t, Z0:t 1, C)

Pω(Zt|X0:t, Z0:t 1, C) + log Pω(Zt|X0:t, Z0:t 1, C)

= DKL(P( |X0:t, Z0:t 1, C)||Pω( |X0:t, Z0:t 1, C)) + X

Zt P(Zt|X0:t, Z0:t 1, C) log Pω(Zt|X0:t, Z0:t 1, C)

Zt P(Zt|X0:t, Z0:t 1, C) log Pω(Zt|X0:t, Z0:t 1, C)

where DKL( ) denotes the Kullback-Leibler (KL) Divergence which is non-negative (Cover, 1999), Pω(Zt|X0:t, Z0:t 1, C) is a variational estimation of the posterior distribution of Zt given X0:t and Z0:t 1, i.e., P(Zt|X0:t, Z0:t 1, C), which is modeled as a recurrent neural network with the parameter set ω in our work. Based on Equation 19 and 20, we can obtain a lower bound of I(X0:T Z0:T |C) denoted as LDI:

X0:t,C, Z0:t

P(X0:t, Z0:t, C) log Pω(Zt|X0:t, Z0:t 1, C) + H(Zt|X0:t 1, Z0:t 1, C)] (21)

Note that the joint distribution P(X0:t, Z0:t, C) has a recursive definition as follows:

P(X0:t, Z0:t, C) = prior(C)P(X0:t, Z0:t|C)

= prior(C)P(Xt|X0:t 1, Z0:t, C)P(Zt|X0:t 1, Z0:t 1, C)P(X0:t 1, Z0:t 1|C) (22)

P(X0, Z0|C) = P((S0, A 1), Z0|C) = µ(S0|C) (23)

where µ(S0|C) denotes the distribution of the initial states for task C. Equation 23 holds because A 1 and Z0 are dummy variables which are only for simplifying notations and never executed and set to be constant across different tasks. Based on

Multi-task Hierarchical Adversarial Inverse Reinforcement Learning

Equation 22 and 23, we can get:

P(X0:t, Z0:t, C) = prior(C)µ(S0|C)

i=1 P(Zi|X0:i 1, Z0:i 1, C)P(Xi|X0:i 1, Z0:i, C)

= prior(C)µ(S0|C)

i=1 P(Zi|X0:i 1, Z0:i 1, C)P((Si, Ai 1)|X0:i 1, Z0:i, C)

= prior(C)µ(S0|C)

i=1 P(Zi|X0:i 1, Z0:i 1, C)P(Ai 1|X0:i 1, Z0:i, C)P(Si|Si 1, Ai 1, C)

= prior(C)µ(S0|C)

i=1 πθ(Zi|Si 1, Zi 1, C)πϕ(Ai 1|Si 1, Zi, C)P(Si|Si 1, Ai 1, C)

In Equation 24, prior(C) is the known prior distribution of the task context C, P(Si|Si 1, Ai 1, C) is the transition dynamic of task C, P(Zi|X0:i 1, Z0:i 1, C) and P(Ai 1|X0:i 1, Z0:i, C) can be replaced with πθ and πϕ, respectively, due to the one-step Markov assumption.

To sum up, we can adopt the high-level policy, low-level policy and variational posterior to get an estimation of the lower bound of the directed information objective through Monte Carlo sampling (Sutton & Barto, 2018) according to Equation 21 and 24, which can then be used to optimize the three networks.

B.2. A Lower Bound of the Mutual Information Objective

In this section, we give out the derivation of a lower bound of the mutual information between the trajectory sequence X0:T and its corresponding task context C, i.e., I(X0:T ; C).

I(X0:T ; C) = H(C) H(C|X0:T )

X0:T P(X0:T ) X

C P(C|X0:T ) log P(C|X0:T )

X0:T P(X0:T ) X

C P(C|X0:T ) log P(C|X0:T )

Pψ(C|X0:T ) + X

X0:T ,C P(X0:T , C) log Pψ(C|X0:T )

X0:T P(X0:T )DKL(P( |X0:T ||Pψ( |X0:T )) + X

X0:T ,C P(X0:T , C) log Pψ(C|X0:T )

X0:T ,C P(X0:T , C) log Pψ(C|X0:T )

C prior(C) X

X0:T P(X0:T |C) log Pψ(C|X0:T )

C prior(C) X

X0:T ,Z0:T P(X0:T , Z0:T |C) log Pψ(C|X0:T )

In Equation 25, H( ) denotes the entropy, prior(C) denotes the known prior distribution of the task context C, P(X0:T , Z0:T |C) can be calculated with Equation 24 by setting t = T, and Pψ(C|X0:T ) is a variational estimation of the posterior distribution P(C|X0:T ) which is implemented as a recurrent neural network with the parameter set ψ. Note that the inequality holds because the KL-Divergence, i.e., DKL( ), is non-negative.

B.3. The Analogy with the VAE Framework

Variational Autoencoder (VAE) (Kingma & Welling, 2014) learns a probabilistic encoder Pη(V |U) and decoder Pξ(U|V ) which map between data U and latent variables V by optimizing the evidence lower bound (ELBO) on the marginal distribution Pξ(U), assuming the prior distributions P U( ) and P V ( ) over the data and latent variables respectively. The authors of (Higgins et al., 2017) extend the VAE approach by including a parameter β to control the capacity of the latent V ,

Multi-task Hierarchical Adversarial Inverse Reinforcement Learning

Figure 4. The analogy of our learning framework with the VAE structure.

of which the ELBO is: max η,ξ E U P U( ) V Pη( |U)

log Pξ(U|V ) βDKL(Pη(V |U)||P V (V ))

The first term can be viewed as the reconstruction accuracy of the data U from V , and the second term works as a regularizer for the distribution of the latent variables V , where DKL denotes the KL Divergence (Cover, 1999). VAE can efficiently solve the posterior inference problem for datasets with continuous latent variables where the true posterior is intractable, through fitting an approximate inference model Pξ (i.e., the variational posterior). The variational lower bound, i.e., ELBO, can be straightforwardly optimized using standard stochastic gradient methods, e.g., SGD (Bottou, 2010).

As shown in Figure 4, the optimization of LMI (Equation 4) can be viewed as using πθ and πϕ as the encoder and Pψ as the decoder and then minimizing the reconstruction error of C from X0:T , and the regularizer term in Equation 26 is neglected (i.e., β = 0). As for the optimization of LDI (Equation 4), at each timestep t, πϕ and Pω form a conditional VAE between Zt and Xt, which is conditioned on the history information and task code, i.e., (X0:t 1, Z0:t 1, C), with the prior distribution of Zt provided by πθ. Compared with the VAE objective (i.e., Equation 26), πϕ and Pω in LDI work as the encoder and decoder respectively; πθ provides the prior, which corresponds to P U( ).

Both Pψ and Pω use sequential data as input and thus are implemented with RNN. The variational posterior for the task code, i.e., Pψ(C|X0:T ) takes the trajectory X0:T as input and is implemented as a bidirectional GRU (Mangal et al., 2019) to make sure that both the beginning and end of the trajectory are equally important. On the other hand, the variational posterior for the local latent code, i.e., Pω(Zt|X0:t, Z0:t 1, C), is modeled as Pω(Zt|Xt, Zt 1, C, ht 1), where ht 1 is the internal hidden state of an RNN. ht 1 is recursively maintained with the time series using the GRU rule, i.e., ht 1 = GRU(Xt 1, Zt 2, ht 2), to embed the history information in the trajectory, i.e., X0:t 1 and Z0:t 2. Note that the RNN-based posterior has been used and justified in the process for sequential data (Chung et al., 2015).

Multi-task Hierarchical Adversarial Inverse Reinforcement Learning

C. Appendix on Hierarchical AIRL

C.1. Derivation of the MLE Objective

In Equation 27, Z0 is a dummy variable which is assigned before the episode begins and never executed. It s implemented as a constant across different episodes, so we have P(S0, Z0|C) = P(S0|C) = µ(S0|C), where µ( |C) denotes the initial state distribution for task C. On the other hand, we have P(St+1, Zt+1|St, Zt, Zt+1, At, C) = P(Zt+1|St, Zt, Zt+1, At, C)P(St+1|St, Zt, Zt+1, At, C) = P(St+1|St, At, C), since the transition dynamic P is irrelevant to the local latent codes Z and only related the task context C.

Pϑ(X0:T , Z0:T |C) µ(e S0|C)

t=0 P(e St+1|e St, e At, C) exp(Rϑ(e St, e At|C))

= P(S0, Z0|C)

t=0 P(St+1, Zt+1|St, Zt, Zt+1, At, C) exp(Rϑ(St, Zt, Zt+1, At|C))

t=0 P(St+1|St, At, C) exp(Rϑ(St, Zt, Zt+1, At|C))

C.2. Justification of the Objective Function Design in Equation 8

In this section, we prove that by optimizing the objective functions shown in Equation 8, we can get the solution of the MLE problem shown as Equation 7, i.e., the task-conditioned hierarchical reward function and policy of the expert.

In Appendix A of (Fu et al., 2017), they show that the discriminator objective (the first equation in 8) is equivalent to the MLE objective (Equation 7) where fϑ serves as Rϑ, when DKL(π(τ)||πE(τ)) is minimized. The same conclusion can be acquired by simply replacing {St, At, τ} with {(St, Zt), (Zt+1, At), (X0:T , Z0:T )}, i.e., the extended definition of the state, action and trajectory, in the original proof, which we don t repeat here. Then, we only need to prove that EC [DKL(πθ,ϕ(X0:T , Z0:T |C)||πE(X0:T , Z0:T |C))] can be minimized through the second equation in 8:

max θ,ϕ EC prior( ),(X0:T ,Z0:T ) πθ,ϕ( |C)

= max θ,ϕ E C,X0:T ,Z0:T

t=0 log Dϑ(St, Zt, Zt+1, At|C) log(1 Dϑ(St, Zt, Zt+1, At|C))

= max θ,ϕ E C,X0:T ,Z0:T

t=0 fϑ(St, Zt, Zt+1, At|C) log πθ,ϕ(Zt+1, At|St, Zt, C)

= max θ,ϕ E C,X0:T ,Z0:T

t=0 fϑ(St, Zt, Zt+1, At|C) log(πθ(Zt+1|St, Zt, C)πϕ(At|St, Zt+1, C))

= max θ,ϕ E C,X0:T ,Z0:T

log QT 1 t=0 exp(fϑ(St, Zt, Zt+1, At|C)) QT 1 t=0 πθ(Zt+1|St, Zt, C)πϕ(At|St, Zt+1, C)

max θ,ϕ E C,X0:T ,Z0:T

log QT 1 t=0 exp(fϑ(St, Zt, Zt+1, At|C))/ZC ϑ QT 1 t=0 πθ(Zt+1|St, Zt, C)πϕ(At|St, Zt+1, C)

Note that ZC ϑ = P X0:T ,Z0:T b Pϑ(X0:T , Z0:T |C) (defined in Equation 7) is the normalized function parameterized with ϑ, so the introduction of ZC ϑ will not influence the optimization with respect to θ and ϕ and the equivalence at the last step holds. Also, the second equality shows that the task-conditioned hierarchical policy is recovered by optimizing an

Multi-task Hierarchical Adversarial Inverse Reinforcement Learning

entropy-regularized policy objective where fϑ serves as Rϑ. Further, we have:

max θ,ϕ E C,X0:T ,Z0:T

log QT 1 t=0 exp(fϑ(St, Zt, Zt+1, At|C))/ZC ϑ QT 1 t=0 πθ(Zt+1|St, Zt, C)πϕ(At|St, Zt+1, C)

= max θ,ϕ E C,X0:T ,Z0:T

log µ(S0|C) QT 1 t=0 P(St+1|St, At, C) QT 1 t=0 exp(fϑ(St, Zt, Zt+1, At|C))/ZC ϑ µ(S0|C) QT 1 t=0 P(St+1|St, At, C) QT 1 t=0 πθ(Zt+1|St, Zt, C)πϕ(At|St, Zt+1, C)

= max θ,ϕ EC prior( ),(X0:T ,Z0:T ) πθ,ϕ( |C)

log πE(X0:T , Z0:T |C)

πθ,ϕ(X0:T , Z0:T |C)

= max θ,ϕ EC prior( ) [ DKL(πθ,ϕ(X0:T , Z0:T |C)||πE(X0:T , Z0:T |C))]

min θ,ϕ EC prior( ) [DKL(πθ,ϕ(X0:T , Z0:T |C)||πE(X0:T , Z0:T |C))]

where the second equality holds because of the definition of πE (Equation 7 with fϑ serving as Rϑ) and πθ,ϕ (Equation 34).

C.3. Justification of the EM-style Adaption

Given only a dataset of expert trajectories, i.e., DE {X0:T }, we can still maximize the likelihood estimation EX0:T DE [log Pϑ(X0:T )] through an EM-style adaption: (We use X0:T , C, Z0:T instead of XE 0:T , CE, ZE 0:T for simplicity.)

EX0:T DE [log Pϑ(X0:T )] = EX0:T DE

C,Z0:T Pϑ(X0:T , C, Z0:T )

Pϑ(X0:T , C, Z0:T )

Pϑ(C, Z0:T |X0:T ) Pϑ(C, Z0:T |X0:T )

log E(C,Z0:T ) Pϑ( |X0:T ) Pϑ(X0:T , C, Z0:T )

Pϑ(C, Z0:T |X0:T )

E(C,Z0:T ) Pϑ( |X0:T ) log Pϑ(X0:T , C, Z0:T )

Pϑ(C, Z0:T |X0:T )

= EX0:T DE,C Pψ( |X0:T ),Z0:T Pω( |X0:T ,C)

log Pϑ(X0:T , C, Z0:T )

Pϑ(C, Z0:T |X0:T )

= EX0:T ,C,Z0:T [log Pϑ(X0:T , C, Z0:T )] EX0:T ,C,Z0:T log Pϑ(C, Z0:T |X0:T )

= EX0:T ,C,Z0:T [log Pϑ(X0:T , Z0:T |C)] EX0:T ,C,Z0:T log prior(C) + log Pϑ(C, Z0:T |X0:T )

where we adopt the Jensen s inequality (Jensen, 1906) in the 4-th step. Also, we note that Pψ,ω(C, Z0:T |X0:T ) provides a posterior distribution of (C, Z0:T ), which corresponds to the generating process led by the hierarchical policy. As justified in C.2, the hierarchical policy is trained with the reward function parameterized with ϑ. Thus, the hierarchical policy is a function of ϑ, and the network Pψ,ω corresponding to the hierarchical policy provides a posterior distribution related to the parameter set ϑ, i.e., (C, Z0:T ) Pϑ( |X0:T ) C Pψ( |X0:T ), Z0:T Pω( |X0:T , C), due to which the 5-th step holds. Note that ϑ, ψ, ω denote the parameters ϑ, ψ, ω before being updated in the M step.

In the second equality of Equation 30, we introduce the sampled global and local latent codes in the E step as discussed in Section 4.2. Then, in the M step, we optimize the objectives shown in Equation 4 and 8 for iterations, by replacing the samples in the first term of Equation 8 with (X0:T , C, Z0:T ) collected in the E step. This is equivalent to solve the MLE problem: maxϑ EX0:T DE,C Pψ( |X0:T ),Z0:T Pω( |X0:T ,C) [log Pϑ(X0:T , Z0:T |C)], which is to maximize a lower bound of the original objective, i.e., EX0:T DE [log Pϑ(X0:T )], as shown in the last step of Equation 30. Thus, the original objective can be optimized through this EM procedure. Note that the second term in the last step is a function of the old parameter ϑ so that it can be overlooked when optimizing with respect to ϑ.

Multi-task Hierarchical Adversarial Inverse Reinforcement Learning

C.4. State-only Adaption of H-AIRL

In AIRL (Fu et al., 2017), they propose a two-component design for the discriminator as follows:

fϑ,ζ(St, St+1) = gϑ(St) + γhζ(St+1) hζ(St) (31)

where γ is the discount factor in MDP. Based on fϑ,ζ(St, St+1), they can further get Dϑ,ζ(St, St+1) which is used in Equation 2 for AIRL training. As proved in (Fu et al., 2017), gϑ, hζ and fϑ,ζ can recover the true reward, value and advantage function, respectively, under deterministic environments with a state-only ground truth reward. With this state-only design, the recovered reward function is disentangled from the dynamics of the environment in which it was trained, so that it can be directly transferred to environments with different transition dynamics, i.e., P, for the policy training. Moreover, the additional shaping term hζ helps mitigate the effects of unwanted shaping on the reward approximator gϑ (Ng et al., 1999). This design can also be adopted to H-AIRL (Equation 8) by redefining Equation 31 on the extended state space (first defined in Section 4.2): fϑ,ζ(e St, e St+1|C) = gϑ(e St|C) + γhζ(e St+1|C) hζ(e St|C)

= gϑ(St, Zt|C) + γhζ(St+1, Zt+1|C) hζ(St, Zt|C) (32)

In this way, we can recover a hierarchical reward function conditioned on the task context C, i.e., gϑ(St, Zt|C), which avoids unwanted shaping and is robust enough to be directly applied in a new task with different dynamic transition distribution from prior(C). The proof can be done by simply replacing the state S in the original proof (Appendix C of (Fu et al., 2017)) with its extended definition e S, so we don t repeat it here.

D. The Proposed Actor-Critic Algorithm for Training

D.1. Gradients of the Mutual Information Objective Term

The objective function related to the mutual information:

C prior(C) X

X0:T ,Z0:T P(X0:T , Z0:T |C) log Pψ(C|X0:T ) (33)

After introducing the one-step Markov assumption to Equation 24, we can calculate P(X0:T , Z0:T |C) as Equation 34, where πθ and πϕ represent the hierarchical policy in the one-step option framework.

P(X0:T , Z0:T |C) = µ(S0|C)

t=1 πθ(Zt|St 1, Zt 1, C)πϕ(At 1|St 1, Zt, C)P(St|St 1, At 1, C) (34)

First, the gradient with respect to ψ is straightforward as Equation 35, which can be optimized as a standard likelihood maximization problem.

C prior(C) X

X0:T ,Z0:T P(X0:T , Z0:T |C) ψ log Pψ(C|X0:T ) (35)

Now we give out the derivation of θLMI:

C prior(C) X

X0:T ,Z0:T θPθ,ϕ(X0:T , Z0:T |C) log Pψ(C|X0:T )

C prior(C) X

X0:T ,Z0:T Pθ,ϕ(X0:T , Z0:T |C) θ log Pθ,ϕ(X0:T , Z0:T |C) log Pψ(C|X0:T )

= E C,X0:T , Z0:T

[ θ log Pθ,ϕ(X0:T , Z0:T |C) log Pψ(C|X0:T )]

= E C,X0:T , Z0:T

t=1 θ log πθ(Zt|St 1, Zt 1, C) log Pψ(C|X0:T )

where the last equality holds because of Equation 34. With similar derivation as above, we have:

ϕLMI = E C,X0:T , Z0:T

t=1 ϕ log πϕ(At 1|St 1, Zt, C) log Pψ(C|X0:T )

Multi-task Hierarchical Adversarial Inverse Reinforcement Learning

D.2. Gradients of the Directed Information Objective Term

Next, we give out the derivation of the gradients related to the directed information objective term, i.e., LDI. We denote the two terms in Equation 21 as LDI 1 and LDI 2 respectively. Then, we have θ,ϕLDI = θ,ϕLDI 1 + θ,ϕLDI 2 . The derivations are as follows:

C prior(C) X

X0:t,Z0:t θPθ,ϕ(X0:t, Z0:t|C) log Pω(Zt|X0:t, Z0:t 1, C)

C prior(C) X

X0:t,Z0:t Pθ,ϕ(X0:t, Z0:t|C)

i=1 θ log πθ(Zi|Si 1, Zi 1, C) log P t ω

C prior(C) X

Xt+1:T , Zt+1:T

Pθ,ϕ(X0:T , Z0:T |C)

i=1 θ log πθ(Zi|Si 1, Zi 1, C) log P t ω

C prior(C) X

X0:T ,Z0:T Pθ,ϕ(X0:T , Z0:T |C)

i=1 θ log πθ(Zi|Si 1, Zi 1, C) log P t ω

C prior(C) X

X0:T ,Z0:T Pθ,ϕ(X0:T , Z0:T |C)

t=1 log P t ω

i=1 θ log πθ(Zi|Si 1, Zi 1, C)

C prior(C) X

X0:T ,Z0:T Pθ,ϕ(X0:T , Z0:T |C)

i=1 θ log πθ(Zi|Si 1, Zi 1, C)

t=i log P t ω

= E C,X0:T , Z0:T

i=1 θ log πθ(Zi|Si 1, Zi 1, C)

t=i log Pω(Zt|X0:t, Z0:t 1, C)

= E C,X0:T , Z0:T

t=1 θ log πθ(Zt|St 1, Zt 1, C)

i=t log Pω(Zi|X0:i, Z0:i 1, C)

where P t ω = Pω(Zt|X0:t, Z0:t 1, C) for simplicity. The second equality in Equation 38 holds following the same derivation in Equation 36. Then, the gradient related to LDI 2 is:

t=1 H(Zt|X0:t 1, Z0:t 1, C)

C prior(C) X

X0:t 1,Z0:t Pθ,ϕ(X0:t 1, Z0:t|C) log P(Zt|X0:t 1, Z0:t 1, C)]

C prior(C) X

X0:t 1,Z0:t Pθ,ϕ(X0:t 1, Z0:t|C) log πθ(Zt|St 1, Zt 1, C)]

C prior(C) X

X0:T ,Z0:T Pθ,ϕ(X0:T , Z0:T |C)

t=1 log πθ(Zt|St 1, Zt 1, C)]

C prior(C) X

X0:T ,Z0:T θPθ,ϕ(X0:T , Z0:T |C)

t=1 log πθ(Zt|St 1, Zt 1, C)+

C prior(C) X

X0:T ,Z0:T Pθ,ϕ(X0:T , Z0:T |C)

t=1 θ log πθ(Zt|St 1, Zt 1, C)]

Multi-task Hierarchical Adversarial Inverse Reinforcement Learning

= E C,X0:T , Z0:T

t=1 θ log πθ(Zt|St 1, Zt 1, C)

i=1 log πθ(Zi|Si 1, Zi 1, C) + 1

= E C,X0:T , Z0:T

t=1 θ log πθ(Zt|St 1, Zt 1, C)

i=t log πθ(Zi|Si 1, Zi 1, C)

The third equality holds because we adopt the one-step Markov assumption, i.e., the conditional probability distribution of a random variable depends only on its parent nodes in the probabilistic graphical model (shown as Figure 1). The fourth equality holds out of similar derivation as steps 2-4 in Equation 38. The last equality can be obtained with Equation 46 in the next section, where we prove that any term which is from PT i=1 log πθ(Zi|Si 1, Zi 1, C) + 1 and not a function of Zt will not influence the gradient calculation in Equation 39 and 40.

With similar derivations, we have:

ϕLDI 1 = E C,X0:T , Z0:T

t=1 ϕ log πϕ(At 1|St 1, Zt, C)

i=t log Pω(Zi|X0:i, Z0:i 1, C)

ϕLDI 2 = E C,X0:T , Z0:T

t=1 ϕ log πϕ(At 1|St 1, Zt, C)

i=t log πθ(Zi|Si 1, Zi 1, C)

As for the gradient with respect to ω, it can be computed with:

ωLDI = ωLDI 1 =

C prior(C) X

X0:t,Z0:t Pθ,ϕ(X0:t, Z0:t|C) ω log Pω(Zt|X0:t, Z0:t 1, C) (43)

Still, for each timestep t, it s a standard likelihood maximization problem and can be optimized through SGD.

D.3. Gradients of the Imitation Learning Objective Term

We consider the imitation learning objective term LIL, i.e., the trajectory return shown as:

C prior(C) X

X0:T ,Z0:T Pθ,ϕ(X0:T , Z0:T |C)

i=0 RIL(Si, Zi, Zi+1, Ai|C) (44)

Following the similar derivation with Equation 36, we can get:

θLIL = E C,X0:T , Z0:T

t=1 θ log πθ(Zt|St 1, Zt 1, C)

i=0 RIL(Si, Zi, Zi+1, Ai|C)

Further, we note that for each t {1, , T}, i < t 1, we have:

E C,X0:T , Z0:T

[ θ log πθ(Zt|St 1, Zt 1, C)RIL(Si, Zi, Zi+1, Ai|C)]

C prior(C) X

X0:T ,Z0:T Pθ,ϕ(X0:T , Z0:T |C) θ log πθ(Zt|St 1, Zt 1, C)RIL(Si, Zi, Zi+1, Ai|C)

C prior(C) X

X0:t 1, Z0:t

Xt:T , Zt+1:T

Pθ,ϕ(X0:T , Z0:T |C) θ log πθ(Zt|St 1, Zt 1, C)Ri IL

C prior(C) X

X0:t 1, Z0:t

Pθ,ϕ(X0:t 1, Z0:t|C) θ log πθ(Zt|St 1, Zt 1, C)Ri IL

Multi-task Hierarchical Adversarial Inverse Reinforcement Learning

C prior(C) X

X0:t 1, Z0:t 1

Pθ,ϕ(X0:t 1, Z0:t 1|C) X

Zt πθ(Zt|St 1, Zt 1, C) θ log πθ(Zt|St 1, Zt 1, C)Ri IL

C prior(C) X

X0:t 1, Z0:t 1

Pθ,ϕ(X0:t 1, Z0:t 1|C)Ri IL X

Zt πθ(Zt|St 1, Zt 1, C) θ log πθ(Zt|St 1, Zt 1, C)

C prior(C) X

X0:t 1, Z0:t 1

Pθ,ϕ(X0:t 1, Z0:t 1|C)Ri IL X

Zt θπθ(Zt|St 1, Zt 1, C)

C prior(C) X

X0:t 1, Z0:t 1

Pθ,ϕ(X0:t 1, Z0:t 1|C)Ri IL θ X

Zt πθ(Zt|St 1, Zt 1, C)

C prior(C) X

X0:t 1, Z0:t 1

Pθ,ϕ(X0:t 1, Z0:t 1|C)RIL(Si, Zi, Zi+1, Ai|C) θ1 = 0

where Ri IL = RIL(Si, Zi, Zi+1, Ai|C) for simplicity. We use the law of total probability in the third equality, which we also use in the later derivations. The fifth equality holds because i < t 1 and RIL(Si, Zi, Zi+1, Ai|C) is irrelevant to Zt. Based on Equation 45 and 46, we have:

θLIL = E C,X0:T , Z0:T

t=1 θ log πθ(Zt|St 1, Zt 1, C)

i=t 1 RIL(Si, Zi, Zi+1, Ai|C)

With similar derivations, we can obtain:

ϕLIL = E C,X0:T , Z0:T

t=1 ϕ log πϕ(At 1|St 1, Zt, C)

i=t 1 RIL(Si, Zi, Zi+1, Ai|C)

D.4. The Overall Unbiased Gradient Estimator

To sum up, the gradients with respect to θ and ϕ can be computed with θ,ϕL = θ,ϕ(α1LMI + α2LDI + α3LIL), where α1:3 > 0 are the weights for each objective term and fine-tuned as hyperparameters. Combining Equation (36, 38, 39, 48) and Equation (37, 41, 42, 49), we have the actor-critic learning framework shown as Equation 11, except for the baseline terms, bhigh and blow.

Further, we claim that Equation 11 provides unbiased estimation of the gradients with respect to θ and ϕ. We proof this by showing that E h PT t=1 θ log πt θbhigh(St 1, Zt 1|C) i = E h PT t=1 ϕ log πt ϕblow(St 1, Zt|C) i = 0, as follows:

E C,X0:T , Z0:T

t=1 θ log πθ(Zt|St 1, Zt 1, C)bhigh(St 1, Zt 1|C)

C prior(C) X

X0:T ,Z0:T Pθ,ϕ(X0:T , Z0:T |C)

t=1 θ log πθ(Zt|St 1, Zt 1, C)bhigh(St 1, Zt 1|C)

X0:T ,Z0:T Pθ,ϕ(X0:T , Z0:T |C) θ log πθ(Zt|St 1, Zt 1, C)bhigh(St 1, Zt 1|C)

X0:t 1,Z0:t Pθ,ϕ(X0:t 1, Z0:t|C) θ log πθ(Zt|St 1, Zt 1, C)bhigh(St 1, Zt 1|C)

Multi-task Hierarchical Adversarial Inverse Reinforcement Learning X

X0:t 1,Z0:t Pθ,ϕ(X0:t 1, Z0:t|C) θ log πθ(Zt|St 1, Zt 1, C)bhigh(St 1, Zt 1|C)

X0:t 1, Z0:t 1

Pθ,ϕ(X0:t 1, Z0:t 1|C) X

Zt πθ(Zt|St 1, Zt 1, C) θ log πθ(Zt|St 1, Zt 1, C)bhigh(St 1, Zt 1|C)

X0:t 1,Z0:t 1 Pθ,ϕ(X0:t 1, Z0:t 1|C)bhigh(St 1, Zt 1|C) X

Zt θπθ(Zt|St 1, Zt 1, C)

X0:t 1,Z0:t 1 Pθ,ϕ(X0:t 1, Z0:t 1|C)bhigh(St 1, Zt 1|C) θ1 = 0

E C,X0:T , Z0:T

t=1 ϕ log πϕ(At 1|St 1, Zt, C)blow(St 1, Zt|C)

C prior(C) X

X0:T ,Z0:T Pθ,ϕ(X0:T , Z0:T |C)

t=1 ϕ log πϕ(At 1|St 1, Zt, C)blow(St 1, Zt|C)

X0:T ,Z0:T Pθ,ϕ(X0:T , Z0:T |C) ϕ log πϕ(At 1|St 1, Zt, C)blow(St 1, Zt|C)

X0:t,Z0:t Pθ,ϕ(X0:t, Z0:t|C) ϕ log πϕ(At 1|St 1, Zt, C)blow(St 1, Zt|C)

X0:t 1,Z0:t Pθ,ϕ(X0:t 1, Z0:t|C) X

Xt Pϕ(Xt|X0:t 1, Z0:t, C)

ϕ log πϕ(At 1|St 1, Zt, C)blow(St 1, Zt|C)

X0:t 1,Z0:t Pθ,ϕ(X0:t 1, Z0:t|C) X

At 1 πϕ(At 1|St 1, Zt, C)

ϕ log πϕ(At 1|St 1, Zt, C)blow(St 1, Zt|C) X

St P(St|St 1, At 1, C)

X0:t 1,Z0:t Pθ,ϕ(X0:t 1, Z0:t|C)blow(St 1, Zt|C) X

At 1 πϕ(At 1|St 1, Zt, C)

ϕ log πϕ(At 1|St 1, Zt, C)

X0:t 1,Z0:t Pθ,ϕ(X0:t 1, Z0:t|C)blow(St 1, Zt|C) X

At 1 ϕπϕ(At 1|St 1, Zt, C)

X0:t 1,Z0:t Pθ,ϕ(X0:t 1, Z0:t|C)blow(St 1, Zt|C) ϕ X

At 1 πϕ(At 1|St 1, Zt, C)

X0:t 1, Z0:t

Pθ,ϕ(X0:t 1, Z0:t|C)blow(St 1, Zt|C) ϕ1 = 0

Multi-task Hierarchical Adversarial Inverse Reinforcement Learning

Algorithm 1 Multi-task Hierarchical Adversarial Inverse Reinforcement Learning (MH-AIRL)

1: Input: Prior distribution of the task variable prior(C), expert demonstrations {XE 0:T } (If the task or option annotations, i.e., {CE} or {ZE 0:T }, are provided, the corresponding estimation in Step 6 is not required.) 2: Initialize the hierarchical policy πθ and πϕ, discriminator fϑ, posteriors for the task context Pψ and option choice Pω 3: for each training episode do 4: Generate M trajectories {(C, X0:T , Z0:T )} by sampling the task C prior( ) and then exploring it with πθ and πϕ 5: Update Pψ and Pω by minimizing LMI and LDI (Eq. 9) using SGD with {(C, X0:T , Z0:T )} 6: Estimate the expert global and local latent codes with Pψ and Pω, i.e., CE Pψ( |XE 0:T ), ZE 0:T Pω( |XE 0:T , CE) 7: Update fϑ by minimizing the cross entropy loss in Eq. 8 based on {(C, X0:T , Z0:T )} and {(CE, XE 0:T , ZE 0:T )} 8: Train πθ and πϕ by HPPO, i.e., Eq. 11, based on {(C, X0:T , Z0:T )} and fϑ which defines Dϑ and RIL 9: end for

D.5. Illustrations of Interactions among Networks in MH-AIRL

Figure 5. Interactions among the five networks in our learning system.

There are in total five networks to learn in our system: the high-level policy πθ, low-level policy πϕ, discriminator fϑ, variational posteriors for the task context Pψ and option context Pω. Algorithm 1 shows in details how to coordinate their training process. To be more intuitive, we provide Figure 5 for illustrating the interactions among them. Pψ and Pω are trained with the trajectories (i.e., {(C, X0:T , Z0:T )}) generated by the hierarchical policy πθ,ϕ, and can provide the reward signals R0:T MI and R0:T DI for training πθ,ϕ, which are defined as α1 log Pψ(C|X0:T ) and α2 log Pω(Zi|Xi,Zi 1,C)

πθ(Zi|Si 1,Zi 1,C) (i {1, , T}) in Equation 11, respectively. On the other hand, the discriminator fϑ is trained to distinguish the expert demonstrations {(CE, XE 0:T , ZE 0:T )} and generated samples {(C, X0:T , Z0:T )}, where CE and {ZE 0:T } can be estimated from Pψ and Pω if not provided. Then, the AIRL reward term R0:T IL can be obtained based on the output of fϑ. Last, the hierarchical policy πθ,ϕ can be trained by maximizing the return defined with R0:T MI, R0:T DI , and R0:T IL (i.e., Eq. 11).

Multi-task Hierarchical Adversarial Inverse Reinforcement Learning

E. Appendix on Evaluation Results

E.1. Plots of the Ablation Study

(a) Half Cheetah-Multi Vel

(b) Walker-Rand Param

(c) Ant-Multi Goal

(d) Kitchen-Multi Seq

Figure 6. Comparison results of MH-AIRL with the ablated versions (MH-GAIL & H-AIRL) and SOTA Hierarchical Imitation Learning (HIL) baselines (Option-GAIL & DI-GAIL) on the four evaluation tasks. Our algorithm outperforms the baselines in all the tasks, especially in the more challenging ones (Ant & Kitchen). MH-GAIL performs better than the other baselines which do not contain the multi-task learning component. H-AIRL, an ablation and HIL algorithm, has better performance than other SOTA HIL baselines on the Mujoco tasks.

E.2. Implementation Details of MH-GAIL

MH-GAIL is a variant of our algorithm by replacing the AIRL component with GAIL. Similar with Section 4.2, we need to provide an extension of GAIL with the one-step option model, in order to learn a hierarchical policy. The extension method follows Option-GAIL (Jing et al., 2021) which is one of our baselines. MH-GAIL also uses an adversarial learning framework that contains a discriminator Dϑ and a hierarchical policy πθ,ϕ, for which the objectives are as follows:

max ϑ EC prior( ),(S,A,Z,Z ) πE( |C) [log(1 Dϑ(S, A, Z, Z |C))] +

EC prior( ),(S,A,Z,Z ) πθ,ϕ( |C) [log Dϑ(S, A, Z, Z |C)]

max θ,ϕ LIL = max θ,ϕ EC prior( ),(X0:T ,Z0:T ) πθ,ϕ( |C)

t=0 Rt IL, Rt IL = log Dϑ(St, At, Zt+1, Zt|C)

where (S, A, Z, Z ) denotes (St, At, Zt+1, Zt), t = {0, , T 1}. It can be observed that the definition of Rt IL have changed. Moreover, the discriminator Dϑ in MH-GAIL is trained as a binary classifier to distinguish the expert demonstrations (labeled as 0) and generated samples (labeled as 1), and does not have a specially-designed structure like the discriminator Dϑ in MH-AIRL, which is defined with fϑ and πθ,ϕ, so that it cannot recover the expert reward function.

E.3. Analysis of the Learned Hierarchical Policy on Half Cheetah-Multi Vel and Walker-Rand Param

First, we randomly select 6 task contexts for Half Cheetah-Multi Vel and visualize the recovered hierarchical policy as the velocity change of each episode in Figure 7(a). It can be observed that the agent automatically discovers two options (Option 1: blue, Option 2: orange) and adopts Option 1 for the acceleration phase (0 v/2 or 0 v) and Option 2 for the deceleration phase (v/2 0). This shows that MH-AIRL can capture the compositional structure within the tasks very well and transfer the learned basic skills to boost multi-task policy learning.

Multi-task Hierarchical Adversarial Inverse Reinforcement Learning

(a) Results on Half Cheetah-Multi Vel

(b) Results on Walker-Rand Param

Figure 7. (a) Velocity change of the Half Cheetah agent in the test tasks with different goal velocities, where the agent adopts Option 1 (blue) and 2 (orange) when increasing and decreasing the speed, respectively. (b) For Walker-Rand Param, the basic skills must adapt to the task setting, so the learning performance would drop without conditioning the low-level policy (i.e., option) on the task context.

Second, we note that, for some circumstances, the basic skills need to be conditioned on the task context. For the Mujoco Multi Goal/Multi Vel tasks, the basic skills (e.g., Option 2: decreasing the velocity) can be directly transferred among the tasks in the class and the agent only needs to adjust its high-level policy according to the task variable (e.g., adopting Option 2 when achieving v/2). However, for tasks like Walker-Rand Param, the skills need to adapt to the tasks, since the mass of the agent changes and so do the control dynamics. As shown in Figure 7(b), the learning performance would drop without conditioning the low-level policy (i.e., option) on the task context, i.e., MH-AIRL-no-cnt.