# adversarial_optionaware_hierarchical_imitation_learning__08201210.pdf Adversarial Option-Aware Hierarchical Imitation Learning Mingxuan Jing 1 Wenbing Huang 1 Fuchun Sun 1 2 Xiaojian Ma 3 Tao Kong 4 Chuang Gan 5 Lei Li 4 It has been a challenge to learning skills for an agent from long-horizon unannotated demonstrations. Existing approaches like Hierarchical Imitation Learning(HIL) are prone to compounding errors or suboptimal solutions. In this paper, we propose Option-GAIL, a novel method to learn skills at long horizon. The key idea of Option-GAIL is modeling the task hierarchy by options and train the policy via generative adversarial optimization. In particular, we propose an Expectation Maximization(EM)-style algorithm: an E-step that samples the options of expert conditioned on the current learned policy, and an M-step that updates the lowand high-level policies of agent simultaneously to minimize the newly proposed option-occupancy measurement between the expert and the agent. We theoretically prove the convergence of the proposed algorithm. Experiments show that Option-GAIL outperforms other counterparts consistently across a variety of tasks. 1. Introduction Hierarchical Imitation Learning (HIL) has exhibited promises on acquiring long-term skills directly from demonstrations (Le et al., 2018; Yu et al., 2018; Byrne & Russon, 1998; Sharma et al., 2019; Hamidi et al., 2015). It contends the nature of sub-task decomposition in long-horizontal tasks, enjoying richer expressivity on characterizing complex decision-making than canonical Imitation Learning (IL). In general, HIL designs micro-policies for accomplishing the specific control for each sub-task, and employs a macro-policy for scheduling the switching of micro-policies. Such a two-level decision-making process is usually formu- 1Department of Computer Science and Technology, Tsinghua University, Beijing, China (Mingxuan Jing ; Wenbing Huang ; Fuchun Sun ) 2THU-Bosch JCML center 3University of California, Los Angeles, USA 4Bytedance AI Lab, Beijing, China 5MIT-IBM Watson AI Lab, USA. Correspondence to: Fuchun Sun . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). lated by an Option model (Sutton et al., 1999) or goal-based framework (Le et al., 2018). Although some works (Fei et al., 2020) have assumed the help of sub-task segmentation annotations, this paper mainly focuses on learning from unsegmented demonstrations to allow more practicability. Plenty of HIL methods have been proposed, including the Behavior-Cloning(BC) based and the Inverse Reinforcement Learning(IRL) based approaches. For the first avenue, Zhang & Paschalidis (2020); Le et al. (2018) customize BC to hierarchical modeling (Daniel et al., 2016), which, unfortunately, is prone to compounding errors (Ross et al., 2011) in case that the demonstrations are limited in practice. On the contrary, IRL is potential to avoid compounding errors by agent self-explorations. However, addressing IRL upon the Option model, by no means, is non-trivial considering the temporal coupling of the high-level and low-level policies. The works by Sharma et al. (2018); Lee & Seo (2020) relax this challenge by training the two-level policies separately. Nevertheless, this two-stage method potentially leads to sub-optimal solutions in the absence of training collaboration between the two stages. To overcome the aforementioned issues, this work proposes a novel Hierarchical IRL framework Option-GAIL, which is theoretically elegant, free of compounding errors, and end-to-end trainable. Basically, it is built upon GAIL (Ho & Ermon, 2016) with two nutritive enhancements: 1) replacing Markov Decision Process (MDP) with the Option model; 2) augmenting the occupancy measurement matching with options. Note that GAIL is a popular IL method that casts policy learning upon MDP into an occupancy measurement matching problem. Therefore, it is natural to replace the MDP with the Option model for hierarchical modeling. Besides, GAIL mimicking a policy from an expert is equivalent to matching its occupancy measurement. This equivalence holds when the policy is one-to-one corresponding to its induced occupancy measurement, which, yet, is not valid in our hierarchical context. As we will depict in the paper, the policy of HIL is related to options, and its one-to-one correspondence only exists with regard to option-extended occupancy measurement other than traditional occupancy measurement without options. Hence, it is indispensable to involve options into the matching goal in our second enhancement. Notably, the option switching is inherently guaranteed in our model, without extra regulators such as Adversarial Option-Aware Hierarchical Imitation Learning mutual information used in (Sharma et al., 2018) to encourage the correlation between action-state pairs and options. It is non-straightforward to train our Option-GAIL, specifically when the expert options are unavailable. We thereby propose an EM-like learning algorithm to remedy the training difficulty. Specifically, in the E-step, we apply a Viterbi method (Viterbi, 1967) to infer the expert options conditional on the agent policy and expert actions/states. For the M-step, with the expert options provided, we optimize both the highand low-level policies to minimize the discrepancy between the option-occupancy measurements of expert and agent. To be more specific, the M-step is implemented by a min-max game: maximizing a discriminator to estimate the discrepancy, and updating the policy to minimize the discriminative cost via a hierarchical RL method (Zhang & Whiteson, 2019). Notably, the convergence of the proposed EM algorithm is theoretically guaranteed if certain mild conditions hold. In summary, our main contributions include: We propose Option-GAIL, a theoretically-grounded framework that integrates hierarchical modeling with option-occupancy measurement matching. It is proved that Option-GAIL ensures the discrepancy minimization between the policies of demonstrator and imitator. We propose an EM-like algorithm to train the parameters of Option-GAIL end-to-end. This method alternates option inference and policy updating and is proved to converge eventually. We evaluate our proposed method on several robotic locomotion and manipulation tasks against state-of-theart HIL/IL baselines. The results demonstrate that our approach attains both dramatically faster convergence and better final performance over the counterparts. A complete set of ablation studies also verify the validity of each component we proposed. 2. Related Works There have already been plenty of works researching on HIL, which, in general, can be categorized into two classes: Hierarchical Behavior Cloning (H-BC) and Hierarchical Inverse Reinforcement Learning (H-IRL). Hierarchical BC. In these avenues, HIL is an extension of Behavior Cloning (BC) (Pomerleau, 1988; Atkeson & Schaal, 1997). As a result, the missing sub-task information needs to be provided or inferred to ensure the learnability of hierarchical policies. For example, Le et al. (2018) require predefined sub-goals when learning the goal-based policies; Kipf et al. (2019) try to alleviate the requirement of sub-task information by formulating the policy learning as an unsupervised trajectory soft-segmentation problem; Daniel et al. (2016) and Zhang & Paschalidis (2020) employ Baum-Welch algorithm (Baum et al., 1972) upon the option framework to infer the latent option from demonstrations, and directly optimize both the highand low-level policies. Despite its easy implementation, behavior cloning is vulnerable to compounding errors (Ross et al., 2011) in case of the limited demonstrations, while our method enjoys better sample efficiency thanks to the IRL formulation. Hierarchical IRL. Contrasted to the BC-based approaches, Hierarchical IRL avoids compounding errors by taking advantage of the agent s self-exploration and reward reconstruction. Following the Generative Adversarial Imitation Learning (GAIL) (Ho & Ermon, 2016), the works by (Sharma et al., 2018; Lee & Seo, 2020) realize HIL by introducing a regularizer into the original IRL problem and maximizing the directed information between generated trajectories and options. However, the highand low-level policies are trained separately in these approaches: the highlevel policy is learned with behavior cloning and remains fixed during the GAIL-based low-level policy learning. As we reported in the experiments, such a two-staged paradigm could lead to potentially poor convergence compared to our end-to-end training fashion. Henderson et al. (2018) propose an end-to-end HIL method without pre-training. However, it adopts a Mixture-of-Expert (Mo E) strategy rather than the canonical option framework. Therefore, an option is exclusively determined by the corresponding state, ignoring its relation to the option of the previous step (Figure 1). On the contrary, our method conducts option inference and policy optimization that are strictly amenable to the option dynamics (Sutton et al., 1999), thus delineating the hierarchy of sub-tasks in a more holistic manner. 3. Preliminaries This section briefly introduces preliminary knowledge and notation used in our work. Markov Decision Process: A Markov Decision Process (MDP) is a 6-element-tuple S, A, P a s,s , Ra s, µ0, γ , where (S, A) denote the continuous/discrete state-action space; P a s,s = P(st+1 = s |st = s, at = a) is the transition probability of next state s S given current state s S and action a A, determined by the environment; Ra s = E[rt|st = s, at = a] returns the expected reward from the task when taking action a on state s; µ0(s) denotes the initial state distribution and γ [0, 1) is a discounting factor. The effectiveness of a policy π(a|s) is evaluated by its expected infinite-horizon reward: η(π) = Es0 µ0,at π(st),st+1 P at st,st+1 P t=0 γtrt . The Option Framework: Options O = {1, , K} are introduced for modeling the policy-switching pro- Adversarial Option-Aware Hierarchical Imitation Learning cedure on long-term tasks (Sutton et al., 1999). An option model is defined as a tuple O = S, A, {Io, πo, βo}o O, πO(o|s), P a s,s , Ra s, µ0, γ , where, S, A, P a s,s , Ra s, µ0, γ are defined as the same as MDP; Io S denotes an initial set of states, from which an option can be activated; βo(s) = P(b = 1|s) is a terminate function which decides whether current option should be terminated or not on a given state s; πo(a|s) is the intra-option policy that determines an action on a given state within an option o; a new option is activated in the call-and-return style by an inter-option policy πO(o|s) once the last option terminates. Generative Adversarial Imitation Learning (GAIL): For simplicity, we denote (x0, , xn) as x0:n for short throughout this paper. Given expert demonstrations on a specified task as Ddemo = τE = (s0:T , a0:T ) , imitation learning aims at finding a policy π that can best reproduce the expert s behavior, without the access of the real reward. Ho & Ermon (2016); Ghasemipour et al. (2020) cast the original maximum-entropy inverse reinforcement learning problem into an occupancy measurement (Syed et al., 2008) matching problem: arg min π Df ρπ(s, a) ρπE(s, a) , (1) where, Df computes f-Divergence, ρπ(s, a) and ρπE(s, a) are the occupancy measurements of agent and expert respectively. By introducing a generative adversarial structure (Goodfellow et al., 2014), GAIL minimizes the discrepancy via alternatively optimizing the policy and the estimated discrepancy. To be specific, a discriminator Dθ(s, a) : S A 7 (0, 1) parameterized by θ in GAIL is maximized and then the policy is updated to minimize the overall discrepancy along each trajectory of exploration. Such optimization process can be cast into: maxθ minπ Eπ log Dθ(s, a) + EπE h log 1 Dθ(s, a) i . 4. Our Framework: Option-GAIL In this section, we provide the details of the proposed Option-GAIL. Our goal is towards imitation learning from demonstrations of long-horizon tasks consisting of small sub-tasks. We first introduce necessary assumptions to facilitate our analyses, and follow it up by characterizing the motivations and formulation of Option-GAIL. We then provide the EM-like algorithm towards the solution and conclude this section with theoretical analyses on the convergence of our algorithm. 4.1. Assumptions and Definitions We have introduced the full option framework for modeling switching procedure on hierarchical tasks in Section 3. How- Figure 1. The probabilistic graph of the one-step option model. The shade masks the group of nodes that induce our optionoccupancy measurement. ever, it is inconvenient to directly learn the policy of this framework due to the difficulty of determining the initial set Io and break condition βo. Here, we introduce the following assumption given which the original option framework will be equivalently converted into the one-step option that is easier to be dealt with. Assumption 1 (Valid Option Switching) We assume that, 1) each state is switchable: Io = S, o O; 2) each switching is effective: P(ot = ot 1|βot 1(st 1) = 1) = 0. Assumption 1 asserts that each state is switchable for each option, and once the switching happens, it switches to a different option with probability 1. Such assumption usually holds in practice without the sacrifice of model expressiveness. Now, we define the following one-step model. Definition 1 (One-step Option) We define Oone-step = S, A, O+, πH, πL, P a s,s , Ra s, µ0, γ where O+ = O {#} consists of all options plus a special initial option class satisfying o 1 #, β#(s) 1. Besides, µ0(s, o) .= µ0(s)1o=#, where 1x=y is the indicator function, and it is equal to 1 iff x = y, otherwise 0. The highand low-level policies are defined as: πH(o|s, o ) .= βo (s)πO(o|s) + 1 βo (s) 1o=o , πL(a|s, o) .= πo(a|s). (2) It can be derived that Oone-step = O under Assumption 11. This equivalence is beneficial as we can characterize the switching behavior by only looking at πH and πL without the need to justify the exact beginning/breaking condition of each option. We denote π .= (πH, πL) and Π = { π} as the set of stationary policies. We will employ the onestep option in the remainder of our paper. Note that in a current paper (Li et al., 2021) the rigorous notions and formulations have been provided for further discussing the relationship between the full option model and the one-step option model. We also provide the definition of the option-occupancy mea- 1The proofs of all theorems are provided in Appendix. Adversarial Option-Aware Hierarchical Imitation Learning surement by borrowing the notion of the occupancy measurement ρ(s, a) from MDP (Syed et al., 2008). Definition 2 (Option-Occupancy Measurement) We define the option-occupancy measurement as ρ π(s, a, o, o ) .= E π h P t=0 γt1(st=s,at=a,ot=o,ot 1=o ) i . The measurement ρ π(s, a, o, o ) can be explained as the distribution of the state-action-option tuple that generated by the policy πH and πL on a given µ0 and P a s,s . According to the Bellman Flow constraint, one can easily obtain that ρ π(s, a, o, o ) belongs to a feasible set of affine constraint D = {ρ(s, a, o, o ) 0; P a,o ρ(s, a, o, o ) = µ0(s, o ) + γ P s ,a ,o P a s ,sρ(s , a , o , o )}. 4.2. Problem Formulation Now we focus on the imitation task on long-term demonstrations. Note that GAIL is no longer suitable for this scenario since it is hard to capture the hierarchy of sub-tasks by MDP. Upon GAIL, the natural idea is to model the long-term tasks via the one-step Option instead of MDP, and the policy is learned by minimizing the discrepancy of the occupancy measurement between expert and agent, namely, min π Df ρ π(s, a) ρ πE(s, a) . (3) While this idea is straightforward and never explored before, we claim that it will cause solution ambiguity we cannot make sure that π = πE, even we can get the optimal solution ρ π(s, a) = ρ πE(s, a) in Equation 3. Intuitively, for the hierarchical tasks, the action we make depends not only on the current state we observe but also on the option we have selected. By Definition 1, the hierarchical policy is relevant to the information of current state, current action, last-time option and current option, which motivates us to leverage the option-occupancy measurement instead of conventional occupancy measurement to depict the discrepancy between expert and agent. Actually, we have a one-to-one correspondence between Π and D. Theorem 1 (Bijection) For each ρ D, it is the optionoccupancy measurement of the following policy: a ρ(s, a, o, o ) P a,o ρ(s, a, o, o ), πL = P o ρ(s, a, o, o ) P a,o ρ(s, a, o, o ), (4) and π = (πH, πL) is the only policy whose optionoccupancy measure is ρ. With Theorem 1, optimizing the option policy is equivalent to optimizing its induced option-occupancy measurement, since ρ π(s, a, o, o ) = ρ πE(s, a, o, o ) π = πE. Our hierarchical imitation learning problem becomes: min π Df ρ π(s, a, o, o ) ρ πE(s, a, o, o ) (5) 𝐷𝑓𝜌 𝜋𝑠, 𝑎 𝜌 𝜋𝐸𝑠, 𝑎 𝐷𝑓𝜌 𝜋𝑠, 𝑎, 𝑜, 𝑜 𝜌 𝜋𝐸𝑠, 𝑎, 𝑜, 𝑜 𝑝𝑜0:𝑇|𝑠0:𝑇, 𝑎0:𝑇, 𝑜 1 Figure 2. Illustration of the different convergence properties of the optimisation problems defined in Equation 3 and Equation 5. The options are not explicitly constrained by Equation 3 Note that the optimization problem defined in Equation 5 implies that in Equation 3, but not vice versa: (1) since ρ π(s, a) = P o,o ρ π(s, a, o, o ), we can easily obtain ρ π(s, a, o, o ) = ρ πE(s, a, o, o ) ρ π(s, a) = ρ πE(s, a); (2) as Eq(o,o ) h Df ρ π(s, a, o, o ) ρ πE(s, a, o, o ) i Eq(o,o ) h Df ρ π(s, a) ρ πE(s, a) i for any option distribution q(o, o ), addressing the problem defined in Equation 5 is addressing an upper bound of that defined in Equation 3. Indeed, we will show in 4.6 that decreasing the goal in Equation 5 will definitely decrease that of Equation 3 given certain conditions. Figure 2 depicts the relationship between Equation 5 and Equation 3. Yet, it is nontrivial to tackle Equation 5 particularly owing to the unobserved options in expert demonstrations. Here, we propose an EM-like algorithm to address it, which will be detailed in 4.3 and 4.4. 4.3. Option-Occupancy Measurement Matching In this section, we assume the expert options are given and train the hierarchical policy π to minimize the goal defined in Equation 5. We denote the option-extended expert demonstrations as Ddemo = { τ = (s0:T , a0:T , o 1:T )}. Inspired from GAIL (Ho & Ermon, 2016), rather than calculating the exact value of the option-occupancy measurement, we propose to estimate the discrepancy by virtue of adversarial learning. We define a parametric discriminator as Dθ(s, a, o, o ) : S A O O+ 7 (0, 1). If specifying the f-divergence as Jensen Shannon divergence, Equation 5 can be converted to a min-max game: min π max θ E Ddemo h log 1 Dθ(s, a, o, o ) i h log Dθ(s, a, o, o ) i . (6) The inner loop is to train Dθ(s, a, o, o ) with the expert demonstration Ddemo and samples D π generated by selfexploration. Regarding the outer loop, a hierarchical rein- Adversarial Option-Aware Hierarchical Imitation Learning forcement learning (HRL) method should be used to minimize the discrepancy: π = arg min π E π c(s, a, o, o ) λH( π), (7) where c(s, a, o, o ) = log Dθ(s, a, o, o ) and the causal entropy H( π) = E π [ log πH log πL] as a policy regularizer with λ [0, 1]. Notice that the cost function is related to options, which is slightly different from many HRL problems with option agnostic cost/reward (Bacon et al., 2017; Zhang & Whiteson, 2019). To deal with this case, we reformulate the option-reward and optimize Equation 7 using similar idea as Zhang & Whiteson (2019). DAC characterizes the option model as two-level MDPs. Here we borrow its theoretical result by re-interpreting the reward used by each MDP: For the high-level MDP, we define state s H t .= (st, ot 1), action a H t .= ot, and reward RH((s, o ), o) .= P a πL(at|st, ot)c(st, at, ot, ot 1). For the low-level MDP, we denote s L t .= (st, ot), a L t .= at, and RL((s, o), a) = P o c(s, a, o, o )pπH(o |s, o) with the posterior propability pπH(o |s, o) = πH(o|s, o ) p(o |s) p(o|s) . Other terms including the initial state distributions µH 0 and µL 0 , the state-transition dynamics P H and P L are defined similar to Zhang & Whiteson (2019). The HRL task in Equation 7 can be separated into two non-hierarchical ones with augmented MDPs: MH = (SH, AH, P H, RH, µH 0 , γ), ML = (SL, AL, P L, RL, µL 0 , γ), whose action decisions depend on πH and πL, separately. Such two nonhierarchical problems can be solved alternatively by utilizing typical reinforcement learning methods like PPO (Schulman et al., 2017) with immediate imitation reward rt = c(st, at, ot, ot 1) when in practice. By alternating the inner loop and the outer loop, we finally derive π that addresses Equation 5. 4.4. Expert Option Inference So far, we have supposed that the expert options are provided, which, however, is usually not true in practice. It thus demands a kind of method to infer the options from observed data (states and actions). Basically, given a policy, the options are supposed to be the ones that maximize the likelihood of the observed state-actions, according to the principle of Maximum-Likelihood-Estimation (MLE). We approximate the expert policy with π learned by the method above. With states and actions observed, the option model will degenerate to a Hidden-Markov-Model (HMM), therefore, the Viterbi method (Viterbi, 1967) is applicable for expert option inference. We call this method as Option Viterbi for its specification to the option model. Given current learned πH, πL and τ = (s0:T , a0:T ) Ddemo, Option-Viterbi generates the most probable values of o 1:T that induces the maximization of the whole trajectory: arg max o 1:T P(s0:T , a0:T , o 1:T ) arg max o0:T log P(a0:T , o0:T |s0:T , o 1 = #) = arg max o T ˆαT (o T ). Here a maximum foreword message ˆαt(ot) is introduced for indicating the maximum probability value of the partial trajectory till time t given ot, and it can be calculated recursively as Equation 9: ˆαt(ot) = max oo:t 1 log P(a0:t, ot|s0:t, o0:t 1, o 1 = #) = max ot 1 ˆαt 1(ot 1) + log πH(ot|st, ot 1) + log πL(at|st, ot), ˆα0(o0) = log πL(a0|s0, o0) + log πH(o0|s0, #). (10) Clearly, Option-Viterbi has a computation complexity of O(T K2) for a T-step trajectory with K options. By back-tracing ot 1 that induces the maximization on ˆα(ot) at each time step, the option-extended expert trajectories Ddemo can finally be acquired. Although our above option inference process implies that the expert demonstrations are assumed to be generated through a hierarchical policy as the same as the agent, our method is still applicable to the non-hierarchical expert, given the fact that a non-hierarchical expert can be imitated by a hierarchical agent (with fewer options activated). 4.5. Summary of Option-GAIL We briefly give an overview of our proposed Option-GAIL in Algorithm 1. With expert-provided demonstrations Ddemo = {τE = (s0:T , a0:T )} and initial policy π, our method alternates the policy optimization and option inference for sufficient iterations. Algorithm 1 Option-GAIL Data: Expert trajectories Ddemo = n τE = (s0:T , a0:T ) o Input: Initial policy π0 = (π0 H, π0 L) Output: Learned Policy π for n = 0 N do E-step: Infer expert options with πn by (9) and get Ddemo; sample trajectories with πn and get D π. M-step: Update πn+1 by (6). end Adversarial Option-Aware Hierarchical Imitation Learning 4.6. Convergence Analysis Algorithm 1 has depicted how our method is computed, but it is still unknown if it will converge in the end. To enable the analysis, we first define a surrogate distribution of the options as q(o 1:T ). We denote Qn as the expected objective in Equation 5 at iteration n by Algorithm 1, namely, Qn = Eq(o 1:T ) h Df ρ π(s, a, o, o ) ρ πE(s, a, o, o ) i . We immediately have the following convergence theorem for Algorithm 1. Theorem 2 (Convergence) Algorithm 1 only decreases the objective: Qn+1 Qn, if these two conditions hold: (1) q(o 1:T ) is our option sampling strategy in E-step and is equal to the posterior of the options given current policy πn, i.e. q(o 1:T ) = P πn(o 1:T |s0:T , a0:T ); (2) M-step definitely decreases the objective in Equation 5 at each iteration. The proof is devised by generalizing traditional EM algorithm to the f-divergence minimization problem. We provide the details in Appendix. Since Qn 0, Theorem 2 confirms that Algorithm 1 will converge eventually. The first condition in Theorem 2 guarantees that the objective of Equation 3 is equal to Qn after E-step, thus Algorithm 1 also helps minimize Equation 3. Besides, if the global minimum is achieved, we have Qn = 0 ρ π(s, a, o, o ) = ρ πE(s, a, o, o ) π = πE. In our implementation, as mentioned in 4.4, we adopt the maximized sampling process instead of the posterior sampling that is required by Theorem 2, as, in this way, we find that it still maintains the convergence in our experiments while reducing the computation complexity with only sampling one option series per trajectory. 5. Experiments We evaluate our model on four widely used robot learning benchmarks with locomotion and manipulation tasks. We first compare our model against other counterparts that do not employ hierarchical structure or self-explorations. Then we provide an ablated study to understand the impact of each component in our model. 5.1. Environments Four environments are used for evaluations: Hopper-v2 and Walker2d-v2: The Hopper-v2 and the Walker2d-v2 are two standardized continuous-time locomotion environments implemented in the Open AI Gym (Brockman et al., 2016) with the Mu Jo Co (Todorov et al., 2012) physics simulator. On these two tasks, the robot should move toward the desired direction by moving its leg periodically. We use expert demonstration containing 1,000 time- steps for learning on Hopper-v2 environment and 5,000 time-steps for learning on the Walker2d-v2 environment. Both of the expert demonstrations are generated by a policy trained by PPO (Schulman et al., 2017). Ant Push-v0: The Ant Push-v0 is a navigation/locomotion task proposed in Nachum et al. (2018), where an Ant robot is required to navigate into a room that is blocked by a movable obstacle, as shown in Figure 10. Specifically, the robot should first navigate to and push away the obstacle and then go into the blocked room. For better comparing the learning performance, we slightly extend the original binary reward as rt = post tar 2 2 + 100 1post=tar where pos is the position of the robot and tar is the location of the blocked room. We use expert demonstrations containing 50,000 time-steps for learning in this environment, where the policy is trained with DAC (Zhang & Whiteson, 2019) regarding our modified reward. Close Microwave2: The Closemicrowave2 is a more challenging robot operation environment in RLBench (James et al., 2020). A 6-Do F robot arm with a gripper is required to reach and close the door of a microwave by controlling the increments on joint positions of the robot arm, as shown in Figure 10. The reward is defined as rt = θt + 100 1θt=0, where θ denotes the rotation angle of the door, We use expert demonstrations containing 1,000 time-steps for learning in this environment generated by a handcrafted feedback controller. 5.2. Comparative Evaluations To illustrate the effectiveness of the proposed method, we contrast several popular imitation learning baselines, including: 1) supervised Behavior Cloning (BC) (Pomerleau, 1988) that do not contain hierarchical structure or selfexploration; 2) GAIL (Ho & Ermon, 2016) that uses selfexploration without hierarchical structure; 3) Hierarchical Behavioral Cloning (H-BC) (Zhang & Paschalidis, 2020), which builds upon the Option structure, but optimizing both the highand low-level policies by directly maximizing the probability of the observed expert trajectories without any self-exploration. This baseline can also be regarded as optimizing Equation 9 with forward-backward messages; 4) a variant of GAIL, denoted as GAIL-HRL that updates hierarchical policies according to Equation 3, where the immediate imitation reward rt = log Dθ (st, at) is used instead. GAIL-HRL can be regarded as a simplified version of our Option-GAIL without using options in occupancy measurement matching, and it is designed for justifying the necessity of involving options during the whole occupancy measurement matching procedure. We employ the same demonstration trajectories, network structures, and option numbers on each environment for fair comparisons. Specifically, we allow 4 available op- Adversarial Option-Aware Hierarchical Imitation Learning Table 1. Comparative results. All results are measured by the average maximum average reward-sum among different trails. Hopper-v2 Walker2d-v2 Ant Push-v0 Close Microwave2 Demos (s, a) T (R11, R3) 1k (R17, R6) 5k (R107, R8) 50k (R101, R8) 1k Demo Reward 3656.17 0.0 5005.80 36.18 116.60 14.07 149.68 16.29 BC 275.93 31.09 589.17 90.81 4.60 2.72 26.03 2.33 GAIL 535.29 7.19 2787.87 2234.46 54.82 4.81 39.14 12.88 H-BC 970.91 72.69 3093.56 107.11 89.23 1.37 89.54 15.44 GAIL-HRL 3697.40 1.14 3687.63 982.99 20.53 6.89 56.95 25.74 Ours 3700.42 1.70 4836.85 100.09 95.00 2.70 100.74 21.33 Average R-sum Exploration Step Exploration Step Walker2d-v2 Exploration Step Ant Push-v0 Exploration Step Close Microwave2 Figure 3. comparison of learning performance on four environments. The environments and the task designs are illustrated on the top of the figure with more details provided in 5.1. We compare the maximum average reward-sums vs. exploration steps on different environments. The solid line indicates the average performance among several trials under different random seeds, while the shade indicates the range of the maximum average reward-sums over different trials. tion classes for all environments, a Multi-Layer Perception(MLP) with hidden size (64, 64) to implement the policies of both levels on Hopper-v2, Walker2d-v2, Ant Push-v0, and (128, 128) on Closemicrowave2; the discriminator is realized by an MLP with hidden size (256, 256) on all environments. For methods that do not use self-exploration, we train the policy for 100 epochs and then evaluate it by average reward-sums; for methods that rely on self-exploration, we update the policy and record the average reward-sum every 4096 environmental steps, and record maximum average reward-sums over the whole training period. The evaluation results are displayed in Figure 10. Obviously, our method gains faster convergence speed and better eventual performance than all baselines in general, as illustrated in Figure 10. For example, on Close Microwave2, which is clearly a long-horizon task, our Option-GAIL converges to a remarkably larger reward than all others. Besides, by introducing the option-based hierarchical structure, Option-GAIL, H-BC and GAIL-HRL are superior to the counterparts that use single-level policy, namely, BC and GAIL. When demonstrations are limited, for instance, on Hopper-v2, Walker2d-v2, and Closemicrowave2, GAIL-like Demo Sample Ours GAIL-HRL Figure 4. Visualization of the options activated at each step, learned respectively by our proposed method (blue) and GAILHRL (gray) on Walker2d-v2. Demo denotes the options inferred from the expert, and Sample denotes the options used by agent when doing self-explorations. The effectiveness of our proposed method on regularizing the option switching is obvious by comparing the consistent switching tendencies between Demo and Sample. Adversarial Option-Aware Hierarchical Imitation Learning demo sample Ours GAIL-HRL Options on Hopper-v2 Figure 5. Visualization of the state-action and options on Hopperv2 environment by t-SNE (Maaten & Hinton, 2008). The dots and squares in the figure indicate the state-action pairs generated by agent and expert on each time step, respectively, and the gray lines connect time-adjacent points. The color indicates the option activated at each time-step. Arrows point to the miss-matched option switching behaviors between agent and expert demonstration by GAIL-HRL. methods including Option-GAIL and GAIL-HRL outperform their BC-like counterpart H-BC, probably thanks to the reduction of compounding errors by self-explorations. On Ant Push-v0, the advantage of our method over H-BC is reduced. We conjecture this is because H-BC is data-hungry, and it can become better when sufficient demonstrations are fed. Comparing with GAIL, with the help of hierarchical modeling, our method can successfully imitate the behavior of the expert with fewer self-explorations. To examine if our proposed method actually regularizes the options switching between expert and agent, Figure 9 illustrates the options including that are inferred from expert and that generated by agent on Walker2d-v2 with more examples deferred to Appendix. It is observed in our method that the expert s options are consistent with the agent s, while for GAIL-HRL, a dramatic conflict exists between expert and agent. This result confirms the benefit of our method since we explicitly minimize the discrepancy of the option-extended occupancy measurement between expert and agent in Equation 5. 5.3. Ablations and Analysis In this section, we perform an in-depth analysis on each component of our model. The necessity of using option-extended occupancy: We geometrically assess the difference between our Option GAIL and GAIL-HRL on Walker2d-v2. To do so, we visualize the trajectories of the expert (squares) and the agent (dots) as well as their options (color) at each time step. Different colors indicate the different options used by the low-policy, and the expert s options are inferred with the agent s policy. The visualizations in Figure 5 read that the agent s trajectory by Option-GAIL almost overlaps with Exploration Step Average Reward on Walker2d-v2 Pretrain πH Demo Sample Figure 6. Ablations on individual components of our proposed method. The curve on the left side compares the maximum average reward-sum within a given range of exploration steps. The options used by agent and inferred from expert are presented one the right side. the demonstration, whereas the one by GAIL-HRL sometimes tracks a wrong direction with respect to the expert. Moreover, in Option-GAIL, the options of the expert and agent always switch collaboratively with each other, while GAIL-HRL occasionally derives inconsistent options (highlighted by an arrow) between the expert and agent. This phenomenon once again verifies the necessity of performing the option-extended occupancy measurement matching in Equation 5. If otherwise conducting the occupancy measurement matching in Equation 3, it will give rise to ambiguities on the learned policies and eventually affect the learning performance. Ablations on individual components: To verify the rationality of the model design, we test different variants of our method in Figure 6. 1) Random o. In this case, we generate the expert s options by random without option inference. As observed, the random options mislead policy learning and thus yield worse performance. We further display the option switching in the right sub-figure in Figure 6. The high-level policy tends to be conservative on switching since the sampling options always keep unchanged, making our model less expressive to fit the expert s hierarchical policy. 2) No ot 1. we simply omit the ot 1 in Equation 6 and implement the discriminator with only (st, at, ot). Clearly, without the option information from the last step, the highlevel policy is not fully regularized and cannot capture the option dynamics, thereby delivering a performance drop compared to the full-option version. 3) Pretrain πH. We pre-train the high-level policy using H-BC for 50 epochs and then fix it during the subsequent learning procedure, which can be regarded as a version of the two-stage method by Sharma et al. (2018). Such ablation explains it is indeed demanded to simultaneously learn the highand low-level policies end-to-end by examining the Adversarial Option-Aware Hierarchical Imitation Learning results in Figure 6. In the optimization perspective, the two-level policies are coupled with each other and will be improved better under an alternative training fashion. In summary, all the results here support the optimal choice of our design. Exploration Step Average Reward on Hopper-v2 Figure 7. Ablations on the total number of available option classes. Left: how different value of |o| influences the maximum average reward-sum within a given range of exploration steps on Hopperv2 environment. Right: actual option transition for different |o|. Ablations on the number of available option classes: Throughout our above experiments, we set the number of option classes |o| = 4 and find it works generally. In fact, changing |o| does not essentially change our performance if |o| is sufficiently large, as our algorithm will automatically inactivate redundant options (similar sub-task switching will be clustered into the same option class). For illustrating such robustness, we evaluate our method with |o| {2 6} on the Hopper-v2 environment. As observed from Figure7, when |o| 3, all variants share similar convergence behavior and option transition: the number of activated options is observed as 3 for all cases. When |o| 2, the convergence becomes worse, probably due to its less expressivity. 6. Discussion and Conclusion In this paper, we have presented Option-GAIL, a theoretically-grounded framework that integrates hierarchical modeling with option occupancy measurement matching. We then provide an EM-like algorithm for training the policies of Option-GAIL with a provable guaranteed training convergence. Comparing to existing HIL methods, Option GAIL can regularize both the highand low-level policies to better fit the hierarchical behavior of the expert. Experiments on robotic locomotion and manipulation benchmarks demonstrate the effectiveness of the proposed Option-GAIL over other counterparts; Option-GAIL works well particularly for the tasks consisting of evidently separable sub-tasks. Our method is generally powerful and can be enhanced by absorbing external knowledge. For example, it could be a future work direction to consider a long-horizon plan as the prior knowledge when inferring options. Acknowledgement This research was primary funded by National Key R&D Program of China (Grant No.20191241501). It was also partially funded by THU-Bosch JCML center (Grant No.20193000122). Meanwhile, this work is jointly sponsored by CAAI Mind Spore Open Fund, China Postdoctoral Science Foundation (Grant No.2020M670337), and the National Natural Science Foundation of China (Grant No. 62006137). We would like to thank Chao Yang and Yu Luo for their generous help and insightful advice. Atkeson, C. G. and Schaal, S. Robot learning from demonstration. In Proc. Int. Conf. Mach. Learn., 1997. Bacon, P.-L., Harb, J., and Precup, D. The option-critic architecture. In Proc. AAAI Conf. Artificial Intell., 2017. Baum, L. E. et al. An inequality and associated maximization technique in statistical estimation for probabilistic functions of markov processes. Inequalities, 3(1):1 8, 1972. Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Open AI Gym. ar Xiv preprint ar Xiv:1606.01540, 2016. Byrne, R. W. and Russon, A. E. Learning by imitation: A hierarchical approach. Behavioral and brain sciences, 21 (5):667 684, 1998. Daniel, C., Van Hoof, H., Peters, J., and Neumann, G. Probabilistic inference for determining options in reinforcement learning. Machine Learning, 104(2-3):337 357, 2016. Fei, C., Wang, B., Zhuang, Y., Zhang, Z., Hao, J., Zhang, H., Ji, X., and Liu, W. Triple-GAIL: A multi-modal imitation learning framework with generative adversarial nets. Proc. Int. Joint Conf. Artificial Intell., 2020. Ghasemipour, S. K. S., Zemel, R., and Gu, S. A divergence minimization perspective on imitation learning methods. In Proc. Conf. Robot. Learn., 2020. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. Proc. Advances in Neural Inf. Process. Syst., 2014. Hamidi, M., Tadepalli, P., Goetschalckx, R., and Fern, A. Active imitation learning of hierarchical policies. In Proc. Int. Joint Conf. Artificial Intell., 2015. Henderson, P., Chang, W.-D., Bacon, P.-L., Meger, D., Pineau, J., and Precup, D. Option GAN: Learning joint Adversarial Option-Aware Hierarchical Imitation Learning reward-policy options using generative adversarial inverse reinforcement learning. In Proc. AAAI Conf. Artificial Intell., 2018. Ho, J. and Ermon, S. Generative adversarial imitation learning. In Proc. Advances in Neural Inf. Process. Syst., 2016. James, S., Ma, Z., Arrojo, D. R., and Davison, A. J. RLBench: The robot learning benchmark & learning environment. IEEE Robot. Auto. Letters, 5(2):3019 3026, 2020. Kipf, T., Li, Y., Dai, H., Zambaldi, V., Sanchez-Gonzalez, A., Grefenstette, E., Kohli, P., and Battaglia, P. Comp ILE: Compositional imitation learning and execution. In Proc. Int. Conf. Mach. Learn., 2019. Le, H., Jiang, N., Agarwal, A., Dudik, M., Yue, Y., and Daum e, H. Hierarchical imitation and reinforcement learning. In Proc. Int. Conf. Mach. Learn., 2018. Lee, S.-H. and Seo, S.-W. Learning compound tasks without task-specific knowledge via imitation and self-supervised learning. In Proc. Int. Conf. Mach. Learn., 2020. Li, C., Song, D., and Tao, D. The skill-action architecture: Learning abstract action embeddings for reinforcement learning. submissions of ICLR, 2021. Maaten, L. v. d. and Hinton, G. Visualizing data using t-sne. J. Mach. Learn. Res., 9(Nov):2579 2605, 2008. Nachum, O., Gu, S. S., Lee, H., and Levine, S. Data-efficient hierarchical reinforcement learning. In Proc. Advances in Neural Inf. Process. Syst., 2018. Pomerleau, D. A. ALVINN: An autonomous land vehicle in a neural network. In Proc. Advances in Neural Inf. Process. Syst., 1988. Ross, S., Gordon, G., and Bagnell, D. A reduction of imitation learning and structured prediction to no-regret online learning. In Proc. Int. Conf. Artificial Intell. & Stat., 2011. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017. Sharma, M., Sharma, A., Rhinehart, N., and Kitani, K. M. Directed-Info GAIL: Learning hierarchical policies from unsegmented demonstrations using directed information. In Proc. Int. Conf. Learn. Representations, 2018. Sharma, P., Pathak, D., and Gupta, A. Third-person visual imitation learning via decoupled hierarchical controller. In Proc. Advances in Neural Inf. Process. Syst., 2019. Sutton, R. S., Precup, D., and Singh, S. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2): 181 211, 1999. Syed, U., Bowling, M., and Schapire, R. E. Apprenticeship learning using linear programming. In Proc. Int. Conf. Mach. Learn., 2008. Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In Proc. IEEE/RSJ Int. Conf. Intelligent Robots & Systems, 2012. Viterbi, A. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE transactions on Information Theory, 13(2):260 269, 1967. Yu, T., Abbeel, P., Levine, S., and Finn, C. One-shot hierarchical imitation learning of compound visuomotor tasks. ar Xiv preprint ar Xiv:1810.11043, 2018. Zhang, S. and Whiteson, S. DAC: The double actor-critic architecture for learning options. In Proc. Advances in Neural Inf. Process. Syst., 2019. Zhang, Z. and Paschalidis, I. Provable hierarchical imitation learning via em. ar Xiv preprint ar Xiv:2010.03133, 2020.