# domain_adaptive_imitation_learning__67f11e88.pdf Domain Adaptive Imitation Learning Kuno Kim 1 Yihong Gu 2 Jiaming Song 1 Shengjia Zhao 1 Stefano Ermon 1 We study the question of how to imitate tasks across domains with discrepancies such as embodiment, viewpoint, and dynamics mismatch. Many prior works require paired, aligned demonstrations and an additional RL step that requires environment interactions. However, paired, aligned demonstrations are seldom obtainable and RL procedures are expensive. We formalize the Domain Adaptive Imitation Learning (DAIL) problem, which is a unified framework for imitation learning in the presence of viewpoint, embodiment, and dynamics mismatch. Informally, DAIL is the process of learning how to perform a task optimally, given demonstrations of the task in a distinct domain. We propose a two step approach to DAIL: alignment followed by adaptation. In the alignment step we execute a novel unsupervised MDP alignment algorithm, Generative Adversarial MDP Alignment (GAMA), to learn state and action correspondences from unpaired, unaligned demonstrations. In the adaptation step we leverage the correspondences to zero-shot imitate tasks across domains. To describe when DAIL is feasible via alignment and adaptation, we introduce a theory of MDP alignability. We experimentally evaluate GAMA against baselines in embodiment, viewpoint, and dynamics mismatch scenarios where aligned demonstrations don t exist and show the effectiveness of our approach. 1. Introduction Humans possess an astonishing ability to recognize latent structural similarities between behaviors in related but distinct domains, and learn new skills from cross domain demonstrations alone. Not only are we capable of learn- 1Department of Computer Science, Stanford University 2Department of Computer Science, Tsinghua University. Correspondence to: Kuno Kim . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). ing from third person observations that have no obvious correspondence to our internal self representations (Stadie et al., 2017; Liu et al., 2018; Sermanet et al., 2018), but we also are capable of imitating experts with different embodiments (Gupta et al., 2017; Rizzolatti & Craighero, 2004; Liu et al., 2020) in foreign environments (Liu et al., 2020) - e.g an infant is able to imitate visuomotor skills by watching adults with different biomechanics (Jones, 2009) acting in environments distinct from their playroom. Previous work in neuroscience (Marshall & Meltzoff, 2015) and robotics (Kuniyoshi & Inoue, 1993; Kuniyoshi et al., 1994) have recognized the pitfalls of exact behavioral cloning in the presence of domain discrepancies and posited that the effectiveness of the human imitation learning mechanism hinges on the ability to learn structure preserving domain correspondences. These correspondences enable the learner to internalize cross domain demonstrations and produce a reconstruction of the behavior in the self domain. Consider a young child that has learned to associate (or "align") his internal body map with the limbs of an adult. When the adult demonstrates running, the child is able internalize the demonstration, and reproduce the behavior. Recently, separate solutions have been proposed for imitation learning across three main types of domain discrepancies: dynamics (Liu et al., 2020), embodiment (Gupta et al., 2017), and viewpoint (Liu et al., 2018; Sermanet et al., 2018) mismatch. Most works (Liu et al., 2018; Sermanet et al., 2018; Gupta et al., 2017) require paired, time-aligned demonstrations to obtain state correspondences and an RL step involving environment interactions. (see Figure 1a) Unfortunately, paired, aligned demonstrations are seldom obtainable and RL procedures are expensive. In this work we formalize the Domain Adaptive Imitation Learning (DAIL) problem - a unified framework for imitation learning across domains with dynamics, embodiment, and/or viewpoint mismatch. Informally, DAIL is the process of learning how to perform a task optimally in a self domain, given demonstrations of the task in a distinct expert domain. We propose a two-step approach to DAIL: alignment followed by adaptation. In the alignment step we execute a novel unsupervised MDP alignment algorithm, Generative Adversarial MDP Alignment (GAMA), to learn state, action maps from unpaired, unaligned demonstrations. In the adaptation step we leverage the learned state, Domain Adaptive Imitation Learning Figure 1. DAIL Pipeline. (a). Inputs: Illustration of paired, aligned vs unpaired, unaligned demonstrations in the alignment task set Dx,y containing tasks T1, T2, ... (b). Alignment phase: we learn state, action maps f, g between the self (x) and expert (y) domain from unpaired, unaligned demonstrations by minimizing a distribution matching loss and an imitation loss on a composite policy ˆ x (c) Adaptation phase: adapt the expert domain policy y,T or demonstrations to obtain a self domain policy ˆ x,T action maps to zero-shot imitate tasks across domains without an extra RL step. To shed light on when DAIL can be solved by alignment and adaptation, we introduce a theory of MDP alignability. We conduct experiments with a variety of domains that have dynamics, embodiment, and viewpoint mismatch and demonstrate significant gains on learning from unpaired data. The primary contributions of this work are as follows: 1. We propose an unsupervised MDP alignment algorithm that succeeds at DAIL from unpaired, unaligned demonstrations removing the need for costly paired, aligned data. 2. We achieve zero-shot imitation, thereby removing a costly RL procedure involving environment interactions. 3. We propose a unifying theoretical framework for imitation learning across domains with dynamics, embodiment, and/or viewpoint mismatch. 2. Domain Adaptive Imitation Learning An infinite horizon Markov Decision Process (MDP) M 2 with deterministic dynamics is a tuple (S, A, P, , R) where is the set of all MDPs, S is the state space, A is the action space, P : S A ! S is a (deterministic) transition function, R : S A ! R is the reward function, and is the initial state distribution. A domain is an MDP without the reward, i.e (S, A, P, ). Intuitively, a domain fully characterizes the embodied agent and the environment dynamics, but not the desired behavior. A task T is a label for an MDP corresponding to the high level description of optimal behavior, such as "walking". T is analogous to category labels for images. An MDP with domain x for task T is denoted by Mx,T = (Sx, Ax, Px, x, Rx,T ), where Rx,T is a reward function encapsulating the behavior labeled by T . For example, different reward functions are needed to realize the "walking" behavior in two morphologically different humanoids. A (stationary) policy for Mx,T is a map x,T : Sx ! B(Ax) where B is the set of probability measures on Ax and an optimal policy arg max x J( x) achieves the highest policy performance J( x) = E x[P1 t=0 γt Rx,T (s(t) x )] where 0 < γ < 1 is a discount factor. A demonstration Mx,T of length H for an MDP Mx,T is a sequence of state, action tuples Mx,T = {(s(t) t=1 and DMx,T = { (k) k=1 is a set of demonstrations. Let Mx,T , My,T be self and expert MDPs for a target task T . Given expert domain demonstrations DMy,T , Domain Adaptive Imitation Learning (DAIL) aims to determine an optimal self domain policy x,T without access to the reward function Rx,T . In this work we propose to first solve an MDP alignment problem and then leverage the alignments to zero-shot imitate expert domain demonstrations. Like prior work (Gupta et al., 2017; Liu et al., 2018; Sermanet et al., 2018), we assume the availability of an alignment task set Dx,y = {(DMx,Ti , DMy,Ti )}N i=1 containing demonstrations for N tasks {Ti}N i=1 from both the self and expert domain. Dx,y could, for example, contain both robot (x) and human (y) demonstrations for a set primitive tasks such as walking, running, and jumping. Unlike prior work, demonstrations are unpaired and unaligned, i.e (s(t) y ) may not be a valid state correspondence (see Figure 1(a)). Paired, time-aligned cross domain data is expensive and may not even exist when task execution rates differ or there exists systematic embodiment mismatch between the domains. For example, a child can imitate an adult running, but not achieve the same speed. Our set up emulates a natural setting in which humans compare how they perform tasks to how other agents perform the same tasks in order to find structural similarities and identify domain correspondences. We now proceed to introduce a theoretical framework that explains how and when the DAIL problem can be solved by MDP alignment followed by adaptation. 3. Alignable MDPs M be the set of all optimal policies for MDP M. We define an occupancy measure (Syed et al., 2008) q : Domain Adaptive Imitation Learning S A ! R for policy executed in MDP M as q (s, a) = (a|s) P1 t=0 γt Pr(s(t) = s; , M). We further define the optimality function OMx : Sx Ax ! {0, 1} for an MDP Mx as OMx(sx, ax) = 1 if 9 Mx such that (sx, ax) 2 supp(q x) and OMx(sx, ax) = 0 otherwise. We are now ready to formalize MDP reductions: a class of structure preserving maps between MDPS. Definition 1. An MDP reduction from Mx = (Sx, Ax, Px, x, Rx) to My = (Sy, Ay, Py, y, Ry) is a tuple r = (φ, ) where φ : Sx ! Sy, : Ax ! Ay are maps that preserve: 1. ( -optimality) 8(sx,ax,sy,ay) 2 Sx Ax Sy Ay : OMy(φ(sx), (ax)) = 1 ) OMx(sx,ax) = 1 (1) OMy(sy,ay) = 1 ) φ 1(sy) 6= ;, 1(ay) 6= ; (2) 2. (dynamics) 8(sx,ax,sy,ay) 2 Sx Ax Sy Ay such that OMy(sy,ay) = 1,sx 2 φ 1(sy),ax 2 1(ay) : Py(sy,ay) = φ(Px(sx,ax)) (3) where we define φ 1(sy) = {sx|φ(sx) = sy}, 1(ay) = {ax| (ax) = ay}. Furthermore, r is an MDP permutation if and only if φ, are bijective. In words, Eq. 1 states that only optimal state, action pairs in x can be mapped to optimal state, action pairs in y and Eq. 2 states that r must be surjective on the set of optimal state, action pairs in y. These properties imply that a policy in Mx should have more flexibility choosing optimal actions than one in My. Eq. 3 states that a reduction must preserve (deterministic) dynamics. We use the notation Mx φ, My to denote that (φ, ) is a reduction from Mx to My, and the shorthand Mx My to denote that Mx reduces to My. To gain an intuitive understanding of MDP reductions, picture the execution trace of an optimal policy as a directed graph with colored edges in which the nodes correspond to states visited by an optimal policy, and the colored edges correspond to actions taken. An MDP reduction from Mx to My homomorphs the execution graph of an optimal policy in Mx to a execution graph of an optimal policy in My. Figure 2 shows an example of a valid reduction from Mx to My: states 1, 2 in Sx are mapped (merged) to state a in Sy and the blue, red actions in Ax are mapped to the green action in Ay. Intuitively, if Mx φ, My, then (φ, ) compresses Mx by merging all optimal state, action pairs that have identical dynamics properties. Definition 2. Two MDPs Mx, My are alignable if and only if Mx My or My Mx. Definition 2 states that MDPs are alignable if reductions exists between them, meaning that they share structure. We use Γ(Mx, My) = {(φ, )|Mx φ, My} to denote the Figure 2. MDP Reduction Example between execution traces in Mx (Left) and My (Right), where Mx My. States correspond to nodes and actions to colors. The shown reduction merges nodes in dotted boxes to their corner label, e.g φ(1) = φ(2) = a, and both blue, red actions in Mx to the green action in My. set of all valid reductions from Mx to My. Reductions have a particularly useful property which is that they adapt policies across alignable MDPs. Consider a state map f : Sx ! Sy, an inverse action map g : Ay ! Ax, and a composite policy ˆ x = g y f (see Figure 1(b)). In words, ˆ x maps a self state to an expert state via f, simulates the expert s action choice for the mapped state via y, then chooses a self action that corresponds to the simulated expert action with g. The following lemma holds for ˆ x. Theorem 1. Let Mx, My be MDPs satisfying Assumption 1 (see Appendix F), Mx φ, My, and y be optimal in My. Then, 8g : Ay ! Ax s.t g(ay) = ay 8ay 2 {ay|9sy 2 Sy s.t OMy(sy, ay) = 1}, it holds that ˆ x = g y φ is optimal in Mx. Theorem 1 states that the state, action maps (f, g 1) chosen to be a reduction can adapt optimal policies between alignable MDPs. Here onwards we interchangeably refer to (f, g) as "alignments". We now show how the DAIL problem can be solved by first solving an MDP alignment problem followed by an adaptation step. Definition 3. Let (Mx, My), (Mx 0) 2 2 be two MDP pairs. Then, (Mx, My) (Mx 0), i.e they are joint alignable, if Γ(Mx, My) \ Γ(Mx In words, two MDP pairs are joint alignable if there exists a shared reduction. We define an equivalence class [(Mx, My)] = {(Mx 0) (Mx, My)} of MDP pairs that share reductions. Overload- ing notation, Γ({(Mx i=1) = {(φ, ) | (φ, ) 2 Γ(Mx N)}. We now formally state the MDP alignment problem: Let (Mx,T , My,T ) be an MDP pair for a target task T . Given an alignment task set Dx,y = {(DMx,Ti , DMy,Ti )}N i=1 comprising unpaired, unaligned demonstrations for MDP pairs {(Mx,Ti, My,Ti)}N i=1 [(Mx,T , My,T )] , determine (φ, ) 2 Γ({(Mx,Ti, My,Ti)}N i=1) such that (φ, ) 2 Γ(Mx,T , My,T ). As shown in Figure 3, with more MDP pairs, there are likely a smaller the number of joint alignments |Γ({(Mx,Ti, My,Ti)}N i=1)| and, as a result, (φ, ) 2 Γ({(Mx,Ti, My,Ti)}N i=1) is more likely to "generalize" to Domain Adaptive Imitation Learning Figure 3. MDP Alignment Problem. Blue/red, and green regions denote sets of MDP alignments for two alignment tasks and the target task, respectively. White hatches cover the solution set to the MDP alignment problem which is the intersection of all sets an MDP pair for a new target task (Mx,T , My,T ) in the equivalence class. Analogously, in a standard supervised learning problem, more training data is likely to shrink the set of models performing optimally on the training set but poorly on the test set. We can then use (φ, ) for DAIL: given cross domain demonstrations DMy,T for the target task T , learn an expert domain policy y,T , and adapt it into the self domain using (φ, ) according to Theorem 1. We can now assess when domains with embodiment and viewpoint mismatch have meaningful state correspondences, i.e MDP reductions, thus allowing for domain adaptive imitation. The states of a human expert with more degrees of freedom than a robot imitator can be merged into the robot states if the task only requires the robot s degrees of freedom and the execution traces share structure, e.g traces are both cycles. However, if the task requires all degrees of freedom possessed only by the human, the robot cannot find meaningful correspondences, and also cannot imitate the task. Two MDPs for different viewpoints of an agent performing a task are MDP permutations since there is a one-to-one correspondence between state, actions at same timestep in the execution trace of an optimal policy. 4. Learning MDP Reductions We now derive objectives that can be optimized to learn MDP reductions. We propose distribution matching and policy performance maximization. We first define the distributions to be matched. Definition 4. Let Mx, My be two MDPs and ˆ x = g y f for f : Sx ! Sy, g : Ay ! Ax and policy y. The co-domain policy execution process Pˆ x = {ˆs(t) y }t 0 is realized by running ˆ x in Mx, i.e: y y( |ˆs(t) x = g(ˆa(t) y ), s(t+1) x = Px(s(t) x ) 8t 0 (4) The target distribution σy y is over transitions uniformly sampled from execution traces of y in My and the proxy distribution σx!y ˆ x is over cross domain transitions uniformly sampled from realizations of Pˆ x, i.e running ˆ x in Mx. y(sy, ay, s0 y =sy, a(t) y =ay, s(t+1) ˆ x (sy, ay, s0 y =sy, ˆa(t) y =ay, ˆs(t+1) y; Pˆ x) (6) We now propose two concrete objectives to optimize for: 1. optimality of ˆ x, 2. σx!y y. In other words, we seek to learn f, g that matches distributions over transition tuples in domain y while maximizing policy performance in domain x. The former captures the dynamics preservation property from Eq. 3 and the latter captures the optimal policy preservation property from Eq. 1, 2. The following theorem uncovers the connection between our objectives and MDP reductions. Theorem 2. Let Mx, My be MDPs satisfying Assumption 1 (see Supp Materials). If Mx My, then 9f : Sx ! Sy, g : Ay ! Ax, and an optimal covering policy y (see Appendix F) that satisfy objectives 1 and 2. Conversely, if 9f : Sx ! Sy, an injective map g : Ay ! Ax and an optimal covering policy y satisfying objectives 1 and 2, then Mx My and 9(φ, ) 2 Γ(Mx, My) s.t f = φ and g(ay) = ay, 8ay 2 Ay. Theorem 2 states that if two MDP are alignable, then objectives 1 and 2 can be satisfied. Conversely, if 1 and 2 can be satisfied for two MDPs with state map f and an injective action map g, then the MDPs must be alignable and all solutions (f, g) are reductions. While Theorem 2 requires that MDPs be alignable to guarantee identifiability of solutions obtained via optimizing for objectives 1 and 2, our experiments will also run on MDPs that are "weakly" alignable, i.e. Eq. 1, 2, 3 do not hold exactly, but intuitively share structure. In the next section, we derive a simple algorithm to learn MDP reductions. 5. Generative Adversarial MDP Alignment Building on Theorem 2, we propose the following training objective for aligning MDPs: f,g J(ˆ x) + λd(σx!y where J(ˆ x) is the performance of ˆ x, d is a distance metric between distributions, and λ > 0 is a Lagrange multiplier. In practice, we found that injectivity of g is unnecessary to enforce in continuous domains. We now present an instantiation of this framework: Generative Adversarial MDP Alignment (GAMA). Recall that we are given an alignment task set Dx,y = {(DMx,Ti , DMy,Ti )}N i=1. In the alignment step, we learn y,Ti, 8Ti and parameterized Domain Adaptive Imitation Learning state, action maps f f : Sx ! Sy, g g : Ay ! Ax that compose ˆ x,Ti = g g y,Ti f f . To match σx!y y, we employ adversarial training (Goodfellow et al., 2014) in which separate discriminators D i D per task are trained to distinguish between "real" transitions (sy, ay, s0 y,Ti and "fake" transitions (ˆsy, ˆay, ˆs0 y) ˆ x,Ti, where ˆsy = f f (sx), ˆay = y(ˆsy), ˆs0 y = f f (P x P (sx, g(ˆay))), and P x P is a fitted model of the x domain dynamics. (see Figure 1(b)) The generator, consisting of f f , g g, is trained to fool the discriminator while maximizing policy performance. The distribution matching gradients are back propagated through the learned dynamics, y,Ti is learned by Imitation Learning (IL) on DMy,Ti, and the policy performance objective on ˆ x,Ti is achieved by IL on DMx,Ti. In this work, we use behavioral cloning (Pomerleau, 1991) for IL. We aim to find a saddle point {f, g}[{D i i=1 of the following objective: x,Ti( |sx)||ˆ x,Ti( |sx))] y,Ti [log D i D(sy, ay, s0 x,Ti [log(1 D i D(ˆsy, ˆay, ˆs0 where DKL is the KL-divergence. We provide the execution flow of GAMA in Algorithm 1. In the adaptation step, we are given expert demonstrations DMy,T of a new target task T , from which we fit an expert domain policy y,T which are composed with the learned alignments to construct an adapted self policy ˆ x,T = g g y,T f f . We also experiment with a demonstration adaptation method which additionally trains an inverse state map f 1 : Sy ! Sx, adapts demonstrations DMy,T into the self domain via f 1, g, and applies behavioral cloning on the adapted demonstrations. (see Figure 1(c)) Notably, our entire procedure does not require paired, aligned demonstrations nor an RL step. Related Works: Closely related to DAIL, the field of cross domain transfer learning in the context of RL has explored approaches to use state maps to exploit cross domain demonstrations in a pretraining procedure for a new target task for which self domain reward function is available. Canonical Correlation Analysis (CCA) (Hotelling, 1936) finds invertible projections into a basis in which data from different domains are maximally correlated. These projections can then be composed to obtain a direct correspondence map between states. (Ammar et al., 2015; Joshi & Chowdhary, 2018) have utilized an unsupervised manifold alignment (UMA) algorithm which finds a linear map between states with similar local geometric properties. UMA assumes the existence of hand crafted features along with a distance metric between them. This family of work commonly uses a linear statemap to define a time-step wise transfer reward and executes an RL step on the new task. Similar to our work, these works use an alignment task set of unpaired, unaligned trajectories to compute the state map. Unlike these works, we learn maps that preserve MDP structure, use deep neural network state, action maps, and achieve zero-shot transfer to the new task without an RL step. More recent work in transfer learning across embodiment (Gupta et al., 2017) and viewpoint (Liu et al., 2018; Sermanet et al., 2018) mismatch obtain state correspondences from an alignment task set comprising paired, time-aligned demonstrations and use them to learn a state map or a state encoder to a domain invariant feature space. In contrast to this family of prior work, our approach learns both state, action maps from unpaired, unaligned demonstrations. Also, we remove the need for additional environment interactions and an expensive RL procedure on the target task by leveraging the action map for zero-shot imitation. (Stadie et al., 2017) have shown promise in using domain confusion loss and generative adversarial imitation learning (Ho & Ermon, 2016) for learning across small viewpoint mismatch without an alignment task set, but fails in dealing with large viewpoint differences. Unlike (Stadie et al., 2017), we leverage the alignment task set to succeed in imitating across larger view- Algorithm 1 Generative Adversarial MDP Alignment (GAMA) input: Alignment task set Dx,y = {(DMx,Ti , DMy,Ti )}N i=1 of unpaired trajectories, fitted y,Ti while not done do: for i = 1, ..., N do: Sample (sx, ax, s0 x) DMx,Ti , (sy, ay, s0 y) DMy,Ti and store in buffer Bi y for j = 1, ..., M do: Sample mini-batch j from Bi y Update dynamics model with: ˆE x,Ti [r P (P x P (sx, ax) s0 Update discriminator: ˆE D(sy, ay, s0 D(ˆsy, ˆay, ˆs0 ] Update alignments (f f , g g) with gradients: x,Ti [r f log D D(ˆsy, ˆay, ˆs0 x,Ti [r f (ˆ x,Ti(sx) ax)2] x,Ti [r g log D D(ˆsy, ˆay, ˆs0 x,Ti [r g(ˆ x,Ti(sx) ax)2] Domain Adaptive Imitation Learning Figure 4. MDP Alignment Visualization. The state maps learned by GAMA and two representative baselines - CCA and IF - are shown for pen$pen (Top Left), pen$cart (Top Right), snake4$snake3 (Bottom Left), reach2$reach3 (Bottom Right). See Appendix E to see more baselines. GAMA is able to recover MDP permutations for alignable pairs pen$pen, pen$cart and find meanigful correspondences between "weakly alignable" pairs snake4$snake3, reach2$reach3. See https://youtu.be/l0tc1JCN_1M for videos Table 1. MDP Alignment Performance. Mean 2 loss between the learned state map predictions and the ground truth permutation. On average, GAMA has 17.3 lower loss than the best baseline. Results are averaged across 5 seeds. GAMA (OURS) CCA UMA IF IFO RANDOM PEN $ PEN 0.057 0.017 0.72 0.25 >100 2.50 1.08 2.24 0.82 >100 PEN $ CART 0.178 0.051 3.92 3.77 >100 1.62 0.52 3.31 1.2 >100 REACH2$REACH2-TP 0.092 0.043 10.14 5.31 >100 12.41 3.12 5.12 2.41 >100 point mismatch and do not require an RL procedure. Some recent works (Liu et al., 2020) have proposed matching only state occupancy measures for imitation across dynamics mismatch. We compare our method to an appropriate distillation of such methods. MDP homomorphisms (Ravindran & Barto, 2002) have been explored with the aim of compressing state, action spaces to facilitate planning. In similar vein, related works have proposed MDP similarity metrics based on bisimulation methods (Ferns et al., 2004) and Boltzmann machine reconstruction error (Ammar et al., 2014). While conceptually related to our MDP alignability theory, these works have not proposed scalable procedures to discover the homomorphisms and have not drawn connections to domain adaptive learning. 6. Experiments Our experiments were designed to answer the following questions: (1). Can GAMA uncover MDP reductions? (2). Can the learned alignments (f f , g g) be leveraged to succeed at DAIL? We propose two metrics to measure DAIL performance. First, alignment complexity which is the number of MDP pairs, i.e number of tasks, in the alignment task set needed to learn alignments that enable zero-shot imitation, given ample cross domain demonstrations for the target tasks. Second, adaptation complexity which is the amount of cross domain demonstrations for the target tasks needed to successfully imitate tasks in the self domain without querying the target task reward function, given a sufficiently large alignment task set. Note that we include experiments with MDP pairs that are not perfectly alignable, Domain Adaptive Imitation Learning D-R2P D-R2R E-R2P E-R2R V-R2R V-R2W Figure 5. Adaptation Complexity. Notably, adaptation complexity of GAMA is close to that of the Self-demo baseline. Baselines fail at DAIL, mostly due to failing the alignment step. Results are averaged across 5 runs. yet intuitively share structure, to show general applicability of GAMA for DAIL. We experiment with environments which are extensions of Open AI Gym (Brockman et al., 2016): pen, cart, reacher2, reacher3, reach2-tp, snake3, and snake4 denotes the pendulum, cartpole, 2-link reacher, 3link reacher, third person 2-link reacher, 3-link snake, and 4-link snake environments, respectively. (self domain) $ (expert domain) specify an MDP pair in the alignment task set. Model architectures and environment details are further described Appendix B, C, D. We study two ablations of GAMA and compare against the following baselines: GAMA - Policy Adapt (GAMA-PA): learns alignments by Algorithm 1, fits an expert policy y,T to DMy,T for a new target task T and zero-shot adapts y,T to the self domain via ˆ x,T = g g y,T f f . GAMA - Demonstration Adapt (GAMA-DA): trains f 1 in addition to Algorithm 1, adapts DMy,T into the self domain via (f 1, g), and fits a self domain policy on the adapted demonstrations. Self Demonstrations (Self-Demo): We behavioral clone on self domain demonstrations of the target task. This baseline sets an "upper bound" for the adaptation complexity. Canonical Correlation Analysis (CCA) (Hotelling, 1936): finds invertible linear transformations to a space where domain data are maximally correlated when given unpaired, unaligned demonstrations. Unsupervised Manifold Alignment (UMA) (Ammar et al., 2015): finds a map between states that have similar local geometries from unpaired, unaligned demonstrations. Invariant Features (IF) (Gupta et al., 2017): finds invertible maps onto a feature space given state pairings. Dynamic Time Warping (Muller, 2007) is used to obtain the pairings. Imitation from Observation (If O) (Liu et al., 2018): learns a statemap conditioned on a cross domain observation given state pairings. Dynamic Time Warping (Muller, 2007) is used to obtain the pairings. Third Person Imitation Learning (TPIL) (Stadie et al., 2017): simultaneously learns a domain agnostic feature space while matching distributions in the feature space. State-Alignment Imitation Learning (Liu et al., 2020): Distribution matching imitation learning with a state occupancy matching objective. 6.1. MDP Alignment Evaluation Figure 4 visualizes the learned state map f f for several MDP pairs. The pen $ pen alignment task (Figure 4, Top Left) and reach$reach-tp (Table 1) task exemplify MDP pairs that are permutations. Similarly, the pen $ cart alignment task (Figure 4, Top Right) has a reduction that maps the pendulum s angle and angular velocity to those of the pole, as the cart s position and velocity are redundant state dimensions once an optimal policy has been learned. Refer to Figure 7 in the Appendix for an explanation of poor UMA performance. Table 1 presents quantitative evaluations of these simple alignment maps. For pen$pen and reach2$reach2-tp we record the average 2 loss between the learned statemap prediction and the ground truth permutation. As for pen$cart, we do the same on the dimensions for the pole s angle and angular velocity. Both Figure 4 and Table 1 shows that GAMA is able to learn simple reductions while baselines mostly fail to do so. The key reason behind this performance gap is that many baselines (Gupta et al., 2017; Liu et al., 2018) obtain state maps from timealigned demonstration data using Dynamic Time Warping (DTW). However, the considered alignment tasks contains unaligned demonstrations with diverse starting states, up to 2x differences in demonstration lengths, and varying task execution rates. We see that GAMA also outperforms baselines that learn from unaligned demonstrations (Hotelling, 1936; Ammar et al., 2015) by learning maps that preserve MDP structure with more flexible neural network function approximators. For snake4 $ snake3 and reach2 $ reach3, the MDPs may not be perfectly alignable, yet they intuitively share structure. From Figure 4 (Bottom Left) we see that GAMA identically matches two adjacent joint angles of snake4 to the two joint angles of snake3 and the periodicity of the snake s wiggle is preserved. On reacher2$reacher3, GAMA learns a state map that matches the first joint angles and states that have similar extents of contraction. 6.2. DAIL Performance We evaluate DAIL performance on six problems that span embodiment, viewpoint, and dynamics mismatch scenarios. Domain Adaptive Imitation Learning D-R2P D-R2R E-R2P E-R2R V-R2R V-R2W Figure 6. Alignment Complexity. Baselines cannot perform zero-shot imitation. Pretrain baseline shows the zero-shot performance of a policy directly pretrained on the self domain alignment tasks, when possible. Results are averaged across 5 runs. See Appendix D for further details on each problem. Dynamics-Reach2Reach (D-R2R): Self domain is reach2 and expert domain is reach2 with isotropic gaussian noise injected into the dynamics. We use the robot s joint level state-action space. The N alignment tasks are reaching for N goals and the target tasks are reaching for 12 new goals, placed maximally far away from the alignment task goals. Dynamics-Reach2Push (D-R2P): Same as D-R2R except the target task is pushing a block to a goal location. Embodiment-Reach2Reach (E-R2R): Self domain is reach2 and expert is reach3. Rest is the same as D-R2R. Embodiment-Reach2Push (E-R2P): Self domain is reach2 and expert is reach3. Rest is the same as D-R2P. Viewpoint-Reach2Reach (V-R2R): Self domain is reach2 and expert domain is reach2-tp1 that has the same "third person" view state space as that in (Stadie et al., 2017) with a 30 planar offset. We use the robot s joint level state-action space. The alignment/target tasks are the same as D-R2R. Viewpoint-Reach2Write (V-R2W): Self domain is reach2 and expert domain is reach2-tp2 that has a different "third person" view state space with a 180 axial offset. We use the robot s joint level state-action space. The N alignment tasks are reaching for N goals and the target task is tracing letters as fast as possible. The transfer task differs from the alignment tasks in two key aspects: the end effector must draw a straight line from a letter s vertex to vertex and not slow down at the vertices in order to trace. Alignment complexity on the six problems is shown in Figure 6. GAMA (light blue) is able to learn alignments that enable zero-shot imitation on the target task, showing clear gains over a simple pretraining procedure (orange) on the self domain MDPs in the alignment task set. Other baselines require an additional expensive RL step and thus cannot zero-shot imitate. Figure 5 shows the adaptation complexity. Notably, GAMA (light blue) produces adapted demonstrations of similar usefulness as self demonstrations (olive green). Most baselines fail to learn alignments from unpaired, unaligned demonstrations and as a result fails at DAIL. TPIL succeeds at V-R2R, but fails at V-R2W which has a significantly larger viewpoint mismatch than V-R2R. SAIL outperforms GAMA and even the self-Demo baseline, but it s important to note that SAIL uses the self domain environment simulator unlike GAMA and Self-Demo. 6.3. DAIL with Visual Inputs The non-visual environment experiments in the previous section demonstrate the limitations of the time-alignment assumptions made in prior work without confounding variables such as the difficulty of optimization in highdimensional spaces. In this section, we introduce two more variants of our method, GAMA-PA-vis and GAMA-DA-vis, which demonstrate that GAMA scales to higher dimensional, visual environments with 64 64 3 image states. Specifically, we train a deep spatial autoencoder on the alignment task set to learn an encoder with the architecture from (Levine et al., 2016), then apply GAMA on the (learned) latent space. Results are shown in Figure 6, 5. We see that the alignment and adaptation complexity of GAMA-PAvis (dark-blue, solid), GAMA-DA-vis (dark-blue, dotted) are both similar to that of GAMA-DA (light blue, solid), GAMA-PA (light blue, dotted) and better than baselines trained with the robot s joint-level representation. 7. Discussion and future work We ve formalized Domain Adaptive Imitation Learning which encompasses prior work in transfer learning across embodiment (Gupta et al., 2017) and viewpoint differences (Stadie et al., 2017; Liu et al., 2018) along with a practical algorithm that can be applied to both scenarios. We now point out directions future work. Our MDP alignability theory is a first step towards formalizing possible shared structures that enable cross domain imitation. While we ve shown that GAMA empirically works well even when MDPs are not perfectly alignable, upcoming works may explore relaxing the conditions for MDP alignability to develop a theory that covers an even wider range of real world MDPs. Future works may also try applying GAMA in the imitation from observations scenario, i.e actions are not available, by aligning observations with GAMA and applying methods from (Sermanet et al., 2018; Liu et al., 2018). Finally, we hope to see future works develop principled ways design a minimal alignment task set, which is analogous to designing a Domain Adaptive Imitation Learning minimal training set for supervised learning. Acknowledgements This research was supported by Sony, NSF (#1651565, #1522054, #1733686), ONR (N00014-19-1-2145), AFOSR (FA9550-19-1-0024). Ammar, H. B., Eaton, E., Ruvolo, P., and Taylor, M. E. An automated measure of mdp similarity for transfer in reinforcement learning. 2014. Ammar, H. B., Eaton, E., Ruvolo, P., and Taylor, M. E. Unsupervised cross-domain transfer in policy gradient reinforcement learning via manifold alignment. 2015. Billingsley, P. Convergence of probability measures. 1968. Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Wojciech, Z. Openai gym. ar Xiv preprint ar Xiv:1606.01540, 2016. Ferns, N., Panangaden, P., and Precup, D. Metrics for finite markov decision processes. In UAI, 2004. Finn, C., Tan, X. Y., Duan, Y., Darrell, T., Levine, S., and Abbeel, P. Deep spatial autoencoders for visuomotor learning. International Conference on Robotics and Automation, 2015. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672 2680, 2014. Gupta, A., Devin, C., Liu, Y. X., Abbeel, P., and Levine, S. Learning invariant feature spaces to transfer skills with reinforcement learning. International Conference on Learning Representations, 2017. Ho, J. and Ermon, S. Generative adversarial imitation learn- ing. In Advances in Neural Information Processing Systems, pp. 4565 4573, 2016. Hotelling, H. Relations between two sets of variates. Biometrika, 28, 1936. Jones, S. S. The development of imitation in infancy. Philos Trans R Soc Lond B Biol Sci., 364:2325 2335, 2009. Joshi, G. and Chowdhary, G. Cross-domain transfer in reinforcement learning using target apprentice. 2018. Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, December 2014. Kuniyoshi, Y. and Inoue, H. Qualitative recognition of ongoing human action sequences. International Joint Conference on Artificial Intelligence, 1993. Kuniyoshi, Y., Inaba, M., and Inoue, H. Learning by watch- ing: Extracting reusable task knowledge from visual observation of human performance. IEEE Trans. Robot. Autom., 10:799 822, 1994. Levine, S., Finn, C., Darrell, T., and Abbeel, P. End-to- end training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334 1373, 2016. Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. ar Xiv preprint ar Xiv:1509. 02971, 2015. Liu, F., Ling, Z., Mu, T., and Su, H. State alignment-based imitation learning. International Conference on Learning Representations, 2020. Liu, Y., Gupta, A., Abbeel, P., and Levine, S. Imitation from observation: Learning to imitate behaviors from raw video via context translation. ar Xiv preprint ar Xiv:1707.03374, 2018. Marshall, P. J. and Meltzoff, A. N. Body maps in the infant brain. Trends Cogn Sci., 19:499 505, 2015. Muller, M. Dynamic time warping. Information retrieval for music and motion, pp. 69 84, 2007. Ortner, R. Combinations and mixtures of optimal policies in unichain markov decision processes are optimal. ar Xiv preprint ar Xiv:0508319, 2005. Pomerleau, D. A. Efficient training of artificial neural net- works for autonomous navigation. Neural computation, 3(1):88 97, 1991. ISSN 0899-7667. Ravindran, B. and Barto, A. G. Model minimization in hierarchical reinforcement learning. In SARA, 2002. Rizzolatti, G. and Craighero, L. The mirror neuron system. Annual Review of Neuroscience, 27:169 192, 2004. Sermanet, P., Lynch, C., Chebotar, Y., Hsu, J., Jang, E., Schaal, S., and Levine, S. Time-contrastive networks: Self-supervised learning from video. ar Xiv preprint ar Xiv:1704.06888, 2018. Stadie, B., Abbeel, P., and Sutskever, I. Third person imita- tion learning. In ICLR, 2017. Syed, U., Bowling, M., and Schapire, R. E. Apprenticeship learning using linear programming. In Proceedings of the 25th international conference on Machine learning, pp. 1032 1039. ACM, July 2008. ISBN 9781605582054. doi: 10.1145/1390156.1390286. Domain Adaptive Imitation Learning Wang, T., Liao, R., Ba, J., and Fidler, S. Nerve Net: Learning structured policy with graph neural networks. International Conference on Learning Representations, 2018, 2018.