# crossdomain_imitation_from_observations__1c7b7224.pdf

Cross-domain Imitation from Observations

Dripta S. Raychaudhuri * 1 Sujoy Paul * 2 Jeroen van Baar 3 Amit K. Roy-Chowdhury 1

Imitation learning seeks to circumvent the difﬁculty in designing proper reward functions for training agents by utilizing expert behavior. With environments modeled as Markov Decision Processes (MDP), most of the existing imitation algorithms are contingent on the availability of expert demonstrations in the same MDP as the one in which a new imitation policy is to be learned. In this paper, we study how to imitate tasks when discrepancies exist between the expert and agent MDP. These discrepancies across domains could include differing dynamics, viewpoint, or morphology; we present a novel framework to learn correspondences across such domains. Importantly, in contrast to prior works, we use unpaired and unaligned trajectories containing only states in the expert domain, to learn this correspondence. We utilize a cycle-consistency constraint on both the state space and a domain agnostic latent space to accomplish this. In addition, we enforce consistency on the temporal position of states via a normalized position estimator function, to align the trajectories across the two domains. Once this correspondence is found, we can directly transfer the demonstrations on one domain to the other and use it for imitation. Experiments across a wide variety of challenging domains demonstrate the efﬁcacy of our approach.

1. Introduction

Humans possess the innate ability to quickly pick up a new behavior by simply observing others performing the same skill. Not only are we able to learn from demonstrations coming from a third-person point of view, we are also capable of imitation from experts who are morphologically different or have different embodiments - as

*Equal contribution Work done partially at UCR 1University of California, Riverside 2Google Research 3Mitsubishi Electric Research Laboratories . Correspondence to: Dripta S. Raychaudhuri <draychaudhuri@ece.ucr.edu>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

Expert domain

Agent domain

Learn domain transformation

Inference task

CEt Mt Mmna EL4+h T9T1o V2z23nduzcu Mqj6MAh3AEJ+BCDRpw A03wg ACDB3i CZ0t Yj9a L9bpo Xb Lym QP4Aevt E9Gijr I=</latexit> A Agent

Transform trajectories

Figure 1. Problem overview. Cross-domain Imitation from Observation (x DIO) entails learning from experts which are different from the agent. Here, the expert is a 4-legged Ant, while the agent is a Half Cheetah. We learn a domain transformation function from unpaired, unaligned, state-only trajectories from a set of proxy tasks and utilize it to imitate the expert on the given inference task.

evidenced by a child imitating an adult with different biomechanics (Jones, 2009). Previous works in neuroscience (Rizzolatti & Craighero, 2004; Marshall & Meltzoff, 2015) have attributed this to the human capacity of learning structure preserving domain correspondences via an invariant feature space (Umilt a et al., 2008), which allows us to reconstruct the observed behavior in the self-domain. While imitation learning algorithms (Ho & Ermon, 2016; Ross et al., 2011) are successful, to some extent, in endowing autonomous agents with this ability to imitate expert behavior, they impose the somewhat unrealistic requirement that the demonstrations must come from the same domain, whether that be ﬁrst-person viewpoint, same morphology or similar dynamics. The question then arises: can we perform imitation learning which can overcome all such domain discrepancies?

Prior work on bridging domain disparities in imitation learning have focused on each of these differences in isolation: morphology (Gupta et al., 2017), dynamics (Gangwani & Peng, 2019) and viewpoint mismatch (Stadie et al., 2017; Sharma et al., 2019; Liu et al., 2018). These works (Gupta et al., 2017; Liu et al., 2018; Sharma et al., 2019) utilize paired, time-aligned demonstrations from both domains, on a set of proxy tasks, to ﬁrst build a correspondence map across the domains and then perform an extra reinforcement

Cross-domain Imitation from Observations

learning (RL) step for learning the ﬁnal policy on the given task. This limits their applicability since paired demonstrations are rarely available and RL procedures are expensive.

Recently, (Kim et al., 2020) proposed a general framework which can perform imitation across a wide array of such discrepancies from unpaired, unaligned demonstrations. However, they require expert actions, such as the exact kinematic forces, in order to learn a domain correspondence and assume availability of an expert policy which is utilized in an interactive learning setting. This is distinctly different to how humans imitate: we are capable of learning behaviors solely from observations/states, without access to underlying actions. Furthermore, continuously querying the expert might be onerous in several situations. Thus, we require a mechanism for learning policies from observation alone, where the expert demonstrations can originate in a domain which is different from the agent domain and access to the expert is limited. We deﬁne this setting as Cross Domain Imitation from Observation (x DIO).

In this work, we propose a novel framework to tackle the x DIO problem, encompassing morphological, viewpoint and dynamics mismatch. We follow a two-step approach (see Fig. 1), where we ﬁrst learn a transformation across the domains using the proxy tasks (Gupta et al., 2017), followed by a transfer process and subsequent learning of the policy. Importantly, in contrast to previous work, we use unpaired and unaligned trajectories containing only states on the expert domain trajectories, to learn this transformation. Additionally, we do not assume any access to the expert policy or the expert domain except for the given demonstrations. To learn the state correspondences, we jointly minimize a divergence between the transition distributions in the state space as well as in the latent space between the expert and the agent proxy task trajectories, while learning to translate between the two domains with the unpaired data via cycle-consistency (Zhu et al., 2017). However, solely learning with such state cycle-consistency may only result in local alignment, and lead to difﬁculties in optimizing for complex environments. Thus, to impose global alignment, we enforce additional consistency on the temporal position of states across the two domains. This ensures that when a state is mapped from one domain to the other, the degree of completion associated by being in that state remains unchanged. Having learnt this mapping on the proxy tasks, we transfer demonstrations for a new inference task from the expert to the agent domain, which are subsequently utilized to learn a policy via imitation.

Experiments on a wide array of domains that encompass dynamics, morphological and viewpoint mismatch, demonstrate the feasibility of learning domain correspondences from unpaired and unaligned state-only demonstrations. The primary contributions of this work are as follows:

1. We propose an algorithm for cross-domain imitation learning by learning transformations across domains, modeled as Markov Decision Processes (MDP), from unpaired, unaligned, state-only demonstrations, thereby ameliorating the need for costly paired, aligned data. 2. Unlike previous work, we do not utilize any costly RL procedure. Neither do we require interactive querying of an expert policy. 3. We adopt multiple tasks in the Mu Jo Co physics engine (Todorov et al., 2012), and show that our framework can ﬁnd correspondences and align two domains across different viewpoints, dynamics and morphologies.

2. Related Works

Imitation learning. Imitation learning (Schaal, 1999) uses a set of expert demonstrations to learn a policy which successfully mimics the expert. A common approach is behavioral cloning (BC) (Pomerleau, 1989; Bojarski et al., 2016), which amounts to learning to mimic the expert demonstrations via supervised learning. Inverse reinforcement learning (IRL) is another approach, where one seeks to learn a reward function that explains the demonstrated actions (Ho & Ermon, 2016; Abbeel & Ng, 2004; Ziebart et al., 2008). Recent works (Torabi et al., 2018; Yang et al., 2019; Paul et al., 2019) extend imitation learning to state-only demonstrations, where expert actions are not observed - this opens up the possibility of using imitation in robotics and learning from weak-supervision sources such as videos. Unlike these approaches, our work tackles the problem of imitation from state-only demonstrations coming from a different domain.

Domain transfer in reinforcement learning. Transfer in the reinforcement learning setting has been attempted by a wide array of works (Taylor & Stone, 2009). (Ammar & Taylor, 2011) manually deﬁne a common state space between MDPs and use it to learn a mapping between states. Unsupervised manifold alignment is used in (Ammar et al., 2015) to learn a linear map between states with similar local geometric properties. However, they assume the existence of hand-crafted features along with a distance metric between them, which limits its applicability. Recent works in transfer learning across mismatches in embodiment (Gupta et al., 2017) and viewpoint (Liu et al., 2018; Sharma et al., 2019), obtain state correspondences from an proxy task set comprising paired, time-aligned demonstrations and use them to learn a state map or a state encoder to a domain invariant feature space. (Kim et al., 2020) proposed a framework which can learn a map across domains from unpaired, unaligned demonstrations. However, they require expert actions to train the framework, along with access to an online expert. Furthermore, most of these approaches (Gupta et al., 2017; Liu et al., 2018) utilize an RL step which incurs additional computational cost. In contrast to these methods, our approach learns an MDP structure preserving state map

Cross-domain Imitation from Observations

Table 1. Comparison to prior work using attributes demonstrated in the paper. x DIO satisﬁes all the criteria desired in a holistic domain adaptive imitation framework.

METHOD UNPAIRED TRAJECTORIES

ONLY STATES

NO ONLINE EXPERT NO RL

IF (Gupta et al., 2017) DAIL (Kim et al., 2020) Ours

from unpaired, unaligned demonstrations without requiring access to expert actions, additional RL or online experts. Viewpoint agnostic imitation has also been tackled in (Stadie et al., 2017), where a combination of adversarial learning (Ho & Ermon, 2016) and domain confusion (Tzeng et al., 2014) is used to learn a policy without an proxy set. However, it fails to account for large variations in viewpoint, in addition to sub-optimal trajectories from the expert domain. From a theoretical perspective, our approach aligns with the objective of MDP homomorphisms (Ravindran, 2004). Similar ideas are explored in learning MDP similarity metric via bisimulation (Ferns et al., 2011) and Boltzmann machine reconstruction error (Ammar et al., 2014). However, these works ﬁnd homomorphisms within an MDP and do not provide ways to discover homomorphisms across MDPs.

Cycle-consistency. Our work draws inspiration from the literature on cycle-consistency (Zhu et al., 2017; Hoffman et al., 2018; Smith et al., 2019). Cycle GAN (Zhu et al., 2017) introduced cycle-consistency to learn bidirectional transformations between domains via Generative Adversarial Networks (Goodfellow et al., 2014) for unpaired imageto-image translation. This was extended to domain adaptation in (Hoffman et al., 2018). Similar techniques are applied in sim-to-real transfer (Ho et al., 2020; Gamrian & Goldberg, 2019). Recently, (Rao et al., 2020) propose RLCycle GAN to perform sim-to-real transfer by adding extra supervision from the Q-value function. Unlike these works, which are restricted on visual alignments, we propose to learn alignments across differing dynamics/morphology.

3. Problem Setting

Before formally deﬁning the x DIO problem, we ﬁrst lay the groundwork in terms of notation. Following (Kim et al., 2020), we deﬁne a domain as a tuple (S, A, P, P0), where S denotes the state space, A is the action space, P is the dynamics or transition function, and P0 is the initial distribution over the states. Given an action a A, the next state is governed by the transition dynamics as s P(s |s, a). An inﬁnite horizon Markov Decision Process (MDP) is deﬁned subsequently by including a reward function r : S A R, and a discount factor γ [0, 1] to the domain tuple. Thus, while the domain typiﬁes only the agent morphology and the dynamics, augmenting the domain with a reward and

discount factor describes an MDP for a particular task. We deﬁne an MDP in some domain x for a task T as MT x = Sx, Ax, Px, r T x , γT x , P0x . A policy is a map πT x : Sx B(Ax), where B is the set of all probability measures on Ax. A trajectory corresponding to the task T in domain x is a sequence of states ηMT x = {s0 x, s1 x, . . . , s Hη x }, where Hη denotes the length of the trajectory. We denote DMT x = {ηi MT x }N i=1 to be a set of such trajectories. In our work, we consider two domains: expert and agent, indicated by MT E and MT A respectively.

The objective of x DIO is to learn an optimal policy πT A in the agent domain, given state-only demonstrations DMT E in the expert domain. In this paper, we propose to ﬁrst learn a transformation ψ : SE SA between the domains and then leverage ψ to imitate from the expert demonstrations. Following prior work (Gupta et al., 2017; Liu et al., 2018; Kim et al., 2020), we assume access to a dataset consisting of expert-agent trajectories for M different proxy tasks: D = {(DM Tj E , DM Tj A )}M j=1. Proxy tasks encompass simple

primitive skills in both domains and are different from the inference task T , for which we wish to learn the policy.

We relax certain assumptions made in previous work, which are critical for real-world applications. Firstly, the trajectories derived from proxy tasks are not paired, i.e., timealigned trajectories do not exist in D. This is crucial in real-world cases, as the tasks may not be executed at the same rate in different domains. Secondly, expert actions are not observed: such actions are difﬁcult to obtain in various scenarios such as videos of humans performing some task. Finally, we train in an ofﬂine fashion and do not require any expert policy for interactive querying, to guide the learning process, beyond the provided demonstrations. Table 1 explicitly details how our setting differs from the ones tackled in the literature.

Once the domain transformation function ψ is learnt, we use it to translate the expert domain trajectories DMT E, for

the inference task T , to the agent domain to obtain ˆDMT A. An inverse dynamics model IA : SA SA AA is then learnt to augment these translated trajectories with actions, similar to (Torabi et al., 2018). These are subsequently used to learn the policy πT A via behavioral cloning.

A crucial characteristic of a good domain transformation ψ lies in MDP dynamics preservation. In our framework, we enforce this from both the local and global perspectives. For local alignment, we aim to ensure that optimal state transitions in MT E map to optimal transitions in MT A. Our proposed method achieves this local alignment by matching the state-transition distributions deﬁned for the true and transferred trajectories on the proxy tasks in an adversar-

Cross-domain Imitation from Observations

ial manner, while maintaining cycle-consistency. A latent space is learned via a mutual information objective to only preserve task-speciﬁc information. On the other hand, a learned temporal position function aims to enforce consistency on the temporal position of the states across the two domains to ensure global alignment. In the following parts, we deﬁne each of these components in more detail.

4.1. Local alignment via distribution matching

State cycle-consistency. We seek to map optimal transitions in the expert domain to the agent domain, and propose to learn domain transformation ψ such that the state transition distribution is matched over the trajectories derived from the proxy tasks. We utilize adversarial training to accomplish this. Given unpaired samples {(st E, st+1 E )} DM Tj E and {(st A, st+1 A )} DM Tj A drawn

from the jth proxy task, the function ψ is learned in an adversarial manner with a discriminator Dj A, where ψ tries to map (st E, st+1 E ) onto the distribution of (st A, st+1 A ), while Dj A tries to distinguish translated samples ψ(st E), ψ(st+1 E )

against real samples (st A, st+1 A ):

min ψ max Dj A Lj adv = E(st A,st+1 A ) D M Tj A

log Dj A(st A, st+1 A )

+E(st E,st+1 E ) D M Tj E

log(1 Dj A(ψ(st E), ψ(st+1 E ))) (1)

Solely optimizing this adversarial loss can lead to the model mapping the same set of states to any random permutation of states in the agent domain, where any of the learned mappings can induce an output distribution that matches the agent state transition distribution. Following (Zhu et al., 2017), we introduce cycle consistency as a means to control this undesired effect. We learn another state map in the opposite direction φ : SA SE by optimizing an adversarial loss, minφ max Dj E Lj adv, with a discriminator Dj E. Cycle consistency is then imposed as:

min ψ,φ Lj cyc = Es E D M Tj E

φ ψ(s E) s E 2 2 +

Es A D M Tj A

ψ φ(s A) s A 2 2 (2)

Domain invariant latent space. To incentivize ψ, φ to generalize beyond proxy tasks, we use an encoder-decoder structure for the transformation function ψ. Concretely, ψ = DE EE, where EE : SE Z represents an encoder which maps a state in the expert domain to a domain agnostic latent space Z, while DE : Z SA represents the decoding function. φ = DA EA is deﬁned similarly via the same latent space Z. Prior work (Gupta et al., 2017) has explored learning such invariant spaces, but use paired data from both domains, which is a very strong and often unrealistic

assumption, as explained above. Inspired from work based on information theoretic objectives (Eysenbach et al., 2018; Wan et al., 2020), we learn the latent space by minimizing the mutual information between the domain and the latent transitions: min EE,EA I d; (zt, zt+1) (3)

where (zt, zt+1) denotes an encoded transition from either of the domains. Minimizing the mutual information between the domain ( = {E, A}) and the encoded latent transition for the same proxy task will result in a latent space which encodes the task-speciﬁc information and ﬁlters out the domain-speciﬁc nuances.

Note that we can decompose the mutual information term as I ; (zt, zt+1) = H( ) H( |(zt, zt+1)), where H( ) denotes the entropy. Thus, our objective in Equation 3 reduces to just maximizing the conditional entropy H( |(zt, zt+1)). Due to intractability of this expression (Alemi et al., 2016; Poole et al., 2019), we optimize a variational lower bound Ed ,(st d,st+1 d ) D M Tj d

log qj d|(zt, zt+1) in-

stead, where qj denotes a variational distribution which approximates the true posterior.

Here, qj is parameterized as a discriminator which outputs the probability that the generated transition comes from domain d for the jth proxy task. Maximizing this objective over the encoder parameters ensures that the discriminator is maximally confused and the latent transitions for the task, coming from both domains, are well aligned. The overall objective is as follows:

min qj max EE,EA LMI = Ed ,(st d,st+1 d ) D M Tj d

(zt, zt+1) (4)

Additionally, we enforce consistency in the latent embedding to further constrain the learnt mapping:

min ψ,φ Lj z = Es E D M Tj E

EA ψ(s E) EE(s E) 2 2

+ Es A D M Tj A

EE φ(s A) EA(s A) 2 2 (5)

4.2. Global alignment via temporal position preservation

Solely learning with state cycle-consistency may result only in local alignment: an optimal state pair in the expert domain may get mapped to an optimal transition in the agent domain while violating task semantics (transitions from beginning of a task get mapped to terminal ones), and then back without breaking cycle-consistency. In order to constrain the mapping to maintain temporal semantics for a task,

Cross-domain Imitation from Observations

p HKk QUH4BAc AQOcg Qqog Tpo Agzuw TN4BW/ag/aiv Wsfs9SMltbsgz+mf X4Bh R2vhg=</latexit>

x+AWGYru E=</latexit>s E ˆs A

QVs Zv25jg/j15n LR2yu5e2Tnd LVUORn YUw Dr YAFv ABfug Amqg Dpo Agzvw BF7Aq3Vv PVtv1vuwd MIa9ay BX7A+Pg Fbi K7d</latexit>s A s0

d Vw2Tsv6z Umhcp HKk QUH4BAc AQOcg Qqog Tpo Agzuw TN4BW/ag/aiv Wsfs9SMltbsgz+mf X4Bh R2vhg=</latexit>

TJ+TEKn0Sxsq WNGSm/p7Ia KT1OAps Z0TNUC96U/E/r5Oa8Mr Pu Ex Sg5LNF4Wp ICYm079Jnytk Rowto Uxeyth Q6o Mzadkg3BW3x5m TPqt5F1b07r9Su8zi Kc ATHc Aoe XEINbq EODWAwg Gd4h Td HOC/Ou/Mxby04+cwh/IHz+QM+to3F</latexit>Pz

Inference task adaptation Local and global alignment via proxy tasks

Figure 2. Framework overview. An illustration of our MDP correspondence learning framework. We perform local alignment via state-transition distribution matching and cycle-consistency in the state space using Lj adv and Lj cyc, as well as in a learnt latent space using Lj z and Lj MI(only proxy task is j shown here). The inverse cycle from agent to expert is omitted here for clarity. Global alignment is performed via consistency on the temporal position of states across the two domains, using the pre-trained position estimators P j A, P j E in Lj pos. Further improvement is obtained via inference task adaptation using Lj pos inf and Lj cyc inf - this prevents overﬁtting to the proxy tasks and makes the learnt transformation more robust and well-conditioned to the target data.

we enforce additional consistency on the temporal position of states across the two domains.

We encode the temporal position of a state by computing a normalized score of proximity to the terminal state in the trajectory. Each state is assigned a value of 1 if they are terminating goal states and 0 otherwise. These discrete values are then exponentially weighted by a discount factor γ (0, 1) to obtain a continuous estimate of the state temporal position. Using these temporal encodings, we pretrain temporal position estimators P j E, P j A in a supervised fashion by optimizing a squared error loss as follows:

min P j E Eη D M Tj E

P j E(st E) γHη t 2 (6)

P j A is learnt in a similar fashion by optimizing Equation 6 with respect to the agent trajectories. These estimators are subsequently used to enforce temporal preservation as:

min ψ,φ Lj pos = Es E D M Tj E

h P j A ψ(s E) P j E(s E) 2 2 i

+ Es A D M Tj A

h P j E φ(s A) P j A(s A) 2 2 i . (7)

Our temporal position estimators may be interpreted as state value functions: trajectories are from a greedy optimal policy with reward 1 for terminal states, and 0 otherwise.

4.3. Inference task adaptation

As discussed in Section 3, we are provided with the stateonly trajectories DMT E on solely the expert domain for the

inference task T . We propose to use these trajectories during the learning process as additional regularization, referred to as inference task adaptation. First, we enforce cycle consistency on the states:

min ψ,φ Lcyc inf = Es E DMT E

φ ψ(s E) s E 2 2 . (8)

In addition, we also enforce temporal preservation in the latent space. Concretely, we ﬁrst train a position estimator P T E by optimizing Equation 6 on the given trajectories as discussed in Section 4.2. We use the trained position estimator, along with a latent space position predictor Pz to enforce temporal preservation by:

min EE,Pz Lpos inf = Es E DMT E

Pz EE(s E) P T E (s E) 2 2 .

4.4. Optimization

Given the alignment dataset D containing trajectories from the M proxy tasks, we ﬁrst pre-train the temporal position estimators {(P j E, P j A)}M j=1 using Equation 6. This is followed by adversarial training of the state maps ψ, φ, where we use separate discriminators on the state space and latent space for each proxy task. The full objective is then:

min ψ,φ max {Dj E},{Dj A},{qj} L =

Lj adv(Dj A) + Lj adv(Dj E)

Lj cyc + Lj z

+ λ3Lj pos λ4Lj MI

Lcyc inf + Lpos inf

Cross-domain Imitation from Observations

where {λi}5 i=1 denote hyper-parameters which control the contribution of each loss term. A pictorial description of the overall framework is shown in Figure 2.

4.5. Imitation from observation

We use the learned ψ to map the states in the inference task expert demonstrations DMT E to the agent domain. Given

the set of transferred state-only demonstrations ˆDMT A, we can use any imitation from observation algorithm to learn the ﬁnal policy. In this work, we follow the Behavioral Cloning from Observation (BCO) approach proposed in (Torabi et al., 2018). BCO entails learning an inverse dynamics model IA : SA SA AA to infer missing action information. First, we collect a dataset of state-action triplets P = {(st A, at A, st+1 A )} by random exploration. The inverse model is subsequently estimated by Maximum Likelihood Estimation (MLE) of the observed transitions in P. Assuming a Gaussian distribution over actions, this reduces to minimizing an ℓ2 loss as follows,

(st A,at A,st+1 A ) P

at A IA(st A, st+1 A ) 2 2 (11)

Next, the learnt inverse model is used to augment ˆDMT A with agent speciﬁc actions. Finally, these action augmented trajectories are used to learn the ﬁnal policy πT A via behavioral cloning. Note that our correspondence learning framework is agnostic to the imitation from observation algorithm used for learning the agent policy.

5. Experiments

In this section, we analyze the efﬁcacy of our proposed method on the x DIO task. We adopt Mu Jo Co (Todorov et al., 2012) as the experimental test-bed and evaluate on several cross-domain tasks, along with a thorough ablation study of different modules in our overall framework. Implementation details are presented in the supplementary materials. Code and videos are available at: https://driptarc.github.io/xdio.html.

We use a total of 7 environments derived from the Open AI Gym (Brockman et al., 2016): 2-link Reacher, 3-link Reacher, Friction-modiﬁed 2-link Reacher, Third-person 2link Reacher, 4-legged Ant, 6-legged Ant and Half Cheetah. We use the joint level state-action space for all environments. These are used to construct six cross-domain tasks:

Dynamics-Reacher2Reacher (D-R2R): Agent domain is the 2-link Reacher and expert domain is the Frictionmodiﬁed 2-link Reacher, created by doubling the friction co-efﬁcient of the former. The proxy tasks are reaching for M goals and the inference tasks are reaching for 4 new goals, placed maximally far away from the proxy goals. See the supplementary for more details on goal placement.

Viewpoint-Reacher2Reacher (V-R2R): Agent domain is the 2-link Reacher and expert domain is Third-person 2-link Reacher that has a third person view state space with a 180 planar offset. Tasks are the same as D-R2R.

Viewpoint-Reacher2Writer (V-R2W): Agent domain is

Algorithm 1 Learn domain transformation ψ

Input: Proxy task set

n (DM Tj E , DM Tj A ) o M

j=1, inference task trajectories DMT E while not done do

for j = 1, . . . , M do //Global and local alignment Sample (s E, s E) DM Ti E , (s A, s A) DM Ti A and store in buffers Bj E, Bj A for i = 1, . . . , N do

Sample mini-batch i from Bj E, Bj A Update Dj E, Dj A by maximizing Li adv(Dj E) and Lj adv(Dj E) respectively Update qj by minimizing Lj MI Update ψ, φ by minimizing λ1 Lj adv(Dj A) + Lj adv(Dj E) + λ2 Lj cyc + Lj z + λ3Lj pos λ4Lj MI end for end for Sample (s E, s E) DMT E and store in buffers BM+1 E //Inference task adaptation for i = 1, . . . , N do

Sample mini-batch i from BM+1 E Update Vz by minimizing Lpos inf Update ψ, φ by minimizing Lcyc inf + Lpos inf end for end while

Cross-domain Imitation from Observations

Table 2. Cross-domain imitation performance of the policy learnt on transferred trajectories for inference tasks. All rewards are normalized by expert performance on corresponding task.

METHOD V-R2R V-R2W D-R2R M-R2R M-A2A M-A2C

IF 0.32 0.10 0.57 0.20 0.48 0.30 0.61 0.23 0.09 0.08 0.00 0.00 CCA 0.16 0.27 0.86 0.30 0.47 0.20 0.16 0.13 0.30 0.30 0.75 0.50 CYCLEGAN 0.17 0.10 0.72 0.16 0.13 0.02 0.12 0.06 0.22 0.20 0.80 0.28 OURS 0.95 0.03 0.93 0.01 0.99 0.02 0.96 0.07 0.78 0.08 1.00 0.00

Figure 3. Adaptation complexity. Performance of learned policy as as the number of cross-domain demonstrations is varied. Our framework consistently performs better than baselines and achieves results close to Self-demo.

M-A2C M-A2A M-R2R

Expert Agent

Figure 4. Cross-domain tasks. Different morphologically mismatched tasks used in our experiments.

the 2-link Reacher and expert domain is Third-person 2link Reacher. The proxy tasks are same as D-R2R and the inference task is tracing a letter on a plane as fast as possible (Kim et al., 2020). The inference task differs from the proxy tasks in two key aspects: the end effector must draw a straight line from the letter s vertex to vertex and not slow down at the vertices.

Morphology-Reacher2Reacher (M-R2R): Agent domain is the 2-link Reacher, while expert domain is the 3-link Reacher. Otherwise same as D-R2R.

Morphology-Ant2Ant (M-A2A): Agent domain is the 4legged Ant, while expert domain is the 6-legged Ant. Otherwise same as D-R2R.

Morphology-Ant2Cheetah (M-A2C): Agent domain is the Half Cheetah, while expert domain is the 4-legged Ant. Otherwise same as D-R2R.

5.2. Baselines

We compare our framework to other methods which are able to learn state correspondences from unpaired and unaligned demonstrations without access to expert actions - Canonical Correlation Analysis (Hotelling, 1992), Invariant Features (Gupta et al., 2017) and Cycle GAN (Zhu et al., 2017). Canonical Correlation Analysis (CCA) (Hotelling, 1992) ﬁnds invertible linear transformations to a space where domain data are maximally correlated when given unpaired,

Cross-domain Imitation from Observations

Table 3. Ablation study on each module s contribution to ﬁnal policy performance.

METHOD V-R2R V-R2W D-R2R M-R2R M-A2A M-A2C

OURS 0.95 0.05 0.93 0.00 0.99 0.02 0.96 0.07 0.78 0.08 1.00 0.00 - W/O INFERENCE ADAPTATION 0.81 0.11 0.88 0.03 0.74 0.22 0.78 0.11 0.46 0.12 0.78 0.23 - W/O LMI 0.60 0.30 0.92 0.03 0.76 0.30 0.67 0.34 0.28 0.20 0.80 0.21 - W/O TEMPORAL PRESERVATION 0.64 0.31 0.84 0.00 0.70 0.32 0.72 0.32 0.36 0.50 0.43 0.50

Figure 5. Alignment Complexity. Performance of learned policy as as the number of proxy tasks is varied. Notably, even with a reduced number of proxy tasks, our method outperforms the baselines in most cases.

unaligned demonstrations. Invariant Features (IF) learns state maps via a domain agnostic space from paired and aligned demonstrations - we use Dynamic Time Warping (M uller, 2007) on the learned latent space to compute the pairings from the unpaired data. Cycle GAN learns the state correspondence via adversarial learning with an additional cycle-consistency on state reconstruction. For all the baselines, we follow a similar procedure towards learning the ﬁnal policy - the correspondence is learnt through the proxy tasks and then is used to transfer trajectories for policy training via BCO. Reported results are averaged across 10 runs. Experts on Reacher tasks are trained using PPO (Schulman et al., 2017), while those for Ant/Cheetah are trained using A3C (Mnih et al., 2016).

5.3. Cross-domain imitation performance

We compare imitation policies learnt by our framework against those learnt using baselines in Table 2. As may be observed, the proposed method achieves near expert performance across all the cross-domain tasks encompassing viewpoint, dynamics and morphological mismatch. On the

other hand, baselines consistently fail to generalize across the same tasks. There are two key reasons which can be hypothesized for this poor performance. Firstly, IF requires time-aligned trajectories, and the alignment when done by algorithms like DTW, rather than human intervention, may not be good enough given that our experiments involve diverse starting states, up to 1.5 differences in demonstration lengths (shown in Figure 6), and varying task execution rates. Secondly, baselines which learn from unpaired data (CCA and Cycle GAN), also fail due to the lack of a mechanism to preserve MDP task characteristics, which is taken care of in our method via temporal order preservation and domain alignment. Figure 7 illustrates the learnt state-maps for some of the cross-domain tasks. The proposed framework translates the expert states in a manner that preserves task semantics.

Varying the number of demonstrations. Given an adequate set of proxy tasks, we experiment by varying the number of cross-domain demonstrations required for training the policy on the inference task. To serve as an upper-bound on performance, we imitate on agent domain demonstra-

Cross-domain Imitation from Observations

100 120 140 160 180 200 220 Length of trajectory

4-legged Ant Half Cheetah

Figure 6. Trajectory length distributions. Distributions of trajectory lengths of the proxy tasks used for M-A2C.

tions, drawn from an expert, on the inference task and denote this as the Self-demo baseline. As shown in Figure 3, our framework produces transferred demonstrations of equal effectiveness to the self-demonstrations. This clearly demonstrates the effectiveness of our framework.

Varying the number of proxy tasks. The number of proxy tasks play a vital role in learning the correspondence across the domains. We perform experiments by varying the number of proxy tasks in the alignment set needed to learn the state-map for imitation, given sufﬁcient cross-domain demonstrations for the inference tasks. The results are shown in Figure 5. In general, more proxy tasks equate to better domain alignment as the solution space over possible state maps is constrained, and the learnt mapping generalizes better to the inference tasks.

5.4. Ablation study

We perform a set of ablation studies by removing each piece of the framework, demonstrating the importance of including each component. The results are shown in Table 3. We begin by excluding inference task adaptation. This leads to a small drop in performance across all tasks, reinforcing the need for adapting on the inference task to incorporate the new state distribution introduced by the inference task. Notably, even without adaptation, the performance in almost all the tasks exceeds those of the baselines. Removing the mutual information objective leads to a similar drop in performance across all tasks. Excluding temporal position preservation also reduces performance demonstrating the signiﬁcance of preserving task semantics via global alignment, which cycle-consistency alone fails to ensure.

6. Conclusion

In this paper, we present a novel framework to tackle the x DIO task by learning a state-map across domains using both local and global alignment. Local alignment is performed via transition distribution matching and cycleconsistency in both the state and latent space, while global

Figure 7. Visualization of domain transformations. State maps learned by our framework and the baselines on the M-R2R task. Our framework is able to map the end effector in a manner which preserves task semantics.

alignment is enforced via the idea of temporal position preservation. While previous approaches rely on paired data and expert actions, we provide a general framework that can learn the mapping from unpaired, unaligned demonstrations without expert actions. We demonstrate the efﬁcacy of our approach on multiple cross-domain tasks encompassing dynamics, viewpoint and morphological mismatch. Our future work will concentrate on extending our method for learning correspondence using random trajectories, thus mitigating the need for proxy tasks.

Acknowledgements

This work was partially supported by Mitsubishi Electric Research Labs and US National Institute of Food and Agriculture Award No: 2021-67022-33453 through the National Robotics Initiative.

Abbeel, P. and Ng, A. Y. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-ﬁrst International Conference on Machine learning, pp. 1, 2004.

Alemi, A. A., Fischer, I., Dillon, J. V., and Murphy, K. Deep variational information bottleneck. ar Xiv preprint ar Xiv:1612.00410, 2016.

Ammar, H. B. and Taylor, M. E. Reinforcement learning transfer via common subspaces. In International

Cross-domain Imitation from Observations

Workshop on Adaptive and Learning Agents, pp. 21 36. Springer, 2011.

Ammar, H. B., Eaton, E., Taylor, M. E., Mocanu, D. C., Driessens, K., Weiss, G., and Tuyls, K. An automated measure of mdp similarity for transfer in reinforcement learning. 2014.

Ammar, H. B., Eaton, E., Ruvolo, P., and Taylor, M. Unsupervised cross-domain transfer in policy gradient reinforcement learning via manifold alignment. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 29, 2015.

Bojarski, M., Del Testa, D., Dworakowski, D., Firner, B., Flepp, B., Goyal, P., Jackel, L. D., Monfort, M., Muller, U., Zhang, J., et al. End to end learning for self-driving cars. ar Xiv preprint ar Xiv:1604.07316, 2016.

Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym. ar Xiv preprint ar Xiv:1606.01540, 2016.

Eysenbach, B., Gupta, A., Ibarz, J., and Levine, S. Diversity is all you need: Learning skills without a reward function. ar Xiv preprint ar Xiv:1802.06070, 2018.

Ferns, N., Panangaden, P., and Precup, D. Bisimulation metrics for continuous markov decision processes. SIAM Journal on Computing, 40(6):1662 1714, 2011.

Gamrian, S. and Goldberg, Y. Transfer learning for related reinforcement learning tasks via image-to-image translation. In International Conference on Machine Learning, pp. 2063 2072. PMLR, 2019.

Gangwani, T. and Peng, J. State-only imitation with transition dynamics mismatch. In International Conference on Learning Representations, 2019.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. Advances in neural information processing systems, 27:2672 2680, 2014.

Gupta, A., Devin, C., Liu, Y., Abbeel, P., and Levine, S. Learning invariant feature spaces to transfer skills with reinforcement learning. In International Conference on Learning Representations. Open Review.net, 2017.

Ho, D., Rao, K., Xu, Z., Jang, E., Khansari, M., and Bai, Y. Retinagan: An object-aware approach to sim-to-real transfer. ar Xiv preprint ar Xiv:2011.03148, 2020.

Ho, J. and Ermon, S. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems, pp. 4565 4573, 2016.

Hoffman, J., Tzeng, E., Park, T., Zhu, J.-Y., Isola, P., Saenko, K., Efros, A., and Darrell, T. Cycada: Cycle-consistent adversarial domain adaptation. In International conference on machine learning, pp. 1989 1998. PMLR, 2018.

Hotelling, H. Relations between two sets of variates. In Breakthroughs in statistics, pp. 162 190. Springer, 1992.

Jones, S. S. The development of imitation in infancy. Philosophical Transactions of the Royal Society B: Biological Sciences, 364(1528):2325 2335, 2009.

Kim, K., Gu, Y., Song, J., Zhao, S., and Ermon, S. Domain adaptive imitation learning. In International Conference on Machine Learning, pp. 5286 5295. PMLR, 2020.

Liu, Y., Gupta, A., Abbeel, P., and Levine, S. Imitation from observation: Learning to imitate behaviors from raw video via context translation. In IEEE International Conference on Robotics and Automation (ICRA), pp. 1118 1125. IEEE, 2018.

Marshall, P. J. and Meltzoff, A. N. Body maps in the infant brain. Trends in Cognitive Sciences, 19(9):499 505, 2015.

Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928 1937. PMLR, 2016.

M uller, M. Dynamic time warping. Information retrieval for music and motion, pp. 69 84, 2007.

Paul, S., Vanbaar, J., and Roy-Chowdhury, A. Learning from trajectories via subgoal discovery. In Advances in Neural Information Processing Systems, pp. 8411 8421, 2019.

Pomerleau, D. A. Alvinn: An autonomous land vehicle in a neural network. In Advances in Neural Information Processing Systems, pp. 305 313, 1989.

Poole, B., Ozair, S., Van Den Oord, A., Alemi, A., and Tucker, G. On variational bounds of mutual information. In International Conference on Machine Learning, pp. 5171 5180, 2019.

Rao, K., Harris, C., Irpan, A., Levine, S., Ibarz, J., and Khansari, M. Rl-cyclegan: Reinforcement learning aware simulation-to-real. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11157 11166, 2020.

Ravindran, B. An algebraic approach to abstraction in reinforcement learning. Ph D thesis, University of Massachusetts at Amherst, 2004.

Cross-domain Imitation from Observations

Rizzolatti, G. and Craighero, L. The mirror-neuron system. Annu. Rev. Neurosci., 27:169 192, 2004.

Ross, S., Gordon, G., and Bagnell, D. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the Fourteenth International Conference on Artiﬁcial Intelligence and Statistics, pp. 627 635, 2011.

Schaal, S. Is imitation learning the route to humanoid robots? Trends in cognitive sciences, 3(6):233 242, 1999.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017.

Sharma, P., Pathak, D., and Gupta, A. Third-person visual imitation learning via decoupled hierarchical controller. In Advances in Neural Information Processing Systems, pp. 2597 2607, 2019.

Smith, L., Dhawan, N., Zhang, M., Abbeel, P., and Levine, S. Avid: Learning multi-stage tasks via pixel-level translation of human videos. ar Xiv preprint ar Xiv:1912.04443, 2019.

Stadie, B. C., Abbeel, P., and Sutskever, I. Third person imitation learning. In International Conference on Learning Representations. Open Review.net, 2017.

Taylor, M. E. and Stone, P. Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research, 10(7), 2009.

Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026 5033. IEEE, 2012.

Torabi, F., Warnell, G., and Stone, P. Behavioral cloning from observation. ar Xiv preprint ar Xiv:1805.01954, 2018.

Tzeng, E., Hoffman, J., Zhang, N., Saenko, K., and Darrell, T. Deep domain confusion: Maximizing for domain invariance. ar Xiv preprint ar Xiv:1412.3474, 2014.

Umilt a, M., Intskirveli, I., Grammont, F., Rochat, M., Caruana, F., Jezzini, A., Gallese, V., Rizzolatti, G., et al. When pliers become ﬁngers in the monkey motor system. Proceedings of the National Academy of Sciences, 105(6): 2209 2213, 2008.

Wan, M., Gangwani, T., and Peng, J. Mutual information based knowledge transfer under state-action dimension mismatch. ar Xiv preprint ar Xiv:2006.07041, 2020.

Yang, C., Ma, X., Huang, W., Sun, F., Liu, H., Huang, J., and Gan, C. Imitation learning from observations by minimizing inverse dynamics disagreement. In Advances in Neural Information Processing Systems, pp. 239 249, 2019.

Zhu, J.-Y., Park, T., Isola, P., and Efros, A. A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223 2232, 2017.

Ziebart, B. D., Maas, A. L., Bagnell, J. A., and Dey, A. K. Maximum entropy inverse reinforcement learning. In AAAI, volume 8, pp. 1433 1438. Chicago, IL, USA, 2008.