# ceil_generalized_contextual_imitation_learning__05b0e322.pdf

CEIL: Generalized Contextual Imitation Learning

Jinxin Liu1,2 Li He1 Yachen Kang1,2 Zifeng Zhuang1,2

Donglin Wang1,4 Huazhe Xu3,5,6

1Westlake University 2Zhejiang University 3Tsinghua University 4Westlake Institute for Advanced Study 5Shanghai Qi Zhi Institute 6Shanghai AI Lab

In this paper, we present Cont Extual Imitation Learning (CEIL), a general and broadly applicable algorithm for imitation learning (IL). Inspired by the formulation of hindsight information matching, we derive CEIL by explicitly learning a hindsight embedding function together with a contextual policy using the hindsight embeddings. To achieve the expert matching objective for IL, we advocate for optimizing a contextual variable such that it biases the contextual policy towards mimicking expert behaviors. Beyond the typical learning from demonstrations (Lf D) setting, CEIL is a generalist that can be effectively applied to multiple settings including: 1) learning from observations (Lf O), 2) offline IL, 3) cross-domain IL (mismatched experts), and 4) one-shot IL settings. Empirically, we evaluate CEIL on the popular Mu Jo Co tasks (online) and the D4RL dataset (offline). Compared to prior state-of-the-art baselines, we show that CEIL is more sample-efficient in most online IL tasks and achieves better or competitive performances in offline tasks.

1 Introduction

Imitation learning (IL) allows agents to learn from expert demonstrations. Initially developed with a supervised learning paradigm [58, 63], IL can be extended and reformulated with a general expert matching objective, which aims to generate policies that produce trajectories with low distributional distances to expert demonstrations [30]. This formulation allows IL to be extended to various new settings: 1) online IL where interactions with the environment are allowed, 2) learning from observations (Lf O) where expert actions are absent, 3) offline IL where agents learn from limited expert data and a fixed dataset of sub-optimal and reward-free experience, 4) cross-domain IL where the expert demonstrations come from another domain (i.e., environment) that has different transition dynamics, and 5) one-shot IL which expects to recover the expert behaviors when only one expert trajectory is observed for a new IL task.

Modern IL algorithms introduce various designs or mathematical principles to cater to the expert matching objective in a specific scenario. For example, the Lf O setting requires particular considerations regarding the absent expert actions, e.g., learning an inverse dynamics function [5, 65]. Besides, out-of-distribution issues in offline IL require specialized modifications to the learning objective, such as introducing additional policy/value regularization [32, 72]. However, such a methodology, designing an individual formulation for each IL setting, makes it difficult to scale up a specific IL algorithm to more complex tasks beyond its original IL setting, e.g., online IL methods often suffer severe performance degradation in offline IL settings. Furthermore, realistic IL tasks are often not subject to a particular IL setting but consist of a mixture of them. For example, we may have access

Equal contributions. Corresponding author: Donglin Wang <wangdonglin@westlake.edu.cn>

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

Table 1: A coarse summary of IL methods demonstrating 1) different expert data modalities they can handle (learning from demonstrations or observations), 2) disparate task settings they consider (learning from online environment interactions or pre-collected offline static dataset), 3) the specific cross-domain setting they assume (the transition dynamics between the learning environment and that of the expert behaviors are different), and 4) the unique one-shot merit they desire (the learned policy is capable of one-shot transfer to new imitation tasks). We highlight that our contextual imitation learning (CEIL) method can naturally be applied to all the above IL settings.

Expert data Task setting Cross-domain One-shot Lf D Lf O Online Offline

S-on-Lf D [9, 13, 21, 30, 38, 52, 57, 61, 77] " % " % % % S-on-Lf O [7, 54, 65, 66, 75] " " " % % % S-off-Lf D [19, 32, 33, 39, 55, 70, 72, 73] " % % " % % S-off-Lf O [78] " " % " % % C-on-Lf D [18, 69, 79] " % " % " % C-on-Lf O [20, 25, 26, 48, 59, 60] " " " % " % C-off-Lf D [34] " % % " " % C-off-Lf O [56, 68] " " % " " %

S-on/off-Lf O [28] " " " " % % Online one-shot [14, 16, 40] " % " % % " Offline one-shot [24, 71] " % % " % " CEIL (ours) " " " " " "

to both demonstrations and observation-only data in offline robot tasks; however, it could require significant effort to adapt several specialized methods to leverage such mixed/hybrid data. Hence, a problem naturally arises: How can we accommodate various design requirements of different IL settings with a general and practically ready-to-deploy IL formulation?

Hindsight information matching, a task-relabeling paradigm in reinforcement learning (RL), views control tasks as analogous to a general sequence modeling problem, with the goal to produce a sequence of actions that induces high returns [12]. Its generality and simplicity enable it to be extended to both online and offline settings [17, 42]. In its original RL context, an agent directly uses known extrinsic rewards to bias the hindsight information towards task-related behaviors. However, when we attempt to retain its generality in IL tasks, how to bias the hindsight towards expert behaviors remains a significant barrier as the extrinsic rewards are missing.

To design a general IL formulation and tackle the above problems, we propose Cont Extual Imitation Learning (CEIL), which readily incorporates the hindsight information matching principle within a bi-level expert matching objective. In the inner-level optimization, we explicitly learn a hindsight embedding function to deal with the challenges of unknown rewards. In the outer-level optimization, we perform IL expert matching via inferring an optimal embedding (i.e., hindsight embedding biasing), replacing the naive reward biasing in hindsight. Intuitively, we find that such a bi-level objective results in a spectrum of expert matching objectives from the embedding space to the trajectory space. To shed light on the applicability and generality of CEIL, we instantiate CEIL to various IL settings, including online/offline IL, Lf D/Lf O, cross-domain IL, and one-shot IL settings.

In summary, this paper makes the following contributions: 1) We propose a bi-level expert matching objective Cont Extual Imitation Learning (CEIL), inheriting the spirit of hindsight information matching, which decouples the learning policy into a contextual policy and an optimal embedding. 2) CEIL exhibits high generality and adaptability and can be instantiated over a range of IL tasks. 3) Empirically, we conduct extensive empirical analyses showing that CEIL is more sample-efficient in online IL and achieves better or competitive results in offline IL tasks.

2 Related Work

Recent advances in decision-making have led to rapid progress in IL settings (Table 1), from typical learning from demonstrations (Lf D) to learning from observations (Lf O) [7, 9, 35, 54, 62, 66], from online IL to offline IL [11, 15, 33, 53, 73], and from single-domain IL to cross-domain IL [34, 48, 56, 68]. Targeting a specific IL setting, individual works have shown their impressive

ability to solve the exact IL setting. However, it is hard to retrain their performance in new unprepared IL settings. In light of this, it is tempting to consider how we can design a general and broadly applicable IL method. Indeed, a number of prior works have studied part of the above IL settings, such as offline Lf O [78], cross-domain Lf O [48, 60], and cross-domain offline IL [56]. While such works demonstrate the feasibility of tackling multiple IL settings, they still rely on standard online/offline RL algorithmic advances to improve performance [25, 32, 44, 47, 50, 51, 55, 72, 76]. Our objective diverges from these works, as we strive to minimize the reliance on the RL pipeline by replacing it with a simple supervision objective, thus avoiding the dependence on the choice of RL algorithms.

Our approach to IL is most closely related to prior hindsight information-matching methods [2, 8, 24, 49], both learning a contextual policy and using a contextual variable to guide policy improvement. However, these prior methods typically require additional mechanisms to work well, such as extrinsic rewards in online RL [4, 42, 64] or a handcrafted target return in offline RL [12, 17]. Our method does not require explicit handling of these components. By explicitly learning an embedding space for both expert and suboptimal behaviors, we can bias the contextual policy with an inferred optimal embedding (contextual variable), thus avoiding the need for explicit reward biasing in prior works. Our method also differs from most prior offline transformer-based RL/IL algorithms that explicitly model a long sequence of transitions [10, 12, 31, 36, 43, 71]. We find that simple fully-connected networks can also elicit useful embeddings and guide expert behaviors when conditioned on a wellcalibrated embedding. In the context of the recently proposed prompt-tuning paradigm in large language tasks or multi-modal tasks [27, 45, 74], our method can be interpreted as a combination of IL and prompting-tuning, with the main motivation that we tune the prompt (the optimal contextual variable) with an expert matching objective in IL settings.

3 Background

Before discussing our method, we briefly introduce the background for IL, including learning from demonstrations (Lf D), learning from observations (Lf O), online IL, offline IL, and cross-domain settings in Section 3.1, and introduce the hindsight information matching in Section 3.2.

3.1 Imitation Learning

Consider a control task formulated as a discrete-time Markov decision process (MDP)2 M = {S, A, T , r, γ, p0}, where S is the state (observation) space, A is the action space, T : S A S R is the transition dynamics function, r : S A R is the reward function, γ is the discount factor, and p0 is the distribution of initial states. The goal in a reinforcement learning (RL) control task is to learn a policy πθ(a|s) maximizing the expected sum of discounted rewards Eπθ(τ) h PT 1 t=0 γtr(st, at) i ,

where τ := {s0, a0, , s T 1, a T 1} denotes the trajectory and the generated trajectory distribution πθ(τ) = p0(s0)πθ(a0|s0) QT 1 t=1 πθ(at|st)T (st|st 1, at 1).

In IL, the ground truth reward function (i.e., r in M) is not observed. Instead, we have access to a set of demonstrations (or observations) {τ|τ πE(τ)} that are collected by an unknown expert policy πE(a|s). The goal of IL tasks is to recover a policy that matches the corresponding expert policy. From the mathematical perspective, IL achieves the plain expert matching objective by minimizing the divergence of trajectory distributions between the learner and the expert:

min πθ D(πθ(τ), πE(τ)), (1)

where D is a distance measure. Meanwhile, we emphasize that the given expert data {τ|τ πE(τ)} may not contain the corresponding expert actions. Thus, in this work, we consider two IL cases where the given expert data τ consists of a set of state-action demonstrations {(st, at, st+1)} (learning from demonstrations, Lf D), as well as a set of state-only transitions {(st, st+1)} (learning from observations, Lf O). When it is clear from context, we abuse notation πE(τ) to denote both demonstrations in Lf D and observations in Lf O for simplicity.

Besides, we can also divide IL settings into two orthogonal categories: online IL and offline IL. In online IL, the learning policy πθ can interact with the environment and generate online trajectories τ πθ(τ). In offline IL, the agent cannot interact with the environment but has access to an offline

2In this paper, we use environment and MDP interchangeably, and use state and observation interchangeably.

static dataset {τ|τ πβ(τ)}, collected by some unknown (sub-optimal) behavior policies πβ. By leveraging the offline data {πβ(τ)} {πE(τ)} without any interactions with the environment, the goal of offline IL is to learn a policy recovering the expert behaviors (demonstrations or observations) generated by πE. Note that, in contrast to the typical offline RL problem [46], the offline data {πβ(τ)} in offline IL does not contains any reward signal.

Cross-domain IL. Beyond the above two IL branches (online/offline and Lf D/Lf O), we can also divide IL into: 1) single-domain IL and 2) cross-domain IL, where 1) the single-domain IL assumes that the expert behaviors are collected in the same MDP in which the learning policy is to be learned, and 2) the cross-domain IL studies how to imitate expert behaviors when discrepancies exist between the expert and the learning MDPs (e.g., differing in their transition dynamics or morphologies).

3.2 Hindsight Information Matching

In typical goal-conditioned RL problems, hindsight experience replay (HER) [3] proposes to leverage the rich repository of the failed experiences by replacing the desired (true) goals of training trajectories with the achieved goals of the failed experiences:

Alg(πθ; g, τg) Alg(πθ; f HER(τg), τg), where the learner Alg(πθ; , ) could be any RL methods, τg πθ(τg|g) denotes the trajectory generated by a goal-conditioned policy πθ(at|st, g), and f HER denotes a pre-defined (hindsight information extraction) function, e.g., returning the last state in trajectory τg.

HER can also be applied to the (single-goal) reward-driven online/offline RL tasks, setting the return (sum of the discounted rewards) of a trajectory as an implicit goal for the corresponding trajectory. Thus, we can reformulate the (single-goal) reward-driven RL task, learning policy πθ(at|st) that maximize the return, as a multi-goal RL task, learning a return-conditioned policy πθ(at|st, ) that maximize the following log-likelihood:

max πθ ED(τ) [log πθ(a|s, f R(τ))] , (2)

where f R(τ) denotes the return of trajectory τ. At test, we can then condition the contextual policy πθ(a|s, ) on a desired target return. In offline RL, the empirical distribution D(τ) in Equation 2 can be naturally set as the offline data distribution; in online RL, D(τ) can be set as the replay/experience buffer, and will be updated and biased towards trajectories that have high expected returns.

Intuitively, biasing the sampling distribution (D(τ) towards higher returns) leads to an implicit policy improvement operation. However, such an operator is non-trivial to obtain in the IL problem, where we do not have access to a pre-defined function f R(τ) to bias the learning policy towards recovering the given expert data {πE(τ)} (demonstrations or observations).

In this section, we will formulate IL as a bi-level optimization problem, which will allow us to derive our method, contextual imitation learning (CEIL). Instead of attempting to train the learning policy πθ(a|s) with the plain expert matching objective (Equation 1), our approach introduces an additional contextual variable z for a contextual IL policy πθ(a|s, ). The main idea of CEIL is to learn a contextual policy πθ(a|s, z) and an optimal contextual variable z such that the given expert data (demonstrations in Lf D or observations in Lf O) can be recovered by the learned z -conditioned policy πθ(a|s, z ). We begin by describing the overall framework of CEIL in Section 4.1, and make a connection between CEIL and the plain expert matching objective in Section 4.2, which leads to a practical implementation under various IL settings in Section 4.3.

4.1 Contextual Imitation Learning (CEIL)

Motivated by the hindsight information matching in online/offline RL (Section 3.2), we propose to learn a general hindsight embedding function fϕ, which encodes trajectory τ (with window size T) into a latent variable z Z, |Z| T |S|. Formally, we learn the embedding function fϕ and a corresponding contextual policy πθ(a|s, z) by minimizing the trajectory self-consistency loss:

πθ, fϕ = min πθ,fϕ ED(τ) [log πθ(τ|fϕ(τ))] = min πθ,fϕ Eτ D(τ)E(s,a) τ [log πθ(a|s, fϕ(τ))] , (3)

where in the online setting, we sample trajectory τ from buffer D(τ), known as the experience replay buffer in online RL; in the offline setting, we sample trajectory τ directly from the given offline data.

If we can ensure that the learned contextual policy πθ and the embedding function fϕ are accurate on the empirical data D(τ), then we can convert the IL policy optimization objective (in Equation 1) into a bi-level expert matching objective:

min z D(πθ(τ|z ), πE(τ)), (4)

s.t. πθ, fϕ = min πθ,fϕ ED(τ) [log πθ(τ|fϕ(τ))] R(fϕ), and z fϕ supp(D), (5)

where R(fϕ) is an added regularization over the embedding function (we will elaborate on it later), and supp(D) denotes the support of the trajectory distribution {τ|D(τ) > 0}. Here fϕ is employed to map the trajectory space to the latent variable space (Z). Intuitively, by optimizing Equation 4, we expect the induced trajectory distribution of the learned πθ(a|s, z ) will match that of the expert. However, in the offline IL setting, the contextual policy can not interact with the environment. If we directly optimize the expert matching objective (Equation 4), such an objective can easily exploit generalization errors in the contextual policy model to infer a mistakenly overestimated z that achieves low expert-matching loss but does not preserve the trajectory self-consistency (Equation 3). Therefore, we formalize CEIL into a bi-level optimization problem, where, in Equation 5, we explicitly constrain the inferred z lies in the (fϕ-mapped) support of the training trajectory distribution.

At a high level, CEIL decouples the learning policy into two parts: an expressive contextual policy πθ(a|s, ) and an optimal contextual variable z . By comparing CEIL with the plain expert matching objective, minπθ D(πθ(τ), πE(τ)), in Equation 1, we highlight two merits: 1) CEIL s expert matching loss (Equation 4) does not account for updating πθ and is only incentivized to update the low-dimensional latent variable z , which enjoys efficient parameter learning similar to the prompt tuning in large language models [74], and 2) we learn πθ by simply performing supervised regression (Equation 5), which is more stable compared to vanilla inverse-RL/adversarial-IL methods.

4.2 Connection to the Plain Expert Matching Objective

To gain more insight into Equation 4 that captures the quality of IL (the degree of similarity to the expert data), we define D( , ) as the sum of reverse KL and forward KL divergence3, i.e., D(q, p) = DKL(q p) + DKL(p q), and derive an alternative form for Equation 4:

arg min z D(πθ(τ|z ), πE(τ)) = arg max z I(z ; τE) I(z ; τθ) | {z } JMI

D(πθ(τ), πE(τ)) | {z } JD

where I(x; y) denotes the mutual information (MI) between x and y, which measures the predictive power of y on x (or vice-versa), the latent variables are defined as τE := τ πE(τ), τθ := τ p(z )πθ(τ|z ), and πθ(τ) = Ez [πθ(τ|z )].

Intuitively, the second term JD on RHS of Equation 6 is similar to the plain expert matching objective in Equation 1, except that here we optimize a latent variable z over this objective. Regarding the MI terms JMI, we can interpret the maximization over JMI as an implicit policy improvement, which incentivizes the optimal latent variable z for having high predictive power of the expert data τE and having low predictive power of the non-expert data τθ.

Further, we can rewrite the MI term (JMI in Equation 6) in terms of the learned embedding function fϕ, yielding an approximate embedding inference objective JMI(fϕ):

JMI = EπE(z ,τE) log p(z |τE) Eπθ(z ,τθ) log p(z |τθ)

Ep(z )πE(τE)πθ(τθ|z ) z fϕ(τE) 2 + z fϕ(τθ) 2 JMI(fϕ),

where we approximate the logarithmic predictive power of z on τ with z fϕ(τ) 2, by taking advantage of the learned embedding function fϕ in Equation 5.

3DKL(p q) := Ep(x) h log p(x)

q(x) i denotes the (forward) KL divergences. It is well known that reverse KL ensures that the learned distribution is mode-seeking and forward KL exhibits a mode-covering behavior [37]. For analysis purposes, here we define D( , ) as the sum of reverse KL and forward KL, and set the weights of both reverse KL and forward KL to 1.

Algorithm 1 Training CEIL: Online or Offline IL Setting Require: Expert demonstrations {πE(τ)}, empty buffer D for online IL or reward-free offline data D for offline IL, training iteration K, and batch size N.

1: Initialize contextual policy πθ(a|s, ), embedding function fϕ(z|τ), and latent variable z . 2: for k = 1, , K do 3: (Online only) Run policy πθ(a|s, z ) in environment and store experience into buffer D. 4: Sample a batch of data {τ}n 1 from D for online IL or D for offline IL. 5: Learn πθ and fϕ over sampled {τ}n 1 using the trajectory self-consistency loss. 6: Update z and fϕ over sampled {τ}n 1 by maximizing JMI(fϕ) αJD. 7: (Offline only) Update z by minimizing R(z ). # eliminating the offline OOD issues. 8: end for Return: the learned contextual policy πθ(a|s, ) and the optimal latent variable z .

By maximizing JMI(fϕ), the learned optimal z will be induced to converge towards the embeddings of expert data and avoid trivial solutions (as shown in Figure 1). Intuitively, JMI(fϕ) can also be thought of as an instantiation of contrastive loss, which manifests two facets we consider significant in IL:

0 100 200 Rollout steps (k)

L2 distance

0 100 200 Rollout steps (k)

L2 distance

Return ||z * f ( E)||2 ||z * f ( )||2

Figure 1: During learning, the distance between z and fϕ(τE) decreases rapidly (green lines). Meanwhile, as policy πθ( | , z ) gets better (blue lines), fϕ(τθ) gradually approaches z (red lines).

1) the "anchor" variable4 z is unknown and must be estimated, and 2) it is necessary to ensure that the estimated z lies in the support set of training distribution, as specified by the support constraints in Equation 5.

In summary, by comparing JMI(fϕ) and JD, we can observe that JMI(fϕ) actually encourages expert matching in the embedding space, while JD encourages expert matching in the original trajectory space. In the next section, we will see that such an embedding-level expert matching objective naturally lends itself to cross-domain IL settings.

4.3 Practical Implementation

In this section, we describe how we can convert the bi-level IL problem above (Equations 4 and 5) into a feasible online/offline IL objective and discuss some practical implementation details in Lf O, offline IL, cross-domain IL, and one-shot IL settings (see more details5 in Appendix 9.3).

As shown in Algorithm 1 (best viewed in colors), CEIL alternates between solving the bi-level problem with respect to the support constraint (Line 3 for online IL or Line 7 for offline IL), the trajectory self-consistency loss (Line 5), and the optimal embedding inference (Line 6).

To satisfy the support constraint in Equation 5, for online IL (Line 3), we directly roll out the z -conditioned policy πθ(a|s, z ) in the environment; for offline IL (Line 7), we minimize a simple regularization6 over z , bearing a close resemblance to the one used in TD3+BC [23]:

R(z ) = min z f ϕ(τE) 2, z f ϕ(τD) 2 , τE := τ πE(τ), τD := τ D(τ), (7) where we apply a stop-gradient operation to f ϕ. To ensure the optimal embedding inference (maxz JMI(fϕ) JD) retaining the flexibility of seeking z across different instances of fϕ, we jointly update the optimal embedding z and the embedding function fϕ with

max z ,fϕ JMI(fϕ) αJD, (8)

where we use α to control the weight on JD.

Lf O. In the Lf O setting, as expert actions are missing, we apply our expert matching objective only over the observations. Note that even though expert data contains no actions in Lf O, we can still

4The triplet contrastive loss enforces the distance between the anchor and the positive to be smaller than that between the anchor and the negative. Thus, we can view z in JMI(fϕ) as an instance of the anchor. 5Our code will be released at https://github.com/wechto/Generalized CEIL. 6In other words, the offline support constraint in Equation 5 is achieved through minimizing R(z ).

leverage a large number of suboptimal actions presented in online/offline D(τ). Thus, we can learn the contextual policy πθ(a|s, z) using the buffer data in online IL or the offline data in offline IL, much owing to the fact that we do not directly use the plain expert matching objective to update πθ.

Cross-domain IL. Cross-domain IL considers the case in which the expert s and learning agent s MDPs are different. Due to the domain shift, the plain idea of min JD may not be a sufficient proxy for the expert matching objective, as there may never exist a trajectory (in the learning MDP) that matches the given expert data. Thus, we can set (the weight of JD) α to 0.

Further, to make embedding function fϕ useful for guiding the expert matching in latent space (i.e., max JMI(fϕ)), we encourage fϕ to capture the task-relevant embeddings and ignore the domainspecific factors. To do so, we generate a set of pseudo-random transitions {τE } by independently sampling trajectories from expert data {πE(τE)} and adding random noise over these sampled trajectories, i.e., τE = τE + noise. Then, we couple each trajectory τ in {τE} {τE } with a label n {0, 1}, indicating whether it is noised, and then generate a new set of {(τ, n)}, where τ {τE} {τE } and n {0, 1}. Thus, we can set the regularization R(fϕ) in Equation 5 to be:

R(fϕ) = I(fϕ(τ); n). (9)

Intuitively, maximizing R(fϕ) encourages embeddings to be domain-agnostic and task-relevant: fϕ(τE) has high predictive power over expert data (n = 0) and low that over noised data (n = 1).

One-shot IL. Benefiting from the separate design of the contextual policy learning and the optimal embedding inference, CEIL also enjoys another advantage one-shot generalization to new IL tasks. For new IL tasks, given the corresponding expert data τnew, we can use the learned embedding function fϕ to generate a corresponding latent embedding znew. When conditioning on such an embedding, we can directly roll out πθ(a|s, znew) to recover the one-shot expert behavior.

5 Experiments

In this section, we conduct experiments across a variety of IL problem domains: single/cross-domain IL, online/offline IL, and Lf D/Lf O IL settings. By arranging and combining these IL domains, we obtain 8 IL tasks in all: S-on-Lf D, S-on-Lf O, S-off-Lf D, S-off-Lf O, C-on-Lf D, C-on-Lf O, C-off-Lf D, and C-off-Lf O, where S/C denotes single/cross-domain IL, on/off denotes online/offline IL, and Lf D/Lf O denote learning from demonstrations/observations respectively. Moreover, we also verify the scalability of CEIL on the challenging one-shot IL setting.

Our experiments are conducted in four popular Mu Jo Co environments: Hopper-v2 (Hop.), Half Cheetah-v2 (Hal.), Walker2d-v2 (Wal.), and Ant-v2 (Ant.). In the single-domain IL setting, we train a SAC policy in each environment and use the learned expert policy to collect expert trajectories (demonstrations/observations). To investigate the cross-domain IL setting, we assume the two domains (learning MDP and the expert-data collecting MDP) have the same state space and action space, while they have different transition dynamics. To achieve this, we modify the torso length of the Mu Jo Co agents (see details in Appendix 9.2). Then, for each modified agent, we train a separate expert policy and collect expert trajectories. For the offline IL setting, we directly take the reward-free D4RL [22] as the offline dataset, replacing the online rollout experience in the online IL setting.

5.1 Evaluation Results

To demonstrate the versatility of the CEIL idea, we collect 20 expert trajectories (demonstrations in Lf D or state-only observations in Lf O) for each environment and compare CEIL to GAIL [30], AIRL [21], SQIL [61], IQ-Learn [28], Value DICE [41], GAIf O [66], ORIL [78], Demo DICE [39], and SMODICE [56] (see their implementation details in Appendix 9.4). Note that these baseline methods cannot be applied to all the IL task settings (S/C-on/off-Lf D/Lf O), thus we only provide experimental comparisons with compatible baselines in each IL setting.

Online IL. In Figure 2, we provide the return (cumulative rewards) curves of our method and baselines on 4 online IL settings: S-on-Lf D (top-left), S-on-Lf O (top-right), C-on-Lf D (bottom-left), and Con-Lf O (bottom-right) settings. As can be seen, CEIL quickly achieves expert-level performance in S-on-Lf D. When extended to S-on-Lf O, CEIL also yields better sample efficiency compared to baselines. Further, considering the complex cross-domain setting, we can see those baselines SQIL

0 100 200 300 400 500

Rollout steps (k)

0 100 200 300 400 500

Rollout steps (k)

Ant-v2 Single-domain online Lf D (S-on-Lf D)

0 100 200 300 400 500

Rollout steps (k)

0 100 200 300 400 500

Rollout steps (k)

Ant-v2 Single-domain online Lf O (S-on-Lf O)

0 100 200 300 400 500

Rollout steps (k)

0 100 200 300 400 500

Rollout steps (k)

Ant-v2 Cross-domain online Lf D (C-on-Lf D)

0 100 200 300 400 500

Rollout steps (k)

0 100 200 300 400 500

Rollout steps (k)

Ant-v2 Cross-domain online Lf O (C-on-Lf O)

IQ-Learn Value DICE (legend for S/C-on-Lf D)

CEIL AIRL (state only)

GAIf O IQ-Learn (legend for S/C-on-Lf O)

Figure 2: Return curves on 4 online IL settings: (a) S-on-Lf D, (b) S-on-Lf O, (c) C-on-Lf D, and (d) C-on-Lf O, where the shaded area represents a 95% confidence interval over 30 trails. Note that baselines cannot be applied to all the IL task settings, thus we only provide comparisons with compatible baselines (two separate legends).

Table 2: Normalized scores (averaged over 30 trails for each task) on 4 offline IL settings: S-off-Lf D, S-off-Lf O, C-off-Lf D, and C-off-Lf O. Scores within two points of the maximum score are highlighted

Offline IL settings Hopper-v2 Halfcheetah-v2 Walker2d-v2 Ant-v2 sum m mr me m mr me m mr me m mr me

ORIL (TD3+BC) 50.9 22.1 72.7 44.7 30.2 87.5 47.1 26.7 102.6 46.5 31.4 61.9 624.3 SQIL (TD3+BC) 32.6 60.6 25.5 13.2 25.3 14.4 25.6 15.6 8.0 63.6 58.4 44.3 387.1 IQ-Learn 21.3 19.9 24.9 5.0 7.5 7.5 22.3 19.6 18.5 38.4 24.3 55.3 264.5 Value DICE 73.8 83.6 50.8 1.9 2.4 3.2 24.6 26.4 44.1 79.1 82.4 75.2 547.5 Demo DICE 54.8 32.7 65.4 42.8 37.0 55.6 68.1 39.7 95.0 85.6 69.0 108.8 754.6 SMODICE 56.1 28.7 68.0 42.7 37.7 66.9 66.2 40.7 58.2 87.4 69.9 113.4 735.9 CEIL (ours) 110.4 103.0 106.8 40.0 30.3 63.9 118.6 110.8 117.0 126.3 122.0 114.3 1163.5

ORIL (TD3+BC) 43.4 25.7 73.0 44.9 2.4 81.8 58.9 16.8 78.2 33.7 29.6 67.1 555.4 SMODICE 54.5 26.4 73.7 42.7 37.9 66.2 60.6 38.5 70.9 85.7 68.3 116.3 741.7 CEIL (ours) 54.2 51.4 90.4 43.5 40.1 47.7 78.5 20.5 110.0 97.0 67.8 120.5 821.7

ORIL (TD3+BC) 52.8 27.6 46.5 38.3 8.0 74.0 25.3 28.4 26.3 26.0 17.6 11.9 382.6 SQIL (TD3+BC) 34.4 19.1 11.4 19.2 25.1 19.9 15.8 16.5 8.8 21.8 23.2 21.2 236.2 IQ-Lean 37.3 35.4 25.9 27.4 27.1 31.2 27.7 22.2 31.7 63.7 63.3 55.8 448.8 Value DICE 22.0 18.3 18.9 14.0 11.7 8.7 11.5 10.0 8.6 24.1 21.4 19.2 188.4 Demo DICE 52.9 15.2 77.2 42.8 38.9 53.8 58.4 26.4 77.8 87.8 69.3 114.9 715.6 SMODICE 55.4 21.4 71.2 42.7 38.0 64.6 68.4 34.2 80.4 87.4 70.4 115.7 749.7 CEIL (ours) 58.4 39.8 81.6 42.6 38.3 46.6 76.5 21.1 81.1 91.6 88.0 115.3 780.9

ORIL (TD3+BC) 55.5 18.2 55.5 40.6 2.9 73.0 26.9 19.4 22.7 11.2 21.3 10.8 358.0 SMODICE 53.7 18.3 64.2 42.6 38.0 63.0 68.9 37.5 60.7 87.5 75.1 115.0 724.4 CEIL (ours) 44.7 44.2 48.2 42.4 36.5 46.9 76.2 31.7 77.0 95.9 71.0 112.7 727.3

and IQ-Learn (in C-on-Lf D and C-on-Lf O) suffer from the domain mismatch, leading to performance degradation at late stages of training, while CEIL can still achieve robust performance.

Offline IL. Next, we evaluate CEIL on the other 4 offline IL settings: S-off-Lf D, S-off-Lf O, C-off-Lf D, and C-off-Lf O. In Table 2, we provide the normalized return of our method and baseline methods on reward-free D4RL [22] medium (m), medium-replay (mr), and medium-expert (me) datasets. We can

1 2 3 4 5 8 10 15 20 0

Demonstration Num.

Normalized Return

1 2 3 4 5 8 10 15 20 0

Demonstration Num.

Normalized Return

2 4 8 16 32 Window Size

Normalized Return

(c) Hopper-v2

S-on-Lf D (CEIL) S-off-Lf D (CEIL) C-on-Lf D (CEIL) C-off-Lf D (CEIL)

2 4 8 16 32 Window Size

Normalized Return

S-on-Lf D (CEIL) S-off-Lf D (CEIL) C-on-Lf D (CEIL) C-off-Lf D (CEIL)

CEIL GAIL IQ-Learn SQIL ORIL Value DICE SMODICE (legend for varying Demo Num.)

Figure 3: Ablating (a, b) the number of expert demonstrations and (c, d) the trajectory window size.

observe that CEIL achieves a significant improvement over the baseline methods in both S-off-Lf D and S-off-Lf O settings. Compared to the state-of-the-art offline baselines, CEIL also shows competitive results on the challenging cross-domain offline IL settings (C-off-Lf D and C-off-Lf O).

One-shot IL. Then, we explore CEIL on the one-shot IL tasks, where we expect CEIL can adapt its behavior to new IL tasks given only one trajectory for each task (mismatched MDP, see Appendix 9.2).

One-shot IL Hop. Hal. Wal. Ant.

SQIL 16.8 1.1 3.5 4.2 IQ-Learn 4.6 0.2 1.7 7.5 CEIL (Lf D) 29.9 2.5 31.7 20.5 CEIL (Lf O) 17.8 3.2 5.6 29.7

ORIL 14.7 0.2 6.9 17.4 SQIL 7.4 0.8 4.6 12.5 IQ-Learn 18.8 1.2 4.0 19.3 Demo DICE 76.5 -0.5 -0.1 19.5 SMODICE 78.0 1.1 8.1 24.6 CEIL (Lf D) 85.6 5.6 67.1 24.3 CEIL (Lf O) 72.2 5.1 70.0 19.4

Table 3: Normalized results on one-shot IL, where CEIL shows prominent transferability.

We first pre-train an embedding function and a contextual policy in the training domain (online/offline IL), then infer a new contextual variable and evaluate it on the new task. To facilitate comparison to baselines, we similarly pre-train a policy network (using baselines) and run BC on top of the pre-trained policy by using the provided demonstration. Consequently, such a baseline+BC procedure cannot be applied to the (one-shot) Lf O tasks. The results in Table 3 show that baseline+BC struggles to transfer their expertise to new tasks. Benefiting from the hindsight framework, CEIL shows better one-shot transfer learning performance on 7 out of 8 one-shot Lf D tasks and retains higher scalability and generality for both one-shot Lf D and Lf O IL tasks.

5.2 Analysis of CEIL

Hybrid offline IL settings Hop. Hal. Wal. Ant.

S-Lf D 29.4 69.9 42.8 84.9 S-Lf D + S-Lf O 30.4 68.6 42.3 91.6 S-Lf D + S-Lf O + C-Lf D 30.7 71.7 42.9 89.2 S-Lf D + S-Lf O + C-Lf D + C-Lf O 58.6 79.6 43.7 98.0

Table 4: The normalized results of CEIL, showing that CEIL can consistently digest useful (task-relevant) information and boost its performance, even under a hybrid of offline IL settings.

Hybrid IL settings. In real-world, many IL tasks do not correspond to one specific IL setting, and instead consist of a hybrid of several IL settings, each of which passes a portion of task-relevant information to the IL agent. For example, we can provide the agent with both demonstrations and state-only observations and, in some cases, cross-domain demonstrations (S-Lf D+S-Lf O+C-Lf D). To examine the versatility of CEIL, we collect a separate expert trajectory for each of the four offline IL settings, and study CEIL s performance under hybrid IL settings. As shown in Table 4, we can see that by adding new expert behaviors on top of Lf D, even when carrying relatively less supervision (e.g., actions are absent in Lf O), CEIL can still improve the performance.

Varying the number of demonstrations. In Figure 3 (a, b), we study the effect of the number of expert demonstrations on CEIL s performance. Empirically, we reduce the number of training demonstrations from 20 to 1, and report the normalized returns at 1M training steps. We can observe that across both online and offline (D4RL *-medium) IL settings, CEIL shows more robust performance with respect to different numbers of demonstrations compared to baseline methods.

Varying the window size of trajectory. Next we assess the effect of the trajectory window size (i.e., the length of trajectory τ used for the embedding function fϕ in Equation 3). In Figure 3 (b, c), we ablate the number of the window size in 4 Lf D IL instantiations. We can see that across a range of window sizes, CEIL remains stable and achieves expert-level performance.

0 100 200 300 400 500

Rollout steps (k)

CEIL CEIL(ablating f )

CEIL(ablating JMI)

0 100 200 300 400 500

Rollout steps (k)

CEIL CEIL(ablating f )

CEIL(ablating JMI)

Single-domain online Lf D (S-on-Lf D)

0 100 200 300 400 500

Rollout steps (k)

CEIL CEIL(ablating f )

CEIL(ablating JMI)

0 100 200 300 400 500

Rollout steps (k)

CEIL CEIL(ablating f )

CEIL(ablating JMI)

Cross-domain online Lf D (C-on-Lf D)

Figure 4: Ablation studies on the optimization of fϕ (ablating fϕ) and the objective of JMI (ablating JMI), where the shaded area represents 95% CIs over 5 trails. See ablation results for offline IL tasks in Table 5.

Table 5: Ablation studies on the optimization of fϕ (ablating fϕ) and the objective of JMI (ablating JMI), where scores (averaged over 5 trails for each task) within two points of the maximum score are highlighted.

Offline IL settings Hopper-v2 Half Cheetah-v2 Walker2d-v2 Ant-v2 sum m mr me m mr me m mr me m mr me

CEIL (ablating fϕ) 97.9 92.5 99.3 41.3 30.3 66.7 103.6 88.1 114.4 97.6 98.4 100.7 1030.8 CEIL (ablating JMI) 83.2 89.0 98.7 27.1 28.3 53.5 107.4 68.0 75.6 116.9 97.8 105.9 951.4 CEIL 110.4 103.0 106.8 40.0 30.3 63.9 118.6 110.8 117.0 126.3 122.0 114.3 1163.5

CEIL (ablating fϕ) 51.5 41.1 83.3 43.8 40.1 63.7 76.3 20.3 103.0 78.0 52.5 105.5 759.2 CEIL (ablating JMI) 54.3 44.9 84.7 42.2 39.9 51.6 77.4 22.7 94.0 92.1 67.9 118.4 792.0 CEIL 54.2 51.4 90.4 43.5 40.1 47.7 78.5 20.5 110.0 97.0 67.8 120.5 821.7

Ablation studies on the optimization of fϕ and the objective of JMI. In Figure 4 and Table 5, we carried out ablation experiments on the loss of fϕ and JMI in both online IL and offline IL settings. We can see that ablating the fϕ loss (optimizing with Equation 5) does degrade the performance in both online and offline IL tasks, demonstrating the effectiveness of optimizing with Equation 8. Intuitively, Equation 8 encourages the embedding function to be task-relevant, and thus we use the expert matching loss to update fϕ. We can also see that ablating JMI does lead to degraded performance, further verifying the effectiveness of our expert matching objective in the latent space.

6 Conclusion

In this paper, we present CEIL, a novel and general Imitation Learning framework applicable to a wide range of IL settings, including C/S-on/off-Lf D/Lf O and few-shot IL settings. This is achieved by explicitly decoupling the imitation policy into 1) a contextual policy, learned with the self-supervised hindsight information matching objective, and 2) a latent variable, inferred by performing the IL expert matching objective. Compared to prior baselines, our results show that CEIL is more sample-efficient in most of the online IL tasks and achieves better or competitive performances in offline tasks.

Limitations and future work. Our primary aim behind this work is to develop a simple and scalable IL method. We believe that CEIL makes an important step in that direction. Admittedly, we also find some limitations of CEIL: 1) Offline results generally outperform online results, especially in the Lf O setting. The main reason is that CEIL lacks explicit exploration bounds, thus future work could explore the exploration ability of online CEIL. 2) The trajectory self-consistency cannot be applied to cross-embodiment agents once the two embodiments/domains have different state spaces or action spaces. Considering such a cross-embodiment setting, a typical approach is to serialize state/action from different modalities into a flat sequence of tokens. We also remark that CEIL is compatible with such a tokenization approach, and thus suitable for IL tasks with different action/state spaces. Thus, we encourage the future exploration of generalized IL methods across different embodiments.

Acknowledgments and Disclosure of Funding

We sincerely thank the anonymous reviewers for their insightful suggestions. This work was supported by the National Science and Technology Innovation 2030 - Major Project (Grant No. 2022ZD0208800), and NSFC General Program (Grant No. 62176215).

[1] Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron C Courville, and Marc Bellemare. Deep reinforcement learning at the edge of the statistical precipice. Advances in neural information processing systems, 34:29304 29320, 2021.

[2] Anurag Ajay, Yilun Du, Abhi Gupta, Joshua Tenenbaum, Tommi Jaakkola, and Pulkit Agrawal. Is conditional generative modeling all you need for decision-making? ar Xiv preprint ar Xiv:2211.15657, 2022.

[3] Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob Mc Grew, Josh Tobin, Open AI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. Advances in neural information processing systems, 30, 2017.

[4] Kai Arulkumaran, Dylan R Ashley, Jürgen Schmidhuber, and Rupesh K Srivastava. All you need is supervised learning: From imitation learning to meta-rl with upside down rl. ar Xiv preprint ar Xiv:2202.11960, 2022.

[5] Bowen Baker, Ilge Akkaya, Peter Zhokov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, and Jeff Clune. Video pretraining (vpt): Learning to act by watching unlabeled online videos. Advances in Neural Information Processing Systems, 35:24639 24654, 2022.

[6] Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeswar, Sherjil Ozair, Yoshua Bengio, Aaron Courville, and R Devon Hjelm. Mine: mutual information neural estimation. ar Xiv preprint ar Xiv:1801.04062, 2018.

[7] Damian Boborzi, Christoph-Nikolas Straehle, Jens S Buchner, and Lars Mikelsons. Imitation learning by state-only distribution matching. ar Xiv preprint ar Xiv:2202.04332, 2022.

[8] David Brandfonbrener, Alberto Bietti, Jacob Buckman, Romain Laroche, and Joan Bruna. When does return-conditioned supervised learning work for offline reinforcement learning? ar Xiv preprint ar Xiv:2206.01079, 2022.

[9] Daniel S Brown, Wonjoon Goo, and Scott Niekum. Better-than-demonstrator imitation learning via automatically-ranked demonstrations. In Conference on robot learning, pages 330 359. PMLR, 2020.

[10] Micah Carroll, Orr Paradise, Jessy Lin, Raluca Georgescu, Mingfei Sun, David Bignell, Stephanie Milani, Katja Hofmann, Matthew Hausknecht, Anca Dragan, et al. Unimask: Unified inference in sequential decision problems. ar Xiv preprint ar Xiv:2211.10869, 2022.

[11] Jonathan Chang, Masatoshi Uehara, Dhruv Sreenivas, Rahul Kidambi, and Wen Sun. Mitigating covariate shift in imitation learning via offline data with partial coverage. Advances in Neural Information Processing Systems, 34:965 979, 2021.

[12] Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling. Advances in neural information processing systems, 34:15084 15097, 2021.

[13] Robert Dadashi, Léonard Hussenot, Matthieu Geist, and Olivier Pietquin. Primal wasserstein imitation learning. ar Xiv preprint ar Xiv:2006.04678, 2020.

[14] Christopher R Dance, Julien Perez, and Théo Cachet. Conditioned reinforcement learning for few-shot imitation. In International Conference on Machine Learning, pages 2376 2387. PMLR, 2021.

[15] Branton De Moss, Paul Duckworth, Nick Hawes, and Ingmar Posner. Ditto: Offline imitation learning with world models. ar Xiv preprint ar Xiv:2302.03086, 2023.

[16] Yan Duan, Marcin Andrychowicz, Bradly Stadie, Open AI Jonathan Ho, Jonas Schneider, Ilya Sutskever, Pieter Abbeel, and Wojciech Zaremba. One-shot imitation learning. Advances in neural information processing systems, 30, 2017.

[17] Scott Emmons, Benjamin Eysenbach, Ilya Kostrikov, and Sergey Levine. Rvs: What is essential for offline rl via supervised learning? ar Xiv preprint ar Xiv:2112.10751, 2021.

[18] Arnaud Fickinger, Samuel Cohen, Stuart Russell, and Brandon Amos. Cross-domain imitation learning via optimal transport. ar Xiv preprint ar Xiv:2110.03684, 2021.

[19] Pete Florence, Corey Lynch, Andy Zeng, Oscar A Ramirez, Ayzaan Wahid, Laura Downs, Adrian Wong, Johnny Lee, Igor Mordatch, and Jonathan Tompson. Implicit behavioral cloning. In Conference on Robot Learning, pages 158 168. PMLR, 2022.

[20] Tim Franzmeyer, Philip HS Torr, and João F Henriques. Learn what matters: cross-domain imitation learning with task-relevant embeddings. ar Xiv preprint ar Xiv:2209.12093, 2022.

[21] Justin Fu, Katie Luo, and Sergey Levine. Learning robust rewards with adversarial inverse reinforcement learning. ar Xiv preprint ar Xiv:1710.11248, 2017.

[22] Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning. ar Xiv preprint ar Xiv:2004.07219, 2020.

[23] Scott Fujimoto and Shixiang Shane Gu. A minimalist approach to offline reinforcement learning. Advances in neural information processing systems, 34:20132 20145, 2021.

[24] Hiroki Furuta, Yutaka Matsuo, and Shixiang Shane Gu. Generalized decision transformer for offline hindsight information matching. ar Xiv preprint ar Xiv:2111.10364, 2021.

[25] Tanmay Gangwani and Jian Peng. State-only imitation with transition dynamics mismatch. ar Xiv preprint ar Xiv:2002.11879, 2020.

[26] Tanmay Gangwani, Yuan Zhou, and Jian Peng. Imitation learning from observations under transition model disparity. ar Xiv preprint ar Xiv:2204.11446, 2022.

[27] Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. Clip-adapter: Better vision-language models with feature adapters. ar Xiv preprint ar Xiv:2110.04544, 2021.

[28] Divyansh Garg, Shuvam Chakraborty, Chris Cundy, Jiaming Song, and Stefano Ermon. Iq-learn: Inverse soft-q learning for imitation. Advances in Neural Information Processing Systems, 34: 4028 4039, 2021.

[29] Adam Gleave, Mohammad Taufeeque, Juan Rocamonde, Erik Jenner, Steven H. Wang, Sam Toyer, Maximilian Ernestus, Nora Belrose, Scott Emmons, and Stuart Russell. imitation: Clean imitation learning implementations, 2022.

[30] Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. Advances in neural information processing systems, 29, 2016.

[31] Michael Janner, Qiyang Li, and Sergey Levine. Offline reinforcement learning as one big sequence modeling problem. Advances in neural information processing systems, 34:1273 1286, 2021.

[32] Firas Jarboui and Vianney Perchet. Offline inverse reinforcement learning. ar Xiv preprint ar Xiv:2106.05068, 2021.

[33] Daniel Jarrett, Ioana Bica, and Mihaela van der Schaar. Strictly batch imitation learning by energy-based distribution matching. Advances in Neural Information Processing Systems, 33: 7354 7365, 2020.

[34] Shengyi Jiang, Jingcheng Pang, and Yang Yu. Offline imitation learning with a misspecified simulator. Advances in neural information processing systems, 33:8510 8520, 2020.

[35] Kshitij Judah, Alan Fern, Prasad Tadepalli, and Robby Goetschalckx. Imitation learning with demonstrations and shaping rewards. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 28, 2014.

[36] Yachen Kang, Diyuan Shi, Jinxin Liu, Li He, and Donglin Wang. Beyond reward: Offline preference-guided policy optimization. ar Xiv preprint ar Xiv:2305.16217, 2023.

[37] Liyiming Ke, Sanjiban Choudhury, Matt Barnes, Wen Sun, Gilwoo Lee, and Siddhartha Srinivasa. Imitation learning as f-divergence minimization. In International Workshop on the Algorithmic Foundations of Robotics, pages 313 329. Springer, 2020.

[38] Liyiming Ke, Sanjiban Choudhury, Matt Barnes, Wen Sun, Gilwoo Lee, and Siddhartha Srinivasa. Imitation learning as f-divergence minimization. In Algorithmic Foundations of Robotics XIV: Proceedings of the Fourteenth Workshop on the Algorithmic Foundations of Robotics 14, pages 313 329. Springer International Publishing, 2021.

[39] Geon-Hyeong Kim, Seokin Seo, Jongmin Lee, Wonseok Jeon, Hyeong Joo Hwang, Hongseok Yang, and Kee-Eung Kim. Demodice: Offline imitation learning with supplementary imperfect demonstrations. In International Conference on Learning Representations, 2022.

[40] Kuno Kim, Yihong Gu, Jiaming Song, Shengjia Zhao, and Stefano Ermon. Domain adaptive imitation learning. In International Conference on Machine Learning, pages 5286 5295. PMLR, 2020.

[41] Ilya Kostrikov, Ofir Nachum, and Jonathan Tompson. Imitation learning via off-policy distribution matching. ar Xiv preprint ar Xiv:1912.05032, 2019.

[42] Aviral Kumar, Xue Bin Peng, and Sergey Levine. Reward-conditioned policies. ar Xiv preprint ar Xiv:1912.13465, 2019.

[43] Yao Lai, Jinxin Liu, Zhentao Tang, Bin Wang, HAO Jianye, and Ping Luo. Chipformer: Transferable chip placement via offline decision transformer. ICML, 2023. URL https: //openreview.net/pdf?id=j0mi EWtw87.

[44] Youngwoon Lee, Andrew Szot, Shao-Hua Sun, and Joseph J Lim. Generalizable imitation learning from observation via inferring goal proximity. Advances in neural information processing systems, 34:16118 16130, 2021.

[45] Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. ar Xiv preprint ar Xiv:2104.08691, 2021.

[46] Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. ar Xiv preprint ar Xiv:2005.01643, 2020.

[47] Yunzhu Li, Jiaming Song, and Stefano Ermon. Infogail: Interpretable imitation learning from visual demonstrations. Advances in Neural Information Processing Systems, 30, 2017.

[48] Fangchen Liu, Zhan Ling, Tongzhou Mu, and Hao Su. State alignment-based imitation learning. ar Xiv preprint ar Xiv:1911.10947, 2019.

[49] Jinxin Liu, Donglin Wang, Qiangxing Tian, and Zhengyu Chen. Learn goal-conditioned policy with intrinsic motivation for deep reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 7558 7566, 2022.

[50] Jinxin Liu, Hongyin Zhang, and Donglin Wang. Dara: Dynamics-aware reward augmentation in offline reinforcement learning. ar Xiv preprint ar Xiv:2203.06662, 2022.

[51] Jinxin Liu, Ziqi Zhang, Zhenyu Wei, Zifeng Zhuang, Yachen Kang, Sibo Gai, and Donglin Wang. Beyond ood state actions: Supported cross-domain offline reinforcement learning. ar Xiv preprint ar Xiv:2306.12755, 2023.

[52] Minghuan Liu, Tairan He, Minkai Xu, and Weinan Zhang. Energy-based imitation learning. ar Xiv preprint ar Xiv:2004.09395, 2020.

[53] Minghuan Liu, Hanye Zhao, Zhengyu Yang, Jian Shen, Weinan Zhang, Li Zhao, and Tie-Yan Liu. Curriculum offline imitating learning. Advances in Neural Information Processing Systems, 34:6266 6277, 2021.

[54] Yu Xuan Liu, Abhishek Gupta, Pieter Abbeel, and Sergey Levine. Imitation from observation: Learning to imitate behaviors from raw video via context translation. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 1118 1125. IEEE, 2018.

[55] Yicheng Luo, Zhengyao Jiang, Samuel Cohen, Edward Grefenstette, and Marc Peter Deisenroth. Optimal transport for offline imitation learning. ar Xiv preprint ar Xiv:2303.13971, 2023.

[56] Yecheng Jason Ma, Andrew Shen, Dinesh Jayaraman, and Osbert Bastani. Smodice: Versatile offline imitation learning via state occupancy matching. ar Xiv e-prints, pages ar Xiv 2202, 2022.

[57] Tianwei Ni, Harshit Sikchi, Yufei Wang, Tejus Gupta, Lisa Lee, and Ben Eysenbach. f-irl: Inverse reinforcement learning via state marginal matching. In Conference on Robot Learning, pages 529 551. PMLR, 2021.

[58] Dean A Pomerleau. Efficient training of artificial neural networks for autonomous navigation. Neural computation, 3(1):88 97, 1991.

[59] Yiwen Qiu, Jialong Wu, Zhangjie Cao, and Mingsheng Long. Out-of-dynamics imitation learning from multimodal demonstrations. In Conference on Robot Learning, pages 1071 1080. PMLR, 2023.

[60] Dripta S Raychaudhuri, Sujoy Paul, Jeroen Vanbaar, and Amit K Roy-Chowdhury. Crossdomain imitation from observations. In International Conference on Machine Learning, pages 8902 8912. PMLR, 2021.

[61] Siddharth Reddy, Anca D Dragan, and Sergey Levine. Sqil: Imitation learning via reinforcement learning with sparse rewards. ar Xiv preprint ar Xiv:1905.11108, 2019.

[62] Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627 635. JMLR Workshop and Conference Proceedings, 2011.

[63] Stefan Schaal. Learning from demonstration. Advances in neural information processing systems, 9, 1996.

[64] Rupesh Kumar Srivastava, Pranav Shyam, Filipe Mutz, Wojciech Ja skowski, and Jürgen Schmidhuber. Training agents using upside-down reinforcement learning. ar Xiv preprint ar Xiv:1912.02877, 2019.

[65] Faraz Torabi, Garrett Warnell, and Peter Stone. Behavioral cloning from observation. ar Xiv preprint ar Xiv:1805.01954, 2018.

[66] Faraz Torabi, Garrett Warnell, and Peter Stone. Generative adversarial imitation from observation. ar Xiv preprint ar Xiv:1807.06158, 2018.

[67] Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.

[68] Luca Viano, Yu-Ting Huang, Parameswaran Kamalaruban, Craig Innes, Subramanian Ramamoorthy, and Adrian Weller. Robust learning from observation with model misspecification. ar Xiv preprint ar Xiv:2202.06003, 2022.

[69] Tianyu Wang, Nikhil Karnwal, and Nikolay Atanasov. Latent policies for adversarial imitation learning. ar Xiv preprint ar Xiv:2206.11299, 2022.

[70] Haoran Xu, Xianyuan Zhan, Honglei Yin, and Huiling Qin. Discriminator-weighted offline imitation learning from suboptimal demonstrations. In International Conference on Machine Learning, pages 24725 24742. PMLR, 2022.

[71] Mengdi Xu, Yikang Shen, Shun Zhang, Yuchen Lu, Ding Zhao, Joshua Tenenbaum, and Chuang Gan. Prompting decision transformer for few-shot policy generalization. In International Conference on Machine Learning, pages 24631 24645. PMLR, 2022.

[72] Sheng Yue, Guanbo Wang, Wei Shao, Zhaofeng Zhang, Sen Lin, Ju Ren, and Junshan Zhang. Clare: Conservative model-based reward learning for offline inverse reinforcement learning. ar Xiv preprint ar Xiv:2302.04782, 2023.

[73] Wenjia Zhang, Haoran Xu, Haoyi Niu, Peng Cheng, Ming Li, Heming Zhang, Guyue Zhou, and Xianyuan Zhan. Discriminator-guided model-based offline imitation learning. In Conference on Robot Learning, pages 1266 1276. PMLR, 2023.

[74] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337 2348, 2022.

[75] Zhuangdi Zhu, Kaixiang Lin, Bo Dai, and Jiayu Zhou. Off-policy imitation learning from observations. Advances in Neural Information Processing Systems, 33:12402 12413, 2020.

[76] Zifeng Zhuang, Kun Lei, Jinxin Liu, Donglin Wang, and Yilang Guo. Behavior proximal policy optimization. ar Xiv preprint ar Xiv:2302.11312, 2023.

[77] Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, Anind K Dey, et al. Maximum entropy inverse reinforcement learning. In Aaai, volume 8, pages 1433 1438. Chicago, IL, USA, 2008.

[78] Konrad Zolna, Alexander Novikov, Ksenia Konyushkova, Caglar Gulcehre, Ziyu Wang, Yusuf Aytar, Misha Denil, Nando de Freitas, and Scott Reed. Offline learning from demonstrations and unlabeled experience. ar Xiv preprint ar Xiv:2011.13885, 2020.

[79] Konrad Zolna, Scott Reed, Alexander Novikov, Sergio Gomez Colmenarejo, David Budden, Serkan Cabi, Misha Denil, Nando de Freitas, and Ziyu Wang. Task-relevant adversarial imitation learning. In Conference on Robot Learning, pages 247 263. PMLR, 2021.

7 Additional Derivation

(Repeat from the main paper.) To gain more insight into Equation 4 that captures the quality of IL (the degree of similarity to the expert data), we define D( , ) as the sum of reverse KL and forward KL divergence, i.e., D(q, p) = DKL(q p) + DKL(p q), and derive an alternative form for Equation 4: arg min z D(πθ(τ|z ), πE(τ)) = arg max z I(z ; τE) I(z ; τθ) | {z } JMI

DKL(πθ(τ), πE(τ)) | {z } JD

where I(x; y) denotes the mutual information (MI) between x and y, which measures the predictive power of y on x (or vice-versa), the latent variables are defined as τE := τ πE(τ), τθ := τ p(z )πθ(τ|z ), and πθ(τ) = Ez [πθ(τ|z )].

Below is our derivation: min z D(πθ(τ|z ), πE(τ))

= min z Ep(z ) [DKL(πθ(τ|z ) πE(τ)) + DKL(πE(τ) πθ(τ|z ))]

= min z Ep(z )πθ(τ|z ) [log πθ(τ|z ) log πE(τ)]

+ Ep(z )πE(τ) [log πE(τ) log πθ(τ|z )]

= min z Ep(z )πθ(τ|z )

log p(z |τ)πθ(τ)

p(z ) log πE(τ)

+ Ep(z )πE(τ)

log πE(τ) log p(z |τ)πθ(τ)

= min z Ep(z )πθ(τ|z )

log p(z |τ)

p(z ) + log πθ(τ)

Ep(z )πE(τ)

log p(z |τ)

p(z ) + log πθ(τ)

= max z I(z ; τE) I(z ; τθ) D(πθ(τ), πE(τ)),

where τE := τ πE(τ), τθ := τ p(z )πθ(τ|z ).

8 More Comparisons and Ablation Studies

8.1 Offline Comparison on D4RL Expert Domain Dataset

In Table 6, we provide the normalized return of our method and baseline methods on the reward-free D4RL [22] expert dataset. Consistently, we can observe that CEIL achieves a significant improvement over the baseline methods in both S-off-Lf D and S-off-Lf O settings. Compared to the state-of-the-art offline IL baselines, CEIL also shows competitive results on the challenging cross-domain offline IL settings (C-off-Lf D and C-off-Lf O).

8.2 Generalizability on Cross-domain Offline IL Settings

In the standard cross-domain IL setting, the goal is to extract expert-relevant information from the mismatched expert demonstrations/observations (expert domain) and to mimic such expert behaviors in the training environment (training domain). Thus, we validate the performance of the learned policy in the training environment (i.e., the environment where the offline data was collected). Here, we also study the generalizability of the learned policy by evaluating the learned policy in the expert environment (i.e., the environment where the mismatched expert data was collected). We provide the normalized scores (evaluated in the expert domain) in Table 7. We can find that across a range of cross-domain offline IL tasks, CEIL consistently demonstrates better (zero-shot) generalizability compared to baselines.

8.3 Ablating the Cross-domain Regularization

We now conduct ablation studies to evaluate the importance of cross-domain regularization in Equation 9 (in the main paper). In Figure 5, we provide the performance improvement when we

Table 6: Normalized scores (averaged over 30 trails for each task) on D4RL expert dataset. Scores within two points of the maximum score are highlighted. hop: Hopper-v2. hal: Half Cheetah-v2. wal: Walker2d-v2. ant: Ant-v2.

hop hal wal ant sum expert expert expert expert

ORIL (TD3+BC) 97.5 91.8 14.5 76.8 280.6 SQIL (TD3+BC) 25.5 14.4 8.0 44.3 92.1 IQ-Learn 37.3 9.9 46.6 85.9 179.7 Value DICE 65.6 2.9 28.2 90.5 187.1 Demo DICE 107.3 87.1 104.8 114.2 413.3 SMODICE 111.0 93.5 108.2 122.0 434.7 CEIL 106.0 96.0 115.6 117.8 435.4

S-off-Lf O ORIL (TD3+BC) 64.2 92.1 12.2 44.3 212.8 SMODICE 111.3 93.7 108.0 122.0 435.0 CEIL 103.3 96.8 110.0 126.4 436.5

ORIL (TD3+BC) 24.4 78.3 29.3 32.1 164.1 SQIL (TD3+BC) 12.2 19.9 8.8 21.2 62.0 IQ-Learn 25.9 31.2 31.7 55.8 144.6 Value DICE 18.6 9.8 8.3 22.3 59.0 Demo DICE 111.5 88.7 107.9 122.5 430.6 SMODICE 111.1 93.8 108.2 120.9 434.0 CEIL 105.8 97.1 108.6 112.2 423.7

C-off-Lf O ORIL (TD3+BC) 22.5 76.6 11.2 28.2 138.6 SMODICE 111.2 93.7 108.1 117.7 430.7 CEIL 113.0 90.1 108.7 125.2 437.0

Table 7: Normalized scores (evaluated on the expert dataset over 30 trails for each task) on 2 cross-domain offline IL settings: C-off-Lf D and C-off-Lf O. Scores within two points of the maximum score are highlighted. m: medium. mr: medium-replay. me: medium-expert. e: expert.

Hopper-v2 Half Cheetah-v2 sum m mr me e m mr me e

ORIL (TD3+BC) 74.7 16.7 45.0 21.4 2.2 0.8 -0.3 -2.2 158.3 SQIL (TD3+BC) 33.6 21.6 14.5 14.5 18.2 7.5 20.9 20.9 151.8 IQ-Learn 11.8 9.7 17.1 17.1 7.7 7.8 9.5 9.5 90.2 Value DICE 49.5 24.2 55.7 49.3 32.2 32.9 38.7 28.7 311.2 Demo DICE 83.2 31.5 81.6 28.5 0.9 -1.1 -1.7 -2.4 220.6 SMODICE 80.1 26.1 78.0 54.3 2.8 -1.0 1.0 -2.3 239.1 CEIL 87.4 74.3 81.2 82.4 44.0 30.4 25.0 17.1 441.9

C-off-Lf O ORIL (TD3+BC) 62.3 18.7 57.0 28.2 0.2 1.1 -0.3 -2.3 165.0 SMODICE 77.6 22.5 80.2 71.0 2.0 -0.9 0.8 -2.3 250.9 CEIL 56.4 58.6 56.7 65.2 5.5 36.5 5.0 5.0 288.7

Walker2d-v2 Ant-v2 sum m mr me e m mr me e

ORIL (TD3+BC) 22.0 24.5 23.9 33.1 16.0 18.6 2.5 0.4 141.0 SQIL (TD3+BC) 32.4 14.9 10.3 10.3 71.4 63.6 60.1 60.1 323.1 IQ-Learn 8.4 5.0 10.2 10.2 19.4 18.4 16.1 16.1 103.8 Value DICE 31.7 21.9 22.9 27.7 70.5 68.5 69.3 68.5 380.9 Demo DICE 12.8 31.5 12.9 86.9 15.7 24.2 2.3 1.4 187.7 SMODICE 43.6 16.1 62.0 85.3 23.7 22.9 2.3 -5.9 249.9 CEIL 102.8 94.8 101.9 100.7 82.0 77.0 76.4 79.8 715.3

C-off-Lf O ORIL (TD3+BC) 22.4 15.2 17.8 12.6 13.6 20.7 5.5 -6.2 101.6 SMODICE 42.4 17.0 55.5 88.7 15.7 22.6 2.5 -6.3 238.1 CEIL 67.9 12.0 68.4 50.8 31.7 57.0 18.0 -1.9 304.0

Improvement

Improvement

Figure 5: Normalized performance improvement (left: C-off-Lf D, right: C-off-Lf O) when we ablate the cross-domain regularization (Equation 9 in the main paper) in cross-domain IL settings. We can observe the general trend (in 26 out of 32 tasks) that ablating the cross-domain regularization causes negative performance improvement. hop: Hopper-v2. hal: Half Cheetah-v2. wal: Walker2d-v2. ant: Ant-v2. m: medium. me: medium-expert. mr: medium-replay. e: expert.

30 60 90 normalized return

CEIL SMODICE Demo DICE

Value DICE ORIL(TD3+BC) SQIL(TD3+BC)

25 50 75 100 normalized return

25 50 75 100 normalized return

20 40 60 80 normalized return

(a) S-off-Lf D Optimality Gap

45 60 75 normalized return

CEIL SMODICE ORIL(TD3+BC)

45 60 75 normalized return

50 60 70 normalized return

30 40 50 60 normalized return

(b) S-off-Lf O Optimality Gap

25 50 75 normalized return

CEIL SMODICE Demo DICE

Value DICE ORIL(TD3+BC) SQIL(TD3+BC)

20 40 60 80 normalized return

20 40 60 80 normalized return

40 60 80 normalized return

(c) C-off-Lf D Optimality Gap

20 40 60 80 normalized return

CEIL SMODICE ORIL(TD3+BC)

30 45 60 normalized return

30 45 60 normalized return

45 60 normalized return

(d) C-off-Lf O Optimality Gap

Figure 6: Aggregate median, IQM, mean, and optimality gap over 16 offline IL tasks. Higher median, higher IQM, and higher mean and lower optimality gap are better. The shaded bar shows 95% stratified bootstrap confidence intervals. We can see that CEIL achieves consistently better performance across a wide range of offline IL settings.

ablate the cross-domain regularization in two cross-domain offline IL tasks (C-off-Lf D and C-off-Lf O). We can find that in 26 out of 32 cross-domain tasks, ablating the regularization can cause performance to decrease (negative performance improvement), thus verifying the benefits of encouraging taskrelevant embeddings.

0 100 200 300 400 500

Rollout steps (k)

0 100 200 300 400 500

Rollout steps (k)

0 100 200 300 400 500

Rollout steps (k)

0 100 200 300 400 500

Rollout steps (k)

IQ-Learn Value DICE (legend for S/C-on-Lf D)

CEIL AIRL (state only)

GAIf O IQ-Learn (legend for S/C-on-Lf O)

Figure 7: Return curves in Walker2d-v2 (from left to right: S-on-Lf D, C-on-Lf D, S-on-Lf O, and C-on-Lf O), where the shaded area represents a 95% confidence interval over 30 trails. We can see that CEIL consistently achieves expert-level performance in Lf D (S-on-Lf D and C-on-Lf D) tasks. Due to the lack of explicit exploration in online Lf O settings, CEIL exhibits drastic performance degradation (in S-on-Lf O and C-on-Lf O) under the same environmental interaction steps.

8.4 Aggregate Results

According to Agarwal et al. [1], we report the aggregate statistics (for 16 offline IL tasks) in Figure 6. We can find that CEIL provides competitive performance consistently across a range of offline IL settings (S-off-Lf D, S-off-Lf O, C-off-Lf D, and C-off-Lf O) and outperforms prior offline baselines.

Table 8: Normalized scores (averaged over 30 trails for each task) when we vary the number of the expert demonstrations (#5, #10, #15, and #20). Scores within two points of the maximum score are highlighted

Offline IL settings Hopper-v2 Halfcheetah-v2 Walker2d-v2 Ant-v2 sum m mr me m mr me m mr me m mr me

S-off-Lf D #5

ORIL (TD3+BC) 42.1 26.7 51.2 45.1 2.7 79.6 44.1 22.9 38.3 25.6 24.5 6.0 408.8 SQIL (TD3+BC) 45.2 27.4 5.9 14.5 15.7 11.8 12.2 7.2 13.6 20.6 23.6 -5.7 192.0 IQ-Learn 17.2 15.4 21.7 6.4 4.8 6.2 13.1 10.6 5.1 22.8 27.2 18.7 169.2 Value DICE 59.8 80.1 72.6 2.0 0.9 1.2 2.8 0.0 7.4 27.3 32.7 30.2 316.9 Demo DICE 50.2 26.5 63.7 41.9 38.7 59.5 66.3 38.8 101.6 82.8 68.8 112.4 751.2 SMODICE 54.1 34.9 64.7 42.6 38.4 63.8 62.2 40.6 55.4 86.0 69.7 112.4 724.7 CEIL 94.5 45.1 80.8 45.1 43.3 33.9 103.1 81.1 99.4 99.8 101.4 85.0 912.5

S-off-Lf D #10

ORIL (TD3+BC) 42.0 21.6 53.4 45.0 2.1 82.1 44.1 27.4 80.4 47.3 24.0 44.9 514.1 SQIL (TD3+BC) 50.0 34.2 7.4 8.8 10.9 8.2 20.0 15.2 9.7 35.3 36.2 11.9 247.6 IQ-Learn 11.3 18.6 20.1 4.1 6.5 6.6 18.3 12.8 12.2 30.7 53.9 23.7 218.7 Value DICE 56.0 64.1 54.2 -0.2 2.6 2.4 4.7 4.0 0.9 31.4 72.3 49.5 341.8 Demo DICE 53.6 25.8 64.9 42.1 36.9 60.6 64.7 36.1 100.2 87.4 67.1 114.3 753.5 SMODICE 55.6 30.3 66.6 42.6 38.0 66.0 64.5 44.6 53.8 86.9 69.5 113.4 731.8 CEIL 113.2 53.0 96.3 64.0 43.6 44.0 120.4 82.3 104.2 119.3 70.0 90.1 1000.4

S-off-Lf D #15

ORIL (TD3+BC) 38.9 22.3 46.8 44.7 1.9 83.8 37.9 4.2 69.9 59.4 22.3 12.4 444.6 SQIL (TD3+BC) 42.8 44.4 5.2 6.8 17.1 9.1 16.9 13.5 6.9 21.2 17.2 12.6 213.6 IQ-Learn 14.6 8.2 29.3 4.0 3.4 5.1 7.3 14.5 11.4 54.2 15.2 61.6 228.6 Value DICE 66.3 58.3 53.6 2.3 2.3 1.2 5.2 -0.1 17.0 45.2 72.0 74.3 397.8 Demo DICE 52.2 29.6 67.3 41.9 37.6 58.1 66.4 42.9 103.5 86.6 68.3 114.3 768.7 SMODICE 55.9 25.7 72.7 42.5 37.6 66.4 67.0 43.2 55.1 86.7 69.7 118.2 740.6 CEIL 116.4 56.7 103.7 80.4 43.0 43.8 120.3 84.8 103.8 126.8 87.0 90.6 1057.3

S-off-Lf D #20

ORIL (TD3+BC) 50.9 22.1 72.7 44.7 30.2 87.5 47.1 26.7 102.6 46.5 31.4 61.9 624.3 SQIL (TD3+BC) 32.6 60.6 25.5 13.2 25.3 14.4 25.6 15.6 8.0 63.6 58.4 44.3 387.1 IQ-Learn 21.3 19.9 24.9 5.0 7.5 7.5 22.3 19.6 18.5 38.4 24.3 55.3 264.5 Value DICE 73.8 83.6 50.8 1.9 2.4 3.2 24.6 26.4 44.1 79.1 82.4 75.2 547.5 Demo DICE 54.8 32.7 65.4 42.8 37.0 55.6 68.1 39.7 95.0 85.6 69.0 108.8 754.6 SMODICE 56.1 28.7 68.0 42.7 37.7 66.9 66.2 40.7 58.2 87.4 69.9 113.4 735.9 CEIL (ours) 110.4 103.0 106.8 40.0 30.3 63.9 118.6 110.8 117.0 126.3 122.0 114.3 1163.5

8.5 Varying the Number of Expert Trajectories

As a complement to the experimental results in the main paper, we continue to compare the performance of CEIL and baselines on more tasks when we vary the number of expert trajectories. Considering offline IL settings, we provide the results in Table 8 for the number of expert trajectories of 5, 10, 15, and 20 respectively. We can find that when varying the number of expert behaviors, CEIL can still obtain higher scores compared to baselines, which is consistent with the findings in Figure 3 in the main paper.

8.6 Limitation (Failure Modes in Online Lf O Setting)

Meanwhile, we find that in the online Lf O settings, CEIL s performance deteriorates severely on a few tasks, as shown in Figure 7 (Walker2d). In Lf D (either on single-domain or on cross-domain IL) settings, CEIL can consistently achieve expert-level performance, but when migrating to Lf O settings, CEIL suffers collapsing performance under the same number of environmental interactions. We believe that this is due to the lack of expert actions in Lf O settings, which causes the agent to stay in the collapsed state region and therefore deteriorates performance. Thus, we believe a rich direction for future research is to explore the online exploration ability.

9 Implementation Details

9.1 Imitation Learning Tasks

In our paper, we conduct experiments across a variety of IL problem domains: single/cross-domain IL, online/offline IL, and Lf D/Lf O IL settings. By arranging and combining these IL domains, we obtain 8 IL tasks in all: S-on-Lf D, S-on-Lf O, S-off-Lf D, S-off-Lf O, C-on-Lf D, C-on-Lf O, C-off-Lf D, and C-off-Lf O, where S/C denotes single/cross-domain IL, on/off denotes online/offline IL, and Lf D/Lf O denote learning from demonstrations/observations respectively.

S-on-Lf D. We have access to a limited number of expert demonstrations and an online interactive training environment. The goal of S-on-Lf D is to learn an optimal policy that mimics the provided demonstrations in the training environment.

S-on-Lf O. We have access to a limited number of expert observations (state-only demonstrations) and an online interactive training environment. The goal of S-on-Lf O is to learn an optimal policy that mimics the provided observations in the training environment.

S-off-Lf D. We have access to a limited number of expert demonstrations and a large amount of pre-collected offline (reward-free) data. The goal of S-off-Lf D is to learn an optimal policy that mimics the provided demonstrations in the environment in which the offline data was collected. Note that here the environment that was used to collect the expert demonstrations and the environment that was used to collect the offline data are the same environment.

S-off-Lf O. We have access to a limited number of expert observations and a large amount of precollected offline (reward-free) data. The goal of S-off-Lf O is to learn an optimal policy that mimics the provided observations in the environment in which the offline data was collected. Note that here the environment that was used to collect the expert observations and the environment that was used to collect the offline data are the same environment.

C-on-Lf D. We have access to a limited number of expert demonstrations and an online interactive training environment. The goal of C-on-Lf D is to learn an optimal policy that mimics the provided demonstrations in the training environment. Note that here the environment that was used to collect the expert demonstrations and the online training environment are not the same environment.

C-on-Lf O. We have access to a limited number of expert observations (state-only demonstrations) and an online interactive training environment. The goal of C-on-Lf O is to learn an optimal policy that mimics the provided observations in the training environment. Note that here the environment that was used to collect the expert observations and the online training environment are not the same environment.

C-off-Lf D. We have access to a limited number of expert demonstrations and a large amount of pre-collected offline (reward-free) data. The goal of C-off-Lf D is to learn an optimal policy that mimics the provided demonstrations in the environment in which the offline data was collected. Note that here the environment that was used to collect the expert demonstrations and the environment that was used to collect the offline data are not the same environment.

C-off-Lf O. We have access to a limited number of expert observations and a large amount of precollected offline (reward-free) data. The goal of C-off-Lf O is to learn an optimal policy that mimics the provided observations in the environment in which the offline data was collected. Note that here the environment that was used to collect the expert observations and the environment that was used to collect the offline data are not the same environment.

Figure 8: Mu Jo Co environments and our modified versions. From left to right: Ant-v2, Half Cheetahv2, Hopper-v2, Walker2d-v2, our modified Ant-v2, our modified Half Cheetah-v2, our modified Hopper-v2, and our modified Walker2d-v2.

9.2 Online IL Environments, Offline IL Datasets, and One-shot tasks

Our experiments are conducted in four popular Mu Jo Co environments (Figure 8): Hopper-v2, Half Cheetah-v2, Walker2d-v2, and Ant-v2. For offline IL tasks, we take the standard (reward-free) D4RL dataset [22] (medium, medium-replay, medium-expert, and expert domains) as the offline dataset. For cross-domain (online/offline) IL tasks, we collect the expert behaviors (demonstrations or observations) on a modified Mu Jo Co environment. Specifically, we change the height of the agent s torso (as shown in Figure 8). We refer the reader to our code submission, which includes our modified Mu Jo Co assets. For one-shot IL tasks, we train the policy only in the single-domain IL settings (S-on-Lf D, S-on-Lf O, S-off-Lf D, and S-off-Lf O). Then we collect only one expert trajectory in the modified Mu Jo Co environment, and roll out the fine-tuned/inferred policy in the modified environment to test the one-shot performance.

Collecting expert behaviors. In our implementation, we use the publicly available rlkit7 implementation of SAC to learn an expert policy and use the learned policy to collect expert behaviors (demonstrations in Lf D or observations in Lf O).

9.3 CEIL Implementation Details

Trajectory self-consistency loss. To learn the embedding function fϕ and a corresponding contextual policy πθ(a|s, z), we minimize the following trajectory self-consistency loss:

πθ, fϕ = min πθ,fϕ Eτ1:T D(τ1:T )E(s,a) τ1:T [log πθ(a|s, fϕ(τ1:T ))] ,

where τ1:T denotes a trajectory segment with window size of T. In the online setting, we sample trajectory τ from the experience replay buffer D(τ); in the offline setting, we sample trajectory τ directly from the given offline data D(τ). Meanwhile, if we can access the expert actions (i.e., Lf D settings), we also incorporate the expert demonstrations into the empirical expectation (i.e., storing the expert demonstrations into the online/offline experience D(τ)).

In our implementation, we use a 4-layer MLP (with Re LU activation) to encode the trajectory τ1:T and a 4-layer MLP (with Re LU activation ) to predict the action respectively. To regularize the learning of the encoder function fϕ, we additionally introduce a decoder network (4-layer MLP with Re LU activation) π θ(s |s, fϕ(τ1:T )) to predict the next states: minπ θ,fϕ Eτ1:T D(τ1:T )E(s,a,s ) τ1:T [log π θ(s |s, fϕ(τ1:T ))]. Further, to circumvent issues of "posterior collapse" [67], we encourage learning quantized latent embeddings. In a similar spirit to VQ-VAE [67], we incorporate ideas from vector quantization (VQ) and introduce the following regularization: minfϕ ||sg[ze(τ1:T )] e||2 + ||ze(τ1:T ) sg[e]||2, where e is a dictionary of vector quantization embeddings (we set the size of this embedding dictionary to be 4096), ze(τ1:T ) is defined as the nearest dictionary embedding to fϕ(τ1:T ), and sg[ ] denotes the stop-gradient operator.

Out-level embedding inference. In Section 4.2 (main paper), we approximate JMI with JMI(fϕ) Ep(z )πE(τE)πθ(τθ|z ) z fϕ(τE) 2 + z fϕ(τθ) 2 , where we replace the mutual information with z fϕ(τ) 2 by leveraging the learned embedding function fϕ. Empirically, we find that we can ignore the second loss z fϕ(τθ) 2, and directly conduct outer-level embedding inference with maxz ,fϕ Ep(z )πE(τE) z fϕ(τE) 2 . Meanwhile, this simplification makes the support constraints (R(z ) in Equation 7 in the main paper) for the offline OOD issues naturally satisfied, since maxz Ep(z )πE(τE) z fϕ(τE) 2 and minz R(z ) are equivalent.

Cross-domain IL regularization. To encourage fϕ to capture the task-relevant embeddings and ignore the domain-specific factors, we set the regularization R(fϕ) in Equation 5 to be:

7https://github.com/rail-berkeley/rlkit.

R(fϕ) = I(fϕ(τ); n), where we couple each trajectory τ in {τE} {τE } with a label n {0, 1}, indicating whether it is noised. In our implementation, we apply MINE [6] to estimate the mutual information and conduct encoder regularization. Specifically, we estimate I(z; n) with ˆI(z; n) := supδ Ep(z,n) [fδ(z, n)] log Ep(z)p(n) [exp (fδ(z, n))] and regularize the encoder fϕ with maxfϕ ˆI(fϕ(τ); n), where we model fδ with a 4-layer MLP (using Re LU activations).

Hyper-parameters. In Table 9, we list the hyper-parameters used in the experiments. For the size of the embedding dictionary, we selected it from a range of [512, 1024, 2048, 4096]. We found 4096 to almost uniformly attain good performance across IL tasks, thus selecting it as the default. For the size of the embedding dimension, we tried four values [4, 8, 16, 32] and selected 16 as the default. For the trajectory window size, we tried five values [2, 4, 8, 16, 32] but we did not observe a significant difference in performance across these values. Thus we selected 2 as the default value. For the learning rate scheduler, we tried the default Pytorch scheduler and Cosine Annealing Warm Restarts, and found Cosine Annealing Warm Restarts enables better results (thus we selected it). For other hyperparameters, they are consistent with the default values of most RL implementations, e.g. learning rate 3e-4 and the MLP policy.

Table 9: CEIL hyper-parameters.

Parameter Value

size of the embedding dictionary 4096 size of the embedding dimension 16 trajectory window size 2

encoder: optimizer Adam encoder: learning rate 3e-4 encoder: learning rate scheduler Cosine Annealing Warm Restarts(T_0 = 1000,T_mult=1, eta_min=1e-5) encoder: number of hidden layers 4 encoder: number of hidden units per layer 512 encoder: nonlinearity Re LU

policy: optimizer Adam policy: learning rate 3e-4 policy: learning rate scheduler Cosine Annealing Warm Restarts(T_0 = 1000,T_mult=1, eta_min=1e-5) policy: number of hidden layers 4 policy: number of hidden units per layer 512 policy: nonlinearity Re LU

decoder: optimizer Adam decoder: learning rate 3e-4 decoder: learning rate scheduler Cosine Annealing Warm Restarts(T_0 = 1000,T_mult=1, eta_min=1e-5) decoder: number of hidden layers 4 decoder: number of hidden units per layer 512 decoder: nonlinearity Re LU

Table 10: Baseline methods and their code-bases.

Baselines Code-bases

GAIL, GAIf O, AIRL https://github.com/Human Compatible AI/imitation SAIL https://github.com/Fangchen Liu/SAIL IQ-Learn, SQIL https://github.com/Div99/IQ-Learn Value DICE https://github.com/google-research/google-research/tree/master/value_dice Demo DICE https://github.com/KAIST-AILab/imitation-dice SMODICE, ORIL https://github.com/Jason Ma2016/SMODICE

9.4 Baselines Implementation Details

We summarize our code-bases of our baseline implementations in Table 10 and describe each baseline as follows:

Generative Adversarial Imitation Learning (GAIL). GAIL [30] is a GAN-based online Lf D method that trains a policy (generator) to confuse a discriminator trained to distinguish between generated transitions and expert transitions. While the goal of the discriminator is to maximize the

objective below, the policy is optimized via an RL algorithm to match the expert occupancy measure (minimize the objective below):

J (π, D) = Eπ [log(D(s, a))] + EπE [1 log(D(s, a))] λH(π).

We used the implementation by Gleave et al. [29] on the Git Hub page8, where there are two modifications introduced with respect to the original paper: 1) a higher output of the discriminator represents better, 2) PPO is used to optimize the policy instead of TRPO.

Generative Adversarial Imitation from Observations (GAIf O). GAIf O [66] is an online Lf O method that applies the principle of GAIL and utilizes a state-only discriminator to judge whether the generated trajectory matches the expert trajectory in terms of states. We provide the objective of GAIf O as follows:

J (π, D) = Eπ [log(D(s, s ))] + EπE [1 log(D(s, s ))] λH(π).

Based on the implementation of GAIL, we implement GAIf O by changing the input of the discriminator to state transitions.

Adversarial Inverse Reinforcement Learning (AIRL). AIRL [21] is an online Lf D/Lf O method using an adversarial learning framework similar to GAIL. It modifies the form of the discriminator to explicitly disentangle the task-relevant information from the transition dynamics. To make the policy more generalized and less sensitive to dynamics, AIRL proposes to learn a parameterized reward function using the output of the discriminator:

fθ,ϕ(s, a, s ) = gθ(s, a) + λhϕ(s ) hϕ(s),

Dθ,ϕ(s, a, s ) = exp(fθ,ϕ(s, a, s )) exp(fθ,ϕ(s, a, s )) + π(a|s).

Similarly to GAIL, we used the code provided by Gleave et al. [29], and the RL algorithm is also PPO.

State Alignment-based Imitation Learning (SAIL). SAIL [48] is an online Lf O method capable of solving cross-domain tasks. SAIL aims to minimize the divergence between the policy rollout and the expert trajectory from both local and global perspectives: 1) locally, a KL divergence between the policy action and the action predicted by a state planner and an inverse dynamics model, 2) globally, a Wasserstein divergence of state occupancy between the policy and the expert. The policy is optimized using:

J (π) = DW(π(s) πE(s)) λDKL(π( |st) πE( |st))

= Eπ(st,at,st+1)

D(st+1) EπE(s)D(s)

π |st ginv |st, f(st) ,

where D is a state-based discriminator trained via J (D) = EπE [D(s)] Eπ [D(s)], f is the pretrained VAE-based state planner, and ginv is the inverse dynamics model trained by supervised regression.

In the online setting, we use the official implementation published by the authors9, where SAIL is optimized using PPO with the reward definition: r(st, st+1) = 1

T D(st+1) EπE(s)D(s) . Besides, we further implement SAIL in the offline setting by using TD3+BC [23] to maximize the reward defined above.

In our experiments, we empirically discover that SAIL is computationally expensive. While SAIL is able to learn tasks in the typical IL setting (S-on-Lf D), our early experimental results find that SAIL(TD3+BC) with heavy hyperparameter tuning failed on the offline setting. This indicates that SAIL is rather sensitive to the dataset composition, which also coincides with the results gathered in Ma et al. [56]. Thus, we do not include SAIL in our comparison results.

Soft-Q Imitation Learning (SQIL). SQIL [61] is a simple but effective single-domain Lf D IL algorithm that is easy to implement with both online and offline Q-learning algorithms. The main

8https://github.com/Human Compatible AI/imitation 9https://github.com/Fangchen Liu/SAIL

idea of SQIL is to give sparse rewards (+1) only to those expert transitions and zero rewards (0) to those experiences in the replay buffer. The Q-function of SQIL is updated using the squared soft Bellman Error:

δ2(D, r) 1 |D|

Q(s, a) r + γ log X

a A exp(Q(s , a )) 2 .

The overall objective of the Q-function is to maximize the following objective:

J (Q) = δ2(DE, 1) δ2(Dπ, 0).

In our experiments, the online imitation policy is optimized using SAC which is also used in the original paper. To make a fair comparison among the offline IL baselines, the offline policy is optimized via TD3+BC.

Offline Reinforced Imitation Learning (ORIL). ORIL [78] is an offline single-domain IL method that solves both Lf D and Lf O tasks. To relax the hard-label assumption (like the sparse rewards made in SQIL), ORIL treats the experiences stored in the replay buffer as unlabelled data that could potentially include both successful and failed trajectories. More specifically, ORIL aims to train a reward function to distinguish between the expert and the suboptimal data without explicitly knowing the negative labels. By incorporating Positive-unlabeled learning (PU-learning), the objective of the reward model can be written as follows (for the Lf D setting):

J (R) = ηEπE(s,a) [log(R(s, a))] + Eπ(s,a) [log(1 R(s, a))] ηEπE(s,a) [log(1 R(s, a))] ,

where η is the relative proportion of the expert data and we set it as 0.5 throughout our experiments. In the original paper, the policy learning algorithm of ORIL is Critic Regularized Regression (CRR), while in this paper, we implemented ORIL using TD3+BC for fair comparisons. Besides, we adapted ORIL to the Lf O setting by learning a state-only reward function:

J (R) = ηEπE(s,s ) [log(R(s, s ))] + Eπ(s,s ) [log(1 R(s, s ))] ηEπE(s,s ) [log(1 R(s, s ))] .

Inverse soft-Q learning (IQ-Learn). IQ-Learn [28] is an IRL-based method that can solve IL tasks in the online/offline and Lf D/Lf O settings. It proposes to directly learn a Q-function from demonstrations and avoid the intermediate step of reward learning. Unlike GAIL optimizing a min-max objective defined in the reward-policy space, IQ-Learn solves the expert matching problem directly in the policy-Q space. The Q-function is trained to maximize the objective:

EπE(s,a,s ) [Q(s, a) γV π(s )] Eπ(s,a,s ) [Q(s, a) γV π(s )] ψ(r),

where V π(s) Ea π( |s) [Q(s, a) log π(a|s)], ψ(r) is a regularization term calculated over the expert distribution. Then, the policy is learned by SAC.

We use the code provided in the official IQ-learn repository10 and reproduce the online-Lf D results reported in the original paper. For online tasks, we empirically find that penalizing the Q-value on the initial states gives the best and most stabilized performance. The learning objective of the Q-function for the online tasks is:

J (Q) = EπE(s,a,s ) [Q(s, a) γV π(s )] (1 γ)Eρ0 [V π(s0)] ψ(r).

In the offline setting, we find that using the above objective easily leads to an overfitting issue, causing collapsed performance. Thus, we follow the instruction provided in the paper and only penalize the expert samples:

J (Q) = EπE(s,a,s ) [Q(s, a) γV π(s )] EπE(s,a,s ) [V π(s) γV π(s )] ψ(r)

= EπE(s,a,s ) [Q(s, a) V π(s)] ψ(r).

Imitation Learning via Off-Policy Distribution Matching (Value DICE). Value DICE [41] is a DICE-based11 Lf D algorithm which minimizes the divergence of state-action distributions between the policy and the expert. In contrast to the state-conditional distribution of actions π( |s) used in the

10https://github.com/Div99/IQ-Learn 11DICE refers to stationary DIstribution Estimation Correction

above methods, the state-action distribution, dπ(s, a) : S A [0, 1], can uniquely characterize a one-to-one correspondence,

dπ(s, a) (1 γ)

t=0 γt Pr(st = s, at = a |s0 ρ0, at π(st), st+1 P(st, at)).

Thus, the plain expert matching objective can be reformulated and expressed in the Donsker-Varadhan representation:

J (π) = DKL(dπ(s, a) dπE(s, a)) = min x:S A R log E(s,a) dπE [exp(x(s, a))] E(s,a) dπ [x(s, a)] .

The objective above can be expanded further by defining x(s, a) = v(s, a) Bπv(s, a) and using a zero-reward Bellman operator Bπ to derive the following (adversarial) objective:

JDICE(π, v) = log E(s,a) dπE exp v(s, a) Bπv(s, a) (1 γ)Es0 ρ0,a0 π( |s0) [v(s0, a0)] .

We use the official Tensorflow implementation12 in our experiments. In the online setting, the rollouts collected are used as an additional replay regularization. The overall objective in the online setting is:

J mix DICE(π, v)

= DKL (1 α)dπ(s, a) + αd RB(s, a) (1 α)dπE(s, a) + αd RB(s, a)

= log E(s,a) dmix exp v(s, a) Bπv(s, a) (1 α)(1 γ) Es0 ρ0, a0 π( |s0) [v(s0, a0)] α E(s,a) d RB [v(s, a) Bπv(s, a)] ,

where dmix (1 α)dπE + αd RB and α is a non-negative regularization coefficient (we set α as 0.1 following the specification of the paper).

In the offline setting, Value DICE only differs in the source of sampling data. We change the online replay buffer to the offline pre-collected dataset.

Offline Imitation Learning with Supplementary Imperfect Demonstrations (Demo DICE). Demo DICE [39] is a DICE-based offline Lf D method that assumes to have access to an offline dataset collected by a behavior policy πβ. Using this supplementary dataset, the expert matching objective of Demo DICE is instantiated over Value DICE:

DKL(dπ(s, a) dπE(s, a)) αDKL(dπ(s, a) dπβ(s, a)),

where α is a positive weight for the constraint.

The above optimization objective can be transformed into three tractable components: 1) a reward function r(s, a) derived by pre-training a binary discriminator D : S A [0, 1]:

r(s, a) = log( 1 D (s, a) 1),

D (s, a) = arg max D = EdπE [log D(s, a)] + Edπβ [log(1 D(s, a))] ,

2) a value function optimization objective:

J (v) = (1 γ)Es ρ0 [v(s)] (1 + α) log E(s,a) dπβ exp(r(s, a) + Es P (s,a)(v(s )) v(s)

and 3) a policy optimization step:

J (π) = E(s,a) dπβ [v (s, a) log π(a|s)] ,

v (s, a) = arg max v J (v).

We report the offline results using the official Tensorflow implementation13.

12https://github.com/google-research/google-research/tree/master/value_dice 13https://github.com/KAIST-AILab/imitation-dice

State Matching Offline DIstribution Correction Estimation (SMODICE). SMODICE [56] proposes to solve offline IL tasks in Lf O and cross-domain settings and it optimizes the following state occupancy objective:

DKL(dπ(s) dπE(s)).

To incorporate the offline dataset, SMODICE derives an f-divergence regularized state-occupancy objective:

log( dπβ(s)

dπE(s)) + Df(dπ(s, a) dπβ(s, a)).

Intuitively, the first term can be interpreted as matching the offline states towards the expert states, while the second regularization term constrains the policy close to the offline distribution of stateaction occupancy. Similarly, we can divide the objective into three steps: 1) deriving a state-based reward by learning a state-based discriminator:

r(s, a) = log( 1 D (s) 1),

D (s, a) = arg max D = EdπE [log D(s)] + Edπβ [log(1 D(s))] ,

2) learning a value function using the learned reward:

J (v) = (1 γ)Es ρ0 [v(s)] log E(s,a) dπβ f (r(s, a) + Es P (s,a)(v(s )) v(s)) ,

and 3) training the policy via weighted regression:

J (π) = E(s,a) dπβ f (r(s, a) + Es P (s,a)(v (s )) v (s)) log π(a|s) ,

v (s, a) = arg max v J (v),

where f is the Fenchel conjugate of f-divergence (please refer to Ma et al. [56] for more details).

We conduct experiments using the official Pytorch implementation 14, where the f-divergence used is X 2-divergence. On the Lf D tasks, we change the input of the discriminator to state-action pairs.

14https://github.com/Jason Ma2016/SMODICE