# trail_nearoptimal_imitation_learning_with_suboptimal_data__c2f61e46.pdf

Published as a conference paper at ICLR 2022

TRAIL: NEAR-OPTIMAL IMITATION LEARNING WITH SUBOPTIMAL DATA

Mengjiao Yang UC Berkeley, Google Brain sherryy@google.com

Sergey Levine UC Berkeley, Google Brain Oﬁr Nachum Google Brain

The aim in imitation learning is to learn effective policies by utilizing near-optimal expert demonstrations. However, high-quality demonstrations from human experts can be expensive to obtain in large number. On the other hand, it is often much easier to obtain large quantities of suboptimal or task-agnostic trajectories, which are not useful for direct imitation, but can nevertheless provide insight into the dynamical structure of the environment, showing what could be done in the environment even if not what should be done. We ask the question, is it possible to utilize such suboptimal ofﬂine datasets to facilitate provably improved downstream imitation learning? In this work, we answer this question afﬁrmatively and present training objectives that use ofﬂine datasets to learn a factored transition model whose structure enables the extraction of a latent action space. Our theoretical analysis shows that the learned latent action space can boost the sample-efﬁciency of downstream imitation learning, effectively reducing the need for large near-optimal expert datasets through the use of auxiliary non-expert data. To learn the latent action space in practice, we propose TRAIL (Transition-Reparametrized Actions for Imitation Learning), an algorithm that learns an energy-based transition model contrastively, and uses the transition model to reparametrize the action space for sample-efﬁcient imitation learning. We evaluate the practicality of our objective through experiments on a set of navigation and locomotion tasks. Our results verify the beneﬁts suggested by our theory and show that TRAIL is able to improve baseline imitation learning by up to 4x in performance.

1 INTRODUCTION Imitation learning uses expert demonstration data to learn sequential decision making policies (Schaal, 1999). Such demonstrations, often produced by human experts, can be costly to obtain in large number. On the other hand, practical application domains, such as recommendation (Afsar et al., 2021) and dialogue (Jiang et al., 2021) systems, provide large quantities of ofﬂine data generated by suboptimal agents. Since the ofﬂine data is suboptimal in performance, using it directly for imitation learning is infeasible. While some prior works have proposed using suboptimal ofﬂine data for ofﬂine reinforcement learning (RL) (Kumar et al., 2019; Wu et al., 2019; Levine et al., 2020), this would require reward information, which may be unavailable or infeasible to compute from suboptimal data (Abbeel & Ng, 2004). Nevertheless, conceptually, suboptimal ofﬂine datasets should contain useful information about the environment, if only we could distill that information into a useful form that can aid downstream imitation learning. One approach to leveraging suboptimal ofﬂine datasets is to use the ofﬂine data to extract a lowerdimensional latent action space, and then perform imitation learning on an expert dataset using this latent action space. If the latent action space is learned properly, one may hope that performing imitation learning in the latent space can reduce the need for large quantities of expert data. While a number of prior works have studied similar approaches in the context of hierarchical imitation and RL setting (Parr & Russell, 1998; Dietterich et al., 1998; Sutton et al., 1999; Kulkarni et al., 2016; Vezhnevets et al., 2017; Nachum et al., 2018a; Ajay et al., 2020; Pertsch et al., 2020; Hakhamaneshi et al., 2021), such methods typically focus on the theoretical and practical beneﬁts of temporal abstraction by extracting temporally extended skills from data or experience. That is, the main beneﬁt of these approaches is that the latent action space operates at a lower temporal frequency than the original environment action space. We instead focus directly on the question of action representation: instead of learning skills that provide for temporal abstraction, we aim to directly reparameterize the action space in a way that provides for more sample-efﬁcient downstream

Published as a conference paper at ICLR 2022

𝒟off {(s, a, s )}

𝒟π* {(s, a)}

Tz ϕ(s, a) (1) πα (a|s, ϕ(s, a)) (2) πZ (ϕ(s, a)|s) (3)

Pretraining Downstream Imitation Inference s

Transition reparametrized actions

Figure 1: The TRAIL framework. Pretraining learns a factored transition model TZ φ and an action decoder πα on Doﬀ. Downstream imitation learns a latent policy πZ on Dπ with expert actions reparametrized by φ. During inference, πZ and πα are combined to sample an action. imitation without the need to reduce control frequency. Unlike learning temporal abstractions, action reparamtrization does not have to rely on any hierarchical structures in the ofﬂine data, and can therefore utilize highly suboptimal datasets (e.g., with random actions). Aiming for a provably-efﬁcient approach to utilizing highly suboptimal ofﬂine datasets, we use ﬁrst principles to derive an upper bound on the quality of an imitation learned policy involving three terms corresponding to (1) action representation and (2) action decoder learning on a suboptimal ofﬂine dataset, and ﬁnally, (3) behavioral cloning (i.e., max-likelihood learning of latent actions) on an expert demonstration dataset. The ﬁrst term in our bound immediately suggests a practical ofﬂine training objective based on a transition dynamics loss using an factored transition model. We show that under speciﬁc factorizations (e.g., low-dimensional or linear), one can guarantee improved sample efﬁciency on the expert dataset. Crucially, our mathematical results avoid the potential shortcomings of temporal skill extraction, as our bound is guaranteed to hold even when there is no temporal abstraction in the latent action space. We translate these mathematical results into an algorithm that we call Transition-Reparametrized Actions for Imitation Learning (TRAIL). As shown in Figure 1, TRAIL consists of a pretraining stage (corresponding to the ﬁrst two terms in our bound) and a downstream imitation learning stage (corresponding to the last term in our bound). During the pretraining stage, TRAIL uses an ofﬂine dataset to learn a factored transition model and a paired action decoder. During the downstream imitation learning stage, TRAIL ﬁrst reparametrizes expert actions into the latent action space according to the learned transition model, and then learns a latent policy via behavioral cloning in the latent action space. During inference, TRAIL uses the imitation learned latent policy and action decoder in conjunction to act in the environment. In practice, TRAIL parametrizes the transition model as an energy-based model (EBM) for ﬂexibility and trains the EBM with a contrastive loss. The EBM enables the low-dimensional factored transition model referenced by our theory, and we also show that one can recover the linear transition model in our theory by approximating the EBM with random Fourier features (Rahimi et al., 2007). To summarize, our contributions include (i) a provably beneﬁcial objective for learning action representations without temporal abstraction and (ii) a practical algorithm for optimizing the proposed objective by learning an EBM or linear transition model. An extensive evaluation on a set of navigation and locomotion tasks demonstrates the effectiveness of the proposed objective. TRAIL s empirical success compared to a variety of existing methods suggests that the beneﬁt of learning single-step action representations has been overlooked by previous temporal skill extraction methods. Additionally, TRAIL signiﬁcantly improves behavioral cloning even when the ofﬂine dataset is unimodal or highly suboptimal (e.g., obtained from a random policy), whereas temporal skill extraction methods lead to degraded performance in these scenarios. Lastly, we show that TRAIL, without using reward labels, can perform similarly or better than ofﬂine reinforcement learning (RL) with orders of magnitude less expert data, suggesting new ways for ofﬂine learning of squential decision making policies.

2 RELATED WORK Learning action abstractions is a long standing topic in the hierarchical RL literature (Parr & Russell, 1998; Dietterich et al., 1998; Sutton et al., 1999; Kulkarni et al., 2016; Nachum et al., 2018a). A large body of work focusing on online skill discovery have been proposed as a means to improve exploration and sample complexity in online RL. For instance, Eysenbach et al. (2018); Sharma et al. (2019); Gregor et al. (2016); Warde-Farley et al. (2018); Liu et al. (2021) propose to learn a diverse set of skills by maximizing an information theoretic objective. Online skill discovery is also commonly seen in a hierarchical framework that learns a continuous space (Vezhnevets et al., 2017; Hausman et al., 2018; Nachum et al., 2018a; 2019) or a discrete set of lower-level policies (Bacon

Published as a conference paper at ICLR 2022

et al., 2017; Stolle & Precup, 2002; Peng et al., 2019), upon which higher-level policies are trained to solve speciﬁc tasks. Different from these works, we focus on learning action representations ofﬂine from a ﬁxed suboptimal dataset to accelerate imitation learning. Aside from online skill discovery, ofﬂine skill extraction focuses on learning temporally extended action abstractions from a ﬁxed ofﬂine dataset. Methods for ofﬂine skill extraction generally involve maximum likelihood training of some latent variable models on the ofﬂine data, followed by downstream planning (Lynch et al., 2020), imitation learning (Kipf et al., 2019; Ajay et al., 2020; Hakhamaneshi et al., 2021), ofﬂine RL (Ajay et al., 2020; Zhou et al., 2020), or online RL (Fox et al., 2017; Krishnan et al., 2017; Shankar & Gupta, 2020; Shankar et al., 2019; Singh et al., 2020; Pertsch et al., 2020; 2021; Wang et al., 2021) in the induced latent action space. Among these works, those that provide a theoretical analysis attribute the beneﬁt of skill extraction predominantly to increased temporal abstraction as opposed to the learned action space being any easier to learn from than the raw action space (Ajay et al., 2020; Nachum et al., 2018b). Unlike these methods, our analysis focuses on the advantage of a lower-dimensional reparametrized action space agnostic to temporal abstraction. Our method also applies to ofﬂine data that is highly suboptimal (e.g., contains random actions) and potentially unimodal (e.g., without diverse skills to be extracted), which have been considered challenging by previous work (Ajay et al., 2020). While we focus on reducing the complexity of the action space through the lens of action representation learning, there exists a disjoint set of work that focuses on accelerating RL with state representation learning (Singh et al., 1995; Ren & Krogh, 2002; Castro & Precup, 2010; Gelada et al., 2019; Zhang et al., 2020; Arora et al., 2020; Nachum & Yang, 2021), some of which have proposed to extract a latent state space from a learned dynamics model. Analogous to our own derivations, these works attribute the beneﬁt of representation learning to a smaller latent state space reduced from a high-dimensional input state space (e.g., images). Lastly, there exist model-based approaches that utilizes ofﬂine data to learn model dynamics which in tern accelerates imitation (Chang et al., 2021; Rafailov et al., 2021). These work differ from our focus of using the ofﬂine data to learn latent action space.

3 PRELIMINARIES In this section, we introduce the problem statements for imitation learning and learning-based control, and deﬁne relevant notations.

Markov decision process. Consider an MDP (Puterman, 1994) M := S, A, R, T , µ, γ , consisting of a state space S, an action space A, a reward function R : S A R, a transition function T : S A (S)1, an initial state distribution µ (S), and a discount factor γ [0, 1) A policy π : S (A) interacts with the environment starting at an initial state s0 µ. An action at π(st) is sampled and applied to the environment at each step t 0. The environment produces a scalar reward R(st, at) and transitions into the next state st+1 T (st, at). Note that we are speciﬁcally interested in the imitation learning setting, where the rewards produced by R are unobserved by the learner. The state visitation distribution dπ(s) induced by a policy π is deﬁned as dπ(s) := (1 γ) P t=0 γt Pr [st = s|π, M]. We relax the notation and use (s, a) dπ to denote s dπ, a π(s).

Learning goal. Imitation learning aims to recover an expert policy π with access to only a ﬁxed set of samples from the expert: Dπ = {(si, ai)}n i=1 with si dπ and ai π (si). One approach to imitation learning is to learn a policy π that minimizes some discrepancy between π and π . In our analysis, we will use the total variation (TV) divergence in state visitation distributions, Diﬀ(π, π ) = DTV(dπ dπ ), as the way to measure the discrepancy between π and π . Our bounds can be easily modiﬁed to apply to other divergence measures such as the Kullback Leibler (KL) divergence or difference in expected future returns. Behavioral cloning (BC) (Pomerleau, 1989) solves the imitation learning problem by learning π from Dπ via a maximum likelihood objective JBC(π) := E(s,a) (dπ ,π )[ log π(a|s)], which optimizes an upper bound of Diﬀ(π, π ) deﬁned above (Ross & Bagnell, 2010; Nachum & Yang, 2021):

Diﬀ(π, π ) γ 1 γ

1 2Edπ [DKL(π (s) π(s))] = γ 1 γ

const(π ) + 1

1 (X) denotes the simplex over a set X.

Published as a conference paper at ICLR 2022

BC with suboptimal ofﬂine data. The standard BC objective (i.e., direct max-likelihood on Dπ ) can struggle to attain good performance when the amount of expert demonstrations is limited (Ross et al., 2011; Tu et al., 2021). We assume access to an additional suboptimal ofﬂine dataset Doﬀ= {(si, ai, s i)}m i=1, where the suboptimality is a result of (i) suboptimal action samples ai Unif A and (ii) lack of reward labels. We use (s, a, s ) doﬀas a shorthand for simulating ﬁnite sampling from Doﬀvia si doﬀ, ai Unif A, s i T (si, ai), where doﬀis an unknown ofﬂine state distribution. We assume doﬀsufﬁciently covers the expert distribution; i.e., dπ (s) > 0 doﬀ(s) > 0 for all s S. The uniform sampling of actions in Doﬀis largely for mathematical convenience, and in theory can be replaced with any distribution uniformly bounded from below by η > 0, and our derived bounds will be scaled by 1 |A|η as a result. This works focuses on how to utilize such a suboptimal Doﬀto provably accelerate BC.

4 NEAR-OPTIMAL IMITATION LEARNING WITH REPARAMETRIZED ACTIONS In this section, we provide a provably-efﬁcient objective for learning action representations from suboptimal data. Our initial derivations (Theorem 1) apply to general policies and latent action spaces, while our subsequent result (Theorem 3) provides improved bounds for specialized settings with continuous latent action spaces. Finally, we present our practical method TRAIL for action representation learning and downstream imitation learning.

4.1 PERFORMANCE BOUND WITH REPARAMETRIZED ACTIONS Despite Doﬀbeing highly suboptimal (e.g., with random actions), the large set of (s, a, s ) tuples from Doﬀreveals the transition dynamics of the environment, which a latent action space should support. Under this motivation, we propose to learn a factored transition model T := TZ φ from the ofﬂine dataset Doﬀ, where φ : S A Z is an action representaiton function and TZ : S Z (S) is a latent transition model. Intuitively, good action representations should enable good imitation learning. We formalize this intuition in the theorem below by establishing a bound on the quality of a learned policy based on (1) an ofﬂine pretraining objective for learning φ and TZ, (2) an ofﬂine decoding objective for learning an action decoder πα, and (3) a downstream imitation learning objective for learning a latent policy πZ with respect to latent actions determined by φ. Theorem 1. Consider an action representation function φ : S A Z, a factored transition model TZ : S Z (S), an action decoder πα : S Z (A), and a tabular latent policy πZ : S (Z). Deﬁne the transition representation error as JT(TZ, φ) := E(s,a) doff [DKL(T (s, a) TZ(s, φ(s, a)))] , the action decoding error as JDE(πα, φ) := E(s,a) doff[ log πα(a|s, φ(s, a))], and the latent behavioral cloning error as JBC,φ(πZ) := E(s,a) (dπ ,π )[ log πZ(φ(s, a)|s)]. Then the TV divergence between the state visitation distributions of πα πZ : S (A) and π can be bounded as

Diﬀ(πα πZ, π )

Pretraining

1 2 E(s,a) doff [DKL(T (s, a) TZ(s, φ(s, a)))] | {z } = JT(TZ, φ)

1 2 Es doff[max z Z DKL(πα (s, z) πα(s, z))] | {z } const(doﬀ, φ) + JDE(πα, φ)

Downstream Imitation

1 2 Es dπ [DKL(π ,Z(s) πZ(s))] | {z } = const(π , φ) + JBC,φ(πZ)

where C1 = γ|A|(1 γ) 1(1 + Dχ2(dπ doﬀ) 1 2 ), C2 = γ(1 γ) 1(1 + Dχ2(dπ doﬀ) 1 2 ), C3 = γ(1 γ) 1, πα is the optimal action decoder for a speciﬁc data distribution doﬀand a

Published as a conference paper at ICLR 2022

πα (a|s, z) = doﬀ(s, a) 1[z = φ(s, a)] P

a A doﬀ(s, a ) 1[z = φ(s, a )],

and π ,Z is the marginalization of π onto Z according to φ:

π ,Z(z|s) := X

a A,z=φ(s,a) π (a|s).

Theorem 1 essentially decomposes the imitation learning error into (1) a transition-based representation error JT, (2) an action decoding error JDE, and (3) a latent behavioral cloning error JBC,φ. Notice that only (3) requires expert data Dπ ; (1) and (2) are trained on the large ofﬂine data Doﬀ. By choosing |Z| that is smaller than |A|, fewer demonstrations are needed to achieve small error in JBC,φ compared to vanilla BC with JBC. The Pearson χ2 divergence term Dχ2(dπ doﬀ) in C1 and C2 accounts for the difference in state visitation between the expert and ofﬂine data. In the case where dπ differs too much from doﬀ, known as the distribution shift problem in ofﬂine RL (Levine et al., 2020), the errors from JT and JDE are ampliﬁed and the terms (1) and (2) in Theorem 1 dominate. Otherwise, as JT 0 and πα, φ arg min JDE, optimizing πZ in the latent action space is guaranteed to optimize π in the original action space.

Sample Complexity To formalize the intuition that a smaller latent action space |Z| < |A| leads to more sample efﬁcient downstream behavioral cloning, we provide the following theorem in the tabular action setting. First, assume access to an oracle latent action representation function φorcl := OPT φ(Doﬀ) which yields pretraining errors (1)(φorcl) and (2)(φorcl) in Theorem 1. For downstream behavioral cloning, we consider learning a tabular πZ on Dπ with n expert samples. We can bound the expected difference between a latent policy πφorcl,Z with respect to φorcl and π as follows.

Theorem 2. Let φorcl := OPT φ(Doﬀ) and πorcl,Z be the latent BC policy with respect to φorcl. We have,

EDπ [Diﬀ(πφorcl,Z, π )] (1)(φorcl) + (2)(φorcl) + C3

where C3 is the same as in Theorem 1.

We can contrast this bound to its form in the vanilla BC setting, for which |Z| = |A| and both (1)(φorcl) and (2)(φorcl) are zero. We can expect an improvement in sample complexity from reparametrized actions when the errors in (1) and (2) are small and |Z| < |A|.

4.2 LINEAR TRANSITION MODELS WITH DETERMINISTIC LATENT POLICY Theorem 1 has introduced the notion of a latent expert policy π ,Z, and minimizes the KL divergence between π ,Z and a tabular latent policy πZ. However, it is not immediately clear, in the case of continuous latent actions, how to ensure that the latent policy πZ is expressive enough to capture any π ,Z. In this section, we provide guarantees for recovering stochastic expert policies with continuous latent action space under a linear transition model. Consider a continuous latent space Z Rd and a deterministic latent policy πθ(s) = θs for some θ Rd |S|. While a deterministic θ in general cannot capture a stochastic π , we show that under a linear transition model TZ(s |s, φ(s, a)) = w(s ) φ(s, a), there always exists a deterministic policy πθ : S Rd, such that θs = π ,Z(s), s S. This means that our scheme for ofﬂine pretraining paired with downstream imitation learning can provably recover any expert policy π from a deterministic πθ, regardless of whether π is stochastic.

Theorem 3. Let φ : S A Z for some Z Rd and suppose there exist w : S Rd such that TZ(s |s, φ(s, a)) = w(s ) φ(s, a) for all s, s S, a A. Let πα : S Z (A) be an action decoder, π : S (A) be any policy in M and πθ : S Rd be a deterministic latent policy for some θ Rd |S|. Then,

Diﬀ(πα πθ, π ) (1)(TZ, φ) + (2)(πα, φ)

Downstream Imitation

+ C4 θEs dπ ,a π (s)[(θs φ(s, a))2] 1 , (4)

where C4 = 1 4|S| w , (1) and (2) corresponds to the ﬁrst and second terms in the bound in Theorem 1.

Published as a conference paper at ICLR 2022

By replacing term (3) in Theorem 1 that corresponds to behavioral cloning in the latent action space by term (4) in Theorem 3 that is a convex function unbounded in all directions, we are guaranteed that πθ is provably optimal regardless of the form of π and π ,Z. Note that the downstream imitation learning objective implied by term (4) is simply the mean squared error between actions θs chosen by πθ and reparameterized actions φ(s, a) appearing in the expert dataset.

4.3 TRAIL: REPARAMETRIZED ACTIONS AND IMITATION LEARNING IN PRACTICE

In this section, we describe our learning framework, Transition-Reparametrized Actions for Imitation Learning (TRAIL). TRAIL consists of two training stages: pretraining and downstream behavioral cloning. During pretraining, TRAIL learns TZ and φ by minimizing JT(TZ, φ) = E(s,a) doff [DKL(T (s, a) TZ(s, φ(s, a)))]. Also during pretraining, TRAIL learns πα and φ by minimizing JDE(πα, φ) := E(s,a) doff[ log πα(a|s, φ(s, a))]. TRAIL parametrizes πα as a multivariate Gaussian distribution. Depending on whether TZ is deﬁned according to Theorem 1 or Theorem 3, we have either TRAIL EBM or TRAIL linear. TRAIL EBM for Theorem 1. In the tabular action setting that corresponds to Theorem 1, to ensure that the factored transition model TZ is ﬂexible to capture any complex (e.g., multi-modal) transitions in the ofﬂine dataset, we propose to use an energy-based model (EBM) to parametrize TZ(s |s, φ(s, a)), TZ(s |s, φ(s, a)) ρ(s )exp( φ(s, a) ψ(s ) 2), (5) where ρ is a ﬁxed distribution over S and ψ : S Z is a function of s . In our implementation we set ρ to be the distribution of s in doﬀ, which enables a practical learning objective for TZ by minimizing E(s,a) doff [DKL(T (s, a) TZ(s, φ(s, a)))] in Theorem 1 using a contrastive loss:

Edoff[ log TZ(s |s, φ(s, a)))] = const(doﬀ) + 1

2Edoff[||φ(s, a) ψ(s )||2]

+ log E s ρ[exp{ 1

2||φ(s, a) ψ( s )||2}].

During downstream behavioral cloning, TRAIL EBM learns a latent Gaussian policy πZ by minimizing JBC,φ(πZ) = E(s,a) (dπ ,π )[ log πZ(φ(s, a)|s)] with φ ﬁxed. During inference, TRAIL EBM ﬁrst samples a latent action according to z πZ(s), and decodes the latent action using a πα(s, z) to act in an environment. Figure 1 describes this process pictorially. TRAIL Linear for Theorem 3. In the continuous latent action setting that corresponds to Theorem 3, we propose TRAIL linear, an approximation of TRAIL EBM, to enable learning linear transition models required by Theorem 3. Speciﬁcally, we ﬁrst learn f, g that parameterize an energybased transition model T (s |s, a) ρ(s ) exp{ ||f(s, a) g(s )||2/2} using the same contrastive loss as above (replacing φ and ψ by f and g), and then apply random Fourier features (Rahimi et al., 2007) to recover φ(s, a) = cos(Wf(s, a) + b), where W is a d k matrix with entries sampled from a unit Gaussian and b a vector with entries sampled uniformly from [0, 2π]. W and b are implemented as an untrainable neural network layer on top of f. This results in an approximate linear transition model,

T (s |s, a) ρ(s ) exp{ ||f(s, a) g(s )||2/2} ψ(s ) φ(s, a). During downstream behavioral cloning, TRAIL linear learns a deterministic policy πθ in the continuous latent action space determined by φ via minimizing

θEs dπ ,a π (s)[(θs φ(s, a))2] 1 with φ ﬁxed. During inference, TRAIL linear ﬁrst determines the latent action according to z = πθ(s), and decodes the latent action using a πα(s, z) to act in an environment.

cartpole-swingup antmaze-large ant cheetah-run fish-swim walker-stand walker-walk humanoid-run

antmaze-medium

Figure 2: Tasks for our empirical evaluation. We include the challenging Ant Maze navigation tasks from D4RL (Fu et al., 2020) and low (1-Do F) to high (21-Do F) dimensional locomotaion tasks from Deep Mind Control Suite (Tassa et al., 2018).

Published as a conference paper at ICLR 2022

Suboptimal Doff

expert 10 trajs expert 10 trajs expert 10 trajs expert 10 trajs

antmaze-large-diverse antmaze-medium-diverse antmaze-medium-play antmaze-large-play

Figure 3: Average success rate (%) over 4 seeds of TRAIL EBM (Theorem 1) and temporal skill extraction methods Ski LD (Pertsch et al., 2021), SPi RL (Pertsch et al., 2020), and OPAL (Ajay et al., 2020) pretrained on suboptimal Doﬀ. Baseline BC corresponds to direct behavioral cloning of expert Dπ without latent actions.

5 EXPERIMENTAL EVALUATION We now evaluate TRAIL on a set of navigation and locomotion tasks (Figure 2). Our evaluation is designed to study how well TRAIL can improve imitation learning with limited expert data by leveraging available suboptimal ofﬂine data. We evaluate the improvement attained by TRAIL over vanilla BC, and additionally compare TRAIL to previously proposed temporal skill extraction methods. Since there is no existing benchmark for imitation learning with suboptimal ofﬂine data, we adapt existing datasets for ofﬂine RL, which contain suboptimal data, and augment them with a small amount of expert data for downstream imitation learning.

5.1 EVALUATING NAVIGATION WITHOUT TEMPORAL ABSTRACTION Description and Baselines. We start our evaluation on the Ant Maze task from D4RL (Fu et al., 2020), which has been used as a testbed by recent works on temporal skill extraction for few-shot imitation (Ajay et al., 2020) and RL (Ajay et al., 2020; Pertsch et al., 2020; 2021). We compare TRAIL to OPAL (Ajay et al., 2020), Skil D (Pertsch et al., 2021), and SPi RL (Pertsch et al., 2020), all of which use an ofﬂine dataset to extract temporally extended (length t = 10) skills to form a latent action space for downstream learning. Ski LD and SPi RL are originally designed only for downstream RL, so we modify them to support downstream imitation learning as described in Appendix C. While a number of other works have also proposed to learn primitives for hierarchical imitation (Kipf et al., 2019; Hakhamaneshi et al., 2021) and RL (Fox et al., 2017; Krishnan et al., 2017; Shankar et al., 2019; Shankar & Gupta, 2020; Singh et al., 2020), we chose OPAL, Ski LD, and SPi RL for comparison because they are the most recent works in this area with reported results that suggest these methods are state-of-the-art, especially in learning from suboptimal ofﬂine data based on D4RL. To construct the suboptimal and expert datasets, we follow the protocol in Ajay et al. (2020), which uses the full diverse or play D4RL Ant Maze datasets as the suboptimal ofﬂine data, while using a set of n = 10 expert trajectories (navigating from one corner of the maze to the opposite corner) as the expert data. The diverse and play datasets are suboptimal in the corner-to-corner navigation task, as they only contain data that navigates to random or ﬁxed locations different from task evaluation.

Implementation Details. For TRAIL, we parameterize φ(s, a) and ψ(s ) using separate feedforward neural networks (see details in Appendix C) and train the transition EBM via the contrastive objective described in Section 4.3. We parametrize both the action decoder πα and the latent πZ using multivariate Gaussian distributions with neural-network approximated mean and variance. For the temporal skill extraction methods, we implement the trajectory encoder using a bidirectional RNN and parametrize skill prior, latent policy, and action decoder as Gaussians following Ajay et al. (2020). We adapt SPi RL and Ski LD for imitation learning by including the KL Divergence term between the latent policy and the skill prior during downstream behavioral cloning (see details in Appendix C). We do a search on the extend of temporal abstraction, and found t = 10 to work the best as reported in these papers maze experiments. We also experimented with a version of vanilla BC pretrained on the suboptimal data and ﬁne-tuned on expert data for fair comparison, which did not show a signiﬁcant difference from directly training vanilla BC on expert data.

Results. Figure 3 shows the average performance of TRAIL in terms of task success rate (out of 100%) compared to the prior methods. Since all of the prior methods are proposed in terms of temporal abstraction, we evaluate them both with the default temporal abstract, t = 10, as well as without temporal abstraction, corresponding to t = 1. Note that TRAIL uses no temporal abstraction. We

Published as a conference paper at ICLR 2022

Suboptimal Doff

ant-expert 25k ant-expert 10k ant-expert 25k ant-expert 10k ant-expert 25k ant-expert 10k

ant-random ant-medium-replay ant-medium ant-medium ant-medium-replay ant-random

Figure 4: Average rewards (over 4 seeds) of TRAIL EBM (Theorem 1), TRAIL linear (Theorem 3), and baseline methods when using a variety of unimodal (ant-medium), low-quality (ant-medium-replay), and random (ant-random) ofﬂine datasets Doﬀpaired with a smaller expert dataset Dπ (either 10k or 25k expert transitions).

ﬁnd that on the simpler antmaze-medium task, TRAIL trained on a single-step transition model performs similarly to the set of temporal skill extraction methods with t = 10. However, these skill extraction methods experience a degradation in performance when temporal abstraction is removed (t = 1). This corroborates the existing theory in these works (Ajay et al., 2020), which attributes their beneﬁts predominantly to temporal abstraction rather than producing a latent action space that is easier to learn. Meanwhile, TRAIL is able to excel without any temporal abstraction. These differences become even more pronounced on the harder antmaze-large tasks. We see that TRAIL maintains signiﬁcant improvements over vanilla BC, whereas temporal skill extraction fails to achieve good performance even with t = 10. These results suggest that TRAIL attains signiﬁcant improvement speciﬁcally from utilizing the suboptimal data for learning suitable action representations, rather than simply from providing temporal abstraction. Of course, this does not mean that temporal abstraction is never helpful. Rather, our results serve as evidence that suboptimal data can be useful for imitation learning not just by providing temporally extended skills, but by actually reformulating the action space to make imitation learning easier and more efﬁcient.

5.2 EVALUATING LOCOMOTION WITH HIGHLY SUBOPTIMAL OFFLINE DATA

Description. The performance of TRAIL trained on a single-step transition model in the previous section suggests that learning single-step latent action representations can beneﬁt a broader set of tasks for which temporal abstraction may not be helpful, e.g., when the ofﬂine data is highly suboptimal (with near-random actions) or unimodal (collected by a single stationary policy). In this section, we consider a Gym-Mu Jo Co task from D4RL using the same 8-Do F quadruped ant robot as the previously evaluated navigation task. We ﬁrst learn action representations from the medium, medium-replay, or random datasets, and imitate from 1% or 2.5% of the expert datasets from D4RL. The medium dataset represents data collected from a mediocre stationary policy (exhibiting unimodal behavior), and the random dataset is collected by a randomly initialized policy and is hence highly suboptimal.

Implementation Details. For this task, we additionally train a linear version of TRAIL by approximating the transition EBM using random Fourier features (Rahimi et al., 2007) and learn a deterministic latent policy following Theorem 3. Speciﬁcally, we use separate feed-forward networks to parameterize f(s, a) and g(s ), and extract action representations using φ(s, a) = cos(Wf(s, a) + b), where W, b are untrainable randomly initialized variables as described in Section 4.3. Different from TRAIL EBM which parametrizes πZ as a Gaussian, TRAIL linear parametrizes the deterministic πθ using a feed-forward neural network.

Results. Our results are shown in Figure 4. Both the EBM and linear versions of TRAIL consistently improve over baseline BC, whereas temporal skill extraction methods generally lead to worse performance regardless of the extent of abstraction, likely due to the degenerate effect (i.e., latent skills being ignored by a ﬂexible action decoder) resulted from unimodal ofﬂine datasets as discussed in (Ajay et al., 2020). Surprisingly, TRAIL achieves a signiﬁcant performance boost even when latent actions are learned from the random dataset, suggesting the beneﬁt of learning action representations from transition models when the ofﬂine data is highly suboptimal. Additionally, the linear variant of TRAIL performs slightly better than the EBM variant when the expert sample size is small (i.e., 10k), suggesting the beneﬁt of learning deterministic latent policies from Theorem 3 when the environment is effectively approximated by a linear transition model.

Published as a conference paper at ICLR 2022

Dπ*cartpole-swingup ~20% Doffcartpole-swingup 80% cheetah-run 80% fish-swim 80% walker-stand 80% walker-walk 80% humanoid-run 80%

cheetah-run ~20% fish-swim ~20% walker-stand ~20% walker-walk ~20% humanoid-run ~20%

Figure 5: Average task rewards (over 4 seeds) of TRAIL EBM (Theorem 1), TRAIL linear (Theorem 3), and OPAL (other temporal methods are included in Appendix D) pretrained on the bottom 80% of the RL Unplugged datasets followed by behavioral cloning in the latent action space on 1 10 of the top 20% of the RL Unplugged datasets following the setup in Zolna et al. (2020). Baseline BC achieves low rewards due to the small expert sample size. Dotted lines denote the performance of CRR (Wang et al., 2020), an ofﬂine RL method trained on the full RL Unplugged datasets with reward labels.

5.3 EVALUATION ON DEEPMIND CONTROL SUITE

Description. Having witnessed the improvement TRAIL brings to behavioral cloning on Ant Maze and Mu Jo Co Ant, we wonder how TRAIL perform on a wider spectrum of locomotion tasks with various degrees of freedom. We consider 6 locomotion tasks from the Deep Mind Control Suite (Tassa et al., 2018) ranging from simple (e.g., 1-Do F cartople-swingup) to complex (e.g., 21-Do F humanoid-run) tasks. Following the setup in Zolna et al. (2020), we take 1 10 of the trajectories whose episodic reward is among the top 20% of the open source RL Unplugged datasets (Gulcehre et al., 2020) as expert demonstrations (see numbers of expert trajectories in Appendix C), and the bottom 80% of RL Unplugged as the suboptimal ofﬂine data. For completeness, we additionally include comparison to Critic Regularized Regression (CRR) (Wang et al., 2020), an ofﬂine RL method with competitive performance on these tasks. CRR is trained on the full RL Unplugged datasets (i.e., combined suboptimal and expert datasets) with reward labels.

Results. Figure 5 shows the comparison results. TRAIL outperforms temporal extraction methods on both low-dimensional (e.g., cartpole-swingup) and high-dimensional (humanoid-run) tasks. Additionally, TRAIL performs similarly to or better than CRR on 4 out of the 6 tasks despite not using any reward labels, and only slightly worse on humanoid-run and walker-walk. To test the robustness of TRAIL when the ofﬂine data is highly suboptimal, we further reduce the size and quality of the ofﬂine data to the bottom 5% of the original RL Unplugged datasets. As shown in Figure 6 in Appndix D, the performance of temporal skill extraction declines in fish-swim, walker-stand, and walker-walk due to this change in ofﬂine data quality, whereas TRAIL maintains the same performance as when the bottom 80% data was used, suggesting that TRAIL is more robust to low-quality ofﬂine data. This set of results suggests a promising direction for ofﬂine learning of sequential decision making policies, namely to learn latent actions from abundant low-quality data and behavioral cloning in the latent action space on scarce high-quality data. Notably, compared to ofﬂine RL, this approach is applicable to settings where data quality cannot be easily expressed through a scalar reward.

6 CONCLUSION

We have derived a near-optimal objective for learning a latent action space from suboptimal ofﬂine data that provably accelerates downstream imitation learning. To learn this objective in practice, we propose transition-reparametrized actions for imitation learning (TRAIL), a two-stage framework that ﬁrst pretrains a factored transition model from ofﬂine data, and then uses the transition model to reparametrize the action space prior to behavioral cloning. Our empirical results suggest that TRAIL can improve imitation learning drastically, even when pretrained on highly suboptimal data (e.g., data from a random policy), providing a new approach to imitation learning through a combination of pretraining on task-agnostic or suboptimal data and behavioral cloning on limited expert datasets. That said, our approach to action representation learning is not necessarily speciﬁc to imitation learning, and insofar as the reparameterized action space simpliﬁes downstream control problems, it could also be combined with reinforcement learning in future work. More broadly, studying how learned action reparameterization can accelerate various facets of learning-based control represents an exciting future direction, and we hope that our results provide initial evidence of such a potential.

Published as a conference paper at ICLR 2022

ACKNOWLEDGMENTS

We thank Dale Schuurmans and Bo Dai for valuable discussions. We thank Justin Fu, Anurag Ajay, and Konrad Zolna for assistance in setting up evaluation tasks.

REFERENCES Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-ﬁrst international conference on Machine learning, pp. 1, 2004.

Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. In International Conference on Machine Learning, pp. 22 31. PMLR, 2017.

M Mehdi Afsar, Trafford Crump, and Behrouz Far. Reinforcement learning based recommender systems: A survey. ar Xiv preprint ar Xiv:2101.06286, 2021.

Anurag Ajay, Aviral Kumar, Pulkit Agrawal, Sergey Levine, and Oﬁr Nachum. Opal: Ofﬂine primitive discovery for accelerating ofﬂine reinforcement learning. ar Xiv preprint ar Xiv:2010.13611, 2020.

Sanjeev Arora, Simon Du, Sham Kakade, Yuping Luo, and Nikunj Saunshi. Provable representation learning for imitation learning via bi-level optimization. In International Conference on Machine Learning, pp. 367 376. PMLR, 2020.

Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 31, 2017.

Daniel Berend and Aryeh Kontorovich. On the convergence of the empirical distribution. ar Xiv preprint ar Xiv:1205.6711, 2012.

Pablo Castro and Doina Precup. Using bisimulation for policy transfer in mdps. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 24, 2010.

Jonathan Chang, Masatoshi Uehara, Dhruv Sreenivas, Rahul Kidambi, and Wen Sun. Mitigating covariate shift in imitation learning via ofﬂine data with partial coverage. Advances in Neural Information Processing Systems, 34, 2021.

Thomas G Dietterich et al. The maxq method for hierarchical reinforcement learning. In ICML, volume 98, pp. 118 126. Citeseer, 1998.

Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function. ar Xiv preprint ar Xiv:1802.06070, 2018.

Roy Fox, Sanjay Krishnan, Ion Stoica, and Ken Goldberg. Multi-level discovery of deep options. ar Xiv preprint ar Xiv:1703.08294, 2017.

Justin Fu, Aviral Kumar, Oﬁr Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning. ar Xiv preprint ar Xiv:2004.07219, 2020.

Carles Gelada, Saurabh Kumar, Jacob Buckman, Oﬁr Nachum, and Marc G Bellemare. Deepmdp: Learning continuous latent space models for representation learning. In International Conference on Machine Learning, pp. 2170 2179. PMLR, 2019.

Karol Gregor, Danilo Jimenez Rezende, and Daan Wierstra. Variational intrinsic control. ar Xiv preprint ar Xiv:1611.07507, 2016.

Caglar Gulcehre, Ziyu Wang, Alexander Novikov, Tom Le Paine, Sergio Gomez Colmenarejo, Konrad Zolna, Rishabh Agarwal, Josh Merel, Daniel Mankowitz, Cosmin Paduraru, et al. Rl unplugged: Benchmarks for ofﬂine reinforcement learning. ar Xiv e-prints, pp. ar Xiv 2006, 2020.

Kourosh Hakhamaneshi, Ruihan Zhao, Albert Zhan, Pieter Abbeel, and Michael Laskin. Hierarchical few-shot imitation with skill transition models. ar Xiv preprint ar Xiv:2107.08981, 2021.

Karol Hausman, Jost Tobias Springenberg, Ziyu Wang, Nicolas Heess, and Martin Riedmiller. Learning an embedding space for transferable robot skills. In International Conference on Learning Representations, 2018.

Published as a conference paper at ICLR 2022

Haoming Jiang, Bo Dai, Mengjiao Yang, Tuo Zhao, and Wei Wei. Towards automatic evaluation of dialog systems: A model-free off-policy evaluation approach. ar Xiv preprint ar Xiv:2102.10242, 2021.

Thomas Kipf, Yujia Li, Hanjun Dai, Vinicius Zambaldi, Alvaro Sanchez-Gonzalez, Edward Grefenstette, Pushmeet Kohli, and Peter Battaglia. Compile: Compositional imitation learning and execution. In International Conference on Machine Learning, pp. 3418 3428. PMLR, 2019.

Sanjay Krishnan, Roy Fox, Ion Stoica, and Ken Goldberg. Ddco: Discovery of deep continuous options for robot learning from demonstrations. In Conference on robot learning, pp. 418 437. PMLR, 2017.

Tejas D Kulkarni, Karthik Narasimhan, Ardavan Saeedi, and Josh Tenenbaum. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. Advances in neural information processing systems, 29:3675 3683, 2016.

Aviral Kumar, Justin Fu, George Tucker, and Sergey Levine. Stabilizing off-policy q-learning via bootstrapping error reduction. ar Xiv preprint ar Xiv:1906.00949, 2019.

Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Ofﬂine reinforcement learning: Tutorial, review, and perspectives on open problems. ar Xiv preprint ar Xiv:2005.01643, 2020.

Jinxin Liu, Donglin Wang, Qiangxing Tian, and Zhengyu Chen. Learn goal-conditioned policy with intrinsic motivation for deep reinforcement learning. ar Xiv preprint ar Xiv:2104.05043, 2021.

Corey Lynch, Mohi Khansari, Ted Xiao, Vikash Kumar, Jonathan Tompson, Sergey Levine, and Pierre Sermanet. Learning latent plans from play. In Conference on Robot Learning, pp. 1113 1132. PMLR, 2020.

Oﬁr Nachum and Mengjiao Yang. Provable representation learning for imitation with contrastive fourier features. ar Xiv preprint ar Xiv:2105.12272, 2021.

Oﬁr Nachum, Shixiang Gu, Honglak Lee, and Sergey Levine. Data-efﬁcient hierarchical reinforcement learning. ar Xiv preprint ar Xiv:1805.08296, 2018a.

Oﬁr Nachum, Shixiang Gu, Honglak Lee, and Sergey Levine. Near-optimal representation learning for hierarchical reinforcement learning. ar Xiv preprint ar Xiv:1810.01257, 2018b.

Oﬁr Nachum, Michael Ahn, Hugo Ponte, Shixiang Gu, and Vikash Kumar. Multi-agent manipulation via locomotion using hierarchical sim2real. ar Xiv preprint ar Xiv:1908.05224, 2019.

Ronald Parr and Stuart Russell. Reinforcement learning with hierarchies of machines. Advances in neural information processing systems, pp. 1043 1049, 1998.

Xue Bin Peng, Michael Chang, Grace Zhang, Pieter Abbeel, and Sergey Levine. Mcp: Learning composable hierarchical control with multiplicative compositional policies. ar Xiv preprint ar Xiv:1905.09808, 2019.

Karl Pertsch, Youngwoon Lee, and Joseph J Lim. Accelerating reinforcement learning with learned skill priors. ar Xiv preprint ar Xiv:2010.11944, 2020.

Karl Pertsch, Youngwoon Lee, Yue Wu, and Joseph J Lim. Guided reinforcement learning with learned skills. In Self-Supervision for Reinforcement Learning Workshop-ICLR 2021, 2021.

Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network. Technical report, CARNEGIE-MELLON UNIV PITTSBURGH PA ARTIFICIAL INTELLIGENCE AND PSYCHOLOGY ..., 1989.

Martin L Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., 1994.

Rafael Rafailov, Tianhe Yu, Aravind Rajeswaran, and Chelsea Finn. Visual adversarial imitation learning using variational models. Advances in Neural Information Processing Systems, 34, 2021.

Published as a conference paper at ICLR 2022

Ali Rahimi, Benjamin Recht, et al. Random features for large-scale kernel machines. In NIPS, volume 3, pp. 5. Citeseer, 2007.

Prajit Ramachandran, Barret Zoph, and Quoc V. Le. Searching for activation functions, 2017.

Zhiyuan Ren and Bruce H Krogh. State aggregation in markov decision processes. In Proceedings of the 41st IEEE Conference on Decision and Control, 2002., volume 4, pp. 3819 3824. IEEE, 2002.

St ephane Ross and Drew Bagnell. Efﬁcient reductions for imitation learning. In Proceedings of the thirteenth international conference on artiﬁcial intelligence and statistics, pp. 661 668. JMLR Workshop and Conference Proceedings, 2010.

St ephane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artiﬁcial intelligence and statistics, pp. 627 635. JMLR Workshop and Conference Proceedings, 2011.

Stefan Schaal. Is imitation learning the route to humanoid robots? Trends in cognitive sciences, 3 (6):233 242, 1999.

Tanmay Shankar and Abhinav Gupta. Learning robot skills with temporal variational inference. In International Conference on Machine Learning, pp. 8624 8633. PMLR, 2020.

Tanmay Shankar, Shubham Tulsiani, Lerrel Pinto, and Abhinav Gupta. Discovering motor programs by recomposing demonstrations. In International Conference on Learning Representations, 2019.

Archit Sharma, Shixiang Gu, Sergey Levine, Vikash Kumar, and Karol Hausman. Dynamics-aware unsupervised discovery of skills. ar Xiv preprint ar Xiv:1907.01657, 2019.

Avi Singh, Huihan Liu, Gaoyue Zhou, Albert Yu, Nicholas Rhinehart, and Sergey Levine. Parrot: Data-driven behavioral priors for reinforcement learning. ar Xiv preprint ar Xiv:2011.10024, 2020.

Satinder P Singh, Tommi Jaakkola, and Michael I Jordan. Reinforcement learning with soft state aggregation. Advances in neural information processing systems, pp. 361 368, 1995.

Martin Stolle and Doina Precup. Learning options in reinforcement learning. In International Symposium on abstraction, reformulation, and approximation, pp. 212 223. Springer, 2002.

Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artiﬁcial intelligence, 112(1-2):181 211, 1999.

Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. Deepmind control suite. ar Xiv preprint ar Xiv:1801.00690, 2018.

Stephen Tu, Alexander Robey, and Nikolai Matni. Closing the closed-loop distribution shift in safe imitation learning. ar Xiv preprint ar Xiv:2102.09161, 2021.

Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. In International Conference on Machine Learning, pp. 3540 3549. PMLR, 2017.

Xiaofei Wang, Kimin Lee, Kourosh Hakhamaneshi, Pieter Abbeel, and Michael Laskin. Skill preferences: Learning to extract and execute robotic skills from human feedback. ar Xiv preprint ar Xiv:2108.05382, 2021.

Ziyu Wang, Alexander Novikov, Konrad Zolna, Jost Tobias Springenberg, Scott Reed, Bobak Shahriari, Noah Siegel, Josh Merel, Caglar Gulcehre, Nicolas Heess, et al. Critic regularized regression. ar Xiv preprint ar Xiv:2006.15134, 2020.

David Warde-Farley, Tom Van de Wiele, Tejas Kulkarni, Catalin Ionescu, Steven Hansen, and Volodymyr Mnih. Unsupervised control through non-parametric discriminative rewards. ar Xiv preprint ar Xiv:1811.11359, 2018.

Published as a conference paper at ICLR 2022

Yifan Wu, George Tucker, and Oﬁr Nachum. Behavior regularized ofﬂine reinforcement learning. ar Xiv preprint ar Xiv:1911.11361, 2019.

Amy Zhang, Rowan Mc Allister, Roberto Calandra, Yarin Gal, and Sergey Levine. Learning invariant representations for reinforcement learning without reconstruction. ar Xiv preprint ar Xiv:2006.10742, 2020.

Wenxuan Zhou, Sujay Bajracharya, and David Held. Plas: Latent action space for ofﬂine reinforcement learning. ar Xiv preprint ar Xiv:2011.07213, 2020.

Konrad Zolna, Alexander Novikov, Ksenia Konyushkova, Caglar Gulcehre, Ziyu Wang, Yusuf Aytar, Misha Denil, Nando de Freitas, and Scott Reed. Ofﬂine learning from demonstrations and unlabeled experience. ar Xiv preprint ar Xiv:2011.13885, 2020.

Published as a conference paper at ICLR 2022

A PROOFS FOR FOUNDATIONAL LEMMAS Lemma 4. If π1 and π2 are two policies in M and dπ1(s) and dπ2(s) are the state visitation distributions induced by policy π1 and π2 where dπ(s) := (1 γ) P

t=0 γt Pr [st = s|π, M]. Deﬁne Diﬀ(π2, π1) = DTV(dπ2 dπ1) then

Diﬀ(π2, π1) γ 1 γ Errdπ1(π1, π2, T ), (6)

Errdπ1(π1, π2, T ) := 1

Es dπ1,a1 π1(s),a2 π2(s)[T (s |s, a1) T (s |s, a2)] . (7)

is the TV-divergence between T π1 dπ1 and T π2 dπ1.

Proof. Following similar derivations in Achiam et al. (2017); Nachum et al. (2018b), we express DTV(dπ2 dπ1) in linear operator notation:

Diﬀ(π2, π1) = DTV(dπ2 dπ1) = 1

21|(1 γ)(I γT Π2) 1µ (1 γ)(I γT Π1) 1µ|, (8)

where Π1, Π2 are linear operators S S A such that Πiν(s, a) = πi(a|s)ν(s) and 1 is an all ones row vector of size |S|. Notice that dπ1 may be expressed in this notation as (1 γ)(I γT Π1) 1µ. We may re-write the above term as 1 21|(1 γ)(I γT Π2) 1((I γT Π1) (I γT Π2))(I γT Π1) 1µ|

21|(I γT Π2) 1(T Π2 T Π1)dπ1|. (9)

Using matrix norm inequalities, we bound the above by

2 (I γT Π2) 1 1, 1|(T Π2 T Π1)dπ1|. (10)

Since T Π2 is a stochastic matrix, (I γT Π2) 1 1, P t=0 γt T Π2 1, = (1 γ) 1. Thus, we bound the above by γ 2(1 γ)1|(T Π2 T Π1)dπ1| = γ 1 γ Errdπ1(π1, π2, T ), (11)

and so we immediately achieve the desired bound in equation 6.

The divergence bound above relies on the true transition model T which is not available to us. We now introduce an approximate transition model T to proxy Errdπ1(π1, π2, T ).

Lemma 5. For π1 and π2 two policies in M and any transition model T ( |s, a) we have, Errdπ1(π1, π2, T ) |A|E(s,a) (dπ1,Unif A)[DTV(T (s, a) T (s, a))] + Errdπ1(π1, π2, T ). (12)

Errdπ1(π1, π2, T ) = 1

Es dπ1,a1 π1(s),a2 π2(s)[T (s |s, a1) T (s |s, a2)] (13)

a A Es dπ1[T (s |s, a)π1(a|s) T (s, a)π2(a|s)]

a A Es dπ1[(T (s |s, a) T (s |s, a))(π1(a|s) π2(a|s)) + T (s |s, a)(π1(a|s) π2(a|s))]

a A Es dπ1[(T (s |s, a) T (s |s, a))(π1(a|s) π2(a|s))]

+ Errdπ1(π1, π2, T ) (16)

a A Es dπ1[ (T (s |s, a) T (s |s, a))(π1(a|s) π2(a|s)) ] + Errdπ1(π1, π2, T ) (17)

|A|E(s,a) (dπ1,Unif A)[DTV(T (s |s, a) T (s |s, a)|] + Errdπ1(π1, π2, T ), (18)

Published as a conference paper at ICLR 2022

and we arrive at the inequality as desired where the last step comes from DTV(T (s, a) T (s, a)) = 1 2 P

s S |T (s |s, a) T (s |s, a)|.

Now we introduce a representation function φ : S A Z and show how the error above may be reduced when T (s, a) = TZ(s, φ(s, a)):

Lemma 6. Let φ : S A Z for some space Z and suppose there exists TZ : S Z (S) such that T (s, a) = TZ(s, φ(s, a)) for all s S, a A. Then for any policies π1, π2, Errdπ1(π1, π2, T )] Es dπ1[DTV(π1,Z π2,Z)], (19) where πk,Z(z|s) is the marginalization of πk onto Z:

πk,Z(z|s) := X

a A,z=φ(s,a) πk(a|s) (20)

for all z Z, k {1, 2}.

Es dπ1,a1 π1(s),a2 π2(s)[T (s |s, a1) T (s |s, a2)] (21)

s S,a A TZ(s |s, φ(s, a))π1(a|s)dπ1(s) X

s S,a A TZ(s |s, φ(s, a))π2(a|s)dπ1(s)

s S,z Z TZ(s |s, z) X

a A, φ(s,a)=z

π1(a|s)dπ1(s) X

s S,z Z TZ(s |s, z) X

a A, φ(s,a)=z

π2(a|s)dπ1(s)

s S,z Z TZ(s |s, z)π1,Z(z|s)dπ1(s) X

s S,z Z TZ(s |s, z)π2,Z(z|s)dπ1(s)

z Z TZ(s |s, z)(π1,Z(z|s) π2,Z(z|s))

s S TZ(s |s, z) |π1,Z(z|s) π2,Z(z|s)|

z Z |π1,Z(z|s) π2,Z(z|s)|

= Es dπ1 [DTV(π1,Z π2,Z)] , (25) and we arrive at the inequality as desired.

Lemma 7. Let d (S, A) be some state-action distribution, φ : S A Z, and πZ : S (Z). Denote πα as the optimal action decoder for d, φ:

πα (a|s, z) = d(s, a) 1[z = φ(s, a)] P

a A d(s, a ) 1[z = φ(s, a )],

and πα ,Z as the marginalization of πα πZ onto Z:

πα ,Z(z|s) := X

a A,z=φ(s,a) (πα πZ)(a|s) = X

a A,z=φ(s,a)

z Z πα (a|s, z)πZ( z|s).

Then we have πα ,Z(z|s) = πZ(z|s) (26) for all z Z and s S.

Published as a conference paper at ICLR 2022

πα ,Z(z|s) = X

a A,z=φ(s,a)

z Z πα (a|s, z)πZ( z|s) (27)

a A,z=φ(s,a)

d(s, a) 1[ z = φ(s, a)] P

a A d(s, a ) 1[ z = φ(s, a )]πZ( z|s) (28)

a A,z=φ(s,a)

d(s, a) 1[z = φ(s, a)] P

a A d(s, a ) 1[z = φ(s, a )]πZ(z|s) (29)

= πZ(z|s) X

a A,z=φ(s,a)

d(s, a) 1[z = φ(s, a)] P

a A d(s, a ) 1[z = φ(s, a )] (30)

= πZ(z|s), (31) and we have the desired equality.

Lemma 8. Let πZ : S (Z) be a latent policy in Z and πα : S Z A be an action decoder, πα,Z be the marginalization of πα πZ onto Z:

πα,Z(z|s) := X

a A,z=φ(s,a) (πα πZ)(a|s) = X

a A,z=φ(s,a)

z Z πα(a|s, z)πZ( z|s).

Then for any s S we have DTV(πZ(s) πα,Z(s)) max z Z DTV(πα (s, z) πα(s, z)), (32)

where πα is the optimal action decoder deﬁned in Lemma 7 (and this holds for any choice of d from Lemma 7).

DTV(πZ(s) πα,Z(s)) (33)

z Z |πZ(z|s) πα,Z(z|s)| (34)

a A,z=φ(s,a)

z Z πα(a|s, z)πZ( z|s)

a A,z=φ(s,a)

z Z (πα(a|s, z) πα (a|s, z) + πα (a|s, z)) πZ( z|s)

a A,z=φ(s,a)

z Z (πα(a|s, z) πα (a|s, z)) πZ( z|s)

(by Lemma 7) (37)

a A,z=φ(s,a) |πα(a|s, z) πα (a|s, z)|

a A |πα(a|s, z) πα (a|s, z)|

=E z πZ(s) [DTV(πα(s, z) , πα (s, z))] (40) max z Z DTV(πα(s, z) πα (s, z)), (41)

(42) and we have the desired inequality.

Lemma 9. Let π1,Z be the marginalization of π1 onto Z as deﬁned in Lemma 6, and let πZ, πα, πα,Z be as deﬁned in Lemma 8, and let πα ,Z be as deﬁned in Lemma 7. For any s S we have DTV(π1,Z(s) πα,Z(s)) max z Z DTV(πα(s, z) πα (s, z)) + DTV(π1,Z(s) πZ(s)). (43)

Published as a conference paper at ICLR 2022

Proof. The desired inequality is achieved by plugging the inequality from Lemma 8 into the following triangle inequality: DTV(π1,Z(s) πα,Z(s)) DTV(πZ(s) πα,Z(s)) + DTV(π1,Z(s) πZ(s)). (44)

Our ﬁnal lemma will be used to translate on-policy bounds to off-policy. Lemma 10. For two distributions ρ1, ρ2 (S) with ρ1(s) > 0 ρ2(s) > 0, we have,

Eρ1[h(s)] (1 + Dχ2(ρ1 ρ2) 1 2 ) q

Eρ2[h(s)2]. (45)

Proof. The lemma is a straightforward consequence of Cauchy-Schwartz: Eρ1[h(s)] = Eρ2[h(s)] + (Eρ1[h(s)] Eρ2[h(s)]) (46)

= Eρ2[h(s)] + X

ρ1(s) ρ2(s)

ρ2(s) 1 2 ρ2(s) 1 2 h(s) (47)

Eρ2[h(s)] +

(ρ1(s) ρ2(s))2

s S ρ2(s)h(s)2 ! 1

= Eρ2[h(s)] + Dχ2(ρ1 ρ2) 1 2 q

Eρ2[h(s)2]. (49)

Finally, to get the desired bound, we simply note that the concavity of the square-root function implies Eρ2[h(s)] Eρ2[ p

Eρ2[h(s)2].

B PROOFS FOR MAJOR THEOREMS B.1 PROOF OF THEOREM 1 Proof. Let π2 := πα πZ, we have π2,Z(z|s) = πα,Z(z|s) = P

a A,φ(s,a)=z(πα πZ)(z|s). By plugging the result of Lemma 9 into Lemma 6, we have

Errdπ1(π1, π2, T )] Es dπ1 max z Z DTV(πα (s, z) πα(s, z)) + DTV(π1,Z(s) πZ(s)) . (50)

By plugging this result into Lemma 5, we have Errdπ1(π1, π2, T ) |A|E(s,a) (dπ1,Unif A)[DTV(T (s, a) T (s, a))] (51)

+ Es dπ1 max z Z DTV(πα (s, z) πα(s, z)) (52)

+ Es dπ1 [DTV(π1,Z(s) πZ(s))] . (53) By further plugging this result into Lemma 4 and let π1 = π , we have:

Diﬀ(πα πZ, π ) γ|A|

1 γ E(s,a) (dπ1,Unif A)[DTV(T (s, a) TZ(s, φ(s, a))]

+ γ 1 γ Es dπ [max z Z DTV(πα (s, z) πα(s, z))]

+ γ 1 γ Es dπ [DTV(π ,Z(s) πZ(s))]. (54)

Finally, by plugging in the off-policy results of Lemma 10 to the bound in Equation 54 and by applying Pinsker s inequality DTV(T (s, a) TZ(s, φ(s, a)))2 1

2DKL(T (s, a) TZ(s, φ(s, a))), we have

Diﬀ(πα πZ, π ) C1

1 2 E(s,a) doff [DKL(T (s, a) TZ(s, φ(s, a)))] | {z } = JT(TZ, φ)

1 2 Es doff[max z Z DKL(πα (s, z) πα(s, z))] | {z } const(doﬀ, φ) + JDE(πα, φ)

1 2 Es dπ [DKL(π ,Z(s) πZ(s))] | {z } = const(π , φ) + JBC,φ(πZ)

Published as a conference paper at ICLR 2022

where C1 = γ|A|(1 γ) 1(1 + Dχ2(dπ doﬀ) 1 2 ), C2 = γ(1 γ) 1(1 + Dχ2(dπ doﬀ) 1 2 ), and C3 = γ(1 γ) 1. Since the maxz Z is not tractable in practice, we approximate Es doff[maxz Z DKL(πα (s, z) πα(s, z))] using E(s,a) doff[DKL(πα (s, φ(s, a)) πα(s, φ(s, a)))], which reduces to JDE(πα, φ) with additional constants. We now arrive at the desired off-policy bound in Theorem 1.

B.2 PROOF OF THEOREM 2 Lemma 11. Let ρ ({1, . . . , k}) be a distribution with ﬁnite support. Let ˆρn denote the empirical estimate of ρ from n i.i.d. samples X ρ. Then,

En[DTV(ρ ˆρn)] 1

Proof. The ﬁrst inequality is Lemma 8 in Berend & Kontorovich (2012) while the second inequality is due to the concavity of the square root function.

Lemma 12. Let D := {(si, ai)}n i=1 be i.i.d. samples from a factored distribution x(s, a) := ρ(s)π(a|s) for ρ (S), π : S (A). Let ˆρ be the empirical estimate of ρ in D and ˆπ be the empirical estimate of π in D. Then,

ED[Es ρ[DTV(π(s) ˆπ(s))]]

Proof. Let ˆx be the empirical estimate of x in D. We have,

Es ρ[DTV(π(s) ˆπ(s))] = 1

s,a ρ(s) |π(a|s) ˆπ(a|s)| (58)

s,a ρ(s) x(s, a)

ρ(s) ˆx(s, a)

s,a ρ(s) ˆx(s, a)

ρ(s) ˆx(s, a)

s,a ρ(s) ˆx(s, a)

ρ(s) x(s, a)

s,a ρ(s) ˆx(s, a)

ρ(s) ˆx(s, a)

+ DTV(x ˆx) (61)

s ρ(s) 1 ρ(s) 1 ˆρ(s)

+ DTV(x ˆx) (62)

s ρ(s) 1 ρ(s) 1 ˆρ(s)

ˆρ(s) + DTV(x ˆx) (63)

= DTV(ρ ˆρ) + DTV(x ˆx). (64) Finally, the bound in the lemma is achieved by application of Lemma 11 to each of the TV divergences.

To prove Theorem 2, we ﬁrst rewrite Theorem 1 as Diﬀ(πZ, π ) (1)(φ) + (2)(φ) + C3 Es dπ [DTV(π ,Z(s) πZ(s))], (65) where (1) and (2) are the ﬁrst two terms in the bound of Theorem 1, and C3 = γ 1 γ . The result in Theorem 2 is then derived by setting φ = φπorcl and πZ := πφorcl,Z and using the result of Lemma 12. Note that the above sample analysis can be extended to the continuous latent action space characterized by Theorem 3 as follows.

Theorem 13. Let φorcl := OPT φ(Doﬀ) and πorcl,θ be the latent BC policy with respect to φorcl. Let d be the dimension of the continuous latent actions and φ be the l norm of φorcl for any s, a. We have

EDπ [Diﬀ(πφorcl,θ, π )] (1)(φorcl) + (2)(φorcl) + C4 d φ

2|S| n + 1,

where (1), (2), and C4 are the same as in Theorem 3.

Published as a conference paper at ICLR 2022

Proof. We use µ Rd |S| to denote the optimal setting of θ which yields a zero l1-norm of θEs dπ,a π [(θs φ(s, a))2]; i.e., µs = Ea π (s)[φ(s, a)]. (66)

According to Theorem 3, we want to bound the l1-norm of

θEs dπ,a π [(θs φ(s, a))2] evaluated at the approximate solution ˆµ Rd |S| with respect to ﬁnite dataset Dπ ; i.e., ˆµs = Ea Dπ ( |s) [φ(s, a)] , (67) with the convention that ˆµs = 0 if s does not appear in Dπ . To this end, we have the following derivation, which uses En to denote the expectation over realizations of ˆµ due to n-size draws of the target dataset Dπ :

θ=ˆµEs dπ,a π (θs φ(s, a))2 1

= En [Es dπ[ ˆµs Ea π [φ(s, a)] 1]] (68)

= En [Es dπ[ ˆµs µs 1]] (69) = Es dπ[En [ ˆµs µs 1]]. (70) We now split up the inner expectation based on the number of times k that s appears in Dπ :

Es dπ[En [ ˆµs µs 1]] = Es dπ

k=0 Pr[count(s) = k] Ek [ ˆµs µs 1]

v u u t Es dπ

k=0 Pr[count(s) = k] Ek [ ˆµs µs 1]2 #

(73) where Ek denotes the expectation over realizations of ˆµs over k-size draws of a π (s). By standard combinatorics, we know

Pr[count(s) = k] = n k

dπ(s)k(1 dπ(s))n k. (74)

Furthermore, for k = 0, we have Ek [ ˆµs µs 1]2 = µs 2 1 d2 φ 2 , (75) while for k > 0, since Ek[ˆµs] = µs, we have

Ek [ ˆµs µs 1]2 d Ek ˆµs µs 2 2 = d Vark [ˆµs] d2 φ 2 k 2d2 φ 2 k + 1 . (76)

Combining equations 74, 75, and 76 we have for any k 0

dπ(s) Pr[count(s) = k] Ek [ ˆµs µs 1]2 2d2 φ 2 k + 1

dπ(s)k+1(1 dπ(s))n k

= 2d2 φ 2 n + 1

n + 1 k + 1

dπ(s)k+1(1 dπ(s))n k,

(77) and so by the binomial theorem, n X

k=0 dπ(s) Pr[count(s) = k] Ek [ ˆµs µs 1]2 2d2 φ 2 n + 1 . (78)

Plugging the above into equation 72 we deduce

Es dπ[En [ ˆµs µs 1]] d φ

2|S| n + 1, (79)

and we have the convergence rate as desired.

B.3 PROOF OF THEOREM 3 Proof. The gradient term in Theorem 3 with respect to a speciﬁc column θs of θ may be expressed as θs E s dπ,a π( s)[(θ s φ( s, a))2]

= 2Ea π(s)[dπ(s)φ(s, a)] + 2dπ(s)θs = 2Ea π(s)[dπ(s)φ(s, a)] + 2Ez=θs[dπ(s) z], (80)

Published as a conference paper at ICLR 2022

θs E s dπ,a π( s)[(θ s φ( s, a))2]

= 2Ea π(s)[dπ(s)T (s |s, a)] + 2Ez=θs[dπ(s)w(s ) z]. (81) Summing over s S, we have: X

θs E s dπ,a π( s)[(θ s φ( s, a))2]

= 2Es dπ,a π(s),z=θs[ T (s |s, a) + TZ(s |s, z)] (82) Thus, we have:

Errdπ(π, πθ, T ) = 1

Es dπ,a π(s),z=θs[ T (s |s, a) + TZ(s |s, z)]

θs E s dπ,a π( s)[(θ s φ( s, a))2]

4|S| w θEs dπ,a π(s)[(θs φ(s, a))2] 1 . (83)

Then by combining Lemmas 4, 5, 10, and apply Equation 83 (as opposed to Lemma 6 as in the tabular case), we arrive at the desired bound in Theorem 3.

C EXPERIMENT DETAILS C.1 ARCHITECTURE We parametrize φ as a two-hidden layer fully connected neural network with 256 units per layer. A Swish (Ramachandran et al., 2017) activation function is applied to the output of each hidden layer. We use embedding size 64 for Ant Maze and 256 for Ant and all Deep Mind Control Suite (DMC) tasks after sweeping values of 64, 256, and 512, though we found TRAIL to be relatively robust to the latent dimension size as long as it is not too small (i.e., 64). The latent skills in temporal skill extraction require a much smaller dimension size, e.g., 8 or 10 as reported by Ajay et al. (2020); Pertsch et al. (2021). We tried increasing the latent skill size for these work during evaluation, but found the reported value 8 to work the best. We additionally experimented with different extend of skill extraction, but found the previously reported t = 10 to also work the best. We implement the trajectory encoder in OPAL, Ski LD, and SPi RL using a bidirectional LSTM with hidden dimension 256. We use β = 0.1 for the KL regularization term in the β VAE of OPAL (as reported). We also use 0.1 as the weight for SPi RL and Ski LD s KL divergence terms.

C.2 TRAINING AND EVALUATION During pretraining, we use the Adam optimizer with learning rate 0.0003 for 200k iterations with batch size 256 for all methods that require pretraining. During downstream behavioral cloning, learned action representations are ﬁxed, but the action decoder is ﬁne-tuned on the expert data as suggested by Ajay et al. (2020). Behavioral cloning for all methods including vanilla BC is trained with learning rate 0.0001 for 1M iterations. We experimented with learning rate decay of downstream BC by a factor of 3 at the 200k boundary for all methods. We found that when the expert sample size is small, decaying learning rate can prevent overﬁtting for all methods. The reported results are with learning rate decay on Ant Maze and without learning rate decay on other environments for all methods. During the downstream behavioral cloning stage, we evaluate the latent policy combined with the action decoder every 10k steps by executing πα πZ in the environment for 10 episodes and compute the average total return. Each method is run with 4 seeds where each seed corresponds to one set of action representations and downstream imitation learning result on that set of representations. We report the mean and standard error for all methods in the bar and line ﬁgures.

C.3 MODIFICATION TO SKILD AND SPIRL Since Ski LD (Pertsch et al., 2021) and SPi RL (Pertsch et al., 2020) are originally designed for RL as opposed to imitation learning, we replace the downstream RL algorithms of Ski LD and SPi RL by behavioral cloning with regularization (but keep skill extraction the same as the original methods). Speciﬁcally, for Sk ILD, we apply a KL regularization term between the latent policy and the learned

Published as a conference paper at ICLR 2022

skill prior in the suboptimal ofﬂine dataset during pretraining, and another KL regularization term between the latent policy and a learn skill posterior on the expert data as done in the original paper during downstream behavioral cloning. We do not need to train the binary classiﬁer that Ski LD trains to decide which regularizer to apply because we know which set of actions are expert versus suboptimal in the imitation learning setting. For SPi RL, we apply the KL divergence between latent policy and skill prior extracted from ofﬂine data (i.e., using the red term in Algorithm 1 of Pertsch et al. (2020)) as an additional term to latent behavioral cloning.

C.4 DATASET DETAILS Ant Maze. For the expert data in Ant Maze, we use the goal-reaching expert policies trained by Ajay et al. (2020) (expert means that the agent is trained to navigate from the one corner of the maze to the opposite corner) to collect n = 10 trajectories. For the suboptimal data in Ant Maze, we use the full D4RL datasets antmaze-large-diverse-v0, antmaze-medium-play-v0, antmaze-medium-diverse-v0, and antmaze-medium-play-v0.

Ant. For the expert data in Ant, we use a small set of expert trajectories selected by taking either the ﬁrst 10k or 25k transitions from ant-expert-v0 in D4RL, corresponding to about 10 and 25 expert trajectories, respectively. For the suboptimal data in Ant, we use the full D4RL datasets ant-medium-v0, ant-medium-replay-v0, and ant-random-v0.

RL Unplugged. For Deep Mind Control Suite (Tassa et al., 2018) set of tasks, we use the RL Unplugged (Gulcehre et al., 2020) dataset. For the expert data, we take 1 10 of the trajectories whose episodic reward is among the top 20% of the open source RL Unplugged datasets following the setup in Zolna et al. (2020). For the suboptimal data, we use the bottom 80% of the RL Unplugged dataset. Table 1 records the total number of trajectories available in RL Unplugged for each task (80% of which are used as suboptimal data), and the number of expert trajectories used in our evaluation.

Task # Total # Dπ cartpole-swingup 40 2 cheetah-run 300 3 ﬁsh-swim 200 1 humanoid-run 3000 53 walker-stand 200 4 walker-walk 200 6

Table 1: Total number of trajectories from RL Unplugged (Gulcehre et al., 2020) locomotion tasks used to train CRR (Wang et al., 2020) and the number of expert trajectories used to train TRAIL. The bottom 80% of # Total is used to learn action representations by TRAIL.

Published as a conference paper at ICLR 2022

D ADDITIONAL EMPIRICAL RESTULS D.1 ADDITIONAL BASELINES FOR RL UNPLUGGED

Figure 6: Average task rewards (over 4 seeds) of TRAIL EBM (Theorem 1), TRAIL linear (Theorem 3), and OPAL, Ski LD, SPi RL trained on the bottom 80% (top) and bottom 5% (bottom) of the RL Unplugged datasets followed by behavioral cloning in the latent action space. Baseline BC achieves low rewards due to the small expert sample size. Dotted lines denote the performance of CRR (Wang et al., 2020) trained on the full dataset with reward labels.

D.2 FRANKAKITCHEN RESULTS

Suboptimal Doff

kitchen-complete kitchen-complete

kitchen-mixed kitchen-partial

Figure 7: Average rewards (over 4 seeds) of TRAIL EBM (Theorem 1), TRAIL linear (Theorem 3), and baseline methods pretrained on kitchen-mixed and kitchen-partial from D4RL to imitate kitchen-complete. TRAIL linear without temporal abstraction performs slightly better than SKi LD and OPAL with temporal abstraction over 10 steps.

Published as a conference paper at ICLR 2022

D.3 DISCRETE MAZE RESULTS

Figure 8: Average task rewards (over 4 seeds) of TRAIL EBM (Theorem 1) and vanilla BC (right) in a discrete four-room maze environment (left) where an agent is randomly placed in the maze and tries to reach the target T . TRAIL learns a discrete latent action space of size 4 from the discrete original action space of size 12 on 500 uniform random trajectories of length 20 shows clear beneﬁt over vanilla BC on expert data.

We conduct additional evaluation on an environment with tabular state and action spaces. As shown in Figure 8, an agent is randomly placed into a four-room environment, and the task is to navigate to the target T . The task reward is 1 at T and 0 elsewhere. There are 12 discrete actions corresponding to rotating clockwise by 90, 180, 270, 360 degrees, rotating counterclockwise by 90, 180, 270, 360 degrees, moving forward by 1 or 2 grids, and moving backward by 1 or 2 grids (the action space is artiﬁcially blown up as suggested by the reviewer). TRAIL is pretrained on 500 trajectories of length 20 with uniform action selection. The expert demonstration always navigates to the target T from any random starting location. TRAIL s latent action dimension is set to 4. We see that TRAIL with a smaller latent action space offers beneﬁts over vanilla BC.

Published as a conference paper at ICLR 2022

E ABLATION STUDY

Suboptimal Doff

expert 10 trajs expert 10 trajs expert 10 trajs expert 10 trajs

antmaze-large-diverse antmaze-medium-diverse antmaze-medium-play antmaze-large-play

Figure 9: Ablation study on action decoder ﬁnetuning, latent dimension size, and pretraining baseline BC on suboptimal data in the Ant Maze environment. TRAIL with default embedding dimension 64 and ﬁnetuning the action decoder corresponds to the second row. Other dimension size (256 and 512) lead to worse performance. Finetuning the action decoder on the expert data has some small beneﬁts. Pretraining BC on suboptimal data before ﬁnetuning on expert does not lead to signiﬁcantly better performance.

Suboptimal Doff

ant-expert 25k ant-expert 10k ant-expert 25k ant-expert 10k ant-expert 25k ant-expert 10k

ant-random ant-medium-replay ant-medium ant-medium ant-medium-replay ant-random

Figure 10: Ablation study on latent dimension size in the Ant environment. TRAIL is generally robust to the choices of the latent action dimension (64, 256, 512) for the Ant task.

Suboptimal Doff

ant-expert 25k ant-expert 10k ant-expert 25k ant-expert 10k ant-expert 25k ant-expert 10k

ant-random ant-medium-replay ant-medium ant-medium ant-medium-replay ant-random

Figure 11: Ablation study on ﬁnetuning the action decoder in the Ant environment. Finetuning the action decoder leads to a slight beneﬁt.

Published as a conference paper at ICLR 2022

F VISUALIZATION OF LATENT ACTIONS

Figure 12: PCA and t-SNE visualizations of the random, medium-replay, medium, and expert D4RL Ant datasets. Without action representation learning (left), the distinction between expert and suboptimal actions is not obvious. The latent actions of TRAIL (right), on the other hand, results in the expert latent actions being more visually separable from suboptimal actions.