# fast_imitation_via_behavior_foundation_models__bb137514.pdf Published as a conference paper at ICLR 2024 FAST IMITATION VIA BEHAVIOR FOUNDATION MODELS Matteo Pirotta*, Andrea Tirinzoni* & Ahmed Touati* Fundamental AI Research at Meta {pirotta,tirinzoni,atouati}@meta.com Alessandro Lazaric & Yann Ollivier Fundamental AI Research at Meta {lazaric,yol}@meta.com Imitation learning (IL) aims at producing agents that can imitate any behavior given a few expert demonstrations. Yet existing approaches require many demonstrations and/or running (online or offline) reinforcement learning (RL) algorithms for each new imitation task. Here we show that recent RL foundation models based on successor measures can imitate any expert behavior almost instantly with just a few demonstrations and no need for RL or fine-tuning, while accommodating several IL principles (behavioral cloning, feature matching, reward-based, and goal-based reductions). In our experiments, imitation via RL foundation models matches, and often surpasses, the performance of SOTA offline IL algorithms, and produces imitation policies from new demonstrations within seconds instead of hours. 1 INTRODUCTION The objective of imitation learning (Schaal, 1996, IL) is to develop agents that can imitate any behavior from few demonstrations. For instance, a cooking robot may learn how to prepare a new recipe from a single demonstration provided by an expert chef. A virtual character may learn to play different sports in a virtual environment from just a few videos of athletes performing the real sports. Imitation learning algorithms achieved impressing results in challenging domains such as autonomous car driving (BΓΌhler et al., 2020; Zhou et al., 2020; George et al., 2018), complex robotic tasks (Nair et al., 2017; Lioutikov et al., 2017; Zhang et al., 2018; Peng et al., 2020; Mandi et al., 2022; Pertsch et al., 2022; Haldar et al., 2023), navigation tasks (Hussein et al., 2018; Shou et al., 2020), cache management (Liu et al., 2020), and virtual character animation (Zhang et al., 2023; Peng et al., 2018; Wagener et al., 2023). Despite these achievements, existing approaches (see Sec. 2 for a detailed review) suffer from several limitations: for any new behavior to imitate, they often require several demonstrations, extensive interaction with the environment, running complex reinforcement learning routines, or knowing in advance the family of behaviors to be imitated. In this paper, we tackle these limitations by leveraging behavior foundation models (BFMs)1 to accurately solve imitation learning tasks from few demonstrations. To achieve this objective, we want our BFM to have the following properties: 1) When pre-training the BFM, no prior knowledge or demonstrations of the behaviors to be imitated are available, and only a dataset of unsupervised transitions/trajectories is provided; 2) The BFM should accurately solve any imitation task without any additional samples on top of the demonstrations, and without solving any complex reinforcement learning (RL) problem. This means that the computation needed to return the imitation policy (i.e., the inference time) should be minimal; 3) Since many different ways to formalize the imitation learning problem have been proposed (e.g., behavior cloning, apprenticeship learning, waypoint imitation), we also want a BFM that is compatible with different imitation learning settings. Our main contributions can be summarized as follows. We leverage recent advances in BFMs based on successor measures, notably the forward-backward (FB) framework (Touati et al., 2023; Touati & Ollivier, 2021), to build BFMs that can be used to solve any imitation task, and satisfy the three properties above. We focus on FB for its *Joint first author, alphabetical order. Joint last author, alphabetical order. 1The term Behavior emphasizes that the model aims at controlling an agent in a dynamical environment. This avoids confusion with widely used foundation models for images, videos, motions, and language. See Yang et al. (2023) for an extensive review of the latter for decision making. Published as a conference paper at ICLR 2024 Imitation Score (%) 68.2 68.8 68.5 Offline IL BFM IL FB-IL (ours) Figure 1: Imitation score (ratio between the cumulative return of the algorithm and the cumulative return of the expert) averaged over domains, tasks, and repetitions, for a single expert demonstration. FB-IL methods reach SOTA performance with a fraction of the test-time computation needed by offline IL baselines and they perform better than other pre-trained behavior foundation models, while implementing a wider set of IL principles. Notice that goal-based methods, such as GOAL-TD3 and GOALFB, work well in this experiment but have more restricted applicability (Sec. 4.4). demonstrated performance at zero-shot reinforcement learning compared to other approaches (Touati et al., 2023). We refer to the set of resulting algorithms as FB-IL. We test FB-IL algorithms across environments from the Deep Mind Control Suite (Tassa et al., 2018a) with multiple imitation tasks, using different IL principles and settings. We show that not only do FB-IL algorithms perform on-par or better than the corresponding state-of-the-art offline imitation learning baselines (Fig. 1), they also solve imitation tasks within a few seconds, which is three orders of magnitude faster than offline IL methods that need to run full RL routines to compute an imitation policy (Fig. 2). Furthermore, FB-IL methods perform better than other BFM methods, while being able to implement a much wider range of imitation principles. 2 RELATED WORK Offline IL Time BC 3β„Ž14π‘š TD3-IL 7β„Ž3π‘š Demodice 12β„Ž59π‘š FB-IL Time BCFB 1π‘š ERFB < 5𝑠 BBELLFB 4π‘š Figure 2: Time for computing an imitation policy from a single demonstration for a subset of offline IL baselines and FBIL methods, averaged over all environments and tasks. While a thorough literature review and classification is out of the scope of this work, we recall some of the most popular formulations of IL, each of which will be implemented via BFMs in Sect. 4. Behavioral Cloning (Pomerleau, 1988; Bain & Sammut, 1995, BC) aims at directly reproducing the expert policy by maximizing the likelihood of the expert actions under the trained imitation policy. While this is the simplest approach to IL, it needs access to expert actions, it may suffer from compounding errors caused by covariate shift (Ross et al., 2011), and it often requires many demonstrations to learn an accurate imitation policy. Variants of the formulation include regularized BC (e.g., Piot et al., 2014; Xu et al., 2022) and BC from observations only (e.g., Torabi et al., 2018). Another, simple but effective, IL principle is to design (e.g., Ciosek, 2022; Reddy et al., 2020) or infer a reward (e.g., Zolna et al., 2020; Luo et al., 2023; Kostrikov et al., 2020) from the demonstrations, then use it to train an RL agent. For instance, SQIL (Reddy et al., 2020) assigns a reward of 1 to expert samples and 0 to non-expert samples obtained either from an offline dataset or online from the environment. Other methods (e.g., Ho & Ermon, 2016; Zolna et al., 2020; Kostrikov et al., 2020; Kim et al., 2022b;a; Ma et al., 2022) learn a discriminator to infer a reward separating expert from non-expert samples. OTR (Luo et al., 2023) uses optimal transport to compute a distance between expert and non-expert transitions that is used as a reward. Finally, other approaches frame IL as a goal-conditioned task (e.g., Ding et al., 2019; Lee et al., 2021), and leverage advances in goal-oriented RL by using goals extracted from expert trajectories. Published as a conference paper at ICLR 2024 Many imitation learning algorithms can be formally derived through the lens of either Apprenticeship Learning (AL) (e.g., Abbeel & Ng, 2004; Syed & Schapire, 2007; Ziebart et al., 2008; Syed et al., 2008; Ho et al., 2016; Ho & Ermon, 2016; Garg et al., 2021; Shani et al., 2022; Viano et al., 2022; Al-Hafez et al., 2023; Sikchi et al., 2023) or Distribution Matching (DM) (e.g., Kostrikov et al., 2020; Kim et al., 2022a; Zhu et al., 2020; Kim et al., 2022b; Ma et al., 2022; Yu et al., 2023). AL looks for a policy that matches or outperforms the expert for any possible reward function in a known class. If the reward is linearly representable w.r.t. a set of features, a sufficient condition is to find a policy whose successor features match that of the expert. DM approaches directly aim to minimize some 𝑓-divergence between the stationary distribution of the learned policy and the one of expert. The main limitation of these approaches is that they need to solve a new (online or offline) RL problem for each imitation task from scratch. This often makes their sample and computational complexity prohibitive. Brandfonbrener et al. (2023) pre-train inverse dynamics representations from multitask demonstrations, that can be efficiently fine-tuned with BC to solve some IL tasks with reduced sample complexity. Masked trajectory models (Carroll et al., 2022; Liu et al., 2022; Wu et al., 2023) pre-train transformer-based architectures using random masking of trajectories and can perform waypoint-conditioned imitation if provided with sufficiently curated expert datasets at pre-training. In a similar setting, Reuss et al. (2023) use pre-trained goal-conditioned policies based on diffusion models for waypoint-conditioned imitation. Wagener et al. (2023) pre-train autoregressive architectures with multiple experts, but focuses on trajectory completion rather than full imitation learning. One-shot IL (e.g., Duan et al., 2017; Finn et al., 2017; Yu et al., 2018; Zhao et al., 2022; Chang & Gupta, 2023) uses meta-learning to provide fast adaptation to a new demonstration at train time. This requires carefully curated datasets at train time with access to several expert demonstrations. Task-conditioned approaches (e.g., James et al., 2018; Kobayashi et al., 2019; Dasari & Gupta, 2020; Dance et al., 2021) can solve IL by (meta-)learning a model conditioned on reward, task or expert embeddings by accessing privileged information at train time.2 3 PRELIMINARIES Markov decision processes. Let β„³= (𝑆, 𝐴, 𝑃, 𝛾) be a reward-free Markov decision process (MDP), where 𝑆is the state space, 𝐴is the state space, 𝑃(d𝑠 |𝑠, π‘Ž) is the probability measure on 𝑠 𝑆defining the stochastic transition to the next state obtained by taking action π‘Žin state 𝑠, and 0 < 𝛾< 1 is a discount factor (Sutton & Barto, 2018). Given (𝑠0, π‘Ž0) 𝑆 𝐴and a policy πœ‹: 𝑆 Prob(𝐴), we denote Pr( |𝑠0, π‘Ž0, πœ‹) and E[ |𝑠0, π‘Ž0, πœ‹] the probabilities and expectations under state-action sequences (𝑠𝑑, π‘Žπ‘‘)𝑑 0 starting at (𝑠0, π‘Ž0) and following policy πœ‹in the environment, defined by sampling 𝑠𝑑 𝑃(d𝑠𝑑|𝑠𝑑 1, π‘Žπ‘‘ 1) and π‘Žπ‘‘ πœ‹(dπ‘Žπ‘‘|𝑠𝑑). We define π‘ƒπœ‹(d𝑠 |𝑠) := 𝑃(d𝑠 |𝑠, π‘Ž)πœ‹(dπ‘Ž|𝑠), the state transition probabilities induced by πœ‹. Given a reward function π‘Ÿ: 𝑆 R, the 𝑄-function of πœ‹for π‘Ÿis π‘„πœ‹ π‘Ÿ(𝑠0, π‘Ž0) := 𝑑 0 𝛾𝑑E[π‘Ÿ(𝑠𝑑)|𝑠0, π‘Ž0, πœ‹]. The optimal 𝑄-function is 𝑄 π‘Ÿ(𝑠, π‘Ž) := supπœ‹π‘„πœ‹(𝑠, π‘Ž). (For simplicity, we assume the reward only depends on 𝑠𝑑+1 instead on the full triplet (𝑠𝑑, π‘Žπ‘‘, 𝑠𝑑+1), but this is not essential.) For each policy πœ‹and each 𝑠0 𝑆, π‘Ž0 𝐴, the successor measure π‘€πœ‹(𝑠0, π‘Ž0, ) over 𝑆describes the cumulated discounted time spent at each state 𝑠𝑑+1 if starting at (𝑠0, π‘Ž0) and following πœ‹, namely, π‘€πœ‹(𝑠0, π‘Ž0, 𝑋) := 𝑑 0 𝛾𝑑Pr(𝑠𝑑+1 𝑋|𝑠0, π‘Ž0, πœ‹) 𝑋 𝑆. (1) The forward-backward (FB) framework. The FB framework (Touati & Ollivier, 2021) learns a tractable representation of successor measures that provides approximate optimal policies for any reward. Let R𝑑be a representation space, and let 𝜌be an arbitrary distribution over states, typically the distribution of states in the training set. FB learns two maps 𝐹: 𝑆 𝐴 R𝑑 R𝑑and 𝐡: 𝑆 R𝑑, and a set of parametrized policies (πœ‹π‘§)𝑧 R𝑑, such that { π‘€πœ‹π‘§(𝑠0, π‘Ž0, 𝑋) 𝑋𝐹(𝑠0, π‘Ž0, 𝑧) 𝐡(𝑠) 𝜌(d𝑠), 𝑠0 𝑆, π‘Ž0 𝐴, 𝑋 𝑆, 𝑧 R𝑑 πœ‹π‘§(𝑠) arg maxπ‘ŽπΉ(𝑠, π‘Ž, 𝑧) 𝑧, (𝑠, π‘Ž) 𝑆 𝐴, 𝑧 R𝑑. (2) We recall some properties of FB that will be leveraged to derive FB-based imitation methods. In the following, we use the short forms Cov 𝐡:= E𝑠 𝜌[𝐡(𝑠)𝐡(𝑠) ] and π‘€πœ‹(𝑠) := Eπ‘Ž πœ‹(𝑠) π‘€πœ‹(𝑠, π‘Ž), 𝐹(𝑠, 𝑧) := Eπ‘Ž πœ‹π‘§(𝑠) 𝐹(𝑠, π‘Ž, 𝑧). (3) 2A few papers (e.g., Peng et al., 2022; Juravsky et al., 2023) have used expert trajectories to speed up the learning of task-conditioned policies. Their objective is not IL but task generalization and/or compositionality. Published as a conference paper at ICLR 2024 Proposition 1 (Touati & Ollivier (2021)). Assume (2) holds exactly. Then the following holds. First, for any reward function π‘Ÿ: 𝑆 R, let π‘§π‘Ÿ= E𝑠 𝜌[π‘Ÿ(𝑠)𝐡(𝑠)]. (4) Then πœ‹π‘§π‘Ÿis optimal for π‘Ÿ, i.e., πœ‹π‘§π‘Ÿ arg maxπœ‹π‘„πœ‹ π‘Ÿ(𝑠, π‘Ž). Moreover, 𝑄 π‘Ÿ(𝑠, π‘Ž) = 𝐹(𝑠, π‘Ž, π‘§π‘Ÿ) π‘§π‘Ÿ. Finally, for each policy πœ‹π‘§and each (𝑠0, π‘Ž0) 𝑆 𝐴, 𝐹(𝑠0, π‘Ž0, 𝑧) R𝑑are the successor features associated to the state embedding πœ™(𝑠) = (Cov 𝐡) 1𝐡(𝑠), i.e., 𝐹(𝑠0, π‘Ž0, 𝑧) = E [ 𝑑 0 π›Ύπ‘‘πœ™(𝑠𝑑+1)|𝑠0, π‘Ž0, πœ‹π‘§ ] . (5) In practice, the properties in Prop. 1 only hold approximately, as 𝐹 𝐡is a rank-𝑑model of the successor measures, πœ‹π‘§may not be the exact greedy policy, and all of them are learned from samples. Moreover, (4) expresses π‘§π‘Ÿas an expectation over states from the training distribution 𝜌. If sampling from a different distribution 𝜌 at test time, an approximate formula is (Touati & Ollivier, 2021, B.5): π‘§π‘Ÿ= (E𝑠 𝜌𝐡(𝑠)𝐡(𝑠) )(E𝑠 𝜌 𝐡(𝑠)𝐡(𝑠) ) 1 E𝑠 𝜌 [π‘Ÿ(𝑠)𝐡(𝑠)]. (6) Pre-training an FB model can be done from a non-curated, offline dataset of trajectories or transitions, thus fulfilling property 1) above. Training is done via the measure-valued Bellman equation satisfied by successor measures. We refer to (Touati et al., 2023) for a full description of FB training. FB belongs to a wider class of methods based on successor features (e.g., Borsa et al. (2018)). Many of our imitation algorithms still make sense with other methods in this class, see App. A.6. We focus on FB as it has demonstrated better performance for zero-shot reinforcement learning within this family (Touati et al., 2023). 4 FORWARD-BACKWARD METHODS FOR IMITATION LEARNING We consider the standard imitation learning problem, where we have access to a few expert trajectories 𝜏= (𝑠0, 𝑠1, . . . , 𝑠ℓ(𝜏)), each of length β„“(𝜏), generated by some unknown expert policy πœ‹π‘’, and no reward function is available. In general, we do not need access to the expert actions, except for behavioral cloning. We denote by E𝜏the empirical average over the expert trajectories 𝜏and by πœŒπ‘’ the empirical distribution of states visited by the expert trajectories.3 We now describe several IL methods based on a pre-trained FB model. These run only from demonstration data, without solving any complex RL problem at test time (property 2)). Some methods just require a near-instantaneous forward pass through 𝐡at test time, while others require a gradient descent over the small-dimensional parameter 𝑧. The latter is still much faster than solving a full RL problem, as shown in Fig. 2. At imitation time, we assume access to the functions 𝐹, 𝐡, the matrix Cov 𝐡, and the policies πœ‹π‘§, but we do not reuse the unsupervised dataset used for FB training. To illustrate how FB can accommodate different IL principles, we present the methods in loose groups by the underlying IL principle. 4.1 BEHAVIORAL CLONING In case actions are available in the expert trajectories, we can directly implement the behavioral cloning principle using the policies (πœ‹π‘§)𝑧returned by the FB model. Each policy πœ‹π‘§ defines a probability distribution on state-action sequences given the initial state 𝑠0, namely Pr(π‘Ž0, 𝑠1, π‘Ž1, . . . |𝑠0, πœ‹π‘§) = 𝑑 0 πœ‹π‘§(π‘Žπ‘‘|𝑠𝑑)𝑃(𝑠𝑑+1|𝑠𝑑, π‘Žπ‘‘). We look for the πœ‹π‘§for which the expert trajectories are most likely, by minimizing the loss ℒ𝐡𝐢(𝑧) := E𝜏ln Pr((π‘Ž0, 𝑠1, π‘Ž1, . . . |𝑠0, πœ‹π‘§) = E𝜏 𝑑 ln πœ‹π‘§(π‘Žπ‘‘|𝑠𝑑) + cst, (7) where the constant absorbs the environment transition probabilities 𝑃(d𝑠𝑑+1|𝑠𝑑, π‘Žπ‘‘), which do not depend on 𝑧. Since we have access to πœ‹π‘§(π‘Ž|𝑠), this can be optimized over 𝑧given the expert trajectories, leading to the behavior cloning-FB (BCFB) approach. 3We give each expert trajectory the same weight in πœŒπ‘’independently of its length, so πœŒπ‘’corresponds to first sampling a trajectory, then sampling a state in that trajectory. Published as a conference paper at ICLR 2024 Since the FB policies (πœ‹π‘§)𝑧are trained to be approximately optimal for some reward, we expect FB (and BFMs in general) to provide a convenient bias to identify policies, instead of performing BC among the set of all (optimal or not) policies. 4.2 REWARD-BASED IMITATION LEARNING Existing reward-based IL methods require running RL algorithms to optimize an imitation policy based on a reward function specifically built to mimic the expert s behavior. Leveraging FB models, we can avoid solving an RL problem at test time, and directly obtain the imitation policy via a simple forward pass of the 𝐡model. Indeed, as mentioned in Sec. 3, FB models can recover a (near-optimal) policy for any reward function π‘Ÿby setting 𝑧= E𝑠 𝜌[π‘Ÿ(𝑠)𝐡(𝑠)]. Depending on the specific reward function, we obtain the following algorithms to estimate a 𝑧, after which we just use πœ‹π‘§. First, consider the case of π‘Ÿ( ) = πœŒπ‘’( )/𝜌( ) in (Kim et al., 2022b;a; Ma et al., 2022). This yields 𝑧= E𝑠 𝜌[π‘Ÿ(𝑠)𝐡(𝑠)] = EπœŒπ‘’[𝐡] = E𝜏 [ 1 β„“(𝜏) 𝑑 0 𝐡(𝑠𝑑+1) ] (8) which amounts to using the FB formula (4) for 𝑧by just putting a reward at every state visited by the expert. We refer to this as empirical reward via FB (ERFB). Similarly, the reward π‘Ÿ( ) = πœŒπ‘’( )/(𝜌( ) + πœŒπ‘’( )) used in (Reddy et al., 2020; Zolna et al., 2020) leads to regularized empirical reward via FB (RERFB), derived from (6) in App. A.1: 𝑧= Cov(𝐡) ( Cov(𝐡) + E𝑠 πœŒπ‘’[𝐡(𝑠)𝐡(𝑠) ] ) 1 EπœŒπ‘’[𝐡]. (9) Even though these reward functions are defined via the distribution 𝜌of the unsupervised dataset, this can be instantiated using only the pre-trained FB model, with no access to the unsupervised dataset, and no need to train a discriminator. Note that (8) and (9) are independent of the order of states in the expert trajectory. This was not a problem in our setup, because the states themselves carry dynamical information (speed variables). If this proves limiting in some environment, this can easily be circumvented by training successor measures over visited transitions (𝑠𝑑, 𝑠𝑑+1) rather than just states 𝑠𝑑+1, namely, training the FB model with 𝐡(𝑠𝑑, 𝑠𝑑+1). A similar trick is applied, e.g., in (Zhu et al., 2020; Kim et al., 2022a). 4.3 DISTRIBUTION MATCHING AND FEATURE MATCHING Apprenticeship learning and distribution matching are popular ways to provide a formal definition of IL as the problem of imitating the expert s visited states. We take a unified perspective on these two categories and derive several FB-IL methods starting from the saddle-point formulation of IL common to many AL and DM methods. Let 𝜌0 be an arbitrary initial distribution over 𝑆. For any reward π‘Ÿand policy πœ‹, the expected discounted cumulated return of πœ‹is equal to E𝑠0 𝜌0 π‘€πœ‹(𝑠0), π‘Ÿ by definition of π‘€πœ‹. Consequently, the AL criterion of minimizing the worst-case performance gap between πœ‹and the expert can be seen as a measure of divergence between successor measures: inf πœ‹sup π‘Ÿ β„› E𝑠0 𝜌0[ π‘€πœ‹π‘’(𝑠0), π‘Ÿ π‘€πœ‹(𝑠0), π‘Ÿ ] = inf πœ‹ E𝑠0 𝜌0 π‘€πœ‹π‘’(𝑠0) E𝑠0 𝜌0 π‘€πœ‹(𝑠0) β„› (10) where β„›is any class of reward functions, and β„› the resulting dual seminorm. Since FB directly models π‘€πœ‹, it can directly tackle (10) as finding the policy πœ‹π‘§that minimizes the loss β„’β„› (𝑧) := E𝑠0 𝜌0 π‘€πœ‹π‘§(𝑠0) E𝑠0 𝜌0 π‘€πœ‹π‘’(𝑠0) 2 β„› . (11) In practice, instead of (11), we consider the loss β„’β„› (𝑧) := E𝑠0 πœŒπ‘’ π‘€πœ‹π‘§(𝑠0) π‘€πœ‹π‘’(𝑠0) 2 β„› . (12) This is a stricter criterion than (11) as it requires the successor measure of the imitation policy and the expert policy to be similar for any 𝑠observed along expert trajectories. This avoids undesirable effects from averaging successor measures over 𝜌0, which may erase too much information about the policy (e.g., take 𝑆= {𝑠1, 𝑠2} where one policy swaps 𝑠1 and 𝑠2 and the other policy does nothing: on average over the starting point, the two policies have the same occupation measure). This increases robustness in our experiments (see App. E.6). We can derive a wide range of algorithms depending on the choice of β„›, how we estimate π‘€πœ‹π‘’ from expert demonstrations, and how we leverage FB models to estimate π‘€πœ‹π‘§. For instance, our algorithms can be extended to the KL divergence between the distributions (App. A.5). Published as a conference paper at ICLR 2024 Successor feature matching. A popular choice for β„›is to consider rewards linear in a given feature basis (Abbeel & Ng, 2004). Here we can leverage the FB property of estimating optimal policies for rewards in the linear span of 𝐡(Touati et al., 2023). Taking ℛ𝐡:= {π‘Ÿ= 𝑀 𝐡, 𝑀 R𝑑, 𝑀 2 1} in (10) yields the seminorm 𝑀 𝐡* := supπ‘Ÿ ℛ𝐡 π‘Ÿ(𝑠)𝑀(d𝑠) = 𝐡(𝑠)𝑀(d𝑠) 2 and the loss ℒ𝐡 (𝑧) := E𝑠0 πœŒπ‘’ 𝐡(𝑠)π‘€πœ‹π‘§(𝑠0, d𝑠) 𝐡(𝑠)π‘€πœ‹π‘’(𝑠0, d𝑠) 2 2 (13) namely, the averaged features 𝐡of states visited under πœ‹π‘§and πœ‹π‘’should match. This can be computed by using the FB model for π‘€πœ‹π‘§and the expert trajectories for π‘€πœ‹π‘’, as follows. Theorem 2. Assume that the FB successor feature property (5) holds. Then the loss (13) satisfies ℒ𝐡 (𝑧) = E𝑠𝑑 πœŒπ‘’E [ (Cov 𝐡)𝐹(𝑠𝑑, 𝑧) π‘˜ 0 π›Ύπ‘˜π΅(𝑠𝑑+π‘˜+1) 2 ] + cst. (14) This can be estimated by sampling a segment (𝑠𝑑, 𝑠𝑑+1, . . .) starting at a random time 𝑑on an expert trajectory. Then we can perform gradient descent over 𝑧. We refer to this method as FMFB. Distribution matching. If β„›is restricted to the span of some features, we only get a seminorm on successor measures (any information not in the features is lost). Instead, one can take β„›= 𝐿2(𝜌), which provides a full norm π‘€πœ‹π‘’ π‘€πœ‹ 𝐿2(𝜌) on visited state distributions: this matches state distributions instead of features. This can be instantiated with FB (App. A.3), but the final loss is very similar to (67), as FB neglects features outside of 𝐡anyway. We refer to this method as DMFB. Bellman residual minimization for distribution matching. An alternative approach is to identify the best imitation policy or its stationary distribution via the Bellman equations they satisfy. This is to distribution matching what TD is to direct Monte Carlo estimation of 𝑄-functions. The successor measure of a policy πœ‹satisfies the measure-valued Bellman equation π‘€πœ‹(𝑠𝑑, d𝑠 ) = π‘ƒπœ‹(𝑠𝑑, d𝑠 )+𝛾 𝑠𝑑+1 π‘ƒπœ‹(𝑠𝑑, d𝑠𝑑+1)π‘€πœ‹(𝑠𝑑+1, d𝑠 ), or more compactly π‘€πœ‹= π‘ƒπœ‹+π›Ύπ‘ƒπœ‹π‘€πœ‹(Blier et al., 2021). So the successor measure of the expert policy satisfies π‘€πœ‹π‘’= π‘ƒπœ‹π‘’+ π›Ύπ‘ƒπœ‹π‘’π‘€πœ‹π‘’. Therefore, if we want to find a policy πœ‹π‘§that behaves like πœ‹π‘’, π‘€πœ‹π‘§should approximately satisfy the Bellman equation for π‘ƒπœ‹π‘’, namely, π‘€πœ‹π‘§ π‘ƒπœ‹π‘’+ π›Ύπ‘ƒπœ‹π‘’π‘€πœ‹π‘§. Thus, we can look for a policy πœ‹π‘§ whose Bellman gaps for πœ‹π‘’are small. This leads to the loss β„’β„› Bell(𝑧) := π‘€πœ‹π‘§ π‘ƒπœ‹π‘’ π›Ύπ‘ƒπœ‹π‘’ π‘€πœ‹π‘§ 2 β„› (15) where the bar above 𝑀on the right denotes a stop-grad operator, as usual for deep 𝑄-learning. The method we call BBELLFB uses the seminorm 𝐡 in (15). This amounts to minimizing Bellman gaps of 𝑄-functions for all rewards linearly spanned by 𝐡. With the FB model, the loss (15) with this norm takes a tractable form allowing for gradient descent over 𝑧(App., Thm. 4): ℒ𝐡 Bell(𝑧) = E𝑠𝑑 πœŒπ‘’,𝑠𝑑+1 π‘ƒπœ‹π‘’( |𝑠𝑑) [ 2𝐹(𝑠𝑑, 𝑧) (Cov 𝐡)𝐡(𝑠𝑑+1) + (𝐹(𝑠𝑑, 𝑧) 𝛾 𝐹(𝑠𝑑+1, 𝑧)) (Cov 𝐡)2(𝐹(𝑠𝑑, 𝑧) 𝛾 𝐹(𝑠𝑑+1, 𝑧)) ] + cst. (16) The norm from β„›= 𝐿2(𝜌) in (15) yields a loss similar to the one used during FB training (indeed, FB is trained via a similar Bellman equation with πœ‹π‘§instead of πœ‹π‘’). The final loss only differs from (16) by Cov 𝐡factors, so we report it in App. A.4 (Thm. 5). We call this method FBLOSSFB. Relationship between IL principles: loss bounds. Any method that provides a policy close to πœ‹π‘’ will provide state distributions close to that of πœ‹π‘’as a result, so we expect a relationship between the losses from different approaches. Indeed, the Bellman gap loss bounds the distribution matching loss (12), and the BC loss bounds the KL version of (12). This is formalized in Thms. 7 and 8 (App. A.7). 4.4 IMITATING NON-STATIONARY BEHAVIORS: GOAL-BASED IMITATION While most IL methods are designed to imitate stationary behaviors, we can leverage FB models to imitate non-stationary behaviors. Consider the case where only a single expert demonstration 𝜏 is available. At each time step 𝑑, we can use the FB method to reach a state 𝑠𝑑+π‘˜slightly ahead of Published as a conference paper at ICLR 2024 𝑠𝑑in the expert trajectory, where π‘˜ 0 is a small, fixed integer. Namely, we place a single reward at 𝑠𝑑+π‘˜, use the FB formula (4) to obtain the 𝑧𝑑corresponding to this reward, 𝑧𝑑:= 𝐡(𝑠𝑑+π‘˜), and use the policy πœ‹π‘§π‘‘. We call this method GOALFB. This is related to settings such as tracking (e.g., Wagener et al., 2023; Winkler et al., 2022), waypoint imitation (e.g., Carroll et al., 2022; Chang & Gupta, 2023; Shi et al., 2023), or goal-based IL (e.g., Liu et al., 2022; Reuss et al., 2023). GOALFB leverages the possibility to change the reward in real time with FB. A clear advantage is its ability to reproduce behaviors that do not correspond to optimizing a (Markovian) reward function, such as cycling over some states, or non-stationary behaviors. GOALFB may have advantages even in the stationary case, as it may mitigate approximation errors from the policies or the representation of π‘€πœ‹: by selecting a time-varying 𝑧, the policy can adapt over time and avoid deviations in the execution of long behaviors through a stationary policy. However, goal-based IL is limited to copying one single expert trajectory, by reproducing the same state sequence. The behavior cannot necessarily be extended past the end of the expert trajectory, and no reusable policy is extracted. 5 EXPERIMENTS In this section, we evaluate FB-IL against the objectives stated in the introduction: Property 2. We verify if an FB model pre-trained on one specific environment is able to imitate a wide range of tasks with only access to few demonstrations and without solving any RL problem. Property 3. We assess the generality of FB-IL by considering a variety of imitation learning principles and settings. Protocol and baselines. We evaluate IL methods on 21 tasks in 4 domains (Maze, Walker, Cheetah, Quadruped) from (Touati et al., 2023). We use the standard reward-based evaluation protocol for IL. For each task, we train expert policies using TD3 (Fujimoto et al., 2018) on a task-specific reward function (Tassa et al., 2018a). We use the expert policies to generate 200 trajectories for each task to be used for IL. In our first series of experiments, the IL algorithms are provided with a single expert demonstration (see App. E.4 for the effect of additional demonstrations). Each experiment (i.e., pair algorithm and task) was repeated with 20 random seeds. We report the cumulated reward achieved by the IL policy, computed using the ground-truth task-specific reward and averaged over 1000 episodes starting from the same initial distribution used to collect the expert demonstrations. For each environment, we train an FB model using only unsupervised samples generated using RND (Burda et al., 2019). We repeat the FB pre-training 10 times, and report performance averaged over the resulting models (variance is reported in App. E.3). For FB-IL methods that require a gradient descent over 𝑧(BCFB, BBELLFB, and FMFB), we use warm-start with 𝑧0 = ERFB({πœπ‘’}) (8), which can be computed with forward passes on 𝐡only. GOALFB is run with a lookahead window π‘˜= 10. First (Section 5.1), we compare FB-IL to standard offline IL algorithms trained on each specific imitation task, using the same unsupervised and expert samples as FB-IL. For behavioral cloning approaches, we use vanilla BC. For reward-based IL, we include SQIL (Reddy et al., 2020) (which is originally online but can easily be adapted offline; SQIL balances sampling in the update step and runs SAC); TD3-IL (where we merge all samples in the replay buffer and use TD3 instead of SAC); ORIL (Zolna et al., 2020) from state-action demonstrations and only state; and OTR (Luo et al., 2023) using TD3 as the offline RL subroutine. For AL and DM IL, we use DEMODICE (Kim et al., 2022b), and IQLEARN (Garg et al., 2021). See App. B for details. Next (Section 5.2), we also include alternative behavior foundation models beyond FB, pre-trained for each environment on the same unsupervised samples as FB. GOAL-TD3 pre-trains goalconditioned policies πœ‹(π‘Ž|𝑠, 𝑔) on the unsupervised dataset using TD3 with Hindsight Experience Replay (Andrychowicz et al., 2017). At test time, it can implement goal-based IL, i.e., at each time step 𝑑it selects the policy πœ‹(π‘Žπ‘‘|𝑠𝑑, 𝑠𝑒 𝑑+π‘˜) where the goal 𝑠𝑒 𝑑+π‘˜corresponds to a state π‘˜steps ahead in the expert trajectory. (Despite the simplicity, we did not find this algorithm proposed in the literature.) Next, GOAL-GPT (Liu et al., 2022) pre-trains a goal-conditioned, transformer-based auto-regressive policy πœ‹(π‘Žπ‘‘|(𝑠𝑑, 𝑔), (𝑠𝑑 1, 𝑔), . . . (𝑠𝑑 β„Ž+1, 𝑔); 𝑔= 𝑠𝑑+π‘˜) to predict the next action based on last β„Ž states and the state π‘˜steps in the future as the goal of the policy. MASKDP (Liu et al., 2022) uses a bidirectional transformer to reconstruct trajectories with randomly masked states and actions. Both models can be used to perform goal-based IL. We adapt DIAYN (Eysenbach et al., 2018) to pre-train a set of policies (πœ‹π‘§) with 𝑧 R𝑑and a skill decoder πœ™: 𝑆 R𝑑predicting which policy is more likely to reach a specific state. (This requires online interaction during pre-training.) It can be used to Published as a conference paper at ICLR 2024 Imitation Score (%) Offline IL BFM IL FB-IL (ours) DMC Cheetah BC Reward FM/DM Goal-based Imitation Score (%) DMC Quadruped Figure 3: Imitation score (i.e., ratio between the cumulative return of the algorithm and the one of the expert) for each domain averaged over tasks and repetitions for a single expert demonstration. implement behavioral cloning as in (7), a method similar to ERFB, and goal-based IL by selecting 𝑧𝑑= πœ™(𝑠𝑑+π‘˜). See App. C for extra details. Additional results. App. E.1 contains detailed results with additional baselines and FB-IL variants. App. E.2 ablates over our warm-start strategy for optimization-based FB-IL methods. App. E.4 studies the influence of the number of expert trajectories, with FB methods being the least sensitive, and BC methods the most. App E.5 tests the methods under a shift between the distribution of the initial states for imitation at test time and the one of expert trajectories: the overall picture is largely unchanged from Fig. 1, although the slight lead of goal-based methods disappears. App. E.3 shows that performance is not very sensitive to variations of the pretrained FB foundation model (estimated across 10 random seeds for FB training), thus confirming robustness of the overall approach. 5.1 COMPARISON TO OFFLINE IL BASELINES Fig. 3 compares the performance of FB-IL methods and offline baselines grouped by IL principle. For ease of presentation, we report the performance averaged over tasks of each environment. Overall, FBIL methods perform on-par or better than each of the baselines implementing the same IL principle, consistently across domains and IL principle. In addition, FB-IL is able to recover the imitation policy in few seconds, almost three orders of magnitude faster than the baselines, that need to be re-trained for each expert demonstration (Tab. 2). This confirms that FB models are effective BFMs for solving a wide range of imitation learning tasks with few demonstrations and minimal compute. As expected, BC baselines perform poorly with only one expert trajectory. BCFB has a much stronger performance, confirming that the set (πœ‹π‘§) contains good imitation policies for a large majority of tasks and that they can be recovered by behavioral cloning from even a single demonstration. Reward-based FB-IL methods ERFB (8), RERFB (9) achieve consistent performance across all environments and perform on par or even better than the baselines sharing the same implicit reward function. This shows that FB models are effective at recovering near-optimal policies from rewards. On the other hand, reward-based IL offline baselines display a significant variance in their performance across environment (e.g., ORIL for state-action completely fails in maze tasks). The baselines derived from distribution matching and apprenticeship learning perform poorly in almost all the domains and tasks. This may be because they implement conservative offline RL algorithms that Published as a conference paper at ICLR 2024 Method Score GOALFB 49.53 4.06 GOAL-TD3 50.76 6.52 GOAL-GPT 17.58 0.82 Figure 4: Imitating a sequence of yoga poses in the walker domain. (left) Example of sequence. (right) Total reward averaged over 1000 randomly-generated sequences and 10 pretrained BFMs plus/minus 95% confidence interval. The reward of a trajectory is computed by summing the (normalized) rewards of the poses expected at each time step. See App. F for additional details. are strongly biased towards the unsupervised data and fail at imitating the expert demonstrations.4 On the other hand, FB-IL variants achieve good results (about 4 times the performance of DICE) in all domains except maze where they lag behind other FB-IL methods. In general, this shows that FB models are effective in implementing different IL principles even when offline baselines struggle. 5.2 COMPARISON TO OTHER BFM METHODS The BFM methods reported in Fig. 3 display a trade-off between generality and performance. DIAYN pre-trained policies and discriminator can be used to implement a wide range of imitation learning principles (except for distribution matching), but its performance does not match the corresponding top offline baselines and it is worse than FB-IL across all domains. Methods based on maskedtrajectory models can implement a goal-based reduction of IL and work better than DIAYN. Finally, GOAL-TD3 is performing best among the other BFM methods and it is close second w.r.t. GOALFB. Nonetheless, as discussed in Sect. 4.4, all goal-based reduction methods are more limited in their applicability, since they can only use a single expert demonstration, cannot generalize beyond the expert trajectory, and do not produce a policy reusable over the whole space. 5.3 WAYPOINT IMITATION LEARNING We consider non-realizable and non-stationary experts by generating demonstrations as the concatenation of yoga poses from (Mendonca et al., 2021), implicitly assuming that the expert policy can instantaneously switch between any two poses. We keep each pose fixed for 100 steps and generate trajectories of 1000 steps. In this case, no imitation policy can perfectly reproduce the sequence of poses and only goal-based IL algorithms can be applied, since all other IL methods assume stationary expert policies. We evalute the same pre-trained models used in the previous section. Fig. 4 shows that GOALFB matches the performance of GOAL-TD3 and outperforms GOAL-GPT. This confirms that even in this specific case, FB-IL is competitive with other BFM models that are specialized (and limited) to goal-reaching tasks, whereas the same pre-trained FB model can be used to implement a wide range of imitation learning principles. GOAL-GPT s poor performance may be because the algorithm tries to reproduce trajectories in the training dataset rather than learning the optimal way to reach goals. We refer to App. F for a qualitative evaluation of the imitating behaviors. 6 CONCLUSION Behavior foundation models offer a new alternative for imitation learning, reducing by orders of magnitude the time needed to produce an imitation policy from new task demonstrations. This comes at the cost of pretraining an environment-specific (but task-agnostic) foundation model. BFMs can be used concurrently with a number of imitation learning design principles, and reach state-of-the-art performance when evaluated for the ground-truth task reward. One theoretical limitation is that, due to imperfections in the underlying BFM, one may not recover optimal performance even with infinite expert demonstrations. This can be mitigated by increasing the BFM capacity, by improving the training data, or by fine-tuning the BFM at test-time, which we leave to future work. 4In App. G we confirm this intuition by showing that the baselines in this category achieve much better performance when the unsupervised dataset contains expert samples (e.g., D4RL data). Unfortunately, this requires curating the dataset for each expert and it would not allow solving multiple tasks in the same environment. Published as a conference paper at ICLR 2024 Pieter Abbeel and Andrew Y. Ng. Apprenticeship learning via inverse reinforcement learning. In ICML, volume 69 of ACM International Conference Proceeding Series. ACM, 2004. Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron C. Courville, and Marc G. Bellemare. Deep reinforcement learning at the edge of the statistical precipice. In Neur IPS, pp. 29304 29320, 2021. Firas Al-Hafez, Davide Tateo, Oleg Arenz, Guoping Zhao, and Jan Peters. LS-IQ: implicit reward regularization for inverse reinforcement learning. In ICLR. Open Review.net, 2023. Marcin Andrychowicz, Dwight Crow, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob Mc Grew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. In NIPS, pp. 5048 5058, 2017. Lei Jimmy Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization. Co RR, abs/1607.06450, 2016. Michael Bain and Claude Sammut. A framework for behavioural cloning. In Machine Intelligence 15, pp. 103 129. Oxford University Press, 1995. LΓ©onard Blier, Corentin Tallec, and Yann Ollivier. Learning successor states and goal-dependent values: A mathematical viewpoint. ar Xiv preprint ar Xiv:2101.07123, 2021. Diana Borsa, AndrΓ© Barreto, John Quan, Daniel Mankowitz, RΓ©mi Munos, Hado van Hasselt, David Silver, and Tom Schaul. Universal successor features approximators. ar Xiv preprint ar Xiv:1812.07626, 2018. David Brandfonbrener, Ofir Nachum, and Joan Bruna. Inverse dynamics pretraining learns good representations for multitask imitation, 2023. Andreas BΓΌhler, Adrien Gaidon, Andrei Cramariuc, Rares Ambrus, Guy Rosman, and Wolfram Burgard. Driving through ghosts: Behavioral cloning with false positives. pp. 5431 5437, 10 2020. doi: 10.1109/IROS45743.2020.9340639. Yuri Burda, Harrison Edwards, Amos J. Storkey, and Oleg Klimov. Exploration by random network distillation. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. Open Review.net, 2019. URL https://openreview. net/forum?id=H1l JJn R5Ym. Micah Carroll, Orr Paradise, Jessy Lin, Raluca Georgescu, Mingfei Sun, David Bignell, Stephanie Milani, Katja Hofmann, Matthew J. Hausknecht, Anca D. Dragan, and Sam Devlin. Uni[mask]: Unified inference in sequential decision problems. In Neur IPS, 2022. Matthew Chang and Saurabh Gupta. One-shot visual imitation via attributed waypoints and demonstration augmentation. In ICRA, pp. 5055 5062. IEEE, 2023. Jongwook Choi, Archit Sharma, Honglak Lee, Sergey Levine, and Shixiang Shane Gu. Variational empowerment as representation learning for goal-conditioned reinforcement learning. In International Conference on Machine Learning, pp. 1953 1963. PMLR, 2021. Kamil Ciosek. Imitation learning by reinforcement learning. In ICLR. Open Review.net, 2022. Christopher R. Dance, Julien Perez, and ThΓ©o Cachet. Demonstration-conditioned reinforcement learning for few-shot imitation. In ICML, volume 139 of Proceedings of Machine Learning Research, pp. 2376 2387. PMLR, 2021. Sudeep Dasari and Abhinav Gupta. Transformers for one-shot visual imitation. In Co RL, volume 155 of Proceedings of Machine Learning Research, pp. 2071 2084. PMLR, 2020. Yiming Ding, Carlos Florensa, Pieter Abbeel, and Mariano Phielipp. Goal-conditioned imitation learning. In Neur IPS, pp. 15298 15309, 2019. Published as a conference paper at ICLR 2024 Yan Duan, Marcin Andrychowicz, Bradly C. Stadie, Jonathan Ho, Jonas Schneider, Ilya Sutskever, Pieter Abbeel, and Wojciech Zaremba. One-shot imitation learning. In NIPS, pp. 1087 1098, 2017. Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function. Co RR, abs/1802.06070, 2018. URL http: //arxiv.org/abs/1802.06070. Chelsea Finn, Tianhe Yu, Tianhao Zhang, Pieter Abbeel, and Sergey Levine. One-shot visual imitation learning via meta-learning. In Co RL, volume 78 of Proceedings of Machine Learning Research, pp. 357 368. PMLR, 2017. Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4RL: datasets for deep data-driven reinforcement learning. Co RR, abs/2004.07219, 2020. Scott Fujimoto, Herke van Hoof, and David Meger. Addressing function approximation error in actor-critic methods. In ICML, volume 80 of Proceedings of Machine Learning Research, pp. 1582 1591. PMLR, 2018. Divyansh Garg, Shuvam Chakraborty, Chris Cundy, Jiaming Song, and Stefano Ermon. Iq-learn: Inverse soft-q learning for imitation. In Neur IPS, pp. 4028 4039, 2021. Laurent George, Thibault Buhet, Γ‰milie Wirbel, Gaetan Le-Gall, and Xavier Perrotton. Imitation learning for end to end vehicle longitudinal control with forward camera. Co RR, abs/1812.05841, 2018. URL http://arxiv.org/abs/1812.05841. Siddhant Haldar, Jyothish Pari, Anant Rai, and Lerrel Pinto. Teach a robot to fish: Versatile imitation from one minute of demonstrations, 2023. Steven Hansen, Will Dabney, Andre Barreto, Tom Van de Wiele, David Warde-Farley, and Volodymyr Mnih. Fast task inference with variational intrinsic successor features. 2020. Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In NIPS, pp. 4565 4573, 2016. Jonathan Ho, Jayesh K. Gupta, and Stefano Ermon. Model-free imitation learning with policy optimization. In ICML, volume 48 of JMLR Workshop and Conference Proceedings, pp. 2760 2769. JMLR.org, 2016. Ahmed Hussein, Eyad Elyan, Mohamed Medhat Gaber, and Chrisina Jayne. Deep imitation learning for 3d navigation tasks. Neural Comput. Appl., 29(7):389 404, apr 2018. ISSN 0941-0643. doi: 10. 1007/s00521-017-3241-z. URL https://doi.org/10.1007/s00521-017-3241-z. Stephen James, Michael Bloesch, and Andrew J. Davison. Task-embedded control networks for few-shot imitation learning. In Co RL, volume 87 of Proceedings of Machine Learning Research, pp. 783 795. PMLR, 2018. Jordan Juravsky, Yunrong Guo, Sanja Fidler, and Xue Bin Peng. PADL: language-directed physicsbased character control. Co RR, abs/2301.13868, 2023. Geon-Hyeong Kim, Jongmin Lee, Youngsoo Jang, Hongseok Yang, and Kee-Eung Kim. Lobsdice: Offline learning from observation via stationary distribution correction estimation. In Neur IPS, 2022a. Geon-Hyeong Kim, Seokin Seo, Jongmin Lee, Wonseok Jeon, Hyeong Joo Hwang, Hongseok Yang, and Kee-Eung Kim. Demodice: Offline imitation learning with supplementary imperfect demonstrations. In ICLR. Open Review.net, 2022b. Kyoichiro Kobayashi, Takato Horii, Ryo Iwaki, Yukie Nagai, and Minoru Asada. Situated GAIL: multitask imitation using task-conditioned adversarial inverse reinforcement learning. Co RR, abs/1911.00238, 2019. Ilya Kostrikov, Ofir Nachum, and Jonathan Tompson. Imitation learning via off-policy distribution matching. In ICLR. Open Review.net, 2020. Published as a conference paper at ICLR 2024 Michael Laskin, Hao Liu, Xue Bin Peng, Denis Yarats, Aravind Rajeswaran, and Pieter Abbeel. Cic: Contrastive intrinsic control for unsupervised skill discovery. ar Xiv preprint ar Xiv:2202.00161, 2022. Youngwoon Lee, Andrew Szot, Shao-Hua Sun, and Joseph J. Lim. Generalizable imitation learning from observation via inferring goal proximity. In Neur IPS, pp. 16118 16130, 2021. Rudolf Lioutikov, Gerhard Neumann, Guilherme Maeda, and Jan Peters. Learning movement primitive libraries through probabilistic segmentation. The International Journal of Robotics Research, 36(8):879 894, 2017. doi: 10.1177/0278364917713116. URL https://doi.org/ 10.1177/0278364917713116. Evan Zheran Liu, Milad Hashemi, Kevin Swersky, Parthasarathy Ranganathan, and Junwhan Ahn. An imitation learning approach for cache replacement. In Proceedings of the 37th International Conference on Machine Learning, ICML 20. JMLR.org, 2020. Fangchen Liu, Hao Liu, Aditya Grover, and Pieter Abbeel. Masked autoencoding for scalable and generalizable decision making. In Neur IPS, 2022. Hao Liu and Pieter Abbeel. Aps: Active pretraining with successor features. In International Conference on Machine Learning, pp. 6736 6747. PMLR, 2021. Yicheng Luo, Zhengyao Jiang, Samuel Cohen, Edward Grefenstette, and Marc Peter Deisenroth. Optimal transport for offline imitation learning. In ICLR. Open Review.net, 2023. Yecheng Jason Ma, Andrew Shen, Dinesh Jayaraman, and Osbert Bastani. Versatile offline imitation from observations and examples via regularized state-occupancy matching. In ICML, volume 162 of Proceedings of Machine Learning Research, pp. 14639 14663. PMLR, 2022. Zhao Mandi, Homanga Bharadhwaj, Vincent Moens, Shuran Song, Aravind Rajeswaran, and Vikash Kumar. Cacti: A framework for scalable multi-task multi-scene visual imitation learning, 12 2022. Russell Mendonca, Oleh Rybkin, Kostas Daniilidis, Danijar Hafner, and Deepak Pathak. Discovering and achieving goals via world models. In Neur IPS, pp. 24379 24391, 2021. Ashvin Nair, Dian Chen, Pulkit Agrawal, Phillip Isola, Pieter Abbeel, Jitendra Malik, and Sergey Levine. Combining self-supervised learning and imitation for vision-based rope manipulation. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 2146 2153. IEEE Press, 2017. doi: 10.1109/ICRA.2017.7989247. URL https://doi.org/10.1109/ICRA. 2017.7989247. Xue Peng, Erwin Coumans, Tingnan Zhang, Tsang-Wei Lee, Jie Tan, and Sergey Levine. Learning agile robotic locomotion skills by imitating animals. 07 2020. doi: 10.15607/RSS.2020.XVI.064. Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. Deepmimic: Exampleguided deep reinforcement learning of physics-based character skills. ACM Trans. Graph., 37(4), jul 2018. ISSN 0730-0301. doi: 10.1145/3197517.3201311. URL https://doi.org/10. 1145/3197517.3201311. Xue Bin Peng, Yunrong Guo, Lina Halper, Sergey Levine, and Sanja Fidler. ASE: large-scale reusable adversarial skill embeddings for physically simulated characters. ACM Trans. Graph., 41 (4):94:1 94:17, 2022. Karl Pertsch, Ruta Desai, Vikash Kumar, Franziska Meier, Joseph J. Lim, Dhruv Batra, and Akshara Rai. Cross-domain transfer via semantic skill imitation, 2022. Bilal Piot, Matthieu Geist, and Olivier Pietquin. Boosted and reward-regularized classification for apprenticeship learning. In AAMAS, pp. 1249 1256. IFAAMAS/ACM, 2014. Dean Pomerleau. ALVINN: an autonomous land vehicle in a neural network. In NIPS, pp. 305 313. Morgan Kaufmann, 1988. Published as a conference paper at ICLR 2024 Siddharth Reddy, Anca D. Dragan, and Sergey Levine. SQIL: imitation learning via reinforcement learning with sparse rewards. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. Open Review.net, 2020. URL https: //openreview.net/forum?id=S1x Kd24tw B. Moritz Reuss, Maximilian Li, Xiaogang Jia, and Rudolf Lioutikov. Goal-conditioned imitation learning using score-based diffusion policies. In Robotics: Science and Systems, 2023. StΓ©phane Ross, Geoffrey J. Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In AISTATS, volume 15 of JMLR Proceedings, pp. 627 635. JMLR.org, 2011. Stefan Schaal. Learning from demonstration. In NIPS, pp. 1040 1046. MIT Press, 1996. Lior Shani, Tom Zahavy, and Shie Mannor. Online apprenticeship learning. In AAAI, pp. 8240 8248. AAAI Press, 2022. Lucy Xiaoyang Shi, Archit Sharma, Tony Z. Zhao, and Chelsea Finn. Waypoint-based imitation learning for robotic manipulation. Co RR, abs/2307.14326, 2023. Zhenyu Shou, Xuan Di, Jieping Ye, Hongtu Zhu, Hua Zhang, and Robert Hampshire. Optimal passenger-seeking policies on e-hailing platforms using markov decision process and imitation learning. Transportation Research Part C: Emerging Technologies, 111:91 113, 2020. ISSN 0968090X. doi: https://doi.org/10.1016/j.trc.2019.12.005. URL https://www.sciencedirect. com/science/article/pii/S0968090X18316577. Harshit Sikchi, Qinqing Zheng, Amy Zhang, and Scott Niekum. Dual rl: Unification and new methods for reinforcement and imitation learning, 2023. Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. The MIT Press, second edition, 2018. URL http://incompleteideas.net/book/the-book-2nd. html. Umar Syed and Robert E. Schapire. A game-theoretic approach to apprenticeship learning. In NIPS, pp. 1449 1456. Curran Associates, Inc., 2007. Umar Syed, Michael H. Bowling, and Robert E. Schapire. Apprenticeship learning using linear programming. In ICML, volume 307 of ACM International Conference Proceeding Series, pp. 1032 1039. ACM, 2008. Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, Timothy P. Lillicrap, and Martin A. Riedmiller. Deepmind control suite. Co RR, abs/1801.00690, 2018a. URL http://arxiv.org/abs/ 1801.00690. Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, Timothy P. Lillicrap, and Martin A. Riedmiller. Deepmind control suite. Co RR, abs/1801.00690, 2018b. Faraz Torabi, Garrett Warnell, and Peter Stone. Behavioral cloning from observation. In JΓ©rΓ΄me Lang (ed.), Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI 2018, July 13-19, 2018, Stockholm, Sweden, pp. 4950 4957. ijcai.org, 2018. doi: 10.24963/ ijcai.2018/687. URL https://doi.org/10.24963/ijcai.2018/687. Ahmed Touati and Yann Ollivier. Learning one representation to optimize all rewards. In Neur IPS, pp. 13 23, 2021. Ahmed Touati, JΓ©rΓ©my Rapin, and Yann Ollivier. Does zero-shot reinforcement learning exist? In ICLR. Open Review.net, 2023. Luca Viano, Angeliki Kamoutsi, Gergely Neu, Igor Krawczuk, and Volkan Cevher. Proximal point imitation learning. In Neur IPS, 2022. Published as a conference paper at ICLR 2024 Nolan Wagener, Andrey Kolobov, Felipe Vieira Frujeri, Ricky Loynd, Ching-An Cheng, and Matthew Hausknecht. Mocapact: A multi-task dataset for simulated humanoid control, 2023. Alexander W. Winkler, Jungdam Won, and Yuting Ye. Questsim: Human motion tracking from sparse sensors with simulated avatars. In SIGGRAPH Asia, pp. 2:1 2:8. ACM, 2022. Philipp Wu, Arjun Majumdar, Kevin Stone, Yixin Lin, Igor Mordatch, Pieter Abbeel, and Aravind Rajeswaran. Masked trajectory models for prediction, representation, and control, 2023. Haoran Xu, Xianyuan Zhan, Honglei Yin, and Huiling Qin. Discriminator-weighted offline imitation learning from suboptimal demonstrations. In ICML, volume 162 of Proceedings of Machine Learning Research, pp. 24725 24742. PMLR, 2022. Sherry Yang, Ofir Nachum, Yilun Du, Jason Wei, Pieter Abbeel, and Dale Schuurmans. Foundation models for decision making: Problems, methods, and opportunities. ar Xiv preprint ar Xiv:2303.04129, 2023. Denis Yarats, David Brandfonbrener, Hao Liu, Michael Laskin, Pieter Abbeel, Alessandro Lazaric, and Lerrel Pinto. Don t change the algorithm, change the data: Exploratory data for offline reinforcement learning. ar Xiv preprint ar Xiv:2201.13425, 2022. Lantao Yu, Tianhe Yu, Jiaming Song, Willie Neiswanger, and Stefano Ermon. Offline imitation learning with suboptimal demonstrations via relaxed distribution matching. In AAAI, pp. 11016 11024. AAAI Press, 2023. Tianhe Yu, Chelsea Finn, Sudeep Dasari, Annie Xie, Tianhao Zhang, Pieter Abbeel, and Sergey Levine. One-shot imitation from observing humans via domain-adaptive meta-learning. In Robotics: Science and Systems, 2018. Haotian Zhang, Ye Yuan, Viktor Makoviychuk, Yunrong Guo, Sanja Fidler, Xue Bin Peng, and Kayvon Fatahalian. Learning physically simulated tennis skills from broadcast videos. ACM Trans. Graph., 42(4), jul 2023. ISSN 0730-0301. doi: 10.1145/3592408. URL https://doi.org/ 10.1145/3592408. Tianhao Zhang, Zoe Mc Carthy, Owen Jow, Dennis Lee, Xi Chen, Ken Goldberg, and Pieter Abbeel. Deep imitation learning for complex manipulation tasks from virtual reality teleoperation. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 5628 5635, 2018. doi: 10.1109/ICRA.2018.8461249. Mandi Zhao, Fangchen Liu, Kimin Lee, and Pieter Abbeel. Towards more generalizable one-shot visual imitation learning. In ICRA, pp. 2434 2444. IEEE, 2022. Yang Zhou, Rui Fu, Chang Wang, and Ruibin Zhang. Modeling car-following behaviors and driving styles with generative adversarial imitation learning. Sensors (Basel, Switzerland), 20, 09 2020. doi: 10.3390/s20185034. Zhuangdi Zhu, Kaixiang Lin, Bo Dai, and Jiayu Zhou. Off-policy imitation learning from observations. In Neur IPS, 2020. Brian D. Ziebart, Andrew L. Maas, J. Andrew Bagnell, and Anind K. Dey. Maximum entropy inverse reinforcement learning. In AAAI, pp. 1433 1438. AAAI Press, 2008. Konrad Zolna, Alexander Novikov, Ksenia Konyushkova, Γ‡aglar GΓΌlΓ§ehre, Ziyu Wang, Yusuf Aytar, Misha Denil, Nando de Freitas, and Scott E. Reed. Offline learning from demonstrations and unlabeled experience. Co RR, abs/2011.13885, 2020. URL https://arxiv.org/abs/2011. 13885. Published as a conference paper at ICLR 2024 Part Appendix Table of Contents A Additional Results and Proofs 16 A.1 Derivation of the Expression for FB with Reward πœŒπ‘’/(𝜌+ πœŒπ‘’) . . . . . . . . . . 16 A.2 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 A.3 Distribution Matching with FB: The Feature Matching Loss with β„›= 𝐿2(𝜌) . . 17 A.4 Derivation of the Bellman Gap Loss for Distribution Matching . . . . . . . . . . 18 A.5 KL Distribution Matching via FB . . . . . . . . . . . . . . . . . . . . . . . . . 20 A.6 Using Universal Successor Features instead of FB for Imitation . . . . . . . . . 21 A.7 The Bellman Gap Loss and the Behavioral Cloning Loss Bound Distribution Matching Losses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 B Offline IL Baselines 24 C BFM Baselines 25 C.1 DIAYN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 C.2 GOAL-GPT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 C.3 MASKDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 C.4 GOAL-TD3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 D Experimental Setup 27 D.1 Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 D.2 Datasets and Expert Trajectories . . . . . . . . . . . . . . . . . . . . . . . . . . 27 D.3 Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 D.4 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 E Imitation Learning Experiments 31 E.1 Detailed View of The Results with 𝐾= 1 Expert Demonstrations . . . . . . . . 31 E.2 Warm Start for FB-IL Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 31 E.3 FB Model Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 E.4 Effect of the Number of Demonstrations . . . . . . . . . . . . . . . . . . . . . 33 E.5 Generalization to Different Starting Point Distributions . . . . . . . . . . . . . . 34 E.6 Distribution Matching: Matching the Successor Measure on Average vs at each State 35 F Waypoint Imitation Learning Experiment 37 F.1 Detailed Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 F.2 Algorithms and Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . 38 F.3 Qualitative Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 G Baseline Results on D4RL 40 Published as a conference paper at ICLR 2024 A ADDITIONAL RESULTS AND PROOFS A.1 DERIVATION OF THE EXPRESSION FOR FB WITH REWARD πœŒπ‘’/(𝜌+ πœŒπ‘’) In FB theory, the expression 𝑧= E𝑠 𝜌[π‘Ÿ(𝑠)𝐡(𝑠)] assumes that we sample rewards π‘Ÿ(𝑠) for states 𝑠 following the original training distribution 𝜌. This is not always possible or desirable. In general, if we have access to rewards π‘Ÿ(𝑠) for states 𝑠sampled from some distribution 𝜌 , then the expression for 𝑧becomes (Touati & Ollivier, 2021, B.5) 𝑧= (Cov 𝐡)(E𝑠 𝜌 𝐡(𝑠)𝐡(𝑠) ) 1 E𝑠 𝜌 [π‘Ÿ(𝑠)𝐡(𝑠)] (17) which reduces to 𝑧= E𝑠 𝜌[π‘Ÿ(𝑠)𝐡(𝑠)] when 𝜌= 𝜌 . Taking π‘Ÿ(𝑠) := πœŒπ‘’(d𝑠) 𝜌(d𝑠)+πœŒπ‘’(d𝑠) and setting 𝜌 := 1 2(𝜌+ πœŒπ‘’) yields 𝑧= (Cov 𝐡) ( 1 2 E𝑠 πœŒπ‘’π΅(𝑠)𝐡(𝑠) ) 1 E𝑠 1 [ πœŒπ‘’(d𝑠) 𝜌(d𝑠) + πœŒπ‘’(d𝑠)𝐡(𝑠) ] (18) = (Cov 𝐡) ( 1 2 E𝑠 πœŒπ‘’π΅(𝑠)𝐡(𝑠) ) 1 ( 1 2πœŒπ‘’(d𝑠) ) πœŒπ‘’(d𝑠) 𝜌(d𝑠) + πœŒπ‘’(d𝑠)𝐡(𝑠) = (Cov 𝐡) ( Cov 𝐡+ E𝑠 πœŒπ‘’π΅(𝑠)𝐡(𝑠) ) 1 E𝑠 πœŒπ‘’[𝐡(𝑠)] (20) The choice 𝜌 = 1 2(𝜌+ πœŒπ‘’) is equivalent to computing 𝑧from a theoretical dataset that would mix the unsupervised and expert datasets. Note, though, that the final expression does not explicitly involve the unsupervised dataset, beyond the covariance matrix Cov 𝐡. A.2 PROOF OF THEOREM 2 The loss is ℒ𝐡 (𝑧) = E𝑠𝑑 πœŒπ‘’ 𝐡(𝑠)π‘€πœ‹π‘§(𝑠𝑑, d𝑠) 𝐡(𝑠)π‘€πœ‹π‘’(𝑠𝑑, d𝑠) 2 2 (21) = E𝑠𝑑 πœŒπ‘’ 𝐡(𝑠)π‘€πœ‹π‘§(𝑠𝑑, d𝑠) 2 2 2 E𝑠𝑑 πœŒπ‘’ ( 𝐡(𝑠)π‘€πœ‹π‘§(𝑠𝑑, d𝑠) ) ( 𝐡(𝑠)π‘€πœ‹π‘’(𝑠𝑑, d𝑠) ) + cst (22) where the constant does not depend on 𝑧. Under the successor feature property (5) of FB, we have, for all 𝑠 𝑆, 𝐡(𝑠 )π‘€πœ‹π‘§(𝑠, d𝑠 ) = E 𝑑 0 𝛾𝑑𝐡(𝑠𝑑+1) | 𝑠0 = 𝑠, πœ‹π‘§ = (Cov 𝐡)𝐹(𝑠, 𝑧), (24) where the first equality follows from the definition of the successor measure π‘€πœ‹π‘§(𝑠, d𝑠 ), and the second one from (5). Therefore, the loss ℒ𝐡 rewrites as ℒ𝐡 (𝑧) = E𝑠𝑑 πœŒπ‘’ (Cov 𝐡)𝐹(𝑠𝑑, 𝑧) 2 2 2 E𝑠𝑑 πœŒπ‘’((Cov 𝐡)𝐹(𝑠𝑑, 𝑧)) ( 𝐡(𝑠)π‘€πœ‹π‘’(𝑠𝑑, d𝑠) ) +cst. (25) Note that, for the derivation above, we do not use the full π‘€πœ‹π‘§= 𝐹 𝐡𝜌property (Equation 2). We only use that this property holds when integrated against 𝐡. For this, it is enough that (Cov 𝐡)𝐹be the successor features of 𝐡, which holds by Proposition 1. This is a weaker requirement than the full successor measure equality π‘€πœ‹π‘§= 𝐹 𝐡𝜌. Now, in expectation over the expert trajectories (𝑠0, 𝑠1, . . . , 𝑠𝑑, . . .), we have 𝐡(𝑠 )π‘€πœ‹π‘’(𝑠𝑑, d𝑠 ) = E π‘˜ 0 π›Ύπ‘˜π΅(𝑠𝑑+π‘˜+1) | 𝑠𝑑, πœ‹π‘’ Published as a conference paper at ICLR 2024 by definition of the successor measure π‘€πœ‹π‘’. So E𝑠𝑑 πœŒπ‘’((Cov 𝐡)𝐹(𝑠𝑑, 𝑧)) ( 𝐡(𝑠)π‘€πœ‹π‘’(𝑠𝑑, d𝑠) ) ((Cov 𝐡)𝐹(𝑠𝑑, 𝑧)) E π‘˜ 0 π›Ύπ‘˜π΅(𝑠𝑑+π‘˜+1) | 𝑠𝑑, πœ‹π‘’ ((Cov 𝐡)𝐹(𝑠𝑑, 𝑧)) π‘˜ 0 π›Ύπ‘˜π΅(𝑠𝑑+π‘˜+1) | 𝑠𝑑, πœ‹π‘’ by properties of conditional expectations. So the loss can be rewritten as ℒ𝐡 (𝑧) = E𝑠𝑑 πœŒπ‘’E (Cov 𝐡)𝐹(𝑠𝑑, 𝑧) 2 2 2((Cov 𝐡)𝐹(𝑠𝑑, 𝑧)) π‘˜ 0 π›Ύπ‘˜π΅(𝑠𝑑+π‘˜+1) + cst | 𝑠𝑑, πœ‹π‘’ (Cov 𝐡)𝐹(𝑠𝑑, 𝑧) π‘˜ 0 π›Ύπ‘˜π΅(𝑠𝑑+π‘˜+1) + cst | 𝑠𝑑, πœ‹π‘’ for a new constant that does not depend on 𝑧. This ends the proof. A.3 DISTRIBUTION MATCHING WITH FB: THE FEATURE MATCHING LOSS WITH β„›= 𝐿2(𝜌) A general norm on measures is the dual 𝐿2-norm, which corresponds to taking β„›= 𝐿2(𝜌) in the distribution matching criterion (10). Explicitly, 𝑀 𝐿2(𝜌) := sup π‘Ÿ, π‘Ÿ 𝐿2(𝜌) 1 π‘Ÿ(𝑠)𝑀(d𝑠) = 𝑀/𝜌 𝐿2(𝜌) . (30) Since π‘Ÿ(𝑠)π‘€πœ‹(𝑠0, d𝑠) is the expected total return of policy πœ‹starting at 𝑠0 for reward π‘Ÿ, this norm is the worst-case optimality gap on unit-norm rewards. Namely, we compare the worst-case difference of two measures on all unit-norm reward functions. The resulting loss in (12) is ℒ𝐿2(𝜌) (𝑧) := E𝑠 πœŒπ‘’ π‘€πœ‹π‘§(𝑠) π‘€πœ‹π‘’(𝑠) 2 𝐿2(𝜌) (31) This loss is tractable thanks to the following result. Theorem 3. Assume that the FB model (2) holds. Then the quantity E𝑠𝑑 πœŒπ‘’πΉ(𝑠𝑑, 𝑧) (Cov 𝐡)𝐹(𝑠𝑑, 𝑧) 2 E𝑠𝑑 πœŒπ‘’πΉ(𝑠𝑑, 𝑧) π‘˜ 0 π›Ύπ‘˜π΅(𝑠𝑑+π‘˜+1) (32) is an unbiased estimate of the loss ℒ𝐿2(𝜌) (𝑧), up to an additive constant that does not depend on 𝑧. So to optimize this loss, we just have to compute the discounted average of features 𝐡of states along the expert trajectory starting at each visited state 𝑠𝑑. Then we can perform stochastic gradient descent with respect to 𝑧. The matrix Cov 𝐡is computed during pretraining. Comparison of the 𝐿2(𝜌) loss and the feature matching loss. The loss (32) coincides with the feature matching loss (67) when Cov 𝐡= Id. Standard FB training weakly enforces Cov 𝐡 Id by an auxiliary loss on Cov 𝐡 Id (Touati et al., 2023), so we do not expect a significant difference between (32) and (67). Actually, the 𝐿2 loss (32) could be considered as a canonical version of the feature matching loss (67) that first orthonormalizes the features. The 𝐿2 loss is defined just by the 𝐿2(𝜌) norm without Published as a conference paper at ICLR 2024 any reference to a feature basis, so it is invariant by linear transformations within the features, while the feature matching loss is not. Compared to feature matching, Theorem 3 requires stronger properties of FB than Theorem 2. Indeed, Theorem 2 only requires that the FB equation 𝐹 𝐡𝜌= 𝑀hold on the span of 𝐡(and would hold similarly for other successor feature models), while Theorem 3 really requires 𝐹 𝐡𝜌= 𝑀as measures. Proof of Theorem 3. The proof is as follows. We have 𝑀 2 𝐿2(𝜌)* = (𝑀(d𝑠 )/𝜌(d𝑠 ))2 𝜌(d𝑠 ). Therefore, with starting point 𝑠, we have π‘€πœ‹π‘§(𝑠) π‘€πœ‹π‘’(𝑠) 2 𝐿2(𝜌)* = (π‘€πœ‹π‘§(𝑠, d𝑠 )/𝜌(d𝑠 ))2𝜌(d𝑠 ) 2 (π‘€πœ‹π‘§(𝑠, d𝑠 )/𝜌(d𝑠 ))(π‘€πœ‹π‘’(𝑠, d𝑠 )/𝜌(d𝑠 ))𝜌(d𝑠 ) + cst = (π‘€πœ‹π‘§(𝑠, d𝑠 )/𝜌(d𝑠 ))2𝜌(d𝑠 ) 2 (π‘€πœ‹π‘§(𝑠, d𝑠 )/𝜌(d𝑠 )) π‘€πœ‹π‘’(𝑠, d𝑠 ) + cst (33) where the constant term does not depend on 𝑠. If the FB model (2) holds, we have π‘€πœ‹π‘§(𝑠, d𝑠 )/𝜌(d𝑠 ) = 𝐹(𝑠, 𝑧) 𝐡(𝑠 ). Therefore, π‘€πœ‹π‘§(𝑠) π‘€πœ‹π‘’(𝑠) 2 𝐿2(𝜌)* = (𝐹(𝑠, 𝑧) 𝐡(𝑠 ))2𝜌(d𝑠 ) 2 𝐹(𝑠, 𝑧) 𝐡(𝑠 )π‘€πœ‹π‘’(𝑠, d𝑠 ) (34) = 𝐹(𝑠, 𝑧) (Cov 𝐡)𝐹(𝑠, 𝑧) 2𝐹(𝑠, 𝑧) 𝐡(𝑠 )π‘€πœ‹π‘’(𝑠, d𝑠 ) (35) Now, in expectation over the expert trajectories (𝑠0, 𝑠1, . . . , 𝑠𝑑, . . .), we have π‘˜ 0 π›Ύπ‘˜π΅(𝑠𝑑+π‘˜+1) | 𝑠𝑑, πœ‹π‘’ = 𝐡(𝑠 )π‘€πœ‹π‘’(𝑠𝑑, d𝑠 ) (36) by definition of the successor measure π‘€πœ‹π‘’. So π‘€πœ‹π‘§(𝑠𝑑) π‘€πœ‹π‘’(𝑠𝑑) 2 𝐿2(𝜌)* = 𝐹(𝑠𝑑, 𝑧) (Cov 𝐡)𝐹(𝑠𝑑, 𝑧) 2𝐹(𝑠𝑑, 𝑧) E π‘˜ 0 π›Ύπ‘˜π΅(𝑠𝑑+π‘˜+1) | 𝑠𝑑, πœ‹π‘’ which proves the theorem by taking expectations with respect to 𝑠𝑑in the expert trajectories. This proves Theorem 3. A.4 DERIVATION OF THE BELLMAN GAP LOSS FOR DISTRIBUTION MATCHING Here we prove the claims in Section 4.3 about the Bellman gap losses. π‘€πœ‹π‘’satisfies the Bellman equation π‘€πœ‹π‘’= π‘ƒπœ‹π‘’+ π›Ύπ‘ƒπœ‹π‘’π‘€πœ‹π‘’. So we can look for the π‘€πœ‹π‘§for which this Bellman equation is best satisfied. This results in the loss (15). We deal in turn with the seminorm 𝐡 from the choice β„›= {π‘Ÿ= 𝑀 𝐡, 𝑀 R𝑑, 𝑀 2 1} (Thm. 4), and with the full norm from β„›= 𝐿2(𝜌) (Thm. 5). The latter loss turns out to be similar to the loss used for FB training (Touati et al., 2023, 5.2), but using transitions from πœ‹π‘’instead of πœ‹π‘§, and working with 𝑀(𝑠, d𝑠 ) instead of 𝑀(𝑠, π‘Ž, d𝑠 ). Theorem 4. Assume the FB successor feature property (5) holds. Let 𝑠𝑑 𝑆. Let 𝑀and 𝐹denote the stop-grad operator over 𝑀and 𝐹(still evaluated at 𝑀= 𝑀and 𝐹= 𝐹). Then the following quantities have equal gradients with respect to 𝑧: The 𝐡 -norm of the Bellman gap, π‘€πœ‹π‘§ π‘ƒπœ‹π‘’ π›Ύπ‘ƒπœ‹π‘’ π‘€πœ‹π‘§ 2 𝐡 at state 𝑠𝑑. Published as a conference paper at ICLR 2024 The expectation over transitions 𝑠𝑑 𝑠𝑑+1 from policy πœ‹π‘’, of the quantity 2𝐹(𝑠𝑑, 𝑧) (Cov 𝐡)𝐡(𝑠𝑑+1) + (𝐹(𝑠𝑑, 𝑧) 𝛾 𝐹(𝑠𝑑+1, 𝑧)) (Cov 𝐡)2(𝐹(𝑠𝑑, 𝑧) 𝛾 𝐹(𝑠𝑑+1, 𝑧)) (38) Theorem 5. Assume the FB model π‘€πœ‹π‘§(𝑠, d𝑠 ) = 𝐹(𝑠, 𝑧) 𝐡(𝑠 )𝜌(d𝑠 ) holds. Let 𝑠𝑑 𝑆. Let 𝑀and 𝐹denote the stop-grad operator over 𝑀and 𝐹(still evaluated at 𝑀= 𝑀and 𝐹= 𝐹). Then the following quantities have equal gradients with respect to 𝑧: The 𝐿2(𝜌) -norm of the Bellman gap, π‘€πœ‹π‘§ π‘ƒπœ‹π‘’ π›Ύπ‘ƒπœ‹π‘’ π‘€πœ‹π‘§ 2 𝐿2(𝜌) at state 𝑠𝑑. The expectation over transitions 𝑠𝑑 𝑠𝑑+1 from policy πœ‹π‘’, of the quantity 2𝐹(𝑠𝑑, 𝑧) 𝐡(𝑠𝑑+1) + (𝐹(𝑠𝑑, 𝑧) 𝛾 𝐹(𝑠𝑑+1, 𝑧)) (Cov 𝐡)(𝐹(𝑠𝑑, 𝑧) 𝛾 𝐹(𝑠𝑑+1, 𝑧)) (39) This is also the FB training loss from (Touati et al., 2023, 5.2) with transitions from πœ‹π‘’. The proofs are essentially the derivation used to obtain the FB loss in Touati et al. (2023). In the proof, we freely use the matrix and operator notation from (Blier et al., 2021) for kernels like 𝑃(𝑠, d𝑠 ) and 𝑀(𝑠, d𝑠 ), which amounts to ordinary matrix multiplication when 𝑆is finite. For Theorem 4, remember that 𝑀(𝑠𝑑) 𝐡 = sup 𝑀 R𝑑, 𝑀 2 1 𝑀(𝑠𝑑, d𝑠)𝑀 𝐡(𝑠) (40) = sup 𝑀 R𝑑, 𝑀 2 1 𝑀 𝑀(𝑠𝑑, d𝑠)𝐡(𝑠) (41) 𝑀(𝑠𝑑, d𝑠)𝐡(𝑠) 2 (42) Therefore, the norm of the Bellman gap above is π‘€πœ‹π‘§(𝑠𝑑, ) π‘ƒπœ‹π‘’(𝑠𝑑, ) π›Ύπ‘ƒπœ‹π‘’(𝑠𝑑, ) π‘€πœ‹π‘§( , ) 2 𝐡 π‘€πœ‹π‘§(𝑠𝑑, d𝑠)𝐡(𝑠) π‘ƒπœ‹π‘’(d𝑠𝑑+1|𝑠𝑑)𝐡(𝑠𝑑+1) 𝛾 π‘ƒπœ‹π‘’(d𝑠𝑑+1|𝑠𝑑) π‘€πœ‹π‘§(𝑠𝑑+1, d𝑠)𝐡(𝑠) Now, if the FB successor feature relation (5) holds, we have π‘€πœ‹π‘§(𝑠𝑑, d𝑠 )𝐡(𝑠) = E [ π‘˜ 0π›Ύπ‘˜π΅(𝑠𝑑+π‘˜+1)|𝑠𝑑, πœ‹π‘§ ] (44) = (Cov 𝐡)𝐹(𝑠𝑑, 𝑧) (45) via (5). Therefore, the norm above is = (Cov 𝐡)𝐹(𝑠𝑑, 𝑧) π‘ƒπœ‹π‘’(d𝑠𝑑+1|𝑠𝑑)𝐡(𝑠𝑑+1) 𝛾 π‘ƒπœ‹π‘’(d𝑠𝑑+1|𝑠𝑑)(Cov 𝐡) 𝐹(𝑠𝑑+1, 𝑧) Now, as usual in deep 𝑄-learning, thanks to the stop-grad operator on 𝐹(𝑠𝑑+1, 𝑧), we can take the expectation over 𝑠𝑑+1 𝑃(d𝑠𝑑+1|𝑠𝑑) outside of the norm, because the difference is a constant that has no gradients with respect to 𝑧. Therefore, the above has the same gradients as E𝑠𝑑+1 π‘ƒπœ‹π‘’(d𝑠𝑑+1|𝑠𝑑) (Cov 𝐡)𝐹(𝑠𝑑, 𝑧) 𝐡(𝑠𝑑+1) 𝛾(Cov 𝐡) 𝐹(𝑠𝑑+1, 𝑧) 2 2 (47) and rearranging yields the result. For Theorem 5, let us start with the norm of the Bellman gap of π‘€πœ‹π‘§at 𝑠𝑑, π‘€πœ‹π‘§(𝑠𝑑) π‘ƒπœ‹π‘’( |𝑠𝑑) π›Ύπ‘ƒπœ‹π‘’( |𝑠𝑑) π‘€πœ‹π‘§( , ) 2 𝐿2(𝜌) . (48) Published as a conference paper at ICLR 2024 Since 𝑀 𝐿2(𝜌) = 𝑀/𝜌 𝐿2(𝜌) we have 𝑀1 𝑀2 2 𝐿2(𝜌) = 𝑀1 2 𝐿2(𝜌) + 𝑀2 2 𝐿2(𝜌) 2 (𝑀1/𝜌) d𝑀2 by a simple computation. Therefore, the norm of the Bellman gap above is = π‘€πœ‹π‘§(𝑠𝑑) 2 𝐿2(𝜌) + cst 2 π‘€πœ‹π‘§(𝑠𝑑, d𝑠𝑑+1) 𝜌(d𝑠𝑑+1) π‘ƒπœ‹π‘’(d𝑠𝑑+1|𝑠𝑑) 𝑠 π‘€πœ‹π‘§(𝑠𝑑, d𝑠 ) 𝑠𝑑+1 π‘ƒπœ‹π‘’(d𝑠𝑑+1|𝑠𝑑) π‘€πœ‹π‘§(𝑠𝑑+1, d𝑠 ) where the constant absorbs all terms not containing π‘€πœ‹π‘§. Now, if π‘€πœ‹π‘§(𝑠, d𝑠 ) = 𝐹(𝑠, 𝑧) 𝐡(𝑠 )𝜌(d𝑠 ), we have π‘€πœ‹π‘§(𝑠𝑑) 2 𝐿2(𝜌) = π‘€πœ‹π‘§(𝑠𝑑)/𝜌) 2 𝐿2(𝜌) (50) = (𝐹(𝑠𝑑, 𝑧) 𝐡(𝑠 ))2𝜌(d𝑠 ) (51) = 𝐹(𝑠𝑑, 𝑧) 𝐡(𝑠 )𝐡(𝑠 ) 𝐹(𝑠𝑑, 𝑧)𝜌(d𝑠 ) (52) = 𝐹(𝑠𝑑, 𝑧) (Cov 𝐡)𝐹(𝑠𝑑, 𝑧) (53) and likewise π‘€πœ‹π‘§(𝑠𝑑, d𝑠𝑑+1) 𝜌(d𝑠𝑑+1) π‘ƒπœ‹π‘’(d𝑠𝑑+1|𝑠𝑑) = E𝑠𝑑+1 π‘ƒπœ‹π‘’(d𝑠𝑑+1|𝑠𝑑) 𝐹(𝑠𝑑, 𝑧) 𝐡(𝑠𝑑+1) (54) 𝑠 π‘€πœ‹π‘§(𝑠𝑑, d𝑠 ) 𝑠𝑑+1 π‘ƒπœ‹π‘’(d𝑠𝑑+1|𝑠𝑑) π‘€πœ‹π‘§(𝑠𝑑+1, d𝑠 ) = E𝑠𝑑+1 π‘ƒπœ‹π‘’(d𝑠𝑑+1|𝑠𝑑) 𝐹(𝑠𝑑, 𝑧) 𝐡(𝑠 ) 𝐹(𝑠𝑑+1, 𝑧) 𝐡(𝑠 )𝜌(d𝑠 ) = E𝑠𝑑+1 π‘ƒπœ‹π‘’(d𝑠𝑑+1|𝑠𝑑) 𝐹(𝑠𝑑, 𝑧) ( 𝐡(𝑠 ) 𝐡(𝑠 ) 𝜌(d𝑠 ) ) 𝐹(𝑠𝑑+1, 𝑧) = E𝑠𝑑+1 π‘ƒπœ‹π‘’(d𝑠𝑑+1|𝑠𝑑) 𝐹(𝑠𝑑, 𝑧) (Cov 𝐡) 𝐹(𝑠𝑑+1, 𝑧) (55) since 𝐡= 𝐡because 𝐡does not depend on 𝑧. Now, the sum of the first and third term is 𝐹(𝑠𝑑, 𝑧) (Cov 𝐡)𝐹(𝑠𝑑, 𝑧) 2𝛾𝐹(𝑠𝑑, 𝑧) (Cov 𝐡) 𝐹(𝑠𝑑+1, 𝑧), which has the same gradients as (𝐹(𝑠𝑑, 𝑧) 𝛾 𝐹(𝑠𝑑+1, 𝑧)) (Cov 𝐡)(𝐹(𝑠𝑑, 𝑧) 𝛾 𝐹(𝑠𝑑+1, 𝑧)). Collecting yields the result. A.5 KL DISTRIBUTION MATCHING VIA FB Here we show how our approach can be extended to another divergence between π‘€πœ‹π‘’and π‘€πœ‹π‘§. Inspired from the generic loss (12), consider the following loss ℒ𝐾𝐿(𝑧) := E𝑠𝑑 πœŒπ‘’KL(π‘€πœ‹π‘’(𝑠) || π‘€πœ‹π‘§(𝑠)) (56) where KL(𝑝|| π‘ž) := 𝑝ln 𝑝/π‘ž 𝑝+ π‘žis the generalized KL divergence between measures 𝑝 and π‘ž. 5 This can be estimated from expert trajectories, similarly to Theorem 2. Theorem 6. Assume that the FB model (2) holds. Then the quantity π‘˜ 0 π›Ύπ‘˜ln ( 𝐹(𝑠𝑑, 𝑧) 𝐡(𝑠𝑑+π‘˜+1) ) (57) is an unbiased estimate of the loss ℒ𝐾𝐿(𝑧), up to an additive constant that does not depend on 𝑧. 5This extends the usual KL divergence to measures which may not sum to 1, which is necessary because the model 𝑀 𝐹 𝐡may not be normalized. This is the Bregman divergence associated with the convex function 𝑝 𝑝ln 𝑝. Published as a conference paper at ICLR 2024 This estimate only makes sense if the learn model π‘€πœ‹π‘§ 𝐹 𝐡only produces positive values for 𝐹 𝐡. The proof is as follows. Let 𝑠0 be any state. Then KL(π‘€πœ‹π‘’(𝑠0) || π‘€πœ‹π‘§(𝑠0)) = ln π‘€πœ‹π‘’(𝑠0, d𝑠 ) π‘€πœ‹π‘§(𝑠0, d𝑠 ) π‘€πœ‹π‘’(𝑠0, d𝑠 ) π‘€πœ‹π‘’(𝑠0, d𝑠 ) + π‘€πœ‹π‘§(𝑠0, d𝑠 ) = ln π‘€πœ‹π‘’(𝑠0, d𝑠 ) π‘€πœ‹π‘§(𝑠0, d𝑠 ) π‘€πœ‹π‘’(𝑠0, d𝑠 ) (59) because both π‘€πœ‹π‘’and π‘€πœ‹π‘§are successor measures, so they both integrate to 1/(1 𝛾). Plugging in the model π‘€πœ‹π‘§(𝑠0, d𝑠 ) = 𝐹(𝑠0, 𝑧) 𝐡(𝑠 )𝜌(d𝑠 ), this equals = ln π‘€πœ‹π‘’(𝑠0, d𝑠 ) 𝐹(𝑠0, 𝑧) 𝐡(𝑠 )𝜌(d𝑠 ) π‘€πœ‹π‘’(𝑠0, d𝑠 ) (60) = ln(𝐹(𝑠0, 𝑧) 𝐡(𝑠 )) π‘€πœ‹π‘’(𝑠0, d𝑠 ) + cst (61) where the constant does not depend on 𝑧. Now, given 𝑠0 and any function 𝑓, we have 𝑓(𝑠 )π‘€πœ‹π‘’(𝑠0, d𝑠 ) = E [ 𝑑 0 𝛾𝑑𝑓(𝑠𝑑+1) | 𝑠0, πœ‹π‘’ ] . This proves the claim up to replacing 𝑠0 with a state 𝑠𝑑 πœŒπ‘’. A.6 USING UNIVERSAL SUCCESSOR FEATURES INSTEAD OF FB FOR IMITATION The FB method belongs to a wider class of methods based on successor features or measures. Many of the imitation algorithms described here still make sense with other successor feature methods, as we now briefly describe. 6 We recall the universal successor feature framework (Borsa et al., 2018). This assumes access to state features πœ™: 𝑆 R𝑑trained according to some external criterion. Then Borsa et al. (2018) train a family of policies (πœ‹π‘§), together with the successor features πœ“of πœ™: { πœ“(𝑠0, π‘Ž0, 𝑧) E [ 𝑑 0π›Ύπ‘‘πœ™(𝑠𝑑+1) | 𝑠0, π‘Ž0, πœ‹π‘§ ] , πœ‹π‘§(𝑠) arg maxπ‘ŽπΉ(𝑠, π‘Ž, 𝑧) 𝑧. (62) At test time when faced with a reward function π‘Ÿ, USFs perform linear regression of the reward on the features: π‘§π‘Ÿ:= (E𝑠 𝜌 πœ™(𝑠)πœ™(𝑠) ) 1 E𝑠 𝜌 [π‘Ÿ(𝑠)πœ™(𝑠)] (63) where 𝜌 is the data distribution at test time. Then the policy πœ‹π‘§π‘Ÿis used. The second equation in (62) is identical to FB, but the first equation is less constrained (πœ“= πœ™= 0 would be a solution if no external criterion is applied to πœ™, while 𝐹= 𝐡= 0 is not a solution of (2) as 𝐹 𝐡must represent successor measures): it is analogous to (5), which is weaker than (2). The equation (63) for π‘§π‘Ÿis analogous to (4); beware that the 𝑧variable in FB and the 𝑧variable in USFs differ by a Cov 𝐡or (Cov πœ™) factor. Imitation learning via USFs. All of the FB-IL methods in this paper can be extended from FB to other USF models, except distribution matching: as USFs don t represent the successor measure itself, they can do feature matching but not distribution matching. We now list the corresponding losses and formulas. The derivations are omitted, since they are very similar to FB due to the analogies in the equations discussed above. For any distribution 𝜌, we abbreviate CovπœŒπœ™:= E𝑠 𝜌[πœ™(𝑠)πœ™(𝑠) ]. 6We have focused on FB as it has demonstrated better performance for zero-shot reinforcement learning (Touati et al., 2023), and is arguably better founded theoretically, by training 𝐹and 𝐡with a single criterion with no risk of representation collapse. Published as a conference paper at ICLR 2024 For behavioral cloning, the loss (7) is unchanged: identify the most likely πœ‹π‘§given the trajectory, then use πœ‹π‘§. This applies to any method that pre-trains a family of policies (πœ‹π‘§)𝑧. For reward-based methods, plugging the reward π‘Ÿ( ) = πœŒπ‘’( )/𝜌( ) (Kim et al., 2022b;a; Ma et al., 2022) into (63) with 𝜌 := 𝜌yields 𝑧= (CovπœŒπœ™) 1 EπœŒπ‘’[πœ™] = (CovπœŒπœ™) 1 E𝜏 [ 1 β„“(𝜏) 𝑑 0 πœ™(𝑠𝑑+1) ] . (64) Similarly, the reward π‘Ÿ( ) = πœŒπ‘’( )/(𝜌( ) + πœŒπ‘’( )) used in (Reddy et al., 2020; Zolna et al., 2020) leads to 𝑧= (CovπœŒπœ™+ CovπœŒπ‘’πœ™) 1 EπœŒπ‘’[πœ™]. (65) by using 𝜌 := 1 2(𝜌+ πœŒπ‘’) in (63). Feature matching can be done by starting with the generic loss (12) and choosing β„›:= {π‘Ÿ= 𝑀 πœ™, 𝑀 R𝑑, 𝑀 2 1}, similarly to FB. The resulting loss β„’πœ™ (𝑧) := E𝑠0 πœŒπ‘’ πœ™(𝑠)π‘€πœ‹π‘§(𝑠0, d𝑠) πœ™(𝑠)π‘€πœ‹π‘’(𝑠0, d𝑠) 2 2 (66) can be estimated by similar derivations as in Theorem 2, leading to the practical estimate β„’πœ™ (𝑧) = E𝑠𝑑 πœŒπ‘’E [ πœ“(𝑠𝑑, 𝑧) 𝑑 0 π›Ύπ‘˜πœ™(𝑠𝑑+π‘˜+1) 2 ] + cst. (67) where we have abbreviated πœ“(𝑠𝑑, 𝑧) := Eπ‘Žπ‘‘ πœ‹π‘§(𝑠𝑑) πœ“(𝑠𝑑, π‘Žπ‘‘, 𝑧). Namely, this finds a 𝑧that matches the successor features πœ“(𝑠𝑑, 𝑧) to the empirical successor features computed along the expert trajectory starting at 𝑠𝑑. Bellman gap matching can be done with the seminorm associated with β„›= {π‘Ÿ= 𝑀 πœ™, 𝑀 R𝑑, 𝑀 2 1}, but not with the full norm from β„›= 𝐿2(𝜌) (which requires FB). This will minimize the norm of Bellman gaps for reward functions in the span of πœ™, using that π‘€πœ‹π‘§πœ™= πœ“if the USF model (62) holds. Explicitly, starting with the Bellman gap loss (15), π‘€πœ‹π‘§ π‘ƒπœ‹π‘’ π›Ύπ‘ƒπœ‹π‘’ π‘€πœ‹π‘§ 2 β„› with transitions from the expert trajectories, and plugging in this choice of β„›together with the USF property π‘€πœ‹π‘§πœ™= πœ“, leads to the loss β„’πœ™ Bell(𝑧) = E𝑠𝑑 πœŒπ‘’ [ 2πœ“(𝑠𝑑, 𝑧) πœ™(𝑠𝑑+1) + (πœ“(𝑠𝑑, 𝑧) 𝛾 πœ“(𝑠𝑑+1, 𝑧)) (πœ“(𝑠𝑑, 𝑧) 𝛾 πœ“(𝑠𝑑+1, 𝑧)) ] + cst (68) analogous to (16). Finally, waypoint imitation can be done in USFs by selecting a goal state 𝑠𝑑+π‘˜, and putting a single reward at 𝑠𝑑+π‘˜in the USF formula (63), which yields 𝑧𝑑= (CovπœŒπœ™) 1 πœ™(𝑠𝑑+π‘˜) (69) analogously to Sec. 4.4, then using πœ‹π‘§π‘‘at time 𝑑. A.7 THE BELLMAN GAP LOSS AND THE BEHAVIORAL CLONING LOSS BOUND DISTRIBUTION MATCHING LOSSES Any method that provides a policy close to πœ‹π‘’will provide state distributions close to π‘‘πœ‹π‘’as a result, so we expect a relationship between the losses from different approaches. Indeed, the Bellman gap loss (15) bounds the distribution matching loss (12), and the BC loss bounds the KL version of (12). This is formalized in Theorems 7 and 8, respectively. Theorem 7 is analogous to the bound between Bellman gaps and errors on the 𝑄-function. Theorem 7. Let πœŒπ‘’be a stationary distribution of the expert policy πœ‹π‘’. Then the Bellman gap loss (15) on successor measures bounds both the feature or distribution matching loss (12) and the original feature or distribution matching loss (11). Namely, for any choice of β„›, E𝑠0 πœŒπ‘’π‘€πœ‹π‘§(𝑠0) E𝑠0 πœŒπ‘’π‘€πœ‹π‘’(𝑠0) 2 β„› E𝑠0 πœŒπ‘’ π‘€πœ‹π‘§(𝑠0, ) π‘€πœ‹π‘’(𝑠0, ) 2 β„› 1 (1 𝛾)2 E𝑠 πœŒπ‘’ π‘€πœ‹π‘§(𝑠) π‘ƒπœ‹π‘’(𝑠) π›Ύπ‘ƒπœ‹π‘’π‘€πœ‹π‘§(𝑠) 2 β„› . (70) Published as a conference paper at ICLR 2024 Theorem 8. Let πœŒπ‘’be a stationary distribution over states and state-actions of the expert policy πœ‹π‘’. Then E𝑠 πœŒπ‘’KL((1 𝛾)π‘€πœ‹π‘’(𝑠) || (1 𝛾)π‘€πœ‹π‘§(𝑠)) 1 1 𝛾E𝑠 πœŒπ‘’KL(πœ‹π‘’( |𝑠) || πœ‹π‘§( |𝑠)) (71) = 1 1 𝛾E(𝑠,π‘Ž) πœŒπ‘’ln πœ‹π‘§(π‘Ž|𝑠) + cst (72) where the constant does not depend on 𝑧. Proof of Theorem 7. The proof is as follows. The first inequality is by convexity of the norm. For the second one, we have π‘€πœ‹π‘§(𝑠0) π‘€πœ‹π‘’(𝑠0) = π‘€πœ‹π‘§(𝑠0, ) (Id π›Ύπ‘ƒπœ‹π‘’) 1π‘ƒπœ‹π‘’(𝑠0, ) (73) = (Id π›Ύπ‘ƒπœ‹π‘’) 1 ((Id π›Ύπ‘ƒπœ‹π‘’)π‘€πœ‹π‘§(𝑠0, ) π‘ƒπœ‹π‘’(𝑠0, )) (74) and therefore E𝑠0 πœŒπ‘’ π‘€πœ‹π‘§(𝑠0, ) π‘€πœ‹π‘’(𝑠0, ) 2 β„› (75) 𝑠 (Id π›Ύπ‘ƒπœ‹π‘’) 1(𝑠0, d𝑠) ((Id π›Ύπ‘ƒπœ‹π‘’)π‘€πœ‹π‘§(𝑠, ) π‘ƒπœ‹π‘’(𝑠, )) and since (Id π›Ύπ‘ƒπœ‹π‘’) 1 is the successor measure of πœ‹π‘’, it is a measure with total mass 1/(1 𝛾), so the integral under (Id π›Ύπ‘ƒπœ‹π‘’) 1 can be rewritten as an expectation under (1 𝛾)(Id π›Ύπ‘ƒπœ‹π‘’) 1: = 1 (1 𝛾)2 E𝑠0 πœŒπ‘’ E𝑠 (1 𝛾)(Id π›Ύπ‘ƒπœ‹π‘’) 1(𝑠0,d𝑠) [(Id π›Ύπ‘ƒπœ‹π‘’)π‘€πœ‹π‘§(𝑠, ) π‘ƒπœ‹π‘’(𝑠, )] 2 β„› 1 (1 𝛾)2 E𝑠0 πœŒπ‘’E𝑠 (1 𝛾)(Id π›Ύπ‘ƒπœ‹π‘’) 1(𝑠0,d𝑠) π‘€πœ‹π‘§(𝑠, ) π‘ƒπœ‹π‘’(𝑠, ) π›Ύπ‘ƒπœ‹π‘’π‘€πœ‹π‘§(𝑠, ) 2 β„› (by convexity) = 1 (1 𝛾)2 E𝑠 πœŒπ‘’ π‘€πœ‹π‘§(𝑠, ) π‘ƒπœ‹π‘’(𝑠, ) π›Ύπ‘ƒπœ‹π‘’π‘€πœ‹π‘§(𝑠, ) 2 β„› (79) where the last equality uses that πœŒπ‘’is a stationary distribution of πœ‹π‘’, which implies that the marginal distribution of 𝑠in the above is (1 𝛾) 𝑠0 πœŒπ‘’(d𝑠0)(Id π›Ύπ‘ƒπœ‹π‘’) 1(𝑠0, d𝑠) = (1 𝛾) 𝑠0 πœŒπ‘’(d𝑠0) 𝑑=0 𝛾𝑑(π‘ƒπœ‹π‘’)𝑑(𝑠0, d𝑠) = (1 𝛾) 𝑑=0 𝛾𝑑 𝑠0 πœŒπ‘’(d𝑠0)(π‘ƒπœ‹π‘’)𝑑(𝑠0, d𝑠) = (1 𝛾) 𝑑=0 π›Ύπ‘‘πœŒπ‘’(d𝑠) = πœŒπ‘’(d𝑠) since πœŒπ‘’is invariant under π‘ƒπœ‹π‘’. Proof of Theorem 8 (the BC loss is an upper bound of the KL successor measure matching loss). The proof goes as follows. Denote by πœ‹π‘ 0,𝑑the probability distribution of the state 𝑠𝑑when starting at 𝑠0 and following policy πœ‹. Denote by πœ‹π‘ 0,0:𝑑the probability distribution of the trajectory (𝑠0, π‘Ž0, 𝑠1, π‘Ž1, . . . , π‘Žπ‘‘ 1, 𝑠𝑑) when starting at 𝑠0 and following policy πœ‹. Denote by E𝑑the expectation under a random variable 𝑑with geometric distribution of parameter 1 𝛾starting at 1. By definition, (1 𝛾)π‘€πœ‹(𝑠0) = Eπ‘‘πœ‹π‘ 0,𝑑. (80) Then, KL((1 𝛾)π‘€πœ‹π‘’(𝑠0) || (1 𝛾)π‘€πœ‹(𝑠0)) = KL ( Eπ‘‘πœ‹π‘ 0,𝑑 𝑒 || Eπ‘‘πœ‹π‘ 0,𝑑) (81) E𝑑KL ( πœ‹π‘ 0,𝑑 𝑒 || πœ‹π‘ 0,𝑑) (82) E𝑑KL ( πœ‹π‘ 0,0:𝑑 𝑒 || πœ‹π‘ 0,0:𝑑) (83) = E𝑑E [ ln πœ‹π‘’(𝑠0, π‘Ž0, . . . , 𝑠𝑑) πœ‹(𝑠0, π‘Ž0, . . . , 𝑠𝑑) | 𝑠0, πœ‹π‘’ π‘˜=0 ln πœ‹π‘’(π‘Žπ‘˜|π‘ π‘˜) ln πœ‹(π‘Žπ‘˜|π‘ π‘˜) | 𝑠0, πœ‹π‘’ Published as a conference paper at ICLR 2024 Now, when integrated over 𝑠0 πœŒπ‘’with πœŒπ‘’a stationary distribution of πœ‹π‘’, then each state π‘ π‘˜is itself distributed according to πœ‹π‘’. Thus, the above is equal to E𝑠0 πœŒπ‘’KL((1 𝛾)π‘€πœ‹π‘’(𝑠0) || (1 𝛾)π‘€πœ‹(𝑠0)) (E𝑑𝑑) E𝑠 πœŒπ‘’KL(πœ‹π‘’( |𝑠) || πœ‹( |𝑠)) (86) where (E𝑑𝑑) accounts for there being 𝑑terms in the sum. Under the law of 𝑑, we have E𝑑𝑑= 1 1 𝛾. Finally, E𝑠 πœŒπ‘’KL(πœ‹π‘’( |𝑠) || πœ‹π‘§( |𝑠)) is the behavior cloning loss up to the entropy of πœ‹π‘’, which does not depend on 𝑧. B OFFLINE IL BASELINES In this section, we provide more information about the baselines we considered in the paper. Refer to App. E for a complete view of the results. In App. G we confirm that our implementation of the baselines matches results reported in the literature. Soft Q Imitation Learning (Reddy et al., 2020, SQIL) is a simple imitation algorithm that can be implemented with little modification to any standard RL algorithm. The idea is to provide a constant reward π‘Ÿ= 1 to expert transitions, and a constant reward π‘Ÿ= 0 to any transition generated by the RL algorithm. SQIL has been introduced and shown to perform well in the online setting. In our experiments we use a straightforward offline adaptation of the algorithm: we provide π‘Ÿ= 1 to expert transitions and π‘Ÿ= 0 to the offline unsupervised transitions and use SAC as RL algorithm. We use balanced sampling as defined in the original paper, i.e., the batch comprises an equal number of expert and unsupervised transitions. Since SQIL underperformed in several tasks, we introduced TD3 Imitation Learning (TD3IL). TD3IL uses a {0, 1} reward for unsupervised and expert samples as in SQIL, but uses TD3 as the offline RL algorithm, and does not use balanced sampling. We tested different ways to construct the batch provided to the agent during training. These methods included sampling expert transitions based on their proportion relative to the unsupervised transitions and sampling a fixed ratio of expert transitions. We found the fixed ratio strategy to be consistent across different number of expert trajectories, thus we used this approach in the experiments. We also tested TD3 with a soft variant of the {0, 1} reward used by SQIL and TD3IL. In particular, Luo et al. (2023) suggest to use optimal transport to compute a distance between expert and nonexpert transitions that is used as a reward function. This reward can subsequently used with any RL algorithm. For the latter, we used TD3 in our experiments since it proved to be the most consistent. We called this algorithm OTRTD3. Offline Reinforced Imitation Learning (Zolna et al., 2020, ORIL) trains a discriminator 𝐷( ) to separate samples from expert and unsupervised trajectories. Then, it trains an offline RL agent by annotating transitions using the learned reward function π‘Ÿ( ) = log(𝐷( ) + 1). We pretrained the discriminator using gradient penalty but we did not use positive-unlabeled learning since it did not improve performance in our tests. We trained three variants of the discriminator, using observation, (observation,action), and (observation, next-observation). Similarly to ORIL, *DICE algorithms (Kim et al., 2022b; Ma et al., 2022; Kim et al., 2022a) use a discriminator to reconstruct the reward function. The main difference is that they aim to reconstruct a reward function of the form π‘Ÿ( ) = πœŒπ‘’ πœŒπ‘’+𝜌while ORIL targets a reward π‘Ÿ( ) = πœŒπ‘’ 𝜌. We use the same discriminator training procedure but we relabel the transitions using the parametric reward function π‘Ÿ( ) = log(𝐷( )) log(1 𝐷( )). These methods leverage a regularized RL approach for training the policy. IQ-Learn (Garg et al., 2021) is a non-adversarial IL method based on the Max Ent formulation (Ziebart et al., 2008). Similarly to Value Dice (Kostrikov et al., 2020), the idea of IQ-Learn is to transform the problem over rewards to a problem over Q-functions. Opposite to *DICE algorithms, IQ-Learn does not require to explicitly train a discriminator to recover a reward function. Discriminator Weighted Behavioral Cloning (Xu et al., 2022, DWBC) proposes to approach offline IL through weighted behavior cloning, where the weights are provided by a discriminator. This is a way of leveraging unsupervised samples in BC, on top of expert samples. As suggested in the paper, we use a discriminator conditioned on the policy learned through weighted BC. Published as a conference paper at ICLR 2024 C BFM BASELINES In this section we provide a description of the behavior foundational models used in our experiments. DIAYN (Eysenbach et al., 2018) is a skill discovery algorithm, commonly used in unsupervised RL. It learns a set of parametrized policies (πœ‹π‘§)𝑧 𝑅𝑑by maximising the mutual information 𝐼(𝑠; 𝑧) between the state produced by policy πœ‹π‘§and the latent skill 𝑧, drawn from a prior distribution of skills 𝑧 𝑝(𝑧). To obtain a tractable approximation of 𝐼(𝑠; 𝑧), Eysenbach et al. (2018) introduce an approximate skill posterior (skill encoder) π‘ž(𝑧| 𝑠) and use the following variational lower bound: 𝐼(𝑠; 𝑧) = 𝐻(𝑧) 𝐻(𝑧| 𝑠) (87) = 𝐻(𝑧) + E𝑝(𝑧) E𝑠 π‘‘πœ‹π‘§[log 𝑝(𝑧| 𝑠)] (88) 𝐻(𝑧) + E𝑝(𝑧) E𝑠 π‘‘πœ‹π‘§[log π‘ž(𝑧| 𝑠)] (89) where 𝐻(𝑧) and 𝐻(𝑧| 𝑠) are respectively the entropy of the prior distribution 𝑝(𝑧) and the entropy of the conditional skill distribution 𝑝(𝑧| 𝑠). The latter is approximated by the skill encoder π‘ž(𝑧| 𝑠). Similarly to Hansen et al. (2020), we model the latent space as a 𝑑-dimensional hypersphere {𝑧: 𝑧 2 = 1} and 𝑝(𝑧) as uniform distribution on the hypersphere. Since latent variables live on the hypersphere, we model the skill encoder as von Mises-Fisher distribution with a scale parameter of 1: π‘ž(𝑧| 𝑠) exp(𝑧 πœ™(𝑠)) (90) where πœ™: 𝑆 R𝑑is a feature map restricted to the unit-hypersphere, i.e., πœ™(𝑠) = 1, 𝑠 𝑆. In practice, we train online both the feature map πœ™and policy πœ‹π‘§: the feature map πœ™is learned by maximizing the log-likelihood objective: maxπœ™E𝑝(𝑧) E𝑠 π‘‘πœ‹π‘§[𝑧 πœ™(𝑠)], while πœ‹π‘§is trained to maximize the intrinsic reward π‘ŸDIAYN(𝑠, 𝑧) = 𝑧 πœ™(𝑠) by using z-conditioned TD3. Our first attempt to vanilla online train DIAYN was unsuccessful and leads to near-zero performance. This is consistent with findings of prior work (Laskin et al., 2022) that DIAYN is not able to learn diverse skill on DMC environments due to absence of resetting when the agent falls. Therefore, we decided to incorporate two components in our training that boost performance: Posterior Hindsight Experience Replay (Choi et al., 2021, P-Her): instead of considering the actual 𝑧that generates the state 𝑠, it consists in relabelling 𝑧for some states 𝑠𝑑by a sample drawn from the skill encoder π‘ž(𝑧|𝑠𝑑+π‘˜), where 𝑠𝑑+π‘˜is a future state in the trajectory of 𝑠𝑑. In practice, we set 𝑧= πœ™(𝑠𝑑+π‘˜), which is equal to the mean of distribution. Exploration bonus: To incentivize the agent to learn diverse skills, we add a π‘˜-nearest neighbors-based entropy reward similarly to (Liu & Abbeel, 2021): π‘Ÿexplore(𝑠) = 𝑧𝑖 k NN(πœ™(𝑠)) πœ™(𝑠) 𝑧𝑖 2 ) where 𝑧𝑖= πœ™(𝑠𝑖) for a mini-batch of states {𝑠𝑖}. The policy is then trained to maximise the reward π‘Ÿ(𝑠, 𝑧) = π‘ŸDIAYN(𝑠, 𝑧) + πœ†π‘Ÿexplore(𝑠), where πœ†is an exploration coefficient. DIAYN for Imitation Learning. We devise here three different IL methods based on a pre-trained DIAYN model: Behavioral cloning (BCDIAYN): we look for the policy πœ‹π‘§that best fits the expert trajectories in term of its likelihood, by minimizing the loss: 𝑑 ln πœ‹π‘§(π‘Žπ‘‘| 𝑠𝑑) (91) Mutual-information maximization (ERDIAYN): we infer the latent variable 𝑧by maximizing the mutual information 𝐼(𝑠; 𝑧). In practice, we maximize instead the (tractable) variational lower bound. max 𝑧 E𝑠 πœŒπ‘’[log π‘ž(𝑧| 𝑠)] = max 𝑧 E𝑠 πœŒπ‘’[𝑧 πœ™(𝑠)] = max 𝑧 𝑧 E𝑠 πœŒπ‘’[πœ™(𝑠)] (92) Published as a conference paper at ICLR 2024 The above maximization problem admits a closed-form solution: 𝑧= E𝑠 πœŒπ‘’[πœ™(𝑠)] which consists simply in averaging the features of the expert states. This final formula is similar to (8) for ERFB, hence our notation ERDIAYN. Note that this approach was proposed by Eysenbach et al. (2018) for descrete skills (see their Appendix G). Here we report the variant for continuous spaces. Goal-based Imitation (GOALDIAYN): Consider one single trajectory 𝜏. At each time step 𝑑, we can use the DIAYN pretrained behaviors to reach a state 𝑠𝑑+π‘˜slightly ahead of 𝑠𝑑in the expert trajectory. Specifically, we set 𝑧𝑑= πœ™(𝑠𝑑+π‘˜), which corresponds to the mean of the skill encoder distribution for 𝑠𝑑+π‘˜, and use the action given by πœ‹π‘§π‘‘(𝑠𝑑). C.2 GOAL-GPT GOAL-GPT (Liu et al., 2022) is a goal-conditioned, transformer-based auto-regressive model πœ‹, trained offline using a behavior cloning objective. At train time, given an offline dataset of trajectories, we sample sub-trajectories {𝑠𝑑, π‘Žπ‘‘, . . . , 𝑠𝑑+π‘˜} consisting of π‘˜contiguous state-action pairs, we relabel the last state as a goal 𝑔= 𝑠𝑑+π‘˜, and then we minimize the following objective: E𝜏[ln πœ‹(π‘Žπ‘‘, . . . , π‘Žπ‘‘+π‘˜| (𝑠𝑑, 𝑔) . . . , (𝑠𝑑+π‘˜, 𝑔))] = E𝜏 𝑖=0 ln πœ‹(π‘Žπ‘‘+𝑖| (𝑠𝑑, 𝑔) . . . , (𝑠𝑑+π‘˜, 𝑔)) The last equality holds since the model uses causal attention masking. We can use GOAL-GPT to perform goal-based imitation. Given one single expert trajectory 𝜏= (𝑠𝑒 1, . . . , 𝑠𝑒 𝑇), we divide the trajectory into segments of equal length π‘˜. For the segment (𝑠𝑒 𝑑, . . . , 𝑠𝑒 𝑑+π‘˜), we set the goal 𝑔to the last expert s state, 𝑔= 𝑠𝑒 𝑑+π‘˜, and we execute the π‘˜actions predicted by the model: πœ‹(π‘Žπ‘–, . . . , π‘Žπ‘‘+π‘˜| (𝑠𝑑, 𝑔) . . . , (𝑠𝑑+𝑖, 𝑔)) for all 𝑖 {1, . . . , π‘˜}. Here (𝑠𝑑, . . . , 𝑠𝑑+𝑖) is the history of the last 𝑖states generated while interacting with the environment to imitate the expert s trajectory. Masked Decision Prediction (Liu et al., 2022, MASKDP) is a self-supervised pretraining method. Unlike autoregressive action prediction used in GOAL-GPT, it employs a masked autoencoder to state-action trajectories, which randomly masks states and actions and is trained to reconstruct the missing data. It uses encoder-decoder architecture β„Ž. Both encoder and decoder are bidirectional transformers. The model is trained to reconstruct a sub-trajectory given a masked view of itself, i.e., Λ†πœ= β„Ž(masked(𝜏)) 𝜏. At train time, given a sub-trajectory 𝜏of state-action pairs, we apply random masking on states and actions independently. The encoder processes only the unmasked states and actions. The decoder operates on the whole sub-trajectory of both visible and masked encoded states and actions, while replacing each masked element by a shared learned vector (called a mask token). The overall model is learned end-to-end using reconstructing loss (mean square error). We can use MASKDP to perform goal-based imitation as follows: Given one single expert trajectory 𝜏= (𝑠𝑒 1, . . . , 𝑠𝑒 𝑇), at each time step 𝑑, we use the MASKDP model to predict the actions necessary to reach the goal 𝑠𝑒 𝑑+π‘˜. To this end, we fed the model by the masked sequence (𝑠𝑑, _, _, . . . , _, 𝑠𝑒 𝑑+π‘˜) of length π‘˜, where _ denotes the missing element, then, we execute only the first action predicted by the model. C.4 GOAL-TD3 For Goal-TD3, we pre-train offline goal-conditioned policies πœ‹(π‘Ž| 𝑠, 𝑔) using the sparse reward π‘Ÿ(𝑠, 𝑔) = I{ 𝑠 𝑔 2 πœ€} for some small value of πœ€. Training uses Hindsight Experience Replay (Andrychowicz et al., 2017, HER). At test time, Goal-TD3 can implement goal-based IL: at each time step 𝑑, we select the policy πœ‹(π‘Žπ‘‘| 𝑠𝑑, 𝑠𝑒 𝑑+π‘˜) where the goal 𝑠𝑒 𝑑+π‘˜is the state π‘˜steps ahead of 𝑠𝑑in the expert trajectory. Published as a conference paper at ICLR 2024 Figure 5: Maze, Walker, Cheetah and Quadruped environments used in our experiments. D EXPERIMENTAL SETUP In this section we provide additional information about our experiments. D.1 ENVIRONMENTS All the environments considered in this paper are based on the Deep Mind Control Suite (Tassa et al., 2018b). In total, we have 4 environments and 21 tasks. Point-mass-maze: a 2-dimensional continuous maze with four rooms. The states are 4dimensional vectors consisting of positions and velocities of the point mass (π‘₯, 𝑦, 𝑣π‘₯, π‘₯𝑦), and the actions are 2-dimensional vectors. The initial state location is sampled uniformly from the top left room. We consider 8 tasks: four goal-reaching tasks( reach_top_left, reach_top_right, reach_bottom_left and reach_bottom_left) consist in reaching a goal in the middle of each room described by their (π‘₯, 𝑦) coordinates, 2 looping tasks encourage the agent to navigate trough the all rooms while drawing two different shapes (square shape close to maze s borders for the square task, diamand shape for the diamand task), fast_slow task encourages the agent to loop around the maze while maintaining large velocity when moving horizontally and small velocity when moving vertically, and finally, reach_bottom_left_long consists in reaching the bottom left room following the long path (by penalizing the agent for moving counterclockwise). Walker: a planar walker. States are 24-dimensional vectors consisting of positions and velocities of robot joints, and actions are 6-dimensional vectors. We consider 5 different tasks: walker_stand reward is a combination of terms encouraging an upright torso and some minimal torso height, while walker_walk and walker_run rewards include a component encouraging some minimal forward velocity. walker_flip reward is a combination of terms encouraging and upright torso and some mininal angular momentum while walker_flip reward encourages only to have some minimal angular momentum without constraining the torso. Cheetah: a running planar biped. States are 17-dimensional vectors consisting of positions and velocities of robot joints, and actions are 6-dimensional vectors. We consider 4 different tasks: cheetah_walk and cheetah_run rewards are linearly proportional to the forward velecity up to some desired values: 2 m/s for walk and 10 m/s for run. Similarly, walker_walk_backward and walker_run_backward rewards encourage reaching some minimal backward velocities. Quadruped: a four-leg ant navigating in 3D space. States and actions are 78-dimensional and 12dimensional vectors, respectively. We consider 4 tasks: quadruped_stand reward encourages an upright torso. quadruped_walk and quadruped_run include a term encouraging some minimal torso velecities. quadruped_walk includes a term encouraging some minimal height of the center of mass. quadruped_jump encourages some minimal height. D.2 DATASETS AND EXPERT TRAJECTORIES We used standard unsupervised datasets for the four domains, generated by Random Network Distillation (RND). They can be downloaded following the instructions in the github repository of Yarats et al. (2022) (https://github.com/denisyarats/exorl). We use 5000 trajectories for locomotion tasks (walker, cheetah and quadruped) and 10000 trajectories for point-mass-maze. Published as a conference paper at ICLR 2024 Expert trajectories for the 21 tasks are generated by task-specific trained TD3 agents. We train TD3 online for locomotion tasks. We train TD3 offline for the navigation tasks since the maze s unsupervised dataset has a good coverage to learn the optimal behaviors. D.3 ARCHITECTURES Forward-Backward: The backward representation network 𝐡(𝑠) is represented by a feedforward neural network with two hidden layers, each with 256 units, that takes as input a state and outputs a 𝑑-dimensional embedding. The output of B can be either L2-projected into the sphere radius 𝑑or batchnormalized. Using Batchnorm makes FB invariant to reward translation since it ensures E𝑠 𝜌[𝐡(𝑠)] = 1. For the forward network 𝐹(𝑠, π‘Ž, 𝑧), we first preprocess separately (𝑠, π‘Ž) and (𝑠, 𝑧) by two feedforward networks with one single hidden layer (with 1024 units) to 512-dimentional space. Then we concatenate their two outputs and pass it into two heads of feedforward networks (each with one hidden layer of 1024 units) to output a 𝑑-dimensional vector. For the policy network πœ‹(𝑠, 𝑧), we first preprocess separately 𝑠and (𝑠, 𝑧) by two feedforward networks with one single hidden layer (with 1024 units) to 512-dimentional space. Then we concatenate their two outputs and pass it into another one single hidden layer feedforward network (with 1024 units) to output to output a 𝑑𝐴-dimensional vector, then we apply a Tanh activation as the action space is [ 1, 1]𝑑𝐴. For all the architectures, we apply a layer normalization (Ba et al., 2016) and Tanh activation in the first layer in order to standardize the states and actions. We use Relu for the rest of layers. We also pre-normalize 𝑧: 𝑧 𝑑 𝑧 𝑧 2 in the input of 𝐹, πœ‹and πœ“. For maze environments, we added an additional hidden layer after the preprocessing (for both policy and forward) as it helped to improve the results. Imitation Learning Baselines: For all IL baselines, we use feedforward neural networks with two hidden layers, each with 1024 units to represent actors, critics and discrimintors networks. We apply a layer normalization and Tanh activation in the first layer in order to standardize the states and actions. BFMs: For DIAYN, we have three networks: The policy network πœ‹(𝑠, 𝑧) is similar to that of FB: it preprocesses separately 𝑠and (𝑠, 𝑧) by two feedforward networks with one single hidden layer (with 1024 units) to 512-dimentional space. Then we concatenate their two outputs and pass it into another two-hidden-layer feedforward network (with 1024 units) to output to output a 𝑑𝐴-dimensional vector, then we apply a Tanh activation as the action space is [ 1, 1]𝑑𝐴. The critic network 𝑄(𝑠, π‘Ž, 𝑧) is similar to F in FB but with scalar output: it preprocesses separately (𝑠, π‘Ž) and (𝑠, 𝑧) by two feedforward networks with one single hidden layer (with 1024 units) to 512-dimentional space. Then it concatenates their two outputs and pass it into two heads of feedforward networks (each with two hidden layers of 1024 units) to output a scalar. The skill encoder network πœ™(𝑠) is represented by a feedforward neural network with two hidden layers, each with 256 units, that takes as input a state and outputs a d-dimensional embedding. The output of πœ™is L2-projected into the unit-hypersphere. For GOAL-TD3, we use the same architectures for both policy and critic networks as DIAYN, and we replace the 𝑧latent skill by a goal state. GOAL-GPT uses auto-regressive transformer architecture via causal attention masking with 5 attention layers. Each attention layer has 4 attention heads with 256 hidden dimensions. MASKDP uses encoder-decoder architecture. Both encoder and decoder are bidirectional transformers with respectively 3 and 2 attention layers. Each attention layer has 4 attention heads with 256 hidden dimensions. More details about the architectures for both GOAL-GPT and MASKDP can be found in (Liu et al., 2022) Published as a conference paper at ICLR 2024 D.4 HYPERPARAMETERS Table 1: Hyperparameters used for FB pretraining. Hyperparameter Walker Cheetah Quadruped Maze Representation dimension 100 50 50 100 Batch size 2048 2048 1024 1024 Discount factor 𝛾 0.98 0.98 0.98 0.99 Optimizer Adam Adam Adam Adam learning rate of F 10 4 10 4 10 4 10 4 learning rate of B 10 4 10 4 10 4 10 6 learning rate of πœ‹ 10 4 10 4 10 4 10 6 Normalization of B L2 None L2 Batchnorm Momentum for target networks 0.99 0.99 0.99 0.99 Stddev for policy smoothing 0.2 0.2 0.2 0.2 Truncation level for policy smoothing 0.3 0.3 0.3 0.3 Regularization weight for orthonormality 1 1 1 1 Table 2: Hyperparameters used for IL baselines. Hyperparameter Walker Cheetah Quadruped Maze Common Batch size 512 512 512 512 Discount factor 𝛾 0.98 0.98 0.98 0.99 Optimizer Adam Adam Adam Adam learning rate 10 4 10 4 10 4 10 4 Momentum for target networks 0.99 0.99 0.99 0.99 Stddev for policy smoothing 0.2 0.2 0.2 0.2 Truncation level for policy smoothing 0.3 0.3 0.3 0.3 Discriminator s training steps 5 105 5 105 5 105 5 105 ORIL Gradient penalty 2 2 10 10 Positive-unlabeled coefficient 0 0 0 0 SMODICE Gradient penalty 2 10 2 10 Divergence πœ’2 πœ’2 πœ’2 πœ’2 LOBSDICE Gradient penalty 10 10 2 10 Divergence KL KL KL KL DEMODICE Gradient penalty 10 10 2 10 Divergence KL KL KL KL TD3IL Fix Sampling Ratio 0.002 0.002 0.002 0.002 Table 3: Hyperparameters used for DIAYN and GOAL-TD3. Hyperparameter Walker Cheetah Quadruped Maze Common Batch size 512 512 512 512 Discount factor 𝛾 0.98 0.98 0.98 0.99 Optimizer Adam Adam Adam Adam Momentum for target networks 0.99 0.99 0.99 0.99 Stddev for policy smoothing 0.2 0.2 0.2 0.2 Truncation level for policy smoothing 0.3 0.3 0.3 0.3 DIAYN actor learning rate 10 5 10 5 10 5 10 5 critic learning rate 10 4 10 4 10 4 10 4 skill encoder learning rate 10 4 10 4 10 4 10 4 exploration coefficient 1 1 0.1 1 P-Her ratio 0.5 0.5 0.5 0.5 Latent dimension 50 25 25 25 GOAL-TD3 actor learning rate 10 4 10 5 10 4 10 5 critic learning rate 10 4 10 5 10 4 10 4 Her ratio 1 0.5 0.75 1 Table 4: Hyperparameters used for GOAL-GPT and MASKDP. Hyperparameter Walker Cheetah Quadruped Maze Common Batch size 512 512 512 512 Optimizer Adam Adam Adam Adam learning rate 10 4 10 4 10 4 10 4 MASKDP context size 64 64 64 64 GOAL-GPT context size 16 16 32 32 Published as a conference paper at ICLR 2024 Hyperparameter finetuning. We finetune the hyperparameters of FB models for each domain by performing hyperparameter sweeps over the batch size, learning rates, representation dimensions, orthonormality regularization and normalization of backward output. We evaluate each model using its reward-based performance on downstream tasks for each domain. We select the hyperparameters that lead to the best averaged performance over each domain s tasks and over three random seeds. We re-train the final FB models with the selected hyperparameters for 10 different random seeds. For imitation learning baselines, we finetune each baseline s hyperparameters for each domain by performing hyperparameter sweep on a representative task (e.g., walker_walk for walker and cheetah_walk for cheetah). We did not do hyperparameter sweeps for each baseline and task, since this would have been too intensive given the number of setups. For the other BFMs (DIAYN, GOAL-TD3, GOAL-GPT and MASKDP), we finetune each of them by performing hyperparameter sweep on their goal-based imitation performance in a representative task for each domain. We select the hyperparameters that lead to the best averaged performance over three random seeds. We re-train the final BFMs models with the selected hyperparameters for 10 different random seeds. This results in one hyperparameter tuning per domain (not per task), both for BFMs and for the IL baselines. Importantly, note that the BFM tuning is shared by all algorithms using a given BFM model (e.g., ERFB, BBELLFB, . . . for FB), while the IL baselines have a separate tuning per algorithm. Published as a conference paper at ICLR 2024 E IMITATION LEARNING EXPERIMENTS In this section we report the complete set of results for the standard IL protocol described in Sect. 5. We also present the additional experiments we conducted to assess the performance of FB-IL methods and baselines. E.1 DETAILED VIEW OF THE RESULTS WITH 𝐾= 1 EXPERT DEMONSTRATIONS We start providing results for each task. Tab. 5, 6 and 7 contain the cumulative reward obtained by the IL policy recovered by each algorithm. The protocol is the same as in Sect. 5. We further report the performance of the expert agent, expert performance is computed over the 250 trajectories we generated for the IL experiment. In Tab. 8 we further report the average time required by the methods to compute the IL policy. Algorithm fast slow loop reach bottom left reach bottom left long reach bottom right reach top left reach top right square BC 451.7(49.4) 506.9(38.7) 346.9(45.7) 263.4(51.4) 167.5(33.4) 476.3(65.2) 452.6(56.0) 757.8(49.0) DWBC 327.9(42.2) 634.4(51.1) 273.9(64.0) 91.7(100.4) 120.5(29.8) 453.3(66.9) 205.1(43.6) 587.0(44.6) ORIL 4.6(4.6) 42.5(5.7) 0.0(0.0) 191.7(132.7) 0.0(0.0) 532.6(37.3) 0.0(0.0) 143.5(2.1) ORIL-O 144.8(3.9) 524.1(6.5) 756.1(13.0) 244.4(32.6) 298.1(47.1) 952.2(5.0) 785.6(10.6) 390.8(28.4) ORIL-OO 148.9(6.6) 488.9(9.0) 715.3(25.2) 218.1(33.3) 237.0(52.8) 950.4(10.8) 740.7(23.7) 417.2(30.4) OTRTD3 121.7(16.4) 675.4(21.2) 486.8(57.4) 361.8(34.6) 274.0(51.5) 401.5(57.0) 549.8(53.8) 640.0(17.8) SQIL 700.0(54.7) 741.0(54.9) 696.2(37.8) 533.9(51.1) 548.5(40.7) 926.0(18.3) 739.9(27.0) 694.5(60.5) TD3IL 104.9(28.9) 752.2(36.8) 830.5(0.2) 11.1(5.7) 668.8(35.3) 963.4(1.7) 829.6(0.6) 244.3(55.5) DEMODICE 107.4(10.9) 432.2(12.7) 20.9(2.5) 928.3(95.5) 19.9(1.3) 111.2(4.6) 26.2(2.5) 416.4(13.4) SMODICE 117.1(3.1) 442.7(3.1) 105.2(3.0) 1340.2(37.2) 88.0(1.9) 85.9(1.3) 115.5(2.7) 296.7(3.6) LOBSDICE 86.7(1.4) 409.6(2.3) 15.0(0.2) 2048.2(33.8) 12.2(0.3) 96.0(1.1) 16.5(0.5) 299.9(3.7) IQLEARN 48.0(15.3) 148.2(16.5) 0.0(0.0) 196.8(73.7) 0.1(0.1) 42.1(11.4) 0.2(0.2) 114.4(15.7) BCDIAYN 16.0(1.7) 142.3(7.8) 545.8(21.3) 224.6(23.6) 211.4(16.9) 737.5(20.9) 584.6(21.7) 116.1(4.5) ERDIAYN 0.4(0.0) 87.2(7.3) 517.7(18.4) 374.8(23.0) 33.8(7.3) 568.5(26.0) 559.9(20.7) 46.2(2.9) GOALDIAYN 215.2(7.6) 388.6(8.5) 580.8(17.9) 47.8(27.1) 299.4(17.1) 585.5(26.3) 619.1(16.2) 414.3(14.5) GOAL-TD3 824.5(2.6) 800.9(7.9) 781.5(3.1) 473.6(12.3) 335.4(16.3) 948.6(1.3) 760.1(5.6) 904.6(1.2) MASKDP 692.4(16.0) 763.8(12.1) 592.0(12.9) 461.5(12.9) 293.8(12.3) 913.2(5.6) 654.8(9.3) 821.3(11.1) GOAL-GPT 710.8(3.5) 843.2(4.6) 713.3(2.6) 606.8(2.7) 471.6(14.5) 665.0(8.2) 716.6(2.5) 840.8(1.7) BCFB 392.8(9.0) 846.8(5.9) 734.5(11.4) 156.4(19.0) 548.5(12.0) 962.4(2.3) 758.1(8.6) 873.4(7.0) ERFB 347.1(4.1) 697.6(10.1) 817.4(0.8) 77.3(8.1) 527.7(13.1) 956.5(1.1) 819.5(1.0) 771.9(5.5) RERFB 284.8(8.9) 637.8(15.6) 776.1(5.2) 347.7(39.7) 570.6(11.2) 945.5(2.9) 760.1(7.4) 718.4(11.7) BBELLFB 236.2(8.7) 648.8(15.0) 495.8(19.7) 53.9(21.7) 313.3(14.3) 776.5(15.7) 412.4(18.3) 445.1(13.2) FMFB 221.2(8.3) 598.6(11.8) 356.5(20.6) 129.9(35.3) 178.8(13.5) 754.0(15.4) 432.0(20.0) 413.9(11.6) FBLOSSFB 219.4(8.4) 638.0(14.0) 377.0(19.2) 20.7(35.4) 375.5(15.0) 824.9(13.6) 297.3(18.3) 473.5(14.3) DMFB 229.8(7.9) 569.1(13.3) 251.5(17.2) 163.8(41.3) 277.7(16.2) 746.9(19.2) 271.5(18.1) 447.8(14.2) GOALFB 572.9(3.3) 818.1(3.3) 799.7(1.2) 543.9(9.9) 617.6(4.1) 954.6(1.0) 793.7(1.7) 885.5(1.8) Expert 935.5(0.6) 912.4(0.5) 813.4(1.3) 666.4(1.2) 710.3(1.0) 949.3(1.1) 813.0(1.2) 952.9(0.8) Table 5: Cumulative reward for each task in the maze environment, averaged over repetitions. Experiments are done with 𝐾= 1 expert demonstrations. Standard deviation is reported in parenthesis. E.2 WARM START FOR FB-IL METHODS As mentioned in the text, some FB methods require a gradient descent over the policy parameter 𝑧, and we initialized the gradient descent with a warm start , setting the initial guess 𝑧0 to the value (8) used in ERFB, which can be readily computed. Fig. 7 illustrates the performance of BCFB and BBELLFB with and without ERFB warm-start averaged over all environments and tasks. While BBELLFB is relatively robust, BCFB performs poorly when the policy embedding 𝑧is optimized starting from a random point on the unit sphere. While BCFB benefits from warm-start, notice that the initial value 𝑧0 obtained from ERFB is not stationary and the BC loss keeps decreasing over iterations and eventually converges to a different policy πœ‹π‘§. This can lead to policies with significantly different behavior as illustrated in Fig. 6. While BCFB without warm start tries to imitate the expert and fails in reproducing the trajectory, the policy returned by ERFB reaches the goal but takes a different path w.r.t. the expert. On the other hand, BCFB with warm start successfully shifts the initial ERFB policy to better replicate the expert actions and eventually reproduce its trajectory. Published as a conference paper at ICLR 2024 Walker Algorithm flip run spin stand walk BC 28.3(1.8) 26.3(1.4) 21.5(3.3) 166.7(6.0) 27.2(1.0) DWBC 57.2(2.0) 27.6(1.6) 190.3(13.7) 191.0(11.5) 39.0(2.9) ORIL 648.0(23.5) 320.2(24.1) 900.1(38.4) 886.9(33.5) 567.3(60.3) ORIL-O 143.0(20.7) 167.4(22.6) 9.0(0.8) 349.1(8.1) 114.7(16.1) ORIL-OO 92.2(7.9) 122.5(20.8) 9.3(0.7) 335.7(4.7) 72.8(5.8) OTRTD3 98.3(4.8) 27.4(1.1) 183.5(15.7) 285.8(18.6) 64.5(5.7) SQIL 561.0(47.6) 309.1(33.1) 967.5(5.3) 807.2(50.6) 681.4(56.6) TD3IL 673.7(23.8) 293.4(22.8) 973.0(4.5) 875.5(40.2) 736.0(43.6) DEMODICE 245.1(1.1) 87.0(0.2) 474.2(8.1) 389.7(0.6) 195.7(0.5) SMODICE 244.3(0.5) 89.5(0.1) 487.5(0.8) 387.0(0.5) 196.2(0.3) LOBSDICE 243.6(0.4) 89.1(0.1) 481.9(0.8) 387.3(0.5) 195.1(0.4) IQLEARN 34.5(2.6) 23.1(0.6) 57.6(10.1) 139.8(6.1) 26.9(1.5) BCDIAYN 36.6(0.6) 23.5(0.6) 17.5(0.6) 195.6(4.4) 33.3(0.8) ERDIAYN 77.0(2.8) 25.9(1.0) 89.8(4.8) 151.7(7.2) 104.8(6.6) GOALDIAYN 104.6(3.2) 26.7(0.7) 451.7(9.3) 145.9(6.7) 131.5(5.5) GOAL-TD3 593.2(3.3) 216.8(1.4) 859.8(1.5) 908.5(2.3) 865.3(2.2) MASKDP 65.7(2.4) 67.6(1.0) 116.0(4.5) 162.5(1.2) 32.2(0.3) GOAL-GPT 253.4(0.2) 102.9(0.1) 375.1(0.4) 406.0(1.1) 214.8(0.3) BCFB 579.7(6.6) 262.2(2.5) 793.6(13.7) 742.9(9.3) 891.5(2.9) ERFB 552.0(5.0) 281.2(2.1) 814.4(17.3) 721.7(8.8) 836.3(5.7) RERFB 553.1(7.7) 343.9(5.2) 900.3(8.9) 672.5(11.5) 735.4(10.8) BBELLFB 642.6(3.8) 322.1(4.0) 896.0(12.1) 720.7(9.6) 789.2(7.9) FMFB 606.5(3.8) 228.1(6.1) 836.4(13.8) 706.3(9.8) 808.9(7.6) FBLOSSFB 643.3(3.7) 282.7(5.0) 910.6(11.3) 706.1(10.9) 811.1(7.0) DMFB 614.1(3.8) 256.0(5.4) 824.6(15.7) 694.3(10.3) 832.8(4.9) GOALFB 593.6(3.2) 275.4(1.9) 903.7(2.2) 715.4(8.6) 820.3(4.9) Expert 977.2(0.8) 845.7(0.7) 986.2(0.4) 987.5(0.6) 978.7(0.8) Table 6: Cumulative reward for each task in the walker environment, averaged over repetitions. Experiments are done with 𝐾= 1 expert demonstrations. Standard deviation is reported in parenthesis. Cheetah Quadruped Algorithm run run backward walk walk backward jump run stand walk BC 62.2(1.8) 88.6(10.5) 237.0(31.4) 675.3(35.8) 156.2(12.6) 197.5(19.3) 429.8(42.7) 295.3(39.3) DWBC 63.8(1.6) 88.6(8.9) 247.9(26.6) 646.5(41.5) 161.1(14.4) 195.6(20.3) 444.5(49.0) 292.0(39.8) ORIL 200.0(9.8) 365.6(5.8) 655.2(47.1) 952.7(9.5) 801.0(28.3) 571.4(13.6) 953.6(20.6) 763.4(35.2) ORIL-O 9.6(6.6) 31.0(20.8) 421.8(56.1) 823.6(23.1) 660.7(5.1) 454.1(37.8) 966.4(4.5) 555.8(23.9) ORIL-OO 103.5(16.7) 243.5(30.2) 459.9(55.7) 820.1(36.4) 677.5(14.7) 449.3(18.1) 958.2(3.4) 584.3(35.0) OTRTD3 91.2(3.8) 107.5(15.1) 515.4(36.3) 681.6(19.6) 363.2(47.5) 31.6(3.3) 277.1(31.3) 61.6(6.3) SQIL 194.7(16.9) 187.8(22.8) 120.4(38.8) 569.1(65.9) 266.1(57.2) 473.5(17.3) 361.3(33.9) 273.2(42.2) TD3IL 214.1(10.2) 347.9(5.1) 861.3(35.9) 968.8(3.2) 808.2(27.8) 552.6(9.7) 855.1(54.8) 813.8(17.5) DEMODICE 39.6(0.6) 54.4(1.0) 197.3(1.9) 250.9(3.3) 154.1(3.7) 113.8(2.5) 238.1(2.9) 114.1(1.9) SMODICE 43.1(0.5) 56.9(0.8) 205.5(2.1) 256.7(3.1) 139.0(2.5) 97.8(2.1) 205.9(4.1) 109.4(2.0) LOBSDICE 47.2(1.5) 53.6(1.0) 234.9(4.4) 225.2(3.8) 172.2(2.8) 126.4(2.4) 242.7(5.6) 123.6(1.6) IQLEARN 46.5(2.2) 85.2(8.5) 320.9(39.8) 772.5(21.9) 99.8(6.3) 71.3(6.8) 242.7(10.5) 237.0(6.0) BCDIAYN 1.0(0.0) 0.8(0.0) 4.9(0.2) 4.7(0.2) 153.4(3.8) 169.0(1.7) 272.3(3.5) 147.0(1.7) ERDIAYN 1.6(0.0) 1.6(0.1) 8.2(0.2) 7.6(0.3) 180.7(2.5) 148.0(1.7) 243.9(5.1) 124.0(2.3) GOALDIAYN 17.2(1.2) 51.3(3.9) 376.5(8.2) 499.5(10.5) 147.4(3.6) 125.2(2.4) 196.0(4.7) 157.7(4.5) GOAL-TD3 83.5(2.3) 171.3(3.4) 779.8(5.6) 653.1(7.3) 732.4(5.5) 426.6(4.0) 946.6(1.3) 760.7(4.1) MASKDP 63.5(1.0) 49.4(1.0) 348.7(7.9) 831.4(8.5) 150.1(3.7) 90.2(2.6) 299.0(6.7) 161.0(4.3) GOAL-GPT 49.2(0.7) 97.4(1.3) 431.8(11.0) 675.2(5.4) 543.0(3.0) 362.0(0.5) 638.7(3.6) 320.9(1.2) BCFB 176.6(7.3) 184.5(6.7) 646.9(21.0) 629.4(24.7) 784.2(5.7) 332.4(9.5) 967.3(2.1) 800.2(5.7) ERFB 303.4(3.5) 207.4(5.7) 888.7(6.5) 885.9(12.0) 798.5(4.4) 437.3(9.5) 971.1(0.4) 703.4(7.7) RERFB 252.0(3.7) 228.2(4.0) 845.8(6.3) 854.8(16.3) 632.2(16.9) 393.8(10.9) 864.4(13.1) 649.6(10.5) BBELLFB 305.0(2.7) 231.8(5.6) 674.9(15.8) 829.0(12.5) 629.8(12.4) 338.3(10.0) 941.3(5.4) 762.9(10.1) FMFB 290.9(3.2) 210.3(6.8) 468.1(22.1) 834.4(15.7) 534.1(14.9) 241.0(11.0) 782.3(16.4) 612.1(11.4) FBLOSSFB 299.3(3.0) 237.7(4.5) 632.4(18.4) 821.6(15.1) 643.8(10.7) 323.0(10.0) 935.3(6.3) 768.5(9.9) DMFB 291.1(3.1) 206.9(7.1) 473.4(22.1) 836.9(15.6) 543.7(15.0) 246.4(11.2) 771.1(17.6) 607.3(11.2) GOALFB 301.4(3.5) 205.7(5.5) 874.8(7.1) 887.4(13.2) 776.6(4.4) 465.5(7.8) 949.8(1.3) 768.6(5.3) Expert 910.9(0.1) 627.9(1.0) 992.3(0.0) 989.9(0.0) 894.1(1.6) 934.0(1.5) 971.6(1.3) 965.7(1.4) Table 7: Cumulative reward for each task in the cheetah and quadruped environments, averaged over repetitions. Experiments are done with 𝐾= 1 expert demonstrations. Standard deviation is reported in parenthesis. E.3 FB MODEL QUALITY In our experiments we averaged the performance over 10 FB models. To evaluate the robustness of the pre-trained steps we report in Fig. 8 the fraction of models 𝐹(𝜏) with a combined normalized score above a certain threshold 𝜏. Let 𝑀= 10 be the number of FB models and π‘₯π‘šbe the score of model π‘š averaged over environments, tasks, FB-IL methods, and repetitions. then 𝐹(𝜏) = 1 𝑀 𝑀 π‘š=1 1[π‘₯π‘š> Published as a conference paper at ICLR 2024 Algorithm Time BC 3h14m DWBC 9h32m ORIL 7h45m ORIL-O 6h56m ORIL-OO 7h12m OTRTD3 4h30m SQIL 10h18m TD3IL 7h3m DEMODICE 12h59m SMODICE 6h35m LOBSDICE 6h28m IQLEARN 13h17m Algorithm Time BCDIAYN 1m ERDIAYN <5s GOALDIAYN <5s GOAL-TD3 <5s MASKDP <5s GOAL-GPT <5s Algorithm Time BCFB 1m ERFB <5s RERFB <5s BBELLFB 4m FMFB 3m FBLOSSFB 4m DMFB 3m GOALFB <5s Table 8: Average time for generating an imitation learning policy once provided a set expert demonstrations. This clearly shows the advantage of BFMs that need only to infer one or multiple policies without explicit training. (a) BCFBwithout warm start (c) BCFBwith warm start Figure 6: Examples of trajectories in the maze with different FBIL methods. The expert trajectory is reported in red while the IL trajectory is reported in green. BCFB BBell FB 0 Imitation Score (%) with warm start without warm start Figure 7: Imitation score averaged over domains, tasks and repetitions for ERFB and BCFB with and without warm start, using 𝐾= 1 expert demonstrations. 𝜏] (Agarwal et al., 2021). We can notice that all the drop in probability is very concentrated towards high values (between 0.5 and 0.7) denoting a small variability in the performance of the FB models. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Average Reward ( ) F( ): fraction of FB models with average reward > Cheetah Maze Quadruped Walker Figure 8: Performance profiles on combined tasks and FB-IL methods. Shaded regions show pointwise 95% confidence bands based on percentile bootstrap with stratified sampling. E.4 EFFECT OF THE NUMBER OF DEMONSTRATIONS We investigate whether the performance of the IL methods can be improved by increasing the number of expert demonstrations. In Fig. 9 we report the aggregate performance for all the tested algorithms. Published as a conference paper at ICLR 2024 As expected, BC (similarly DWBC) improves significantly with the number of experts trajectories. However, even with 100 trajectories (corresponding to 1M number of expert transitions), BC only marginally outperforms FB-IL methods using only 1 trajectory. Similarly, DEMODICE and IQ-LEARN enjoy a significant improvement but they are still unable to match the performance of FB-IL methods. Other baselines are practically unaffected by the number of expert trajectories given that the bottleneck is mostly the inefficient usage of unsupervised samples. FB-IL methods are overall stable achieving very good results already with a single trajectory. We think that increasing the number of trajectories does not help pre-trained models since we are limited by the approximation errors of the model. We think we may be able to overcome this limit by allowing fine-tuning the model, but this is outside the scope of the paper. Notice that many approaches GOALDIAYN, MASKDP, GOAL-TD3, GOAL-GPT, GOALFB are not able to deal with multiple expert trajectories. E.5 GENERALIZATION TO DIFFERENT STARTING POINT DISTRIBUTIONS We investigate how well imitation algorithms generalize when the demonstrations are collected under a different initial state distribution than the one used for evaluation. Specifically, we repeat the experiments of Section 5 using 200 new expert trajectories of 1000 steps generated by the same TD3-based expert policies of Section 5 but changing the initial state distributions as follows: for the walker, quadruped, and cheetah domain, we initialize each trajectory directly in the expected long-term stationary behavior (i.e., a walking position for the walk task, a running one for run, etc.). Concretely, we achieved that by taking, for each domain and task, the state observed at step 500 of each of the 200 trajectories used in Section 5 (which we made sure to be representative of the desired stationary behavior) and by rolling out 1000 steps using the expert policy starting from each of these 200 new initial states. for the maze domain, we randomly initialize the agent position in the upper-right room instead of the original upper-left one. Then, we use these new demonstratinos to test all FB-IL methods as well as the most performing baselines from each IL algorithm family. Everything else (protocol, hyperparameters, etc.) is exactly the one used in Section 5. In particular, each imitation policy is tested on the original initial state distributions of each domain (i.e., a random position and orientation for walker, cheetah, and quadruped, and a random position in the upper left room for maze), which are now very different from the one used to generate the expert demonstrations. The main challenge is that now the demonstrations are only showing the desired behavior (e.g., how to walk) but not how to reach that behavior (e.g., how to move to a walking position when lying on the floor). The results are shown in Figure 10 and Figure 11. In Figure 10, we report the average performance loss that each IL method suffers due to distribution shift (computed as the ratio between the imitation score obtained using the modified demonstrations in this sections, and the one using the original demonstrations of Section 5). Overall, methods in the BC and FM/DM classes seem to suffer the largest performance loss. Rewardbased and goal-based methods are the least impacted by distribution shift, with the former being almost unaffected. Overall all the FB-IL methods consistently achieve good performance even under distribution shift (with a performance drop between 2% and 22% depending on the method), which is not the case for some baselines (e.g., those in the BC and FM/DM). While goal-based methods slightly outperform the task-specific TD3IL baseline in our base setup (Fig. 1), this is not the case anymore in the generalization setup, with the best pretrained method now around 93% of the top non-pretrained performance (Fig. 10). Still, the overall picture in Figs. 1 and 10 is largely the same. Published as a conference paper at ICLR 2024 Imitation Score (%) 45.8 43.0 43.9 1 1 1 1 1 1 44.5 41.9 44.3 44.4 10 10 10 10 10 10 42.2 43.8 44.7 50 50 50 50 50 50 39.5 43.2 44.0 100 100 100 100 100 Imitation Score (%) 1 1 1 1 1 1 10 10 10 10 10 10 50 50 50 50 50 50 100 100 100 100 100 100 Imitation Score (%) 1 1 1 1 1 1 Imitation Score (%) 68.2 68.8 68.5 1 1 1 1 1 1 1 1 68.0 68.7 70.6 10 10 10 10 10 10 10 67.2 68.6 70.4 50 50 50 50 50 50 50 66.9 68.6 70.4 100 100 100 100 100 100 100 Figure 9: Aggregate normalized score of IL algorithms with different numbers of expert demonstrations. Results are averaged over domains, tasks and repetitions. E.6 DISTRIBUTION MATCHING: MATCHING THE SUCCESSOR MEASURE ON AVERAGE VS AT EACH STATE The classical distribution matching loss (10) amounts to estimating the divergence between the overall occupation measures of a policy and the expert policy, β„’β„› (𝑧) := E𝑠0 𝜌0 π‘€πœ‹π‘§(𝑠0) E𝑠0 𝜌0 π‘€πœ‹π‘’(𝑠0) 2 β„› , (94) as E𝑠0 𝜌0 π‘€πœ‹(𝑠0) is the cumulated discounted measure over all states visited by πœ‹when starting at 𝑠0 𝜌0. Published as a conference paper at ICLR 2024 Imitation Score (%) 66.13 66.93 51.27 51.50 40.12 39.62 Offline IL BFM IL FB-IL (ours) Loss/Gain w.r.t. Standard Protocol Offline IL BFM IL FB-IL (ours) Figure 10: Generalization experiment: (top) average imitation score; (bottom) ratio between the average imitation score with distribution shift (i.e., using the demonstrations with modified initial state distribution, as described in the first paragraph of Appendix E.5) and without (i.e., using the demonstrations of Section 5). Imitation scores are computed by averaging over all domains, tasks, and repetitions for a single expert demonstrations. Imitation Score (%) 62.6 61.4 59.6 44.9 44.7 44.6 Offline IL BFM IL FB-IL (ours) DMC Cheetah BC Reward FM/DM Goal-based Imitation Score (%) DMC Quadruped 67.8 67.9 65.8 64.6 Figure 11: Generalization experiment: imitation score for each domain averaged over tasks and repetitions for a single expert demonstration. The imitation score is the ratio between the cumulative return of the algorithm and the cumulative return of the expert. But as mentioned in Section 4.3, we have chosen the loss β„’β„› (𝑧) := E𝑠0 πœŒπ‘’ π‘€πœ‹π‘§(𝑠0) π‘€πœ‹π‘’(𝑠0) 2 β„› (95) Published as a conference paper at ICLR 2024 Eρe[ 2] Eρe[ ] 2 Eρ0[ ] 2 0 Imitation Score (%) with warm start without warm start Figure 12: Difference between matching the successor measure at each state, i.e., using the loss E𝑠0 πœŒπ‘’ π‘€πœ‹π‘§(𝑠0) π‘€πœ‹π‘’(𝑠0) 2 𝐡 , and matching it on average w.r.t. the initial state, i.e., using the loss E𝑠0 πœŒπ‘’π‘€πœ‹π‘§(𝑠0) E𝑠0 πœŒπ‘’π‘€πœ‹π‘’(𝑠0) 2 𝐡 or E𝑠0 𝜌0 π‘€πœ‹π‘§(𝑠0) E𝑠0 𝜌0 π‘€πœ‹π‘’(𝑠0) 2 𝐡 . instead. This compares the successor measures of the expert and πœ‹π‘§at each point separately. This is a stricter criterion. For instance, take 𝑆to be a cycle 𝑆= {1, . . . , 𝑛}, and take an expert policy that moves to the right on the cycle, 𝑠 𝑠+ 1 mod 𝑛. If 𝜌0 is uniform, the overall occupation measure E𝑠0 𝜌0 π‘€πœ‹π‘’(𝑠0) of that policy is uniform too, and is the same as for a policy that stays in place. Putting E𝑠0 outside the norm is a simple way to solve this problem. Another way would be to consider distributions over state transitions (𝑠𝑑, 𝑠𝑑+1) instead of just states 𝑠𝑑, as done, e.g., in (Zhu et al., 2020; Kim et al., 2022a), but this requires changing the foundation model. In Figure 12, we report the difference in overall performance between three such variants: putting E𝑠0 𝜌0 inside the norm as in classical distribution matching, putting E𝑠0 πœŒπ‘’inside the norm (thus widening the initial states to all states from which we can estimate successor measures from the demos), and putting E𝑠0 πœŒπ‘’outside the norm (our main choice). The norm chosen is 𝐡 . The variant with E𝑠0 πœŒπ‘’inside underperforms, while the variant with E𝑠0 𝜌0 inside only works well in the presence of our warm start initialization 𝑧0 EπœŒπ‘’[𝐡]. This is not surprising: indeed, with E𝑠0 𝜌0 inside the norm, the warm start is the only way the algorithm can incorporate information from the whole expert trajectory. Without warm start, it only gets information from the earliest part of the trajectory. More precisely, with the expectation E𝑠0 𝜌0 inside the norm, the loss in Thm. 2 becomes ℒ𝐡 (𝑧) = E𝑠0 𝜌0 [ (Cov 𝐡)𝐹(𝑠0, 𝑧) E [ π‘˜ 0 π›Ύπ‘˜π΅(π‘ π‘˜+1) | 𝑠0, πœ‹π‘’ ] ] 2 2 + cst (96) from which it is clear that: 1. The model 𝐹only matters through 𝐹(𝑠0, 𝑧). 2. Along the expert trajectories, states far from the initial state distribution 𝜌0 are discounted. So the later sections of the expert trajectories are largely discarded. On the other hand, the version with E𝑠0 πœŒπ‘’outside does not rely on the warm start to incorporate information from the whole expert trajectory: it demonstrates greater robustness to the initialization of 𝑧, for a negligible difference in performance compared to E𝑠0 𝜌0-inside-with-warm-start. It also corresponds to a finer theoretical criterion for identifying policies, as explained above. F WAYPOINT IMITATION LEARNING EXPERIMENT We consider imitating a sequence of yoga poses from the Robo Yoga benchmark (Mendonca et al., 2021). The Robo Yoga benchmark defines 12 different robot positions for the walker and quadruped domains of the Deep Mind Control Suite. Here we focus on walker and generate expert demonstrations by concatenating different poses, keeping each fixed for 100 time steps. Figure 13 shows an example of such a demonstration. Published as a conference paper at ICLR 2024 Note that the expert is both non-stationary, as the underlying task (i.e., reaching the pose) changes over time, and non-realizable, as the transitions between poses are instantaneous and thus not physically attainable. For this reason, we compare only goal-based IL methods (GOALFB, GOAL-TD3, and GOAL-GPT) which are the only ones capable of dealing with non-stationarity. Figure 13: Example of sequence of yoga poses to imitate. From left to right: lie front, legs up, stand up, lie back, kneel, head stand, bridge. F.1 DETAILED PROTOCOL We evaluate each goal-based IL method on 1000 sequences of poses. Each sequence is built by first sampling 10 out of the 12 yoga poses without replacement and then building a state trajectory of 1000 steps where each of the 10 poses is kept fixed for 100 consecutive steps. For each of the 1000 test sequences, each method is tested starting from a state randomly generated from the standard initial state distribution of the walker domain. Each generated trajectory is scored in terms of the rewards of the pose sequence it intends to imitate. More precisely, each pose in the Robo Yoga benchmark is associated to a reward function π‘Ÿπ‘”(𝑠𝑑+1) = { 1 if 𝑠𝑑+1 𝑔 0.55, 0 otherwise, where 𝑔is the state corresponding to the pose. Letting (𝑔𝑑)𝑑 0 denote a sequence of poses to imitate, the score we compute for the corresponding trajectory (𝑠𝑑)𝑑 0 produced by an IL method is 1000 𝑑=0 π‘Ÿπ‘”π‘‘+1(𝑠𝑑+1). This number is then normalized by the total reward achieved by TD3 trained offline on each pose separately. Each number in Figure 4 is obtained by averaging the scores obtained by the IL method over the 1000 test sequences and 10 pretrained BFMs, plus/minus the standard deviation of the average score of each of the 10 BFMs divided by F.2 ALGORITHMS AND HYPERPARAMETERS GOALFB. We pretrain 10 FB models (with 10 different random seeds) as described in App. D.3 using the same hyperparameters reported in Table 1 with only three modifications: we add an extra hidden layer to the 𝐹and actor networks, we reduce the learning rate of 𝐡to 10 6, and we train for 3 106 gradient steps. At test time, we imitate pose sequences by using a lookahead of 1. That is, at time 𝑑of the produced trajectory we play π‘Žπ‘‘= πœ‹π‘§π‘‘(𝑠𝑑) with 𝑧𝑑= 𝐡(𝑠demo 𝑑+1 ), where 𝑠𝑑is the current state and 𝑠demo 𝑑+1 is the state one-step ahead in the demonstration (i.e., the next pose to imitate). Goal-TD3 and Goal-GPT. We use the same 10 pretrained models considered in the experiments of Sect. 5.1 and 5.2. See App. C and D for the detailed training protocol and hyperparameters. At test time, for GOAL-TD3 we use the same lookahead of 1 as for GOALFB. On the other hand, for GOAL-GPT we use a lookahead of 16 as we found a lookahead of 1 to be working poorly. This is likely due to the fact that GOAL-GPT is pretrained with a context length of 16, and thus tries to reach a goal state 16 steps ahead with an autoregressive (i.e., history-dependent) policy. Setting the lookahead to 1 essentially implies that we execute a Markovian policy as the history is reset every single step. F.3 QUALITATIVE EVALUATION Figure 14 shows an example of trajectory generated by GOALFB when imitating the pose sequence of Figure 13. Published as a conference paper at ICLR 2024 Figure 14: Imitating the sequence of yoga poses from Figure 13 by GOALFB. The agent learns how to quickly transition between each pose despite not being demonstrated how to do so. Besides being able to reproduce each of the seven poses, GOALFB is capable of transitioning between them despite not being demonstrated how to do so. This is because the underlying FB model, from a purely unsupervised pre-training, potentially learned how to reach each pose from any other state in the data. Published as a conference paper at ICLR 2024 ORILIQL ORIL-OIQL ORIL-OOIQL OTRIQL DEMODICE SMODICE LOBSDICE halfcheetah-medium-expert-v2 70.7(1.5) 83.0(2.7) 78.5(2.7) 70.3(2.8) 43.9(0.2) 59.6(0.5) 69.9(0.7) halfcheetah-medium-replay-v2 35.0(0.3) 34.2(0.3) 35.1(0.5) 35.0(0.1) 32.0(0.2) 34.0(0.2) 35.1(0.2) halfcheetah-medium-v2 42.7(0.1) 42.4(0.1) 42.6(0.1) 42.5(0.1) 42.6(0.1) 42.7(0.1) 42.5(0.1) walker2d-medium-expert-v2 103.1(2.6) 102.8(4.0) 105.4(3.0) 100.1(4.4) 99.8(1.4) 107.9(0.2) 106.2(1.3) walker2d-medium-replay-v2 32.1(1.3) 43.4(1.5) 42.1(2.4) 50.5(0.9) 36.3(1.5) 20.2(1.1) 37.4(1.2) walker2d-medium-v2 64.7(2.6) 68.3(2.0) 62.1(2.5) 65.3(4.6) 68.8(1.4) 53.0(1.3) 56.9(1.9) Table 9: Normalized score for imitation on the D4RL benchmark with 𝐾= 1 expert demonstrations. G BASELINE RESULTS ON D4RL In this section, we check that our baseline implementations achieve comparable performance to the results reported in the literature. For this, we report the evaluation of a few baselines on the standard D4RL benchmark (Fu et al., 2020). We consider the same setting as in (Luo et al., 2023). Similarly to (Luo et al., 2023), we use IQL as the learning algorithm for ORIL and OTR, which we noticed to be performing better than TD3 in this particular benchmark. Tab. 9 shows that our implementations achieve comparable performance to the results in the literature, confirming their correctness.7 This also shows that our setting is particularly challenging for *DICE algorithms that are intrinsically tight to conservative updates. 7It is likely that better results can be obtained through hyper-parameter tuning (which we did not perform in these experiments).