# imitation_by_predicting_observations__b448a355.pdf Imitation by Predicting Observations Andrew Jaegle 1 Yury Sulsky 1 Arun Ahuja 1 Jake Bruce 1 Rob Fergus 1 Greg Wayne 1 Imitation learning enables agents to reuse and adapt the hard-won expertise of others, offering a solution to several key challenges in learning behavior. Although it is easy to observe behavior in the real-world, the underlying actions may not be accessible. We present a new method for imitation solely from observations that achieves comparable performance to experts on challenging continuous control tasks while also exhibiting robustness in the presence of observations unrelated to the task. Our method, which we call FORM (for Future Observation Reward Model ) is derived from an inverse RL objective and imitates using a model of expert behavior learned by generative modelling of the expert s observations, without needing ground truth actions. We show that FORM performs comparably to a strong baseline IRL method (GAIL) on the Deep Mind Control Suite benchmark, while outperforming GAIL in the presence of task-irrelevant features. 1. Introduction The goal of imitation is to learn to produce behavior that matches that of an expert on unseen data, given demonstrations of the expert s behavior (Abbeel & Ng, 2004; Osa et al., 2018). The field of imitation learning offers tools for learning behavior when programmed rewards cannot be provided, or when rewards can only be partially or sparsely specified. Imitation learning has been at the heart of several breakthroughs in building AI agents (Pomerleau, 1989; Abbeel et al., 2010; Silver et al., 2016; Vinyals et al., 2019; Open AI et al., 2019), allowing agents to learn even when faced with hard exploration problems (Gulcehre et al., 2020). There is widespread evidence that imitation (among other forms of social learning) is a core mechanism by which humans and other animals learn to acquire a sophisticated behavioral repertoire (Tomasello, 1996; Laland, 2008; Byrne, 1Deep Mind. Correspondence to: Andrew Jaegle . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). 2009; Huber et al., 2009). While most algorithms for imitation learning assume that demonstrations contain the actions the expert executed, animals must imitate without directly observing what actions the expert took (i.e. without knowing exactly what commands were issued to produce the observable changes). In the context of machine learning, solving the problem of imitation from observation is a key step towards the tantalizing possibility of learning behavior from unlabeled and easy to collect data, such as raw video footage of human activity. Many recent algorithms for imitation have focused on addressing the problem of imitation in very small data regimes, but the challenge in imitating from these abundant sources of data is not primarily one of quantity. The challenge is rather how to learn models for imitation that are general enough to learn and generalize from data that depicts a rich (and unknown) reward structure. In this work, we show how predictive generative models can be used to learn a general reward model from observations alone. Current state-of-the-art approaches to imitation (including from observation) pose learning as an adversarial game: a classifier estimates the probability that a state is visited by the expert or imitator, and the policy seeks to maximize the classifier error (Merel et al., 2017; Torabi et al., 2019a). Because these methods are based on matching the expert s occupancy using a fixed dataset of demonstrations, they tend to be very sensitive to the precise details of the demonstrations and to the representation used. This property makes learning with adversarial methods difficult when using raw, noisy observations without extensive tuning and careful use of strong forms of regularization (Peng et al., 2019), domain or task knowledge (Zolna et al., 2020), or a combination of behavioral cloning and careful representation design (Abramson et al., 2020). In this work, we introduce the future observation reward model (FORM) (see Figure 1), which address the problem of imitation from observation while exhibiting both (1) generality and expressiveness by coupling predictive generative models with inverse RL (IRL) and (2) improved robustness by foregoing an adversarial formulation. In FORM, the imitator tries to match the probability of observation sequences in the expert data. It does so using a learned generative model of expert observation sequences and a learned generative model of its own observation sequences. In other words, Imitation by Predicting Observations Figure 1. FORM learns to imitate expert behavior using sequences of internal state observations, without access to the expert s actions. Visualizations of agent behavior (top) and reward curves for a single episode (bottom) are shown after 0 (left), 50k (middle) and 5M update steps. FORM imitates using two learned models: both the demonstrator model (trained offline) and the imitator model (trained online) log-likelihoods track the unseen task reward as the imitation agent learns. Agent behavior is shown as images, but we use lower-dimensional internal state observations in this work. FORM casts the problem of learning from demonstrations as a sequence prediction problem, using a generative model of expert sequences to guide RL. Because FORM builds separate models of expert and imitator sequences, rather than using a single classifier to discriminate expert and imitator states, it is less prone to focus on irrelevant differences between the expert and imitator demonstrations. The structure of the FORM objective makes it theoretically straightforward to optimize using standard policy optimization tools and, as we show, empirically competitive on the Deep Mind Control Suite continuous control benchmark domain. This stands in contrast to adversarial methods, such as Generative Adversarial Imitation Learning (GAIL), whose objectives are known to be ill-posed (without additional regularization) and challenging to optimize both in theory and in practice (Arjovsky et al., 2017; Gulrajani et al., 2017; Mescheder et al., 2017). This property makes it difficult to apply adversarial techniques to imitation in settings with even small differences between expert and imitator settings (Zolna et al., 2020). Robustness to distractors is an important part of behavior learning, as recently illustrated by Stone et al. 2021 in the context of RL with image background distractors. These situations are common in practice: the lab environment where expert data is collected for a robot will be quite different to where it might be deployed. While it may be possible to collect a large number of demonstrations, it is impossible to exhaustively sample all possible sources of differences between the two domains (such as the surface texture, robot physical parameters, or environment appearance). These differences confound the signal that must be imitated, leading to the risk of spurious dependencies between the two being learned. As we will show, FORM exhibits greater robustness than a well-tuned adversarial imitation method, GAIL from Observation, or GAIf O (Torabi et al., 2019a) in presence of task-independent features. We make the following technical contributions in this work: 1. We derive the FORM reward from an objective for inverse reinforcement learning from observations. We show that this reward can be maximized using gen- erative models of expert and imitator behavior with standard policy optimization techniques. 2. We develop a practical algorithm for imitation learning using the FORM reward and demonstrate that it performs competitively with a well-tuned GAIf O model on the Deep Mind Control Suite benchmark. 3. We show that FORM is more robust than GAIf O in the presence of extraneous, task-irrelevant features, which simulate domain shift between expert and imitator settings. 2. Background and related work RL, IRL, and imitation Reinforcement learning is concerned with learning a policy that maximizes the expected return, which is given as the expected sum of all future discounted rewards (Sutton & Barto, 2018), which are typically observed. In imitation learning, on the other hand, we are not given a reward function, but we do have access to demonstrations produced by a demonstrator (or expert) policy, which maximizes some (unobserved) expected return. IRL has the related goal of recovering the unobserved reward function from expert behavior. IRL offers a general formula for imitation: estimate the reward function underlying the demonstration data (a reward model ) and maximize this reward by RL (Ng & Russell, 2000), possibly iterating multiple times until convergence. Alternative approaches to imitation, such as behavioral cloning (BC) (Pomerleau, 1989) or BC from observations (BCO) (Torabi et al., 2018), typically have difficulty producing reliable behavior away from configurations seen in the expert demonstrations. This is because small errors in predicting actions or mimicking short-term agent behavior accumulates over long behavioral timescales.1 IRL methods like FORM avoid this problem: because they perform RL on a learned reward, they can learn through experience to recover from mistakes by focusing on long-term consequences of each action. 1The standard solution to this problem for BC assumes access to an expert policy that can be repeatedly queried (Ross et al., 2011), which is not always feasible. Imitation by Predicting Observations GAIL and occupancy-based IRL Most contemporary IRL-based approaches to imitation as exemplified by GAIL use a strategy of state-action occupancy matching, typically by casting imitation as an adversarial game and learning a classifier to discriminate states and actions sampled uniformly from the expert demonstrations from those encountered by the imitator (Ho & Ermon, 2016; Torabi et al., 2019a; Fu et al., 2018; Kostrikov et al., 2019; Ghasemipour et al., 2019). In contrast, rather than classifying states as belonging to the expert or imitator, FORM learns to imitate using separate generative models of expert and imitator behavior. This means that FORM is built on predictive models of the form p(xt|xt 1), where xs are observations, rather than a single model of the form p(expert|x) that tries to classify observations as generated by the expert or not. FORM s objective is similar in spirit to classical feature-matching and maximum-entropy formulations of imitation (Ng & Russell, 2000; Abbeel & Ng, 2004; Ziebart et al., 2008), while also providing a fully probabilistic interpretation and making minimal assumptions about the environment (the FORM objective does not require an MDP or deterministic transitions). Other related methods for imitation Other recent work has used generative models in the context of imitation learning: this work typically retains GAIL s occupancy-based perspective (Baram et al., 2016; Jarrett et al., 2020; Liu et al., 2021) or introduces a generative model to provide a heuristic reward (Yu et al., 2020). Unlike FORM, which uses effect models (see Figure 2) that are suitable for imitation from observations, this work models quantities that are useful primarily in conjunction with actions (modeling state-action densities and/or dynamics models for GAIL augmentation). Other recently proposed methods learn reward models either purely or partially offline (Kostrikov et al., 2020; Jarrett et al., 2020; Arenz & Neumann, 2020). This approach leans on the presence of actions in the demonstrator data. Although FORM s demonstrator effect model is learned offline, FORM s online phase is essential to the process of distilling an effect model (which doesn t use actions) into a policy (which does). Learning to act from observations Many methods have been proposed for imitation from observation (Torabi et al., 2019b), but most methods that do so using IRL are based around GAIL (Wang et al., 2017; Torabi et al., 2019a; Sun et al., 2019). Recent work has obtained interesting results using solutions based around tracking or matching trajectories in learned feature spaces (Peng et al., 2018; Merel et al., 2019), by matching imitator actions to learned models of expert trajectories (interpreted as inverse models of the expert action) (Schmeckpeper et al., 2020; Zhu et al., 2020; Edwards et al., 2018; Pathak et al., 2018), and by learning to match features in learned invariant spaces (Sermanet et al., 2017). Finally, we note that much recent work has observed Figure 2. FORM s demonstrator and imitator effect models are effect models, generative models p(xt|xt 1) of the change in observation (observed) produced by a policy π in an environment with transition dynamics p T (unobserved). The models used in model-based RL are usually of the form p(xt|xt 1, at 1) and aim to model transition dynamics rather than the full distribution of outcomes given a policy. that the structure of observation sequences can be exploited to generate behavior, whether in the context of language modeling (Brown et al., 2020), 3D navigation (Dosovitskiy & Koltun, 2017), few-shot planning (Rybkin et al., 2019), or value-based RL (Edwards et al., 2020). FORM uses generative models of future observations to exploit this property of observation transitions and connect it to inverse reinforcement learning to produce a practical algorithm for imitation. 3. Approach 3.1. Inverse reinforcement learning from observations Our goal is to learn a policy that produces behavior like an expert (or demonstrator) by IRL, using only observation. Historically, the IRL procedure has been framed as matching the expected distribution over states and actions (or their features) along the imitator and demonstrator paths (Ng & Russell, 2000; Abbeel & Ng, 2004; Ziebart et al., 2008). As also noted in (Arenz & Neumann, 2020), we can express this as a divergence minimization problem: min θ DKL[p I θ(τ)||p D(τ)], (1) where τ = {x0, a0, x1, a1, . . . , a T 2, x T 1} is a trajectory consisting of actions are actions A = {a0, . . . a T 1} and states X = {x0, . . . , x T 1}. We use x rather than o (for observation) or s (for state) because FORM does not assume that its inputs are Markovian FORM applies to generic observations but speaking in terms of states simplifies the comparison to other methods (like GAIL) that assume Markovian states are given or inferred. p D(τ) is the distribution over trajectories induced by the demonstrator s policy and the environment dynamics, while p I θ(τ) is the corresponding distribution induced by an imitator with learnable parameters θ. In imitation learning from observation, the imitator must reason about the demonstrator s behavior without supervised access to the expert s actions (its control signals). Accord- Imitation by Predicting Observations ingly, we focus on distributions over observation sequences, which amounts to integrating out the imitator s actions: p I θ(X) = Z A p I θ(τ) = Z A p I θ(A, X) t 0 p(xt|x0 log p(xt|x