# planning_from_pixels_using_inverse_dynamics_models__0d3f94a2.pdf Published as a conference paper at ICLR 2021 PLANNING FROM PIXELS USING INVERSE DYNAMICS MODELS Keiran Paster Department of Computer Science University of Toronto, Vector Institute keirp@cs.toronto.edu Sheila A. Mc Ilraith & Jimmy Ba Department of Computer Science University of Toronto, Vector Institute {sheila, jba}@cs.toronto.edu Learning task-agnostic dynamics models in high-dimensional observation spaces can be challenging for model-based RL agents. We propose a novel way to learn latent world models by learning to predict sequences of future actions conditioned on task completion. These task-conditioned models adaptively focus modeling capacity on task-relevant dynamics, while simultaneously serving as an effective heuristic for planning with sparse rewards. We evaluate our method on challenging visual goal completion tasks and show a substantial increase in performance compared to prior model-free approaches. 1 INTRODUCTION Deep reinforcement learning has proven to be a powerful and effective framework for solving a diversity of challenging decision-making problems (Silver et al., 2017a; Berner et al., 2019). However these algorithms are typically trained to maximize a single reward function, ignoring information that is not directly relevant to the associated task at hand. This way of learning is in stark contrast to how humans learn (Tenenbaum, 2018). Without being prompted by a specific task, humans can still explore their environment, practice achieving imaginary goals, and in so doing learn about the dynamics of the environment. When subsequently presented with a novel task, humans can utilize this learned knowledge to bootstrap learning a property we would like our artificial agents to have. In this work, we investigate one way to bridge this gap by learning world models (Ha & Schmidhuber, 2018) that enable the realization of previously unseen tasks. By modeling the task-agnostic dynamics of an environment, an agent can make predictions about how its own actions may affect the environment state without the need for additional samples from the environment. Prior work has shown that by using powerful function approximators to model environment dynamics, training an agent entirely within its own world models can result in large gains in sample efficiency (Ha & Schmidhuber, 2018). However, learning world models that are both accurate and general has largely remained elusive, with these models experiencing many performance issues in the multi-task setting. The main reason for poor performance is the so-called planning horizon dilemma (Wang et al., 2019): accurately modeling dynamics over a long horizon is necessary to accurately estimate rewards, but performance is often poor when planning over long sequences due to the accumulation of errors. These modeling errors are especially prevalent in high-dimensional observation spaces where loss functions that operate on pixels may focus model capacity on task-irrelevant features (Kaiser et al., 2020). Recent work (Hafner et al., 2020; Schrittwieser et al., 2019) has attempted to side-step these issues by learning a world model in a latent space and propagating gradients over multiple time-steps. While these methods are able to learn accurate world models and achieve high performance on benchmark tasks, their representations are usually trained with task-specific information such as rewards, encouraging the model to focus on tracking task-relevant features but compromising their ability to generalize to new tasks. In this work, we propose to learn powerful, latent world models that can predict environment dynamics when planning for a distribution of tasks. The main contributions of our paper are three-fold: we propose to learn a latent world model conditioned on a goal; we train our latent representation to model inverse dynamics sequences of actions that take the agent from one state to another, rather Published as a conference paper at ICLR 2021 State L S T M Linear Linear Linear Linear Linear Linear Figure 1: The network architecture for the inverse dynamics model used in GLAMOR. Res Nets are used to encode state features and an LSTM predicts the action sequence. than training it to capture information about reward; and we show that by combining our inverse dynamics model and a prior over action sequences, we can quickly construct plans that maximize the probability of reaching a goal state. We evaluate our world model on a diverse distribution of challenging visual goals in Atari games and the Deepmind Control Suite (Tassa et al., 2018) to assess both its accuracy and sample efficiency. We find that when planning in our latent world model, our agent outperforms prior, model-free methods across most tasks, while providing an order of magnitude better sample efficiency on some tasks. 2 RELATED WORK Model-based RL has typically focused on learning powerful forward dynamics models, which are trained to predict the next state given the current state and action. In works such as (Kaiser et al., 2020), these models are trained to predict the next state in observation space - often by minimizing L2 distance. While the performance of these algorithms in the low data regime is often strong, they can struggle to reach the asymptotic performance of model-free methods (Hafner et al., 2020). An alternative approach is to learn a forward model in a latent space, which may be able to avoid modeling irrelevant features and better optimize for long-term consistency. These latent spaces can be trained to maximize mutual information with the observations (Hafner et al., 2020; 2019) or even task-specific quantities like the reward, value, or policy (Schrittwieser et al., 2019). Using a learned forward model, there are several ways that an agent could create a policy. While forward dynamics models map a state and action to the next state, an inverse dynamics model maps two subsequent states to an action. Inverse dynamics models have been used in various ways in sequential decision making. In exploration, inverse dynamics serves as a way to learn representations of the controllable aspects of the state (Pathak et al., 2017). In imitation learning, inverse dynamics models can be used to map a sequence of states to the actions needed to imitate the trajectory (Pavse et al., 2019). Christiano et al. (2016) use inverse dynamics models to translate actions taken in a simulated environment to the real world. Recently, there has been an emergence of work (e.g., Ghosh et al., 2020; Schmidhuber, 2019; Srivastava et al., 2019) highlighting the relationship between imitation learning and reinforcement learning. Specifically, rather than learn to map states and actions to reward, as is typical in reinforcement learning, Srivastava et al. (2019) train a model to predict actions given a state and an outcome, which could be the amount of reward the agent is to collect within a certain amount of time. Ghosh et al. (2020) use a similar idea, predicting actions conditioned on an initial state, a goal state, and the amount of time left to achieve the goal. As explored in Appendix A.1, these methods are perhaps the nearest neighbors to our algorithm. In our paper, we tackle a visual goal-completion task due to its generality and ability to generate tasks with no domain knowledge. Reinforcement learning with multiple goals has been studied Published as a conference paper at ICLR 2021 since Kaelbling (1993). Most agents that are trained to achieve multiple goals are trained with offpolicy reinforcement learning combined with a form of hindsight relabeling (Andrychowicz et al., 2017), where trajectories that do not achieve the desired goal are relabeled as a successful trajectory that achieves the goal that was actually reached. Andrychowicz et al. (2017) uses value-based reinforcement learning with a reward based on the euclidean distance between physical objects, which is only possible with access to an object-oriented representation of the state. In environments with high-dimensional observation spaces, goal-achievement rewards are more difficult to design. Nair et al. (2018) use a VAE (Kingma & Welling, 2014) trained on observations to construct a latent space and uses distances in the latent space for a reward. These distances, however, may contain features that are uncontrollable or irrelevant. Warde-Farley et al. (2019) attempt to solve this issue by framing the goal-achievement task as maximizing the mutual information between the goal and achieved state I(sg, s T ). Our method differs from these approaches since we aim simply to maximize an indicator reward 1(s T = sg) and do not explicitly learn a value or Q-function. 3.1 PROBLEM FORMULATION Reinforcement learning is a framework in which an agent acts in an unknown environment and adapts based on its experience. We model the problem with an MDP, defined as the tuple (S, A, T, R, γ). S is a set of states; A is a set of actions; the transition probabilities T : S A S [0, 1] defines the probability of the environment transitioning from state s to s given that the agent acts with action a; the reward function R : S A S R maps a state-action transition to a real number; and 0 γ 1 is the discount factor, which controls how much an agent should prefer rewards sooner rather than later. An agent acts in the MDP with a policy π : S A [0, 1], which determines the probability of the agent taking action a while in state s. The expected return of a policy is denoted: JRL(π) = Eτ P (τ|π) t γt R(st, at, s t) that is the averaged discounted future rewards for trajectories τ = {(st, at)}T t=1 of states and actions sampled from the policy. A reinforcement learning agent s objective is to find the optimal policy π = arg maxπ JRL(π) that maximizes the expected return. In goal-conditioned reinforcement learning, an agent s objective is to find a policy that maximizes this return over the distribution of goals g p(g) when acting with a policy that is now also conditioned on g. In our work, g S and we consider goal achievement rewards of the form Rg(s) = 1(s = g). Additionally, we consider a trajectory to be complete when any Rg(st) = 1 and denote this time-step t = T. With these rewards, an optimal goal-achieving agent maximizes: J(π) = Eg p(g)[Es T p(s T |πg)[γT Rg(s T )]]. (2) Note that unlike prior works, we consider both the probability of goal achievement as well as the length of the trajectory T in our objective. 3.2 PLANNING We consider the problem of finding an optimal action sequence a1, . . . , ak 1 to maximize expected return J(s, g, a1, . . . , ak 1): J(s, g, a1, . . . , ak 1) = Esk p(sk|s,a1,...,ak 1)[γkrg(sk)] = γT p(sk = g|s, a1, . . . , ak 1) (3) Thus, the optimal action sequence is found by solving the following optimization problem: a 1, . . . , a k 1 = arg max a1,...,ak 1 γkp(sk = g|s1, a1, . . . , ak 1) (4) Even with access to a perfect model of p(sk = g|s1, a1, . . . , ak 1), solving this optimization may be difficult. In many environments, the number of action sequences that reach the goal are vastly outnumbered by the action sequences that do not. Without a heuristic or reward-shaping, there is little hope of solving this problem in a reasonable amount of time. Published as a conference paper at ICLR 2021 3.3 GLAMOR: GOAL-CONDITIONED LATENT ACTION MODELS FOR RL Inspired by sequence modeling in NLP, we propose to rewrite Equation 4 in a way that permits factoring across the actions in the action sequence. By factoring, planning in our model can use the heuristic search algorithms that enable sampling high quality language sequences that are hundreds of tokens long. First, note that12: p(sk = g|s1, a1, . . . , ak 1) p(sk = g|s1, a 0, p(g ) > 0, and p(s = g|a (g)) > p(s = g|a (g )) > 0, then one-step GCSL does not converge to an optimal policy. Published as a conference paper at ICLR 2021 As an example, consider a game with state space S = {1, 2, 3, 4, 5, 6} and action space {roll fair die, roll loaded die}. Rolling the fair die will result in the state being uniformly set to the number that the die lands on, while the loaded die will always land on 1. The optimal goal-achieving policy is evident: the agent should roll the loaded die with probability one only when the goal is to land on 1. Otherwise the agent should always roll the fair die. In this example, GCSL will fail to converge but GLAMOR will find the correct policy. Proof. We consider a one-step MDP with goal distribution p(g). In this MDP, we assume a deterministic initial state and denote the state the agent transitions to after taking action a as s. For simplicity, we assume the policy πθ is powerful enough to perfectly match its training distribution when trained with maximum-likelihood. Therefore, GCSL consists of the following iterated steps: Gather trajectories with πt. This gives an empirical distribution of goal-conditioned actions pt+1(a|s = g) = p(s=g|a)pt(a) Distill pt+1(a|s = g) into πt+1(a|g). The update to the policy π at each iteration depends both on the relative likelihood of transitioning to the goal with the environment dynamics and the probability of taking action a, pt(a). Clearly, pt(a) is a function of the previous policy πt and the goal distribution p(g): g p(g )πt(a|g ) (11) Assuming the distillation is exact, the policy evolves like: πt+1(a|s = g) = p(s = g|a) g p(g )πt(a|g ) (12) We then analyze the ratio of the probability of any sub-optimal action a to the probability of the optimal action a : πt(a|g) πt(a (g)|g). If there is goal interference, goals g, g exist such that p(g )p(s = g|a (g )) > 0 and p(s = g|a (g)) > p(s = g|a (g )) > 0. Assume πt 1 is optimal. Then, πt(a (g )|g) πt(a (g)|g) = p(s = g|a (g )) P g p(g )πt 1(a (g )|g ) p(s = g|a (g)) P g p(g )πt 1(a (g)|g ) (13) p(s = g|a (g ))p(g ) p(s = g|a (g)) (14) Therefore, if πt 1 is optimal, πt will again be sub-optimal and the policy will never converge. A.2 CAUSALLY CORRECT MODELS Rezende et al. (2020) explore the connection between model-based reinforcement learning and causality. A model is causally correct if a learned model qθ(x) p(x) with respect to a set of interventions. In model-based RL, a model is trained to predict some aspect of an environment. In order to use the learned model to predict the affect of a new policy in the environment, the model must be causally correct with respect to changes in the policy. Rezende et al. (2020) show that some partial models, including Mu Zero, are not causally correct with respect to action sequence interventions. Published as a conference paper at ICLR 2021 Hyper-parameter value optimizer Adam W weight-decay 0.01 normalization Group Norm learning-rate 5e-4 replay-ratio 4 eps-steps 3e5 eps-final 0.1 min-steps-learn 5e4 buffer size 1e6 policy trials 50 state size 512 clip-p-actions -3.15 lstm-hidden-dim 64 lstm-layers 1 train tasks 1000 Figure 5: Hyperparameters used to train GLAMOR. As an example, consider training a model to predict whether a sequence of actions wins in the game of Simon Says. The model is trained to predict p(win|s0, a1, . . . , ak) on data produced by some training policy. If this training policy is good at the game, the conditional probability p(win|s0, a1, . . . , ak) = 1 for all action sequences, even though using the action sequence blindly in the real game would result in a much lower win-rate. This happens because the true data-generating process is actually dependent on intermediate states s1, . . . , sk, which the training policy has access to. By conditioning on some action sequence s0, a1, . . . , ak without modeling the intermediate states, these states become confounding variables. In order to predict the affect of taking a certain action sequence on the environment, what we really want to do is find p(win|s0, do(a1, . . . , ak)). To learn causally correct models with GLAMOR, we opt to simplify the data-generating process by using training policies that are independent of intermediate states. While this may hurt training in some stochastic environments, we find that in our multi-task setting with relabeling, using nonreactive exploration has a negligible effect. A.3 EXPERIMENTAL DETAILS Figure 5 shows the hyperparameters that were used to train our method. While our method is substantially more simple than value-based methods like DISCERN, there are still a few important hyperparameters. We found that tuning the replay ratio is important to balance between sample efficiency and over-fitting in our models. We also find that GLAMOR works best with a large replay buffer and a large model. We also found that avoiding selecting action sequences which have too low a probability, similar to tricks used in beam search in NLP (Holtzman et al., 2020), increases the performance of our planner. To achieve this, we introduce a hyperparameter clip-p-actions, and only expand an action sequence if the immediate log-probability under the inverse dynamics model is over this value. A.4 ABLATIONS In order to find which parts of GLAMOR contribute to its performance, we ran an ablation study. A.4.1 TRAINING POLICY In Figure 6, we vary the way we construct the open-loop action sequence that is followed to collect training data. We vary it in two ways: the amount of compute used to create the plan and whether we consider the action prior. While the performance is poor when using only one planning sample, GLAMOR seems to work well on Pong with as little as 5 samples. More interestingly, disabling the action prior during training seems to increase the variance of GLAMOR significantly. Without the action prior, the planning procedure will select action sequences that may have shown up more in Published as a conference paper at ICLR 2021 0 1 2 3 4 5 Agent Steps (Millions) Goals Achieved 1 Planning Trial 0 1 2 3 4 5 Agent Steps (Millions) Goals Achieved 5 Planning Trials 0 1 2 3 4 5 Agent Steps (Millions) Goals Achieved 10 Planning Trials 0 1 2 3 4 5 Agent Steps (Millions) Goals Achieved 50 Planning Trials No Action Prior With Action Prior Figure 6: In this experiment, we test how changing the training policy affects performance in Pong. As the amount of compute used in the planner during training increases, so does the performance of the evaluated agent. Using the action prior decreases the variance of the agent s performance. 0 1 2 3 4 5 Agent Steps (Millions) Goals Achieved 0 1 2 3 4 5 Agent Steps (Millions) Goals Achieved 0 1 2 3 4 5 Agent Steps (Millions) Goals Achieved 0 1 2 3 4 5 Agent Steps (Millions) Goals Achieved GCSL GLAMOR (Ours) default random Figure 7: When training agents with off-policy data collected with a random policy, GLAMOR outperfoms GCSL and can achieve most goals. the training data simply due to the training policy. We hypothesize that this effect can significantly hurt exploration and performance. We also test the performance of both GLAMOR and GCSL when using a uniform training policy in Figure 7. We found that GLAMOR performs very well even when trained on completely off-policy data, while GCSL struggles. A.4.2 PLANNER We also evaluate whether the planner is necessary at test-time to achieve strong goal-achievement performance. To test this, we ran an experiment where the planner simply takes 1 trajectory sample and takes the first action5. Figure 8 shows that while the agent still surprisingly achieves many goals in this setting, using the planner results in a stronger policy. We interperate this result as showing that the heuristic search guided by the factored inverse dynamics and action prior is strong and additional compute simply chooses a plan among already good options. A.5 ADDITIONAL FIGURES In Figure 9 and Figure 10, we plot the learning curves of GLAMOR, GCSL, and DISCERN trained for 5M agent steps. GLAMOR learns quickly compared to the other algorithms and performs better asymptotically with the exception of a few environments (Ms Pacman, point-mass). Figure 11 and Figure 12 show the states achieved by GLAMOR attemping to reach a specific goal. As noted by Warde-Farley et al. (2019), the manipulator environment is difficult and no algorithm learned to 5Note that this is exactly equivalent to lowering the planning compute in our implementation. Published as a conference paper at ICLR 2021 0 1 2 3 4 5 Agent Steps (Millions) Goals Achieved 0 1 2 3 4 5 Agent Steps (Millions) Goals Achieved 1 Search Trial 50 Search Trials Figure 8: Goal achievement rates for DM Control and Atari. Searching for a high scoring action sequence results in more goals achieved. However, even when using compute equivalent to a modelfree agent, GLAMOR performs remarkably well. 0 1 2 3 4 5 Agent Steps (Millions) Goals Achieved 0 1 2 3 4 5 Agent Steps (Millions) Goals Achieved 0 1 2 3 4 5 Agent Steps (Millions) Goals Achieved 0 1 2 3 4 5 Agent Steps (Millions) Goals Achieved 0 1 2 3 4 5 Agent Steps (Millions) Goals Achieved Montezuma Revenge 0 1 2 3 4 5 Agent Steps (Millions) Goals Achieved 0 1 2 3 4 5 Agent Steps (Millions) Goals Achieved 0 1 2 3 4 5 Agent Steps (Millions) Goals Achieved 0 1 2 3 4 5 Agent Steps (Millions) Goals Achieved Private Eye 0 1 2 3 4 5 Agent Steps (Millions) Goals Achieved 0 1 2 3 4 5 Agent Steps (Millions) Goals Achieved 0 1 2 3 4 5 Agent Steps (Millions) Goals Achieved 0 1 2 3 4 5 Agent Steps (Millions) Goals Achieved 0 1 2 3 4 5 Agent Steps (Millions) Goals Achieved DISCERN GCSL GLAMOR (Ours) Figure 9: Training Curves for Atari Tasks. GLAMOR achieves more goals (often with many fewer steps) than both GCSL and DISCERN. achieve goals within 5M steps. The low achievement rate on the finger task is due to the agent s inability to reliably control the angle of the spinner. Published as a conference paper at ICLR 2021 0 1 2 3 4 5 Agent Steps (Millions) Goals Achieved ball in cup/catch 0 1 2 3 4 5 Agent Steps (Millions) Goals Achieved cartpole/balance 0 1 2 3 4 5 Agent Steps (Millions) Goals Achieved 0 1 2 3 4 5 Agent Steps (Millions) Goals Achieved manipulator/bring ball 0 1 2 3 4 5 Agent Steps (Millions) Goals Achieved pendulum/swingup 0 1 2 3 4 5 Agent Steps (Millions) Goals Achieved point mass/easy 0 1 2 3 4 5 Agent Steps (Millions) Goals Achieved reacher/hard DISCERN GCSL GLAMOR (Ours) Figure 10: Training Curves for Control Tasks. GLAMOR achieves more goals (often with many fewer steps) than both GCSL and DISCERN. Figure 11: Goal states (above) and states achieved by the fully trained GLAMOR agent (below) averaged over 5 trials for each Atari game tested. Variance comes from environment and planning stochasticity. Note that on most games, GLAMOR learns to control the positions of both directly and indirectly controllable objects in the frame. Published as a conference paper at ICLR 2021 Figure 12: Goal states (above) and states achieved by the fully trained agent (below) averaged over 5 trials for each control task tested. Variance comes from planning stochasticity since the environment dynamics are deterministic. In most environments, GLAMOR learns to control the agent s state to match the visually specified goal. Note that in the finger environment, GLAMOR learns to control the position of the finger despite not often being able to control the angle of the spinner.