# learning_task_informed_abstractions__ca8daa3a.pdf Learning Task Informed Abstractions Xiang Fu * 1 Ge Yang * 1 2 Pulkit Agrawal 1 2 Tommi Jaakkola 1 Current model-based reinforcement learning methods struggle when operating from complex visual scenes due to their inability to prioritize task-relevant features. To mitigate this problem, we propose learning Task Informed Abstractions (TIA) that explicitly separates rewardcorrelated visual features from distractors. For learning TIA, we introduce the formalism of Task Informed MDP (Ti MDP) that is realized by training two models that learn visual features via cooperative reconstruction, but one model is adversarially dissociated from the reward signal. Empirical evaluation shows that TIA leads to significant performance gains over state-of-the-art methods on many visual control tasks where natural and unconstrained visual distractions pose a formidable challenge. Project page: https://xiangfu.co/tia 1. Introduction Consider results of a simple experiment reported in Figure 1. We train a state-of-the-art model-based reinforcement learning algorithm (Hafner et al., 2020) to solve two versions of the Cheetah Run task (Tassa et al., 2018): one with a simple and the other with a visually complex background (Zhang et al., 2021). For each version we train three model variants that contain 0.5 (small), 1 (medium) and 2 (large) parameters in their respective world models. The performance with the simple background is only marginally affected by model capacity indicating that even the smallest model has sufficient capacity to learn taskrelevant features. With complex background, performance is much worse but increases monotonically with the model size. Because the amount of task-relevant information is unchanged between simple and complex background variants, these results demonstrate that excess model capacity is devoted to representing background information when learning from complex visual inputs. The background con- *Equal contribution 1MIT CSAIL 2IAIFI. Correspondence to: Xiang Fu . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). Video Distraction No Background Episode Return Figure 1: Comparison of the performance of a state-ofthe-art model-based RL algorithm, Dreamer, on two versions of the Cheetah Run with vs. without visual distraction. Performance is reported for three models of increasing sizes (0.5 , 1 , 2 of the original Dreamer). Results show that even the smallest model has sufficient capacity to capture task-relevant features when observations are distractor-free (gray), but when the scene is complex (red), task-irrelevant features consume most of the model capacity. Error bars indicate one standard deviation. veys no information about the task. It, therefore, interferes with the learning of task-relevant information by consuming model capacity. Here relevant refers to features needed to predict the optimal actions, whereas irrelevant refers to everything else that makes up the observation. There are two main components of a model-based learner: (i) a forward dynamics model that predicts future events resulting from executing a sequence of actions from the current state and (ii) a reward predictor for evaluating possible future states. The policy performance depends critically on the prediction accuracy of the forward model, which is intimately tied to the feature space in which the future is predicted. Similar to the complex background version of Cheetah Run, visual observations obtained in the real world are full of irrelevant information. Therefore, if there is no bias for learning task-relevant features , the model will try to predict all the information. In such a scenario spurious features would unnecessarily increase the sample complexity and necessitate the use of much larger models than necessary. Sometimes it can also result in training failure. Ways to learn good representations has been a major focus in model-based deep reinforcement learning. A popular choice is to reconstruct the input observations (Kingma & Welling, Learning Task Informed Abstractions 2014; Kingma et al., 2014; Watter et al., 2015; Hafner et al., 2020). Often these features are encouraged to be disentagled (Bengio, 2013; Higgins et al., 2016; 2017) to identify distinct factors of variation. Since disentanglement simply re-formats the space, this disentangled feature space would still contain irrelevant information and will not address the central problem of learning task-relevant features. As our analysis shows, the incorporation of reward prediction loss is insufficient for producing a feature space that only contains task-relevant information. Reward supervision alone is shown insufficient for feature learning (Yarats et al., 2019). For instance, just knowing the center of mass of a humanoid moving forward is sufficient to predict the reward, whereas the knowledge of the full-body pose might be necessary to predict the optimal action. In a nutshell reconstruction captures too much information, whereas reward-prediction captures too little. Several works attempt to combine these two training signals (Hafner et al., 2020; Oh et al., 2017) but struggle to learn in complex visual scenarios. Since the goal of the agent is to maximize the expected return, predicting the value function instead of one-step reward may aid in learning all the relevant information (Silver et al., 2017; Oh et al., 2017; Schrittwieser et al., 2020). However, because the value function is often learned via bootstrapping, it may provide an unstable training signal. These challenges in learning task-relevant representations inspired several works to investigate feature learning methods that neither rely on reconstruction nor solely depend on rewards. One line of work biases the learned features to only capture controllable parts of the environment using an inverse model that predicts actions from a pair of states (Agrawal et al., 2015; Jayaraman & Grauman, 2015; Agrawal et al., 2016; Pathak et al., 2017), or using metrics such as empowerment (Klyubin et al., 2005; Gregor et al., 2016). To understand their shortcoming, consider the scenario of the arm pushing an object. Here both the arm and the object are controllable. While it is easy to capture the part that is directly controllable (e.g., the arm), capturing all controllable features (i.e., arm and the object) without imposing a reconstruction loss is non-trivial. Another idea that has shown promise is the bisimulation metric (Ferns et al., 2011; Zhang et al., 2021). Because supervision in bisimulation comes solely from rewards, it is subject to the same issues as mentioned earlier. Another possibility is to use contrastive learning (Chen et al., 2020; Oord et al., 2018), but without additional constraints, these methods may not distinguish between relevant and irrelevant features. The ongoing discussion illustrates the fundamental challenge in learning task-relevant features: some objectives (e.g., reconstruction) capture too much information, whereas others (e.g., rewards, inverse models, empowerment) capture too little. Empirically we find that a weighted loss function that combines these objectives does not lead to task-relevant features (see Figure 1). In this work, we revisit feature learning by combining image reconstruction and reward prediction but propose to explicitly explain away irrelevant features by constructing a cooperative two-player game between two models. These models, dubbed as task and distractor models, learn task-relevant (s+ t ) and irrelevant features (s t ) of the observation (ot) respectively. Similar to prior work, we force the task model to learn task-relevant features (s+ t ) by predicting the reward. But unlike past work, we also force the distractor model to learn task-irrelevant features (s t ) via adversarial dissociation with the reward signal. However, both models cooperate to reconstruct ot by maximizing p(ot|s+ t , s t ). Our method implements a Markov decision process (MDP) of a specific factored structure, which we call Task Informed MDP (Ti MDP) (see Figure 2b). It is worth noting that Ti MDP is structurally similar to the relaxed block MDP (Zhang et al., 2020) formulation in partitioning the state-space into two separate components. However, (Zhang et al., 2020) neither proposes a practical method for segregating relevant information nor provides any experimental validation of their framework in the context of learning from complex visual inputs. We evaluate our method on a custom Many World environment, a suite of control tasks that specifically test the robustness of learning to visual distractions (Zhang et al., 2021) and Atari games. The results convincingly demonstrate that our method, which we call Task Informed Abstractions (TIA), successfully learns relevant features and outperforms existing state-of-the-art methods. 2. Preliminaries A Markov Decision Process is represented as the tuple S, O, A, T, r, γ, ρ0 where O is a high-dimensional observation space. A is the space of actions. S is the state space. ρ0 is the initial state distribution. r : S 7 R is the scalar reward. The goal of RL is to learn a policy π (a| s) that maximizes cumulative reward Jπ = argmaxπ E P t γt 1rt discounted by γ. Our primary contribution is in the method for learning forward dynamics and is agnostic to the specific choice of the model-based algorithm. We choose to build upon the stateof-the-art method Dreamer (Hafner et al., 2020). The main components of this model are: Representation model: pθ(st | ot, st 1, at 1) Observation model: qθ(ot | st) Transition model: qθ(st | st 1, at 1) Reward model: qθ(rt | st) Model Learning Dreamer (Hafner et al., 2020) forecasts in a feature representation of images learned via supervi- Learning Task Informed Abstractions sion from three signals: (a) image reconstruction J t O .= ln q(ot | st) , (b) reward prediction J t R .= ln q(rt | st) and (c) dynamics regularization h J t D .= βKL p(st | st 1, at 1, ot) q(st | st 1, at 1) i . The overall objective is: JDreamer .= Eτ t J t O + J t R + J t D optimized over the agent s experience τ. To achieve competitive performance on Atari, a few modifications are required that are incorporated in the variant Dreamer V2 described in (Hafner et al., 2021). Policy Learning Dreamer uses the learned forward dynamics model to train a policy using an actor-critic formulation described below: Action model: aτ qφ(aτ | sτ) Value model: vψ(sτ) Eq( |sτ ) Pt+H τ=t γτ trτ (3) The action model is trained to maximize cumulative rewards over a fixed horizon H. Both the action and value models are learned using imagined rollouts from the learned dynamics. We refer the reader to (Hafner et al., 2020) for more details. 3. Learning Task Informed Abstractions Task Informed MDP In many real-world problems, the state space of MDP cannot be directly accessed but needs to be inferred from high-dimensional sensory observations. Figure 2a shows the graphical model describing this common scenario. To explicitly segregate task-relevant and irrelevant factors, we propose to model the latent embedding space S with two components: a task-relevant component S+ and a task-irrelevant component S . We assume that the reward is fully determined by the task-relevant component r : S+ 7 R, and the task-irrelevant component contains no information about the reward: MI(rt; s t ) = 0, t. In the most general case, s t+1 can depend on s+ t and s+ t+1 can depend on s t . However, in many realistic scenarios the task-relevant and distractor features evolve independently (e.g. the cars on the road vs leaves flickering in the wind) and thus follow factored dynamics (Guestrin et al., 2003; Pitis et al., 2020). Such a situation greatly simplifies model learning. For this reason we further incorporate this factored structure into our formulation through the assumption: p(st+1|st, at) = p(s+ t+1|s+ t , at)p(s t+1|s t , at). The resulting MDP, which we call Task Informed MDP (Ti MDP) is illustrated in Figure 2b. Note that both S+ and S generate the observation O, and both forward models p(s+ t+1|s+ t , at) and p(s t+1|s t , at) admit the agent s actions. For clarity, we summarize the assumptions we have made into Section 3. s+ 1 s+ 2 s+ 3 s 1 s 2 s 3 (b) Task-Informed MDP Figure 2: (a) The graphical model of an MDP. (b) Task Informed MDP (Ti MDP). The state space decomposes into two components: s+ t captures the task-relevant features, whereas s t captures the task-irrelevant features. The crossterms between s+/ are removed by imposing a factored MDP assumption. The red arrow indicates an adversarial loss to discourage s from picking up reward relevant information. Table 1: Assumptions for Ti MDP Ti MDP Details (dynamics) ot = f(s+ t , s t ) Both S /+ contribute to O r : S+ 7 R r only depends on S+ MI(rt; s t ) = 0 S does not inform the task p(st+1|st, at) = p(s+ t+1|s+ t , at) p(s t+1|s t , at) s+/ t+1 has no dependency on s /+ t JT .= Ep( X J t Oj + J t R + J t D ) JS .= Ep( X J t Oj + J t Os + J t Radv + J t Ds ) J t Oj .= ln q(ot | s+ t , s t ) J t Os .= λOs ln q(ot | s t ) J t R .= ln q(rt | s+ t ) J t Radv .= λRadv max q ln q(rt | s t ) J t D .= β KL p(s+ t | s+ t 1, at 1, ot) q(s+ t | s+ t 1, at 1) J t Ds .= β KL p(s t | s t 1, at 1, ot) q(s t | s t 1, at 1) (4) Our method involves learning two models: one model captures the task-relevant state component s+ t , which we call the task model. The other model captures the task-irrelevant state component s t , which we call the distractor model. The learning objective for these two models are denoted by JT and JS (task and distractor), and expanded in Equa- Learning Task Informed Abstractions Learned Mixing 64 x 64 x 3 Distractor Model Reconstruction (a) Learning Task Informed World Models Distractor Model (not used) (b) Policy Learning only unrolls in S+. Figure 3: Components of Task Informed Abstraction Learning. (a) From the dataset of past experience, TIA uses the reward to factor the MDP into a task-relevant world model and a task-irrelevant one. (b) Only the forward dynamics in s+ t is used during policy learning. The Policy is trained using back-propagation through time. Note that the images are shown just for demonstration purposes and are not generated during policy learning. tion (4). A visual illustration is provided in Figure 3a. We will explain each component in the following section. Reward Dissociation for the distractor model is accomplished via the adversarial objective J t Radv. This is a minimax setup where we interleave optimizing the distractor model s reward prediction head (for multiple iterations/training step) with the training of the distractor model. While the reward prediction head is trained to minimize the reward prediction loss ln q(rt|s t ), the distractor model maximizes this objective so as to exclude reward-correlated information from its learned features (Ganin & Lempitsky, 2015). The reward prediction loss is computed using ln N(rt; ˆrt, 1), where N( ; µ, σ2) denotes the Gaussian likelihood, and ˆrt is the predicted reward. Cooperative Reconstruction By jointly reconstructing the image, the distractor model that s biased towards capturing task-irrelevant information will enable the task model to focus on task-relevant features. We implement joint reconstruction through the objective J t Oj. Starting with a sequence of observation and actions {o[