# entropic_desired_dynamics_for_intrinsic_control__470e05e5.pdf Entropic Desired Dynamics for Intrinsic Control Steven Hansen Deep Mind Guillaume Desjardins Deep Mind Kate Baumli Deep Mind David Warde-Farley Deep Mind Nicolas Heess Deep Mind Simon Osindero Deep Mind Volodymyr Mnih Deep Mind An agent might be said, informally, to have mastery of its environment when it has maximised the effective number of states it can reliably reach. In practice, this often means maximizing the number of latent codes that can be discriminated from future states under some short time horizon (e.g. [15]). By situating these latent codes in a globally consistent coordinate system, we show that agents can reliably reach more states in the long term while still optimizing a local objective. A simple instantiation of this idea, Entropic Desired Dynamics for Intrinsic Con Trol (EDDICT), assumes fixed additive latent dynamics, which results in tractable learning and an interpretable latent space. Compared to prior methods, EDDICT s globally consistent codes allow it to be far more exploratory, as demonstrated by improved state coverage and increased unsupervised performance on hard exploration games such as Montezuma s Revenge. 1 Introduction Endowing reinforcement learning agents with the ability to learn effectively from unsupervised interaction with the environment, i.e. without access to an extrinsic reward signal, has the potential to make reinforcement learning practical in settings where the tasks the agent will face are initially unknown or where task feedback is expensive. The natural question is: what should the agent learn in the absence of extrinsic rewards? One appealing guiding principle is maximizing the number of states the agent can reach and to which it can reliably return. Intrinsic control methods have shown promise in this direction. By maximizing the mutual information between a latent code z and future states reached by a policy conditioned on this code, intrinsic control methods learn to map latent codes to behaviors from which the code can be inferred. One major limitation of such approaches is that the latent codes z are usually sampled from a fixed prior distribution p(z). Using a fixed prior means that such approaches are unable to learn codes that correspond to states that cannot be reached in the time horizon T, since any code can be sampled in any state. Simply increasing the time horizon T does not solve the problem since it leads to a sparser learning signal. Learning a state-dependent prior has proven to be difficult and has been shown to lead to fewer learned codes/goal states [15]. This inability to learn how to reach distant states limits the usefulness of such intrinsic control approaches. We propose to sidestep this limitation by replacing the fixed code distribution p(z) with a fixed dynamics model over codes p(zt|zt 1). Our algorithm, Entropic Desired Dynamics for Intrinsic Con Trol (EDDICT), learns to map sequences of latent codes sampled from this dynamics model to behaviors for which the state transition dynamics in the environment match the latent code dynamics. EDDICT learns to map each zt to a state that is reachable from the state corresponding to zt 1, Correspondence to stevenhansen@deepmind.com 35th Conference on Neural Information Processing Systems (Neur IPS 2021). Figure 1: Graphical models for various priors and posteriors of interest. Circles denote random variables which are observed (shaded) or latent (white), with diamonds denoting deterministic quantities. (a) Prior over a particular trajectory consisting of two sub-trajectories {τ0, τ1}, and auxiliary variables {z0, z1}. (b) Posterior inference with independent codes, as in prior work. (c) Naive posterior inference for the sub-trajectory {z1, τ1}, conditioned on the past. (d) Posterior inference with hindsight. Despite z0 being observed, we infer z1 based on the most likely code z0 to have generated τ0, using the variational reverse predictor (dashed line). allowing it to reach states much farther than the time horizon T using sequences of codes z. We show that even highly constrained latent dynamics (i.e. additive noise) are sufficient to both interpret latent codes in terms of their corresponding locations in state space, and encourage exploratory behavior to a far greater extent when compared to prior methods. Our environment is a special case of a Markov Decision Process (MDP) without rewards or terminal signals: M : (S, A, P, P0). S is the state space, A is the action space, P(st+1 | st, at) the conditional distribution representing the state transition dynamics when taking action at A from state st S and P0(s) the initial state distribution. For simplicity, we present our method in the episodic setting with episodes of length T = MK, but relax this assumption in practice. Agents interact with the environment according to a policy πθ(a | s) with parameters θ, yielding trajectories τ = [s0, a0, s1, a1, s T ], and distributed as pπθ(τ) = P0(s0) QT 1 t=0 P(st+1 | st, at)πθ(at | st). It will be useful for us to segment a given trajectory τ into sub-trajectories of length K, with τ = [s0, τ0, τ1, ], and τi = [ai K, si K+1, ai K+1, a(i+1)K 1, s(i+1)K]. Note that τi is defined to include s(i+1)K, but not the state si K from which ai K was sampled. With a slight abuse of notation and denoting τ 1 := s0, we rewrite pπθ(τ) = P0(s0) QM 1 i=0 pπ(τi | τi 1), with M the number of sub-trajectories per episode and pπθ(τi | τi 1) = Q(i+1)K 1 t=i K P(st+1 | st, at)πθ(at | st). 2 Hierarchical agents sample a high-level goal or latent variable z p(z) every K steps, and interact with the environment via a parametric conditional low-level policy πθ(a | s, z), which can be thought of as a fixed duration option [42]. Composing p(z), πθ(a | s, z) and transition dynamics yields an augmented trajectory Λ = [s0, z0, τ0, z1, τ1, ] 3, whose distribution decomposes as pπθ(Λ) = P0(s0) QM i=0 p(zi)pπθ(τi | τi 1, zi), with pπθ(τi | τi 1, zi) defined analogously to pπθ(τi | τi 1) with conditional policy πθ(at | st, zi). To simplify exposition, we index sequences at the timescale of sub-trajectories using index i, e.g. [zi, τi, zi+1, τi+1, ] and reserve index t for indexing sequences at the granular timescale of actions, e.g. [st, at, st+1, at+1, ]. Concretely, indexing by i should be interpreted as i K, as in si := si K. For a general sequence x = [x0, x1, ], we define x