# entropic_desired_dynamics_for_intrinsic_control__470e05e5.pdf

Entropic Desired Dynamics for Intrinsic Control

Steven Hansen Deep Mind Guillaume Desjardins Deep Mind Kate Baumli Deep Mind David Warde-Farley Deep Mind

Nicolas Heess Deep Mind Simon Osindero Deep Mind Volodymyr Mnih Deep Mind

An agent might be said, informally, to have mastery of its environment when it has maximised the effective number of states it can reliably reach. In practice, this often means maximizing the number of latent codes that can be discriminated from future states under some short time horizon (e.g. [15]). By situating these latent codes in a globally consistent coordinate system, we show that agents can reliably reach more states in the long term while still optimizing a local objective. A simple instantiation of this idea, Entropic Desired Dynamics for Intrinsic Con Trol (EDDICT), assumes ﬁxed additive latent dynamics, which results in tractable learning and an interpretable latent space. Compared to prior methods, EDDICT s globally consistent codes allow it to be far more exploratory, as demonstrated by improved state coverage and increased unsupervised performance on hard exploration games such as Montezuma s Revenge.

1 Introduction

Endowing reinforcement learning agents with the ability to learn effectively from unsupervised interaction with the environment, i.e. without access to an extrinsic reward signal, has the potential to make reinforcement learning practical in settings where the tasks the agent will face are initially unknown or where task feedback is expensive. The natural question is: what should the agent learn in the absence of extrinsic rewards? One appealing guiding principle is maximizing the number of states the agent can reach and to which it can reliably return.

Intrinsic control methods have shown promise in this direction. By maximizing the mutual information between a latent code z and future states reached by a policy conditioned on this code, intrinsic control methods learn to map latent codes to behaviors from which the code can be inferred. One major limitation of such approaches is that the latent codes z are usually sampled from a ﬁxed prior distribution p(z). Using a ﬁxed prior means that such approaches are unable to learn codes that correspond to states that cannot be reached in the time horizon T, since any code can be sampled in any state. Simply increasing the time horizon T does not solve the problem since it leads to a sparser learning signal. Learning a state-dependent prior has proven to be difﬁcult and has been shown to lead to fewer learned codes/goal states [15]. This inability to learn how to reach distant states limits the usefulness of such intrinsic control approaches.

We propose to sidestep this limitation by replacing the ﬁxed code distribution p(z) with a ﬁxed dynamics model over codes p(zt|zt 1). Our algorithm, Entropic Desired Dynamics for Intrinsic Con Trol (EDDICT), learns to map sequences of latent codes sampled from this dynamics model to behaviors for which the state transition dynamics in the environment match the latent code dynamics. EDDICT learns to map each zt to a state that is reachable from the state corresponding to zt 1,

Correspondence to stevenhansen@deepmind.com

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

Figure 1: Graphical models for various priors and posteriors of interest. Circles denote random variables which are observed (shaded) or latent (white), with diamonds denoting deterministic quantities. (a) Prior over a particular trajectory consisting of two sub-trajectories {τ0, τ1}, and auxiliary variables {z0, z1}. (b) Posterior inference with independent codes, as in prior work. (c) Naive posterior inference for the sub-trajectory {z1, τ1}, conditioned on the past. (d) Posterior inference with hindsight. Despite z0 being observed, we infer z1 based on the most likely code z0 to have generated τ0, using the variational reverse predictor (dashed line).

allowing it to reach states much farther than the time horizon T using sequences of codes z. We show that even highly constrained latent dynamics (i.e. additive noise) are sufﬁcient to both interpret latent codes in terms of their corresponding locations in state space, and encourage exploratory behavior to a far greater extent when compared to prior methods.

Our environment is a special case of a Markov Decision Process (MDP) without rewards or terminal signals: M : (S, A, P, P0). S is the state space, A is the action space, P(st+1 | st, at) the conditional distribution representing the state transition dynamics when taking action at A from state st S and P0(s) the initial state distribution. For simplicity, we present our method in the episodic setting with episodes of length T = MK, but relax this assumption in practice. Agents interact with the environment according to a policy πθ(a | s) with parameters θ, yielding trajectories τ = [s0, a0, s1, a1, s T ], and distributed as pπθ(τ) = P0(s0) QT 1 t=0 P(st+1 | st, at)πθ(at | st).

It will be useful for us to segment a given trajectory τ into sub-trajectories of length K, with τ = [s0, τ0, τ1, ], and τi = [ai K, si K+1, ai K+1, a(i+1)K 1, s(i+1)K]. Note that τi is deﬁned to include s(i+1)K, but not the state si K from which ai K was sampled. With a slight abuse of notation and denoting τ 1 := s0, we rewrite pπθ(τ) = P0(s0) QM 1 i=0 pπ(τi | τi 1), with M the number of sub-trajectories per episode and pπθ(τi | τi 1) = Q(i+1)K 1 t=i K P(st+1 | st, at)πθ(at | st). 2

Hierarchical agents sample a high-level goal or latent variable z p(z) every K steps, and interact with the environment via a parametric conditional low-level policy πθ(a | s, z), which can be thought of as a ﬁxed duration option [42]. Composing p(z), πθ(a | s, z) and transition dynamics yields an augmented trajectory Λ = [s0, z0, τ0, z1, τ1, ] 3, whose distribution decomposes as pπθ(Λ) = P0(s0) QM i=0 p(zi)pπθ(τi | τi 1, zi), with pπθ(τi | τi 1, zi) deﬁned analogously to pπθ(τi | τi 1) with conditional policy πθ(at | st, zi).

To simplify exposition, we index sequences at the timescale of sub-trajectories using index i, e.g. [zi, τi, zi+1, τi+1, ] and reserve index t for indexing sequences at the granular timescale of actions, e.g. [st, at, st+1, at+1, ]. Concretely, indexing by i should be interpreted as i K, as in si := si K. For a general sequence x = [x0, x1, ], we deﬁne x<t := [x0, , xt 1] and extend this notation to augmented trajectories as follows: Λ<i := [s0, z0, τ0, , zi 1, τi 1].

We would like to learn goal directed agents, which are capable of reaching any state s S, given a goal or state embedding z Z. Extending earlier work on empowerment [40, 34], Variational

2The dependence on τi 1 is thus due to si K τi 1. 3We avoid introducing new notation for the corresponding semi-MDP, as our present notation allows us to reason about sub-trajectories, for both standard policies π(a | s) and conditional policies π(a | s, z).

Intrinsic Control (VIC) [19] and related methods (e.g. [15]) propose to achieve this by learning a conditional policy π(a | s, z) which maximizes I(z; τ), the mutual information between the latent code z and (possibly a subset of) the resulting trajectory obtained by following π. Unfortunately, this objective can be difﬁcult to train in practice as we scale both the number of options and the horizon over which the code is executed [1, 15]. Our method addresses both of these issues in a principled manner by introducing temporal dependencies between a sequence of latent codes, evolving under simple linear dynamics, which decomposes the full objective into a sum of local mutual information lower-bounds, without loss of coherence of the global code.

Entropic Desired Dynamics for Instrinsic Control (EDDICT) can be understood from the perspective of divergence minimization [20, 13, 27]. Concretely, we can deﬁne a prior policy µ which induces a distribution pµ(Λ) over the space of augmented trajectories. We then learn a (posterior) policy π by minimizing the KL-divergence between pπθ(Λ) and pµ(Λ).

3.1 VIC as Divergence Minimization

Given a prior policy µ, we construct a prior distribution over an augmented space (τ, z), with auxiliary variables [3] z Z, as pµ(z, τ) = pµ(τ)qw(z | τ). The conditional qw(z | τ) is a learnt predictor, with parameters w Ω, which aims to predict z from the underlying trajectory.

We can show that an entropy regularized version of VIC is obtained by maximizing Oent-VIC(θ, w) =

KL h p(z)pπθ(τ | z) pµ(z, τ) i wrt. the parameters of πθ and qw, with p(z) a ﬁxed or learnt distribution over options. Intuitively, we seek a code conditioned policy which generates trajectories having high probability under our trajectory prior, and from which z can be inferred in hindsight. After some algebra, this simpliﬁes to:

Oent-VIC(θ, w) = Ez p(z) τ pπθ (τ|z)

h log qw(z | τ) log p(z) | {z } a Iq(z;τ)

t=0 log πθ(at | st, z)

µ(at | st) | {z } b regularizer

In the above, Iq(z; τ) refers to the variational lower-bound [2, 34] to the mutual information I(z; τ) = E[log pπθ(z | τ) log p(z)] , using reverse predictor q trained to approximate the true posterior distribution pπθ(z | τ) . In expectation, the regularization terms correspond to a sum of KLdivergences between our conditional policy and the prior over actions.4 The original objective OVIC is obtained by dropping this regularizer and choosing the reverse predictor q VIC w := qw(z | s0, s T ), which predicts z from the ﬁrst and last states of the trajectory.

Since we focus on discrete action spaces, we set µ to a uniform distribution over actions, causing b to revert to standard entropy rewards [48]. In practice, we optimize the above objective using a value-based reinforcement learning algorithm and ϵ-greedy policies (in lieu of a Boltzmann policy), and thus omit these terms. Note that the auxiliary variable perspective of VIC can also be found in Hausman et al. [23].

3.2 Incorporating Temporal Dynamics

Instead of sampling a single goal to be reached within the duration of the episode, it may be preferable to sample a sequence of codes either as relative (or local) goals, parameterized relative to the agent s current position, or as way points, a sequence of global goal coordinates which the agent should visit in sequence.

Relative vs Global Codes. Local goals can be implemented for VIC by resampling a latent code every K steps and maximizing a sum of local objectives of the type Iq(zi; si+1 | si), with option zi initiated from state si. We describe these codes as having local semantics, as an option zi should only be inferable in the context of the relationship between its initiation state si and ﬁnal state si+1. In essence, each zi represents a local displacement which the low-level policy should execute. In contrast, the strategy of sampling way points in some global frame of reference would require maximizing

4The regularization terms emerge from the fact that the transition dynamics are shared by both pµ(τ) and pπ(τ | z), and thus cancel out in the computation of the KL-divergence.

Iq(zi; si+1). Unfortunately, this would seem to require learning a state-dependent high-level policy which gives higher probability to goals zi which are reachable (in K steps) from si.

Ours is a hybrid of these two approaches: by specifying goals relative to previously sampled codes, in the form of a Markov chain with simple linear dynamics, EDDICT can recover codes with global semantics while avoiding the need to explicitly train a high-level policy.

EDDICT Prior. As in Section 3.1, we specify a joint distribution over the set of sub-trajectories {τi} and auxiliary variables {zi}, i [0,M 1]. Recall that τ 1 := s0. Our prior for an augmented trajectory Λ is given by:

pµ(Λ | s0) =

i=0 pµ(τi | τi 1)qw(zi | τi),

again with µ a uniform distribution over actions. As we shall see, making the a priori assertion that zi is conditionally independent of τi 1 given τi will ensure that our objective breaks down as a sum of local objectives, amenable to greedy optimization. This prior is illustrated in Fig. 1a. Using the reverse predictor q EDDICT w := qw(zi | si+1), which predicts zi from si+1 alone (the last state of τi) will then induce codes with global goal semantics.

EDDICT Posterior. We structure our posterior around goal-conditioned policies πθ(a | s, z), but modiﬁed to account for the temporal structure of our prior. We incorporate temporal dependencies between the latent codes in the form of a Markov chain p(zi | zi 1) with initial distribution p(z0). Deﬁning p(z0 | z 1) := p(z0), we write:

pπθ(Λ | s0) =

i=0 p(zi | zi 1)pπθ(τi | τi 1, zi)

We now expand the negative KL-divergence corresponding to this choice of prior, posterior and reverse predictor:

KL [pπθ(Λ | s0) pµ(Λ | s0)] = Epπθ (Λ|s0) n M 1 X

i=0 log q EDDICT w (zi | si+1) p(zi | zi 1) pµ(τi | τi 1) pπθ(τi | τi 1, zi)

The objective is then obtained by dropping action entropy terms.

i=0 Epπθ (Λ<i) Epπθ (zi,τi|Λ<i)

c z }| { log q EDDICT w (zi | si+1) log p(zi | zi 1)

| {z } O(i)(θ,w;zi 1,τi 1)

The objective thus breaks down as a sum of M terms, deﬁned5 as O(i)(θ, w) = E[O(i)(θ, w; zi 1, τi 1)]. It is worth pointing out that in expectation, c constitutes a valid lowerbound to I(zi; si+1 | zi 1) despite the reverse predictor not conditioning on zi 1.

3.3 EDDICT Objective

We obtain EDDICT by incorporating (i) greedy optimization, (ii) hindsight correction, and (iii) linear dynamics into the objective of Equation 1.

Greedy Optimization. Deﬁne the effective entropy as the difference in log-probabilities given by the reverse predictor and the high-level policy over options (cf. c ). As written, the objective aims to maximize the long term sum of effective entropies: concretely, each code zi should seek to be entropic and discernible from si+1 but also lead to states from which future options are themselves

5Note that we use O(i) to refer to the i-th term of Eq. 1, which is a function of a particular value of zi 1 and τi 1. O(i) is reserved for the expected value of O(i) under pπθ(Λ<i).

discernible. The variance of any return estimator will thus increase with the number of option periods. To avoid this issue, EDDICT optimizes Eq. 1 in a greedy-manner as:

OGREEDY(θ, w) =

i=0 Epπ(Λ<i) h O(i)(θ, w; zi 1, τi 1) i , (2)

where we have omitted the policy parameters from the sampling distribution pπ(Λ<i), which is thus considered ﬁxed with respect to the optimization process. Concretely, this can be implemented by treating each option period as a pseudo-episode, i.e. using discount factors which are zero on option boundaries as shown in Algorithm 1.

Hindsight Correction. Unfortunately, the above objective is rather brittle as the distribution over zi is conditioned solely on zi 1, and ignores the underlying state in which the code is sampled. We can improve on this open-loop formulation by reasoning in hindsight. From Eq. 2, O(i) is computed in expectation under pπ(Λ<i) which includes the joint pπ(zi 1, τi 1 | Λ<i 1). We rewrite this joint as pπ(τi 1 | Λ<i 1)pπ(zi 1 | τi 1) pπ(τi 1 | Λ<i 1)q EDDICT w (zi 1 | si), since qw is a variational approximation to the true posterior by construction. Incorporating this approximation to Eq. 2 yields the ﬁnal objective:

OEDDICT(θ, w) =

i=0 Epπ(Λ<i)Eq EDDICT w (zi 1|si) h O(i)(θ, w; zi 1, τi 1) i . (3)

Concretely, when sampling zi p(zi | zi 1), we thus condition on the code most likely to have yielded state si, under the reverse predictor. Importantly, this objective induces a cross-entropy term between the target distribution q EDDICT w (zi 1 | si)p(zi | zi 1) and q EDDICT w (zi | si+1). This ensures that predictions made from si+1 are consistent with those from si, under our latent state dynamics.

Linear Dynamics The ﬁnal piece of the puzzle concerns the choice of code distribution. We cannot employ the VIC strategy of a ﬁxed entropic distribution, since our codes form a Markov chain. We would further like to avoid the full HRL problem, which would require us to have a parameterized high-level policy over options. Choosing an AR(1) process as the conditional code distribution satisﬁes both of these requirements and we thus set p(zi | zi 1) = zi 1 + i, with i sampled from either an isotropic Gaussian or a uniform distribution on the disc. Another useful property of the AR(1) process is that it ensures that the marginal code entropy increases monotonically with each option period (more states visited) while the conditional entropy remains constant (same number of states reachable from any given state), as shown in Fig. 4b. Finally, hard coding the dynamics to be linear, versus learning a parametric policy over codes, naturally imposes an interpretable Euclidian topology in code space, as shown in Fig. 4a.

3.4 Algorithm

We now provide a more mechanistic view of EDDICT. Algorithm 1 presents an online version of the algorithm, with details of the distributed setup used in our experiments presented below.

We optimize our objective using a distributed deep reinforcement learning system [14], based on Peng s Q(λ) [37] and ϵ-greedy policies. The system consists of a centralized learner, a replay buffer [32], and a set of distributed workers each interfacing with their own copy of the environment. Given the latest parameter values and current state of the environment si (local to each worker), actors sample zi and generate sub-trajectory τi by executing π(a | s, zi) for K steps in the environment. The resulting (si, i, τi) is then fed back to the replay buffer, from which the learner consumes data to perform off-policy updates. Storing the initiation state si and offset i, allows the learner to recompute the code zi as required using the most up-to-date version of the reverse predictor. Intrinsic rewards derived from the reverse predictor are similarly computed on the learner.

In practice, the learner maximizes OEDDICT by summing two losses. The ﬁrst implements policy iteration by minimizing the mean-squared error between a target return, computed by Peng s Q(λ) under a target network [32], and the current Q-value estimates. Our greedy optimization procedure yields a single non-zero reward, log qw(zi | si+1), which is received upon option termination. The second loss corresponds to the cross-entropy loss of the reverse predictor found in Eq. 3. With qw(z | s) := N(fw(s), 1) for some parametric function fw, this amounts to minimizing

Algorithm 1: EDDICT

Input : Environment dynamics P, initial state s0, policy πθ, code predictor qw(z | s) := N(fw(s), 1), option period K, discount γ, code dimension d. τ [s0], i 0 repeat

z U(Dd) // e.g. uniform over a disc, isotropic normal zi fw(si K) + z for t i K : (i + 1)K 1 do

at π(a|st, zi; θ) // parametric or epsilon-greedy st+1 P(st+1|st, at)

// Compute intrinsic rewards. Note: entropy of code distribution is constant under linear dynamics. rt+1 log qw(zi | st+1) if t=(i+1)K 1 else 0 // (optional) add entropy rewards. γt+1 0 if t=(i+1)K 1 else γ st+1 [zi, st+1] // augment state with code Append at, st+1, rt+1, γt+1 to τ.

Update θ with any reinforcement learning algorithm on the sub-trajectory τ.

// Minimize cross-entropy loss from Eq. 3, for linear dynamics and Gaussian reverse predictor. Update w by gradient descent on i (fw(s(i+1)K) fw(si K)) 2 2 τ [s(i+1)K], i i + 1

i (fw(si+1) fw(si)) 2 2. This loss is extremely intuitive: we train the reverse predictor such that the inferred latent state from si, matches the inferred state from si+1 under our latent dynamics. As in [41], we found that an uninformative prior performed best in practice (despite our choice of isotropic Gaussian for the predictor), and thus sample i from a uniform distribution on the disc 6.

Concretely, we parameterize the action-value function Qθ(s, a, z) as an MLP operating on state embeddings, derived from a Res Net [24], and linear action and code embeddings. In our experiments, the reverse predictor qw operates on the same state embeddings as the Q-function, with gradients from both objectives being backpropagated into the Res Net. Complete details of the architecture can be found in the Appendix.

4 Related Work

Intrinsic Control and Empowerment. EDDICT can best be thought as incorporating temporal structure into intrinsic control algorithms [19, 15, 1, 21], which build on empowerment [26, 34]. Relative Variational Intrinsic Control (RVIC) [8] also extends an intrinsic control objective, but does so by penalizing codes predictable from a single state, leading to codes representing state-agnostic behaviors. In contrast, the parameterization of our reverse predictor, along with a ﬁxed high-level policy over options, ensure that EDDICT s codes are reachable from the states in which they are sampled while preserving global state semantics.

It is well known that the VIC objective is difﬁcult to train when the code space is large [1]. At a high-level, EDDICT tackles this issue by breaking down this single goal into a sequence of sub-goals. This is orthogonal to the approach of Achiam et al. [1], which increases the number of available options over time. HIDIO [51] proposes an objective similar to ours (discriminator rewards over sub-trajectories, greedy-optimization), but sample options using a state-dependent high-level policy trained to maximize extrinsic rewards over the semi-MDP induced by the low-level policy.

Skill Discovery and HRL. The notion of reusable behavior and hierarchy has a long history in the RL literature [e.g. 42]. In comparison to EDDICT existing work can be broadly categorized with respect to the signal that is used for behavior induction and the nature of the learned representation. EDDICT bears similarity to unsupervised skill discovery methods that induce behavior in the absence of external rewards usually for the use in downstream tasks, including [16, 36]. Other approaches learn skills or behavior representations from demonstrations provided by humans or expert policies [e.g.

6Our variational bound is looser as a result, since our variational posterior is not matched to the prior. Improving the modeling assumptions of the reverse predictor, e.g. by using a truncated Gaussian, is left for future work.

Figure 3: Montezuma s revenge. (left) Typical observations from the ﬁrst (left) and second (right) rooms. (right) Observations from a trained EDDICT agent, sorted by L2-norm of corresponding code and aggregated into quartiles. Images are generated by taking the pixel-wise maximum

17, 31, 38], while optimizing the reward for one or multiple tasks [e.g. 6, 25, 23, 44, 49, 29, 18], or via subgoals that are associated with explicit rewards in a predeﬁned [e.g. 28, 35] or learned space [e.g. 46, 47].

Methodology. Auxiliary variables have a long history in variational inference [3, 30], as a way to obtain more expressive posteriors and serve a similar purpose in the context of EDDICT/VIC. Tirumala et al. [45] incorporates an AR(1) process in the context of HRL and skill transfer, but did so within the prior which served to regularize a set of task-speciﬁc high-level policies. In contrast, the AR(1) process in EDDICT ensures that the high-level policy samples goals which are reachable from the current state. AR(1) processes over latent temporal sequences have also been used to prevent posterior collapse in VAEs with powerful autoregressive decoders [39], an analogous phenomenon to option collapse in HRL. Hindsight reasoning has seen a multitude of applications in reinforcement learning, improving credit assignment [22], training of goal-conditioned policies in sparse reward settings [5] and off-policy learning of options in HRL [50].

5 Experiments

Here we evaluate EDDICT s learned representations and behavior, and contrast them to prior work in the space of intrinsic control (or skill discovery) methods. We assess the learned representations qualitatively by looking at how well they correspond to privileged information known to be relevant to down stream tasks. Namely, the state dimensions given in the Deep Mind Control Suite [43] and the avatar coordinates in the Atari Learning Environment (ALE) [9]. We stress that this privileged information is not used during training in any way, with reverse predictors operating on the same input as the Q-function.

The quality of learned behaviors is measured in terms of exploration; we posit that EDDICT explores in the space of controllable outcomes, and that this style of exploration results in reaching many states of interest. To assess this quantitatively, we compare unsupervised behavior policies in terms of reward achievement on the Atari game Montezuma s Revenge, which is known to require sophisticated exploration in order to progress. Additionally, we look at the number of unique states visited per episode using privileged environment information (i.e. the underlying RAM states in ALE), as this is a proxy for state coverage that is agnostic to the speciﬁc reward function of the game [4]. To look speciﬁcally at the claim that EDDICT explores the controllable states, we also measure an estimate of the mutual information between the marginal code distribution and the marginal state distribution.

We consider the following baselines for evaluation: VIC [19], RVIC [8] and an ablated version of EDDICT. VIC refers to a scalable variant introduced in [8], that uses a ﬁxed Categorical distribution over 50 outcomes. In the EDDICT ablation (EDDICT- ) the code proposal mechanism is simpliﬁed by substituting zi+1 := i for EDDICT s zi+1 := zi + i. Note that the reverse predictor remains unchanged, and thus tries to predict i directly from si+1. All algorithms were implemented in the same codebase and thus share the same network architecture and reinforcement learning method.

For the results on Montezuma s Revenge, we further include results for a Q(λ) agent trained to maximize the game score (which other methods do not have access to), again matched in terms of network architecture.

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Frames 1e8

Marginal & Conditional Code Entropy

EDDICT EDDICT-

Figure 4: (a) Point Mass. 2D codes colored by ground truth x (left) and y (right) coordinates of the point-mass on a version of the environment with a U shaped wall. (top) EDDICT- (bottom) EDDICT. (b) Montezuma s Revenge. Estimated marginal code entropy H[zi] (solid) and conditional entropy H[zi | si+1] (dashed). Despite codes being less predictable, EDDICT achieves higher mutual information Ip(zi; si+1), as measured under the marginal code distribution (shaded).

5.1 Codes as Representations of State

Codes as state representations To illustrate EDDICT s ability to map codes to the controllable aspects of the environment, we have trained EDDICT on a simple continuous control task from the Deep Mind Control Suite [43]. This environment comes with a set of ground truth state dimensions required to compute the dynamics, a subset of which are under the agent s control. We can thus allow EDDICT to train from raw pixels, and then evaluate the correspondence between the code values and ground truth state dimensions.

Shown in Figure 4a, is the state representation learned on a version of the Control Suite point mass domain which is modiﬁed to include 3 obstacle walls in the shape of a "U" to increase the difﬁculty and exploration requirements of the domain. EDDICT successfully recovers the ground truth coordinates of the point mass position (under the agent s control), but not the target position (randomly set per episode and not under the agent s control). This property of not representing what can not be controllable effectively solves the well known noisy TV problem [10].

Standard intrinsic control methods also have this property of only representing the controllable, but they lack any incentive to represent states unreachable in a single unroll as being distinct. This is clearly demonstrated by our ablation s performance, wherein the relationship between nearby states is much more tenuous.

Code norm as difﬁculty Since our desired latent dynamics consist of a sum of zero centered IID variables, the marginal code distribution will be also centered at zero, with the probability density dissipating as a function of the code norm. Assuming EDDICT manages to form a mapping between latent codes and states, this implies that less frequently visited states will have a higher code norm.

We test this hypothesis empirically, by training EDDICT on Montezuma s Revenge. In this game, the agent is represented by an avatar who can move locally around the screen, but who can easily die. When this occurs, the avatar is reborn on a platform in the middle of the screen. This means that by construction, states where the avatar is near the center of the screen as easier to reach than those farther out. Our hypothesis would thus suggest that EDDICT should assign center states with low norm codes and more peripheral states with high norm codes. As Figure 2 shows, this is exactly what happens in practice. One interesting subtlety is that the game actually contains several rooms, each with a different background. These are exceedingly hard to reach and, as expected, this results in EDDICT assigns these states the highest code norms of all.

5.2 Control to Explore

EDDICT s representations only tell half of the story. Since codes represent states, and the marginal code entropy increases monotonically, this suggests EDDICT s ﬁxed high-level policy should result in exploratory behavior. But unlike most traditional work on the exploration problem, EDDICT focuses its exploration only on what it can control.

Figure 5: Comparing exploratory behavior in Montezuma s Revenge. (right) Game score. All methods except Q(λ) did not have access to this during training. (left) Number of unique avatar positions visited. This is a proxy for coverage of the controllable states.

Measuring control Intrinsic control methods can measure their control over the environment in terms of the mutual information between a code and its downstream effects of the environment. For episodic or resampled but independent codes, this can be estimated straightforwardly as the effective code entropy (cf. c , Eq. 1) averaged over a mini-batch. EDDICT however deﬁnes a Markov chain over codes, and thus requires us to compute entropy over the marginal code distribution. To do so, we ﬁt an isotropic Gaussian to all of the codes in the batch, yielding mean and variance estimates ˆµ and ˆσ2. Our ﬁnal estimate of Iqw(zi; si+1) is then H N(ˆµ, ˆσ2) , plus the average log-prediction reward log qw(zi | si+1) over the batch.

Since this is a lower bound, it is not an unbiased estimate, but the relative values should still be meaningful when comparing models of the same architecture. This metric thus allows us to answer the question: does EDDICT control the environment to a greater degree than its ﬁxed code distribution equivalent? As Figure 4b shows, this is very much the case. Interestingly, this is true despite the codes in general being less predictable; the extra entropy from the sequential sampling more than makes up for it.

Exploring what matters In order to evaluate EDDICT s behavior policy, we must designate a proxy metric for exploration quality. For Atari 2600, game score is an obvious candidate. But since the methods under consideration learn without access to score or episode boundaries, this metric is sometimes quite noisy. To give a more complete view of exploratory behavior, we also include two coverage metrics which counts the number of unique RAM states the agent visits per episode and over its lifetime. Using the information given in [4], we only count the RAM states corresponding to the controllable avatar. We evaluate EDDICT on these 3 metrics across 6 amenable games, and as shown in Table 1, we ﬁnd that in the majority of cases EDDICT outperforms RVIC, VIC and its ablation on one or more metrics. RVIC was the most competitive method, and suggests that global codes are not the only way for intrinsic control methods to yield exploratory behavior.

Of particular interest is Montezuma s Revenge, one of the hardest exploration games, as attested by the numerous reinforcement learning papers that fail to receive non-zero scores (e.g. [33, 14]). As shown in Figure 5, EDDICT outperforms other intrinsic control methods by a wide margin.

For additional context, we provide a broader set of baseline results on Montezuma s Revenge in Table 2 of the Supplemental. These span entropy or curiosity-based algorithms which learn a single uniﬁed policy, instead of the code-conditional policies recovered by algorithms in the VIC family, with Never-Give Up [7] and Random Network Distillation [11] greatly outperforming EDDICT on the metric of average episodic reward.

6 Limitations and Discussion

Endowing agents with the ability to master the environment is an important step towards more general purpose agents, as it allows learning in any circumstance without any requirement of a task speciﬁc

Game DIAYN RVIC Cat(16) EDDICT EDDICTBerzerk 1.75, 24.8, 365 0.156, 31.3, 138, 0.382, 61.3, 477, 0.562, 10.6, 584 Hero 1.82, 24.1, 1.35k 2.58, 35.2, 805, 2.1, 32.5, 1.34k, 0.856, 20.7, 68.7 Montezuma 0.379, 4.61, 0 0.577, 5.61, 0, 1.17, 8.07, 30.9, 0.674, 4.54, 0 Ms. Pacman 0.475, 1.72, 652 0.288, 1.72, 397, 0.349, 1.72, 587, 0.262, 1.72, 360 Private Eye 4.61, 87, 1.54k 5.41, 85.2, 886, 4.86, 86.8, 1.07k, 4.01, 74.5, -43.6 Seaquest 0.346, 10.9, 18.7 1.27, 10.9, 143, 1.91, 10.9, 400, 1.56, 10.9, 238

Table 1: Results on 6 Atari games at 1B frames. Each tuple A,B,C represents mean of: (A) Episodic Coverage ( 103) (B) Lifetime Coverage ( 103) (C) Average return. For EDDICT-based agents, we pick the best metric across code sizes. Metrics which are best across agents, based on mean performance over 3 seeds, are shown in bold. In cases where there is a tie across all methods for a metric, none are bolded. The full set of results, including VIC, random policy baselines, scores with standard deviations and training curves can be found in the supplemental material.

reward function. EDDICT can both explore and control the environment by learning latent codes that make sense of states in a globally consistent coordinate system. But in terms of pure exploration, EDDICT falls short of state of the art methods that learn a single policy (e.g. [12, 11]). Understanding what these advancements mean for learning code-conditional policies is a promising future direction.

Additive dynamics can not capture important aspects of some environments that we might wish for our agents to represent, such as dynamics that are irreversible or state-dependent. Ideally, general purpose function approximators (e.g. neural networks) could be used to specify more general dynamics, but how to make such learning tractable while preserving the advantageous properties of EDDICT remains an important open question.

In addition to its inherent merits for environment exploration and manipulation, EDDICT s novel state to code mapping and code transition function could be used to aid local planning, or could serve as a compact representation on top of which to learn policies, or a good state similarity metric for goal-based RL, or aid in many other unlisted tasks. All of these directions are left as future work.

7 Societal Impact

Unsupervised reinforcement learning in general, and intrinsic control methods in particular, are far from being commercialized due to their insufﬁcient data efﬁciency and lack of validation in real world environments. However, when this is no longer the case, these methods could signiﬁcantly reduce the human cost of setting up systems that interact with humans (e.g. robotics), as these methods limit the need for handcrafted reward functions and the collection of human preferences. But this beneﬁt comes with a cost to interpretability and safety. The information theoretic objectives of the methods lead to behavior that can be very hard to predict a priori (e.g. what does controlling your environment look like?). Furthermore, safety constraints might be harder to specify in the absence of a closed-form reward function. As these methods mature, the emphasis should shift from raw performance to a more nuanced approach that addresses these societal concerns head on.

8 Acknowledgements

We would like to thank Stephen Spencer for providing engineering support. We further thank Yury Sulsky and Arturo Bajuelos, who contributed to results found in Appendix B, showing how EDDICT can be incorporated into the exploration policy of standard RL agents.

[1] J. Achiam, H. Edwards, D. Amodei, and P. Abbeel. Variational option discovery algorithms. ar Xiv preprint ar Xiv:1807.10299, 2018.

[2] D. B. F. Agakov. The im algorithm: a variational approach to information maximization. Advances in Neural Information Processing Systems, 16:201, 2004.

[3] F. V. Agakov and D. Barber. An auxiliary variational method. In International Conference on Neural Information Processing, pages 561 566. Springer, 2004.

[4] A. Anand, E. Racah, S. Ozair, Y. Bengio, M.-A. Côté, and R. D. Hjelm. Unsupervised state representation learning in atari. ar Xiv preprint ar Xiv:1906.08226, 2019.

[5] M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. Mc Grew, J. Tobin, O. Pieter Abbeel, and W. Zaremba. Hindsight experience replay. In Advances in Neural Information Processing Systems, volume 30, 2017.

[6] P. Bacon, J. Harb, and D. Precup. The option-critic architecture. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, pages 1726 1734, 2017.

[7] A. P. Badia, P. Sprechmann, A. Vitvitskyi, D. Guo, B. Piot, S. Kapturowski, O. Tieleman, M. Arjovsky, A. Pritzel, A. Bolt, and C. Blundell. Never give up: Learning directed exploration strategies. In International Conference on Learning Representations, 2020. URL https: //openreview.net/forum?id=Sye57x Stv B.

[8] K. Baumli, D. Warde-Farley, S. Hansen, and V. Mnih. Relative variational intrinsic control, 2020.

[9] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artiﬁcial Intelligence Research, 47:253 279, 06 2013.

[10] Y. Burda, H. Edwards, D. Pathak, A. Storkey, T. Darrell, and A. A. Efros. Large-scale study of curiosity-driven learning. ar Xiv preprint ar Xiv:1808.04355, 2018.

[11] Y. Burda, H. Edwards, A. Storkey, and O. Klimov. Exploration by random network distillation. ar Xiv preprint ar Xiv:1810.12894, 2018.

[12] V. Campos, P. Sprechmann, S. Hansen, A. Barreto, S. Kapturowski, A. Vitvitskyi, A. P. Badia, and C. Blundell. Coverage as a principle for discovering transferable behavior in reinforcement learning. ar Xiv preprint ar Xiv:2102.13515, 2021.

[13] M. Deisenroth, G. Neumann, and J. Peters. A survey on policy search for robotics.

[14] L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning, et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. ar Xiv preprint ar Xiv:1802.01561, 2018.

[15] B. Eysenbach, A. Gupta, J. Ibarz, and S. Levine. Diversity is all you need: Learning skills without a reward function, 2018. URL http://arxiv.org/abs/1802.06070.

[16] C. Florensa, Y. Duan, and P. Abbeel. Stochastic neural networks for hierarchical reinforcement learning. ar Xiv preprint ar Xiv:1704.03012, 2017.

[17] R. Fox, S. Krishnan, I. Stoica, and K. Goldberg. Multi-level discovery of deep options, 2017.

[18] K. Frans, J. Ho, X. Chen, P. Abbeel, and J. Schulman. Meta learning shared hierarchies. Co RR, abs/1710.09767, 2017. URL http://arxiv.org/abs/1710.09767.

[19] K. Gregor, D. J. Rezende, and D. Wierstra. Variational intrinsic control. Co RR, abs/1611.07507, 2016. URL http://arxiv.org/abs/1611.07507.

[20] D. Hafner, P. A. Ortega, J. Ba, T. Parr, K. Friston, and N. Heess. Action and perception as divergence minimization, 2020.

[21] S. Hansen, W. Dabney, A. Barreto, T. Van de Wiele, D. Warde-Farley, and V. Mnih. Fast task inference with variational intrinsic successor features. ar Xiv preprint ar Xiv:1906.05030, 2019.

[22] A. Harutyunyan, W. Dabney, T. Mesnard, M. Gheshlaghi Azar, B. Piot, N. Heess, H. P. van Hasselt, G. Wayne, S. Singh, D. Precup, and R. Munos. Hindsight credit assignment. In Advances in Neural Information Processing Systems, volume 32, 2019.

[23] K. Hausman, J. T. Springenberg, Z. Wang, N. Heess, and M. Riedmiller. Learning an embedding space for transferable robot skills. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=rk07ZXZRb.

[24] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition, 2015.

[25] N. Heess, G. Wayne, Y. Tassa, T. Lillicrap, M. Riedmiller, and D. Silver. Learning and transfer of modulated locomotor controllers. ar Xiv preprint ar Xiv:1610.05182, 2016.

[26] A. S. Klyubin, D. Polani, and C. L. Nehaniv. Empowerment: A universal agent-centric measure of control. In 2005 IEEE Congress on Evolutionary Computation, volume 1, pages 128 135. IEEE, 2005. [27] S. Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review, 2018. [28] A. Levy, R. Platt, and K. Saenko. Hierarchical reinforcement learning with hindsight. In International Conference on Learning Representations, 2019. URL https://openreview. net/forum?id=ryz ECo Ac Y7. [29] A. Li, C. Florensa, I. Clavera, and P. Abbeel. Sub-policy adaptation for hierarchical reinforcement learning. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=Bye Wog St DS. [30] L. Maaløe, C. K. Sønderby, S. K. Sønderby, and O. Winther. Auxiliary deep generative models. In International conference on machine learning, pages 1445 1453. PMLR, 2016. [31] J. Merel, L. Hasenclever, A. Galashov, A. Ahuja, V. Pham, G. Wayne, Y. W. Teh, and N. Heess. Neural probabilistic motor primitives for humanoid control. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum? id=BJl6Tj Rc Y7. [32] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing atari with deep reinforcement learning. In Advances in Neural Information Processing Systems (NIPS) Proceedings of the deep learning workshop, 2013. [33] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015. [34] S. Mohamed and D. J. Rezende. Variational information maximisation for intrinsically motivated reinforcement learning. In Advances in neural information processing systems, pages 2125 2133, 2015. [35] O. Nachum, S. S. Gu, H. Lee, and S. Levine. Data-efﬁcient hierarchical reinforcement learning. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/file/ e6384711491713d29bc63fc5eeb5ba4f-Paper.pdf. [36] O. Nachum, S. Gu, H. Lee, and S. Levine. Near-optimal representation learning for hierarchical reinforcement learning. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=H1emus0q F7. [37] J. Peng and R. J. Williams. Incremental multi-step Q-learning. Machine Learning, 22:283 290, 1996. [38] X. B. Peng, M. Chang, G. Zhang, P. Abbeel, and S. Levine. Mcp: Learning composable hierarchical control with multiplicative compositional policies. ar Xiv preprint ar Xiv:1905.09808, 2019. [39] A. Razavi, A. van den Oord, B. Poole, and O. Vinyals. Preventing posterior collapse with delta-VAEs. In International Conference on Learning Representations, 2019. URL https: //openreview.net/forum?id=BJe0Gn0c Y7. [40] C. Salge, C. Glackin, and D. Polani. Empowerment an introduction. In Guided Self Organization: Inception, pages 67 114. Springer, 2014. [41] A. Sharma, S. Gu, S. Levine, V. Kumar, and K. Hausman. Dynamics-aware unsupervised discovery of skills. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=HJg LZR4Kv H. [42] R. S. Sutton, D. Precup, and S. Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artiﬁcial Intelligence, 112(1), 1999. [43] Y. Tassa, Y. Doron, A. Muldal, T. Erez, Y. Li, D. d. L. Casas, D. Budden, A. Abdolmaleki, J. Merel, A. Lefrancq, et al. Deepmind control suite. ar Xiv preprint ar Xiv:1801.00690, 2018. [44] Y. W. Teh, V. Bapst, W. M. Czarnecki, J. Quan, J. Kirkpatrick, R. Hadsell, N. Heess, and R. Pascanu. Distral: Robust multitask reinforcement learning. In Advances in Neural Information Processing Systems (NIPS), pages 4499 4509, 2017.

[45] D. Tirumala, H. Noh, A. Galashov, L. Hasenclever, A. Ahuja, G. Wayne, R. Pascanu, Y. W. Teh, and N. Heess. Exploiting hierarchy for learning and transfer in kl-regularized rl, 2020. [46] A. S. Vezhnevets, S. Osindero, T. Schaul, N. Heess, M. Jaderberg, D. Silver, and K. Kavukcuoglu. Fe Udal networks for hierarchical reinforcement learning. In Proceedings of the International Conference on Machine Learning (ICML), pages 3540 3549, 2017. [47] D. Warde-Farley, T. Van de Wiele, T. Kulkarni, C. Ionescu, S. Hansen, and V. Mnih. Unsupervised control through non-parametric discriminative rewards. ar Xiv preprint ar Xiv:1811.11359, 2018. [48] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229 256, 1992. [49] M. Wulfmeier, A. Abdolmaleki, R. Hafner, J. T. Springenberg, M. Neunert, T. Hertweck, T. Lampe, N. Siegel, N. Heess, and M. Riedmiller. Compositional transfer in hierarchical reinforcement learning, 2020. [50] M. Wulfmeier, D. Rao, R. Hafner, T. Lampe, A. Abdolmaleki, T. Hertweck, M. Neunert, D. Tirumala, N. Siegel, N. Heess, and M. Riedmiller. Data-efﬁcient hindsight off-policy option learning, 2020. [51] J. Zhang, H. Yu, and W. Xu. Hierarchical reinforcement learning by discovering intrinsic options. In International Conference on Learning Representations, 2021. URL https:// openreview.net/forum?id=r-g PPHEjpmw.