# improving_intrinsic_exploration_with_language_abstractions__083a2e7d.pdf Improving Intrinsic Exploration with Language Abstractions Jesse Mu1 , Victor Zhong2,3, Roberta Raileanu3, Minqi Jiang3,4, Noah Goodman1, Tim Rocktäschel4*, Edward Grefenstette4,5* 1Stanford University, 2University of Washington, 3Meta AI, 4University College London, 5Cohere Reinforcement learning (RL) agents are particularly hard to train when rewards are sparse. One common solution is to use intrinsic rewards to encourage agents to explore their environment. However, recent intrinsic exploration methods often use state-based novelty measures which reward low-level exploration and may not scale to domains requiring more abstract skills. Instead, we explore language as a general medium for highlighting relevant abstractions in an environment. Unlike previous work, we evaluate whether language can improve over existing exploration methods by directly extending (and comparing to) competitive intrinsic exploration baselines: AMIGo (Campero et al., 2021) and Novel D (Zhang et al., 2021). These language-based variants outperform their non-linguistic forms by 47 85% across 13 challenging tasks from the Mini Grid and Mini Hack environment suites. 1 Introduction A central challenge in reinforcement learning (RL) is designing agents that can solve complex, long-horizon tasks with sparse rewards. In the absence of extrinsic rewards, one popular solution is to provide intrinsic rewards for exploration [34, 35, 42, 43]. This invariably leads to the challenging question: how should one measure exploration? One common answer is that an agent should be rewarded for attaining novel states in the environment, but naive measures of novelty have limitations. For example, consider an agent that starts in the kitchen of a large house and must make an omelet. Simple state-based exploration will reward an agent for visiting every room in the house, but a more effective strategy would be to stay put and use the stove. Moreover, like kitchens with different-colored appliances, states can look cosmetically different but have the same underlying semantics, and thus are not truly novel. Together, these constitute two fundamental challenges for intrinsic exploration: first, how can we reward true progress in the environment over meaningless exploration? Second, how can we tell when a state is not just superficially, but semantically novel? Fortunately, humans are equipped with a powerful tool for solving both problems: language. As a cornerstone of human intelligence, language has strong priors over the features and behaviors needed for exploration and skill acquisition. It also describes a rich and compositional set of meaningful behaviors as simple as directions (e.g. move left) and as abstract as conjunctions of high level tasks (e.g. retrieve the ring and defeat the wizard) that can categorize and unify many possible world states. Our aim is to see whether language abstractions can improve existing state-based exploration methods in RL. While language-guided exploration methods exist in the literature [3, 5, 12, 13, 21 24, 31, 44, 51, 53], we make two key contributions over prior work. First, existing methods assume access to a high-level linguistic instruction for reward shaping, or otherwise assume that any intermediate language annotations encountered are always helpful for learning. Instead, we study settings without Work done while at Meta AI. Correspondence to muj@stanford.edu 36th Conference on Neural Information Processing Systems (Neur IPS 2022). instructions, with more diverse intermediate messages (Figure 1) that may or may not be useful, but may nonetheless be a more effective measure of novelty than raw states. Second, past work often compares only to vanilla RL, while ignoring competitive intrinsic exploration baselines. This leaves the true utility of language over simpler state-based exploration unclear. To remedy this issue, we conduct a controlled evaluation on the effect of language on competitive approaches to exploration by extending two recent, state-of-the-art methods: AMIGo [7], where a teacher proposes intermediate location-based goals for a student, and Novel D [54], which rewards an agent for visiting novel regions of the state space. Building upon these methods, we propose L-AMIGo, where the teacher proposes goals expressed via language instead of coordinates, and L-Novel D, a variant of Novel D with an additional exploration bonus for visiting linguistically-novel states. # - $. ./ $- . # - *0 . # - -4./ ' 2 ) # / * 4*0 2 )/ /* 5 +] v! *- ] w *0 &$'' /# ($)*/ 0-[ ) 2# / $- /$*)] v#%&'w * /* /# "- ) **- $ & 0+ /# 4 ''*2 & 4 0/ /# & 4 ) 3/ /* /# 4 ''*2 **- 0LQL*ULG 0LQL+DFN Figure 1: Language conveys meaningful environment abstractions. Language state annotations in the Mini Grid Key Corridor S4R3 [8] and Mini Hack Wand of Death (Hard) [41] tasks. Across 13 challenging, procedurally-generated, sparse-reward tasks in the Mini Grid [8] and Mini Hack [41] environment suites, we show that language-parameterized exploration methods outperform their non-linguistic counterparts by 47 85%, especially in more abstract tasks with larger state and action spaces. We also show that language improves the interpretability of the training process, either by developing a natural curriculum of semantic goals (in LAMIGo) or by allowing us to visualize the most novel language during training (in L-Novel D). Finally, we show when and where the finegrained compositional semantics of the language improves agent exploration, when compared to non-compositional baselines. 2 Related Work Exploration in RL. Exploration has a long history in RL, from "-greedy [48] or count-based exploration [4, 29, 30, 32, 47, 50] to intrinsic motivation [33 35] and curiosity-based learning [42]. More recently, deep neural networks have been used to measure novelty with changes in state representations [6, 40, 54] or prediction errors in world models [1, 36, 46]. Another long tradition generates curricula of intrinsic goals to encourage learning [7, 13, 14, 18 20, 37 39]. In this paper, we explore the potential benefit of language on these approaches to exploration. Language for Exploration. The observation that language-guided exploration can improve RL is not new: language has been used to shape policies [24, 51] and rewards [3, 5, 21 23, 31, 44, 53] and set intrinsic goals [12, 13]. Crucially, our work differs from prior work in two ways: first, instead of reward shaping with high-level instructions, we use noisier, intermediate language annotations for exploration; second, we directly extend and compare to competitive intrinsic exploration baselines. L-AMIGo, our variant of AMIGo with language goals, is similar to the IMAGINE agent of Colas et al. [13], which also sets intrinsic language goals. However, IMAGINE is built for instruction following, and requires a perfectly compositional space of language goals, which the agent tries to explore so that it can complete novel goals at test time. Instead, we make no assumptions on the language and explore to maximize extrinsic reward, using an alternative goal difficulty metric to measure progress. Meanwhile, reward shaping and inverse RL methods [3, 5, 21 24, 31, 44, 51, 53] reward an agent for actions associated with linguistic descriptions, but again are primarily designed for instruction following, where an extrinsic goal is available to help shape intermediate rewards. In our setting, however, we have not high-level extrinsic goals but low-level intermediate language annotations. Extrinsic reward shaping methods such as LEARN [22] could be naively applied by simply doing reward shaping with every intermediate language annotation, and a few of these methods are designed for low-level language subgoals [24, 31]. However, a shared assumption of these approaches is that language is always helpful: either because we have expert-curated messages (as in Harrison et al. [24]), or because we have goal descriptions that let us identify subgoals relevant to the extrinsic goal (as in ELLA; Mirchandani et al. [31]). In our tasks, however, most language is unhelpful for progress in the environment, and we have no extrinsic goals. Consequently, past methods reduce to D $0,*R &DPSHUR HW DO E / $0,*R "* /* /# - **- +$ & 0+ /# '' *+ ) /# "- 4 **- t;_:u 7 7 SROLF\ "* /* /# - **- +$ & 0+ /# '' *+ ) /# "- 4 **- L $GG QHZ JRDOV +$ & 0+ /# & 4 QHZ LL 7UDLQ JURXQGLQJ QHWZRUN "* /* /# 4 ''*2 **- "* /* /# "- ) **- "* /* /# & 4 LLL 7UDLQ SROLF\ QHWZRUN "* /* /# - **- ᶓ UHZDUG LI JRDO DFKLHYHG LQ ȱ W VWHSV ᶔ UHZDUG RWKHUZLVH :KDW ODQJXDJH ZDV DFKLHYHG" / "* /* /# **- ROG Figure 2: L-AMIGo. (a) AMIGo. (b) L-AMIGo teacher first predicts achievable goals, then samples a goal. (i iii) L-AMIGo teacher training steps: updating the goal set G and training grounding and policy networks. simply giving a fixed reward for every intermediate message encountered, which (we will show) fails to learn. Finally, work concurrent to ours by Tam et al. [49] tackles similar ideas in photorealistic environments that permit transfer from foundation models. Instead, we explore domain-specific symbolic games with built-in language where such models are not readily available. A final distinguishing contribution of our work is that prior work often neglects non-linguistic exploration baselines. For example, Harrison et al. [24] and LEARN [22] compare to vanilla RL only; ELLA [31] compares to LEARN and RIDE [40], with limited improvements over RIDE. Prior work can thus be summarized as showing that linguistic rewards improve over extrinsic rewards alone. Instead, we provide novel evidence that linguistic rewards improve upon state-based intrinsic rewards, using the same exploration methods and challenging tasks typical of recent work in RL. 3 Problem Statement We explore RL in the setting of an augmented Markov Decision Process (MDP) defined as the tuple (S, A, T, R, γ, L), where S and A are the state and action spaces, T : S A ! S is the environment transition dynamics, R : S A ! R is the extrinsic reward function, where rt = R(st, at) is the reward obtained at time t by taking action at in state st, and γ is the discount factor. To add language, we assume access to an annotator L that produces language descriptions for states: t = L(st), such as those in Figure 1. Note that not every state needs a description (which we model with a null description ?) and the set of descriptions need not be known ahead of time.2 We ultimately seek a policy that maximizes the expected discounted (extrinsic) reward Rt = E[PH k=0 γkrt+k], where H is the finite time horizon. During training, however, we maximize an augmented reward r+ t = rt + λri t, where ri t is an intrinsic reward and λ is a scaling hyperparameter. Like past work [26, 31, 53] we make the simplifying assumption of access to an oracle language annotator L provided by the environment. Note that the language annotator is oracle in that it always outputs messages that are true of the current state, but not oracle in that it indiscriminately outputs messages that are not necessarily relevant for the extrinsic goal. Many modern RL environments are pre-equipped with language, including Net Hack/Mini Hack [28, 41], text-based games [15, 45, 52], and in fact most video games in general. In the absence of an oracle annotator, one common approach is to learn an annotator model from a dataset of language-annotated states [3, 22, 31], though such datasets are often generated from oracles that are simply run offline instead [31]. Concurrent work [49] uses pretrained foundation models to automatically provide annotations in 3D environments, though such models are not readily available in the symbolic 2D games we explore. Since this idea has been well-proven, we assume oracle access to L, but as an example, one could straightforwardly adapt the annotator model trained on Baby AI by Mirchandani et al. [31] to our setting. 2For presentational simplicity, the annotator here outputs a single description per state, but in practice, we allow an annotator to produce multiple descriptions: e.g. in Mini Grid, open the door and open the red door describe the same state. This requires two minor changes in the equations, described in Footnotes 4 and 5. We now describe our approach to jointly training a student and a goal-proposing teacher, extending AMIGo [7] to arbitrary language goals. 4.1 Adversarially Motivated Intrinsic Goals (AMIGo) AMIGo [7] augments an RL student policy with goals generated by a teacher, which provide intrinsic reward when completed (Figure 2a). The idea is that the teacher should propose intermediate goals that start simple, but grow harder to encourage an agent to explore its environment. Student. Formally, the student S is a goal-conditioned policy parameterized as S(at | st, gt; S), where gt is the goal provided by the teacher, and the student receives an intrinsic reward ri t of 1 only if the teacher s goal at that timestep is completed. The student receives a goal from the teacher either at the beginning of an episode, or mid-episode, if the previous goal has been completed. Teacher. Separately, AMIGo trains an adversarial teacher policy T (gt | s0; T ) to propose goals to the student given its initial state. The teacher is trained with a reward r T t that depends on a difficulty threshold t : the teacher is given a positive reward of + for proposing goals that take the student more than t timesteps to complete, and β for goals that are completed sooner, or never completed within the finite time horizon. To encourage proposing harder and harder goals that promote exploration, t is increased linearly throughout training: whenever the student completes 10 goals in a row under the current difficulty threshold, it is increased by 1, up to some tunable maximum difficulty. Finally, to encourage intermediate goals that are aligned with the extrinsic goal, the teacher is also rewarded with the extrinsic reward when the student attains it. This teacher is updated separately from the student at different time intervals. Formally, its training data is batches of (s0, gt, r T t ) tuples collected from student trajectories for nonzero r T t , where s0 is the initial state of the student s trajectory and gt is the goal that led to reward r T The original paper [7] implements AMIGo for Mini Grid only, where the goals gt are (x, y) coordinates to be reached. The student gets the goal embedded directly in the M N environment, and the teacher is a dimensionality-preserving convolutional network which encodes the student s M N environment into an M N distribution over coordinates, from which a single goal is selected. 4.2 Extension to L-AMIGo Student. The L-AMIGo student is a policy conditioned not on (x, y) goals, but on language goals t: S(at | st, t; S). Given the goal t, the student is now rewarded if it reaches a state with the language description t, i.e. if t = L(st).3. Typically this student will encode the goal with a learned language model and concatenate the goal representation with its state representation. Teacher. Now the L-AMIGo teacher selects goals from the set of possible language descriptions in the environment. Because the possible goals are initially unknown, the teacher maintains a running set of goals G that is updated as the student encounters new state descriptions (Figure 2i). This move to language creates a challenge: not only must a teacher choose a goal to propose, it must also determine which goals are achievable at all. For example, the goal go to the red door only makes sense in environments with red doors. In L-AMIGo, these tasks are factorized into a policy network, which produces the distribution over goals given a student s state, and a grounding network, which predicts the probability that a goal is likely to be achieved in the first place (Figure 2b): T ( t | st; T ) / pground( t | st; T ) ppolicy( t | st; T ) (1) pground( t | st; T ) = σ (f( t; T ) hground(st; T )) (2) ppolicy( t | st; T ) / f( t; T ) hpolicy(st; T ) (3) 3We can treat language goals and state descriptions equivalently, even if the wordings are slightly different across environments. In Mini Grid, messages (e.g. go to the red door) look like goals but can also be interpreted as state descriptions: [in this state, you have] go[ne] to the red door In Mini Hack, messages are description-like (e.g. you kill the minotaur!), but imagine the teacher s goal as [reach a state where] you kill the minotaur! Equation 3 describes the policy network as producing a probability for a goal by computing the dot product between goal and state representations f( t; T ) and hpolicy(st; T ), normalizing over possible goals; this policy is learned identically to the standard AMIGo teacher (Figure 2iii). Equation 2 specifies the grounding network as predicting whether a goal is achievable in an environment, by applying the sigmoid function to the dot product between the goal representation f( t; T ) and a (possibly separate) state representation hground(st; T ). Given an oracle grounding classifier, which outputs only 0 or 1, this is equivalent to restricting the teacher to proposing only goals that are achievable in a given environment. In practice, however, we learn the classifier online (Figure 2ii). Given the initial state s0 of an episode, we ask the grounding network to predict the first language description encountered along this trajectory: 1st = L(st0), where t0 is the minimum t where L(st) 6= ?. This is formalized as a multilabel binary cross entropy loss, Lground(s0, 1st) = log(pground( 1st | s0; T )) 1 |G| 1 log(1 pground( 0 | s0; T )), (4) where the second term noisily generates negative samples of (start state, unachieved description) pairs based on the set of descriptions G known to the teacher at the time, similar to contrastive learning.4 Note that since G is updated during training, Equation 4 grows to include more terms over time. To summarize, training the teacher involves three steps: (1) updating the running set of descriptions seen in the environment, (2) learning the policy network based on whether the student achieved goals proposed by the teacher, and (3) learning the grounding network by predicting descriptions encountered from initial states. Algorithm S1 in Appendix A describes how L-AMIGo trains in an asynchronous actor-critic framework, where the student and teacher are jointly trained from batches of experience collected from separate actor threads, as used in our experiments (see Section 6). 5 L-Novel D Next, we describe Novel D [54], which extends simpler tabular- [47] or pseudo- [4, 6] count-based intrinsic exploration methods, and our language variant, L-Novel D. Instead of simply rewarding an agent for rare states, Novel D rewards agents for transitioning from states with low novelty to states with higher novelty. Zhang et al. [54] show that Novel D surpasses Random Network Distillation [6], another popular exploration method, on a variety of tasks including Mini Grid and Atari. 5.1 Novel D Novel D defines the reward ri t to be the difference in novelty between state st and previous state st 1: t = Novel Ds(st, st 1) , max(N(st) N(st 1), 0) | {z } Term 1 (Novel D) (Ne(st) = 1) | {z } Term 2 (ERIR) In the first Novel D term, N(st) is the novelty of state st; this quantity describes the difference in novelty between successive states, which is clipped > 0 so the agent is not penalized from moving back to less novel states. is a hyperparameter that scales the average magnitude of the reward. The second term is the Episodic Reduction on Intrinsic Reward (ERIR): a constraint that the agent only receives reward when encountering a state for the first time in an episode. Ne(st) is an episodic state counter that tracks exact state visitation counts, as defined by (x, y) coordinates. Measuring novelty with RND. In smaller MDPs, it is possible to track exact state visitation counts, in which case the novelty is typically the inverse square root of visitation counts [47]. However, in larger environments where states are rarely revisited, we (like Novel D) use the popular Random Network Distillation (RND) [6] technique as an approximate novelty measure. Specifically, the novelty of a state is measured by the prediction error of a state embedding network that is trained jointly with the agent to match the output of a fixed, random target network. The intuition is that states which the RND network has been trained on will have lower prediction error than novel states. 4If multiple first descriptions are found, the teacher predicts 1 for each description, and 0 for all others. 5.2 Extension to L-Novel D Our incorporation of language is simple: we add an additional exploration bonus based on novelty defined according to the language descriptions of states: Novel D ( t, t 1) , max(N( t) N( t 1), 0) (Ne( t) = 1). (6) This bonus is identical to standard Novel D: N( ) is the novelty of the description as measured by a separately parameterized RND network encoding the description,5 and Ne( t) = 1 when the language description has been encountered for the first time this episode. We keep the original Novel D exploration bonus, as language rewards may be sparse and a basic navigation bonus can encourage the agent to reach language-annotated states. The final intrinsic reward for L-Novel D is t = L-Novel D(st, st 1, t, t 1) , Novel Ds(st, st 1) + λ Novel D ( t, t 1) (7) where λ controls the trade-off between Equations 5 and 6. One might ask why we do not simply include the language description as input into the RND network, along with the state. While this can work in some cases, decoupling the state and language novelties allow us to precisely control the trade-off between the two, with a hyperparameter that can be tuned to different tasks. In contrast, a combined input obfuscates the relative contributions of state and language to the overall novelty. Appendix F.2 has ablations that show that (1) combining the state and language inputs or (2) using the language novelty term alone leads to worse performance. 6 Experiments We evaluate L-AMIGo, AMIGo, L-Novel D, and Novel D, implemented in the Torch Beast [27] implementation of IMPALA [17], a common asynchronous actor-critic method. Besides vanilla IMPALA, we also compare to a naive (fixed) message reward given for any message in the environment, which is similar doing extrinsic reward shaping for all messages (e.g. LEARN [22]; also [3, 5, 21, 23, 44, 53]) or prior approaches that assume that messages are always helpful (Harrison et al. [24], ELLA [31]); see Appendix B for more discussion on this baseline and its equivalencies to prior work.6 We run each model 5 times across 13 tasks within two challenging procedurally-generated RL environments, Mini Grid [8] and Mini Hack [41], and adapt baseline models provided for both environments [7, 41]; for full model, training, and hyperparameter details, see Appendix C. 6.1 Environments Mini Grid. Following Campero et al. [7], we evaluate on the most challenging tasks in Mini Grid [8], which involve navigation and manipulation tasks in gridworlds: Key Corridor S{3,4,5}R3 (Figure 1) and Obstructed Maze_{1Dl,2Dlhb,1Q}. These tasks involve picking up a ball in a locked room, with the key to the door hidden in boxes or other rooms and the door possibly obstructed. The suffix indicates the size of the environment, in increasing order. See Appendix G.1 for more details. To add language, we use the complementary Baby AI platform [9] which provides a grammar of 652 possible messages, involving goto, open, pickup, and putnext commands applied to a variety of objects qualified by type (e.g. box, door) and/or color (e.g. red, blue). The oracle language annotator emits a message when the corresponding action is completed. On average, only 6 to 12 messages (1-2% of all 652) are needed to complete each task (see Appendix G.1 for all messages). Note that since Baby AI messages are not included in the original environment from which we adapt baseline agents [7], none of our Mini Grid agents encode language observations directly into the state. While it can be tempting and beneficial to use language in this way, one a priori benefit of using language solely for exploration is that language is only needed during training, and not evaluation. Regardless, see Appendix E for additional experiments with Mini Grid agents that encode language into the state representation; while this boosts performance of baseline models, the experiments show that language-augmented exploration methods still outperform non-linguistic ones. 5For multiple messages, we average Novel D of each messsage. 6For an implementation of a message reward with simple novelty-based decay, see the message-only LNovel D ablation results in Appendix F.2, which underperforms full L-Novel D and L-AMIGo. Quest Medium Multi Room N2 Extreme Multi Room N4 Extreme Obstructed Maze_1Dl Obstructed Maze_2Dlhb Obstructed Maze_1Q River Quest Easy Key Corridor S3R3 Key Corridor S4R3 Key Corridor S5R3 Wo D Medium Wo D Hard 0 2e7 4e7 6e7 0 1e7 2e7 3e7 0 2e7 4e7 6e7 0 1e7 2e7 3e70 5e7 1e81.5e82e8 0 5e7 1e81.5e82e8 0 2e7 4e7 6e7 0 2e7 4e7 6e7 0 2e7 4e7 6e7 0 2e7 4e7 6e7 8e7 0 5e7 1e8 1.5e8 2e8 0 2e7 4e7 6e7 0 2e7 4e7 6e7 0.00 Extrinsic Reward IMPALA AMIGo L AMIGo Novel D L Novel D Naive Reward Figure 3: Training curves. Mean extrinsic reward ( std err) across 5 independent runs for each model and environment. In general, linguistic variants outperform their non-linguistic forms. Mini Hack. Mini Hack [41] is a suite of procedurally-generated tasks of varying difficulty set in the roguelike game Net Hack [28]. Mini Hack contains a diverse action space beyond simple Mini Gridesque navigation, including planning, inventory management, tool use, and combat. These actions cannot be expressed by (x, y) positions, but instead are captured by in-game messages (Figure 1). We evaluate our methods on a representative suite of tasks of varying difficulty: River, Wand of Death (Wo D)-{Medium,Hard}, Quest-{Easy,Medium}, and Multi Room-{N2,N4}-Extreme. For space reasons, we describe the Wo D-Hard environment here, but defer full descriptions of tasks (and messages) to Appendices G.2 and H. In Wo D-Hard, depicted in Figure 1, the agent must learn to use a Wand of Death, which can zap and kill enemies. This involves a complex sequence of actions: the agent must find the wand, pick it up, choose to zap an item, select the wand in the inventory, and finally choose the zapping direction (towards the minotaur which is pursuing the player). It must then proceed past the minotaur to the goal to receive reward. Taking these actions out of order (e.g. trying to zap with nothing in the inventory, or selecting something other than the wand) has no effect. It is difficult to enumerate all Mini Hack messages, as they are hidden in low-level game code which has many edge cases. As an estimate, we can examine expert policies: agents which have solved Wo D tasks encounter around 60 messages, of which only 5 10 (8 16%) are needed for successful trajectories, including inventory (f - a metal wand, what do you want to zap?, in what direction?) and combat (You kill the minotaur!, Welcome to level 2.) messages; most are irrelevant (e.g. picking up and throwing stones) or nonsensical (There is nothing to pick up, That is a silly thing to zap). In the other tasks, only 8 18% of the hundreds of unique messages are needed for success (Appendix G.2). Unlike the Mini Grid environments, we adapt baseline models from [41], which all already encode the in-game message into the state representation. Despite this, as we will show, using language as an explicit target for exploration outperforms using language as a state feature alone. Figure 3 shows training curves with AMIGo, Novel D, language variants, and the IMPALA and naive message reward baselines. Following Agarwal et al. [2], we summarize these results with the interquartile mean (IQM) of all methods in Figure 4, with bootstrapped 95% confidence intervals constructed from 5k samples per model/env combination.78 We come to the following conclusions: 7See Appendix D for full numeric tables and area under the curve (AUC)/probability of improvement plots. 8See Appendix F for ablation studies of L-AMIGo s grounding network and the components of L-Novel D. 0 .25 .5 .75 1 0 .25 .5 .75 1 Overall Mini Hack Mini Grid Multi Room N4 Extreme Multi Room N2 Extreme Quest Medium Wo D Hard Wo D Medium River Obstructed Maze_1Q Obstructed Maze_2Dlhb Obstructed Maze_1Dl Key Corridor S5R3 Key Corridor S4R3 Key Corridor S3R3 Extrinsic Reward AMIGo L AMIGo Novel D L Novel D Figure 4: Aggregate performance. Interquartile mean (IQM) of models across tasks. Dot is median; error bars are 95% bootstrapped CIs. 0 .25 .5 .75 1 0 .25 .5 .75 1 Overall Mini Hack Mini Grid Multi Room N4 Extreme Multi Room N2 Extreme Quest Medium Wo D Hard Wo D Medium River Obstructed Maze_1Q Obstructed Maze_2Dlhb Obstructed Maze_1Dl Key Corridor S5R3 Key Corridor S4R3 Key Corridor S3R3 Extrinsic Reward L AMIGo One hot L Novel D One hot Figure 5: One-hot performance. Models compared to variants with one-hot non-compositional goals. Plot elements same as Figure 4 Linguistic exploration outperforms non-linguistic exploration. Both algorithms, L-AMIGo and L-Novel D, outperform their nonlinguistic counterparts. Despite variance in runs and across environments, averaged across all environments (Overall) we see a statistically significant improvement of L-AMIGo over AMIGo (.27 absolute, 47% relative) and of L-Novel D over Novel D (.35 absolute, 85% relative). In some tasks, Figure 3 shows that L-AMIGo and L-Novel D reach the same asymptotic performance as their non-linguistic versions, but with better sample efficiency and stability (e.g. Key Corridor S3R3 L-AMIGo, Quest-Easy L-Novel D; see Appendix D.3 AUC plots). Lastly, the failure of the naive message reward shows that indiscriminate reward shaping fails in tasks with sufficiently diverse language; instead, some notion of novelty or difficulty is needed to make progress. Linguistic exploration excels in larger environments. Our tasks include sequences of environments with the same underlying dynamics, but larger state spaces and thus more challenging exploration problems. In general, larger environments result in bigger improvements of linguistic over nonlinguistic exploration, since the space of messages remains relatively constant even as the state space grows. For example, there is no difference in ultimate performance for language/non-language variants on Key Corridor S3R3, yet the gaps grow as the environment size grows to Key Corridor S5R3, especially in L-Novel D. A similar trend can be seen in the Wo D tasks, where AMIGo actually outperforms L-AMIGo in Wo D-Medium, but in Wo D-Hard is unable to learn at all. 7.1 Interpretability One auxiliary benefit of our language-based methods is that the language states and goals can provide insight into an agents training and exploration process. We demonstrate how L-AMIGo and L-Novel D agents can be interpreted in Figure 6. Emergent L-AMIGo Curricula. Campero et al. [7] showed that AMIGo teachers produce an interpretable curriculum, with initially easy (x, y) goals located next to the student s start location, and later harder goals referencing distant locations behind doors. In L-AMIGo, we can see a similar curriculum emerge through the proportion of language goals proposed by the teacher throughout training. In the Key Corridor S4R3 environment (Figure 6a), the teacher first proposes the generic goal open (any) door before then proposing goals involving specific colored doors (where is a color). Next, the agent discovers keys, and the teacher proposes pick[ing] up the key and putting it in certain locations. Finally, the teacher and student converge on the extrinsic goal pick up the ball. Due to the complexity of the Wo D-Hard environment, the curriculum for the teacher is more exploratory (Figure 6c). The teacher proposes useless goals at first, such as finding staircases and slings. At one point, the teacher proposes throwing stones at the minotaur (an ineffective strategy) before devoting more time towards wand actions (you see here a wand, the wand glows and fades). Eventually, as the student grows more competent, the teacher begins proposing goals that involve directly killing the minotaur (you kill the minotaur, welcome to experience level 2) before converging on the message you see a minotaur corpse the final message needed to complete the episode. L-Novel D Message Novelty. Similarly, L-Novel D allows for interpretation by examining the messages with highest intrinsic reward as training progresses. In Key Corridor S4R3 (Figure 6b), the 0 1e7 2e7 3e7 4e7 Frames Teacher Proposed Goals go to the door open the door go to the key pick up the key put the key next to the door pick up the ball open the door (a) L-AMIGo, Key Corridor S4R3 0 2e7 4e7 6e7 8e7 Frames log(Novel Dl) goto open pickup putnext ball door key multiple open the door pick up the ball go to the ball put the key next to the ball pick up the key (b) Novel D, Key Corridor S4R3 0 1e7 3e7 5e7 7e7 Frames Teacher Proposed Goals you see a minotaur corpse. welcome to experience level 2. you kill the minotaur! the death ray bounces! you see here a wand. the wand glows and fades. there is a staircase up here. what do you want to zap? b - a +2 sling. the flint stone hits the minotaur. (c) L-AMIGo, Wo D-Hard 0 1e7 2e7 3e7 4e7 Frames log(Novel Dl) you have much trouble lifting a minotaur corpse. that is a silly thing to zap. it s a wall. what do you want to zap? the flint stone hits the minotaur. the death ray bounces! you see here uncursed flint stones. you see here a wand. you see a minotaur corpse. (d) Novel D, Wo D-Hard Figure 6: Interpretation of language-guided exploration. For the Key Corridor S4R3 and Wo D-Hard environments, shown are curricula of goals proposed by the L-AMIGo teacher (a,c) and the intrinsic reward of messages (some examples labeled) for L-Novel D (b,d). novelty of easy goals such as open the door decreases fastest, while the novelty of the true extrinsic goal (pick up the ball) and even rarer actions (put the key next to the ball) remains high throughout training. In Wo D-Hard (Figure 6d), messages vary widely in novelty: simple and nonsensical messages like that is a silly thing to zap and it s a wall quickly plummet, while more novel messages are rarer states that require killing the minotaur (you have trouble lifting a minotaur corpse). 7.2 Do semantics matter? Language not only denotes meaningful features in the world; its lexical and compositional semantics also explain how actions and states relate to each other. For example, in L-AMIGo, an agent might more easily go to the red door if it already knows how to go to the yellow door. Similarly, in L-Novel D, training the RND network on the message go to the yellow door could lower novelty of similar messages like go to the red door, which might encourage exploration of semantically broader states. While our primary focus is not on whether agents can generalize to new language instructions or states, we are still interested in whether these semantics improve exploration for extrinsic rewards. To check this hypothesis, in Figure 5 we run one-hot variants of L-AMIGo and L-Novel D where the semantics of the language annotations are hidden: each message is replaced with a one-hot identifier (e.g. go to the red door ! 1, go to the blue door ! 2) but otherwise functions identically to the original message. We make two observations. (1) One-hot goals actually perform quite competitively, demonstrating that the primary benefit of language in these tasks is to abstract over the state space, rather than provide fine-grained semantic relations between states. (2) Nevertheless, L-AMIGo is able to exploit semantics, with a significant improvement (.20 absolute, 32% relative) in aggregate performance over one-hot goals, in contrast to L-Novel D, which shows no significant difference. We leave for future work a more in-depth investigation into what kinds of environments and models might benefit more from language semantics. 8 Discussion The key insight in this paper is that language, even if noisy and often unrelated to the goal, is a more abstract, efficient, and interpretable space for exploration than state representations. To support this, we have presented variants of two popular state-of-the-art exploration methods, L-AMIGo and L-Novel D, that outperform their non-linguistic counterparts by 47 85% across 13 language-annotated tasks in the challenging Mini Grid and Mini Hack environment suites. Despite their success here, our models have some limitations. First, as is common in work like ours, it will be important to alleviate the restriction on oracle language annotations, perhaps by using learned state description models [31, 49]. Furthermore, L-AMIGo specifically cannot handle tasks such as the full Net Hack game which have unbounded language spaces and many redundant goals (e.g. go to/approach/arrive at the door), since it selects a single goal which must be achieved verbatim. An exciting extension to L-AMIGo would propose abstract goals (e.g. kill [any] monster or find a new item), possibly in a continuous semantic space, that can be satisfied by multiple messages. More general extensions include better understanding when and why language semantics benefits exploration (Section 7.2) and using pretrained models to imbue semantics into the models beforehand [49]. Additionally, although the agents in this work are able to explore even when not all language is useful, we must take caution in adversarial settings where the language is completely unrelated to the extrinsic task (and thus useless) or even describes harmful behaviors. Future work should measure how robust these methods are to the noisiness and quality of the language. Nevertheless, the success of L-AMIGo and L-Novel D demonstrates the power of even noisy language in these domains, underscoring the importance of abstract and semantically-meaningful measures of exploration in RL. Acknowledgments and Disclosure of Funding We thank Heinrich Küttler, Mikael Henaff, Andy Shih, Alex Tamkin, and anonymous reviewers for constructive comments and feedback, and Mikayel Samvelyan for help with Mini Hack. JM is supported by an Open Philanthropy AI Fellowship. [1] J. Achiam and S. Sastry. Surprise-based intrinsic motivation for deep reinforcement learning. ar Xiv preprint ar Xiv:1703.01732, 2017. [2] R. Agarwal, M. Schwarzer, P. S. Castro, A. C. Courville, and M. Bellemare. Deep reinforcement learning at the edge of the statistical precipice. In Advances in Neural Information Processing Systems, 2021. [3] D. Bahdanau, F. Hill, J. Leike, E. Hughes, A. Hosseini, P. Kohli, and E. Grefenstette. Learning to understand goal specifications by modelling reward. In International Conference on Learning Representations, 2018. [4] M. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and R. Munos. Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, 2016. [5] V. Blukis, Y. Terme, E. Niklasson, R. A. Knepper, and Y. Artzi. Learning to map natural language instructions to physical quadcopter control using simulated flight. In Proceedings of the 3rd Conference on Robot Learning, 2019. [6] Y. Burda, H. Edwards, A. Storkey, and O. Klimov. Exploration by random network distillation. In International Conference on Learning Representations, 2018. [7] A. Campero, R. Raileanu, H. Kuttler, J. B. Tenenbaum, T. Rocktäschel, and E. Grefenstette. Learning with AMIGo: Adversarially motivated intrinsic goals. In International Conference on Learning Representations, 2021. [8] M. Chevalier-Boisvert, L. Willems, and S. Pal. Minimalistic gridworld environment for Open AI gym. https://github.com/maximecb/gym-minigrid, 2018. [9] M. Chevalier-Boisvert, D. Bahdanau, S. Lahlou, L. Willems, C. Saharia, T. H. Nguyen, and Y. Bengio. Baby AI: A platform to study the sample efficiency of grounded language learning. In International Conference on Learning Representations, 2019. [10] K. Cho, B. van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning Phrase Representations using RNN Encoder Decoder for Statistical Machine Translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pages 1724 1734, 2014. [11] D.-A. Clevert, T. Unterthiner, and S. Hochreiter. Fast and accurate deep network learning by exponential linear units (ELUs). In International Conference on Learning Representations, 2016. [12] C. Colas, A. Akakzia, P.-Y. Oudeyer, M. Chetouani, and O. Sigaud. Language-conditioned goal generation: a new approach to language grounding for RL. ar Xiv preprint ar Xiv:2006.07043, 2020. [13] C. Colas, T. Karch, N. Lair, J.-M. Dussoux, C. Moulin-Frier, P. Dominey, and P.-Y. Oudeyer. Language as a cognitive tool to imagine goals in curiosity driven exploration. In Advances in Neural Information Processing Systems, 2020. [14] C. Colas, T. Karch, O. Sigaud, and P.-Y. Oudeyer. Autotelic agents with intrinsically motivated goal- conditioned reinforcement learning: A short survey. ar Xiv preprint ar Xiv:2012.09830, 2020. [15] M.-A. Côté, A. Kádár, X. Yuan, B. Kybartas, T. Barnes, E. Fine, J. Moore, M. Hausknecht, L. El Asri, M. Adada, W. Tay, and A. Trischler. Text World: A learning environment for text-based games. In Workshop on Computer Games, pages 41 75, 2018. [16] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171 4186, 2019. [17] L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning, et al. IMPALA: Scalable distributed deep-rl with importance weighted actor-learner architectures. In Proceedings of the 35th International Conference on Machine Learning, pages 1407 1416, 2018. [18] M. Fang, T. Zhou, Y. Du, L. Han, and Z. Zhang. Curriculum-guided hindsight experience replay. In Advances in Neural Information Processing Systems, 2019. [19] C. Florensa, D. Held, M. Wulfmeier, M. Zhang, and P. Abbeel. Reverse curriculum generation for reinforcement learning. In Proceedings of the 1st Conference on Robot Learning, pages 482 495, 2017. [20] S. Forestier, R. Portelas, Y. Mollard, and P.-Y. Oudeyer. Intrinsically motivated goal exploration processes with automatic curriculum learning. ar Xiv preprint ar Xiv:1708.02190, 2017. [21] J. Fu, A. Korattikara, S. Levine, and S. Guadarrama. From language to goals: Inverse reinforcement learning for vision-based instruction following. In International Conference on Learning Representations, 2019. [22] P. Goyal, S. Niekum, and R. J. Mooney. Using natural language for reward shaping in reinforcement learning. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19), 2019. [23] P. Goyal, S. Niekum, and R. J. Mooney. Pixl2r: Guiding reinforcement learning using natural language by mapping pixels to rewards. In Proceedings of the 4th Conference on Robot Learning, pages 485 497, 2020. [24] B. Harrison, U. Ehsan, and M. O. Riedl. Guiding reinforcement learning exploration using natural language. In Proceedings of the 17th International Conference on Autonomous Agents and Multi Agent Systems, pages 1956 1958, 2018. [25] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735 1780, 1997. [26] Y. Jiang, S. Gu, K. Murphy, and C. Finn. Language as an abstraction for hierarchical deep reinforcement learning. In Advances in Neural Information Processing Systems, 2019. [27] H. Küttler, N. Nardelli, T. Lavril, M. Selvatici, V. Sivakumar, T. Rocktäschel, and E. Grefenstette. Torch- Beast: A Py Torch platform for distributed RL. ar Xiv preprint ar Xiv:1910.03552, 2019. [28] H. Küttler, N. Nardelli, A. H. Miller, R. Raileanu, M. Selvatici, E. Grefenstette, and T. Rocktäschel. The Net Hack learning environment. In Advances in Neural Information Processing Systems, 2020. [29] M. C. Machado, M. G. Bellemare, and M. Bowling. Count-based exploration with the successor represen- tation. In Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence, pages 5125 5133, 2020. [30] J. Martin, S. N. Sasikumar, T. Everitt, and M. Hutter. Count-based exploration in feature space for reinforcement learning. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17), pages 2471 2478, 2017. [31] S. Mirchandani, S. Karamcheti, and D. Sadigh. ELLA: Exploration through learned language abstraction. In Advances in Neural Information Processing Systems, 2021. [32] G. Ostrovski, M. G. Bellemare, A. Oord, and R. Munos. Count-based exploration with neural density models. In Proceedings of the 34th International Conference on Machine Learning, pages 2721 2730, 2017. [33] P.-Y. Oudeyer and F. Kaplan. How can we define intrinsic motivation? In Proceedings of the 8th International Conference on Epigenetic Robotics: Modeling Cognitive Development in Robotic Systems, 2008. [34] P.-Y. Oudeyer and F. Kaplan. What is intrinsic motivation? A typology of computational approaches. Frontiers in Neurorobotics, 1:6, 2009. [35] P.-Y. Oudeyer, F. Kaplan, and V. V. Hafner. Intrinsic motivation systems for autonomous mental develop- ment. IEEE Transactions on Evolutionary Computation, 11(2):265 286, 2007. [36] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell. Curiosity-driven exploration by self-supervised prediction. In Proceedings of the 34th International Conference on Machine Learning, pages 2778 2787, 2017. [37] R. Portelas, C. Colas, K. Hofmann, and P.-Y. Oudeyer. Teacher algorithms for curriculum learning of deep RL in continuously parameterized environments. In Proceedings of the 4th Conference on Robot Learning, pages 835 853, 2020. [38] R. Portelas, C. Colas, L. Weng, K. Hofmann, and P.-Y. Oudeyer. Automatic curriculum learning for deep RL: A short survey. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) Survey Track, pages 4819 4825, 2020. [39] S. Racaniere, A. K. Lampinen, A. Santoro, D. P. Reichert, V. Firoiu, and T. P. Lillicrap. Automated curricula through setter-solver interactions. In International Conference on Learning Representations, 2020. [40] R. Raileanu and T. Rocktäschel. RIDE: Rewarding impact-driven exploration for procedurally-generated environments. In International Conference on Learning Representations, 2020. [41] M. Samvelyan, R. Kirk, V. Kurin, J. Parker-Holder, M. Jiang, E. Hambro, F. Petroni, H. Kuttler, E. Grefen- stette, and T. Rocktäschel. Mini Hack the planet: A sandbox for open-ended reinforcement learning research. In Advances in Neural Information Processing Systems (Datasets and Benchmarks Track), 2021. [42] J. Schmidhuber. A possibility for implementing curiosity and boredom in model-building neural controllers. In Proceedings of the International Conference on Simulation of Adaptive Behavior: From Animals to Animats, pages 222 227, 1991. [43] J. Schmidhuber. Formal theory of creativity, fun, and intrinsic motivation (1990 2010). IEEE Transactions on Autonomous Mental Development, 2(3):230 247, 2010. [44] E. Schwartz, G. Tennenholtz, C. Tessler, and S. Mannor. Language is power: Representing states using natural language in reinforcement learning. ar Xiv preprint ar Xiv:1910.02789, 2019. [45] M. Shridhar, X. Yuan, M.-A. Côté, Y. Bisk, A. Trischler, and M. Hausknecht. ALFWorld: Aligning text and embodied environments for interactive learning. In International Conference on Learning Representations, 2021. [46] B. C. Stadie, S. Levine, and P. Abbeel. Incentivizing exploration in reinforcement learning with deep predictive models. ar Xiv preprint ar Xiv:1507.00814, 2015. [47] A. L. Strehl and M. L. Littman. An analysis of model-based interval estimation for markov decision processes. Journal of Computer and System Sciences, 74(8):1309 1331, 2008. [48] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT Press, 1998. [49] A. C. Tam, N. C. Rabinowitz, A. K. Lampinen, N. A. Roy, S. C. Chan, D. Strouse, J. X. Wang, A. Banino, and F. Hill. Semantic exploration from language abstractions and pretrained representations. ar Xiv preprint ar Xiv:2204.05080, 2022. [50] H. Tang, R. Houthooft, D. Foote, A. Stooke, X. Chen, Y. Duan, J. Schulman, F. De Turck, and P. Abbeel. #exploration: A study of count-based exploration for deep reinforcement learning. In Advances in Neural Information Processing Systems, 2017. [51] T. Tasrin, M. S. A. Nahian, H. Perera, and B. Harrison. Influencing reinforcement learning through natural language guidance. ar Xiv preprint ar Xiv:2104.01506, 2021. [52] J. Urbanek, A. Fan, S. Karamcheti, S. Jain, S. Humeau, E. Dinan, T. Rocktäschel, D. Kiela, A. Szlam, and J. Weston. Learning to speak and act in a fantasy text adventure game. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 673 683, 2019. [53] N. Waytowich, S. L. Barton, V. Lawhern, and G. Warnell. A narration-based reward shaping approach using grounded natural language commands. ar Xiv preprint ar Xiv:1911.00497, 2019. [54] T. Zhang, H. Xu, X. Wang, Y. Wu, K. Keutzer, J. E. Gonzalez, and Y. Tian. Novel D: A simple yet effective exploration criterion. In Advances in Neural Information Processing Systems, 2021. 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] In Section 3 we discuss limitations of an oracle annotator, and in Section 8 discuss broader limitations. (c) Did you discuss any potential negative societal impacts of your work? [Yes] In Section 8 we discuss caution in adversarial settings where the language signal is useless for learning or, worse, encourages harmful behavior. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [N/A] (b) Did you include complete proofs of all theoretical results? [N/A] 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experi- mental results (either in the supplemental material or as a URL)? [Yes] Code included with supplementary material and will be made public upon acceptance, with a link in Appendix C (currently anonymized) (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] In Appendices B, C, G (c) Did you report error bars (e.g., with respect to the random seed after running experi- ments multiple times)? [Yes] We follow best practices as suggested by Agarwal et al. [2] for reporting IQM with 95% boostrapped CIs in Figures 4, 5, S2, S3, S6, S7, and Table S1. Standard error bars are shown in main curves in Figure 3 (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] Appendix C.3 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [Yes] Yes, we cited Campero et al. [7], Zhang et al. [54], Samvelyan et al. [41], Chevalier-Boisvert et al. [8], Chevalier-Boisvert et al. [9], and codebases we have used are included and credited in the supplementary material (b) Did you mention the license of the assets? [Yes] Yes, code licenses are included in the supplementary material. (c) Did you include any new assets either in the supplemental material or as a URL? [Yes] See checklist 3(a). (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] No human data (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] No human data 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]