# learning_to_follow_instructions_in_textbased_games__e2e2a074.pdf

Learning to Follow Instructions in Text-Based Games

Mathieu Tuli, Andrew C. Li, Pashootan Vaezipoor, Toryn Q. Klassen , Scott Sanner, Sheila A. Mc Ilraith University of Toronto, Toronto, Canada Vector Institute for Artiﬁcial Intelligence, Toronto, Canada Schwartz Reisman Institute for Technology and Society, Toronto, Canada {mathieutuli,andrewli,pashootan,toryn,sheila}@cs.toronto.edu ssanner@mie.utoronto.ca

Text-based games present a unique class of sequential decision making problem in which agents interact with a partially observable, simulated environment via actions and observations conveyed through natural language. Such observations typically include instructions that, in a reinforcement learning (RL) setting, can directly or indirectly guide a player towards completing reward-worthy tasks. In this work, we study the ability of RL agents to follow such instructions. We conduct experiments that show that the performance of state-of-the-art text-based game agents is largely unaffected by the presence or absence of such instructions, and that these agents are typically unable to execute tasks to completion. To further study and address the task of instruction following, we equip RL agents with an internal structured representation of natural language instructions in the form of Linear Temporal Logic (LTL), a formal language that is increasingly used for temporally extended reward speciﬁcation in RL. Our framework both supports and highlights the beneﬁt of understanding the temporal semantics of instructions and in measuring progress towards achievement of such a temporally extended behaviour. Experiments with 500+ games in Text World demonstrate the superior performance of our approach.

1 Introduction

Building AI agents that can understand natural language is an important and longstanding problem in AI. In recent years, instrumented text-based game (TBG) engines have served as compelling environments for studying a variety of tasks related to language understanding, affordance extraction, memory, and sequential decision making (e.g., Côté et al., 2018; Adhikari et al., 2020; Liu et al., 2022). They provide a simulated, partially observable environment where an agent can navigate and interact with environment objects, receiving observations and administering commands via natural language. Text World (Côté et al., 2018) is a TBG learning environment for training reinforcement learning (RL) agents. Successful play requires language understanding, effective navigation, memory, and an ability to follow instructions embedded within the text. Instructions may or may not be directly bound to reward but can guide an RL agent towards completing tasks and collecting reward.

In this paper we study instruction following in text-based games and propose an approach that advances the previous state of the art. To this end, we employ the state-of-the-art model-free TBG RL agent called GATA (Graph Aided Transformer Agent) (Adhikari et al., 2020) that operates in the Text World environment. GATA has made signiﬁcant advances in performance by augmenting TBG agents with long-term memory a critical component of effective game play. Despite GATA s improvement over previous baselines, our experiments (see Figure 1) show that GATA performance is largely unaffected by the presence or absence of instructions, leading us to conclude that GATA is not effectively following instructions. We also ﬁnd that while GATA agents are able to garner reward,

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

GATAD GATAD S 0.0

Test Normalized

Game Points

20 Games - Level 1

GATAD GATAD S 0.0

1.0 100 Games - Level 1

GATAD GATAD S 0.0

1.0 20 Games - Level 2

GATAD GATAD S 0.0

1.0 100 Games - Level 2

Test Success Rate

Stripped Instructions

Points Success

Figure 1: Comparison of GATA performance when trained with instructions (GATAD) versus when instructions are stripped from environment observations (GATAD-S). Agents were trained with 20 or 100 games, at increasing levels of task difﬁculty (level 1 vs level 2). Note that normalized game point performance (solid blocks) and rate of success (hashed blocks) are largely unchanged whether instructions are present or absent. Low success rate (i.e., task completion) rate is also seen in level 2.

they are not typically successful in completing tasks an important vulnerability to the deployment of such techniques in environments where partial completion of tasks can be unsafe.

To further study and address the task of instruction following, we equip GATA with an internal structured representation of natural language instructions speciﬁed in Linear Temporal Logic (LTL) (Pnueli, 1977), a formal language that is increasingly used for temporally extended goal and preference speciﬁcation in symbolic planning, and for reward speciﬁcation, and other purposes in RL (e.g., Bacchus & Kabanza, 2000; Baier & Mc Ilraith, 2006; Baier et al., 2009; Patrizi et al., 2011; Camacho & Mc Ilraith, 2019; Littman et al., 2017; Toro Icarte et al., 2018a,b; Camacho et al., 2019; Leon et al., 2020; Kuo et al., 2020; Vaezipoor et al., 2021). LTL also provides a mechanism to monitor progress towards completion of instructions. Our framework both supports and highlights the beneﬁt of understanding the temporal semantics of instructions and in measuring progress towards achievement of a temporally extended behaviour. We perform experiments that illustrate the superior performance of our TBG agent and its ability to follow instructions. Contributions of this work include:

Experiments that expose the lack of instruction following and low task completion rate in a state-of-the-art TBG agent.

An approach to the study and deployment of instruction following in TBG environments via exploitation of a formal language: LTL. LTL provides well-deﬁned semantics and supports a measure of progress towards satisfaction of instructions.

An augmentation to an existing state-of-the-art architecture for TBGs to equip a TBG agent with instruction-following capabilities.

Comprehensive experiments and insights that study our and others approaches to instruction following, and that highlight the superior performance of our proposed approach.

2 Background

In this section we introduce Text World, the TBG engine that we use, together with the Cooking domain that we employ in our experiments. We also overview Linear Temporal Logic, which (as described in section 1) we use in our approach as an internal representation for instructions.

2.1 Text-Based Games: Text World

Text-based games are partially observable multi-turn games where the environment and the player s action choices are represented textually. In this work, we use Text World (Côté et al., 2018) as our textbased game engine. A text-based game can be viewed as a (discrete-time) partially observable Markov decision process (POMDP) S, T, A, O, Ω, R, γ (Côté et al., 2018) where S is the environment s state space, A is the action space, T(st+1|st, at) where st+1, st S and at A is the conditional transition probability between states st+1 and st given action at, O is the set of (partial) observations that the agent receives, Ω(ot|st, at 1) is the set of conditional observation probabilities, R : S A R is the reward function, and γ [0, 1] is the discount factor. An agent s goal is to learn some

optimal policy π (a|o) (or a policy that conditions on historical observations or on some internal memory) that maximizes the expected discounted return. In this work, we focus on the choice-based variant of games, similar to previous works (Adhikari et al., 2020; Narasimhan et al., 2015). The action space A is a list of possible commands and at each time-step t in the game, the agent must select action at Ct from the current subset of permissible actions Ct A.

2.1.1 Environment Setting

We focus on the Text World Cooking domain, popularized by Adhikari et al. (2020) and Microsoft s First Text World Problems: A Language and Reinforcement Learning Challenge (FTWP) (Trischler et al., 2019). The game tasks agents with gathering and preparing various cooking ingredients described by an in-game recipe that is to be found. Game points (rewards) are earned for each of (1) collecting a required ingredient, (2) performing a preparatory step (some cutting or cooking action) on an ingredient as required by the recipe, (3) preparing the meal once all of the ingredients have been prepared, and (4) eating the meal. The game s partial observations can contain instructions that guide the agent towards completion of tasks, but not all instructions correspond directly to rewards. The game ﬁrst instructs the agent to examine a cookbook, which elicits a recipe to be followed. The act of examining the cookbook returns no reward, but following its recipe will return reward. See Appendix C for more details. Success is determined by whether the recipe is fully completed and eaten. Preparing ingredients can also involve collecting certain tools (e.g., a knife). The game may also involve navigation the agent may need to navigate to the kitchen or to ﬁnd certain ingredients.

2.2 Linear Temporal Logic (LTL)

Linear Temporal Logic (LTL) (Pnueli, 1977) is a formal language a propositional logical language with temporal modalities that can be used to describe properties of trajectories. We will use LTL to specify instructions. LTL formulas are constructed from propositional variables (e.g., player-has-carrot), connectives from propositional logic (e.g., , ), and two temporal operators: (NEXT) and U (UNTIL). Formally, we deﬁne the syntax of LTL per Baier & Katoen (2008) as

ϕ ::= p | ϕ | ϕ ψ | ϕ | ϕ U ψ

where p P for some ﬁnite set of propositional symbols P. Satisfaction of an LTL formula is determined by a sequence of truth assignments σ = σ0, σ1, σ2, . . . for P, where p σi iff proposition p P holds at time step i. Formally, σ satisﬁes ϕ at time i 0, denoted as σ, i |= ϕ, under the following conditions:

σ, i |= p iff p σi, where p P σ, i |= (ϕ ψ) iff σ, i |= ϕ and σ, i |= ψ σ, i |= ϕ iff σ, i |= ϕ σ, i |= ϕ U ψ iff there exists j such that i j and σ, i |= ϕ iff σ, i + 1 |= ϕ σ, j |= ψ, and σ, k |= ϕ for all k [i, j)

A sequence σ is then said to satisfy ϕ iff σ, 0 |= ϕ.

Any LTL formula can be deﬁned in terms of p P, (negation), (and), (NEXT), and U (UNTIL). From these operators, we can also deﬁne the Boolean operators (or) and (implication), and the temporal operators (ALWAYS) and (EVENTUALLY), where σ, 0 |= ϕ if ϕ always holds in σ, and σ, 0 |= ϕ if ϕ holds at some point in σ.

2.2.1 LTL Progression

LTL formulas can also be progressed along a sequence of truth assignments (Bacchus & Kabanza, 2000; Toro Icarte et al., 2018b). In other words, as an agent acts in the environment, resulting truth assignments can be used to update the formula to reﬂect what has been satisﬁed. The updated formula would now reﬂect the parts of the original formula that are remaining to be satisﬁed or whether the formula has been violated/satisﬁed. The progression operator prog(σi, ϕ) is deﬁned as follows.

Deﬁnition 2.1. For LTL formula ϕ, truth assignment σi over P, and p P, prog(σi, ϕ) is deﬁned as

prog(σi, p) = true if p σi false otherwise prog(σi, ϕ1 ϕ2) = prog(σ1, ϕ1) prog(σ1, ϕ2)

prog(σi, ϕ) = prog(σi, ϕ) prog(σi, ϕ1 UNTIL ϕ2) = prog(σi, NEXT ϕ) = ϕ prog(σ1, ϕ2) (prog(σ1, ϕ1) ϕ1 UNTIL ϕ2)

In the context of Text World, the progression operator can be applied at every step in the episode to update the LTL instruction fed to the agent. To do so, it s necessary to have a labelling function that can indicate when propositions are true as the agent acts during an episode (e.g., to detect that player-has-carrot is true when the player has the carrot). We discuss how this labelling occurs in section 4, and give an example of how progression works in Appendix D.

3 Following Instructions with GATA

In order to evaluate the effectiveness of state-of-the-art text-based game agents at following instructions, we conducted experiments on the Cooking domain using the state-of-the-art model-free RL agent for Text World, GATA (Adhikari et al., 2020). GATA uses a transformer variant of the popular LSTM-DQN (Narasimhan et al., 2015) combined with a dynamic belief graph that is updated during game-play. The aim is to use this belief graph as long-term memory to improve action selection by modelling the underlying game dynamics (Adhikari et al., 2020). Formally, given the POMDP, GATA attempts to learn some optimal policy π (a|o, g) where g is the belief graph.

While GATA s belief graph can capture goal relations (e.g. apple-needs-cut), it turns out that agents trained to condition on observations and the GATA belief graph alone largely ignore in-game instructions. We tested a GATA agent on levels 1 and 2 in the Cooking domain, after training on either the 20-game or 100-game training set, and found that in none of those settings was the cookbook examined more than 15% of the time (3/20 testing games). In short, the GATA agent usually doesn t observe what the recipe is for the current game, meaning it has no way of knowing what the actual goal of the game is (except eventually from the rewards it gets and when the episode ends).

We further investigate how GATA agents fail to follow instructions by training these agents using modiﬁed game observations that have their instructions stripped (speciﬁcally, instructions directing the agent to examine the cookbook, the recipe text within the cookbook, and instructions to grab a knife if attempting to cut an ingredient without ﬁrst holding the knife were removed from observations). This has two effects: (1) the agent no longer receives text-based instructions about what the goal is or what it should do; and (2) GATA s belief state will no longer capture goal relations like needs . The results of this experiment are in Figure 1, and demonstrate how GATA s performance remains largely unchanged. This suggests that GATA is (here at least) (a) not exploiting text-based instructions that would lead it to success and (b) even not exploiting the goal-related relations in its own belief state.

The results in Figure 1 also show a drop in GATA s performance when moving from level 1 to level 2 in the Cooking domain, where the games complexity is increased by just one added ingredient preparation step in the recipes (see Table 1 for more details on the levels). GATA has difﬁculty in fully completing tasks on level 2 games, where its success rate is roughly half that of its achieved normalized game points (only the latter metric was used by Adhikari et al. (2020)).

Given these insights, we wish to further study and address instruction following in TBGs. In the next section, we propose using LTL and demonstrate how existing work can be easily augmented.

4 An Approach to Following Instructions

We now investigate a mechanism for both studying and advancing the ability of an RL agent to follow instructions. We do so by translating instructions to an internal structured representation of language in the form of LTL, a formal language that is increasingly being used for reward speciﬁcation in RL agents (Vaezipoor et al., 2021; Leon et al., 2020; Kuo et al., 2020; Camacho et al., 2019; Toro Icarte et al., 2018b). We describe how to augment the GATA architecture with these LTL instructions and how to monitor progress towards their completion.

4.1 Generating and Representing LTL Instructions for Text World

We use three types of LTL instructions for the Cooking domain. The ﬁrst instruction identiﬁes the need to examine the cookbook, and is the formula NEXT cookbook-is-examined. This instruction simply states that the agent should examine the cookbook (i.e. cookbook-is-examined = true) in the next step of the game. The second instruction is the actual recipe that gets elicited from the cookbook. We format this instruction to be order-invariant and incomplete. Order-invariance allows the agent to complete the instructions in any order, but is still constrained by any ordering

that the Text World engine may enforce. Incomplete simply refers to the fact that not every single action required to complete the recipe is encoded (e.g., grabbing a knife before slicing a carrot, or opening the fridge). The agent must still learn to do these things to accomplish its tasks, but is not directly instructed to do so. Assuming the recipe requires that predicates p1, p2, . . . pn be true, the cookbook instructions are modelled as (EVENTUALLY p1) (EVENTUALLY p2) . . . (EVENTUALLY pn). For example, in the Cooking Domain, this instruction might be the conjunction

(EVENTUALLY apple-in-player) (EVENTUALLY meal-in-player) (EVENTUALLY meal-is-consumed).

The third and ﬁnal type of instruction identiﬁes the need to navigate to the kitchen. This instruction is deﬁned as EVENTUALLY player-at-kitchen. This instruction will come prior to the ﬁrst two described above, but is only used in games with navigation (see Table 1).

We build a simple LTL translator that generates these instructions from the textual observations, similar to the goal generator used in Liu et al. (2022). Text World s observations are easily parsed to extract the goal information already contained within them, which we then formalize and keep track of using LTL. We provide examples of these observations and more details in Appendix E. Note that these observations are only used to generate the instruction itself, and subsequently LTL progression is used, with the GATA belief state as our labelling function, to monitor completion of instruction steps and to update instructions that remain to be addressed.

One possible criticism with such an LTL translator is its reliance on domain knowledge. While not the main focus of this paper, a complementary research problem that has begun to be explored is to automatically translate natural language instructions to LTL (e.g., Dzifcak et al., 2009; Finucane et al., 2010; Wang et al., 2020). Traditionally, such approaches have required large corpora of training data or hard-coded rules, and were restricted to a speciﬁc domain. However, pretrained large language models such as GPT-3 introduce the potential for a general natural-language-to-LTL translation scheme with minimal domain-speciﬁc adaptation (Hahn et al., 2022; Huang et al., 2022; Brohan et al., 2022). We explore this prospect by applying GPT-3 to Text World in subsection 5.5.

Finally, we note that in this work, GATA provides the domain-dependent vocabulary for describing properties of state (e.g. carrot-is-chopped) while our LTL augmentation provides the domainindependent temporal modalities (i.e., NEXT, EVENTUALLY, etc.) and the logical connectives for composing those properties of state into the instructions we use. In this way, our technique is very generalizable, limited only by the recognizable properties of state (which in our case are provided by GATA) and instructions that can be extracted in game-play.

4.2 LTL Augmented Rewards and Episode Termination

We can also deﬁne an augmented reward function RLTL(s, a, ϕ), where ϕ is an LTL formula, that rewards the agent for completing instructions. Given a labelling function L : S A 2P that assigns truth values to the propositions in P,

RLTL(s, a, ϕ) = R(s, a) +

1 if prog(L(s, a), ϕ) = true 1 if prog(L(s, a), ϕ) = false 0 otherwise

In other words, a bonus reward is given for every LTL instruction the agent satisﬁes and a penalty is given if the agent fails to complete an instruction. We perform an ablative study on the effect of this reward in Appendix H.3.2. We henceforth refer to this modiﬁed reward function as the LTL reward. The maximum bonus reward an agent receives is either 2 if there is no navigation task, or 3.

Further, because we wish to satisfy instructions, we can also use the instructions to modify episode termination. That is, if our LTL instruction is violated, we have arrived in a terminal state, even if Text World has not indicated so. We perform an ablative study on the effect of this LTL-based termination in Appendix H.3.2.

4.3 LTL-GATA Model Architecture

We build a similar model to GATA s original architecture, augmented to include the LTL encoding of instructions and their progression according to observed system state. We dub this model LTL-GATA, which we describe in detail below. Figure 2 depicts an episode step interaction of LTL-GATA with Text World and Figure 3 depicts the model itself. Additional details can be found in Appendix F.

Figure 2: An example of a single step in an episode of Text World. The game environment returns an observation ot and action candidate set Ct in response to action at 1. In turn, the agent s graph updater (GATA) updates its belief graph gt in response to both ot and gt 1. Next, gt and ot update the LTL instructions. ϕt is generated from ot after the cookbook is examined and thereafter ϕt 1 is progressed to ϕt at each time step. The policy network selects action at from Ct conditioned on ot, ϕt, and gt and the cycle repeats.

Graph Updater: We use the original GATA-GTP model (Adhikari et al., 2020), which generates a discrete belief graph as a list of triplets of the form (object, relationship, object). It is composed of two sub-components: (a) the belief state updater, which generates gt from observation ot and the graph gt 1; and (b) the graph encoder, which encodes the current graph into a vector as GE(gt) = g t RD for some latent dimension D. The graph encoder is a relational graph convolutional network (RGCN) (Schlichtkrull et al., 2018) using basis regularization (Schlichtkrull et al., 2018) and highway connections (Srivastava et al., 2015). We refer the readers to Adhikari et al. (2020) for more details.

LTL Updater: The LTL updater generates and progresses LTL instructions. LTL instructions deﬁning the need to arrive at the kitchen and examine the cookbook are generated from the initial observation o0. The subsequent instruction deﬁning the recipe is generated from game observation ot, as described in subsection 4.1, when the action examine cookbook is executed at time t. For the truth assignments (i.e. the labelling function L), we leverage GATA s highly accurate belief state from the graph updater. We use the Spot engine (Duret-Lutz et al., 2016) to perform the progression.

Text Encoders: For encoding the action choices Ct, observations ot, as well as encoding the LTL instructions ϕt, we use a simpliﬁed version of the Transformer architecture presented by Vaswani et al. (2017). This is the same architecture used by Adhikari et al. (2020). For LTL instructions, we encode them directly as a string. For example, the LTL formula ϕ = (EVENTUALLY p1) (EVENTUALLY p2), where p1 = pepper-in-player and p2 = pepper-is-cut, has the string representation

str(ϕ) : eventually player_has_pepper and eventually pepper_is_cut

We format each predicate as a single token, and we show in Appendix H.3.1 that our method is robust to predicate format. For some input string v Rℓof length ℓ, the text encoder outputs a single vector TE(v) = v RD of dimension D, which is the same latent dimension as the graph encoder.

Action Selector: The action selector is a 2-layer multi-layer perceptron (MLP). The encoded state vectors TE(ot) = o t RD, TE(ϕt) = ϕ t RD, and GE(gt) = g t RD are concatenated to form the agent s ﬁnal state representation zt = [o t; ϕ t; g t] R3D. In contrast to Adhikari et al. (2020), we concatenate features rather than use the bi-directional attention-based aggregator. This simpliﬁed the model s complexity and worked just as well experimentally. This vector is then repeated nc times and concatenated with the encoded actino choices C t Rnc D where nc is the number of action choices. This input matrix is fed to the MLP which returns the a vector of Q-values for each action qc Rnc.

Training. Formally, for belief state g and LTL instruction ϕ, LTL-GATA aims to learn an optimal policy π (a|o, g, ϕ). To learn this optimal policy, we implement Double DQN (DDQN) (Van Hasselt et al., 2016) with reward function and termination criteria as discussed in subsection 4.2. We use a prioritized experience replay buffer (Schaul et al., 2016). Refer to Appendix G.2 for further details.

5 Experiments

Our experimental assessment was designed both to understand how well GATA was exploiting observational instructions, as discussed in section 3, and to assess the instruction-following performance

Figure 3: LTL-GATA s policy model. The model chooses action at Ct conditioned on the state zt = [o t; ϕ t; g t]. The action selector chooses at based on the predicted Q-values.

Table 1: Cooking Levels

Recipe Size

Need {Grab, Cut, Cook}

0 1 1 3 { , , } 1 1 1 4 { , , } 2 1 1 5 { , , } 3 1 9 3 { , , }

of our proposed approach relative to this state of the art (not only in terms of game points but also successful completion). We additionally strove to assess features of our approach (such as monitoring instruction progress) that contributed to its performance, as well as general challenges to text-based game playing that limited its performance (such as navigation).1

5.1 Experimental Setup

Games. To have as fair a comparison with Adhikari et al. (2020) as possible, we reused the sets of games they had generated. For the training games, they had created two sets: one set that contains 20 unique games per level and another that contains 100 unique games per level. Both the validation and testing sets have 20 unique games each per level. The levels we chose to use in our assessment are shown in Table 1. Note that in our assessment we omit Levels 4 and 5. Level 4 is an augmentation to Level 3 that adds more ingredients; both GATA and LTL-GATA at this level suffer from the navigation issues we discuss later with respect to Level 3. As we wanted to focus on instruction following and not navigation, we omitted this level and chose to use Level 0 instead. Level 5 is simply a random combination of all levels, so it is omitted for similar reasons.

Hyper-parameters. We replicate all but three hyper-parameters from Adhikari et al. (2020): (1) we use a batch size of 200 instead of 64 when training on the 100 game set, (2) for level 3, we use Boltzmann action selection, and (3) we use Adam Kingma & Ba (2015) with a learning rate of 0.0003 instead of RAdam Liu et al. (2020) with a learning rate of 0.001. These changes boosted performance for all models. See Appendix H.1 for more details.

Baselines. We compare against (1) TDQN (Adhikari et al., 2020), the transformer variant of the LSTM-DQN (Narasimhan et al., 2015) model, (2) GATAC, and (3) GATAD. GATAC is GATA s best performing model (GATA-COC) that uses a continue graph-updater pre-trained using contrastive observation classiﬁcation. GATAD is a similarly performant model (GATA-GTP) that uses a discrete graph-updater pre-trained with ground-truth graphs from the FTWP dataset. Finally, we note that we found a few issues with GATA s original code2 and have since ﬁxed them (see Appendix H.4.1). For comparison, we include the original paper GATA models, labelled as GATAC P and GATAD P .

Measuring performance. We measure performance using two metrics: normalized accumulated game points and game success rate. We report averaged results over 3 seeds for each experiment. Previous works only compared using the normalized accumulated game points; however, this may sometimes be misleading an agent could get 3/4 = 0.75 points on all games but never actually succeed on any. In contrast, measuring the success rate alongside the normalized game points allows for a more complete analysis of the agent s ability to play and complete these games.

5.2 LTL-GATA Compared to Baselines

Consistently high performance with 20 training games. We see from Figure 4 that LTL-GATA exhibits consistently high performance across levels as compared to the baselines when trained on the 20 games set. In particular, LTL-GATA maintains its performance on level 2, where the game s slight increase in complexity causes large performance drop-offs in other methods. Our agent can easily complete the added task and maintain similar performance to the previous level 1.

1Our code for the experiments can be found at https://github.com/Mathieu Tuli/LTL-GATA 2https://github.com/xingdi-eric-yuan/GATA-public, released under the open-source MIT License.

Test Normalized

Game Points

1.0 Level 1

GATAD GATAC

1.0 Level 2

GATAD GATAC

1.0 Level 3

Test Success Rate

Points Success

Test Normalized

Game Points

GATAD GATAC

1.0 Level 1

GATAD GATAC

1.0 Level 2

GATAD GATAC

1.0 Level 3

Test Success Rate

Points Success

Figure 4: Testing scores across various levels and on both the 20 (top) and 100 (bottom) game training sets. We select the top-performing models (per seed) on the validation set during training and apply those models on the test set and report the average scores.

Option Required 0.0

Test Normalized

Game Points

20 Games - Level 1

Option Required 0.0

1.0 100 Games - Level 1

Option Required 0.0

1.0 20 Games - Level 2

Option Required 0.0

1.0 100 Games - Level 2

Test Success Rate

Examining the Cookbook

Points Success

(a) Forcing GATA to Examine the Cookbook

Prog No Prog 0.0

Test Normalized

Game Points

100 Games - Level 1

Test Success Rate

LTL Progression

Points Success

(b) Progression Ablation

Figure 5: (a) A comparison of GATAD performance when given the Option to examine the cookbook vs. when it is Required to examine the cookbook. (b) A comparison of LTL-GATA with (Prog) and without (No Prog) using LTL progression.

Large performance gains with 100 training games. We see from Figure 4 that LTL-GATA gains considerable performance when trained on 100 games. With the added games, our agent is exposed to more predicates and can now generalize better to the testing set. Future work may look at how to achieve this kind of generalization without having to expose our agent to more predicates.

Success rate and normalized game points. Looking at the performance of GATA on level 2, it becomes apparent why measuring the success is important. Although it achieves almost 0.4 normalized points, the actual success rate is near 0 for original GATA models, and 60% of the normalized points for the ﬁxed models average across both training sets. In contrast, LTL-GATA exhibits high normalized points and success rate, where the average success rate across both training sets is 82% of the normalized points.

Competitive performance on level 3. Level 3 introduces the added challenge of navigation. LTLGATA outperforms GATA in this level as well, but not to the degree of previous levels. Inspecting testing trajectories, it becomes evident that both LTL-GATA and GATA methods struggle with navigation in this level, and have difﬁculties even navigating to the kitchen in the ﬁrst place. Exploring at test time to ﬁnd items and rooms in an unknown environment is a major challenge built into textbased games. Hypothetically, LTL could contribute to addressing this challenge. LTL could be used to dictate strategy and/or to simply track such exploration (e.g., for remembering which rooms have been previously visited). LTL might also be used to encode learned navigation instructions (e.g. ﬁnd the blue door, go through it, then turn right ). We do not pursue this vector of research here, but it is an interesting direction for future work.

5.3 Does LTL Progression Matter?

We show in Figure 5b that the use of progression is critical to performance, where LTL-GATA without progression incurs a large performance drop-off, dropping below the performance of the baselines as well. Without progression, the LTL instruction will not reﬂect the changes incurred by the agent s actions. This appears to confuse the agent considerably, demonstrated by its performance drop-off.

5.4 Forcing GATA to Examine the Cookbook

Because LTL-GATA is always tasked with examining the cookbook, we question whether a similar tasking for GATA improves performance. We experiment with GATAD by forcing the agent to examine the cookbook on the ﬁrst step of the episode. Forcing GATA to examine the cookbook will elicit goal relations like (apple,needs,cut) in the belief state. We show however in Figure 5a that GATA does not improve when being given the cookbook. This shows that GATA cannot make use of the information elicited from the cookbook, continuing to ignore important instructions. Even with the presence of goal relations in its belief state, GATA fails to properly attend to this information. This highlights the beneﬁts of a formalized representation of instructions used by LTL-GATA.

5.5 On Automatic Translation: Natural Language Instructions to LTL

While LTL-GATA relies on a handcrafted LTL translator to provide initial instructions from text observations, we investigate the potential of automating this step using pretrained large language models. This is not a central focus of the paper. Rather, we include this exploration as a proof of concept that the use of LTL is not a barrier to broad deployment of the work presented here. To this end, we evaluate whether GPT-3 (Brown et al., 2020) can few-shot learn to translate Text World observations to LTL, given only six examples and without additional training.

We experiment with two models of GPT-3 from Open AI: Ada (the fastest model) and Da Vinci (the most powerful model). We perform few-shot translation by constructing prompts that contain six example translations, followed by the natural language observation to translate (the test case). The examples remain ﬁxed for all test cases, and follow the form "NL: <natural language observation>. LTL: <ltl-formulas>". Our test case follows the form "NL: <natural language observation>. LTL:", where the model must complete the prompt, thereby performing a translation. We consider a response that exactly matches the ground-truth LTL formula as absolutely correct, a response that is otherwise correct except for parentheses and spaces as almost correct, and all other responses as incorrect. Further details and examples can be found in Appendix H.6.

Out of 234 test cases, Da Vinci translated 93.2% absolutely correctly and another 5.6% almost correctly, with only 1.3% of examples incorrect. Da Vinci displayed an impressive ability to generalize to unseen adjectives (e.g. is_grilled), nouns (e.g. carrot), and compositions of formula. Unfortunately, the weaker model, Ada, translated all 100% of examples incorrectly. We found that Ada commonly hallucinated new nonsensical words and predicates such as ingredient_is_salt_is_diced or banana_pork_chop_in_player, leading to erroneous translations.

6 Related Work

Text-based games. In this work we equip a text-based deep RL agent with formalized LTL instructions, building on previous works that employed belief graphs for solving text-based games. Adhikari et al. (2020) focused on supervised (i.e. translation) and self-supervised learned mechanisms to construct such belief graphs, whereas Ammanabrolu & Hausknecht (2020); Yin & May (2019b); Ammanabrolu & Riedl (2019) employed rule-based methods. At a larger scope, there is a host of other works on playing text-based games using deep reinforcement learning (Hausknecht et al., 2020; Zahavy et al., 2018; Jain et al., 2020; Yin & May, 2019a). Yuan et al. (2018) used count-based memory to shape the reward to improve in exploration and generalization in a simple domain. Narasimhan et al. (2015) and He et al. (2016) proposed variations of an LSTM-based model, which the TDQN model, used in this work, is built from. In just published work, Liu et al. (2022) took a model-based approach, focusing on object-oriented dynamics. However, these works do not address the role and representation of instructions that deﬁnes our work. Kimura et al. (2021) does employ a neuro-symbolic RL method using Logical Neural Networks. However, it does not focus on instructions, operates over all logical facts of the environment, and is applied to a simpler domain.

Instruction following and Linear Temporal Logic. Vaezipoor et al. (2021) trained an RL agent to follow various LTL instructions in both discrete and continuous action-space visual environments. They used R-GCNs to learn representations of the LTL instructions and also employed LTL progression. Their model showed good generalization performance on similar and much larger unseen instructions than those observed during training . However, in contrast to the work presented here, they relied on ground-truth labelling functions and operated in fully observable settings, while we use GATA s learned belief graphs, in a partially observable setting, to evaluate the truth or falsity of propositions and to progress formulae. We further distinguish ourselves from this work by opting for training the LTL semantics end-to-end using a transformer rather than an R-GCN. Works using LTL for reward speciﬁcation (Leon et al., 2020; Kuo et al., 2020; Camacho et al., 2019; Toro Icarte et al., 2018b; Littman et al., 2017) or advice (Toro Icarte et al., 2018a) in RL agents exist, however they do not focus on text-based environments nor partially observable ones.

7 Conclusion

We studied the ability of RL agents to follow instructions in text-based games using Text World. We conducted experiments to show how current state-of-the-art model-free agents largely fail to exploit instructions and do not typically complete prescribed tasks. We then showed how LTL can be used to construct internal structured representations for state augmentation that result in large performance improvements and more reliable instruction following and task completion. Experiments showed that monitoring instruction progress was critical to these gains. Our method inherits limitations in dealing with navigation and unseen games from prior work, but these concerns are somewhat orthogonal to our focus on instruction following.

Furthermore, we can consider the broader impact of this work by relating to the critical need for good instruction following in safety-oriented domains such as autonomous transport or health care. We would like to suggest that works towards building better language agents should also emphasize the importance of completing instructions. To illustrate, for an agent to help a person half-way across a street, or to start but not ﬁnish a medical operation, may be worse than for it to do nothing at all. To that end, we have proposed using (game) success rate as a metric for future work, and demonstrated how LTL-GATA is very successful in the games it plays, relative to the state-of-the-art. Overall, we intend this paper to highlight the importance of studying instruction following in environments like Text World that act as a proxies to the general class of problems dealing with language understanding and human-machine interaction.

Finally, in follow-on work we would like to explore more complex text-based games such as the Jericho environment (Hausknecht et al., 2020). These games involve a number of distinct challenges, including exploration, navigation, puzzle solving, language understanding, and instruction following. In this vein, we d like to see whether LTL can be exploited to capture (learned) domain-speciﬁc strategic advice, or memory to tackle both navigation and exploration challenges. We d like to further explore seamless ways to exploit the merits of natural language together with the beneﬁts afforded by the compositional syntax and semantics of formal languages such as LTL. To this end, further advancing our explorations translating natural language to LTL is of interest and import, for this and a diversity of other applications in and outside RL.

Acknowledgements

We thank the Neur IPS reviewers for their constructive feedback, and also the reviewers from the Wordplay: When Language Meets Games workshop at NAACL 2022, where a preliminary version of this paper appeared (Tuli et al., 2022). We gratefully acknowledge funding from the Natural Sciences and Engineering Research Council of Canada (NSERC), the Canada CIFAR AI Chairs Program, and Microsoft Research. Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institute for Artiﬁcial Intelligence (www.vectorinstitute.ai/partners). Finally, we thank the Schwartz Reisman Institute for Technology and Society for providing a rich multi-disciplinary research environment.

Adhikari, A., Yuan, X., Côté, M.-A., Zelinka, M., Rondeau, M.-A., Laroche, R., Poupart, P., Tang, J., Trischler, A., and Hamilton, W. Learning dynamic belief graphs to generalize on text-based games. Advances in Neural Information Processing Systems (Neur IPS2020), 33:3045 3057, 2020.

Ammanabrolu, P. and Hausknecht, M. J. Graph constrained reinforcement learning for natural language action spaces. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. Open Review.net, 2020. URL https://openreview. net/forum?id=B1x6w0Etw H.

Ammanabrolu, P. and Riedl, M. O. Playing text-adventure games with graph-based deep reinforcement learning. In Burstein, J., Doran, C., and Solorio, T. (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pp. 3557 3565. Association for Computational Linguistics, 2019. doi: 10.18653/v1/n19-1358. URL https://doi.org/10.18653/v1/n19-1358.

Bacchus, F. and Kabanza, F. Using temporal logics to express search control knowledge for planning. Artiﬁcial Intelligence, 116(1-2):123 191, 2000.

Baier, C. and Katoen, J. Principles of Model Checking. MIT Press, 2008.

Baier, J. and Mc Ilraith, S. Planning with temporally extended goals using heuristic search. In Proceedings of the 16th International Conference on Automated Planning and Scheduling (ICAPS06), pp. 342 345, June 2006.

Baier, J. A., Bacchus, F., and Mc Ilraith, S. A. A heuristic search approach to planning with temporally extended preferences. Artiﬁcial Intelligence, 173(5-6):593 618, 2009.

Brohan, A., Chebotar, Y., Finn, C., Hausman, K., Herzog, A., Ho, D., Ibarz, J., Irpan, A., Jang, E., Julian, R., et al. Do as I can, not as I say: Grounding language in robotic affordances. In 6th Annual Conference on Robot Learning, 2022.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., Mc Candlish, S., Radford, A., Sutskever, I., and Amodei, D. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877 1901, 2020. URL https://proceedings.neurips.cc/paper/2020/file/ 1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.

Camacho, A. and Mc Ilraith, S. A. Strong fully observable non-deterministic planning with LTL and LTL-f goals. In Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI), pp. 5523 5531, 2019.

Camacho, A., Toro Icarte, R., Klassen, T. Q., Valenzano, R. A., and Mc Ilraith, S. A. LTL and beyond: Formal languages for reward function speciﬁcation in reinforcement learning. In Proceedings of the 28th International Joint Conference on Artiﬁcial Intelligence (IJCAI), pp. 6065 6073, 2019.

Côté, M., Kádár, Á., Yuan, X., Kybartas, B., Barnes, T., Fine, E., Moore, J., Hausknecht, M. J., Asri, L. E., Adada, M., Tay, W., and Trischler, A. Textworld: A learning environment for textbased games. In Cazenave, T., Safﬁdine, A., and Sturtevant, N. R. (eds.), Computer Games - 7th Workshop, CGW 2018, Held in Conjunction with the 27th International Conference on Artiﬁcial Intelligence, IJCAI 2018, Stockholm, Sweden, July 13, 2018, Revised Selected Papers, volume 1017 of Communications in Computer and Information Science, pp. 41 75. Springer, 2018. doi: 10.1007/978-3-030-24337-1\_3. URL https://doi.org/10.1007/978-3-030-24337-1_3.

Duret-Lutz, A., Lewkowicz, A., Fauchille, A., Michaud, T., Renault, E., and Xu, L. Spot 2.0 A Framework for LTL and ω-Automata Manipulation. In Proceedings of the 14th International Symposium on Automated Technology for Veriﬁcation and Analysis (ATVA), pp. 122 129. Springer, 2016.

Dzifcak, J., Scheutz, M., Baral, C., and Schermerhorn, P. What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution. In Proceedings of the 2009 IEEE International Conference on Robotics and Automation (ICRA), pp. 4163 4168. IEEE, 2009.

Finucane, C., Jing, G., and Kress-Gazit, H. LTLMo P: Experimenting with language, temporal logic and robot control. In 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 1988 1993. IEEE, 2010.

Hahn, C., Schmitt, F., Tillman, J. J., Metzger, N., Siber, J., and Finkbeiner, B. Formal speciﬁcations from natural language. ar Xiv preprint ar Xiv:2206.01962, 2022.

Hausknecht, M., Ammanabrolu, P., Côté, M.-A., and Yuan, X. Interactive ﬁction games: A colossal adventure. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 34, pp. 7903 7910, 2020.

He, J., Chen, J., He, X., Gao, J., Li, L., Deng, L., and Ostendorf, M. Deep reinforcement learning with a natural language action space. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers. The Association for Computer Linguistics, 2016. doi: 10.18653/v1/p16-1153. URL https://doi.org/10.18653/v1/p16-1153.

Huang, W., Abbeel, P., Pathak, D., and Mordatch, I. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. ar Xiv preprint ar Xiv:2201.07207, 2022.

Jain, V., Fedus, W., Larochelle, H., Precup, D., and Bellemare, M. G. Algorithmic improvements for deep reinforcement learning applied to interactive ﬁction. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 34, pp. 4328 4336, 2020.

Kimura, D., Ono, M., Chaudhury, S., Kohita, R., Wachi, A., Agravante, D. J., Tatsubori, M., Munawar, A., and Gray, A. Neuro-symbolic reinforcement learning with ﬁrst-order logic. In Moens, M., Huang, X., Specia, L., and Yih, S. W. (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pp. 3505 3511. Association for Computational Linguistics, 2021. doi: 10.18653/v1/2021.emnlp-main.283. URL https://doi.org/10.18653/v1/2021. emnlp-main.283.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In Bengio, Y. and Le Cun, Y. (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412. 6980.

Kuo, Y., Katz, B., and Barbu, A. Encoding formulas as deep networks: Reinforcement learning for zero-shot execution of LTL formulas. In IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2020, Las Vegas, NV, USA, October 24, 2020 - January 24, 2021, pp. 5604 5610. IEEE, 2020. doi: 10.1109/IROS45743.2020.9341325. URL https://doi.org/10. 1109/IROS45743.2020.9341325.

Leon, B. G., Shanahan, M., and Belardinelli, F. Systematic Generalisation through Task Temporal Logic and Deep Reinforcement Learning. ar Xiv preprint ar Xiv:2006.08767, 2020.

Littman, M. L., Topcu, U., Fu, J., Jr., C. L. I., Wen, M., and Mac Glashan, J. Environment-independent task speciﬁcations via GLTL. Co RR, abs/1704.04341, 2017. URL http://arxiv.org/abs/ 1704.04341.

Liu, G., Adhikari, A., Farahmand, A.-m., and Poupart, P. Learning object-oriented dynamics for planning from text. In ICLR 2022, April 2022. URL https://www.microsoft.com/en-us/research/publication/ learning-object-oriented-dynamics-for-planning-from-text/.

Liu, L., Jiang, H., He, P., Chen, W., Liu, X., Gao, J., and Han, J. On the variance of the adaptive learning rate and beyond. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. Open Review.net, 2020. URL https://openreview. net/forum?id=rkgz2a EKDr.

Narasimhan, K., Kulkarni, T. D., and Barzilay, R. Language understanding for text-based games using deep reinforcement learning. In Màrquez, L., Callison-Burch, C., Su, J., Pighin, D., and Marton, Y. (eds.), Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, pp. 1 11. The Association for Computational Linguistics, 2015. doi: 10.18653/v1/d15-1001. URL https: //doi.org/10.18653/v1/d15-1001.

Patrizi, F., Lipovetzky, N., Giacomo, G. D., and Geffner, H. Computing inﬁnite plans for LTL goals using a classical planner. In Proceedings of the 22nd International Joint Conference on Artiﬁcial Intelligence (IJCAI), pp. 2003 2008. IJCAI/AAAI, 2011. doi: 10.5591/978-1-57735-516-8/ IJCAI11-334. URL https://doi.org/10.5591/978-1-57735-516-8/IJCAI11-334.

Pnueli, A. The temporal logic of programs. In Proceedings of the 18th IEEE Symposium on Foundations of Computer Science (FOCS), pp. 46 57. IEEE, 1977.

Schaul, T., Quan, J., Antonoglou, I., and Silver, D. Prioritized experience replay. In Bengio, Y. and Le Cun, Y. (eds.), 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016. URL http://arxiv.org/ abs/1511.05952.

Schlichtkrull, M., Kipf, T. N., Bloem, P., Berg, R. v. d., Titov, I., and Welling, M. Modeling relational data with graph convolutional networks. In European Semantic Web Conference, pp. 593 607. Springer, 2018.

Srivastava, R. K., Greff, K., and Schmidhuber, J. Highway networks. ar Xiv preprint ar Xiv:1505.00387, 2015.

Toro Icarte, R., Klassen, T. Q., Valenzano, R., and Mc Ilraith, S. A. Advice-based exploration in model-based reinforcement learning. In Proceedings of the 31st Canadian Conference on Artiﬁcial Intelligence (CCAI), pp. 72 83. Springer, 2018a.

Toro Icarte, R., Klassen, T. Q., Valenzano, R., and Mc Ilraith, S. A. Teaching multiple tasks to an RL agent using LTL. In Proceedings of the 17th International Conference on Autonomous Agents and Multi Agent Systems, pp. 452 461, 2018b.

Trischler, A., Côté, M.-A., and Lima, P. First Text World Problems, the competition: Using text-based games to advance capabilities of AI agents. Microsoft Research Blog. URL https://www.microsoft.com/en-us/research/blog/ﬁrst-textworld-problems-the-competitionusing-text-based-games-to-advance-capabilities-of-ai-agents/, 2019.

Tuli, M., Li, A., Vaezipoor, P., Klassen, T. Q., Sanner, S., and Mc Ilraith, S. A. Instruction following in text-based games. In Wordplay: When Language Meets Games Workshop @ NAACL 2022, 2022.

Vaezipoor, P., Li, A. C., Toro Icarte, R. A., and Mcilraith, S. A. LTL2action: Generalizing LTL instructions for multi-task RL. In International Conference on Machine Learning, pp. 10497 10508. PMLR, 2021.

Van Hasselt, H., Guez, A., and Silver, D. Deep reinforcement learning with double q-learning. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 30, 2016.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30, 2017.

Wang, C., Ross, C., Kuo, Y., Katz, B., and Barbu, A. Learning a natural-language to LTL executable semantic parser for grounded robotics. In Kober, J., Ramos, F., and Tomlin, C. J. (eds.), 4th Conference on Robot Learning, Co RL 2020, 16-18 November 2020, Virtual Event / Cambridge, MA, USA, volume 155 of Proceedings of Machine Learning Research, pp. 1706 1718. PMLR, 2020. URL https://proceedings.mlr.press/v155/wang21g.html.

Yin, X. and May, J. Comprehensible context-driven text game playing. In 2019 IEEE Conference on Games (Co G), pp. 1 8. IEEE, 2019a.

Yin, X. and May, J. Learn how to cook a new recipe in a new house: Using map familiarization, curriculum learning, and bandit feedback to learn families of text-based adventure games. ar Xiv preprint ar Xiv:1908.04777, 2019b.

Yuan, X., Côté, M., Sordoni, A., Laroche, R., des Combes, R. T., Hausknecht, M. J., and Trischler, A. Counting to explore and generalize in text-based games. Co RR, abs/1806.11525, 2018. URL http://arxiv.org/abs/1806.11525.

Zahavy, T., Haroush, M., Merlis, N., Mankowitz, D. J., and Mannor, S. Learn what not to learn: Action elimination with deep reinforcement learning. In Bengio, S., Wallach, H. M., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, December 3-8, 2018, Montréal, Canada, pp. 3566 3577, 2018. URL https://proceedings. neurips.cc/paper/2018/hash/645098b086d2f9e1e0e939c27f9f2d6f-Abstract.html.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] Limitations are described in both section 5 and section 7. (c) Did you discuss any potential negative societal impacts of your work? [Yes] A broader impact is considered in section 7, and further discussion can be found in Appendix I. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [N/A] (b) Did you include complete proofs of all theoretical results? [N/A] 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] All code can be found at https://github.com/Mathieu Tuli/LTL-GATA. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] Hyper-parameters are discussed brieﬂy in section 5 and in full detail in Appendices G and H. (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] Yes, error bars can be found in every results ﬁgure. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] This information can be found in Appendix H.2. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] The footnotes in section 5 include links to the original assets, and in-text citations give credit throughout the paper. (b) Did you mention the license of the assets? [Yes] The footnotes in section 5 include links to the original assets. (c) Did you include any new assets either in the supplemental material or as a URL?

[Yes] The only new asset in this work is our code, which we provide here: https: //github.com/Mathieu Tuli/LTL-GATA. (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]