# pretrained_language_models_for_interactive_decisionmaking__e2c352c7.pdf

Pre-Trained Language Models for

Interactive Decision-Making

Shuang Li 1 , Xavier Puig1, Chris Paxton2, Yilun Du1, Clinton Wang1, Linxi Fan2,

Tao Chen1, De-An Huang2, Ekin Akyürek1, Anima Anandkumar2,3, , Jacob Andreas1, , Igor Mordatch4, , Antonio Torralba1, , Yuke Zhu2,5,

1MIT, 2Nvidia, 3Caltech, 4Google Brain, 5UT Austin Junior authors are ordered based on contributions and senior authors are ordered alphabetically.

Language model (LM) pre-training is useful in many language processing tasks. But can pre-trained LMs be further leveraged for more general machine learning problems? We propose an approach for using LMs to scaffold learning and generalization in general sequential decision-making problems. In this approach, goals and observations are represented as a sequence of embeddings, and a policy network initialized with a pre-trained LM predicts the next action. We demonstrate that this framework enables effective combinatorial generalization across different environments and supervisory modalities. We begin by assuming access to a set of expert demonstrations, and show that initializing policies with LMs and ﬁne-tuning them via behavior cloning improves task completion rates by 43.6% in the Virtual Home environment. Next, we integrate an active data gathering procedure in which agents iteratively interact with the environment, relabel past failed experiences with new goals, and update their policies in a self-supervised loop. Active data gathering further improves combinatorial generalization, outperforming the best baseline by 25.1%. Finally, we explain these results by investigating three possible factors underlying the effectiveness of the LM-based policy. We ﬁnd that sequential input representations (vs. ﬁxed-dimensional feature vectors) and LM-based weight initialization are both important for generalization. Surprisingly, however, the format of the policy inputs encoding (e.g. as a natural language string vs. an arbitrary sequential encoding) has little inﬂuence. Together, these results suggest that language modeling induces representations that are useful for modeling not just language, but also goals and plans; these representations can aid learning and generalization even outside of language processing. 2

1 Introduction

Language models (LMs) play a key role in machine learning approaches to natural language processing tasks [9]. This includes tasks that are not purely linguistic, and require nontrivial planning and reasoning capabilities [24, 13]: for example, instruction following, vision-language navigation, and visual question answering. Indeed, some of these tasks are so distant from language modeling that one can ask whether pre-trained LMs can be used as a general framework even for tasks that involve no language at all. If so, how might these capabilities be accessed in a model trained only to process and generate natural language strings?

Correspondence to: Shuang Li <lishuang@mit.edu>

2Project page: https://shuangli-project.github.io/Pre-Trained-Language-Models-for-Interactive-Decision Making. Part of this work was done during Shuang s internship at NVIDIA.

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

Language Model

Task-specific Model

Tokenization

Next action at

Observation ot History ht Goal g

Tokenization Tokenization

F! F! F! F! F! F! F! F! F!

Virtual Home

Graph partial observation:

Goal predicates: Inside(pancake,stove):2

Grid partial observation: Language goal:

Put the green box next to the purple box

Figure 1: Environments (left): Different environments have different types of observations and goals. Our approach (right): We use pre-trained LMs as a general framework for interactive decision-making by converting policy inputs into sequential data. Such a method enables effective combinatorial generalization to novel tasks.

In this paper, we study these questions through the lens of embodied decision-making, investigating the effectiveness of LM pre-training as a general framework for learning policies across a variety of environments. We propose LID, a framework that uses Pre-Trained Language Models for Interactive Decision-Making. As shown in Figure 1 (right), we encode the inputs to a policy including observations, goals, and history as a sequence of embeddings. These embeddings are passed to a policy network initialized with the parameters of a pre-trained LM, which is ﬁne-tuned to predict actions. This framework is broadly applicable, accommodating goals and environment states represented as natural language strings, image patches, or scene graphs.

We ﬁnd that imitation learning using pre-trained LMs as policy initializers improves in-domain performance and enables strong generalization over novel tasks. For i.i.d. training and evaluation tasks, this approach yields 20% more successful policies than other baseline methods in Virtual Home [31]. For combinatorial generalization to out-of-distribution tasks, i.e. tasks involving new combinations of goals, states or objects, LM pre-training confers even more beneﬁts: it improves task completion rates by 43.6% for novel tasks (see Figure 3). These results hold for a variety of environment representations: encoding states as natural language strings, when possible, improves the data-efﬁciency of training, but even LMs ﬁne-tuned on random environment encodings generalize combinatorially to new goals and states when trained on large enough datasets.

We further examine how our method may be used in environments where expert data is not available, and agents must instead actively gather data. To do this, we integrate an Active Data Gathering (ADG) procedure into pre-trained LMs as shown in Figure 2. Our proposed approach to ADG consists of three parts. First, exploration collects trajectories using a mix of random actions and actions generated by the current policy. Exploration is insufﬁcient in this high dimensional problem and most of the trajectories will likely fail to achieve the end goal. A key insight is that even the failed trajectories contain useful sub-trajectories that solve certain sub-goals, and we relabel these goals in a hindsight relabeling stage. The relabeled goal describes what was achieved in the extracted sub-trajectory. The policy update stage samples relabeled trajectories to update the policy. The active data gathering procedure allows us to train the LM-policy without pre-collected expert data. It also outperforms reinforcement learning (RL) methods on embodied decision-making tasks and enables more effective generalization to novel tasks.

Finally, we investigate why LID contributes to generalization. We hypothesize three possible causes for the effectiveness of LM-based policy initialization: (1) the use of language-based input encodings, and more generally LMs ability to reason about natural language strings; (2) the sequential structure of transformer inputs, in contrast to the ﬁxed-sized observations used by most policy architectures, and (3) task-general inductive bias conferred by weight initialization with LM pretraining. We investigate (1) by encoding the policy inputs as different types of sequences. Different input encoding schemes have only a negligible impact on the performance: the effectiveness of language modeling is not limited to utilizing natural strings, but in fact extends to arbitrary sequential encodings. We study (2) by encoding observations with a single vector embedding, thereby removing its sequential structure. This operation signiﬁcantly degrades the model s performance on novel tasks. Finally, we investigate (3) by learning the parameters of the policy from scratch. The success rate after removing the pre-trained LM weights drops by 11.2%, indicating that LM pretraining provides useful inductive bias for sequence processing even when sequences are not natural language strings.

To summarize, our work has four main contributions:

First, we propose to use pre-trained LMs as a general scaffold for interactive decision-making

across a variety of environments by converting all policy inputs into sequential data.

Second, we demonstrate that language modeling improves combinatorial generalization in pol-

icy learning: initializing a policy with a pre-trained LM substantially improves out-of-distribution performance on novel tasks.

Third, we integrate an active data gathering procedure into the proposed approach to further

enable policy learning on environments without using pre-collected expert data.

Finally, we perform several analyses to explain the generalization capabilities of pre-trained LMs,

ﬁnding that natural strings are not needed to beneﬁt from LM pre-training, but the sequential input encoding and weight pre-training are important.

These results point to the effectiveness of the proposed framework with pre-trained LMs as a generalpurpose framework to promote structured generalization in interactive decision-making.

2 Related Work

In recent years, word and sentence representations from pre-trained LMs [29, 9, 33] have become ubiquitous in natural language processing [49, 30]. Some of the most successful applications of pre-training lie at the boundary of natural language processing and other domains, as in instruction following [13] and language-guided image retrieval [22].

Learning representations of language. From nearly the earliest days of the ﬁeld, natural language processing researchers observed that representations of words derived from distributional statistics in large text corpora serve as useful features for downstream tasks [8, 11]. The earliest versions of these representation learning schemes focused on isolated word forms [25, 28]. However, recent years have seen a number of techniques for training (masked or autoregressive) language models to produce contextualized word representations (which incorporate information neighboring words in sentences and paragraphs) via a variety of masked-word prediction objectives [9, 47].

Applications of pre-trained LMs. LMs can be ﬁne-tuned to perform language processing tasks other than language modeling by casting those tasks as word-prediction problems. Successful uses of representations from pre-trained models include syntactic parsing [19] and language-to-code translation [45]; successful adaptations of LM prediction heads include machine translation [49], sentiment classiﬁcation [6] and style transfer [18]. A number of tasks integrate language and other modalities, including visual question answering and image captioning [48]. Recent works ﬁnd that image representations can be injected directly into LMs embedding layers [42].

Policy learning and LM. Traditional policy learning methods, such as PPO [37], DQN [27], DDPG [21], A3C [26], perform well on playing tasks on Atari, Open AI gym [5], and Mu Jo Co [41]. Some of them might fail to solve more challenging tasks on embodied environments [31, 39]. Several recent papers [36, 17, 15] propose to use LM for policy learning. Frozen Pretrained Transformer (FPT) [23] demonstrates that pre-trained LMs require very little ﬁne-tuning to match the performance of task-speciﬁc models on several image classiﬁcation and numerical sequence processing tasks. Semi-Supervised Skill Learning with Latent Language (SL)3 [38] shows that LMs can serve as an effective backbone for hierarchical policies that express plans as natural language strings [2, 4]. In this paper, we focus on building a general framework for decision-making tasks using pre-trained LMs, even when language is not provided as an input or output.

3 Decision-Making and Language Modeling

3.1 POMDPs and Policy Learning

We explore the application of LMs to general sequential decision-making tasks in partially observed environments. These tasks may be formalized as partially observable Markov decision processes (POMDPs). A POMDP is deﬁned by a set of states, a set of observations, a set of actions, and a transition model T (st+1|st, at) that maps the current state and action to the next state. Importantly, in a POMDP setting, the observation ot only captures a portion of the underlying state st, and an

optimal decision-making strategy (a policy) must incorporate both the current observation and the history of previous observations and actions. In our experiments, policies are parametric models φ(at|g, ht, ot) that output the probability of an action given the goals g, history information ht = {o1, a1, , ot 1, at 1}, and partial observations ot of the current state st.

In Figure 1 (right), we show a high-level overview of the proposed method. We ﬁrst convert all policy inputs into a sequence and provide them as input to a transformer encoder. Representations from this encoder model are then passed to a task-speciﬁc decoder that predicts actions. We collect a dataset of N training trajectories D = {di}N

i=1, where each trajectory consists of a goal and a sequence of observations and actions: di = {gi, oi

Ti}, where Ti is the length of the trajectory. We then train the policy to maximize the probability of actions we want to achieve ai = {ai

1, . . . , ai

Ti} across trajectories using the cross-entropy loss:

φ = arg min

3.2 Language models as policy initializers

Our experiments focus on autoregressive, transformer-based LMs [43]. These models are trained to ﬁt a distribution over a text sequence y = {yi}n

i=1 via the chain rule p(y) = p(y1) Qn

i=2 p(yi | y1, . . . , yi 1). Each term on the right hand side is parameterized by a transformer network, which accepts the conditioned tokens as input. Each token passes through a learned embedding layer F , then the full conditioned sequence is fed into the LM. In our work, we use a standard LM, GPT-2, to process the input sequence rather than to predict future tokens.

Both POMDP decision-making and language modeling are naturally framed as sequence prediction tasks, where successive words or actions/observations are predicted based on a sequence of previous words or actions/observations. This suggests that pre-trained LMs can be used to initialize POMDP policies by ﬁne-tuning them to model high-reward or expert trajectories, as described below.

We evaluate the effectiveness of pre-trained LMs in solving decision-making tasks across environments. We use Baby AI [16] and Virtual Home [31] to evaluate the proposed method. While both environments feature complex goals, the nature of these goals, as well as the state and action sequences that accomplish them, differ substantially across environments (Figure 1 (left)).

4.1 Policy Network

We ﬁrst examine whether pre-trained LMs provide effective initializers when states and action histories are represented as natural language strings. We encode the inputs to the policy including observations, goals, and action histories as sequences of words. These word sequences are passed to the LM (using its pre-trained word embedding layer F ) and used to obtain contextualized token representations. Token representations are averaged and used to predict actions. We design a policy network following the general policy framework proposed in Figure 1.

Environment encodings in Virtual Home. In Virtual Home, each goal consists of a sequence of predicates and multiplicities, and is translated into a templated English sentence (e.g. Inside(apple, fridge):2 becomes put two apples inside the fridge ). To encode the agent s partial observation, we extract a list of currently visible objects, their states (e.g. open, clean ), and 3D world coordinates. We use a fully-connected layer to encode the 3D information and generate a feature representation of each object in the observation. To encode history, we store information about all previous actions and convert them into templated English sentences (e.g. I have put the plate on the kitchen table and the apple inside the fridge ).

Environment encodings in Baby AI. The observation by default is a 7 7 grid. We convert the observation into 7 7 text descriptions, e.g. purple ball , grey wall , open door , and combine them into a long sentence. We then convert the history actions into text descriptions, e.g. turn left and go forward . We combine the language instruction (without modiﬁcation) with the observation and history text descriptions, and feed them to the pre-trained LM.

We note that the policy network described above does not strictly require that these encodings take the form of natural language strings other encodings of the environment as a sequence also work (see Section 7). This framework could be also generalized to support pixel-based observations using discretization schemes like the one employed in the Vision Transformer [10].

Action prediction. We pool LM outputs into a context representation that is used to predict the next action. In training, we maximize the probabilities of demonstrated actions. In inference, we select the valid action with the highest probability. See Appendix C.1 for details.

Virtual Home and Baby AI have quite different observation spaces, action spaces, and goal spaces; however, we show that embedding policy inputs as sequences and utilizing the pre-trained LM as a policy initializer, enables effective generalization to novel tasks on both environments. We note that LID is not limited to Virtual Home and Baby AI, but is straightforwardly applicable to other embodied environments, such as ALFRED [40] and i Gibson [39].

4.2 Training

We ﬁrst examine LID through imitation learning on data collected by experts in Section 4.2.1. We then show that integrating an active data gathering procedure into LID enables policy learning without using expert data in Section 4.2.2. We use Virtual Home as an example to explain the data gathering.

4.2.1 Policy Learning with Expert Data

The policy model is ﬁrst initialized from a pre-trained LM and then ﬁne-tuned on data collected by experts. We build on the Virtual Home environment to collect a set of expert trajectories using regression planning [20] and create a Virtual Home-Imitation Learning dataset. Given a task described by goal predicates, the planner generates an action sequence to accomplish this task (See Appendix E.1). The planner has access to privileged information, such as information about the pre-conditions and effects of each action, allowing an agent to robustly perform tasks in partially observable environments and generate expert trajectories for training and evaluation.

4.2.2 Policy Learning with Active Data Gathering

- Interacted objects

- Navigation trajectory

- Useful sub-trajectory for hindsight relabeling

1. Exploration

2. Hindsight Relabeling

Relabel a task goal for the useful subtrajectory: On (apple, kitchen table): 1

3. Policy update

Sample a random goals: Inside (milk, fridge)

Actions generated by the current policy or random exploration: [open] <kitchen cabinet> [put] <apple> <kitchentable>

Figure 2: LID with the active data gathering procedure. By iteratively repeating the exploration, hindsight relabeling, and policy update, LID with active data gathering can learn an effective policy without using pre-collected expert data.

Collecting expert data is sometimes challenging. It may require privileged information of the environment or human annotations, which can be timeconsuming and difﬁcult to scale. A promising way to scale up supervision is Hindsight Experience Replay (HER) [3], which allows agents to learn from orders of magnitude more data without supervision. However, existing HER methods [12] focus on simple tasks with small state/action space and full observability. They cannot tackle more complicated embodied decision-making tasks, requiring nontrivial planning and reasoning or natural language understanding. LID with the active data gathering (LIDADG) can be used in solving tasks in such environments.

As shown in Figure 2, LID-ADG consists of three stages, i.e. exploration, hindsight relabeling, and policy update. The key idea is to gradually improve the task success rate by asking the agent to iteratively explore the environment, relabel failure samples, and update its policy using imitation learning. In the exploration stage, we ﬁrst randomly sample a goal and an initial state. We then use a mix of random actions and actions generated by the current policy φ(at|g, ht, ot) to obtain the next action. We repeat this process until this episode ends. We collect M trajectories and store them in the replay buffers. The generated actions in the early stages rarely complete the given task.

However, even the failed trajectories contain useful sub-trajectories that solve certain sub-goals. In the hindsight relabeling stage, we extract useful sub-trajectories and relabel a goal g0 for each of them. We design a goal relabel function fl that generates a goal based on the sequence of observations and actions using hand-designed templates. In practice, we implement the goal relabel function as a program (see Appendix E.2). The hindsight relabeling stage allows sample-efﬁcient learning by reusing the failure cases. During policy update, the agent samples the data from the replay buffers and updates its policy network φ.

By interleaving the exploration, hindsight relabeling, and policy update, LID-ADG can gradually improve the policy without requiring pre-collected expert data. In embodied environments with large action spaces, sparse rewards, and long-horizon planning, RL methods often struggle to obtain stable policy gradients during training. Our method enables sample-efﬁcient learning from the sparse rewards by relabeling new goals for the bad samples that the agent fails to achieve. In addition, LID-ADG leverages the stability of supervised learning in the policy update stage, enabling it to outperform RL approaches on a wide range of decision-making tasks.

5 Experiment Setup

We evaluate the proposed method and baselines on Virtual Home and Baby AI.

5.1 Virtual Home

Virtual Home is a 3D embodied environment featuring partial observability, large action spaces, and long time horizons. We evaluate policies performance from three aspects: (1) performance on in-distribution tasks; (2) generalization to novel scenes; and (3) generalization to novel tasks.

In-Distribution. The predicate types and their counts in the goal are randomly sampled from the same distribution as the training data. The objects are initially placed in the environment according to common-sense layouts (e.g. plates appear inside the kitchen cabinets rather than the bathtub).

Novel Scenes. The objects are placed in random positions in the initial environment without commonsense constraints (e.g. apples may appear inside the dishwasher).

Novel Tasks. The components of all goal predicates are never seen together during training (e.g. both plates and fridges appear in training goals, but Inside(plate, fridge) only appears in the test set. (See Appendix F for more details.)

We evaluate the success rates of different methods on each test set. A given episode is scored as successful if the policy completes its entire goal within the maximum allowed steps of the environment. On each of the 3 test subsets, we use 5 different random seeds and test 100 tasks under each seed. Thus there are 1500 examples used to evaluate each model.

5.2 Baby AI

Baby AI is a 2D grid world environment for instruction following. Observations in Baby AI are 7 7 3 grids describing a partial and local egocentric view of the state of the environment. We evaluate the methods on four representative tasks: Go To Red Ball, Go To Local, Pickup Loc, and Put Next Local. Performing well on the test set requires the models to generalize to new environment layouts and goals, resulting in new combinations of tasks not seen in training. For each method, we compute success rates over 500 episodes on each task.

6 Experiments

We ﬁrst show results of the proposed method and baselines for embodied decision-making tasks using expert data in Section 6.1. We then show our results when using actively gathered data in Section 6.2.

6.1 Embodied Decision Making with Pre-trained Language Model (LID)

6.1.1 Results on Virtual Home

We evaluate the following methods:

In-Distribution Novel Scenes Novel Tasks 0

Success Rate

LSTM LID-Text (Ours)

Figure 3: Comparisons of the proposed method and baselines on Virtual Home. All the methods are trained on expert data using imitation learning. MLP-1, MLP, and LSTM are baselines without using the pre-trained LM. The proposed method, LID-Text (Ours), outperforms all baselines.

Tasks Methods Number of Demos

100 500 1K 5K 10K

Go To Red Ball Baby AI-Ori [16] 81.0 96.0 99.0 99.5 99.9 LID-Text (Ours) 93.9 99.4 99.7 100.0 100.0

Go To Local Baby AI-Ori [16] 55.9 84.3 98.6 99.9 99.8 LID-Text (Ours) 64.6 97.9 99.0 99.5 99.5

Pickup Loc Baby AI-Ori [16] 28.0 58.0 93.3 97.9 99.8 LID-Text (Ours) 28.7 73.4 99.0 99.6 99.8

Put Next Local Baby AI-Ori [16] 14.3 16.8 43.4 81.2 97.7 LID-Text (Ours) 11.1 93.0 93.2 98.9 99.9 Table 1: Success rates on Baby AI tasks. All the methods are trained on ofﬂine expert data using imitation learning. LID-Text (Ours) outperforms Baby AIOri, the method used in the original paper [16].

LID-Text (Ours) is the proposed method that converts all environments inputs into text descriptions. The pre-trained LM is ﬁne-tuned for decision-making (conditioned on goals, observations, and histories) as described in Section 4.1.

Recurrent Network. We compare our method with a recurrent baseline using an LSTM [14] to encode the history information. The hidden representation from the last timestep, together with the goal and current observation, are used to predict the next action.

MLP and MLP-1. We perform additional comparisons with baselines that do not use recurrent networks or pre-trained LMs. MLP and MLP-1 take the goal, histories, and the current observation as input and send them to the multilayer perceptron neural network (MLP) to predict actions. MLP-1 has three more average-pooling layers than MLP that average the features of tokens in the goal, history actions, and the current observation, respectively, before sending them to the MLP layer.

Quantitative results. Each method is trained on 20K demos from the Virtual Home-Imitation Learning dataset, and then evaluated on the three test subsets: In-Distribution, Novel Scenes, and Novel Tasks. In Figure 3, LID-Text (Ours), which initializes the policy with a pre-trained LM, has higher success rates than other methods. This difference is most pronounced in the Novel Tasks setting, where test tasks require combinatorial generalization across goals that are never seen during training. Here, LID-Text (Ours) dramatically (43.6%) improves upon all baselines. Such combinatorial generalization is necessary to construct general purpose agents, but is often difﬁcult for existing approaches. Our results suggest that pre-trained LMs can serve as a computational backbone for combinatorial generalization.

6.1.2 Results on Baby AI

We use the standard training and test data provided by [16]. In Baby AI, performing well on unseen test tasks with new environment layouts and goals requires combinatorial reasoning. In Table 1, we report the success rate of models trained on different number of demos. Baby AI-Ori [16] is the method used in the original paper. LID-Text (Ours) is the proposed method that converts policy inputs into a text sequence. Given enough training data, i.e. 10K demos, both methods achieve high success rates, but LID-Text (Ours) outperforms Baby AI-Ori with less training data, indicating the proposed method improves sample efﬁciency when generalizing to novel tasks.

6.2 Pre-trained Language Model with Active Data Gathering (LID-ADG)

We compare LID-ADG, the proposed LM framework for decision-making using actively gathered data (Section 4.2.2), to a variety of baselines that do not use pre-collected expert data on Virtual Home.

Random. The agent selects the next action randomly from the valid action space at that state. Goal-Object. The agent randomly selects an object that in the goal and in the valid action space to interact with. For example, given a goal of Inside(apple, fridge):1 , this baseline might choose grab apple , open fridge , or other actions containing apple or fridge . Online RL. We compare with PPO [37], one of the most commonly used online RL methods. For fair comparison, we equip PPO with the same main policy network as the proposed method. Our implementation is

In-Distribution Novel Scenes Novel Tasks

Random 0.0 0.0 0.0 0.0 0.0 0.0 Goal-Object 0.8 0.5 0.0 0.0 0.4 0.4 PPO 0.0 0.0 0.0 0.0 0.0 0.0 DQN+HER 0.0 0.0 0.0 0.0 0.0 0.0 LID-ADG (Ours) 46.7 2.7 32.2 3.3 25.5 4.1

Table 2: Comparisons of methods without using expert data on Virtual Home. LID-ADG (Ours) is the only successful approach.

In-Distribution Novel Scenes Novel Tasks

LID-ADG (Ours) 46.7 2.7 32.2 3.3 25.5 4.1 PPO (LID-ADG Init) 53.7 3.5 30.2 3.4 27.8 2.7 DT (LID-ADG Data) 42.4 1.5 21.6 2.48 16.8 1.0

Table 3: The proposed method with active data gathering, LID-ADG (Ours), can be used as an policy initializer for online RL or a data provider for ofﬂine RL.

based on Stable Baselines3 [35]. Hindsight Experience Replay. We compare with DQN+HER used in [3] and modify its main policy network to be the same as the proposed method.

Quantitative results. We compare LID-ADG with baselines on Virtual Home in Table 2. Each experiment is performed 5 times with different random seeds. The Random baseline is always 0, indicating the tasks in Virtual Home cannot be easily solved by a random policy. Goal-Object is better than Random because Goal-Object has access to objects in the goal and it samples actions from a much smaller action space. The online RL baseline, PPO, fails to solve tasks in Virtual Home featured by partially observation, large state/action space, and long-term horizon. DQN+HER works well on simple tasks on 2D environments, but they cannot tackle Virtual Home tasks neither, requiring nontrivial planning and reasoning. LID-ADG does not require expert data and can solve the complicated tasks in 3D embodied environments which cannot be easily achieved using RL. 3

Policy initializer and data provider. LID-ADG can further be used to initialize the weights for ﬁnetuning RL policies and to gather data for ofﬂine learning. As shown in Table 2, directly training RL, e.g. PPO, fails to solve tasks in Virtual Home. However, after using the policy trained by LID-ADG to initialize the PPO policy, we may effectively learn an interactive policy with good performance. In Table 3, PPO (LID-ADG Init) is initialized from LID-ADG and further ﬁne-tuned to solve the tasks in Virtual Home. After initialization, PPO improves its success rate by 53.7% on the In-Distribution setting (See PPO results in Table 2 and Table 3). In addition, LID-ADG can provide data for ofﬂine learning. LID-ADG saves the relabeled data in replay buffers. We train Decision Transformer (DT) [7] using the data collected by LID-ADG. See DT (LID-ADG Data) in Table 3.

7 Analysis: Understanding the Sources of Generalization

The pre-trained LM policy, ﬁne-tuned on either expert data or actively gathered data, exhibits effective combinatorial generalization. Is this simply because LMs are effective models of relations between natural language descriptions of states and actions [1], or because they provide a more general framework for combinatorial generalization in decision-making? We hypothesize and investigate three possible factors to understand the sources of such combinatorial generalization. We use policies trained on the expert data as an example to explain the experiments.

7.1 Input Encoding Scheme

We ﬁrst hypothesize that converting environment inputs into natural language contributes to the combinatorial generalization as the LMs are trained on language data. We explore the role of natural language by investigating three alternative ways of encoding policy inputs to our model without using natural language strings: two in Virtual Home, and one in Baby AI. Baby AI results are in Appendix A.

Index encoding in Virtual Home. Rather than natural language strings, LID-Index (Ours) converts policy inputs into integer indices. LID-Index (Ours) retains the discrete, serial format of the goal, history, and observation, but replaces each word with an integer, and replaces the embedding layer from the pre-trained LM with a new embedding layer trained from scratch. For example, grab apple is mapped to (5,3) based on the positions of grab and apple in the vocabulary set.

Unnatural string encoding in Virtual Home. LID-Unnatural (Ours) replaces the natural language tokens (e.g. converting the goal On(fork, table):1 to put one fork on the table) with random ones (e.g. converting On(fork, table) to brought wise character trees ﬁne yet). This is done by

3Note that the results of LID-Text in Figure 3 and results of LID-ADG in Table 2 are not directly comparable because the difﬁculty level of the evaluated tasks are different. See Appendix F for more details.

Table 4: Success rates of policies trained with different input encodings in the Novel Tasks setting on Virtual Home. The text encoding is the most sample-efﬁcient, but all models converge to similar performance given sufﬁcient training data.

Methods Number of Demos

100 500 1K 5K 10K 20K

LID-Text (Ours) 8.8 1.4 22.2 1.7 26.8 1.0 46.0 1.0 58.2 1.2 58.2 1.6 LID-Index (Ours) 6.4 0.6 18.0 3.8 18.8 1.0 45.5 2.1 54.6 0.8 57.8 0.9 LID-Unnatural (Ours) 6.8 1.3 18.6 2.1 27.0 1.1 47.2 1.7 55.8 0.8 58.8 0.9

randomly permuting the entire vocabulary, mapping each token to a new token. Such a permutation breaks the semantic information in natural strings.

LID-Index (Ours) and LID-Unnatural (Ours) have the same policy network as LID-Text (Ours). All are ﬁne-tuned on the expert data. The averaged results using 5 different random seeds on the Novel Tasks setting are reported in Table 4. Given few training data, e.g. 100 demos, all the models perform poorly, with success rates lower than 10%. LID-Text (Ours) achieves higher success rates than LID-Index (Ours) and LID-Unnatural (Ours) when dataset size increases, e.g. LID-Text (Ours) is around 4% higher than LID-Index (Ours) and LID-Unnatural (Ours) with 500 training demos. When the training dataset is further enlarged, e.g. 20K demos, success rates of all approaches reach similar performance. This result indicates that the effectiveness of pre-trained LMs in compositional generalization is not unique to natural language strings, but can be leveraged from arbitrary encodings, although adapting the model to arbitrary encodings may require more training data.

7.2 Sequential Input Representation

Table 5: Experiments on sequential inputs and weight initialization. Fine-tuning the pre-trained weights and the usage of sequential encoding are important for combinatorial generalization.

In-Distribution Novel Tasks

LID-Text (Ours) 87.6 1.9 58.2 2.3 No-Seq 74.0 2.3 2.0 0.6 No-Pretrain 90.8 2.0 47.0 2.8 No-FT 51.2 4.5 17.0 2.9

Next, we explore whether generalization requires the sequential processing mechanisms in transformer-based LMs. We investigate whether the LM pre-trained policy will still be effective when the input encoding is not sequential. No-Seq encodes the goal as a single vector by averaging all goal embeddings. History and observation features are obtained in the same way. All features are then sent to the pre-trained LM to predict actions. As shown in Table 5, removing sequential structure signiﬁcantly hurts performance on Novel Tasks. No Seq achieves good performance on test tasks that are closer to training tasks, but cannot generalize well to more challenging unseen tasks. Thus, combinatorial generalization in pre-trained LMs may be attributed in part to transformers ability to process sequential input representations effectively.

7.3 Favorable Weight Initialization

Finally, we investigate if the favorable weight initialization from LM pre-training enables effective generalization of the proposed model. No-Pretrain does not initialize the policy using the pre-trained LM, but instead training the policy on the expert data from scratch. In Table 5, we ﬁnd that removing the pre-trained weights can ﬁt the in-domain data and thus performs well on the In-Distribution setting. However, its success rate is 11.2% lower than the proposed model on the Novel Tasks setting, indicating the pre-trained weights are important for effective generalization, but not necessary for effective data ﬁtting. We further test a baseline, No-FT, that keeps the pre-trained weights of the language model but freezes them while training the rest model on our expert data. Freezing the pretrained weights without ﬁne-tuning signiﬁcantly hurts the performance on both settings, suggesting that ﬁne-tuning of the transformer weights is essential for effective combinatorial generalization.

Together, these results suggest that sequential input representations (vs. ﬁxed-dimensional feature vectors) and favorable weight initialization are both important for generalization, however, the input encoding schemes (e.g. as a natural language string vs. an arbitrary encoding scheme) has little inﬂuence. These results point to the potential broader applicability of pre-trained LMs as a computational backbone for compositional embodied decision making, where arbitrary inputs, such as language, images, or grids, may be converted to sequential encodings.

Novel Tasks

Goal: Inside (cutlery fork, fridge): 1

Walk to Kitchen Cabinet

Open Kitchen

Cabinet Grab Cutlery Fork Walk to Fridge Open Fridge Put Cutlery Fork

inside the Fridge Walk to Dishwasher Open Dishwasher

Go To Local

Goal: go to the grey key

Right Forward Right

Goal: pick up the purple key on your right

Forward Forward Right Pickup Start

Goal: Inside (pancake, stove): 1; Close (stove); Switch On (stove)

In-Distribution

Walk to Livingroom Walk to Pancake Grab Pancake Walk to Kitchen Open Stove Put Pancake inside the Stove Close Stove Switch on Stove

Figure 4: Qualitative results of our model on Virtual Home and Baby AI. We only show a sub-trajectory in each example to save space. The interacted objects are labelled by green bounding boxes.

Goal: Inside (chicken, oven): 1; Close (oven); Turn ON (oven)

Policy Error

Grounding Error

Goal: Inside (salmon, fridge): 1; Inside (sundae, fridge): 1

Walk to Stove Open Stove Walk to Bedroom Walk to Cabinet

Grab Cutlets Open Fridge Put Cutlets inside

the Fridge Walk to Bedroom

Goal: On (fork, kitchen table): 1

Goal: On (pillow, bed ): 1

Walk to Bedroom Walk to Livingroom Walk to Cabinet Open Cabinet

Walk to Kitchen Counter Drawer Walk to Rug Grab Apple

Put Apple on Kitchen Table

Figure 5: Failure cases. We show failure cases caused by the grounding error and policy error. The interacted objects are labelled by green bounding boxes.

8 Qualitative Results

In Figure 4, we show examples of LID-Text (Ours) completing tasks in Virtual Home and Baby AI. We show two successful examples from Virtual Home on the In-Distribution and Novel Tasks settings, and two successful examples from Baby AI on solving the Go To Local and Pickup Loc tasks. We only show short trajectories or extract a sub-trajectory for saving space.

Failure case analysis. In Figure 5, we show some failure cases of the proposed method. We observed two main types of failure cases: grounding error and policy error. For failures caused by the grounding error, the agent interacts with a wrong object that is not related to the given goal, e.g. the agent puts cutlets instead of the salmon inside the fridge. For failures caused by the policy error, the agent cannot ﬁnd the target objects or does not interact with them. The proposed method that converts policy inputs into sequential encodings and feeds them to the general LM framework can accomplish decision-making tasks efﬁciently, however, there are still challenging tasks that the policy fails to accomplish. Larger LMs, e.g. GPT-3 [6], may improve the success rate of those challenging tasks.

9 Conclusion and Broader Impact

In this paper, we introduced LID, a general approach to sequential decision-making that converts goals, histories, and observations into sequences and processes them using a policy initialized with a pre-trained LM. We integrated an active data gathering procedure into the proposed method to enable policy learning without using expert data. Our analysis showed that input representation and favorable weight initialization both contribute to the generalization while the input encoding scheme has little inﬂuence. One drawback of the active data gathering is that it relies on hand-designed rules for task relabeling. More generally, a potential disadvantage of the proposed approach is that biases of the pre-trained LMs may inﬂuence its behavior, and further study of LID-based models bias is required before they may be deployed in sensitive downstream applications. Nevertheless, our results demonstrate that LID enables effective combinatorial generalization across different environments, and highlight the promise of LM pre-training for more general decision-making problems.

[1] P. Ammanabrolu and M. O. Riedl. Playing text-adventure games with graph-based deep

reinforcement learning. ar Xiv preprint ar Xiv:1812.01628, 2018.

[2] J. Andreas and D. Klein. Learning with latent language. In North American Association for

Computational Linguistics, 2022.

[3] M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. Mc Grew, J. Tobin,

P. Abbeel, and W. Zaremba. Hindsight experience replay. ar Xiv preprint ar Xiv:1707.01495, 2017.

[4] M. L. Athul Paul Jacob and J. Andreas. Multitasking inhibits semantic drift. In North American

Association for Computational Linguistics, 2021.

[5] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba.

Openai gym, 2016.

[6] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan,

P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. ar Xiv preprint ar Xiv:2005.14165, 2020.

[7] L. Chen, K. Lu, A. Rajeswaran, K. Lee, A. Grover, M. Laskin, P. Abbeel, A. Srinivas, and

I. Mordatch. Decision transformer: Reinforcement learning via sequence modeling. ar Xiv preprint ar Xiv:2106.01345, 2021.

[8] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent

semantic analysis. Journal of the American society for information science, 41(6):391 407, 1990.

[9] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional

transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805, 2018.

[10] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani,

M. Minderer, G. Heigold, S. Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. ar Xiv preprint ar Xiv:2010.11929, 2020.

[11] S. T. Dumais. Latent semantic analysis. Annual review of information science and technology,

38(1):188 230, 2004.

[12] D. Ghosh, A. Gupta, A. Reddy, J. Fu, C. Devin, B. Eysenbach, and S. Levine. Learning to reach

goals via iterated supervised learning. ar Xiv preprint ar Xiv:1912.06088, 2019.

[13] F. Hill, S. Mokra, N. Wong, and T. Harley. Human instruction-following with deep reinforcement

learning via transfer-learning from text. ar Xiv preprint ar Xiv:2005.09382, 2020.

[14] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735

1780, 1997.

[15] W. Huang, P. Abbeel, D. Pathak, and I. Mordatch. Language models as zero-shot planners:

Extracting actionable knowledge for embodied agents. ar Xiv preprint ar Xiv:2201.07207, 2022.

[16] D. Y.-T. Hui, M. Chevalier-Boisvert, D. Bahdanau, and Y. Bengio. Babyai 1.1, 2020.

[17] E. Jang, A. Irpan, M. Khansari, D. Kappler, F. Ebert, C. Lynch, S. Levine, and C. Finn. Bc-z:

Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning, pages 991 1002. PMLR, 2022.

[18] N. S. Keskar, B. Mc Cann, L. R. Varshney, C. Xiong, and R. Socher. Ctrl: A conditional

transformer language model for controllable generation. ar Xiv preprint ar Xiv:1909.05858, 2019.

[19] N. Kitaev, S. Cao, and D. Klein. Multilingual constituency parsing with self-attention and

pre-training. ar Xiv preprint ar Xiv:1812.11760, 2018.

[20] R. E. Korf. Planning as search: A quantitative approach. Artiﬁcial intelligence, 33(1):65 88,

[21] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra.

Continuous control with deep reinforcement learning. ar Xiv preprint ar Xiv:1509.02971, 2015.

[22] J. Lu, D. Batra, D. Parikh, and S. Lee. Vilbert: Pretraining task-agnostic visiolinguistic

representations for vision-and-language tasks. ar Xiv preprint ar Xiv:1908.02265, 2019.

[23] K. Lu, A. Grover, P. Abbeel, and I. Mordatch. Pretrained transformers as universal computation

engines. ar Xiv preprint ar Xiv:2103.05247, 2021.

[24] A. Majumdar, A. Shrivastava, S. Lee, P. Anderson, D. Parikh, and D. Batra. Improving vision-

and-language navigation with image-text pairs from the web. In European Conference on Computer Vision, pages 259 274. Springer, 2020.

[25] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of

words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111 3119, 2013.

[26] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu.

Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928 1937. PMLR, 2016.

[27] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller.

Playing atari with deep reinforcement learning. ar Xiv preprint ar Xiv:1312.5602, 2013.

[28] J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation.

In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532 1543, 2014.

[29] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer. Deep

contextualized word representations. ar Xiv preprint ar Xiv:1802.05365, 2018.

[30] E. A. Platanios, A. Pauls, S. Roy, Y. Zhang, A. Kyte, A. Guo, S. Thomson, J. Krishnamurthy,

J. Wolfe, J. Andreas, et al. Value-agnostic conversational semantic parsing. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3666 3681, 2021.

[31] X. Puig, K. Ra, M. Boben, J. Li, T. Wang, S. Fidler, and A. Torralba. Virtualhome: Simulating

household activities via programs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8494 8502, 2018.

[32] X. Puig, T. Shu, S. Li, Z. Wang, J. B. Tenenbaum, S. Fidler, and A. Torralba. Watch-and-help:

A challenge for social perception and human-ai collaboration. ar Xiv preprint ar Xiv:2010.09890, 2020.

[33] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever. Improving language understanding

by generative pre-training. 2018.

[34] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. Language models are

unsupervised multitask learners. Open AI blog, 1(8):9, 2019.

[35] A. Rafﬁn, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, and N. Dormann. Stable-baselines3:

Reliable reinforcement learning implementations. Journal of Machine Learning Research, 22(268):1 8, 2021.

[36] M. Reid, Y. Yamada, and S. S. Gu. Can wikipedia help ofﬂine reinforcement learning? ar Xiv

preprint ar Xiv:2201.12122, 2022.

[37] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization

algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017.

[38] P. Sharma, A. Torralba, and J. Andreas. Skill induction and planning with latent language. In

Association for Computational Linguistics, 2022.

[39] B. Shen, F. Xia, C. Li, R. Martín-Martín, L. Fan, G. Wang, S. Buch, C. D Arpino, S. Srivas-

tava, L. P. Tchapmi, et al. igibson, a simulation environment for interactive tasks in large realisticscenes. ar Xiv preprint ar Xiv:2012.02924, 2020.

[40] M. Shridhar, J. Thomason, D. Gordon, Y. Bisk, W. Han, R. Mottaghi, L. Zettlemoyer, and

D. Fox. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10740 10749, 2020.

[41] E. Todorov, T. Erez, and Y. Tassa. Mujoco: A physics engine for model-based control. In 2012

IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026 5033. IEEE, 2012.

[42] M. Tsimpoukelli, J. Menick, S. Cabi, S. Eslami, O. Vinyals, and F. Hill. Multimodal few-shot

learning with frozen language models. ar Xiv preprint ar Xiv:2106.13884, 2021.

[43] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and

I. Polosukhin. Attention is all you need. ar Xiv preprint ar Xiv:1706.03762, 2017.

[44] J. Vig. A multiscale visualization of attention in the transformer model. In Proceedings of the

57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 37 42, Florence, Italy, July 2019. Association for Computational Linguistics.

[45] B. Wang, R. Shin, X. Liu, O. Polozov, and M. Richardson. Rat-sql: Relation-aware schema

encoding and linking for text-to-sql parsers. ar Xiv preprint ar Xiv:1911.04942, 2019.

[46] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf,

M. Funtowicz, et al. Huggingface s transformers: State-of-the-art natural language processing. ar Xiv preprint ar Xiv:1910.03771, 2019.

[47] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le. Xlnet: Gener-

alized autoregressive pretraining for language understanding. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.

[48] Z. Yang, N. Garcia, C. Chu, M. Otani, Y. Nakashima, and H. Takemura. Bert representations for

video question answering. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1556 1565, 2020.

[49] J. Zhu, Y. Xia, L. Wu, D. He, T. Qin, W. Zhou, H. Li, and T.-Y. Liu. Incorporating bert into

neural machine translation. ar Xiv preprint ar Xiv:2002.06823, 2020.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s

contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] See Section 9.

(c) Did you discuss any potential negative societal impacts of your work? [Yes] See

Section 9. (d) Have you read the ethics review guidelines and ensured that your paper conforms to

them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [N/A] (b) Did you include complete proofs of all theoretical results? [N/A] 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main ex-

perimental results (either in the supplemental material or as a URL)? [Yes] In the supplemental material. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they

were chosen)? [Yes] See Section 6 and Appendix C.2. (c) Did you report error bars (e.g., with respect to the random seed after running experi-

ments multiple times)? [Yes] See Section 6. (d) Did you include the total amount of compute and the type of resources used (e.g., type

of GPUs, internal cluster, or cloud provider)? [Yes] See Appendix C.2. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A]

(d) Did you discuss whether and how consent was obtained from people whose data you re

using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable

information or offensive content? [Yes] See Section 9. 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if

applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review

Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount

spent on participant compensation? [N/A]