# learning_grounded_action_abstractions_from_language__e5894b5f.pdf

Published as a conference paper at ICLR 2024

LEARNING ADAPTIVE PLANNING REPRESENTATIONS WITH NATURAL LANGUAGE GUIDANCE

Lionel Wong1 Jiayuan Mao1* Pratyusha Sharma1* Zachary S. Siegel2 Jiahai Feng3

Noa Korneev4 Joshua B. Tenenbaum1 Jacob Andreas1

1MIT 2Princeton University 3UC Berkeley 4Microsoft

Effective planning in the real world requires not only world knowledge, but the ability to leverage that knowledge to build the right representation of the task at hand. Decades of hierarchical planning techniques have used domain-specific temporal action abstractions to support efficient and accurate planning, almost always relying on human priors and domain knowledge to decompose hard tasks into smaller subproblems appropriate for a goal or set of goals. This paper describes Ada (Action Domain Acquisition), a framework for automatically constructing task-specific planning representations using task-general background knowledge from language models (LMs). Starting with a general-purpose hierarchical planner and a low-level goal-conditioned policy, Ada interactively learns a library of planner-compatible high-level action abstractions and low-level controllers adapted to a particular domain of planning tasks. On two language-guided interactive planning benchmarks (Mini Minecraft and ALFRED Household Tasks), Ada strongly outperforms other approaches that use LMs for sequential decisionmaking, offering more accurate plans and better generalization to complex tasks.

1 INTRODUCTION

People make complex plans over long timescales, flexibly adapting what we know about the world in general to govern how we act in specific situations. To make breakfast in the morning, we might convert a broad knowledge of cooking and kitchens into tens of fine-grained motor actions in order to find, crack, and fry a specific egg; to achieve a complex research objective, we might plan a routine over days or weeks that begins with the low-level actions necessary to ride the subway to work. The problem of adapting general world knowledge to support flexible long-term planning is one of the unifying challenges of AI. While decades of research have developed representations and algorithms for solving restricted and shorter-term planning problems, generalized and long-horizon planning remains a core, outstanding challenge for essentially all AI paradigms, including classical planning (Erol et al., 1994), reinforcement learning (Sutton et al., 1999), and modern generative AI (Wang et al., 2023a).

How do humans solve this computational challenge? A growing body of work in cognitive science suggests that people come up with hierarchical, problem-specific representations of their actions and environment to suit their goals, tailoring how they represent, remember, and reason about the world to plan efficiently for a particular set of tasks (e.g., Ho et al., 2022). In AI, a large body of work has studied hierarchical planning using domain-specific temporal abstractions progressively decomposing high-level goals into sequences abstract actions that eventually bottom out in low-level control. An extensive body of work has explored how to plan using these hierarchical action spaces, including robotic task-and-motion planning (TAMP) systems (Garrett et al., 2021) and hierarchical RL frameworks (Sutton et al., 1999).

However, identifying a set of abstract actions that are relevant and useful for achieving any given set of goals remains the central bottleneck in general. Intuitively, useful high-level actions must satisfy many different criteria: they should enable time-efficient high-level planning, correspond feasible lowlevel action sequences, and compose and generalize to new tasks. Despite efforts to learn high-level actions automatically in both classical planning (Nejati et al., 2006) and RL formulations (Dietterich, 2000), most state-of-the-art robotics and planning systems rely on human expertise to hand-engineer new planning representations for each new domain (Ahn et al., 2022).

Asterisk indicates equal contribution. Correspondence to zyzzyva@mit.edu. Code for this paper will be released at: https://github.com/Catherine Wong/llm-operators

Published as a conference paper at ICLR 2024

(a) To solve grounded planning tasks

Mini Minecraft

ALFRED Household

Bring a hot egg to the table.

Place chilled wine in the cabinet.

Put away sliced bread into the fridge.

Place a cold potato slice in the oven.

Craft a wood plank.

Craft a bed.

(b) ...we learn a library of grounded actions with a hierarchical planning framework

(:action chill-object-1

:parameters (?r ?o ?l) :precondition (and

(receptacle Type ?r Fridge Type) (receptacle At Location ?r ?l) ... (holds ?o)) :effect (and (is Hot ?o)))

(i) Propose symbolic action abstractions

(:action chill-object-2 ...)

(:action heat-object-1 ...)

(ii) Ground with bi-level planning

(pick-up wine) (chill-object-1 wine fridge) ...(more steps) (put wine cabinet)

(iii) Verified grounded action library

(:action chill-object-1 ...)

chill-object-1

(:action slice-object-1 ...)

(:action slice-object-2 ...)

. . . more grounded actions . . .

Apply Pick Place Low-level

Policy Search

Figure 1: We solve complex planning tasks specified in language and grounded in interactive environments by jointly learning a library of symbolic high-level action abstractions and modular low-level controllers associated with each abstraction. Our system leverages background information in language as a prior to propose useful action abstractions, then uses a hierarchical planning framework to verify and ground them.

In this paper, we introduce Action Domain Acquisition (Ada), a framework for using background knowledge from language (conveyed via language models) as an initial source of task-relevant domain knowledge. Ada uses language models (LMs) in an interactive planning loop to assemble a library of composable, hierarchical actions tailored to a given environment and task space. Each action consists of two components: (1) a high-level abstraction represented as a symbolic planning operator (Fikes & Nilsson, 1971) that specifies preconditions and action effects as sets of predicates; and (2) a low-level controller that can achieve the action s effects by predicting a sequence of low-level actions with a neural network or local search procedure. We study planning in a multitask reinforcement learning framework, in which agents interact with their environments to must solve collections of tasks of varying complexity. Through interaction, Ada incrementally builds a library of actions, ensuring at each step that learned high-level actions compose to produce valid abstract plans and realizable low-level trajectories.

We evaluate Ada (Fig. 1) on two benchmarks, Mini Minecraft and ALFRED (Shridhar et al., 2020). We compare this approach against three baselines that leverage LMs for sequential decision-making in other ways: to parse linguistic goals into formal specifications that are solved directly by a planner (as in Liu et al. (2023)), to directly predict sequences of high-level subgoals (as in Ahn et al. (2022)), and to predict libraries of actions defined in general imperative code (as in Wang et al. (2023a)). In both domains, we show that Ada learns action abstractions that allow it to solve dramatically more tasks on each benchmark than these baselines, and that these abstractions compose to enable efficient and accurate planning in complex, unseen tasks.

2 PROBLEM FORMULATION

We assume access to an environment X, U, T , where X is the (raw) state space, U is the (low-level) action space (e.g., robot commands), and T is a deterministic transition function T : X U X. We also have a set of features (or predicates ) P that define an abstract state space S: each abstract state s S is composed of a set of objects and their features. For example, a simple scene that contains bread on a table could be encoded as an abstract state with two objects A and B, and atoms {bread(A), table(B), on(A, B)}. We assume the mapping from environmental states to abstract states Φ : X S is given and fixed (though see Migimatsu & Bohg, 2022 for how it might be learned).

In addition to the environment, we have a collection of tasks t. Each t is described by a natural language instruction ℓt, corresponding to a goal predicate (which is not directly observed). In this paper, we assume that predicates may be defined in terms of abstract states, i.e., gt : S {T, F}. Our goal is to build an agent that, given the initial state x0 X and the natural language instruction ℓt, can generate a sequence of low-level actions {u1, u2, , u H} UH such that gt(Φ(x H)) is true (where x H is the terminal state of sequentially applying {ui} on x0). The agent receives reward signal only upon achieving the goal specified by gt.

Given a very large number of interactions, a sufficiently expressive reflex policy could, in principle, learn a policy that maps from low-level states to low-level actions conditioned on the language instruction π(u | x; ℓt). However, for very long horizons H and large state spaces (e.g., composed of many objects and compositional goals), such algorithms can be highly inefficient or effectively infeasible. The key idea behind our approach is to use natural language descriptions ℓt to bootstrap a high-level action space A over the abstract state space S to accelerate learning and planning.

Published as a conference paper at ICLR 2024

Bring a hot egg.

Initial state: 𝑥!

Instruction: ℓ"

(c) State and Action Abstraction

Predicates (𝓟):

Operator (𝓐): (:action Heat Object

:parameters (?l ?r ?o) :precondition (and

(receptacle Type ?r Microwave Type) (holds ?a ?o) ...) :effect (and (is Hot ?o)) ...)

... (more operators)

(a) Task Input

Controllers: 𝜋𝑢 𝑥 ; 𝑎): 𝒳 𝒜 𝒰

Abstract State High-Level Planner

Abstract Goal

Abstract Plan {𝒂𝒊}

(pick egg#1, ...) (move To microwave) ... (more actions)

Low-Level Plan {𝒖𝒊}

Move Pick Move ......

(b) Bi-Level Planning and Execution

receptacle Type(...) object Type(...), is Hot(...), ...

Figure 2: Representation for our (a) task input, (b) the bi-level planning and execution pipeline for inference time, and (c) the abstract state and action representation.

Candidate Ops.

for iteration 𝒊= 1, 2, 3, ...

ℓ%} 𝒜" {𝑎"} Sym. Planner

Symbolic Goal

Abstract State

High-Level Action Low-Level Action

{𝑢"} Controller

Execute and Score Operators 𝒜"

Figure 3: The overall framework. Given task environment states and descriptions, at each iteration, we first propose candidate abstract actions (operators) A i, then uses bi-level planning and execution to solve tasks. We add operators to the operator library based on the execution result.

Formally, our approach learns a library of high-level actions (operators) A. As illustrated in Fig. 2b, each a A is a tuple of name, args, pre, eff, controller . name is the name of the action, args is a list of variables, usually denoted by ?x, ?y, etc., pre is a precondition formula based on the variables args and the features P, and eff is the effect, which is also defined in terms of args and P. Finally, controller : X U is a low-level policy associated with the action. The semantics of the preconditions and effects is: for any state x such that pre(Φ(x)), executing controller starting in x (for an indefinite number of steps) will yield a state x such that eff(Φ(x )) (Lifschitz, 1986). In this framework, A defines a partial, abstract world model of the underlying state space.

As shown in Fig. 2b, given the set of high-level actions and a parse of the instruction ℓt into a first-order logic formula, we can leverage symbolic planners (e.g., Helmert, 2006) to first compute a high-level plan {a1, , a K} AK that achieves the goal ℓt symbolically, and then refine the high-level plan into a low-level plan with the action controllers. This bi-level planning approach decomposes long-horizon planning problems into several short-horizon problems. Furthermore, it can also leverage the compositionality of high-level actions A to generalize to longer plans.

3 ACTION ABSTRACTIONS FROM LANGUAGE

As illustrated in Fig. 3, our framework, Action Domain Acquisition (Ada) learns action abstractions iteratively as it attempts to solve tasks. Our algorithm is given a dataset of tasks and their corresponding language descriptions, the feature set P, and optionally an initial set of high-level action operators A0. At each iteration i, we first use a large language model (LLM) to propose a set of novel high-level action definitions A i based on the features P and the language goals {ℓt} (Section 3.1). Next, we use a LLM to also translate each language instruction ℓt into a symbolic goal description Ft, and use a bi-level planner to compute a low-level plan to accomplish ℓt (Section 3.2). Then, based on the planning and execution results, we score each operator in Ai and add ones to the verified library if they have yielded successful execution results (Section 3.4). To accelerate low-level planning, we simultaneously learn local subgoal-conditioned policies (i.e., the controllers for each operator; Section 3.3). Algorithm 1 summarizes the overall framework.

A core goal of our approach is to adapt the initial action abstractions proposed from an LLM prior into a set of useful operators A that permit efficient and accurate planning on a dataset of tasks and ideally, that generalize to future tasks. While language provides a key initial prior, our formulation refines and verifies the operator library to adapt to a given planning procedure and environment (similar to other action-learning formulations like Silver et al., 2021). Our formulation ensures not only that the learned operators respect the dynamics of the environment, but also fit their grain of abstraction according to the capacity of the controller, trading off between fast high-level planning and efficient low-level control conditioned on each abstraction.

Published as a conference paper at ICLR 2024

Algorithm 1 Action Abstraction Learning from Language

Input: Dataset of tasks and their language descriptions {ℓt} Input: Predicate set P Input: Optionally, an initial set of abstract operators A0, or A0 =

1: Initialize subgoal-conditioned policy πθ. 2: for i = 1, 2, , M do 3: Ai Ai 1 Propose Operator Definitions(P, {ℓt}) Section 3.1 4: for each unsolved task j: (x(j) 0 , ℓ(j) t ) do 5: u Bi Level Plan(Ai, ℓ(j) t , π) Section 3.2 6: result(j) Execute(x(j) 0 , u) Execute the plan 7: θ Update Subgoal Policy(θ, result) Section 3.3 8: Ai Score And Filter(Ai, result) Section 3.4 return AM

(:action heat-object

:parameters (?l ?r ?o) :precondition (and

(receptacle Type ?r Microwave Type) (at Location ?l) (receptacle At Location ?r ?l) (holds ?o)) :effect (and (is Hot ?o)))

(a) Stage 1: Propose Symbolic Task Decomposition

;; Examples ;; Bake a bread and bring it to the table. (pick-up bread) (place bread oven) (bake bread) ... ;; Sauté some cabbage. ...... (more language to symbolic plans)

(pick-up egg) ...... (more steps) (place pot stove) (heat-object kitchen microwave egg) ...... (more steps)

LLM Generation

(b) Stage 2: Propose Symbolic Operator Definitions

;; Examples (:action saute ...

:precondition (and (receptacle Type ?r Pan) ... :effect ...) (:action bake

:precondition (and (receptacle Type ?r Oven) ... :effect ...) ...... (more operator examples)

heat-object ? ? ?

Extract Undefined Operator Names

Bring a hot egg to the table.

Objects: egg, stove, ...... State: on(egg, table), cold(egg), open(microwave), ......

Figure 4: Our two-stage prompting method for generating candidate operator definitions. (a) Given a task instruction, we first prompt an LLM to generate a candidate symbolic task decomposition. (b) We then extract undefined operator names that appear in the sequences and prompt an LLM to generate symbolic definitions.

3.1 OPERATOR PROPOSAL: Ai Ai 1 Propose Operator Definitions(P, {ℓt})

At each iteration i, we use a pretrained LLM to extend the previous operator library Ai 1 with a large set of candidate operator definitions proposed by the LLM based on the task language descriptions and environment features P. This yields an extended candidate library A i where each a A i = name, args, pre, eff where name is a human-readable action name and args, pre, eff are a PDDL operator definition. We employ a two-stage prompting strategy: symbolic task decomposition followed by symbolic operator definition.

Example. Fig. 4 shows a concrete example. Given a task instruction (Bring a hot egg to the table) and the abstract state description, we first prompt the LLM to generate an abstract task decomposition, which may contain operator names that are undefined in the current operator library. Next, we extract the names of those undefined operators and prompt LLMs to generate the actual symbolic operator descriptions, in this case, the new heat-object operator.

Symbolic task decomposition. For a given task ℓt and a initial state x0, we first translate the raw state x0 into a symbolic description Φ(x0). To constrain the length of the state description, we only include unary features in the abstract state (i.e., only object categories and properties). Subsequently, we present a few-shot prompt to the LLM and query it to generate a proposed task decomposition conditioned on the language description ℓt. It generates a sequence of named high-level actions and their arguments, which explicitly can include high-level actions that are not yet defined in the current action library. We then extract all the operator names proposed across tasks as the candidate high-level operators. Note that while in principle we might use the LLM-proposed task decomposition itself as a high-level plan, we find empirically that this is less accurate and efficient than a formal planner.

Symbolic operator definition. With the proposed operator names and their usage examples (i.e., the actions and their arguments in the proposed plans), we then few-shot prompt the LLM to generate candidate operator definitions in the PDDL format (argument types, and pre/postconditions defined based on features in P). We also post-process the generated operator definitions to remove feature

Published as a conference paper at ICLR 2024

names not present in P and correct syntactic errors. We describe implementation details for our syntax correction strategy in the appendix.

3.2 GOAL PROPOSAL AND PLANNING: result(j) Execute(x(j) 0 , Bi Level Plan(Ai, ℓ(j) t , π))

At each iteration i, we then attempt to Bi Level Plan for unsolved tasks in the dataset. This step attempts to find and execute a low-level action sequence {u1, u2, , u H} UH for each task using the proposed operators in A i that satisfies the unknown goal predicate gt for each task. This provides the environment reward signal for action learning. Our Bi Level Plan has three steps.

Symbolic goal proposal: As defined in Sec. 2, each task is associated with a queryable but unknown goal predicate gt that can be represented as a first-order logic formula ft over symbolic features in P. Our agent only has access to a linguistic task description ℓt, so we use a few-shot prompted LLM to predict candidate goal formulas F t conditioned on ℓt and features P.

High-level planning: Given each candidate goal formula f t F t, the initial abstract problem state s0, and the current candidate operator library A , we search for a high-level plan PA = {(a1, o1i...), , (a K, o Ki...)} as a sequence of high-level actions from A concretized with object arguments o, such that executing the action sequence would satisfy f t according to the operator definitions. This is a standard symbolic PDDL planning formulation; we use an off-the-shelf symbolic planner, Fast Downward (Helmert, 2006) to find high-level plans.

Low-level planning and environment feedback: We then search for a low-level plan as a sequence of low-level actions {u1, u2, , u H} UH, conditioned on the high-level plan structure. Each concretized action tuple (ai, o1i...) PA defines a local subgoal sgi, as the operator postcondition parameterized by the object arguments o. For each (ai, o1i...) PA, we therefore search for a sequence of low-level actions ui1, ui2... that satisfies the local subgoal sgi. We search with a fixed budget per subgoal, and fail early if we are unable to satisfy the local subgoal sgi. If we successfully find a complete sequence of low-level actions satisfying all local subgoals sgi in PA, we execute all low-level actions and query the hidden goal predicate gt to determine environment reward. We implement a basic learning procedure to simultaneously learn subgoal-conditioned controllers over time (described in Section 3.3), but our formulation is general and supports many hierarchical planning schemes (such as sampling-based low-level planners (La Valle, 1998) or RL algorithms).

3.3 LOW-LEVEL LEARNING AND GUIDED SEARCH: θ Update Subgoal Policy(θ, result)

The sequence of subgoals sgi corresponding to high-level plans PA already restricts the local lowlevel planning horizon. However, we further learn subgoal-conditioned low-level policies π(u|x; sg) from environment feedback during training to accelerate low-level planning. To exploit shared structure across subgoals, we learn a shared controller for all operators from x X and conjunctions of predicates in sg. To maximize learning during training, we use a hindsight goal relabeling scheme (Andrychowicz et al., 2017), supervising on all conjunctions of predicates in the state as we roll out low-level search. While the shared controller could be learned as a supervised neural policy, we find that our learned operators sufficiently restrict the search to permit learning an even simpler count-based model from X, sg u U. We provide additional details in the Appendix.

3.4 SCORING LLM OPERATOR PROPOSALS: Ai Score And Filter(Ai, result)

Finally, we update the learned operator library Ai to retain candidate operators that were useful and successful in bi-level planning. Concretely, we estimate operator candidate a i A i accuracy across the bi-level plan executions as s/b where b counts the total times a i appeared in a high-level plan and s counts successful execution of the corresponding low-level action sequence to achieve the subgoal associated with a i. We retain operators if b > τb and s/b > τr, where τb, τr are hyperparameters. Note that this scoring procedure learns whether operators are accurate and support low-level planning independently of whether the LLM-predicted goals f t matched the true unknown goal predicates gt.

4 EXPERIMENTS

Domains. We evaluate our approach on two-language specified planning-benchmarks: Mini Minecraft and ALFRED (Shridhar et al., 2020). Mini Minecraft (Fig. 5, top) is a procedurally-generated Minecraft-like benchmark (Chen et al., 2021; Luo et al., 2023) on a 2D grid world that requires

Published as a conference paper at ICLR 2024

Craft a bed.

(:action craft-bed

:parameters (

?i1 - inventory ?i2 - inventory ?ti inventory ?s - object ?o1 - object ?o2 - object ?t - object ?t - tile) :precondition (and

(agent-at ?t) (object-at ?s ?t) (obj-type ?s Work Station) (inventory ?i1 ?o1) (obj-type ?i1 Wood Plank) (inventory ?i2 ?o2) (obj-type ?i2 Wool) (inventory-empty ?ti) (obj-type ?t Hypothetical)) :effect (and

(not (inventory-empty ?ti)) (inventory ?ti ?t) (not (obj-type ?t Hypothetical)) (obj-type ?t Bed) (not (inventory ?i1 ?o1)) ;; ... more effects ommited ) )

(a) Example task. (b) An example operator proposed and verified by our algorithm. (c) Visualization of the crafting actions used.

iron ingot iron ore pickaxe

Mini Minecraft

(a) Example tasks.

Wash the dirty bowl before putting the bowl on the counter.

(b) Example operators proposed and verified by our algorithm.

Put chilled wine in the cabinet.

Warm a plate and place it on the table.

Place a cold potato slice in the oven.

(:action Cool Object

:parameters (

?toolreceptacle - receptacle ?a agent ?l - location ?o - object) :precondition (and

(receptacle Type ?toolreceptacle Fridge Type) (at Location ?a ?l) (holds ?a ?o) (receptacle At Location ?toolreceptacle ?l)) :effect (and

(is Cool ?o)) )

(:action Slice Object

:parameters (

?toolobject - object ?a agent ?l - location ?o - object) :precondition (and

(object Type ?toolobject Knife Type) (at Location ?a ?l) (object At Location ?o ?l) (sliceable ?o) (holds ?a ?toolobject)) :effect (and

(is Sliced ?o)) )

Figure 5: Top: (a) The Mini Minecraft environment, showing an intermediate step towards crafting a bed. (b) Operator proposed by an LLM and verified by our algorithm through planning and execution. (c) Low-level actions involved in crafting the bed. Bottom: (a) The ALFRED household environment. (b) Example operators proposed by LLM and verified by our algorithm, which are composed to solve the cold potato slice task.

complex, extended planning. The agent can use tools to mine resources and craft objects. The ability to create new objects that themselves permit new actions yields an enormous action space at each time step (>2000 actions) and very long-horizon tasks (26 high-level steps for the most complex task, without path-planning.) ALFRED (Fig. 5, bottom) is a household planning benchmark of human-annotated but formally verifiable tasks defined over a simulated Unity environment (Shridhar et al., 2020). The tasks include object rearrangements and those with object states such as heating and cleaning. Ground-truth high-level plans in the ALFRED benchmark compose 5-10 high-level operators, and low-level action trajectories have on average 50 low-level actions. There over 100 objects that the agent can interact with in each interactive environment. See the Appendix for details.

Experimental setup. We evaluate in an iterative continual learning setting; except on the compositional evaluations, we learn from n=2 iterations through all (randomly ordered) tasks and report final accuracy on those tasks. All experiments and baselines use GPT-3.5. For each task, at each iteration, we sample n=4 initial goal proposals and n=4 initial task decompositions, and n=3 operator definition proposals for each operator name. We report best-of accuracy, scoring a task as solved if verification passes on at least one of the proposed goals. For Minecraft, we set the motion planning budget for each subgoal to 1000 nodes. For ALFRED, which requires a slow Unity simulation, we set it to 50 nodes. Additional temperature and sampling details are in the Appendix.

We evaluate on three Mini Minecraft benchmark variations to test how our approach generalizes to complex, compositional goals. In the simplest Mining benchmark, all goals involve mining a target item from an appropriate initial resource with an appropriate tool (e.g., Mining iron from iron ore with an axe). In the harder Crafting benchmark, goals involve crafting a target artifact (e.g., a bed), which may require mining a few target resources. The most challenging Compositional benchmark combines mining and crafting tasks, in environments that only begin with raw resources and two starting tools (axe and pickaxe). Agents may need to compose multiple skills to obtain other downstream resources (see Fig. 5 for an example). To test action generalization, we report evaluation on the Compositional using only actions learned previously in the Mining and Crafting benchmarks.

We similarly evaluate on an ALFRED benchmark of Simple and Compositional tasks drawn from the original task distribution in Shridhar et al. (2020). This distribution contains simple tasks that require picking up an object and placing it in a new location, picking up objects, applying a single household skill to an object and moving them to a new location (e.g., Put a clean apple on the dining table), and compositional tasks that require multiple skills (e.g., Place a hot sliced potato on the

Published as a conference paper at ICLR 2024

Mini Minecraft (n=3) LLM Predicts? Library? Mining Crafting Compositional

Low-level Planning Only Goal 31% (σ=0.0%) 9% (σ=0.0%) 9% (σ=0.0%) Subgoal Prediction Sub-goals 33% (σ=1.6%) 36% (σ=5.6%) 6% (σ=1.7%) Code Policy Prediction Sub-policies 15% (σ=1.2%) 39% (σ=3.2%) 10% (σ=1.7%) Ada (Ours) Goal+Operators 100% (σ=0.0%) 100% (σ=7.5%) 100% (σ=4.1%)

ALFRED (n=3 replications) LLM Predicts? Library? Original (Simple + Compositional Tasks)

Low-level Planning Only Goal 21% (σ=1.0%) Subgoal Prediction Sub-goal 2% (σ=0.4%) Code Policy Prediction Sub-policies 2% (σ=0.9%) Ada (Ours) Goal+Operators 79% (σ=0.9%)

Table 1: (Top) Results on Mini Minecraft. Our algorithm successfully recovers all intermediate operators for mining and crafting, which enable generalization to more compositional tasks (which use up to 26 operators) without any additional learning. (Bottom) Results on ALFRED. Our algorithm recovers all required household operators, which generalize to more complex compositional tasks. All results report mean performance and STD from n=3 random replications for all models.

counter). We use a random subset of n=223 tasks, selected from an initial 250 that we manually filter to remove completely misspecified goals (which omit any mention of the target object or skill).

Baselines. We compare our method to three baselines of language-guided planning.

Low-level Planning Only uses an LLM to predict only the symbolic goal specification conditioned on the high-level predicates and linguistic goal, then uses the low-level planner to search directly for actions that satisfy that goal. This baseline implements a model like LLM+P (Liu et al., 2023), which uses LLMs to translate linguistic goals into planning-compatible formal specifications, then attempt to plan directly towards these with no additional representation learning.

Subgoal Prediction uses an LLM to predict a sequence of high-level subgoals (as PDDL pre/postconditions with object arguments), conditioned on the high-level predicates, and task goal and initial environment state. This baseline implements a model like Say Can (Ahn et al., 2022), which uses LLMs to directly predict goal and a sequence of decomposed formal subgoal representations, then applies low-level planning over these formal subgoals.

Code Policy Prediction uses an LLM to predict the definitions of a library of imperative local code policies in Python (with cases and control flow) over an imperative API that can query state and execute low-level actions.) Then, as Fast Downward planning is no longer applicable, we also use the LLM to predict the function call sequences with arguments for each task. This baseline implements a model like Voyager (Wang et al., 2023a), which uses an LLM to predict a library of skills implemented as imperative code for solving individual tasks. Like Voyager, we verify the individual code skills during interactive planning, but do not use a more global learning objective to attempt to learn a concise or non-redundant library.

4.1 RESULTS

What action libraries do we learn? Fig. 5 shows example operators learned on each domain (Appendix A.3 contains the full libraries of operators learned on both domains from a randomly sampled run of the n=3 replications). In Mini Minecraft, we manually inspect the library and find that we learn operators that correctly specify the appropriate tools, resources, and outputs for all intermediate mining actions (on Mining) and crafting actions (on Crafting), allowing perfect direct generalization to the Compositional tasks without any additional training on these complex tasks. In ALFRED, we compare the learned libraries from all runs to the ground-truth operator library hand-engineered in Shridhar et al. (2020). The ground-truth operator set contains 8 distinct operators corresponding to different compositional skills (e.g., Slicing, Heating, Cleaning, Cooling). Across all replications, model reliably recovers semantically identical (same predicate preconditions and postconditions) definitions for all of these ground-truth operators, except for a single operator that is defined disjunctively (the ground-truth Slice skill specifies either of two types of knives), which we occasionally learn as two distinct operators or only recover with one of these two types.

We also inspect the learning trajectory and find that, through the interactive learning loop, we successfully reject many initially proposed operator definitions sampled from the language model that turn out to be redundant (which would make high-level planning inefficient), inaccurate (including apriori reasonable proposals that do not fit the environment specifications, such as proposing to clean objects with just a towel, when our goal verifiers require washing them with water in a sink), or

Published as a conference paper at ICLR 2024

underspecified (such as those that omit key preconditions, yielding under-decomposed high-level task plans that make low-level planning difficult).

Do these actions support complex planning and generalization? Table 2 shows quantitative results from n=3 randomly-initialized replications of all models, to account for random noise in sampling from the language model and stochasticity in the underlying environment (ALFRED). On Minecraft, where goal specification is completely clear due to the synthetic language, we solve all tasks in each evaluation variation, including the challenging Compositional setting the action libraries learned from simpler mining/crafting tasks generalize completely to complex tasks that require crafting all intermediate resources and tools from scratch. On ALFRED, we vastly outperform all other baselines, demonstrating that the learned operators are much more effective for planning and compose generalizably to more complex tasks. We qualitatively find that failures on ALFRED occur for several reasons. One is goal misspecification, when the LLM does not successfully recover the formal goal predicate (often due to ambiguity in human language), though we find that on average, 92% of the time, the ground truth goal appears as one of the top-4 goals translated by the LLM. We also find failures due to low-level policy inaccuracy, when the learned policies fail to account for low-level, often geometric details of the environment (e.g., the learned policies are not sufficiently precise to place a tall bottle on an appropriately tall shelf). More rarely, we see planning failures caused by slight operator overspecification (e.g., the Slice case discussed above, in which we do not recover the specific disjunction over possible knives that can be used to slice.) Both operator and goal specification errors could be addressed in principal by sampling more (and more diverse) proposals.

How does our approach compare to using the LLM to predict just goals, or predict task sequences? As shown in Table 2, our approach vastly outperforms the Low-level Planning Only baseline on both domains, demonstrating the value of the action library for longer horizon planning. We also find a substantial improvement over the Subgoal Prediction baseline. While the LLM frequently predicts important high-level aspects of the task subgoal structure (as it does to propose operator definitions), it frequently struggles to robustly sequence these subgoals and predict appropriate concrete object groundings that correctly obey the initial problem conditions or changing environment state. These errors accumulate over the planning horizon, reflected in decreasing accuracy on the compositional Minecraft tasks (on ALFRED, this baseline struggles to solve any more than the basic pick-and-place tasks, as the LLM struggles to predict subgoals that accurately track whether objects are in appliances or whether the agent s single gripper is full with an existing tool.)

How does our approach compare to using the LLM to learn and predict plans using imperative code libraries? Somewhat surprisingly, we find that the Code Policy prediction baseline performs unevenly and often very poorly on our benchmarks. (We include additional results in A.2.1 showing that our model also dramatically outperforms this baseline using GPT-4 as the base LLM.) We find several key reasons for the poor performance of this baseline relative to our model, each which validate the key conceptual contributions of our approach. First, the baseline relies on the LLM as the planner as the skills are written as general Python functions, rather than any planner-specific representation, we do not use an optimized planner like Fast Downward. As with Subgoal Prediction, we find that the LLM is not a consistent or accurate planner. While it retrieves generally relevant skills from the library for each task, it often struggles to sequence them accurately or predict appropriate arguments given the initial problem state. Second, we find that imperative code is less suited in general as a hierarchical planning representation for these domains than the high-level PDDL and low-level local policy search representation we use in our model. This is because it uses control flow to account for environment details that would otherwise be handled by local search relative to a high-level PDDL action. Finally, our model specifically frames the library learning objective around learning a compact library of skills that enables efficient planning, whereas our Voyager re-implementation (as in Wang et al. (2023a)) simply grows a library of skills which are individually executable and can be used to solve individual, shorter tasks. Empirically, as with the original model in Wang et al. (2023a), this baseline learns hundreds of distinct code definitions on these datasets, which makes it harder to accurately plan and generalize to more complex tasks. Taken together, these challenges support our overarching library learning objective for hierarchical planning.

5 RELATED WORK

Planning for language goals. A large body of recent work attempts to use LLMs to solve planning tasks specified in language. One approach is to directly predict action sequences (Huang et al., 2022; Valmeekam et al., 2022; Silver et al., 2022; Wang et al., 2023b), but this has yielded mixed

Published as a conference paper at ICLR 2024

results as LLMs can struggle to generalize or produce correct plans as problems grow more complex. To combat this, one line of work has explored structured and iterative prompting regimes (e.g., chain-of-thought and feedback) (Mu et al., 2023; Silver et al., 2023; Zhu et al., 2023). Increasingly, other neuro-symbolic work uses LLMs to predict formal goal or action representations that can be verified or solved with symbolic planners (Song et al., 2023; Ahn et al., 2022; Xie et al., 2023; Arora & Kambhampati, 2023). These approaches leverage the benefits of a known planning domain model. Our goal in this paper is to leverage language models to learn this domain model. Another line of research aims at using LLMs to generate formal planning domain models for specific problems (Liu et al., 2023) and subsequently uses classical planners to solve the task. However, they are not considering generating grounded or hierarchical actions in an environment and not learning a library of operators that can be reused across different tasks. More broadly, we share the broad goal of building agents that can understand language and execute actions to achieve goals (Tellex et al., 2011; Misra et al., 2017; Nair et al., 2022). See also Luketina et al. (2019) and Tellex et al. (2020).

Learning planning domain and action representations from language. Another group of work has been focusing on learning latent action representations from language (Corona et al., 2021; Andreas et al., 2017; Jiang et al., 2019; Sharma et al., 2022; Luo et al., 2023). Our work differs from them in that we are learning a planning-compatible action abstraction from LLMs, instead of relying on human demonstrations and annotated step-by-step instructions. The more recent Wang et al. (2023a) adopts a similar overall problem specification, to learn libraries of actions as imperative code-based policies. Our results show that learning planning abstractions enables better integration with hierarchical planning, and, as a result, better performance and generalization to more complex problems. Other recent work (Nottingham et al., 2023) learns an environment model from interactive experience, represented as a task dependency graph; we seek to learn a richer state transition model (which represents the effects of actions) decomposed as operators that can be formally composed to verifiably satisfy arbitrarily complex new goals. Guan et al. (2024), published concurrently, seeks to learn PDDL representations; we show how these can be grounded hierarchically.

Language and code. In addition to Wang et al. (2023a), a growing body of work in program synthesis, both by learning lifted program abstractions that compress longer existing or synthesized programs (Bowers et al., 2023; Ellis et al., 2023; Wong et al., 2021; Cao et al., 2023). These approaches (including Wang et al. (2023a)) generally learn libraries defined over imperative and functional programming languages, such as LISP and Python. Our work is closely inspired by these and seeks to learn representations suited specifically to solving long-range planning problems.

Hierarchical planning abstractions. The hierarchical planning knowledge that we learn from LLMs and interactions in the environments are related to hierarchical task networks (Erol et al., 1994; Nejati et al., 2006), hierarchical goal networks (Alford et al., 2016), abstract PDDL domains (Konidaris et al., 2018; Bonet & Geffner, 2020; Chitnis et al., 2021; Asai & Muise, 2020; Mao et al., 2022; 2023), and domain control knowledge (de la Rosa & Mc Ilraith, 2011). Most of these approaches require manually specified hierarchical planning abstractions; others learn them from demonstrations or interactions. By contrast, we leverage human language to guide the learning of such abstractions.

6 DISCUSSION AND FUTURE WORK

Our evaluations suggest a powerful role for language within AI systems that form complex, longhorizon plans as a rich source of background knowledge about the right action abstractions for everyday planning domains, which contains broad human priors about environments, task decompositions, and potential future goals. A core goal of this paper was to demonstrate how to integrate this knowledge into the search, grounding, and verification toolkits developed in hierarchical planning.

We leave open many possible extensions towards future work. Key limitations of our current framework point towards important directions for further integrating LMs and hierarchical planning to scale our approach: here, we build on an existing set of pre-defined symbolic predicates for initially representing the environment state; do not yet tackle fine-grained, geometric motor planning; and use a general LLM (rather than one fine-tuned for extended planning). Future work might generally tackle these problems by further asking how else linguistic knowledge and increasingly powerful or multimodal LLMs could be integrated here: to propose useful named predicates over initial perceptual inputs (e.g., images) (Migimatsu & Bohg, 2022); or to speed planning by bootstrapping hierarchical planning abstractions using the approach here, but then to progressively transfer planning to another model, including an LLM, to later compose and use the learned representations.

Published as a conference paper at ICLR 2024

Acknowledgement. We thank anonymous reviewers for their valuable comments. We gratefully acknowledge support from ONR MURI grant N00014-16-1-2007; from the Center for Brain, Minds, and Machines (CBMM, funded by NSF STC award CCF-1231216); from NSF grant 2214177; from NSF grant CCF-2217064 and IIS-2212310; from Air Force Office of Scientific Research (AFOSR) grant FA9550-22-1-0249; from ONR MURI grant N00014-22-1-2740; from ARO grant W911NF-231-0034; from the MIT-IBM Watson AI Lab; from the MIT Quest for Intelligence; from Intel; and from the Boston Dynamics Artificial Intelligence Institute. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of our sponsors.

Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, et al. Do as I Can, Not as I Say: Grounding Language in Robotic Affordances. ar Xiv:2204.01691, 2022. 1, 2, 7, 9

Ron Alford, Vikas Shivashankar, Mark Roberts, Jeremy Frank, and David W Aha. Hierarchical Planning: Relating Task and Goal Decomposition with Task Sharing. In IJCAI, 2016. 9

Jacob Andreas, Dan Klein, and Sergey Levine. Modular Multitask Reinforcement Learning with Policy Sketches. In ICML, 2017. 9

Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob Mc Grew, Josh Tobin, Open AI Pieter Abbeel, and Wojciech Zaremba. Hindsight Experience Replay. In Neur IPS, 2017. 5

Daman Arora and Subbarao Kambhampati. Learning and Leveraging Verifiers to Improve Planning Capabilities of Pre-trained Language Models. ar Xiv:2305.17077, 2023. 9

Masataro Asai and Christian Muise. Learning Neural-Symbolic Descriptive Planning Models via Cube-Space Priors: The Voyage Home (To Strips). In IJCAI, 2020. 9

Blai Bonet and Hector Geffner. Learning First-Order Symbolic Representations for Planning from the Structure of the State Space. In ECAI, 2020. 9

Matthew Bowers, Theo X Olausson, Lionel Wong, Gabriel Grand, Joshua B Tenenbaum, Kevin Ellis, and Armando Solar-Lezama. Top-Down Synthesis for Library Learning. PACMPL, 7(POPL): 1182 1213, 2023. 9

David Cao, Rose Kunkel, Chandrakana Nandi, Max Willsey, Zachary Tatlock, and Nadia Polikarpova. Babble: Learning Better Abstractions with E-graphs and Anti-unification. PACMPL, 7(POPL): 396 424, 2023. 9

Valerie Chen, Abhinav Gupta, and Kenneth Marino. Ask Your Humans: Using Human Instructions to Improve Generalization in Reinforcement Learning. In ICLR, 2021. 5, 13

Rohan Chitnis, Tom Silver, Joshua B Tenenbaum, Leslie Pack Kaelbling, and Tom as Lozano-P erez. GLi B: Efficient Exploration for Relational Model-Based Reinforcement Learning via Goal-Literal Babbling. In AAAI, 2021. 9

Rodolfo Corona, Daniel Fried, Coline Devin, Dan Klein, and Trevor Darrell. Modular Networks for Compositional Instruction Following. In NAACL-HLT, 2021. 9

Tom as de la Rosa and Sheila Mc Ilraith. Learning Domain Control Knowledge for TLPlan and Beyond. In ICAPS 2011 Workshop on Planning and Learning, 2011. 9

Thomas G Dietterich. Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition. JAIR, 13:227 303, 2000. 1

Kevin Ellis, Lionel Wong, Maxwell Nye, Mathias Sable-Meyer, Luc Cary, Lore Anaya Pozo, Luke Hewitt, Armando Solar-Lezama, and Joshua B Tenenbaum. Dream Coder: Growing Generalizable, Interpretable Knowledge with Wake Sleep Bayesian Program Learning. Philosophical Transactions of the Royal Society, 381(2251):20220050, 2023. 9

Published as a conference paper at ICLR 2024

Kutluhan Erol, James Hendler, and Dana S Nau. HTN Planning: Complexity and Expressivity. In AAAI, 1994. 1, 9

Richard E Fikes and Nils J Nilsson. STRIPS: A New Approach to the Application of Theorem Proving to Problem Solving. Artif. Intell., 2(3-4):189 208, 1971. 2

Caelan Reed Garrett, Rohan Chitnis, Rachel Holladay, Beomjoon Kim, Leslie Pack Kaelbling, and Tom as Lozano-P erez. Integrated Task and Motion Planning. Ann. Rev. Control Robot. Auton. Syst., 4:265 293, 2021. 1

Lin Guan, Karthik Valmeekam, Sarath Sreedharan, and Subbarao Kambhampati. Leveraging pretrained large language models to construct and utilize world models for model-based task planning. Advances in Neural Information Processing Systems, 36, 2024. 9

Malte Helmert. The Fast Downward Planning System. JAIR, 26:191 246, 2006. 3, 5

Mark K Ho, David Abel, Carlos G Correa, Michael L Littman, Jonathan D Cohen, and Thomas L Griffiths. People Construct Simplified Mental Representations to Plan. Nature, 606(7912):129 136, 2022. 1

Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents. In ICML, 2022. 8

Yiding Jiang, Shixiang Shane Gu, Kevin P Murphy, and Chelsea Finn. Language as an Abstraction for Hierarchical Deep Reinforcement Learning. In Neur IPS, 2019. 9

George Konidaris, Leslie Pack Kaelbling, and Tomas Lozano-Perez. From Skills to Symbols: Learning Symbolic Representations for Abstract High-Level Planning. JAIR, 61:215 289, 2018. 9

Steven La Valle. Rapidly-Exploring Random Trees: A New Tool for Path Planning. Research Report 9811, 1998. 5

Vladimir Lifschitz. On the Semantics of STRIPS. In Workshop on Reasoning about Actions and Plans, 1986. 3

Bo Liu, Yuqian Jiang, Xiaohan Zhang, Qiang Liu, Shiqi Zhang, Joydeep Biswas, and Peter Stone. LLM+ P: Empowering Large Language Models with Optimal Planning Proficiency. ar Xiv:2304.11477, 2023. 2, 7, 9

Jelena Luketina, Nantas Nardelli, Gregory Farquhar, Jakob Foerster, Jacob Andreas, Edward Grefenstette, Shimon Whiteson, and Tim Rockt aschel. A Survey of Reinforcement Learning Informed by Natural Language. In IJCAI, 2019. 9

Zhezheng Luo, Jiayuan Mao, Jiajun Wu, Tom as Lozano-P erez, Joshua B Tenenbaum, and Leslie Pack Kaelbling. Learning Rational Subgoals from Demonstrations and Instructions. In AAAI, 2023. 5, 9, 13

Jiayuan Mao, Tomas Lozano-Perez, Joshua B. Tenenbaum, and Leslie Pack Kaelbing. PDSketch: Integrated Domain Programming, Learning, and Planning. In Neur IPS, 2022. 9

Jiayuan Mao, Tom as Lozano-P erez, Joshua B. Tenenbaum, and Leslie Pack Kaelbling. What planning problems can a relational neural network solve? In Neur IPS, 2023. 9

Toki Migimatsu and Jeannette Bohg. Grounding Predicates through Actions, 2022. 2, 9

Dipendra Misra, John Langford, and Yoav Artzi. Mapping Instructions and Visual Observations to Actions with Reinforcement Learning. In EMNLP, 2017. 9

Yao Mu, Qinglong Zhang, Mengkang Hu, Wenhai Wang, Mingyu Ding, Jun Jin, Bin Wang, Jifeng Dai, Yu Qiao, and Ping Luo. Embodied GPT: Vision-Language Pre-training via Embodied Chain of Thought. ar Xiv:2305.15021, 2023. 9

Suraj Nair, Eric Mitchell, Kevin Chen, Silvio Savarese, and Chelsea Finn. Learning Language Conditioned Robot Behavior from Offline Data and Crowd-Sourced Annotation. In Co RL, 2022. 9

Published as a conference paper at ICLR 2024

Negin Nejati, Pat Langley, and Tolga Konik. Learning Hierarchical Task Networks by Observation. In ICML, 2006. 1, 9

Kolby Nottingham, Prithviraj Ammanabrolu, Alane Suhr, Yejin Choi, Hannaneh Hajishirzi, Sameer Singh, and Roy Fox. Do Embodied Agents Dream of Pixelated Sheep: Embodied Decision Making using Language Guided World Modelling. In ICML, 2023. 9

Pratyusha Sharma, Antonio Torralba, and Jacob Andreas. Skill Induction and Planning with Latent Language. In ACL, 2022. 9

Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mottaghi, Luke Zettlemoyer, and Dieter Fox. ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks. In CVPR, 2020. 2, 5, 6, 7, 13

Tom Silver, Rohan Chitnis, Joshua Tenenbaum, Leslie Pack Kaelbling, and Tom as Lozano-P erez. Learning Symbolic Operators for Task and Motion Planning. In IROS, 2021. 3

Tom Silver, Varun Hariprasad, Reece S Shuttleworth, Nishanth Kumar, Tom as Lozano-P erez, and Leslie Pack Kaelbling. PDDL Planning with Pretrained Large Language Models. In Neur IPS Foundation Models for Decision Making Workshop, 2022. 8

Tom Silver, Soham Dan, Kavitha Srinivas, Joshua B Tenenbaum, Leslie Pack Kaelbling, and Michael Katz. Generalized Planning in PDDL Domains with Pretrained Large Language Models. ar Xiv:2305.11014, 2023. 9

Chan Hee Song, Jiaman Wu, Clayton Washington, Brian M Sadler, Wei-Lun Chao, and Yu Su. LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models. In ICCV, 2023. 9

Richard S Sutton, Doina Precup, and Satinder Singh. Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning. Artif. Intell., 112(1-2):181 211, 1999. 1

Stefanie Tellex, Thomas Kollar, Steven Dickerson, Matthew Walter, Ashis Banerjee, Seth Teller, and Nicholas Roy. Understanding Natural Language Commands for Robotic Navigation and Mobile Manipulation. In AAAI, 2011. 9

Stefanie Tellex, Nakul Gopalan, Hadas Kress-Gazit, and Cynthia Matuszek. Robots That Use Language. Annual Review of Control, Robotics, & Autonomous Systems, 3:25 55, 2020. 9

Karthik Valmeekam, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. Large Language Models Still Can t Plan (A Benchmark for LLMs on Planning and Reasoning about Change). In Neur IPS Foundation Models for Decision Making Workshop, 2022. 8

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An Open-Ended Embodied Agent with Large Language Models. ar Xiv:2305.16291, 2023a. 1, 2, 7, 8, 9

Zihao Wang, Shaofei Cai, Guanzhou Chen, Anji Liu, Xiaojian Ma, and Yitao Liang. Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents, 2023b. 8

Catherine Wong, Kevin M Ellis, Joshua Tenenbaum, and Jacob Andreas. Leveraging Language to Learn Program Abstractions and Search Heuristics. In ICML, 2021. 9

Yaqi Xie, Chen Yu, Tongyao Zhu, Jinbin Bai, Ze Gong, and Harold Soh. Translating Natural Language to Planning Goals with Large-Language Models. ar Xiv:2302.05128, 2023. 9

Xizhou Zhu, Yuntao Chen, Hao Tian, Chenxin Tao, Weijie Su, Chenyu Yang, Gao Huang, Bin Li, Lewei Lu, Xiaogang Wang, Yu Qiao, Zhaoxiang Zhang, and Jifeng Dai. Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Textbased Knowledge and Memory, 2023. 9

Published as a conference paper at ICLR 2024

We will release a complete code repository containing our full algorithm implementation, all baselines, and benchmark tasks. Here, we provide additional details on our implementational choices.

A.1 BENCHMARKS

Mini Minecraft (Fig. 5, top) is a procedurally-generated Minecraft-like benchmark (Chen et al., 2021; Luo et al., 2023) that requires complex, extended planning. The environment places an agent on a 2D map containing various resources, tools, and crafting stations. The agent can use appropriate tools to mine new items from raw resources (e.g. use an axe to obtain wood from trees), or collect resources into an inventory to craft new objects (e.g. combining sticks and iron ingots to craft a sword, which itself can be used to obtain feathers from a chicken). The ability to create new objects that themselves permit new actions yields an enormous action space at each time step (>2000 actions, considering different combinations of items to use) and very long-horizon tasks (26 steps for the most complex task, even without path-planning.) The provided environment predicates allow querying object types and inventory contents. Low-level actions allow the agent to move and apply tools to specific resources. To focus on complex crafting, we provide a low-level move-to action to move directly to specified locations. Linguistic goal specifications are synthetically generated from a simple grammar over craftable objects and resources (e.g. Craft a sword, Mine iron ore).

ALFRED (Fig. 5, bottom) is a household planning benchmark of human-annotated but formally verifiable tasks defined over a simulated Unity environment (Shridhar et al., 2020). The interactive environment places an agent in varying 3D layouts, each containing appliances and dozens of household objects. The provided environment includes predicates for querying object types, object and agent locations, and classifiers over object states (eg. whether an object is hot or on). Low-level actions enable the agent to pick up and place objects, apply tools to other objects, and open, close, and turn on appliances. As specified in Shridhar et al. (2020), ground-truth high-level plans in the ALFRED benchmark compose 5-10 high-level operators, and low-level action trajectories have on average 50 low-level actions. There over 100 objects that the agent can interact with in each interactive environment.

As with Minecraft, we provide a low-level method to move the agent directly to specified locations. While ALFRED is typically used to evaluate detailed instruction following, we focus on a goal-only setting that only uses the goal specifications. The human-annotated goals introduce ambiguity, underspecification, and errors with respect to the ground-truth verifiable tasks (eg. people refer to tables without specifying if they mean the side table, dining table, or desk; a light when there are multiple distinct lamps; or a cabbage when they want lettuce).

A.2 ADDITIONAL METHODS IMPLEMENTATION DETAILS

A.2.1 LLM PROMPTING

We use gpt-3.5-turbo-16k for all experiments and baselines. Here, we describe the contents of the LLM few-shot prompts used in our method in more detail. Symbolic Task Decomposition For all unsolved tasks, at each iteration, we sample a set of symbolic task descriptions as a sequence of named high-level actions and their arguments. We construct a few-shot prompt consisting of the following components:

1. A brief natural language header (;;;; Given natural language goals, predict a sequence of PDDL actions);

2. A sequence of example (lt, PA) tuples containing linguistic goals and example task decompositions. To avoid biasing the language model in advance, we provide example task decompositions for similar, constructed tasks that do not use any of the skills that need to be learned in our two domains. For example, on ALFRED, these example task decompositions are for example tasks (bake a potato and put it in the fridge, place a baked, grated apple on top of the dining table, place a plate in a full sink., and pick up a laptop and then carry it over to the desk lamp, then restart the desk lamp.), and our example task decompositions suggest named operators

Published as a conference paper at ICLR 2024

Bake Object, Grate Object, Fill Object, and Restart Object, none of which appear in the actual training set.

3. At iterations > 0, we also provide a sequence of sampled (lt, PA) tuples randomly sampled from any solved tasks and their discovered high-level plans. This means that few-shot prompting better represents the true task distribution over successive iterations.

In our experiments, we prompt with temperature=1.0 and draw n=4 task decomposition samples per unsolved task.

Symbolic Operator Definition For all unsolved tasks, at each iteration, we sample proposed operator definitions consisting of args, pre, eff conditioned on all undefined operator names that appear in the proposed task decompositions.

For each operator name, we construct a few-shot prompt consisting of the following components:

1. A brief natural language header (You are a software engineer who will be writing planning operators in the PDDL planning language. These operators are based on the following PDDL domain definition.

2. The full set of environment predicates vocabulary of high-level environment predicates P, as well as valid named argument values (eg. object types).

3. A sequence of example name, args, pre, eff operator definitions demonstrating the PDDL definition format. As with task decomposition, of course, we do not provide any example operator definitions that we wish to learn from our dataset.

4. At iterations > 0, we include as many possible validated name, args, pre, eff operators defined in the current library (including new learned operators). If there are shared patterns between operators, this means that few-shot prompting also better represents the true operator structure over successive iterations.

In our experiments, we prompt with temperature=1.0 and draw n=3 task decomposition samples per unsolved task. However, in our pilot experiments, we actually find that sampling directly from the token probabilities defined by this few-shot prompt does not produce sufficiently diverse definitions for each operator name. We instead directly prompt the LLM to produce up to N distinct operator definitions sequentially.

We find that GPT 3.5 frequently produces syntactically invalid operator proposals proposed operators often include invent predicates and object types that are not defined in the environment vocabulary, do not obey the predicate typing rules, or do not have the correct number and types of arguments. While this might improve with finetuned or larger LLMs, we instead implement a simple post-processing heuristic to correct operators with syntactic errors, or reject operators altogether: as operator pre and postconditions are represented as conjunctions of predicates, we remove any invalid predicates (predicates that are invented or that specify invalid arguments); we collect all arguments named across the predicates and use the ground truth typing to produce the final args, and we reject any operators that have 0 valid postcondition predicates. This post-processing procedure frequently leaves operators underspecified (e.g., the resulting operators now are missing necessary preconditions, which were partially generated but syntactically incorrect in the proposal); we allow our full operator learning algorithm to verify and reject these operators.

Symbolic Goal Proposal Finally, as described in 3.2, we also use an LLM to propose a set of candidate goal definitions as FOL formulas F t defined over the environment predicates P for each task. Our prompting technique is very similar to that used in the rest of our algorithm. For each task, we we construct a few-shot prompt consisting of the following components:

1. A brief natural language header (You are a software engineer who will be writing goal definitions for a robot in the PDDL planning language.

2. The full set of environment predicates vocabulary of high-level environment predicates P, as well as valid named argument values (eg. object types).

3. A sequence of example lt, ft language and FOL goal formulas. In our experiments, during training, unlike in the previous prompts (where including ground truth operators would solve the learning problem), we do sample an initial set of goal definitions from the training

Published as a conference paper at ICLR 2024

Mini Minecraft LLM Predicts? Library? Mining Crafting Compositional

Code Policy Prediction Sub-policies 12% 37% 11% Ours Goal+Operators 100% 100% 100%

ALFRED (n=3 replications) LLM Predicts? Library? Original (Simple + Compositional Tasks)

Code Policy Prediction Sub-policies 11% Ours Goal+Operators 70%

Table 2: Results with GPT-4 as the LLM backbone: On both Mini Minecraft (Top) and ALFRED (Bottom), our algorithm recovers all required operators, which generalize to more complex compositional tasks. Switching to GPT-4 does not impact performance trends observed across the Code as Policies (Voyager) baseline and our method.

distribution as our initial example supervision. We set supervision to a randomly sampled fraction (0.1) of the training distribution.

4. At iterations > 0, we also include lt, ft examples from successfully solved tasks.

In our experiments, we prompt with temperature=1.0 and draw n=4 task decomposition samples per unsolved task. As with the operator proposal, we also find that sampling directly from the token probabilities defined by this few-shot prompt does not produce sufficiently diverse definitions for each linguistic goal to correct for ambiguity in the human language (eg. to define the multiple concrete Table types that a person might mean when referring to a table). We therefore again instead directly prompt the LLM to produce up to N distinct operator definitions sequentially.

We also post-process proposed goals using the same syntactic criterion to remove invalid predicates in the FOL formula, and reject any empty goals.

A.2.2 POLICY LEARNING AND GUIDED LOW-LEVEL SEARCH

Concretely, we implement our policy-guided low-level action search as the following. We maintain a dictionary D that maps subgoals (a conjunction of atoms) to a set of candidate low-level action trajectories. When planning for a new subgoal sg, if D contains the trajectory, we prioritize trying candidate low-level trajectories in D. Otherwise, we fall back to a brute-force breadth-first search over all possible action trajectories. To populate D, during the BFS, we compute the difference in the environment state before and after the agent executes any sampled trajectory and the corresponding trajectory t that caused the state change. Here the state difference can be viewed as a subgoal sg achieved by executing t. Rather than directly adding the (sg, t) as a key-value pair to D, we lift the trajectory and environment state change by replacing concrete objects in sg and t by variables. Note that we update D with each sampled trajectory in the BFS even if it doesn t achieve the subgoal specified in the BFS search.

When the low-level search receives a subgoal sg, we again lift it by replacing objects with variables, and try to match it with entries in D. If D contains multiple trajectories t for a given subgoal sg, we track how often a given trajectory succeeds for a subgoal and prioritize trajectories with the most successes.

A.3 EXPERIMENTS

Learned Operator Libraries on Minecraft The following shows the full PDDL domain definition including the initial provided vocabulary of symbolic environment constants and predicates, initial pick and place operators and example operator, and all ensuing learned operators combined from the Mining and Crafting benchmarks.

1 ( define ( domain crafting world v20230404 teleport )

2 ( : requirements : s t r i p s )

3 ( : types

6 inventory

7 object type

9 ( : constants

Published as a conference paper at ICLR 2024

10 Key object type

11 Work Station object type

12 Pickaxe object type

13 Iron Ore Vein object type

14 Iron Ore object type

15 Iron Ingot object type

16 Coal Ore Vein object type

17 Coal object type

18 Gold Ore Vein object type

19 Gold Ore object type

20 Gold Ingot object type

21 Cobblestone Stash object type

22 Cobblestone object type

23 Axe object type

24 Tree object type

25 Wood object type

26 Wood Plank object type

27 Stick object type

28 Sword object type

29 Chicken object type

30 Feather object type

31 Arrow object type

32 Shears object type

33 Sheep object type

34 Wool object type

35 Bed object type

36 Boat object type

37 Sugar Cane Plant object type

38 Sugar Cane object type

39 Paper object type

40 Bowl object type

41 Potato Plant object type

42 Potato object type

43 Cooked Potato object type

44 Beetroot Crop object type

45 Beetroot object type

46 Beetroot Soup object type

48 Hypothetical object type

49 Trash object type

51 ( : predicates

52 ( t i l e u p ?t1 t i l e ?t2 t i l e )

53 ( tile down ?t1 t i l e ?t2 t i l e )

54 ( t i l e l e f t ?t1 t i l e ?t2 t i l e )

55 ( t i l e r i g h t ?t1 t i l e ?t2 t i l e )

57 ( agent at ? t t i l e )

58 ( object at ?x object ? t t i l e )

59 ( inventory holding ? i inventory ?x object )

60 ( inventory empty ? i inventory )

62 ( object of type ?x object ?ot object type )

65 ( : action move to

66 : parameters (? t1 t i l e ?t2 t i l e )

67 : precondition ( and ( agent at ?t1 ) )

68 : e f f e c t ( and ( agent at ?t2 ) ( not ( agent at ?t1 ) ) )

70 ( : action pick up

71 : parameters (? i inventory ?x object ? t t i l e )

72 : precondition ( and ( agent at ? t ) ( object at ?x ? t ) ( inventory empty ? i ) )

Published as a conference paper at ICLR 2024

73 : e f f e c t ( and ( inventory holding ? i ?x ) ( not ( object at ?x ? t ) ) ( not ( inventory empty ? i ) ) )

75 ( : action place down

76 : parameters (? i inventory ?x object ? t t i l e )

77 : precondition ( and ( agent at ? t ) ( inventory holding ? i ?x ) )

78 : e f f e c t ( and ( object at ?x ? t ) ( not ( inventory holding ? i ?x ) ) ( inventory empty ? i ) )

80 ( : action mine iron ore

81 : parameters (? t o o l i n v inventory ? t a r g e t i n v inventory ?x object ? t o o l object ? target object ? t t i l e )

82 : precondition ( and

83 ( agent at ? t )

84 ( object at ?x ? t )

85 ( object of type ?x Iron Ore Vein )

86 ( inventory holding ? t o o l i n v ? t o o l )

87 ( object of type ? t o o l Pickaxe )

88 ( inventory empty ? t a r g e t i n v )

89 ( object of type ? target Hypothetical )

91 : e f f e c t ( and

92 ( not ( inventory empty ? t a r g e t i n v ) )

93 ( inventory holding ? t a r g e t i n v ? target )

94 ( not ( object of type ? target Hypothetical ) )

95 ( object of type ? target Iron Ore )

98 ( : action mine wood 2

99 : parameters (? t t i l e ?x object ? t o o l i n v inventory ? t o o l object ? t a r g e t i n v inventory ? target object )

101 : precondition ( and

102 ( agent at ? t )

103 ( object at ?x ? t )

104 ( object of type ?x Tree )

105 ( inventory holding ? t o o l i n v ? t o o l )

106 ( object of type ? t o o l Axe )

107 ( inventory empty ? t a r g e t i n v )

108 ( object of type ? target Hypothetical )

110 : e f f e c t ( and

111 ( not ( inventory empty ? t a r g e t i n v ) )

112 ( inventory holding ? t a r g e t i n v ? target )

113 ( not ( object of type ? target Hypothetical ) )

114 ( object of type ? target Wood)

117 ( : action mine wool1 0

118 : parameters (? t t i l e ?x object ? t o o l i n v inventory ? t o o l object ? t a r g e t i n v inventory ? target object )

120 : precondition ( and

121 ( agent at ? t )

122 ( object at ?x ? t )

123 ( object of type ?x Sheep )

124 ( inventory holding ? t o o l i n v ? t o o l )

125 ( object of type ? t o o l Shears )

126 ( inventory empty ? t a r g e t i n v )

127 ( object of type ? target Hypothetical )

129 : e f f e c t ( and

130 ( not ( inventory empty ? t a r g e t i n v ) )

131 ( inventory holding ? t a r g e t i n v ? target )

132 ( not ( object of type ? target Hypothetical ) )

Published as a conference paper at ICLR 2024

133 ( object of type ? target Wool )

136 ( : action mine potato 0

137 : parameters (? t t i l e ?x object ? t a r g e t i n v inventory ? target object )

139 : precondition ( and

140 ( agent at ? t )

141 ( object at ?x ? t )

142 ( object of type ?x Potato Plant )

143 ( inventory empty ? t a r g e t i n v )

144 ( object of type ? target Hypothetical )

146 : e f f e c t ( and

147 ( not ( inventory empty ? t a r g e t i n v ) )

148 ( inventory holding ? t a r g e t i n v ? target )

149 ( not ( object of type ? target Hypothetical ) )

150 ( object of type ? target Potato )

153 ( : action mine sugar cane 2

154 : parameters (? t t i l e ?x object ? t o o l i n v inventory ? t o o l object ? t a r g e t i n v inventory ? target object )

156 : precondition ( and

157 ( agent at ? t )

158 ( object at ?x ? t )

159 ( object of type ?x Sugar Cane Plant )

160 ( inventory holding ? t o o l i n v ? t o o l )

161 ( object of type ? t o o l Axe )

162 ( inventory empty ? t a r g e t i n v )

163 ( object of type ? target Hypothetical )

165 : e f f e c t ( and

166 ( not ( inventory empty ? t a r g e t i n v ) )

167 ( inventory holding ? t a r g e t i n v ? target )

168 ( not ( object of type ? target Hypothetical ) )

169 ( object of type ? target Sugar Cane )

172 ( : action mine beetroot 1

173 : parameters (? t t i l e ?x object ? t o o l i n v inventory ? t o o l object ? t a r g e t i n v inventory ? target object )

175 : precondition ( and

176 ( agent at ? t )

177 ( object at ?x ? t )

178 ( object of type ?x Beetroot Crop )

179 ( inventory holding ? t o o l i n v ? t o o l )

180 ( inventory empty ? t a r g e t i n v )

181 ( object of type ? target Hypothetical )

183 : e f f e c t ( and

184 ( not ( inventory empty ? t a r g e t i n v ) )

185 ( inventory holding ? t a r g e t i n v ? target )

186 ( not ( object of type ? target Hypothetical ) )

187 ( object of type ? target Beetroot )

190 ( : action mine feather 1

191 : parameters (? t t i l e ?x object ? t o o l i n v inventory ? t o o l object ? t a r g e t i n v inventory ? target object )

193 : precondition ( and

Published as a conference paper at ICLR 2024

194 ( agent at ? t )

195 ( object at ?x ? t )

196 ( object of type ?x Chicken )

197 ( inventory holding ? t o o l i n v ? t o o l )

198 ( object of type ? t o o l Sword )

199 ( inventory empty ? t a r g e t i n v )

200 ( object of type ? target Hypothetical )

202 : e f f e c t ( and

203 ( not ( inventory empty ? t a r g e t i n v ) )

204 ( inventory holding ? t a r g e t i n v ? target )

205 ( not ( object of type ? target Hypothetical ) )

206 ( object of type ? target Feather )

209 ( : action mine cobblestone 2

210 : parameters (? t t i l e ?x object ? t o o l i n v inventory ? t o o l object ? t a r g e t i n v inventory ? target object )

212 : precondition ( and

213 ( agent at ? t )

214 ( object at ?x ? t )

215 ( object of type ?x Cobblestone Stash )

216 ( inventory holding ? t o o l i n v ? t o o l )

217 ( object of type ? t o o l Pickaxe )

218 ( inventory empty ? t a r g e t i n v )

219 ( object of type ? target Hypothetical )

221 : e f f e c t ( and

222 ( not ( inventory empty ? t a r g e t i n v ) )

223 ( inventory holding ? t a r g e t i n v ? target )

224 ( not ( object of type ? target Hypothetical ) )

225 ( object of type ? target Cobblestone )

228 ( : action mine gold ore1 2

229 : parameters (? t t i l e ?x object ? t o o l i n v inventory ? t o o l object ? t a r g e t i n v inventory ? target object )

231 : precondition ( and

232 ( agent at ? t )

233 ( object at ?x ? t )

234 ( object of type ?x Gold Ore Vein )

235 ( inventory holding ? t o o l i n v ? t o o l )

236 ( object of type ? t o o l Pickaxe )

237 ( inventory empty ? t a r g e t i n v )

238 ( object of type ? target Hypothetical )

240 : e f f e c t ( and

241 ( not ( inventory empty ? t a r g e t i n v ) )

242 ( inventory holding ? t a r g e t i n v ? target )

243 ( not ( object of type ? target Hypothetical ) )

244 ( object of type ? target Gold Ore )

247 ( : action mine coal1 0

248 : parameters (? t t i l e ?x object ? t o o l i n v inventory ? t o o l object ? t a r g e t i n v inventory ? target object )

250 : precondition ( and

251 ( agent at ? t )

252 ( object at ?x ? t )

253 ( object of type ?x Coal Ore Vein )

254 ( inventory holding ? t o o l i n v ? t o o l )

255 ( object of type ? t o o l Pickaxe )

Published as a conference paper at ICLR 2024

256 ( inventory empty ? t a r g e t i n v )

257 ( object of type ? target Hypothetical )

259 : e f f e c t ( and

260 ( not ( inventory empty ? t a r g e t i n v ) )

261 ( inventory holding ? t a r g e t i n v ? target )

262 ( not ( object of type ? target Hypothetical ) )

263 ( object of type ? target Coal )

266 ( : action mine beetroot1 0

267 : parameters (? t t i l e ?x object ? t a r g e t i n v inventory ? target object )

269 : precondition ( and

270 ( agent at ? t )

271 ( object at ?x ? t )

272 ( object of type ?x Beetroot Crop )

273 ( inventory empty ? t a r g e t i n v )

274 ( object of type ? target Hypothetical )

276 : e f f e c t ( and

277 ( not ( inventory empty ? t a r g e t i n v ) )

278 ( inventory holding ? t a r g e t i n v ? target )

279 ( not ( object of type ? target Hypothetical ) )

280 ( object of type ? target Beetroot )

283 ( : action craft wood plank

284 : parameters (? ingredientinv1 inventory ? t a r g e t i n v inventory ? s t a t i o n object ? ingredient1 object ? targe t object ? t t i l e )

285 : precondition ( and

286 ( agent at ? t )

287 ( object at ? s t a t i o n ? t )

288 ( object of type ? s t a t i o n Work Station )

289 ( inventory holding ? ingredientinv1 ? ingredient1 )

290 ( object of type ? ingredient1 Wood)

291 ( inventory empty ? t a r g e t i n v )

292 ( object of type ? target Hypothetical )

294 : e f f e c t ( and

295 ( not ( inventory empty ? t a r g e t i n v ) )

296 ( inventory holding ? t a r g e t i n v ? target )

297 ( not ( object of type ? target Hypothetical ) )

298 ( object of type ? target Wood Plank )

299 ( not ( inventory holding ? ingredientinv1 ? ingredient1 ) )

300 ( inventory empty ? ingredientinv1 )

301 ( not ( object of type ? ingredient1 Wood) )

302 ( object of type ? ingredient1 Hypothetical )

305 ( : action craft arrow

306 : parameters (? ingredientinv1 inventory ? ingredientinv2 inventory ? t a r g e t i n v inventory ? s t a t i o n object ? ingredient1 object ? ingredient2 object ? target object ? t t i l e )

307 : precondition ( and

308 ( agent at ? t )

309 ( object at ? s t a t i o n ? t )

310 ( object of type ? s t a t i o n Work Station )

311 ( inventory holding ? ingredientinv1 ? ingredient1 )

312 ( object of type ? ingredient1 Stick )

313 ( inventory holding ? ingredientinv2 ? ingredient2 )

314 ( object of type ? ingredient2 Feather )

315 ( inventory empty ? t a r g e t i n v )

316 ( object of type ? target Hypothetical )

Published as a conference paper at ICLR 2024

318 : e f f e c t ( and

319 ( not ( inventory empty ? t a r g e t i n v ) )

320 ( inventory holding ? t a r g e t i n v ? target )

321 ( not ( object of type ? target Hypothetical ) )

322 ( object of type ? target Arrow )

323 ( not ( inventory holding ? ingredientinv1 ? ingredient1 ) )

324 ( inventory empty ? ingredientinv1 )

325 ( not ( object of type ? ingredient1 Stick ) )

326 ( object of type ? ingredient1 Hypothetical )

327 ( not ( inventory holding ? ingredientinv2 ? ingredient2 ) )

328 ( inventory empty ? ingredientinv2 )

329 ( not ( object of type ? ingredient2 Feather ) )

330 ( object of type ? ingredient2 Hypothetical )

333 ( : action craft beetroot soup 0

334 : parameters (? t t i l e ? s t a t i o n object ? ingredientinv1 inventory ? ingredient1 object ? ingredientinv2 inventory ? ingredient2 object ? t a r g e t i n v inventory ? target object )

336 : precondition ( and

337 ( agent at ? t )

338 ( object at ? s t a t i o n ? t )

339 ( object of type ? s t a t i o n Work Station )

340 ( inventory holding ? ingredientinv1 ? ingredient1 )

341 ( object of type ? ingredient1 Beetroot )

342 ( inventory holding ? ingredientinv2 ? ingredient2 )

343 ( object of type ? ingredient2 Bowl )

344 ( inventory empty ? t a r g e t i n v )

345 ( object of type ? target Hypothetical )

347 : e f f e c t ( and

348 ( not ( inventory empty ? t a r g e t i n v ) )

349 ( inventory holding ? t a r g e t i n v ? target )

350 ( not ( object of type ? target Hypothetical ) )

351 ( object of type ? target Beetroot Soup )

352 ( not ( inventory holding ? ingredientinv1 ? ingredient1 ) )

353 ( inventory empty ? ingredientinv1 )

354 ( not ( object of type ? ingredient1 Beetroot ) )

355 ( object of type ? ingredient1 Hypothetical )

356 ( not ( inventory holding ? ingredientinv2 ? ingredient2 ) )

357 ( inventory empty ? ingredientinv2 )

358 ( not ( object of type ? ingredient2 Bowl ) )

359 ( object of type ? ingredient2 Hypothetical )

362 ( : action craft paper 0

363 : parameters (? t t i l e ? s t a t i o n object ? ingredientinv1 inventory ? ingredient1 object ? t a r g e t i n v inventory ? target object )

365 : precondition ( and

366 ( agent at ? t )

367 ( object at ? s t a t i o n ? t )

368 ( object of type ? s t a t i o n Work Station )

369 ( inventory holding ? ingredientinv1 ? ingredient1 )

370 ( object of type ? ingredient1 Sugar Cane )

371 ( inventory empty ? t a r g e t i n v )

372 ( object of type ? target Hypothetical )

374 : e f f e c t ( and

375 ( not ( inventory empty ? t a r g e t i n v ) )

376 ( inventory holding ? t a r g e t i n v ? target )

377 ( not ( object of type ? target Hypothetical ) )

378 ( object of type ? target Paper )

Published as a conference paper at ICLR 2024

379 ( not ( inventory holding ? ingredientinv1 ? ingredient1 ) )

380 ( inventory empty ? ingredientinv1 )

381 ( not ( object of type ? ingredient1 Sugar Cane ) )

382 ( object of type ? ingredient1 Hypothetical )

385 ( : action craft shears2 2

386 : parameters (? t t i l e ? s t a t i o n object ? ingredientinv1 inventory ? ingredient1 object ? t a r g e t i n v inventory ? target object )

388 : precondition ( and

389 ( agent at ? t )

390 ( object at ? s t a t i o n ? t )

391 ( object of type ? s t a t i o n Work Station )

392 ( inventory holding ? ingredientinv1 ? ingredient1 )

393 ( object of type ? ingredient1 Gold Ingot )

394 ( inventory empty ? t a r g e t i n v )

395 ( object of type ? target Hypothetical )

397 : e f f e c t ( and

398 ( not ( inventory empty ? t a r g e t i n v ) )

399 ( inventory holding ? t a r g e t i n v ? target )

400 ( not ( object of type ? target Hypothetical ) )

401 ( object of type ? target Shears )

402 ( not ( inventory holding ? ingredientinv1 ? ingredient1 ) )

403 ( inventory empty ? ingredientinv1 )

404 ( not ( object of type ? ingredient1 Gold Ingot ) )

405 ( object of type ? ingredient1 Hypothetical )

408 ( : action craft bowl 1

409 : parameters (? t t i l e ? s t a t i o n object ? ingredientinv1 inventory ? ingredient1 object ? ingredientinv2 inventory ? ingredient2 object ? t a r g e t i n v inventory ? target object )

411 : precondition ( and

412 ( agent at ? t )

413 ( object at ? s t a t i o n ? t )

414 ( object of type ? s t a t i o n Work Station )

415 ( inventory holding ? ingredientinv1 ? ingredient1 )

416 ( object of type ? ingredient1 Wood Plank )

417 ( inventory holding ? ingredientinv2 ? ingredient2 )

418 ( object of type ? ingredient2 Wood Plank )

419 ( inventory empty ? t a r g e t i n v )

420 ( object of type ? target Hypothetical )

422 : e f f e c t ( and

423 ( not ( inventory empty ? t a r g e t i n v ) )

424 ( inventory holding ? t a r g e t i n v ? target )

425 ( not ( object of type ? target Hypothetical ) )

426 ( object of type ? target Bowl )

427 ( not ( inventory holding ? ingredientinv1 ? ingredient1 ) )

428 ( inventory empty ? ingredientinv1 )

429 ( not ( object of type ? ingredient1 Wood Plank ) )

430 ( object of type ? ingredient1 Hypothetical )

431 ( not ( inventory holding ? ingredientinv2 ? ingredient2 ) )

432 ( inventory empty ? ingredientinv2 )

433 ( not ( object of type ? ingredient2 Wood Plank ) )

434 ( object of type ? ingredient2 Hypothetical )

437 ( : action craft boat 0

438 : parameters (? t t i l e ? s t a t i o n object ? i n g r e d i e n t i n v inventory ? ingredient object ? t a r g e t i n v inventory ? target object )

Published as a conference paper at ICLR 2024

440 : precondition ( and

441 ( agent at ? t )

442 ( object at ? s t a t i o n ? t )

443 ( object of type ? s t a t i o n Work Station )

444 ( inventory holding ? i n g r e d i e n t i n v ? ingredient )

445 ( object of type ? ingredient Wood Plank )

446 ( inventory empty ? t a r g e t i n v )

447 ( object of type ? target Hypothetical )

449 : e f f e c t ( and

450 ( not ( inventory empty ? t a r g e t i n v ) )

451 ( inventory holding ? t a r g e t i n v ? target )

452 ( not ( object of type ? target Hypothetical ) )

453 ( object of type ? target Boat )

454 ( not ( inventory holding ? i n g r e d i e n t i n v ? ingredient ) )

455 ( inventory empty ? i n g r e d i e n t i n v )

456 ( not ( object of type ? ingredient Wood Plank ) )

457 ( object of type ? ingredient Hypothetical )

460 ( : action craft cooked potato 1

461 : parameters (? t t i l e ? s t a t i o n object ? ingredientinv1 inventory ? ingredient1 object ? ingredientinv2 inventory ? ingredient2 object ? t a r g e t i n v inventory ? target object )

463 : precondition ( and

464 ( agent at ? t )

465 ( object at ? s t a t i o n ? t )

466 ( object of type ? s t a t i o n Work Station )

467 ( inventory holding ? ingredientinv1 ? ingredient1 )

468 ( object of type ? ingredient1 Potato )

469 ( inventory holding ? ingredientinv2 ? ingredient2 )

470 ( object of type ? ingredient2 Coal )

471 ( inventory empty ? t a r g e t i n v )

472 ( object of type ? target Hypothetical )

474 : e f f e c t ( and

475 ( not ( inventory empty ? t a r g e t i n v ) )

476 ( inventory holding ? t a r g e t i n v ? target )

477 ( not ( object of type ? target Hypothetical ) )

478 ( object of type ? target Cooked Potato )

479 ( not ( inventory holding ? ingredientinv1 ? ingredient1 ) )

480 ( inventory empty ? ingredientinv1 )

481 ( not ( object of type ? ingredient1 Potato ) )

482 ( object of type ? ingredient1 Hypothetical )

483 ( not ( inventory holding ? ingredientinv2 ? ingredient2 ) )

484 ( inventory empty ? ingredientinv2 )

485 ( not ( object of type ? ingredient2 Coal ) )

486 ( object of type ? ingredient2 Hypothetical )

489 ( : action craft gold ingot 1

490 : parameters (? t t i l e ? s t a t i o n object ? ingredientinv1 inventory ? ingredient1 object ? ingredientinv2 inventory ? ingredient2 object ? t a r g e t i n v inventory ? target object )

492 : precondition ( and

493 ( agent at ? t )

494 ( object at ? s t a t i o n ? t )

495 ( object of type ? s t a t i o n Work Station )

496 ( inventory holding ? ingredientinv1 ? ingredient1 )

497 ( object of type ? ingredient1 Gold Ore )

498 ( inventory holding ? ingredientinv2 ? ingredient2 )

499 ( object of type ? ingredient2 Coal )

500 ( inventory empty ? t a r g e t i n v )

Published as a conference paper at ICLR 2024

501 ( object of type ? target Hypothetical )

503 : e f f e c t ( and

504 ( not ( inventory empty ? t a r g e t i n v ) )

505 ( inventory holding ? t a r g e t i n v ? target )

506 ( not ( object of type ? target Hypothetical ) )

507 ( object of type ? target Gold Ingot )

508 ( not ( inventory holding ? ingredientinv1 ? ingredient1 ) )

509 ( inventory empty ? ingredientinv1 )

510 ( not ( object of type ? ingredient1 Gold Ore ) )

511 ( object of type ? ingredient1 Hypothetical )

512 ( not ( inventory holding ? ingredientinv2 ? ingredient2 ) )

513 ( inventory empty ? ingredientinv2 )

514 ( not ( object of type ? ingredient2 Coal ) )

515 ( object of type ? ingredient2 Hypothetical )

518 ( : action c r a f t s t i c k 0

519 : parameters (? t t i l e ? s t a t i o n object ? ingredientinv1 inventory ? ingredient1 object ? t a r g e t i n v inventory ? target object )

521 : precondition ( and

522 ( agent at ? t )

523 ( object at ? s t a t i o n ? t )

524 ( object of type ? s t a t i o n Work Station )

525 ( inventory holding ? ingredientinv1 ? ingredient1 )

526 ( object of type ? ingredient1 Wood Plank )

527 ( inventory empty ? t a r g e t i n v )

528 ( object of type ? target Hypothetical )

530 : e f f e c t ( and

531 ( not ( inventory empty ? t a r g e t i n v ) )

532 ( inventory holding ? t a r g e t i n v ? target )

533 ( not ( object of type ? target Hypothetical ) )

534 ( object of type ? target Stick )

535 ( not ( inventory holding ? ingredientinv1 ? ingredient1 ) )

536 ( inventory empty ? ingredientinv1 )

537 ( not ( object of type ? ingredient1 Wood Plank ) )

538 ( object of type ? ingredient1 Hypothetical )

541 ( : action craft sword 0

542 : parameters (? t t i l e ? s t a t i o n object ? ingredientinv1 inventory ? ingredient1 object ? ingredientinv2 inventory ? ingredient2 object ? t a r g e t i n v inventory ? target object )

544 : precondition ( and

545 ( agent at ? t )

546 ( object at ? s t a t i o n ? t )

547 ( object of type ? s t a t i o n Work Station )

548 ( inventory holding ? ingredientinv1 ? ingredient1 )

549 ( object of type ? ingredient1 Stick )

550 ( inventory holding ? ingredientinv2 ? ingredient2 )

551 ( object of type ? ingredient2 Iron Ingot )

552 ( inventory empty ? t a r g e t i n v )

553 ( object of type ? target Hypothetical )

555 : e f f e c t ( and

556 ( not ( inventory empty ? t a r g e t i n v ) )

557 ( inventory holding ? t a r g e t i n v ? target )

558 ( not ( object of type ? target Hypothetical ) )

559 ( object of type ? target Sword )

560 ( not ( inventory holding ? ingredientinv1 ? ingredient1 ) )

561 ( inventory empty ? ingredientinv1 )

562 ( not ( object of type ? ingredient1 Stick ) )

Published as a conference paper at ICLR 2024

563 ( object of type ? ingredient1 Hypothetical )

564 ( not ( inventory holding ? ingredientinv2 ? ingredient2 ) )

565 ( inventory empty ? ingredientinv2 )

566 ( not ( object of type ? ingredient2 Iron Ingot ) )

567 ( object of type ? ingredient2 Hypothetical )

570 ( : action craft bed 1

571 : parameters (? t t i l e ? s t a t i o n object ? ingredientinv1 inventory ? ingredient1 object ? ingredientinv2 inventory ? ingredient2 object ? t a r g e t i n v inventory ? target object )

573 : precondition ( and

574 ( agent at ? t )

575 ( object at ? s t a t i o n ? t )

576 ( object of type ? s t a t i o n Work Station )

577 ( inventory holding ? ingredientinv1 ? ingredient1 )

578 ( object of type ? ingredient1 Wood Plank )

579 ( inventory holding ? ingredientinv2 ? ingredient2 )

580 ( object of type ? ingredient2 Wool )

581 ( inventory empty ? t a r g e t i n v )

582 ( object of type ? target Hypothetical )

584 : e f f e c t ( and

585 ( not ( inventory empty ? t a r g e t i n v ) )

586 ( inventory holding ? t a r g e t i n v ? target )

587 ( not ( object of type ? target Hypothetical ) )

588 ( object of type ? target Bed)

589 ( not ( inventory holding ? ingredientinv1 ? ingredient1 ) )

590 ( inventory empty ? ingredientinv1 )

591 ( not ( object of type ? ingredient1 Wood Plank ) )

592 ( object of type ? ingredient1 Hypothetical )

593 ( not ( inventory holding ? ingredientinv2 ? ingredient2 ) )

594 ( inventory empty ? ingredientinv2 )

595 ( not ( object of type ? ingredient2 Wool ) )

596 ( object of type ? ingredient2 Hypothetical )

599 ( : action c r a f t i r o n i n g o t 2

600 : parameters (? t t i l e ? s t a t i o n object ? ingredientinv1 inventory ? ingredient1 object ? ingredientinv2 inventory ? ingredient2 object ? t a r g e t i n v inventory ? target object )

602 : precondition ( and

603 ( agent at ? t )

604 ( object at ? s t a t i o n ? t )

605 ( object of type ? s t a t i o n Work Station )

606 ( inventory holding ? ingredientinv1 ? ingredient1 )

607 ( object of type ? ingredient1 Iron Ore )

608 ( inventory holding ? ingredientinv2 ? ingredient2 )

609 ( object of type ? ingredient2 Coal )

610 ( inventory empty ? t a r g e t i n v )

611 ( object of type ? target Hypothetical )

613 : e f f e c t ( and

614 ( not ( inventory empty ? t a r g e t i n v ) )

615 ( inventory holding ? t a r g e t i n v ? target )

616 ( not ( object of type ? target Hypothetical ) )

617 ( object of type ? target Iron Ingot )

618 ( not ( inventory holding ? ingredientinv1 ? ingredient1 ) )

619 ( inventory empty ? ingredientinv1 )

620 ( not ( object of type ? ingredient1 Iron Ore ) )

621 ( object of type ? ingredient1 Hypothetical )

622 ( not ( inventory holding ? ingredientinv2 ? ingredient2 ) )

623 ( inventory empty ? ingredientinv2 )

Published as a conference paper at ICLR 2024

624 ( not ( object of type ? ingredient2 Coal ) )

625 ( object of type ? ingredient2 Hypothetical )

Learned Operator Libraries on ALFRED The following shows the full PDDL domain definition including the initial provided vocabulary of symbolic environment constants and predicates, initial pick and place operators, and all ensuing learned operators.

1 ( define ( domain a l f r e d )

2 ( : requirements : adl

4 ( : types

5 agent loc at io n receptacle object rtype otype

7 ( : constants

8 Candle Type otype

9 Shower Glass Type otype

10 CDType otype

11 Tomato Type otype

12 Mirror Type otype

13 Scrub Brush Type otype

14 Mug Type otype

15 Toaster Type otype

16 Painting Type otype

17 Cell Phone Type otype

18 Ladle Type otype

19 Bread Type otype

20 Pot Type otype

21 Book Type otype

22 Tennis Racket Type otype

23 Butter Knife Type otype

24 Shower Door Type otype

25 Key Chain Type otype

26 Baseball Bat Type otype

27 Egg Type otype

28 Pen Type otype

29 Fork Type otype

30 Vase Type otype

31 Cloth Type otype

32 Window Type otype

33 Pencil Type otype

34 Statue Type otype

35 Light Switch Type otype

36 Watch Type otype

37 Spatula Type otype

38 Paper Towel Roll Type otype

39 Floor Lamp Type otype

40 Kettle Type otype

41 Soap Bottle Type otype

42 Boots Type otype

43 Towel Type otype

44 Pillow Type otype

45 Alarm Clock Type otype

46 Potato Type otype

47 Chair Type otype

48 Plunger Type otype

49 Spray Bottle Type otype

50 Hand Towel Type otype

51 Bathtub Type otype

52 Remote Control Type otype

53 Pepper Shaker Type otype

54 Plate Type otype

Published as a conference paper at ICLR 2024

55 Basket Ball Type otype

56 Desk Lamp Type otype

57 Footstool Type otype

58 Glassbottle Type otype

59 Paper Towel Type otype

60 Credit Card Type otype

61 Pan Type otype

62 Toilet Paper Type otype

63 Salt Shaker Type otype

64 Poster Type otype

65 Toilet Paper Roll Type otype

66 Lettuce Type otype

67 Wine Bottle Type otype

68 Knife Type otype

69 Laundry Hamper Lid Type otype

70 Spoon Type otype

71 Tissue Box Type otype

72 Bowl Type otype

73 Box Type otype

74 Soap Bar Type otype

75 House Plant Type otype

76 Newspaper Type otype

77 Cup Type otype

78 Dish Sponge Type otype

79 Laptop Type otype

80 Television Type otype

81 Stove Knob Type otype

82 Curtains Type otype

83 Blinds Type otype

84 Teddy Bear Type otype

85 Apple Type otype

86 Watering Can Type otype

87 Sink Type otype

89 Arm Chair Type rtype

90 Bed Type rtype

91 Bathtub Basin Type rtype

92 Dresser Type rtype

93 Safe Type rtype

94 Dining Table Type rtype

95 Sofa Type rtype

96 Hand Towel Holder Type rtype

97 Stove Burner Type rtype

98 Cart Type rtype

99 Desk Type rtype

100 Coffee Machine Type rtype

101 Microwave Type rtype

102 Toilet Type rtype

103 Counter Top Type rtype

104 Garbage Can Type rtype

105 Coffee Table Type rtype

106 Cabinet Type rtype

107 Sink Basin Type rtype

108 Ottoman Type rtype

109 Toilet Paper Hanger Type rtype

110 Towel Holder Type rtype

111 Fridge Type rtype

112 Drawer Type rtype

113 Side Table Type rtype

114 Shelf Type rtype

115 Laundry Hamper Type rtype

118 ; ; Predicates defined on t h i s domain . Note the types f o r each predicate .

Published as a conference paper at ICLR 2024

119 ( : predicates

120 ( at Location ?a agent ? l l o c a t i o n )

121 ( receptacle At Location ? r receptacle ? l l o c a t i o n )

122 ( object At Location ?o object ? l l o c a t i o n )

123 ( in Receptacle ?o object ? r receptacle )

124 ( receptacle Type ? r receptacle ? t rtype )

125 ( object Type ?o object ? t otype )

126 ( holds ?a agent ?o object )

127 ( holds Any ?a agent )

128 ( holds Any Receptacle Object ?a agent )

130 ( openable ? r receptacle )

131 ( opened ? r receptacle )

132 ( is Clean ?o object )

133 ( cleanable ?o object )

134 ( is Hot ?o object )

135 ( heatable ?o object )

136 ( is Cool ?o object )

137 ( coolable ?o object )

138 ( toggleable ?o object )

139 ( is Toggled ?o object )

140 ( s l i c e a b l e ?o object )

141 ( i s S li ce d ?o object )

143 ( : action Pickup Object Not In Receptacle

144 : parameters (?a agent ? l l o c a t i o n ?o object )

145 : precondition ( and

146 ( at Location ?a ? l )

147 ( object At Location ?o ? l )

148 ( not ( holds Any ?a ) )

149 ( f o r a l l

150 (? re receptacle )

151 ( not ( in Receptacle ?o ?re ) )

154 : e f f e c t ( and

155 ( not ( object At Location ?o ? l ) )

156 ( holds ?a ?o )

157 ( holds Any ?a )

161 ( : action Put Object In Receptacle

162 : parameters (?a agent ? l l o c a t i o n ?ot otype ?o object ? r receptacle )

163 : precondition ( and

164 ( at Location ?a ? l )

165 ( receptacle At Location ? r ? l )

166 ( object Type ?o ?ot )

167 ( holds ?a ?o )

168 ( not ( holds Any Receptacle Object ?a ) )

170 : e f f e c t ( and

171 ( in Receptacle ?o ? r )

172 ( not ( holds ?a ?o ) )

173 ( not ( holds Any ?a ) )

174 ( object At Location ?o ? l )

178 ( : action Pickup Object In Receptacle

179 : parameters (?a agent ? l l o c a t i o n ?o object ? r receptacle )

180 : precondition ( and

181 ( at Location ?a ? l )

Published as a conference paper at ICLR 2024

182 ( object At Location ?o ? l )

183 ( in Receptacle ?o ? r )

184 ( not ( holds Any ?a ) )

186 : e f f e c t ( and

187 ( not ( object At Location ?o ? l ) )

188 ( not ( in Receptacle ?o ? r ) )

189 ( holds ?a ?o )

190 ( holds Any ?a )

194 ( : action Rinse Object 2

195 : parameters (? toolreceptacle receptacle ?a agent ? l lo ca ti on ?o object )

197 : precondition ( and

198 ( receptacle Type ? toolreceptacle Sink Basin Type )

199 ( at Location ?a ? l )

200 ( receptacle At Location ? toolreceptacle ? l )

201 ( object At Location ?o ? l )

202 ( cleanable ?o )

204 : e f f e c t ( and

205 ( is Clean ?o )

209 ( : action Turn On Object 2

210 : parameters (?a agent ? l l o c a t i o n ?o object )

212 : precondition ( and

213 ( at Location ?a ? l )

214 ( object At Location ?o ? l )

215 ( toggleable ?o )

217 : e f f e c t ( and

218 ( is Toggled ?o )

222 ( : action Cool Object 0

223 : parameters (? toolreceptacle receptacle ?a agent ? l lo ca ti on ?o object )

225 : precondition ( and

226 ( receptacle Type ? toolreceptacle Fridge Type )

227 ( at Location ?a ? l )

228 ( receptacle At Location ? toolreceptacle ? l )

229 ( holds ?a ?o )

231 : e f f e c t ( and

232 ( is Cool ?o )

235 ( : action Slice Object 1

236 : parameters (? t o o l o b j e c t object ?a agent ? l l o c a t i o n ?o object )

238 : precondition ( and

239 ( object Type ? t o o l o b j e c t Butter Knife Type )

240 ( at Location ?a ? l )

241 ( object At Location ?o ? l )

242 ( s l i c e a b l e ?o )

243 ( holds ?a ? t o o l o b j e c t )

Published as a conference paper at ICLR 2024

245 : e f f e c t ( and

246 ( i s S li ce d ?o )

249 ( : action Slice Object 0

250 : parameters (? t o o l o b j e c t object ?a agent ? l l o c a t i o n ?o object )

252 : precondition ( and

253 ( object Type ? t o o l o b j e c t Knife Type )

254 ( at Location ?a ? l )

255 ( object At Location ?o ? l )

256 ( s l i c e a b l e ?o )

257 ( holds ?a ? t o o l o b j e c t )

259 : e f f e c t ( and

260 ( i s S li ce d ?o )

263 ( : action Microwave Object 0

264 : parameters (? toolreceptacle receptacle ?a agent ? l lo ca ti on ?o object )

266 : precondition ( and

267 ( receptacle Type ? toolreceptacle Microwave Type )

268 ( at Location ?a ? l )

269 ( receptacle At Location ? toolreceptacle ? l )

270 ( holds ?a ?o )

272 : e f f e c t ( and

273 ( is Hot ?o )