# agent_planning_with_world_knowledge_model__30333aa9.pdf Agent Planning with World Knowledge Model Shuofei Qiao , Runnan Fang , Ningyu Zhang , Yuqi Zhu , Xiang Chen , Shumin Deng , Yong Jiang , Pengjun Xie , Fei Huang , Huajun Chen Zhejiang University National University of Singapore, NUS-NCS Joint Lab Alibaba Group Zhejiang Key Laboratory of Big Data Intelligent Computing {shuofei,zhangningyu}@zju.edu.cn Recent endeavors towards directly using large language models (LLMs) as agent models to execute interactive planning tasks have shown commendable results. Despite their achievements, however, they still struggle with brainless trial-anderror in global planning and generating hallucinatory actions in local planning due to their poor understanding of the real physical world. Imitating humans mental world knowledge model which provides global prior knowledge before the task and maintains local dynamic knowledge during the task, in this paper, we introduce parametric World Knowledge Model (WKM) to facilitate agent planning. Concretely, we steer the agent model to self-synthesize knowledge from both expert and sampled trajectories. Then we develop WKM, providing prior task knowledge to guide the global planning and dynamic state knowledge to assist the local planning. Experimental results on three complex real-world simulated datasets with three state-of-the-art open-source LLMs, Mistral-7B, Gemma-7B, and Llama-3-8B, demonstrate that our method can achieve superior performance compared to various strong baselines. Other interesting findings include: 1) our instance-level task knowledge can generalize better to unseen tasks, 2) weak WKM can guide strong agent model planning, and 3) unified WKM training has promising potential for further development3. trial-and-error correct path first step correct path first step world knowledge state knowledge task knowledge agent_probs know_probs agent model agent model hallucinatory (a) (b) trajectories Figure 1: Traditional agent planning vs. Agent planning with world knowledge model. 1 Introduction The remarkable advances in Large Language Models (LLMs) have witnessed a rapid development of various natural language processing tasks [25, 16, 28, 47, 60, 33]. Recently, multiple attempts that Equal Contribution. Corresponding Author. 3The code is available at https://github.com/zjunlp/WKM. 38th Conference on Neural Information Processing Systems (Neur IPS 2024). directly exploit LLMs as agent models to address physical world planning tasks have demonstrated promising achievements [54, 57, 56, 34, 38, 64, 44]. However, as most state-of-the-art LLMs are autoregressive models trained with next-token prediction, they lack the ability to essentially understand the real world, leading to generating hallucinatory actions and performing brainless trial-and-error in the environment as shown in Figure 1(a). In contrast to LLMs, humans possess a mental knowledge model about the physical world [1, 18, 17, 30]. When facing a specific task, they will first briefly rehearse the entire process in mind using their rich prior knowledge before performing mindless actions. We call this kind of knowledge global task knowledge (a.k.a. environment/task commonsense). In addition, during the task procedure, the mental world knowledge model will constantly maintain a kind of local state knowledge, representing humans cognition of the current world state. For example, imagine you are in a room and your task is to put a clean egg in microwave. The task knowledge may refer to The egg is most likely in the fridge ... The workflows are: 1) locate and take the egg; 2) clean the egg using sinkbasin ... The state knowledge possibly refers to My task is to ... I have found and taked the egg ... Next I should ... The absence of world knowledge can lead to blind trial-and-error in the early planning stages when environmental information is limited. Conversely, in later stages when information is redundant, it can easily result in a confused cognition of the current world state and generate hallucinatory actions. The process by which humans handle planning tasks reminds us to develop a parametric World Knowledge Model (WKM) to facilitate agent planning. As humans typically acquire knowledge from expertise and practical experience, we build WKM based on knowledge learned from both expert and explored trajectories. Specifically, we first steer the agent model to synthesize task knowledge from the comparison between expert and sampled trajectories. Then we prompt it to summarize state knowledge for each planning step from expert trajectories and combine the previous and next actions to build a state knowledge base. Lastly, we integrate the generated knowledge into expert trajectories and train a WKM. The agent model needs to be retrained to adapt to the task knowledge. Note our agent and knowledge model are both trained with Lo RA [12] sharing the same backbone. During the planning phase, we use the WKM to provide global prior task knowledge and maintain local dynamic state knowledge for the agent model as shown in Figure 1(b). The task knowledge will be concatenated in natural language form following the specific task to guide the agent model s trial-and-error. At each planning step, to prevent the occurrence of hallucinatory actions, we utilize the generated state knowledge as the query to conduct k NN retrieval from the pre-built state knowledge base. We then use the constraints from the previous action, the probabilities of the retrieved next actions, and the probabilities from the agent model to make a weighted prediction for the next action. We evaluate our method on three real-world simulated planning tasks: ALFWorld [41], Web Shop [53], and Science World [50] with three state-of-the-art open-source LLMs: Mistral-7B [16], Gemma7B [24], and Llama-3-8B [25]. Empirical results demonstrate that our method achieves superior performance compared to various strong baselines on both seen and unseen tasks. Moreover, further analytical results show that 1) our WKM can effectively reduce blind trial-and-error and hallucinatory actions, 2) our model-generated instance-level knowledge can generalize better to unseen tasks, 3) weak-guide-strong is feasible, 4) multi-task unified WKM possesses strong potential, and 5) explicit state knowledge will hurt the performance of agent planning. 2 Preliminaries We mainly focus on interactive tasks with partial observations from environments. Following the task formulation in [44], the problem can be viewed as a Partially Observable Markov Decision Process (POMDP): (U, S, A, O, T ). The instruction space U defines the task and its corresponding regulations. S is the state space, O is the observation space, and A is the action space. T : S A S defines the transition function, which we assume to be given by the environments. It is noticed that U, A, and O are subspaces of the natural language space in the language agent scenarios. Based on the above, the historical trajectory ht that consists of a list of actions and observations at time t can be represented as: ht = (u, a0, o0, a1, o1, . . . , at, ot), (1) where u U is the task instruction and a A, o O are the action and the observation. Given a task, the language agent with parameter θ serves as the policy model πθ responsible for generating Agent: go to fridge 1 Obs: The fridge 1 is closed State Knowledge: Your task is to You are checking Agent: open fridge 1 Obs: The fridge 1 is open. In it Task: put a clean egg in microwave You are in the middle of a room Task: put a clean egg in microwave. Task Knowledge: You should first find an egg and The workflows are: (b) State Knowledge Summarization Sampled Trajectory Agent: go to countertop 1 Obs: On the countertop 1, you see a creditcard 2, a dishsponge 2 Agent: go to countertop 2 Obs: On the countertop 2, you see a creditcard 1, a pen 1, a pen 2, a newspaper 1 Reward: 0.0 Expert Trajectory Agent: go to fridge 1 Obs: The fridge 1 is closed Agent: open fridge 1 Obs: The fridge 1 is open. In it, you see a cup 3, a cup 1, a lettuce 1 Agent: put egg 2 in microwave 1 Reward: 1.0 (a) Task Knowledge (at, st, at+1) (c) Model Training World Knowledge Model You are in the middle of a room. Looking quickly around you, you see a armchair 1, a cabinet 1, a drawer 2, a drawer 1, a sofa 1 Task: put two newspapers in drawer. Task Knowledge: When trying to place multiple objects in a drawer, you should first locate all the objects, then go to the drawer one at a time, and place each object inside before closing the drawer. The action workflows are: 1) Locate all objects. 2) Go to the drawer. 3) Place one object in/on the drawer. 4) Close the drawer. 5) Repeat steps 2-4 for each object. Agent: go to sofa 1 Obs: On the sofa 1, you see a creditcard 2, a newspaper 1. State Knowledge: Your task is to put two newspapers in drawer. You are checking sofa 1 and there is a newspaper 1 on it. Agent: take newspaper 1 from sofa 1 Obs: You pick up the newspaper 1 from the sofa 1. State Knowledge: Your task is to put two newspapers in drawer. You are checking sofa 1 and have found one newspaper. Next you should find another newspaper. Agent: put newspaper 2 in/on the drawer 1 Reward: 1.0 Task: put two newspapers in drawer agent model knowledge model state knowledge base (1-γ) pknow+γ pagent from agent model from knowledge model from environment State knowledge will not appear in the context of agent model during training and inference. (d) Planning with WKM Training Phase Planning Phase Figure 2: Overview of our WKM. We train a world knowledge model on the knowledge synthesized by the agent model itself from both expert and explored trajectories, providing prior task knowledge to guide global planning and dynamic state knowledge to assist local planning. the action at+1 based on ht at each time step t + 1: at+1 πθ( |ht). (2) Specifically, a0 πθ( |u) is generated according to the task instruction u. The whole trajectory τ concludes when the task is completed or exceeds the maximum time steps. Then the production of the entire trajectory with time length n can be modeled as: t=0 πθ(at+1|ht)πθ(a0|u). (3) Ultimately, the final reward r(u, τ) [0, 1] representing the task completion rate is calculated. Note that we follow a REACT-style [54] trajectory that includes rationales before each action. We use a to represent the action with rationales for convenience. World Knowledge Model. World knowledge model serves as humans mental cognition of the physical environment, more intricate than the word knowledge model which LLM-powered agent models are trained to be [61, 10, 52, 13]. Our world here refers to the simulated environment of the task. Based on the static environment of the task and the dynamic changes during interaction with the agent, we define world knowledge as a combination of prior global knowledge and dynamic local knowledge, corresponding to the blind trial-and-error problem in global planning and the hallucinatory action issue in local planning in traditional agent models, respectively. To attain precise and efficient agent planning, we develop a parametric WKM to simulate the mental WKM of humans. As shown in Figure 2, we steer the agent model to self-synthesize the task knowledge from the comparison of expert and sampled trajectories ( 3.1). Then we prompt the agent model to selfsummarize the state knowledge based on historical behavior and construct a state knowledge base ( 3.2). The generated knowledge will be integrated into the expert trajectories for training the WKM. After the training process ( 3.3), we augment the agent model with the world knowledge model to achieve effective and accurate planning ( 3.4). 3.1 Task Knowledge Synthesis The task knowledge serves as the prior knowledge to guide the agent model s global planning and prevent it from dropping into blind trial-and-error. Experienced Agent Exploration. We primarily acquire task knowledge through the comparison of preference trajectories (chosen vs. rejected). In order to improve the quality of rejected trajectories and obtain more targeted task knowledge, we employ an experienced agent for exploration. Firstly, we train a vanilla language model with expert trajectories4 from the training set to obtain an experienced agent. Subsequently, the experienced agent explores the training set tasks again to generate rejected trajectories. Our purpose is to extract superior task knowledge that cannot be acquired solely through supervised fine-tuning on chosen trajectories, thus further effectively boosting the agent s capabilities. Self Knowledge Synthesis. With the expert trajectories as the chosen ones and the trajectories sampled from the experienced agent as the rejected ones, we prompt the agent model itself to synthesize the task knowledge. Supposing K is the task knowledge space: κ πθ( |ρTask Know, u, τw, τl), (4) where κ K is the task knowledge, ρTask Know stands for the prompt to instruct the task knowledge extraction, and τw, τl are the chosen and rejected trajectories respectively. Note that given the same task u, τw and τl always satisfy r(u, τw) = 1 r(u, τl). Even when r(u, τw) = r(u, τl), we still consider trajectories sampled from the experienced agent as rejected ones. This is because expert trajectories often have shorter step lengths, enabling the agent to learn more knowledge of efficient planning. For detailed prompts of task knowledge synthesis, please refer to Appendix I.1. 3.2 State Knowledge Summarization The state knowledge serves as the dynamic knowledge to constrain the agent model s local planning and prevent it from generating hallucinatory actions. We prompt the agent model to self-summarize state knowledge at each planning step based on the expert trajectories to guarantee quality. For detailed prompts of state knowledge summarization, please refer to Appendix I.2. Supposing the prompt used to summarize state knowledge is ρState Know and the state knowledge s S is a part of the state space S, the generation of state knowledge at time t can be represented as: st πθ( |ρState Know, ht). (5) State Knowledge Base Construction. To avoid confusion caused by excessive additional information, instead of explicitly concatenating the state knowledge to the context, we construct a state knowledge base for retrieval (we analyze in 4.3 how explicit state knowledge may affect the performance of agent model). We combine the state knowledge st with the previous action at and next action at+1 from the expert trajectory to form a action-state-action triplet (at, st, at+1). After iterating through all expert trajectories, we obtain a State Knowledge Base B = {(s, apre, anext)(i)}|B| i=1, where apre = at, anext = at+1, and |B| is the size of the state knowledge base. 3.3 Model Training We integrate the generated world knowledge into expert trajectories and train a world knowledge model. The agent model needs to be re-trained to adapt to the incorporation of task knowledge. Note that our agent model and knowledge model are both trained with Lo RA sharing the same backbone. We list the examples of training data for both the agent model and WKM in Appendix E. Agent Model Training. Given the expert trajectories dataset D = {(u, κ, τw)(i)}|D| i=1 with task knowledge κ generated in 3.1, we train the agent model to follow the task knowledge to generate actions. Under an auto-regressive manner, the loss of the agent model can be formulated as: Lagent(πθ) = Eτw D[πθ(τw|u, κ)] (6) Suppose X = (x1, x2, . . . , x|X|) is the token sequence of the trajectory τw, we have: πθ(τw|u, κ) = j=1 (1(xj A) log πθ(xj|u, κ, x