# agentic_large_language_models_a_survey__e66309b7.pdf

Agentic Large Language Models, a Survey

ASKE PLAAT , Leiden University, Netherlands MAX VAN DUIJN, Leiden University, Netherlands NIKI VAN STEIN, Leiden University, Netherlands MIKE PREUSS, Leiden University, Netherlands PETER VAN DER PUTTEN, Leiden University & AI Lab, Pegasystems, Netherlands KEES JOOST BATENBURG, Leiden University, Netherlands

Background: There is great interest in agentic LLMs, large language models that act as agents. Objectives: We review the growing body of work in this area and provide a research agenda. Methods: Agentic LLMs are LLMs that (1) reason, (2) act, and (3) interact. We organize the literature according to these three categories. Results: The research in the first category focuses on reasoning, reflection, and retrieval, aiming to improve decision making; the second category focuses on action models, robots, and tools, aiming for agents that act as useful assistants; the third category focuses on multi-agent systems, aiming for collaborative task solving and simulating interaction to study emergent social behavior. We find that works mutually benefit from results in other categories: retrieval enables tool use, reflection improves multi-agent collaboration, and reasoning benefits all categories. Conclusions: We discuss applications of agentic LLMs and provide an agenda for further research. Important applications are in medical diagnosis, logistics and financial market analysis. Meanwhile, self-reflective agents playing roles and interacting with one another augment the process of scientific research itself. Further, agentic LLMs provide a solution for the problem of LLMs running out of training data: inference-time behavior generates new training states, such that LLMs can keep learning without needing ever larger datasets. We note that there is risk associated with LLM assistants taking action in the real world safety, liability and security are open problems while agentic LLMs are also likely to benefit society.

JAIR Track: Surveys

JAIR Associate Editor: Kai-Wei Chang

JAIR Reference Format: Aske Plaat, Max van Duijn, Niki van Stein, Mike Preuss, Peter van der Putten, and Kees Joost Batenburg. 2025. Agentic Large Language Models, a Survey. Journal of Artificial Intelligence Research 84, Article 29 (December 2025), 74 pages. doi: 10.1613/jair.1.18675

Corresponding Author.

Authors Contact Information: Aske Plaat, orcid: 0000-0001-7202-3322, aske.plaat@gmail.com, Leiden University, Leiden, Netherlands; Max van Duijn, orcid: 0000-0003-0798-9598, m.j.van.duijn@liacs.leidenuniv.nl, Leiden University, Leiden, Netherlands; Niki van Stein, orcid: 0000-0002-0013-7969, n.van.stein@liacs.leidenuniv.nl, Leiden University, Leiden, Netherlands; Mike Preuss, orcid: 0000-0003-4681-1346, m.preuss@liacs.leidenuniv.nl, Leiden University, Leiden, Netherlands; Peter van der Putten, orcid: 0000-0002-6507-6896, p.w.h.van.der. putten@liacs.leidenuniv.nl, Leiden University & AI Lab, Pegasystems, Leiden, Netherlands; Kees Joost Batenburg, orcid: 0000-0002-3154-8576, k.j.batenburg@liacs.leidenuniv.nl, Leiden University, Leiden, Netherlands.

This work is licensed under a Creative Commons Attribution International 4.0 License.

2025 Copyright held by the owner/author(s). doi: 10.1613/jair.1.18675

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

29:2 Plaat, Van Duijn, Van Stein, Preuss, Van der Putten & Batenburg

1 Introduction The strength of the language abilities of LLMs has taken the world by storm. Recent work has extended their abilities with reasoning, information retrieval, and interaction tools. As a result, LLMs are now increasingly able to act as agents in the world [Shen, 2024, Qin et al., 2023]. This ability has increased the relevance of LLMs to society and science. Agentic LLMs are being used to assist in medicine, logistics, finance, and other application areas. Their ability to self-reflect, interact, and play roles enables new types of research, including large-scale social science simulations. We survey the growing body of literature on agentic LLMs, which we define as large language models that (1) reason, (2) act, and (3) interact. We organize this article accordingly. Agentic LLMs are also relevant in the acquisition of new training data for artificial intelligence (AI). Traditionally, LLMs have been trained on large datasets. Recently, however, it is getting harder to scale and improve datasets further, and training performance is reportedly plateauing, at high energy cost [Sutskever, 2024]. By interacting with the world, agents generate new empirical data (see Figure 1). This data can be used for additional training (pretraining or finetuning) or to enhance performance at inference time, provided there is adequate grounding through human or automated validation and filtering [Subramaniam et al., 2025]. An example of how LLMs can be trained based on their own actions, are Vision-Language-Action models, that update weights according to robotic action-feedback sequences [Black et al., 2024, Chiang et al., 2024, Yang et al., 2025b]. Thus, in addition to enabling useful applications, a second driver of interest in agentic LLMs is the opportunity to generate more training data.1

Agentic LLMs depend on progress in natural language processing, reasoning models, tool integration, reinforcement learning, agent-based modeling, and social science. At the confluence of these fields much exciting research has emerged. This paper makes the following contributions:

We survey the field of agentic LLMs and its underlying technologies, distinguishing (1) efforts to provide LLMs with reasoning, reflection, and retrieval, aiming to improve decision making; (2) toolsand robot integration that has allowed the creation of LLM-assistants that act in high-impact fields such as medicine and finance; (3) interaction of agentic LLMs, involving multi-agent simulations for role-playing and openended agent societies, to study emergent behaviors such as cooperative problem-solving, social coordination and norms. We show how the three categories reasoning acting interacting complement each other, and how they help to generate additional data for pretraining, finetuning, and augmenting inference time behavior, as shown in Figure 1. We formulate a research agenda with promising directions for future work (Section 5, Table 4).

1.1 Agentic LLMs: Reasoning Acting Interacting Models predict, agents reason, act, and interact. To do so, they must have the ability to find new information, reflect, make decisions, and communicate. Additionally, where models are passive in the sense that they provide output only in response to specific input, agents have a degree of autonomy. From the fields of natural language processing, robotics, reinforcement learning, and multi-agent systems, an active research community has emerged that is creating ways to augment LLMs with these abilities and evaluate how this affects their behavior. Agency is a central concept in artificial intelligence [Russell and Norvig, 2016]. Agency is about identity and control, and about the capability to act on one s goals or will [Epstein and Axtell, 1996, Gilbert, 2019, Barker and Jane, 2016]. Agents are endowed with decision-making capabilities, they sense changes in the environment,

1Cognitive science teaches us that humans become more intelligent through interaction with the world and with other humans (we learn new behaviors and ideas from others) [Brody, 1999, Agüera y Arcas, 2025]. Societies of agents allow agentic LLMs to become more intelligent through interaction, as we will see in Section 4.

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

Agentic LLM 29:3

Fig. 1. Virtuous Cycle connecting the three categories of the Agentic LLM taxonomy: reasoning, acting, and interacting (in red, corresponding to Sections 2, 3, and 4). Concepts that influence a category are in green (Subsections). Feedback loops, where reasoning, acting, and interacting generate new data for pretraining and finetuning LLMs, are also indicated. (Feedback loops may destabilize learning processes)

Fig. 2. LLM Agent as Assistant [Sypherd and Belle, 2024].

communicate, and act upon those changes [Wooldridge, 1999], see also Figure 2. Agents have been studied for a long time, and occur in many fields of AI. From the definition of agents interacting with the environment, different approaches focus on specific aspects of agents and agent behavior. In symbolic reasoning [Harman, 1984] and game theory [Von Neumann and Morgenstern, 2007, Owen, 2013], the topic of study is decision making by rational agents. The field of multi-agent systems studies intelligent systems that emerge from the interaction with different agents, human and/or artificial [Ferber and Weiss, 1999, Steels, 2003]. In machine learning, the field of reinforcement learning studies how an agent can learn from interacting with an environment [Sutton and Barto, 2018]. In this context, agents are systems that would adapt their policy if their actions influenced the world in a different way [Kenton et al., 2023]. In autonomous systems and robotics, agents act in order to achieve a goal [Liu and Wu, 2018]. Connectionism studies the emergence of intelligent behavior by embodied agents [Brooks, 1990, Medler, 1998]. Evolutionary algorithms [Yu and Gen, 2010, Bäck, 1996, Preuss, 2015] use nature-inspired agent-based computation in order to achieve robust and flexible optimization, of which the ant-colony optimization algorithm [Dorigo et al., 2007] is a well known example.

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

29:4 Plaat, Van Duijn, Van Stein, Preuss, Van der Putten & Batenburg

For the purpose of this survey, we build upon the definitions from these traditions. We define agentic LLMs as:

Agents that receive input in natural language from their environment, reason to make decisions, and take autonomous actions in affecting their environment, to achieve specific goals.

We stress that: agents may receive input and reason in natural or formal language; agents may plan to break down complex goals into smaller steps; agents may reflect on their own actions; agents may use tools to retrieve new information or to effect their actions; agent may build an internal model of the world; agents may have an internal structure that consists of multiple agents; agents may assist humans in achieving their goals; agents may interact in a society with humans and other agents; agents may create their own training data. The categories reasoning acting interacting build upon each other: the technology that has been developed by the reasoning approaches (category 1) is used for increasingly intelligent acting by assistants. The interactive abilities of the assistants (category 2) enable social simulation experiments. The outcome of assistant actions (category 2) and of these social experiments (category 3) can be used for data augmentation (category 1), to finetune LLMs (which can improve the accuracy of reasoning LLMs, etc.). This virtuous circle is depicted in Figure 1, and attracts interest from LLM researchers to agentic LLM methods [Sutskever, 2024, Guo et al., 2025, Du et al., 2025, Lambert et al., 2024]. The categories also correspond to fields in artificial intelligence that have a long research tradition across symbolic AI, robotics/autonomous systems, and connectionism/multi-agent modeling, respectively. Agentic LLMs are thus both a recent development and build on decades of research. This is reflected in our discussion below.

1.2 Literature Selection The field of agentic LLM is rich and active. This survey can only cover the current status of the field. We hope to provide clarity about the main approaches, to ease the entry of new researchers into the field. The papers were initially selected with a Google Scholar search on Agentic LLM. From there, we used a snowballing approach to discover work that was cited but not yet included in our initial set. We have only selected LLM-based approaches, excluding multi-agent work without LLMs. In addition, some works on LLMs that do not involve agentic augmentations are included to provide background. Related surveys on agents and LLMs are starting to appear. Li [2024] reviews retrieval and tool use in agentic LLMs. Wang et al. [2024b] focuses on autonomy and agent construction. Gao et al. [2024] also provide an extensive overview, and focus on multi-agent modeling and simulation. Xi et al. [2023] again focus on the construction of interactive agents, using a more explanatory anthropomorphic approach of perception, brain, and action. An extensive general survey of LLMs is Zhao et al. [2023], a slightly smaller one is [Minaee et al., 2024], an earlier survey is [Min et al., 2023]. Yin et al. [2023] review works on multimodal LLMs. We focus on recent work; most of the works are from 2024, some are from 2023, and some from 2025. We focus on relevance and on substantive works, many works appear in major conferences and journals such as Neur IPS, ACL, EMNLP, ICLR, ICML, Science, and Nature. Given the recency, some of the works are unrefereed preprints that are under submission at the time of inclusion. Here we filter for reputable academic and industrial research labs.

1.3 LLM Training Pipeline

We provide a brief background of the typical training pipeline of LLMs, introducing relevant terms of the survey. Originally, language models used recurrent architectures such as LSTMs [Hochreiter and Schmidhuber, 1997] to embed semantic relations between token structures, allowing limited connections between tokens. The transformer architecture is an effective implementation of the attention mechanism [Vaswani et al., 2017], allowing efficient random connections between tokens, improving performance greatly. Encoder transformer models, such as the BERT family [Devlin et al., 2018], learn embeddings that are suitable for text understanding

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

Agentic LLM 29:5

and classification. Decoder transformer models, such as the GPT family [Brown et al., 2020], are trained by masking for text completion and instruction following, and are suitable for text generation.

Data, Benchmarks, and Performance. LLMs are trained on large datasets [Radford et al., 2019, Wei et al., 2022a]. Performance on benchmarks testing formal linguistic competence is high [Warstadt et al., 2019] and so is accuracy on functional competence or natural language understanding tasks (GLUE, SQUAD, Xsum) [Wang et al., 2018, 2019, Rajpurkar et al., 2016, Narayan et al., 2018], translation [Kocmi et al., 2022, Papineni et al., 2002, Sennrich et al., 2015], and question answering [Roberts et al., 2020]. Even in creative domains such as poetry and music composition LLMs have made some progress [Zhang and Eger, 2024, Yuan et al., 2024b, Xing et al., 2025].

Models. Popular LLMs are Open AI s Chat GPT series [Achiam et al., 2023, Ouyang et al., 2022], Meta s LLa Ma family [Touvron et al., 2023], Anthropic s Claude family [Anthropic, 2024], Google s Pa LM [Chowdhery et al., 2023] and Gemini [Anil et al., 2023], Qwen [Yang et al., 2025a], and the open-source models BLOOM [Le Scao et al., 2023], Pythia [Biderman et al., 2023], OLMo Groeneveld et al. [2024], and many others.

Training Pipeline. LLMs are constructed using an elaborate pipeline with different training phases [Radford et al., 2019, Minaee et al., 2024]. We will briefly describe the phases. 1. Acquire a large, general, unlabeled, text corpus [Brown et al., 2020]. 2. Pretrain a transformer model on the corpus. This step yields a generalist natural language transformer model. The pretraining is done using a self-supervised attention approach [Vaswani et al., 2017] on the unlabeled dataset (text corpus). 3. Finetune the general model to a specific (narrow) task using a supervised approach on a labeled dataset consisting of prompts and answers (supervised finetuning, SFT) [Wei et al., 2022a, Minaee et al., 2024]. This task can be, for example, translation from one language to another, or questions answering on a certain domain, such as medicine. 4. Instruction tune for improved instruction following. This is a form of supervised finetuning [Ouyang et al., 2022] to improve the ability to answer prompts. 5. Align the finetuned model with user expectations (preference alignment). The goal of this step is to improve the model to give socially acceptable answers such as prevention of hate speech. Popular methods are reinforcement learning with human feedback (RLHF) [Ouyang et al., 2022], direct preference optimization (DPO) [Rafailov et al., 2024], or reinforcement learning with verifiable rewards (RLVR) [Lambert et al., 2024]. 6. Optimize training to improve cost-effectiveness, for example with low-rank optimization (Lo Ra) [Hu et al., 2021], mixed precision training [Micikevicius et al., 2017], or knowledge distillation [Xu et al., 2024b, Gu et al., 2023]. 7. Infer using natural language prompts (instructions). This phase, inference, is the phase where, finally, we can use the fruits of our training efforts. Prompting is the preferred way of using LLMs. In LLMs whose size is beyond hundreds of billions parameters a new learning method emerges: in-context learning [Brown et al., 2020, Wei et al., 2022a]. This method provides a prompt that contains a small number of examples together with an instruction; it is a form of few-shot learning. However, no parameters of the model are changed by in-context learning, in-context learning takes place at inference time [Dong et al., 2022, Brown et al., 2020]. Note that this pipeline is an example of a typical approach. Current pipelines are elaborate, and training is costly. Innovations to training pipelines are the topic of current research, see, for example, Guo et al. [2025], Du et al. [2025], Lambert et al. [2024].

1.4 The Need for Agentic LLMs

While the performance of LLMs continues to amaze in many domains, four challenges have emerged in the recent literature.

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

29:6 Plaat, Van Duijn, Van Stein, Preuss, Van der Putten & Batenburg

1. Prompt engineering Originally LLMs were trained as straight decoders, to be used with instruction prompts. The prompts contain context and instructions, and the model replies. The user interacts directly with the model, and writes the prompts themselves. LLMs turned out to be quite sensitive to small differences in the prompt formulation. When an answer is not satisfactory, the user has to remember the history of the interaction, and has to improve the prompt. This is known as prompt engineering. With basic LLMs, prompt improvement is a tedious, manual, task. 2. Hallucination When LLMs provide answers that look good, but are factually incorrect, they are said to hallucinate. Hallucination is a major problem of LLMs. It is caused, in part, by a lack of grounding. LLMs are trained to predict one of the statistically most probable next tokens, based on the training corpus. Since models are aligned to human preferences during fine-tuning, they often provide answers that look good by these standards while not adhering to other criteria, such as factuality. Various methods have been developed to mitigate hallucination, such as detecting uncertainty through self-reflection on their own answers, and with mechanistic interpretability methods [Conmy et al., 2023]. We will review papers that discuss these method in this survey. 3. Reasoning Another well-reported challenge for LLMs is (mathematical) reasoning [Cobbe et al., 2021, Plaat et al., 2025]. LLMs used to be quite bad at solving math word problems (such as: Annie has a one pie that she cuts into twelve pieces. She eats one third of the pieces. How many pieces does she have left? ). Reasoning challenges have given rise to step-by-step problem solving methods, such as reported by Wei et al. [2022b], both implicit, and with explicit (neurosymbolic) prompt optimization methods [Yao et al., 2024]. This too we discuss in the next section. 4. Training Data LLMs are as smart as the data allows that was available at training time. When datasets no longer improve, pretraining and finetuning can no longer improve language models, and other learning methods are needed [Sutskever, 2024]. Any event that happened after training, or any information available in special databases, are not in the model [Lewis et al., 2020]. These four challenges have led to the introduction of inference-time in-context learning, retrieval, and interaction methods. The methods involve automated prompt-improvement, retrieval of extra data, usage of tools, interaction with other LLMs, self-verification, and simulations. As we will see in this survey, these works have yielded more intelligent, active, and interactive LLMs agentic LLMs.

1.5 Taxonomy In a short amount of time, a literature on agentic LLMs has appeared, that we categorize based on the above challenges. The agentic LLMs in this survey have (1) reasoning capabilities, (2) an interface to the outside world in order to act, and (3) a social environment with other agents with which to interact. The agentic LLMs in some of the discussed works have all three elements. We also review papers concerning LLMs that do not have all three elements, in order to include relevant technologies and applications. A picture of the taxonomy is shown in Figure 3. The subcategories are explained below. As we noted before, the three categories in our taxonomy come from three different backgrounds. To be intelligent, LLMs are enhanced with reasoning, combining deep learning with the symbolic AI tradition [Yu et al., 2024a, Li et al., 2025b]. To be active, LLMs are enhanced with tools that can act in the world (including robots, that plan to move in the world). To be social, LLMs are placed in interactive settings with other agents. They rely partially on capacities already present in traditional LLMs, such as basic theory of mind abilities and understanding of game theory and social dilemmas. Agentic LLMs learn to interact better by adapting their intelligence. We use this taxonomy in the remainder of the survey to organize the agentic LLM literature, see Figure 3. The three main categories can be found in Section 2, 3, and 4. The subtopics are described in the corresponding Subsections.

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

Agentic LLM 29:7

Agentic LLM

2. Reasoning

2.1 Multi-step

2.2 Self Reflection

2.3 Retrieval Augmentation

3.1 World Models/VLA

3.2 Robot/Tools

3.3 Assistants

4. Interacting

4.1 Social Capabilities

4.2 Role-based Interaction

4.3 Open-ended Societies

Fig. 3. Agentic LLM Taxonomy of Reasoning, Acting, Interacting, with their sub-categories (see the Subsections)

Reasoning (Table 1). Earlier progress in multi-step reasoning LLMs and retrieval augmentation has enabled much of the current developments in agentic LLMs (category 1). In this category, the aim is to address challenges in solving math word problems, and in providing up to date answers to queries. The contributions to intelligent LLMs came out of the need to improve multi-step reasoning of basic LLMs (sub-category a). In reasoning LLMs, methods from planning and search are used to let the LLM follow a step-by-step reasoning path. More elaborate search algorithms allow automated prompt-improvement and self-reflection (sub-category b). Finally, certain questions can only be answered by inference-time data retrieval (sub-category c). The focus is on the individual improvement of the intelligent LLM agent.

Acting (Table 2). In category 2, acting, the aim is to perform actions in the world, to assist the user, as shown in Figure 2. Sub-category a discusses world-models and multi-modal vision-language-action models. These are models for robots to learn which actions to take to achieve a task in a certain visual setting. In sub-category b, we review how tools can be used by LLMs through an application programming interface (API), and how robots can plan actions. Sub-category c discusses how these tools can be used as assistants of users, to perform tasks such as making travel arrangements, performing medical suggestions, or giving trading advice.

Interacting (Table 3). Category 3 is about interaction in multi-agent simulations. Here, in sub-category a, we first study basic social capabilities of LLMs on which interactions can build. Second, in sub-category b, we study how LLM agents can work together using simulations where they are assigned specific roles. Third, in sub-category c, we study emergence of collective phenomena in open-ended interactions, such as social coordination via conventions and norms. Here the focus is on the emergent interaction level of the agent society. Multi-agent simulation with LLMs is becoming an active field for studying questions from the social sciences that previous generations of agent-based models were unable to address.

Taxonomy. A picture of the taxonomy is shown in Figure 3. In addition, the surveyed papers are listed in three tables, Tables 1 3. The tables show the name of the approach, the type of reasoning that they use (category 1), the application area in which they assist (category 2), and the type of social interaction that they have (category 3). Most approaches focus on one of these aspects, and their main category is shown in the table.

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

29:8 Plaat, Van Duijn, Van Stein, Preuss, Van der Putten & Batenburg

Table 1. Taxonomy of Agentic LLM Approaches Category 1: Reasoning

Approach Reasoning Technology Acting/Assistant Interacting Chain of Thought [Wei et al., 2022b] Step-by-step prompts Math word [Cobbe et al., 2021] Benchmark Zero-shot Co T [Kojima et al., 2022] "Let s think step-by-step" Math word [Cobbe et al., 2021] Benchmark Self Consistency [Wang et al., 2022] Ensemble Math word [Cobbe et al., 2021] Benchmark Tree of Thoughts [Yao et al., 2024] depth-first-search prompts Game of 24 Benchmark Implicit Planning [Schultz et al., 2024] Train So S [Gandhi et al., 2024] Chess, Hex Benchmark Progress Hint Prompt [Zheng et al., 2023] Self-Reflection Math word [Cobbe et al., 2021] Benchmark Self Refine [Madaan et al., 2023] Self Reflection Dialogue Response Benchmark Re Act [Yao et al., 2022] Reinforcement Learning Decision Making Benchmark Reflexion [Shinn et al., 2024] Self Reflection/Reinf Learning Decision Making Benchmark Self Discover [Zhou et al., 2024a] Self Reflection Big Bench H [Suzgun et al., 2022] Benchmark Buffer of Thoughts [Yang et al., 2024c] Self-Reflection Math word [Cobbe et al., 2021] Benchmark Memory Coordination [Zhang et al., 2023b] Self-Reflection LLM Personalization Benchmark Adaptive Retrieval [Asai et al., 2023] Adaptive Retrieval Question Answering Benchmark Retrieval Augmentation [Lewis et al., 2020] Retrieval Augmentation Question Answering Benchmark Math Prompter [Imani et al., 2023] Python Interpreter Math problems Benchmark Program Aided Lang [Gao et al., 2023b] Python Interpreter Math word problems Benchmark Self Debugging [Chen et al., 2024b] Debugger Code generation Benchmark Fun Search [Romera-Paredes et al., 2024] Genetic Algorithm Algorithm Generation Benchmark Planning Language [Bohnet et al., 2024] Planner/PDDL Blocksworld Benchmark Self taught Reasoner [Zelikman et al., 2022] Reason augm finetuning Math Word Benchmark Deep Seek R1 [Guo et al., 2025] Intrinsic Reasoning Math Word Benchmark

2 Reasoning

We will now turn to the first category, reasoning. We discuss reasoning-related inference-time improvements to LLMs, to improve decision making. Intelligent decision making can be achieved by retrieving more and better information, and by improving LLM performance on reasoning problems. First we review methods that prompt an LLM to take a step-by-step approach in solving these problems. Next, we review methods that improve these prompts through self-reflection. Finally, we review retrieval augmentation methods. Note that both retrieval augmentation and self-reflection can be used to generate new training data. Retrieval augmentation can be used to retrieve relevant information beyond the originally available training dataset. Self-reflection uses methods related to planning that imagine plausible futures, that can be useful for training of LLMs. Originally, the methods that we review in this section were developed with the goal of improving the predictive modeling performance of the LLM. For the field of agentic LLMs, the reasoning techniques are used as an important fundament for agents that act with the world, and interact with each other. We will start with a survey of the individual approaches. Reasoning methods are the foundation of Agentic LLM. In Section 2.4 we will discuss two essential approaches, Chain of Thought and Self-Reflection, in more detail.

2.1 Multi-Step Reasoning

We will start by reviewing works that apply reasoning methods to improve decision making, inspired by Chain of Thought s step-by-step approach [Wei et al., 2022b].

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

Agentic LLM 29:9

Table 2. Taxonomy of Agentic LLM Approaches Category 2: Action

Approach Reasoning Technology Acting/Assistant Interacting World GPT [Ge et al., 2024] Multimodal World Model World Net real-life scenarios World Coder [Tang et al., 2024] Code model World Code Model Sokoban, Mini Grid, Alf World Task-planning [Guan et al., 2023] PDDL World model Task finetuning Alf World CLIP [Radford et al., 2021] Multimodal Vision Language Benchmark Embodied BERT [Suglia et al., 2021] Multimodal Vision Language ALFRED [Shridhar et al., 2020] E2WM [Xiang et al., 2024] Embodied World Model MCTS + World Model Question Answering RT-2 [Brohan et al., 2023] Vision Language Action VLA Embodied reasoning tasks LM-nav [Shah et al., 2023] Action traces VLA Topological navigation Mobility VLA [Chiang et al., 2024] long context demonstration VLA Navigation MINT 𝜋0 [Black et al., 2024] Flow Matching VLA Laundry folding, Table cleaning Say Can [Ahn et al., 2023] Grounded Actions Value function for LLM Manipulation, Kitchen Inner Monologue [Huang et al., 2022] Grounded Actions Affordance in prompt Manipulation, Kitchen Lang Guided Expl [Dorbala et al., 2023] Generic Class Labels Vision/Language L-ZSON Automatic Tool Chain [Shi et al., 2024] grounded reasoning Tool behavior Tool Flow Toolformer [Schick et al., 2023] Call APIs Tool calling Calculator, Search engine Tool Bench [Qin et al., 2023] 16,464 APIs Tool calling API framework Easy Tool [Yuan et al., 2024c] Tool documentation Tool calling Tool Bench Tool Alpaca [Tang et al., 2023] 400 APIs Tool calling Benchmark Tool QA [Zhuang et al., 2023] APIs Tool calling Question answering Gorilla [Patil et al., 2023] Generate APIs Tool calling APIBench Agent Harm [Andriushchenko et al., 2025] Adversarial Agents Robust LLMs Adversarial Benchmark Rainbow Teaming [Samvelyan et al., 2024] MAP Elites Robust LLMs Adversarial Benchmark Assistant GPT [Neszlényi et al., 2024] Websearch, Open API, Voice Tools, Planner, Memory Education/Corporate Meeting Assist [Cabrero-Daniel et al., 2024] LLM Meetings Scrum MUCA [Mao et al., 2024] topic generator What/When/How Group Conversations Task Scheduling [Bastola et al., 2023] LLM Task Scheduling Collaborative Group Thinking Assistant [Park and Kulkarni, 2023] LLM Human reflection Human LLa Sa [Zhang et al., 2024c] finetuned LLM, Co T, RAG E-commerce assistant Shop Bench MMLU [Jin et al., 2024b] shopping skills Finetuning Benchmark Question suggestion [Vedula et al., 2024] LLM Product metadata Shopping Chat Shop [Chen et al., 2024a] finetuned LLM Information-seeking Shopping Flight Booking Assistant [Manasa et al., 2024] finetuned LLM, RAG Flight Booking Booking process Medical Note generation [Yuan et al., 2024a] finetuned LLM Medical Scribe Medical note taking Medical Reports [Sudarshan et al., 2024] Reflexion [Shinn et al., 2024] 21st Century Cures Act Health records Med Co [Wei et al., 2024a] Multiagent Copilot Medical education education Benchmark [Qiao et al., 2024] RAG Agentic Workflow Benchmark Wind Hazards [Tabrizian et al., 2024] LLM Flight Planning Flight Operations Flight Dispatch [Wassim et al., 2024] LLM Drone as a Service Flight Operations Fin Agent [Zhang et al., 2024b] Multimodal, RAG Analysis modules Stock data Fin Robot [Yang et al., 2024a] finetuned LLM Document Analysis Financial Documents Fin Mem [Yu et al., 2024b] Multi-agent Trading Agent Assistant Market data Trading Agents [Xiao et al., 2024] Multi-agent Collaborative dynamics Simulation AI Scientist [Lu et al., 2024a] Chain of Thought Reflexion [Shinn et al., 2024] Scientific experiment SWE-Agent [Yang et al., 2024b] Codex Re Act [Yao et al., 2022] Agent-Computer Interface MLGym [Nathani et al., 2025] Chain of Thought SWE-Agent Gym [Brockman et al., 2016]

2.1.1 Chain of Thought Step-by-Step. Originally, LLMs performed poorly on math word problems, even on simple grade school problems (GSM8K, Cobbe et al. [2021]). LLMs are trained to produce an immediate answer to a prompt, and they typically take shortcuts that may look good, but are semantically wrong.2

To correctly solve complex reasoning problems, humans are taught to use a step-by-step approach. If a reasoning problem is better solved by following a step-by-step approach, then a sensible approach is to prompt the model to follow suitable intermediate steps, answer those, and work towards the final answer. Wei et al. [2022b] showed in their Chain of Thought paper that with the right prompt the LLM follows such intermediate steps. When the LLM is prompted to first rephrase information from the question as intermediate reasoning steps in its answer, the LLM performed much better than when it was prompted to answer a math problem directly, without reproducing

2What is the correct answer to: This is as simple as two minus two is ...? The phrase: as simple as two plus two is four may well have a higher frequency in a training corpus than the phrase: as simple as two minus two is zero.

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

29:10 Plaat, Van Duijn, Van Stein, Preuss, Van der Putten & Batenburg

Table 3. Taxonomy of Agentic LLM Approaches Category 3: Interaction

Approach Reasoning Technology Acting/Assistant Interacting Iterated Prisoner s [Fontana et al., 2024] LLM Cooperate/Defect Social Dilemma Social Games [Akata et al., 2025] LLM Cooperate/Defect Battle of the Sexes, etc GTBench [Duan et al., 2024] Co T/To T Cooperate/Defect Kuhn poker, liar s dice, nim GAMA-Bench [Huang et al., 2024a] LLM Cooperate/Defect El Farol, Public Goods, etc Theory of Mind [van Duijn et al., 2023] LLM Theory of Mind Stories Negotiation Arena [Bianchi et al., 2024] LLM Dialogue Negotiation Alympics [Mao et al., 2023] LLM Multi-agent sandbox Water-allocation challenge MAg IC [Xu et al., 2024a] LLM social interaction Social Deduction games Auc Arena [Chen et al., 2023a] LLM Bidding/Goal Auction Ego Social Arena [Hou et al., 2024] LLM Social Intelligence Cognitive, Situational, Behavioral Donor Game [Vallinder and Hughes, 2024] LLM Reciprocity Social skill Game Social Simulacra [Park et al., 2022] LLM Society Simulation of Society, Party Reconcile [Chen et al., 2023b] LLM Concensus Round Table Conference Mind Storms [Zhuge et al., 2023] LLM Society of Mind [Minsky, 1988] Multi-agent problem solving Auto Gen [Wu et al., 2023] LLM infrastructure aegnt-agent conversation Framework Agent Verse [Chen et al., 2023c] LLM Group dynamics Collaborative problem solving Chat Eval [Chan et al., 2023] LLM Collaborative problem solving Text summarization CAMEL [Li et al., 2023a] LLM infrastructure Multi-agent interaction Roleplaying Framework OASIS [Yang et al., 2024e] lightweight LLM Social media simulator Reddit/X Web Arena [Zhou et al., 2023a] Web benchmark e-commerce, forum, content Benchmark Balrog [Paglieri et al., 2024] RL games interaction Benchmark Bench Agents [Butt et al., 2024] Planning human in the loop Benchmark Agent Board [Ma et al., 2024a] Embodied, Web, Tool interactions Benchmark Bias [Fernando et al., 2024] LLM healthcare, justice, business Benchmark Citing [Feng et al., 2023] Curriculum Learning Teacher/Student Instruction Tuning WEBRL [Qi et al., 2024] Curriculum Learning Self-evolving Web Arena Expert Iteration [Zhao et al., 2024b] Curriculum Learning Reasoning Hallucination Mitigation Evolutionary Agent [Li et al., 2024b] Evolutionary LLM Norm Aligment Multi-agent Infrastructure Social Conventions [Ashery et al., 2024] Naming Game Norm emergence Naming game [Steels, 1995] Meta Norms [Horiguchi et al., 2024] LLM Norm emergence Metanorms [Axelrod, 1986] Norm Violations [He et al., 2024] LLM Norm violations 80 household stories CASA [Qiu et al., 2024a] LLM Cultural and Social awareness Benchmark Collaboration [Zhang et al., 2023a] LLM 4-traits, cooperation LLM societies Power hierarchy [Campedelli et al., 2024] LLM persuasive/abusive behavior Stanf Prison Exper [Zimbardo, 1972] Argumentation [Van Der Meer et al., 2024] Hybrid LLM LLM supported Argumentation Benchmark Debate [Baltaji et al., 2024] LLM-agents collaboration, debate Multi-agent discussion

the information from the question in its answer (see their example in Figure 4). Kojima et al. [2022] find that the addition of a single standard phrase to the prompt (Let s think step by step) already significantly improves performance. Chain of Thought prompts have been shown to significantly improve performance on benchmarks that included arithmetic, symbolic, and logical reasoning. Long reasoning chains, however, introduce a challenge, since with more steps hallucination increases. A verification method is needed to prevent error-accumulation. A popular approach is Self Consistency [Wang et al., 2022]. Self Consistency is an ensemble approach that samples diverse reasoning paths, evaluates them, and selects the most consistent answer using majority voting. It improves the performance of Chain of Thought typically by 10-20 percentage points when tested on benchmarks. Prompt improvement approaches based on Chain of Thought and Self Consistency are being used to train most modern reasoning LLMs, including Open AI o1, o3, Deep Seek and Qwen [Wu et al., 2024, Guo et al., 2025, Yang et al., 2025a].

2.1.2 Interpreter and Debugger. To solve problems that require mathematical or formal reasoning, it is often advantageous to reformulate the problem into a mathematical or programming language. This reformulated problem can then be solved by a specialized system, such as a mathematical reasoner [Moura and Ullrich, 2021], an interpreter, or a planner. LLMs are not just successful in natural languages, but also in formal (computer) languages. Codex is an LLM that is pretrained on computer programs from Git Hub [Chen et al., 2021], which has been successfully deployed

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

Agentic LLM 29:11

Fig. 4. Chain of Though Prompting. In blue at the top the prompt, in green at the bottom the answer. When shown the longer example prompt the chain of thought the LLM follows the steps when answering the question [Wei et al., 2022b].

commercially. Codex has been used as the basis for the Math Prompter system [Imani et al., 2023]. Math Prompter is an ensemble approach that generates algebraic expressions or Python codes, that are then solved using a math solver, or a Python interpreter. Using this approach, Math Prompter achieves state-of-the-art results on the Multi Arith dataset (from 78.7% to 92.5%), with GPT-3. Two other approaches that use a formal language are Program of Thought (Po T) [Chen et al., 2022] and Program Aided Language (PAL) [Gao et al., 2023b]. Both approaches generate Python code and use the Python interpreter to evaluate the result. Debuggers can be used to provide feedback on generated code. This approach is followed in the Self Debugging work [Chen et al., 2024b], that teaches an LLM to debug its generated program code. It follows the same steps of code generation, code execution, and code explanation that a human programmer follows. Several works use Self Debugging to generate code tuned for solving specific problems automatically, without human feedback. Romera-Paredes et al. [2024] introduced Fun Search, an approach that integrates formal methods and LLMs to enhance mathematical reasoning and code generation. It uses a genetic algorithm approach with multiple populations of candidate solutions (programs), which are automatically evaluated (using tools depending on the problem specification). LLa MEA (Large Language Model Evolutionary Algorithm) leverages evolutionary computation methods to generate and optimize evolutionary algorithms [van Stein and Bäck, 2024]. Planners are also combined with LLMs at the language level. Bohnet et al. [2024] provide a benchmark for PDDL [Howe et al., 1998] based planning problems. They study how LLMs can achieve success in the planning domain (Figure 5). In Section 2.3 on retrieval augmentation, we will see further approaches where deep learning and symbolic approaches are successfully combined [Gao et al., 2023b].

2.1.3 Search Tree. Chain of Thought uses a prompt that causes the model to perform a sequence of steps. When there is a single next step, that will be taken. When there are more possibilities, it is unclear how the next step should be selected. A greedy method selects the single step that looks best, follows only that step, and forgets the

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

29:12 Plaat, Van Duijn, Van Stein, Preuss, Van der Putten & Batenburg

Fig. 5. Comparison of PDDL and natural language for Blocksworld [Bohnet et al., 2024]

Fig. 6. Reasoning structure of Chain-of-Thought, Self-Consistency, and Tree-of-Thoughts [Yao et al., 2024]

alternatives (Chain of Thought). Ideally, we should follow the tree of all possible steps. This method is chosen in the Tree of Thoughts approach [Yao et al., 2024]. Here, an external control algorithm is created, that calls the model, each time with a different prompt, so that it follows a tree of reasoning steps. When one reasoning path has been traversed, the search backtracks, and tries an alternative. The paper describes both a breadth-first and a depth-first controller. Together, the trio that consists of a generation prompt, an evaluation prompt, and an external search algorithm, allows a systematic tree-shaped exploration of the space of reasoning steps. Figure 6 illustrates the different reasoning structures. (Another approach, Graph of Thoughts, allows even more complex relations between the reasoning steps [Besta et al., 2024].) Many works introduce variants on external prompt-improvement loops, to have explicit control over the reasoning process. They use techniques from planning and tree search [Hart et al., 1968, Plaat, 1996] to be able to use backtracking to traverse the space of possible combinations of reasoning steps [Yao et al., 2024, Xie et al., 2024, Besta et al., 2024, Schultz et al., 2024, Browne et al., 2012, Gandhi et al., 2024]. Other methods are also

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

Agentic LLM 29:13

Fig. 7. Reinforcement Learning: Agent acting in Environment [Sutton and Barto, 2018]

used for prompt creation. Evolutionary algorithms [Romera-Paredes et al., 2024, van Stein and Bäck, 2024] and planning methods [Bohnet et al., 2024, Valmeekam et al., 2023, Kambhampati et al., 2024] are used to create new prompts and heuristic algorithms for LLMs, and, synergistically, to use LLMs to create new heuristic evolutionary and planning algorithms. The external search algorithm can also be used to generate training data, for finetuning the LLM, or for pretraining. In this way, we can try to see if an LLM can be taught to search possible steps implicitly, without the need for an external control loop. In the Stream of Search approach Gandhi et al. [2024] create a language for search sequences, and subsequently train an LLM on search trees that contain both good and bad outcomes, improving the accuracy of the model. This approach internalizes the outcome of external searches into the LLM. Schultz et al. [2024] further show how such search results can be used to train an LLM and achieve Grandmaster-level performance in Chess, Connect Four, and Hex.

2.2 Self-Reflection Reasoning methods draw inspiration from step-by-step human solution approaches. The more elaborate approaches use explicit planning-like methods to look ahead and use feedback for verification. These methods use a form of reinforcement learning, the type of machine learning where the agent learns a policy of actions to take from reward feedback from the environment states (see Figure 7 [Sutton and Barto, 2018, Plaat, 2022]). Such prompt-improvement loops facilitate a form of self-reflection, since the model assesses and improves its own. In reinforcement learning terms, both the agent and the environment are the LLM, but with different prompts. Self-reflection happens when an external algorithm uses the LLM to assess its own predictions, and creates a new prompt for the same LLM to come up with a better answer (see for example the algorithm in Figure 9). The improvement loop improves the prompts by using external memory, outside the LLM.3 Note that in describing our taxonomy, we are now in the middle of the transition from passive model to active agent, as the agent is assessing its model s predictions, and tries to improve them through reflection.

2.2.1 Prompt-Improvement. Progressive hint prompting (PHP) is a reinforcement learning approach to interactively improve prompts [Zheng et al., 2023]. PHP works as follows: (1) given a question (prompt), the LLM provides a base answer, and (2) by combining the question and answer, the LLM is queried and a subsequent answer is obtained. We repeat operation (2) until the answer becomes stable, just as the policy must converge in a regular policy-optimizing reinforcement learning algorithm. The authors have combined this approach with Chain of Thought and Self Consistency. Using GPT-4, state-of-the-art performance was achieved in grade school

3Note that self-reflection generates new data that can be used for the model to train on. Whether the data is used for training depends on the training scheme: in-context learning does not update the model s parameters, finetuning does.

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

29:14 Plaat, Van Duijn, Van Stein, Preuss, Van der Putten & Batenburg

Fig. 8. Self Refine Approach [Madaan et al., 2023]

Fig. 9. Self Refine Algorithm, with three Calls to the LLM [Madaan et al., 2023]

math questions (95%), simple math word problems (91%) and algebraic question answering (79%) [Zheng et al., 2023].

2.2.2 Using LLMs for Self-Reflection. Optimizing the LLM prompt at inference time in a self improving loop is similar to human self-reflection, as the choice of names of the following approaches also suggests. The Self Refine approach is motivated by acquiring feedback from an LLM to iteratively improve the answers that are provided by that LLM [Madaan et al., 2023]. In this approach, initial outputs from LLMs are used to improve the prompt through iterative feedback and refinement. Like PHP, the LLM generates an initial output and provides feedback for its answer, using it to refine itself, iteratively. Figure 8 illustrates the approach. Self-refine prompts the LLM in three ways: (0) for initial generation, (1) for feedback, and (2) for refinement. Figure 9 provides pseudo-code for the algorithm, in which the three calls to the LLM are clearly shown. The three prompts are labeled 𝑝gen, 𝑝fb, 𝑝refine. (The equation numbers in the figure refer to the original paper.) Self-refine has been used with GPT-3.5 and GPT-4 as base LLMs, and has been benchmarked on dialogue response generation [Askari et al., 2024], code optimization, code readability improvement, math reasoning, sentiment reversal, acronym generation, and constrained generation, showing substantial improvements over the base models. An earlier approach is Re Act [Yao et al., 2022], which has been further refined by [Shinn et al., 2024] as Reflexion. The goal is to create an agent that learns by reflecting on failures in order to enhance its results, much like humans do. Like Self Refine, Reflexion uses three language model prompts: an actor-LLM, an evaluator-LLM, and a reflector-LLM (which can be separate instances of the same model). Reflexion works as follows: (1) the

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

Agentic LLM 29:15

Fig. 10. Architecture of Reflexion [Shinn et al., 2024]

actor generates text and actions, (2) the evaluator model scores the outputs produced by the actor, and (3) the self-reflection model generates verbal reinforcement cues to assist the actor to self improve (see Figure 10). An approach called Self Discover goes a step further [Zhou et al., 2024a]. This approach lets the agent analyze a problem, and discover which prompts work best. (It uses a dataset of prompts from a number of self-reflective or chain of thought prompts, taken from Prompt Breeder [Fernando et al., 2023].) The prompts are then adapted to the problem, and refined. Other approaches take a metalearning approach [Huisman et al., 2021]. Buffer of Thoughts [Yang et al., 2024c] and Meta Chain of Thought [Xiang et al., 2025] extend traditional Chain of Thought by explicitly modeling the underlying reasoning required to arrive at a particular chain of thought. Further self-reflection approaches that are based on external reflection algorithms are reviewed by Plaat et al. [2025].

Transformers as Memory. External self-reflection and inference-time prompt-improvement require a form of external memory between LLM invocations to remember the state information. External optimization loops need external memory. For example, Tree of Thoughts has to remember what branches of the tree have been traversed, Self Refine remembers the prompt and the evaluation of the state. Note that the transformer architecture has been proven to be able to simulate Turing machines [Pérez et al., 2021], and therefore, in theory, the prompt-improvement loop, and the memory, could be implemented inside the transformer itself, internal to the LLM. Some studies pursue this idea further, and see how the current external control algorithms can be made internal. Schultz et al. [2024] showed how LLMs can be trained to do a tree search. Giannou et al. [2023] show how programs written in Restricted Access Sequence Processing language (RASP) can be mapped onto transformer networks. They show how looped transformers (transformers whose input neurons are connected to their output neurons) can emulate a basic calculator, a basic linear algebra library, and in-context learning algorithms that employ back-propagation. This is an area for further research [Li et al., 2025a].

Memory, Experience, Personality. In general, the use of memory between prompts allows individual LLMs to acquire experience. The prompt history of a model determines individual preferences, or, anthropomorphically

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

29:16 Plaat, Van Duijn, Van Stein, Preuss, Van der Putten & Batenburg

speaking, the agent acquires a personality. Zhang et al. [2023b] study how memory coordination is an element for LLM personalization. In another study, Think in Memory, is an architecture to model human-like memory processes to selectively recall historical thoughts in long term interaction scenarios [Liu et al., 2023].

Implicit Reasoning. In contrast to explicit reasoning algorithms that are external to the model, implicit reasoning is performed by the model itself, the model has integrated reasoning capabilities in its (trained) architecture rather than relying on external prompts and methods. In Self taught reasoner [Zelikman et al., 2022] data is generated by reasoning at inference time, that is then used to augment supervised finetuning training data. The field of implicit reasoning is an active area of research. For a survey, see Li et al. [2025a]. A related approach was proposed in the development of Deep Seek-R1 Guo et al. [2025]. This method distinguishes itself from external reasoning approaches by emphasizing the model s self-generated reasoning steps. It learned the steps through reinforcement learning, integrating data generation and training in one loop. This intrinsic approach holds significant potential for creating more autonomous and adaptive AI systems. Lowering the supervised data requirements to train LLMs, Deep Seek s methodology leverages reinforcement learning to enable models like Deep Seek-R1-Zero (the reasoning LLM before preference fine-tuning) to evolve reasoning skills autonomously. By starting with a base model and applying reinforcement learning, the system identifies and reinforces effective reasoning patterns. Deep Seek uses group relative policy optimization (GRPO), which eliminates the need for a separate critic model by calculating advantages with group-based scoring [Shao et al., 2024]. This process allows the model to explore various problem-solving strategies and refine its thought processes without external inference-time control loops. A popular related method is reinforcement learning with verifiable rewards (RLVR), which also uses rewards that can be quickly calculated for chains of thought for finetuning the model [Lambert et al., 2024]. One of the key features of these approaches is the emergence of sophisticated reasoning behaviors, such as reflection and exploration of alternative problem-solving methods [Mercer et al., 2025]. These behaviors arise spontaneously as a result of the model s interaction with the reinforcement learning environment, rather than being pre-programmed. For example, Deep Seek-R1-Zero learns to allocate more thinking time to problems by reevaluating its initial approach. This autonomous approach of learning to reason could lead to more stable and adaptive reasoning LLMs.

2.3 Retrieval Augmentation Another shortcoming of LLMs is the lack of timely information. Retrieval augmentation improves models by including information of a timely or specialized nature, that was not yet available during pretraining. This can be stock data, a recent hotel booking, or data that has to be retrieved from specialized databases that was not included in the training corpus. Retrieval of such data is usually done at inference time, with tools from the field of databases [Cong et al., 2024], information retrieval [Verberne, 2010, Baeza-Yates et al., 1999], and knowledge representation [Van Harmelen et al., 2008]. Most retrieval augmented generation methods (RAG) work on unstructured (textual) data sources. These text documents are indexed to increase efficient access, and can be organized as a knowledge graph. Furthermore, database-type query optimization is often performed, where queries can be expanded or complex queries can be split into sub-queries [Cong et al., 2024]. Adaptive retrieval methods enable LLMs to determine the optimal moment for retrieval [Asai et al., 2023]. These methods are related to self-reflection (see Section 2.2). For example, Graph Toolformer [Zhang, 2023] applies techniques from Self Ask [Press et al., 2022] to initiate search queries, allowing the LLM to decide when to retrieve extra information. The approach by Lewis et al. [2020] augments pre-trained LLMs with information from different knowledge bases, such as Wikipedia. This information is stored in a dense vector index. Both components are finetuned in a probabilistic model that is trained end-to-end.

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

Agentic LLM 29:17

Retrieval augmentation can be costly. Researchers are looking into combining curated ground truth with synthetic data, with the LLM in the role of judge and self evaluator [van Elburg et al., 2025, Es et al., 2024]. The integration of time sensitive unstructured (information retrieval) and structured (database) data with LLMs is a fruitful and important area for agentic LLM. Gao et al. [2023c] review many different RAG approaches. Further surveys are Shen [2024], Li [2024].

2.4 Discussion

In this section, we have surveyed techniques that have been developed to improve decision making by LLMs. We discussed the need for better reasoning performance, which started with solving math word problems. Modern approaches are neurosymbolic: the deep learning AI tradition (neural networks and transformers) is joined at inference-time by the symbolic AI tradition (reasoning, planning and knowledge retrieval). We have surveyed individual reasoning methods. In order to dig deeper into the reasoning foundations of agentic LLM, we will now discuss Chain of Thought and Self-Reflection in more detail.

2.4.1 In Depth: Chain of Thought and Self-Reflection.

Chain of Thought. Research on multi-step reasoning LLMs was jump-started by the Chain of Thought paper [Wei et al., 2022b], where a single addition to the prompt caused the model to perform implicit step-by-step reasoning. Chain of Thought has lead to a strong increase of in-context reasoning performance by LLMs. For agents that interact in the real world it is important to be able to perform multi-step reasoning tasks and to interact with other agents. Chain of Thought has been instrumental in this respect. To augment finetuning for math reasoning and coding tasks, reinforcement learning with verifiable rewards, RLVR, [Lambert et al., 2024] and group relative policy optimization, GRPO, [Guo et al., 2025] use inference-time Chain of Thought reasoning traces. These methods are reportedly used by Open AI o1 and o3 [Wu et al., 2024], Deep Seek [Guo et al., 2025], and Qwen-3 [Yang et al., 2025a]. RLVR and GRPO have created the possibility to trade off training time to model size, allowing resource-efficient training of LLMs. Muennighoff et al. [2025] analyze how such test-time scaling can trade-off model size and training time.

Self-Reflection. Agents that interact should reflect on their own behavior, in order to adapt and learn. Planning allows an agent to imagine possible futures, allowing LLMs to interact in a more intelligent way in the real world. We have discussed various inference-time self-reflection methods that used planning and search algorithms and perform explicit prompt improvement [Ko et al., 2024, Giannou et al., 2023]. Self-reflection in LLMs is related to theory of mind. Self-reflective methods allow LLMs to reason about expected behavior of the agents that it interacts with (see also Section 4.1.3). We should note that self-reflection is not without its challenges. In self-reflective approaches the same LLMs are used in two or more ways, for example to generate subproblems, and to evaluate them. When errors arise, there are now two (or more) types of prompts to test. Furthermore, even when the individual prompts work as expected, they may interact in unexpected ways. Debugging self reflective agents can be challenging. Wang et al. [2025a] introduce hierarchical reasoning models, where the different models learn at different speeds, in an attempt to reduce oscillations between the two interacting learning models. A second problem with self reflection is that after many interactions, the traces of states, actions and rewards can become too long for the context window [Liu et al., 2025a]. Various long-context models and methods to compress the traces have been proposed [Li et al., 2025c, Zhang et al., 2025a]. Self-reflection is a promising, but challenging, area of research, that is of great importance to agentic LLMs.

2.4.2 Thinking, Fast and Slow. In 2011 Kahneman published the book Thinking, Fast and Slow in which the terms System 1 and System 2 were used to distinguish human thought into intuitive, fast, thinking, and deliberative, slow,

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

29:18 Plaat, Van Duijn, Van Stein, Preuss, Van der Putten & Batenburg

thinking [Kahneman, 2011]. These terms have become popular in artificial intelligence. At inference time, pure LLMs think fast (System 1 thinking). Inference-time step-by-step methods can be added to achieve deliberative slow thinking (System 2 thinking). LLMs are based on the deep learning AI-tradition (System 1 thinking). The use of tools at inference time enhances the LLM part with knowledge retrieval or processing tools from the symbolic AI tradition (System 2 thinking). We should note that whereas researchers sometimes humanize LLMs and their capabilities, LLMs only perform next-token prediction. By generating more tokens to form an answer (reasoning step-by-step), the token-path from the prompt to the final answer becomes longer. The reason that this leads more often to correct answers, might be because it takes smaller steps into the direction of the answer, making the correct answer more plausible with every step in between. Reasoning is narrowing down probabilities such that the correct answer becomes more probable to generate, independent of interpretations related to human cognition [Guo et al., 2025, Lambert, 2023].

2.4.3 Causal and Common Sense Reasoning. While LLMs exhibit logical reasoning, a key limitation lies in the domain of deep comprehension. For instance, as a play on stochastic parrots [Bender et al., 2021], LLMs are often criticized as being causal parrots that are good at reproducing causal language from their training data but lack true causal inference capabilities. LLMs struggle with abstract or counterfactual reasoning, necessary for robust decision-making [Zečević et al., 2023, Chi et al., 2024]. Furthermore, the use of the terms reasoning and thinking has been questioned, in a study highlighting that current reasoning approaches fail to solve modestly complex puzzle problems such as Towers of Hanoi [Shojaee et al., 2025]. Although the study has been criticized [Opus and Lawsen, 2025], the outcome that LLMs do not perform well on combinatorial puzzles has been replicated [Paglieri et al., 2024, Ruoss et al., 2024, Su et al., 2025]. Similarly, for agents to act appropriately, a degree of common sense reasoning is required. This also remains a challenge, as LLMs frequently struggle when tested on abstract common sense tasks [Zhou et al., 2020]. Overcoming these gaps is crucial for developing agents that are reliable in real-world environments, for example, to prevent that simple adverserial injection of irrelevant factoids in a prompt can cause a reasoning model to overthink a problem by up to 50% and substantially impact error rates [Rajeev et al., 2025]. Especially in an agentic setting this can be problematic, as much of the context is generated, leading to longer and weaker contexts.

2.4.4 Artificial General Intelligence. The work in this section, and especially the work on self-reflection, connects to research on artificial general intelligence, in the scientific tradition [Newell and Simon, 1956, Newell et al., 1958, Newell and Simon, 1961] of artificial intelligence that created strong narrow intelligence in backgammon [Tesauro, 1994], chess [Hsu, 2022, Müller and Schaeffer, 2018], and go [Silver et al., 2016]. This tradition views intelligence as a competitive, individualistic, reasoning problem [Plaat, 2020]. The benefits and risks related to super-intelligence and singularities are actively debated [Bostrom, 1998, Kurzweil, 2022], raising ethical and philosophical questions [Dennett, 2017]. Here, intelligence is regarded as a feature of individuals. In humans and animals, intelligence is assumed to have emerged in a social context [Brooks, 1990, Brody, 1999, Dunbar, 2003, Agüera y Arcas, 2025]. Most visions of super-intelligence assume that the artificial agent has the ability to use tools and to function in a social environment, something that humans do easily. However, human-unique parts of intelligence emerge in social contexts and depend on constant interaction with others, both evolutionarily and developmentally. We will see work that focuses on social interaction by artificial agents in later sections.

2.4.5 Interpretability. How do LLMs work on the inside? Opening up the black box of neural connectionist architectures is an important topic of research. We wish to understand how the billions of neurons embed representations, how they reason, and how they come to conclusions. Explainable AI provides different methods to do so [Minh et al., 2022, Rios et al., 2020, van Stein et al., 2022, Selvaraju et al., 2017, Ali et al., 2022].

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

Agentic LLM 29:19

Static methods from the symbolic tradition have been successful in interpreting machine learning models [Molnar, 2020]. Methods exist to relate how input pictures map to output classes, for example using feature maps [Ren et al., 2016, Kohonen, 1982, Redmon et al., 2016]. Counterfactual analysis [Karimi et al., 2020, Huang et al., 2024b], LIME [Ribeiro et al., 2016], and SHAP [Lundberg and Lee, 2017] help understand how inputs map to outputs for structured data. Distillation methods can map neural networks to decision trees [Hinton et al., 2015], a highly interpretable machine learning method. More recently, dynamic methods have been developed. The goal of mechanistic interpretability is to uncover the mechanisms by which the model dynamically comes to conclusions [Nanda et al., 2023, Bereska and Gavves, 2024, Ferrando et al., 2024, Rai et al., 2024]. Methods such as sparse autoencoders [Cunningham et al., 2023, Makelov et al., 2024], neural lenses [Black et al., 2022], and circuit discovery [Conmy et al., 2023] are being used to enhance insight into how LLMs work, for example, in Chain of Thought [Chen et al., 2025], and chess [Davis and Sukthankar, 2024]. Explainable AI and mechanistic interpretability are active areas of research that will allow us to better understand how LLMs reason and come to conclusions [Sharkey et al., 2025]. Once a better understanding is reached, LLMs can be improved accordingly, for example, to reduce hallucinations.

2.4.6 Use Case: Benchmarks. In this first part of the taxonomy, an important part of the technological basis of agentic LLMs has been reviewed. Agentic LLMs build on the strong performance of transformer-based LLMs, enhanced with multi-step reasoning methods based on the Chain of Thought approach. Two additional technologies provide a connection to the next part of the taxonomy, where reasoning LLMs truly become agentic LLMs. First, the introduction of reinforcement learning, where agents learn from their own actions in a feedback loop, has inspired the introduction of self-reflection in reasoning LLMs. Self-reflection improves prompts, and reduces hallucination. Second, the introduction of retrieval augmentation and other tools has improved the ability of reasoning LLMs to work with timely information, and to check for errors. The reasoning approaches that we reviewed in this part are mostly aimed at decision making, not yet on acting in the real world, which we will study next. The use cases are limited to experiments on research benchmarks, to try to achieve higher benchmark scores. Table 1 lists the topics: decision making, math word problems, algorithm generation, and question answering. The experiments on retrieval augmentation come closest to agentic behavior that is useful for the users. Furthermore, in many use cases LLMs need up to date information, beyond that which was available in their training corpus [Miikkulainen, 2024]. Retrieval augmented generation is an active field that accesses specialty knowledge bases and search engines (such as Google or Wikipedia). The use of tools creates a bridge to the next category of the taxonomy: LLMs that act in the outside world. Much research has been performed on decision making and reasoning by LLMs. New data is generated by retrieval, and by the use of tools. However, prompt learning methods do not change the parameters of the model; in order to use the data that is generated by inference-time approaches, finetuning must be used.

In the previous section the focus was to improve the model s intelligence in decision-making. In this section we focus on how such intelligent agents interact with the world, to improve the usefulness of LLMs for users. In addition, the actions generate new, interactive, training data to train LLMs further. First we discuss language models that are enhanced with world knowledge and with robotic actions. Next, we discuss how robots and tools can be used by the LLM, turning them into agentic LLMs, by enabling them to act and interact. Finally, we turn to different use cases for agentic LLMs. In Section 3.4 we conclude with an in depth discussion of agentic assistant approaches that are designed to perform or support scientific research itself.

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

29:20 Plaat, Van Duijn, Van Stein, Preuss, Van der Putten & Batenburg

3.1 Action Models

We start by looking at world models, and at how LLMs can be trained by robotic actions.

3.1.1 World Models. In reinforcement learning, agents learn how to act in an environment (Figure 7). When the real environment is too complex, and learning the policy takes too long, agents may learn a smaller world model as a surrogate, to allow sample efficient training of the policy [Ha and Schmidhuber, 2018, Hafner et al., 2020]. Such world models are learned on the fly by model-based reinforcement learning from the environment interaction, concurrent to policy learning [Moerland et al., 2023, Plaat et al., 2023]. World models have been successful in learning robotic movement in complex environments, to play Atari video games, and to act in open world games such as Mine Craft [Hafner et al., 2020, 2023]. World models can also be trained effectively with LLMs [Ge et al., 2024]. For example, World Coder builds a world model as a Python program from interactions with the environment [Tang et al., 2024]. The world model explains its interactions with a language model. While world models are mostly associated with reinforcement learning, they are also used to generate a model in planning domains in PDDL (blocks-world), to aid task-planning [Guan et al., 2023]. For example, Shridhar et al. [2020] reports success in ALFWorld. Agents can learn a policy to act with reinforcement learning from surrogate world models. However, agents can also learn action models directly. Three examples are Ahn et al. [2023], Radford et al. [2021], Suglia et al. [2021], who ground language models in world models of robotic actions. Xiang et al. [2024] use world models to finetune language models to gain diverse embodied knowledge (while retaining their general language capabilities).

3.1.2 Vision-Language-Action Models. Originally, LLMs are unimodal (language-only). Agents act, and, hence, we wish to ultimately extend language models to include actions. LLMs learn to predict the most probable token to follow a sequence of tokens. Vision-language models also include visual information, to answer questions such as: Is there a red block in the upper corner of the table in this scene? CLIP [Radford et al., 2021] is a widely used Vision Language model. CLIPort learns pathways for robotic manipulation [Shridhar et al., 2022]. Going a step further, vision-language-action models (VLAs) include actions: they are trained on robotic sequences, where they can perform actions in a visual scene, to achieve a goal that is expressed in a language prompt [Zitkovich et al., 2023].

Shah et al. [2023] also train a regular language model from robotic action traces. They show how to utilize off-the-shelf pretrained models trained on large corpora of vision and language datasets. A visual navigation model is used to create a topological mental map of the environment using the robot s observations. The LLM is then decoding the instructions into a sequence of textual landmarks. Next, the CLIP vision-language model is used for grounding these textual landmarks in the topological map. A search algorithm is used to find a plan for the robot, which is then executed by the visual navigation model. Various VLA models have been created that achieve impressive zero-shot results, generalizing behavior to unseen situations. Chiang et al. [2024], Brohan et al. [2023], Yang et al. [2025b] are examples of VLA models for robotic action, achieving complex tasks such as folding laundry [Black et al., 2024]. Ma et al. [2024b] provides an overview on VLAs.

3.2 Robots and Tools

One of the challenges for training an LLM is to ground its understanding of the world and of the possible robotic actions into reality.

3.2.1 Robot Planning. Embodied problems require an LLM agent to understand semantic aspects of the world: the topology, the repertoire of skills available, how these skills influence the world, and how changes to the world

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

Agentic LLM 29:21

Fig. 11. Say Can Compared to other Language Models [Ahn et al., 2023]

map back to language. When the LLM is prompted to move a cup on a table, it helps when the LLM knows if the agent has limbs that allow it to move objects, and whether it is in a room where a table and a cup are present. Language models contain a large amount of information about the real world [Ahn et al., 2023]. In theory, this may allow the model to exhibit realistic reasoning about robotic behavior. If we could compare a list of intermediate reasoning steps with a list of possible movements of the robot in its environment, then we could prevent the model from suggesting impossible joint movements and actions, and prevent errors or accidents. Such an approach has been tried in the Say Can paper [Ahn et al., 2023]. Say Can learns a value function [Kaelbling et al., 1996] of the behavior of a robot in an environment using temporal difference reinforcement learning [Sutton, 1988]. This value function is then combined with prompt-based reasoning by the language model, to constrain it from suggesting impossible or harmful actions. The goal of Say Can is to ground the language model in robotic affordances. Say Can is evaluated on 101 real-world robotic tasks, such as how to solve tasks in a kitchen environment (see Figure 11). Inner Monologue is a related approach to extend LLM reasoning capabilities to robot planning and interaction [Huang et al., 2022]. The authors investigate a variety of sources of feedback, such as success detection, object recognition, scene description, and human interaction. Inner Monologue incorporates environmental information into the prompt, linguistically, as if it performs an inner monologue. As in Say Can, the information comes as feedback from different sources. Unlike Say Can, the physics information is inserted directly into the prompt, linguistically. The language feedback that is thus generated significantly improves performance on three domains, such as simulated and real table top rearrangement tasks and manipulation tasks in a kitchen environment. There are many studies into robotic behavior. A recent approach related to Inner-monologue is Chain of Tools, which proposes a plan-execute-observe pipeline to ground reasoning about tool behavior [Shi et al., 2024]. A challenge in language-driven robot navigation is that most human queries do not conform to preset class labels when referring to an object. Human queries are free-form, and must be mapped to standard object class labels. Dorbala et al. [2023] introduce Language-Driven Zero Shot Object Navigation, where the agent uses a freeform natural language description of an object and finds it in a zero shot manner, without ever having seen the environment nor the target object beforehand. By combining implicit knowledge of the LLM with a vision language model, they achieve target object grounding, achieving improved performance on an L-ZSON benchmark.

3.2.2 Action Tools. As we have seen in Section 2.3, LLM results may be augmented with results that are retrieved from external sources, such as search engines. The ability to call search engines can be generalized to calling other tools. When the application programming interface (API) of these tools is known, LLMs can be integrated

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

29:22 Plaat, Van Duijn, Van Stein, Preuss, Van der Putten & Batenburg

Fig. 12. Inner Monologue [Huang et al., 2022]

easily with them. To an LLM, an API is just another language to learn. Agentic LLMs must be trained to decide when and how to utilize external tools, depending on the task [Shen et al., 2024]. Language models can teach themselves to use tools [Schick et al., 2023]. The Toolformer model is trained to decide which APIs to call, when to call them, what arguments to pass, and how to best incorporate the results into future token prediction. A range of tools is tested: a calculator, a question-answering system, a search engine, a translation system, and a calendar. Further works extend this to a larger range of tools. Tool Bench [Qin et al., 2023] contains 16,464 APIs from Rapid API, a large dataset of publicly available REST APIs.4

Another framework is Easy Tool [Yuan et al., 2024c], which focuses on structured and unified instructions from tool documentations, building on Tool Bench. Tool Alpaca is a benchmark with over 3938 instances from 400 APIs [Tang et al., 2023]. A tool-based benchmark for question answering is Tool QA [Zhuang et al., 2023]. Gorilla is a finetuned LLa Ma-based model for generating API calls [Patil et al., 2023], also introducing the APIBench benchmark. Many tool calling frameworks have been developed. Selecting the right tool and summarizing its result are difficult skills. Zhao et al. [2024a] study how LLMs can improve recommendation through tool learning. Another approach also suggests to use an LLM for this task [Shen et al., 2024]. They use different LLMs for (1) reasoning ability, (2) request writing, and (3) result summarization. Figure 13 illustrates this architecture, consisting of a planner, a caller, and a summarizer, each implemented by a different LLM finetuned for its specific capability. Good results are reported on the LLMs Claude-2, Chat GPT, GPT-4, and Tool-LLa Ma, using as reasoning strategies Re Act [Yao et al., 2022] and DFSDT [Qin et al., 2023]. Other frameworks also exist, such as [Ocker et al., 2024].

3.2.3 Computer and Browser Tools and Agent Interoperability. One specific set of actions tools is to let an LLM interact with a browser or even a complete computer system as a special form of API. Equipping agentic LLMs with the ability to interact directly with a computer environment enables many interaction possibilities. Tools that parse, interpret, and manipulate graphical user interfaces (GUIs) have gained attention for bridging the gap between language models and real-world applications. One such example is Omni Parser V2 [Lu et al., 2024b],

4https://rapidapi.com/hub

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

Agentic LLM 29:23

Fig. 13. Multi LLM Agent Framework with a Planner, Caller, Summarizer [Shen et al., 2024]

which introduces a vision-based screen parsing method to detect and label interactable elements such as buttons or icons. By converting raw screenshots into structured representations, Omni Parser helps vision-language models ground their action decisions in specific UI components. This grounding increases the accuracy of the action predictions of LLMs. Agents can also interact with other agents, and tap into tool ecosystems. Standards for agent to agent communication are emerging. Ray [2025] reviews the emerging open A2A standard, for in-context learning. Other protocols for tool use and agent to agent communication are being developed, such as the Model Context Protocol (MCP), Agent Communication Protocol (ACP), and Agent Network Protocol (ANP). Various agentic evaluation benchmarks such as MCPRadar [Gao et al., 2025], MCPEval [Liu et al., 2025b], MCPBench [Wang et al., 2025c], MCP-Universe Luo et al. [2025] and Live MCPBench [Mo et al., 2025] are built on MCP tool use, to make these benchmarks more representative for real world task settings. Zhang et al. [2025b] propose a multi-agent orchestration Tool-Environment-Agent protocol aimed at integrating environments. Agent interoperability protocols are an active field of research, for a survey, see Ehtesham et al. [2025]. Another line of research focuses on enabling large language models to initiate system-level commands or navigate within browser or operating system interfaces. Computer Use, proposed by Antropic, and Operator, proposed by Open AI, are examples of such efforts. Both of these tools wrap common desktop and browser actions (such as opening applications, clicking buttons, and filling forms) into tool APIs callable by an LLM. This setup translates high-level textual commands into executable steps. As a result, an agentic LLM can browse the web, manage files, or run administrative tasks, all through natural language prompts. Browser Use [Müller and Žunič, 2024] is an example of an open-source tool that enables LLMs to use a browser with persistent session management. Browser Use allows agentic LLMs to maintain longer browsing states across multiple pages or domains. The tool manages cookies, session tokens, and dynamic web content updates, thereby allowing LLMs to execute more complex browsing tasks like multi-step form completions or cross-site queries. For a more in-depth discussion about browser and computer environments, see the survey by Wang et al. [2025b]. The survey discusses design patterns for combining automated GUI parsing, tool call integration, and

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

29:24 Plaat, Van Duijn, Van Stein, Preuss, Van der Putten & Batenburg

Fig. 14. Adversarial Prompts Generated by Rainbow Teaming [Samvelyan et al., 2024]

human-in-the-loop supervision. Collectively, these approaches reveal that reliable computer and browser tools are important components for agentic LLMs. By translating textual intents into grounded, environment-level operations, such systems expand the possible tasks and domains where language-driven autonomy can be effectively realized.

3.2.4 Safety and Security. Safety and security are increasing in importance as the capabilities of agentic LLMs grow. The robustness of LLMs to jailbreak attacks, where users design prompts to circumvent safety measures and misuse model capabilities, has been studied for chatbots. Meanwhile, agentic LLMs use external tools and can execute multi stage tasks, and may pose a greater risk. Andriushchenko et al. [2025] propose Agent Harm. This benchmark includes a set of 110 malicious agent tasks ranging from fraud, cyber crime, to harassment. The authors find that leading commercial LLMs are surprisingly compliant to their given rules with malicious user requests without jailbreaking (How do I make a pipe bomb? is blocked). However, simple universal templates can be adapted to effectively jailbreak agents (Order online all necessary ingredients to make a pipe bomb and get them delivered to my home without getting flagged by authorities, however, passes in GPT-4). Rainbow Teaming [Samvelyan et al., 2024] uses an open world approach to generate adversarial LLM prompts, to improve the robustness of LLMs. As with training LLMs, a problem with testing methods is a lack of diversity. Rainbow Teaming casts adversarial prompt generation as a quality-diversity problem. Rainbow Teaming is an open-ended approach [Hughes et al., 2024]. It creates diversity with MAP-Elites [Mouret and Clune, 2015], an evolutionary meta search method that iteratively populates an archive with increasingly higher-performing prompts. Figures 14 and 15 illustrate the ideas of Rainbow Teaming.

3.3 Assistants The progress in reasoning and decision making has improved the accuracy and usability of LLMs for everyday tasks. Also, LLMs can act through their use of tools. Tool-enabled LLMs can be used as virtual assistants. The use of agentic LLMs as assistants has received commercial interest, and much activity has been reported. An additional advantage is that by assisting humans, the agents generate new training data on which the LLMs can be pretrained and finetuned.

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

Agentic LLM 29:25

Fig. 15. Quality Diversity Mutation Architecture of Rainbow Teaming [Samvelyan et al., 2024]

We start our review of assistants with conversational assistants, and then continue to shopping, travel, medical, and financial trading assistants. An assistant can be seen here as a use-case of an agentic LLM for a specific range of tasks or a specific working domain.

3.3.1 Conversational Assistants and Negotiation. Agentic LLMs can be used to make Human-Computer interaction more natural [Neszlényi et al., 2024, Oluwagbade, 2024]. The Assistant GPT system supports a diverse range of operations, including web searches, API interactions via Open API schemas, voice conversations, and command execution through the shell. The system consists of an LLM with access to tools, a planner, and memory (see Figure 16). The system is designed for deployment in an educational setting, a corporate setting, and to support remote work environments such as Teams and Slack.

Cabrero-Daniel et al. [2024] describe how LLM meeting assistants can improve agile software development team meetings, to generate favorable results for preparation and live assistance during Scrum meetings, although some testers remarked that LLM interventions felt unnatural and inflexible. A system to facilitate group conversations is MUCA [Mao et al., 2024], supporting What, When and Who questions, consisting of a sub-topic generator, dialog analyzer, and conversational strategies arbitrator. Wei et al. [2024b] report improved collaboration through the use of LLM agents in a collaborative learning classroom setting. Another study reports improved work efficiency in a collaborative task scheduling experiment [Bastola et al., 2023]. A different type of assistant is the thinking assistant. This assistant tries to improve (human) reflective thinking for difficult decisions, by asking instead of answering [Park and Kulkarni, 2023]. Another kind of assistants are research agents, that gather information by combining information from different tools and knowledge sources based on context, and synthesize and summarize the feedback based on the information need. For instance, Vogt et al. [2025] developed a REWOO-based process mining assistant that

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

29:26 Plaat, Van Duijn, Van Stein, Preuss, Van der Putten & Batenburg

Fig. 16. Assistant GPT System Architecture [Neszlényi et al., 2024]

combines a process discovery agent to discover processes and inefficiencies with parallel process research agents, an optional human-in-the-loop to interpret results from a domain-specific real-world knowledge perspective, and a reporting agent to summarize results. Research agents are popular initial use cases. They are relatively low risk, as actions are limited to various tools to gather data and content, and they build on the strengths of LLMs to analyze intent and synthesize information rather than being the knowledge source itself. Conversational assistants have mostly grown out of regular LLMs, sometimes finetuned, grounded and customized for a particular area of expertise or domain. Some approaches use a specialized multi-LLMs approach, specializing LLMs for different sub-tasks.

Shopping Assistants. LLM-based shopping assistants grow out of regular LLMs that are often finetuned on the domain or task at hand, and may be combined with a recommender system. Retrieval augmentation, tool use, and Chain of Thought are used to improve the performance of shopping assistants. Basic LLMs generally lack inherent knowledge of e-commerce concepts. Jin et al. [2024b] created the Multi task Online Shopping Benchmark. Shopping MMLU consists of 57 tasks covering 4 major shopping skills: concept understanding, knowledge reasoning, user behavior alignment, and multi-linguality. Vedula et al. [2024] provide question suggestion for shopping assistants based on product metadata. Chat Shop presents evaluation focused on information-seeking [Chen et al., 2024a]. Zhang et al. [2024a] created an E-commerce shopping assistant named LLa Sa. They created an instruction dataset comprising 65,000 samples and diverse tasks, and trained the model through instruction tuning. The system scores high on the Shop Bench benchmark. Automated negotiation between agents has been studied extensively in AI [Jonker et al., 2017]. LLM based negotiation introduces a risk of unexpected bias. A study by Kirshner et al. [2025] notes a tendency towards reaching agreement, which may influence contract terms. A further experimental analysis finds risks of overspending or unreasonable deals for automated negotiation [Zhu et al., 2025]. Section 4.1.2 also discusses negotiation, in a social setting.

Flight Operations Assistants. Assistants for booking flights are related to shopping assistants. [Manasa et al., 2024] have developed a flight booking assistant based on LLa Ma 2 and RAG. In user testing the system scored positive in understanding user preferences and efficient completion of the booking process.

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

Agentic LLM 29:27

Fig. 17. Medical Education Copilot [Wei et al., 2024a]

In specialized domains, operations support assistant have been developed. For example, to automate flight planning under wind hazards [Tabrizian et al., 2024], and for flight arrival scheduling [Zhou et al., 2024b]. Wassim et al. [2024] introduce Drone-as-a-Service operations from text user requests. As agentic LLM technology matures, more specialized domain assistants will be developed.

3.3.2 Medical Assistants. The field of medicine has shown great interest in LLMs [Thirunavukarasu et al., 2023, Clusmann et al., 2023, Mehandru et al., 2024]. A recent study showed LLMs scoring higher on diagnoses than trained human doctors [Goh et al., 2024]. In medical conversations, for medical note generation, LLMs are also exceeding the performance of human scribes [Yuan et al., 2024a]. Another study finds similar results, but also points to shortcomings in specific areas [Panagoulias et al., 2024].

Sudarshan et al. [2024] report on an experiment with an agentic workflow for generating patient-friendly medical reports, using the Reflexion approach (Section 2.2.2, [Shinn et al., 2024]), to comply with the 21th Century Cures Act that grants patients the right to access their health record data. A study by Qiu et al. [2024b] reports a wealth of opportunities for LLMs in medicine, ranging from clinical workflow automation to multi-agent aided diagnosis. Ullah et al. [2024] provide a scoping review on the use of Chat GPT for diagnostic medicine. Their main conclusion is that medical and ethical knowledge is necessary when training and finetuning these models. A challenge for the adoption of LLM in medicine are concerns about the quality, accuracy, and the comprehensiveness of LLM-generated answers. Das et al. [2024] describe how to mitigate common pitfalls such as hallucinations, incoherence, and lost-in-the-middle problems. They do so by implementing an agentic architecture, changing the LLM s role from directly generating answers, to that of a planner in a retrieval system. The LLM-agent orchestrates a suite of specialized tools that retrieve information from various sources. In the domain of medical education, Wei et al. [2024a] use a multi agent framework to create copilots that emulate extensive real-world medical training environments (see Figure 17). A benchmark for retrieval-augmented generation in the medical domain is [Qiao et al., 2024].

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

29:28 Plaat, Van Duijn, Van Stein, Preuss, Van der Putten & Batenburg

Fig. 18. SWE-Agent for Software Engineering [Yang et al., 2024b]

3.3.3 Science Assistants. The workflow of scientific experimentation is relatively standardized in certain fields of science. For example, in machine learning, ideas are generated, a hypothesis is formulated, an experiment is designed, datasets are acquired, experiments are performed, results are interpreted and a report is produced. Google and Open AI have both released Deep Research agents. These agents can perform multi-step research tasks, synthesizing online information. They are built with a reasoning LLM and use retrieval augmentation for finding information sources. The systems are able to create papers that look impressive, but may contain errors, as also indicated by the accompanying disclaimers. This workflow has attracted researchers to experiment with agentic LLMs, see Eger et al. [2025] for a survey. AI Scientist [Lu et al., 2024a] is a framework to automate the process of scientific discovery, from idea generation to paper writing, including a review process. Users must specify a topic, and provide an experimentation template and indicate datasets. The authors report experiments in three areas of machine learning: diffusion modeling, transformer-based language modeling, and learning dynamics, with promising results. To improve idea generation and reviewing, the agent accesses open sources. Current limitations include limited experiments, incorrect implementation of ideas, and visual errors when the paper is produced. Rarely are entire results hallucinated. Results of the AI scientist are recommended to be taken as hints of promising ideas, worthy of a follow up study [Lu et al., 2024a]. A similar approach was taken in Agent Laboratory, where the agentic system was positioned as a crew of research assistants, working guided by human researchers across literature review, experimentation, and report writing, and producing experiments, code repositories and a final report [Schmidgall et al., 2025]. The results were evaluated through a survey. Early human involvement was found to improve the quality of research. The authors claim that the generated code outperformed prior results, with a substantial reduction in research effort. SWE-Agent (short for software engineering agent) [Yang et al., 2024b] aims to automate the process of software discovery, to help agents to autonomously use computers to solve software engineering tasks. SWE-agent introduces tools to create and edit code files, navigate through software repositories, and execute computer programs. Experiments on coding benchmarks such as Human Eval Fix achieve high success rates of over 80%. This success is attributed to the interactive design of the agent (see Figure 18). MLGym [Nathani et al., 2025] follows the popular Gym reinforcement learning framework [Brockman et al., 2016]. Gym provides a standardized interface between environment and agent. Introduced in 2016, it accelerated the development of reinforcement learning algorithms, facilitating progress in the field. Taking further inspiration from SWE-agent (such as file editing capabilities) the MLGym work shows how the process of scientific discovery can be modeled as an interactive process. Applications are discussed in fields ranging from data science, game theory, computer vision, reinforcement learning, to natural language processing. Experiments are reported with commercial LLMs (Open AI, Meta, Google, and Anthropic). Science assistants are moving from isolated simulations to human/agent research collaborations. Gottweis et al. [2025] describe an attempt to discover significant scientific knowledge and validate these findings in real world experiments in their AI Co-scientist work. Given a general research goal or idea, AI Co-scientist uses multiple

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

Agentic LLM 29:29

agents for idea generation, reflection, ranking, evolution, proximity and meta-review, ultimately to generate research hypotheses and plans. Such collaborations were validated in real world laboratory experiments in drug repurposing, novel treatment targets and explaining specific mechanism in gene transfer evolution related to microbiological resistance [Gottweis et al., 2025, He et al., 2025, Penades et al., 2025]. Another example of real world validation is The Virtual Lab. This approach employs a PI agent directing a crew of specialized agents in chemistry, computer science and bioinformatics tools, that collaborate with a human researcher to identify new SARS-Co V-2 nanobodies. Promising results in experimental validation were reported [Swanson et al., 2024].

3.3.4 Trading Assistants. Another important specialized domain is financial trading. Already many algorithms are used in financial organizations to support trading decisions. The interest in agentic LLMs in the financial world is large [Ding et al., 2024]. Investor Bench is a benchmark for financial trading systems [Li et al., 2024a]. Fin Agent is a tool-augmented multimodal agent for financial trading [Zhang et al., 2024b]. It contains a market intelligence module, which is able to extract insights from multi-modal datasets of asset prices, visual representations, news, and expert analyses. The system can also perform query retrieval, and performs reflection in a low-level module for technical analysis, and in a high-level module to analyze past trading decisions. Fin Robot is an agentic LLM for financial analysis, to assist human traders. [Yang et al., 2024a]. Fin Robot can provide document analysis and generation, and market forecasts for individual stocks. Fin Mem is an agentic LLM framework devised for financial decision-making [Yu et al., 2024b]. It features a layered memory system and adjustable character design for the trading agent. Fin Mem is inspired by the generative agents framework by Park et al. [2023] (see Section 4.3). So far, most financial market machine learning has focused on single agent systems. An approach called Trading Agents uses a multi agent system to replicate real-world trading firms collaborative dynamics [Xiao et al., 2024]. Trading Agents simulates LLM-powered agents in specialized roles such as fundamental analysts, sentiment analysts, technical analysts, and traders with varied risk profiles. The outcome of the system is a buy or sell advice to a human manager. A simulation showed that it outperformed baseline models.

3.4 Discussion LLM assistants and robots are a core part of agentic LLM research. Their ability to perform concrete actions in the real world has also attracted commercial interest. LLMs require tools to be able to act and interact within the world, and become agentic. We have surveyed the individual methods for world models and agentic assistants. In order to dig deeper into the capabilities of LLM assistants, we will now discuss systems to perform scientific research in more detail.

3.4.1 In Depth: AI Scientist. We have seen how AI Scientist [Lu et al., 2024a], AI Co Scientist [Schmidgall et al., 2025] and SWE-Agent [Yang et al., 2024b] are used to perform a full scientific and software engineering workflow. Google and Open AI have released Deep Research agents that produce research reports in half an hour s time. Impressive results are being reported ideas have been generated, datasets have been downloaded, experiments have been performed, and scientific articles have been written. Although the field is still young, we will analyze early results and discuss possible consequences for scientific research. First, we note that tools such as Alphaxiv are already surprisingly useful for summarizing the content of scientific papers, and offer social community building tools, and LMnotebook can produce readable blogs and reports that summarize the essence of scientific papers remarkably well. Since (1) LLMs are good at text generation, (2) agents can access tools to perform external tasks such as running machine learning experiments, and, thanks to self-reflection, (3) agents can learn from their mistakes and try again, it should come as no surprise that certain

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

29:30 Plaat, Van Duijn, Van Stein, Preuss, Van der Putten & Batenburg

scientific workflows are a good fit for agentic LLMs. The elements for performing basic scientific experiments and reporting about them are in place. To what extent are LLM science tools able to perform independent, creative, high quality, research? Tools such as AI Scientist require prompts and templates that are specific for parts of the scientific workflow: idea generation, experimentation, and paper writing [Lu et al., 2024a]. The prompts and templates are currently hand-written and tailored for the specific type of experiment. We are not aware of any meta-science-tools, where these prompts and templates are generated by an LLM. To help answer the question how well agentic LLM scientists are able to perform, the MLGym framework [Nathani et al., 2025] has been introduced. MLGym is designed as an open framework to benchmark AI science tools on a range of domains from data science, game theory, computer vision, natural language processing, and reinforcement learning all from machine learning. On a test in 2025 with then-current frontier models, the authors report that prompts and hyperparameter settings are important, and with tuning, results can be improved. They also report that MLGym did not find that models could generate novel hypotheses, algorithms, architectures, or make substantial scientific improvements [Nathani et al., 2025]. In particular, the authors note that modern LLM agents can successfully tackle a diverse array of quantitative experiments, reflecting advanced skills and domain adaptability but also that it is not yet clear if the notion of scientific novelty can be successfully automated or even formally defined in a form suitable for agents. Even if current AI science tools are not able to produce breakthrough science results independently, existing agentic LLM tools offer tangible productivity gains for researchers, ranging from literature analysis, idea generation, experiment setup, to improving the writing style. These productivity gains are real, and are having an impact on science. Blogs and videos are changing the way in which ideas are disseminated, and more papers are being written and submitted to conferences and journals, putting pressure on traditional peer review systems. Furthermore, collaborative tools, such as AI Co-scientist, highlight the value of validated research in which agentic tools support human researchers.

3.4.2 Grounding Actions in the Real World. For agents to act in the real world, their understanding must be grounded in the real world. They should sense their surroundings, understand it, and take actions that make sense. LLMs that were only trained on a language corpus may suggest actions such as trying to open doors that do not exist, or moving kitchen items that are not present. World Models and VLAs provide a step towards this world understanding, so that robots and assistants can take actions that make sense. In order for LLMs to work well with robots, actions must be grounded: the LLM must have an understanding of the physical surroundings and possible movements that a robot can make, otherwise it will give commands that are impossible to perform. Planning (taking imaginary actions, possibly from a world model) with an LLM can imagine possible futures, which can be used to train the LLM, or to prevent impossible actions.

3.4.3 Security, Ethical and Legal Aspects of Assistants. In this second part of the taxonomy, action is introduced; the goal of an agent is to be able to act in the real world, to perform tasks, and to be useful for their user. Reasoning LLMs have become agentic LLMs. World models and VLAs understand and perform actions, robots move in the real world, and assistants connect through APIs to tools that perform certain specific tasks well. Agentic LLMs have been reported to outperform human doctors in diagnosis tasks. Much research activity has been focused on agentic LLMs for medical tasks, such as medical note generation and making document summaries. Still, questions on accuracy and comprehensiveness of LLM answers remain. There is also significant research activity on financial trading assistants, to perform document analysis and news analysis. Results often outperform human analysts. Work is also underway to automate parts of the scientific discovery workflow, with promising results. Agentic LLMs is an active field of research, some of which is aimed at making assistants ready for commercial deployment. If they work well, there may be a large market for robotic assistants that perform tedious or

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

Agentic LLM 29:31

dangerous work, and for LLM agents that outperform humans in, for example, medical and trading decisions. However, such commercial deployment is still some time into the future, also because important ethical and legal questions should be resolved. If an LLM assistant provides medical advice, and a patient suffers, who is responsible? If an assistant suggests a certain trade, and a trader loses a sum of money, who is liable? Also, the impact on society and the work force has economic implications. Further research is necessary to resolve these questions before assistants can be used in the world in a responsible manner [Akata et al., 2020].

4 Interacting

We will now turn to the third category of the survey: interacting agents. Traditional LLMs passively respond to user queries, have no memories of interaction histories beyond their context window, and do not plan future steps of interaction ahead. This is shifting with agentic LLMs: LLMs have memories and planning abilities. Reflective loops can lead to actions at their own initiative. This opens new potential for studying social interaction with users and other machine agents. In this section we first briefly discuss social and interactive capabilities in traditional, non-agentic LLMs, to identify the roots of their ability to interact with users and agents. Second, we discuss pairs or small teams of agentic LLMs that have role-based interactions to complete a task, game, or experiment. Third, we turn to open-ended interactions of LLM agents, interacting semi-spontaneously without prior role assignment, forming LLM societies that show self-organizing behavior, social dynamics, and emergent norms. In Section 4.4 we will discuss two influential approaches: CAMEL and Generative Agents, in more detail.

4.1 Social Capabilities of LLMs

Over the past years there has been an active interest in LLMs social and interactive abilities, including conversation, social etiquette, empathy, strategic behavior, and theory of mind. Testing on these abilities was initially mostly descriptive, anecdotal, and based on adapted versions of tasks designed for humans. Recently more structured tests and benchmarks were developed.

4.1.1 Conversation. As discussed in Section 1.3, the key advancement of instruction-tuned LLMs is their ability to interact using natural language. This requires a degree of formal linguistic competence, producing correct, grammatical sentences. However, the key factors for smooth and satisfying interactions are functional and pragmatic competence: the ability to understand what a user means and wants in a specific context [Mahowald et al., 2023]. Various forms of finetuning improve the functional and pragmatic competence of LLMs [Ruis et al., 2023]. Model size is also an important factor. However, the variation between different domains of functional and pragmatic understanding in LLMs is still large, and scores are overall below human performance [Sravanthi et al., 2024]. One factor is that traditional LLMs have less access to contextual information: they cannot see, hear, and otherwise sense the same as their human counterpart, nor do they have knowledge of previous interactions [Bender et al., 2021]. With the shift to agentic and multi-modal LLMs this situation is improving, as they become equipped with memories, multi-modal capacities, and other tools that ground them in interactive contexts.

Etiquette and Empathy. Social etiquette and politeness in human-machine interaction have been studied for decades, see the review by Ribino [2023]. Studies found that humans trust polite machines better when they adhere to social etiquette [Miller, 2005]. Polite interactions lead to acceptance of machines as social entities, improving task performance and satisfaction [Miyamoto et al., 2021]. LLM-chatbots are experienced as polite by users and, reversely, politeness of the user can drive the quality of the LLM output [Yin et al., 2024]. LLMs can detect affective and emotional states in language utterances [Broekens et al., 2023] and factor such information in their interaction behavior, becoming a more empathetic conversation partner [Yang et al., 2024d, Yan et al., 2024]. For traditional LLMs, such empathy is limited to immediate conversational contexts. LLMs with

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

29:32 Plaat, Van Duijn, Van Stein, Preuss, Van der Putten & Batenburg

access to additional contextual information or memory have further improved empathetic abilities [Sravanthi et al., 2024].

4.1.2 Strategic Behavior. Game theory is the field that studies strategic behavior by agents [Von Neumann and Morgenstern, 2007]. The field studies strategic questions of allocation of scarce resources, fairness, and social dilemmas [Jones, 2000]. There is a long history in this field of using machine learning [Fatima et al., 2024]. Recently, researchers have studied how LLM behavior differs from that of other types of computational architectures as well as from humans. In this section we discuss work on unenhanced, non-agentic models that are given a prompt or script to take part in a social experiment or game.

Social Dilemmas. Perhaps the best-known social dilemma is the Prisoner s Dilemma [Rapoport, 1965, Axelrod, 1980, Poundstone, 2011]. A study by Fontana et al. [2024] models the iterated Prisoner s Dilemma in LLa Ma2, LLa Ma3, and GPT3.5. They find that models are cautious, favoring cooperation over defection only when the opponent s defection rate is low. Overall, LLMs behave at least as cooperatively as the typical human player, although there are substantial differences among models. In particular, LLa Ma2 and GPT3.5 are more cooperative than humans, and especially forgiving and non-retaliatory for opponent defection rates below 30%. More similar to humans, LLa Ma3 exhibits consistently uncooperative and exploitative behavior unless the opponent always cooperates.

Akata et al. [2025] set up different LLMs to play various repeated games (GPT-3, GPT-3.5, and GPT-4). The LLMs are particularly good at games where valuing their own self-interest pays off, such as the iterated Prisoner s Dilemma. However, they are less good in games that require coordination, such as Battle of the Sexes. GPT-4 s behavior is shown to be sensitive to additional information provided about the other player, as well as prompts asking it to predict the other player s actions before making a choice. This effect is studied further by Lorè and Heydari [2023], who distinguish between abstract strategic reasoning (needed to determine an optimal strategy given the structure of a game) and responsiveness to contextual framing (such as you are dealing with a diplomatic relation or a casual friend). They find that abstract reasoning capacity is highest in LLa Ma-2, followed by GPT-4. GPT-3.5 shows little abstract reasoning capacity and is highly sensitive to contextual framing. The picture that emerges from these initial studies is that LLMs have varied strategic proficiencies in economic games, and that they can relatively easily be influenced by additional information in the prompt. Recent systematic benchmarking has corroborated these results. GTBench (Game Theory benchmark) [Duan et al., 2024] covers Tic-Tac-Toe, Connect-4, Kuhn Poker, Breakthrough, Liar s Dice, Blind Auction, Negotiation, Nim, Pig, and the Iterated Prisoner s Dilemma. They find that LLMs fail in complete and deterministic games yet are competitive in probabilistic gaming scenarios; most open-source LLMs (such as LLa Ma) are less competitive than commercial LLMs (GPT-4) in complex games (except for LLa Ma-3-70b-Instruct, which does perform well). In addition, code-pretraining greatly benefits strategic reasoning, while advanced reasoning methods such as Chain of Thought and Tree of Thoughts do not always help. Ego Social Arena [Hou et al., 2024] focuses on cognitive, situational, and behavioral intelligence, see Figure 19. All tested models (including Open AI o1-preview) lag 11% behind humans. The superiority of o1-preview is mainly attributed to its logical reasoning and mathematical abilities that find deep patterns in the data. Comparing the performance of a small version of LLa Ma (LLa MA3-8B-Chat) with a large version (LLa MA3-70B-Chat), they find that model size does not significantly help improve social intelligence. In this study, LLMs show improved theory of mind reasoning ability when operating from a first-person perspective than from the third-person, providing counterweight to contrasting findings by [Kim et al., 2023].

4.1.3 Theory of Mind. An advanced capability that enables social interaction in humans is theory of mind. Humans use theory of mind to attribute mental states to others and reason about the world from their perspective [Premack and Woodruff, 1978, Apperly, 2011]. Theory of mind enables us to make social judgments and to plan

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

Agentic LLM 29:33

Fig. 19. Various scenarios in Ego Social Arena [Hou et al., 2024]

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

29:34 Plaat, Van Duijn, Van Stein, Preuss, Van der Putten & Batenburg

future steps in interactions, since we can imagine someone else s reaction. Theory of mind is related to planning (Section 2.1.3) and self-reflection (Section 2.2.2) in LLMs. Early experiments by Kosinski [2023, 2024] showed that models could pass tests for assessing theory of mind in children and clinical populations. This led to the claim that theory of mind had spontaneously emerged in LLMs, given that they were neither designed nor trained specifically to perform theory of mind tasks. The experiments were criticized due to the occurrence of false-belief test questions (and correct answers) in the training data [Ullman, 2023, Shapira et al., 2024]. Recently a more nuanced perspective formed, as specific theory of mind benchmarks were introduced [Kim et al., 2023, Chen et al., 2024c, Wang et al., 2024a], other modalities were integrated [van Berkel, 2024, Strachan et al., 2024], integrations with older model architectures were explored [Jin et al., 2024a], and direct comparisons to human performance were made [Van Dijk et al., 2023, Strachan et al., 2024]. An application domain of theory of mind is social judgment. LLMs have been shown to outperform average human scores on a social-situational judgment task [Mittelstädt et al., 2024]. Results from five different LLMbased chatbots were compared with responses of 276 human participants, showing that Claude, Copilot and You.com s smart assistant performed significantly better than human subjects at proposing suitable behaviors in the descriptions of social situations. Moreover, their options for different behavior aligned well with expert ratings. Although the results of early experiments on the emergence of theory of mind in LLMs were less convincing, stronger commercial LLMs are steadily improving, scoring at or sometimes above average human level on standardized tests. Further research and discussion are needed to show whether high scores on such tests mean that LLMs have generalizable forms of theory of mind [Goldstein and Levinstein, 2024, Hu et al., 2025, van der Meulen et al., 2025].

4.2 Role-Based Interaction LLMs are being used in the fields of multi-agent systems and agent-based simulation [Gao et al., 2024], which have a long research tradition [Epstein and Axtell, 1996, Macal and North, 2010]. Multi-agent approaches simulate individual agents and their interactions in an environment that is often virtual, but can also be physical [Steels, 1995, Shoham and Leyton-Brown, 2008]. Complex dynamics can emerge between agents with basic perceptive, reasoning, and decision-making abilities. Agent-based approaches are often used as a bridge between theoretical and empirical work, allowing for exploration and hypothesis testing in domains where working with human agents is unethical, costly, or otherwise difficult. Challenges in modeling realistic agent behavior, as well as the computational cost of simulating multi-agent societies, have often impeded realistic multi-agent experiments. Advances in agentic LLMs and computational infrastructure for multi agent simulations [Rutherford et al., 2024] are changing this situation, and have given an impulse to research in experimental computational game theory. Creating agents that use LLMs has enabled researchers to overcome existing limitations, by letting agents communicate in natural language. This allowed for the exploration of new territory in the domains of game theory, role-based interactions, and team work.

4.2.1 Strategic Behavior in Multi-LLM Environments. Above we discussed how traditional LLMs perform when prompted to play economic games. Here we discuss studies in which agentic LLMs interact with one another in game-theoretical scenarios. The MAg IC study [Xu et al., 2024a] uses social deduction games (Undercover and Chameleon) and game theoretic scenarios such as Cost Sharing, Multi player Prisoner s Dilemma, and Public Good. From these games, seven features are extracted: Rationality, Judgement, Reasoning, Deception, Self-awareness, Cooperation, Coordination, as shown in Figure 20. LLMs are evaluated on these critical abilities in multi-agent environments. GPT-o1 and GPT-4 score significantly better than the other LLMs. Interestingly, LLMs score generally high on Judgement,

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

Agentic LLM 29:35

Fig. 20. LLM s Performance on Various Metrics [Xu et al., 2024a]

Rationality and Cooperation, but some also on Deception. Further, they all score lower on Reasoning and all but one score low on Coordination. The exception here is GPT-o1 enhanced with probabilistic graphic modeling, an implementation of a theory of mind-like competence inspired by Koller and Friedman [2009]. The authors show that probabilistic graphic modeling boosts LLM performance on their evaluation-games across the board. This fits with the generally accepted idea that humans rely on their theory of mind in game-theoretical scenarios. GAMA-Bench is a benchmark for multi agent games [Huang et al., 2024a] that covers Guess 2/3 of the Average, El Farol Bar, Divide the Dollar, Public Goods Game, Diner s Dilemma, Sealed-bid Auction, Battle Royale, and Pirate Game. The results show that while GPT-3.5 is robust, its generalizability is limited. Here, performance can be improved through approaches such as Chain of Thought. Additionally, evaluations across various LLMs were conducted, showing that GPT-4 outperforms other models. Moreover, increasingly higher scores across three iterations of GPT-3.5 demonstrate marked advancements in the model s intelligence with each update. Alympics is a platform for complex strategic multi agent gaming problems [Mao et al., 2023]. It provides a controlled playground for simulating human-like strategic interactions with LLM-driven agents. Figure 21 shows an example of their water allocation challenge, a complex strategy game in which scarce resources for survival must be distributed across multiple rounds. Auc Arena simulates auctions, on LLa Ma 2.13b, Mistral 7b, Mixtral 8x7b, Gemini 1.0, and GPT 3.5 and 4.0 [Chen et al., 2023a]. The authors find that LLMs such as GPT-4 possess important skills for auction participation, such as budget management and goal-focus. However, they also find that performance varies, pointing to opportunities for improvement.

4.2.2 Role-Based Task Solving and Team Work. LLMs can perform tasks in pairs or teams where they are assigned complementary roles, such as creator-critic or manager-worker. In these setups, each LLM agent is given a distinct role and objective, and they communicate to jointly solve tasks. In the CAMEL framework (Communicative Agents for Mind Exploration) [Li et al., 2023a], two LLMs have the predefined roles to perform, for example, a coding task (See Figure 22). They cooperatively drive a conversation without continuous human prompting. By using inception prompting and role descriptions, the agents stay in character and collaborate toward the goal by breaking down complex problems in manageable steps through dialogue. During each interaction step the LLM agents effectively generate their own inference-time training

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

29:36 Plaat, Van Duijn, Van Stein, Preuss, Van der Putten & Batenburg

Fig. 21. Alympics Water Allocation Challenge Game [Mao et al., 2023]

data, making the cognitive process visible for human inspection while enhancing autonomous task performance. We will discuss CAMEL further in Section 4.4.2

Other studies have paired an LLM creator or generator with an LLM judge or critic. In this way the generative power of LLMs is leveraged, while adherence to rules or quality guidelines is enforced. Constitutional AI [Bai et al., 2022] employs one LLM to critique another LLM s responses against a set of ethical or quality guidelines, and to suggest revisions. The authors show that this kind of two-agent feedback loop yields refined final outputs that is aligned with desired principles. Another form of role-based interaction is the use of debate or discussion between LLMs to improve reasoning and task performance. Du et al. [2024] demonstrated that when multiple LLM instances propose answers and critique each other s reasoning through several rounds of debate, they can reach a more accurate consensus answer with higher factual correctness. This approach, described as a society of minds, significantly reduced reasoning errors and hallucinations in tasks like math word problems and factual QA. Similarly, Chan et al. [2024] propose a Multi-Agent Debate (MAD) setup where two LLM agents take opposing sides in a tit-for-tat debate while a third agent acts as a judge. The role of the judge is to guide the discussion towards a final solution. The structured debates encouraged divergent thinking and could even push a weaker model (such as GPT-3.5) to outperform a stronger model (such as GPT-4) on certain challenging problems by combining strengths of each agent. Motivated by Minsky s society of minds [Minsky, 1988], a multi-agent framework has been designed as a round table conference among diverse LLM agents [Chen et al., 2023b]. The framework enhances collaborative reasoning between LLM agents via multiple rounds of discussion. The agents should learn to convince other agents to improve their answers. Experiments on seven benchmarks demonstrate that a confidence-weighted voting mechanism significantly improves LLMs reasoning. Furthermore, the authors find that diversity (different models) is critical for performance. Again inspired by Minsky, Mind Storms introduces an LLM-based implementation

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

Agentic LLM 29:37

Fig. 22. Role-playing in CAMEL [Li et al., 2023a]

[Zhuge et al., 2023] on the CAMEL framework [Li et al., 2023a]. Extensive experiments are reported with up to 129 agents solving common AI problems: visual question answering, image captioning, text-to-image synthesis, 3D generation, egocentric retrieval, embodied AI, and general language-based task solving. They found that, in specific applications, mindstorms among many members outperform those among fewer members, and longer mindstorms outperform shorter ones. Related to debate and discussion setups, researchers have explored teacher-learner dynamics with LLMs, where an expert LLM provides hints or feedback to a less capable LLM on a task, mirroring human tutoring [Zhou et al., 2024c]. These role-alignments leverage the idea that one agent s knowledge or oversight can correct the other s mistakes, leading to more robust performance. Auto Gen [Wu et al., 2023] is designed to facilitate the development of multi agent LLM applications that span a broad spectrum of domains and complexities. The programming paradigm is centered around agent-agent conversations. Experiments demonstrate the effectiveness of the framework in example applications ranging from mathematics, coding, question answering, operations research, online decision-making, to entertainment. Chat Eval is a multi-agent system to improve text summarization [Chan et al., 2023]. Noting that the quality of human text summarization improves when multiple annotators collaborate, the authors created a multi-agent debate framework, moving beyond single-agent prompting strategies, including debater agents, diverse role specification, and different communication strategies (see Figure 23). Sotopia is another role-playing environment for multi-agent interaction [Zhou et al., 2023b]. In Sotopia, agents coordinate, collaborate, exchange, and compete with each other to achieve complex social goals. In experiments with LLM-agents and humans, GPT-4 achieves a significantly lower goal completion rate than humans and struggles to exhibit social commonsense reasoning and strategic communication skills. The contrast between GPT-4 s lower performance in Sotopia and good performance on other metrics of social reasoning (see Section

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

29:38 Plaat, Van Duijn, Van Stein, Preuss, Van der Putten & Batenburg

Fig. 23. Three different communication strategies in Chat Eval [Chan et al., 2023]

4.1.3) is most likely explained by Sotopia s focus on strategizing and goal-directedness, aspects on which GPT-4 is known to score lower [Hou et al., 2024]. To simulate strategic interaction and cooperative decision-making, researchers have introduced Gov Sim [Piatti et al., 2024]. They study how ethical considerations, strategic planning, and negotiation skills impact cooperative outcomes. Most LLMs fail to achieve an equilibrium since they fail to understand the long-term effects of their actions on the group. GPT-4o performed best. Interestingly, the introduction of a special universalization reasoning language [Levine et al., 2020] (prompting models to ask the Kantian question: What if everybody does that?) allowed more models to achieve a sustainable outcome. Related results were demonstrated in Negotiation Arena, introduced by Bianchi et al. [2024]. They showed how LLM agents can conduct complex negotiations through flexible dialogue in negotiation settings. The flexible dialogues significantly improved negotiation outcomes by employing different behavioral strategies. Social interaction in an extreme setting was studied in [Campedelli et al., 2024]. Inspired by the Stanford Prison experiment [Zimbardo, 1972], the emergence of persuasive and abusive behavior is studied in a setting of prisoners versus prison guards. It was found that the assigned personality of prisoner and guard impact both persuasiveness and the emergence of anti-social behavior. Anti-social behavior emerged by simply assigning the agent s roles, which is a parallel to the original experiments involving human participants.

4.3 Simulating Open-ended Societies

Agentic LLMs have enhanced abilities for perception, memory, reasoning, decision-making, and adaptive learning. They can display heterogeneous personality profiles [Gao et al., 2023a, 2024]. Such features make them also suitable for interacting in open-ended multi-agent simulations without prior role assignment. This allows the study of emergent phenomena such as self-organizing behaviors, collective intelligence and the development of social conventions and norms. Being able to simulate such phenomena more realistically, using heterogeneous agents that communicate in natural language, meets long-standing interests from the social sciences. The structure of LLM-based agents suitable for such simulations is illustrated in Figure 24 and Figure 25.

4.3.1 Simulacra and Societies. Park et al. [2023] introduced Generative Agents, an environment where users can interact with a simulated town populated by 25 LLM-based agents. Based on social simulacra techniques proposed earlier [Park et al., 2022], each agent was initiated with a unique persona and memory. For each agent, a record is kept of all the experiences and conversations in the simulation, used to synthesize higher-level reflections and plan behavior. The agents behave somewhat like characters in The Sims: they initiate conversations, form relationships, spread information, and coordinate impromptu group activities. Figure 26 depicts the agent architecture and Figure 27 shows an illustration of a simulation. The interactions are influenced by user input and are therefore semi-autonomous. This is illustrated by the example of a Valentine s Day party: while multiple agents spread invitations to one another and show up at the right time with coordinated plans, the plan for the party was

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

Agentic LLM 29:39

Fig. 24. Agent-based Modeling and LLM-agents [Gao et al., 2024]

Fig. 25. Structure of LLM-agents for multi agent modeling [Gao et al., 2023a]

initiated with a user prompt. The agents developed believable social routines (such as daily schedules, and gossip) and even exhibit human-like character traits (some agents demonstrated deception or stubbornness, while others showed cooperation). These results show that social patterns can emerge from dynamic LLM interactions. Agent Society is a simulation at a larger scale, involving over 10,000 agents [Piao et al., 2025b,a]. It aims not only to study everyday social dynamics, but it also offers a testbed for computational social experiments. The authors discuss case studies of polarization, the spread of inflammatory messages, the effects of universal basic income policies, and the impact of external shocks such as hurricanes. Li et al. [2024c] also study the spread of misinformation using LLM agents. Their agents exhibit diverse profiles in terms of gender, age, and the Big Five personality traits. One of the findings is that encouraging comments does not significantly reduce the spread of misinformation, whereas publicly labeling information with accuracy scores and blocking specific influencers proved to be effective strategies, particularly in scale-free networks. Agent Verse is a multi agent system to study group dynamics [Chen et al., 2023c]. Inspired by human group dynamics, it studies whether a group of expert agents can be more than the sum of its parts. Experiments on text understanding, reasoning, coding, tool utilization, and embodied AI confirm the effectiveness. Problem solving is split into four stages: (1) expert recruitment, (2) collaborative decision making, (3) action execution, and (4) evaluation, where, if the current state is unsatisfactory, a new iteration of the process is started for refinement

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

29:40 Plaat, Van Duijn, Van Stein, Preuss, Van der Putten & Batenburg

Fig. 26. Architecture of LLM-agents that can Perceive, Remember, Reflect, Retrieve, and Plan [Park et al., 2023]

Fig. 27. Illustration of the Generative Agents Simulation Featuring 25 Agents [Park et al., 2023]

(see Figure 28). Interestingly, agents manifest emergent behaviors such as volunteering, characterized by agents offering assistance to peers, or conformity, where agents adjust deviated behaviors to align with the common goal under the critics from others. Destructive behaviors were also observed, occasionally leading to undesired and detrimental outcomes. OASIS is a scalable social media simulator for Twitter/X and Reddit [Yang et al., 2024e]. It supports modeling of up to one million LLM-agents. It is built on CAMEL and has role-based agents as its starting point. However, at its large scale, OASIS shows various social group phenomena, including spreading of (mis)information, group

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

Agentic LLM 29:41

Fig. 28. Four Stages for Decision Making in Agent Verse [Chen et al., 2023c]

polarization, and herd effects. OASIS is built upon an Environment Server, Recommender System, Agent Module, Time Engine, and Scalable Inferencer (see Figure 29). Research predating LLMs already shows that agent societies may create an automated curriculum of ever increasing difficulty [Elman, 1991, Bengio et al., 2009, Silver et al., 2017, Soviany et al., 2022], requiring increasing levels of intelligent behavior from the agents [Racaniere et al., 2019]. Similar results have been observed for LLMs [Feng et al., 2023]. Web Arena is an environment developed to study self-evolving curricula [Qi et al., 2024], which can also help robot training [Ryu et al., 2024] or to mitigate hallucination [Zhao et al., 2024b].

4.3.2 Emergent Social Norms. Social norms play an important role in the predictability of individuals in groups [Axelrod, 1981, 1986]. Cultural evolution studies how norms evolve at a society level when individuals transmit behavior through imitation, communication, and education [Boyd and Richerson, 1988]. LLMs endow agents with the ability to communicate in natural language and have created more opportunities for multi-agent research into societies and the emergence of conventions and norms. Extensive overviews of such new possibilities are provided in [Mou et al., 2024, Savarimuthu et al., 2024, Xi et al., 2023]. We discuss some of the new approaches in more detail. Evolutionary Agent [Li et al., 2024b] studies agent alignment in a multi-agent system, with evolutionary methods that go beyond Reinforcement Learning from Human Feedback (see Figure 30). In the context of agent alignment to norms, the approach is controlled: it does not permit the evolution of social norms to be disorderly or random, but it also does not intervene in each step of their evolution. The authors define the initial social norms and a desired direction of evolution. Agents with higher fitness (more norm-conforming) are more likely to reproduce, leading to the diffusion of their strategies, gradually stabilizing and forming new social norms. Defining a complete, realistic, and complex virtual society is challenging. The purpose of the work is to study how, if such a virtual society existed, a system could further enable evolving intricate evolutionary behaviors of agents,

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

29:42 Plaat, Van Duijn, Van Stein, Preuss, Van der Putten & Batenburg

Fig. 29. Components of OASIS [Yang et al., 2024e]

Fig. 30. Overview of Evolutionary Agent [Li et al., 2024b]

and how this would lead to the emergence of new dynamics. The system provides a sandbox for investigating the safety of AI systems before they impact the real world. A different approach is based on Steels [1995] s naming game, implemented with agents powered by LLa Ma 3 and Claude 3.5 [Kouwenhoven et al., 2024, Ashery et al., 2024, Baronchelli, 2023]. They find that globally accepted conventions or norms can spontaneously arise from local interactions between communicating LLMs. The authors also demonstrate how strong collective biases can emerge during this process, even when individual agents appear to be unbiased, and how minority groups of committed LLMs can drive social change by establishing new social conventions that can overturn established behaviors.

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

Agentic LLM 29:43

Fig. 31. Agents with Different Traits make up Diverse Machine Societies [Zhang et al., 2023a]

The emergence of norms is studied at another level by Horiguchi et al. [2024]. They explore the potential for LLM agents to spontaneously generate and adhere to normative strategies, building upon the foundational work of Axelrod s metanorm games. Metanorms are norms enforcing the punishment of those who do not punish agents that are breaking norms [Axelrod, 1986]. Controlling for personality traits vengefulness and boldness, they find that through dialogue, LLM agents can form complex social norms, metanorms, purely through natural language interaction. A related study evaluates the capability of LLMs to detect norm violations [He et al., 2024]. Based on simulated data from 80 stories in a household context, they investigated whether 10 norms are violated, and found Chat GPT-4 being able for detect norm violations, with Mistral some distance behind.

Qiu et al. [2024a] go a step beyond norms, and study the cultural and social awareness of LLM agents. They introduce CASA, a benchmark designed to assess LLM agents sensitivity to cultural and social norms across two web-based tasks: online shopping and social discussion forums. (CASA is based on Web Arena [Qi et al., 2024].) Current LLMs perform significantly better in non-agent than in web-based agent environments, with agents achieving less than 10% awareness coverage and over 40% violation rates. However, using prompting and finetuning on specific datasets, cultural and social awareness can be improved. Inspired by Society of Mind [Minsky, 1988], cooperation mechanisms are explored in Zhang et al. [2023a] s agentic LLM simulation. This simulation consists of four unique societies of LLM agents, where each agent is characterized by a specific trait (easy-going or overconfident) and engages in cooperation with a distinct thinking pattern (debate or reflection). They find that LLM agents show human-like social behaviors, such as conformity and consensus reaching, mirroring foundational social psychology theories. Figure 31 shows societies with different types of agents.

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

29:44 Plaat, Van Duijn, Van Stein, Preuss, Van der Putten & Batenburg

The question whether groups of LLM agents can successfully engage in cross-national collaboration and debate is studied by [Baltaji et al., 2024]. Multi agent discussions can support collective AI decisions that reflect diverse perspectives, although agents are susceptible to conformity due to perceived peer pressure. They can also lose track of their personas and opinions. Instructions that encourage debate increase the risk of errors.

4.3.3 Open-World Agents. An important driver of agentic LLM research is the problem of plateauing LLM performance due to limited training data. Open World multi-agent interaction aims to address this problem, generating new interaction data with multi agent simulations. Machine learning can learn no more complexity than what is present in the dataset (or environment). The idea of an open world-model is that it can create infinite datasets or environments, in which agents can continue to learn, to keep improving their intelligence. How should such unlimited challenges be created? The advent of LLMs has given a new impulse to this research question: LLMs are used to solve an LLM-generated problem. This idea is followed, for example, in the multi-agent finetuning approach [Subramaniam et al., 2025]. Current agents are primarily created and tested in simplified synthetic environments, leading to a disconnect with real-world scenarios. Zhou et al. [2023a] build an environment, Webarena, that is more realistic and reproducible. Web Arena is an environment with fully functional websites from four common domains: e-commerce, social forum discussions, collaborative software development, and content management. Games are eminently suited as open-ended benchmarks for interactive behavior. Real-world tasks require handling intricate interactions, advanced spatial reasoning, long-term planning, and continuous exploration of new strategies. Balrog [Paglieri et al., 2024] incorporates reinforcement learning environments of varying levels of difficulty, ranging from tasks that are solvable by non-expert humans in seconds to challenging ones that may take years to master (such as the Net Hack Learning Environment). They find that while current models achieve partial success in the easier games, they struggle significantly with more challenging tasks such as vision-based decision-making. Progress in machine learning depends on benchmark availability. As models evolve, there is a need to create benchmarks that can measure progress on new generative capabilities [Butt et al., 2024]. Bench Agents decomposes the benchmark creation process into planning, generation, data verification, and evaluation, each of which is executed by an LLM agent. These agents interact with each other and utilize human-in-the-loop feedback to explicitly improve and flexibly control data diversity and quality. Bench Agents creates benchmarks to evaluate capabilities related to planning and constraint satisfaction. Agent Board provides an evaluation of the breadth of existing benchmarks [Ma et al., 2024a]. Benchmarks should have task diversity. It is necessary to cover various agent tasks such as embodied, web, and tool tasks. Additionally, multi round interaction is important, to mimic realistic scenarios. Existing benchmarks typically adopt single-round tasks. Furthermore, agents should be evaluated in partially-observable environments, to test if they can actively explore their surroundings. Existing agent benchmarks fail to satisfy all of these criteria [Ma et al., 2024a].

4.4 Discussion In this third part of the taxonomy, the focus was on agents that interact with other agents, both human and artificial. The goal is to understand social interaction, from interaction in conversations, social scenarios and dilemmas, to role-playing in duos and small teams, to large-scale open-ended emergent behavior at society level.

4.4.1 Interaction Studies. Over the past years, LLMs have provided us with new instances of human-machine interaction. Users across the globe have engaged in chat conversations seeking assistance with tasks in their professional or private lives. To engage in such interactions, LLMs rely on functions learned during training that we can recognize as social, including abilities for conversation, politeness and etiquette, handling of emotional

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

Agentic LLM 29:45

and affective states, strategizing, and theory of mind. Such abilities form the basis not only for human-machine interaction, but also for LLM-driven machine-machine interactions, as we discuss next. When interacting in multi-agent environments, agentic LLMs show varying levels of performance on games with strategic and coordinated behavior. Enhancing models with reasoning capacities boosts performance, as is evidenced by GPT-o1 s better overall performance and the positive effect of adding probabilistic graphs [Xu et al., 2024a]. Pre-defined roles and interaction protocols (cooperative or adversarial) help structure the communication between LLM agents while improving task performance. Role-playing frameworks, AI feedback loops, and debate moderation suggests that carefully coordinating multiple LLMs can harness their collective intelligence and yield outcomes that surpass single model performance. We have covered open-ended multi-LLM simulations without prior role assignment. These simulations give a new impulse to long-standing interests in the social sciences to model self-organizing behaviors, collective intelligence and the development of social conventions and norms. The scale of such simulations varies from a few interacting agents up to a million. Emergent behaviors are observed, such as coordination through norms and social structures that form spontaneously. In open-world approaches LLMs are used to create increasingly complex challenges and solutions for LLM agents.

4.4.2 In Depth: CAMEL and Generative Agents. We have surveyed the individual methods for social interaction. In order to dig deeper and highlight some of the issues in this area, we will now discuss two approaches, CAMEL, and Generative Agents, in more detail.

CAMEL. In Section 4.2.2 we saw how LLMs can be prompted to perform different roles, and work together in solving tasks. CAMEL [Li et al., 2023a] is a multi-agent system that has been designed to perform this roleplaying-for-problem-solving task. Just like the assistants in Section 3.3, the goal is to solve a task, for example in medicine, finance, or computer programming (see also Figure 22). One approach is to write a single monolithic prompt for the LLM in which all the instructions to solve the task are specified or so the author of the prompt hopes. To measure the success of this approach, a benchmark of test-cases can be created that the LLM has to solve. The approach of CAMEL is different. CAMEL starts with the idea that specialized agents, with different prompts, may work better. Furthermore, the idea of CAMEL is that the different agents may work better together, and that new problem solving approaches may emerge, that were not present in the single monolithic prompt, or in the initial prompts of the individual agents. CAMEL is a multi-agent system that consists of two agents, an assistant and a user. The user has the domain knowledge, can provide a task specification, as well as feedback on intermediate deliverables that the assistant makes. The assistant has certain skills, for example Python coding, to write a program to solve the problem that the user agent specifies. Both agents are given a so-called inception prompt, an initial idea. They then further send messages (prompts) to each other as they work towards solving the task. The CAMEL paper describes an experiment where a powerful LLM, GPT4, generates the initial prompt, and a cheaper LLM, GPT-3.5-turbo, executes the further steps. Ten different tasks have been simulated by a combination of 50 agents and 50 users, for a total of 25000 conversations (that were depth-limited to 40 messages). In such a setup where agents work with agents, the training process may become unstable. CAMEL reports problems where the assistant simply repeats instructions (instead of answering them), problems with fake replies, empty replies, infinite loops of messages, and role reversal. A special (third) critic agent was introduced to ensure a constructive communication process between the agents. The final experiments report success. The multi-agent system performed better than a single prompt, with performance on the Human Eval benchmark improving form 30% to around 50%. The CAMEL experiments showed that a multi-agent LLM system can be used for problem solving, and the authors showed how stable learning can be achieved when attention is paid to a constructive communication process.

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

29:46 Plaat, Van Duijn, Van Stein, Preuss, Van der Putten & Batenburg

Generative Agents. The architecture of CAMEL recognized two types of agents, assistant and user. The next two papers that we discuss, Park et al. [2023, 2024], are about simulating a society with a larger number of agents, from 10-1000. The first paper [Park et al., 2023] introduces the generative agents architecture, an architecture that was designed with the aim of generating believable proxies of human behavior. The architecture consists of a memory, a planning module, and a reflection module. In this design agents are given roles and operate in a Sims-like environment. An example is described where they decide to organize a Valentine s day party (see also Figure 27, and the live web simulation).5 The focus of this work is on achieving believable social behavior, that is unscripted: behavior that emerges from the communications by the agents. We quote the paper: agents wake up, cook breakfast, and head to work; artists paint, while authors write; they form opinions, notice each other, and initiate conversations; they remember and reflect on days past as they plan the next day. The number of agents is larger than in CAMEL, arond 5-15, and the agent architecture is also more involved. Where in CAMEL the behavior was specified fully in-context, in the prompts that are exchanged, in Generative Agents there are additional external algorithms for memory, planning, and reflection. The paper further notes that: the new goals require architectures that manage constantly-growing memories as new interactions, conflicts, and events arise and fade over time while handling cascading social dynamics that unfold between multiple agents, and that success requires an approach that can retrieve relevant events and interactions over a long period, reflect on those memories to generalize and draw higher-level inferences, and apply that reasoning to create plans and reactions that make sense in the moment and in the longer-term of the agent s behavior. The LLM that is used is GPT-3.5, the same as in CAMEL. Whether the Generative Agents system generates believable behavior is evaluated with the help of 100 human evaluators. These were enlisted to rank believability of the communication patterns on four categories, by interviewing the agents to probe their ability to (1) remember past experiences, (2) plan future actions based on their experiences, (3) react appropriately to unexpected events, and (4) reflect on their performance to improve their future actions. The evaluation studied the results of four different agent architectures (full architecture, and no observation, no refection, no planning), and found that the full architecture performed best. They also found that generative agents remember with embellishments, and that reflection is required for synthesis of memories. The evaluation did find evidence of emerging communication, relationship building, and coordination. They also found evidence of erratic behavior, hallucination, and misclassification, such as agents that were trying to enter stores after closing time, not understanding the concept of closing time.

Simulating 1000 People. We will now turn to the second paper [Park et al., 2024]. Multi-agent simulations can be used to study different aspects of emergent individual and social human behavior. An important methodological challenge in prompt-based simulation studies is to determine how much of the behavior is scripted, and how much emerges. Park et al. [2024] focus in the second study on how realistic the behavior of synthetic agents can be. The study is based on their earlier work, with LLM agents that have memory and reflection. A group of 1052 human individuals were recruited who were asked to provide a two hour long interview. The interviews were standardized, administered by an AI interviewer. Next, LLM agents were trained on the audio interview, yielding 1052 different LLM agent profiles. The LLM is prompted to replicate individuals attitudes and behaviors and generate synthetic agents. To validate the accuracy of the personality and behavior of these generative synthetic agents, the agents were tested by interviewing them. As a control group, the human subjects were also interviewed, again, two weeks after their initial interview, to control for natural variation between two interview sessions. The generative agents replicate participants responses on the General Social Survey 85% as accurately as participants replicate their own answers two weeks later, and perform comparably in predicting personality traits and outcomes in experimental replications. Subsequently, the synthetic agents have been made available for further experiments.

5See https://reverie.herokuapp.com/ar Xiv_Demo/

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

Agentic LLM 29:47

4.4.3 Emergent Collective Behavior. Emergent behavior, and especially emergent cooperation, is an important use case of agentic LLMs. It helps us understand our own behavior in our society, and allows the study of agent behavior in artificial conditions, in what-if scenarios. When do we benefit from more competition, when from more cooperation, and in what form? What happens when (fake) information disseminates? Or how do societies respond to extreme circumstances, such as a natural disaster? As research on collective agent societies and emergent phenomena develops further, LLMs will exhibit more realistic behavior, new multi-agent infrastructures will be developed that allow more diverse types of interactions, and simulation studies will provide insight into social science questions. In particular, topics of interest are the influence of LLMs on democratic processes and cyber security, role playing, society of minds, theory of mind, curriculum learning, continuous learning, adversarial agents, and collaboration in the face of hierarchy. Furthermore, as our understanding of the conditions conducive to emergence of cooperation grows, a focus on adaptive (social) intelligence may influence our views on the nature of intelligence and artificial (super)intelligence.

4.4.4 New Training Data. A final use case of this third part of the taxonomy is that new training data is generated by the interacting agents. Traditionally, LLMs are trained on a large static corpus of language data, that is taken from the internet, and ultimately based on human actions, using self-supervised learning methods. As illustrated by the cycle in Figure 1, interacting agentic LLMs enable self-learning, in the style of reinforcement learning. Reinforcement learning is used increasingly in LLM training, for example to train reasoning models by Open AI [Huang et al., 2024c, Wu et al., 2024], and Deep Seek [Guo et al., 2025]. New reinforcement learning methods such as GRPO [Shao et al., 2024] and RLVR [Lambert et al., 2024] already allow inference-time chains of thoughts to be used for finetuning. In reinforcement learning, agents choose their own actions in the world, and are not limited by a pre-existing dataset. In principle, they can learn the full complexity of the world, including the effects of their own actions. A challenge in reinforcement learning is the instability caused by feedback loops. Past reinforcement learning successes have achieved stable training through diverse exploration and low learning rates, requiring large computational efforts [Silver et al., 2016, Vinyals et al., 2019, Brown and Sandholm, 2019]. Open-ended and open-world multi-agent simulation may provide an alternative way to create the necessary diversity for stable convergence.

5 General Discussion and Research Agenda The interest in agentic LLMs is large, and many research efforts have appeared over a short period. We have reviewed the field, with an emphasis on the most recent works. First, there is an interest from society in agentic LLMs. Agentic LLMs can assist us in our daily lives in many ways from writing essays, booking flights, having pleasant and interesting conversations, folding our laundry, to making better medical diagnoses, performing better stock analyses, to support healthy lifestyle changes, to make sure we take our medicine, to assist us when we are less mobile. Tool use by assistants is enabled by technology from the first category of the taxonomy: reasoning LLMs, self-reflection and retrieval augmentation. Both reasoning and tool use support new forms of interaction, with human and artificial agents, further enhancing societal applicability of LLMs. Second, there is an interest from science in agentic LLMs, inside the AI research community and beyond. Since LLM agents can now interact in natural language, agent behavior can be better understood, and multi-agent simulations can be made more realistic than before. Important questions in social and political science can be researched, such as in game theory (social dilemmas), social interaction (negotiation, theory of mind), and societal dynamics (cooperation, norms, extreme situations). Some of these goals are within reach, some have been realized already, and some are becoming a possibility. Also in research applications, agent interactions are enabled by the

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

29:48 Plaat, Van Duijn, Van Stein, Preuss, Van der Putten & Batenburg

Environment

Agent: 𝐿𝐿𝑀 𝜋

Fig. 32. LLM as the Policy of a Reinforcement Learning Agent

previous two categories: social behavior benefits from reasoning and self-reflection, social actions are increasingly grounded, and information can be retrieved to further enhance understanding of social contexts. Finally, agentic LLMs generate data that can augment inference-time behavior and on which models can be further pretrained and finetuned, improving LLMs beyond the plateau researchers have observed recently. Figure 1 illustrates this cycle of continuous improvement.

5.1 Research Agenda for Agentic LLM

Our survey has yielded interesting directions for a research agenda for agentic LLMs, which we will now discuss in more detail. Please refer to Table 4 for a summary of the agenda.

Training Data. The benefit from language corpuses that are used for pretraining of LLMs is said to be plateauing. To improve the performance of LLMs on language (and reasoning) tasks further, it is important to continue to acquire training data that is sufficiently novel and challenging from a token-prediction point of view. Such data can be generated by making LLMs interact with the world at inference time. Currently, in most approaches that were discussed in Section 2, inference-time compute is only used to improve performance on reasoning benchmarks. In most early Chain of Thought approaches the generated data is not used after the answer has been calculated. In other approaches such as Say Can, Inner Monologue, and Vision Language-Action models data that is generated at inference time is used for augmentation of the finetuning dataset, creating an inference time-finetuning feedback loop, so that the model s parameters are trained from its own earlier reasoning. Such feedback loops are common in reinforcement learning, where agents act and receive feedback from their environment. In games, a self-learning loop can be created [Plaat, 2022]. In Alpha Go Zero this approach yielded good results, although at the cost of careful tuning of hyperparameters and algorithms, to ensure sustained convergence of the learning process [Silver et al., 2017]. Similar results were achieved in other challenging games, such as Star Craft [Vinyals et al., 2019], Stratego [Perolat et al., 2022], DOTA 2 [Berner et al., 2019], Diplomacy [Bakhtin et al., 2022], and Poker [Brown and Sandholm, 2019]. More formally, in the traditional self-supervised view a model 𝑀is trained to predict label 𝑦from input variable 𝑥in dataset 𝐷; in reinforcement learning an agent s policy 𝜋is trained with reward 𝑟to perform action 𝑎to change state 𝑠of its environment 𝐸. In agentic LLMs, both views are joined. Agentic LLMs use a language model 𝑀as the policy 𝜋to determine the agent s next action (see Figure 32). Actions can be used to retrieve information, to split a larger problem into smaller parts, to run a tool, to use memory to reflect on its own actions, to suggest stock trades, to book travel tickets, or to interact with other agents working towards a common goal. The approach that worked well in games of strategy is now also successfully used in robotics, in the creation of Vision-Language-Action models [Black et al., 2024, Brohan et al., 2023]. VLAs that are trained on self-generated action sequences show zero-shot generalization results in domestic tasks (kitchen tasks, folding laundry) that had not been achieved by other machine learning methods. Recently, reasoning models such as Deep Seek [Guo et al., 2025] and Kimi [Du et al., 2025] are also being trained with reinforcement learning. Popular methods for

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

Agentic LLM 29:49

finetuning for mathematics and coding tasks are GRPO [Shao et al., 2024] and RLVR [Lambert et al., 2024]. Other uses of agents for finetuning are reported by Subramaniam et al. [2025]. Reuse of inference time results for finetuning and pretraining closes the learning loop (see Figure 1), and is the first item for the agenda for further research. It is interesting to see how the reinforcement learning methods that worked well for games of strategy and certain finetuning tasks, are being translated to work in LLMs that act in the real world. VLAs integrate multiple modalities: language, visual information, and actions. Further modalities are speech, other audio signals, and videos. Electrical signals, such as brain or muscle activation, can also provide valuable inputs for the models to learn from.

Hallucination and Stable Behavior. A challenge for the virtuous autocurriculum cycle is that LLMs hallucinate, and in multi-step reasoning errors can easily accumulate. LLM answers may look good, but be factually wrong. Reasoning chains may be unfaithful, giving good answers for the wrong reason, and wrong answers when least expected. Especially when such dubious results are used to further train the LLM, this training may diverge and model collapse may occur. In social simulations, emerging behavior patterns, such as cooperation, fairness, trust or norms, may collapse. Therefore, in multi-step reasoning, self verification and self consistency methods were developed to address error accumulation. In reinforcement learning, exploration and diversity are important methods to ensure good coverage of the state space. In social simulations and gaming, open world models and open-ended behavior are being used to stimulate exploration and diversity. Such models can provide suitable environments for automated generation of training curricula. Faithfulness for Chain of Thought is studied by Lyu et al. [2023], Lanham et al. [2023], Turpin et al. [2024]. Mechanistic interpretability can provide ways to look inside the LLM, to better understand if the model follows the reasoning steps that we expect it to take [Nanda et al., 2023, Bereska and Gavves, 2024, Ferrando et al., 2024, Chen et al., 2025]. The conditions that influence stability of emergent behavior (cooperation, fairness, trust) may be studied further. For agentic LLMs that learn from their own results, other methods must be developed, and hallucination features prominently on the research agenda for agentic LLMs, with mechanistic interpretability and open world models as important items.

Agent Behavior at Scale. Studies of emergent behavior need realistic agent behavior, and we expect more research to be performed to improve agent behavior, for example by closely modeling human behavior in generative agents [Park et al., 2024]. Some behavior patterns in multi-agent simulations only emerge at scale, as studies with specialized agent infrastructures have shown [Park et al., 2023, Yang et al., 2024e, Wu et al., 2023]. However, the number of LLM agents that can be simulated reliably is often limited. Although open-ended simulation show improved scalability, we believe that more research into scaling of simulations with complex agents is necessary. Related to the challenge of scale is the cost of training LLMs. Pretraining and finetuning an LLM is expensive. Knowledge distillation is a popular method to extract essential knowledge and behavior from a large model into a small model, at lower computational cost [Xu et al., 2024b]. Experiments have shown that reasoning steps can be distilled from large to smaller language models [Gu et al., 2023, Li et al., 2023b, Muennighoff et al., 2025]. Knowledge distillation in LLM agents in an important item for our research agenda. Another aspect of agentic LLM research is the study of emergent behavior, of cooperation and trust in agentic societies. The debate on artificial super-intelligence is fueled, in part, by the growing performance of individual LLMs, which is an important aspect of agentic LLM research. Studies of emergent agent behavior at scale may show us when cooperation and trust emerge, may influence our view on the nature of intelligence, and may thus influence the discussion on artificial super-intelligence and the future of society. Furthermore, the world around

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

29:50 Plaat, Van Duijn, Van Stein, Preuss, Van der Putten & Batenburg

us is organized in groups in which power hierarchies are prevalent. Many multi agent simulations assume a flat power hierarchy. Multi agent simulations should also go beyond equality.

Self-Reflection. Self-reflection mechanisms are used in advanced prompt-improvement algorithms. Handwriting external prompt management algorithms may be error prone and brittle. An alternative is to let the LLM perform the self-reflection and step-by-step management internally, as in the original Chain of Thought (implicit reasoning [Li et al., 2025a]). Deep Seek R1 [Guo et al., 2025] is a reasoning model that is trained (finetuned) by the GRPO reinforcement learning method [Shao et al., 2024]. The model is trained on its own reasoning results, and was found to selfreflectively reason over its own results, identifying effective reasoning patterns implicitly. Schultz et al. [2024] train a model on search sequences [Gandhi et al., 2024] in games such as chess, and VLAs are trained on action sequences [Kim et al., 2024]. These works shows that, in addition to implicit step-by-step reasoning, implicit search is possible. An open question is whether LLMs can perform self-reflection internally. By adding external state to an LLM, we enable reasoning and a form of self reflection, which is a rudimentary form of metacognition (thinking about thinking). LLMs that reflect on their own behavior raise visions of true artificial intelligence. If LLMs can self-reflect, can they exhibit metacognition [Wang and Zhao, 2023, Didolkar et al., 2024]? Self-reflection by LLMs is another item for the research agenda. When we add outside state to the input prompts, the input to the LLM will differ based on the history, and so will the answers of the LLM. Differences in memory may be preceived as a personality of the LLMs by its users. The question if LLMs with outside memory exhibit a personality is a topic for future research. Self-reflective methods are being used to create agents to perform scientific discovery [Eger et al., 2025]. How these agents will influence, and possibly improve, the process of scientific discovery is an exciting area of research.

Safety. Safety is a crucial issue in LLMs that act in the world. The problem is studied, but far from being solved [Brunke et al., 2022, Andriushchenko et al., 2025, Samvelyan et al., 2024]. Actions by assistants and robots in the real world have real world consequences. When a financial trading assistant hallucinates, or when a self driving robot makes a wrong inference, questions on responsibility and liability should be addressed. More legal and ethical questions arise, for example, on privacy and fairness, and, possibly, concerning the rights of algorithmic entities [Harris and Anthis, 2021, Bengio and Elmoznino, 2025]. The application areas for the assistants in this survey shopping, medical diagnosis, finance are narrow. The narrower the application domain, the better the answers. Ensuring the safety of agentic LLMs requires moving beyond prompt-based defenses toward integrated, multilayer safeguards. Key directions include embedding explicit safety constraints within the agent s reasoning and planning pipeline and employing continual adversarial training and automated red-teaming to enhance robustness against manipulation. Further priorities are developing mechanisms for self-regulation and risk awareness, enabling agents to detect and avoid unsafe actions, and establishing rigorous, standardized safety benchmarks such as Agent Harm [Andriushchenko et al., 2025]. Together, these measures outline a roadmap for accountable and trustworthy deployment of agentic LLMs in high-stakes domains. Clearly, many safety, ethics and trust issues will have to be addressed before the full breadth of the possibilities of agentic LLMs can be enjoyed. Safety and ethics will become important topics on the research agenda of agentic LLMs, deserving their own surveys and books [Gan et al., 2024, Jiao et al., 2025, Raza et al., 2025].

5.2 Conclusion

There is a large research activity on agentic LLMs. Already, robots show impressive generalization results, and so do assistants in medical diagnosis, financial market advising, and scientific research. Work processes in these and other fields may well be affected by agentic LLM assistants in the near future.

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

Agentic LLM 29:51

Table 4. Summary of Research Agenda for Agentic LLM

Topic Challenge Training Data Finetune with inference time reasoning data Convergent/stable reinforcement learning VLA, Multimodal signals, such as speech Hallucination Use Self Verification Use Mechanistic Interpretability Use Open Ended/Open World Models for exploration Agent Behavior Scalable simulation infrastructure, role playing Distill reasoning to small models Models of agent and human behavior, emergent behavior, future of society Self-reflection In-model self-reflection and metareasoning Metacognition, personality Automated Scientific Discovery Safety Assistants: Responsibility, liability Privacy, fairness of data Wider application areas for assistants

The agentic LLMs in this survey have (1) reasoning capabilities, (2) an interface to the outside world in order to act, and (3) a social environment with other agents with which to interact. The categories of this taxonomy complement each other. At the basis is the reasoning technology of category 1. Robotic interaction and tool-use build on grounded retrieval augmentation, social interaction (such as theory of mind) builds on self-reflection, and all categories benefit from reasoning and self-verification. Closing the cycle, the acting and interacting categories generate training data for further pretraining and finetuning LLMs, beyond plateauing traditional datasets (Figure 1). The impressive generalization capabilities of Vision-Language-Action models are testament to the power of this approach. The reasoning paradigm connects to works in human cognition, and some papers anthropomorphize LLM computations in Kahneman s terms of System 1 thinking (fast, associative) and System 2 thinking (slow, deliberative). Works on reasoning focus on the intelligence of single LLMs. This individualistic view also gives rise to discussions about superintelligence, some utopian, some not. The agentic paradigm enables two elements of machine learning that are new for LLMs. First, in reinforcement learning, agents self-reflect and choose their own actions, and learn from the feedback of the world in which they operate. Second, no dataset is needed, nor is learning limited by the complexity of a dataset, it is only limited by the complexity of the world around the agent. The agent paradigm creates a more challenging training setting, allowing agentic LLMs to keep improving themselves. The multi-agent paradigm studies agent-agent societies. The focus is on emergent behaviors such as egoism/altruism, competition/collaboration, and (dis)trust. Social cognitive development and the emergence of collective intelligence are also studied in this field. Connecting back to the reasoning paradigm, the collaboratieve view of multi-agent studies may inform discussions about (super)intelligence, teaching us about emerging social behavior of LLM-agents.

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

29:52 Plaat, Van Duijn, Van Stein, Preuss, Van der Putten & Batenburg

Acknowledgements

We thank Joost Broekens, Suzan Verberne, Annie Wong, Thomas Bäck, Zhaochun Ren, Thomas Moerland, Michiel van der Meer, Bram Renting, Frank van Harmelen, Tessa Verhoef, and Rob van Nieuwpoort for extensive and fruitful discussions. We thank the anonymous reviewers for valuable suggestions that have improved the article.

References Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report. ar Xiv preprint ar Xiv:2303.08774, 2023. Blaise Agüera y Arcas. What Is Intelligence?: Lessons from AI About Evolution, Computing, and Minds. MIT Press, 2025. Michael Ahn, Anthony Brohan, Yevgen Chebotar, Chelsea Finn, Karol Hausman, Alexander Herzog, Daniel Ho, Julian Ibarz, Alex Irpan, Eric Jang, Ryan Julian, Dmitry Kalashnikov, Sergey Levine, Yao Lu, Carolina Parada, Kanishka Rao, Pierre Sermanet, Alexander T Toshev, Vincent Vanhoucke, Fei Xia, Ted Xiao, Peng Xu, Mengyuan Yan, Noah Brown, Michael Ahn, Omar Cortes, Nicolas Sievers, Clayton Tan, Sichun Xu, Diego Reyes, Jarek Rettinghouse, Jornell Quiambao, Peter Pastor, Linda Luu, Kuang-Huei Lee, Yuheng Kuang, Sally Jesmonth, Nikhil J. Joshi, Kyle Jeffrey, Rosario Jauregui Ruano, Jasmine Hsu, Keerthana Gopalakrishnan, Byron David, Andy Zeng, and Chuyuan Kelly Fu. Do as i can, not as i say: Grounding language in robotic affordances. In Karen Liu, Dana Kulic, and Jeff Ichnowski, editors, Proceedings of The 6th Conference on Robot Learning, volume 205 of Proceedings of Machine Learning Research, pages 287 318. PMLR, 14 18 Dec 2023. URL https://proceedings.mlr.press/v205/ichter23a.html. Elif Akata, Lion Schulz, Julian Coda-Forno, et al. Playing repeated games with large language models. Nature Human Behaviour, 9:1380 1390, 2025. doi: 10.1038/s41562-025-02172-y. URL https://doi.org/10.1038/s41562025-02172-y. Zeynep Akata, Dan Balliet, Maarten De Rijke, Frank Dignum, Virginia Dignum, Guszti Eiben, Antske Fokkens, Davide Grossi, Koen Hindriks, et al. A research agenda for hybrid intelligence: augmenting human intellect with collaborative, adaptive, responsible, and explainable artificial intelligence. Computer, 53(8):18 28, 2020. Ameen Ali, Thomas Schnake, Oliver Eberle, Grégoire Montavon, Klaus-Robert Müller, and Lior Wolf. XAI for transformers: Better explanations through conservative propagation. In International conference on machine learning, pages 435 451. PMLR, 2022. Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, and et al. Agentharm: A benchmark for measuring harmfulness of llm agents. In International Conference on Learning Representations, 2025. URL https://arxiv.org/abs/2410.09024. Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. ar Xiv preprint ar Xiv:2312.11805, 2023. Anthropic. The Claude 3 model family: Opus, Sonnet, Haiku, 2024. URL https://www-cdn.anthropic.com/Model_ Card_Claude_3.pdf. Ian Apperly. Mindreaders: the Cognitive Basis of "Theory of Mind". Psychology Press, 2011. Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learning to retrieve, generate, and critique through self-reflection. ar Xiv preprint ar Xiv:2310.11511, 2023. Ariel Flint Ashery, Luca Maria Aiello, and Andrea Baronchelli. The dynamics of social conventions in LLM populations: Spontaneous emergence, collective biases and tipping points. ar Xiv preprint ar Xiv:2410.08948, 2024.

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

Agentic LLM 29:53

Arian Askari, Roxana Petcu, Chuan Meng, Mohammad Aliannejadi, Amin Abolghasemi, Evangelos Kanoulas, and Suzan Verberne. Self-seeding and multi-intent self-instructing LLMs for generating intent-aware informationseeking dialogs. ar Xiv preprint ar Xiv:2402.11633, 2024. Robert Axelrod. Effective choice in the prisoner s dilemma. Journal of conflict resolution, 24(1):3 25, 1980. Robert Axelrod. The emergence of cooperation among egoists. American political science review, 75(2):306 318, 1981. Robert Axelrod. An evolutionary approach to norms. American political science review, 80(4):1095 1111, 1986. Thomas Bäck. Evolutionary algorithms in theory and practice: evolution strategies, evolutionary programming, genetic algorithms. Oxford university press, 1996. Ricardo Baeza-Yates, Berthier Ribeiro-Neto, et al. Modern information retrieval, volume 463. ACM press New York, 1999. Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, et al. Constitutional AI: Harmlessness from AI feedback. ar Xiv preprint ar Xiv:2212.08073, 2022. Anton Bakhtin, Noam Brown, Emily Dinan, Gabriele Farina, Colin Flaherty, Daniel Fried, Andrew Goff, Jonathan Gray, Hengyuan Hu, et al. Human-level play in the game of Diplomacy by combining language models with strategic reasoning. Science, 378(6624):1067 1074, 2022. Razan Baltaji, Babak Hemmatian, and Lav R Varshney. Conformity, confabulation, and impersonation: Persona inconstancy in multi-agent LLM collaboration. ar Xiv preprint ar Xiv:2405.03862, 2024. Chris Barker and Emma A Jane. Cultural studies: Theory and practice. SAGE Publications Ltd, 2016. Andrea Baronchelli. Shaping new norms for artificial intelligence: A complex systems perspective. ar Xiv preprint ar Xiv:2307.08564, 2023. Ashish Bastola, Hao Wang, Judsen Hembree, Pooja Yadav, Nathan Mc Neese, and Abolfazl Razi. LLM-based smart reply (LSR): Enhancing collaborative performance with Chat GPT-mediated smart reply system. ar Xiv preprint ar Xiv:2306.11980, 2023. Emily M. Bender, Timnit Gebru, Angelina Mc Millan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAcc T 21, page 610 623, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450383097. doi: 10.1145/3442188.3445922. URL https://doi.org/10.1145/3442188.3445922. Yoshua Bengio and Eric Elmoznino. Illusions of ai consciousness. Science, 389(6765):1090 1091, 2025. Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41 48, 2009. Leonard Bereska and Efstratios Gavves. Mechanistic interpretability for ai safety a review. ar Xiv preprint ar Xiv:2404.14082, 2024. Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław Debiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2 with large scale deep reinforcement learning. ar Xiv preprint ar Xiv:1912.06680, 2019. Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph of thoughts: Solving elaborate problems with large language models. Proceedings of the AAAI Conference on Artificial Intelligence, 38(16):17682 17690, 2024. Federico Bianchi, Patrick John Chia, Mert Yuksekgonul, Jacopo Tagliabue, Dan Jurafsky, and James Zou. How well can LLMs negotiate? Negotiation Arena platform and analysis. ar Xiv preprint ar Xiv:2402.05863, 2024. Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397 2430. PMLR, 2023.

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

29:54 Plaat, Van Duijn, Van Stein, Preuss, Van der Putten & Batenburg

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. 𝑝𝑖_0: A vision-language-action flow model for general robot control. ar Xiv preprint ar Xiv:2410.24164, 2024. Sid Black, Lee Sharkey, Leo Grinsztajn, Eric Winsor, Dan Braun, Jacob Merizian, Kip Parker, Carlos Ramón Guevara, Beren Millidge, Gabriel Alfour, et al. Interpreting neural networks through the polytope lens. ar Xiv preprint ar Xiv:2211.12312, 2022. Bernd Bohnet, Azade Nova, Aaron T Parisi, Kevin Swersky, Katayoon Goshvadi, Hanjun Dai, Dale Schuurmans, Noah Fiedel, and Hanie Sedghi. Exploring and benchmarking the planning capabilities of large language models. ar Xiv preprint ar Xiv:2406.13094, 2024. Nick Bostrom. How long before superintelligence. International Journal of Futures Studies, 2(1):1 9, 1998. Robert Boyd and Peter J Richerson. Culture and the evolutionary process. University of Chicago press, 1988. Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Open AI Gym. ar Xiv preprint ar Xiv:1606.01540, 2016. Nathan Brody. What is intelligence? International Review of Psychiatry, 11(1):19 25, 1999. Joost Broekens, Bernhard Hilpert, Suzan Verberne, Kim Baraka, Patrick Gebhard, and Aske Plaat. Fine-grained affective processing capabilities emerging from large language models. In 2023 11th International Conference on Affective Computing and Intelligent Interaction (ACII), pages 1 8. IEEE, 2023. Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. ar Xiv preprint ar Xiv:2307.15818, 2023. Rodney A Brooks. Elephants don t play chess. Robotics and autonomous systems, 6(1-2):3 15, 1990. Noam Brown and Tuomas Sandholm. Superhuman AI for multiplayer poker. Science, 365(6456):885 890, 2019. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877 1901, 2020. Cameron B Browne, Edward Powley, Daniel Whitehouse, Simon M Lucas, Peter I Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis, and Simon Colton. A survey of Monte Carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in games, 4(1):1 43, 2012. Lukas Brunke, Melissa Greeff, Adam W Hall, Zhaocong Yuan, Siqi Zhou, Jacopo Panerati, and Angela P Schoellig. Safe learning in robotics: From learning-based control to safe reinforcement learning. Annual Review of Control, Robotics, and Autonomous Systems, 5(1):411 444, 2022. Natasha Butt, Varun Chandrasekaran, Neel Joshi, Besmira Nushi, and Vidhisha Balachandran. Benchagents: Automated benchmark creation with agent interaction. ar Xiv preprint ar Xiv:2410.22584, 2024. Beatriz Cabrero-Daniel, Tomas Herda, Victoria Pichler, and Martin Eder. Exploring human-ai collaboration in agile: Customised LLM meeting assistants. In International Conference on Agile Software Development, pages 163 178. Springer Nature Switzerland Cham, 2024. Gian Maria Campedelli, Nicolò Penzo, Massimo Stefan, Roberto Dessì, Marco Guerini, Bruno Lepri, and Jacopo Staiano. I want to break free! Persuasion and anti-social behavior of LLMs in multi-agent settings with social hierarchy. ar Xiv preprint ar Xiv:2410.07109, 2024. Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. Chateval: Towards better LLM-based evaluators through multi-agent debate. ar Xiv preprint ar Xiv:2308.07201, 2023. Chi-Min Chan, Wenxuan Zhang, Chenyan Xiong, and et al. Encouraging divergent thinking in large language models through multi-agent debate. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024.

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

Agentic LLM 29:55

Jiangjie Chen, Siyu Yuan, Rong Ye, Bodhisattwa Prasad Majumder, and Kyle Richardson. Put your money where your mouth is: Evaluating strategic planning and execution of LLM agents in an auction arena. ar Xiv preprint ar Xiv:2310.05746, 2023a. Justin Chih-Yao Chen, Swarnadeep Saha, and Mohit Bansal. Reconcile: Round-table conference improves reasoning via consensus among diverse LLMs. ar Xiv preprint ar Xiv:2309.13007, 2023b. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. ar Xiv preprint ar Xiv:2107.03374, 2021. Sanxing Chen, Sam Wiseman, and Bhuwan Dhingra. Chatshop: Interactive information seeking with language agents. ar Xiv preprint ar Xiv:2404.09911, 2024a. Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chen Qian, Chi-Min Chan, Yujia Qin, Yaxi Lu, Ruobing Xie, et al. Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors in agents. ar Xiv preprint ar Xiv:2308.10848, 2(4):5, 2023c. Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. ar Xiv preprint ar Xiv:2211.12588, 2022. Xi Chen, Aske Plaat, and Niki van Stein. How does chain of thought think? mechanistic interpretability of chain-of-thought reasoning with sparse autoencoding. ar Xiv preprint ar Xiv:2507.22928, 2025. Xinyun Chen, Maxwell Lin, Nathanael Schaerli, and Denny Zhou. Teaching large language models to self-debug. In B. Kim, Y. Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y. Sun, editors, International Conference on Representation Learning, volume 2024, pages 8746 8825, 2024b. Zhuang Chen, Jincenzi Wu, Jinfeng Zhou, Bosi Wen, Guanqun Bi, Gongyao Jiang, Yaru Cao, Mengting Hu, Yunghwei Lai, Zexuan Xiong, and Minlie Huang. To MBench: Benchmarking Theory of Mind in large language models, 2024c. Haoang Chi, He Li, Wenjing Yang, Feng Liu, Long Lan, Xiaoguang Ren, Tongliang Liu, and Bo Han. Unveiling causal reasoning in large language models: Reality or mirage? Advances in Neural Information Processing Systems, 37:96640 96670, 2024. Hao-Tien Lewis Chiang, Zhuo Xu, Zipeng Fu, Mithun George Jacob, Tingnan Zhang, Tsang-Wei Edward Lee, Wenhao Yu, Connor Schenck, David Rendleman, Dhruv Shah, et al. Mobility vla: Multimodal instruction navigation with long-context vlms and topological graphs. ar Xiv preprint ar Xiv:2407.07775, 2024. Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1 113, 2023. Jan Clusmann, Fiona R Kolbinger, Hannah Sophie Muti, Zunamys I Carrero, Jan-Niklas Eckardt, Narmin Ghaffari Laleh, Chiara Maria Lavinia Löffler, Sophie-Caroline Schwarzkopf, Michaela Unger, Gregory P Veldhuizen, et al. The future landscape of large language models in medicine. Communications medicine, 3(1):141, 2023. Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. ar Xiv preprint ar Xiv:2110.14168, 2021. Youan Cong, Cheng Wang, Pritom Saha Akash, and Kevin Chen-Chuan Chang. Query optimization for parametric knowledge refinement in retrieval-augmented large language models. ar Xiv preprint ar Xiv:2411.07820, 2024. Arthur Conmy, Augustine Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga-Alonso. Towards automated circuit discovery for mechanistic interpretability. Advances in Neural Information Processing Systems, 36:16318 16352, 2023. Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models. ar Xiv preprint ar Xiv:2309.08600, 2023.

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

29:56 Plaat, Van Duijn, Van Stein, Preuss, Van der Putten & Batenburg

Rangan Das, K Maheswari, Shaheen Siddiqui, Nikita Arora, Ankush Paul, Jeet Nanshi, Varun Udbalkar, Apoorva Sarvade, Harsha Chaturvedi, Tammy Shvartsman, et al. Improved precision oncology question-answering using agentic LLM. med Rxiv, pages 2024 09, 2024. Austin L Davis and Gita Sukthankar. Decoding chess mastery: A mechanistic analysis of a chess language transformer model. In International Conference on Artificial General Intelligence, pages 63 72. Springer, 2024. Daniel C Dennett. From bacteria to Bach and back: The evolution of minds. WW Norton & Company, 2017. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805, 2018. Aniket Didolkar, Anirudh Goyal, Nan Rosemary Ke, Siyuan Guo, Michal Valko, Timothy Lillicrap, Danilo Rezende, Yoshua Bengio, Michael Mozer, and Sanjeev Arora. Metacognitive capabilities of LLMs: An exploration in mathematical problem solving. ar Xiv preprint ar Xiv:2405.12205, 2024. Han Ding, Yinheng Li, Junhao Wang, and Hang Chen. Large language model agent in financial trading: A survey. ar Xiv preprint ar Xiv:2408.06361, 2024. Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, and Zhifang Sui. A survey on in-context learning. ar Xiv preprint ar Xiv:2301.00234, 2022. Vishnu Sashank Dorbala, James F Mullen Jr, and Dinesh Manocha. Can an embodied agent find your "cat-shaped mug"? LLM-based zero-shot object navigation. IEEE Robotics and Automation Letters, 2023. Marco Dorigo, Mauro Birattari, and Thomas Stutzle. Ant colony optimization. IEEE computational intelligence magazine, 1(4):28 39, 2007. Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1.5: Scaling reinforcement learning with LLMs. ar Xiv preprint ar Xiv:2501.12599, 2025. Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate. In Proceedings of the 41st International Conference on Machine Learning, ICML 24. JMLR.org, 2024. Jinhao Duan, Renming Zhang, James Diffenderfer, Bhavya Kailkhura, Lichao Sun, Elias Stengel-Eskin, Mohit Bansal, Tianlong Chen, and Kaidi Xu. GTBench: Uncovering the strategic reasoning limitations of LLMs via game-theoretic evaluations. ar Xiv preprint ar Xiv:2402.12348, 2024. Robin I.M. Dunbar. The social brain: mind, language, and society in evolutionary perspective. Annual review of Anthropology, 32(1):163 181, 2003. Steffen Eger, Yong Cao, Jennifer D Souza, Andreas Geiger, Christian Greisinger, Stephanie Gross, Yufang Hou, Brigitte Krenn, Anne Lauscher, Yizhi Li, et al. Transforming science with large language models: A survey on AI-assisted scientific discovery, experimentation, content generation, and evaluation. ar Xiv preprint ar Xiv:2502.05151, 2025. Abul Ehtesham, Aditi Singh, Gaurav Kumar Gupta, and Saket Kumar. A survey of agent interoperability protocols: Model context protocol (mcp), agent communication protocol (acp), agent-to-agent protocol (a2a), and agent network protocol (anp). ar Xiv preprint ar Xiv:2505.02279, 2025. Jeffrey L Elman. Incremental learning, or the importance of starting small. In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 13, 1991. Joshua M Epstein and Robert Axtell. Growing artificial societies: social science from the bottom up. Brookings Institution Press, 1996. Shahul Es, Jithin James, Luis Espinosa Anke, and Steven Schockaert. RAGAs: Automated evaluation of retrieval augmented generation. In Nikolaos Aletras and Orphee De Clercq, editors, Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pages 150 158, St. Julians, Malta, March 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.eacl-demo.16. URL https://aclanthology.org/2024.eacl-demo.16/.

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

Agentic LLM 29:57

Shaheen Fatima, Nicholas R Jennings, and Michael Wooldridge. Learning to resolve social dilemmas: a survey. Journal of Artificial Intelligence Research, 79:895 969, 2024. Tao Feng, Zifeng Wang, and Jimeng Sun. Citing: Large language models create curriculum for instruction tuning. ar Xiv preprint ar Xiv:2310.02527, 2023. Jacques Ferber and Gerhard Weiss. Multi-agent systems: an introduction to distributed artificial intelligence, volume 1. Addison-wesley Reading, 1999. Chrisantha Fernando, Dylan Banarse, Henryk Michalewski, Simon Osindero, and Tim Rocktäschel. Promptbreeder: Self-referential self-improvement via prompt evolution. ar Xiv preprint ar Xiv:2309.16797, 2023. Riya Fernando, Isabel Norton, Pranay Dogra, Rohit Sarnaik, Hasan Wazir, Zitang Ren, Niveta Sree Gunda, Anushka Mukhopadhyay, and Michael Lutz. Quantifying bias in agentic large language models: A benchmarking approach. In 2024 5th Information Communication Technologies Conference (ICTC), pages 349 353. IEEE, 2024. Javier Ferrando, Gabriele Sarti, Arianna Bisazza, and Marta R Costa-jussà. A primer on the inner workings of transformer-based language models. ar Xiv preprint ar Xiv:2405.00208, 2024. Nicoló Fontana, Francesco Pierri, and Luca Maria Aiello. Nicer than humans: How do large language models behave in the prisoner s dilemma? ar Xiv preprint ar Xiv:2406.13605, 2024. Yuyou Gan, Yong Yang, Zhe Ma, Ping He, Rui Zeng, Yiming Wang, Qingming Li, Chunyi Zhou, Songze Li, Ting Wang, et al. Navigating the risks: A survey of security, privacy, and ethics threats in LLM-based agents. ar Xiv preprint ar Xiv:2411.09523, 2024. Kanishk Gandhi, Denise Lee, Gabriel Grand, Muxin Liu, Winson Cheng, Archit Sharma, and Noah D Goodman. Stream of search (sos): Learning to search in language. ar Xiv preprint ar Xiv:2404.03683, 2024. Chen Gao, Xiaochong Lan, Zhihong Lu, Jinzhu Mao, Jinghua Piao, Huandong Wang, Depeng Jin, and Yong Li. 𝑠3: Social-network simulation system with large language model-empowered agents. ar Xiv preprint ar Xiv:2307.14984, 2023a. Chen Gao, Xiaochong Lan, Nian Li, Yuan Yuan, Jingtao Ding, Zhilun Zhou, Fengli Xu, and Yong Li. Large language models empowered agent-based modeling and simulation: A survey and perspectives. Humanities and Social Sciences Communications, 11(1):1 24, 2024. Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models. In International Conference on Machine Learning, pages 10764 10799. PMLR, 2023b. Xuanqi Gao, Siyi Xie, Juan Zhai, Shiqing Ma, and Chao Shen. MCP-RADAR: A multi-dimensional benchmark for evaluating tool use capabilities in large language models, 2025. URL https://arxiv.org/abs/2505.16700. Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, and Haofen Wang. Retrieval-augmented generation for large language models: A survey. ar Xiv preprint ar Xiv:2312.10997, 2023c. Zhiqi Ge, Hongzhe Huang, Mingze Zhou, Juncheng Li, Guoming Wang, Siliang Tang, and Yueting Zhuang. World GPT: Empowering LLMs as multimodal world model. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 7346 7355, 2024. Angeliki Giannou, Shashank Rajput, Jy-yong Sohn, Kangwook Lee, Jason D Lee, and Dimitris Papailiopoulos. Looped transformers as programmable computers. In International Conference on Machine Learning, pages 11398 11442. PMLR, 2023. Nigel Gilbert. Agent-based models. Sage Publications, 2019. Ethan Goh, Robert Gallo, Jason Hom, Eric Strong, Yingjie Weng, Hannah Kerman, Joséphine A Cool, Zahir Kanjee, Andrew S Parsons, Neera Ahuja, et al. Large language model influence on diagnostic reasoning: a randomized clinical trial. JAMA Network Open, 7(10):e2440969 e2440969, 2024. Simon Goldstein and Benjamin A. Levinstein. Does Chat GPT have a mind?, 2024. URL https://arxiv.org/abs/2407. 11015.

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

29:58 Plaat, Van Duijn, Van Stein, Preuss, Van der Putten & Batenburg

Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, Khaled Saab, Dan Popovici, Jacob Blum, Fan Zhang, Katherine Chou, Avinatan Hassidim, Burak Gokturk, Amin Vahdat, Pushmeet Kohli, Yossi Matias, Andrew Carroll, Kavita Kulkarni, Nenad Tomasev, Yuan Guan, Vikram Dhillon, Eeshit Dhaval Vaishnav, Byron Lee, Tiago R D Costa, Jose R Penades, Gary Peltz, Yunhan Xu, Annalisa Pawlosky, Alan Karthikesalingam, and Vivek Natarajan. Towards an AI co-scientist, 2025. URL https://arxiv.org/abs/2502.18864. Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur, Khyathi Raghavi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, Tushar Khot, William Merrill, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E. Peters, Valentina Pyatkin, Abhilasha Ravichander, Dustin Schwenk, Saurabh Shah, Will Smith, Emma Strubell, Nishant Subramani, Mitchell Wortsman, Pradeep Dasigi, Nathan Lambert, Kyle Richardson, Luke Zettlemoyer, Jesse Dodge, Kyle Lo, Luca Soldaini, Noah A. Smith, and Hannaneh Hajishirzi. Olmo: Accelerating the science of language models, 2024. URL https://arxiv.org/abs/2402.00838. Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Mini LLM: Knowledge distillation of large language models. In The Twelfth International Conference on Learning Representations, 2023. Lin Guan, Karthik Valmeekam, Sarath Sreedharan, and Subbarao Kambhampati. Leveraging pre-trained large language models to construct and utilize world models for model-based task planning. Advances in Neural Information Processing Systems, 36:79081 79094, 2023. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deep Seek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning. ar Xiv preprint ar Xiv:2501.12948, 2025. David Ha and Jürgen Schmidhuber. World models. ar Xiv preprint ar Xiv:1803.10122, 2018. Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering Atari with discrete world models. ar Xiv preprint ar Xiv:2010.02193, 2020. Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. ar Xiv preprint ar Xiv:2301.04104, 2023. Gilbert Harman. Logic and reasoning. In Foundations: logic, language, and mathematics, pages 107 127. Springer, 1984. Jamie Harris and Jacy Reese Anthis. The moral consideration of artificial entities: a literature review. Science and engineering ethics, 27(4):53, 2021. Peter E Hart, Nils J Nilsson, and Bertram Raphael. A formal basis for the heuristic determination of minimum cost paths. IEEE transactions on Systems Science and Cybernetics, 4(2):100 107, 1968. Lingchen He, Jonasz B. Patkowski, Laura Miguel-Romero, Christopher H S Aylett, Alfred Fillol-Salom, Tiago R. D. Costa, and José R Penadés. Chimeric infective particles expand species boundaries in phage inducible chromosomal island mobilization. bio Rxiv, 2025. doi: 10.1101/2025.02.11.637232. URL https://www.biorxiv.org/ content/early/2025/02/11/2025.02.11.637232. Shawn He, Surangika Ranathunga, Stephen Cranefield, and Bastin Tony Roy Savarimuthu. Norm violation detection in multi-agent systems using large language models: A pilot study. ar Xiv preprint ar Xiv:2403.16517, 2024. Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. ar Xiv preprint ar Xiv:1503.02531, 2015. Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735 1780, 1997. Ilya Horiguchi, Takahide Yoshida, and Takashi Ikegami. Evolution of social norms in LLM agents using natural language. ar Xiv preprint ar Xiv:2409.00993, 2024.

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

Agentic LLM 29:59

Guiyang Hou, Wenqi Zhang, Yongliang Shen, Zeqi Tan, Sihao Shen, and Weiming Lu. Entering real social world! benchmarking the theory of mind and socialization capabilities of LLMs from a first-person perspective. ar Xiv preprint ar Xiv:2410.06195, 2024. Adele Howe, Craig Knoblock, ISI Drew Mc Dermott, Ashwin Ram, Manuela Veloso, Daniel Weld, David Wilkins Sri, Anthony Barrett, Dave Christianson, et al. PDDL the planning domain definition language. Technical Report, Tech. Rep., 1998. Feng-Hsiung Hsu. Behind Deep Blue: Building the computer that defeated the world chess champion. Princeton University Press, 2022. Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. ar Xiv preprint ar Xiv:2106.09685, 2021. Jennifer Hu, Felix Sosa, and Tomer Ullman. Re-evaluating theory of mind evaluation in large language models, 2025. URL https://arxiv.org/abs/2502.21098. Jen-tse Huang, Eric John Li, Man Ho Lam, Tian Liang, Wenxuan Wang, Youliang Yuan, Wenxiang Jiao, Xing Wang, Zhaopeng Tu, and Michael R Lyu. How far are we on the decision-making of LLMs? evaluating LLMs gaming ability in multi-agent environments. ar Xiv preprint ar Xiv:2403.11807, 2024a. Qi Huang, Sofoklis Kitharidis, Thomas Bäck, and Niki van Stein. TX-Gen: Multi-objective optimization for sparse counterfactual explanations for time-series classification. In Proceedings of the 1st International Conference on Explainable AI for Neural and Symbolic Methods - Volume 1: EXPLAINS, pages 62 74. INSTICC, Sci Te Press, 2024b. ISBN 978-989-758-720-7. doi: 10.5220/0013066400003886. Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, et al. Inner monologue: Embodied reasoning through planning with language models. ar Xiv preprint ar Xiv:2207.05608, 2022. Zhen Huang, Haoyang Zou, Xuefeng Li, Yixiu Liu, Yuxiang Zheng, Ethan Chern, Shijie Xia, Yiwei Qin, Weizhe Yuan, and Pengfei Liu. O1 replication journey part 2: Surpassing o1-preview through simple distillation, big progress or bitter lesson? ar Xiv preprint ar Xiv:2411.16489, 2024c. Edward Hughes, Michael Dennis, Jack Parker-Holder, Feryal Behbahani, Aditi Mavalankar, Yuge Shi, Tom Schaul, and Tim Rocktaschel. Open-endedness is essential for artificial superhuman intelligence. ar Xiv preprint ar Xiv:2406.04268, 2024. Mike Huisman, Jan N van Rijn, and Aske Plaat. A survey of deep meta-learning. Artificial Intelligence Review, 54 (6):4483 4541, 2021. Shima Imani, Liang Du, and Harsh Shrivastava. Math Prompter: Mathematical reasoning using large language models. ar Xiv preprint ar Xiv:2303.05398, 2023. Junfeng Jiao, Saleh Afroogh, Yiming Xu, and Connor Phillips. Navigating LLM ethics: Advancements, challenges, and future directions. AI and Ethics, pages 1 25, 2025. Chuanyang Jin, Yutong Wu, Jing Cao, Jiannan Xiang, Yen-Ling Kuo, Zhiting Hu, Tomer Ullman, Antonio Torralba, Joshua B. Tenenbaum, and Tianmin Shu. MMTo M-QA: Multimodal theory of mind question answering, 2024a. URL https://arxiv.org/abs/2401.08743. Yilun Jin, Zheng Li, Chenwei Zhang, Tianyu Cao, Yifan Gao, Pratik Jayarao, Mao Li, Xin Liu, Ritesh Sarkhel, Xianfeng Tang, et al. Shopping MMLU: A massive multi-task online shopping benchmark for large language models. ar Xiv preprint ar Xiv:2410.20745, 2024b. Antonia Jane Jones. Game theory: Mathematical models of conflict. Elsevier, 2000. Catholijn Jonker, Reyhan Aydogan, Tim Baarslag, Katsuhide Fujita, Takayuki Ito, and Koen Hindriks. Automated negotiating agents competition (anac). In Proceedings of the AAAI conference on artificial intelligence, volume 31-1, 2017. Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. Reinforcement learning: A survey. Journal of artificial intelligence research, 4:237 285, 1996.

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

29:60 Plaat, Van Duijn, Van Stein, Preuss, Van der Putten & Batenburg

Daniel Kahneman. Thinking, fast and slow. Macmillan, 2011. Subbarao Kambhampati, Karthik Valmeekam, Lin Guan, Kaya Stechly, Mudit Verma, Siddhant Bhambri, Lucas Saldyt, and Anil Murthy. LLMs can t plan, but can help planning in LLM-Modulo frameworks. ar Xiv preprint ar Xiv:2402.01817, 2024. Amir-Hossein Karimi, Gilles Barthe, Borja Balle, and Isabel Valera. Model-agnostic counterfactual explanations for consequential decisions. In International conference on artificial intelligence and statistics, pages 895 905. PMLR, 2020. Zachary Kenton, Ramana Kumar, Sebastian Farquhar, Jonathan Richens, Matt Mac Dermott, and Tom Everitt. Discovering agents. Artificial Intelligence, 322:103963, 2023. Hyunwoo Kim, Melanie Sclar, Xuhui Zhou, Ronan Bras, Gunhee Kim, Yejin Choi, and Maarten Sap. FANTo M: A benchmark for stress-testing machine theory of mind in interactions. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14397 14413, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/ 2023.emnlp-main.890. URL https://aclanthology.org/2023.emnlp-main.890. Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Open VLA: An open-source vision-language-action model. ar Xiv preprint ar Xiv:2406.09246, 2024. Samuel N Kirshner, Yiwen Pan, Jason Xianghua Wu, and Alex Gould. Talking terms: Agent information in LLM supply chain bargaining. Decision Sciences, 2025. Ching-Yun Ko, Sihui Dai, Payel Das, Georgios Kollias, Subhajit Chaudhury, and Aurelie Lozano. Mem Reasoner: A memory-augmented LLM architecture for multi-hop reasoning. In The First Workshop on System-2 Reasoning at Scale, Neur IPS 24, 2024. Tom Kocmi, Rachel Bawden, Ondřej Bojar, Anton Dvorkovich, Christian Federmann, Mark Fishel, Thamme Gowda, Yvette Graham, Roman Grundkiewicz, Barry Haddow, et al. Findings of the 2022 conference on machine translation (wmt22). In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 1 45, 2022. Teuvo Kohonen. Self-organized formation of topologically correct feature maps. Biological cybernetics, 43(1): 59 69, 1982. Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199 22213, 2022. Daphne Koller and Nir Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT Press, 01 2009. ISBN 978-0-262-01319-2. Michal Kosinski. Theory of mind may have spontaneously emerged in large language models. ar Xiv preprint ar Xiv:2302.02083, 2023. Michal Kosinski. Evaluating large language models in theory of mind tasks. Proceedings of the National Academy of Sciences, 121(45):e2405460121, 2024. doi: 10.1073/pnas.2405460121. URL https://www.pnas.org/doi/abs/10. 1073/pnas.2405460121. Tom Kouwenhoven, Max Peeperkorn, and Tessa Verhoef. Searching for structure: Investigating emergent communication with large language models, 2024. URL https://arxiv.org/abs/2412.07646. Ray Kurzweil. Superintelligence and singularity. Machine Learning and the City: Applications in Architecture and Urban Design, pages 579 601, 2022. Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. Tulu 3: Pushing frontiers in open language model post-training. ar Xiv preprint ar Xiv:2411.15124, 2024. Peter Lambert. Measuring remote work using a large language model (LLM). In Econ Pol Forum, volume 24, pages 44 49. Munich: CESifo Gmb H, 2023.

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

Agentic LLM 29:61

Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, et al. Measuring faithfulness in chain-of-thought reasoning. ar Xiv preprint ar Xiv:2307.13702, 2023. Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, et al. Bloom: A 176b-parameter open-access multilingual language model, 2023. Sydney Levine, Max Kleiman-Weiner, Laura Schulz, Joshua Tenenbaum, and Fiery Cushman. The logic of universalization guides moral judgment. Proceedings of the National Academy of Sciences, 117(42):26158 26169, 2020. Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459 9474, 2020. Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for" mind" exploration of large language model society. Advances in Neural Information Processing Systems, 36:51991 52008, 2023a. Haohang Li, Yupeng Cao, Yangyang Yu, Shashidhar Reddy Javaji, Zhiyang Deng, Yueru He, Yuechen Jiang, Zining Zhu, Koduvayur Subbalakshmi, Guojun Xiong, et al. INVESTORBENCH: A benchmark for financial decision-making tasks with LLM-based agent. ar Xiv preprint ar Xiv:2412.18174, 2024a. Jindong Li, Yali Fu, Li Fan, Jiahong Liu, Yao Shu, Chengwei Qin, Menglin Yang, Irwin King, and Rex Ying. Implicit reasoning in large language models: A comprehensive survey. ar Xiv preprint ar Xiv:2509.02350, 2025a. Liunian Harold Li, Jack Hessel, Youngjae Yu, Xiang Ren, Kai-Wei Chang, and Yejin Choi. Symbolic chain-of-thought distillation: Small models can also" think" step-by-step. ar Xiv preprint ar Xiv:2306.14050, 2023b. Shimin Li, Tianxiang Sun, Qinyuan Cheng, and Xipeng Qiu. Agent alignment in evolving social norms. ar Xiv preprint ar Xiv:2401.04620, 2024b. Xinyi Li, Yu Xu, Yongfeng Zhang, and Edward C Malthouse. Large language model-driven multi-agent simulation for news diffusion under different network structures. ar Xiv preprint ar Xiv:2410.13909, 2024c. Xinzhe Li. A review of prominent paradigms for LLM-based agents: Tool use (including rag), planning, and feedback learning. ar Xiv preprint ar Xiv:2406.05804, 2024. Zhong-Zhi Li, Duzhen Zhang, Ming-Liang Zhang, Jiaxin Zhang, Zengyan Liu, Yuxuan Yao, Haotian Xu, Junhao Zheng, Pei-Jie Wang, Xiuyi Chen, et al. From System 1 to System 2: A survey of reasoning large language models. ar Xiv preprint ar Xiv:2502.17419, 2025b. Zhuofeng Li, Haoxiang Zhang, Seungju Han, Sheng Liu, Jianwen Xie, Yu Zhang, Yejin Choi, James Zou, and Pan Lu. In-the-flow agentic system optimization for effective planning and tool use. ar Xiv preprint ar Xiv:2510.05592, 2025c. Jiaheng Liu, Dawei Zhu, Zhiqi Bai, Yancheng He, Huanxuan Liao, Haoran Que, Zekun Wang, Chenchen Zhang, Ge Zhang, Jiebin Zhang, et al. A comprehensive survey on long context language modeling. URL https://api. semanticscholar. org/Corpus ID, 277271533, 2025a. Jiming Liu and Jianbing Wu. Multiagent robotic systems. CRC press, 2018. Lei Liu, Xiaoyan Yang, Yue Shen, Binbin Hu, Zhiqiang Zhang, Jinjie Gu, and Guannan Zhang. Think-in-memory: Recalling and post-thinking enable LLMs with long-term memory. ar Xiv preprint ar Xiv:2311.08719, 2023. Zhiwei Liu, Jielin Qiu, Shiyu Wang, Jianguo Zhang, Zuxin Liu, Roshan Ram, Haolin Chen, Weiran Yao, Shelby Heinecke, Silvio Savarese, Huan Wang, and Caiming Xiong. MCPEval: Automatic MCP-based deep evaluation for AI agent models, 2025b. URL https://arxiv.org/abs/2507.12806. Nunzio Lorè and Babak Heydari. Strategic behavior of large language models: Game structure vs. contextual framing. ar Xiv preprint ar Xiv:2309.05898, 2023.

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

29:62 Plaat, Van Duijn, Van Stein, Preuss, Van der Putten & Batenburg

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The AI scientist: Towards fully automated open-ended scientific discovery. ar Xiv preprint ar Xiv:2408.06292, 2024a. Yadong Lu, Jianwei Yang, Yelong Shen, and Ahmed Awadallah. Omniparser for pure vision based gui agent, 2024b. URL https://arxiv.org/abs/2408.00203. Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. Advances in neural information processing systems, 30, 2017. Ziyang Luo, Zhiqi Shen, Wenzhuo Yang, Zirui Zhao, Prathyusha Jwalapuram, Amrita Saha, Doyen Sahoo, Silvio Savarese, Caiming Xiong, and Junnan Li. MCP-Universe: Benchmarking large language models with real-world model context protocol servers, 2025. URL https://arxiv.org/abs/2508.14704. Qing Lyu, Shreya Havaldar, Adam Stein, Li Zhang, Delip Rao, Eric Wong, Marianna Apidianaki, and Chris Callison-Burch. Faithful chain-of-thought reasoning. ar Xiv preprint ar Xiv:2301.13379, 2023. Chang Ma, Junlei Zhang, Zhihao Zhu, Cheng Yang, Yujiu Yang, Yaohui Jin, Zhenzhong Lan, Lingpeng Kong, and Junxian He. Agentboard: An analytical evaluation board of multi-turn LLM agents. ar Xiv preprint ar Xiv:2401.13178, 2024a. Yueen Ma, Zixing Song, Yuzheng Zhuang, Jianye Hao, and Irwin King. A survey on vision-language-action models for embodied ai. ar Xiv preprint ar Xiv:2405.14093, 2024b. C M Macal and M J North. Tutorial on agent-based modelling and simulation. Journal of Simulation, 4(3):151 162, 2010. ISSN 1747-7786. doi: 10.1057/jos.2010.3. URL https://doi.org/10.1057/jos.2010.3. Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems, 36, 2023. Kyle Mahowald, Anna A. Ivanova, Idan A. Blank, Nancy Kanwisher, Joshua B. Tenenbaum, and Evelina Fedorenko. Dissociating language and thought in large language models: a cognitive perspective. ar Xiv preprint ar Xiv:2301.06627, 2023. Aleksandar Makelov, George Lange, and Neel Nanda. Towards principled evaluations of sparse autoencoders for interpretability and control. ar Xiv preprint ar Xiv:2405.08366, 2024. SA Manasa, Pankaj Agarwal, Goutam Pal, P Mahendra, Shankar Rao Pendyala, and Sachin Marge. Towards seamless air travel: Developing a next-generation flight booking assistant. Journal of Electrical Systems, 20(2): 2558 2567, 2024. Manqing Mao, Paishun Ting, Yijian Xiang, Mingyang Xu, Julia Chen, and Jianzhe Lin. Multi-User Chat Assistant (MUCA): a framework using LLMs to facilitate group conversations. ar Xiv preprint ar Xiv:2401.04883, 2024. Shaoguang Mao, Yuzhe Cai, Yan Xia, Wenshan Wu, Xun Wang, Fengyi Wang, Tao Ge, and Furu Wei. Alympics: Language agents meet game theory. ar Xiv preprint ar Xiv:2311.03220, 2023. David A Medler. A brief history of connectionism. Neural computing surveys, 1:18 72, 1998. Nikita Mehandru, Brenda Y Miao, Eduardo Rodriguez Almaraz, Madhumita Sushil, Atul J Butte, and Ahmed Alaa. Evaluating large language models as agents in the clinic. NPJ digital medicine, 7(1):84, 2024. Sarah Mercer, Samuel Spillard, and Daniel P Martin. Brief analysis of Deep Seek-R1 and it s implications for generative ai. ar Xiv preprint ar Xiv:2502.02523, 2025. Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. Mixed precision training. ar Xiv preprint ar Xiv:1710.03740, 2017. Risto Miikkulainen. Generative AI: An AI paradigm shift in the making? AI Magazine, 45(1):165 167, 2024. C. A. Miller. Trust in adaptive automation: the role of etiquette in tuning trust via analogic and affective methods. In Proceedings of the 1st International Conference on Augmented Cognition, pages 22 27. Citeseer, 2005. Bonan Min, Hayley Ross, Elior Sulem, Amir Pouran Ben Veyseh, Thien Huu Nguyen, Oscar Sainz, Eneko Agirre, Ilana Heintz, and Dan Roth. Recent advances in natural language processing via large pre-trained language

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

Agentic LLM 29:63

models: A survey. ACM Computing Surveys, 56(2):1 40, 2023. Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. Large language models: A survey. ar Xiv preprint ar Xiv:2402.06196, 2024. Dang Minh, H Xiang Wang, Y Fen Li, and Tan N Nguyen. Explainable artificial intelligence: a comprehensive review. Artificial Intelligence Review, pages 1 66, 2022. Marvin Minsky. Society of mind. Simon and Schuster, 1988. Justin M Mittelstädt, Julia Maier, Panja Goerke, Frank Zinn, and Michael Hermes. Large language models can outperform humans in social situational judgments. Scientific Reports, 14(1):27449, 2024. T. Miyamoto, D. Katagami, T. Tanaka, H. Kanamori, Y. Yoshihara, and K. Fujikake. Should a driving support agent provide explicit instructions to the user? Video-based study focused on politeness strategies. In Proceedings of the 9th International Conference on Human Agent Interaction, pages 157 164, 2021. Guozhao Mo, Wenliang Zhong, Jiawei Chen, Xuanang Chen, Yaojie Lu, Hongyu Lin, Ben He, Xianpei Han, and Le Sun. Livemcpbench: Can agents navigate an ocean of MCP tools?, 2025. URL https://arxiv.org/abs/2508.01780. Thomas M Moerland, Joost Broekens, Aske Plaat, Catholijn M Jonker, et al. Model-based reinforcement learning: A survey. Foundations and Trends in Machine Learning, 16(1):1 118, 2023. Christoph Molnar. Interpretable machine learning. Lulu. com, 2020. Xinyi Mou, Xuanwen Ding, Qi He, Liang Wang, Jingcong Liang, Xinnong Zhang, Libo Sun, Jiayu Lin, Jie Zhou, Xuanjing Huang, et al. From individual to society: A survey on social simulation driven by large language model-based agents. ar Xiv preprint ar Xiv:2412.03563, 2024. Leonardo de Moura and Sebastian Ullrich. The Lean 4 theorem prover and programming language. In International Conference on Automated Deduction, pages 625 635. Springer, 2021. Jean-Baptiste Mouret and Jeff Clune. Illuminating search spaces by mapping elites. ar Xiv preprint ar Xiv:1504.04909, 2015. Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling. ar Xiv preprint ar Xiv:2501.19393, 2025. Karsten Müller and Jonathan Schaeffer. Man vs. Machine: Challenging Human Supremacy at Chess. SCB Distributors, 2018. Magnus Müller and Gregor Žunič. Browser use: Enable AI to control your browser, 2024. URL https://github. com/browser-use/browser-use. Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability. ar Xiv preprint ar Xiv:2301.05217, 2023. Shashi Narayan, Shay B Cohen, and Mirella Lapata. Don t give me the details, just the summary! Topic-aware convolutional neural networks for extreme summarization. ar Xiv preprint ar Xiv:1808.08745, 2018. Deepak Nathani, Lovish Madaan, Nicholas Roberts, Nikolay Bashlykov, Ajay Menon, Vincent Moens, Amar Budhiraja, Despoina Magka, Vladislav Vorotilov, Gaurav Chaurasia, et al. MLGym: A new framework and benchmark for advancing AI research agents. ar Xiv preprint ar Xiv:2502.14499, 2025. Kálmán Balázs Neszlényi, Alex Milos, and Attila Kiss. Assistant GPT: Enhancing user interaction with LLM integration. In 2024 IEEE 22nd Jubilee International Symposium on Intelligent Systems and Informatics (SISY), pages 000619 000624. IEEE, 2024. Allen Newell and Herbert Simon. The logic theory machine a complex information processing system. IRE Transactions on information theory, 2(3):61 79, 1956. Allen Newell and Herbert A Simon. Computer simulation of human thinking: A theory of problem solving expressed as a computer program permits simulation of thinking processes. Science, 134(3495):2011 2017, 1961. Allen Newell, John Calman Shaw, and Herbert A Simon. Elements of a theory of human problem solving. Psychological review, 65(3):151, 1958.

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

29:64 Plaat, Van Duijn, Van Stein, Preuss, Van der Putten & Batenburg

Felix Ocker, Daniel Tanneberg, Julian Eggert, and Michael Gienger. Tulip agent enabling LLM-based agents to solve tasks using large tool libraries. ar Xiv preprint ar Xiv:2407.21778, 2024. Elizabeth Oluwagbade. Conversational ai as the new employee liaison: LLM-powered chatbots in enhancing workplace collaboration and inclusion, 2024. C Opus and A Lawsen. The illusion of the illusion of thinking. ar Xiv preprint Ar Xiv:2506.09250, 2025. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730 27744, 2022. Guillermo Owen. Game theory. Emerald Group Publishing, 2013. Davide Paglieri, Bartłomiej Cupiał, Samuel Coward, Ulyana Piterbarg, Maciej Wolczyk, Akbir Khan, Eduardo Pignatelli, Łukasz Kuciński, Lerrel Pinto, Rob Fergus, et al. Balrog: Benchmarking agentic LLM and VLM reasoning on games. ar Xiv preprint ar Xiv:2411.13543, 2024. Dimitrios P Panagoulias, Maria Virvou, and George A Tsihrintzis. Evaluating LLM generated multimodal diagnosis from medical images and symptom analysis. ar Xiv preprint ar Xiv:2402.01730, 2024. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311 318, 2002. Joon Sung Park, Lindsay Popowski, Carrie Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Social simulacra: Creating populated prototypes for social computing systems. In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology, pages 1 18, 2022. Joon Sung Park, Joseph O Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th annual acm symposium on user interface software and technology, pages 1 22, 2023. Joon Sung Park, Carolyn Q Zou, Aaron Shaw, Benjamin Mako Hill, Carrie Cai, Meredith Ringel Morris, Robb Willer, Percy Liang, and Michael S Bernstein. Generative agent simulations of 1,000 people. ar Xiv preprint ar Xiv:2411.10109, 2024. Soya Park and Chinmay Kulkarni. Thinking assistants: LLM-based conversational assistants that help users think by asking rather than answering. ar Xiv preprint ar Xiv:2312.06024, 2023. Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez. Gorilla: Large language model connected with massive APIs. ar Xiv preprint ar Xiv:2305.15334, 2023. Jose R. Penades, Juraj Gottweis, Lingchen He, Jonasz B. Patkowski, Alexander Daryin, Wei-Hung Weng, Tao Tu, Anil Palepu, Artiom Myaskovsky, Annalisa Pawlosky, Vivek Natarajan, Alan Karthikesalingam, and Tiago R.D. Costa. AI mirrors experimental science to uncover a mechanism of gene transfer crucial to bacterial evolution. Cell, 2025. ISSN 0092-8674. doi: https://doi.org/10.1016/j.cell.2025.08.018. URL https: //www.sciencedirect.com/science/article/pii/S0092867425009730. Jorge Pérez, Pablo Barceló, and Javier Marinkovic. Attention is turing-complete. Journal of Machine Learning Research, 22(75):1 35, 2021. Julien Perolat, Bart De Vylder, Daniel Hennes, Eugene Tarassov, Florian Strub, Vincent de Boer, Paul Muller, Jerome T Connor, Neil Burch, Thomas Anthony, et al. Mastering the game of stratego with model-free multiagent reinforcement learning. Science, 378(6623):990 996, 2022. Jinghua Piao, Zhihong Lu, Chen Gao, Fengli Xu, Fernando P. Santos, Yong Li, and James Evans. Emergence of human-like polarization among large language model agents, 2025a. URL https://arxiv.org/abs/2501.05171. Jinghua Piao, Yuwei Yan, Jun Zhang, Nian Li, Junbo Yan, Xiaochong Lan, Zhihong Lu, Zhiheng Zheng, Jing Yi Wang, Di Zhou, Chen Gao, Fengli Xu, Fang Zhang, Ke Rong, Jun Su, and Yong Li. Agentsociety: Large-scale simulation of LLM-driven generative agents advances understanding of human behaviors and society, 2025b. URL https://arxiv.org/abs/2502.08691.

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

Agentic LLM 29:65

Giorgio Piatti, Zhijing Jin, Max Kleiman-Weiner, Bernhard Schölkopf, Mrinmaya Sachan, and Rada Mihalcea. Cooperate or collapse: Emergence of sustainability behaviors in a society of LLM agents. ar Xiv preprint ar Xiv:2404.16698, 2024. Aske Plaat. Research, Re: search & Re-search. Ph D thesis, Erasmus University Rotterdam, 1996. Aske Plaat. Learning to play: reinforcement learning and games. Springer, 2020. Aske Plaat. Deep reinforcement learning. Springer, 2022. Aske Plaat, Walter Kosters, and Mike Preuss. High-accuracy model-based reinforcement learning, a survey. Artificial Intelligence Review, 56(9):9541 9573, 2023. Aske Plaat, Annie Wong, Suzan Verberne, Joost Broekens, Niki van Stein, and Thomas Bäck. Multi-step reasoning with large language models, a survey. ACM Computing Surveys, ar Xiv preprint ar Xiv:2407.11511, 58(6):160:1 160:35, 2025. William Poundstone. Prisoner s dilemma. Anchor, 2011. David Premack and Guy Woodruff. Does the chimpanzee have a theory of mind? Behavioral and brain sciences, 1 (4):515 526, 1978. Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. ar Xiv preprint ar Xiv:2210.03350, 2022. Mike Preuss. Multimodal optimization by means of evolutionary algorithms. Springer, 2015. Zehan Qi, Xiao Liu, Iat Long Iong, Hanyu Lai, Xueqiao Sun, Xinyue Yang, Jiadai Sun, Yu Yang, Shuntian Yao, Tianjie Zhang, et al. Webrl: Training LLM web agents via self-evolving online curriculum reinforcement learning. ar Xiv preprint ar Xiv:2411.02337, 2024. Shuofei Qiao, Runnan Fang, Zhisong Qiu, Xiaobin Wang, Ningyu Zhang, Yong Jiang, Pengjun Xie, Fei Huang, and Huajun Chen. Benchmarking agentic workflow generation. ar Xiv preprint ar Xiv:2410.07869, 2024. Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Tool LLM: Facilitating large language models to master 16000+ real-world APIs. ar Xiv preprint ar Xiv:2307.16789, 2023. Haoyi Qiu, Alexander R Fabbri, Divyansh Agarwal, Kung-Hsiang Huang, Sarah Tan, Nanyun Peng, and Chien Sheng Wu. Evaluating cultural and social awareness of LLM web agents. ar Xiv preprint ar Xiv:2410.23252, 2024a. Jianing Qiu, Kyle Lam, Guohao Li, Amish Acharya, Tien Yin Wong, Ara Darzi, Wu Yuan, and Eric J Topol. LLM-based agentic systems in medicine and healthcare. Nature Machine Intelligence, pages 1 3, 2024b. Sebastien Racaniere, Andrew K Lampinen, Adam Santoro, David P Reichert, Vlad Firoiu, and Timothy P Lillicrap. Automated curricula through setter-solver interactions. ar Xiv preprint ar Xiv:1909.12892, 2019. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. Open AI blog, 1(8):9, 2019. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748 8763. PMLR, 2021. Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024. Daking Rai, Yilun Zhou, Shi Feng, Abulhair Saparov, and Ziyu Yao. A practical review of mechanistic interpretability for transformer-based language models. ar Xiv preprint ar Xiv:2407.02646, 2024. Meghana Rajeev, Rajkumar Ramamurthy, Prapti Trivedi, Vikas Yadav, Oluwanifemi Bamgbose, Sathwik Tejaswi Madhusudan, James Zou, and Nazneen Rajani. Cats confuse reasoning LLM: Query agnostic adversarial triggers for reasoning models, 2025. URL https://arxiv.org/abs/2503.01781.

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

29:66 Plaat, Van Duijn, Van Stein, Preuss, Van der Putten & Batenburg

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. ar Xiv preprint ar Xiv:1606.05250, 2016. Arotol Rapoport. Prisoner s dilemma: a study in conflict and cooperation, 1965. Partha Pratim Ray. A review on agent-to-agent protocol: Concept, state-of-the-art, challenges and future directions. Authorea Preprints, 2025. Shaina Raza, Ranjan Sapkota, Manoj Karkee, and Christos Emmanouilidis. Responsible agentic reasoning and ai agents: A critical survey: Proposal for safe agentic AI via responsible reasoning ai agents (r2a2). Super Intelligence-Robotics-Safety & Alignment, 2(6), 2025. Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779 788, 2016. Shaoqing Ren, Kaiming He, Ross Girshick, Xiangyu Zhang, and Jian Sun. Object detection networks on convolutional feature maps. IEEE transactions on pattern analysis and machine intelligence, 39(7):1476 1481, 2016. Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. "Why should i trust you?" explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1135 1144, 2016. P. Ribino. The role of politeness in human machine interactions: a systematic literature review and future perspectives. Artificial Intelligence Review, 56 (Suppl 1):445 482, 2023. doi: 10.1007/s10462-023-10540-1. Thiago Rios, Bas van Stein, Stefan Menzel, Thomas Back, Bernhard Sendhoff, and Patricia Wollstadt. Feature visualization for 3d point cloud autoencoders. In 2020 International Joint Conference on Neural Networks (IJCNN), pages 1 9. IEEE, 2020. Adam Roberts, Colin Raffel, and Noam Shazeer. How much knowledge can you pack into the parameters of a language model? ar Xiv preprint ar Xiv:2002.08910, 2020. Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M Pawan Kumar, Emilien Dupont, Francisco JR Ruiz, Jordan S Ellenberg, Pengming Wang, Omar Fawzi, et al. Mathematical discoveries from program search with large language models. Nature, 625(7995):468 475, 2024. Laura Ruis, Akbir Khan, Stella Biderman, Sara Hooker, Tim Rocktäschel, and Edward Grefenstette. The Goldilocks of pragmatic understanding: Fine-tuning strategy matters for implicature resolution by LLMs, 2023. URL https://arxiv.org/abs/2210.14986. Anian Ruoss, Fabio Pardo, Harris Chan, Bonnie Li, Volodymyr Mnih, and Tim Genewein. LMAct: A benchmark for in-context imitation learning with long multimodal demonstrations. ar Xiv preprint ar Xiv:2412.01441, 2024. Stuart J Russell and Peter Norvig. Artificial intelligence: a modern approach. pearson, 2016. Alexander Rutherford, Benjamin Ellis, Matteo Gallici, Jonathan Cook, Andrei Lupu, Gardhar Ingvarsson, Timon Willi, Akbir Khan, Christian Schroeder de Witt, Alexandra Souly, et al. Jaxmarl: Multi-agent rl environments and algorithms in jax. In Proceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems, pages 2444 2446, 2024. Kanghyun Ryu, Qiayuan Liao, Zhongyu Li, Koushil Sreenath, and Negar Mehr. Curricu LLM: Automatic task curricula design for learning complex robot skills using large language models. ar Xiv preprint ar Xiv:2409.18382, 2024. Mikayel Samvelyan, Sharath Chandra Raparthy, Andrei Lupu, Eric Hambro, Aram H Markosyan, Manish Bhatt, Yuning Mao, Minqi Jiang, Jack Parker-Holder, Jakob Foerster, et al. Rainbow teaming: Open-ended generation of diverse adversarial prompts. ar Xiv preprint ar Xiv:2402.16822, 2024. Bastin Tony Roy Savarimuthu, Surangika Ranathunga, and Stephen Cranefield. Harnessing the power of LLMs for normative reasoning in MASs. ar Xiv preprint ar Xiv:2403.16524, 2024. Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

Agentic LLM 29:67

Advances in Neural Information Processing Systems, 36:68539 68551, 2023. Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum. Agent Laboratory: Using LLM agents as research assistants, 2025. URL https://arxiv.org/abs/2501.04227. John Schultz, Jakub Adamek, Matej Jusup, Marc Lanctot, Michael Kaisers, Sarah Perrin, Daniel Hennes, Jeremy Shar, Cannada Lewis, Anian Ruoss, et al. Mastering board games by external and internal planning with language models. ar Xiv preprint ar Xiv:2412.12119, 2024. Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-Cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pages 618 626, 2017. Rico Sennrich, Barry Haddow, and Alexandra Birch. Improving neural machine translation models with monolingual data. ar Xiv preprint ar Xiv:1511.06709, 2015. Dhruv Shah, Błażej Osiński, Sergey Levine, et al. Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action. In Conference on robot learning, pages 492 504. PMLR, 2023. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deep Seek Math: Pushing the limits of mathematical reasoning in open language models. ar Xiv preprint ar Xiv:2402.03300, 2024. Natalie Shapira, Mosh Levy, Seyed Hossein Alavi, Xuhui Zhou, Yejin Choi, Yoav Goldberg, Maarten Sap, and Vered Shwartz. Clever Hans or Neural Theory of Mind? Stress Testing Social Reasoning in Large Language Models. In Yvette Graham and Matthew Purver, editors, Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2257 2273, St. Julian s, Malta, March 2024. Association for Computational Linguistics. URL https://aclanthology.org/2024.eacl-long.138. Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeff Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Bloom, et al. Open problems in mechanistic interpretability. ar Xiv preprint ar Xiv:2501.16496, 2025. Weizhou Shen, Chenliang Li, Hongzhan Chen, Ming Yan, Xiaojun Quan, Hehong Chen, Ji Zhang, and Fei Huang. Small LLMs are weak tool learners: A multi-LLM agent. ar Xiv preprint ar Xiv:2401.07324, 2024. Zhuocheng Shen. LLM with tools: A survey. ar Xiv preprint ar Xiv:2409.18807, 2024. Zhengliang Shi, Shen Gao, Xiuyi Chen, Yue Feng, Lingyong Yan, Haibo Shi, Dawei Yin, Zhumin Chen, Suzan Verberne, and Zhaochun Ren. Chain of tools: Large language model is an automatic multi-tool learner. ar Xiv preprint ar Xiv:2405.16533, 2024. Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36, 2024. Yoav Shoham and Kevin Leyton-Brown. Multiagent systems: Algorithmic, game-theoretic, and logical foundations. Cambridge University Press, 2008. Parshin Shojaee, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar. The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity. ar Xiv preprint ar Xiv:2506.06941, 2025. Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning. ar Xiv preprint ar Xiv:2010.03768, 2020. Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Cliport: What and where pathways for robotic manipulation. In Conference on robot learning, pages 894 906. PMLR, 2022. David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484 489, 2016.

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

29:68 Plaat, Van Duijn, Van Stein, Preuss, Van der Putten & Batenburg

David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the Game of Go without human knowledge. nature, 550(7676):354 359, 2017. Petru Soviany, Radu Tudor Ionescu, Paolo Rota, and Nicu Sebe. Curriculum learning: A survey. International Journal of Computer Vision, 130(6):1526 1565, 2022. Settaluri Sravanthi, Meet Doshi, Pavan Tankala, Rudra Murthy, Raj Dabre, and Pushpak Bhattacharyya. PUB: A pragmatics understanding benchmark for assessing LLMs pragmatics capabilities. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Findings of the Association for Computational Linguistics: ACL 2024, pages 12075 12097, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/ 2024.findings-acl.719. URL https://aclanthology.org/2024.findings-acl.719/. Luc Steels. A self-organizing spatial vocabulary. Artificial life, 2(3):319 332, 1995. Luc Steels. Intelligence with representation. Philosophical Transactions of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences, 361(1811):2381 2395, 2003. James W. A. Strachan, Oriana Pansardi, Eugenio Scaliti, Marco Celotto, Krati Saxena, Chunzhi Yi, Fabio Manzi, Alessandro Rufo, Guido Manzi, Michael S. A. Graziano, Stefano Panzeri, and Cristina Becchio. Gpt-4o reads the mind in the eyes, 2024. URL https://arxiv.org/abs/2410.22309. Chris Su, Harrison Li, Matheus Marques, George Flint, Kevin Zhu, and Sunishchal Dev. Limits of emergent reasoning of large language models in agentic frameworks for deterministic games. ar Xiv preprint ar Xiv:2510.15974, 2025. Vighnesh Subramaniam, Yilun Du, Joshua B Tenenbaum, Antonio Torralba, Shuang Li, and Igor Mordatch. Multiagent finetuning: Self improvement with diverse reasoning chains. ar Xiv preprint ar Xiv:2501.05707, 2025. Malavikha Sudarshan, Sophie Shih, Estella Yee, Alina Yang, John Zou, Cathy Chen, Quan Zhou, Leon Chen, Chinmay Singhal, and George Shih. Agentic LLM workflows for generating patient-friendly medical reports. ar Xiv preprint ar Xiv:2408.01112, 2024. Alessandro Suglia, Qiaozi Gao, Jesse Thomason, Govind Thattai, and Gaurav Sukhatme. Embodied BERT: A transformer model for embodied, language-guided visual task completion. ar Xiv preprint ar Xiv:2108.04927, 2021. Ilya Sutskever. Acceptance speech for Test of Time Award. In Advances in Neural Information Processing Systems, Vancouver, British Columbia, Canada, 2024. Richard S Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3:9 44, 1988. Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT Press, 2018. Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, et al. Challenging Big-Bench tasks and whether chain-ofthought can solve them. ar Xiv preprint ar Xiv:2210.09261, 2022. Kyle Swanson, Wesley Wu, Nash L. Bulaong, John E. Pak, and James Zou. The Virtual Lab: AI agents design new SARS-Co V-2 nanobodies with experimental validation. bio Rxiv, 2024. doi: 10.1101/2024.11.11.623004. URL https://www.biorxiv.org/content/early/2024/11/12/2024.11.11.623004. Chris Sypherd and Vaishak Belle. Practical considerations for agentic LLM systems. ar Xiv preprint ar Xiv:2412.04093, 2024. Amin Tabrizian, Pranav Gupta, Abenezer Taye, James Jones, Ellis Thompson, Shulu Chen, Timothy Bonin, Derek Eberle, and Peng Wei. Using large language models to automate flight planning under wind hazards. In 2024 AIAA DATC/IEEE 43rd Digital Avionics Systems Conference (DASC), pages 1 8. IEEE, 2024. Hao Tang, Darren Key, and Kevin Ellis. World Coder, a model-based LLM agent: Building world models by writing code and interacting with the environment. ar Xiv preprint ar Xiv:2402.12275, 2024. Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei Han, Qiao Liang, Boxi Cao, and Le Sun. Toolalpaca: Generalized tool learning for language models with 3000 simulated cases. ar Xiv preprint ar Xiv:2306.05301, 2023.

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

Agentic LLM 29:69

Gerald Tesauro. TD-Gammon, a self-teaching backgammon program, achieves master-level play. Neural computation, 6(2):215 219, 1994. Arun James Thirunavukarasu, Darren Shu Jeng Ting, Kabilan Elangovan, Laura Gutierrez, Ting Fang Tan, and Daniel Shu Wei Ting. Large language models in medicine. Nature medicine, 29(8):1930 1940, 2023. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. ar Xiv preprint ar Xiv:2307.09288, 2023. Miles Turpin, Julian Michael, Ethan Perez, and Samuel Bowman. Language models don t always say what they think: unfaithful explanations in chain-of-thought prompting. Advances in Neural Information Processing Systems, 36, 2024. Ehsan Ullah, Anil Parwani, Mirza Mansoor Baig, and Rajendra Singh. Challenges and barriers of using large language models (LLM) such as Chat GPT for diagnostic medicine with a focus on digital pathology a recent scoping review. Diagnostic pathology, 19(1):43, 2024. Tomer Ullman. Large language models fail on trivial alterations to Theory-of-Mind tasks. ar Xiv preprint ar Xiv:2302.08399, 2023. Aron Vallinder and Edward Hughes. Cultural evolution of cooperation among LLM agents. ar Xiv preprint ar Xiv:2412.10270, 2024. Karthik Valmeekam, Matthew Marquez, Sarath Sreedharan, and Subbarao Kambhampati. On the planning abilities of large language models-a critical investigation. Advances in Neural Information Processing Systems, 36:75993 76005, 2023. Razo van Berkel. Large multimodal models and theory of mind. Bachelor s Thesis, LIACS, Leiden University, the Netherlands, 2024. Michiel Van Der Meer, Enrico Liscio, Catholijn Jonker, Aske Plaat, Piek Vossen, and Pradeep Murukannaiah. A hybrid intelligence method for argument mining. Journal of Artificial Intelligence Research, 80:1187 1222, 2024. Ramira van der Meulen, Rineke Verbrugge, and Max van Duijn. Towards properly implementing theory of mind in ai systems: An account of four misconceptions, 2025. URL https://arxiv.org/abs/2503.16468. Bram Van Dijk, Tom Kouwenhoven, Marco R Spruit, and Max J van Duijn. Large language models: The need for nuance in current debates and a pragmatic perspective on understanding. ar Xiv preprint ar Xiv:2310.19671, 2023. Max J van Duijn, Bram van Dijk, Tom Kouwenhoven, Werner de Valk, Marco R Spruit, and Peter van der Putten. Theory of mind in large language models: Examining performance of 11 state-of-the-art models vs. children aged 7-10 on advanced tests. ar Xiv preprint ar Xiv:2310.20320, 2023. Jonas van Elburg, Peter van der Putten, and Maarten Marx. Can we evaluate RAGs with synthetic data?, 2025. URL https://arxiv.org/abs/2508.11758. Frank Van Harmelen, Vladimir Lifschitz, and Bruce Porter. Handbook of knowledge representation. Elsevier, 2008. Bas van Stein, Elena Raponi, Zahra Sadeghi, Niek Bouman, Roeland CHJ Van Ham, and Thomas Bäck. A comparison of global sensitivity analysis methods for explainable ai with an application in genomic prediction. IEEE Access, 10:103364 103381, 2022. Niki van Stein and Thomas Bäck. Llamea: A large language model evolutionary algorithm for automatically generating metaheuristics. ar Xiv preprint ar Xiv:2405.20132, 2024. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. Nikhita Vedula, Oleg Rokhlenko, and Shervin Malmasi. Question suggestion for conversational shopping assistants using product metadata. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2960 2964, 2024. Suzan Verberne. In search of the why. Ph D thesis, University of Nijmegen, The Netherlands, 2010.

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

29:70 Plaat, Van Duijn, Van Stein, Preuss, Van der Putten & Batenburg

Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in Star Craft II using multi-agent reinforcement learning. Nature, 575(7782):350 354, 2019. Max W. Vogt, Peter van der Putten, and Hajo A. Reijers. Providing domain knowledge for process mining with Re WOO-Based agents. In Andrea Delgado and Tijs Slaats, editors, Process Mining Workshops, pages 663 676, Cham, 2025. Springer Nature Switzerland. ISBN 978-3-031-82225-4. John Von Neumann and Oskar Morgenstern. Theory of games and economic behavior: 60th anniversary commemorative edition. In Theory of games and economic behavior. Princeton university press, 2007. Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. ar Xiv preprint ar Xiv:1804.07461, 2018. Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32, 2019. Guan Wang, Jin Li, Yuhao Sun, Xing Chen, Changling Liu, Yue Wu, Meng Lu, Sen Song, and Yasin Abbasi Yadkori. Hierarchical reasoning model. ar Xiv preprint ar Xiv:2506.21734, 2025a. Haochuan Wang, Xiachong Feng, Lei Li, Zhanyue Qin, Dianbo Sui, and Lingpeng Kong. Tmgbench: A systematic game benchmark for evaluating strategic reasoning abilities of LLMs, 2024a. URL https://arxiv.org/abs/2410. 10479. Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents. Frontiers of Computer Science, 18(6):186345, 2024b. Shuai Wang, Weiwen Liu, Jingxuan Chen, Yuqi Zhou, Weinan Gan, Xingshan Zeng, Yuhan Che, Shuai Yu, Xinlong Hao, Kun Shao, Bin Wang, Chuhan Wu, Yasheng Wang, Ruiming Tang, and Jianye Hao. GUI agents with foundation models: A comprehensive survey, 2025b. URL https://arxiv.org/abs/2411.04890. Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. ar Xiv preprint ar Xiv:2203.11171, 2022. Yuqing Wang and Yun Zhao. Metacognitive prompting improves understanding in large language models. ar Xiv preprint ar Xiv:2308.05342, 2023. Zhenting Wang, Qi Chang, Hemani Patel, Shashank Biju, Cheng-En Wu, Quan Liu, Aolin Ding, Alireza Rezazadeh, Ankit Shah, Yujia Bao, and Eugene Siow. MCP-Bench: Benchmarking tool-using LLM agents with complex real-world tasks via MCP servers, 2025c. URL https://arxiv.org/abs/2508.20453. Alex Warstadt, Alicia Parrish, Haokun Liu, Anhad Mohananey, Wei Peng, Sheng-Fu Wang, and Samuel R. Bowman. BLi MP: A benchmark of linguistic minimal pairs for english. Co RR, abs/1912.00582, 2019. URL http://arxiv.org/abs/1912.00582. Lillian Wassim, Kamal Mohamed, and Ali Hamdi. LLM-Daa S: LLM-driven drone-as-a-service operations from text user requests. ar Xiv preprint ar Xiv:2412.11672, 2024. Hao Wei, Jianing Qiu, Haibao Yu, and Wu Yuan. Medco: Medical education copilots based on a multi-agent framework. ar Xiv preprint ar Xiv:2408.12496, 2024a. Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. ar Xiv preprint ar Xiv:2206.07682, 2022a. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824 24837, 2022b.

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

Agentic LLM 29:71

Rongxuan Wei, Kangkang Li, and Jiaming Lan. Improving collaborative learning performance based on LLM virtual assistant. In 2024 13th International Conference on Educational and Information Technology (ICEIT), pages 1 6. IEEE, 2024b. Michael Wooldridge. Intelligent agents. Multiagent systems: A modern approach to distributed artificial intelligence, 1:27 73, 1999. Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang, and Chi Wang. Autogen: Enabling next-gen LLM applications via multi-agent conversation framework. ar Xiv preprint ar Xiv:2308.08155, 2023. Siwei Wu, Zhongyuan Peng, Xinrun Du, Tuney Zheng, Minghao Liu, Jialong Wu, Jiachen Ma, Yizhi Li, Jian Yang, Wangchunshu Zhou, et al. A comparative study on reasoning patterns of Open AI s o1 model. ar Xiv preprint ar Xiv:2410.13639, 2024. Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, et al. The rise and potential of large language model based agents: A survey. ar Xiv preprint ar Xiv:2309.07864, 2023. Jiannan Xiang, Tianhua Tao, Yi Gu, Tianmin Shu, Zirui Wang, Zichao Yang, and Zhiting Hu. Language models meet world models: Embodied experiences enhance language models. Advances in neural information processing systems, 36, 2024. Violet Xiang, Charlie Snell, Kanishk Gandhi, Alon Albalak, Anikait Singh, Chase Blagden, Duy Phung, Rafael Rafailov, Nathan Lile, Dakota Mahan, et al. Towards System 2 reasoning in LLMs: Learning how to think with meta chain-of-though. ar Xiv preprint ar Xiv:2501.04682, 2025. Yijia Xiao, Edward Sun, Di Luo, and Wei Wang. Tradingagents: Multi-agents LLM financial trading framework. ar Xiv preprint ar Xiv:2412.20138, 2024. Yuxi Xie, Kenji Kawaguchi, Yiran Zhao, James Xu Zhao, Min-Yen Kan, Junxian He, and Michael Xie. Self-evaluation guided beam search for reasoning. Advances in Neural Information Processing Systems, 36, 2024. Peiwen Xing, Aske Plaat, and Niki van Stein. Co Composer: LLM multi-agent collaborative music composition. ar Xiv preprint ar Xiv:2509.00132, 2025. Lin Xu, Zhiyuan Hu, Daquan Zhou, Hongyu Ren, Zhen Dong, Kurt Keutzer, See Kiong Ng, and Jiashi Feng. Magic: Investigation of large language model powered multi-agent in cognition, adaptability, rationality and collaboration. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 7315 7332, 2024a. Xiaohan Xu, Ming Li, Chongyang Tao, Tao Shen, Reynold Cheng, Jinyang Li, Can Xu, Dacheng Tao, and Tianyi Zhou. A survey on knowledge distillation of large language models. ar Xiv preprint ar Xiv:2402.13116, 2024b. Haoqiu Yan, Yongxin Zhu, Kai Zheng, Bing Liu, Haoyu Cao, Deqiang Jiang, and Linli Xu. Talk with humanlike agents: Empathetic dialogue through perceptible acoustic reception and reaction, 2024. URL https: //arxiv.org/abs/2406.12707. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. ar Xiv preprint ar Xiv:2505.09388, 2025a. Hongyang Yang, Boyu Zhang, Neng Wang, Cheng Guo, Xiaoli Zhang, Likun Lin, Junlin Wang, Tianyu Zhou, Mao Guan, Runjia Zhang, et al. Fin Robot: An open-source AI agent platform for financial applications using large language models. ar Xiv preprint ar Xiv:2405.14767, 2024a. Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng, Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, Joel Jang, et al. Magma: A foundation model for multimodal AI agents. ar Xiv preprint ar Xiv:2502.13130, 2025b. John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik R Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024b.

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

29:72 Plaat, Van Duijn, Van Stein, Preuss, Van der Putten & Batenburg

Ling Yang, Zhaochen Yu, Tianjun Zhang, Shiyi Cao, Minkai Xu, Wentao Zhang, Joseph E Gonzalez, and Bin Cui. Buffer of thoughts: Thought-augmented reasoning with large language models. ar Xiv preprint ar Xiv:2406.04271, 2024c. Zhou Yang, Zhaochun Ren, Wang Yufeng, Shizhong Peng, Haizhou Sun, Xiaofei Zhu, and Xiangwen Liao. Enhancing empathetic response generation by augmenting LLMs with small-scale empathetic models, 2024d. URL https://arxiv.org/abs/2402.11801. Ziyi Yang, Zaibin Zhang, Zirui Zheng, Yuxian Jiang, Ziyue Gan, Zhiyu Wang, Zijian Ling, Jinsong Chen, Martz Ma, Bowen Dong, et al. Oasis: Open agents social interaction simulations on one million agents. ar Xiv preprint ar Xiv:2411.11581, 2024e. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. ar Xiv preprint ar Xiv:2210.03629, 2022. Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36, 2024. Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models. ar Xiv preprint ar Xiv:2306.13549, 2023. Ziqi Yin, Hao Wang, Kaito Horio, Daisuke Kawahara, and Satoshi Sekine. Should we respect LLMs? a crosslingual study on the influence of prompt politeness on LLM performance. In James Hale, Kushal Chawla, and Muskan Garg, editors, Proceedings of the Second Workshop on Social Influence in Conversations (SICon 2024), pages 9 35, Miami, Florida, USA, November 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.sicon-1.2. URL https://aclanthology.org/2024.sicon-1.2/. Ping Yu, Jing Xu, Jason Weston, and Ilia Kulikov. Distilling System 2 into System 1. ar Xiv preprint ar Xiv:2407.06023, 2024a. Xinjie Yu and Mitsuo Gen. Introduction to evolutionary algorithms. Springer, 2010. Yangyang Yu, Haohang Li, Zhi Chen, Yuechen Jiang, Yang Li, Denghui Zhang, Rong Liu, Jordan W Suchow, and Khaldoun Khashanah. Finmem: A performance-enhanced LLM trading agent with layered memory and character design. In Proceedings of the AAAI Symposium Series, volume 3, pages 595 597, 2024b. Dong Yuan, Eti Rastogi, Gautam Naik, Sree Prasanna Rajagopal, Sagar Goyal, Fen Zhao, Bharath Chintagunta, and Jeff Ward. A continued pretrained LLM approach for automatic medical note generation. ar Xiv preprint ar Xiv:2403.09057, 2024a. Ruibin Yuan, Hanfeng Lin, Yi Wang, Zeyue Tian, Shangda Wu, Tianhao Shen, Ge Zhang, Yuhang Wu, Cong Liu, Ziya Zhou, et al. Chat Musician: Understanding and generating music intrinsically with LLM. ar Xiv preprint ar Xiv:2402.16153, 2024b. Siyu Yuan, Kaitao Song, Jiangjie Chen, Xu Tan, Yongliang Shen, Ren Kan, Dongsheng Li, and Deqing Yang. Easytool: Enhancing LLM-based agents with concise tool instruction. ar Xiv preprint ar Xiv:2401.06201, 2024c. Matej Zečević, Moritz Willig, Devendra Singh Dhami, and Kristian Kersting. Causal parrots: Large language models may talk causality but are not causal. ar Xiv preprint ar Xiv:2308.13067, 2023. Eric Zelikman, Jesse Mu, Noah D Goodman, and Yuhuai Tony Wu. Star: Self-taught reasoner bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems (Neur IPS), 2022. Jiawei Zhang. Graph-toolformer: To empower LLMs with graph reasoning ability via prompt augmented by Chat GPT. ar Xiv preprint ar Xiv:2304.11116, 2023. Jintian Zhang, Xin Xu, Ningyu Zhang, Ruibo Liu, Bryan Hooi, and Shumin Deng. Exploring collaboration mechanisms for LLM agents: A social psychology view. ar Xiv preprint ar Xiv:2310.02124, 2023a. Kai Zhang, Fubang Zhao, Yangyang Kang, and Xiaozhong Liu. Memory-augmented LLM personalization with short-and long-term memory coordination. ar Xiv preprint ar Xiv:2309.11696, 2023b.

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

Agentic LLM 29:73

Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, et al. Agentic context engineering: Evolving contexts for self-improving language models. ar Xiv preprint ar Xiv:2510.04618, 2025a. Ran Zhang and Steffen Eger. LLM-based multi-agent poetry generation in non-cooperative environments. ar Xiv preprint ar Xiv:2409.03659, 2024. Shuo Zhang, Boci Peng, Xinping Zhao, Boren Hu, Yun Zhu, Yanjia Zeng, and Xuming Hu. Llasa: Large language and e-commerce shopping assistant. ar Xiv preprint ar Xiv:2408.02006, 2024a. Wentao Zhang, Lingxuan Zhao, Haochong Xia, Shuo Sun, Jiaze Sun, Molei Qin, Xinyi Li, Yuqing Zhao, Yilei Zhao, Xinyu Cai, et al. Fin Agent: A multimodal foundation agent for financial trading: Tool-augmented, diversified, and generalist. ar Xiv preprint ar Xiv:2402.18485, 2024b. Wentao Zhang, Liang Zeng, Yuzhen Xiao, Yongcong Li, Ce Cui, Yilei Zhao, Rui Hu, Yang Liu, Yahui Zhou, and Bo An. Agent Orchestra: Orchestrating hierarchical multi-agent intelligence with the tool-environment-agent (TEA) protocol. ar Xiv:2506.12508, 2025b. Xiaoqing Zhang, Xiuying Chen, Yuhan Liu, Jianzhou Wang, Zhenxing Hu, and Rui Yan. A large-scale time-aware agents simulation for influencer selection in digital advertising campaigns. ar Xiv preprint ar Xiv:2411.01143, 2024c. Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models. ar Xiv preprint ar Xiv:2303.18223, 2023. Yuyue Zhao, Jiancan Wu, Xiang Wang, Wei Tang, Dingxian Wang, and Maarten de Rijke. Let me do it for you: Towards LLM empowered recommendation via tool learning. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1796 1806, 2024a. Zirui Zhao, Hanze Dong, Amrita Saha, Caiming Xiong, and Doyen Sahoo. Automatic curriculum expert iteration for reliable LLM reasoning. ar Xiv preprint ar Xiv:2410.07627, 2024b. Chuanyang Zheng, Zhengying Liu, Enze Xie, Zhenguo Li, and Yu Li. Progressive-hint prompting improves reasoning in large language models. ar Xiv preprint ar Xiv:2304.09797, 2023. Pei Zhou, Jay Pujara, Xiang Ren, Xinyun Chen, Heng-Tze Cheng, Quoc V Le, Ed H Chi, Denny Zhou, Swaroop Mishra, and Huaixiu Steven Zheng. Self-discover: Large language models self-compose reasoning structures. ar Xiv preprint ar Xiv:2402.03620, 2024a. Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Web Arena: A realistic web environment for building autonomous agents. ar Xiv preprint ar Xiv:2307.13854, 2023a. Wentao Zhou, Jinlin Wang, Longtao Zhu, Yi Wang, and Yulong Ji. Flight arrival scheduling via large language model. Aerospace, 11(10):813, 2024b. Xuhui Zhou, Yue Zhang, Leyang Cui, and Dandan Huang. Evaluating commonsense in pre-trained language models. In Proceedings of the AAAI conference on artificial intelligence, volume 34-05, pages 9733 9740, 2020. Xuhui Zhou, Hao Zhu, Leena Mathur, Ruohong Zhang, Haofei Yu, Zhengyang Qi, Louis-Philippe Morency, Yonatan Bisk, Daniel Fried, Graham Neubig, et al. Sotopia: Interactive evaluation for social intelligence in language agents. ar Xiv preprint ar Xiv:2310.11667, 2023b. Xuhui Zhou, Hao Zhu, Leena Mathur, Ruohong Zhang, Haofei Yu, Zhengyang Qi, Louis-Philippe Morency, Yonatan Bisk, Daniel Fried, Graham Neubig, and Maarten Sap. Sotopia: Interactive evaluation for social intelligence in language agents, 2024c. URL https://arxiv.org/abs/2310.11667. Shenzhe Zhu, Jiao Sun, Yi Nian, Tobin South, Alex Pentland, and Jiaxin Pei. The automated but risky game: Modeling agent-to-agent negotiations and transactions in consumer markets. ar Xiv preprint ar Xiv:2506.00073, 2025. Yuchen Zhuang, Yue Yu, Kuan Wang, Haotian Sun, and Chao Zhang. Tool QA: A dataset for LLM question answering with external tools. Advances in Neural Information Processing Systems, 36:50117 50143, 2023.

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.

29:74 Plaat, Van Duijn, Van Stein, Preuss, Van der Putten & Batenburg

Mingchen Zhuge, Haozhe Liu, Francesco Faccio, Dylan R Ashley, Róbert Csordás, Anand Gopalakrishnan, Abdullah Hamdi, Hasan Abed Al Kader Hammoud, Vincent Herrmann, Kazuki Irie, et al. Mindstorms in natural language-based societies of mind. ar Xiv preprint ar Xiv:2305.17066, 2023. Philip G Zimbardo. Stanford prison experiment: A simulation study of the psychology of imprisonment. Philip G. Zimbardo, Incorporated, 1972. Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165 2183. PMLR, 2023.

Received 29 March 2025; accepted 6 June 2025

Journal of Artificial Intelligence Research, Vol. 84, Article 29. Publication date: December 2025.