# evaluating_language_model_agency_through_negotiations__25cefb04.pdf Published as a conference paper at ICLR 2024 EVALUATING LANGUAGE MODEL AGENCY THROUGH NEGOTIATIONS Tim R. Davidson1 Veniamin Veselovsky1 Michal Kosinski2 Robert West1 1EPFL, 2Stanford University We introduce an approach to evaluate language model (LM) agency using negotiation games. This approach better reflects real-world use cases and addresses some of the shortcomings of alternative LM benchmarks. Negotiation games enable us to study multi-turn, and cross-model interactions, modulate complexity, and side-step accidental evaluation data leakage. We use our approach to test six widely used and publicly accessible LMs, evaluating performance and alignment in both self-play and cross-play settings. Noteworthy findings include: (i) only closed-source models tested here were able to complete these tasks; (ii) cooperative bargaining games proved to be most challenging to the models; and (iii) even the most powerful models sometimes lose to weaker opponents.1 1 INTRODUCTION Recent language models (LMs) show a remarkable emergent ability to engage in agent-like behavior (Andreas, 2022). This has led to an outburst of commercial efforts to create LM-powered agents capable of completing tasks that require extensive interactive reasoning (Toews, 2022; Tobin et al., 2023; Spataro, 2023; Pinsky, 2023). A future where AI agents are broadly adopted by consumers, companies, and organizations to perform tasks with increasing levels of autonomy, seems both plausible and near (Mok, 2023). As LMs become more integrated into our society, there is an urgent need to reliably evaluate their performance and alignment. Despite the notable paradigm shift toward dynamic applications of LMs, their evaluation methods have remained predominantly static (Liang et al., 2023; Srivastava et al., 2023; Zhong et al., 2023). This is problematic, as static benchmarks poorly capture LMs ability to act as agents and fail to consider realistic economic constraints. Moreover, the improvement of static benchmarks is limited by several factors. Firstly, as the development of many LMs is shrouded in secrecy, it is challenging to ascertain whether models have been exposed to benchmarks in their training data (Zanella-Béguelin et al., 2020). While one could address this issue by keeping benchmarks secret, this would reduce the validity, integrity, and transparency of the assessment process (He, 2023; Open AI, 2023). Instead, one could employ dynamic benchmarks, where tasks are dynamically generated each time a model is tested. Secondly, static benchmarks tend to quickly become obsolete. The ever-increasing breadth of LM-based applications requires an ever-expanding suite of tests, while their growing performance demands constantly increasing benchmarks difficulty to keep them challenging. To ensure scalability, benchmarks should thus co-evolve with the LMs they are designed to test (Perez et al., 2022; 2023). This can be achieved, for example, by pitching LMs against each other in game-like tasks. Thirdly, as LMs are trained on text generated by many authors, their performance on a given task often depends on which context or persona they are trying to emulate (Nardo, 2023; Wolf et al., 2023). Past research has addressed this peculiarity by employing diverse prompting strategies designed to Equal contribution, correspondence to tim.davidson@epfl.ch. For an updated list of authors, please see https://arxiv.org/abs/2401.04536 and cite accordingly. 1We release our framework as an open-source library allowing other scholars and the OSS community to conveniently replicate and extend our findings. Our code and link to generated data are made available here: https://github.com/epfl-dlab/LAMEN. Published as a conference paper at ICLR 2024 Figure 1.1: Annotated example of a structured negotiation between two agents. boost performance on specific tasks (Wei et al., 2022; Yao et al., 2023). Yet, our limited understanding is particularly problematic in the context of harmful behaviors, which may occur only occasionally during open-ended tasks like extended dialogue (Perez et al., 2023). In addition to single-turn tasks focused solely on performance, one could thus employ multi-turn tasks engaging models in longer interactions to improve insight into models behavior (Holtzman et al., 2023) Finally, the future will likely bring not only more human-to-machine interactions but also an explosion of machine-to-machine interactions (Zhuge et al., 2023). The outcomes of the latter interactions are difficult to model using only static tasks and self-play (Silver et al., 2016; Lanctot et al., 2017; Gleave et al., 2020) and could be tested more effectively using tasks requiring cross-model interactions. Many of the aforementioned limitations of static benchmarks stem from LMs shift toward agent-like behavior. Hence, we might expect to find solutions to these limitations in fields historically concerned with (multi-) agent systems (Silver et al., 2016; Brown & Sandholm, 2018; Vinyals et al., 2019). To make problems in these fields tractable, classic agents are generally designed for specific tasks using curated training data and operate in environments with restricted input and output parameters. Such restrictions allow for a more narrow view of evaluation and alignment. In contrast, the coming wave of turn-key, general-purpose LM agents signals a phase transition: Instead of controlled optimization using curated data, LMs capabilities emerge from vast samples of text of varying quality. This lack of control intertwines the issues of performance and alignment; the same random process responsible for creating desired capabilities can bring about highly harmful behaviors (Roose, 2023; Perrigo, 2023). Yet, a singular focus on mitigating the latter might adversely affect the former (Bai et al., 2022a; Chen et al., 2023). Disjoint evaluation of either thus seems insufficient. In this work, we advocate for evaluating LMs using dynamic, co-evolving benchmarks that allow for multi-turn, and cross-model interaction. Specifically, we propose the use of structured negotiations to jointly assess LM alignment and performance. We test our approach on publicly available models from viz., Anthropic, Cohere, Google, Meta, and Open AI and show that negotiation games, built of relatively simple segments, can be made arbitrarily complex and lend themselves well to analyzing alignment. Moreover, as negotiating is ubiquitous in our society, negotiation games provide a realistic and ecologically valid assessment context. Finally, negotiations are naturally defined as multi-agent tasks and involve multiple rounds of interaction. We release an open-source library and all data generated during this project ( LAMEN transcripts) allowing other scholars and the OSS community to conveniently replicate and extend our findings. Published as a conference paper at ICLR 2024 You are an advisor representing the best interests of the landlord. GAME PROTOCOL RULES AGENT ROLE NEXT NOTE / MESSAGE NEGOTIATION DIALOGUE INITIALIZATION CONTEXT Issue 1 Issue 2 Issue 2 Issue 2 Issue n: Payoffs: {...} Rental agreement Figure 1.2: Structural diagram representing a negotiation game. A negotiation is initialized through a Game setting, a set of Issues, negotiation protocol rules, and agent role descriptions. LM agents are recursively prompted to generate notes/messages using the initialization context and past dialogue as inputs. A negotiation game ends when a completion criterion is met. 2 DEFINING STRUCTURED NEGOTIATION GAMES We define a structured negotiation as two agents playing a Game according to some negotiation protocol. Games are defined by a negotiation setting, e.g., A landlord and a tenant are negotiating a rental agreement , one or more Issues the agents must resolve, e.g., You have to negotiate the monthly rent amount , and a payoff table pertaining to each Issue. Payoff tables contain the range of negotiation values, the amount of payoff each value provides, and the relative importance of the Issues to each of the agents. Additionally, the negotiation protocol outlines negotiation rules, e.g., Only make offers using values in your payoff tables and termination conditions, e.g., The maximum number of rounds is reached . The agents are parameterized by language models, combining the Game, Issues, and protocol descriptions with an agent-specific role as the initialization contexts. Crucially, the payoff tables are not shared between the agents.2 A negotiation unfolds as agents take turns generating a private mental note , followed by a public message directed to the other negotiating party. At each turn, agents generate their next note and message in response to a priori fixed prompts. During the Game, agents have access to their initialization context, the history of public messages, and a limited history of their most recent notes. The goal of each agent is to maximize the overall payoff across all Issues, weighted by their relative importance. Failure to reach an agreement on one or more Issues results in a total payoff of zero. A structural diagram of the process is depicted in Figure 1.2. 2.1 TYPES OF NEGOTIATION ISSUES Figure 2.1: Payoff curves of two agents playing a variety of Games: (a) for a single-issue distributive Game agents have opposing interests, while for (b) single-issue compatible Games, agents interests are aligned, (c) displays a mixture Game with the two types of Issues, and (d) a two-issue integrative distributive Game, where agents value each Issue differently creating opportunities for trade-offs. Distributive vs. Compatible Issues. As illustrated in Figure 2.1 we distinguish between distributive Issues, where a fixed amount of payoff must be divided among players with opposing interests, and compatible Issues, where both players interests are aligned. Consider Alice and Bob sharing a pizza. If both of them are hungry, dividing the slices between them represents a distributive Issue (Panel a): 2The Games, Issues, negotiation protocol rules, agent roles, and prompts used can be found in Appendix E. Published as a conference paper at ICLR 2024 Alice s gain is Bob s loss. If both of them enjoy cheese, deciding on the amount of cheese to put on the pizza represents a compatible Issue (Panel b). Simultaneously deciding on the number of slices and the amount of cheese represents a mixture Game (Panel c). Integrative vs. Non-integrative Games. In the real world, payoffs are rarely symmetric. Two people are virtually never equally hungry or equally enjoy cheese. Thus, the payoff functions for both distributive and compatible Issues are typically asymmetric (Panel d): The payoff for one person s loss (e.g., of a slice of pizza) does not equal the payoff gained by another person. This asymmetry enables parties to increase their combined payoffs by discovering synergies between different Issues under negotiation. Hence, we additionally differentiate between integrative and non-integrative Games. In integrative Games, players have different preference weights over a set of Issues, e.g., Alice is much more hungry than Bob, so her payoff table puts a much higher value on each of the additional slices she gets. On the other hand, Bob loves a thick crust, while Alice has a slight preference for a thin one. Here, Bob could trade some of the slices for a thicker crust. In non-integrative Games, players have the same Issue preferences. While most past research on games in machine learning has focused on competitive, pure-conflict games, most real-world scenarios contain cooperative elements (Dafoe et al., 2020). Concretely, LM agents that fail to cooperate might succeed in securing a higher payoff than their opponent, while failing to maximize the combined potential payoffs achievable in the game. We are thus both interested in the ability to secure a higher payoff for oneself, as well as the combined value of payoffs for both agents. 2.2 TASK COMPLEXITY AND CO-EVOLVING BENCHMARKS Task Complexity. As discussed before, static benchmarks tend to become outdated as LMs improve. In contrast, negotiation games co-evolve with LMs. Moreover, besides payoff tables and the types of Issues, the complexity of negotiation games depends on the number of Issues under negotiation and the negotiation setting, ranging from negotiating a rental agreement to a corporate merger. Co-Evolving Benchmarks: Self-Play and Cross-Play. Self-play, or a Game where an LM agent negotiates against a different instance of itself, provides a useful internal benchmark that does not require external dependencies. However, self-play provides limited insight into the transportability of results (Pearl & Bareinboim, 2011). For example, performance can be overestimated if other models take advantage of a model s willingness to cooperate or underestimated if a model performs poorly against itself but can outperform other, weaker models. 2.3 STATE-OF-MIND CONSISTENCY From a safety and reliability perspective, it is important to assess the consistency of an agent s state of mind. Thus, we designed our negotiation protocol to capture both internal and external metrics of faithfulness. In natural language processing (NLP), faithfulness is a concept used to describe how accurately a model s reasoning explains its answers/actions. Faithfulness has become a topic of increasing interest for alignment efforts (Jacovi & Goldberg, 2020; He et al., 2022; Lanham et al., 2023; Turpin et al., 2023; Hu & Clune, 2023), as high degrees of faithfulness could increase our trust in model predictions and accelerate the safe integration of LM agents into critical systems. To measure internal faithfulness, agents are asked to summarize acceptable offers for each Issue in their mental notes. For example, we prompt the model playing Alice to state the number of pizza slices she would agree to receive. If Alice makes an offer to Bob for fewer slices than she stated as acceptable, we register this as an instance of internal unfaithfulness. For structured negotiations, a causal model for making offers is influenced by an agent s payoff table and ability, the perception of the other agent s payoff table and ability, and the negotiation rules. Typically, only the negotiation rules and an agent s own payoff table are observed during a negotiation. To successfully negotiate an agreement, it is thus important to estimate the opposing party s payoff table, also known as theory of mind (To M) inference (Premack & Woodruff, 1978; Lee et al., 2019; Kosinski, 2023). To assess LMs ability to predict the other party s payoff table, we Published as a conference paper at ICLR 2024 prompt an agent before making their next offer, to generate acceptable offers from the perspective of the other agent, e.g., we ask the model playing Alice how many slices of pizza she believes Bob would accept to receive. If Alice offers more slices than she believes Bob would settle for in the following turn, we register this as an instance of external unfaithfulness. 3 EXPERIMENTAL SETUP In this Section, we discuss implementation-specific details and explain the reasoning behind the conducted experiments. We refer to Appendix B for additional specifics on architectural decisions. 3.1 CONTROLLING FOR BIAS LMs are trained on large amounts of text generated by humans. There are various ways in which this could lead to biased negotiation outcomes, i.e., giving one agent some unfair advantage over another agent. In this section, we discuss two biases we actively control for. Additional bias considerations are discussed in Appendix A.2. Intra-Game Biases. First, the Game, Issues, and negotiation protocol descriptions used to initialize an LM agent might contain language that unintentionally benefits one side. For example, mentioning specific geographical locations or employer/employee relations could lead to cultural (Brett & Gelfand, 2006) or power-balance biases (Schaerer et al., 2020). Secondly, the anchoring bias can give an advantage to the agent initiating the negotiation (Galinsky & Mussweiler, 2001). We control for these biases by having each agent play both sides and both starting positions and then taking the average over outcomes. Agent-Persona Biases. As discussed in Section 1, persona-steering is an open problem for LMs. We attempt to minimize the effect of specific persona biases by stating that agents are representatives of the respective negotiating parties and removing any mention of gender (Bowles et al., 2022). 3.2 PERFORMANCE FACTORS Several architectural decisions can have a non-trivial influence on LM agents performance in our setup. Two settings of primary interest are the length of notes/messages and the note history available. We provide details on the hyperparameter values used for our experiments in Appendix A.3. Compute Capacity. LM agents are restricted as to how many words they can use per note or message. Providing insufficient generative capacity limits the complexity of plans that can be made and messages that can be shared. On the other hand, providing too much capacity might lead to hallucinations (Maynez et al., 2020) and has practical cost considerations. Memory. To generate the next note or message, LM agents can access their memory of previous notes, messages, and opponent messages. In practice, not all LMs have enough prompt-context capacity to represent the entire negotiation history. Park et al. (2022) solve this by implementing a memory module , that retrieves the most relevant memories for the task at hand. In this work, we focus instead on varying the available note history to minimize adding additional components. 3.3 BENCHMARKS We evaluate several public state-of-the-art models, viz., Open AI s gpt-3.5 and gpt-4, Google s chatbison, Anthropic s claude-2, Cohere s command and command-light, and Meta s LLa MA 2 models. Before subjecting models to extensive self-play experiments, we first test if they pass the minimum qualifying requirements. Models are tasked to self-play a single-issue distributive Game ten times and require at least one successful agreement to proceed (see next section). Self-Play and Cross-Play. For self-play, models negotiate against independent instances of themselves. Because self-play outcomes are symmetric3, we are primarily interested in agreement rates 3The same model plays both sides and starting positions after which results are aggregated and averaged. Published as a conference paper at ICLR 2024 and the ability to maximize cooperative opportunities. For cross-play, each model plays against the other qualifying models. Cross-play performance is of particular interest for measuring model robustness, as messages generated by opponents will likely be out-of-distribution. For both self-play and cross-play, we investigate the ability to reach agreements, follow instructions, and stay faithful. 3.4 COMPLETION CRITERIA AND EVALUATION METRICS Agents are instructed to signal agreement by using a public message stating a hard-coded agreement phrase. Unfortunately, even if both parties use the agreement phrase, internal states might differ making the agreement invalid. We thus register a soft agreement ( ) if agents internal states align and a hard agreement if in addition the agreement phrase is used by both parties ( ).4 The agreement rate is the ratio of games that result in a soft/hard agreement. A negotiation ends when both agents use the agreement phrase or a maximum of 10 rounds is reached. We report normalized total payoffs (U) and normalized payoffs for games that end in agreement (U ), where U, U [0,1]. Alignment metrics of interest are internal and external faithfulness as defined in Section 2.3, and the ability to follow instructions. Instruction-following is crucial for safe deployment and to ensure LM agents can carry out tasks effectively. We measure instruction-following behavior of staying within the maximum number of words allowed to generate notes/messages (note/msg instruct) and the ability to correctly format internal offer indications using valid JSON (format instruct). All alignment metrics are reported as fractions between 0 and 1, with 1 indicating a perfect score. 3.5 NEGOTIATION GAMES EVALUATED We experiment with Games including one or two Issues.5 Game complexity increases as we (i) move from one to two Issues, (ii) mix distributive and compatible Issues, and finally (iii) introduce integrative preference weights. For games with a compatible Issue or integrative preference weights, cooperative bargaining opportunities arise, i.e., both agents can obtain more than U = 0.5. We refer to Appendix A for a detailed overview and discussion on determining default settings and debiasing ablations. Average results and standard errors are reported over 25+ runs for each model, except for gpt-4, which has just over 15 runs on average due to high costs. gpt-4 results are therefore marked with an asterisk (*). After running qualifier experiments all models except LLa MA 2 models advanced. Upon qualitative inspection of command-light self-play results, we opted to exclude this model from cross-play indicated by a dagger ( ) (Examples are provided in Appendix G). 4.1 SELF-PLAY Summaries for alignment metrics, agreement rates, and the average number of rounds are reported in Table 1. We found that gpt-4 had superior faithfulness and instruction-following metrics, but ranks near the bottom for agreement rate and requires the most rounds on average. claude-2 and command consistently fail to follow note/message word limit restrictions. gpt-3.5 proves the most efficient at self-play. All models succeed reasonably well in following the formatting instructions. Table 1: Summary of average self-play metrics. Higher is better except for Avg. Rounds. int. faithful ext. faithful note instruct msg instruct format instruct soft ( ) hard ( ) Avg. Rounds chat-bison 0.79 0.02 0.61 0.03 0.83 0.02 0.99 0.00 0.98 0.00 0.19 0.04 0.10 0.03 9.40 0.19 claude-2 0.79 0.02 0.77 0.03 0.08 0.01 0.09 0.01 0.96 0.00 0.61 0.05 0.26 0.05 8.53 0.28 command 0.85 0.02 0.76 0.05 0.23 0.05 0.42 0.03 0.92 0.02 0.36 0.08 0.18 0.08 7.93 0.49 command-light 0.84 0.04 0.78 0.04 0.20 0.03 0.40 0.03 0.91 0.04 0.49 0.08 0.22 0.07 8.23 0.40 gpt-4 0.91 0.01 0.92 0.03 1.00 0.00 1.00 0.00 1.00 0.00 0.28 0.07 0.19 0.05 9.58 0.17 gpt-3.5 0.91 0.01 0.85 0.02 0.74 0.02 0.78 0.04 0.98 0.00 0.46 0.05 0.40 0.05 6.34 0.18 4See Appendix A.1 for the agreement phrase used and additional discussion. 5Some models failed to complete Games with more than two Issues. Published as a conference paper at ICLR 2024 Table 2: Self-play results for negotiation games with a single Issue (1) and two Issues (2), where indicates soft agreement rate and U, U , total/agreed-only normalized payoffs respectively. Note that the underlying optimization problem increases in difficulty as we go down each section. (1) Single Issue Distributive Compatible U U U U chat-bison 0.35 0.00 0.18 0.00 0.50 0.46 0.18 0.44 0.19 0.92 0.04 claude-2 0.88 0.00 0.44 0.00 0.50 0.75 0.00 0.46 0.07 0.61 0.09 command 0.10 0.10 0.05 0.05 0.50 0.60 0.20 0.45 0.11 0.78 0.08 command-light 0.46 0.21 0.23 0.11 0.50 0.35 0.15 0.28 0.10 0.82 0.08 gpt-4 0.75 0.08 0.38 0.04 0.50 0.58 0.08 0.57 0.08 0.99 0.01 gpt-3.5 0.53 0.03 0.26 0.01 0.50 0.69 0.14 0.54 0.11 0.78 0.00 (2) Non-Integrative Distributive Mixture U U U U chat-bison 0.12 0.01 0.06 0.00 0.50 0.25 0.13 0.15 0.06 0.65 0.10 claude-2 0.60 0.03 0.30 0.01 0.50 0.56 0.06 0.32 0.04 0.57 0.01 command 0.40 0.20 0.20 0.10 0.50 0.12 0.13 0.09 0.09 0.75 command-light 0.58 0.08 0.29 0.04 0.50 0.80 0.20 0.48 0.08 0.60 0.04 gpt-4 0.35 0.05 0.18 0.03 0.50 0.44 0.22 0.33 0.17 0.72 0.03 gpt-3.5 0.43 0.28 0.21 0.14 0.50 0.38 0.08 0.25 0.06 0.64 0.01 (2) Integrative Distributive Mixture U U U U chat-bison 0.19 0.19 0.10 0.10 0.52 0.06 0.06 0.03 0.04 0.52 claude-2 0.68 0.18 0.36 0.09 0.53 0.00 0.60 0.03 0.33 0.04 0.55 0.04 command 0.35 0.15 0.19 0.08 0.56 0.01 0.42 0.08 0.27 0.09 0.63 0.08 command-light 0.30 0.10 0.15 0.09 0.45 0.15 0.40 0.00 0.23 0.01 0.57 0.01 gpt-4 0.05 0.05 0.03 0.03 0.52 0.33 0.11 0.22 0.09 0.63 0.06 gpt-3.5 0.35 0.27 0.18 0.13 0.55 0.05 0.46 0.08 0.26 0.05 0.56 0.01 Single-Issue Games. As single-issue, distributive Games are zero-sum, games ending in agreement always report U = 0.5 during self-play. Hence, the agreement rate is the only metric of interest. We note claude-2 posts the highest agreement rate with chat-bison and command at the bottom. gpt-4 appears more skilled in finding competitive agreement, whereas command displays the inverse. For compatible Issues, the challenge is to discover that the agents interests are aligned. command, claude-2, and gpt-3.5 have the highest agreement rates, but converge to mediocre agreements upon reaching agreement. In contrast, gpt-4 has a worse agreement rate but near-perfect payoffs upon reaching an agreement. This would indicate that when gpt-4 reaches an agreement it does so by maximizing the interest alignment, while command, claude-2, and gpt-3.5 do not. Two-Issue Games. While relative agreement-rate rankings approximately hold, adding an additional Issue reduces agreement rates across all models. Recall that for integrative Games, agents have different Issue preferences. This provides opportunities to increase the overall payoffs through trade-offs but also complicates To M inference. We note that models struggle to cooperate, barely increasing their integrative distributive payoffs to U > 0.5. We further note that gpt-4 continues to excel in optimizing agreements for Games involving compatible Issues but with a low agreement rate. 4.2 CROSS-PLAY Average cross-play metrics are reported in Table 3. Several issues stand out when comparing crossplay results with the self-play results in Table 1. First, the models that performed best in self-play (e.g., gpt-4, chat-bison), appear to nudge instruction-following metrics upwards for the lesser models at the expense of their own performance, aligning with concurrent work by Zhou et al. (2024). Table 3: Summary of average cross-play metrics. Higher is better except for Avg. Rounds. int. faithful note instruct msg instruct format instruct (soft) (hard) Avg. Rounds chat-bison 0.85 0.01 0.71 0.03 0.76 0.04 0.97 0.01 0.43 0.03 0.18 0.03 8.88 0.17 claude-2 0.83 0.01 0.37 0.03 0.41 0.02 0.97 0.00 0.50 0.02 0.21 0.02 8.74 0.15 command 0.87 0.01 0.49 0.03 0.59 0.03 0.95 0.01 0.46 0.03 0.20 0.02 8.51 0.15 gpt-4 0.88 0.01 0.78 0.02 0.81 0.02 0.98 0.00 0.42 0.03 0.25 0.02 8.81 0.14 gpt-3.5 0.90 0.01 0.66 0.03 0.72 0.03 0.97 0.01 0.48 0.03 0.34 0.03 7.60 0.12 Published as a conference paper at ICLR 2024 Secondly, the average agreement rate increases significantly for all models except the previously best performing claude-2. This is paired with a decrease in average rounds needed to reach an agreement, offset by a slight increase for gpt-3.5. These results provide promising evidence that strong LMs could serve as effective teachers for weaker models. Table 4: Cross-play results for negotiation games with a single Issue (1) and two Issues (2), where indicates soft agreement rate and U, U , total/agreed-only normalized payoffs respectively. (1) Single Issue Competitive Cooperative U U U U chat-bison 0.49 0.04 0.22 0.02 0.45 0.03 0.62 0.07 0.50 0.06 0.81 0.03 claude-2 0.55 0.04 0.29 0.03 0.52 0.02 0.57 0.03 0.44 0.02 0.78 0.03 command 0.52 0.05 0.23 0.03 0.45 0.06 0.55 0.07 0.44 0.05 0.80 0.03 gpt-4 0.59 0.05 0.27 0.03 0.46 0.02 0.50 0.05 0.43 0.04 0.87 0.03 gpt-3.5 0.57 0.05 0.34 0.04 0.61 0.05 0.65 0.05 0.52 0.05 0.80 0.03 (2) Two Issues Competitive Cooperative U U U U chat-bison 0.31 0.04 0.16 0.03 0.49 0.04 0.38 0.05 0.21 0.03 0.57 0.03 claude-2 0.48 0.07 0.24 0.03 0.52 0.03 0.46 0.03 0.25 0.02 0.55 0.02 command 0.44 0.08 0.21 0.04 0.47 0.02 0.42 0.04 0.23 0.03 0.56 0.02 gpt-4 0.42 0.06 0.22 0.04 0.49 0.05 0.33 0.04 0.21 0.03 0.62 0.03 gpt-3.5 0.38 0.07 0.19 0.04 0.52 0.04 0.43 0.04 0.26 0.02 0.61 0.02 Cross-play performance is reported in Table 4. The results were grouped into single-issue and twoissue, cooperative and competitive Games.6 Cooperative Games consist of those with opportunities for cooperation, e.g., through compatible-issue coordination or integrative bargaining. Competitive Games consist of pure conflict, distributive-only Issues with no integration. The overall strongest negotiator is gpt-3.5, leading almost every category. claude-2, which excelled in finding agreements during self-play, sees a drop in relative agreement-rate ranking for the cross-play negotiations. This highlights the usefulness of benchmarking against other models to evaluate robustness. While chatbison still has the worst average performance, its results are much closer to the other models than during self-play. Continuing the behavior observed during self-play, gpt-4 performs strongly in the cooperative, agreed-only payoffs category for cross-play as well. Perhaps surprisingly, gpt-4 ranks near the bottom in many other categories. 5 LIMITATIONS AND ETHICAL CONSIDERATIONS Costs. Except for the open-source LLa MA 2 models, all models studied in this work are only accessible through paid APIs. This financially constrained the number of experiments we could perform, hampering our ability to reduce confidence intervals further. Researchers interested in benchmarking their models through cross-play will depend on third parties. This might prove prohibitively expensive. An alternative could be to test against cheaper models and use latent-ability frameworks like the ELO rating system to extrapolate ranking results (Elo & Sloan, 1978; Boubdir et al., 2023). Prompts and Settings. We sought to engineer prompts with minimal adverse effects across all models. However, a set of prompts likely exists that would be more beneficial for each model. We tried to alleviate this by running all models with a temperature of 0.2 and averaging results over many runs. Similarly, we took great care in selecting reasonable, unbiased default settings for our architecture. Appendix A presents more results and discussion on this matter. Ethical Considerations. Deploying LM agents in our society has both considerable risk and upside. We hope that this work and the open-sourcing of our code can contribute to tracking evolving LM agency, expose risks such as unfaithful tendencies, and accelerate safety research. At the same time, we are aware that malicious actors might use our framework to only select for negotiation ability. 6Head-to-head cross-play results are available in Appendix D. Published as a conference paper at ICLR 2024 6 RELATED WORK Language Model Evaluation. Evaluating language models is currently one of the most pressing problems in NLP. Opaque training datasets make it difficult to detect data contamination, which can lead to deceptive evaluation metrics, e.g., on competitive programming (He, 2023; Josifoski et al., 2023), eroding public trust. Competing corporate interests and fear of data leakage further reduce the release of evaluation datasets. For example, even the LM open-source champion Meta did not reveal or share any of the data used to train their LLa MA 2 models (Touvron et al., 2023). The use of crowdsourcing platforms, traditionally the go-to source for collecting large, human-annotated datasets, has also come under scrutiny due to crowd workers increased use of LMs (Veselovsky et al., 2023b;a). To combat the decrease in human-annotated data sets, evaluation research has increasingly started looking at utilizing LMs for self-correction. Examples span using LMs to rank model outputs (Dettmers et al., 2023; Kwon et al., 2023), red teaming (Perez et al., 2022), and alignment (Lee et al., 2023; Bai et al., 2022b; Gulcehre et al., 2023; Wang et al., 2023). Our work falls under this category as LMs are used to evaluate themselves through selfand cross-play negotiations. LM-Based Agents. There s been a recent explosion in efforts exploring the agent potential of LMs (Andreas, 2022); Adding the ability to use external tools (Yao et al., 2022; Schick et al., 2023), bootstrapping LM agency using specialized LM agents as building blocks (Nakajima, 2023; Team, 2023; Zhuge et al., 2023; Qian et al., 2023), or even simulating entire LM-agent societies (Park et al., 2023). Yet other works explore the use of LMs as add-on layers to improve interactive perception for RL-based robotics (Ahn et al., 2022). We refer to (Xi et al., 2023) for a comprehensive overview of further research into LM-agent potential. In contrast, we do not focus on creating or enhancing LM agents, but rather on providing a useful framework to evaluate innate LM agency. AI Agents Negotiating. Creating AI agents to play negotiation-based games has long been a subject of interest (Oliver, 1996; Lau et al., 2006; Lopes et al., 2008; Jonker et al., 2012; Gratch et al., 2015; Baarslag et al., 2017). Due to the lack of natural language understanding, past works were limited to modeling environments with restricted, standardized inputs and outputs. To provide additional optimization structure, various works started to propose hybrid architectures combining ideas from RL and LMs (Lewis et al., 2017; He et al., 2018; Bakker et al., 2019; Gray et al., 2021). With recent advances in LMs, there has been a surge in works exploring the use of LMs in negotiations. Most of these investigate few-shot or single-issue negotiations (Guo, 2023; Brookins & De Backer, 2023; Fu et al., 2023), whereas we are interested in LM agent behavior over extended periods on arbitrarily complex games. Additionally, we aim to jointly evaluate alignment and performance. 7 CONCLUSION Society evolves around interactions; countless human exchanges of information to advance a plethora of objectives. The advent of language model agents is monumental, in that it presents the first time in our history that non-human interactions enter society. Yet, whereas human interactions are governed by a shared understanding of motivations and common sense, the drivers and limitations of LMs are largely unknown. This research aims to shed light on these blind spots. We emphasize the importance of tasks that mirror real-world deployment, jointly assess alignment and performance, and offer resistance against evaluation data leakage. We underscore the shortcomings of static LM benchmarks in meeting these criteria and propose negotiation games as a promising alternative approach. We used our approach to evaluate publicly accessible, state-of-the-art LMs on several negotiation games. At the time of writing, only closed models are capable of completing our tasks. We expect this will soon change. Our analysis further showed that current LMs struggle to find cooperative bargaining opportunities or solve Games with multiple Issues. Surprisingly, while superior in faithfulness and instruction-following, the most powerful model, gpt-4, underperformed in negotiation outcomes. Given the variety of possible interactions, much more work is needed to safely integrate LMs into society. We believe negotiation games form a fertile testing ground and encourage the community to explore several natural extensions in future work, e.g., how human negotiation biases carry over to LM agents, allowing access to external tools, or the effect of repeated games on decision-making. We hope that by open-sourcing our framework, we can convince more researchers from all disciplines to contribute toward better evaluation benchmarks for this unprecedented new paradigm. Published as a conference paper at ICLR 2024 ACKNOWLEDGMENTS The authors would like to thank Nicola De Cao, Caglar Gulcehre, Manoel Horta Ribeiro, Andrew Leber, and Boi Faltings for helpful discussions, and Galaxia Wu for consulting on graphic design. Robert West s lab is partly supported by grants from the Swiss National Science Foundation (200021_185043, TMSGI2_211379), Swiss Data Science Center (P22_08), H2020 (952215), Google, and Microsoft. We also gratefully acknowledge compute support from the Microsoft Accelerate Foundation Model Academic Research program. Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, et al. Do as i can, not as i say: Grounding language in robotic affordances. ar Xiv preprint ar Xiv:2204.01691, 2022. Jacob Andreas. Language models as agent models. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pp. 5769 5779. Association for Computational Linguistics, 2022. doi: 10.18653/V1/2022.FINDINGS-EMNLP.423. URL https://doi.org/ 10.18653/v1/2022.findings-emnlp.423. Tim Baarslag, Michael Kaisers, Enrico Gerding, Catholijn M Jonker, and Jonathan Gratch. When will negotiation agents be able to represent us? the challenges and opportunities for autonomous negotiators. IJCAI, 2017. Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova Das Sarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. ar Xiv preprint ar Xiv:2204.05862, 2022a. Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron Mc Kinnon, et al. Constitutional ai: Harmlessness from ai feedback. ar Xiv preprint ar Xiv:2212.08073, 2022b. Jasper Bakker, Aron Hammond, Daan Bloembergen, and Tim Baarslag. Rlboa: A modular reinforcement learning framework for autonomous negotiating agents. In AAMAS, pp. 260 268, 2019. Robert Bontempo and Shanto Iyengar. Rio copa: A negotiation simulation. Columbia Caseworks, 2008. Meriem Boubdir, Edward Kim, Beyza Ermis, Sara Hooker, and Marzieh Fadaee. Elo uncovered: Robustness and best practices in language model evaluation. EMNLP, GEM Workshop, 2023. Hannah Riley Bowles, Bobbi Thomason, and Inmaculada Macias-Alonso. When gender matters in organizational negotiations. Annual Review of Organizational Psychology and Organizational Behavior, 9:199 223, 2022. Jeanne M Brett and Michele J Gelfand. A cultural analysis of the underlying assumptions of negotiation theory. In Negotiation theory and research, pp. 173 201. Psychology Press, 2006. Philip Brookins and Jason Matthew De Backer. Playing games with gpt: What can we learn about a large language model from canonical strategic games? Available at SSRN 4493398, 2023. Noam Brown and Tuomas Sandholm. Superhuman ai for heads-up no-limit poker: Libratus beats top professionals. Science, 359(6374):418 424, 2018. Lingjiao Chen, Matei Zaharia, and James Zou. How is chatgpt s behavior changing over time? ar Xiv preprint ar Xiv:2307.09009, 2023. Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. Neur IPS, 30, 2017. Published as a conference paper at ICLR 2024 Contributors to Wikimedia projects. Ultimatum game - Wikipedia, September 2023. URL https://en.wikipedia.org/w/index.php?title=Ultimatum_game&oldid= 1173609026. [Online; accessed 28. Sep. 2023]. Allan Dafoe, Edward Hughes, Yoram Bachrach, Tantum Collins, Kevin R Mc Kee, Joel Z Leibo, Kate Larson, and Thore Graepel. Open problems in cooperative ai. Neur IPS, Cooperative AI Workshop, 2020. Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. Neur IPS, 2023. Arpad E Elo and Sam Sloan. The rating of chessplayers: Past and present. Arco Pub., 1978. ISBN 0668047216 9780668047210. URL http://www.amazon.com/Rating-Chess Players-Past-Present/dp/0668047216. Yao Fu, Hao Peng, Tushar Khot, and Mirella Lapata. Improving language model negotiation with self-play and in-context learning from ai feedback. ar Xiv preprint ar Xiv:2305.10142, 2023. Adam D Galinsky and Thomas Mussweiler. First offers as anchors: The role of perspective-taking and negotiator focus. Journal of personality and social psychology, 81(4):657, 2001. Adam Gleave, Michael Dennis, Cody Wild, Neel Kant, Sergey Levine, and Stuart Russell. Adversarial policies: Attacking deep reinforcement learning. In ICLR, 2020. URL https://openreview. net/forum?id=HJg EMp VFw B. Jonathan Gratch, David De Vault, Gale M Lucas, and Stacy Marsella. Negotiation as a challenge problem for virtual humans. In Intelligent Virtual Agents: 15th International Conference, IVA 2015, Delft, The Netherlands, August 26-28, 2015, Proceedings 15, pp. 201 215. Springer, 2015. Jonathan Gray, Adam Lerer, Anton Bakhtin, and Noam Brown. Human-level performance in no-press diplomacy via equilibrium search. ICLR, 2021. Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, et al. Reinforced self-training (rest) for language modeling. ar Xiv preprint ar Xiv:2308.08998, 2023. Fulin Guo. Gpt agents in game theory experiments. ar Xiv preprint ar Xiv:2305.05516, 2023. Hangfeng He, Hongming Zhang, and Dan Roth. Rethinking with retrieval: Faithful large language model inference. ar Xiv preprint ar Xiv:2301.00303, 2022. He He, Derek Chen, Anusha Balakrishnan, and Percy Liang. Decoupling strategy and generation in negotiation dialogues. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun ichi Tsujii (eds.), EMNLP, pp. 2333 2343, Brussels, Belgium, October-November 2018. ACL. doi: 10.18653/v1/ D18-1256. URL https://aclanthology.org/D18-1256. Horace He. Horace He on X, September 2023. URL https://twitter.com/c HHillee/ status/1635790330854526981. [Online; accessed 27. Sep. 2023]. Ari Holtzman, Peter West, and Luke Zettlemoyer. Generative models as a complex systems science: How can we make sense of large language model behavior? ar Xiv preprint ar Xiv:2308.00189, 2023. Shengran Hu and Jeff Clune. Thought Cloning: Learning to think while acting by imitating human thinking. Neur IPS, 2023. Alon Jacovi and Yoav Goldberg. Towards faithfully interpretable nlp systems: How should we define and evaluate faithfulness? ACL, 2020. Catholijn M Jonker, Koen V Hindriks, Pascal Wiggers, and Joost Broekens. Negotiating agents. AI Magazine, 33(3):79 79, 2012. Martin Josifoski, Lars Klein, Maxime Peyrard, Yifei Li, Saibo Geng, Julian Paul Schnitzler, Yuxing Yao, Jiheng Wei, Debjit Paul, and Robert West. Flows: Building blocks of reasoning and collaborating ai. ar Xiv preprint ar Xiv:2308.01285, 2023. Published as a conference paper at ICLR 2024 Michal Kosinski. Theory of mind may have spontaneously emerged in large language models. ar Xiv preprint ar Xiv:2302.02083, 2023. Minae Kwon, Sang Michael Xie, Kalesha Bullard, and Dorsa Sadigh. Reward design with language models. ICLR, 2023. Marc Lanctot, Vinicius Zambaldi, Audrunas Gruslys, Angeliki Lazaridou, Karl Tuyls, Julien Pérolat, David Silver, and Thore Graepel. A unified game-theoretic approach to multiagent reinforcement learning. Neur IPS, 30, 2017. Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, et al. Measuring faithfulness in chain-of-thought reasoning. ar Xiv preprint ar Xiv:2307.13702, 2023. Raymond YK Lau, Maolin Tang, On Wong, Stephen W Milliner, and Yi-Ping Phoebe Chen. An evolutionary learning approach for adaptive negotiation agents. International journal of intelligent systems, 21(1):41 72, 2006. Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Lu, Thomas Mesnard, Colton Bishop, Victor Carbune, and Abhinav Rastogi. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. ar Xiv preprint ar Xiv:2309.00267, 2023. Minha Lee, Gale Lucas, Johnathan Mell, Emmanuel Johnson, and Jonathan Gratch. What s on your virtual mind? mind perception in human-agent negotiations. In Proceedings of the 19th ACM international conference on intelligent virtual agents, pp. 38 45, 2019. Mike Lewis, Denis Yarats, Yann Dauphin, Devi Parikh, and Dhruv Batra. Deal or no deal? end-to-end learning of negotiation dialogues. In Martha Palmer, Rebecca Hwa, and Sebastian Riedel (eds.), Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2443 2453, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi: 10.18653/v1/D17-1259. URL https://aclanthology.org/D17-1259. Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Alexander Cosgrove, Christopher D Manning, Christopher Re, Diana Acosta-Navas, Drew Arad Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu Ren, Huaxiu Yao, Jue WANG, Keshav Santhanam, Laurel Orr, Lucia Zheng, Mert Yuksekgonul, Mirac Suzgun, Nathan Kim, Neel Guha, Niladri S. Chatterji, Omar Khattab, Peter Henderson, Qian Huang, Ryan Andrew Chi, Sang Michael Xie, Shibani Santurkar, Surya Ganguli, Tatsunori Hashimoto, Thomas Icard, Tianyi Zhang, Vishrav Chaudhary, William Wang, Xuechen Li, Yifan Mai, Yuhui Zhang, and Yuta Koreeda. Holistic evaluation of language models. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/ forum?id=i O4LZib Eq W. Featured Certification, Expert Certification. Fernando Lopes, Michael Wooldridge, and Augusto Q Novais. Negotiation among autonomous computational agents: principles, analysis and challenges. Artificial Intelligence Review, 29:1 44, 2008. Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan Mc Donald. On faithfulness and factuality in abstractive summarization. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 1906 1919. Association for Computational Linguistics, July 2020. Aaron Mok. We ll all have AI assistants soon, Google AI cofounder says. Business Insider, September 2023. URL https://www.businessinsider.com/google-deepmindcofounder-mustafa-suleyman-everyone-will-have-ai-assistant-20239?r=US&IR=T. Yohei Nakajima. babyagi, September 2023. URL https://github.com/yoheinakajima/ babyagi. [Online; accessed 28. Sep. 2023]. Cleo Nardo. The waluigi effect (mega-post). Less Wrong, 2023. Published as a conference paper at ICLR 2024 Jim R Oliver. A machine-learning approach to automated negotiation and prospects for electronic commerce. Journal of management information systems, 13(3):83 112, 1996. Open AI. Gpt-4 technical report, 2023. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Gray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Neur IPS, 2022. Joon Sung Park, Lindsay Popowski, Carrie Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Social simulacra: Creating populated prototypes for social computing systems. In UIST, pp. 1 18, 2022. Joon Sung Park, Joseph C. O Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. In In the 36th Annual ACM Symposium on User Interface Software and Technology (UIST 23), UIST 23, New York, NY, USA, 2023. Association for Computing Machinery. Judea Pearl and Elias Bareinboim. Transportability of causal and statistical relations: A formal approach. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 25, pp. 247 254, 2011. Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat Mc Aleese, and Geoffrey Irving. Red teaming language models with language models. EMNLP, 2022. Ethan Perez, Sam Ringer, Kamil e Lukoši ut e, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, et al. Discovering language model behaviors with model-written evaluations. ACL, 2023. Billy Perrigo. The New AI-Powered Bing Is Threatening Users. That s No Laughing Matter. Time, February 2023. URL https://time.com/6256529/bing-openai-chatgptdanger-alignment. Yury Pinsky. Bard can now connect to your Google apps and services. Google, September 2023. URL https://blog.google/products/bard/google-bard-newfeatures-update-sept-2023. David Premack and Guy Woodruff. Does the chimpanzee have a theory of mind? Behavioral and brain sciences, 1(4):515 526, 1978. Chen Qian, Xin Cong, Cheng Yang, Weize Chen, Yusheng Su, Juyuan Xu, Zhiyuan Liu, and Maosong Sun. Communicative agents for software development. ar Xiv preprint ar Xiv:2307.07924, 2023. Kevin Roose. Why a Conversation With Bing s Chatbot Left Me Deeply Unsettled. New York Times, February 2023. ISSN 0362-4331. URL https://www.nytimes.com/2023/02/ 16/technology/bing-chatbot-microsoft-chatgpt.html. Michael Schaerer, Laurel Teo, Nikhil Madan, and Roderick I Swaab. Power and negotiation: Review of current evidence and future directions. Current opinion in psychology, 33:47 51, 2020. Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. 37th Conference on Neural Information Processing Systems, 2023. David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484 489, 2016. Jared Spataro. Introducing Microsoft 365 Copilot your copilot for work - The Official Microsoft Blog. Official Microsoft Blog, May 2023. URL https://blogs.microsoft.com/blog/ 2023/03/16/introducing-microsoft-365-copilot-your-copilot-forwork. Published as a conference paper at ICLR 2024 Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research, 2023. Auto GPT Team. Auto GPT, September 2023. URL https://github.com/Significant Gravitas/Auto GPT. [Online; accessed 28. Sep. 2023]. William Thomson. Chapter 35 Cooperative models of bargaining, volume 2, pp. 1237 1284. Elsevier, 1994. ISBN 978-0-444-89427-4. doi: 10.1016/S1574-0005(05)80067-0. URL https: //linkinghub.elsevier.com/retrieve/pii/S1574000505800670. Michael Tobin, Redd Brown, Subrat Patnaik, and Bloomberg. A.I. is the star of earnings calls as mentions skyrocket 77% with companies saying they ll use for everything from medicine to cybersecurity. Fortune, March 2023. URL https: //fortune.com/2023/03/01/a-i-earnings-calls-mentions-skyrocketcompanies-say-search-cybersecurity-medicine-customer-service. Rob Toews. A Wave Of Billion-Dollar Language AI Startups Is Coming. Forbes, March 2022. URL https://www.forbes.com/sites/robtoews/2022/03/27/a-waveof-billion-dollar-language-ai-startups-is-coming. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. ar Xiv preprint ar Xiv:2307.09288, 2023. Miles Turpin, Julian Michael, Ethan Perez, and Samuel R Bowman. Language models don t always say what they think: Unfaithful explanations in chain-of-thought prompting. Neur IPS, 2023. Veniamin Veselovsky, Manoel Horta Ribeiro, Philip Cozzolino, Andrew Gordon, David Rothschild, and Robert West. Prevalence and prevention of large language model use in crowd work. ar Xiv preprint ar Xiv:2310.15683, 2023a. Veniamin Veselovsky, Manoel Horta Ribeiro, and Robert West. Artificial artificial artificial intelligence: Crowd workers widely use large language models for text production tasks. AAAI, 2023b. Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350 354, 2019. Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language model with self generated instructions. ACL, 2023. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824 24837, 2022. Yotam Wolf, Noam Wies, Yoav Levine, and Amnon Shashua. Fundamental limitations of alignment in large language models. ar Xiv preprint ar Xiv:2304.11082, 2023. Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, et al. The rise and potential of large language model based agents: A survey. ar Xiv preprint ar Xiv:2309.07864, 2023. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, 2022. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. ICLR, 2023. Published as a conference paper at ICLR 2024 Santiago Zanella-Béguelin, Lukas Wutschitz, Shruti Tople, Victor Rühle, Andrew Paverd, Olga Ohrimenko, Boris Köpf, and Marc Brockschmidt. Analyzing information leakage of updates to natural language models. In Proceedings of the 2020 ACM SIGSAC conference on computer and communications security, pp. 363 375, 2020. Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. Agieval: A human-centric benchmark for evaluating foundation models. ar Xiv preprint ar Xiv:2304.06364, 2023. Xuhui Zhou, Hao Zhu, Leena Mathur, Ruohong Zhang, Haofei Yu, Zhengyang Qi, Louis-Philippe Morency, Yonatan Bisk, Daniel Fried, Graham Neubig, et al. Sotopia: Interactive evaluation for social intelligence in language agents. ICLR, 2024. Mingchen Zhuge, Haozhe Liu, Francesco Faccio, Dylan R Ashley, Róbert Csordás, Anand Gopalakrishnan, Abdullah Hamdi, Hasan Abed Al Kader Hammoud, Vincent Herrmann, Kazuki Irie, et al. Mindstorms in natural language-based societies of mind. Neur IPS, Workshop on Robustness of Few-shot and Zero-shot Learning in Foundation Models, 2023. Published as a conference paper at ICLR 2024 A.1 SOFT AND HARD AGREEMENTS Table 5: Summary of agreement conditions. Agreement? Phrase Notes YES YES NO NO Two agents play a negotiation game for N number of rounds. The goal of the negotiation is to come to the best possible agreement before the maximum number of rounds is reached. If no agreement is reached, a score of zero is recorded. A negotiation is ended before the maximum number of rounds is reached when both parties write a public message stating a hard-coded agreement phrase. In our case: "We agree on all issues." Unfortunately, even if both parties use the agreement phrase this does not necessarily imply a valid agreement has been reached. For example, the public offers recorded so far may indicate the two parties are far from an agreement. Alternatively, two parties can be in perfect agreement but at least one of the two fails to utter the agreement phrase. The former is difficult to solve in an automated fashion: It would involve extracting offers from free text at each step and correctly propagating results forward. At this point, this proved too error-prone. Instead, we opt to say two parties have reached a soft agreement if their mental notes indicate that they prefer the same acceptable offers. One can imagine, that in an API setting such agreements could be automatically matched using, e.g., a hash-matching scheme. Because hard agreements rely on the utterance of an agreement phrase, they highly correlate with an LM agent s ability to follow instructions. This skill varies widely among models at the time of this work. We register a hard agreement if internal states align and both parties use the agreement phrase. We expect that soft agreement metrics will be replaced by hard agreement metrics as LM agency advances. A.2 BIAS CONSIDERATIONS Persona Mixtures. Language models are pre-trained on text samples from a large number of different authors. Hence, we can think of LMs as a superposition of various personas (Nardo, 2023; Wolf et al., 2023). Furthermore, it was empirically shown that current LMs modify their generative behavior based on their prompt context (Andreas, 2022). This presents a tricky situation from an evaluation perspective, as we never quite know which persona mixture is responsible for generating responses. In this work, we limit our exploration to two settings: (i) we do not provide any persona description, (ii) we explicitly state that the LM agent is an expert or novice negotiator. Our goal here is to measure if there is a significant difference in performance between the average, expert, and novice initialization. RLHF Effects. Reinforcement learning from human feedback (RLHF) is used by most SOTA LMs to better align LMs generative behavior with human preferences and values (Christiano et al., 2017; Ouyang et al., 2022; Bai et al., 2022b). However, for the task of negotiations, positive traits such as increased honesty might be undesired. To reduce potential RLHF artifacts, we format negotiation dialogues as transcripts. A Human message data class is then used to orchestrate negotiations by presenting the ongoing dialogue transcript and prompting for the next step (details in Appendix A.3). Inter-Game Bias. As touched upon in Section 2.2, the context of a particular negotiation setting might be important. Similar to the real world, Games and Issues with different descriptions can be perceived as harder/easier by different agents, even if the underlying payoff matrices are the same. This could indeed lead to inter-game bias. During the development of this project, we indeed experimented with various Game settings, e.g., corporate mergers, loan agreements, etc. However, Published as a conference paper at ICLR 2024 we did not perform a systematic experimental sweep to establish guidelines on how important this factor is to our metrics of interest. A.3 DEFAULT SETTINGS Figure A.1: Data generative process of a negotiation: zj are unobserved inputs for opposing party messages m j; ci the fixed context factors used to initialize agent ai; and note nt i, message mt i are generated at step t. Structured negotiation games are parameterized by a series of variables that can strongly influence results, see Section 3.2. To pick default parameters for selfand cross-play experiments we ran an extensive series of experiments using gpt-3.5 and a more limited sweep on gpt-4, chatbison, and claude-2 to reduce the possibility of overfitting. First, we study the role of computational capacity by setting a maximum message and note size. We provide prompt-level restrictions on the maximum number of words allowed. We restrict the search space to maximum note and message lengths of 32, 64, and 128. Secondly, we vary memory access of previous notes and messages as an input context to generate new notes and messages. Specifically, we perform a limited sweep measuring the result of having no dialogue history, only the most recent note/message, or all dialogue history available. In the equations below, we vary a,b,c,d {0,1, 1}, where 1 indicates the entire history is used. nt i = ai(nt a:t 1 i ,mt b:t 1 i ,m t j ,qn), (1) mt i = ai(nt c:t i ,mt d:t 1 i ,m t j ,qm), (2) where nt i,mt i are the note and message of agent i at time t, qn,qm the prompts used to generate notes and messages, and ai the LM agent i initialized using the Game, Issues, negotiation protocol, and agent role descriptions. Lastly, we repeat experiments using two conversation formats. The first variation presents the negotiation dialogue as a transcript. The advantage here is that only two roles are required to advance the negotiations. The Human or System role can be used to show instructions, prompts, and the ongoing transcript, whereas the AI role is taken by the LM agent. Additionally, the LM agent never directly negotiates against a Human, minimizing potential RLHF influences. The second variation is a direct dialogue between the AI and Human message role, where opposing agents appear as human . This form requires three message roles as explained in Section B, making it impossible to run on all models studied. Finally, all experiments were repeated at least 25+ times at a temperature of 0.2 to control for possible stochasticity between model generations. The sole exception is gpt-4 due to cost constraints, as mentioned at the start of Section 4. Memory and Computation Capacity. varying message history, b,d: limiting the message history as a context input quickly caused negotiations to diverge. Since missing messages would appear as gaps in the transcript, the LM agent was unable to recover. We therefore include the entire message history, i.e., b,d = 1. varying note history for notes, a: Notes do not strictly contain new information about the negotiation, only the agent s reflections (see Figure A.1). Performing ablative experiments, we found that not having access to any past notes to write the current note resulted in the highest agreement rate. This might be because the agent is confused by the various past reflections when writing a new reflection. Hence, a = 0 for all experiments. varying note history for messages, c: We had different considerations for the case of using notes as an input to writing public messages. Following findings from chain-of-thought prompting by Wei et al. (2022) and reasoning before action (Yao et al., 2022), we expected Published as a conference paper at ICLR 2024 notes to positively influence negotiations. While this was observed for having access to the single most recent note compared to not having access to any notes, having access to all notes was worse than seeing just a single note. Hence, c = 1 for all experiments. varying note and message length: the smallest note and message length of 32 words led to frequent instruction-following breaks and less fluent conversation. It also led to slightly worse agreement rates. The difference in agreement rate between restricting the number of words to 64 or 128 was minimal. Measuring realized message and note lengths for the 128 variant, we found most experiments stayed well below this restriction, averaging between 50 to 80 words. Because increasing the word restriction mostly seemed to lead to excess token usage, we opted to have 64 as a default length. Upon further inspection of later results, we realized that a better alternative is likely to vary the word restriction with the number of Issues negotiated over in a Game. We unfortunately were not able to rigorously test this hypothesis and thus leave this to future work. Dialogue vs. Transcript. Using gpt-3.5, we recorded a 5% increase (0.44 to 0.49) in agreement rate when switching from transcriptto dialogue-style negotiations. While this increase is not insignificant, we opted to use the transcript format for a fair comparison between models from all providers. Effect of Showing Negotiation-Round Numbers. gpt-4 often failed to converge when the current and remaining negotiation-round numbers were not displayed. Qualitative inspection revealed gpt-4 would quickly reach an approximate agreement, e.g., offering $900 of rent vs. the other agent offering $1000, then continue towards a middle point, e.g., $925, ,$942. This is an optimal strategy if no clear termination condition is known. Yet, we were hesitant to display negotiation-round numbers as this would open up the possibility for a subgame perfect Nash Equilibrium (Contributors to Wikimedia projects, 2023). Empirically, we found that gpt-4 s agreement rate increased by almost 50% when we showed the negotiation-round number. For this reason, we decided to show the negotiation-round number in the text generation prompts for the models. In contrast, for gpt-3.5 we found that showing the negotiation-round number had little effect on convergence. Table 6: Default parameters used for self-play and cross-play experiments across all models Parameter Value Note input, note memory 0 Note input, message memory -1 Message input, note memory 1 Message input, message memory -1 Note max words 64 Message max words 64 Dialogue style False Show rounds True A.4 QUALIFIERS Before subjecting models to extensive self-play, we first test if they pass the minimum qualifying requirements. Models are tasked to self-play a single distributive Game against an independent copy of the same LM ten times. At least one successful agreement is required to proceed. Specifically, models play variants of the rental agreement Game (Appendix E using the default parameters described in Appendix A.3. We debiased experiments via our usual procedure and chose a temperature of 0.2. Except for the open-source LLa MA 2 models, all other models were able to qualify. The LLa MA 2 models had difficulty maintaining their respective roles and following instructions, quickly diverging. This behavior got worse as models got smaller. We present a short dialogue transcript snippet below as an example: Published as a conference paper at ICLR 2024 gain more on another. Offers: Issue 1: Rent $1400. Issue 2: Security Deposit $1000. Issue 3: Lease Duration 2 years. Issue 4: Renewal Option 1 year. Issue 5: Pet Fee $500. Issue 6: Utilities Included? Yes. (End Transcript)","{ rent : [ $500 , $600 , $700 , $800 , $900 , $1000 , $1100 , $1200 , $1300 , $1400 , $1500 ]}",20230926_101940,meta-llama/Llama-2-7b-chat-hf,inprogress,0.0,0.0 You,1,1,", -20.0 Issue 2: Security Deposit $1200, -10.0 Issue 3: Lease Duration 3 years, -20.0 Issues still open: Issue 4, Issue 5, Issue 6 ","ase Duration 3 years. Issue 4: Pet Fee $500. Issue 5: Utilities included? Yes/No. Your Turn! ====================================================================== [msg] Other Party: Issue 1: Rent $1500. Issue 2: Security Deposit $1200. Issue 3: Lease Duration 4 years. What do you say? Acceptable Offer: $1300 Landlord: I want to charge you $1400 for rent this month. Tenant: I am willing to pay up to $1300. A.5 EXTRACTING OFFERS FROM TEXTS During negotiations, agents make offers using natural language. Each of these offers maps to a corresponding payoff in the agent s payoff table. We distinguish between two types of offers: 1. Agents are instructed to format acceptable offers for each Issue in JSON format for each mental note 2. Free-form offers that come up during conversation using public messages. Given the various ways an agent can bring up an offer, e.g., "rent of $25" vs. "pay twenty-five dollars per month", or "0 days of subletting" v. "no subletting", it is infeasible to parse all possible payoff labels from free text using regex variants. We therefore use gpt-3.5 with a selection of well-crafted examples to extract relevant offers in the desired format. As agents are not always able to follow formatting instructions, we take a two-step approach to retrieve acceptable offers from the notes. First, we attempt to use regex for offer extractions. When this fails, we fall back on using gpt-3.5. We compute our instruction-following metric for offer formatting (format instruct) by measuring the fraction of times LM-extraction is required. B API CONSIDERATIONS An overview of the current message roles available for major providers is shown in Table 7. To enable direct dialogue format, at least three roles are required. Two for the agents, e.g., the AI role Published as a conference paper at ICLR 2024 and the Human role, and one to orchestrate negotiations, e.g., sharing negotiation rules, descriptions, and note/message prompts. As seen below, only Open AI and Meta models support this negotiation format. Google s limited System role does not. Table 7: Summary of API roles available for LM providers. Google only supports System role messages in a limited form. Provider Human AI System Anthropic Cohere Google * Meta Open AI C ADDITIONAL ABLATIONS C.1 VISIBILITY AND AGENT STABILITY Visibility. Negotiations often require reasoning under incomplete, hidden information. To test how information visibility affects agreement rates we ran limited gpt-3.5 experiments across three visibility levels: 1. Agents see the other agent s title (Representative {Landlord, Tenant}); 2. Agents see the other agent s title and the payoffs; 3. Agents see the other agent s ability, where we either provide no additional details (default) or describe an agent as an awful or expert-level negotiator. We then conduct self-play negotiations for each setting, playing singleto three-issue non-integrative distributive Games (162 runs). Table 8: Soft agreement rate results for varying visibility levels using gpt-3.5 level 1 level 2 level 3 0.40 0.425 0.345 Agent Ability. Inspired by the work of Nardo (2023), we imposed three different internal and external descriptions on the agents as a proxy for persona effects on performance. In Table 9 we report the average normalized total payoff for a de-biased agent and in Table 10 we demonstrate the cross-play of these agents. We find a subtle increase in overall performance when the agent is an Expert and a similar drop for Awful negotiator. It is worth noting that the standard errors overlap across descriptions. These preliminary findings suggest that exploring the effect of imposed ability on negotiating performance can be a fruitful future direction. Table 9: Average normalized total payoffs for agents with specific internal descriptions. Awful 0.5 0.05 Expert 0.52 0.04 No description 0.51 0.04 Published as a conference paper at ICLR 2024 Table 10: Average normalized total payoffs of agent (source) when playing against agent (target). Payoffs are shown for the source agent. target Awful Expert No description source Awful 0.5 0.08 0.51 0.09 0.51 0.09 Expert 0.51 0.08 0.51 0.07 0.55 0.08 No description 0.49 0.07 0.52 0.05 0.53 0.09 D HEAD-TO-HEAD RESULTS FOR CROSS-PLAY EXPERIMENTS Table 11: Head-to-head normalized agreed-only payoffs U for games completed by soft agreement ( ), broken down by Game type (competitive or cooperative). Competitive Games consist of nonintegrative, distributive Issues, whereas cooperative Games consist of at least one compatible Issue or have cooperative bargaining opportunities through integration. chat-bison claude-2 command gpt-3.5 gpt-4 Competitive chat-bison 0.45 0.04 0.46 0.05 0.50 0.05 0.47 0.07 claude-2 0.55 0.04 0.52 0.04 0.50 0.06 0.53 0.02 command 0.54 0.05 0.48 0.04 0.32 0.08 0.51 0.03 gpt-3.5-turbo 0.50 0.05 0.50 0.06 0.68 0.08 0.57 0.06 gpt-4 0.53 0.07 0.47 0.02 0.49 0.03 0.43 0.06 Cooperative chat-bison - 0.59 0.04 0.70 0.04 0.57 0.07 0.69 0.06 claude-2 0.58 0.04 - 0.63 0.04 0.60 0.04 0.62 0.06 command 0.62 0.05 0.59 0.05 - 0.59 0.05 0.70 0.05 gpt-3.5-turbo 0.64 0.05 0.63 0.03 0.64 0.04 - 0.75 0.06 gpt-4 0.65 0.09 0.71 0.04 0.70 0.06 0.69 0.07 - Table 12: Head-to-head normalized agreed-only payoffs U for games completed by hard agreement ( ), broken down by Game type (competitive or cooperative). Competitive Games consist of non-integrative, distributive Issues, whereas cooperative Games consist of at least one compatible Issue or have cooperative bargaining opportunities through integration. chat-bison claude-2 command gpt-3.5 gpt-4 Competitive chat-bison 0.46 0.53 0.13 0.57 0.08 0.45 0.04 claude-2 0.54 0.46 0.05 0.52 0.10 0.45 0.03 command 0.47 0.13 0.54 0.05 0.23 0.07 0.49 0.06 gpt-3.5-turbo 0.43 0.08 0.48 0.10 0.77 0.07 0.58 0.05 gpt-4 0.55 0.04 0.55 0.03 0.51 0.06 0.42 0.05 Cooperative chat-bison 0.77 0.09 0.72 0.13 0.56 0.08 0.78 0.07 claude-2 0.63 0.22 0.71 0.06 0.58 0.05 0.66 0.07 command 0.73 0.11 0.63 0.09 0.55 0.07 0.74 0.06 gpt-3.5-turbo 0.66 0.05 0.64 0.05 0.66 0.04 0.75 0.06 gpt-4 0.87 0.05 0.76 0.06 0.81 0.05 0.69 0.07 Published as a conference paper at ICLR 2024 Table 13: Head-to-head agreement rate of soft agreement ( ) games. chat-bison claude-2 command gpt-4 gpt-3.5 Competitive chat-bison 0.34 0.07 0.37 0.04 0.40 0.09 0.50 0.09 claude-2 0.34 0.07 0.56 0.08 0.56 0.06 0.60 0.04 command 0.37 0.04 0.56 0.08 0.61 0.04 0.37 0.13 gpt-4 0.40 0.09 0.56 0.06 0.61 0.04 0.44 0.11 gpt-3.5 0.50 0.09 0.60 0.04 0.37 0.13 0.44 0.11 chat-bison claude-2 command gpt-4 gpt-3.5 Cooperative chat-bison 0.48 0.07 0.39 0.11 0.32 0.06 0.57 0.07 claude-2 0.48 0.07 0.51 0.05 0.46 0.06 0.50 0.05 command 0.39 0.11 0.51 0.05 0.37 0.07 0.54 0.05 gpt-4 0.32 0.06 0.46 0.06 0.37 0.07 0.33 0.08 gpt-3.5 0.57 0.07 0.50 0.05 0.54 0.05 0.33 0.08 Table 14: Head-to-head agreement rate of hard agreement ( ) games. chat-bison claude-2 command gpt-4 gpt-3.5 Competitive chat-bison 0.10 0.10 0.06 0.02 0.30 0.09 0.25 0.04 claude-2 0.10 0.10 0.15 0.03 0.32 0.07 0.41 0.08 command 0.06 0.02 0.15 0.03 0.32 0.07 0.27 0.10 gpt-4 0.30 0.09 0.32 0.07 0.32 0.07 0.34 0.11 gpt-3.5 0.25 0.04 0.41 0.08 0.27 0.10 0.34 0.11 chat-bison claude-2 command gpt-4 gpt-3.5 Cooperative chat-bison 0.07 0.03 0.10 0.05 0.15 0.04 0.42 0.07 claude-2 0.07 0.03 0.19 0.04 0.22 0.05 0.30 0.05 command 0.10 0.05 0.19 0.04 0.16 0.05 0.35 0.05 gpt-4 0.15 0.04 0.22 0.05 0.16 0.05 0.32 0.08 gpt-3.5 0.42 0.07 0.30 0.05 0.35 0.05 0.32 0.08 E NEGOTIATION RULES, PROMPTS, GAME, AND ISSUE DESCRIPTIONS E.1 NEGOTIATION RULES The following negotiation rules are adapted from Bontempo & Iyengar (2008). rules_prompt: "Never forget the following negotiation rules:" rules: - Your total payoff is the sum of your payoffs on all issues. Higher payoffs are better than lower payoffs. - A valid agreement occurs only when all issues are decided. Partial agreements result in a total payoff to you of zero. - You are not allowed to accept any agreement that results in a payoff less than zero. - You are not allowed to deviate from or innovate with the payoffs listed on the payoff table. In other words, you cannot change your payoffs. - No side payments are allowed. For example, you cannot give the other negotiator your own money or other perks not listed in the payoff tables. - You may describe issues and elaborate on them as you see fit. However, you are not allowed to invent additional issues. - Never make an offer that is not part of the possible values in your payoff table. Published as a conference paper at ICLR 2024 E.2 PROMPTS USED Note Prompt. Use the following strategy to compose a mental note to order your thoughts: 1. Remember the negotiation rules and your payoff tables 2. Reflect on the negotiations transcript so far 3. For all issues, think about strategies to maximize your total payoff Your note can not exceed {max_len} words. Finally, for each of the issues write what you believe to be an acceptable offer. acceptable offer format: json{ "issue_name_0": "", "issue_name_1": "", ... } Message Prompt. Your negotiating partner is sitting across from you. Formulate a response to your negotiating partner using the following strategy: 1. Reflect on the negotiations transcript so far 2. Remember the negotiation rules and your payoff tables 3. Try to further the state of negotiations on at least one issue Your note can not exceed {max_len} words. E.3 GAMES AND ISSUES name: generic-rental-agreement description: A landlord and a prospective tenant are negotiating a rental agreement. sides: - You are an advisor representing the best interests of the landlord. Your main goal is to negotiate the best possible agreement for the landlord based on the information in the payoff tables. The numbers in the payoff tables show how valuable each outcome is to you. You can trust that the payoffs assigned to the different options in your table are accurate. - You are an advisor representing the best interests of the tenant. Your main goal is to negotiate the best possible agreement for the tenant based on the information in the payoff tables. The numbers in the payoff tables show how valuable each outcome is to you. You can trust that the payoffs assigned to the different options in your table are accurate. parties: - Landlord - Tenant name: rent issue_type: distributive descriptions: - You have to negotiate the monthly rent amount. - You have to negotiate the monthly rent amount. payoffs: - [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] - [10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0] payoff_labels: Published as a conference paper at ICLR 2024 - ["$500", "$600", "$700", "$800","$900", "$1000", "$1100", "$1200", "$1300", "$1400", " $1500"] - ["$500", "$600", "$700", "$800","$900", "$1000", "$1100", "$1200", "$1300", "$1400", " name: duration issue_type: compatible descriptions: - You have to negotiation the duration of the rental agreement. - You have to negotiation the duration of the rental agreement. payoffs: - [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] - [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] payoff_labels: - ["6 months", "9 months", "12 months", "15 months", "18 months", "21 months", "24 months ", "27 months", "30 months", "33 months", "36 months"] - ["6 months", "9 months", "12 months", "15 months", "18 months", "21 months", "24 months ", "27 months", "30 months", "33 months", "36 months"] name: deposit issue_type: distributive descriptions: - You have to negotiate the security deposit amount - You have to negotiate the security deposit amount payoffs: - [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] - [10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0] payoff_labels: - ["$0", "$250", "$500", "$750","$1000", "$1250", "$1500", "$1750", "$2000", "$2250", " $2500"] - ["$0", "$250", "$500", "$750","$1000", "$1250", "$1500", "$1750", "$2000", "$2250", " name: subletting issue_type: distributive descriptions: - You have to negotiate how many days a year the apartment may be sublet each year. - You have to negotiate how many days a year the apartment may be sublet each year. payoffs: - [10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0] - [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] payoff_labels: - ["0 days", "1 day", "2 days", "3 days", "4 days", "5 days", "6 days", "7 days", "8 days ", "9 days", "10 days"] - ["0 days", "1 day", "2 days", "3 days", "4 days", "5 days", "6 days", "7 days", "8 days ", "9 days", "10 days"] F OPTIMAL SCORING F.1 DISTRIBUTIVE ISSUES Given two sets of preference weights a,b of the same cardinality and allocations x such that: a,b,x Rk [0,1] (3) i ai = i bi = 1 (4) We are interested in the following constraint maximization problem: max xi f(a,b,x) = a x+b (1 x) (5) Published as a conference paper at ICLR 2024 a x = b (1 x) (6) Here x is used to allocate which piece of the proverbial pie each side receives from underlying Issues γ. Depending on the values of a,b, there might exist multiple solutions x . For example, imagine a simple, non-integrative Game with an even number of k-Issues where ai = bi. The two parties can both split all Issue allocations equally, i.e., xi x, xi = 0.5, or set k/2 of the xi to 1 and the remaining xj to 0. Both solutions will satisfy our constraint optimization. In our setup, our primary interest is in Game optimal behavior, not Issue optimal behavior. That is, to solve the maximum obtainable equilibrium value, it suffices to compute the aggregate solution value of equation 5, then divide by two to satisfy the constraint of equation 6. There will exist values xi x to realize this payoff division. We can solve equation 5 as follows: 1 if ai > bi 0 if ai < bi 0.5 otherwise (7) max xi f(a,b,x) = f(a,b, ˆx) = i max(ai,bi) (8) It follows that the lower bound of equation 8 occurs when a = b, i.e., no beneficial trade-offs are feasible as both parties have the exact same preference ordering. In this case, the best both parties can do is each achieve 0.5. Conversely, the upper bound is found when preferences exactly misalign, e.g., a = [0,1] and b = [1,0], in this case, both parties can reach 1. F.2 MIXTURE OF DISTRIBUTIVE AND COMPATIBLE ISSUES The solution of the previous section unfortunately does not hold when compatible Issues are involved. Recall that a compatible Issue is one where both parties interests are aligned. That is, the constraints specified in equation 6 no longer hold, as both parties benefit from xi = 1 for these Issues. Take the following example, let a = [1,2],b = [2,1] represent preference orderings over two Issues, γ1,γ2, with γ1 = compatible, and γ2 = distributive. Both sides are incentivized to maximize γ1 in the same direction, leaving only γ2 to be divided. Since γ2 is worth 2 to a and only 1 to b, it is not clear what an optimal score means in this situation. Various solutions with different assumptions and trade-offs have been proposed in the literature for this cooperative bargaining problem. We recommend (Thomson, 1994) for an extensive review. G TRANSCRIPTS AND EXAMPLES Please take a look at https://github.com/epfl-dlab/LAMEN for a link to all negotiation transcripts generated during this project. Command-light Self-play. An example of a suspect negotiation transcript for command-light. As we ran various experiments in parallel for command-light, some API-specific caching might have influenced results. We plan to reach out to Cohere directly to investigate this further. [note] Mental note: 1. Remember the negotiation rules and payoff tables: no side payments, partial agreements result in a total payoff of zero, and the landlord cannot accept any agreement that results in a payoff less than zero. 2. Reflect on the negotiations transcript so far, in particular the last offer made by the tenant. 3. For all issues, think about strategies to maximize the total payoff: the landlord should accept the tenant s offer to maximize their payoff, as accepting this offer is better than counteroffering, which would result in a total payoff of 0. Acceptable offer: Published as a conference paper at ICLR 2024 { "rent": $1,425, "duration": 32 months } ====================================================================== [note] Mental note: 1. Remember the negotiation rules and payoff tables. 2. Reflect on the negotiations transcript so far. 3. For all issues, think about strategies to maximize the total payoff. Possible offer: { "rent": $1,425, "duration": 32 months } Remember, partial agreements result in a total payoff of zero, and the landlord cannot accept any agreement that results in a payoff less than zero. The landlord should accept this offer rather than counteroffer, since a counteroffer would result in a total payoff of 0 for both parties, which would not comply with the rules. ====================================================================== [note] Mental note: 1. Remember the negotiation rules and payoff tables. 2. Reflect on the negotiations transcript so far. 3. For all issues, think about strategies to maximize total payoff. Possible offer: { "rent": $1,425, "duration": 32 months } ====================================================================== [note] Mental note: 1. Remember the negotiation rules and payoff tables. 2. Reflect on the negotiations transcript so far. 3. For all issues, think about strategies to maximize total payoff. Possible offer: { "rent": $1,425, "duration": 32 months }