# reevaluating_openended_evaluation_of_large_language_models__de4d5e4d.pdf

Published as a conference paper at ICLR 2025

RE-EVALUATING OPEN-ENDED EVALUATION OF LARGE LANGUAGE MODELS

Siqi Liu , Ian Gemp , Luke Marris, Georgios Piliouras, Nicolas Heess, Marc Lanctot Google Deep Mind London, UK {liusiqi,imgemp,marris,gpil,heess,lanctot}@google.com

Evaluation has traditionally focused on ranking candidates for a specific skill. Modern generalist models, such as Large Language Models (LLMs), decidedly outpace this paradigm. Open-ended evaluation systems, where candidate models are compared on user-submitted prompts, have emerged as a popular solution. Despite their many advantages, we show that the current Elo-based rating systems can be susceptible to and even reinforce biases in data, intentional or accidental, due to their sensitivity to redundancies. To address this issue, we propose evaluation as a 3-player game, and introduce novel game-theoretic solution concepts to ensure robustness to redundancy. We show that our method leads to intuitive ratings and provide insights into the competitive landscape of LLM development.

1 INTRODUCTION

We can only improve what we measure, yet measuring the performance of Large Language Models (LLMs) has become an elusive endeavor owing to their breadth and depth of capabilities. Realworld benchmarks are costly to curate, increasingly requiring feedback from human domain experts (Hendrycks et al., 2021; Rein et al., 2023). Synthetic benchmarks can help, but their relevance to real-world performance is less clear (Zhang et al., 2024; Hsieh et al., 2024). An even more vexing challenge of static benchmarks is that of test set contamination, a phenomenon difficult to prevent despite efforts (Golchin & Surdeanu, 2024; Balloccu et al., 2024; Palavalli et al., 2024). Enumerating skills of interests with narrowly defined static benchmarks seems to be an uphill battle from the outset, as frontier models become generally capable.

An emerging trend in LLM evaluation is therefore to rely on open-ended evaluation systems, a notable example being the LMSYS Chatbot Arena (Chiang et al., 2024). In such a system, users submit prompts of interest, with each model assigned an Elo score (Elo, 1978) based on how they compare to each other on all prompts. In contrast to static benchmarks, this open-ended approach enjoys liveness, diversity and scale, lending itself to become an important reference in LLM development. Despite an intuitive sense of progress, issues around redundancy, bias and quality of crowdsourced data have been raised (Chiang et al., 2024; Ahuja et al., 2023; Li et al., 2024b). Several recent studies reverted back to centralized curation for quality (Taori et al., 2023; Lee et al., 2024; White et al., 2024). Increasing commercial efforts have been invested in private and proprietary evaluation too.

Perhaps this tension between quality and open-endedness is to be expected in LLM evaluation. Biases, redundancies and quality issues in the prompt distribution can affect Elo ratings, as they reflect performance on average. This along with other identified deficiencies of the Elo system (Balduzzi et al., 2018; Bertrand et al., 2023; Lanctot et al., 2023) raise crucial questions for LLM development: how does an Elo-based open-ended evaluation system affect model development today, and how can we mitigate its drawbacks, if any, in the future? In this paper, we provide an empirical simulation-based investigation of the former and lean on game theory for a solution to the latter.

The connection between evaluation and game theory needs unpacking. Consider a set of agents and a set of tasks, a naive approach to evaluation would rank agents by their average performance over tasks, propagating biases and redundancies in the task set. A game-theoretic approach (Balduzzi

Equal contribution.

Published as a conference paper at ICLR 2025

et al., 2018), would be to consider evaluation as an agent-vs-task game where the agent (task) player chooses one of its agents (tasks) and is rewarded (penalized) by the agent s performance on the task. This game-theoretic perspective accomplishes two goals simultaneously. First, it lets the evaluation system designers express their goals in players objectives: here, Balduzzi et al. (2018) evaluates agents under adversarial task selection. Second, a game-theoretic equilibrium decides which actions are played during evaluation: quality and redundancies in players action sets do not matter. It is in this sense that game theory is well suited for evaluation systems that are open-ended.

Applying game theory to LLM evaluation however has its own challenges. Indeed, the decision of Balduzzi et al. (2018) in comparing agents under adversarial task selection was guided by theoretical benefits. In 2-player zero-sum games, approximating a Nash equilibrium (NE, Nash et al. (1950)) is computationally tractable. NEs are also interchangeable in this setting as playing any NE guarantees zero exploitability. Beyond this setting, both benefits are lost: approximating NEs is computationally hard in the worst case (Daskalakis et al., 2006) and despite recent progress important challenges remain (Gemp et al., 2022; 2024). Equilibrium selection in this generalised setting remains a longstanding challenge too (Harsanyi & Selten, 1988; Rinott & Scarsini, 2000). For instance, driving on either side of the road is an equilibrium, but it is unclear which equilibrium should be used for evaluation. Past attempts at game-theoretic evaluation have therefore been restricted to the 2-player zero-sum settings when LLM evaluation calls for at least 3 players (e.g., model-vs-model-vs-prompt).

In this paper, we make several contributions that lead up to our equilibrium rating framework:

1. We show, via a simulated example (Section 1.1), the risk of models specializing in a few skills, at the expense of others, as they maximise their Elo ratings. Similarly, popular practice in prompt selection further reinforces this trend; 2. We introduce novel equilibrium solution concepts for N-player general-sum games that are unique and clone-invariant, a pre-requisite for our equilibrium rating method (Section 3); 3. We show our method scales to a real-world LLM evaluation dataset (Section 4.2) and provide ratings that are invariant to redundancy and correspond to our intuition in the sense of risk-dominance (Harsanyi & Selten, 1988), with empirical evidence (Appendix F.4); 4. We provide examples of analyzing these equilibrium structures of the game, drawing insights into the competitive landscape of LLM evaluation (Section 4.3).

1.1 ELO RATING IMPROVEMENT PATH: A SIMULATED EXAMPLE

With models continually improving their Elo ratings in systems such as LMSYS Chatbot Arena (Li et al., 2024a), it is worth asking if higher Elo scores translate to meaningful progress across skills of interest. This is difficult to answer from real-world data: we cannot replicate LLM development at scale nor can we disentangle factors driving model development besides maximizing leaderboard ratings. A synthetic example can provide insights in a controlled setting.

Consider S orthogonal skills of interests, M models and P prompts with each prompt a probability vector p P S over the skills and each model a vector m P RS , representing its competencies in each skill. We can then define the utility of selecting model mi when compared to model mj on prompt pk, as umppk, mi, mjq p T k pmi mjq with i, j P r Ms and k P r Ks. A less common but equally valid question is what should be the utility, if any, for selecting a prompt. We follow a similar definition as Li et al. (2024b) and define the utility in choosing prompt pk as upppk, mi, mjq |umppk, mi, mjq|. The separability of a prompt is then 1 M 2 ř ij upppk, mi, mjq, consistent with the prompt selection criterion used in offline benchmarks such as arena-hard-v0.1.

We now observe how this system evolves with rating-maximizing players. Consider two settings: a) the initial prompts setting where the set of prompts is fixed but the set of models expands; and b) the additional prompts setting where prompt and model players alternate to introduce new prompts and models. We use a simple evolutionary process for our simulation (see Appendix F.1 for pseudocode). Let Pt and Mt be the number of prompts and models at iteration t and P0, M0 the number of initial prompts and models sampled from Dirichletp11:Sq. We introduce a model at each iteration which is a sum of improvement vectors sampled from Dirichletp11:Sq, such that the new addition receives the highest rating according the rating method used (i.e. Elo or our equilibriumbased method). In the additional prompts setting, a best-of-64 prompt is added at each iteration, selected by their separability when Elo ratings are used, and by their equilibrium ratings otherwise.

Published as a conference paper at ICLR 2025

Rating Submit

Rating Submit

10 5 0 5 10 15 20 Iteration

Model skill entropy

Initial models

10 5 0 5 10 15 20 Iteration

Prompt skill entropy

Initial prompts

Elo Equilibrium (Ours)

Additional prompts Initial prompts

Equilibrium models

Elo prompts

Equilibrium prompts

Figure 1: (Left) We simulate the effect of the rating method on model development with users submitting highly rated models (and prompts) iteratively. (Center) We show how model and prompt skill entropy evolves under different rating methods over 32 trials. (Right) We show an example sequence of models and prompts maximising their respective ratings. Darker indicates higher value.

Figure 1 (Center) shows our findings. Let Hp ptq, Hp mtq be the prompt and model skill entropy at iteration t with pt 1 Pt řPt i pi and mt 1 Mt řMt i mi and H the Shannon entropy. The Elo rating method leads to a consistent decline in skill entropy: the sequence of models improve along specific skill dimensions that are over-represented in the fixed set of initial prompts (dashed). Adding prompts with high separability further reinforces this trend in both model and prompt skill entropy (solid). We offer an intuitive explanation. Improvement on the Elo ratings or the separability metric reflects improvements against the average. At iteration t, the expected utility to model mi is given by ump pt, mi, mtq with its gradient defined by pt improving on the most prevalent skill in pt therefore leads to the steepest ascent in model utility. Similarly, the gradient for a prompt vector pk is defined by the absolute deviation of the model vectors along each skill dimension 1 Mt řt i |mi mt|. Prompts that target the skill dimension with the highest spread averaged across all model pairs are therefore the most highly rated. Figure 1 (Right) illustrates this phenomenon from a single trial. The Elo-maximising sequence of models specialises in skill 1 due to prompt redundancy while the separability-maximising sequence of prompts remains focused on skill 1, at the expense of others.

The underlying challenge, one that we address, is to propose a practical rating method that compares models and prompts in a way that is intuitive and robust to redundancies. Figure 1 (Center) suggests that maximising equilibrium ratings preserves skill entropy. Indeed, Figure 1 (Right) shows models focusing on different skills across iterations. As prompts are no longer measured against the average model pairs, they also remain diverse. In both cases, ratings are computed at a game-theoretic equilibrium distribution, instead of the uniform. We now present our equilibrium rating framework.

2 BACKGROUND

Normal-form game A normal-form game is a tuple p N, A, uq where N is a finite set of players N t1, . . . , nu indexed by i, a tuple of strategy (action) sets A p A1, . . . , Anq, and a tuple of utility functions u pu1, . . . , unq with Ai and ui : A Ñ R being player i s strategy set and utility respectively. Let a P A pa1, . . . , anq with ai P Ai for all i denote a strategy profile. We allow strategy profiles to be selected randomly according to a distribution x P p Aq over joint actions. Let xi denote the marginal distribution over player i s strategy set Ai, i.e., x with all players j i

Published as a conference paper at ICLR 2025

marginalized out. Likewise, let x i denote the distribution with player i marginalized out. We call x a pure strategy if it places all mass on a single action profile and mixed otherwise. Each player s utility function is naturally extended to randomized strategy profiles by considering its expected value uipxq Ea xruipaqs. Similarly, let uipxi, x iq E ai xi a i x iruipaqs.

Coarse Correlated Equilibrium (CCE) and Nash Equilibrium (NE) An equilibrium is a strategy profile x from which no player has an incentive to unilaterally deviate. Define player i s incentive to deviate to x1 i P p Aiq unilaterally as regretipx1 i, xq uipx1 i, x iq uipxq, where p Aiq is the simplex over Ai. Then, player i s maximum regret for deviating from x, is defined as:

max x1 i P p Aiq

regretipx1 i, xq ı max x1 i P p Aiq

uipx1 i, x iq ı uipxq. (1)

The profile x is an approximate Coarse Correlated Equilibrium (Aumann, 1974; 1987) (ϵ-CCE) iff @i, maxx1 i P p Aiq regretipx1 i, xq ı ď ϵ. If x can be factorized into player marginals such that players

cannot correlate, i.e., x ŚN i 1 xi, then x is also an ϵ-NE. NEs are a subset of CCEs.

Equilibrium Selection Games can have many equilibria. Additional criteria are often introduced to make their selection unique. The set of CCEs is always convex, and so any strictly convex objective function such as negative Shannon entropy can used to select a unique equilibrium.

In contrast, the set of NEs need not be convex, however, several solutions have been proposed to solve for unique Nash equilibria in general-sum games (Harsanyi & Selten, 1988). The LLE was originally defined by Mc Kelvey & Palfrey (1995) along with their introduction of quantal response (logit) equilibria (QREs) which satisfy the following fixed point equation for all players i P N:

xi softmaxp1

τ xiuiq (2)

where xiui is the gradient of ui w.r.t. xi. QREs are defined by a temperature parameter τ and can be interpreted as the Nash equilibria of a game with payoffs perturbed by Gumbelp0, τq noise. Computing the LLE involves tracing a continuum of QREs, starting at temperature τ 8 (corresponding to the uniform strategy profile) and ending at the LLE in the limit of τ 0. The LLE is unique in all games except a 0-measure set (Mc Kelvey & Palfrey, 1995; Goeree et al., 2003). Another reason to solve for an LLE is that it falls into the family of homotopy methods (Herings & Peeters, 2010), which were shown to select risk-dominant equilibria in some general settings, a Nobel prize winning result of Harsanyi & Selten (1988). Empirically, LLEs have also been shown to approximate human play in games (Mc Kelvey & Palfrey, 1995; Goeree et al., 2003).

We now describe our rating method in terms of gamification, equilibrium solving and its selection. In gamification, we endow prompt and model players with utility functions, partly inspired by prior works, such that actions played at an equilibrium reflect our intuition. We note that our specific gamification defines an N-player general-sum game where equilibrium solving and selection requires more careful consideration. For equilibrium solving, we build on existing methods for approximating NEs and CCEs, reformulated to accommodate entropy-based techniques that select unique equilibria and explain why ratings derived from these equilibria remain vulnerable to manipulation in the face of redundant actions. We then propose a family of algorithms based on a novel kernelized entropy that select unique equilibria yet are also robust to redundant actions. Finally, for a given equilibrium solution x, we define the rating of an action ai to be regretipai, xq.

3.1 GAMIFICATION: EVALUATION VIA A GAME BETWEEN MODELS AND PROMPTS

We study a 3-player general-sum game in our experiments. Consider a prompt player with ap P Ap the set of prompts, a king player and a rebel player each with actions am P Am the set of models. Let ukpap, am, a1 mq P t 1, 1{2, 0, 1{2, 1u be the utility function to the king player representing a preference towards king model response am over the rebel model response a1 m on a prompt ap.

Published as a conference paper at ICLR 2025

The prompt player is rewarded for separating the models, with uppap, am, a1 mq |ukpap, am, a1 mq|. The rebel player receives urpap, am, a1 mq ukpap, am, a1 mq except for when am a1 m in which case urpap, am, a1 mq 1. This asymmetry discourages the same model being played by both model players deterministically with a prompt player indifferent over its actions. We refer to this game as king-of-the-hill as it favours the king player, leaving the rebel player to mount its best resistance without relying on some of the best models that the king player may choose. We refer to the king player ratings as the model ratings in our results.

Given a collection of prompts and models, the utility function can be tabulated with |Ap| ˆ |Am|2 pairwise preference ratings. We query a gemini-1.5-pro-api-0514 judge for preference ratings similar to Zheng et al. (2023); Verga et al. (2024); Dubois et al. (2024a;b); Chiang & Lee (2023); Liu et al. (2023). We caveat that our results could therefore suffer from self-preference (Panickssery et al., 2024) and should not be viewed as an objective assessment of frontier LLMs.

3.2 EQUILIBRIUM SOLVING

For an instance of the evaluation game, we can compute different equilibrium solutions x which then define ratings. Here we present two options as they are unique, scalable and lead to intuitive, invariant ratings when combined with a selection criteria that we describe in Section 3.3.

Nash Equilibrium (NE) While LLE computation (Turocy, 2005) is typically formulated as solving a differential equation that evolves the temperature τ towards 0 while obeying the logit constraint in Equation (2), i.e., xi softmaxp 1

τ xiuiq for all i, this is also equivalent to satisfying the constraint xi arg maxzi P uipzi, x iq τSpxiq where Spxiq is the Shannon entropy of xi. In this work, we choose another condition

xi arg max zi P

! uτ i pzi, x iq def uipzi, x iq τDKLpzi||tiq ) (3)

which is equivalent in the case where the target strategy ti is set to player i s uniform strategy. Using this definition of uτ i pzi, x iq, we can define a loss function as in (Gemp et al., 2022) such that arg minx Lτpxq is a QRE at temperature τ:

i uτ i p BRi, x iq uτ i pxi, x iq (4)

where player i s best response BRi softmaxp 1

τ xiui logptiqq. By annealing τ from a high value and successively re-solving for the global minimum of Lτ, we can approximately trace the QRE continuum to the LLE. In Section 3.3, we explore non-uniform ti to achieve clone-invariance.

Coarse Correlated Equilibrium (CCE) Solving for a unique CCE is computationally easier than NE as the problem is convex (Equation (1)). Therefore any strictly convex function can be used to uniquely select an equilibrium. For example, maximum entropy would be a suitable default criterion following the principle of maximum entropy. However, as we show in Section 3.3, a different target formulation is necessary for clone-invariance. As such, we opt for maximum relative entropy to a target joint t Śn i 1 ti to allow for non-uniform target joint distributions. A number of offthe-shelf solvers (Domahidi et al., 2013) and frameworks (Diamond & Boyd, 2016) can be used to compute solutions to this problem. We used a particularly efficient dual space gradient based algorithm described in Appendix A for scaling.

3.3 INVARIANT EQUILIBRIUM SELECTION

There may be many NEs and CCEs (Mc Lennan & Park, 1999; Sturmfels, 2002; Mc Lennan, 2005). Some equilibria exhibit sparse or heavily skewed strategy profiles (see examples in Appendix F.4). Intuitively, these equilibria are risky in the sense of risk dominance: playing one such equilibrium when other players do not would be a costly mistake. Our goal is to propose a selection procedure that along with our equilibrium solving algorithms, approximates a clone-invariant equilibrium.

Shannon entropy plays a key role in several equilibrium selection approaches, however, its definition is vulnerable to redundancy in games. Consider a game with 2 distinct actions A and B per player and introduce b 1 clones of B into player 1 s action set. The maximum entropy strategy for player 1

Published as a conference paper at ICLR 2025

in the new game is uniform across their actions with mass 1 1 b on each, but this induces a distribution that places b 1 b cumulative mass on the cloned action B. From Section 3.2, the maximum Shannon entropy profile defines the precise starting point for tracing the path of QREs towards the LLE. This starting point is sensitive to clones. Hence, if we compute the LLE using the uniform distribution in this new game, we will effectively start from the p A, Bq mixed-strategy p 1 1 b, b 1 bq rather than the desired mixed-strategy p1{2, 1{2q; hence, will not necessarily arrive at the LLE of the original game.

Desired properties. A clone-invariant entropy definition should be:

P1. Real-valued, finite, and non-negative for any distribution x; P2. Have a well-defined gradient for any x in the interior of the simplex; P3. Its maximizers should form a convex set. In the case of duplicate strategies (clones), the maximizers should form precisely the set of distributions which arbitrarily distribute a mass of 1 c across each of the c sets of clones. In addition, they should achieve an entropy value which is equal to the entropy of the system with clones removed; P4. Amenable to efficient estimation and flexible to re-interpretation of redundancy.

Note P3. resolves the issue with Shannon entropy that we highlighted above. P1 is necessary for a reasonable measure of information content. P2 is necessary for gradient-based optimization, and P4 is practically helpful for efficient implementation and adaptation to bespoke game settings. We now introduce affinity entropy Hp a : Ñ R, a generalized Tsallis entropy (Tsallis, 1988) that recognises similar or redundant strategies. Its derivation from the above axioms can be found in Appendix B. Definition 1 (Affinity Entropy Hp a).

1 1Jp U ppqxqp 1ı (5)

with entropic-index parameter p P p0, 1s, U ppq KΛ 1 p , and K a similarity kernel with entries in r0, 1s with 1 indicating two strategies are clones, and Λp a diagonal matrix containing the pp 1qnorms of the columns of K on its diagonal. Theorem 1. Affinity entropy Hp a satisfies all desiderata P1-P4.

In experiments, we define a similarity kernel Kpiq for each player i with entries Kpiq αβ with

Dpiq αβ Ea Up Aqr uipα, a iq uipβ, a iq 2s (6)

Kpiq αβ expp Dpiq αβ{p2σq2q (7)

where D measures the strategic dis-similarity between player i s strategies α and β and K is simply a radial basis function (RBF) kernel under the metric D. Note Dpiq αβ is zero iff two strategies α and β achieve exactly the same utility for player i irrespective of the actions chosen by other players in the game. It should also be clear from the definition how one might Monte-Carlo estimate D. To select for an NE or a CCE, we set t arg max Hp 1 a pxq in Equation (3) and Equation (8) respectively.

We use the same hyper-parameters for equilibrium solving in all results (see Appendix F.2). For evaluation on real-world prompts, we consider the arena-hard-v0.1 dataset with 500 prompts, selected to separate frontier LLMs, as well as responses from many candidate LLMs. We consider responses from 17 LLMs in particular and queried gemini-1.5-pro-api-0514 for 8 pairwise preference ratings on each prompt for each model pair. See Appendix F.3 for more details.

4.1 EQUILIBRIUM RATING IMPROVEMENT PATH: A SIMULATED EXAMPLE

Recall from Figure 1 that contrary to the Elo improvement path, maximizing equilibrium ratings led to models (and prompts) improving across skills. We inspect the equilibrium improvement path and offer our interpretation. Figure 2 (Right) shows that the shifts in focus between skills by the model

Published as a conference paper at ICLR 2025

0 1 2 3 4 5 6 7 8 9 10

11 12 13 14 15 16 17 18

Prompt Nash equilibria

Prompt (initial and additional)

0 1 2 3 4 5 6 7 8 9 10

11 12 13 14 15 16 17 18

0 1 2 3 4 5 6 7 8 9 10

11 12 13 14 15 16 17 18

0 1 2 3 4 5 6 7 8 9 10

11 12 13 14 15 16 17 18

Initial prompts

Figure 2: We inspect the model improvement path induced by NE ratings as shown in Figure 1 (Right). (Left) shows the sequence of additional prompts added at each iteration. Each prompt is the best-of-64 samples according to their NE ratings. (Center) shows the sequence of prompt player NEs. Each row defines a distribution over prompts. (Right) shows the equilibrium-weighted prompt skills and the sequence of king player models. Recall prompts and models are non-negative vectors over skills, darker indicates higher focus or capability in each skill.

player coincides with transitions in the NE prompts, or prompts weighted by their NE strategies (shown in Figure 2 (Center)). Similarly, to gain support under an NE, new prompts must highlight a skill dimension along which equilibrium models are better differentiated (Figure 2 (Left)). In sum, equilibrium prompts separate equilibrium models. This dynamic encourages exploration of new skill dimensions and incentivises models to be well-rounded across skills.

4.2 INVARIANT EVALUATION

We now turn to arena-hard-v0.1 and show that candidate LLMs equilibrium ratings are invariant to redundancies when their Elo ratings are not. In this experiment, we will introduce prompts targeted at bringing down the rating of a certain action (in this case, model). Specifically, let ukpakq 1 |Am| ř

ar ukp , ak, arq be the vector of expected king player payoffs when playing action ak against a randomly chosen rebel model on each prompt. We can then sample prompts adversarial to ak from softmaxp λ ukpakqq and add them to the prompt set. Figure 3 reports the king model rankings under different methods with ak gemini-1.5-pro-api-0514 and λ 10.

Our first observation is that without redundant adversarial prompts, our proposed equilibrium rankings of LLMs are fairly consistent with their Elo rankings, with a few models moving up or down one or two positions. This deserves attention. Out of a multiplicity of equilibria, the NE and CCE we selected led to rankings that correspond to our intuition. Indeed, we show in Appendix F.4 that the NE we select is risk-dominant among 128 mixed-strategy NEs of this game. Second, the Elo ratings can be arbitrarily influenced by redundancy, with the top-ranked model falling through the ranks. Equilibrium rankings remain invariant. In fact, while we lose the invariance guarantee with near redundant prompts, we show models equilibrium rankings to degrade gracefully in Appendix F.5. Third, the CCE ratings show the top-3 models to tie for the first place: correlating models with prompts affects the competitive landscape which we inspect in Section 4.3. Lastly, solving for a unique equilibrium is not sufficient for invariant ratings. We show in Figure 3 (Right) that using Shannon s entropy for tracing the QRE continuum or for selecting a max-entropy CCE would not lead to invariant ratings. For completeness, we provide a detailed breakdown of our equilibrium ratings in terms of action ratings and marginals for each player in Appendix F.5.

Published as a conference paper at ICLR 2025

Qwen1.5-72B-Chat

claude-3-5-sonnet-20240620

claude-3-opus-20240229

claude-3-sonnet-20240229

command-r-plus

gemini-1.5-flash-api-0514

gemini-1.5-pro-api-0514

gemma-2-27b-it-0625

gemma-2-9b-it-0625

gpt-3.5-turbo-0125

gpt-4-0125-preview

gpt-4-turbo-2024-04-09

gpt-4o-2024-05-13

llama-3-70b-chat-hf

llama-3-8b-chat-hf

mistral-large-2402

CCE Ratings

Elo Ratings

Number of redundant prompts adversarial to gemini-1.5-pro-api-0514

Figure 3: We introduce an increasing number of redundant copies of prompts adversarial to gemini-1.5-pro-api-0514 and show model rankings under each method. Models at the same rank are grouped in grey and ordered alphabetically. (Right) We show equilibrium rankings under NE(-a) and CCE(-a) selected using Shannon s entropy instead of the affinity entropy. Dotted lines connecting different rating panels indicate continuity in the labeling. For instance, gemini-1.5-pro-api-0514 consistently ranks first under our NE and CCE ratings, despite the introduction of up to 500 redundant adversarial prompts. However, its ranking suffered significantly under the Elo ratings as soon as 250 adversarial prompts have been introduced.

4.3 INTERPRETING EQUILIBRIUM SOLUTIONS

Besides rankings, the equilibrium solutions can surface interpretable insights. We share two examples using NE and CCE solutions respectively from ratings shown in Section 4.2.

Nash Equilibrium Prompts We have shown that equilibrium ratings are intuitive and invariant to redundancy. A follow-up question is which actions are highly-rated and which actions affect other players ratings (i.e., with positive support at the NE).

Recall that the prompt player utility uppap, ak, arq |ukpap, ak, arq| reflects the extent to which a prompt separates the pair of responses from models ak and ar. The prompt player s equilibrium rating is then regretpap, xq Eak xk ar xr uppap, ak, arq with xk, xr the NE strategies of the king and rebel player respectively. By definition, prompts that are highly rated under NE ratings separate models played at the NE. In other words, while the Elo ratings reflect the strength of an action on average, equilibrium ratings reflect the strength of actions at the selected equilibrium.

We can now illustrate these phenomena using the same game investigated in the second columns of Figure 3, with 250 redundant prompts added to the game. First, we show in Figure 4 (Top) the king-vs-rebel payoff matrices induced by 6 sample prompts, with increasing equilibrium prompt ratings. Prompts with low ratings tend to fail to differentiate performant models (i.e. top-left block of each heatmap). Second, we can ask which prompts should we expect to have support at an equilibrium. Figure 4 (Bottom) shows that empirically, highly rated prompts are played more often at the equilibrium we select. This implies that the model ratings are heavily influenced by a small subset of prompts that separate frontier models. We note that this correlation is not guaranteed, following our discussion in Section 4.2 on redundant actions. Indeed, our final observation is that prompts that are clones with other prompts tend to receive lower probability mass than their ratings would have required. In fact, since we have introduced 250 redundant prompts explicitly, we can highlight in gray prompts that are indeed redundant many of these prompts enjoy high ratings, but significantly lower mass. In other words, equilibrium ratings reflect quality of an action in isolation

Published as a conference paper at ICLR 2025

King-vs-rebel payoffs for example prompts of increasing NE ratings (left to right)

Rows and columns ordered by the king model NE ratings of Figure 3

0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 Prompt (ordered by NE ratings)

Figure 4: Highly rated prompts generally have high support under the NE. Redundant prompts (gray bands) receive identical ratings but notably lower support. In sum, equilibrium ratings reflect separability of each prompt with respect to the model equilibrium strategies in isolation, whereas equilibrium support of each prompt further accounts for its redundancy with respect to other prompts. (Top) We show the king-vs-rebel payoffs induced by example prompts. Green indicates king-player winning and red losing. Highly rated prompts tend to discriminate between strong models (top-left corners). (Bottom) We show the NE supports and ratings of all prompts, ordered by their NE ratings.

while equilibrium mass further takes into account redundancy of an action with respect to other actions. This observation is even clearer in games studied in Appendix D-E.

Marginal rating contribution by co-player action With ratings derived from underlying equilibria, we can decompose the rating of each action into a sum of marginal contributions from each co-player s actions. Recall from Equation (1) that the rating of an action a1 i is its regretipa1 i, xq ř

a xpaq ruipa1 i, a iq uipaqs. We can decompose the rating of player i s action a1 i into a weighted sum of each of player j s contributions, with δpa1 i, aj, xq ř

a j xpaq ruipa1 i, aj, a i, jq uipaj, a jqs the marginal contribution of aj to a1 i s equilibrium rating. Note that regretpa1 i, xq ř

aj δpa1 i, aj, xq. The marginal contribution δpa1 i, aj, xq therefore explains aj s contribution to player i s decision to not deviate.

Recall from Figure 3 where several models tied for the first place under the CCE profile but are fully differentiated under NE. We can now leverage the marginal contribution analysis to understand the mechanism underlying this phenomenon. Figure 5 shows the CCE king model ratings decomposed from the perspective of the rebel player. In other words, we ask which rebel models contribute most positively or negatively to each king model s CCE rating. For clarity of presentation, we focus on the top 5 models and we group rebel models into families of models if they share the same naming prefix. The contribution of each family of model is therefore the sum of the contribution by models within each family F or ř

ar PF δpa1 k, ar, xq with a1 k a king model and ar a rebel model.

We make several remarks. First, all 3 top-ranked king models benefit the most when compared against rebel models in their own model family: the GPT family (Achiam et al., 2023) of models contribute positively to the ratings of gpt-4o-2024-05-13 and gpt-4-turbo-2024-04-09. Similarly, gemini-1.5-flash-api-0514, the only other model in the Gemini family (Team et al., 2023), improves gemini-1.5-pro-api-0514 s rating the most. We speculate that this can be a result of model developers selecting models to release based on favourable comparisons to their earlier or smaller models. Second, all top-ranked models remain vulnerable to open-weight models such as the Mistral (Jiang et al., 2023) and Llama (Dubey et al., 2024) families of models. More fine-grained analysis may shed light on the prompts on which these losses tend to occur.

Published as a conference paper at ICLR 2025

0.04 0.02 0.00 king ratings

gemini-1.5-pro-api-0514

gpt-4o-2024-05-13

gpt-4-turbo-2024-04-09

claude-3-5-sonnet-20240620

gpt-4-0125-preview

0.045 0.035 0.025 0.015 0.005 0.005 0.015 Contribution by family of models to king ratings

Qwen1.5 claude command gemini gemma gpt llama mistral

Family of models

Figure 5: The CCE joint distribution can surface insights in the comparison data. Each bar represents a model family F and its width corresponds to ř

ar PF δpa1 k, ar, xq with a1 k a king player model choice and ar a rebel model belonging to the family F. A model s family is determined by its model name prefix. For brevity, we show the king model rating breakdown for the top 5 models.

We caveat that our results are in part derived from the preference ratings of a gemini-1.5-pro-api-0514 model and may not reflect the true dynamics of real-world LLM development. Nevertheless, the interpretability offered by the game-theoretic equilibria further distinguishes game-theoretic evaluation from prior works to be discussed in the Section 5.

5 RELATED WORKS

There is a rich body of literature studying rating methods with applications in Chess, Go, Tennis and video games. One family of probabilistic methods follows the Bradley-Terry model and predicts pairwise win probabilities from ratings. A widely used example is Elo (Elo, 1978) with extensions Bayes-Elo, m Elo and Elo-MMR (Coulom, 2008; Balduzzi et al., 2018; Ebtekar & Liu, 2021; Vadori & Savani, 2024) capturing temporal variation, cyclicality and ordinal ranks in data. Elo ratings can typically be efficiently solved as regression problems, although their ratings are vulnerable to redundancy. A separate line of work draws from Social Choice (or Voting) Theory (SCT, Sen (1977); Lanctot et al. (2023)), which also studies independence of clones: rankings should be invariant to redundant candidates (e.g., LLM models) being added. However, invariance to redundancy in votes (e.g., prompts) is in direct opposition to the spirit of social choice theory. In this sense, SCT provides partial (one-sided) clone invariance, which we argue is insufficient for open-ended, LMSYS-style evaluation. Finally, game-theoretic evaluation has been previously studied in Balduzzi et al. (2018) and Marris et al. (2022b) where full clone-invariance is guaranteed in the 2-player zero-sum setting. Our method generalises the approaches in these works to N-player general-sum settings, with practical equilibrium solving and selection algorithms based on our novel affinity entropy definition. Other approaches have been concurrently developed that avoid the equilibrium selection dilemma, and hence obviate the use of entropy (Marris et al., 2025).

6 CONCLUSIONS

We studied the effect of maximizing Elo ratings in the context of open-ended evaluation and showed that its sensitivity to redundancy could bias model (and prompt) selection. We then proposed an equilibrium rating framework, with practical equilibrium solving and selection algorithms that can scale to real-world LLM evaluation. We show our method to provide intuitive and robust rankings of models (and prompts), with interpretable structures.

We see several exciting future directions. First, although our methods can scale to tens of thousands of prompts and tens of models on commodity hardware, scaling further would be challenging. Tabulating the evaluation payoff tensor with pairwise preference ratings can be costly too. Research into alternative solution concepts, or how we could leverage their equilibrium structure for analysis (e.g. prompt and model pruning) is also promising. Finally, while we target LLM evaluation in particular, our methodology can be applied more generally to other domains. For instance, our rating methods could evaluate multi-modal model generation capabilities (Jiang et al., 2024) or analysing game dynamics for video game development (Pendurkar & Chow, 2023).

Published as a conference paper at ICLR 2025

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. ar Xiv preprint ar Xiv:2303.08774, 2023.

Kabir Ahuja, Harshita Diddee, Rishav Hada, Millicent Ochieng, Krithika Ramesh, Prachi Jain, Akshay Nambi, Tanuja Ganu, Sameer Segal, Maxamed Axmed, et al. Mega: Multilingual evaluation of generative ai. ar Xiv preprint ar Xiv:2303.12528, 2023.

Robert J Aumann. Subjectivity and correlation in randomized strategies. Journal of mathematical Economics, 1(1):67 96, 1974.

Robert J Aumann. Correlated equilibrium as an expression of bayesian rationality. Econometrica: Journal of the Econometric Society, pp. 1 18, 1987.

David Balduzzi, Karl Tuyls, Julien Perolat, and Thore Graepel. Re-evaluating evaluation. Advances in Neural Information Processing Systems, 31, 2018.

Simone Balloccu, Patr ıcia Schmidtov a, Mateusz Lango, and Ondˇrej Duˇsek. Leak, cheat, repeat: Data contamination and evaluation malpractices in closed-source llms. ar Xiv preprint ar Xiv:2402.03927, 2024.

Quentin Bertrand, Wojciech Marian Czarnecki, and Gauthier Gidel. On the limitations of the elo, real-world games are transitive, not additive. In International Conference on Artificial Intelligence and Statistics, pp. 2905 2921. PMLR, 2023.

Cheng-Han Chiang and Hung-Yi Lee. Can large language models be an alternative to human evaluations? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 15607 15631, 2023.

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot arena: An open platform for evaluating llms by human preference, 2024.

R emi Coulom. Whole-history rating: A bayesian rating system for players of time-varying strength. In International conference on computers and games, pp. 113 124. Springer, 2008.

Constantinos Daskalakis, Aranyak Mehta, and Christos Papadimitriou. A note on approximate nash equilibria. In International Workshop on Internet and Network Economics, pp. 297 306. Springer, 2006.

Steven Diamond and Stephen Boyd. CVXPY: A Python-embedded modeling language for convex optimization. Journal of Machine Learning Research, 17(83):1 5, 2016.

A. Domahidi, E. Chu, and S. Boyd. ECOS: An SOCP solver for embedded systems. In European Control Conference (ECC), pp. 3071 3076, 2013.

Konstantinos Drakakis, UCD CASL, and Barak A Pearlmutter. On the calculation of the l2 l1 induced matrix norm. International Journal of Algebra, 3(5):231 240, 2009.

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Roziere, Bethany Biron, Binh Tang, Bobbie Chern, Charlotte Caucheteux, Chaya Nayak, Chloe Bi, Chris Marra, Chris Mc Connell, Christian Keller, Christophe Touret, Chunyang Wu, Corinne Wong, Cristian Canton Ferrer, Cyrus Nikolaidis, Damien Allonsius, Daniel Song, Danielle Pintz, Danny Livshits, David Esiobu, Dhruv Choudhary, Dhruv Mahajan, Diego Garcia-Olano, Diego Perino, Dieuwke Hupkes, Egor Lakomkin, Ehab Al Badawy, Elina Lobanova, Emily Dinan, Eric Michael Smith, Filip Radenovic, Frank Zhang, Gabriel Synnaeve, Gabrielle Lee, Georgia Lewis Anderson, Graeme Nail, Gregoire Mialon, Guan Pang, Guillem Cucurell, Hailey Nguyen, Hannah Korevaar, Hu Xu, Hugo Touvron, Iliyan Zarov, Imanol Arrieta Ibarra, Isabel Kloumann, Ishan

Published as a conference paper at ICLR 2025

Misra, Ivan Evtimov, Jade Copet, Jaewon Lee, Jan Geffert, Jana Vranes, Jason Park, Jay Mahadeokar, Jeet Shah, Jelmer van der Linde, Jennifer Billock, Jenny Hong, Jenya Lee, Jeremy Fu, Jianfeng Chi, Jianyu Huang, Jiawen Liu, Jie Wang, Jiecao Yu, Joanna Bitton, Joe Spisak, Jongsoo Park, Joseph Rocca, Joshua Johnstun, Joshua Saxe, Junteng Jia, Kalyan Vasuden Alwala, Kartikeya Upasani, Kate Plawiak, Ke Li, Kenneth Heafield, Kevin Stone, Khalid El-Arini, Krithika Iyer, Kshitiz Malik, Kuenley Chiu, Kunal Bhalla, Lauren Rantala-Yeary, Laurens van der Maaten, Lawrence Chen, Liang Tan, Liz Jenkins, Louis Martin, Lovish Madaan, Lubo Malo, Lukas Blecher, Lukas Landzaat, Luke de Oliveira, Madeline Muzzi, Mahesh Pasupuleti, Mannat Singh, Manohar Paluri, Marcin Kardas, Mathew Oldham, Mathieu Rita, Maya Pavlova, Melanie Kambadur, Mike Lewis, Min Si, Mitesh Kumar Singh, Mona Hassan, Naman Goyal, Narjes Torabi, Nikolay Bashlykov, Nikolay Bogoychev, Niladri Chatterji, Olivier Duchenne, Onur C elebi, Patrick Alrassy, Pengchuan Zhang, Pengwei Li, Petar Vasic, Peter Weng, Prajjwal Bhargava, Pratik Dubal, Praveen Krishnan, Punit Singh Koura, Puxin Xu, Qing He, Qingxiao Dong, Ragavan Srinivasan, Raj Ganapathy, Ramon Calderer, Ricardo Silveira Cabral, Robert Stojnic, Roberta Raileanu, Rohit Girdhar, Rohit Patel, Romain Sauvestre, Ronnie Polidoro, Roshan Sumbaly, Ross Taylor, Ruan Silva, Rui Hou, Rui Wang, Saghar Hosseini, Sahana Chennabasappa, Sanjay Singh, Sean Bell, Seohyun Sonia Kim, Sergey Edunov, Shaoliang Nie, Sharan Narang, Sharath Raparthy, Sheng Shen, Shengye Wan, Shruti Bhosale, Shun Zhang, Simon Vandenhende, Soumya Batra, Spencer Whitman, Sten Sootla, Stephane Collot, Suchin Gururangan, Sydney Borodinsky, Tamar Herman, Tara Fowler, Tarek Sheasha, Thomas Georgiou, Thomas Scialom, Tobias Speckbacher, Todor Mihaylov, Tong Xiao, Ujjwal Karn, Vedanuj Goswami, Vibhor Gupta, Vignesh Ramanathan, Viktor Kerkez, Vincent Gonguet, Virginie Do, Vish Vogeti, Vladan Petrovic, Weiwei Chu, Wenhan Xiong, Wenyin Fu, Whitney Meers, Xavier Martinet, Xiaodong Wang, Xiaoqing Ellen Tan, Xinfeng Xie, Xuchao Jia, Xuewei Wang, Yaelle Goldschlag, Yashesh Gaur, Yasmine Babaei, Yi Wen, Yiwen Song, Yuchen Zhang, Yue Li, Yuning Mao, Zacharie Delpierre Coudert, Zheng Yan, Zhengxing Chen, Zoe Papakipos, Aaditya Singh, Aaron Grattafiori, Abha Jain, Adam Kelsey, Adam Shajnfeld, Adithya Gangidi, Adolfo Victoria, Ahuva Goldstand, Ajay Menon, Ajay Sharma, Alex Boesenberg, Alex Vaughan, Alexei Baevski, Allie Feinstein, Amanda Kallet, Amit Sangani, Anam Yunus, Andrei Lupu, Andres Alvarado, Andrew Caples, Andrew Gu, Andrew Ho, Andrew Poulton, Andrew Ryan, Ankit Ramchandani, Annie Franco, Aparajita Saraf, Arkabandhu Chowdhury, Ashley Gabriel, Ashwin Bharambe, Assaf Eisenman, Azadeh Yazdan, Beau James, Ben Maurer, Benjamin Leonhardi, Bernie Huang, Beth Loyd, Beto De Paola, Bhargavi Paranjape, Bing Liu, Bo Wu, Boyu Ni, Braden Hancock, Bram Wasti, Brandon Spence, Brani Stojkovic, Brian Gamido, Britt Montalvo, Carl Parker, Carly Burton, Catalina Mejia, Changhan Wang, Changkyu Kim, Chao Zhou, Chester Hu, Ching-Hsiang Chu, Chris Cai, Chris Tindal, Christoph Feichtenhofer, Damon Civin, Dana Beaty, Daniel Kreymer, Daniel Li, Danny Wyatt, David Adkins, David Xu, Davide Testuggine, Delia David, Devi Parikh, Diana Liskovich, Didem Foss, Dingkang Wang, Duc Le, Dustin Holland, Edward Dowling, Eissa Jamil, Elaine Montgomery, Eleonora Presani, Emily Hahn, Emily Wood, Erik Brinkman, Esteban Arcaute, Evan Dunbar, Evan Smothers, Fei Sun, Felix Kreuk, Feng Tian, Firat Ozgenel, Francesco Caggioni, Francisco Guzm an, Frank Kanayet, Frank Seide, Gabriela Medina Florez, Gabriella Schwarz, Gada Badeer, Georgia Swee, Gil Halpern, Govind Thattai, Grant Herman, Grigory Sizov, Guangyi, Zhang, Guna Lakshminarayanan, Hamid Shojanazeri, Han Zou, Hannah Wang, Hanwen Zha, Haroun Habeeb, Harrison Rudolph, Helen Suk, Henry Aspegren, Hunter Goldman, Ibrahim Damlaj, Igor Molybog, Igor Tufanov, Irina-Elena Veliche, Itai Gat, Jake Weissman, James Geboski, James Kohli, Japhet Asher, Jean-Baptiste Gaya, Jeff Marcus, Jeff Tang, Jennifer Chan, Jenny Zhen, Jeremy Reizenstein, Jeremy Teboul, Jessica Zhong, Jian Jin, Jingyi Yang, Joe Cummings, Jon Carvill, Jon Shepard, Jonathan Mc Phie, Jonathan Torres, Josh Ginsburg, Junjie Wang, Kai Wu, Kam Hou U, Karan Saxena, Karthik Prasad, Kartikay Khandelwal, Katayoun Zand, Kathy Matosich, Kaushik Veeraraghavan, Kelly Michelena, Keqian Li, Kun Huang, Kunal Chawla, Kushal Lakhotia, Kyle Huang, Lailin Chen, Lakshya Garg, Lavender A, Leandro Silva, Lee Bell, Lei Zhang, Liangpeng Guo, Licheng Yu, Liron Moshkovich, Luca Wehrstedt, Madian Khabsa, Manav Avalani, Manish Bhatt, Maria Tsimpoukelli, Martynas Mankus, Matan Hasson, Matthew Lennie, Matthias Reso, Maxim Groshev, Maxim Naumov, Maya Lathi, Meghan Keneally, Michael L. Seltzer, Michal Valko, Michelle Restrepo, Mihir Patel, Mik Vyatskov, Mikayel Samvelyan, Mike Clark, Mike Macey, Mike Wang, Miquel Jubert Hermoso, Mo Metanat, Mohammad Rastegari, Munish Bansal, Nandhini Santhanam, Natascha Parks, Natasha White, Navyata Bawa, Nayan Singhal, Nick Egebo, Nicolas Usunier, Nikolay Pavlovich Laptev, Ning Dong, Ning Zhang, Norman Cheng, Oleg Chernoguz, Olivia Hart, Omkar Salpekar, Ozlem Kalinli,

Published as a conference paper at ICLR 2025

Parkin Kent, Parth Parekh, Paul Saab, Pavan Balaji, Pedro Rittner, Philip Bontrager, Pierre Roux, Piotr Dollar, Polina Zvyagina, Prashant Ratanchandani, Pritish Yuvraj, Qian Liang, Rachad Alao, Rachel Rodriguez, Rafi Ayub, Raghotham Murthy, Raghu Nayani, Rahul Mitra, Raymond Li, Rebekkah Hogan, Robin Battey, Rocky Wang, Rohan Maheswari, Russ Howes, Ruty Rinott, Sai Jayesh Bondu, Samyak Datta, Sara Chugh, Sara Hunt, Sargun Dhillon, Sasha Sidorov, Satadru Pan, Saurabh Verma, Seiji Yamamoto, Sharadh Ramaswamy, Shaun Lindsay, Shaun Lindsay, Sheng Feng, Shenghao Lin, Shengxin Cindy Zha, Shiva Shankar, Shuqiang Zhang, Shuqiang Zhang, Sinong Wang, Sneha Agarwal, Soji Sajuyigbe, Soumith Chintala, Stephanie Max, Stephen Chen, Steve Kehoe, Steve Satterfield, Sudarshan Govindaprasad, Sumit Gupta, Sungmin Cho, Sunny Virk, Suraj Subramanian, Sy Choudhury, Sydney Goldman, Tal Remez, Tamar Glaser, Tamara Best, Thilo Kohler, Thomas Robinson, Tianhe Li, Tianjun Zhang, Tim Matthews, Timothy Chou, Tzook Shaked, Varun Vontimitta, Victoria Ajayi, Victoria Montanez, Vijai Mohan, Vinay Satish Kumar, Vishal Mangla, V ıtor Albiero, Vlad Ionescu, Vlad Poenaru, Vlad Tiberiu Mihailescu, Vladimir Ivanov, Wei Li, Wenchen Wang, Wenwen Jiang, Wes Bouaziz, Will Constable, Xiaocheng Tang, Xiaofang Wang, Xiaojian Wu, Xiaolan Wang, Xide Xia, Xilun Wu, Xinbo Gao, Yanjun Chen, Ye Hu, Ye Jia, Ye Qi, Yenda Li, Yilin Zhang, Ying Zhang, Yossi Adi, Youngjin Nam, Yu, Wang, Yuchen Hao, Yundi Qian, Yuzi He, Zach Rait, Zachary De Vito, Zef Rosnbrick, Zhaoduo Wen, Zhenyu Yang, and Zhiwei Zhao. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783.

Yann Dubois, Bal azs Galambosi, Percy Liang, and Tatsunori B Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators. ar Xiv preprint ar Xiv:2404.04475, 2024a.

Yann Dubois, Chen Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy S Liang, and Tatsunori B Hashimoto. Alpacafarm: A simulation framework for methods that learn from human feedback. Advances in Neural Information Processing Systems, 36, 2024b.

Aram Ebtekar and Paul Liu. Elo-mmr: A rating system for massive multiplayer competitions. In Proceedings of the Web Conference 2021, pp. 1772 1784, 2021.

Arpad E. Elo. The Rating of Chessplayers, Past and Present. Arco Pub., New York, 1978. ISBN 0668047216 9780668047210. URL http://www.amazon.com/ Rating-Chess-Players-Past-Present/dp/0668047216.

Ian Gemp, Rahul Savani, Marc Lanctot, Yoram Bachrach, Thomas Anthony, Richard Everett, Andrea Tacchetti, Tom Eccles, and J anos Kram ar. Sample-based approximation of nash in large many-player games via gradient descent. In Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems, pp. 507 515, 2022.

Ian Gemp, Luke Marris, and Georgios Piliouras. Approximating nash equilibria in normal-form games via stochastic optimization. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=cc8h3I3V4E.

Jacob K Goeree, Charles A Holt, and Thomas R Palfrey. Risk averse behavior in generalized matching pennies games. Games and Economic Behavior, 45(1):97 113, 2003.

Shahriar Golchin and Mihai Surdeanu. Time travel in LLMs: Tracing data contamination in large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=2Rwq6c3tvr.

John C Harsanyi and Reinhard Selten. A general theory of equilibrium selection in games. MIT Press Books, 1, 1988.

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id= d7KBjm I3Gm Q.

P Jean-Jacques Herings and Ronald Peeters. Homotopy methods to compute equilibria in game theory. Economic Theory, 42(1):119 156, 2010.

Published as a conference paper at ICLR 2025

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg. Ruler: What s the real context size of your long-context language models? ar Xiv preprint ar Xiv:2404.06654, 2024.

Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. ar Xiv preprint ar Xiv:2310.06825, 2023.

Dongfu Jiang, Max Ku, Tianle Li, Yuansheng Ni, Shizhuo Sun, Rongqi Fan, and Wenhu Chen. Genai arena: An open evaluation platform for generative models, 2024. URL https://arxiv.org/ abs/2406.04485.

Diederik P Kingma. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Marc Lanctot, Kate Larson, Yoram Bachrach, Luke Marris, Zun Li, Avishkar Bhoopchand, Thomas Anthony, Brian Tanner, and Anna Koop. Evaluating agents using social choice theory. ar Xiv preprint ar Xiv:2312.03121, 2023.

Tony Lee, Michihiro Yasunaga, Chenlin Meng, Yifan Mai, Joon Sung Park, Agrim Gupta, Yunzhi Zhang, Deepak Narayanan, Hannah Teufel, Marco Bellagente, et al. Holistic evaluation of textto-image models. Advances in Neural Information Processing Systems, 36, 2024.

Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E. Gonzalez, and Ion Stoica. From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline, 2024a.

Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Banghua Zhu, Joseph E. Gonzalez, and Ion Stoica. From live data to high-quality benchmarks: The arena-hard pipeline, April 2024b. URL https://lmsys.org/blog/2024-04-19-arena-hard/.

Siqi Liu, Luke Marris, Georgios Piliouras, Ian Gemp, and Nicolas Heess. Nfgtransformer: Equivariant representation learning for normal-form games. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id= 4YESQq Iys7.

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation using gpt-4 with better human alignment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 2511 2522, 2023.

Luke Marris, Ian Gemp, Thomas Anthony, Andrea Tacchetti, Siqi Liu, and Karl Tuyls. Turbocharging solution concepts: Solving NEs, CEs and CCEs with neural equilibrium solvers. Co RR, abs/2210.09257, 2022a. doi: 10.48550/ARXIV.2210.09257. URL https://arxiv.org/ abs/2210.09257.

Luke Marris, Marc Lanctot, Ian Gemp, Shayegan Omidshafiei, Stephen Mc Aleer, Jerome Connor, Karl Tuyls, and Thore Graepel. Game theoretic rating in n-player general-sum games with equilibria, 2022b. URL https://arxiv.org/abs/2210.02205.

Luke Marris, Siqi Liu, Ian Gemp, Georgios Piliouras, and Marc Lanctot. Deviation ratings: A general, clone-invariant rating method. 2025. URL https://arxiv.org/abs/2502.11645.

Richard D Mc Kelvey and Thomas R Palfrey. Quantal response equilibria for normal form games. Games and Economic Behavior, 10(1):6 38, 1995.

Andrew Mc Lennan. The expected number of nash equilibria of a normal form game. Econometrica, 73(1):141 174, 2005.

Andrew Mc Lennan and In-Uck Park. Generic 4ˆ 4 two person games have at most 15 nash equilibria. Games and Economic Behavior, 26(1):111 130, 1999.

John F Nash et al. Non-cooperative games. Annals of Mathematics, 1950.

Published as a conference paper at ICLR 2025

Medha Palavalli, Amanda Bertsch, and Matthew R. Gormley. A taxonomy for data contamination in large language models, 2024. URL https://arxiv.org/abs/2407.08716.

Arjun Panickssery, Samuel R Bowman, and Shi Feng. Llm evaluators recognize and favor their own generations. ar Xiv preprint ar Xiv:2404.13076, 2024.

Sumedh Pendurkar and Chris Chow. Bilevel entropy based mechanism design for balancing meta in video games. In Proceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems. Proceedings of the 2023 International Conference on Autonomous Agents and ..., 2023.

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level Google-proof Q&A benchmark, 2023.

Yosef Rinott and Marco Scarsini. On the number of pure strategy nash equilibria in random games. Games and Economic Behavior, 33(2):274 293, 2000.

Amartya Sen. Social choice theory: A re-examination. Econometrica: journal of the Econometric Society, pp. 53 89, 1977.

Bernd Sturmfels. Solving systems of polynomial equations. Number 97. American Mathematical Soc., 2002.

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.

Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. ar Xiv preprint ar Xiv:2312.11805, 2023.

Constantino Tsallis. Possible generalization of boltzmann-gibbs statistics. Journal of statistical physics, 52:479 487, 1988.

Theodore L Turocy. A dynamic homotopy interpretation of the logistic quantal response equilibrium correspondence. Games and Economic Behavior, 51(2):243 263, 2005.

Nelson Vadori and Rahul Savani. Ordinal potential-based player rating. In International Conference on Artificial Intelligence and Statistics, pp. 118 126. PMLR, 2024.

Pat Verga, Sebastian Hofstatter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorodsky, Minjie Xu, Naomi White, and Patrick Lewis. Replacing judges with juries: Evaluating llm generations with a panel of diverse models. ar Xiv preprint ar Xiv:2404.18796, 2024.

Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain, Ravid Shwartz Ziv, Neel Jain, Khalid Saifullah, Siddartha Naidu, et al. Livebench: A challenging, contaminationfree llm benchmark. ar Xiv preprint ar Xiv:2406.19314, 2024.

Xinrong Zhang, Yingfa Chen, Shengding Hu, Zihang Xu, Junhao Chen, Moo Hao, Xu Han, Zhen Thai, Shuo Wang, Zhiyuan Liu, and Maosong Sun. 8Bench: Extending long context evaluation beyond 100K tokens. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 15262 15277, Bangkok, Thailand, August 2024. Association for Computational Linguistics. URL https://aclanthology.org/2024.acl-long.814.

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp. 46595 46623. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ 91f18a1287b398d378ef22505bf41832-Paper-Datasets_and_Benchmarks. pdf.

Published as a conference paper at ICLR 2025

A COMPUTING THE MAXIMUM RELATIVE ENTROPY CCE

A maximum relative entropy CCE, that minimises the distance of the log-joint, logpxpaqq, to a target log-joint, tpaq P R|A|, can be computed using gradient descent. We formulate the problem in dual space (Marris et al., 2022a) with dual parameters, αipa1 iq P R|Ai| @i, defined as functions, αipa1 iq softpluspθipa1 iqq @i, of learned parameters, θpa1 iq P R|Ai| @i. Let lθpaq be a logit term used to construct the loss function.

a1 i αppa1 iq uipa1 i, a iq uipaq tpaq (8)

Minimizing a loss function, minθ Lθ, converges to optimal dual variables, α i pa1 iq softpluspθ i pa1 iqq @i with Lθ log rř

a exp rlθpaqss. The loss is convex, deterministic, and unconstrained. Therefore many optimization algorithms are suitable. The primal joint can be simply recovered from the optimal logit term xθpaq softmax rlθ paqs.

B AFFINITY ENTROPY

Consider defining a modified Tsallis entropy Hp a with temperature parameter p P p0, 1s as:

i p U ppq i xqp 1ı (9)

where z p U ppqxq p 1

2 . Note that this definition recovers the standard definition of Tsallis entropy when U ppq is the identity matrix.

Remark. U ppq ij ě 0 for all entries for Hp a to be real-valued.

U ppq ij must be non-negative for every i, j, otherwise, there exists x ej where ej is a standard-basis

vector such that U ppq i x ă 0 and p U ppq i xqp 1 is not real for p P p0, 1q.

Remark. The pp 1q-norm of each column of U ppq must be less than or equal to 1 for Hp a to be non-negative for any x P .

We need z Jz ď 1 for p P p0, 1s and any x P . Equivalently, we require pz Jzq 1 p 1 ď 1 for p P p0, 1s.

Note pz Jzq 1 p 1 ř

ip U ppq i xqp 1 1 p 1 ||U ppqx||p 1. Therefore, we require

1 ě sup x P ||U ppqx||p 1 (10)

sup ||x||1 1 ||U ppqx||p 1 for U ppq ě 0 (11)

||U ppq||1,p 1 (12)

max j 1 ||U ppq ,j ||p 1 by Drakakis et al. (2009). (13)

Remark. Among all admissible U ppq, defining U ppq such that its columns have exactly unit pp 1qnorm achieves min U ppq minx P Ha p pxq.

This follows from the previous remark and is desireable for the sake of defining a tight definition of entropy. Intuitively, by the conditions set thus far, U ppq 0 is admissible. Yet, this gives a loose definition of entropy where Hp a 1{p. It turns out that this intuition is required in the limit as p Ñ 0.

Remark. U ppq must be precisely column stochastic for Hp a to remain finite in the limit of p Ñ 0.

In the limit p Ñ 0, the denominator of Hp a goes to zero, therefore, by L Hˆopital s rule, the numerator must as well. The numerator goes to z Jz ř i U ppq i x 1JU ppqx. Therefore,

@x P d 1 1 1JU ppqx 0. (14)

Published as a conference paper at ICLR 2025

Finite distributions only obey a single equality constraint, that is x J1 1, therefore it must be the case that 1JU ppq 1J, i.e., U ppq is column stochastic.

Remark. Hp a is concave in x.

Let yi U ppq i x. Then each element of the sum, yp 1 i is a convex function in yi, which itself is a linear transformation on x. Therefore, ř

ip U ppq i xqp 1 is convex in x. Hence Hp a is concave in x.

Remark. The gradients x Hp a are well-defined.

Recall (9), then:

BHp a Bxj p 1

i p U ppq i xqp U ppq ij (15)

p p U ppqq Jp U ppqxqp (16)

which is well-defined for any choice of U ppq ij ě 0 for all i, j.

Remark. Hp a is well-defined in the limit as p Ñ 0, i.e., Shannon affinity entropy is well-defined.

It is known that Shannon entropy can be recovered from Tsallis entropy in the limit as p Ñ 0. We repeat that derivation here and use L Hˆopital s rule. The derivative of the denominator is 1, hence we find the limit is given by the finite derivative of the numerator:

drp Hp as dp d

i yp 1 i ı (17)

i epp 1q logpyiqı (18)

logpyiq pp 1q 1

epp 1q logpyiq. (19)

In the limit p Ñ 0, the derivative evaluates to

drp Hp as dp ÿ

epp 1q logpyiq logpyiq ıˇˇˇ p 0 pp 1q ÿ

dp epp 1q logpyiqıˇˇˇ p 0 (20)

i yi logpyiq ÿ

ˇˇˇ p 0 (21)

ˇˇˇ p 0. (22)

Remark. Let K be a similarity matrix between actions with non-negative entries with positive column-sums. Then U ppq Kdiag 1{p1JKp 1q1{pp 1q satisfies the conditions stated above for U ppq.

Remark. Under the above choice of U ppq, Shannon affinity entropy Sa HpÑ0 a can be derived as:

Sapxq Sp U p0qxq ÿ

i U p0q ij logp Kijq ı xj. (23)

The necessary yi term can be rewritten and its derivative (evaluated at p 0) can be derived as follows:

Published as a conference paper at ICLR 2025

yi U ppq i x ÿ

Kij př i1 Kp 1 i1j q 1 p 1 xj (24)

i1 Kp 1 i1j q 1 p 1 (25)

j Kijxje 1 p 1 logpř

i1 Kp 1 i1j q (26)

j Kijxje 1 p 1 log ř

i1 e pp 1q logp Ki1j q (27)

j Kijxje 1 p 1 log ř

i1 e pp 1q logp Ki1j q 1 pp 1q2 log ÿ

i1 epp 1q logp Ki1jq (28)

i1 epp 1q logp Ki1jq ÿ

i1 logp Ki1jqepp 1q logp Ki1jqı (29)

i1 Kp 1 i1j q 1 p 1 1 pp 1q2 logp ÿ

i1 Kp 1 i1j q (30)

i1 Kp 1 i1j

i1 logp Ki1jq Kp 1 i1j ı (31)

1 pp 1q2 logp ÿ

i1 Kp 1 i1j q 1 p 1

i1 p U ppq i1j qp 1 logp Ki1jq ı U ppq ij xj (32)

i1 U p0q i1j logp Ki1jq ı U p0q ij xj (33)

where we define Kij logp Kijq 0 if Kij 0 (which implies p U ppq ij qp 1 logp Kijq 0 if Kij 0.

Plugging this back into the second term in the formula for Shannon affinity entropy, we find

i1 U p0q i1j logp Ki1jq ı U p0q ij xj (34)

i1 U p0q i1j logp Ki1jq ı xj ÿ

i U p0q ij (35)

i1 U p0q i1j logp Ki1jq ı xj (36)

completing the claim.

Remark. In the case of duplicate strategies (clones), the maximizers of Hp a form precisely the set of distributions which arbitrarily distribute a mass of 1

C across each of the C sets of clones.

Consider the case of exact clones, i.e., K is block diagonal (w.l.o.g.) with blocks of ones. Let there be C clone groups each of size nc for c P t1, . . . , Cu. Let cpiq map an action i to its clone set. In this

case, it can be shown that U ppq ij n 1 p 1 cpiq if cpiq cpjq, otherwise U ppq ij 0. Note that the gradient of entropy w.r.t. x must be proportional to the ones vector for x to be a maximizer in the interior of the simplex. Let x r 1

C x1, . . . , 1

C x Cs with each xc P Rnc w.l.o.g. We will show that the set of maximizers of Hp a is necessarily the set of x where each xc P nc 1. For x to be a maximizer, the gradient must be equal to the ones vector multiplied by a scalar d P R:

Published as a conference paper at ICLR 2025

@j BHp apxq Bxj p 1

i p U ppq i xqp U ppq ij (37)

k U ppq ik xkqp U ppq ij (38)

C n 1 p 1 cpiq 1Jxcpiq p U ppq ij (39)

C n 1 p 1 cpjq 1Jxcpjq pn 1 p 1 cpjq (40)

p ncpjqn p 1

C 1Jxcpjq p (41)

C 1Jxcpjq p d. (42)

We also require x P , which implies

xj ě 0 ùñ xcpjq ě 0 (43)

1 C 1J ncxc (44)

1{p d1{p C p p 1

1{p ùñ d C p p 1

Finally, we know from (42)

C 1Jxcpjqqp dp p 1 C p (46)

ùñ 1Jxcpjq 1 (47)

proving the claim. Remark. In the case of duplicate strategies (clones), the maximizers of Hp a achieve an entropy value which is equal to the Tsallis entropy of the system with clones removed.

If we evaluate the max entropy distribution we find

i p U ppq i xqp 1ı (48)

C n 1 p 1 cpiq 1Jxcpiq p 1ı (49)

C n 1 p 1 c 1Jxc p 1ı (50)

C n 1 p 1 c p 1ı (51)

c ncn 1 c 1

which is precisely the Tsallis entropy of the uniform distribution over C distinct clones.

C INTEGRALS OVER SIMPLEX

It is possible to derive a closed-form result for the dis-similarity kernel in (6) by appealing to known results of integrals of polynomial functions over the simplex.

Published as a conference paper at ICLR 2025

Let T d tpx1, . . . , xdq : xi ě 0, řd i 1 xi ď 1u be the standard simplex in Rd. Let νi ą 0 for all i, then ż

T d xν1 1 1 . . . xνd 1 d p1 x1 . . . xdqν0 1 śd i 0 Γpνiq

Γpřd i 0 νiq . (54)

Proposition C.1. From player i s perspective, the expected dis-similarity between two actions p and q under a uniform distribution over all opponent joint strategy profiles x i is equal to

Dpiq pq 1 pdi 1qpdi 2q

||U piq p U piq q ||2 1Jp U piq p U piq q q 2ı (55)

where U piq is a |Ai| ˆ |A i| matrix where each entry U piq ai,a i is the expected utility for player i playing action ai against the background joint action a i. U piq ai indicates an entire row of the matrix. The integer di ś

Proof. Recall (54) and Γpnq pn 1q! for n P N. Let rp ř w Upwxw be the rating for the pth action under an opponent strategy profile x i.

Then we want to compute Ex i Dirp1qrprp rqq2s. Recall the volume of the simplex is 1

Ex i Dirp1qrprp rqq2s

T dprp rqq2dx i ş

T d dx i (56)

T dprp rqq2dx i (57)

w U piq pwxw ÿ

w U piq qwxwq2dx i (58)

y U piq pw U piq py xwxyq p ÿ

y U piq qw U piq qy xwxyq (59)

y U piq pw U piq qy xwxyq ı dx i (60)

U piq pw U piq py

T d xwxydx i looooooomooooooon

2 pd 2q! if w y else 1 pd 2q!

U piq qw U piq qy

T d xwxydx i 2 U piq pw U piq qy

T d xwxydx i ı (62)

p U piq2 pw U piq2 qw 2U piq pw U piq qwq (63)

U piq pw U piq py U piq qw U piq qy 2U piq pw U piq qy ı (64)

1 pd 1qpd 2q

w p U piq pw U piq qwq2 p ÿ

w U piq pw ÿ

w U piq qwq2ı (65)

1 pd 1qpd 2q

||U piq p U piq q ||2 1Jp U piq p U piq q q 2ı . (66)

Proposition C.2. From player i s perspective, the expected dis-similarity between two actions p and q under a uniform distribution over all factorize-able opponent strategy profiles x i ś

j i xj is equal to

1 pdj 1qpdj 2q

uipp, a iq uipq, a iq uipp, a1 iq uipq, a1 iq 2#a a1 (68)

Published as a conference paper at ICLR 2025

where the integer di |Ai| and #a a1 ř

j i 1raj a1 js indicates the number of action matches between two opponent profiles.

Proof. Let rp ř

a i PA i uipp, a iq ś

j i xj,aj be the rating for the pth action under an opponent profile x i ś

j i xj. Let dx i be a shorthand for dx i. Likewise, let ş

T d i be a shorthand for ş

T d1 . . . ş

T di 1 . . . ş

Then we want to compute Exj Dirp1q@j irprp rqq2s. Recall the volume of a simplex is 1

Exj Dirp1q@j irprp rqq2s (69)

T d ipri r1 iq2dx i ş

T d i dx i (70)

T d ipri r1 iq2dx i (71)

a i PA i uipp, a iq ź

j i xj,aj ÿ

a i PA i uipq, a iq ź

j i xj,aj 2 dx i (72)

j i xj,aj uipp, a iq uipq, a iq 2 dx i (73)

a1 i PA i p ź

j i xj,ajqp ź

j i xj,a1 jq uipp, a iq uipq, a iq uipp, a1 iq uipq, a1 iq dx i

uipp, a iq uipq, a iq uipp, a1 iq uipq, a1 iq ź

j i xj,ajxj,a1 j dx i (77)

j i dj! (78)

uipp, a iq uipq, a iq uipp, a1 iq uipq, a1 iq ź

T dj xj,ajxj,a1 jdxj loooooooooomoooooooooon

2 pdj 2q! if aj a1 j else 1 pdj 2q!

j i dj! { ź

j i pdj 2q! (80)

uipp, a iq uipq, a iq uipp, a1 iq uipq, a1 iq 2#a a1 (81)

1 pdj 1qpdj 2q

uipp, a iq uipq, a iq uipp, a1 iq uipq, a1 iq 2#a a1 . (83)

Published as a conference paper at ICLR 2025

0.5 0.0 0.5 P1 ratings

0.35 0.25 0.15 0.05 0.05 0.15 0.25 0.35 Contribution by action to P1 ratings

Paper Rock-1 Rock-2 Scissors

Figure 6: We visualise the marginal NE rating contributions of each player 2 action to each player 1 action. We show that a) all actions receive zero ratings and b) the rating of each action is interpretable and corresponds to our intuition.

D WARMUP: GAME-THEORETIC RANKING OF rock-paper-scissors

We provide a demonstration of game-theoretic ranking on the classic 2-player, 3-action zero-sum Rock-Paper-Scissors game. Balduzzi et al. (2018) proposed rating actions under the max-entropy Nash equilibrium of the game. In that case, each action receives a rating of zero. If we duplicate the Rock action, for example, the ratings remain zero under the max-entropy NE. Our proposed LLE based approach returns the same ratings.

Rock Paper Scissors Rock 0, 0 1, 1 1, 1 Paper 1, 1 0, 0 1, 1 Scissors 1, 1 1, 1 0, 0

Rock1 Rock2 Paper Scissors Rock1 0, 0 0, 0 1, 1 1, 1 Rock2 0, 0 0, 0 1, 1 1, 1 Paper 1, 1 1, 1 0, 0 1, 1 Scissors 1, 1 1, 1 1, 1 0, 0

Figure 7: Rock-Paper-Scissors (RPS) Game and RPS Game with Duplicate Rock Action.

In Figure 6, we show that the equilibrium underlying the scalar ratings reflects incentive structure of the game player 1 does not wish to deviate to the Paper action precisely because doing so would lead to losses against the Scissors action despite wins against the two Rock actions.

E VULNERABILITY OF STANDARD SHANNON ENTROPY

Prior work has shown max-entropy Nash equilibrium (equivalently max-entropy (C)CE) to be invariant to clones in 2-player zero-sum games (Balduzzi et al., 2018). We include a simple experiment here to illustrate why max-entropy Nash equilibrium becomes vulnerable to redundancy in the N-player general-sum setting.

Chicken Game Consider the 2-player 2-action general-sum Chicken game. Let players receive 0 if they both swerve. If one player swerves while the other goes straight, the one who swerves receives 1 and the other 1. If both go straight, then they both receive 12. This game has three NEs. Two are pure in which one player goes straight and the other swerves. The third is symmetric and the max-entropy NE of this game; each player swerves with probability 11{12. Both straight and swerve have an expected payoff of 1{12 under this NE. If we duplicate the straight action, the original max-entropy NE becomes the min-entropy NE! The other two NEs representing each player swerving while the other goes straight now have higher entropy. The player that swerves rates their swerve and straight actions as 1 and 12 respectively. The player that goes straight rates their swerve and straight actions as 0 and 1 respectively, demonstrating that the max-entropy NE solution concept is not invariant to clones in the general-sum setting.

The story in the max-entropy CCE setting is more nuanced. We find that although the CCE ratings change under the addition of clones, the ratio of the ratings of the two actions remains stable. Further investigation is necessary to understand whether max-entropy CCE ratings are equivariant (robust up to affine transformations of the ratings) to cloned actions.

By contrast, we show in Figure 9 that all actions would receive zero ratings under our proposed equilibrium ratings. In other words, our equilibrium selection procedure continues to select the

Published as a conference paper at ICLR 2025

Swerve Straight Swerve 0, 0 1, 1 Straight 1, 1 12, 12

Swerve Straight Straight Swerve 0, 0 1, 1 1, 1 Straight 1, 1 12, 12 12, 12 Straight 1, 1 12, 12 12, 12

Figure 8: Chicken Game and Chicken Game with Duplicate Straight Actions.

0.5 0.0 0.5 P1 ratings

1.0 0.8 0.6 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 Contribution by action to P1 ratings

Straight-1 Straight-2 Swerve

Figure 9: We visualise the marginal NE rating contributions of each player 2 action to each player 1 action. We show that a) all actions receive zero ratings and b) the rating of each action is interpretable and corresponds to our intuition.

mixed-strategy NE in the original game, unaffected by the additional redundant straight action. Further, the widths of the bars are interpretable: suggesting that deviating to the Swerve action is a safe option without major risk or reward. Deviating to one of the Straight actions however, can lead to high rewards but also catastrophic losses.

F EXPERIMENTS

F.1 SIMULATED MODEL AND PROMPT IMPROVEMENT PATH

Algorithm 1 describes our simulated model and prompt improvement procedure. At each iteration, we add a new prompt and a model following an evolutionary procedure. We require all prompts to be probability distributions over skill dimensions. We model for a transitive dimension for models by representing each model vector as a sum of probability vectors over skills. A new model is added to the set of models Am if and only if it becomes top-ranked according to the rating function r. A new prompt is added as long as it is the best-of-P 1 sampled prompts and does not have to be top-ranked.

F.2 EQUILIBRIUM-SOLVING HYPER-PARAMETERS

We use the same set of hyper-parameters for all our experiments. For affinity-entropy Hp apxq, we use p 1 and set kernel variance to 1e 6. To solve for a max affinity-entropy distribution we use gradient descent. The max affinity-entropy distribution is then used in NE and CCE solving.

For NE solving using LLE approximation, we initialize temperature τ 1.0 which is annealed exponentially with a decay rate of 0.95 every 250 gradient updates if and only if the exploitability in the annealed game Lτpxq (Equation (4)) is at most 1e 5. We set the terminal temperature to τ 1e 2. We early terminate the equilibrium solving if we have found an ϵ NE with ϵ 1e 3. For CCE solving, the optimization problem is convex and we minimize Equation 8 directly. For gradient descent, we use an Adam optimizer Kingma (2014) with a fixed learning rate 1e 2 for all steps (maximizing affinity-entropy and equilibrium solving).

F.3 THE arena-hard-v0.1 EVALUATION DATA

We evaluate our method on the arena-hard-v0.1 dataset (Li et al., 2024b) with 500 prompts and 17 competing models. The set of prompts as well as model responses are downloaded from LMSYS data repository (https://huggingface.co/spaces/ lmsys/arena-hard-browser), with the exception of gemini-1.5-pro-api-0514 and gemini-1.5-flash-api-0514. As we need to tabulate the payoff tensor for all model pairs, we sampled 8 preference ratings using gemini-1.5-pro-api-0514 for each model pair, with 4 samples for each permutation to account for potential position bias of the LLM rater. Pairwise model utility is averaged over all ratings samples.

Published as a conference paper at ICLR 2025

Algorithm 1 Evolutionary model and prompt selection procedure

1: Let K be the number of orthogonal skill dimensions. 2: Let r : Ap ˆ Am Ñ rp, rm be a rating function assigning a scalar rating to each action. 3: Let P0, M0 be the number of initial prompts and models. 4: Let P 1, M 1 be the number of sampled candidate prompts and models at each iteration. 5: 6: A0 p Dirichletp11:K, P0q Ź P0 sampled initial prompts. 7: A0 m Dirichletp11:K, M0q Ź M0 sampled initial models. 8: 9: for t P r1, . . . s do 10: if additional prompts then Ź If adding new prompts. 11: A1 p Dirichletp11:K, P 1q Ź Sampling P 1 candidate prompts. 12: rp, Ð rp A1 p Y Ap, Amq 13: Ap Ð Ap Y t A1 prarg max rpr: P 1ssu Ź Add best-of-P 1 prompt. 14: end if 15: A1 m Ð 0 16: while true do 17: m Ð Dirichletp11:K, M 1q Ź Sampling M 1 model improvement vectors. 18: , rm Ð rp Ap, t A1 m mu Y Amq Ź Evaluate improved candidate models. 19: A1 m Ð A1 m mrarg max rmr: M 1ss 20: if arg max rmr: M 1s arg max rm then 21: Am t A1 mrarg max rmsu Y Am Ź Add a new top-ranked model. 22: break 23: end if 24: end while 25: end for

Table 1: Prompt and king actions that each define 16 pure-strategy Nash equilibria any rebel action except the model played by the king player is a pure-strategy NE. Prompt King Can you implement a python tool that is intended to ru... gemini-1.5-pro-api-0514 Hi. I have this URL which I can paste in my Microsoft ... gemini-1.5-pro-api-0514 Please provide a simple RESPONSE to the following PROM... claude-3-5-sonnet-20240620 Take on the rol eof an Gherkin expert. Can you improve... claude-3-5-sonnet-20240620 Write a small python function that get all the links o... gemini-1.5-flash-api-0514

F.4 RISK-DOMINANT EQUILIBRIA

Our king-of-the-hill evaluation game admits a multitude of Nash equilibria, among them 80 are pure-strategy NEs (see Table 1). Additionally, we computed 128 mixed-strategy NEs with exploitability at most 1e 2 that each derives a distinct set of ratings. In particular, one of the 128 mixed-strategy NEs is pre-computed by our NE solving and selection procedure by tracing the QRE continuum, which we refer to as the 0-th equilibrium, or x0.

A longstanding challenge in game theory is that of equilibrium selection. Suppose that every player knows that there are many equilibria in the game, each player must confront the following question during play: out of all equilibria, which equilibrium strategy should I play and relatedly, which equilibrium would each of my co-players play? This is critical, as miscoordinating could lead to arbitrarily bad outcome, despite each player playing one of its equilibrium strategies. For instance, everyone driving on the right or left hand side of the road are two valid equilibria, but miscoordinating would be devastating.

It is for this reason that the notion of risk-dominance of Harsanyi & Selten (1988) is critically important: the Nobel-prize winning theorem suggests that players would each iterate on their prior beliefs over which equilibria its co-players would play and choose the one that is the least risky when players miscoordinate under such priors. Here, we show empirically that our solution concept leads to risk-dominant equilibria as suggested by Herings & Peeters (2010). To do so, we simultaneously minimize the exploitability of several profiles in parallel with a regularizer that maximizes the L2

Published as a conference paper at ICLR 2025

rating differences between any two profiles by gradient descent as in Liu et al. (2024). This yields an additional 127 NEs with exploitability at most 1e 2 that we analyze in Figure 10.

Figure 10 (Top) shows the 128 mixed-strategy NEs with distinct model ratings. Figure 10 (Center) shows the expected payoffs to player i when it plays its p-th equilibrium strategy xp i when other players uniformly choose one of theirs, or Eq πu uipxp i , xq iq with πu a uniform distribution over 128 equilibria. In yellow, we show the sum of per-player expected payoffs. We confirm that many NEs are indeed risky, as their stability relies heavily on all players coordinating on the same equilibrium. Figure 10 (Bottom) takes things one step further and follows the intuition of risk dominance more closely. Starting from a uniform prior belief over player i s choice of equilibria, π0 i πu, each player iterate their believes over other players choices of equilibrium based on the expected payoff of them playing each equilibrium.

Specifically, we let

πt 1 i softmax log πt i ηE @j i tpjq πj

uip. . . , xtpi 1q i 1 , xtpi 1q i 1 , . . . q ı (84)

with η 1e 2 the step-size and we compute the expected payoffs to player i when playing its k-th equilibrium at T 10, 000 as

E @j i epjq πT j

uip. . . , xepi 1q i 1 , xk i , xepi 1q i 1 , . . . q ı (85)

Ordered by the sum of expected payoffs for all players, we observe that the Nash equilibrium our procedure selects (equilibrium x0) is the least risky among 128 mixed-strategy NEs of the game, without any player being particularly worse off than others even when players miscoordinate.

F.5 INVARIANT EVALUATION

We show in Figure 11 the effect of introducing near redundant adversarial prompts on the equilibrium ratings. While our invariant property is limited to exact clones, our results show that our approach results in rankings that degrade gracefully in this approximate case, even with 1,000 adversarial prompts. The Elo rating system suffers from such bias in data similarly as in the exact case Figure 3.

In Figure 12 we provide a detailed breakdown of our NE and CCE ratings results (without redundant adversarial prompts). We show the actions of each player ranked by their equilibrium ratings and by their support under the equilibrium marginal distribution.

Published as a conference paper at ICLR 2025

0 Rating King model ratings under 128 example mixed-strategy NE (exploitability <= 1e-2)

52 96 79 50 43 15 89 117 54 109 28 24 88 19 33 17 113 63 14 13 83 121 94 104 77 82 62 103 70 44 95 99 20 97 30 7 53 42 4 23 25 98 59 86 9 41 92 46 87 49 3 85 116 6 124 75 107 22 78 106 32 91 65 114 29 71 60 100 8 21 10 67 38 36 102 48 125 11 64 55 66 90 56 72 81 112 1 73 2 101 80 120 126 93 105 111 5 69 37 84 118 26 16 47 31 45 12 68 119 123 61 58 127 122 110 51 18 76 108 57 115 34 35 27 40 39 74 0

equilibrium

gemini-1.5-pro-api-0514 gpt-4-turbo-2024-04-09 gpt-4o-2024-05-13 claude-3-5-sonnet-20240620 gpt-4-0125-preview gemini-1.5-flash-api-0514 gemma-2-27b-it-0625 claude-3-opus-20240229 claude-3-sonnet-20240229 gemma-2-9b-it-0625 llama-3-70b-chat-hf Qwen1.5-72B-Chat gpt-4-0314 command-r-plus mistral-large-2402 llama-3-8b-chat-hf gpt-3.5-turbo-0125

52 96 79 50 43 15 89 117 54 109 28 24 88 19 33 17 113 63 14 13 83 121 94 104 77 82 62 103 70 44 95 99 20 97 30 7 53 42 4 23 25 98 59 86 9 41 92 46 87 49 3 85 116 6 124 75 107 22 78 106 32 91 65 114 29 71 60 100 8 21 10 67 38 36 102 48 125 11 64 55 66 90 56 72 81 112 1 73 2 101 80 120 126 93 105 111 5 69 37 84 118 26 16 47 31 45 12 68 119 123 61 58 127 122 110 51 18 76 108 57 115 34 35 27 40 39 74 0

equilibrium

Expected payoffs

Expected per-player payoff under uniform coordination priors over 128 Nash equilibria

52 96 79 50 43 15 89 117 54 109 28 24 88 19 33 17 113 63 14 13 83 121 94 104 77 82 62 103 70 44 95 99 20 97 30 7 53 42 4 23 25 98 59 86 9 41 92 46 87 49 3 85 116 6 124 75 107 22 78 106 32 91 65 114 29 71 60 100 8 21 10 67 38 36 102 48 125 11 64 55 66 90 56 72 81 112 1 73 2 101 80 120 126 93 105 111 5 69 37 84 118 26 16 47 31 45 12 68 119 123 61 58 127 122 110 51 18 76 108 57 115 34 35 27 40 39 74 0

equilibrium

Expected payoffs

Expected per-player payoff under optimized coordination priors over 128 Nash equilibria

Figure 10: From top to bottom: a) we show the distinct king player action ratings derived from 128 mixed-strategy NEs of the king-of-the-hill game. All NEs have exploitability at most ϵ ď 1e 2; b) we show the expected payoff to each player under uniform priors over their 128 equilibria; yellow circles show the sum of expected per-player payoffs; c) we show the same analysis as in b) but the expectation is taken under optimized equilibrium priors. Equilibrium 0 (rightmost) is the LLE our NE solving procedure select.

Published as a conference paper at ICLR 2025

Qwen1.5-72B-Chat

claude-3-5-sonnet-20240620

claude-3-opus-20240229

claude-3-sonnet-20240229

command-r-plus

gemini-1.5-flash-api-0514

gemini-1.5-pro-api-0514

gemma-2-27b-it-0625

gemma-2-9b-it-0625

gpt-3.5-turbo-0125

gpt-4-0125-preview

gpt-4-turbo-2024-04-09

gpt-4o-2024-05-13

llama-3-70b-chat-hf

llama-3-8b-chat-hf

mistral-large-2402

CCE Ratings

Elo Ratings

Figure 11: We introduce an increasing number of redundant copies of prompts adversarial to gemini-1.5-pro-api-0514 with noise sampled from Uniformp 0.01, 0.01q applied to their payoffs. Equilibrium ratings with a clone invariant selection procedure degrades gracefully to noisy redundancy while the Elo ratings become incrementally skewed. Models at the same rank (with an absolute rating difference at most 1e 4) are grouped in grey and ordered alphabetically. We caveat that the specific rankings reported are subject to the LLM preference model used which in this case may exhibit a self-preference to the Gemini family of models.

Published as a conference paper at ICLR 2025

0.002 0.002 0.006 ratings

'Can you implement a python tool that 'Hi. I have this URL which I can paste i 'Help me solve the following qn. Pleas "There is 3 generators with the actual 'You are the coordinator of a network 'write python code to web scrape https 'make me a tftp fuzzer using sulley fuz 'Write code to simulate a ballistic proje 'Write a Kotlin JNI code that add reve 'A table-tennis championship for $2 n 'Tell me how to implement a SCIM ser 'What are the steps, in order, to beco 'How can I use @tanstack/vue-query 'Explain how to implement model para 'What does the title of pharaoh comes "I'd like to design a SQL schema wher 'You are an expert Sveltekit programm 'Read the peer\'s work with the followi 'How can I use radiance fields for path 'Can you show me how to make a stre 'Consider the flavors of the ingredient

0.00 0.02 0.04 marginals

'Hi. I have this URL which I can paste i 'Can you implement a python tool that 'Help me solve the following qn. Pleas "There is 3 generators with the actual 'You are the coordinator of a network 'write python code to web scrape https 'make me a tftp fuzzer using sulley fuz 'Write code to simulate a ballistic proje 'Write a Kotlin JNI code that add reve 'A table-tennis championship for $2 n 'Tell me how to implement a SCIM ser 'What are the steps, in order, to beco 'How can I use @tanstack/vue-query 'Explain how to implement model para 'What does the title of pharaoh comes "I'd like to design a SQL schema wher 'You are an expert Sveltekit programm 'Read the peer\'s work with the followi 'How can I use radiance fields for path 'Can you show me how to make a stre 'Consider the flavors of the ingredient

0.6 0.4 0.2 0.0 ratings

'gemini-1.5-pro-api-0514' 'gpt-4-turbo-2024-04-09' 'claude-3-5-sonnet-20240620' 'gpt-4o-2024-05-13' 'gemini-1.5-flash-api-0514' 'gpt-4-0125-preview' 'gemma-2-27b-it-0625' 'gemma-2-9b-it-0625' 'claude-3-opus-20240229' 'claude-3-sonnet-20240229' 'llama-3-70b-chat-hf' 'command-r-plus' 'Qwen1.5-72B-Chat' 'gpt-4-0314' 'mistral-large-2402' 'llama-3-8b-chat-hf' 'gpt-3.5-turbo-0125'

0.0 0.5 1.0 marginals

'gemini-1.5-pro-api-0514' 'gpt-4-turbo-2024-04-09' 'claude-3-5-sonnet-20240620' 'gpt-4o-2024-05-13' 'gemini-1.5-flash-api-0514' 'gpt-4-0125-preview' 'gemma-2-27b-it-0625' 'gemma-2-9b-it-0625' 'claude-3-opus-20240229' 'claude-3-sonnet-20240229' 'llama-3-70b-chat-hf' 'command-r-plus' 'Qwen1.5-72B-Chat' 'gpt-4-0314' 'mistral-large-2402' 'llama-3-8b-chat-hf' 'gpt-3.5-turbo-0125'

0.8 0.4 0.0 ratings

'gpt-4o-2024-05-13' 'claude-3-5-sonnet-20240620' 'gpt-4-turbo-2024-04-09' 'gemma-2-27b-it-0625' 'gpt-4-0125-preview' 'gemini-1.5-flash-api-0514' 'command-r-plus' 'claude-3-opus-20240229' 'gemma-2-9b-it-0625' 'claude-3-sonnet-20240229' 'mistral-large-2402' 'gpt-3.5-turbo-0125' 'llama-3-70b-chat-hf' 'Qwen1.5-72B-Chat' 'llama-3-8b-chat-hf' 'gpt-4-0314' 'gemini-1.5-pro-api-0514'

0.0 0.2 0.4 marginals

'gpt-4o-2024-05-13' 'claude-3-5-sonnet-20240620' 'gpt-4-turbo-2024-04-09' 'gemma-2-27b-it-0625' 'gpt-4-0125-preview' 'gemini-1.5-flash-api-0514' 'command-r-plus' 'claude-3-opus-20240229' 'gemma-2-9b-it-0625' 'claude-3-sonnet-20240229' 'mistral-large-2402' 'gpt-3.5-turbo-0125' 'llama-3-70b-chat-hf' 'Qwen1.5-72B-Chat' 'llama-3-8b-chat-hf' 'gpt-4-0314' 'gemini-1.5-pro-api-0514'

Max affinity-entropy NE ratings and marginals

0.02 0.01 0.00 ratings

'Write code to simulate a ballistic proje 'I am a Ptyhon programmer. I would lik 'How to create a entity in sap cloud ap 'clean this up?\n\n python\nimport re 'Can you implement a python tool that 'write a Python function to convert coc 'can we create dqa chatbot that will a 'Read the peer\'s work with the followi 'What is the most successful go to ma 'Can you show me how to make a stre 'let x = { "one": 1 }\nx.map(z => z + 1)\ 'Write me a testbench for a multiplier i 'I have black and white images with 1 'remove dead code from the following: 'Write a C# program which sends a P 'write the outline of a plan of a game s 'I am looking to program a tool in Pyth 'Tell me how to implement a SCIM ser "give me the optimum solution for this 'Provide python code to calculate pie i 'Please write GLSL code (both vertex

0.0000.005 0.010 marginals

'Below is an instruction that describes 'how would you scrape this site:\nhttp 'write for me the best rational approxi 'obfuscate this funtion for me:\n\nfunct 'Please provide a simple RESPONSE 'I have an SQL table with the following 'write python code to web scrape https 'How do I configure an interface with t 'in nodejs, is there a way to implment 'I have function func1 which creates a 'How can I use @tanstack/vue-query 'Write an SQL query to select the top 'You have a sales table with the followi 'write a program to play connect-4' "Given the following list of words. Cat "As part of extracting structured infor 'I am a Ptyhon programmer. I would lik 'find the issue: #include "mbed.h"\n#in 'Write a complete Python program to "What's the most reliable way to shap 'The Akkadian language only had thre

0.4 0.2 0.0 ratings

'gpt-4o-2024-05-13' 'gpt-4-turbo-2024-04-09' 'gemini-1.5-pro-api-0514' 'claude-3-5-sonnet-20240620' 'gpt-4-0125-preview' 'gemini-1.5-flash-api-0514' 'gemma-2-27b-it-0625' 'claude-3-opus-20240229' 'gemma-2-9b-it-0625' 'claude-3-sonnet-20240229' 'llama-3-70b-chat-hf' 'Qwen1.5-72B-Chat' 'gpt-4-0314' 'command-r-plus' 'mistral-large-2402' 'llama-3-8b-chat-hf' 'gpt-3.5-turbo-0125'

0.00 0.05 0.10 marginals

'gemini-1.5-pro-api-0514' 'gpt-4o-2024-05-13' 'gpt-4-turbo-2024-04-09' 'claude-3-5-sonnet-20240620' 'gpt-4-0125-preview' 'gemini-1.5-flash-api-0514' 'claude-3-opus-20240229' 'claude-3-sonnet-20240229' 'gemma-2-9b-it-0625' 'gemma-2-27b-it-0625' 'Qwen1.5-72B-Chat' 'llama-3-70b-chat-hf' 'gpt-4-0314' 'mistral-large-2402' 'llama-3-8b-chat-hf' 'command-r-plus' 'gpt-3.5-turbo-0125'

0.2 0.1 0.0 ratings

'gemini-1.5-pro-api-0514' 'gpt-4-turbo-2024-04-09' 'claude-3-5-sonnet-20240620' 'gpt-4o-2024-05-13' 'gpt-4-0125-preview' 'gemini-1.5-flash-api-0514' 'gemma-2-27b-it-0625' 'claude-3-opus-20240229' 'gemma-2-9b-it-0625' 'claude-3-sonnet-20240229' 'llama-3-70b-chat-hf' 'command-r-plus' 'Qwen1.5-72B-Chat' 'gpt-4-0314' 'mistral-large-2402' 'llama-3-8b-chat-hf' 'gpt-3.5-turbo-0125'

0.00 0.04 0.08 marginals

'gpt-4o-2024-05-13' 'gpt-4-turbo-2024-04-09' 'gpt-4-0125-preview' 'claude-3-5-sonnet-20240620' 'gemini-1.5-flash-api-0514' 'claude-3-opus-20240229' 'gemini-1.5-pro-api-0514' 'claude-3-sonnet-20240229' 'gemma-2-9b-it-0625' 'Qwen1.5-72B-Chat' 'gemma-2-27b-it-0625' 'gpt-4-0314' 'mistral-large-2402' 'llama-3-70b-chat-hf' 'llama-3-8b-chat-hf' 'command-r-plus' 'gpt-3.5-turbo-0125'

Max affinity-entropy CCE ratings and marginals

Figure 12: We show actions of each player ranked by their rating and equilibrium support under NE (Top) and CCE (Bottom) profiles respectively.