# risk_aware_benchmarking_of_large_language_models__3e8ceecf.pdf

Risk Aware Benchmarking of Large Language Models

Apoorva Nitsure 1 Youssef Mroueh 1 Mattia Rigotti 1 Kristjan Greenewald 1 2 Brian Belgodere 1

Mikhail Yurochkin 1 2 Jiri Navratil 1 Igor Melnyk 1 Jarret Ross 1

We propose a distributional framework for benchmarking socio-technical risks of foundation models with quantified statistical significance. Our approach hinges on a new statistical relative testing based on first and second order stochastic dominance of real random variables. We show that the second order statistics in this test are linked to mean-risk models commonly used in econometrics and mathematical finance to balance risk and utility when choosing between alternatives. Using this framework, we formally develop a riskaware approach for foundation model selection given guardrails quantified by specified metrics. Inspired by portfolio optimization and selection theory in mathematical finance, we define a metrics portfolio for each model as a means to aggregate a collection of metrics, and perform model selection based on the stochastic dominance of these portfolios. The statistical significance of our tests is backed theoretically by an asymptotic analysis via central limit theorems instantiated in practice via a bootstrap variance estimate. We use our framework to compare various large language models regarding risks related to drifting from instructions and outputting toxic content.

1. Introduction

Foundation models such as large language models (LLMs) have shown remarkable capabilities redefining the field of artificial intelligence. At the same time, they present pressing and challenging socio-technical risks regarding the trustworthiness of their outputs and their alignment with human values and ethics (Bommasani et al., 2021). Evaluating LLMs is therefore a multi-dimensional problem, where those risks

1IBM Research 2MIT-IBM Watson AI Lab. Correspondence to: Apoorva Nitsure <Apoorva.Nitsure@ibm.com>, Youssef Mroueh <mroueh@us.ibm.com>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

are benchmarked across diverse tasks and domains (Chang et al., 2023).

In order to quantify these risks, (Liang et al., 2022; Wang et al., 2023; Huang et al., 2023; Sun et al., 2024) proposed benchmarks of automatic metrics for probing the trustworthiness of LLMs. These metrics include accuracy, robustness, fairness, toxicity of the outputs, etc. Human evaluation benchmarks can be even more nuanced, and are often employed when tasks surpass the scope of standard metrics. Notable benchmarks based on human and automatic evaluations include, among others, Chatbot Arena (Zheng et al., 2023), HELM (Bommasani et al., 2023), Mosaic ML s Eval, Open LLM Leaderboard (Wolf, 2023), and BIG-bench (Srivastava et al., 2022), each catering to specific evaluation areas such as chatbot performance, knowledge assessment, and domain-specific challenges. Traditional metrics, however, sometimes do not correlate well with human judgments. Aiming for a better alignment with human judgments, some approaches utilize Chat GPT/GPT-4 for natural language generation evaluations (Liu et al., 2023; Zhang et al., 2023; Hada et al., 2023).

A comprehensive evaluation of LLMs requires addressing the following critical considerations:

1. Interpretability. Evaluation of foundation models is multi-dimensional in nature and multiple metrics benchmark the models on different socio-technical dimensions that probe the trustworthiness of their outputs and their adherence to shared values and ethics. It is critical to establish an aggregate-level measure to facilitate the interpretation and effective communication of the evaluation results.

2. Risk Aware Benchmarking. In natural language (and other) applications, metrics quantify important guardrails such as model s toxicity, safety, or robustness. Therefore, a comprehensive evaluation framework must incorporate a risk aware benchmarking. This entails ranking models based on the assessment of failure modes and tail statistics1, providing a nuanced understanding of potential pitfalls.

1I.e. understanding and quantifying low-probability high-risk events.

Risk Aware Benchmarking of Large Language Models

Figure 1: (a) Quantiles, (b) Tail Value at Risk (TVAR), of Metrics portfolio of an LLM, showing that TVAR (second-order stochastic dominance) more clearly ranks the models than the quantiles alone (first-order stochastic dominance). (c) Ranking of models using Relative First and Second Stochastic Dominance of Portfolios (R-FSD, R-SSD @P) versus ranking of models using Relative First and Second Stochastic Dominance of chat GPT evaluation scores and ranking by Mean Win Rate (MWR) on the metrics portfolio. The portfolio in this plot uses an independent copula aggregation. Note that (1) the metrics portfolio successfully approximates the chat GPT evaluation, since the @P rankings largely agree with the @chat GPT rankings; (2) the R-SSD rankings outperform MWR baseline.

3. Statistical Significance. Evaluating machine learning models is intimately connected to statistical significance testing (SST), although this framework is still underutilized: (Dror et al., 2018) reports almost 50% of ACL papers miss SST indicators. With the ever increasing parametric complexity of LLMs, obtaining a reliable SST in evaluating foundation models becomes ever more urgent.

We propose in this paper an evaluation framework that offers a principled solution and an efficient implementation that addresses each of these challenges. Our main contributions are:

1. Interpretable Metrics-Portfolio (Section 4). Drawing inspiration from econometrics and mathematical finance, we define a metrics-portfolio for aggregating metrics. This portfolio uses the notion of copula to normalize and aggregate metrics, yielding a single interpretable number assessing each output of a LLM. A higher value of the portfolio is preferable. We illustrate in Figure 1 panels (a) and (b) summary statistics of the metrics portfolio aggregating a total of 8 automatic metrics computed using 5K samples from the Mixinstruct dataset (Jiang et al., 2023). In panel (c) we show that model ranking based on our metrics-portfolio aligns with human evaluation proxies such as chat GPT (Please refer to Appendix B for details of how chat GPT score is computed).

2. Risk Aware Benchmarking via Second Order Stochastic Dominance (Section 2). Stochastic orders define partial orders on random variables and play a vital role

in econometrics and mathematical finance for comparing and selecting portfolios. We propose using stochastic order to select LLMs based on their metricsportfolios. A portfolio dominates in the First Order Stochastic Dominance (FSD) if it has higher quantiles for all percentiles. However, in Figure 1 (Panel (a)), the quantiles of the metrics-portfolio of an LLM don t provide a clear ordering. Instead, we propose the use of Second Stochastic Dominance (SSD), where a portfolio dominates if it has higher Tail Values at Risk (TVAR) for all percentiles (also known as Conditional Value at Risk). TVAR, illustrated in Figure 1 (Panel (b)), represents normalized integrated quantiles, assessing the risks of low values in the portfolio. Small TVAR corresponds to fat left tails in the distribution of the portfolio, identifying risky LLMs as those with the lowest TVAR. For example, Flan-T5 emerges as the riskiest model in our running example.

3. Statistical Significance via Dominance Tests. (Section

3) Armed with these notions of stochastic dominance, we define statistics that benchmark the relative dominance of a model s portfolio on another (R-FSD and R-SSD in Panel (c) in Figure 1). We subject these statistics to an asymptotic analysis, proving central limit theorems that provide the foundation for hypothesis testing with false discovery rate control. We then perform stochastic dominance hypothesis testings between all pairs of models. Having adjusted the confidence level of these tests, we aggregate these pairwise rankings to a single rank via rank aggregation techniques such as the Borda Algorithm (de Borda, 1781). The resulting ranks, depicted in Panel (c) of Figure 1, highlight that the portfolio of automatic metrics (@P) leads

Risk Aware Benchmarking of Large Language Models

to a similar ranking to chat GPT score (@chat GPT) for both first and second stochastic order. To underscore the importance of risk aware benchmarking, we present the ranking of the metrics-portfolio produced by the ubiquitous Min Win Rate (MWR) used in LLM benchmarks (Liang et al., 2022)(last column in Panel (c)). Flan-T5 ranks close to last with all other orders, but ranks 6 with MWR. This highlights that the ubiquituous MWR used in LLM benchmarks is risky for ranking LLMs as it does not take into account failure modes of the model, and we caution practitioners of its pitfalls.

2. Stochastic Dominance

We first review notions of stochastic dominance and their relation to downside risk measures and risk averse preference modeling. We use the notation of the seminal paper of (Ogryczak & Ruszczynski, 2002), and assume that the random variables are standardized so that larger outcomes are preferable. Throughout this Section, the reader can think of the random variable X as a metric evaluating the performance of model A on a specific test set. Likewise, Y represents the evaluation of model B. We defer the definition of metrics portfolio to Section 4. In a multi-metric evaluation, as explained in the introduction, X and Y represent portfolios of evaluations of model A and B respectively.

2.1. First and Second order Dominance and Mean-Risk Models

First Order Stochastic Dominance The First-order Stochastic Dominance (FSD) between real-valued random variables uses the right-continuous cumulative distribution (CDF) as a performance function. Specifically, for a real random variable X, define the first performance function F (1) X : R [0, 1] as the CDF: F (1) X (η) = P(X η), η R. The FSD of X on Y is defined as follows:

X FSD Y F (1) X (η) F (1) Y (η), η R, (1)

this intuitively means that for all outcomes η, the probability of observing smaller outcomes than η is lower for X than Y . An equivalent definition can be expressed using the quantile F ( 1) X (See e.g (Ogryczak & Ruszczynski, 2002)):

X FSD Y F ( 1) X (p) F ( 1) Y (p), p (0, 1],

where F ( 1) X : [0, 1] R is the left-continuous inverse of F (1) X : F ( 1) X (p) = inf{η : F (1) X (η) p} for p (0, 1]. We focus on this definition as it is more computationally and

notationally friendly since the quantile function is always supported on [0, 1].

Second Order Stochastic Dominance The Second-order Stochastic Dominance (SSD) is defined via the second performance function F (2) X : R [0, 1] that measures the area under the CDF: F (2) X (η) = R η F (1) X (x)dx, for x R, yielding:

X SSD Y F (2) X (η) F (2) Y (η), η R. (3)

Note that FSD implies SSD, hence SSD is a finer notion of dominance. While FSD implies that X is preferred to Y by any utility-maximizing agent preferring larger outcomes2, (Ogryczak & Ruszczynski, 2002) showed that SSD implies that X is preferred to Y by any risk-averse agent preferring larger outcomes.3 Similarly to FSD, SSD can be measured with quantile functions via introducing the second quantile function also known as integrated quantiles F ( 2) X : (0, 1] R

F ( 2) X (p) = Z p

0 F ( 1) X (t)dt, for t (0, 1]. (4)

Similarly to the FSD case, an equivalent more computationally friendly definition can be expressed in terms of the second quantile function (a proof of this equivalence can be found in Theorem 3.2 in (Ogryczak & Ruszczynski, 2002)):

X SSD Y F ( 2) X (p) F ( 2) Y (p), p (0, 1].

This equivalence is not straightforward and is due to Fenchel duality between F (2) and F ( 2). Using p = 1 we see that SSD implies µX µY , where µX and µY are means of X and Y .

Mean Risk Models (MRM) As noted earlier SSD is linked to risk aware benchmarking via the second performance function F (2)(.) measuring expected shortfall, and the negative second quantile function F ( 2)(p) that is an assessment of expected losses given outcomes lower than the p-quantile.

Definition 2.1 (Mean Risk Models). A mean risk model of a random variable X consists of the pair (µX, r X), where µX is the mean of X, and r X is a functional that measures the risk of the random outcome X.

The consistency of a mean risk model with SSD is defined as follows:

2I.e. having an increasing utility function. 3I.e. having an increasing and concave utility function.

Risk Aware Benchmarking of Large Language Models

Name Risk Measure α consistency with SSD Standard deviation σX = p

E(X µX)2 not consistent Absolute semi deviation δX = E(µX X)+ 1 consistent

Negative Tail Value at Risk TVARX(p) = F ( 2)(p)

p 1 consistent for all p (0, 1]

Mean absolute deviation from a quantile h X(p) = µx F ( 2) X (p) p 1 consistent for all p (0, 1]

Gini Tail ΓX = 2 R 1 0 (µXp F ( 2) X (p))dp 1 consistent

Table 1: Risk models and their α consistency with SSD.

Definition 2.2 (SSD consistency of Mean Risk Models). A mean risk model (µX, r X) is α consistent with SSD, if for α > 0 the following is true:

X SSD Y = µX αrx µY αr Y . (6)

The ubiquitous mean risk model in machine learning is (µX, σX), where σX is the standard deviation. Unfortunately this model is not consistent with the SSD and has several limitations as it implies Gaussianity of the outcomes or a quadratic utility function. We give in Table 1 risk measurements and their α consistency (proofs in (Ogryczak & Ruszczynski, 2002)). Note that in contrast FSD is only consistent with the Mean-Va R risk model (Mean-Value at Risk) for all p [0, 1]. Va R does not provide a refined tail assessment.

2.2. Relaxations of Stochastic Dominance Recalling the definitions of FSD and SSD in Equations (2) and (5), in the finite-sample regime it is hard to test for these relations as one needs to show the infinite-sample quantile or second quantile properties hold uniformly over all p (0, 1]. This difficulty motivated the relaxation of stochastic dominance to an almost stochastic dominance pioneered by (Leshno & Levy, 2002). These relaxations were revisited for the first order by (Alvarez-Esteban et al., 2014) who later proposed an optimal transportation approach to assess almost first stochastic order (Del Barrio et al., 2018).

Almost FSD (ε-FSD) Following (Leshno & Levy, 2002), (Del Barrio et al., 2018) relaxed FSD (Equation (2)) via the violation ratio of FSD. X ε FSD Y if and only if:

εW2(FX, FY ) =

R 1 0 (F ( 1) Y (t) F ( 1) X (t))2 +dt W2 2(FX, FY ) ε,

where W2 is the Wasserstein -2 distance between FX and FY .This ratio corresponds to a measure of the area of violation of the FSD dominance of X on Y . Note that 0 εW2(FX, FY ) 1, with value 0 if X FSD Y and 1

if Y FSD X. For ε (0, 1

2], Figure 4a in Appendix G

illustrates ε-FSD, dashed areas represent the violation set.

Almost SSD (ε-SSD) We define ε-SSD, for ε (0, 1

2), by relaxing Equation (5) as follows: X ε SSD Y if and only if

εIQ(FX, FY ) =

R 1 0 (F ( 2) Y (t) F ( 2) X (t))2 +dt d2 IQ(FX, FY ) ε,

where d IQ is the L2 distance between the Integrated Quantiles (F ( 2)). This ratio corresponds to a measure of the area of violation of the SSD dominance of X on Y . Figure 4b in Appendix G illustrates the second order, dashed areas represent the violation set of SSD of X on Y . Appendix D gives a more detailed account on almost stochastic dominance.

2.3. Relative Stochastic Dominance

In the remainder of the paper, we refer to the FSD violation ratio as εW2(FX, FY ) ε(1)(FX, FY ) and to the SSD violation ratio as εIQ(FX, FY ) ε(2)(FX, FY ). One of the shortcomings of almost stochastic dominance is the need to fix a threshold ε on the violation ratio. When comparing two random variables, setting a threshold is a viable option. Nevertheless, when one needs to rank multiple variables X1, . . . , Xk (considering all pairwise comparisons), setting a single threshold that would lead to a consistent relative stochastic dominance among the k variables becomes challenging. To alleviate this issue, we draw inspiration from relative similarity and dependence tests (Bounliphone et al., 2016a;b) that circumvent the need for a threshold via relative pairwise testings.

For ℓ {1, 2} (i.e for FSD or SSD) we consider all pairs of violations ratios:

ε(ℓ) ij = ε(ℓ)(FXi, FXj) for i, j {1 . . . k}, i = j,

noting that ε(ℓ) ij + ε(ℓ) ji = 1. Let F = (FX1, . . . FXk). We define the one-versus-all violation ratio of the dominance of

Risk Aware Benchmarking of Large Language Models

Xi on all other variables Xj, j = i :

ε(ℓ) i (F) = 1 k 1

j =i ε(ℓ) ij .

We then define relative stochastic dominance for both orders, R-FSD an R-SSD respectively:

Xi1 R FSD Xi2 . . . R FSD Xik

ε(1) i1 (F) ε(1) ik (F) and,

Xi1 R SSD Xi2 . . . R SSD Xik

ε(2) i1 (F) ε(2) ik (F)

In this definition of relative stochastic dominance, the most dominating model is the one with the lowest one-versus-all violation ratio and to test for relative dominance of Xi on Xj we can look at the following statistics:

ε(ℓ) ij (F) = ε(ℓ) i (F) ε(ℓ) j (F), (9)

and we have the following threshold-free test for relative order:4

Xi R FSD Xj ε(1) ij (F) 0 (10)

Xi R SSD Xj ε(2) ij (F) 0 (11)

3. Testing For Almost and Relative Stochastic Dominance

Given empirical samples from FX and FY we perform statistical testing of the almost and relative stochastic dominance of X on Y given empirical estimates of the statistics given in Sections 2.2 and 2.3. A key ingredient for quantifying the statistical significance of such tests is a central limit theorem that guarantees that the centered empirical statistics is asymptotically Gaussian at the limit of infinite sample size. Given n samples from FX (m from FY respectively), we denote F n X and F m Y the corresponding empirical distributions. For ε0 FSD, (Del Barrio et al., 2018) studied the following hypothesis testing H0 : X ε0 SSDY versus

the alternative Ha : X ε0 SSDY . Using (2), this amounts to

the following null hypothesis : H0 : εW2(F n X, F m Y ) > ε0. (Del Barrio et al., 2018) showed the asymptotic normality of the empirical statistics: (Del Barrio et al., 2018; Ulmer

4For comparing k = 2 random variables, these r-FSD and r-SSD tests reduce to 0.5-FSD and 0.5-SSD absolute tests, respectively.

et al., 2022) propose to reject H0 with a confidence level 1 α if:

εW2(F n X, F m Y ) ε0 +

mn σ2(FX, FY )Φ 1(α),

(12) where Φ 1 is the quantile function of a standard normal.

For the tests we propose below, we assume the following structure on the underlying CDFs to derive the corresponding central limit theorems (CLTs). Assumption 1 (Regularity). Let the CDF F be supported on the interval [ M, M] for some constant M, and have pdf f such that f (p)

f 3(p) is bounded for almost every p for which f(p) > 0 (i.e. all p in the support of f).

ε-SSD Testing Similar to ε-FSD, using the definition in (5) we propose to test using the following null hypothesis for testing for ε0-SSD:

H0 : εIQ(F n X, F m Y ) > ε0

Supposing Assumption 1 holds for FX, FY and assuming n n+m λ for some λ, we state a Central Limit Theorem for the second order statistics (Theorem 3.1, proved in Appendix J.1).

Theorem 3.1 (Central Limit Theorem for ε-SSD). Assume that FX, FY are supported on intervals5 in [ M, M],

and have pdfs fx, fy such that f x(p) f 3 x(p), f y(p) f 3 y(p) are bounded almost everywhere on the support of fx and fy respectively. Assume we have n samples from FX and m samples from FY , with n, m such that n n+m λ for some

mn m+n (εIQ(F n X, F m Y ) εIQ(FX, FY ))

N(0, σ2 λ(FX, FY )) where σ2 λ(FX, FY ) = 1 d8 IQ(FX,FY ) [(1 λ)Var(v X(U)) + λVar(v Y (U))] ,

for U Unif[0, 1], v Y (t) =

2 1 fy(F 1 Y (t))

R 1 t (F ( 2) X (p) F ( 2) Y (p))+dp , and

v X(t) = 2 1 fx(F 1 X (t))

R 1 t (F ( 2) X (p) F ( 2) Y (p)) dp .

Similarly to (12), Theorem 3.1 suggests to reject H0 with a confidence 1 α if :

εIQ(F n X, F m Y ) ε0+

mn σ2 λ(FX, FY )Φ 1(α),

where (for the same reasons as the FSD case) σ2 λ is given by the central limit theorem.

Relative Stochastic Dominance Testing We turn now to relative stochastic dominance that we introduced in (10)

5The interval for FX and for FY need not coincide.

Risk Aware Benchmarking of Large Language Models

and (11) for first and second orders. Given n samples from k random variables (X1 . . . Xk), let F = (F1, . . . , Fk) be the marginals of Xi and Fn = (F1n, . . . , Fkn) denote the empirical marginals. To test for R-FSD (resp R-SSD) of Xi1 on Xi2 we propose to test the following null hypothesis:

H0 : ε(ℓ) ij (Fn) > 0, ℓ= 1 or 2

Assuming that each Fi satisfies Assumption 1, we state in Appendix H a central limit theorem for the relative second order statistics (Theorem H.3 proved in in Appendix J.2). A similar result holds for the relative first order statistics that we omit for brevity. Theorem H.3 suggests to reject H0 with a confidence 1 α if:

ε(2) i1,i2(Fn)

1 nσ2 relative(FX, FY )Φ 1(α)

where σ2 relative(FX, FY ) is given by the central limit theorem (similar test exists for R-FSD).

Bootstrapping Heuristic While the CLT above provides an asymptotic value for the variance, in practice (as in the ASO framework of (Ulmer et al., 2022)) we estimate the variance with a bootstrapping heuristic (Efron & Tibshirani, 1993). This estimate is nonasymptotic and hence should often be more accurate than the asymptotic value. Proving the consistency of the bootstrap for functions of quantiles is generally nontrivial (Shao & Tu, 2012), but recall that the stochastic ordering can be defined in terms of either quantiles or CDFs. In Appendix K we provide a bootstrap consistency proof for the absolute statistics based on the CDF, leaving the quantile based proof for future work.

Multi-Testing Algorithm Algorithm 1 given in Appendix C summarizes the multi-testing setup for both relative and almost (absolute) FSD and SSD. The main idea behind Algorithm 1 is to turn multi-testing to pairwise testings i.e testing for stochastic dominance between all pairs of models using relative (or absolute) FSD or SSD. In order to ensure that this multi-testing has a confidence level 1 α, we correct the individual test s confidence level by dividing α by the number of all pairs (Bonferroni, 1936). Then in order to combine the pairwise rankings to a single rank, we use a simple Borda count (de Borda, 1781) rank aggregation algorithm.

4. Distributional Risk Aware Benchmarking of Foundation Models

Setup In this section we consider the multi-metric evaluation setup of a foundation model A : X O, using N metrics mi : O R, i = 1 . . . N, where mi are real

valued functions evaluated on a test set D.Without loss of generality, assume that each of the metrics are standardized such that higher values of mi correspond to more desirable model performance. We model observed values for each metric mi as a continuous random variable Mi with unknown CDF FMi. For a model A : X O and a data sample X D, we describe the evaluation of model A with mi with the following random variable Mi: Mi|A, X := mi(A(X)), X D, i = 1 . . . N, where the randomness arises from the data sampling procedure X D, and (if applicable) the stochasticity of the model A, for example if the model uses sampling.

Metrics Portfolio Aggregation and Selection using Stochastic Dominance Let λ = (λ1, . . . , λN) be a probability vector that represents the importance of the mi metrics to the model s end user. Inspired by the portfolio optimization literature, we model the user return from a model as a portfolio of metrics mi evaluated on a test set D. Following (Ulan et al., 2021; Belgodere et al., 2023), we define this portfolio as an Independent copula, which forms a weighted geometric mean of the CDFs:

RA(X) = exp

i=1 λi log FMi (mi(A(X)))

Note that (15) normalizes the metrics using the CDF of the metric Mi, eliminating the issue of differing dynamic ranges. This CDF should be formed by pooling together the evaluations on all samples and from all models being compared, to ensure that the various RA are comparable. The CDF normalization is monotonic and hence it preserves the order of each metrics and allow us to aggregate in the probability space the metrics using a simple weighted geometric mean. Computing RA(X) for all test samples X, we can therefore characterize the distribution of the metric portfolio of the model A. To compare two models it is enough to compare their corresponding portfolios, specifically, Model A is preferred to Model B using εor R-SSD:

RA(X) ε or R SSD RB(X). (16)

Similar tests can be performed for FSD.

Note that the portfolio aggregation in (15) does not take into account the dependencies and correlations between the metrics. To alleviate this, we explore using also the empirical copula (Ruschendorf, 1976) as a means of aggregation of the metrics as follows

Rc A(X) = ˆC (FM1 (m1(A(X))) , . . . FMN (m N(A(X)))) , (17) where ˆC is the empirical copula . Given N samples Xℓ, ℓ= 1 . . . n, the empirical copula is given by ˆC(u1, . . . un) = 1 n Pn j=1 ΠN i=11FMi(mi(A(Xj)))<ui. The empirical copula

Risk Aware Benchmarking of Large Language Models

can be understood as an average mean win rate (with an and operation on all metrics), that is computed on the CDF transformed scores of each evaluated sample. The main advantage of the independent copula (IC) in (15) versus the empirical copula (EC) in (17) is its computational efficiency (O(n N) for IC versus O(n2N) for EC).

Multiple Models Comparison Given k models Aℓ, ℓ= 1 . . . k and their evaluations mi(Aℓ(X)), X D, i = 1 . . . N, we pool all model evaluations for a metric to estimate the CDF of each metric FMi and construct a portfolio for each model RAℓ(X). We use our Relative Stochastic Dominance testing introduced in Section 3 and in Algorithm 1 to rank the models by their metrics portfolio in relative SSD or FSD with a confidence level 1 α.

Per Metric Stochastic Dominance and Rank Aggregation We also explore another approach for multi-testing, by considering the stochastic dominance of the models on permetric basis. This amounts to computing N relative stochastic orders for each Mi = (mi(A1(X)), . . . , mi(Aℓ(X))), i = 1 . . . N. This amounts to producing via Algorithm 1 a relative ranking πi of the models based on Mi. A single rank π is then obtained via rank aggregation with uniform weighting on the per-metric rankings πi, i = 1 . . . N. We use for rank aggregation the R package of (Pihur et al., 2009). For more details on rank aggregation, the reader is referred to Appendix F.3.

5. Experiments

5.1. Validation of Statistical Significance

We examine the statistical properties of our tests as a function of sample size. We purposely design synthetic score distributions to represent challenging problems comprising large overlap between the distributions and considerable violation ratio, but where one would still like to have an ordering among the variables. For this we consider the two Gaussian variables X N(0, 1) and and Y N(0.5, 2). Figure 5 in Appendix L.1 shows that our tests have desirable statistical properties. We perform synthetic experiment on fat tailed distribution such as log normal (Fig. 6 App. L.1).

5.2. LLM evaluation with Stochastic Dominance

We showcase LLM evaluation with stochastic dominance to benchmark two risks: drifting from instructions and outputting toxic content. The following datasets correspond to each risk we benchmark.

Mix-Instruct Evaluation Data We use the data from (Jiang et al., 2023), that consists of an instruction, an input sentence and an expected output from the user, as well as the output of a set of different LLMs. The dataset consists of a training set of 100K samples and a test set of

5K samples. (Jiang et al., 2023) used automatic metrics such as BARTscore and BLEU score comparing the LLM generation to the expected output in order to evaluate if each LLM followed the instruction. (Jiang et al., 2023) used also chat GPT to evaluate the generations (See Appendix B for Chat GPT evaluation). The number of automatic metrics N is 8, the total number of evaluated models k is 12. Metrics are unified so that larger values are preferred.

Toxicity Evaluation We use the real toxicity prompts dataset of Gehman et al. (2020), and generate prompts completions from the Llama 2 7b , Llama 2 13b, Llama 2 70b , Mosaic ML MPT 30b and Tiiuae Falcon 40b models available in Opensource (k = 5 models). We select two sets of prompts: toxic prompts (toxicity > 0.8, that gives 10K prompts ) and non-toxic prompts (toxicity < 0.2, from which we randomly sample 10K). We sample from each model, 10 completions per prompt using nucleus sampling (top-p sampling with p = 0.9 and a temperature of 1). This procedure yields a dataset of 200K sentence completions per model. We evaluate the toxicity of these generations using the Perspective API, on the following toxicity metrics (N = 6 metrics): Toxicity, Severe toxicity, Identity Attack, Insult, Profanity and Threat. Following Liang et al. (2022), we evaluate the toxicity of generated completions only and refer to this as Gen Only evaluation. In order to also give the context of the completion, we prepend the model generation with the prompt and evaluate the full sentence using Perspective API. We refer to this as Prompt+Gen. The polarity of all toxicity metrics is unified so that high values refer to non toxic content (we use log probabilities of Perspective API outputs).

Evaluation Protocol and Baselines We evaluate each of the use cases (instruction following and toxicity) using the following absolute stochastic dominance tests: (1) ε-FSD (corresponds to the ASO evaluation of (Ulmer et al., 2022)) for ε = 0.08, 0, 25, 0.4. (2) our proposed ε-SSD using the same values for ε, (3) our relative stochastic dominance R-FSD and R-SSD tests, (4) the Mean Risk models described in Table 3, and (5) the ranking produced by the Mean Win Rate (MWR) used by LLM leaderboards such as HELM (Liang et al., 2022). As noted in Section 4, we either perform these tests on a metrics portfolio we refer to this as test @P(IC) when using the independent copula given in Equation (15) and test @P(EC) when using the empirical copula given in Equation (17) ; or on a per metric basis leading to N rankings of the models that we reduce to a single ranking via Rank Aggregation (RA) (Pihur et al., 2009) we refer to this as RA(test @ M). In this naming convention, test takes values in {MWR, ε-FSD, ε-SSD, R-FSD, R-SSD, Mean Risk Model (µX r X)} where r X is a chosen risk from Table 3. We perform all our statistical tests with a significance level α = 0.05, and use 1000

Risk Aware Benchmarking of Large Language Models

100 500 1000 2000 3000 4000 Sample Size

Kendall Tau Similarity

Kendall Tau Similarity to Asymptotic Ranks

Eval Type Mean Win Rate w/ P(IC)

Mean Risk Models w/ P(IC)

Relative Testing SSD w/ P(IC)

Mean Win Rate w/ P(EC)

Mean Risk Models w/ P(EC)

Relative Testing SSD w/ P(EC)

(a) Asymptotic Rank Stability

100 500 1000 2000 3000 4000 5000 Sample Size

Kendall Tau Similarity

Sample dependent Portfolio based Model Ranking Comparison with Chat GPT SSD

Eval Type Mean Win Rate w/ P(IC)

Mean Risk Models w/ P(IC)

Relative Testing w/ P(IC)

Mean Win Rate w/ P(EC)

Mean Risk Models w/ P(EC)

Relative Testing w/ P(EC)

(b) Rank Similarity to R-SSD @chat GPT Rank.

Figure 2: (a) On the Mix-instruct dataset, we compute the ranking resulting from each ranking method using varying sample sizes from 100 to 5K. We repeat each experiment 5 times. We report for each method, the Kendall-Tau similarity between resulting ranks at each sample to the corresponding asymptotic rank at 5K samples. We see that Relative SSD on independent copula portfolio P(IC) is more stable in sample size than rank aggregation of all Mean Risk Models and more stable than MWR on the portfolio. The empirical dependent copula portfolio P(EC) does not have favorable asymptotics w.r.t to P(IC) since it suffers from the curse of dimension. (b) We use the same setup as in (a) but instead of Kendall-Tau similarity to the asymptotic rank of each method, we plot the similarity to R-SSD @Chat GPT rank at 5K samples. We see that MWR is inconsistent with chat GPT rank while both R-SSD @P(IC) and (EC) and RA(MRM @P(IC)) have a Kendall-Tau similarity between 0.7 and 0.75. Interestingly, the dependent copula (EC) captures better chat GPT rank than independent copula (IC), hinting at the favorable role of the metric dependencies.

bootstrap iterations.

Efficient Implementation We compare the computational complexity of our implementation for computing all stochastic orders to that of the Deep-Significance package (deepsig, 2022) which implements ε-FSD in the ASO framework (Ulmer et al., 2022), on the task of comparing models on the Mix-Instruct dataset (sample size 5K, k = 12 models). Using the Deep-Significance implementation of MULTI-ASO in (Ulmer et al., 2022) for ε = 0.25 with just 3 bootstrap iterations6, the test completes in 15min50s (averaged over 7 runs). Our code for relative and absolute testing performs all tests at once and relies on caching vectorization and multi-threading of the operations. Our code completes all tests in an average of just 17.7 s with 1000 bootstraps. Experiments were run on a CPU machine with 128 AMD cores, of which 2 were used.

Mix-Instruct Results and Analysis In Figure 2 we depict the asymptotics of the ranks resulting from our tests as function of the sample size. In Figure 2 (a), we see that R-SSD with the portfolio aggregation with Independent Copula P(IC) has favorable asymptotics compared to R-SSD with dependent Empirical Copula P(EC). Indeed the empirical copula estimation suffers from the curse of dimension. On the other hand, we see in Figure 2 (b) that R-SSD with P(EC) captures better than P(IC) the ranks resulting from

6Limited to 3 for computational reasons.

R-SSD with Chat GPT score. In other words, the dependent copula agrees more with the human evaluation proxy that is chat GPT. Note that the EC is expensive to compute and requires on average 1.5 hours on 5K samples, whereas IC requires only 0.87 seconds.

When compared with Mean Win Rate (MWR) used in LLM leaderboards such as HELM (Liang et al., 2022), we see that it does not have good asymptotics nor agree with Chat GPT rankings, regardless of the aggregation technique used. This is due to the fact that MWR only counts wins and does not take into account how fat is the left tail of the distribution of the metric being benchmarked, possibly leading to overevaluation of risky models.

Remarkably, the R-SSD ordering agrees with the rank aggregation of all (consistent) mean risk models, confirming the theoretical link between second order dominance and risk averse decision making. The dependent copula EC with R-SSD leads to a better agreement with chat GPT R-SSD ranking than MRM models. Finally Tables 4 and Table 5 in Appendix L give additional results on R-FSD and the rank aggregation of all metrics, and how it compares to ε FSD and SSD.

Toxicity Results and Analysis Table 2 shows the results of our tests on the combined set of toxic and non toxic prompts. Ablation studies on individual sets are given in Table 6 in Appendix L.4. We make a few observations: First, overall

Risk Aware Benchmarking of Large Language Models

Scenario Llama 2 7b Llama 2 13b Llama 2 70b Mosaic ML MPT 30b Tiiuae Falcon 40b

All Combined (Toxic + Non-Toxic Prompts)

RA(R-FSD @M) (Gen Only) 2 3 5 1 4 R-FSD @P(IC) (Gen Only) 2 3 5 1 4 RA(R-SSD @M) (Gen Only) 2 3 5 1 4 R-SSD @P(IC) (Gen Only) 2 3 5 1 4

RA(R-FSD @M) (Prompt + Gen) 3 4 5 1 2 RA(R-FSD @M) (Prompt + Gen) 3 4 5 1 2 R-SSD @P(IC) (Prompt + Gen) 3 4 5 1 2 R-SSD @P(IC) (Prompt + Gen) 3 4 5 1 2

Table 2: Toxicity Ranking using an Independent Copula portfolio aggregation of Perspective API metrics.

the portfolio with independent copula approach agrees well with the rank aggregation of per-metric rankings. The portfolio is more computationally efficient as it needs to run the stochastic dominance test only on the portfolio, rather than running N tests and aggregating them via rank aggregation. An ablation study on empirical copula in Appendix L shows that it leads to a similar ranking as the Independent Copula. Secondly, on this dataset the R-FSD and R-SSD agree, with a few exceptions. Interestingly, when comparing models on model generation only, on toxic prompts Mosaic ML MPT stands out, while on non toxic prompts Llama2 7B stands out and on the combined set Mosaic ML MPT stands out. On the combined set, we see for the llama family that increased model size increases the toxicity of generations. This is in line with findings in the recent Trust LLM benchmark (Sun et al., 2024).

6. Conclusion

In this paper we introduced a distributional framework for risk aware benchmarking and comparison of foundation models based on multi-metric evaluations. Our framework has potential beyond the current applications presented here, being applicable wherever statistical significance while ranking assets for decision making is needed. We believe our tools for training models to be risk averse can be of significant use to practitioners and serve as a stepping stone towards solving the AI alignment problem.

Impact Statement

This paper presents a risk aware framework for benchmarking LLMs. In benchmarking LLM the stochastic nature of their generation and in presence of multiple metrics to be evaluated, our work offers a solution that gives raise to 1) a sound aggregation of the metrics via the copula method 2) a risk aware evaluation that takes into account tail events of misalignment and not only the average behaviors thanks to the use of stochastic orders 3) quantifies the uncertainty

of the evaluation via statistical significance testings. The potential societal consequences of our work falls under AI governance as it allows a rigorous certification of compliance of LLMs with multitude of safeguards and dimensions.

Pedro C Alvarez-Esteban, E del Barrio, JA Cuesta-Albertos, and C Matr an. A contamination model for approximate stochastic order: extended version. ar Xiv preprint ar Xiv:1412.1920, 2014.

Brian Belgodere, Pierre Dognin, Adam Ivankay, Igor Melnyk, Youssef Mroueh, Aleksandra Mojsilovic, Jiri Navratil, Apoorva Nitsure, Inkit Padhi, Mattia Rigotti, Jerret Ross, Yair Schiff, Radhika Vedpathak, and Richard A. Young. Auditing and generating synthetic data with controllable trust trade-offs, 2023.

Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. ar Xiv preprint ar Xiv:2108.07258, 2021.

Rishi Bommasani, Percy Liang, and Tony Lee. Holistic evaluation of language models. Annals of the New York Academy of Sciences, 2023.

C.E. Bonferroni. Teoria statistica delle classi e calcolo delle probabilit a. Pubblicazioni del R. Istituto superiore di scienze economiche e commerciali di Firenze. Seeber, 1936. URL https://books.google.com/ books?id=3CY-HQAACAAJ.

Wacha Bounliphone, Eugene Belilovsky, Matthew Blaschko, Ioannis Antonoglou, and Arthur Gretton. A test of relative similarity for model selection in generative models. Proceedings ICLR 2016, 2016a.

Risk Aware Benchmarking of Large Language Models

Wacha Bounliphone, Eugene Belilovsky, Arthur Tenenhaus, Ioannis Antonoglou, Arthur Gretton, and Matthew B Blashcko. Fast non-parametric tests of relative dependency and similarity. ar Xiv preprint ar Xiv:1611.05740, 2016b.

Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Kaijie Zhu, Hao Chen, Linyi Yang, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. A survey on evaluation of large language models. ar Xiv preprint ar Xiv:2307.03109, 2023.

Jean-Charles de Borda. M emoire sur les elections au scrutin. Histoire de l Acad emie Royale des Sciences, 1781.

deepsig. Deepsignificance. https://github.com/ Kaleidophon/deep-significance, 2022.

Eustasio Del Barrio, Juan A Cuesta-Albertos, and Carlos Matr an. An optimal transportation approach for assessing almost stochastic order. The Mathematics of the Uncertain: A Tribute to Pedro Gil, pp. 33 44, 2018.

Rotem Dror, Gili Baumer, Segev Shlomov, and Roi Reichart. The hitchhiker s guide to testing statistical significance in natural language processing. In Proceedings of the 56th annual meeting of the association for computational linguistics (volume 1: Long papers), pp. 1383 1392, 2018.

Rotem Dror, Segev Shlomov, and Roi Reichart. Deep dominance-how to properly compare deep neural models. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2773 2785, 2019.

B. Efron and R. Tibshirani. An Introduction to the Bootstrap, 1993.

Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. In Findings, 2020. URL https: //api.semanticscholar.org/Corpus ID: 221878771.

Alexander A Gushchin and Dmitriy A Borzykh. Integrated quantile functions: properties and applications. Modern Stochastics: Theory and Applications, 4(4):285 314, 2017.

Rishav Hada, Varun Gumma, Adrian de Wynter, Harshita Diddee, Mohamed Ahmed, Monojit Choudhury, Kalika Bali, and Sunayana Sitaram. Are large language modelbased evaluators the solution to scaling up multilingual evaluation? ar Xiv preprint ar Xiv:2309.07462, 2023.

Yue Huang, Qihui Zhang, Lichao Sun, et al. Trustgpt: A benchmark for trustworthy and responsible large language models. ar Xiv preprint ar Xiv:2306.11507, 2023.

Dongfu Jiang, Xiang Ren, and Bill Yuchen Lin. Llmblender: Ensembling large language models with pairwise ranking and generative fusion. ar Xiv preprint ar Xiv:2306.02561, 2023.

Moshe Leshno and Haim Levy. Preferred by all and preferred by most decision makers: Almost stochastic dominance. Management Science, 48(8):1074 1085, 2002.

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher R e, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu Ren, Huaxiu Yao, Jue Wang, Keshav Santhanam, Laurel Orr, Lucia Zheng, Mert Yuksekgonul, Mirac Suzgun, Nathan Kim, Neel Guha, Niladri Chatterji, Omar Khattab, Peter Henderson, Qian Huang, Ryan Chi, Sang Michael Xie, Shibani Santurkar, Surya Ganguli, Tatsunori Hashimoto, Thomas Icard, Tianyi Zhang, Vishrav Chaudhary, William Wang, Xuechen Li, Yifan Mai, Yuhui Zhang, and Yuta Koreeda. Holistic evaluation of language models, 2022.

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. Gpteval: Nlg evaluation using gpt-4 with better human alignment. ar Xiv preprint ar Xiv:2303.16634, 2023.

Wlodzimierz Ogryczak and Andrzej Ruszczynski. Dual stochastic dominance and related mean-risk models. SIAM Journal on Optimization, 13(1):60 78, 2002.

Vasyl Pihur, Susmita Datta, and Somnath Datta. Rankaggreg, an r package for weighted rank aggregation. BMC Bioinformatics, 10:62 62, 2009. URL https://api.semanticscholar. org/Corpus ID:206970248.

Ludger Ruschendorf. Asymptotic Distributions of Multivariate Rank Order Statistics. The Annals of Statistics, 4 (5):912 923, 1976.

Jun Shao and Dongsheng Tu. The jackknife and bootstrap. Springer Science & Business Media, 2012.

Edwin D Simpson. Statistical significance testing for natural language processing, 2021.

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adri a Garriga Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. ar Xiv preprint ar Xiv:2206.04615, 2022.

Risk Aware Benchmarking of Large Language Models

Lichao Sun, Yue Huang, Haoran Wang, Siyuan Wu, Qihui Zhang, Chujie Gao, Yixin Huang, Wenhan Lyu, Yixuan Zhang, Xiner Li, Zhengliang Liu, Yixin Liu, Yijue Wang, Zhikun Zhang, Bhavya Kailkhura, Caiming Xiong, Chaowei Xiao, Chunyuan Li, Eric Xing, Furong Huang, Hao Liu, Heng Ji, Hongyi Wang, Huan Zhang, Huaxiu Yao, Manolis Kellis, Marinka Zitnik, Meng Jiang, Mohit Bansal, James Zou, Jian Pei, Jian Liu, Jianfeng Gao, Jiawei Han, Jieyu Zhao, Jiliang Tang, Jindong Wang, John Mitchell, Kai Shu, Kaidi Xu, Kai-Wei Chang, Lifang He, Lifu Huang, Michael Backes, Neil Zhenqiang Gong, Philip S. Yu, Pin-Yu Chen, Quanquan Gu, Ran Xu, Rex Ying, Shuiwang Ji, Suman Jana, Tianlong Chen, Tianming Liu, Tianyi Zhou, William Wang, Xiang Li, Xiangliang Zhang, Xiao Wang, Xing Xie, Xun Chen, Xuyu Wang, Yan Liu, Yanfang Ye, Yinzhi Cao, Yong Chen, and Yue Zhao. Trustllm: Trustworthiness in large language models, 2024.

L.Y. Tzeng, Rachel Huang, and P.T. Shih. Revisiting almost second-degree stochastic dominance. Management Science, 59:1250 1254, 05 2013. doi: 10.2307/23443939.

Maria Ulan, Welf L owe, Morgan Ericsson, and Anna Wingkvist. Copula-based software metrics aggregation. Software Quality Journal, 29(4):863 899, 2021. URL https://doi.org/10.1007/ s11219-021-09568-9.

Dennis Ulmer, Christian Hardmeier, and Jes Frellsen. deepsignificance-easy and meaningful statistical significance testing in the age of neural networks. ar Xiv preprint ar Xiv:2204.06815, 2022.

Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, Sang T. Truong, Simran Arora, Mantas Mazeika, Dan Hendrycks, Zinan Lin, Yu Cheng, Sanmi Koyejo, Dawn Song, and Bo Li. Decodingtrust: A comprehensive assessment of trustworthiness in gpt models, 2023.

Edward Beeching , Cl ementine Fourrier , Nathan Habib , Sheon Han, Nathan Lambert, Nazneen Rajani, Omar Sanseviero, Lewis Tunstall, Thomas Wolf. Open llm leaderboard. https://huggingface.co/spaces/ Hugging Face H4/open_llm_leaderboard, 2023.

Haopeng Zhang, Xiao Liu, and Jiawei Zhang. Summit: Iterative text summarization via chatgpt. ar Xiv preprint ar Xiv:2305.14835, 2023.

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric. P Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.

Risk Aware Benchmarking of Large Language Models

Supplementary Material

A. Ablation Studies

Metrics Aggregation Versus Portfolio For portoflio, computing ranking using FSD and SSD including the portfolio computation on 5K samples for 5 bootstrap samples , we have mean execution time of 32.01 4.51 s. For FSd and SSD ranking computation for all metrics, followed by rank using pearson distance the execution time is of 254.99 16.76 s. On the other hand, we observe on the mix-instruct dataset a consistency of ranks between these two approaches (FSD or SDD on portfolio & FSD or SSD on all metrics followed by rank aggregation) as quantified by the kendall-tau similarity between the ranks:

1. Kendall Tau(R-SSD@P(IC), RA(R-SSD@M)) = 0.848

2. Kendall Tau(R-FSD@P(IC), RA(R-FSD@M)) = 0.878

3. Kendall Tau(R-SSD@P(EC), RA(R-SSD@M)) = 0.848

4. Kendall Tau(R-FSD@P(EC), RA(R-SSD@M)) = 0.848

We see that these two approaches lead to similar ranks while portfolio approach leads to 7x speedups when using IC portfolios.

B. Transforming Discrete Relative Chat GPT Scores to Absolute Real Valued Scores

We follow (Jiang et al., 2023) in mapping discrete chat GPT scores to real valued ones. Note that chat GPT scores for comparing models A and B are discrete and are one of these 4 options: A is better, B is better, Both are equally good, Both are equally bad.

Given m models we construct for each prompt sample ℓ= 1 . . . N a m m comparison matrix with chat GPT:

Xℓ,ij = +1, Xℓ,ji = 1 if model i is better

Xℓ,ij = 1, Xℓ,ji = +1 if model j is better

Xℓ,ij = Xℓ,ji = +0.5 if model i and j equally good

Xℓ,ij = Xℓ,ji = 0.5 if model i and j equally bad

Then each model will define the following scalar score at each sample ℓ:

j=1 (Xℓ,ij Xℓ,ji).

hence we have a distribution of chat GPT score for each model :

i=1 δsℓ,i, i = 1 . . . m.

Note that the scores sℓ,i take on even integer values between 2m and 2m inclusive, we treat the support as continuous and consider the following kernel density estimator with Gaussian kernel of width σ:

ˆp(σ) i (t) = 1

i=1 φ t sℓ,i

, t R, i = 1 . . . m,

where φ is the standard normal density. In Figure 3 below we plot chat GPT scores kernel density estimates for two models, openassistant and flan-t5:

Risk Aware Benchmarking of Large Language Models

Figure 3: Chat GPT density scores for two models, open-assistant has clearly higher scores than the Flan-t5 models.

C. Multi-Testing Algorithm for Relative and Almost Stochastic Dominance

Our multi-testing algorithm for relative and almost stochastic dominance is detailed in Algorithm 1.

In a nutshell our multi-testing consists of the following steps:

1. For evaluation of each model compute summary statistics , i.e quantiles and integrated quantiles.

2. For all pairs of models, compute statistics of absolute and relative tests by computing violation ratios.

3. Compute the variance of these statistics via bootstrapping.

4. Perform the hypothesis testing for all pairs models with a corrected confidence level taking into account the number of all pairs

5. Aggregate pairwise rankings to a rank using the Borda algorithm, that ranks the model by their number of wins in the stochastic dominance tests performed above.

D. Absolute or Almost Stochastic Dominance

Almost FSD (ε-FSD) Following (Leshno & Levy, 2002), (Del Barrio et al., 2018) relaxed FSD via the violation ratio of FSD:

Definition D.1 (FSD Violation Ratio (Del Barrio et al., 2018) ). For FX = FY define the violation ratio:

εW2(FX, FY ) =

A(1) 0 (F ( 1) X (t) F ( 1) Y (t))2dt R 1 0 (F ( 1) X (t) F ( 1) Y (t))2dt =

R 1 0 (F ( 1) Y (t) F ( 1) X (t))2 +dt W2 2(FX, FY ) ,

where A(1) 0 = n t (0, 1) : F ( 1) Y (t) > F ( 1) X (t) o is the violation set the relation X FSD Y , and W2 is the

Wasserstein 2 distance.

Risk Aware Benchmarking of Large Language Models

Algorithm 1 Stochastic Order Multi-testing (relative and absolute)

1: Input: F1, ..., Fk, k models we want to rank corresponding to empirical measure p1 = 1 n Pn i=1 δx1 i , ... pk = 1 n Pn i=1 δxk i , Threshold: τ.

2: Input: Desired stochastic order {1, 2}, B number of bootstraps, m = K2 number of comparisons, significance level α. 3: Cache the bootstraps samples and their statistics 4: for j = 1 to k do 5: p0 j pj 6: Get Quantiles and Integrated Quantiles 7: Q0,j GETQUANTILES(pj) 8: IQ0,j GETINTEGRATEDQUANTILES(pj) 9: for b = 1 to B do 10: Get Quantiles and Integrated Quantiles 11: pb j RESAMPLEWITHREPLACEMENT(pj, n) {using quantiles and uniform} 12: Qb,j GETQUANTILES(pb j) 13: IQb,j GETINTEGRATEDQUANTILES(pb j) 14: end for 15: end for 16: Compute all violation ratios 17: εb,i,j COMPUTEVIOLATIONRATIOS(F b i , F b j , order) for b = 0 . . . B, i, j = 1 . . . k, i = j {ratio of Q or IQ of j > i by total area} 18: εb,i,i = 0, b, i 19: Compute the sum statistics 20: for b = 0 to B do 21: for i = 1 to k do 22: εi b 1 k 1 P

j εb,i,j 23: end for 24: end for 25: Compute the relative statistics 26: εi,j b = εi b εj b, b, i, j 27: Compute the Bootstrap Variance 28: for i = 1 to k do 29: for j = 1 to k do

30: σij = q

1 B 1 PB b=1( εi,j b MEAN( εi,j b , b))2

31: σabs ij = q

1 B 1 PB b=1(εb,i,j MEAN(εb,i,j, b))2

32: end for 33: end for 34: Compute the test 35: Winij = Winabs ij = 0 36: for i = 1 to k do 37: for j = 1 to k do 38: if i = j and εi,j 0 1 nσijΦ 1(α/k2) 0 then

39: Winij = 1 {with confidence level 1 α/k2} 40: end if 41: if i = j and ε0.i,j 1 nσabs ij Φ 1(α/k2) τ then

42: Winabs ij = 1 {with confidence level 1 α/k2} 43: end if 44: end for 45: end for rank = BORDA(Win) {with confidence level 1 α} rankabs = BORDA(Winabs) {with confidence level 1 α} 46: Return rank, rankabs

Risk Aware Benchmarking of Large Language Models

Algorithm 2 COMPUTEVIOLATIONRATIOS(Fa,Fb,order)

if order =1 then

Return εW2(Fa, Fb) in Definition D.1 else if order=2 then

Return εIQ(Fa, Fb) in Definition D.2 end if

Note that 0 εW2(FX, FY ) 1, with value 0 if X FSD Y and 1 if Y FSD X. For ε (0, 1

2], the relaxed FSD can be

therefore defined as follows

X ε FSD Y εW2(FX, FY ) ε. (18)

Figure 4a in Appendix G illustrates ε-FSD, dashed areas represent the violation set.

Almost SSD (ε-SSD) Note that the original definition of ε-FSD of X on Y in (Leshno & Levy, 2002) is an L1 definition and uses the CDF rather than quantiles: R (FX(x) FY (x))+dx ε R |FX(x) FY (x)|dx. (Tzeng et al., 2013) gave a similar L1 definition for ε-SSD using the second performance function F (2)(.). According to (Tzeng et al., 2013), X dominates Y in the ε-SSD if R (F (2) X (x) F (2) Y (x))+dt ε R + |F (2) X (x) F (2) Y (x)|dx. Following (Del Barrio et al., 2018), we redefine ε-SSD using second quantiles and with a L2 definition, this eases the analysis and practically the integration is on (0, 1] rather than ( , ).

We define the SSD violation ratio as follows:

Definition D.2 (SSD Violation Ratio ). For FX = FY define the violation ratio:

εIQ(FX, FY ) =

A(2) 0 (F ( 2) X (t) F ( 2) Y (t))2dt R 1 0 (F ( 2) X (t) F ( 2) Y (t))2dt =

R 1 0 (F ( 2) Y (t) F ( 2) X (t))2 +dt d2 IQ(FX, FY ) ,

where A(2) 0 = n t (0, 1) : F ( 2) Y (t) > F ( 2) X (t) o is the violation set the relation X SSD Y , and d IQ is the L2

distance between the Integrated Quantiles (F ( 2)).

We are now ready to define ε-SSD, for ε (0, 1

X ε SSD Y εIQ(FX, FY ) ε (19)

Figure 4b in Appendix G illustrates the second order, dashed areas represent the violation set of SSD of X on Y . Integrated quantiles fully characterize one dimensional distributions as can be seen from the Theorem I.1 stated and proved in Appendix I:

E. Related Works on Stochastic Dominance

Stochastic Dominance In (Dror et al., 2018; 2019; Ulmer et al., 2022; Simpson, 2021) a distributional assessment of the models based on stochastic dominance was proposed to overcome the limitations of the ubiquitous Mean-Variance Risk model used in machine learning.

(Ulmer et al., 2022) used first order almost stochastic dominance and advocated for selecting a model A over B based on a metric mi if: Mi|A, X ε FSD Mi|B, X. We expand this to the Relative-FSD. In natural language (and other) applications,

it is often crucial to mitigate the risk of outputs with low metrics, especially when those metrics quantify important socio-technical guardrails such as model s toxicity, safety, or robustness. Unfortunately, the first stochastic ordering does

Risk Aware Benchmarking of Large Language Models

not capture an assessment of the left tail behavior of Mi|A, X and Mi|B, X and hence does not provide a risk-aware benchmarking (Ogryczak & Ruszczynski, 2002). To alleviate this issue, we instead consider the second order stochastic ordering and use our second order almost or relative stochastic dominance tests introduced in Section 3 for selecting a model A if:Mi|A, X ε or R SSD Mi|B, X.

F. Supplement Discussions

F.1. Mean Risk Models

Name Risk Measure α consistency with SSD Standard deviation σX = p

E(X µX)2 not consistent Absolute semi deviation δX = E(µX X)+ 1 consistent

Negative Tail Value at Risk TVARX(p) = F ( 2)(p)

p 1 consistent for all p (0, 1]

Mean absolute deviation from a quantile h X(p) = µx F ( 2) X (p) p 1 consistent for all p (0, 1]

Gini Tail ΓX = 2 R 1 0 (µXp F ( 2) X (p))dp 1 consistent

Table 3: Risk models and their α consistency with SSD.

Note that several risks in Table 3 use the second quantile function as part of a benchmarking of the left tails of the outcomes.

F.2. δ Consistency of Gini-Risk Models with ε-SSD

δ Consistency of Gini-Risk Models with ε-SSD We relax the definition of α consistency of mean-risk models with SSD to (α, δ) consistency with ε-SSD as follows:

Definition F.1 ((α, δ) consistency of MRM with ε-SSD). A mean-risk model (µX, r X) is (α, δ) consistent with ε-SSD, if there exists α, δ > 0 such that X ε-SSD Y = µX αrx + δ µY αr Y

It is easy to see that the Mean-Gini tail MRM of X and Y is consistent with their ε-SSD:

Proposition F.2. The Mean-Gini Tail MRM is (1, 2ε 1 2 d IQ(FX, FY )) consistent with ε-SSD.

Proof of Proposition F.2.

µX ΓX = µX 2 Z 1

0 (µXp F ( 2) X (p))dp = 2 Z 1

0 (F ( 2) X (p) F ( 2) Y (p) + F ( 2) Y (p))dp

0 F ( 2) Y (p) + 2 Z

A(2) 0 (F ( 2) X (p) F ( 2) Y (p))dp + 2 Z

[0,1]/A(2) 0 (F ( 2) X (p) F ( 2) Y (p))dp

0 F ( 2) Y (p) 2 Z

A(2) 0 |F ( 2) X (p) F ( 2) Y (p))|dp

= µY ΓY 2 Z 1

0 (F ( 2) Y (p) F ( 2) X (p))+dp

µY ΓY 2 Z 1

0 (F ( 2) Y (p) F ( 2) X (p))2 +dp 1

2 (Cauchy-Schwartz)

µY ΓY 2ε 1 2 d IQ(FX, FY )(By assumption X ε SSD Y )

Risk Aware Benchmarking of Large Language Models

F.3. Rank Aggregation

Given N ranks πi, i = 1 . . . N represented as permutations in Sk, the rank aggregation in (Pihur et al., 2009) solves the following problem :

i=1 αid(π, πi),

where αi 0, PN i=1 αi = 1 represent importance of each ranking and d is a distance between permutations. (Pihur et al., 2009) have multiple choices of distance such as Pearson or Kendall s-Tau. We fixed through out our experiments the distance to Pearson.

F.4. Mean Win Rate and CDF normalizers in portfolio

To unpack the notations in (15), consider a distribution A on models space. For a sample X Di and a model A A, the metric mi() normalization through its CDF can be written as follows:

FMi(mi(A(X)) = EB AEY Di1mi(B(Y ) mi(A(X)). (20)

Hence for a model A on each evaluated sample the CDF normalizer computes a soft ranking of the evaluation of the model A with a metric mi on the sample X with respect to all models and all samples. Remark F.3 (Mean Win Rate ). Note that in LLM leaderborads such as HELM and Hugging face, the performance of a model A evaluated with a metric mi, is summarized via a Mean Win Rate (MWR) aggregated on models level looking on expected metrics MWRA,Mi = EB A1EX Di[mi(B(X))] EX Di[mi(A(X))], (21)

or aggregated on sample level marginalizing on models with a max:

MWRA,Mi = EX Di1max B =A mi(B(X)) mi(A(X)), (22)

Contrasting (20) , (21) and (22) we see that instead of looking at the MWR summary statistics that does not allow to consider all order statistics and relative ordering as well the risks of tails events, we consider a full distributional benchmarking in the metrics portfolio approach.

0 0.2 0.4 0.6 0.8 1

(a) εFSD (first order): U ε FSD V

0 0.2 0.4 0.6 0.8 1

Integrated Quantile

(b) ε-SSD (second order): X ε SSD Y

Figure 4: (a) An Example of Almost First Order Stochastic Dominance: Plots of quantile functions of U and V . Dashed areas is the violation set of first order stochastic dominance of U on V . (b) An Example of Almost Second Order Stochastic Dominance: Plots of integrated quantile functions; dashed area is the violation set for the second order stochastic dominance of X on Y .

Risk Aware Benchmarking of Large Language Models

H. Central Limit Theorems

H.1. CLT for ε-SSD

Theorem H.1 (Central Limit Theorem for ε-SSD). Assume that FX, FY are supported on intervalsa in [ M, M],

and have pdfs fx, fy such that f x(p) f 3 x(p), f y(p) f 3 y(p) are bounded almost everywhere on the support of fx and fy respectively. Assume we have n samples from FX and m samples from FY , with n, m such that n n+m λ for some λ. Then r mn

m + n (εIQ(F n X, F m Y ) εIQ(FX, FY )) N(0, σ2 λ(FX, FY ))

where σ2 λ(FX, FY ) = 1 d8 IQ(FX, FY ) [(1 λ)Var(v X(U)) + λVar(v Y (U))] ,

for U Unif[0, 1], v Y (t) = 2 1 fy(F 1 Y (t))

R 1 t (F ( 2) X (p) F ( 2) Y (p))+dp , and v X(t) =

2 1 fx(F 1 X (t))

R 1 t (F ( 2) X (p) F ( 2) Y (p)) dp .

a The interval for FX and for FY need not coincide.

Remark H.2 (Non-independent samples). Theorem H.1 assumes that the n-sample from FX is independent of the m-sample for FY . Consider instead the setting where there are n samples from FX and FY that are dependent (e.g. X, Y are evaluations of different models applied to the same data). We can describe general dependence structure as the following. Suppose (X, Y ) has marginals X FX, Y FY , with some unknown dependence structure (optionally described by the copula CXY (ux, uy) = Pr(FX(X) ux, FY (Y ) uy)). Let

(Ux, Uy) = (FX(X), FY (Y )) CXY .

Note that Ux and Uy have marginals equal to Unif([0, 1]), but Ux and Uy may be dependent. Hence the variances in each term of the decomposition (24) in the appendix cannot be added. Instead, one should modify the result of Theorem H.1 to use

σ2 λ(FX, FY ) = 1 d8 IQ(FX, FY )Var [v X(Ux) + v Y (Uy)] .

H.2. CLT for Relative Statistics

We focus here on presenting the Central Limit Theorem for SSD. The relative FSD has a similar form and we omit its statement here.

Risk Aware Benchmarking of Large Language Models

Theorem H.3 (Central limit Theorem for Relative SSD). Assume F1n, . . . , Fkn are available and independent, and each Fi satisfies the conditions of Theorem H.1. Then

n ε(2) i1,i2(Fn) ε(2) i1,i2(F) w N

0, 1 (k 1)2

i=1 σ2 i (i1, i2)

σ2 i (i1, i2) =

Var 2v(1) i1i2 (Ui)

d4 IQ(Fi1,Fi2) + P j =i1,i2

v(1) i1j (Ui)

d4 IQ(Fi1,Fj)

Var 2v(2)+ i1i2 (Ui)

d4 IQ(Fi1,Fi2) P

v(1) i2j (Ui)

d4 IQ(Fi2,Fj)

Var v(2)+ i1j (Ui)

d4 IQ(Fi1,Fj) v(2)+ i2j (Ui)

d4 IQ(Fi2,Fj)

for Ui Unif([0, 1]) all independent, and v(1), ij (t) = 2 d F 1 i (t) dt R 1 t (F ( 2) i (p) F ( 2) j (p)) dp , v(2),+ ij (t) =

2 d F 1 j (t) dt

R 1 t (F ( 2) i (p) F ( 2) j (p))+dp .

Remark H.4 (Dependent samples). If the Fin are dependent, a similar expression to that shown in Remark H.2 for the absolute testing case also holds here. The statement is omitted.

I. Proof of Theorem I.1

Theorem I.1 (d IQ is a metric). d IQ is a metric on the space of univariate distributions with continuous CDF, moreover, it metrizes the weak topology.

First, we show that d IQ(F, G) = 0 if and only if F = G. The forward direction is obvious. For the reverse direction, if d IQ(F, G) = 0, then F ( 2)(t) = G( 2)(t) a.e. By the continuity of integrated quantiles, this implies F ( 2) = G( 2)

everywhere. Then, since F ( 1)(t) is simply the derivative of F ( 2)(t) with respect to t7, F ( 1) = G( 1) everywhere by differentiating both sides of F ( 2)(t) = G( 2)(t). Hence F = G since distributions are uniquely determined by their quantile functions.

The triangle inequality follows from the triangle inequality of the L2 norm, since q R 1 0 (F ( 2)(t) G( 2)(t))2dt =

F ( 2)(t) G( 2)(t) L2([0,1]). Hence d IQ is a metric. By Theorem 10 in (Gushchin & Borzykh, 2017), we know that random variable X(i) w X (with cdf F(i)) if and only if F ( 2) (i) converges uniformly to F ( 2). Hence d IQ must metrize weak convergence.

J. Proofs of Central Limit Theorems

J.1. Absolute Testing: Proof of Theorem H.1

Note that for Ui and Vi an n-sample and an m-sample respectively from Unif([0, 1]), we can get Xi, Yi as Xi = F 1(Ui), Yi = G 1(Vi). Let Hn,1 and Hm,2 be the empirical d.f.s of the Ui and Vi respectively. We have

F 1 n (t) = F 1(H 1 n,1(t)),

F ( 2) n (t) = Z t

0 F 1 n (p)dp = Z t

0 F 1(H 1 n,1(p))dp.

7This follows because F 2 is the integral of the finite-valued quantile function F 1(t).

Risk Aware Benchmarking of Large Language Models

We are interested in

εIQ(Fn, Gm) =

A0(F ( 2) n (t) G( 2) m (t))2dt

d2 IQ(Fn, Gm) ,

where A0 = n t (0, 1) : G( 2) m (t) > F ( 2) n (t) o ,

is the violation set.

It is shown in (Gushchin & Borzykh, 2017) (Theorem 10 therein) that integrated quantiles converge uniformly, i.e. F ( 2) n (t) F ( 2)(t) pointwise. As an immediate consequence, we have

εIQ(Fn, Gm) a.s. εIQ(F, G).

We apply the following decomposition and bound the two terms separately:

εIQ(Fn, Gm) εIQ(F, G) = (εIQ(Fn, Gm) εIQ(F, Gm)) + (εIQ(F, Gm) εIQ(F, G)). (23)

We derive asymptotic normality of these terms for Gm, the proof for Fn is identical by symmetry.

We introduce the statistics

0 (F ( 2)(t) G( 2) m (t))2dt

0 (F ( 2)(t) G( 2) m (t))2 +dt

0 (F ( 2)(t) G( 2) m (t))2 dt

The nonrandom S, S+, S are defined similarly with G instead of Gm.

Next, set Tm = m(Sm S)

T + m = m(S+ m S+)

T m = m(S m S ).

Theorem J.1. Assume that G is supported on an interval that is a subset of [ M, M], and has pdf g such that g (p)

g3(p) is bounded almost everywhere on the support of g. Then

Tm = αm,2(v) + o P (1)

T + m = αm,2(v+) + o P (1)

T m = αm,2(v ) + o P (1)

where we define αm,2(t) = m(t H 1 m,1(t)) and αm,2(v) = R 1 0 v(t)αm,2(t)dt, and

v(t) = 2 1 g(G 1(t))

t F ( 2)(p) G( 2)(p)dp .

v+(t) = 2 1 g(G 1(t))

t (F ( 2)(p) G( 2)(p))+dp ,

v (t) = 2 1 g(G 1(t))

t (F ( 2)(p) G( 2)(p)) dp .

Risk Aware Benchmarking of Large Language Models

Proof. We begin with Tm. Note that8

Tm = m(Sm S)

0 (F ( 2)(t) G( 2) m (t))2 (F ( 2)(t) G( 2)(t))2dt

h 2F ( 2)(t) G( 2) m (t) G( 2)(t) i (G( 2)(t) G( 2) m (t))dt

h F ( 2)(t) G( 2)(t) i (G( 2)(t) G( 2) m (t))dt

h F ( 2)(t) G( 2)(t) i Z t

0 G( 1)(p) G( 1)(H 1 m,1(p)))dp dt

Let us do integration by parts:

h F ( 2)(t) G( 2)(t) i Z t

0 G( 1)(p) G( 1)(H 1 m,1(p)))dp dt =

0 F ( 2)(t) G( 2)(t)dt Z 1

0 G( 1)(t) G( 1)(H 1 m,1(t)))dt

0 F ( 2)(p) G( 2)(p)dp h G( 1)(t) G( 1)(H 1 m,1(t))) i dt

t F ( 2)(p) G( 2)(p)dp h G( 1)(t) G( 1)(H 1 m,1(t))) i dt

t F ( 2)(p) G( 2)(p)dp (t H 1 m,1(t))dt

t F ( 2)(p) G( 2)(p)dp)(t H 1 m,1(t))2dt

t F ( 2)(p) G( 2)(p)dp (t H 1 m,1(t))dt + o P (1).

In the penultimate step we have used a first-order Taylor series on G 1(t) via the assumption that d2G 1(t)

dt2 = g (G 1(t))

g3(G 1(t)) is bounded almost everywhere, and in the final step we have noted that

t F ( 2)(p) G( 2)(p)dp (t H 1 m,1(t))2dt 2 m Z 1

0 (t H 1 m,1(t))2dt

since the support of F and G lie in [ M, M] and R 1 0 (t H 1 m,1(t))2dt = Op(1/m).

We then have Tm = αm,2(v) + o P (1),

where αm,2(t) = m(t H 1 m,1(t)), and αm,2(v) = R 1 0 v(t)αm,2(t)dt where

v(t) = 2 d G 1(t)

t F ( 2)(p) G( 2)(p)dp .

Similarly, T + m = αm,2(v+) + o P (1), T m = αm,2(v ) + o P (1)

8Convergence here is uniform convergence of the integrated quantiles.

Risk Aware Benchmarking of Large Language Models

v+(t) = 2 d G 1(t)

t (F ( 2)(p) G( 2)(p))+dp ,

v (t) = 2 d G 1(t)

t (F ( 2)(p) G( 2)(p)) dp .

Corollary J.2. Assume that G is supported on an interval in [ M, M], and has pdf g such that g (p)

g3(p) is bounded almost everywhere on the support of g. Then as m

m(ϵIQ(F, Gm) ϵIQ(F, G)) w N(0, σ2)

and if additionally n m(ϵIQ(Fn, Gm) ϵIQ(Fn, G)) w N(0, σ2),

if for U Unif([0, 1])

σ2 = Var(v+(U))

d8 IQ(F, G)

Proof. Note that by Theorem J.1

m(ϵIQ(F, Gm) ϵIQ(F, G)) = m S m Sm S

= m SSm (T m Tm) αm,2(v+)

since Sm S a.s. Recalling the definition of αm,2 yields asymptotic normality with zero mean as in (Del Barrio et al., 2018), and variance as calculated in the corollary statement.

The case of m(ϵIQ(Fn, Gm) ϵIQ(Fn, G)) follows similarly since integrated quantiles weakly converge as Fn F.

Continuing with the main proof, recalling (23) and using Corollary J.2 along with the asymptotic independence of the two terms and the fact that n n+m λ, we have

m + n (εIQ(Fn, Gm) εIQ(F, G))

(1 λ)n(εIQ(Fn, Gm) εIQ(F, Gm)) +

λn(εIQ(F, Gm) εIQ(F, G)) (24)

N(0, σ2 λ(F, G))

σ2 λ(F, G) = 1 d8 IQ(F, G) [(1 λ)Var(v F (U)) + λVar(v G(U))] .

Here, we have defined

v G(t) = 2 1 g(G 1(t))

t (F ( 2)(p) G( 2)(p))+dp ,

v F (t) = 2 1 f(F 1(t))

t (F ( 2)(p) G( 2)(p)) dp .

Risk Aware Benchmarking of Large Language Models

J.2. Relative Testing: Proof of Theorem H.3

εi1,i2 IQ (F) = ϵi1 IQ(F) ϵi2 IQ(F)

j =i1 ϵi1j IQ X

j =i2 ϵi2j IQ

2ϵi1i2 IQ 1 + X

j =i1,i2 (ϵi1j IQ ϵi2j IQ)

For compactness, let us introduce the differencing notation ϕ( )|Fn F = ϕ(Fn) ϕ(F). We seek a CLT on

n( [ εi1,i2 IQ (Fn) εi1,i2 IQ (F)) = n k 1

2ϵIQ( , Fi2,n) + X

j =i1,i2 ϵIQ( , Fj,n)

2ϵIQ(Fi1, ) X

j =i1,i2 ϵIQ( , Fj,n)

j =i1,i2 (ϵIQ(Fi1, ) ϵIQ(Fi2, ))|Fj,n Fj

2ϵIQ( , Fi2) + X

j =i1,i2 ϵIQ( , Fj)

Fi1 | {z } I

2ϵIQ(Fi1, ) X

j =i1,i2 ϵIQ( , Fj)

Fi2 | {z } II

j =i1,i2 (ϵIQ(Fi1, ) ϵIQ(Fi2, ))|Fj,n Fj | {z } III

where we have used the uniform convergence of integrated quantiles. Note that I, II, and each term in the sum in III are all independent.

v(1) ij (t) = 2 d F 1 i (t) dt

t F ( 2) i (p) F ( 2) j (p)dp ,

v(2) ij (t) = 2

d F 1 j (t) dt

t F ( 2) i (p) F ( 2) j (p)dp ,

and v(1)+ ij , v(2)+ ij similarly. Then by the proof of Corollary J.2, each term in III converges to

n k 1 (ϵIQ(Fi1, ) ϵIQ(Fi2, ))|Fj,n Fj αm,j(v(2)+ i1j ) (k 1)d4 IQ(Fi1, Fj) + αm,j(v(2)+ i2j ) (k 1)d4 IQ(Fi2, Fj)

= 1 k 1αm,j

v(2)+ i1j d4 IQ(Fi1, Fj) + v(2)+ i2j d4 IQ(Fi2, Fj)

w N 0, 1 (k 1)2 σ2 j (i1, i2) , j = i1, i2.

σ2 j (i1, i2) = 1 (k 1)2 Var

v(2)+ i1j (U) d4 IQ(Fi1, Fj) v(2)+ i2j (U) d4 IQ(Fi2, Fj)

, j = i1, i2,

Risk Aware Benchmarking of Large Language Models

and U Unif([0, 1]). Similarly for I and II,

I w N 0, 1 (k 1)2 σ2 i1(i1, i2)

II w N 0, 1 (k 1)2 σ2 i2(i1, i2)

σ2 i1(i1, i2) = Var

2v(1) i1i2 (U) d4 IQ(Fi1, Fi2) + X

v(1) i1j (U) d4 IQ(Fi1, Fj)

σ2 i2(i1, i2) = Var

2v(2)+ i1i2 (U) d4 IQ(Fi1, Fi2) X

v(1) i2j (U) d4 IQ(Fi2, Fj)

Putting everything together via independence,

n [ εi1,i2 IQ (Fn) εi1,i2 IQ (F) w N

0, 1 (k 1)2

i=1 σ2 i (i1, i2)

K. Consistency of Bootstrapping

In this section, we consider the relaxation measure using the CDFs10:

ϵℓ(FX, FY ) =

R (F (ℓ) Y (t) F (ℓ) X (t))2 +dt R (F (ℓ) Y (t) F (ℓ) X (t))2dt .

Note that we can relax FSD as follows:

Y ε FSD X ϵ1(FX, FY ) ε. (25)

Similarly we can relax SSD as follows:

Y ε SSD X ϵ2(FX, FY ) ε. (26)

We will prove bootstrap consistency for ℓ= 1 (approximate first order dominance), the proof for ℓ= 2 (approximate second order dominance) is similar.

We seek to show that the bootstrapped variance Var( ϵ1(F n X , F m Y )) is an asymptotically consistent estimator of Var( ϵ1(F n X, F m Y ), i.e. their ratio goes to 1:

Var( ϵ1(F n X , F m Y )) Var( ϵ1(F n X, F m Y )) p 1.

Note we can write this as Var( ϵ1(F n X , F m Y ) Var( ϵ1(F n X, F m Y )) p Var(T(F n X , F m Y )) Var(T(F n X, F m Y )) ,

T(FX, FY ) =

R (FY (t) FX(t))2 +dt R (FY (t) FX(t))2dt .

9This U Unif([0, 1]) is drawn simply for this variance calculation and is not dependent on anything outside of this equation. 10The result using quantiles as described in the main text is less straightforward and if left for future work.

Risk Aware Benchmarking of Large Language Models

Consider the metric created by the sup norm

ρ (F, G) = F G = sup x |F(x) G(x)|.

Note that T is continuously ρ -Frechet differentiable in both arguments due to the differentiability of the function ( )2 + and integration. Specifically,

D1,(FX,FY )(GX) = 1 ( R (FY (t) FX(t))2dt)2

(FY (t) FX(t))2dt Z

2(FY (t) FX(t))+GXdt

(FY (t) FX(t))2 +dt Z

2(FY (t) FX(t))GXdt i .

and similarly for D2,(FX,FY )(GY ). Since T is continuously differentiable, by the definition of continuous Frechet differentiability we can write (see Chapter 2 in (Shao & Tu, 2012)) the following:

T(F n X , F m Y ) T(F n X, F m Y )

= D1,(F n X,F m Y )(F n X F n X) + D2,(F n X,F m Y )(F m Y F m Y ) + (ρ (F n X , F n X) + ρ (F m Y , F m Y ))ϵ n,m

T(F n X , F m Y ) T(F n X, F m Y ) = D1,(F n X,F m Y )(F n X F n X) + (ρ (F n X , F n X))ϵ n

T(F n X, F m Y ) T(F n X, F m Y ) = D2,(F n X,F m Y )(F m Y F m Y ) + (ρ (F m Y , F m Y ))ϵ m

T(F n X, F m Y ) T(FX, FY )

= D1,(FX,FY )(F n X FX) + D2,(FX,FY )(F m Y FY ) + (ρ (F n X, FX) + ρ (F m Y , FY ))ϵn,m

T(F n X, FY ) T(FX, FY ) = D1,(FX,FY )(F n X FX) + (ρ (F n X, FX))ϵn

T(FX, F m Y ) T(FX, FY ) = D2,(FX,FY )(F m Y FY ) + (ρ (F m Y , FY ))ϵm

where ϵ n,m, ϵ n, ϵ m, ϵn,m, ϵn, ϵm 0 as n, m .

Hence, combining terms,

T(F n X , F m Y ) T(F n X, F m Y ) = (T(F n X , F m Y ) T(F n X, F m Y )) + (T(F n X, F m Y ) T(F n X, F m Y )) + op(n 1/2 + m 1/2),

T(F n X, F m Y ) T(FX, FY ) = (T(F n X, FY ) T(FX, FY )) + (T(FX, F m Y ) T(FX, FY )) + op(n 1/2 + m 1/2).

Hence, assuming independence of the n-sample and m-sample and respective bootstrap resamplings,

Var(T(F n X , F m Y )) Var(T(F n X, F m Y )) a.s. Var(T(F n X, F m Y )) + Var(T(F n X , F m Y )) Var(T(FX, F m Y )) + Var(T(F n X, FY )) ,

i.e. we add the variances.

We have now divided the task to the one-sided setting where the bootstrap is only done in one argument of T. Hence we can apply Theorem 3.10 of (Shao & Tu, 2012) which states that for ρ -Frechet differentiable functions of a CDF, the bootstrap variance estimator is asymptotically consistent if the support is bounded (more general results can be stated but are omitted for simplicity). Applying separately to each of the two variances we have the following. Proposition K.1. Suppose FX, FY , have support contained in [ M, M] for some M > 0, and F n X, F m Y arise from independent samples. Then Var( ϵ1(F n X , F m Y )) Var( ϵ1(F n X, F m Y )) a.s. 1.

Risk Aware Benchmarking of Large Language Models

L. Additional Experimental Results

L.1. Statistical Significance on Synthetic Data

We examine the statistical properties of our tests as a function of sample size. We purposely design synthetic score distributions to represent challenging problems with large overlap between the distributions and a considerable violation ratio, but where one would still like to have an ordering among the variables. For this we consider the two Gaussian distributions with mean µ = 0 and standard deviation σ = 1, and with mean µ = 0.5 and standard deviation σ = 2, respectively. In the top panels of Figure 5 we show the PDF, CDF and integrated quantile function of these two Gaussians, illustrating the large violation ratio. The orange distribution can be calculated to be 0.2-FSD and 0.45-SSD over the blue distribution. Note that these ε values are not comparable, due to the differences in definitions. In Figure 5, we conduct experiments illustrating the power of our tests for the absolute tests of the hypotheses H0,F SD = 0.45-FSD and H0,SSD = 0.45-SSD. We also use our relative tests, which in this 2-variable case (as noted in the main text) are equivalent to testing H0,F SD = 0.5-FSD and H0,SSD = 0.5-SSD. The bottom left panel in Figure 5 show the True Positive Rate for the different types of tests that we developed: relative test with quantile function, relative test with Integrated Quantile Function, absolute test with quantile function, and absolute test with Integrated Quantile Function. As expected, all tests quickly converge towards True Positive Rate of 1.0 for growing sample sizes.

0.0 0.5 1.0 1.5 2.0 2.5 3.0 X

= 0, 0, = 0.5

= 0.0, = 1.0

0.0 0.5 1.0 1.5 2.0 2.5 3.0 X

= 0, 0, = 0.5

= 0.0, = 1.0

0.0 0.2 0.4 0.6 0.8 1.0 prob

Integrated quantile function

= 0, 0, = 0.5

= 0.0, = 1.0

0 1000 2000 3000 4000 5000 sample size

true positive rate

True Positive Rate

relative test Q relative test IQ absolute test Q absolute test IQ

Figure 5: True Positive Rate vs sample size for Gaussian distributions. We compute the True Positive Rate of our stochastic dominance methods on the test distributions in the top panels for different sample sizes. Decisions are made using a confidence threshold of α = 0.05 and τ = 0.45 (for the absolute tests) and rates are computed over 1000 repetitions of the tests. Note that the FSD and SSD curves should not be compared due to differences in the underlying hypotheses.

Risk Aware Benchmarking of Large Language Models

0.0 0.5 1.0 1.5 2.0 2.5 3.0 X

= 0, 0, = 0.5

= 0.0, = 1.0

0.0 0.5 1.0 1.5 2.0 2.5 3.0 X

= 0, 0, = 0.5

= 0.0, = 1.0

0.0 0.2 0.4 0.6 0.8 1.0 prob

Integrated quantile function

= 0, 0, = 0.5

= 0.0, = 1.0

0 500 1000 1500 2000 2500 sample size

true positive rate

True Positive Rate

relative test Q relative test IQ absolute test Q absolute test IQ

Figure 6: True Positive Rate vs sample size for Lognormal distributions generated as X = eµ+σZ, where Z is a standard Gaussian variable. We compute the True Positive Rate of our stochastic dominance as in Fig. 5, but in this case we examine True Positive Rate for heavy-tailed distributions examplified by Lognormal distributions.

L.2. Mix-Instruct

Results for the Mix-Instruct data are shown in Figures 7 and 8, as well as Table 5.

L.3. Toxicity

Toxicity results are in Table 6.

L.4. Ablation Study on toxicity Independent Versus Empirical Copula Portfolio

When Comparing EC and IC portfolio aggregation using R-SSD to rank the LLM we see in Figures 9 and 10 that the two aggregation approaches lead to same ranking. While IC computational complexity is linear in the number of points , EC is quadratic. Given the correspondence in ranking IC is a more efficient aggregation technique.

L.5. Fat Left Tails of Metrics and Inconsistency of Mean-Variance with SSD

When metrics evaluated have fat tails, the Mean-Variance ranking can be inconsistent with the SSD. See Table 7.

Risk Aware Benchmarking of Large Language Models

Open koala alpaca llama flan-t5 stablelm Vicuna Dolly Moss Chat GLM mpt-7b mpt-7b assistant 7b (v2) 6b instruct

Mean Win Rates RA(MWR @ M) 1 6 2 8 5 7 3 10 9 4 11 12

MWR @P(IC) 1 5 2 7 6 8 3 9 10 4 11 12

Relative FSD RA(R-FSD @ M) 1 6 2 5 8 11 4 10 7 3 9 12

R-FSD @P(IC) 1 6 2 5 11 10 4 8 7 3 9 12

R-FSD @Chat GPT 1 7 3 4 12 11 2 8 5 6 9 10

Relative SSD RA(R-SSD @ M) 1 7 2 5 12 10 4 9 6 3 8 11

R-SSD @P(IC) 1 6 3 5 12 11 4 7 8 2 9 10

R-SSD @Chat GPT 1 8 3 4 11 12 2 7 5 6 9 10

Mean-Risk Models RA(µX ΓX) @ M 1 7 2 5 12 11 4 9 6 3 8 10 RA(µX r X) @P(IC) 1 6 3 5 12 11 4 7 8 2 9 10

Table 4: Rankings of models on following instructions according to all tests, with the top 3 ranks highlighted. We see that SSD and Mean Risk models are consistent. Note that RA(µX r X) @P(IC) denotes the aggregation of rankings produced by (µX r X) @P(IC) for each r X in Table 3.

Figure 7: Radar plot of mean risk models of the portfolio on Mix-Instruct data. Note that the outer models are indeed the ones preferred by SSD in Table 5.

Risk Aware Benchmarking of Large Language Models

Figure 8: Empirical CDF and Tva R for portfolio on Mix-Instruct data

IC Copula Portfolio EC Copula Portfolio

Prompt + gen

Prompt + gen

40 K samples

Figure 9: IC versus EC Portfolio Aggregation on Toxicity. Ranking of models using 40 K samples, with independent and Empirical Copula portfolio with R-SSD. We see that the two aggregation methods lead to similar results.

Risk Aware Benchmarking of Large Language Models

Open koala alpaca llama flan-t5 stablelm Vicuna Dolly Moss Chat GLM mpt-7b mpt-7b assistant 7b (v2) 6b instruct

Mean Win Rates

RA(MWR @ M) 1 6 2 8 5 7 3 10 9 4 11 12

MWR @P(IC) 1 5 2 7 6 8 3 9 10 4 11 12

Relative FSD

RA(R-FSD @ M) 1 6 2 5 8 11 4 10 7 3 9 12

R-FSD @P(IC) 1 6 2 5 11 10 4 8 7 3 9 12

Relative SSD

RA(R-SSD @ M) 1 7 2 5 12 10 4 9 6 3 8 11

R-SSD @P(IC) 1 6 3 5 12 11 4 7 8 2 9 10

R-SSD @Chat GPT 1 8 3 4 11 12 2 7 5 6 9 10

Absolute FSD

ε-FSD @P(IC) ε=0.08 1 6 2 5 10 11 4 7 8 3 9 12

ε-FSD @P(IC) ε=0.25 1 6 2 5 12 10 4 7 8 3 9 11

ε-FSD @P(IC) ε=0.4 1 6 2 5 12 10 4 8 7 3 9 11

Absolute SSD

ε-SSD @P(IC) ε = 0.08 1 6 3 5 12 11 4 7 8 2 9 10

ε-SSD @P(IC) ε = 0.25 1 6 3 5 12 11 4 8 7 2 9 10

ε-SSD @P(IC) ε=0.4 1 6 3 5 12 11 4 7 8 2 9 10

Mean-Risk Models

RA(µX r X) @P(IC) 1 6 3 5 12 11 4 7 8 2 9 10

Table 5: Mix instruct Extended Results.

Risk Aware Benchmarking of Large Language Models

Scenario Llama 2 7b Llama 2 13b Llama 2 70b Mosaic ML MPT 30b Tiiuae Falcon 40b

Toxic Prompts

RA(R-FSD @M ) (Gen Only) 3 2 4 1 5 R-FSD @P(IC)(IC)(Gen Only) 2 3 4 1 5 RA(R-SSD @M ) (Gen Only) 3 2 4 1 5 R-SSD@P(IC)(IC) (Gen Only) 3 2 4 1 5

RA(R-FSD @M) (Prompt + Gen) 2 3 1 4 5 R-FSD @P(IC)(IC)(Prompt + Gen) 2 3 1 4 5 RA(R-SSD @M) (Prompt + Gen) 2 3 1 4 5 R-SSD @P(IC)(IC) (Prompt + Gen) 2 3 1 4 5

Non-Toxic Prompts

RA(R-FSD @M) (Gen Only) 1 2 4 3 5 R-FSD @P(IC)(IC) (Gen Only) 1 2 3 4 5 RA(R-SSD @M) (Gen Only) 1 2 3 4 5 R-SSD @P(IC)(IC) (Gen Only) 1 2 3 4 5

RA( R-FSD @M) (Prompt + Gen) 3 2 4 1 5 R-FSD @P(IC) (Prompt + Gen) 1 2 4 3 5 RA(R-SSD @M) (Prompt + Gen) 1 2 3 4 5 R-SSD @P(IC) (Prompt + Gen) 1 2 4 3 5

All Combined (Toxic + Non-Toxic Prompts)

RA(R-FSD @M) (Gen Only) 2 3 5 1 4 R-FSD @P(IC) (Gen Only) 2 3 5 1 4 RA(R-SSD @M) (Gen Only) 2 3 5 1 4 R-SSD @P(IC) (Gen Only) 2 3 5 1 4

RA(R-FSD @M) (Prompt + Gen) 3 4 5 1 2 RA(R-FSD @M) (Prompt + Gen) 3 4 5 1 2 R-SSD @P(IC) (Prompt + Gen) 3 4 5 1 2 R-SSD @P(IC) (Prompt + Gen) 3 4 5 1 2

Table 6: Toxicity Ranking Extended Results

Risk Aware Benchmarking of Large Language Models

IC Copula Portfolio EC Copula Portfolio

Prompt + gen Prompt + gen

20 K samples

Figure 10: IC versus EC Portfolio Aggregation on Toxicity. Ranking of models using 20 K samples, with independent and Empirical Copula portfolio with R-SSD. We see that the two aggregation methods lead to similar results.

Figure 11: Identity Attack Metric distribution computed on Prompt+Generation output of Highly Toxic Prompts

Risk Aware Benchmarking of Large Language Models

Scenario Llama 2 7b Llama 2 13b Llama 2 70b Mosaic ML MPT 30b Tiiuae Falcon 40b

Non Toxic Prompts

Identity Attack Metric Gen evaluation

Mean - Sigma 1 3 4 2 5 Mean - Gamma 2 3 4 1 5 Mean - n Tv AR 2 3 4 1 5 SSD 2 3 4 1 5

Threat Metric Prompt + Gen evaluation

Mean - Sigma 1 3 2 4 5 Mean - Gamma 1 2 3 5 4 Mean - n Tv AR 1 2 3 5 4 SSD 1 2 3 5 4

Table 7: Inconsistency of Mean - Sigma on Toxicity Metrics with SSD and other mean-risk models. This is a due to the fact the metric evaluated may a have a fat left tail see Figures 11 and 13.

Figure 12: Threat Metric distribution computed on Prompt+Generation output of Less Toxic Prompts

Risk Aware Benchmarking of Large Language Models

Figure 13: Identity Attack Metric distribution computed on Generation output of Less Toxic Prompts