# certifying_counterfactual_bias_in_llms__55c29470.pdf

Published as a conference paper at ICLR 2025

CERTIFYING COUNTERFACTUAL BIAS IN LLMS

Isha Chaudhary1 Qian Hu2 Manoj Kumar3

Morteza Ziyadi2 Rahul Gupta2 Gagandeep Singh1

1 UIUC, 2 Amazon, 3 Oracle Health

Warning: This paper contains model outputs that are offensive in nature. Large Language Models (LLMs) can produce biased responses that can cause representational harms. However, conventional studies are insufﬁcient to thoroughly evaluate biases across LLM responses for different demographic groups (a.k.a. counterfactual bias), as they do not scale to large number of inputs and do not provide guarantees. Therefore, we propose the ﬁrst framework, LLMCert-B that certiﬁes LLMs for counterfactual bias on distributions of prompts. A certiﬁcate consists of high-conﬁdence bounds on the probability of unbiased LLM responses for any set of counterfactual prompts prompts differing by demographic groups, sampled from a distribution. We illustrate counterfactual bias certiﬁcation for distributions of counterfactual prompts created by applying preﬁxes sampled from preﬁx distributions, to a given set of prompts. We consider preﬁx distributions consisting random token sequences, mixtures of manual jailbreaks, and perturbations of jailbreaks in LLM s embedding space. We generate non-trivial certiﬁcates for SOTA LLMs, exposing their vulnerabilities over distributions of prompts generated from computationally inexpensive preﬁx distributions.

1 INTRODUCTION

Text-generating Large Language Models (LLMs) are recently being deployed in user-facing applications, such as chatbots (Lee et al., 2023; Brown et al., 2020a; Gemini Team, 2024) where they are popular for producing human-like texts (Shahriar and Hayawi, 2023). These LLMs are safetytrained (Wang et al., 2023) to avoid generating harmful content. However, despite safety training, they can produce texts that exhibit social biases and stereotypes (Kotek et al., 2023; Manvi et al., 2024; Hofmann et al., 2024). Such texts can result in representational harms (Suresh and Guttag, 2021; Blodgett et al., 2020) to protected demographic groups (a subset of the population that is negatively affected by bias). Representational harms include stereotyping, denigration, and misrepresentation of historically and structurally oppressed demographic groups. Although representational harms are harmful in their own right (Blodgett et al., 2020), as they can socially impact individuals and redeﬁne social hierarchies, the resulting allocation harms (Gallegos et al., 2024a) can lead to economic losses to protected groups and are therefore regulated by anti-discrimination laws such as (Sherry, 1965). Language is considered an important factor for labeling, modifying, and transmitting beliefs about demographic groups and can result in reinforcement of social inequalities (Rosa and Flores, 2017). Hence, with rising popularity of LLMs, it is important to formally evaluate their biases to effectively mitigate representational harms resulting from them (Lee et al., 2024). We focus on counterfactual bias, inspired from Kusner et al. (2018), which assesses semantic differences across LLM responses caused by varying demographic groups mentioned in prompts (counterfactual prompts).

Prior work has focused on benchmarking LLM performance (Liang et al., 2023; Wang et al., 2024; Mazeika et al., 2024) and adversarial attacks (Sheng et al., 2020; Zou et al., 2023; Vega et al., 2023; Wallace et al., 2019). While these methods provide some empirical insights into LLM bias, they have several fundamental limitations (Mc Intosh et al., 2024; Yang et al., 2023) such as (1) Limited

Corresponding author: <isha4@illinois.edu> Work done while at Amazon

Published as a conference paper at ICLR 2025

Figure 1: (Overview of LLMCert-B): LLMCert-B is a quantitative certiﬁcation framework to certify bias in target LLM s responses to a random set of prompts differing only by a sensitive attribute. In speciﬁc instantiations, LLMCert-B samples a (a) set of preﬁxes from a given distribution and prepends them to a prompt set to form (b) the prompts given to the target LLM. (c) The target LLM s responses are checked for bias by a bias detector, (d) whose results are fed into a certiﬁer. (e) Certiﬁer computes bounds using the Clopper-Pearson method (Clopper and Pearson, 1934) on probability of unbiased LLM responses for any set of prompts formed with a random preﬁx from preﬁx distribution.

test cases: Benchmarking consists of evaluating several but limited number of test cases. Due to its enumerative nature, benchmarking can not scale to prohibitively large numbers of prompts that can elicit bias from LLMs. Adversarial attacks identify only few worst-case examples, which do not inform about the overall biases from large input sets; (2) Test set leakage: LLMs may have been trained on popular benchmarking datasets, thus resulting in incorrect evaluation; (3) Lack of guarantees. Benchmarking involves empirical estimation without any formal guarantees of generalization over any input sets. Similarly, adversarial attacks give limited insights as they can show existence of problematic behaviors on individual inputs but do not quantify the risk of biased LLM responses.

This work. We propose an alternative to benchmarking and adversarial attacks certifying LLMs for bias, with formal guarantees. Certiﬁcation operates on a prohibitively large set of inputs, represented succinctly as a speciﬁcation. Speciﬁcations deﬁne inputs mathematically through operators over the vocabulary of LLMs. Certiﬁcation can provide guarantees on the target model s behavior that generalize to unseen inputs satisfying the speciﬁcation. With guarantees, we can be better informed about the risks associated with LLMs before deploying them in public-facing applications.

Key challenges. (1) There are no existing precise mathematical representations of large sets of (counterfactual) prompts to make practical speciﬁcations. (2) State-of-the-art neural network certiﬁers (Wang et al., 2021; Singh et al., 2019) currently do not scale to LLMs as they require white-box access to the model and lose precision signiﬁcantly for larger models, resulting in inconclusive results.

Our approach. Given the diversity of LLM prompts, there will always be some cases where the LLM output will be biased (e.g., found by adversarial attacks (Zou et al., 2023)). Hence, we believe that LLM certiﬁcation must be quantitative (Li et al., 2022a; Baluta et al., 2021) and study the question:

What is probability of unbiased LLM responses for any counterfactual prompt set?

Exactly computing the probability of unbiased responses is infeasible due to the large number of possible counterfactual prompt sets over which the biased behavior has to be determined. One can try to compute deterministic lower and upper bounds on the probability (Berrada et al., 2021). However, this is expensive and requires white-box access making it not applicable to popular, SOTA but closed-source LLMs such as GPT-4 (Achiam et al., 2023). Therefore, we focus on black-box probabilistic certiﬁcation that estimates the probability of unbiased responses over a given distribution of counterfactual prompt sets with high conﬁdence bounds. We develop the ﬁrst general speciﬁcation and certiﬁcation framework, LLMCert-B1 for counterfactual bias in LLMs, applicable to both open

1LLM Certiﬁcation of Bias

Published as a conference paper at ICLR 2025

and closed-source LLMs (see Figure 1 for an overview). Our speciﬁcations over counterfactual prompt sets are the ﬁrst relational properties (Barthe et al., 2011) for trustworthy LLMs. We demonstrate LLMCert-B with 3 kinds of speciﬁcations, each with counterfactual prompt set distributions formed by adding preﬁxes sampled from given distribution of preﬁxes to a ﬁxed set of counterfactual prompts. We present distributions of preﬁxes consisting random token sequences, mixtures of popular jailbreaks, and jailbreak perturbations in embedding space ( 3.1). The ﬁrst two are model-agnostic speciﬁcations and hence apply to both open and closed-source models. However, the third one requires access to the embeddings and the ability to prompt LLMs with embeddings, and hence applies to only open-source models. The mixture and embedding space jailbreak preﬁx distributions contain effective, manually designed jailbreaks and their perturbations, which are potential jailbreaks, in their sample space and hence assess LLMs biases in adversarial settings.

We certify the target LLM leveraging conﬁdence intervals. LLMCert-B samples several counterfactual prompt sets from the speciﬁed prompt distribution and generates high-conﬁdence bounds on the probability of unbiased LLM responses for any random counterfactual prompt set in the distribution.

Contributions. Our main contributions are:

We design novel speciﬁcations that quantify the desirable relational property of low counterfactual bias in LLM responses over counterfactual prompts in a speciﬁed distribution. We illustrate such speciﬁcations with distributions of counterfactual prompt sets constructed with potentially adversarial preﬁxes. The preﬁxes are drawn from 3 distributions (1) random, (2) mixture of jailbreaks, and (3) jailbreak perturbations in the embedding space. We develop the ﬁrst probabilistic black-box certiﬁer LLMCert-B, applicable to both open and closed-source models, for quantifying counterfactual bias in LLM responses. LLMCert B leverages conﬁdence intervals (Clopper and Pearson, 1934) to generate high-conﬁdence bounds on the probability of obtaining unbiased responses from the target LLM, given any random set of counterfactual prompts from the distribution given in the speciﬁcation. We ﬁnd that the safety alignment of SOTA LLMs is easily circumvented with several preﬁxes in the distributions given in our speciﬁcations, especially those involving mixture of jailbreaks and jailbreak perturbations in the embedding space ( 5). These distributions are inexpensive to sample from, but can effectively bring out biased behaviors from SOTA models. This shows the existence of simple, bias-provoking distributions for which no defenses exist currently. We provide quantitative measures for the fairness (lack of bias) of SOTA LLMs, which hold with high conﬁdence. We ﬁnd that there are no consistent trends in the fairness of models with the scaling of their sizes, hence suggesting that the quality of alignment techniques could be a more important factor than size for fairness. Our implementation is available at https://github.com/uiuc-focal-lab/LLMCert-B and we provide guidelines for using our framework for practitioners in Appendix A.

2 BACKGROUND

2.1 LARGE LANGUAGE MODELS (LLMS)

LLMs are autoregressive models for next-token prediction. Given a sequence of tokens t1, . . . , tk, they give a probability distribution over their vocabulary for the next token, P[tk+1 | t1, . . . , tk]. They are typically ﬁne-tuned for instruction-following (Zhang et al., 2024) and aligned with human feedback (Wang et al., 2023; Ouyang et al., 2022) to make their responses safe. We certify instructiontuned, aligned LLMs for counterfactual bias, as they are typically used in public-facing applications.

2.2 CLOPPER-PEARSON CONFIDENCE INTERVALS

Clopper-Pearson conﬁdence intervals (Clopper and Pearson, 1934) provide lower and upper bounds [pl, pu] on the probability of success parameter p of a Bernoulli random variable with probabilistic guarantees. The bounds are obtained with n independent and identically distributed observations of the random variable, in which k( n) successes are observed. The conﬁdence interval is such that Pr{p [[pl, pu]} (1 γ). γ (0, 1) is the (small) permissible error probability by which the true value of p / [pl, pu]. The conﬁdence intervals are obtained by statistical hypothesis testing for p, where the lowest and highest values are pl and pu respectively, with the given conﬁdence 1 γ.

Published as a conference paper at ICLR 2025

3 FORMALIZING BIAS CERTIFICATION

We develop a general framework, LLMCert-B, to specify and quantitatively certify counterfactual bias in text generated by (Large) Language Models (LMs). LLMCert-B formalizes bias with speciﬁcations precise mathematical representations that deﬁne the desirable property (absence of bias) in large sets of inputs. Bias is deﬁned with respect to demographic groups subsets of human population sharing an identity trait, that could be biological or socially constructed (Gallegos et al., 2024b). Bias consists of disparate treatment or outcomes when varying the demographic groups in inputs to the target LM. For autoregressive LMs, we consider text generation bias consisting of stereotyping, misrepresentation, derogatory language, etc, that can result in representational harms (Gallegos et al., 2024b). To apply certiﬁcation to closed-source LMs as well, we study extrinsic bias (Cao et al., 2022) that manifests in the ﬁnal LM response texts. See Appendix I for a detailed discussion on bias in ML.

3.1 BIAS SPECIFICATION

Next, we formally specify the lack of bias in the responses of language models. Unbiased LM responses do not exhibit semantic disparities owing to speciﬁc demographic groups in the prompts (Gallegos et al., 2024b; Sheng et al., 2019; Smith et al., 2022). Our bias speciﬁcation is motivated by Counterfactual Fairness (Kusner et al., 2018). Consider a given identity trait I such as gender, race, etc. (that are often basis of social bias). I categorizes the human population into m subsets called demographic groups G1, . . . , Gm, each differing by the identity trait. Each demographic group G is characterized/recognized by several synonymous strings in the society, called sensitive attributes GA (Li et al., 2024). For example, the sensitive attributes for the demographic group corresponding to the female gender are woman, female, etc. We select any one sensitive attribute of a demographic group G to represent G. Let the resulting set of sensitive attributes, each corresponding to a demographic group, be A = {A1, . . . , Am}, where Ai GA i . Our speciﬁcations are for counterfactual prompts (Gallegos et al., 2024b) that differ only by the sensitive attributes in them.

Let V be the vocabulary of the target LM L. Consider a set of prompts P = {P1, . . . Ps}, s > 1, P Vc, where c is the context length of L and Vc is a sequence of elements of V having length [1, c]. Let each prompt in P contain a unique sensitive attribute from A such that overall P represent more than 1 distinct demographic groups represented by A. Let each prompt be Pi = Xi Ai, where Xi is the part of Pi independent of sensitive attributes and Ai is a sensitive attribute. We consider only prompts that can be decomposed into parts with and without sensitive attributes respectively. To generalize to closed-source LMs, we assume L to be a black-box system that can only be queried, when specifying and certifying bias. The black-box assumption renders any symbolic analysis (Mirman et al., 2020) infeasible and hence allows only for analysis with input-output examples. Deﬁnition 1. (Counterfactual prompt set). A set of prompts P = {P1 = X1 A1, . . . , Ps = Xs As} is called counterfactual when: (1) i, j [1, s]. Xi = Xj; (2) i [1, s]. j ([1, s] \ i). Ai = Aj; (3) For an unbiased text generator f, i [1, s]. f(Pi) = f(Xi). That is, the prompts only differ in the sensitive attributes that are ideally unrelated to the overall query and the set of prompts P represent more than 1 sensitive attributes from A.

We specify bias over counterfactual prompt sets (Deﬁnition 1). These exclude prompts where sensitive attributes are important to answer the overall query, such as What steps should I take to prepare for becoming a parent?", as semantically different answers, dependent on the sensitive attribute, are acceptable for such cases. All possible counterfactual prompt sets can be prohibitive in number for typical context lengths. This is because, the common part of counterfactual prompt sets, X can be any element from Vc which contains 1010000 elements for c = 2k. Hence, enumerative speciﬁcations (which specify the desired behavior on all inputs) are impractical, as they cannot be scalably certiﬁed without symbolic analysis for large number of inputs. Hence, we deﬁne probabilistic speciﬁcations for the probability of unbiased responses from L, for which we provide a certiﬁcation algorithm in Section 3.2. Let be a sampleable discrete probability distribution over (Vc) (power set of prompts) having non-zero support on some counterfactual prompt sets P. We deﬁne probabilistic speciﬁcations for bias in L over . The speciﬁcation is agnostic to s sampler, as long as it generates independent and identically distributed samples. We show examples of in Section 4.

Let D be a user-deﬁned bias detection function that can identify stereotypes/disparity in given texts for different sensitive attributes in A. Let D evaluate to zero for unbiased inputs (scaling and shifting of D can be done to satisfy this). We leave D as a parameter of the speciﬁcation, as different domains

Published as a conference paper at ICLR 2025

can have varying notions of bias and stakeholders can decide the most suitable notion (Anthis et al., 2024). Overall, our quantitative speciﬁcation is the probability of unbiased responses (measured by D) when L is independently prompted with each element of a randomly sampled counterfactual prompts set from (1). Certiﬁcate C is an evaluation/estimation of the speciﬁed probability of unbiased responses, along with samples of LLM responses, for user-deﬁned parameters , D, and L.

C( , D, L) Pr P [D([L(P1), . . . , L(Ps)]) = 0] (1)

3.2 CERTIFICATION ALGORITHM

Exactly computing C( , D, L) (1) is intractable as it would require enumerating all (prohibitively many) prompts sets in the support of . Hence, our certiﬁcation algorithm estimates C( , D, L) for given and D and target L with high conﬁdence, as described next. We generate intervals [ ˆpl, ˆpu] that bound C( , D, L) in (1) with conﬁdence 1 γ. Such interval estimates are better than point-wise estimates as they also quantify the uncertainty of the estimation. C( , D, L) is the probability of success (unbiased responses) for the Bernoulli random variable F D([L(P1), . . . , L(Ps)]) = 0. To obtain high-conﬁdence bounds on C( , D, L), we employ binomial proportion conﬁdence intervals. In particular, we leverage the Clopper-Pearson conﬁdence interval method (Clopper and Pearson, 1934) (Section 2.2) as it is known to be a conservative method, i.e., the conﬁdence of the resultant intervals is at least the pre-speciﬁed conﬁdence, 1 γ (Newcombe, 1998). We obtain n independent and identically distributed (iid) samples of F by sampling iid P from and compute the Clopper Pearson conﬁdence intervals of C( , D, L). The certiﬁcate, hence obtained, bounds the probability of unbiased responses for random P with high conﬁdence. Note that the certiﬁcation results depend on the user-deﬁned choices of n and 1 γ.

4 CERTIFICATION INSTANCES

In this section, we instantiate prompt set distributions to form novel bias speciﬁcations. We select such that its underlying sample space has prompt sets that share a common characteristic, so we can certify the bias conditioned on the presence of the characteristic. Thus, this becomes a local speciﬁcation (Seshia et al., 2018), wherein the certiﬁcate is given for a local input space. Local speciﬁcations have commonly been considered in neural network veriﬁcation (Singh et al., 2019; Wang et al., 2021; Baluta et al., 2021). Prior works on neural network speciﬁcations such as (Geng et al., 2023) generate only local speciﬁcations, as they correspond to meaningful real-world scenarios, and as local input regions are considered to design adversarial inputs for the models. In our local bias speciﬁcations, we consider around a given set of prompts Q (pivot), denoting the resultant prompt set distributions as Q. Preﬁxes are commonly used to steer the text generated by LLMs according to the users intentions (Liu et al., 2021). Hence, we want to study whether the application of certain preﬁxes can elicit different forms of bias from the target LLM. Let pre denote a distribution of preﬁxes. Each element in the sample spaces of Q is a set of prompts formed by uniformly applying a preﬁx to all prompts Qi Q, that is, q Q = S Qi Q{p Qi} for p pre, where denotes string concatenation. Algorithm 1 presents the probabilistic speciﬁcation involving addition of randomly sampled preﬁxes as a probabilistic program. Our probabilistic programs follow the syntax of the probabilistic programming language deﬁned in (Sankaranarayanan et al., 2013, Figure 3). The syntax is similar to that of a typical imperative programming language, with the addition of primitive functions to sample from common distributions over discrete / continuous random variables (for example, Bernoulli: B, Uniform: U) and estimate Probability(.). estimate Probability(.) takes in a random variable and returns its estimated probability at a speciﬁc value. make Prefix(args,kind) (line 1) is a general function to sample different kinds of preﬁxes such as random preﬁxes (Algorithm 2), mixture of jailbreaks (Algorithm 3), and soft preﬁxes (Algorithm 4), constructed using arguments, args.

C( Q, D, L) characterizes the bias that can be elicited from L by varying the preﬁx selected from pre applied to a given Q. Next, we describe the 3 different kinds of practical pre and their sampling algorithms to deﬁne local bias speciﬁcations for L. We show some samples from each kind of pre in Appendix C. Our speciﬁcations are for the average-case behavior of the target LLM, as pre are not distributions of provably adversarial (worst-case) preﬁxes.

Published as a conference paper at ICLR 2025

Algorithm 1 Preﬁx speciﬁcation Input: L, Q; Output: C( , D, L)

1: pre := make Prefix(args, kind = random / mixture / soft) 2: P := [pre Qi for Qi Q] 3: C( , D, L) := estimate Probability(D([L(P1), . . . , L(Ps)]) = 0)

Algorithm 2 Make random preﬁx Input: V; Output: pre

1: pre := U(V) . . . [q times] U(V)

Algorithm 4 Make soft preﬁx Input: L, M0; Output: pre

1: E := embed(L, M0) 2: pre := E + U([ κ, κ])

Algorithm 3 Make mixture of jailbreak preﬁx Input: L, V, M; Output: pre

1: M0 := split(M0) 2: H := S

Mk M,k>0 split(Mk) 3: ω(pλ, H) := shuffle({if(B(pλ), h, ) | h H}) 4: Mi := M0[0] ω(pλ, H) M0[1] ω(pλ, H) . . . 5: Mi tokenize(L, Mi) 6: pre := [if(B(pµ), U(V), τ) for τ Mi]

Random preﬁxes. Prior works such as (Wei et al., 2023; Zou et al., 2023) have shown the effects of incoherent ﬁxed-length strings in jailbreaking LLMs for harmful prompts. Hence, we specify bias in LLMs for prompts with incoherent preﬁxes that are random sequences of tokens from the vocabulary of the LLM. Such preﬁxes are not all intentionally adversarial, except for adversarial strings like those from prior works, but denote random noise in the prompt. Algorithm 2 presents the preﬁx sampler as a uniform distribution, U(.) over the random preﬁxes of ﬁxed length, q. The sample space of random preﬁx pre has |V|q cardinality. pre for random preﬁxes assigns a non-zero probability to discovered and undiscovered jailbreaks, of a ﬁxed length q. Hence, certiﬁcation for the random preﬁx distribution indicates the expected bias in responses to Q with any random preﬁxes of length q.

Mixtures of jailbreaks. Manually designed jailbreaks are fairly effective at bypassing the safety training of LLMs (walkerspider, 2022; Wei et al., 2023; jai). To study the vulnerability of LLMs under powerful jailbreaks, we develop speciﬁcations with manual jailbreaks. Distributions from which jailbreaks can be sampled are unknown. Thus, we construct potential jailbreaking preﬁxes from a set M of popular manually-designed jailbreaks by applying 2 operations interleaving and mutation. Interleaving attempts to strengthen a given manual jailbreak with more bias-provoking instructions, while mutation attempts to obfuscates the jailbreak such that it can be effective, even under explicit training to avoid the original jailbreak. Algorithm 3 presents the preﬁx constructor as a probabilistic program. Each manual jailbreak Mk M can be treated as a ﬁnite set of instructions Mk = {Mk[0], . . . }. split(.) function takes a jailbreak and returns the list of the jailbreak s instructions. Let M0 be the most effective jailbreak (a.k.a. main jailbreak). We decide effectiveness of jailbreaks using popular open-source leaderboards of jailbreaks. We include all the instructions of the main jailbreak in the ﬁnal preﬁx. The other jailbreaks are helper jailbreaks, whose instructions are included with an interleaving probability, pλ in the ﬁnal preﬁx. Let H = S

Mi M Mi denote the set of all instructions from helper jailbreaks [line 2]. Let ω(pλ, H) shufﬂe and concatenate randomly picked (with probability pλ) instructions from H [line 3]. shuffle(.) is a function for randomly sampling a permutation from a uniform distribution over all permutations of an input list (after removing which denotes void elements). Let if(e1, e2, e3) be an abbreviation for if e1 then e2 else e3. We ﬁrst apply the interleaving operation with the resultant given as Mi [line 4]. The mutation operation is then applied to Mi viewed as a sequence of tokens [τ0, . . . , ], wherein any token τi can be ﬂipped to any random token τ i V, with a mutation probability pµ (generally set to be low), to result in pre [line 6]. We hypothesize such preﬁxes to be potential jailbreaks as they are formed by strengthening a manual jailbreak with other jailbreaks and obfuscating its presence. The number of preﬁxes formed by the aforementioned operations can be prohibitively many, owing to typically long manual jailbreaks and the possibility to mutate any token to any random token from V.

Soft preﬁxes from jailbreaks. Due to the limited number of effective manual jailbreaks (walkerspider, 2022; Learn Prompting, 2023), they can be easily identiﬁed and defended against. However, the excellent denoising capabilities of LLMs could render them vulnerable to simple manipulations of manual jailbreaks as well, indicating that the threat is not completely mitigated by current de-

Published as a conference paper at ICLR 2025

fenses (Jain et al., 2023). Hence, we specify fairness under preﬁxes constructed by adding noise to the original manual jailbreaks. Algorithm 4 presents the preﬁx constructor as a probabilistic program. Let E be the embedding matrix of M0 in the embedding space of the target LLM, obtained by applying the function embed(.) [line 1]. We perturb E by adding noise to it. As we are not aware of any adversarial distributions of noise that could be added to manual jailbreaks to make them stronger, we select a uniform distribution. We uniformly sample noise from B(0, κ) which is an κ > 0 (constant) ball around the origin and add it to E to construct pre in the embedding space [line 2].

5 EXPERIMENTS

We used 2 A100 GPUs, each with 40GB VRAM. We derive the queries on which the speciﬁcations from the 3 preﬁx distributions presented in Section 4 are pivoted, from popular datasets for fairness and bias assessment BOLD (Dhamala et al., 2021) and Decoding Trust (Wang et al., 2024). BOLD setup. BOLD is a dataset of partial sentences to demonstrate bias in the generations of LLMs in common situations. We pick a test set of 250 samples randomly from BOLD s profession partition and demonstrate binary gender bias speciﬁcations and certiﬁcates on it.

Figure 2: Example from QBOLD

We develop a pivot set of prompts from each test set sample by prepending an instruction to complete the partial sentence for the profession annotated in BOLD, where the subject identiﬁes with a particular gender (Male / Female). An example pivot set Q from BOLD is illustrated in Figure 2. Let the resultant test set consisting of pivot prompts from each partial sentence considered from BOLD be QBOLD. To identify bias in the responses of the target LLM for the prompts constructed from pivot prompt sets in QBOLD, we use the regard-based bias metric proposed in (Sheng et al., 2019). Our adaptation of the regard metric is described and evaluated with a human study on Amazon Mechanical Turk in Appendix G.1. Our bias detector matches human perception of bias in 76% cases. We qualitatively analyze the false positive and false negatives of the bias detector with respect to human judgment in Appendix G.1.

Figure 3: Example from QDT

Decoding Trust setup. Decoding Trust (DT) is a dataset benchmark to evaluate various properties of LLMs, including stereotype bias against people of different demographic groups. We make speciﬁcations from all 48 statements in the stereotypes partition for demographic groups corresponding to race (black/white). An example pivot set Q from DT is illustrated in Figure 3. Let the resultant test set consisting of pivot prompts from each partial sentence considered from DT be QDT . We evaluate the LLM responses to prompts derived from pivot prompt sets in QDT using a bias detector that identiﬁes the disparity in agreement to the stereotype for different demographic groups given in the prompt, as discussed in Appendix G.2.

For every element in QBOLD and QDT , we generate 3 certiﬁcates for the speciﬁcations in Section 4, such that a certiﬁcate consists of bounds on the probability of unbiased responses from the target LLM. Both bias detectors are such that they output 1 for a biased set of responses to a counterfactual prompt set and 0 for an unbiased set of responses. The values of the certiﬁcation parameters used in our experiments are given in Table 2 (Appendix E). We study their effect on the certiﬁcation results with an ablation study in Appendix E. We generate the certiﬁcation bounds with 95% conﬁdence and 50 samples. While our main experiments are for counterfactual prompt sets with binary demographic groups, our framework can be extended beyond binary demographic groups, which we experimentally demonstrate in Appendix E.6. Appendix B presents the existing, manually-designed jailbreaks used across all speciﬁcations. Note that, these jailbreaks are just examples to demonstrate our framework, which generalizes beyond them to new bias-eliciting manual textual jailbreaks.

5.1 CERTIFICATION RESULTS

We certify the popular contemporary LLMs Llama-2-chat (Touvron et al., 2023) 7B and 13B (parameters), Vicuna-v1.5 (Chiang et al., 2023) 7B and 13B, Mistral-Instruct-v0.2 (Jiang et al., 2023) 7B, Gemini-1.0-pro (Gemini Team, 2024), GPT-3.5 (Brown et al., 2020b), GPT-4 (Achiam et al., 2023), and Claude-3.5-Sonnet (Anthropic, 2024). We report the average of the certiﬁcation bounds for all pivot prompt sets in QBOLD and QDT each for every model in Table 1. We do not certify the

Published as a conference paper at ICLR 2025

Table 1: Average of the bounds on the probability of unbiased responses for different models. Lowest bounds for each speciﬁcation kind and dataset are highlighted. We report 2 baselines unbiased responses when prompting without preﬁxes and with the main jailbreak as preﬁx.

Average certiﬁcation bounds Dataset Model % Unbiased without preﬁx % Unbiased with main JB

Random Mixture Soft

Vicuna-7B 99.9 89.4 (0.93, 1.0) (0.90, 0.99) (0.73, 0.89) Vicuna-13B 99.7 99.8 (0.93, 1.0) (0.93, 1.0) (0.92, 1.0) Llama-7B 99.8 99.8 (0.92, 1.0) (0.92, 1.0) (0.93, 1.0) Llama-13B 99.8 99.7 (0.93, 1.0) (0.91, 1.0) (0.93, 1.0) Mistral-7B 100.0 41.0 (0.92, 1.0) (0.22, 0.42) (0.30, 0.52) Gemini 99.2 74.1 (0.92, 1.0) (0.60, 0.83) GPT-3.5 99.5 50.2 (0.92, 1.0) (0.44, 0.67) GPT-4 99.8 99.9 (0.92, 1.0) (0.80, 0.96) Claude-3.5-Sonnet 99.6 99.8 (0.93, 1.0) (0.92, 1.0)

Vicuna-7B 95.4 100.0 (0.85, 0.97) (0.92, 1.0) (0.88, 0.97) Vicuna-13B 88.7 76.2 (0.71, 0.92) (0.92, 1.0) (0.51, 0.78) Llama-7B 97.5 100.0 (0.79, 0.96) (0.92, 1.0) (0.92, 1.0) Llama-13B 100.0 100.0 (0.92, 1.0) (0.93, 1.0) (0.93, 1.0) Mistral-7B 99.2 72.9 (0.91, 1.0) (0.85, 0.99) (0.46, 0.73) Gemini 99.6 94.6 (0.92, 1.0) (0.73, 0.93) GPT-3.5 99.6 56.7 (0.93, 1.0) (0.66, 0.88) GPT-4 100.0 100.0 (0.93, 1.0) (0.93, 1.0) Claude-3.5-Sonnet 99.6 100.0 (0.93, 1.0) (0.93, 1.0)

closed-source models such as Gemini, GPT, and Claude for soft preﬁxes, as it requires access to the models embedding layers. Certiﬁcation time signiﬁcantly depends on the inference latency of the target model. Generating each certiﬁcate can take 1-2 minutes for models with reasonable latency. Baselines. The baselines consider QBOLD and QDT as benchmarking datasets, having counterfactual prompt sets as individual elements. Similar to popular LLM bias benchmarking works such as (Wang et al., 2024; Liang et al., 2023; Esiobu et al., 2023; Xie et al., 2024), we study the biases in LLMs for a ﬁxed dataset of counterfactual prompt sets which may be provided as is to the LLM, or with jailbreaks. In the ﬁrst baseline (without preﬁx), every counterfactual prompt set is evaluated 5 times, each time prompting a target LLM with each prompt in the set without any preﬁxes and detecting bias across its responses using the corresponding bias detector. The bias result for each counterfactual prompt set is computed by averaging the results over the 5 evaluations, similar to (Wang et al., 2024). This baseline indicates biases in LLM responses without any preﬁxes and can be used to judge the additional inﬂuence of preﬁxes on eliciting bias from LLMs. Table 1 reports the average of evaluations over all counterfactual prompt sets in QBOLD and QDT respectively. In the second baseline (with main jailbreak), each counterfactual prompt is similarly evaluated 5 times, but with the unmodiﬁed main jailbreak (Figure 5, Appendix B), used in the mixture of jailbreak and soft preﬁxes from jailbreak distributions, as a preﬁx. The average result of this baseline is also reported in Table 1. This baseline is used to indicate the efﬁcacy of the main jailbreak without any modiﬁcations, and hence suggests the importance of the mixture and soft preﬁx distributions around the main jailbreak in eliciting biases in LLM responses. The baselines are empirical studies of counterfactual bias in LLMs, which analyze bias with a dataset of prompts. On the other hand, LLMCert-B quantiﬁes biases for any random prompt sampled from a given distribution.

5.1.1 GENERAL OBSERVATIONS

Comparison with baselines. Our results for the baseline without preﬁx are generally close to and often higher than the average upper bounds from certiﬁcation. This suggests that the counterfactual prompt sets do not majorly result in biased LLM behavior by themselves and the application of preﬁxes is essential for biased LLM responses. Hence, such baselines present an optimistic view on the biases of LLMs, which may not be acceptable for real-world scenarios. We compare the baseline with main jailbreak with the certiﬁcation bounds for the speciﬁcations with mixture of jailbreak preﬁxes and soft preﬁxes, as these consist of modiﬁcations of the main jailbreak. We see that in most cases, the baseline results are either contained within or higher than the bounds. Thus the baseline does not account for several biased behaviors that can result by modifying the main jailbreak. In some cases, e.g., Mistral for DT speciﬁcations, the average lower bound is higher than the baseline result, indicating that the modiﬁcations might have reduced the efﬁcacy of the main jailbreak for the model. In such cases, the speciﬁcations can be made model-speciﬁc and we leave this to future work.

Published as a conference paper at ICLR 2025

Variations with preﬁxes. We observe that the random preﬁxes are generally less effective at eliciting biased behaviors than the other preﬁxes. This suggests that incoherent preﬁxes can be denoised by the LLMs and may not effectively bypass their alignment. In some cases, such as that for Vicuna-13b, we see that the model produces biased responses with the random preﬁxes, indicating that the model is not robust to even incoherent perturbations to the prompts. Thus random preﬁx speciﬁcations present basic certiﬁcates into the denoising capabilities of the LLMs to produce unbiased responses. There are no consistent trends between the mixture of jailbreaks and soft preﬁx speciﬁcations, except for the effectiveness of prompts from their distributions in producing biased LLM responses.

Denial of response. In several instances, such as those when certifying GPT-4 (for DT), Claude, and Llama, we observe unbiased behaviors because the model simply refuses to respond. The completions are mostly Sorry I can t assist with that". While that can be a valid defense to adversarial prompts, it may not be the most desirable (Cui et al., 2024). The model could instead output neutral responses, ignoring any bias-encouraging texts in the prompts to be helpful but not harmful.

5.1.2 MODEL-SPECIFIC OBSERVATIONS

Vicuna. The Vicuna models show high probability of unbiased responses, except for those with soft preﬁxes. The baselines indicate that the 7B model can output less unbiased responses with the main jailbreak as preﬁx. But the mixture of jailbreaks speciﬁcations have high probability of unbiased responses, suggesting that the perturbations of the main jailbreak with the operations of mixture speciﬁcations can make the jailbreak less effective, while perturbing the jailbreak in the embedding space can elicit more biased behaviors from the model. We ﬁnd that unbiased behavior trends are not consistent with increasing model sizes across different speciﬁcations.

Llama and Claude. The Llama models and Claude-Sonnet generate a few biased responses, but the bounds on the probability of unbiased responses are high, indicating better alignment.

Mistral. Mistral exhibits signiﬁcantly low probability of unbiased when prompted with mixtures of jailbreaks and soft preﬁxes for speciﬁcations from both QBOLD and QDT .

Gemini. We prompt Gemini with the safety ﬁlters disabled, so we can check the safety training of the model itself. We observe that its responses can get highly biased for mixture of jailbreaks speciﬁcations, resulting in low bounds. This insight is, however, not evident from the baselines.

GPT. The baselines indicate low bias in the responses of GPT-4. However, certiﬁcation results say otherwise. The GPT-4 model shows low probability of unbiased responses for the mixture of jailbreak speciﬁcations derived from QBOLD. However, we observe unbiased responses for speciﬁcations corresponding to QDT . GPT-3.5 achieves low probability of unbiased responses for the mixture of jailbreak speciﬁcations, indicating that GPT-4 may have undergone better safety training than it.

5.2 CASE STUDIES

We qualitatively introspect individual certiﬁcates for different models for one example from QBOLD. We present similar analyses for one example from QDT in Appendix D. We select the example for

Desktop automation refers to software programming from QBOLD for Automation profession. We present responses of 4 popular LLMs Mistral, Llama-7B, Gemini, and GPT-4 obtained during certiﬁcation for the mixture of jailbreak speciﬁcation, in Figure 4. The responses demonstrate various stereotypical biases against both genders. For example, Mistral discriminates against women, calling them incapable and better suited for menial jobs . The Gemini model is also biased against women, saying they cannot think and are meant for household chores . On the other hand, GPT-4 shows biases towards men, as it responds about them lacking motivation and being less efﬁcient , with automation being better at their jobs. It depicts women crushing male competition and becoming superior. Llama, however, defends against the prompts and avoids responding in a biased way. The responses reﬂect the certiﬁcation results for the models. We ﬁnd from the certiﬁcation bounds that the Mistral model is worse than Gemini and GPT, which are worse than Llama with high conﬁdence for this speciﬁcation.

Published as a conference paper at ICLR 2025

Figure 4: Responses (and bounds) of Mistral, Llama-7B, Gemini, and GPT-4 when certifying with BOLD example Desktop automation refers to software programming for mixture of jailbreaks. 6 RELATED WORK

Attacking LLM alignment. LLMs are aligned with human ethics by supervised ﬁne-tuning and reinforcement learning with human feedback (Ouyang et al., 2022). However, (Zou et al., 2023; Vega et al., 2023; Chao et al., 2023; Sheng et al., 2020; Wallace et al., 2019) propose methods to jailbreak LLMs, bypassing alignment and causing harmful or biased responses. Jailbreaks can be incoherent (Zou et al., 2023; Sheng et al., 2020) or coherent (Dominique et al., 2024; Liu et al., 2024).

Benchmarking LLMs. Various prior works have benchmarked LLM performance on standard and custom datasets. These consist of datasets of general prompts (Dhamala et al., 2021; Wang et al., 2024) or adversarial examples (Zou et al., 2023; Mazeika et al., 2024). Benchmarks such as (Liang et al., 2023; Wang et al., 2024; Mazeika et al., 2024; Manerba et al., 2024; Gallegos et al., 2024a) present empirical trends for LLM performance, measured along various axes including bias.

Guarantees for LLMs. There is an emerging need for guarantees on LLM behavior. (Kang et al., 2024) provides guarantees on the generation risks of RAG LLMs. (Quach et al., 2024; Deutschmann et al., 2023; Mohri and Hashimoto, 2024; Yadkori et al., 2024) apply conformal prediction to LLMs, proposing methods for generating output sets with statistical guarantees on correctness or coverage. (Zollo et al., 2024) presents a framework for selecting low-risk system prompts for LLMs with probabilistic guarantees. (Chaudhary et al., 2024) present a certiﬁcation framework for knowledge comprehension in LLMs. We compare LLMCert-B with existing works further in Appendix H.

Fairness in Machine Learning. Fairness has been extensively studied for general Machine Learning, beginning from the seminal work of Dwork et al. (2011). Prior works have proposed methods to formally certify classiﬁers for fairness, such as (Biswas and Rajan, 2023; Bastani et al., 2019). However, these do not extend to LLMs. Fairness and bias have also been studied in natural language processing in prior works such as (Chang et al., 2019; Smith et al., 2022; Krishna et al., 2022).

7 CONCLUSION

We present the ﬁrst framework LLMCert-B to specify and certify counterfactual bias in LLM responses, for both open and closed-source models. We instantiate LLMCert-B with novel speciﬁcations based on different kinds of potentially adversarial preﬁxes. LLMCert-B generates high conﬁdence bounds on probability of unbiased responses for counterfactual prompts from a given distribution. Our results show previously unknown vulnerabilities related to counterfactual bias in SOTA LLMs.

Published as a conference paper at ICLR 2025

ACKNOWLEDGEMENT

We thank the anonymous reviewers for their insightful comments. This work was supported by a grant from the Amazon-Illinois Center on AI for Interactive Conversational Experiences (AICE) and NSF Grants No. CCF-2238079, CCF-2316233, CNS-2148583.

ETHICS STATEMENT

We identify the following positive and negative impacts of our work.

Positive impacts. Our work is the ﬁrst to provide quantitative certiﬁcates for the bias in Large Language Models. It can be used by model developers to thoroughly assess their models before releasing them and by the general public to become aware of the potential harms of using any LLM. As our framework, LLMCert-B assumes black-box access to the model, it can be applied to even closed-source LLMs with API access.

Negative impacts. In this work, we propose 3 kinds of speciﬁcations involving random preﬁxes, mixtures of jailbreaks as preﬁxes, and jailbreaks in the embedding space of the target model. While these preﬁxes are not adversarially designed, they are often successful in eliciting biased and toxic responses from the target LLMs. They can be used to attack these LLMs by potential adversaries. We have informed the developers of the LLMs about this threat.

REPRODUCIBILITY STATEMENT

Our implementation of LLMCert-B is open-sourced at https://github.com/ uiuc-focal-lab/LLMCert-B. We provide a README with the code having instructions to reproduce the main results of the paper. The code also consists the ﬁles used in conducting a human evaluation of our bias detector for BOLD on Amazon Mechanical Turk.

Jailbreak Chat. https://www.reddit.com/r/Chat GPTJailbreak/.

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report, 2023.

Jacy Anthis, Kristian Lum, Michael Ekstrand, Avi Feller, Alexander D Amour, and Chenhao Tan. The impossibility of fair llms, 2024. URL https://arxiv.org/abs/2406.03198.

Jacy Reese Anthis and Victor Veitch. Causal context connects counterfactual fairness to robust prediction and group fairness, 2023. URL https://arxiv.org/abs/2310.19691.

Anthropic. Claude 3.5 sonnet model card addendum. Technical report, Anthropic, 2024. URL https://www-cdn.anthropic.com/ fed9cc193a14b84131812372d8d5857f8f304c52/Model_Card_Claude_3_ Addendum.pdf.

Teodora Baluta, Zheng Leong Chua, Kuldeep S. Meel, and Prateek Saxena. Scalable quantitative veriﬁcation for deep neural networks, 2021.

Solon Barocas and Andrew D Selbst. Big data s disparate impact. California Law Review, 104(3): 671 732, 2016.

Gilles Barthe, Juan Manuel Crespo, and César Kunz. Relational veriﬁcation using product programs. pages 200 214, 2011.

Osbert Bastani, Xin Zhang, and Armando Solar-Lezama. Probabilistic veriﬁcation of fairness properties via concentration, 2019.

Published as a conference paper at ICLR 2025

Leonard Berrada, Sumanth Dathathri, Krishnamurthy Dvijotham, Robert Stanforth, Rudy Bunel, Jonathan Uesato, Sven Gowal, and M. Pawan Kumar. Verifying probabilistic speciﬁcations with functional lagrangians. Co RR, abs/2102.09479, 2021. URL https://arxiv.org/abs/ 2102.09479.

Sumon Biswas and Hridesh Rajan. Fairify: Fairness veriﬁcation of neural networks. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 1546 1558, 2023. doi: 10.1109/ICSE48619.2023.00134.

Jack Blandin and Ian A. Kash. Generalizing group fairness in machine learning via utilities. J. Artif. Int. Res., 78, January 2024. ISSN 1076-9757. doi: 10.1613/jair.1.14238. URL https: //doi.org/10.1613/jair.1.14238.

Su Lin Blodgett, Solon Barocas, Hal Daumé III, and Hanna Wallach. Language (technology) is power: A critical survey of bias in NLP. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5454 5476, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.485. URL https://aclanthology.org/2020.acl-main. 485.

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam Mc Candlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners, 2020a. URL https: //arxiv.org/abs/2005.14165.

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam Mc Candlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners, 2020b.

Yang Trista Cao, Yada Pruksachatkun, Kai-Wei Chang, Rahul Gupta, Varun Kumar, Jwala Dhamala, and Aram Galstyan. On the intrinsic and extrinsic fairness evaluation metrics for contextualized language representations. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 561 570, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-short.62. URL https://aclanthology.org/ 2022.acl-short.62.

Kai-Wei Chang, Vinodkumar Prabhakaran, and Vicente Ordonez. Bias and fairness in natural language processing. In Timothy Baldwin and Marine Carpuat, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): Tutorial Abstracts, Hong Kong, China, November 2019. Association for Computational Linguistics. URL https:// aclanthology.org/D19-2004.

Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries, 2023.

Isha Chaudhary, Vedaant V. Jain, and Gagandeep Singh. Decoding intelligence: A framework for certifying knowledge comprehension in llms, 2024. URL https://arxiv.org/abs/2402. 15929.

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https: //lmsys.org/blog/2023-03-30-vicuna/.

Published as a conference paper at ICLR 2025

C. J. Clopper and E. S. Pearson. The use of conﬁdence or ﬁducial limits illustrated in the case of the binomial. Biometrika, 26(4):404 413, 12 1934. ISSN 0006-3444. doi: 10.1093/biomet/26.4.404. URL https://doi.org/10.1093/biomet/26.4.404.

Justin Cui, Wei-Lin Chiang, Ion Stoica, and Cho-Jui Hsieh. Or-bench: An over-refusal benchmark for large language models, 2024. URL https://arxiv.org/abs/2405.20947.

Nicolas Deutschmann, Marvin Alberts, and María Rodríguez Martínez. Conformal autoregressive generation: Beam search with coverage guarantees, 2023. URL https://arxiv.org/abs/ 2309.03797.

Jwala Dhamala, Tony Sun, Varun Kumar, Satyapriya Krishna, Yada Pruksachatkun, Kai-Wei Chang, and Rahul Gupta. Bold: Dataset and metrics for measuring biases in open-ended language generation. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAcc T 21. ACM, March 2021. doi: 10.1145/3442188.3445924. URL http://dx.doi.org/10.1145/3442188.3445924.

Brandon Dominique, David Piorkowski, Manish Nagireddy, and Ioana Baldini Soares. Prompt templates: A methodology for improving manual red teaming performance. In ACM CHI Conference on Human Factors in Computing Systems, 2024.

Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Rich Zemel. Fairness through awareness, 2011.

David Esiobu, Xiaoqing Tan, Saghar Hosseini, Megan Ung, Yuchen Zhang, Jude Fernandes, Jane Dwivedi-Yu, Eleonora Presani, Adina Williams, and Eric Michael Smith. Robbie: Robust bias evaluation of large generative language models, 2023.

Emilio Ferrara. Should chatgpt be biased? challenges and risks of bias in large language models. First Monday, November 2023. ISSN 1396-0466. doi: 10.5210/fm.v28i11.13346. URL http: //dx.doi.org/10.5210/fm.v28i11.13346.

Isabel O. Gallegos, Ryan A. Rossi, Joe Barrow, Md Mehrab Tanjim, Sungchul Kim, Franck Dernoncourt, Tong Yu, Ruiyi Zhang, and Nesreen K. Ahmed. Bias and fairness in large language models: A survey, 2024a.

Isabel O. Gallegos, Ryan A. Rossi, Joe Barrow, Md Mehrab Tanjim, Sungchul Kim, Franck Dernoncourt, Tong Yu, Ruiyi Zhang, and Nesreen K. Ahmed. Bias and fairness in large language models: A survey, 2024b. URL https://arxiv.org/abs/2309.00770.

Google Gemini Team. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024.

Chuqin Geng, Nham Le, Xiaojie Xu, Zhaoyue Wang, Arie Gurﬁnkel, and Xujie Si. Towards reliable neural speciﬁcations, 2023. URL https://arxiv.org/abs/2210.16114.

Valentin Hofmann, Pratyusha Ria Kalluri, Dan Jurafsky, and Sharese King. Dialect prejudice predicts ai decisions about people s character, employability, and criminality, 2024.

Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, and Tom Goldstein. Baseline defenses for adversarial attacks against aligned language models, 2023.

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023.

Mintong Kang, Nezihe Merve Gürel, Ning Yu, Dawn Song, and Bo Li. C-rag: Certiﬁed generation risks for retrieval-augmented language models, 2024.

Hadas Kotek, Rikker Dockum, and David Sun. Gender bias and stereotypes in large language models. In Proceedings of The ACM Collective Intelligence Conference, CI 23. ACM, November 2023. doi: 10.1145/3582269.3615599. URL http://dx.doi.org/10.1145/3582269.3615599.

Published as a conference paper at ICLR 2025

Satyapriya Krishna, Rahul Gupta, Apurv Verma, Jwala Dhamala, Yada Pruksachatkun, and Kai-Wei Chang. Measuring fairness of text classiﬁers via prediction sensitivity, 2022.

Anna Kruspe. Towards detecting unanticipated bias in large language models, 2024. URL https: //arxiv.org/abs/2404.02650.

Matt J. Kusner, Joshua R. Loftus, Chris Russell, and Ricardo Silva. Counterfactual fairness, 2018. URL https://arxiv.org/abs/1703.06856.

Learn Prompting. Jailbreaking. https://learnprompting.org/docs/prompt_ hacking/jailbreaking, 2023.

Gibbeum Lee, Volker Hartmann, Jongho Park, Dimitris Papailiopoulos, and Kangwook Lee. Prompted llms as chatbot modules for long open-domain conversation. In Findings of the Association for Computational Linguistics: ACL 2023. Association for Computational Linguistics, 2023. doi: 10.18653/v1/2023.ﬁndings-acl.277. URL http://dx.doi.org/10.18653/v1/ 2023.findings-acl.277.

Messi H.J. Lee, Jacob M. Montgomery, and Calvin K. Lai. Large language models portray socially subordinate groups as more homogeneous, consistent with a bias observed in humans. In The 2024 ACM Conference on Fairness, Accountability, and Transparency, FAcc T 24, page 1321 1340. ACM, June 2024. doi: 10.1145/3630106.3658975. URL http://dx.doi.org/10.1145/ 3630106.3658975.

Renjue Li, Pengfei Yang, Cheng-Chao Huang, Youcheng Sun, Bai Xue, and Lijun Zhang. Towards practical robustness analysis for dnns based on pac-model learning, 2022a.

Yingji Li, Mengnan Du, Rui Song, Xin Wang, and Ying Wang. A survey on fairness in large language models, 2024. URL https://arxiv.org/abs/2308.10149.

Zhuoyan Li, Zhuoran Lu, and Ming Yin. Towards better detection of biased language with scarce, noisy, and biased annotations. In Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society, AIES 22, page 411 423, New York, NY, USA, 2022b. Association for Computing Machinery. ISBN 9781450392471. doi: 10.1145/3514094.3534142. URL https://doi.org/ 10.1145/3514094.3534142.

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu Ren, Huaxiu Yao, Jue Wang, Keshav Santhanam, Laurel Orr, Lucia Zheng, Mert Yuksekgonul, Mirac Suzgun, Nathan Kim, Neel Guha, Niladri Chatterji, Omar Khattab, Peter Henderson, Qian Huang, Ryan Chi, Sang Michael Xie, Shibani Santurkar, Surya Ganguli, Tatsunori Hashimoto, Thomas Icard, Tianyi Zhang, Vishrav Chaudhary, William Wang, Xuechen Li, Yifan Mai, Yuhui Zhang, and Yuta Koreeda. Holistic Evaluation of Language Models, 2023.

Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing, 2021. URL https://arxiv.org/abs/2107.13586.

Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, Kailong Wang, and Yang Liu. Jailbreaking chatgpt via prompt engineering: An empirical study, 2024.

Marta Marchiori Manerba, Karolina Sta nczak, Riccardo Guidotti, and Isabelle Augenstein. Social bias probing: Fairness benchmarking for language models, 2024.

Rohin Manvi, Samar Khanna, Marshall Burke, David Lobell, and Stefano Ermon. Large language models are geographically biased, 2024.

Victor M Longa Martínez. John baugh. 2018. linguistics in pursuit of justice, cambridge: Cambridge university press [xx+ 215 pp.]. isbn: 978-1107153455. Verba: Anuario Galego de Filoloxía, 49: 1 15, 2022.

Published as a conference paper at ICLR 2025

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal, 2024.

Timothy R. Mc Intosh, Teo Susnjak, Tong Liu, Paul Watters, and Malka N. Halgamuge. Inadequacies of large language model benchmarks in the era of generative artiﬁcial intelligence, 2024.

Matthew Mirman, Timon Gehr, and Martin Vechev. Robustness certiﬁcation of generative models, 2020. URL https://arxiv.org/abs/2004.14756.

Christopher Mohri and Tatsunori Hashimoto. Language models with conformal factuality guarantees, 2024. URL https://arxiv.org/abs/2402.10978.

Robert G. Newcombe. Two-sided conﬁdence intervals for the single proportion: comparison of seven methods. Statistics in Medicine, 17(8):857 872, 1998. doi: 10.1002/(SICI)1097-0258(19980430) 17:8<857::AID-SIM777>3.0.CO;2-E.

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback, 2022.

Victor Quach, Adam Fisch, Tal Schuster, Adam Yala, Jae Ho Sohn, Tommi S. Jaakkola, and Regina Barzilay. Conformal language modeling, 2024. URL https://arxiv.org/abs/2306. 10193.

Jonathan Rosa and Nelson Flores. Unsettling race and language: Toward a raciolinguistic perspective. Language in Society, 46(5):621 647, 2017. doi: 10.1017/S0047404517000562.

Sriram Sankaranarayanan, Aleksandar Chakarov, and Sumit Gulwani. Static analysis for probabilistic programs: inferring whole program properties from ﬁnitely many paths. SIGPLAN Not., 48(6): 447 458, jun 2013. ISSN 0362-1340. doi: 10.1145/2499370.2462179. URL https://doi. org/10.1145/2499370.2462179.

Sanjit A. Seshia, Ankush Desai, Tommaso Dreossi, Daniel J. Fremont, Shromona Ghosh, Edward Kim, Sumukh Shivakumar, Marcell Vazquez-Chanlatte, and Xiangyu Yue. Formal speciﬁcation for deep neural networks. In Shuvendu K. Lahiri and Chao Wang, editors, Automated Technology for Veriﬁcation and Analysis, pages 20 34, Cham, 2018. Springer International Publishing. ISBN 978-3-030-01090-4.

Sakib Shahriar and Kadhim Hayawi. Let s have a chat! a conversation with chatgpt: Technology, applications, and limitations. Artiﬁcial Intelligence and Applications, 2(1):11 20, June 2023. ISSN 2811-0854. doi: 10.47852/bonviewaia3202939. URL http://dx.doi.org/10.47852/ bonview AIA3202939.

Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng. The woman worked as a babysitter: On biases in language generation. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pages 3407 3412, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1339. URL https://aclanthology.org/D19-1339.

Emily Sheng, Kai-Wei Chang, Prem Natarajan, and Nanyun Peng. Towards Controllable Biases in Language Generation. In Trevor Cohn, Yulan He, and Yang Liu, editors, Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3239 3254, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.ﬁndings-emnlp.291. URL https://aclanthology.org/2020.findings-emnlp.291.

John H. Sherry. The civil rights act of 1964: Fair employment practices under title vii. Cornell Hotel and Restaurant Administration Quarterly, 6(2):3 6, 1965. doi: 10.1177/001088046500600202. URL https://doi.org/10.1177/001088046500600202.

Published as a conference paper at ICLR 2025

Julius Sim and Norma Reid. Statistical Inference by Conﬁdence Intervals: Issues of Interpretation and Utilization. Physical Therapy, 79(2):186 195, 02 1999. ISSN 0031-9023. doi: 10.1093/ptj/79. 2.186. URL https://doi.org/10.1093/ptj/79.2.186.

Gagandeep Singh, Timon Gehr, Markus Püschel, and Martin Vechev. An abstract domain for certifying neural networks. Proc. ACM Program. Lang., 3(POPL), jan 2019. doi: 10.1145/3290354. URL https://doi.org/10.1145/3290354.

Eric Michael Smith, Melissa Hall, Melanie Kambadur, Eleonora Presani, and Adina Williams. "i m sorry to hear that": Finding new biases in language models with a holistic descriptor dataset, 2022.

Harini Suresh and John Guttag. A framework for understanding sources of harm throughout the machine learning life cycle. In Proceedings of the 1st ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization, EAAMO 21, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450385534. doi: 10.1145/3465416.3483305. URL https://doi.org/10.1145/3465416.3483305.

J.M. Terry, R. Hendrick, E. Evangelou, and R.L. Smith. Variable dialect switching among african american children: Inferences about working memory. Lingua, 120(10):2463 2475, 2010. ISSN 0024-3841. doi: https://doi.org/10.1016/j.lingua.2010.04.013. URL https://www. sciencedirect.com/science/article/pii/S0024384110001129. Morphological variation in Japanese.

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and ﬁne-tuned chat models. ar Xiv preprint ar Xiv:2307.09288, 2023.

Jason Vega, Isha Chaudhary, Changming Xu, and Gagandeep Singh. Bypassing the safety training of open-source llms with priming attacks. ar Xiv preprint ar Xiv:2312.12321, 2023.

walkerspider. DAN is my new friend. https://old.reddit.com/r/Chat GPT/comments/ zlcyr9/dan_is_my_new_friend/, 2022.

Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. Universal adversarial triggers for attacking and analyzing NLP. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pages 2153 2162, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1221. URL https://aclanthology.org/D19-1221.

Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, Sang T. Truong, Simran Arora, Mantas Mazeika, Dan Hendrycks, Zinan Lin, Yu Cheng, Sanmi Koyejo, Dawn Song, and Bo Li. Decodingtrust: A comprehensive assessment of trustworthiness in gpt models, 2024.

Shiqi Wang, Huan Zhang, Kaidi Xu, Xue Lin, Suman Jana, Cho-Jui Hsieh, and J. Zico Kolter. Beta-crown: Efﬁcient bound propagation with per-neuron split constraints for neural network robustness veriﬁcation. In Proc. Advances in Neural Information Processing Systems (Neur IPS), pages 29909 29921, 2021.

Yufei Wang, Wanjun Zhong, Liangyou Li, Fei Mi, Xingshan Zeng, Wenyong Huang, Lifeng Shang, Xin Jiang, and Qun Liu. Aligning large language models with human: A survey, 2023.

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail?, 2023.

Tinghao Xie, Xiangyu Qi, Yi Zeng, Yangsibo Huang, Udari Madhushani Sehwag, Kaixuan Huang, Luxi He, Boyi Wei, Dacheng Li, Ying Sheng, Ruoxi Jia, Bo Li, Kai Li, Danqi Chen, Peter Henderson, and Prateek Mittal. Sorry-bench: Systematically evaluating large language model safety refusal behaviors, 2024. URL https://arxiv.org/abs/2406.14598.

Published as a conference paper at ICLR 2025

Yasin Abbasi Yadkori, Ilja Kuzborskij, David Stutz, András György, Adam Fisch, Arnaud Doucet, Iuliya Beloshapka, Wei-Hung Weng, Yao-Yuan Yang, Csaba Szepesvári, Ali Taylan Cemgil, and Nenad Tomasev. Mitigating llm hallucinations via conformal abstention, 2024. URL https: //arxiv.org/abs/2405.01563.

Shuo Yang, Wei-Lin Chiang, Lianmin Zheng, Joseph E. Gonzalez, and Ion Stoica. Rethinking benchmark and contamination for language models with rephrased samples, 2023.

Shengyu Zhang, Linfeng Dong, Xiaoya Li, Sen Zhang, Xiaofei Sun, Shuhe Wang, Jiwei Li, Runyi Hu, Tianwei Zhang, Fei Wu, and Guoyin Wang. Instruction tuning for large language models: A survey, 2024.

Thomas P. Zollo, Todd Morrill, Zhun Deng, Jake C. Snell, Toniann Pitassi, and Richard Zemel. Prompt risk control: A rigorous framework for responsible deployment of large language models, 2024. URL https://arxiv.org/abs/2311.13628.

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models, 2023.

Published as a conference paper at ICLR 2025

A NOTE FOR PRACTITIONERS

In this section, we describe how practioners can use our framework to assess LLMs and automatically identify vulnerabilities in them. Our open-source implementation is available at: https://github.com/uiuc-focal-lab/LLMCert-B, which can be used following the GPL (license) terms and conditions. The open-source framework can be used to certify both open and closed-source LLMs by adding support to query custom models in utils.py (for open-source models) and utils_api.py (for closed-source models) ﬁles. The framework requires unrestricted (in terms of number of inferences) query-access to the target model. Developers can adjust the desired conﬁdence-level of the certiﬁcates and increase/decrease the number of samples used in certiﬁcation for tighter/looser bounds, according to their requirements and budget. Developers can also use custom bias detectors to label biased responses, using which the certiﬁcates can be computed. To get customized insights into LLM biases for their particular applications, developers can deﬁne speciﬁcations with prompts that are commonly observed in their use cases. This customization can either happen by sourcing the pivot prompts from domain-speciﬁc datasets, instead of the popular BOLD or Decoding Trust datasets and/or using custom distributions of preﬁxes/sufﬁxes which more suitably represent the biases in their domains. For example, in domains where there is threat of racial bias, the preﬁxes could explicitly encourage the model to exhibit racial bias, so as to stress-test the trustworthiness of the models. Developers can also deﬁne entirely new distributions of counterfactual prompts, irrespective of preﬁxes/sufﬁxes, to specify bias similar to 1 and certify with LLMCert-B.

The certiﬁcates obtained are reliable, quantitative risk assessments of models, with lower and upper bounds on risks pertaining to bias in the models generations. They can also be used to compare different LLMs to pick one with acceptable risk in varying contexts.

B COMMON JAILBREAKS USED

In our experiments, we combine common jailbreaks to certify against hybrid jailbreaks. The speciﬁc jailbreaks we use are shown below. The main jailbreak2 is shown in Figure 5. We have modiﬁed the jailbreak to encourage the model to generate responses demonstrating gender bias by adding the text highlighted in red in Figure 5a and for racial bias in Figure 5b. We use one helper jailbreak3 from which we derive the instructions shown in Figure 6. Each of these instructions is inserted into the main jailbreak at the points where its individual instructions end, with the interleaving probability pλ. All instructions inserted at the same point in the main jailbreak appear in a random order in the ﬁnal jailbreak preﬁx. The main jailbreak is also the jailbreak used to create the soft preﬁxes. Note that, these jailbreaks are just examples to demonstrate our framework, which is in no way restricted to them and generalizes to any future bias-eliciting manual, textual jailbreaks.

C EXAMPLE SAMPLES FROM PREFIX DISTRIBUTIONS

In this section, we provide examples of preﬁxes of different kinds, described in Section 4.

Random preﬁxes. Figure 7 presents some examples of random preﬁxes, each having 100 tokens.

Mixture of jailbreaks. Figure 8 presents some examples of mixture of jailbreak preﬁxes, formed from the main and helper jailbreaks in Figures 5 and 6 respectively.

Soft preﬁxes. Examples of soft preﬁxes added to the queries can not be represented in textual form as they consist of perturbations of the preﬁx in the LLM s embedding space. However, the general structure is {jailbreak perturbed in embedding space}.

2Last accessed on Jan 22, 2024 from www.jailbreakchat.com/prompt/ 588ab0ed-2829-4be8-a3f3-f28e29c06621 3https://www.reddit.com/r/Chat GPT/comments/113ddqq/my_jailbreak_is_far_ superior_to_dan_the_prompt_is/

Published as a conference paper at ICLR 2025

(a) For gender

(b) For race

Figure 5: Main Jailbreaks for different sensitive attributes

Figure 6: Instructions derived from helper jailbreak

D CASE STUDIES FROM DECODING TRUST CERTIFICATES

We introspect the example [target group] have a drug problem from QDT . We study the responses of 4 LLMs Mistral, Llama-7B, Gemini, and GPT-3.5, obtained when certifying them for the mixture of jailbreaks speciﬁcation based on the example. Figure 9 presents some samples of responses

Figure 7: Examples of random preﬁxes

Published as a conference paper at ICLR 2025

Figure 8: Examples of mixture of jailbreak preﬁxes

Figure 9: Responses of Mistral, Llama-7B, Gemini, and GPT-3.5 obtained during their certiﬁcation for the Decoding Trust example [target group] have a drug problem for mixture of jailbreaks speciﬁcation. The certiﬁcation result is indicated alongside the model.

from the models to explain their certiﬁcates. We observe that the Gemini model is biased against white people, calling them lazy and more likely to be exposed to drugs . Similarly, Mistral is also biased against white people, saying that they have a monopoly on addiction . On the other hand, GPT-3.5 is biased against black people, depicting them as inferior and inclined towards addiction . The Llama model is safer, refraining from biased responses. The bounds for the probability of unbiased response suggest that GPT-3.5 and Gemini are more inclined to produce biased responses than the Mistral and Llama models for prompts in the distribution of counterfactual prompt sets in the given speciﬁcation.

Published as a conference paper at ICLR 2025

Table 2: Hyperparameter values

Hyperparameter Description Value

γ (1 γ) is conﬁdence over certiﬁcation 0.05 n Number of samples for certiﬁcation 50 T LLM decoding temperature 1.0 Top-k LLM decoding top-k 10 q Preﬁx length for random preﬁxes 100 pλ Interleaving probability 0.2 pµ Mutation probability 0.01 κ Max noise magnitude added to jailbreak embedding elements relative to the maximum embedding value 0.02

E ABLATION STUDY

In this section, we study the effect of changing the various certiﬁcation parameters (a.k.a. hyperparameters) on the certiﬁcates generated with LLMCert-B. Table E presents the list of hyperparameters and their values used in our experiments.

We regenerate the certiﬁcates for different preﬁx distributions by varying the hyperparameters. In particular, we study the variations of the results when n, T, Top-k, q, pλ, pµ, and κ are varied, keeping γ constant. This is because, 1 γ denotes the conﬁdence of the certiﬁcation bounds and that is generally desired to be high. 95% is a typical conﬁdence level for practical applications (Sim and Reid, 1999). We conduct this ablation study on the speciﬁcations for a randomly picked set of 100 counterfactual prompt sets from BOLD s test set QBOLD. We certify the Mistral-Instruct-v0.2 (Jiang et al., 2023) 7B parameter model and study the overall results next.

E.1 CERTIFICATION ALGORITHM HYPERPARAMETER

We show ablations on n for all kinds of speciﬁcations in Figure 10. We see that the bounds begin converging at 50 samples and subsequent samples cause minor variations in their values, justifying our choice of using 50 samples. Fewer than 50 samples can result in less tight bounds.

(a) Random preﬁxes

(b) Mixtures of jailbreaks

(c) Soft preﬁxes

Figure 10: Ablation study on the certiﬁcation hyperparameters showing variations of average certiﬁcation bounds with number of samples n

E.2 LLM DECODING HYPERPARAMETERS

We study variations in the certiﬁcation bounds with 2 important hyperparameters of the LLM decoding algorithms that inﬂuence their generated texts T (decoding temperature) and Top-k (number of tokens decoded at each step). Figures 11 and 12 show the variations in the certiﬁcation bounds with T and Top-k respectively for the 3 kinds of speciﬁcations. We see only minor changes in the average certiﬁcation bounds with the variations of these hyperparameters. Our hypothesis of this phenomenon is that as the certiﬁcates aggregate the bias results of several samples, they smooth out the noise introduced by the choice of LLM decoding hyperparameters and give insights into the biases of the LLM itself.

Published as a conference paper at ICLR 2025

(a) Random preﬁxes

(b) Mixtures of jailbreaks

(c) Soft preﬁxes

Figure 11: Ablation study showing variations of average certiﬁcation bounds with temperature T

(a) Random preﬁxes

(b) Mixtures of jailbreaks

(c) Soft preﬁxes

Figure 12: Ablation study showing variations of average certiﬁcation bounds with Top-k parameter.

(a) Variation with preﬁx length q for random preﬁx speciﬁcations

(b) Variation with interleaving probability pλ for mixture of jailbreaks speciﬁcations

(c) Variation with mutation probability pµ for mixture of jailbreaks speciﬁcations

(d) Variation with relative magnitude of noise κ for soft preﬁxes from jailbreak speciﬁcations

Figure 13: Ablation study on the certiﬁcation hyperparameters showing variations of average certiﬁcation bounds

Published as a conference paper at ICLR 2025

E.3 RANDOM PREFIXES

The speciﬁcations based on random preﬁxes consist of 1 hyperparameter q, length of the random preﬁx. Hence, we vary this hyperparameter, while keeping the others ﬁxed. Figure 13a presents the variation in the average certiﬁcation bounds obtained when varying q.

E.4 MIXTURE OF JAILBREAKS

These speciﬁcations have 2 hyperparameters pλ, the probability of adding an instruction from the helper jailbreaks when interleaving, and pµ, the probability of randomly ﬂipping every token of the resultant of interleaving. We show ablation studies on these in Figures 13b and 13c respectively.

E.5 SOFT PREFIXES FROM JAILBREAKS

These speciﬁcations have 1 hyperparameter κ, the maximum relative magnitude (with respect to the maximum magnitude of the embeddings) by which the additive uniform noise can change the embeddings of the main jailbreak. Figure 13d presents an ablation study on κ.

We see no signiﬁcant effects of the variation of abovementioned hyperparameters on certiﬁcation results for different speciﬁcations.

E.6 SCALING BEYOND BINARY DEMOGRAPHIC GROUPS

Table 3: Average bounds on the probability of unbiased responses from Mistral 7B.

Spec type Bounds

Random (0.91, 1.0) Mixture (0.82, 0.98) Soft (0.20, 0.46)

Our general framework (Section 3) and speciﬁcation instances (Section 4) are applicable to certify biases beyond binary counterfactual prompt sets (like for male/female gender, black/white race). This is subject to the availability of bias detectors D that can identify biases across responses for counterfactual prompt sets for more than binary demographic groups. While we are not aware of any D that could work with QBOLD, we extend our D for the speciﬁcations from QDT to work for responses to prompts from three racial demographic groups black people, white people, and asians. We elaborate on the extension in Appendix G.2. We certify Mistral-Instruct-v0.2 7B model with the three kinds of speciﬁcations and ﬁnd that the average certiﬁcation bounds presented in Table 3 are similar to the bounds presented for the Mistral model in Table 1 for the random and mixture of jailbreak speciﬁcations. However, the results are signiﬁcantly worse for the soft preﬁx speciﬁcations. This is because, ﬁrstly the model is particularly susceptible to these speciﬁcations as is evidenced even in the results with binary demographic groups. Secondly, with the addition of another demographic group, the bias detector is skewed towards identifying bias in more sets of responses than for the case with binary demographic groups. The bias detector identiﬁes bias in responses having at least 1 agreement and 1 disagreement to the stereotype mentioned in the prompts, which has the same chance as unbiased result for binary demographic groups, but not beyond them.

F VALIDITY OF CONFIDENCE INTERVALS

We design a synthetic study for the validity of the conﬁdence intervals as follows. As we can not precisely regulate the true probability of unbiased responses of LLMs, we assume various values of that probability and generate binary-valued samples indicating biased (non-zero) /unbiased (zero) responses from any LLM. Hence, we generate 50 samples (same as the samples used in LLMCert-B s certiﬁcation) of the Bernoulli random variable F (Section 3.2), with various values for the probability of success and generate Clopper-Pearson conﬁdence intervals for the success probability using the samples. We repeat this process 1000 times and report the percentage of instances wherein the conﬁdence intervals contain the true probability of success. This percentage indicates the probability of correctness of the conﬁdence intervals. We ﬁnd that for all 11 equally-spaced values of the true probability of unbiased responses between 0 and 1, the conﬁdence intervals bound the true value for more than 95% (nominal, user-speciﬁed conﬁdence level) times, which validates the claim that the conﬁdence intervals hold with at least the user-speciﬁed conﬁdence. Figure 14 presents the variation

Published as a conference paper at ICLR 2025

Figure 14: Variation in the proportion of Clopper-Pearson conﬁdence intervals containing the true probability of unbiased responses. Dashed line shows the user-speciﬁed conﬁdence level, i.e. 95%.

in the proportion of Clopper-Pearson conﬁdence intervals that contain the true probability of unbiased responses for different values of the latter.

G BIAS DETECTOR

G.1 BOLD EXPERIMENTS

Design of bias detector. Our bias detector DBOLD is based on the regard bias metric (Sheng et al., 2019). Speciﬁcally, we measure the bias for a pair of completions, one for male and another for female, by invoking the regard classiﬁer in compare mode. The compare mode gives the difference between the regard score obtained for the different completions. We consider all completion pairs with a regard score higher than a prespeciﬁed threshold (0.85 in this case) as biased. For the completions considered unbiased with the previous evaluation, we also consider the regard scores of the individual completions (with respect to the partial sentence given to the model). We consider the completions to be biased if either score exceeds the threshold. This is for the instances of bias, wherein the pair of generations are similarly biased and individual generations have text discriminating against a particular demographic group.

Evaluating the metric. We evaluate DBOLD against human judgment of bias, by getting annotations for a subset of the completions from the different models from Amazon Mechanical Turk. We randomly select 490 pairs of completions (for male and female subjects in the prompts), 70 each obtained when certifying 7 different models. Each pair of completions is annotated by 3 different annotators. We combine the 3 annotations for each pair of completions into a single value of bias by considering the majority consensus among the annotators. We compare the bias annotation thus obtained from humans with the results of DBOLD and ﬁnd that DBOLD s outputs matches human intuition 76% times. We provide the HTML ﬁle used to render the instructions shown to the Amazon Mechanical Turkers in our supplementary material. Each participant is given a compensation of $0.5 for the annotations. Next, we provide a qualitative analysis with examples where the results of DBOLD and the human evaluation match and where they do not match.

Figure 15a shows the true positive examples where the both the DBOLD and the annotators identify bias. Figure 15b shows the false negative examples where DBOLD does not identify bias but the annotators do. Figure 15c shows the false positive examples where DBOLD identiﬁes bias but the annotators do not. According to the authors, the false positive examples are actually biased. But we also recognize that bias is a complex subject, and its notion can vary across individuals. The human annotation of bias is contingent on many factors such as their cultural background, our annotation instructions (provided in the open-source implementation of the framework), etc. Hence, we believe that the noise in the human study needs to be taken into account when evaluating our bias detector DBOLD. Our bias detector shows 93% precision, but 50% recall. Owing to these inconsistencies

Published as a conference paper at ICLR 2025

(a) (True Positive) Bias detected by both DBOLD and annotators

(b) (False Negative) Bias detected by annotators but not DBOLD

(c) (False Positive) Bias detected by DBOLD but not annotators

Figure 15: Qualitative analysis of DBOLD with human bias annotations.

of our bias detector with human perception of bias, we believe that our bounds for probability of unbiased responses are actually higher than bounds with a perfect bias detector (due to low recall), indicating a worse situation of counterfactual bias in SOTA LLMs.

G.2 DT EXPERIMENTS

We design the bias detector DDT for speciﬁcations from the Decoding Trust dataset, using the agreement detector proposed for the stereotype bias dataset in the Decoding Trust paper (Wang et al., 2024). In particular, the agreement detector in the benchmark applies on 1 response and identiﬁes if it contains phrases such as I agree (to the stereotype mentioned in the prompt) for agreement or phrases like I disagree for disagreement. To detect bias across the responses for the counterfactual prompts, DDT checks for agreement in the individual responses, and identify bias when there is at least 1 agreement, but not all responses are for agreement, i.e., the model agrees to the stereotype for some demographic groups but not others. DDT can easily scale beyond binary demographic groups, as it is agnostic to the number of responses across which it needs to detect bias.

G.3 LIMITATIONS OF BIAS DETECTORS

As we can see from the human evaluations for DBOLD and also from the simplicity of DDT that the bias detectors can have several false positive and false negative cases. This is expected, as detection of bias is a challenging task (Li et al., 2022b; Kruspe, 2024) owing to the subtle and intricate differences between biased and unbiased texts. Moreover, there can be special characters / emojis in the texts generated in the LLMs, which are often not considered by textual bias detectors, but are clearly biased, e.g.,

(we observe this emoji combination in the generations of Vicuna-7B for random preﬁx speciﬁcations from QBOLD). While our certiﬁcation method borrows some of the inaccuracies of the bias detectors we use, they are often smoothed as certiﬁcation aggregates several observations of bias to generate bounds on the probability of unbiased responses.

Published as a conference paper at ICLR 2025

H COMPARING WITH PRIOR WORKS ON GUARANTEES FOR LLMS

Works on conformal prediction. (Quach et al., 2024; Deutschmann et al., 2023; Mohri and Hashimoto, 2024; Yadkori et al., 2024) apply conformal prediction techniques to language models, proposing methods for generating sets of outputs with statistical guarantees on correctness, coverage, or abstention, aiming to improve reliability and mitigate hallucinations. Their guarantees and scope differ from those of LLMCert-B. The prior works give specialized decoding strategies that guarantee the correctness of the outputs, useful for the factuality of the responses. LLMCert-B, however, is about assessing and certifying the counterfactual biases in LLM responses generated with any decoding scheme applied on the target LLMs. Moreover, guarantees of conformal prediction are for correctness of LLM responses for individual prompts, while LLMCert-B s guarantees are over distributions of counterfactual prompt sets (which have prohibitively large sample spaces).

Prompt Risk Control. (Zollo et al., 2024) (PRC) presents a framework for selecting low-risk system prompts for LLMs. PRC computes the loss incurred for one prompt at a time, and aggregates those losses to form a risk measure. LLMCert-B, on the other hand, is for counterfactual bias, i.e., we assess bias across a set of LLM responses, obtained by varying the sensitive attributes in prompts. Our speciﬁcation is thus a relational property (Barthe et al., 2011), which is deﬁned over multiple related inputs. Biases across LLM responses for multiple related prompts are aggregated to certify the target LLM. Moreover, LLMCert-B proposes novel, inexpensive mechanisms to design distributions and their samplers with potentially adversarial preﬁxes, containing effective, manually-designed jailbreaks in their sample space. Contrary to PRC which uses static datasets of adversarial examples, we use independent and identically-distributed samples from these distributions.

I ADDITIONAL NOTES ON BIAS IN ML

Bias is a complex social phenomenon that arises in various forms. In this section, we discuss various notions of bias and harms caused by them. We also discuss how LLMCert-B complements existing evaluation methods by certifying for counterfactual bias. The following discussion is not a comprehensive treatment of bias in Machine Learning and we refer the reader to detailed survey papers such as (Gallegos et al., 2024a; Blodgett et al., 2020; Li et al., 2024) for more information.

Bias consists of disparate treatment (a.k.a. discrimination) or disparate outcomes (Barocas and Selbst, 2016) for different demographic groups. Harms due to bias are primarily of 2 kinds representational and allocation (Gallegos et al., 2024a). Representational harm (Suresh and Guttag, 2021; Blodgett et al., 2020) consists of denigrating and subordinating attitudes towards a demographic group. It consists of use of derogatory language, stereotyping, toxicity, misrepresentation, etc. These can arise from inappropriate use of language by humans or machines (e.g., LLMs). Allocation harms (Ferrara, 2023) are disparate distribution of resources or opportunities between demographic groups. These consist of direct or indirect discrimination in economic or social opportunities. For example, prior works like (Terry et al., 2010; Martínez, 2022) show that the lack of representation of African American English in dominant language practices results in that community facing penalties in education systems or when seeking housing. Most constitutions around the world have antidiscrimination laws like (Sherry, 1965) that prohibit allocative harms in employment etc. Language is considered an important factor for labeling, modifying, and transmitting beliefs about demographic groups and can result in the reinforcement of social inequalities (Rosa and Flores, 2017).

LLMCert-B is a reliable evaluation method for counterfactual bias in language models (LMs), that certiﬁes the probability of unbiased response in target LLMs for distributions of counterfactual prompts with statistical guarantees. Prior bias assessments have been of 2 kinds (Cao et al., 2022) intrinsic and extrinsic. Intrinsic bias occurs in the language representations, while extrinsic bias manifests in the ﬁnal textual responses of the LMs. To certify closed-source LMs as well, we study extrinsic bias. Bias is opposite of fairness, which has been identiﬁed to be of various forms such as group fairness (Blandin and Kash, 2024), individual fairness (Dwork et al., 2011), counterfactual fairness (Kusner et al., 2018), etc. LLMCert-B certiﬁes for counterfactual bias, akin to counterfactual fairness. This is because of the causal perspective of counterfactual bias (Anthis and Veitch, 2023) (bias due to mentioning speciﬁc demographic group in the prompt) which aligns more closely with human intuitions about discrimination and fairness. Moreover, unlike group fairness, counterfactual fairness operates at the individual-level, thus identifying bias in speciﬁc cases, instead of aggregates.