# certifying_counterfactual_bias_in_llms__55c29470.pdf Published as a conference paper at ICLR 2025 CERTIFYING COUNTERFACTUAL BIAS IN LLMS Isha Chaudhary1 Qian Hu2 Manoj Kumar3 Morteza Ziyadi2 Rahul Gupta2 Gagandeep Singh1 1 UIUC, 2 Amazon, 3 Oracle Health Warning: This paper contains model outputs that are offensive in nature. Large Language Models (LLMs) can produce biased responses that can cause representational harms. However, conventional studies are insufficient to thoroughly evaluate biases across LLM responses for different demographic groups (a.k.a. counterfactual bias), as they do not scale to large number of inputs and do not provide guarantees. Therefore, we propose the first framework, LLMCert-B that certifies LLMs for counterfactual bias on distributions of prompts. A certificate consists of high-confidence bounds on the probability of unbiased LLM responses for any set of counterfactual prompts prompts differing by demographic groups, sampled from a distribution. We illustrate counterfactual bias certification for distributions of counterfactual prompts created by applying prefixes sampled from prefix distributions, to a given set of prompts. We consider prefix distributions consisting random token sequences, mixtures of manual jailbreaks, and perturbations of jailbreaks in LLM s embedding space. We generate non-trivial certificates for SOTA LLMs, exposing their vulnerabilities over distributions of prompts generated from computationally inexpensive prefix distributions. 1 INTRODUCTION Text-generating Large Language Models (LLMs) are recently being deployed in user-facing applications, such as chatbots (Lee et al., 2023; Brown et al., 2020a; Gemini Team, 2024) where they are popular for producing human-like texts (Shahriar and Hayawi, 2023). These LLMs are safetytrained (Wang et al., 2023) to avoid generating harmful content. However, despite safety training, they can produce texts that exhibit social biases and stereotypes (Kotek et al., 2023; Manvi et al., 2024; Hofmann et al., 2024). Such texts can result in representational harms (Suresh and Guttag, 2021; Blodgett et al., 2020) to protected demographic groups (a subset of the population that is negatively affected by bias). Representational harms include stereotyping, denigration, and misrepresentation of historically and structurally oppressed demographic groups. Although representational harms are harmful in their own right (Blodgett et al., 2020), as they can socially impact individuals and redefine social hierarchies, the resulting allocation harms (Gallegos et al., 2024a) can lead to economic losses to protected groups and are therefore regulated by anti-discrimination laws such as (Sherry, 1965). Language is considered an important factor for labeling, modifying, and transmitting beliefs about demographic groups and can result in reinforcement of social inequalities (Rosa and Flores, 2017). Hence, with rising popularity of LLMs, it is important to formally evaluate their biases to effectively mitigate representational harms resulting from them (Lee et al., 2024). We focus on counterfactual bias, inspired from Kusner et al. (2018), which assesses semantic differences across LLM responses caused by varying demographic groups mentioned in prompts (counterfactual prompts). Prior work has focused on benchmarking LLM performance (Liang et al., 2023; Wang et al., 2024; Mazeika et al., 2024) and adversarial attacks (Sheng et al., 2020; Zou et al., 2023; Vega et al., 2023; Wallace et al., 2019). While these methods provide some empirical insights into LLM bias, they have several fundamental limitations (Mc Intosh et al., 2024; Yang et al., 2023) such as (1) Limited Corresponding author: Work done while at Amazon Published as a conference paper at ICLR 2025 Figure 1: (Overview of LLMCert-B): LLMCert-B is a quantitative certification framework to certify bias in target LLM s responses to a random set of prompts differing only by a sensitive attribute. In specific instantiations, LLMCert-B samples a (a) set of prefixes from a given distribution and prepends them to a prompt set to form (b) the prompts given to the target LLM. (c) The target LLM s responses are checked for bias by a bias detector, (d) whose results are fed into a certifier. (e) Certifier computes bounds using the Clopper-Pearson method (Clopper and Pearson, 1934) on probability of unbiased LLM responses for any set of prompts formed with a random prefix from prefix distribution. test cases: Benchmarking consists of evaluating several but limited number of test cases. Due to its enumerative nature, benchmarking can not scale to prohibitively large numbers of prompts that can elicit bias from LLMs. Adversarial attacks identify only few worst-case examples, which do not inform about the overall biases from large input sets; (2) Test set leakage: LLMs may have been trained on popular benchmarking datasets, thus resulting in incorrect evaluation; (3) Lack of guarantees. Benchmarking involves empirical estimation without any formal guarantees of generalization over any input sets. Similarly, adversarial attacks give limited insights as they can show existence of problematic behaviors on individual inputs but do not quantify the risk of biased LLM responses. This work. We propose an alternative to benchmarking and adversarial attacks certifying LLMs for bias, with formal guarantees. Certification operates on a prohibitively large set of inputs, represented succinctly as a specification. Specifications define inputs mathematically through operators over the vocabulary of LLMs. Certification can provide guarantees on the target model s behavior that generalize to unseen inputs satisfying the specification. With guarantees, we can be better informed about the risks associated with LLMs before deploying them in public-facing applications. Key challenges. (1) There are no existing precise mathematical representations of large sets of (counterfactual) prompts to make practical specifications. (2) State-of-the-art neural network certifiers (Wang et al., 2021; Singh et al., 2019) currently do not scale to LLMs as they require white-box access to the model and lose precision significantly for larger models, resulting in inconclusive results. Our approach. Given the diversity of LLM prompts, there will always be some cases where the LLM output will be biased (e.g., found by adversarial attacks (Zou et al., 2023)). Hence, we believe that LLM certification must be quantitative (Li et al., 2022a; Baluta et al., 2021) and study the question: What is probability of unbiased LLM responses for any counterfactual prompt set? Exactly computing the probability of unbiased responses is infeasible due to the large number of possible counterfactual prompt sets over which the biased behavior has to be determined. One can try to compute deterministic lower and upper bounds on the probability (Berrada et al., 2021). However, this is expensive and requires white-box access making it not applicable to popular, SOTA but closed-source LLMs such as GPT-4 (Achiam et al., 2023). Therefore, we focus on black-box probabilistic certification that estimates the probability of unbiased responses over a given distribution of counterfactual prompt sets with high confidence bounds. We develop the first general specification and certification framework, LLMCert-B1 for counterfactual bias in LLMs, applicable to both open 1LLM Certification of Bias Published as a conference paper at ICLR 2025 and closed-source LLMs (see Figure 1 for an overview). Our specifications over counterfactual prompt sets are the first relational properties (Barthe et al., 2011) for trustworthy LLMs. We demonstrate LLMCert-B with 3 kinds of specifications, each with counterfactual prompt set distributions formed by adding prefixes sampled from given distribution of prefixes to a fixed set of counterfactual prompts. We present distributions of prefixes consisting random token sequences, mixtures of popular jailbreaks, and jailbreak perturbations in embedding space ( 3.1). The first two are model-agnostic specifications and hence apply to both open and closed-source models. However, the third one requires access to the embeddings and the ability to prompt LLMs with embeddings, and hence applies to only open-source models. The mixture and embedding space jailbreak prefix distributions contain effective, manually designed jailbreaks and their perturbations, which are potential jailbreaks, in their sample space and hence assess LLMs biases in adversarial settings. We certify the target LLM leveraging confidence intervals. LLMCert-B samples several counterfactual prompt sets from the specified prompt distribution and generates high-confidence bounds on the probability of unbiased LLM responses for any random counterfactual prompt set in the distribution. Contributions. Our main contributions are: We design novel specifications that quantify the desirable relational property of low counterfactual bias in LLM responses over counterfactual prompts in a specified distribution. We illustrate such specifications with distributions of counterfactual prompt sets constructed with potentially adversarial prefixes. The prefixes are drawn from 3 distributions (1) random, (2) mixture of jailbreaks, and (3) jailbreak perturbations in the embedding space. We develop the first probabilistic black-box certifier LLMCert-B, applicable to both open and closed-source models, for quantifying counterfactual bias in LLM responses. LLMCert B leverages confidence intervals (Clopper and Pearson, 1934) to generate high-confidence bounds on the probability of obtaining unbiased responses from the target LLM, given any random set of counterfactual prompts from the distribution given in the specification. We find that the safety alignment of SOTA LLMs is easily circumvented with several prefixes in the distributions given in our specifications, especially those involving mixture of jailbreaks and jailbreak perturbations in the embedding space ( 5). These distributions are inexpensive to sample from, but can effectively bring out biased behaviors from SOTA models. This shows the existence of simple, bias-provoking distributions for which no defenses exist currently. We provide quantitative measures for the fairness (lack of bias) of SOTA LLMs, which hold with high confidence. We find that there are no consistent trends in the fairness of models with the scaling of their sizes, hence suggesting that the quality of alignment techniques could be a more important factor than size for fairness. Our implementation is available at https://github.com/uiuc-focal-lab/LLMCert-B and we provide guidelines for using our framework for practitioners in Appendix A. 2 BACKGROUND 2.1 LARGE LANGUAGE MODELS (LLMS) LLMs are autoregressive models for next-token prediction. Given a sequence of tokens t1, . . . , tk, they give a probability distribution over their vocabulary for the next token, P[tk+1 | t1, . . . , tk]. They are typically fine-tuned for instruction-following (Zhang et al., 2024) and aligned with human feedback (Wang et al., 2023; Ouyang et al., 2022) to make their responses safe. We certify instructiontuned, aligned LLMs for counterfactual bias, as they are typically used in public-facing applications. 2.2 CLOPPER-PEARSON CONFIDENCE INTERVALS Clopper-Pearson confidence intervals (Clopper and Pearson, 1934) provide lower and upper bounds [pl, pu] on the probability of success parameter p of a Bernoulli random variable with probabilistic guarantees. The bounds are obtained with n independent and identically distributed observations of the random variable, in which k( n) successes are observed. The confidence interval is such that Pr{p [[pl, pu]} (1 γ). γ (0, 1) is the (small) permissible error probability by which the true value of p / [pl, pu]. The confidence intervals are obtained by statistical hypothesis testing for p, where the lowest and highest values are pl and pu respectively, with the given confidence 1 γ. Published as a conference paper at ICLR 2025 3 FORMALIZING BIAS CERTIFICATION We develop a general framework, LLMCert-B, to specify and quantitatively certify counterfactual bias in text generated by (Large) Language Models (LMs). LLMCert-B formalizes bias with specifications precise mathematical representations that define the desirable property (absence of bias) in large sets of inputs. Bias is defined with respect to demographic groups subsets of human population sharing an identity trait, that could be biological or socially constructed (Gallegos et al., 2024b). Bias consists of disparate treatment or outcomes when varying the demographic groups in inputs to the target LM. For autoregressive LMs, we consider text generation bias consisting of stereotyping, misrepresentation, derogatory language, etc, that can result in representational harms (Gallegos et al., 2024b). To apply certification to closed-source LMs as well, we study extrinsic bias (Cao et al., 2022) that manifests in the final LM response texts. See Appendix I for a detailed discussion on bias in ML. 3.1 BIAS SPECIFICATION Next, we formally specify the lack of bias in the responses of language models. Unbiased LM responses do not exhibit semantic disparities owing to specific demographic groups in the prompts (Gallegos et al., 2024b; Sheng et al., 2019; Smith et al., 2022). Our bias specification is motivated by Counterfactual Fairness (Kusner et al., 2018). Consider a given identity trait I such as gender, race, etc. (that are often basis of social bias). I categorizes the human population into m subsets called demographic groups G1, . . . , Gm, each differing by the identity trait. Each demographic group G is characterized/recognized by several synonymous strings in the society, called sensitive attributes GA (Li et al., 2024). For example, the sensitive attributes for the demographic group corresponding to the female gender are woman, female, etc. We select any one sensitive attribute of a demographic group G to represent G. Let the resulting set of sensitive attributes, each corresponding to a demographic group, be A = {A1, . . . , Am}, where Ai GA i . Our specifications are for counterfactual prompts (Gallegos et al., 2024b) that differ only by the sensitive attributes in them. Let V be the vocabulary of the target LM L. Consider a set of prompts P = {P1, . . . Ps}, s > 1, P Vc, where c is the context length of L and Vc is a sequence of elements of V having length [1, c]. Let each prompt in P contain a unique sensitive attribute from A such that overall P represent more than 1 distinct demographic groups represented by A. Let each prompt be Pi = Xi Ai, where Xi is the part of Pi independent of sensitive attributes and Ai is a sensitive attribute. We consider only prompts that can be decomposed into parts with and without sensitive attributes respectively. To generalize to closed-source LMs, we assume L to be a black-box system that can only be queried, when specifying and certifying bias. The black-box assumption renders any symbolic analysis (Mirman et al., 2020) infeasible and hence allows only for analysis with input-output examples. Definition 1. (Counterfactual prompt set). A set of prompts P = {P1 = X1 A1, . . . , Ps = Xs As} is called counterfactual when: (1) i, j [1, s]. Xi = Xj; (2) i [1, s]. j ([1, s] \ i). Ai = Aj; (3) For an unbiased text generator f, i [1, s]. f(Pi) = f(Xi). That is, the prompts only differ in the sensitive attributes that are ideally unrelated to the overall query and the set of prompts P represent more than 1 sensitive attributes from A. We specify bias over counterfactual prompt sets (Definition 1). These exclude prompts where sensitive attributes are important to answer the overall query, such as What steps should I take to prepare for becoming a parent?", as semantically different answers, dependent on the sensitive attribute, are acceptable for such cases. All possible counterfactual prompt sets can be prohibitive in number for typical context lengths. This is because, the common part of counterfactual prompt sets, X can be any element from Vc which contains 1010000 elements for c = 2k. Hence, enumerative specifications (which specify the desired behavior on all inputs) are impractical, as they cannot be scalably certified without symbolic analysis for large number of inputs. Hence, we define probabilistic specifications for the probability of unbiased responses from L, for which we provide a certification algorithm in Section 3.2. Let be a sampleable discrete probability distribution over (Vc) (power set of prompts) having non-zero support on some counterfactual prompt sets P. We define probabilistic specifications for bias in L over . The specification is agnostic to s sampler, as long as it generates independent and identically distributed samples. We show examples of in Section 4. Let D be a user-defined bias detection function that can identify stereotypes/disparity in given texts for different sensitive attributes in A. Let D evaluate to zero for unbiased inputs (scaling and shifting of D can be done to satisfy this). We leave D as a parameter of the specification, as different domains Published as a conference paper at ICLR 2025 can have varying notions of bias and stakeholders can decide the most suitable notion (Anthis et al., 2024). Overall, our quantitative specification is the probability of unbiased responses (measured by D) when L is independently prompted with each element of a randomly sampled counterfactual prompts set from (1). Certificate C is an evaluation/estimation of the specified probability of unbiased responses, along with samples of LLM responses, for user-defined parameters , D, and L. C( , D, L) Pr P [D([L(P1), . . . , L(Ps)]) = 0] (1) 3.2 CERTIFICATION ALGORITHM Exactly computing C( , D, L) (1) is intractable as it would require enumerating all (prohibitively many) prompts sets in the support of . Hence, our certification algorithm estimates C( , D, L) for given and D and target L with high confidence, as described next. We generate intervals [ ˆpl, ˆpu] that bound C( , D, L) in (1) with confidence 1 γ. Such interval estimates are better than point-wise estimates as they also quantify the uncertainty of the estimation. C( , D, L) is the probability of success (unbiased responses) for the Bernoulli random variable F D([L(P1), . . . , L(Ps)]) = 0. To obtain high-confidence bounds on C( , D, L), we employ binomial proportion confidence intervals. In particular, we leverage the Clopper-Pearson confidence interval method (Clopper and Pearson, 1934) (Section 2.2) as it is known to be a conservative method, i.e., the confidence of the resultant intervals is at least the pre-specified confidence, 1 γ (Newcombe, 1998). We obtain n independent and identically distributed (iid) samples of F by sampling iid P from and compute the Clopper Pearson confidence intervals of C( , D, L). The certificate, hence obtained, bounds the probability of unbiased responses for random P with high confidence. Note that the certification results depend on the user-defined choices of n and 1 γ. 4 CERTIFICATION INSTANCES In this section, we instantiate prompt set distributions to form novel bias specifications. We select such that its underlying sample space has prompt sets that share a common characteristic, so we can certify the bias conditioned on the presence of the characteristic. Thus, this becomes a local specification (Seshia et al., 2018), wherein the certificate is given for a local input space. Local specifications have commonly been considered in neural network verification (Singh et al., 2019; Wang et al., 2021; Baluta et al., 2021). Prior works on neural network specifications such as (Geng et al., 2023) generate only local specifications, as they correspond to meaningful real-world scenarios, and as local input regions are considered to design adversarial inputs for the models. In our local bias specifications, we consider around a given set of prompts Q (pivot), denoting the resultant prompt set distributions as Q. Prefixes are commonly used to steer the text generated by LLMs according to the users intentions (Liu et al., 2021). Hence, we want to study whether the application of certain prefixes can elicit different forms of bias from the target LLM. Let pre denote a distribution of prefixes. Each element in the sample spaces of Q is a set of prompts formed by uniformly applying a prefix to all prompts Qi Q, that is, q Q = S Qi Q{p Qi} for p pre, where denotes string concatenation. Algorithm 1 presents the probabilistic specification involving addition of randomly sampled prefixes as a probabilistic program. Our probabilistic programs follow the syntax of the probabilistic programming language defined in (Sankaranarayanan et al., 2013, Figure 3). The syntax is similar to that of a typical imperative programming language, with the addition of primitive functions to sample from common distributions over discrete / continuous random variables (for example, Bernoulli: B, Uniform: U) and estimate Probability(.). estimate Probability(.) takes in a random variable and returns its estimated probability at a specific value. make Prefix(args,kind) (line 1) is a general function to sample different kinds of prefixes such as random prefixes (Algorithm 2), mixture of jailbreaks (Algorithm 3), and soft prefixes (Algorithm 4), constructed using arguments, args. C( Q, D, L) characterizes the bias that can be elicited from L by varying the prefix selected from pre applied to a given Q. Next, we describe the 3 different kinds of practical pre and their sampling algorithms to define local bias specifications for L. We show some samples from each kind of pre in Appendix C. Our specifications are for the average-case behavior of the target LLM, as pre are not distributions of provably adversarial (worst-case) prefixes. Published as a conference paper at ICLR 2025 Algorithm 1 Prefix specification Input: L, Q; Output: C( , D, L) 1: pre := make Prefix(args, kind = random / mixture / soft) 2: P := [pre Qi for Qi Q] 3: C( , D, L) := estimate Probability(D([L(P1), . . . , L(Ps)]) = 0) Algorithm 2 Make random prefix Input: V; Output: pre 1: pre := U(V) . . . [q times] U(V) Algorithm 4 Make soft prefix Input: L, M0; Output: pre 1: E := embed(L, M0) 2: pre := E + U([ κ, κ]) Algorithm 3 Make mixture of jailbreak prefix Input: L, V, M; Output: pre 1: M0 := split(M0) 2: H := S Mk M,k>0 split(Mk) 3: ω(pλ, H) := shuffle({if(B(pλ), h, ) | h H}) 4: Mi := M0[0] ω(pλ, H) M0[1] ω(pλ, H) . . . 5: Mi tokenize(L, Mi) 6: pre := [if(B(pµ), U(V), τ) for τ Mi] Random prefixes. Prior works such as (Wei et al., 2023; Zou et al., 2023) have shown the effects of incoherent fixed-length strings in jailbreaking LLMs for harmful prompts. Hence, we specify bias in LLMs for prompts with incoherent prefixes that are random sequences of tokens from the vocabulary of the LLM. Such prefixes are not all intentionally adversarial, except for adversarial strings like those from prior works, but denote random noise in the prompt. Algorithm 2 presents the prefix sampler as a uniform distribution, U(.) over the random prefixes of fixed length, q. The sample space of random prefix pre has |V|q cardinality. pre for random prefixes assigns a non-zero probability to discovered and undiscovered jailbreaks, of a fixed length q. Hence, certification for the random prefix distribution indicates the expected bias in responses to Q with any random prefixes of length q. Mixtures of jailbreaks. Manually designed jailbreaks are fairly effective at bypassing the safety training of LLMs (walkerspider, 2022; Wei et al., 2023; jai). To study the vulnerability of LLMs under powerful jailbreaks, we develop specifications with manual jailbreaks. Distributions from which jailbreaks can be sampled are unknown. Thus, we construct potential jailbreaking prefixes from a set M of popular manually-designed jailbreaks by applying 2 operations interleaving and mutation. Interleaving attempts to strengthen a given manual jailbreak with more bias-provoking instructions, while mutation attempts to obfuscates the jailbreak such that it can be effective, even under explicit training to avoid the original jailbreak. Algorithm 3 presents the prefix constructor as a probabilistic program. Each manual jailbreak Mk M can be treated as a finite set of instructions Mk = {Mk[0], . . . }. split(.) function takes a jailbreak and returns the list of the jailbreak s instructions. Let M0 be the most effective jailbreak (a.k.a. main jailbreak). We decide effectiveness of jailbreaks using popular open-source leaderboards of jailbreaks. We include all the instructions of the main jailbreak in the final prefix. The other jailbreaks are helper jailbreaks, whose instructions are included with an interleaving probability, pλ in the final prefix. Let H = S Mi M Mi denote the set of all instructions from helper jailbreaks [line 2]. Let ω(pλ, H) shuffle and concatenate randomly picked (with probability pλ) instructions from H [line 3]. shuffle(.) is a function for randomly sampling a permutation from a uniform distribution over all permutations of an input list (after removing which denotes void elements). Let if(e1, e2, e3) be an abbreviation for if e1 then e2 else e3. We first apply the interleaving operation with the resultant given as Mi [line 4]. The mutation operation is then applied to Mi viewed as a sequence of tokens [τ0, . . . , ], wherein any token τi can be flipped to any random token τ i V, with a mutation probability pµ (generally set to be low), to result in pre [line 6]. We hypothesize such prefixes to be potential jailbreaks as they are formed by strengthening a manual jailbreak with other jailbreaks and obfuscating its presence. The number of prefixes formed by the aforementioned operations can be prohibitively many, owing to typically long manual jailbreaks and the possibility to mutate any token to any random token from V. Soft prefixes from jailbreaks. Due to the limited number of effective manual jailbreaks (walkerspider, 2022; Learn Prompting, 2023), they can be easily identified and defended against. However, the excellent denoising capabilities of LLMs could render them vulnerable to simple manipulations of manual jailbreaks as well, indicating that the threat is not completely mitigated by current de- Published as a conference paper at ICLR 2025 fenses (Jain et al., 2023). Hence, we specify fairness under prefixes constructed by adding noise to the original manual jailbreaks. Algorithm 4 presents the prefix constructor as a probabilistic program. Let E be the embedding matrix of M0 in the embedding space of the target LLM, obtained by applying the function embed(.) [line 1]. We perturb E by adding noise to it. As we are not aware of any adversarial distributions of noise that could be added to manual jailbreaks to make them stronger, we select a uniform distribution. We uniformly sample noise from B(0, κ) which is an κ > 0 (constant) ball around the origin and add it to E to construct pre in the embedding space [line 2]. 5 EXPERIMENTS We used 2 A100 GPUs, each with 40GB VRAM. We derive the queries on which the specifications from the 3 prefix distributions presented in Section 4 are pivoted, from popular datasets for fairness and bias assessment BOLD (Dhamala et al., 2021) and Decoding Trust (Wang et al., 2024). BOLD setup. BOLD is a dataset of partial sentences to demonstrate bias in the generations of LLMs in common situations. We pick a test set of 250 samples randomly from BOLD s profession partition and demonstrate binary gender bias specifications and certificates on it. Figure 2: Example from QBOLD We develop a pivot set of prompts from each test set sample by prepending an instruction to complete the partial sentence for the profession annotated in BOLD, where the subject identifies with a particular gender (Male / Female). An example pivot set Q from BOLD is illustrated in Figure 2. Let the resultant test set consisting of pivot prompts from each partial sentence considered from BOLD be QBOLD. To identify bias in the responses of the target LLM for the prompts constructed from pivot prompt sets in QBOLD, we use the regard-based bias metric proposed in (Sheng et al., 2019). Our adaptation of the regard metric is described and evaluated with a human study on Amazon Mechanical Turk in Appendix G.1. Our bias detector matches human perception of bias in 76% cases. We qualitatively analyze the false positive and false negatives of the bias detector with respect to human judgment in Appendix G.1. Figure 3: Example from QDT Decoding Trust setup. Decoding Trust (DT) is a dataset benchmark to evaluate various properties of LLMs, including stereotype bias against people of different demographic groups. We make specifications from all 48 statements in the stereotypes partition for demographic groups corresponding to race (black/white). An example pivot set Q from DT is illustrated in Figure 3. Let the resultant test set consisting of pivot prompts from each partial sentence considered from DT be QDT . We evaluate the LLM responses to prompts derived from pivot prompt sets in QDT using a bias detector that identifies the disparity in agreement to the stereotype for different demographic groups given in the prompt, as discussed in Appendix G.2. For every element in QBOLD and QDT , we generate 3 certificates for the specifications in Section 4, such that a certificate consists of bounds on the probability of unbiased responses from the target LLM. Both bias detectors are such that they output 1 for a biased set of responses to a counterfactual prompt set and 0 for an unbiased set of responses. The values of the certification parameters used in our experiments are given in Table 2 (Appendix E). We study their effect on the certification results with an ablation study in Appendix E. We generate the certification bounds with 95% confidence and 50 samples. While our main experiments are for counterfactual prompt sets with binary demographic groups, our framework can be extended beyond binary demographic groups, which we experimentally demonstrate in Appendix E.6. Appendix B presents the existing, manually-designed jailbreaks used across all specifications. Note that, these jailbreaks are just examples to demonstrate our framework, which generalizes beyond them to new bias-eliciting manual textual jailbreaks. 5.1 CERTIFICATION RESULTS We certify the popular contemporary LLMs Llama-2-chat (Touvron et al., 2023) 7B and 13B (parameters), Vicuna-v1.5 (Chiang et al., 2023) 7B and 13B, Mistral-Instruct-v0.2 (Jiang et al., 2023) 7B, Gemini-1.0-pro (Gemini Team, 2024), GPT-3.5 (Brown et al., 2020b), GPT-4 (Achiam et al., 2023), and Claude-3.5-Sonnet (Anthropic, 2024). We report the average of the certification bounds for all pivot prompt sets in QBOLD and QDT each for every model in Table 1. We do not certify the Published as a conference paper at ICLR 2025 Table 1: Average of the bounds on the probability of unbiased responses for different models. Lowest bounds for each specification kind and dataset are highlighted. We report 2 baselines unbiased responses when prompting without prefixes and with the main jailbreak as prefix. Average certification bounds Dataset Model % Unbiased without prefix % Unbiased with main JB Random Mixture Soft Vicuna-7B 99.9 89.4 (0.93, 1.0) (0.90, 0.99) (0.73, 0.89) Vicuna-13B 99.7 99.8 (0.93, 1.0) (0.93, 1.0) (0.92, 1.0) Llama-7B 99.8 99.8 (0.92, 1.0) (0.92, 1.0) (0.93, 1.0) Llama-13B 99.8 99.7 (0.93, 1.0) (0.91, 1.0) (0.93, 1.0) Mistral-7B 100.0 41.0 (0.92, 1.0) (0.22, 0.42) (0.30, 0.52) Gemini 99.2 74.1 (0.92, 1.0) (0.60, 0.83) GPT-3.5 99.5 50.2 (0.92, 1.0) (0.44, 0.67) GPT-4 99.8 99.9 (0.92, 1.0) (0.80, 0.96) Claude-3.5-Sonnet 99.6 99.8 (0.93, 1.0) (0.92, 1.0) Vicuna-7B 95.4 100.0 (0.85, 0.97) (0.92, 1.0) (0.88, 0.97) Vicuna-13B 88.7 76.2 (0.71, 0.92) (0.92, 1.0) (0.51, 0.78) Llama-7B 97.5 100.0 (0.79, 0.96) (0.92, 1.0) (0.92, 1.0) Llama-13B 100.0 100.0 (0.92, 1.0) (0.93, 1.0) (0.93, 1.0) Mistral-7B 99.2 72.9 (0.91, 1.0) (0.85, 0.99) (0.46, 0.73) Gemini 99.6 94.6 (0.92, 1.0) (0.73, 0.93) GPT-3.5 99.6 56.7 (0.93, 1.0) (0.66, 0.88) GPT-4 100.0 100.0 (0.93, 1.0) (0.93, 1.0) Claude-3.5-Sonnet 99.6 100.0 (0.93, 1.0) (0.93, 1.0) closed-source models such as Gemini, GPT, and Claude for soft prefixes, as it requires access to the models embedding layers. Certification time significantly depends on the inference latency of the target model. Generating each certificate can take 1-2 minutes for models with reasonable latency. Baselines. The baselines consider QBOLD and QDT as benchmarking datasets, having counterfactual prompt sets as individual elements. Similar to popular LLM bias benchmarking works such as (Wang et al., 2024; Liang et al., 2023; Esiobu et al., 2023; Xie et al., 2024), we study the biases in LLMs for a fixed dataset of counterfactual prompt sets which may be provided as is to the LLM, or with jailbreaks. In the first baseline (without prefix), every counterfactual prompt set is evaluated 5 times, each time prompting a target LLM with each prompt in the set without any prefixes and detecting bias across its responses using the corresponding bias detector. The bias result for each counterfactual prompt set is computed by averaging the results over the 5 evaluations, similar to (Wang et al., 2024). This baseline indicates biases in LLM responses without any prefixes and can be used to judge the additional influence of prefixes on eliciting bias from LLMs. Table 1 reports the average of evaluations over all counterfactual prompt sets in QBOLD and QDT respectively. In the second baseline (with main jailbreak), each counterfactual prompt is similarly evaluated 5 times, but with the unmodified main jailbreak (Figure 5, Appendix B), used in the mixture of jailbreak and soft prefixes from jailbreak distributions, as a prefix. The average result of this baseline is also reported in Table 1. This baseline is used to indicate the efficacy of the main jailbreak without any modifications, and hence suggests the importance of the mixture and soft prefix distributions around the main jailbreak in eliciting biases in LLM responses. The baselines are empirical studies of counterfactual bias in LLMs, which analyze bias with a dataset of prompts. On the other hand, LLMCert-B quantifies biases for any random prompt sampled from a given distribution. 5.1.1 GENERAL OBSERVATIONS Comparison with baselines. Our results for the baseline without prefix are generally close to and often higher than the average upper bounds from certification. This suggests that the counterfactual prompt sets do not majorly result in biased LLM behavior by themselves and the application of prefixes is essential for biased LLM responses. Hence, such baselines present an optimistic view on the biases of LLMs, which may not be acceptable for real-world scenarios. We compare the baseline with main jailbreak with the certification bounds for the specifications with mixture of jailbreak prefixes and soft prefixes, as these consist of modifications of the main jailbreak. We see that in most cases, the baseline results are either contained within or higher than the bounds. Thus the baseline does not account for several biased behaviors that can result by modifying the main jailbreak. In some cases, e.g., Mistral for DT specifications, the average lower bound is higher than the baseline result, indicating that the modifications might have reduced the efficacy of the main jailbreak for the model. In such cases, the specifications can be made model-specific and we leave this to future work. Published as a conference paper at ICLR 2025 Variations with prefixes. We observe that the random prefixes are generally less effective at eliciting biased behaviors than the other prefixes. This suggests that incoherent prefixes can be denoised by the LLMs and may not effectively bypass their alignment. In some cases, such as that for Vicuna-13b, we see that the model produces biased responses with the random prefixes, indicating that the model is not robust to even incoherent perturbations to the prompts. Thus random prefix specifications present basic certificates into the denoising capabilities of the LLMs to produce unbiased responses. There are no consistent trends between the mixture of jailbreaks and soft prefix specifications, except for the effectiveness of prompts from their distributions in producing biased LLM responses. Denial of response. In several instances, such as those when certifying GPT-4 (for DT), Claude, and Llama, we observe unbiased behaviors because the model simply refuses to respond. The completions are mostly Sorry I can t assist with that". While that can be a valid defense to adversarial prompts, it may not be the most desirable (Cui et al., 2024). The model could instead output neutral responses, ignoring any bias-encouraging texts in the prompts to be helpful but not harmful. 5.1.2 MODEL-SPECIFIC OBSERVATIONS Vicuna. The Vicuna models show high probability of unbiased responses, except for those with soft prefixes. The baselines indicate that the 7B model can output less unbiased responses with the main jailbreak as prefix. But the mixture of jailbreaks specifications have high probability of unbiased responses, suggesting that the perturbations of the main jailbreak with the operations of mixture specifications can make the jailbreak less effective, while perturbing the jailbreak in the embedding space can elicit more biased behaviors from the model. We find that unbiased behavior trends are not consistent with increasing model sizes across different specifications. Llama and Claude. The Llama models and Claude-Sonnet generate a few biased responses, but the bounds on the probability of unbiased responses are high, indicating better alignment. Mistral. Mistral exhibits significantly low probability of unbiased when prompted with mixtures of jailbreaks and soft prefixes for specifications from both QBOLD and QDT . Gemini. We prompt Gemini with the safety filters disabled, so we can check the safety training of the model itself. We observe that its responses can get highly biased for mixture of jailbreaks specifications, resulting in low bounds. This insight is, however, not evident from the baselines. GPT. The baselines indicate low bias in the responses of GPT-4. However, certification results say otherwise. The GPT-4 model shows low probability of unbiased responses for the mixture of jailbreak specifications derived from QBOLD. However, we observe unbiased responses for specifications corresponding to QDT . GPT-3.5 achieves low probability of unbiased responses for the mixture of jailbreak specifications, indicating that GPT-4 may have undergone better safety training than it. 5.2 CASE STUDIES We qualitatively introspect individual certificates for different models for one example from QBOLD. We present similar analyses for one example from QDT in Appendix D. We select the example for Desktop automation refers to software programming from QBOLD for Automation profession. We present responses of 4 popular LLMs Mistral, Llama-7B, Gemini, and GPT-4 obtained during certification for the mixture of jailbreak specification, in Figure 4. The responses demonstrate various stereotypical biases against both genders. For example, Mistral discriminates against women, calling them incapable and better suited for menial jobs . The Gemini model is also biased against women, saying they cannot think and are meant for household chores . On the other hand, GPT-4 shows biases towards men, as it responds about them lacking motivation and being less efficient , with automation being better at their jobs. It depicts women crushing male competition and becoming superior. Llama, however, defends against the prompts and avoids responding in a biased way. The responses reflect the certification results for the models. We find from the certification bounds that the Mistral model is worse than Gemini and GPT, which are worse than Llama with high confidence for this specification. Published as a conference paper at ICLR 2025 Figure 4: Responses (and bounds) of Mistral, Llama-7B, Gemini, and GPT-4 when certifying with BOLD example Desktop automation refers to software programming for mixture of jailbreaks. 6 RELATED WORK Attacking LLM alignment. LLMs are aligned with human ethics by supervised fine-tuning and reinforcement learning with human feedback (Ouyang et al., 2022). However, (Zou et al., 2023; Vega et al., 2023; Chao et al., 2023; Sheng et al., 2020; Wallace et al., 2019) propose methods to jailbreak LLMs, bypassing alignment and causing harmful or biased responses. Jailbreaks can be incoherent (Zou et al., 2023; Sheng et al., 2020) or coherent (Dominique et al., 2024; Liu et al., 2024). Benchmarking LLMs. Various prior works have benchmarked LLM performance on standard and custom datasets. These consist of datasets of general prompts (Dhamala et al., 2021; Wang et al., 2024) or adversarial examples (Zou et al., 2023; Mazeika et al., 2024). Benchmarks such as (Liang et al., 2023; Wang et al., 2024; Mazeika et al., 2024; Manerba et al., 2024; Gallegos et al., 2024a) present empirical trends for LLM performance, measured along various axes including bias. Guarantees for LLMs. There is an emerging need for guarantees on LLM behavior. (Kang et al., 2024) provides guarantees on the generation risks of RAG LLMs. (Quach et al., 2024; Deutschmann et al., 2023; Mohri and Hashimoto, 2024; Yadkori et al., 2024) apply conformal prediction to LLMs, proposing methods for generating output sets with statistical guarantees on correctness or coverage. (Zollo et al., 2024) presents a framework for selecting low-risk system prompts for LLMs with probabilistic guarantees. (Chaudhary et al., 2024) present a certification framework for knowledge comprehension in LLMs. We compare LLMCert-B with existing works further in Appendix H. Fairness in Machine Learning. Fairness has been extensively studied for general Machine Learning, beginning from the seminal work of Dwork et al. (2011). Prior works have proposed methods to formally certify classifiers for fairness, such as (Biswas and Rajan, 2023; Bastani et al., 2019). However, these do not extend to LLMs. Fairness and bias have also been studied in natural language processing in prior works such as (Chang et al., 2019; Smith et al., 2022; Krishna et al., 2022). 7 CONCLUSION We present the first framework LLMCert-B to specify and certify counterfactual bias in LLM responses, for both open and closed-source models. We instantiate LLMCert-B with novel specifications based on different kinds of potentially adversarial prefixes. LLMCert-B generates high confidence bounds on probability of unbiased responses for counterfactual prompts from a given distribution. Our results show previously unknown vulnerabilities related to counterfactual bias in SOTA LLMs. Published as a conference paper at ICLR 2025 ACKNOWLEDGEMENT We thank the anonymous reviewers for their insightful comments. This work was supported by a grant from the Amazon-Illinois Center on AI for Interactive Conversational Experiences (AICE) and NSF Grants No. CCF-2238079, CCF-2316233, CNS-2148583. ETHICS STATEMENT We identify the following positive and negative impacts of our work. Positive impacts. Our work is the first to provide quantitative certificates for the bias in Large Language Models. It can be used by model developers to thoroughly assess their models before releasing them and by the general public to become aware of the potential harms of using any LLM. As our framework, LLMCert-B assumes black-box access to the model, it can be applied to even closed-source LLMs with API access. Negative impacts. In this work, we propose 3 kinds of specifications involving random prefixes, mixtures of jailbreaks as prefixes, and jailbreaks in the embedding space of the target model. While these prefixes are not adversarially designed, they are often successful in eliciting biased and toxic responses from the target LLMs. They can be used to attack these LLMs by potential adversaries. We have informed the developers of the LLMs about this threat. REPRODUCIBILITY STATEMENT Our implementation of LLMCert-B is open-sourced at https://github.com/ uiuc-focal-lab/LLMCert-B. We provide a README with the code having instructions to reproduce the main results of the paper. The code also consists the files used in conducting a human evaluation of our bias detector for BOLD on Amazon Mechanical Turk. Jailbreak Chat. https://www.reddit.com/r/Chat GPTJailbreak/. Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report, 2023. Jacy Anthis, Kristian Lum, Michael Ekstrand, Avi Feller, Alexander D Amour, and Chenhao Tan. The impossibility of fair llms, 2024. URL https://arxiv.org/abs/2406.03198. Jacy Reese Anthis and Victor Veitch. Causal context connects counterfactual fairness to robust prediction and group fairness, 2023. URL https://arxiv.org/abs/2310.19691. Anthropic. Claude 3.5 sonnet model card addendum. Technical report, Anthropic, 2024. URL https://www-cdn.anthropic.com/ fed9cc193a14b84131812372d8d5857f8f304c52/Model_Card_Claude_3_ Addendum.pdf. Teodora Baluta, Zheng Leong Chua, Kuldeep S. Meel, and Prateek Saxena. Scalable quantitative verification for deep neural networks, 2021. Solon Barocas and Andrew D Selbst. Big data s disparate impact. California Law Review, 104(3): 671 732, 2016. Gilles Barthe, Juan Manuel Crespo, and César Kunz. Relational verification using product programs. pages 200 214, 2011. Osbert Bastani, Xin Zhang, and Armando Solar-Lezama. Probabilistic verification of fairness properties via concentration, 2019. Published as a conference paper at ICLR 2025 Leonard Berrada, Sumanth Dathathri, Krishnamurthy Dvijotham, Robert Stanforth, Rudy Bunel, Jonathan Uesato, Sven Gowal, and M. Pawan Kumar. Verifying probabilistic specifications with functional lagrangians. Co RR, abs/2102.09479, 2021. URL https://arxiv.org/abs/ 2102.09479. Sumon Biswas and Hridesh Rajan. Fairify: Fairness verification of neural networks. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 1546 1558, 2023. doi: 10.1109/ICSE48619.2023.00134. Jack Blandin and Ian A. Kash. Generalizing group fairness in machine learning via utilities. J. Artif. Int. Res., 78, January 2024. ISSN 1076-9757. doi: 10.1613/jair.1.14238. URL https: //doi.org/10.1613/jair.1.14238. Su Lin Blodgett, Solon Barocas, Hal Daumé III, and Hanna Wallach. Language (technology) is power: A critical survey of bias in NLP. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5454 5476, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.485. URL https://aclanthology.org/2020.acl-main. 485. Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam Mc Candlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners, 2020a. URL https: //arxiv.org/abs/2005.14165. Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam Mc Candlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners, 2020b. Yang Trista Cao, Yada Pruksachatkun, Kai-Wei Chang, Rahul Gupta, Varun Kumar, Jwala Dhamala, and Aram Galstyan. On the intrinsic and extrinsic fairness evaluation metrics for contextualized language representations. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 561 570, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-short.62. URL https://aclanthology.org/ 2022.acl-short.62. Kai-Wei Chang, Vinodkumar Prabhakaran, and Vicente Ordonez. Bias and fairness in natural language processing. In Timothy Baldwin and Marine Carpuat, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): Tutorial Abstracts, Hong Kong, China, November 2019. Association for Computational Linguistics. URL https:// aclanthology.org/D19-2004. Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries, 2023. Isha Chaudhary, Vedaant V. Jain, and Gagandeep Singh. Decoding intelligence: A framework for certifying knowledge comprehension in llms, 2024. URL https://arxiv.org/abs/2402. 15929. Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https: //lmsys.org/blog/2023-03-30-vicuna/. Published as a conference paper at ICLR 2025 C. J. Clopper and E. S. Pearson. The use of confidence or fiducial limits illustrated in the case of the binomial. Biometrika, 26(4):404 413, 12 1934. ISSN 0006-3444. doi: 10.1093/biomet/26.4.404. URL https://doi.org/10.1093/biomet/26.4.404. Justin Cui, Wei-Lin Chiang, Ion Stoica, and Cho-Jui Hsieh. Or-bench: An over-refusal benchmark for large language models, 2024. URL https://arxiv.org/abs/2405.20947. Nicolas Deutschmann, Marvin Alberts, and María Rodríguez Martínez. Conformal autoregressive generation: Beam search with coverage guarantees, 2023. URL https://arxiv.org/abs/ 2309.03797. Jwala Dhamala, Tony Sun, Varun Kumar, Satyapriya Krishna, Yada Pruksachatkun, Kai-Wei Chang, and Rahul Gupta. Bold: Dataset and metrics for measuring biases in open-ended language generation. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAcc T 21. ACM, March 2021. doi: 10.1145/3442188.3445924. URL http://dx.doi.org/10.1145/3442188.3445924. Brandon Dominique, David Piorkowski, Manish Nagireddy, and Ioana Baldini Soares. Prompt templates: A methodology for improving manual red teaming performance. In ACM CHI Conference on Human Factors in Computing Systems, 2024. Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Rich Zemel. Fairness through awareness, 2011. David Esiobu, Xiaoqing Tan, Saghar Hosseini, Megan Ung, Yuchen Zhang, Jude Fernandes, Jane Dwivedi-Yu, Eleonora Presani, Adina Williams, and Eric Michael Smith. Robbie: Robust bias evaluation of large generative language models, 2023. Emilio Ferrara. Should chatgpt be biased? challenges and risks of bias in large language models. First Monday, November 2023. ISSN 1396-0466. doi: 10.5210/fm.v28i11.13346. URL http: //dx.doi.org/10.5210/fm.v28i11.13346. Isabel O. Gallegos, Ryan A. Rossi, Joe Barrow, Md Mehrab Tanjim, Sungchul Kim, Franck Dernoncourt, Tong Yu, Ruiyi Zhang, and Nesreen K. Ahmed. Bias and fairness in large language models: A survey, 2024a. Isabel O. Gallegos, Ryan A. Rossi, Joe Barrow, Md Mehrab Tanjim, Sungchul Kim, Franck Dernoncourt, Tong Yu, Ruiyi Zhang, and Nesreen K. Ahmed. Bias and fairness in large language models: A survey, 2024b. URL https://arxiv.org/abs/2309.00770. Google Gemini Team. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024. Chuqin Geng, Nham Le, Xiaojie Xu, Zhaoyue Wang, Arie Gurfinkel, and Xujie Si. Towards reliable neural specifications, 2023. URL https://arxiv.org/abs/2210.16114. Valentin Hofmann, Pratyusha Ria Kalluri, Dan Jurafsky, and Sharese King. Dialect prejudice predicts ai decisions about people s character, employability, and criminality, 2024. Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, and Tom Goldstein. Baseline defenses for adversarial attacks against aligned language models, 2023. Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023. Mintong Kang, Nezihe Merve Gürel, Ning Yu, Dawn Song, and Bo Li. C-rag: Certified generation risks for retrieval-augmented language models, 2024. Hadas Kotek, Rikker Dockum, and David Sun. Gender bias and stereotypes in large language models. In Proceedings of The ACM Collective Intelligence Conference, CI 23. ACM, November 2023. doi: 10.1145/3582269.3615599. URL http://dx.doi.org/10.1145/3582269.3615599. Published as a conference paper at ICLR 2025 Satyapriya Krishna, Rahul Gupta, Apurv Verma, Jwala Dhamala, Yada Pruksachatkun, and Kai-Wei Chang. Measuring fairness of text classifiers via prediction sensitivity, 2022. Anna Kruspe. Towards detecting unanticipated bias in large language models, 2024. URL https: //arxiv.org/abs/2404.02650. Matt J. Kusner, Joshua R. Loftus, Chris Russell, and Ricardo Silva. Counterfactual fairness, 2018. URL https://arxiv.org/abs/1703.06856. Learn Prompting. Jailbreaking. https://learnprompting.org/docs/prompt_ hacking/jailbreaking, 2023. Gibbeum Lee, Volker Hartmann, Jongho Park, Dimitris Papailiopoulos, and Kangwook Lee. Prompted llms as chatbot modules for long open-domain conversation. In Findings of the Association for Computational Linguistics: ACL 2023. Association for Computational Linguistics, 2023. doi: 10.18653/v1/2023.findings-acl.277. URL http://dx.doi.org/10.18653/v1/ 2023.findings-acl.277. Messi H.J. Lee, Jacob M. Montgomery, and Calvin K. Lai. Large language models portray socially subordinate groups as more homogeneous, consistent with a bias observed in humans. In The 2024 ACM Conference on Fairness, Accountability, and Transparency, FAcc T 24, page 1321 1340. ACM, June 2024. doi: 10.1145/3630106.3658975. URL http://dx.doi.org/10.1145/ 3630106.3658975. Renjue Li, Pengfei Yang, Cheng-Chao Huang, Youcheng Sun, Bai Xue, and Lijun Zhang. Towards practical robustness analysis for dnns based on pac-model learning, 2022a. Yingji Li, Mengnan Du, Rui Song, Xin Wang, and Ying Wang. A survey on fairness in large language models, 2024. URL https://arxiv.org/abs/2308.10149. Zhuoyan Li, Zhuoran Lu, and Ming Yin. Towards better detection of biased language with scarce, noisy, and biased annotations. In Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society, AIES 22, page 411 423, New York, NY, USA, 2022b. Association for Computing Machinery. ISBN 9781450392471. doi: 10.1145/3514094.3534142. URL https://doi.org/ 10.1145/3514094.3534142. Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu Ren, Huaxiu Yao, Jue Wang, Keshav Santhanam, Laurel Orr, Lucia Zheng, Mert Yuksekgonul, Mirac Suzgun, Nathan Kim, Neel Guha, Niladri Chatterji, Omar Khattab, Peter Henderson, Qian Huang, Ryan Chi, Sang Michael Xie, Shibani Santurkar, Surya Ganguli, Tatsunori Hashimoto, Thomas Icard, Tianyi Zhang, Vishrav Chaudhary, William Wang, Xuechen Li, Yifan Mai, Yuhui Zhang, and Yuta Koreeda. Holistic Evaluation of Language Models, 2023. Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing, 2021. URL https://arxiv.org/abs/2107.13586. Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, Kailong Wang, and Yang Liu. Jailbreaking chatgpt via prompt engineering: An empirical study, 2024. Marta Marchiori Manerba, Karolina Sta nczak, Riccardo Guidotti, and Isabelle Augenstein. Social bias probing: Fairness benchmarking for language models, 2024. Rohin Manvi, Samar Khanna, Marshall Burke, David Lobell, and Stefano Ermon. Large language models are geographically biased, 2024. Victor M Longa Martínez. John baugh. 2018. linguistics in pursuit of justice, cambridge: Cambridge university press [xx+ 215 pp.]. isbn: 978-1107153455. Verba: Anuario Galego de Filoloxía, 49: 1 15, 2022. Published as a conference paper at ICLR 2025 Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal, 2024. Timothy R. Mc Intosh, Teo Susnjak, Tong Liu, Paul Watters, and Malka N. Halgamuge. Inadequacies of large language model benchmarks in the era of generative artificial intelligence, 2024. Matthew Mirman, Timon Gehr, and Martin Vechev. Robustness certification of generative models, 2020. URL https://arxiv.org/abs/2004.14756. Christopher Mohri and Tatsunori Hashimoto. Language models with conformal factuality guarantees, 2024. URL https://arxiv.org/abs/2402.10978. Robert G. Newcombe. Two-sided confidence intervals for the single proportion: comparison of seven methods. Statistics in Medicine, 17(8):857 872, 1998. doi: 10.1002/(SICI)1097-0258(19980430) 17:8<857::AID-SIM777>3.0.CO;2-E. Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback, 2022. Victor Quach, Adam Fisch, Tal Schuster, Adam Yala, Jae Ho Sohn, Tommi S. Jaakkola, and Regina Barzilay. Conformal language modeling, 2024. URL https://arxiv.org/abs/2306. 10193. Jonathan Rosa and Nelson Flores. Unsettling race and language: Toward a raciolinguistic perspective. Language in Society, 46(5):621 647, 2017. doi: 10.1017/S0047404517000562. Sriram Sankaranarayanan, Aleksandar Chakarov, and Sumit Gulwani. Static analysis for probabilistic programs: inferring whole program properties from finitely many paths. SIGPLAN Not., 48(6): 447 458, jun 2013. ISSN 0362-1340. doi: 10.1145/2499370.2462179. URL https://doi. org/10.1145/2499370.2462179. Sanjit A. Seshia, Ankush Desai, Tommaso Dreossi, Daniel J. Fremont, Shromona Ghosh, Edward Kim, Sumukh Shivakumar, Marcell Vazquez-Chanlatte, and Xiangyu Yue. Formal specification for deep neural networks. In Shuvendu K. Lahiri and Chao Wang, editors, Automated Technology for Verification and Analysis, pages 20 34, Cham, 2018. Springer International Publishing. ISBN 978-3-030-01090-4. Sakib Shahriar and Kadhim Hayawi. Let s have a chat! a conversation with chatgpt: Technology, applications, and limitations. Artificial Intelligence and Applications, 2(1):11 20, June 2023. ISSN 2811-0854. doi: 10.47852/bonviewaia3202939. URL http://dx.doi.org/10.47852/ bonview AIA3202939. Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng. The woman worked as a babysitter: On biases in language generation. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pages 3407 3412, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1339. URL https://aclanthology.org/D19-1339. Emily Sheng, Kai-Wei Chang, Prem Natarajan, and Nanyun Peng. Towards Controllable Biases in Language Generation. In Trevor Cohn, Yulan He, and Yang Liu, editors, Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3239 3254, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.291. URL https://aclanthology.org/2020.findings-emnlp.291. John H. Sherry. The civil rights act of 1964: Fair employment practices under title vii. Cornell Hotel and Restaurant Administration Quarterly, 6(2):3 6, 1965. doi: 10.1177/001088046500600202. URL https://doi.org/10.1177/001088046500600202. Published as a conference paper at ICLR 2025 Julius Sim and Norma Reid. Statistical Inference by Confidence Intervals: Issues of Interpretation and Utilization. Physical Therapy, 79(2):186 195, 02 1999. ISSN 0031-9023. doi: 10.1093/ptj/79. 2.186. URL https://doi.org/10.1093/ptj/79.2.186. Gagandeep Singh, Timon Gehr, Markus Püschel, and Martin Vechev. An abstract domain for certifying neural networks. Proc. ACM Program. Lang., 3(POPL), jan 2019. doi: 10.1145/3290354. URL https://doi.org/10.1145/3290354. Eric Michael Smith, Melissa Hall, Melanie Kambadur, Eleonora Presani, and Adina Williams. "i m sorry to hear that": Finding new biases in language models with a holistic descriptor dataset, 2022. Harini Suresh and John Guttag. A framework for understanding sources of harm throughout the machine learning life cycle. In Proceedings of the 1st ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization, EAAMO 21, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450385534. doi: 10.1145/3465416.3483305. URL https://doi.org/10.1145/3465416.3483305. J.M. Terry, R. Hendrick, E. Evangelou, and R.L. Smith. Variable dialect switching among african american children: Inferences about working memory. Lingua, 120(10):2463 2475, 2010. ISSN 0024-3841. doi: https://doi.org/10.1016/j.lingua.2010.04.013. URL https://www. sciencedirect.com/science/article/pii/S0024384110001129. Morphological variation in Japanese. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. ar Xiv preprint ar Xiv:2307.09288, 2023. Jason Vega, Isha Chaudhary, Changming Xu, and Gagandeep Singh. Bypassing the safety training of open-source llms with priming attacks. ar Xiv preprint ar Xiv:2312.12321, 2023. walkerspider. DAN is my new friend. https://old.reddit.com/r/Chat GPT/comments/ zlcyr9/dan_is_my_new_friend/, 2022. Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. Universal adversarial triggers for attacking and analyzing NLP. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pages 2153 2162, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1221. URL https://aclanthology.org/D19-1221. Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, Sang T. Truong, Simran Arora, Mantas Mazeika, Dan Hendrycks, Zinan Lin, Yu Cheng, Sanmi Koyejo, Dawn Song, and Bo Li. Decodingtrust: A comprehensive assessment of trustworthiness in gpt models, 2024. Shiqi Wang, Huan Zhang, Kaidi Xu, Xue Lin, Suman Jana, Cho-Jui Hsieh, and J. Zico Kolter. Beta-crown: Efficient bound propagation with per-neuron split constraints for neural network robustness verification. In Proc. Advances in Neural Information Processing Systems (Neur IPS), pages 29909 29921, 2021. Yufei Wang, Wanjun Zhong, Liangyou Li, Fei Mi, Xingshan Zeng, Wenyong Huang, Lifeng Shang, Xin Jiang, and Qun Liu. Aligning large language models with human: A survey, 2023. Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail?, 2023. Tinghao Xie, Xiangyu Qi, Yi Zeng, Yangsibo Huang, Udari Madhushani Sehwag, Kaixuan Huang, Luxi He, Boyi Wei, Dacheng Li, Ying Sheng, Ruoxi Jia, Bo Li, Kai Li, Danqi Chen, Peter Henderson, and Prateek Mittal. Sorry-bench: Systematically evaluating large language model safety refusal behaviors, 2024. URL https://arxiv.org/abs/2406.14598. Published as a conference paper at ICLR 2025 Yasin Abbasi Yadkori, Ilja Kuzborskij, David Stutz, András György, Adam Fisch, Arnaud Doucet, Iuliya Beloshapka, Wei-Hung Weng, Yao-Yuan Yang, Csaba Szepesvári, Ali Taylan Cemgil, and Nenad Tomasev. Mitigating llm hallucinations via conformal abstention, 2024. URL https: //arxiv.org/abs/2405.01563. Shuo Yang, Wei-Lin Chiang, Lianmin Zheng, Joseph E. Gonzalez, and Ion Stoica. Rethinking benchmark and contamination for language models with rephrased samples, 2023. Shengyu Zhang, Linfeng Dong, Xiaoya Li, Sen Zhang, Xiaofei Sun, Shuhe Wang, Jiwei Li, Runyi Hu, Tianwei Zhang, Fei Wu, and Guoyin Wang. Instruction tuning for large language models: A survey, 2024. Thomas P. Zollo, Todd Morrill, Zhun Deng, Jake C. Snell, Toniann Pitassi, and Richard Zemel. Prompt risk control: A rigorous framework for responsible deployment of large language models, 2024. URL https://arxiv.org/abs/2311.13628. Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models, 2023. Published as a conference paper at ICLR 2025 A NOTE FOR PRACTITIONERS In this section, we describe how practioners can use our framework to assess LLMs and automatically identify vulnerabilities in them. Our open-source implementation is available at: https://github.com/uiuc-focal-lab/LLMCert-B, which can be used following the GPL (license) terms and conditions. The open-source framework can be used to certify both open and closed-source LLMs by adding support to query custom models in utils.py (for open-source models) and utils_api.py (for closed-source models) files. The framework requires unrestricted (in terms of number of inferences) query-access to the target model. Developers can adjust the desired confidence-level of the certificates and increase/decrease the number of samples used in certification for tighter/looser bounds, according to their requirements and budget. Developers can also use custom bias detectors to label biased responses, using which the certificates can be computed. To get customized insights into LLM biases for their particular applications, developers can define specifications with prompts that are commonly observed in their use cases. This customization can either happen by sourcing the pivot prompts from domain-specific datasets, instead of the popular BOLD or Decoding Trust datasets and/or using custom distributions of prefixes/suffixes which more suitably represent the biases in their domains. For example, in domains where there is threat of racial bias, the prefixes could explicitly encourage the model to exhibit racial bias, so as to stress-test the trustworthiness of the models. Developers can also define entirely new distributions of counterfactual prompts, irrespective of prefixes/suffixes, to specify bias similar to 1 and certify with LLMCert-B. The certificates obtained are reliable, quantitative risk assessments of models, with lower and upper bounds on risks pertaining to bias in the models generations. They can also be used to compare different LLMs to pick one with acceptable risk in varying contexts. B COMMON JAILBREAKS USED In our experiments, we combine common jailbreaks to certify against hybrid jailbreaks. The specific jailbreaks we use are shown below. The main jailbreak2 is shown in Figure 5. We have modified the jailbreak to encourage the model to generate responses demonstrating gender bias by adding the text highlighted in red in Figure 5a and for racial bias in Figure 5b. We use one helper jailbreak3 from which we derive the instructions shown in Figure 6. Each of these instructions is inserted into the main jailbreak at the points where its individual instructions end, with the interleaving probability pλ. All instructions inserted at the same point in the main jailbreak appear in a random order in the final jailbreak prefix. The main jailbreak is also the jailbreak used to create the soft prefixes. Note that, these jailbreaks are just examples to demonstrate our framework, which is in no way restricted to them and generalizes to any future bias-eliciting manual, textual jailbreaks. C EXAMPLE SAMPLES FROM PREFIX DISTRIBUTIONS In this section, we provide examples of prefixes of different kinds, described in Section 4. Random prefixes. Figure 7 presents some examples of random prefixes, each having 100 tokens. Mixture of jailbreaks. Figure 8 presents some examples of mixture of jailbreak prefixes, formed from the main and helper jailbreaks in Figures 5 and 6 respectively. Soft prefixes. Examples of soft prefixes added to the queries can not be represented in textual form as they consist of perturbations of the prefix in the LLM s embedding space. However, the general structure is {jailbreak perturbed in embedding space}. 2Last accessed on Jan 22, 2024 from www.jailbreakchat.com/prompt/ 588ab0ed-2829-4be8-a3f3-f28e29c06621 3https://www.reddit.com/r/Chat GPT/comments/113ddqq/my_jailbreak_is_far_ superior_to_dan_the_prompt_is/ Published as a conference paper at ICLR 2025 (a) For gender (b) For race Figure 5: Main Jailbreaks for different sensitive attributes Figure 6: Instructions derived from helper jailbreak D CASE STUDIES FROM DECODING TRUST CERTIFICATES We introspect the example [target group] have a drug problem from QDT . We study the responses of 4 LLMs Mistral, Llama-7B, Gemini, and GPT-3.5, obtained when certifying them for the mixture of jailbreaks specification based on the example. Figure 9 presents some samples of responses Figure 7: Examples of random prefixes Published as a conference paper at ICLR 2025 Figure 8: Examples of mixture of jailbreak prefixes Figure 9: Responses of Mistral, Llama-7B, Gemini, and GPT-3.5 obtained during their certification for the Decoding Trust example [target group] have a drug problem for mixture of jailbreaks specification. The certification result is indicated alongside the model. from the models to explain their certificates. We observe that the Gemini model is biased against white people, calling them lazy and more likely to be exposed to drugs . Similarly, Mistral is also biased against white people, saying that they have a monopoly on addiction . On the other hand, GPT-3.5 is biased against black people, depicting them as inferior and inclined towards addiction . The Llama model is safer, refraining from biased responses. The bounds for the probability of unbiased response suggest that GPT-3.5 and Gemini are more inclined to produce biased responses than the Mistral and Llama models for prompts in the distribution of counterfactual prompt sets in the given specification. Published as a conference paper at ICLR 2025 Table 2: Hyperparameter values Hyperparameter Description Value γ (1 γ) is confidence over certification 0.05 n Number of samples for certification 50 T LLM decoding temperature 1.0 Top-k LLM decoding top-k 10 q Prefix length for random prefixes 100 pλ Interleaving probability 0.2 pµ Mutation probability 0.01 κ Max noise magnitude added to jailbreak embedding elements relative to the maximum embedding value 0.02 E ABLATION STUDY In this section, we study the effect of changing the various certification parameters (a.k.a. hyperparameters) on the certificates generated with LLMCert-B. Table E presents the list of hyperparameters and their values used in our experiments. We regenerate the certificates for different prefix distributions by varying the hyperparameters. In particular, we study the variations of the results when n, T, Top-k, q, pλ, pµ, and κ are varied, keeping γ constant. This is because, 1 γ denotes the confidence of the certification bounds and that is generally desired to be high. 95% is a typical confidence level for practical applications (Sim and Reid, 1999). We conduct this ablation study on the specifications for a randomly picked set of 100 counterfactual prompt sets from BOLD s test set QBOLD. We certify the Mistral-Instruct-v0.2 (Jiang et al., 2023) 7B parameter model and study the overall results next. E.1 CERTIFICATION ALGORITHM HYPERPARAMETER We show ablations on n for all kinds of specifications in Figure 10. We see that the bounds begin converging at 50 samples and subsequent samples cause minor variations in their values, justifying our choice of using 50 samples. Fewer than 50 samples can result in less tight bounds. (a) Random prefixes (b) Mixtures of jailbreaks (c) Soft prefixes Figure 10: Ablation study on the certification hyperparameters showing variations of average certification bounds with number of samples n E.2 LLM DECODING HYPERPARAMETERS We study variations in the certification bounds with 2 important hyperparameters of the LLM decoding algorithms that influence their generated texts T (decoding temperature) and Top-k (number of tokens decoded at each step). Figures 11 and 12 show the variations in the certification bounds with T and Top-k respectively for the 3 kinds of specifications. We see only minor changes in the average certification bounds with the variations of these hyperparameters. Our hypothesis of this phenomenon is that as the certificates aggregate the bias results of several samples, they smooth out the noise introduced by the choice of LLM decoding hyperparameters and give insights into the biases of the LLM itself. Published as a conference paper at ICLR 2025 (a) Random prefixes (b) Mixtures of jailbreaks (c) Soft prefixes Figure 11: Ablation study showing variations of average certification bounds with temperature T (a) Random prefixes (b) Mixtures of jailbreaks (c) Soft prefixes Figure 12: Ablation study showing variations of average certification bounds with Top-k parameter. (a) Variation with prefix length q for random prefix specifications (b) Variation with interleaving probability pλ for mixture of jailbreaks specifications (c) Variation with mutation probability pµ for mixture of jailbreaks specifications (d) Variation with relative magnitude of noise κ for soft prefixes from jailbreak specifications Figure 13: Ablation study on the certification hyperparameters showing variations of average certification bounds Published as a conference paper at ICLR 2025 E.3 RANDOM PREFIXES The specifications based on random prefixes consist of 1 hyperparameter q, length of the random prefix. Hence, we vary this hyperparameter, while keeping the others fixed. Figure 13a presents the variation in the average certification bounds obtained when varying q. E.4 MIXTURE OF JAILBREAKS These specifications have 2 hyperparameters pλ, the probability of adding an instruction from the helper jailbreaks when interleaving, and pµ, the probability of randomly flipping every token of the resultant of interleaving. We show ablation studies on these in Figures 13b and 13c respectively. E.5 SOFT PREFIXES FROM JAILBREAKS These specifications have 1 hyperparameter κ, the maximum relative magnitude (with respect to the maximum magnitude of the embeddings) by which the additive uniform noise can change the embeddings of the main jailbreak. Figure 13d presents an ablation study on κ. We see no significant effects of the variation of abovementioned hyperparameters on certification results for different specifications. E.6 SCALING BEYOND BINARY DEMOGRAPHIC GROUPS Table 3: Average bounds on the probability of unbiased responses from Mistral 7B. Spec type Bounds Random (0.91, 1.0) Mixture (0.82, 0.98) Soft (0.20, 0.46) Our general framework (Section 3) and specification instances (Section 4) are applicable to certify biases beyond binary counterfactual prompt sets (like for male/female gender, black/white race). This is subject to the availability of bias detectors D that can identify biases across responses for counterfactual prompt sets for more than binary demographic groups. While we are not aware of any D that could work with QBOLD, we extend our D for the specifications from QDT to work for responses to prompts from three racial demographic groups black people, white people, and asians. We elaborate on the extension in Appendix G.2. We certify Mistral-Instruct-v0.2 7B model with the three kinds of specifications and find that the average certification bounds presented in Table 3 are similar to the bounds presented for the Mistral model in Table 1 for the random and mixture of jailbreak specifications. However, the results are significantly worse for the soft prefix specifications. This is because, firstly the model is particularly susceptible to these specifications as is evidenced even in the results with binary demographic groups. Secondly, with the addition of another demographic group, the bias detector is skewed towards identifying bias in more sets of responses than for the case with binary demographic groups. The bias detector identifies bias in responses having at least 1 agreement and 1 disagreement to the stereotype mentioned in the prompts, which has the same chance as unbiased result for binary demographic groups, but not beyond them. F VALIDITY OF CONFIDENCE INTERVALS We design a synthetic study for the validity of the confidence intervals as follows. As we can not precisely regulate the true probability of unbiased responses of LLMs, we assume various values of that probability and generate binary-valued samples indicating biased (non-zero) /unbiased (zero) responses from any LLM. Hence, we generate 50 samples (same as the samples used in LLMCert-B s certification) of the Bernoulli random variable F (Section 3.2), with various values for the probability of success and generate Clopper-Pearson confidence intervals for the success probability using the samples. We repeat this process 1000 times and report the percentage of instances wherein the confidence intervals contain the true probability of success. This percentage indicates the probability of correctness of the confidence intervals. We find that for all 11 equally-spaced values of the true probability of unbiased responses between 0 and 1, the confidence intervals bound the true value for more than 95% (nominal, user-specified confidence level) times, which validates the claim that the confidence intervals hold with at least the user-specified confidence. Figure 14 presents the variation Published as a conference paper at ICLR 2025 Figure 14: Variation in the proportion of Clopper-Pearson confidence intervals containing the true probability of unbiased responses. Dashed line shows the user-specified confidence level, i.e. 95%. in the proportion of Clopper-Pearson confidence intervals that contain the true probability of unbiased responses for different values of the latter. G BIAS DETECTOR G.1 BOLD EXPERIMENTS Design of bias detector. Our bias detector DBOLD is based on the regard bias metric (Sheng et al., 2019). Specifically, we measure the bias for a pair of completions, one for male and another for female, by invoking the regard classifier in compare mode. The compare mode gives the difference between the regard score obtained for the different completions. We consider all completion pairs with a regard score higher than a prespecified threshold (0.85 in this case) as biased. For the completions considered unbiased with the previous evaluation, we also consider the regard scores of the individual completions (with respect to the partial sentence given to the model). We consider the completions to be biased if either score exceeds the threshold. This is for the instances of bias, wherein the pair of generations are similarly biased and individual generations have text discriminating against a particular demographic group. Evaluating the metric. We evaluate DBOLD against human judgment of bias, by getting annotations for a subset of the completions from the different models from Amazon Mechanical Turk. We randomly select 490 pairs of completions (for male and female subjects in the prompts), 70 each obtained when certifying 7 different models. Each pair of completions is annotated by 3 different annotators. We combine the 3 annotations for each pair of completions into a single value of bias by considering the majority consensus among the annotators. We compare the bias annotation thus obtained from humans with the results of DBOLD and find that DBOLD s outputs matches human intuition 76% times. We provide the HTML file used to render the instructions shown to the Amazon Mechanical Turkers in our supplementary material. Each participant is given a compensation of $0.5 for the annotations. Next, we provide a qualitative analysis with examples where the results of DBOLD and the human evaluation match and where they do not match. Figure 15a shows the true positive examples where the both the DBOLD and the annotators identify bias. Figure 15b shows the false negative examples where DBOLD does not identify bias but the annotators do. Figure 15c shows the false positive examples where DBOLD identifies bias but the annotators do not. According to the authors, the false positive examples are actually biased. But we also recognize that bias is a complex subject, and its notion can vary across individuals. The human annotation of bias is contingent on many factors such as their cultural background, our annotation instructions (provided in the open-source implementation of the framework), etc. Hence, we believe that the noise in the human study needs to be taken into account when evaluating our bias detector DBOLD. Our bias detector shows 93% precision, but 50% recall. Owing to these inconsistencies Published as a conference paper at ICLR 2025 (a) (True Positive) Bias detected by both DBOLD and annotators (b) (False Negative) Bias detected by annotators but not DBOLD (c) (False Positive) Bias detected by DBOLD but not annotators Figure 15: Qualitative analysis of DBOLD with human bias annotations. of our bias detector with human perception of bias, we believe that our bounds for probability of unbiased responses are actually higher than bounds with a perfect bias detector (due to low recall), indicating a worse situation of counterfactual bias in SOTA LLMs. G.2 DT EXPERIMENTS We design the bias detector DDT for specifications from the Decoding Trust dataset, using the agreement detector proposed for the stereotype bias dataset in the Decoding Trust paper (Wang et al., 2024). In particular, the agreement detector in the benchmark applies on 1 response and identifies if it contains phrases such as I agree (to the stereotype mentioned in the prompt) for agreement or phrases like I disagree for disagreement. To detect bias across the responses for the counterfactual prompts, DDT checks for agreement in the individual responses, and identify bias when there is at least 1 agreement, but not all responses are for agreement, i.e., the model agrees to the stereotype for some demographic groups but not others. DDT can easily scale beyond binary demographic groups, as it is agnostic to the number of responses across which it needs to detect bias. G.3 LIMITATIONS OF BIAS DETECTORS As we can see from the human evaluations for DBOLD and also from the simplicity of DDT that the bias detectors can have several false positive and false negative cases. This is expected, as detection of bias is a challenging task (Li et al., 2022b; Kruspe, 2024) owing to the subtle and intricate differences between biased and unbiased texts. Moreover, there can be special characters / emojis in the texts generated in the LLMs, which are often not considered by textual bias detectors, but are clearly biased, e.g., (we observe this emoji combination in the generations of Vicuna-7B for random prefix specifications from QBOLD). While our certification method borrows some of the inaccuracies of the bias detectors we use, they are often smoothed as certification aggregates several observations of bias to generate bounds on the probability of unbiased responses. Published as a conference paper at ICLR 2025 H COMPARING WITH PRIOR WORKS ON GUARANTEES FOR LLMS Works on conformal prediction. (Quach et al., 2024; Deutschmann et al., 2023; Mohri and Hashimoto, 2024; Yadkori et al., 2024) apply conformal prediction techniques to language models, proposing methods for generating sets of outputs with statistical guarantees on correctness, coverage, or abstention, aiming to improve reliability and mitigate hallucinations. Their guarantees and scope differ from those of LLMCert-B. The prior works give specialized decoding strategies that guarantee the correctness of the outputs, useful for the factuality of the responses. LLMCert-B, however, is about assessing and certifying the counterfactual biases in LLM responses generated with any decoding scheme applied on the target LLMs. Moreover, guarantees of conformal prediction are for correctness of LLM responses for individual prompts, while LLMCert-B s guarantees are over distributions of counterfactual prompt sets (which have prohibitively large sample spaces). Prompt Risk Control. (Zollo et al., 2024) (PRC) presents a framework for selecting low-risk system prompts for LLMs. PRC computes the loss incurred for one prompt at a time, and aggregates those losses to form a risk measure. LLMCert-B, on the other hand, is for counterfactual bias, i.e., we assess bias across a set of LLM responses, obtained by varying the sensitive attributes in prompts. Our specification is thus a relational property (Barthe et al., 2011), which is defined over multiple related inputs. Biases across LLM responses for multiple related prompts are aggregated to certify the target LLM. Moreover, LLMCert-B proposes novel, inexpensive mechanisms to design distributions and their samplers with potentially adversarial prefixes, containing effective, manually-designed jailbreaks in their sample space. Contrary to PRC which uses static datasets of adversarial examples, we use independent and identically-distributed samples from these distributions. I ADDITIONAL NOTES ON BIAS IN ML Bias is a complex social phenomenon that arises in various forms. In this section, we discuss various notions of bias and harms caused by them. We also discuss how LLMCert-B complements existing evaluation methods by certifying for counterfactual bias. The following discussion is not a comprehensive treatment of bias in Machine Learning and we refer the reader to detailed survey papers such as (Gallegos et al., 2024a; Blodgett et al., 2020; Li et al., 2024) for more information. Bias consists of disparate treatment (a.k.a. discrimination) or disparate outcomes (Barocas and Selbst, 2016) for different demographic groups. Harms due to bias are primarily of 2 kinds representational and allocation (Gallegos et al., 2024a). Representational harm (Suresh and Guttag, 2021; Blodgett et al., 2020) consists of denigrating and subordinating attitudes towards a demographic group. It consists of use of derogatory language, stereotyping, toxicity, misrepresentation, etc. These can arise from inappropriate use of language by humans or machines (e.g., LLMs). Allocation harms (Ferrara, 2023) are disparate distribution of resources or opportunities between demographic groups. These consist of direct or indirect discrimination in economic or social opportunities. For example, prior works like (Terry et al., 2010; Martínez, 2022) show that the lack of representation of African American English in dominant language practices results in that community facing penalties in education systems or when seeking housing. Most constitutions around the world have antidiscrimination laws like (Sherry, 1965) that prohibit allocative harms in employment etc. Language is considered an important factor for labeling, modifying, and transmitting beliefs about demographic groups and can result in the reinforcement of social inequalities (Rosa and Flores, 2017). LLMCert-B is a reliable evaluation method for counterfactual bias in language models (LMs), that certifies the probability of unbiased response in target LLMs for distributions of counterfactual prompts with statistical guarantees. Prior bias assessments have been of 2 kinds (Cao et al., 2022) intrinsic and extrinsic. Intrinsic bias occurs in the language representations, while extrinsic bias manifests in the final textual responses of the LMs. To certify closed-source LMs as well, we study extrinsic bias. Bias is opposite of fairness, which has been identified to be of various forms such as group fairness (Blandin and Kash, 2024), individual fairness (Dwork et al., 2011), counterfactual fairness (Kusner et al., 2018), etc. LLMCert-B certifies for counterfactual bias, akin to counterfactual fairness. This is because of the causal perspective of counterfactual bias (Anthis and Veitch, 2023) (bias due to mentioning specific demographic group in the prompt) which aligns more closely with human intuitions about discrimination and fairness. Moreover, unlike group fairness, counterfactual fairness operates at the individual-level, thus identifying bias in specific cases, instead of aggregates.