# adept_a_debiasing_prompt_framework__b1b95116.pdf ADEPT: A DEbiasing Promp T Framework Ke Yang1, Charles Yu2, Yi R. Fung2, Manling Li2, Heng Ji2 1Tsinghua University 2University of Illinois Urbana-Champaign yang-k19@mails.tsinghua.edu.cn {ctyu2,yifung2,manling2,hengji}@illinois.edu Several exisiting approaches have proven that finetuning is an applicable approach for debiasing contextualized word embeddings. Similarly, discrete prompts with semantic meanings have shown to be effective in debiasing tasks. With unfixed mathematical representation at the token level, continuous prompts usually surpass discrete ones at providing a pre-trained language model (PLM) with additional taskspecific information. Despite this, relatively few efforts have been made to debias PLMs by prompt tuning with continuous prompts compared to its discrete counterpart. Furthermore, for most debiasing methods that alter a PLM s original parameters, a major problem is the need to not only decrease the bias in the PLM, but also ensure that the PLM does not lose its representation ability. Finetuning methods typically have a hard time maintaining this balance, as they tend to aggressively remove meanings of attribute words (like the words developing our concepts of male and female for gender), which also leads to an unstable and unpredictable training process. In this paper, we propose ADEPT, a method to debias PLMs using prompt tuning while maintaining the delicate balance between removing biases and ensuring representation ability1. To achieve this, we propose a new training criterion inspired by manifold learning and equip it with an explicit debiasing term to optimize prompt tuning. In addition, we conduct several experiments with regard to the reliability, quality, and quantity of a previously proposed attribute training corpus in order to obtain a clearer prototype of a certain attribute, which indicates the attribute s position and relative distances to other words on the manifold. We evaluate ADEPT on several widely acknowledged debiasing benchmarks and downstream tasks, and find that it achieves competitive results while maintaining (and in some cases even improving) the PLM s representation ability. We further visualize words correlation before and after debiasing a PLM, and give some possible explanations for the visible effects. Introduction Natural Language Processing (NLP) tools are widely used today to perform reasoning and prediction by efficiently condensing the semantic meanings of a token, a sentence or a Copyright 2023, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. 1The code and data are publicly available at https://github.com/ Empath Yang/ADEPT. document. As more powerful NLP models have been developed, many real-world tasks have been automated by the application of these NLP systems. However, a great number of fields and tasks have a high demand for fairness and equality: legal information extraction (Rabelo et al. 2022), resume filtering (Abdollahnejad, Kalman, and Far 2021), and general language assistants (Askell et al. 2021) to name a few. Unfortunately, in the pursuit of the most competitive results, folks often blindly apply PLMs, leading to strong performance with the unseen cost of introducing bias into the process. An ideal NLP tool s decision or choice should not impose harms on a person based on their background (Blodgett et al. 2020), but many studies (Caliskan, Bryson, and Narayanan 2017; Mayfield et al. 2019) have found that biases exist and occur throughout the NLP lifecycle. Thus, it is increasingly important that PLMs can be debiased to enable applications that may be inadvertently influenced by the PLM s implicit stereotypes. Debiasing, if treated as a special case of downstream tasks, can be tackled through finetuning. Typically, a finetuning debiasing method puts forward specific loss terms to guide a PLM to remove biases in itself (Kaneko and Bollegala 2021). Prompt tuning (Li and Liang 2021; Liu et al. 2021b; Lester, Al-Rfou, and Constant 2021) is one of the more promising methods for transfer learning with large PLMs these days, and its general success (Raffel et al. 2020) suggests applications toward debiasing as well. Prompt tuning, whose role is similar to that of finetuning, refers to freezing all the parameters of the original PLM and only training an additional section of parameters (called a prompt ) for the downstream tasks. Here, a prompt is a set of tokens, often added as a prefix to the input for the task, that act as task-specific complementary information. All PLM debiasing methods must overcome a major hurdle of imbalance. Methods that are imbalanced do not adequately balance eliminating biases in a PLM while maintaining its representation ability. Some existing methods are prone to be destructive, whether destructive refers to decreasing a word/sentence embedding s projection on a linear bias subspace (Liang et al. 2020), or refers to completely removing the semantic meanings of attribute words (e.g., man, male; and woman, female) from all neutral words (e.g., engineer, scientist; and teacher, librarian) (Kaneko and Bollegala 2021). If a debiasing framework focuses only on the The Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI-23) (a) While debiasing, ADEPT only trains the prompt parameters and keeps the base model frozen. (b) When performing downstream tasks, ADEPT conditions either the base model or both the prompt and the base model. Figure 1: An illustration of how debiasing works using ADEPT and for downstream tasks. PLM s debiasing task and pays no attention to preserving the model s useful properties, it may destroy the PLM s computational structure and counteract the benefits of pretraining altogether. Although an extreme example, a randomly initialized model is expected to be completely unbiased. In this paper, we propose ADEPT (Figure 1), a debiasing algorithm which implements prompt tuning to debias PLMs and makes the following contributions: We are the first to exploit prompt tuning in the debiasing space. We introduce a novel debiasing criterion, which often enables the debiased model to perform better than the original one in downstream tasks. We show that ADEPT is more effective at mitigating biases on a word embedding manifold than other methods which operate on a linear bias subspace. We show methods for improving prototypes for contextualized word embeddings that are generated via aggregation. Our prompt tuning approach has the inherent advantage of saving computing and storage resources. In our experiments, we achieve great results by training prompts with less than 1% the parameters of the PLM as opposed to fine-tuning approaches which train the whole model. Furthermore, because prompt tuning only trains prompt and the PLM s original parameters are not touched during the training process, the base model will maintain its robustness. Related Work Debiasing Methods Word Embeddings Static word embeddings, the foundational building blocks of neural language models, have been a prime target for criticism. In light of artifacts from their training process leading to the encoding of stereotypes, many efforts have been made to mitigate the correlations stored within static embeddings (Mikolov, Yih, and Zweig 2013; Bolukbasi et al. 2016; Caliskan, Bryson, and Narayanan 2017; Mikolov et al. 2013; Manzini et al. 2019). However, most modern PLMs employ contextualized word embeddings, spreading the potentially biased representations of words across various contexts. Discrete Prompts Solaiman and Dennison (2021) propose PALMS with Values-Targeted Datasets, which finetunes large-scaled PLMs on a predetermined set of social values in order to reduce PLMs biases and toxicity. Askell et al. (2021) use a hand-designed prompt with more than 4600 solid words as a stronger baseline for helpfulness, harmlessness, and honesty principle for a general language assistant. Schick, Udupa, and Sch utze (2021) encourage a model to generate biased text and discard its undesired behaviors with this internal knowledge. In general, discrete prompts debias PLMs in the form of debiasing descriptions. As crafting discrete prompts manually requires domain knowledge and professional expertise, and we cannot ensure hand-crafted prompts effectiveness beforehand, we hope to improve debiasing prompts performance by transforming it to continuous ones which can be optimized with standard techniques like gradient descent. Finetuning Setting Kaneko and Bollegala (2021) propose a finetuning method of debiasing PLMs. It sets special loss for the debiasing tasks which takes both a PLM s debiasing results and its expressiveness into account. The experiment shows that token-level debiasing across all layers of the PLM produces the best performance. It further conducts experiments on MNLI tasks and finds that the debiased model preserves semantic information. As this work also makes efforts to maintain a PLM s expressiveness while debiasing, we take their debiased model as our baseline. Prompt Tuning Prompt usually has two connotations. One is the text with natural semantics, which is fed into the language model together with the original input as additional information. Another is a set of prefixed, continuous trainable numbers post-set into a PLM, which usually do not have semantic meanings. Because this set of continuous numbers have the same functions as the discrete prompt, such as providing the PLM with extra hints for solving a problem, it is also called prompt (or prefix). Li and Liang (2021), Liu et al. (2021b), and Lester, Al Rfou, and Constant (2021) propose prompt tuning (or prefixtuning, p-tuning) as a lightweight alternative to finetuning for performing downstream tasks. This approach conditions a large-scaled PLM by freezing its original parameters and Algorithm 1: ADEPT: a debiasing algorithm for contextualized word embeddings. Input: a Pre-trained Language Model (PLM) Output: Φprompt for debiasing the PLM ADEPT: 1: Prepare a PLM MΘ with parameters Θ. 2: Suppose a bias has d attributes. Define a neutral word tuple W neutral and attribute word tuples W a(i) = (wa(i) 1 , ..., wa(i) g ), each with g one-to-one words. 3: Collect sentences Sneutral and {Sa(i)}d i=1. 4: Initialize parameters Φprompt. 5: for epoch in 1, ..., epochmax do 6: Calculate prototypes of the neutral words: Eneutral = M Θ(Sneutral), where M Θ = MΘ Φprompt. 7: Calculate prototypes of attributes: Ea(i) = M Θ(Sa(i)), ea(i) = aver(Ea(i)). 8: Calculate distances between attribute words and neutral words: P a(i) = Distance(Eneutral|ea(i)). 9: Calculate loss of bias: Lbias = P i,j {1,...,d},i 2 as well. We then collect sentences based on the word tuples. Sneutral (or Sa(i)) denotes sentences that contain at least one word in W neutral (or W a(i), respectively). Instead of creating template-based sentences using the attribute words from {W neutral} {W a(i)}d i=1, we scrape natural sentences from a corpus (possibly distinct from and/or smaller than the PLM s pretraining corpus) for a diverse word distribution that aligns better with the real-world. Calculate Prototypes of Neutral Words/Attributes To get an insight of a model s view on different groups, we seek prototypes of neutral words and attributes. To obtain these prototypes, we extract embeddings for each word. For a word from W x (x is neutral or a(i) for some i), we fetch the associated sentence from Sx and feed it into M Θ. Then, we extract the hidden state for the word from each layer 2We hold the opinion that gender identity need not be restricted to the binary choice of male or female. However, for the purposes of experimentation and following prior studies, we adopt this binary setting. of the forward pass. For PLMs adopting Word Piece embeddings such as BERT (Devlin et al. 2018), if a word has several sub-tokens, we average the sub-tokens hidden states as the word s hidden state. For each word tuple s sentences Sx, we extract the set of embeddings Ex. For attribute words, we follow the procedures from Bommasani, Davis, and Cardie (2020) and average the embeddings Ea(i) to get a single embedding ea(i) that closer resembles a static embedding as opposed to contextualized embeddings. Under the law of large numbers, we expect this simple linear computation to reduce the context s linear influences on each attribute word. This process can be summarized as: Eneutral = M Θ(Sneutral) Ea(i) = M Θ(Sa(i)) (2) ea(i) = aver(Ea(i)) (3) Thus we take Eneutral = [eneutral 1 , eneutral 2 , ...] as the prototypes of neutral words and ea(i) as the prototype of an attribute. Define Tuning Loss We treat word embeddings as being distributed on a manifold and design the loss adhering to the criterion that pairwise attribute words should look alike compared to neutral words on the manifold. We first design Lbias with the intention of pushing pairwise attribute words closer together on the manifold, which corresponds to decreasing biases in a PLM. pnj|a(i) quantifies the degree to which attribute a(i) s information can be restored from the neutral word nj in M Θ: pnj|a(i) = exp( ||ea(i) eneutral j ||2 2ρ2 ) P nk W neutral{exp( ||ea(i) eneutral k ||2 where ρ is a hyperparameter. We can interpret Equation 4 in this way: (1) Let us set a Gaussian distribution with a covariance matrix to be ρ times the identity matrix at the prototype of attribute a(i), which is ea(i). Then the prototype of the neutral word nj, which is eneutral j , shows up in the distribution with the probability proportional to exp( ||ea(i) eneutral j ||2 2ρ2 ), the numerator. (2) The denominator sums up the probability mentioned above from all nk W neutral, and plays a role as the normalization factor. (3) Equation 4 is a formulation that quantifies how much information of ea(i) we can restore from eneutral j . Similar equations have been used in other contexts (Parzen 1962; Hinton and Roweis 2002). P a(i) denotes our distances from attribute a(i) to all neutral words. P a(i) = [pn1|a(i), pn2|a(i), ...] means it is a list of values calculated from ea(i) and Eneutral. Therefore, we summarize it as below: P a(i) = Distance(Eneutral|ea(i)) (5) We define Lbias as: i,j {1,...,d},i