# automatically_neutralizing_subjective_bias_in_text__50ff750f.pdf The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20) Automatically Neutralizing Subjective Bias in Text Reid Pryzant,1 Richard Diehl Martinez,1 Nathan Dass,1 Sadao Kurohashi,2 Dan Jurafsky,1 Diyi Yang3 1Stanford University {rpryzant, rdm, ndass, jurafsky}@stanford.edu 2Kyoto University kuro@i.kyoto-u.ac.jp 3Georgia Institute of Technology diyi.yang@cc.gatech.edu Texts like news, encyclopedias, and some social media strive for objectivity. Yet bias in the form of inappropriate subjectivity introducing attitudes via framing, presupposing truth, and casting doubt remains ubiquitous. This kind of bias erodes our collective trust and fuels social conflict. To address this issue, we introduce a novel testbed for natural language generation: automatically bringing inappropriately subjective text into a neutral point of view ( neutralizing biased text). We also offer the first parallel corpus of biased language. The corpus contains 180,000 sentence pairs and originates from Wikipedia edits that removed various framings, presuppositions, and attitudes from biased sentences. Last, we propose two strong encoder-decoder baselines for the task. A straightforward yet opaque CONCURRENT system uses a BERT encoder to identify subjective words as part of the generation process. An interpretable and controllable MODULAR algorithm separates these steps, using (1) a BERT-based classifier to identify problematic words and (2) a novel join embedding through which the classifier can edit the hidden states of the encoder. Large-scale human evaluation across four domains (encyclopedias, news headlines, books, and political speeches) suggests that these algorithms are a first step towards the automatic identification and reduction of bias. 1 Introduction Writers and editors of texts like encyclopedias, news, and textbooks strive to avoid biased language. Yet bias remains ubiquitous. 62% of Americans believe their news is biased (Gallup 2018) and bias is the single largest source of distrust in the media (Foundation 2018). This work presents data and algorithms for automatically reducing bias in text. We focus on a particular kind of bias: inappropriate subjectivity ( subjective bias ). Subjective bias occurs when language that should be neutral and fair is skewed by feeling, opinion, or taste (whether consciously or unconsciously). In practice, we identify subjective bias via the method of Recasens, Danescu-Niculescu Mizil, and Jurafsky (2013): using Wikipedia s neutral point Copyright c 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Figure 1: Example output from our MODULAR algorithm. Exposed is a factive verb that presupposes the truth of its complement (that Mc Cain is unprincipled). Replacing exposed with described neutralizes the headline because it conveys a similar main clause proposition (someone is asserting Mc Cain is unprincipled), but no longer introduces the authors subjective bias via presupposition. of view (NPOV) policy.1 This policy is a set of principles which includes avoiding stating opinions as facts and preferring nonjudgemental language . For example a news headline like John Mc Cain exposed as an unprincipled politician (Figure 1) is biased because the verb expose is a factive verb that presupposes the truth of its complement; a non-biased sentence would use a verb like describe so as not to presuppose the subjective opinion of the writer. Pilfered in the gameplay is pilfered from DDR (Table 1) subjectively frames the shared gameplay as a kind of theft. His in a lead programmer usually spends his career again introduces a biased and subjective viewpoint (that all programmers are men) through presupposition. We aim to debias text by suggesting edits that would make it more neutral. This contrasts with prior research which has debiased representations of text by removing dimensions of prejudice from word embeddings (Bolukbasi et al. 2016; Gonen and Goldberg 2019) and the hidden states of predictive models (Zhao et al. 2018; Das, Dantcheva, and Bremond 2018). To avoid overloading the definition of debias, we refer to our kind of text debiasing as neutralizing that text. Figure 1 gives an example. We introduce the Wiki Neutrality Corpus (WNC). This is 1https://en.wikipedia.org/wiki/Wikipedia:Neutral point of view Source Target Subcategory A new downtown is being developed which A new downtown is being developed which Epistemological will bring back... which its promoters hope will bring back.. The authors expos e on nutrition studies The authors statements on nutrition studies Epistemological He started writing books revealing a vast world conspiracy He started writing books alleging a vast world conspiracy Epistemological Go is the deepest game in the world. Go is one of the deepest games in the world. Framing Most of the gameplay is pilfered from DDR. Most of the gameplay is based on DDR. Framing Jewish forces overcome Arab militants. Jewish forces overcome Arab forces. Framing A lead programmer usually spends Lead programmers often spend Demographic his career mired in obscurity. their careers mired in obscurity. The lyrics are about mankind s perceived idea of hell. The lyrics are about humanity s perceived idea of hell. Demographic Marriage is a holy union of individuals. Marriage is a personal union of individuals. Demographic Table 1: Samples from our new corpus. 500 sentence pairs are annotated with subcategory information (Column 3). a new parallel corpus of 180,000 biased and neutralized sentence pairs along with contextual sentences and metadata. The corpus was harvested from Wikipedia edits that were designed to ensure texts had a neutral point of view. WNC is the first parallel corpus of biased language. We also define the task of neutralizing subjectively biased text. This task shares many properties with tasks like detecting framing or epistemological bias (Recasens, Danescu Niculescu-Mizil, and Jurafsky 2013), or veridicality assessment/factuality prediction (Saur ı and Pustejovsky 2009; Marneffe, Manning, and Potts 2012; Rudinger, White, and Van Durme 2018; White et al. 2018). Our new task extends these detection/classification problems into a generation task: generating more neutral text with otherwise similar meaning. Finally, we propose a pair of novel sequence-to-sequence algorithms for this neutralization task. Both methods leverage denoising autoencoders and a token-weighted loss function. An interpretable and controllable MODULAR algorithm breaks the problem into (1) detection and (2) editing, using (1) a BERT-based detector to explicitly identify problematic words, and (2) a novel join embedding through which the detector can modify an editors hidden states. This paradigm advances an important human-in-the-loop approach to bias understanding and generative language modeling. Second, an easy to train and use but more opaque CONCURRENT system uses a BERT encoder to identify subjectivity as part of the generation process. Large-scale human evaluation suggests that while not without flaws, our algorithms can identify and reduce bias in encyclopedias, news, books, and political speeches, and do so better than state-of-the-art style transfer and machine translation systems. This work represents an important first step towards automatically managing bias in the real world. We release data and code to the public.2 2 Wiki Neutrality Corpus (WNC) The Wiki Neutrality Corpus consists of aligned sentences pre and post-neutralization by English Wikipedia editors (Table 1). We used regular expressions to crawl 423,823 Wikipedia revisions between 2004 and 2019 where editors provided NPOV-related justification (Zanzotto and Pennacchiotti 2010; Recasens, Danescu-Niculescu-Mizil, and Ju- 2https://github.com/rpryzant/neutralizing-bias Data Sentence Total Seq length # revised pairs words (mean) words (mean) Biased-full 181,496 10.2M 28.21 4.05 Biased-word 55,503 2.8M 26.22 1.00 Neutral 385,639 17.4M 22.58 0.00 Table 2: Corpus statistics. rafsky 2013; Yang et al. 2017). To maximize the precision of bias-related changes, we ignored revisions where More than a single sentence was changed. Minimal edits (character Levenshtein distance < 4). Maximal edits (more than half of the words changed). Edits where more than half of the words were proper nouns. Edits that fixed spelling or grammatical errors. Edits that added references or hyperlinks. Edits that changed non-literary elements like tables or punctuation. We align sentences in the pre and post text by computing a sliding window (size k = 5) of pairwise BLEU (Papineni et al. 2002) between sentences and matching sentences with the biggest score (Faruqui et al. 2018; Tiedemann 2008). Last, we discarded pairs whose length ratios were beyond the 95th percentile (Pryzant et al. 2017). Corpus statistics are given in Table 2. The final data are (1) a parallel corpus of 180k biased sentences and their neutral counterparts, and (2) 385k neutral sentences that were adjacent to a revised sentence at the time of editing but were not changed by the editor. Note that following Recasens, Danescu-Niculescu-Mizil, and Jurafsky (2013), the neutralizing experiments in Section 4 focus on the subset of WNC where the editor modified or deleted a single word in the source text ( Biased-word in Table 2). Table 1 also gives a categorization of these sample pairs using a slight extension of the typology of Recasens, Danescu-Niculescu-Mizil, and Jurafsky (2013). They defined framing bias as using subjective words or phrases linked with a particular point of view (using words like best or deepest or using pilfered from instead of based on), and epistemological bias as linguistic features that subtly (often via presupposition) modify the believability of a proposition. We add to their two a third kind of subjectivity bias that also occurs in our data, which we call demographic bias, text with presuppositions about particular genders, races, or other demographic categories (like presupposing that all programmers are male). Subcategory Percent Epistemological 25.0 Framing 57.7 Demographic 11.7 Noise 5.6 Table 3: Proportion of bias subcategories in Biased-full. The dataset does not include labels for these categories, but we hand-labeled a random sample of 500 examples to estimate the distribution of the 3 types. Table 3 shows that while framing bias is most common, all types of bias are represented in the data, including instances of demographic bias. 2.1 Dataset Properties We take a closer look at WNC to identify characteristics of subjective bias on Wikipedia. Topic. We use the Wikimedia Foundation s categorization models (Asthana and Halfaker 2018) to bucket articles from WNC into a 44-category ontology,3 then compare the proportions of NPOV-driven edits across categories. Subjectively biased edits are most prevalent in history, politics, philosophy, sports, and language categories. They are least prevalent in the meteorology, science, landforms, broadcasting, and arts categories. This suggests that there is a relationship between a text s topic and the realization of bias. We use this observation to guide our model design in Section 3.1. Tenure. We group editors into newcomers (less than a month of experience) and experienced (more than a month). We find that newcomers are less likely to perform neutralizing edits (15% in WNC) compared to other edits (34% in a random sample of 685k edits). This difference is significant ( χ2 p = 0.001), suggesting the complexity of neutralizing text is typically reserved for more senior editors, which helps explain the performance of human evaluators in Section 6.1. 3 Methods for Neutralizing Text We propose the task of neutralizing text, in which the algorithm is given an input sentence and must produce an output sentence whose meaning is as similar as possible to the input but with the subjective bias removed. We propose two algorithms for this task, each with its own benefits. A MODULAR algorithm enables human control and interpretability. A CONCURRENT algorithm is simple to train and operate. We adopt the following notation: s = [ws 1, ..., ws n] is a source sequence of subjectively biased text. t = [wt 1, ..., wt m] is a target sequence and the neutralized version of s. 3https://en.wikipedia.org/wiki/Wikipedia:Wiki Project Council/Directory Figure 2: The detection module uses discrete features fi and BERT embedding bi to calculate logit yi. 3.1 MODULAR The first algorithm we are proposing is called MODULAR. It has two stages: BERT-based detection and LSTM-based editing. We pretrain a model for each stage and then combine them into a joint system for end-to-end fine tuning on the overall neutralizing task. We proceed to describe each module. Detection Module The detection module is a neural sequence tagger that estimates pi, the probability that each input word ws i is subjectively biased (Figure 2). Module description. Each pi is calculated according to pi = σ(bi Wb + ei We + b) (1) bi Rb represents ws i s semantic meaning. It is a contextualized word vector produced by BERT, a transformer encoder that has been pre-trained as a masked language model (Devlin et al. 2019). To leverage the biastopic relationship uncovered in Section 2.1, we prepend a token indicating an article s topic category (, , etc) to s. The word vectors for these tokens are learned from scratch. ei represents expert features of bias proposed by (Recasens, Danescu-Niculescu-Mizil, and Jurafsky 2013): ei = Re LU(fi Win) (2) Win Rf h is a matrix of learned parameters, and fi is a vector of discrete features4. Wb Rb, We Rh, and b R are learnable parameters. Module pre-training. We train this module using diffs5 between the source and target text. A label p i is 1 if ws i was deleted or modified as part of the neutralizing process. A label is 0 if the associated word was unchanged during editing, i.e. it occurs in both the source and target text. The loss is calculated as the average negative log likelihood of the labels: p i log pi + (1 p i ) log(1 pi) 4Such as lexicons of hedges, factives, assertives, implicatives, and subjective words; see code release. 5https://github.com/paulgb/simplediff Figure 3: The MODULAR system uses join embedding v to reconcile the detector s predictions with an encoder-decoder architecture. The greater a word s probability, the more of v is mixed into that word s hidden state. Editing Module The editing module takes a subjective source sentence s and is trained to edit it into a more neutral compliment t. Module description. This module is based on a sequence-to-sequence neural machine translation model (Luong, Pham, and Manning 2015). A bi-LSTM encoder turns s into a sequence of hidden states H = (h1, ..., hn) (Hochreiter and Schmidhuber 1997). Next, an LSTM decoder generates text one token at a time by repeatedly attending to H and producing probability distributions over the vocabulary. We also add two mechanisms from the summarization literature (See, Liu, and Manning 2017). The first is a copy mechanism, where the model s final output for timestep i becomes a weighted combination of the predicted vocabulary distribution and attentional distribution from that timestep. The second is a coverage mechanism which incorporates the sum of previous attention distributions into the final loss function to discourage the model from re-attending to a word and repeating itself. Module pre-training. We pre-train the decoder as a language model of neutral text using the neutral portion of WNC (Section 2). Doing so expresses a data-driven prior about how target sentences should read. We accomplish this with a denoising autoencoder objective (Hill, Cho, and Korhonen 2016) and maximizing the conditional log probability of reconstructing a sequence x from a corrupted version of itself x using noise model C (log p(x| x) where x = C(x)). Our C is similar to (Lample et al. 2018). We slightly shuffle x such that xi s index in x is randomly selected from [i k, i + k]. We then drop words with probability p. For our experiments, we set k = 3 and p = 0.25. Final System Once the detection and editing modules have been pre-trained, we join them and fine-tune together as an end to end system for translating s into t. This is done with a novel join embedding mechanism that lets the detector control the editor (Figure 3). The join embedding is a vector v Rh that we add to each encoder hidden state in the editing module. This operation is gated by the detector s output probabilities p = (p1, ..., pn). Note that the same v is applied across all timesteps. h i = hi + pi v (3) We proceed to condition the decoder on the new hidden states H = (h 1, ..., h n) which have varying amounts of v in them. Intuitively, v is enriching the hidden states of words that the detector identified as subjective. This tells the decoder what language should be changed and what is safe to be be copied during the neutralization process. Error signals are allowed to flow backwards into both the encoder and detector, creating an end-to-end system from the two modules. To fine-tune the parameters of the joint system, we use a token-weighted loss function that scales the loss on neutralized words (i.e. words unique to t) by a factor of α: i=1 λ(wt i, s) log p(wt i|s, wt