# evaluating_commonsense_in_pretrained_language_models__574c9932.pdf The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20) Evaluating Commonsense in Pre-Trained Language Models Xuhui Zhou,1 Yue Zhang,2 Leyang Cui,2,3 Dandan Huang2 1University of Washington 2School of Engineering, Westlake University 3Zhejiang University xuhuizh@uw.edu, {yue.zhang, cuileyang, huangdandan}@westlake.edu.cn Contextualized representations trained over large raw text data have given remarkable improvements for NLP tasks including question answering and reading comprehension. There have been works showing that syntactic, semantic and word sense knowledge are contained in such representations, which explains why they benefit such tasks. However, relatively little work has been done investigating commonsense knowledge contained in contextualized representations, which is crucial for human question answering and reading comprehension. We study the commonsense ability of GPT, BERT, XLNet, and Ro BERTa by testing them on seven challenging benchmarks, finding that language modeling and its variants are effective objectives for promoting models commonsense ability while bi-directional context and larger training set are bonuses. We additionally find that current models do poorly on tasks require more necessary inference steps. Finally, we test the robustness of models by making dual test cases, which are correlated so that the correct prediction of one sample should lead to correct prediction of the other. Interestingly, the models show confusion on these test cases, which suggests that they learn commonsense at the surface rather than the deep level. We release a test set, named CATs publicly, for future research. Introduction Contextualized representations trained over large-scale text data have given remarkable improvements to a wide range of NLP tasks, including natural language inference (Bowman et al. 2015), question answering (Rajpurkar, Jia, and Liang 2018) and reading comprehension (Lai et al. 2017). Giving new state-of-the-art results that approach or surpass human performance on several benchmark datasets, it is an interesting question what types of knowledge are learned in pre-trained contextualized representations in order to better understand how they benefit the NLP problems above. There has been work investigating the nature of syntactic (Liu et al. 2019a), semantic (Liu et al. 2019a) and word sense (Kim et al. 2019) knowledge contained in such contextualized representations, in particular BERT (Devlin et al. 2019), showing Work done while at Westlake University Copyright c 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. that such knowledge can be effectively learned via language model (LM) pre-training over large scale data. Commonsense knowledge spans a huge portion of human experience, encompassing knowledge about the spatial, physical, social, temporal, and psychological aspects of typical everyday life. (Liu and Singh 2004). Intuitively, such knowledge is at least as useful as semantic and syntactic knowledge in natural language inference, reading comprehension and coreference resolution. For example, the word it in the sentence the dog cannot cross the street because it is too X can refer to three different entities when the word X is timid , wide and dark , respectively, and resolving such ambiguity can require that a system has relevant commonsense knowledge beyond the sentence level. However, relatively little work has been conducted on systematically evaluating the nature of commonsense knowledge learned in contextualized representations. We fill this gap by evaluating five state-of-the-art contextualized embedding models on seven commonsense benchmarks. The models include off-the-shelf embeddings1 from GPT (Radford and Sutskever 2018), GPT2 (Radford et al. 2019), BERT (Devlin et al. 2019), XLNet (Yang et al. 2019) and Ro BERTa (Liu et al. 2019b), and the benchmarks include Conjunction Acceptability, Sense Making (Wang et al. 2019), Winograd Schema Challenge (Levesque, Davis, and Morgenstern 2012), SWAG (Zellers et al. 2018), Hella Swag (Zellers et al. 2019), Sense Making with Reasoning (Wang et al. 2019), and Argument Reasoning Comprehension (Habernal et al. 2018). We evaluate commonsense knowledge contained in the above models by unifying the form of all the datasets and comparing LM perplexities on positive and negative samples (i.e., sentences that make sense and those that do not make sense, respectively). Commonsense contained in our data covers a wide range of subjects, from physical world knowledge to social conventions, from scientific domains to daily life scenes. We further categorize them by the difficulty level, namely the number of inference steps necessary in making sense. We reframe the datasets in order to conduct both wordand sentence-level testing. For word-level testing, negative samples are drawn by replacing words from positive sam- 1https://github.com/huggingface/transformers Token-level CA They broadcast an announcement, but a subway came into the station and I couldn t hear it. They broadcast an announcement, before a subway came into the station and I couldn t hear it . WSC The trophy doesn t fit into the brown suitcase because the trophy is too large. The trophy doesn t fit into the brown suitcase because the suitcase is too large. SM money can be used for buying cars money can be used for buying stars Sentence-level SMR he put an elephant into the fridge (because) an elephant is much bigger than a fridge . he put an elephant into the fridge (because) elephants are usually gray... he put an elephant into the fridge (because) an elephant cannot eat a fridge . SWAG Someone unlocks the door and they go in. Someone leads the way in. Someone unlocks the door and they go in. Someone opens the door and walks out. Someone unlocks the door and they go in. Someone walks out of the driveway. Someone unlocks the door and they go in. Someone walks next to someone and sits on a pew. Hella Swag A carved pumpkin with a light in it glows on a counter. Supplies for carving are then shown. A woman cuts the top off the pumpkin, emptying the seeds. she cuts down all the pieces and dumps them in a trash bin in the end. she then carves the traced lines to cut out the design. she tapes the top shut as the continue carving the pumpkin. ARCT People can choose not to use Google Other search engines don t redirect to Google Google is not a harmful monopoly People can choose not to use Google All other search engines redirect to Google Google is not a harmful monopoly Table 1: Example of reframed test instances corresponding to each of our test task. The key word is bolded in token-level tasks. , , and are used for showing the logic flows and replaced by natural language in actual test data. ples. We are concerned about nouns, verbs, adjectives, adverbs, pronouns and conjunctions, which reflect different aspects of commonsense. For example, while verbs such as buy, throw, sell ... are relatively more associated with event knowledge, conjunctions such as because, but, so ... are more associated with logical reasoning. For sentence-level testing, negative examples are drawn by replacing a full subsentences (such as a clause) with irrelevant or conflicting contents. Sentence-level tests concern more about commonsense inference. From the results we have four salient observations. First, the pre-trained models give consistently better performances than random baselines, which demonstrates that language model pre-training is useful for learning commonsense knowledge. Second, models based on bi-directional contexts such as BERT, XLNet and Ro BERTa are stronger in learning commonsense knowledge compared to those based on uni-directional contexts, such as GPT and GPT2. Third, more commonsense knowledge can be learned from larger training sets, which conforms well to the intuition. Fourth, the models have a certain degree of commonsense reasoning ability. However, as the number of necessary inference steps increase, the model performances drop, which shows that commonsense is still a big challenge that is not completely solved by pre-trained contextualized language models (LMs). Finally, we further test the robustness of the five models by making dual test samples. Here a dual test sample is built by adding, deleting and replacing words in a test sam- ple, or swapping two words in the sample, thereby resulting in a closely related test case. In theory, a model equipped with relevant commonsense should give consistent predictions on a pair of dual test cases. However, we find that none of the models are able to reach such consistency. Instead, the models are confused by the modification, tending to give the same predictions over a pair of dual samples despite they may have different gold labels. This further reveals that commonsense contained in the pre-trained models may remain in a surface level, without deep semantic comprehension. We publicly release our datasets, named commonsense ability tests (CATs), and the test script at Git Hub. 2 Tasks for Evaluating Commonsense Commonsense ability can be broadly divided to two categories. First, a model with commonsense ability should have basic knowledge about the world, for example, water always goes down. Second, it should have the ability to reason over commonsense knowledge, such as water always goes down because there is gravity on the earth and if you are injured, you should go to the hospital. To comprehensively test different models commonsense ability, we synthesize six challenging tasks by taking positive and negative samples from existing benchmarks, and further introduce a new task called Conjunction Acceptability (CA). We reframe all the tasks into sentence-scoring tasks by substitution or concatenation. For example, we create posi- 2https://github.com/Xuhui Zhou/CATS Original: Paul tried to call George on the phone, but he wasn t successful. Who is he? Candidate: A. Paul (correct) B. George Reframed: A. Paul tried to call George on the phone, but Paul wasn t successful. (Positive sample) B. Paul tried to call George on the phone, but George wasn t successful. (Negative sample) Table 2: Example of reframing a WSC question; Note that there can be additional negative samples. tive and negative samples by replacing a pronoun in the sentence of a WSC question with the candidates to obtain a test instance as Table 2. A model is asked to score the sentences and we pick the sentence with the highest score as its prediction in a test instance. Below we introduce the data sources and reframed tasks in detail (the correct answer is bolded). Sense Making (SM) Introduced by Wang et al. (2019), this task tests whether a model can differentiate sense-making and non-sensemaking statements. Given a pair of statements (i.e a test instance), it requires the model to choose the more sensible statement. One example is: I work 8 hours a day / I work 25 hours a day. This task conforms to our evaluation schema without a change. More examples are shown in the SM section of Table 1. The statements typically differ only in one key word which covers nouns, verbs, adjectives, and adverbs. Winograd Schema Challenge (WSC) The Winograd Schema Challenge (WSC) dataset (Levesque, Davis, and Morgenstern 2012) consists 273 instances of the pronoun resolution problem. Each instance contains a sentence with a pronoun referring to one of nouns; the original question is to pick the correct noun. For our task, we transform the test as shown in Table 2. More examples are shown in the WSC section of Table 1. WSC is recognized as one of the most difficult commonsense datasets. Conjunction Acceptability (CA) As stated by Lo Bue and Yates (2011), logic-based commonsense knowledge is an important part of world knowledge in addition to content-based knowledge. We aim to probe a model s ability to understand the logic relations in the language by extracting 189 positive samples from the WSC dataset and replacing the conjunction manually with another conjunction to obtain a negative sample. We pair the positive and negative samples to obtain a test instance. For example, The lawyer asked the witness a question, and the witness was reluctant to answer it / The lawyer asked the witness a question, but the witness was reluctant to answer it. More examples are shown in the CA section of Table 1. This task using because , before , when , but , and to correspond to the Cause and Effect, Preconditions, Simultaneous Conditions, Contradiction, and Addition logic relations, respectively. It is complementary to the other token-level tasks which focus more on content-based knowledge. SWAG SWAG (Zellers et al. 2018) is a dataset with multiple choices questions about grounded situations. It questions models understanding towards the relationship between two physical scenes. With the help of adversarial filtering (AF), Zellers et al. created a sufficiently large amount of questions automatically. For example, given On stage, a woman takes a seat at the piano. She, the question is to choose the following candidates: A. sits on a bench as her sister plays with the doll B. smiles with someone as the music plays C.is in the crowd, watching the dancers D. nervously sets her fingers on the keys. We obtain a positive or negative sample by concatenating the context and a candidate together (e.g On stage, a woman takes a seat at the piano. She nervously sets her fingers on the keys). There are one positive sample and three negative samples in a SWAG test instance. More examples are shown in the SWAG section of Table 1. By forcing the model to predict the next action, it requires inductive reasoning and temporal reasoning. Hella Swag Hella Swag (Zellers et al. 2019) is an argumented version of SWAG with the same data format as SWAG, more inference steps and higher data quality. While Hella Swag also includes the dataset from Wiki How, we choose only the instances coming from Activity Net to make the results comparable to the original SWAG dataset. Sense Making with Reasoning (SMR) Sense Making with Reasoning focuses on identifying the reason behind a statement (Wang et al. 2019) against commonsense. A model needs to understand that a specific statement (e.g can is usually made of gold) is against commonsense and to make a choice for the reason behind from three candidates (e.g gold is too bright to make cans, gold is too soft to make cans and gold is too expensive to make cans). We make a positive or negative sample by concatenating the statement and candidate reason together. For each test instance in SMR, there is a positive sample and two negative samples. More examples are shown in the SMR section of Table 1. This task is intuitively difficult since it requires a model to have deeper knowledge of with higher-level inference, which belongs to abductive reasoning. Argument Reasoning Comprehension Task (ARCT) Similar to SMR, Habernal et al. (2018) propose the ARCT dataset to test a model s abductive reasoning ability. Its domain lies in social topics such as search engine and LGBT rights, which is different from the daily-routine scenarios. For example, given a reason R: I find the idea that it is a sin to be born or live a life at all to be preposterous and a claim C: Christians have created a harmful atmosphere for gays, this task is to pick the correct warrant W from two candidates: A. being gay isn t considered a sin B. being gay is considered a sin, where R W C. We make a positive or negative sample by concatenating the reason, candidate warrant and claim together (e.g I find the idea that it is a sin to be born or live a life at all to be preposterous and since being gay is considered a sin, Christians have created a harmful atmosphere for gays). A test instance in ARCT contains a pair of positive and negative samples. More examples are shown in the ARCT section of Table 1. We further break this task into two variants, where ARCT1 represents the original dataset, ARCT2 represents an argumented dataset by adding negation to original instances to alleviate the statistical cues in the dataset (Niven and Kao 2019). We integrated the above test sets into a commonsense ability test (CATs) benchmark, released for future research. Pre-trained Models We take six contextualized representation models that give the state-of-the-art performances on NLP benchmarks such as GLUE (Wang et al. 2018) and SQu AD (Rajpurkar, Jia, and Liang 2018). Off-the-shelf models are taken. Below we give the detailed settings. GPT (Radford and Sutskever 2018) is a uni-directional transformer LM trained on 800M tokens of Book Corpus (Zhu et al. 2015). Given a text sequence x = [x1, ..., x T ], GPT works in a way similar to conventional auto-regressive (AR) LM: max θ log pθ(x) = t=1 log pθ(xt|x