# abductive_commonsense_reasoning__a2908f31.pdf Published as a conference paper at ICLR 2020 ABDUCTIVE COMMONSENSE REASONING Chandra Bhagavatula , Ronan Le Bras , Chaitanya Malaviya , Keisuke Sakaguchi , Ari Holtzman , Hannah Rashkin , Doug Downey , Scott Wen-tau Yih , Yejin Choi Allen Institute for AI, Seattle, WA, USA, Facebook AI, Seattle, WA, USA Paul G. Allen School of Computer Science & Engineering, WA, USA {chandrab,ronanlb,chaitanyam,keisukes}@allenai.org {arih,hannahr,dougd}@allenai.org {yejin}@cs.washington.edu {scottyih}@fb.com Abductive reasoning is inference to the most plausible explanation. For example, if Jenny finds her house in a mess when she returns from work, and remembers that she left a window open, she can hypothesize that a thief broke into her house and caused the mess, as the most plausible explanation. While abduction has long been considered to be at the core of how people interpret and read between the lines in natural language (Hobbs et al., 1988), there has been relatively little research in support of abductive natural language inference and generation. We present the first study that investigates the viability of language-based abductive reasoning. We introduce a challenge dataset, ART, that consists of over 20k commonsense narrative contexts and 200k explanations. Based on this dataset, we conceptualize two new tasks (i) Abductive NLI: a multiple-choice question answering task for choosing the more likely explanation, and (ii) Abductive NLG: a conditional generation task for explaining given observations in natural language. On Abductive NLI, the best model achieves 68.9% accuracy, well below human performance of 91.4%. On Abductive NLG, the current best language generators struggle even more, as they lack reasoning capabilities that are trivial for humans. Our analysis leads to new insights into the types of reasoning that deep pre-trained language models fail to perform despite their strong performance on the related but more narrowly defined task of entailment NLI pointing to interesting avenues for future research. 1 INTRODUCTION The brain is an abduction machine, continuously trying to prove abductively that the observables in its environment constitute a coherent situation. Jerry Hobbs, ACL 2013 Lifetime Achievement Award1 Abductive reasoning is inference to the most plausible explanation for incomplete observations (Peirce, 1965a). Figure 1 illustrates an example. Given the incomplete observations about the world that O1: Jenny cleaned her house and went to work, leaving the window just a crack open. and sometime later O2: When Jenny returned home, she saw her house was a mess. , we can hypothesize different potential explanations and reason about which is the most likely. We can readily rule out H3 since it fails to justify the observation O2. While H1 and H2 are both plausible, the most likely explanation based on commonsense is H1 as H2 is somewhat implausible given O1. One crucial observation Peirce makes about abductive reasoning is that abduction is the only logical operation which introduces any new ideas , which contrasts with other types of inference such as entailment, that focuses on inferring only such information that is already provided in the premise. Work done while at AI2 1The full transcript of his award speech is available at https://www.mitpressjournals.org/ doi/full/10.1162/COLI_a_00171 ar Xiv:submit/3044704 [cs.CL] 14 Feb 2020 Published as a conference paper at ICLR 2020 A thief broke into the house by pulling open the window. The bird got stuck inside the house, flew around while trying to escape, and made a mess. At work, she opened her window and the wind blew her papers everywhere. Jenny left an insecure opening to her house. Fails to Justify to O2. Although wind caused a mess, the event happened at Jenny s The thief got into the house through the window and rifled through Jenny s things, which made a mess. When Jenny returned home she saw that her house was a mess! Jenny cleaned her house and went to work, leaving the window just a crack open. It was a breezy day and a large bird flew into the house. Likely to follow O1. ? Somewhat Unlikely. If the window was just a crack open, a large bird is unlikely to get in? Figure 1: Example of Abductive Reasoning. Given observations O1 and O2, the αNLI task is to select the most plausible explanatory hypothesis. Since the number of hypotheses is massive in any given situation, we make a simplifying assumption in our ART dataset to only choose between a pair of explanations. Abductive reasoning has long been considered to be at the core of understanding narratives (Hobbs et al., 1988), reading between the lines (Norvig, 1987; Charniak & Shimony, 1990), reasoning about everyday situations (Peirce, 1965b; Andersen, 1973), and counterfactual reasoning (Pearl, 2002; Pearl & Mackenzie, 2018). Despite the broad recognition of its importance, however, the study of abductive reasoning in narrative text has very rarely appeared in the NLP literature, in large part because most previous work on abductive reasoning has focused on formal logic, which has proven to be too rigid to generalize to the full complexity of natural language. In this paper, we present the first study to investigate the viability of language-based abductive reasoning. This shift from logic-based to language-based reasoning draws inspirations from a significant body of work on language-based entailment (Bowman et al., 2015; Williams et al., 2018b), language-based logic (Lakoff, 1970; Mac Cartney & Manning, 2007), and language-based commonsense reasoning (Mostafazadeh et al., 2016; Zellers et al., 2018). In particular, we investigate the use of natural language as the representation medium, and probe deep neural models on language-based abductive reasoning. More concretely, we propose Abductive Natural Language Inference (αNLI) and Abductive Natural Language Generation (αNLG) as two novel reasoning tasks in narrative contexts.2 We formulate αNLI as a multiple-choice task to support easy and reliable automatic evaluation: given a context, the task is to choose the more likely explanation from a given pair of hypotheses choices. We also introduce a new challenge dataset, ART, that consists of 20K narratives accompanied by over 200K explanatory hypothesis.34 We then establish comprehensive baseline performance based on state-of-the-art NLI and language models. The best baseline for αNLI based on BERT achieves 68.9% accuracy, with a considerable gap compared to human performance of 91.4%( 5.2). The best generative model, based on GPT2, performs well below human performance on the αNLG task ( 5.2). Our analysis leads to insights into the types of reasoning that deep pre-trained language models fail to perform despite their strong performance on the closely related but different task of entailment NLI pointing to future research directions. 2 TASK DEFINITION Abductive Natural Language Inference We formulate αNLI as multiple choice problems consisting of a pair of observations as context and a pair of hypothesis choices. Each instance in ART is defined as follows: O1: The observation at time t1. 2αNLI and αNLG are pronounced as alpha-NLI and alpha-NLG, respectively 3ART: Abductive Reasoning in narrative Text. 4Data available to download at http://abductivecommonsense.xyz Published as a conference paper at ICLR 2020 O2: The observation at time t2 > t1. h+: A plausible hypothesis that explains the two observations O1 and O2. h : An implausible (or less plausible) hypothesis for observations O1 and O2. Given the observations and a pair of hypotheses, the αNLI task is to select the most plausible explanation (hypothesis). Abductive Natural Language Generation αNLG is the task of generating a valid hypothesis h+ given the two observations O1 and O2. Formally, the task requires to maximize P(h+|O1, O2). 3 MODELS FOR ABDUCTIVE COMMONSENSE REASONING 3.1 ABDUCTIVE NATURAL LANGUAGE INFERENCE A Probabilistic Framework for αNLI: A distinct feature of the αNLI task is that it requires jointly considering all available observations and their commonsense implications, to identify the correct hypothesis. Formally, the αNLI task is to select the hypothesis h that is most probable given the observations. h = arg max hi P(H = hi|O1, O2) (1) Rewriting the objective using Bayes Rule conditioned on O1, we have: P(hi|O1, O2) P(O2|hi, O1)P(hi|O1) (2) We formulate a set of probabilistic models for αNLI that make various independence assumptions on Equation 2 starting from a simple baseline that ignores the observations entirely, and building up to a fully joint model. These models are depicted as Bayesian Networks in Figure 2. a) Hypothesis-Only e) Fully Connected d) Linear Chain H O1 O2 O2 H O1 H O1 O2 H b) First Observation Only c) Second Observation Only Figure 2: Illustration of the graphical models described in the probabilistic framework. The Fully Connected model can, in theory, combine information from both available observations. Hypothesis Only: Our simplest model makes the strong assumption that the hypothesis is entirely independent of both observations, i.e. (H O1, O2), in which case we simply aim to maximize the marginal P(H). First (or Second) Observation Only: Our next two models make weaker assumptions: that the hypothesis depends on only one of the first O1 or second O2 observation. Linear Chain: Our next model uses both observations, but considers each observation s influence on the hypothesis independently, i.e. it does not combine information across the observations. Formally, the model assumes that the three variables O1, H, O2 form a linear Markov chain, where the second observation is conditionally independent of the first, given the hypothesis (i.e. (O1 O2|H)). Under this assumption, we aim to maximize a somewhat simpler objective than Equation 2: h = arg max hi P(O2|hi)P(hi|O1) where (O1 O2|H) (3) Fully Connected: Finally, our most sophisticated model jointly models all three random variables as in Equation 2, and can in principle combine information across both observations to choose the correct hypothesis. Published as a conference paper at ICLR 2020 <ο1> ω# # ω$ # <ο2> ω# Transformer e#& e$ & e$)# Commonsense+Contextualized Word Representations Word Embeddings (e.g. COMe T Embeddings) COMe T Commonsense Transformers <ο1> ω# # ω$ # <ο2> ω# Figure 3: Overview of an αNLG model that integrates commonsense representations obtained from COMe T (Bosselut et al., 2019) with GPT2. Each observation is input to the COMe T model to obtain nine embeddings, each associated with one commonsense inference type. To help illustrate the subtle distinction between how the Linear Chain and Fully Connected models consider both observations, consider the following example. Let observation O1: Carl went to the store desperately searching for flour tortillas for a recipe. and O2: Carl left the store very frustrated. . Then consider two distinct hypotheses, an incorrect h1: The cashier was rude and the correct h2: The store had corn tortillas, but not flour ones. . For this example, a Linear Chain model could arrive at the wrong answer, because it reasons about the observations separately taking O1 in isolation, both h1 and h2 seem plausible next events, albeit each a priori unlikely. And for O2 in isolation i.e. in the absence of O1, as for a randomly drawn shopper the h1 explanation of a rude cashier seems a much more plausible explanation of Carl s frustration than are the details of the store s tortilla selection. Combining these two separate factors leads the Linear Chain to select h1 as the more plausible explanation. It is only by reasoning about Carl s goal in O1 jointly with his frustration in O2, as in the Fully Connected model, that we arrive at the correct answer h2 as the more plausible explanation. In our experiments, we encode the different independence assumptions in the best performing neural network model. For the hypothesis-only and single observation models, we can enforce the independencies by simply restricting the inputs of the model to only the relevant variables. On the other hand, the Linear Chain model takes all three variables as input, but we restrict the form of the model to enforce the conditional independence. Specifically, we learn a discriminative classifier: PLinear Chain(h|O1, O2) eφ(O1,h)+φ (h,O2) where φ and φ are neural networks that produce scalar values. 3.2 ABDUCTIVE NATURAL LANGUAGE GENERATION Given h+= {wh 1 . . . wh l }, O1={wo1 1 . . . wo1 m } and O2={wo2 1 . . . wo2 n } as sequences of tokens, the αNLG task can be modeled as P(h+|O1, O2) = Q P(wh i |wh