# local_explanation_of_dialogue_response_generation__708fce43.pdf Local Explanation of Dialogue Response Generation Yi-Lin Tuan1, Connor Pryor2, Wenhu Chen1, Lise Getoor2, William Yang Wang1 1 University of California, Santa Barbara 2 University of California, Santa Cruz {ytuan, wenhuchen, william}@cs.ucsb.edu {cfpryor, getoor}@ucsc.edu In comparison to the interpretation of classification models, the explanation of sequence generation models is also an important problem, however it has seen little attention. In this work, we study model-agnostic explanations of a representative text generation task dialogue response generation. Dialog response generation is challenging with its open-ended sentences and multiple acceptable responses. To gain insights into the reasoning process of a generation model, we propose a new method, local explanation of response generation (LERG), that regards the explanations as the mutual interaction of segments in input and output sentences. LERG views the sequence prediction as uncertainty estimation of a human response and then creates explanations by perturbing the input and calculating the certainty change over the human response. We show that LERG adheres to desired properties of explanation for text generation, including unbiased approximation, consistency, and cause identification. Empirically, our results show that our method consistently improves other widely used methods on proposed automaticand humanevaluation metrics for this new task by 4.4-12.8%. Our analysis demonstrates that LERG can extract both explicit and implicit relations between input and output segments. 1 1 Introduction As we use machine learning models in daily tasks, such as medical diagnostics [6, 19], speech assistants [31] etc., being able to trust the predictions being made has become increasingly important. To understand the underlying reasoning process of complex machine learning models a sub-field of explainable artificial intelligence (XAI) [2, 17, 36] called local explanations, has seen promising results [35]. Local explanation methods [27, 39] often approximate an underlying black box model by fitting an interpretable proxy, such as a linear model or tree, around the neighborhood of individual predictions. These methods have the advantage of being model-agnostic and locally interpretable. Traditionally, off-the-shelf local explanation frameworks, such as the Shapley value in game theory [38] and the learning-based Local Interpretable Model-agnostic Explanation (LIME) [35] have been shown to work well on classification tasks with a small number of classes. In particular, there has been work on image classification [35], sentiment analysis [8], and evidence selection for question answering [32]. However, to the best of our knowledge, there has been less work studying explanations over models with sequential output and large class sizes at each time step. An attempt by [1] aims at explaining machine translation by aligning the sentences in source and target languages. Nonetheless, unlike translation, where it is possible to find almost all word alignments of the input and output sentences, many text generation tasks are not alignment-based. We further explore explanations over sequences that contain implicit and indirect relations between the input and output utterances. 1Our code is available at https://github.com/Pascalson/LERG. 35th Conference on Neural Information Processing Systems (Neur IPS 2021). In this paper, we study explanations over a set of representative conditional text generation models dialogue response generation models [45, 55]. These models typically aim to produce an engaging and informative [3, 24] response to an input message. The open-ended sentences and multiple acceptable responses in dialogues pose two major challenges: (1) an exponentially large output space and (2) the implicit relations between the input and output texts. For example, the open-ended prompt How are you today? could lead to multiple responses depending on the users emotion, situation, social skills, expressions, etc. A simple answer such as Good. Thank you for asking. does not have an explicit alignment to words in the input prompt. Even though this alignment does not exist, it is clear that good is the key response to how are you . To find such crucial corresponding parts in a dialogue, we propose to extract explanations that can answer the question: Which parts of the response are influenced the most by parts of the prompt? To obtain such explanations, we introduce LERG, a novel yet simple method that extracts the sorted importance scores of every input-output segment pair from a dialogue response generation model. We view this sequence prediction as the uncertainty estimation of one human response and find a linear proxy that simulates the certainty caused from one input segment to an output segment. We further derive two optimization variations of LERG. One is learning-based [35] and another is the derived optimal similar to Shapley value [38]. To theoretically verify LERG, we propose that an ideal explanation of text generation should adhere to three properties: unbiased approximation, intra-response consistency, and causal cause identification. To the best of our knowledge, our work is the first to explore explanation over dialog response generation while maintaining all three properties. To verify if the explanations are both faithful (the explanation is fully dependent on the model being explained) [2] and interpretable (the explanation is understandable by humans) [14], we conduct comprehensive automatic evaluations and user study. We evaluate the necessity and sufficiency of the extracted explanation to the generation model by evaluating the perplexity change of removing salient input segments (necessity) and evaluating the perplexity of only salient segments remaining (sufficiency). In our user study, we present annotators with only the most salient parts in an input and ask them to select the most appropriate response from a set of candidates. Empirically, our proposed method consistently outperforms baselines on both automatic metrics and human evaluation. Our key contributions are: We propose a novel local explanation method for dialogue response generation (LERG). We propose a unified formulation that generalizes local explanation methods towards sequence generation and show that our method adheres to the desired properties for explaining conditional text generation. We build a systematic framework to evaluate explanations of response generation including automatic metrics and user study. 2 Local Explanation Local explanation methods aim to explain predictions of an arbitrary model by interpreting the neighborhood of individual predictions [35]. It can be viewed as training a proxy that adds the contributions of input features to a model s predictions [27]. More formally, given an example with input features x = {xi}M i=1, the corresponding prediction y with probability f(x) = Pθ(Y = y|x) (the classifier is parameterized by θ), we denote the contribution from each input feature xi as φi R and denote the concatenation of all contributions as φ = [φ1, ..., φM]T RM. Two popular local explanation methods are the learning-based Local Interpretable Model-agnostic Explanations (LIME) [35] and the game theory-based Shapley value [38]. LIME interprets a complex classifier f based on locally approximating a linear classifier around a given prediction f(x). The optimization of the explanation model that LIME uses adheres to: ξ(x) = arg min ϕ [L(f, ϕ, πx) + Ω(ϕ)] , (1) where we sample a perturbed input x from πx( x) = exp( D(x, x)2/σ2) taking D(x, x) as a distance function and σ as the width. Ω is the model complexity of the proxy ϕ. The objective of ξ(x) is to find the simplest ϕ that can approximate the behavior of f around x. When using a linear classifier dialog input text output G (a) Controllable dialogue models input text positive (b) Explanation of classifier dialog input text output G (c) Our concept Figure 1: The motivation of local explanation for dialogue response generation. (c) = (a)+(b). φ as the ϕ to minimize Ω(ϕ) [35], we can formulate the objective function as: φ = arg min φ E x πx(Pθ(Y = y| x) φT z)2 , (2) where z {0, 1}M is a simplified feature vector of x by a mapping function h such that z = h(x, x) = { (xi x)}M i=1. The optimization means to minimize the classification error in the neighborhood of x sampled from πx. Therefore, using LIME, we can find an interpretable linear model that approximates any complex classifier s behavior around an example x. Shapley value takes the input features x = {xi}M i=1 as M independent players who cooperate to achieve a benefit in a game [38]. The Shapley value computes how much each player xi contributes to the total received benefit: | x|!(|x| | x| 1)! |x|! [Pθ(Y = y| x {xi}) Pθ(Y = y| x)] . (3) To reduce the computational cost, instead of computing all combinations, we can find surrogates φi proportional to ϕi and rewrite the above equation as an expectation over x sampled from P( x): φi = |x| |x| 1ϕi = E x P ( x)[Pθ(Y = y| x {xi}) Pθ(Y = y| x)], i , (4) where P( x) = 1 (|x| 1)( |x| 1 | x| ) is the perturb function.2 We can also transform the above formulation into argmin: φi = arg min φi E x P ( x)([Pθ(Y = y| x {xi}) Pθ(Y = y| x)] φi)2 . (5) 3 Local Explanation for Dialogue Response Generation We aim to explain a model s response prediction to a dialogue history one at a time and call it the local explanation of dialogue response generation. We focus on the local explanation for a more fine-grained understanding of the model s behavior. 3.1 Task Definition As depicted in Figure 1, we draw inspiration from the notions of controllable dialogue generation models (Figure 1a) and local explanation in sentiment analysis (Figure 1b). The first one uses a concept in predefined classes as the relation between input text and the response; the latter finds the features that correspond to positive or negative sentiment. We propose to find parts within the input and output texts that are related by an underlying intent (Figure 1c). We first define the notations for dialogue response generation, which aims to predict a response y = y1y2...y N given an input message x = x1x2...x M. xi is the i-th token in sentence x with length M and yj is the j-th token in sentence y with length N. To solve this task, a typical sequence-to-sequence model f parameterized by θ produces a sequence of probability masses [45]. The probability of y given x can then be computed as the product of the sequence Pθ(y|x) = Pθ(y1|x)Pθ(y2|x, y1)...Pθ(y N|x, y D(Pθ(yj | x, y D(Pθ(yj | x, y Φij . Property 3: cause identification To ensure that the explanation model sorts different input features by their importance to the results, if g(yj| x {xi}) > g(yj| x {x i}), x x\{xi, x i} , (16) then Φij > Φi j We prove that our proposed method adheres to all three Properties in Appendix B. Meanwhile Shapley value follows Properties 2 and 3, while LIME follows Property 3 when an optimized solution exists. These properties also demonstrate that our method approximates the text generation process while sorting out the important segments in both the input and output texts. This could be the reason to serve as explanations to any sequential generative model. 4 Experiments Explanation is notoriously hard to evaluate even for digits and sentiment classification which are generally more intuitive than explaining response generation. For digit classification (MNIST), explanations often mark the key curves in figures that can identify digit numbers. For sentiment analysis, explanations often mark the positive and negative words in text. Unlike them, we focus on identifying the key parts in both input messages and their responses. Our move requires an explanation include the interactions of the input and output features. To evaluate the defined explanation, we quantify the necessity and sufficiency of explanations towards a model s uncertainty of a response. We evaluate these aspects by answering the following questions. necessity: How is the model influenced after removing explanations? sufficiency: How does the model perform when only the explanations are given? Furthermore, we conduct a user study to judge human understandings of the explanations to gauge how trustworthy the dialog agents are. Figure 2: The explanation results of a GPT model fine-tuned on Daily Dialog. Figure 3: The explanation results of fine-tuned Dialo GPT. 4.1 Dataset, Models, Methods We evaluate our method over chit-chat dialogues for their more complex and realistic conversations. We specifically select and study a popular conversational dataset called Daily Dialog [25] because its dialogues are based on daily topics and have less uninformative responses.Due to the large variation of topics, open-ended nature of conversations and informative responses within this dataset, explaining dialogue response generation models trained on Daily Dialog is challenging but accessible.3 We fine-tune a GPT-based language model [33, 47] and a Dialo GPT [55] on Daily Dialog by minimizing the following loss function: j log Pθ(yj|x, y