# neural_response_generation_with_dynamic_vocabularies__495c0052.pdf Neural Response Generation with Dynamic Vocabularies Yu Wu, Wei Wu, Dejian Yang, Can Xu, Zhoujun Li, State Key Lab of Software Development Environment, Beihang University, Beijing, China Microsoft Research, Beijing, China Authors are supported by Adept Mind Scholarship {wuyu,dejianyang,lizj}@buaa.edu.cn {wuwei,can.xu}@microsoft.com We study response generation for open domain conversation in chatbots. Existing methods assume that words in responses are generated from an identical vocabulary regardless of their inputs, which not only makes them vulnerable to generic patterns and irrelevant noise, but also causes a high cost in decoding. We propose a dynamic vocabulary sequence-tosequence (DVS2S) model which allows each input to possess their own vocabulary in decoding. In training, vocabulary construction and response generation are jointly learned by maximizing a lower bound of the true objective with a Monte Carlo sampling method. In inference, the model dynamically allocates a small vocabulary for an input with the word prediction model, and conducts decoding only with the small vocabulary. Because of the dynamic vocabulary mechanism, DVS2S eludes many generic patterns and irrelevant words in generation, and enjoys efficient decoding at the same time. Experimental results on both automatic metrics and human annotations show that DVS2S can significantly outperform state-of-the-art methods in terms of response quality, but only requires 60% decoding time compared to the most efficient baseline. Introduction Together with the rapid growth of social conversation data on Internet, there has been a surge of interest on building chatbots for open domain conversation with data driven approaches. Existing methods are either retrieval based (Yan, Song, and Wu 2016; Wu et al. 2017) or generation based (Vinyals and Le 2015; Ritter, Cherry, and Dolan 2011; Shang, Lu, and Li 2015). Recently, generation based approaches are becoming popular in both academia and industry, and a common practice is to learn a response generation model within an encoder-decoder framework (a.k.a., a sequence-to-sequence model) from the large scale conversation data. The mainstream of implementation of the encoder-decoder framework is using neural networks, because they are powerful on capturing complicated semantic and syntactic relations between messages and responses and are end-to-end learnable. On top of the architecture, various models have been proposed to tackle the notorious safe Corresponding Author. Copyright c 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. reply problem (Xing et al. 2016; Li et al. 2015); to take conversation history into consideration (Sordoni et al. 2015; Serban et al. 2016; 2017); and to bias responses to some specific persona or emotions (Li et al. 2016a; Zhou et al. 2017). Although existing work has made great progress on generating proper responses, they all assume a static vocabulary in decoding, that is they use the same large set of words to generate responses regardless of inputs. The assumption, however, is a simplification of the real scenario, as proper responses to a specific input (either a message or a conversation context) could only relate to a small specific set of words, and the sets of words could be different from input to input. As a result, the assumption may cause some problems in practice: (1) words that are semantically far from the current conversation also take part in decoding. These words may bias the process of generation and increase the probability of irrelevant responses and generic responses when some of them appear very frequently in the entire data set; (2) the decoding process becomes unnecessarily slow, because one has to estimate a probability distribution for the entire static vocabulary in decoding of each word of a response. More seriously, to suppress the irrelevant responses and the generic responses, state-of-the-art methods have to either complicate their decoders (Xing et al. 2016; Mou et al. 2016) or append a heavy post-processing procedure after decoding (Li et al. 2015), which further deteriorates efficiency. These problems widely exist in the existing methods, but have not drawn enough attention yet. In this paper, we aim to achieve high quality response generation and fast decoding at the same time. Our idea is that we dynamically allocate a vocabulary for each input at the decoding stage. The vocabulary is small as it only covers words that are useful in forming relevant and informative responses for the input and filters most irrelevant words out. Because response decoding of each input only focuses on their own relevant words, the process can be conducted efficiently without loss of response quality. We formulate the idea as a dynamic vocabulary sequenceto-sequence (DVS2S) model. The model defines a dynamic vocabulary in decoding through a multivariate Bernoulli distribution (Dai et al. 2013) on the entire vocabulary and factorizes the generation probability as the product of a vocabulary generation probability conditioned on the input and a response generation probability conditioned on both the in- The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18) put and the vocabulary. DVS2S follows the encoder-decoder framework. In encoding, an input is transformed to a sequence of hidden vectors. In decoding, the model first estimates the multivariate Bernoulli distribution using the hidden vectors given by the encoder, and then selects words to form a vocabulary for the decoder according to the distribution. Responses are generated only using the selected words. Vocabulary construction and response generation are jointly learned from training data, and thus in parameter learning errors in response prediction can be backpropagated to vocabulary formation and used to calibrate word selection. In training, as target vocabularies can only be partially observed from data, we treat them as a latent variable, and optimize a lower bound of the true objective through a Monte Carlo sampling method. We conduct an empirical study using the data in (Xing et al. 2016), and compare DVS2S with state-of-the-art generation methods using extensive automatic evaluation metrics and human judgment. In terms of automatic evaluation, DVS2S achieves 3% gain on BLEU-1 and 5% gain on Embedding Average (Liu et al. 2016) over the best performing baseline. On human evaluation, DVS2S significantly outperforms the baseline methods, which is consistent with the automatic evaluation results. Moreover, the model also achieves 6% gain on the metric of distinct-1 over the best baseline model, indicating that it can generate more informative and diverse responses. Upon the significant improvement on response quality, DVS2S can save 40% decoding time compared to the most efficient baseline in the same running environment. Our contributions in the paper are three-folds: (1) proposal of changing the static vocabulary mechanism to a dynamic vocabulary mechanism in the response generation for chatbots; (2) proposal of a dynamic vocabulary sequence-tosequence model and derivation of a learning approach that can jointly optimize word selection and response generation; (3) empirical verification of the effectiveness and efficiency of the proposed model on large scale conversation data. Related Work Recent years have witnessed remarkable success on open domain response generation for chatbots. In a single-turn scenario, Ritter (Ritter, Cherry, and Dolan 2011) formulated response generation as a machine translation problem by regarding messages and responses as a source language and a target language respectively. Due to the success on machine translation, sequence-to-sequence (S2S) models (Bahdanau, Cho, and Bengio 2014) have been widely used in response generation recently. For instance, Vinyals et al. (Vinyals and Le 2015) and Shang et al. (Shang, Lu, and Li 2015) applied S2S with attention on this task. To address the general response issue of the standard S2S, Li et al. (Li et al. 2015) presented a maximum mutual information objective function, and Mou et al. (Mou et al. 2016) and Xing et al. (Xing et al. 2016) incorporated external knowledge into the S2S model. Shao et al. (Shao et al. 2017) proposed a target attention neural conversation model to generate long and diverse responses. Reinforcement learning (Li et al. 2016b) and adversarial learning (Li et al. 2017; Xu et al. 2017) techniques have also been exploited to enhance the existing models.In a multi-turn scenario, Sordoni et al. (Sordoni et al. 2015) compressed context information into a vector and injected the vector into response generation. Serban et al. (Serban et al. 2016) adopted a hierarchical recurrent structure to model multi-turn conversations. As an extension of the model, latent variables were introduced to model the one-to-many relation in conversation (Serban et al. 2017; Zhao, Zhao, and Esk enazi ). In this work, we focus on an important but less explored problem: vocabulary selection in decoding. We propose changing the widely used static vocabulary decoder in both single-turn generation and multi-turn generation to a dynamic vocabulary decoder, and derive an approach to jointly learn vocabulary construction and response generation from data. The proposed method can improve response quality and at the same time speed up decoding process. Before us, some work in machine translation has already exploited dynamic vocabularies (L Hostis, Grangier, and Auli 2016; Jean et al. 2015; Mi, Wang, and Ittycheriah 2016). These work often treats vocabulary construction and translation as two separate steps. The same practice, however, cannot be easily transplanted to conversation, as there are no clear one-to-one translation relations in responding. To maintain response quality while improve efficiency in conversation, we propose joint learning of vocabulary construction and response generation in order to let them supervise each other. As far as we know, we are the first who explore the application of dynamic vocabularies in response generation for open domain conversation. Problem Formalization Suppose that we have a data set D = {(Xi, Yi)}N i=1, where Yi is a response of an input Xi. Here Xi can be either a message or a message with several previous turns as a context. As the first step, we assume Xi a message in this work, and leave the verification of the same technology to context-based response generation as future work. i, Xi corresponds to a target vocabulary (i.e., vocabulary in decoding) Ti = (ti,1, . . . , ti,|V |) sampled from a multivariate Bernoulli distribution (βi,1, . . . βi,|V |) where |V | is the size of the entire vocabulary V and ti,j {0, 1}, 1 j |V |. ti,j = 1 means that the j-th word wj in V is selected for generating responses for Xi, otherwise the word will not be used in generation. βi,j = p(ti,j = 1) is the probability of the j-th word being selected which is parameterized by a function f(Xi). Generation probability of Yi given Xi is formulated as p(Yi|Xi) = p(Yi|Ti, Xi)p(Ti|Xi). Our goal is to learn a word selection model f(X) (corresponds to p(T|X)) and a response generation model g(X, T) (corresponds to p(Y |T, X)) by maximizing loglikelihood N i=1 log[p(Yi|Xi)] of D. Thus given a new message X , we can estimate its target vocabulary T with f(X ) and generate a response Y using g(X , T ). In the following sections, we first introduce our DVS2S model (i.e. g(X, T)) by assuming that T is obtained. Then we present 1 h 2h 3h 4 h 5 h 2' h 3' h 4'h Function Words Content Words ( ) I c Compute ( ) I f ( ) I c Static List I got an offer from Stanford Congratulations Dynamic Vocabulary Construction Remove Unnecessary Words Figure 1: Architecture of DVS2S. We translate a Chinese example into English and show the generation process of the word lab . Words in blue squares refers to those chosen by the dynamic vocabulary otherwise they are in black squares. how to sample T with the use of f(X). Finally, we show how to jointly learn f(X) and g(X, T) from D. Dynamic Vocabulary Sequence-to-Sequence Model Figure 1 illustrates the architecture of our dynamic vocabulary sequence-to-sequence (DVS2S) model. DVS2S is built in an encoder-decoder framework (Sutskever, Vinyals, and Le 2014) with an attention mechanism (Bahdanau, Cho, and Bengio 2014). For each input, it equips the decoder with a specific vocabulary that consists of useful words sampled from the entire vocabulary according to a distribution and performs response generation with the vocabulary. Specifically, given a message X = (x1, x2, . . . , xt) where xi is the embedding of the i-th word, the encoder exploits a bidirectional recurrent neural network with gated recurrent units (bi GRU) (Chung et al. 2014) to transform X into hidden vectors h = (h1, h2, . . . , ht). A bi GRU comprises a forward GRU that reads a sentence in its order and a backward GRU that reads the sentence in its reverse order. The forward GRU encodes the sentence into hidden vectors ( h 1, . . . , h t) by zi = σ(Wzxi + Uz h i 1), (1) ri = σ(Wrxi + Ur h i 1), hi = tanh(Whxi + Uh(ri h i 1)), h i = zi hi + (1 zi) h i 1, where zi and ri are an update gate and a reset gate respectively, h 0 = 0, and Wz, Wh, Wr, Uz, Ur,Uh are parameters. The backward hidden state h i is obtained similarly. Then i [1, t], hi is the concatenation of h i and h i. The decoder takes h = (h1, h2, . . . , ht) as an input and generates a response by a language model with an attention mechanism. When generating the i-th word yi, the decoder estimates a word distribution ˆyi by ˆyi = l(yi 1, ci, h i, T), (2) where ci is a context vector formed by the attention mechanism, h i is the i-th hidden state of the decoder, and yi 1 is the (i 1)-th word of the response. Specifically, the decoder also exploits a GRU to encode yi 1 into h i whose initial state is the last hidden vector of the encoder. ci is a linear combination of {h1, . . . , ht} which is formulated as j=1 αi,jhj, (3) where αi,j is given by αi,j = exp(ei,j) t k=1 exp(ei,k) , (4) ei,j = v tanh(Wα[hj; h i]). (5) Wα and v are parameters, and [ ; ] means concatenation of the two arguments. l(yi 1, ci, h i, T) is a |T|-dimensional probability distribution where |T| = |V | k=1 tk. tk T, if tk = 1, then the corresponding element in l(yi 1, ci, h i, T) is defined by p(yi = wk) = exp(s(wk)) tj T,tj=1 exp(s(wj)), (6) where s(wk) is given by s(wk) = Wwk[yi 1; h i 1; ci] + bwk, tk T. (7) Wwk and bwk are two parameters. Equation (6) and (7) are called projection operation. Time complexity of decoding of DVS2S is O(lenr m p + lenr lenm m2 + lenr (m + p) |T| + m |V |) (GRU+attention+projection+vocabulary construction), while time complexity of decoding of the existing methods is at least O(lenr m p+lenr lenm m2+lenr (m+p) |V |) (GRU+attention+projection), where lenr is the length of the generated response, lenm is the length of the message, m is the hidden state size of the decoder, and p is the embedding size of target words. In practice, |V | is much larger than other parameters, so the cost of decoding in existing methods is dominated by lenr (m+p) |V | (i.e., time complexity of projection). DVS2S reduces it to lenr (m+p) |T| in Equation (6) and (7). Since lenr is usually much larger than 1, lenr (m + p) |T| + m |V | is much smaller than lenr (m + p) |V |. Therefore, DVS2S could enjoy a faster decoding process than the existing methods (the conclusion is also verified in experiments). Dynamic Vocabulary Construction In this section, we elaborate dynamic vocabulary construction for X. We define T = {wk V |tk T, tk = 1} and I(w) the index of word w in V . T is equivalent to T. Remember that T is a variable sampled from a multivariate Bernoulli distribution which is a joint distribution of |V | independent Bernoulli distributions. Each Bernoulli distribution depicts the probability of a word w from V being selected to T and is parameterized by βI(w). We make such an assumption because there does not exist a clear one-to-one relationship between words in a message and words in its proper responses and we have to treat T as a latent variable in training as useful words for forming a proper response to a message can only be partially observed in training data. T = T c T f where T c refers to content words and T f refers to function words. Function words guarantee grammatical correctness and fluency of responses. Therefore, there should not be a large variance on T f over difference messages. We collect words appearing more than 10 times in the training data, excluding nouns, verbs, adjectives and adverbs from them, and use the remaining ones to form a function word set V f of V . w V f, we define βI(w) = 1. Thus, T f = V f regardless of inputs. In other words, all function words are always sampled in the construction of T. Content words, on the other hand, express semantics of responses, and thus should be highly related to the input message. Let V c = V \ V f be the full content word set, then c V c, we parameterize βI(c) as βI(c) = σ(W c ht + bc), (8) where σ is a sigmoid function, ht is the last hidden state of the encoder, and Wc and bc are parameters. In the construction of T, T c is sampled from V c based on {βI(c)|c V c}. How to allocate a proper T to X is key to the success of DVS2S. T should cover enough words that are necessary to generate relevant, informative, and fluent responses for X, but cannot be too large for the sake of cost control in decoding. To make sure that we can sample such a T with high probability, we consider jointly learning vocabulary construction and response generation from training data, as will be seen in the next section. Model Training With a latent variable T, the objective of learning can be written as i=1 log(p(Yi|Xi)) = Ti p(Yi|Ti, Xi)p(Ti|Xi)). Equation (9) is difficult to optimize as logarithm is outside the summation. Hence, we instead maximize a variational lower bound of N i=1 log[p(Yi|Xi)] which is given by Ti p(Ti|Xi) log p(Yi|Ti, Xi) (10) j=1 p(ti,j|Xi) l=1 log p(yi,l|yi,