# coherent_dialogue_with_attentionbased_language_models__4733b505.pdf

Coherent Dialogue with Attention-Based Language Models

Hongyuan Mei Johns Hopkins University hmei@cs.jhu.edu

Mohit Bansal UNC Chapel Hill mbansal@cs.unc.edu

Matthew R. Walter TTI-Chicago mwalter@ttic.edu

We model coherent conversation continuation via RNNbased dialogue models equipped with a dynamic attention mechanism. Our attention-RNN language model dynamically increases the scope of attention on the history as the conversation continues, as opposed to standard attention (or alignment) models with a ﬁxed input scope in a sequence-tosequence model. This allows each generated word to be associated with the most relevant words in its corresponding conversation history. We evaluate the model on two popular dialogue datasets, the open-domain Movie Triples dataset and the closed-domain Ubuntu Troubleshoot dataset, and achieve signiﬁcant improvements over the state-of-the-art and baselines on several metrics, including complementary diversity-based metrics, human evaluation, and qualitative visualizations. We also show that a vanilla RNN with dynamic attention outperforms more complex memory models (e.g., LSTM and GRU) by allowing for ﬂexible, long-distance memory. We promote further coherence via topic modeling-based reranking.

Introduction Automatic conversational models (Winograd 1971), also known as dialogue systems, are of great importance to a large variety of applications, ranging from open-domain entertaining chatbots to goal-oriented technical support agents. An increasing amount of research has recently been done to build purely data-driven dialogue systems that learn from large corpora of human-to-human conversations, without using hand-crafted rules or templates. While most work in this area formulates dialogue modeling in a sequence-tosequence framework (similar to machine translation) (Ritter, Cherry, and Dolan 2011; Shang, Lu, and Li 2015; Vinyals and Le 2015; Sordoni et al. 2015; Li et al. 2016a; Duˇsek and Jurˇc ıˇcek 2016), some more recent work (Serban et al. 2016; Luan, Ji, and Ostendorf 2016) instead trains a language model over the entire dialogue as one single sequence. In our work, we empirically demonstrate that a language model is better suited to dialogue modeling, as it learns how the conversation evolves as information progresses. Sequence-to-sequence models, on the other hand, learn only how the most recent dialogue response is generated. Such models are better suited to converting the same

Copyright c 2017, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

information from one modality to another, e.g., in machine translation and image captioning. We improve the coherence of such neural dialogue language models by developing a generative dynamic attention mechanism that allows each generated word to choose which related words it wants to align to in the increasing conversation history (including the previous words in the response being generated). Neural attention (or alignment) has proven very successful for various sequence-to-sequence tasks by associating salient items in the source sequence with the generated item in the target sequence (Mnih et al. 2014; Bahdanau, Cho, and Bengio 2015; Xu et al. 2015; Mei, Bansal, and Walter 2016a; Parikh et al. 2016). However, such attention models are limited to a ﬁxed scope of history, corresponding to the input source sequence. In contrast, we introduce a dynamic attention mechanism to a recurrent neural network (RNN) language model in which the scope of attention increases as the recurrence operation progresses from the start through the end of the conversation.

The dynamic attention model promotes coherence of the generated dialogue responses (continuations) by favoring the generation of words that have syntactic or semantic associations with salient words in the conversation history. Our simple model shows signiﬁcant improvements over state-ofthe-art models and baselines on several metrics (including complementary diversity-based metrics, human evaluation, and qualitative visualizations) for the open-domain Movie Triples and closed-domain Ubuntu Troubleshoot datasets. Our vanilla RNN model with dynamic attention outperforms more complex memory models (e.g., LSTM and GRU) by allowing for long-distance and ﬂexible memory. We also present several visualizations to intuitively understand what the attention model is learning. Finally, we also explore a complementary LDA-based method to re-rank the outputs of the soft alignment-based coherence method, further improving performance on the evaluation benchmarks.1

Related Work

A great deal of attention has been paid to developing datadriven methods for natural language dialogue generation.

1Arxiv version with appendices: https://arxiv.org/abs/1611. 06997

Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17)

(a) RNN seq2seq (encoder-decoder) model (b) RNN language model

(c) Attention seq2seq (encoder-decoder) model (d) Attention language model

Figure 1: Comparing RNN language models to RNN sequence-to-sequence model, with and without attention.

Conventional statistical approaches tend to rely extensively on hand-crafted rules and templates, require interaction with humans or simulated users to optimize parameters, or produce conversation responses in an information retrieval fashion. Such properties prevent training on the large human conversational corpora that are becoming increasingly available, or fail to produce novel natural language responses. Ritter, Cherry, and Dolan (2011) formulate dialogue response generation as a statistical phrase-based machine translation problem, which requires no explicit hand-crafted rules. The recent success of RNNs in statistical machine translation (Sutskever, Vinyals, and Lee 2014; Bahdanau, Cho, and Bengio 2015) has inspired the application of such models to the ﬁeld of dialogue modeling. Vinyals and Le (2015) and Shang, Lu, and Li (2015) employ an RNN to generate responses in human-to-human conversations by treating the conversation history as one single temporally ordered sequence. In such models, the distant relevant context in the history is difﬁcult to recall. Some efforts have been made to overcome this limitation. Sordoni et al. (2015) separately encode the most recent message and all the previous context using a bag-of-words representation, which is decoded using an RNN. This approach equates the distance of each word in the generated output to all the words in the conversation history, but loses the temporal information of the history. Serban et al. (2016) design a hierarchical model that stacks an utterance-level RNN on a token-level RNN, where the utterance-level RNN reduces the number of computational steps between utterances. Wen et al. (2015) and Wen et al. (2016) improve spoken dialog systems via multi-domain and semantically conditioned neural networks on dialog act representations and explicit slot-value formulations. Our work explores the ability of recurrent neural network language models (Bengio et al. 2003; Mikolov 2010) to in-

terpret and generate natural language conversations while still maintaining a relatively simple architecture. We show that a language model approach outperforms the sequenceto-sequence model at dialogue modeling. Recently, Tran, Bisazza, and Monz (2016) demonstrated that the neural attention mechanism can improve the effectiveness of a neural language model. We propose an attention-based neural language model for dialogue modeling that learns how a conversation evolves as a whole, rather than only how the most recent response is generated, and that also reduces the number of computations between the current recurrence step and the distant relevant context in the conversation history. The attention mechanism in our model has the additional beneﬁt of favoring words that have semantic association with salient words in the conversation history, which promotes the coherence of the topics in the continued dialogue. This is important when conversation participants inherently want to maintain the topic of the discussion. Some past studies have equated coherence with propositional consistency (Goldberg 1983), while others see it as a summary impression (Sanders 1983). Our work falls in the category of viewing coherence as topic continuity (Crow 1983; Sigman 1983). Similar objectives, i.e., generating dialogue responses with certain properties, have been addressed recently, such as promoting response diversity (Li et al. 2016a), enhancing personal consistency (Li et al. 2016b), and improving speciﬁcity (Yao et al. 2016). Concurrent with this work, Luan, Ji, and Ostendorf (2016) improve topic consistency by feeding into the model the learned LDAbased topic representations. We show that the simple attention neural language model signiﬁcantly outperforms such a design. Furthermore, we suggest an LDA-based re-ranker complementary to soft neural attention that further promotes topic coherence.

The Model RNN Seq2Seq and Language Models Recurrent neural networks have been successfully used both in sequence-to-sequence models (RNNSeq2Seq, Fig. 1a) (Sutskever, Vinyals, and Lee 2014) and in language models (RNN-LM, Fig. 1b) (Bengio et al. 2003; Mikolov 2010). We ﬁrst discuss language models for dialogue, which is the primary focus of our work, then brieﬂy introduce the sequence-to-sequence model, and lastly discuss the use of attention methods in both models. The RNN-LM models a sentence as a sequence of tokens {w0, w1, . . . , w T } with a recurrence function

ht = f(ht 1, wt 1) (1)

and an output (softmax) function

P(wt = vj|w0:t 1) = exp g(ht, vj)

i exp g(ht, vi), (2)

where the recurrent hidden state ht Rd encodes all the tokens up to t 1 and is used to compute the probability of generating vj V as the next token from the vocabulary V . The functions f and g are typically deﬁned as

f(ht 1, wt 1) = tanh(Hht 1 + PEwt 1) (3a)

g(ht, vj) = O vjht, (3b)

where H Rd d is the recurrence matrix, Ewt 1 is a column of word embedding matrix E Rde V that corresponds to wt 1, P Rd de projects word embedding into the space of the same dimension d as the hidden units, and O Rd V is the output word embedding matrix with column vector Ovj corresponding to vj. We train the RNN-LM, i.e, estimate the parameters H, P, E and O, by maximizing the log-likelihood on a set of natural language training sentences of size N

t=0 log P(wt|w0:t 1) (4)

Since the entire architecture is differentiable, the objective can be optimized by back-propagation. When dialogue is formulated as a sequence-to-sequence task, the RNN-Seq2Seq model can be used in order to predict a target sequence w T 0:L = {w T 0 , w T 1 , . . . , w T L} given an input source sequence w S 0:M = {w S 0 , w S 1 , . . . , w S M}. In such settings, an encoder RNN represents the input as a sequence of hidden states h S 0:M = {h S 0 , h S 1 , . . . , h S M}, and a separate decoder RNN then predicts the target sequence token-bytoken given the encoder hidden states h S 0:M.

Attention in RNN-Seq2Seq Models There are several ways by which to integrate the sequence of hidden states h S 0:M in the decoder RNN. An attention mechanism (Fig. 1c) has proven to be particularly effective for various related tasks in machine translation, image caption synthesis, and language understanding (Mnih et al. 2014; Bahdanau, Cho, and Bengio 2015; Xu et al. 2015; Mei, Bansal, and Walter 2016a).

The attention module takes as input the encoder hidden state sequence h S 0:M and the decoder hidden state h T l 1 at each step l 1, and returns a context vector zl computed as a weighted average of encoder hidden states h S 0:M

βlm = b tanh(Wh T l 1 + Uh S m) (5a)

αlm = exp(βlm)/

m=0 exp(βlm) (5b)

m=0 αlmh S m, (5c)

where parameters W Rd d, U Rd d, and b Rd are jointly learned with the other model parameters. The context vector zl is then used as an extra input to the decoder RNN at step l together with w T 0:l 1 to predict the next token w T l .

Attention in RNN-LM We develop an attention-RNN language model (A-RNNLM) as illustrated in Figure 1d, and describe how it can be used in the context of dialogue modeling. We then describe its advantages compared to the use of attention in sequenceto-sequence models. As with the RNN-LM, the model ﬁrst encodes the input into a sequence of hidden states up to word t 1 (Eqn. 1). Given a representation of tokens up to t 1 {r0, r1, . . . , rt 1} (which we deﬁne shortly), the attention module computes the context vector zt at step t as a weighted average of r0:t 1

βti = b tanh(Wht 1 + Uri) (6a)

αti = exp(βti)/

i=0 exp(βti) (6b)

i=0 αtiri (6c)

We then use the context vector zt together with the hidden state ht to predict the output at time t

g(ht, zt, vj) = O vj(Ohht + Ozzt) (7a)

P(wt = vj|w0:t 1) = exp g(ht, zt, vj) i exp g(ht, zt, vi), (7b)

where Oh Rd d and Oz Rd dz project ht and zt, respectively, into the same space of dimension d. There are multiple beneﬁts of using an attention-RNN language model for dialogue, which are empirically supported by our experimental results. First, a complete dialogue is usually composed of multiple turns. A language model over the entire dialogue is expected to better learn how a conversation evolves as a whole, unlike a sequenceto-sequence model, which only learns how the most recent response is generated and is better suited to translationstyle tasks that transform the same information from one modality to another. Second, compared to LSTM models, an attention-based RNN-LM also allows for gapped context and a ﬂexible combination of conversation history for

every individual generated token, while maintaining low model complexity. Third, attention models yield interpretable results we visualize the learned attention weights, showing how attention chooses the salient words from the dialogue history that are important for generating each new word. Such a visualization is typically harder for the hidden states and gates of conventional LSTM and RNN language models. With an attention mechanism, there are multiple options for deﬁning the token representation r0:t 1. The original attention model introduced by Bahdanau, Cho, and Bengio (2015) uses the hidden units h0:t 1 as the token representations r0:t 1. Recent work (Mei, Bansal, and Walter 2016a; 2016b) has demonstrated that performance can be improved by using multiple abstractions of the input, e.g., ri = (E wi, h i ) , which is what we use in this work.

LDA-based Re-Ranking While the trained attention-RNN dialogue model generates a natural language continuation of a conversation while maintaining topic concentration by token association, some dialogue-level topic-supervision can help to encourage generations that are more topic-aware. Such supervision is not commonly available, and we use unsupervised methods to learn document-level latent topics. We employ the learned topic model to select the best continuation based on document-level topic-matching. We choose Latent Dirichlet Allocation (LDA) (Blei, Ng, and Jordan 2003; Blei and Lafferty 2009) due to its demonstrated ability to learn a distribution over latent topics given a collection of documents. This generative model assumes documents {w0:Tn}N n=1 arise from K topics, each of which is deﬁned as a distribution over a ﬁxed vocabulary of terms, which forms a graphical structure L that can be learned from the training data. The topic representation ˆθ of a (possibly unseen) dialogue w0:T can then be estimated with the learned topic structure L as ˆθ(w0:T ) = L(w0:T ). Given a set of generated continuations {cm}M m=1 for each unseen dialogue w0:T , the topic representations of the dialogue and its continuations are ˆθ(w0:T ) = L(w0:T ) and ˆθ(cm) = L(cm), respectively. We employ a matching score Sm = S ˆθ(w0:T ), ˆθ(cm) to compute the similarity between ˆθ(w0:T ) and each ˆθ(cm). In the end, a weighted score is computed as Sm = λSm + (1 λ)ℓ(cm|w0:T ), where λ [0, 1] and ℓ(cm|w0:T ) is the conditional log-likelihood of the continuation cm. The hyper-parameters K and λ are tuned on a development set. Concurrent with our work, Luan, Ji, and Ostendorf (2016) use learned topic representations ˆθ of the given conversation as an extra feature in a language model to enhance the topic coherence of the generation. As we show in the Results section, our model signiﬁcantly outperforms this approach.

Experimental Setup Dataset We train and evaluate the models on two large natural language dialogue datasets, Movie Triples (preprocessed by Serban et al. (2016)) and Ubuntu Troubleshoot

(pre-processed by Luan, Ji, and Ostendorf (2016)). The dialogue within each of these datasets consists of a sequence of utterances (turns), each of which is a sequence of tokens (words).2 The arxiv version s appendix provides the statistics for these two datasets.

Evaluation Metrics For the sake of comparison, we closely follow previous work and adopt several standard (and complementary) evaluation metrics: perplexity (PPL), word error rate (WER), recall@N, BLEU, and diversitybased Distinct-1. We provide further discussion of the various metrics and their advantages in the arxiv version s appendix. On the Movie Triples dataset, we use PPL and WER, as is done in previous work. Following Serban et al. (2016), we adopt two versions for each metric: i) PPL as the wordlevel perplexity over the entire dialogue conversation; ii) PPL@L as the word-level perplexity over the last utterance of the conversation; iii) WER; and iv) WER@L (deﬁned similarly). On the Ubuntu dataset, we follow previous work and use PPL and recall@N. Recall@N (Manning et al. 2008) evaluates a model by measuring how often the model ranks the correct dialogue continuation within top-N given 10 candidates. Additionally, we also employ the BLEU score (Papineni et al. 2001) to evaluate the quality of the generations produced by the models. Following Luan, Ji, and Ostendorf (2016), we perform model selection using PPL on the development set, and perform the evaluation on the test set using the other metrics. We also present evaluation using the Distinct-1 metric (proposed by Li et al. (2016a)) to measure the ability of the A-RNN to promote diversity in the generations, because typical neural dialogue models generate generic, safe responses (technically appropriate but not informative, e.g., I dont know ). Finally, we also present a preliminary human evaluation.

Training Details For the Movie Triples dataset, we follow the same procedure as Serban et al. (2016) and ﬁrst pretrain on the large Q-A Sub Title dataset (Ameixa et al. 2014), which contains 5.5M question-answer pairs from which we randomly sample 20000 pairs as the held-out set, and then ﬁne-tune on the target Movie Triples dataset. We perform early-stopping according to the PPL score on the held-out set. We train the models for both the Movie Triples and Ubuntu Troubleshoot datasets using Adam (Kingma and Ba 2015) for optimization in RNN back-propagation. The arxiv version s appendix provides additional training details, including the hyperparameter settings.

Results and Analysis Primary Dialogue Modeling Results In this section, we compare the performance on several metrics of our attention-based RNN-LM with RNN baselines and state-of-the-art models on the two benchmark datasets.

2Following Luan, Ji, and Ostendorf (2016), we randomly sample nine utterances as negative examples of the last utterance for each conversation in Ubuntu Troubleshoot for the development set.

Figure 2: A visualization of attention on the (a) Movie Triples and (b) Ubuntu Troubleshooting datasets, showing which words in the conversation history are being aligned to, for each generated response word. Shaded intervals indicate the strength with which the corresponding words in the conversation history and response are attend to when generating the bolded word in the response. We show this for two generated words in the same response (left and right column).

Table 1: Results on the Movie Triples test set. The HRED results are from Serban et al. (2016).

Model PPL PPL@L WER WER@L

RNN 27.09 26.67 64.10% 64.07% HRED 26.81 26.31 63.93% 63.91% A-RNN 25.52 23.46 61.58% 60.15%

Table 1 reports PPL and WER results on the Movie Triples test set, while Table 2 compares different models on Ubuntu Troubleshoot in terms of PPL on the development set and recall@N (N = 1 and 2) on the test set (following what previous work reports). In the tables, RNN is the plain vanilla RNN language model (RNN-LM), as deﬁned in The Model section, and LSTM is an LSTM-RNN language model, i.e., an RNN-LM with LSTM memory units. ARNN refers to our main model as deﬁned in the Attention in RNN-LM section. HRED in Table 1 is the hierarchical neural dialogue model proposed by Serban et al. (2016).3 LDA-CONV in Table 2 is proposed by Luan, Ji, and Ostendorf (2016), which integrates learned LDA-topicproportions into an LSTM language model in order to promote topic-concentration in the generations. Both tables demonstrate that the attention-RNN-LM (A-RNN) model achieves the best results reported to-date on these datasets in terms all evaluation metrics. It improves the ability of an RNN-LM to model continuous dialogue conversations, while keeping the model architecture simple. We also evaluate the effectiveness of the RNN-LM and RNN-Seq2Seq models on both the Movie Triples and Ubuntu

3We compare to their best-performing model version, that adopts bidirectional gated-unit RNN (GRU).

Table 2: Ubuntu Troubleshoot PPL and recall@N, with LSTM and LDA-CONV results from Luan et al. (2016).

Model PPL recall@1 recall@2

RNN 56.16 11% 22% LSTM 54.93 12% 22% LDA-CONV 51.13 13% 24% A-RNN 45.38 17% 30%

Table 3: RNN-LM vs. RNN-Seq2Seq

PPL Movie Triples Ubuntu Troubleshoot

RNN-Seq2Seq 35.10 104.61 RNN-LM 27.81 56.16

Troubleshoot development sets. As shown in Table 3, the RNN language model yields lower perplexity than the RNN sequence-to-sequence model on both datasets. Hence, we present all primary results on our primary A-RNN attentionbased RNN language model.4

Generation Diversity Results Next, we investigate the ability of the A-RNN to promote diversity in the generations, compared to that of the vanilla RNN using the Distinct-1 metric proposed by Li et al. (2016a). Distinct-1 is computed as the number of distinct unigrams in the generation scaled by the total number of generated tokens. Table 5 shows that our attention-based RNN language model (A-RNN) yields much more diversity

4Experiments also demonstrate signiﬁcant improvements for the Attention-RNN-LM over the Attention-RNN-Seq2Seq.

Table 4: BLEU score on Ubuntu Troubleshoot

Model dev BLEU test BLEU

RNN 0.1846 0.1692 A-RNN 0.2702 0.3713 T-A-RNN 0.2908 0.3128 A-RNN-RR 0.4696 0.4279 T-A-RNN-RR 0.4895 0.3971

History: A: person , but this time i got the money . R: if you give it to me , i might forget i found you .

A-RNN: A: i ll give it to you . RNN: A: i don t know . Reference: A: i don t have it with me . tell person .

(a) Movie Triples

History: A: i have really choppy streaming video in 9.10, any way to ﬁx that? A: any way to ﬁx choppy streaming video in 9.10 ? R: what kind of video

A-RNN: A: what video card do you have ? RNN: A: what are you trying to do ? Reference: A: what cpu do you have

(b) Ubuntu Troubleshoot

Figure 3: Examples of responses generated by the A-RNN.

in its generations as compared to the vanilla RNN baseline.

Topic Coherence Results

Next, we investigate the ability of the different models to promote topic coherence in the generations in terms of BLEU score. In addition to the RNN and ARNN models, we consider T-A-RNN, a method that incorporates LDA-based topic information into an A-RNN model, following the approach of Luan, Ji, and Ostendorf (2016). We also evaluate our LDA-based reranker, A-RNN-RR, which re-ranks according to the score Sm = λSm + (1 λ)ℓ(cm|w0:T ), where we compute the log-likelihood ℓ(cm|w0:T ) based upon a trained A-RNN-M model and validate the weight λ on the development set. We also consider a method that combines the T-A-RNN model with an LDA-based re-ranker (T-A-RNN-RR).5 Table 4 reports the resulting BLEU scores for each of these methods on the development and test sets from the Ubuntu Troubleshoot dataset. We make the following observations based upon these results: (1) The A-RNN performs substantially better than the RNN with regards to BLEU; (2) using our LDA-based re-ranker further improves the performance by a signiﬁcant amount (A-RNN v.s. A-RNN-RR); (3) as opposed to our LDA-based re-ranker, adopting the LDA design of Luan, Ji, and Ostendorf (2016) only yields marginal improvements on the development set, but does not general-

5Since Luan, Ji, and Ostendorf (2016) do not publish BLEU scores or implementations of their models, we can not compare with LDA-CONV on BLEU. Instead, we demonstrate the effect of adding the key component of LDA-CONV on top of the A-RNN.

Table 5: Generation Diversity Results: A-RNN vs. RNN

Distinct-1 Movie Triples Ubuntu Troubleshoot

RNN 0.0004 0.0007 A-RNN 0.0028 0.0104

ize well to the test set (A-RNN v.s. T-A-RNN and A-RNNRR v.s. T-A-RNN-RR). Also, our LDA re-ranker results in substantial improvements even on top of their topic-based model (T-A-RNN v.s. T-A-RNN-RR).

Preliminary Human Evaluation In addition to multiple automatic metrics, we also report a preliminary human evaluation. On each dataset, we manually evaluate the generations of both the A-RNN and RNN models on 100 examples randomly sampled from the test set. For each example, we randomly shufﬂe the two response generations, anonymize the model identity, and ask a human annotator to choose which response generation is more topically coherent based on the conversation history. As Table 6 shows, the A-RNN model wins substantially more often than the RNN model.

Table 6: Human Evaluaton: A-RNN vs. RNN

Movie Triples Ubuntu Troubleshoot

Not distinguishable 48% 74% RNN wins 6% 5% A-RNN wins 46% 21%

Qualitative Analysis Next, we qualitatively evaluate the effectiveness of our ARNN model through visualizations of the attention and outputs on both datasets. Figure 2 provides a visualization of the attention for a subset of the words in the generation for the two datasets. The last line in both Figure 2a and Figure 2b presents the generated response and we highlight in bold two output words (one on the left and one on the right) for two time steps. For each highlighted generated word, we visualize the attention weights for words in the conversation history (i.e., words in the preceding turns and those previously generated in the output response), where darker shades indicate larger attention weights. As the ﬁgure indicates, the attention mechanism helps learn a better RNN language model that promotes topic coherence, by learning to associate the currently-generated word with informative context words in the conversation history. As shown in Figure 3a, the A-RNN generates meaningful and topically coherent responses on the Movie Triples dataset. In comparison, the vanilla RNN tends to produce generic answers, such as i don t know . Similarly, the A-RNN follows up with useful questions on the Ubuntu Troubleshoot dataset (Fig. 3b).

Conclusion We present an attention-RNN dialogue language model that increases the scope of attention continuously as the conver-

sation progresses (which distinguishes it from standard attention with ﬁxed scope in a sequence-to-sequence models) to promote topic coherence, such that each generated word can be associated with its most related words in the conversation history. We evaluate this simple model on two large dialogue datasets (Movie Triples and Ubuntu Troubleshoot), and achieve the best results reported to-date on multiple dialogue metrics (including complementary diversity-based metrics), performing better than gate-based RNN memory models. We also promote topic concentration by adopting LDA-based reranking, further improving performance.

Acknowledgments We thank Iulian Serban, Yi Luan, and the anonymous reviewers for sharing their datasets and for their helpful discussion. We thank NVIDIA Corporation for donating GPUs used in this research.

References Ameixa, D.; Coheur, L.; Fialho, P.; and Quaresma, P. 2014. Luke, I am your father: dealing with out-of-domain requests by using movies subtitles. In Intelligent Virtual Agents, 13 21. Springer. Bahdanau, D.; Cho, K.; and Bengio, Y. 2015. Neural machine translation by jointly learning to align and translate. In ICLR. Bengio, Y.; Ducharme, R.; Vincent, P.; and Jauvin, C. 2003. A neural probabilistic language model. Journal of Machine Learning Research. Blei, D. M., and Lafferty, J. D. 2009. Topic models. Text mining: Classiﬁcation, clustering, and applications 10(71):34. Blei, D. M.; Ng, A. Y.; and Jordan, M. I. 2003. Latent Dirichlet allocation. the Journal of machine Learning research. Crow, B. 1983. Topic shifts in couples conversations. Conversational coherence: Form, structure, and strategy 136 156. Duˇsek, O., and Jurˇc ıˇcek, F. 2016. A context-aware natural language generator for dialogue systems. In Proceedings of SIGDIAL. Goldberg, J. 1983. A move towards describing conversational coherence. Conversational coherence: Form, structure, and strategy 25 45. Kingma, D., and Ba, J. 2015. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR). Li, J.; Galley, M.; Brockett, C.; Gao, J.; and Dolan, B. 2016a. A diversity-promoting objective function for neural conversation models. In NAACL. Li, J.; Galley, M.; Brockett, C.; Gao, J.; and Dolan, B. 2016b. A persona-based neural conversation model. ar Xiv preprint ar Xiv:1603.06155. Luan, Y.; Ji, Y.; and Ostendorf, M. 2016. LSTM based conversation models. ar Xiv preprint ar Xiv:1603.09457. Manning, C. D.; Raghavan, P.; Sch utze, H.; et al. 2008. Introduction to information retrieval, volume 1. Cambridge university press Cambridge. Mei, H.; Bansal, M.; and Walter, M. R. 2016a. Listen, attend, and walk: Neural mapping of navigational instructions to action sequences. In Proceedings of the National Conference on Artiﬁcial Intelligence (AAAI). Mei, H.; Bansal, M.; and Walter, M. R. 2016b. What to talk about and how? Selective generation using LSTMs with coarse-to-ﬁne alignment. In Proceedings of the Conference of the North American

Chapter of the Association for Computational Linguistics Human Language Technologies (NAACL HLT). Mikolov, T. 2010. Recurrent neural network based language model. In Proceedings of Interspeech. Mnih, V.; Hees, N.; Graves, A.; and Kavukcuoglu, K. 2014. Recurrent models of visual attention. In Advances in Neural Information Processing Systems (NIPS). Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2001. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 311 318. Parikh, A. P.; T ackstr om, O.; Das, D.; and Uszkoreit, J. 2016. A decomposable attention model for natural language inference. ar Xiv preprint ar Xiv:1606.01933. Ritter, A.; Cherry, C.; and Dolan, W. B. 2011. Data-driven response generation in social media. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Sanders, R. 1983. Tools for cohering discourse and their strategic utilization. Conversational coherence: Form, structure, and strategy 67 80. Serban, I. V.; Sordoni, A.; Bengio, Y.; Courville, A.; and Pineau, J. 2016. Building end-to-end dialogue systems using generative hierarchical neural networks. In Proceedings of the National Conference on Artiﬁcial Intelligence (AAAI). Shang, L.; Lu, Z.; and Li, H. 2015. Neural responding machine for short-text conversation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL). Sigman, S. 1983. Some multiple constraints placed on conversational topics. Conversational coherence: Form, structure, and strategy. Sordoni, A.; Galley, M.; Auli, M.; Brockett, C.; Ji, Y.; Mitchell, M.; Nie, J.-Y.; Gao, J.; and Dolan, B. 2015. A neural network approach to context-sensitive generation of conversational responses. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL). Sutskever, I.; Vinyals, O.; and Lee, Q. V. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems (NIPS). Tran, K.; Bisazza, A.; and Monz, C. 2016. Recurrent memory network for language modeling. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies (NAACL HLT). Vinyals, O., and Le, Q. 2015. A neural conversational model. In ICML Deep Learning Workshop. Wen, T.-H.; Gaˇsi c, M.; Mrkˇsi c, N.; Su, P.-H.; Vandyke, D.; and Young, S. 2015. Semantically conditioned LSTM-based natural language generation for spoken dialogue systems. In EMNLP. Wen, T.-H.; Gaˇsi c, M.; Mrkˇsi c, N.; M. Rojas-Barahona, L.; Su, P.-H.; Vandyke, D.; and Young, S. 2016. Multi-domain neural network language generation for spoken dialogue systems. In NAACL. Winograd, T. 1971. Procedures as a representation for data in a computer program for understanding natural language. Technical report, DTIC Document. Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhutdinov, R.; Zemel, R.; and Bengio, Y. 2015. Show, attend and tell: Neural image caption generation with visual attention. In ICML. Yao, K.; Peng, B.; Zweig, G.; and Wong, K.-F. 2016. An attentional neural conversation model with improved speciﬁcity. ar Xiv preprint ar Xiv:1606.01292.