# zeroresource_knowledgegrounded_dialogue_generation__9fe7bbdd.pdf Zero-Resource Knowledge-Grounded Dialogue Generation Linxiao Li Peking University lilinxiao@pku.edu.cn Can Xu Microsoft STCA caxu@microsoft.com Wei Wu Meituan wuwei19850318@gmail.com Yufan Zhao Microsoft STCA yufzhao@microsoft.com Xueliang Zhao Peking University xl.zhao@pku.edu.cn Chongyang Tao Peking University chongyangtao@pku.edu.cn While neural conversation models have shown great potentials towards generating informative and engaging responses via introducing external knowledge, learning such a model often requires knowledge-grounded dialogues that are difficult to obtain. To overcome the data challenge and reduce the cost of building a knowledgegrounded dialogue system, we explore the problem under a zero-resource setting by assuming no context-knowledge-response triples are needed for training. To this end, we propose representing the knowledge that bridges a context and a response and the way that the knowledge is expressed as latent variables, and devise a variational approach that can effectively estimate a generation model from a dialogue corpus and a knowledge corpus that are independent with each other. Evaluation results on three benchmarks of knowledge-grounded dialogue generation indicate that our model can achieve comparable performance with stateof-the-art methods that rely on knowledge-grounded dialogues for training, and exhibits a good generalization ability over different topics and different datasets. 1 Introduction Recent years have witnessed rapid progress on learning a dialogue generation model for open domain human-machine conversation [40, 34, 50, 1]. Though such models in advanced neural architectures [39] are capable of replying with natural and smooth responses regarding to conversation history, people can still feel a clear gap when they converse with the systems, compared with the conversation with humans. One primary reason is that existing dialogue systems lack of necessary knowledge and thus cannot go deep with humans when they dive into a specific topic. To bridge the gap, researchers begin to study how to ground open domain dialogues by external knowledge, which could be obtained either from structured knowledge bases [26, 38], or from unstructured documents [10, 55, 16]. In this work, we study document-grounded dialogue generation in which a response is synthesized regarding to a conversation context associated with a few sentences from external documents. While the documents serve as content sources and hint response generation with knowledge, collecting enough dialogues that are naturally grounded on documents for model training is not trivial. Although some benchmarks built upon crowd-sourcing have been released by recent papers [55, 10, 16], the small training size makes the generation models generalize badly on unseen topics [10] and the cost of building such data also prevents from transferring the technology proved on the benchmarks to Work done during the internship at Microsoft STCA. Corresponding author: Can Xu (caxu@microsoft.com). 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. new domains and new languages. A very recent paper [52] attempts to tackle the data challenge under a low-resource assumption, however, reliance on the expensive knowledge-grounded dialogues is still not fully removed. In this paper, we make one step further by exploring knowledge-grounded dialogue generation under a zero-resource setting, where no context-knowledge-response triples (e.g., those obtained from crowd-sourcing) are assumed available in training. Apparently, such an assumption raises even bigger challenges for learning, but our effort will allow developers to build a knowledge-grounded dialogue system from independent dialogues (e.g., context-response pairs collected from Reddit) and knowledge resources (e.g., wiki articles), and thus can greatly reduce the cost of building such systems and enhance transferability of the technology. Since knowledge-grounded dialogues are absent in training, we introduce two latent variables that represent the knowledge for grounding and the rate of grounding (i.e., how much knowledge is used in responding) respectively. The generation process is then formalized within a probabilistic framework and optimized via variational inference [19]. To take advantage of the recent breakthrough on pretraining for natural language tasks, we build the probabilistic models on the basis of a pre-trained language model. Instead of using generative models, we propose instantiating the posterior with a retrieval model whereby the search space of knowledge is restrained within a few relevant candidates. Thus, we can circumvent the tedious sampling steps and have a more stable learning process. In addition to the objectives in generalized EM, we also devise a knowledge selection loss and a mutual information loss with the former to learn how to tailor long knowledge input to meet the capacity constraint of the pre-trained language model and the latter to effectively estimate the latent grounding rate in variational inference. We conduct experiments with benchmarks of knowledge-grounded dialogue generation that are constructed by crowd-sourcing. Evaluation results in terms of both automatic metrics and human judgment indicate that our model not only achieves comparable performance with the state-of-the-art model that is learned from crowd-sourced training sets, but also exhibits a good generalization ability over different topics and different datasets. Our contributions are four-fold: (1) exploration of knowledge-grounded dialogue generation under a zero-resource setting; (2) proposal of a double latent variable model that depicts not only the knowledge connecting a context and a response but also the way that the knowledge is expressed; (3) proposal of a variational learning approach; and (4) empirical verification of the effectiveness of the proposed approach on three benchmarks of knowledge-grounded dialogue generation. Given Dcov = {(Ci,Ri)}n i=1 as a dialogue corpus and Kkg = {Kj}m j=1 as a knowledge base, where i {1,...,n}, Ci refers to a dialogue context with Ri a response; and j {1,...,m}, Kj denotes a piece of knowledge (e.g., a sentence in Wikipedia), we aim to learn a model p(R C,K) from Dcov and Kkg without any oracles (e.g., the crowd-workers in existing benchmarks) indicating the collation of a dialogue and the related knowledge. Thus, for a new context C associated with external knowledge K (e.g., obtained from a retrieval model like in [10]), one can generate a response R following p(R C,K). 2.1 Zero-Resource Learning Framework Figure 1: Graphical model of the proposed approach. Solid lines mean that there exists links in both the probabilistic graph and the neural graph, while dotted lines mean that links only exist in the neural graph. Figure 1 gives the graphical model of our approach. The model depicts dependency among four variables: dialogue context C, response R, latent knowledge Zk, and grounding rate Zα, where Zk bridges C and R controlled by Zα. Basically, Zα indicates how much knowledge in Zk is carried by R according to C. Hence, the variable endows our method with flexibility that responses in various levels of knowledge (e.g., from a short reply that simply catches up with the context to an informative statement that delivers necessary content for continuing the discussion) can be modeled in a unified framework. More advantages credited to Zα include (1) in training, the model guarded by Zα be- comes more robust regarding to the noise in the inferred Zk; and (2) in prediction, the model can automatically control the way of knowledge expression and thus can be easily adapted to different scenarios without much extra effort. The general objective of learning can be formulated as L(θ) = E(C,R) Dcov[log pθ(R C)]. (1) By approximating the true posterior with a variational posterior q(Zk,Zα C,R), we optimize the marginal log-likelihood in Eq. 1 with Generalized EM method [4]: E-step: arg min q DKL(q(Zk) p(Zk C,R)) + DKL(q(Zα) p(Zα C,R)), (2) arg max p EZα q(Zα)EZk q(Zk) log p(R C,Zk,Zα) DKL(q(Zk) p(Zk C)) DKL(q(Zα) p(Zα C,Zk)), (3) where q(Zk), q(Zα), and q(Zk,Zα) stand for q(Zk C,R), q(Zα C,R), and q(Zk,Zα C,R), respectively, and DKL( ) refers to Kullback Leibler divergence. Detailed derivations are presented in supplementary material. 2.2 Neural Parameterization q(Zk) & p(Zk C,R): normally, q(Zk) and p(Zk C,R) can be specified as neural generative models (e.g., within the VAE framework [9, 47]). However, learning generative posteriors often requires sampling from a large space that is slow and inaccurate. It is also difficult to approximate the intractable p(Zk C,R), which could enlarge the gap between E-step and M-step. Motivated by the issues, we instead define q(Zk) with a retrieval model. Formally, q(Zk) is calculated as q(Zk C,R) = exp F(C,R,Zk) K S(R) exp F(C,R,K ) , (4) where S(R) denotes the inference of the latent knowledge that is made up of top-l results retrieved from Kkg by a relevance model rel( , ) with R as a query, and F( , , ) is a 3-layer transformer that maps (C,R,Zk) to a matching score. Since S(R) is enumerable, p(Zk C,R) in Eq. 2 can be calculated by p(Zk C,R) = p(Zk,R C) p(R C) = p(Zk C)p(R C,Zk) K S(R) p(K C)p(R C,K ), (5) where p(Zk C) and p(R C,Zk) will be detailed later. p(R C,Zk,Zα): we adopt UNILM [11] as the backbone of P(R C,Zk,Zα). Note that UNILM can be replaced by other pre-trained language models such as GPT-2 [28]. Here, we mix Zk with random noise Z sampled from Kkg. This is to simulate the real scenario in test where the useful knowledge is often enclosed with a lot of irrelevant candidates (e.g., in Wizard of Wikipedia [10], each context is associated with 61 knowledge candidates and only one of them is selected by the crowd-worker for responding). Then P(R C,Zk,Zα) is defined by UNILM(I) with I given by I = [CLS][Zα] c1 ...clc [SEP] S1 ...Slk [SEP] r1 ...rlr [SEP], (6) where (c1,...,clc) denotes the utterance sequence in context C, (r1,...,rlr) denotes the word sequence in response R, and (S1,...,Slk) denotes the sentence sequence of Zk Z. One practical issue is that large-scale pre-trained language models such as UNILM often set constraint on the maximum number of tokens they can handle (e.g., 512 tokens in UNILM), which forces us to shorten I before feeding it to the models. To this end, we devise a knowledge selection model which is formalized as a binary classifier p(y C,Z) with CLS(UNILM(I )) as input, where CLS( ) returns the vector corresponding to the [CLS] token, and I = [CLS]c1 ...clc[SEP]z1 ...zlz[SEP] with Z = (z1,...,zlz). Then, sentences in Zk Z are fed to I one-by-one in a descending order according to p(y C,Z) until the capacity constraint. For simplicity, we define p(Zk C) in Eq. 5 as p(y = 1 C,Zk), and define p(R C,Zk) in Eq. 5 with P(R C,Zk,Zα) by dropping [Zα] in Eq. 6. p(Zα C,Zk): we define p(Zα C,Zk) as σ(CLS(F(Ip Zα ))), where σ( ) is a sigmoid function, F( ) is 3-layer transformer, and Ip Zα = e[CLS]e[Zα]e[c1]... e[cn]e[SEP]e[S1]...e[Sk] with e[ ] the summation of the token embedding, the position embedding, and the segment embedding given by the embedding layer of UNILM(I). q(Zα): q(Zα) is specified as Sim(R,Zk) with Sim( , ) a similarity function of sentence pairs. 2.3 Learning Details Besides Eq. 2 and Eq. 3, two extra objectives are also included in learning in order to explicitly optimize knowledge selection and enhance the learning of Zα. Knowledge Selection Loss: the knowledge selection model p(y C,Z) is optimized by differentiating Zpos from Zneg, where Zpos corresponds to the maximum Sim(R,Z) with Z S(R), and Zneg is randomly sampled from Kkg. The loss function can be formulated as Lks = log(p(y = 1 C,Zpos)) log(p(y = 0 C,Zneg)). (7) Mutual Information Loss: although the posterior q(Zα) is deterministic, we still observe that it is hard to encode the information of knowledge expression into Zα through learning, which is a phenomenon similar to the posterior collapse problem in [5, 51]. To mitigate the problem, we directly impose the association of Zα and R given Zk by a mutual information loss defined as I(Zα,R) = Ep(Zα,R) log p(Zα,R) p(Zα)p(R). Since direct optimization of I(Zα,R) is intractable, we instead propose maximizing a lower bound via variational information maximization [6] which can be formulated as I(Zα,R) Ep(Zα)Ep(R Zα) log qφ(Zα R). (8) In order to optimize Eq. 8, we need to make generation of response tokens differentiable. Recall that the probability distribution of token wt is calculated as: pt = (W[Ht]) Rv, (9) where Ht is the hidden state of wt in p(R C,Zk,Zα), and W Rh v are trainable parameters with h the size of Ht and v the vocabulary size. Though one can estimate the gradient of Ep(R Zα) with REINFORCE algorithm [43], such an approach often suffers from high variance. Therefore, we instead exploit the gumbel-softmax reparametrization trick [17] as a low-variance approximation of sampling from the categorical distribution pt: V i=1 e(wi)softmax((pt + ξ)/τ)i, (10) where ξ is an independent noise sampled from the Gumbel distribution, τ is the temperature (i.e., a hyper-parameter), and e(wi) is the embedding of wi in p(R C,Zk,Zα). Although this gradient estimator is biased, we find that it works well in practice. We set τ = 0.1 based on the results on validation and fix the value in all the experiments. The learning algorithm is summarized in Algorithm 1. 2.4 Knowledge-grounded Response Generation Model After learning from Dcov and Kkg, we define the response generation model p(R C,K) in test as p(R C,Z,Zα), where we rank K = {K i} according to {p(y C,K i)} and fill Z with the ranked sequence until reaching the capacity constraint of UNILM, and Zα is predicted by p(Zα C,Z). 3 Experiments We test the proposed method on benchmarks of knowledge-grounded dialogue generation, including Wizard of Wikipedia (Wizard) [10], Topical-Chat (TC) [16], and CMU Document Grounded Conversations (CMU_Do G) [55]. Algorithm 1 Optimization Algorithm 1: Input: dialogue corpus Dcov, knowledge corpus Kkg, pre-trained UNILM, threshold λ, and maximum step M. 2: Construct a relevance model rel( , ) based on Kkg. 3: for m 1 to M do 4: Sample a mini-batch (Ci, Ri) from Dcov and retrieve S(Ri) with rel( , ). 5: Sample a t from uniform(0, 1). 6: if t < λ: 7: Update the parameters of the model based on Eq. 7 8: else: 9: Estimate p(Zk C, R) based on Eq. 5 10: Update the parameters of the model based on Eq. 2 E-Step. 11: Sample Zk from q(Zk C, R) 12: Update the parameters of the model based on Eq. 3 and Eq. 8 M-Step. 13: end for 14: return p(y C, Z), p(Zα C, Z) and p(R C, Z, Zα). 3.1 Experimental Setup Training Data: we build the knowledge corpus with a Wikipedia dump,3 where text is extracted with an open source tool4 and split into sentences using NLTK.5 In total, there are 5,972,585 articles and 77,152,626 sentences. On average, each sentence contains 27.4 words. The dialogue corpus is constructed from the Reddit Conversation Corpus cleaned by [12]. We merge the training/validation/test sets in the original data, and extract a subset by the following rules: (1) the length of the response falls in (10,50); (2) the proportion of unique non-stop words in the response falls in (0.25,0.6); (3) the proportion of unique words in the response is larger than 0.5; (4) Sim(R, K) 0.1 where K = arg max K S(R) Sim(R,K); and (5) the length of K in (4) is longer than 10. These rules could remove responses that are too short, too long, too generic, or in an extreme chat-style, and thus can guarantee the quality of training. Automatic evaluation metrics are also sensitive to the length of generated responses. Our model suffers because of the length inconsistent between training and testing. Instead of adjusting the length distribution of training data, we drop the ending token for short responses during training to approximate the maximum average length of benchmarks(24 in our experiment). After the pre-processing, the subset is randomly split into a training set and a validation set with 842,521 and 2,737 dialogues respectively. On average, each dialogue (with the last turn as the response and other turns as the context) contains 3.1 utterances in both sets, and the average length of the utterances is 16.0 in training and is 16.1 in validation. Note that the validation set is used for model selection and thus we do not access any data point in the benchmarks before evaluation. Test Data: all the benchmarks are built with crowd-sourcing on Amazon Mechanical Turk (AMT), and are split into training sets, validation sets, and test sets by the data owners. In Wizard and CMU_Do G, knowledge is obtained from Wikipedia, while in TC, besides wiki articles, Washington Post articles and Reddit fun facts are also utilized as the knowledge sources. Unlike CMU_Do G that focuses on movie domain, both Wizard and TC cover a wide range of topics from multiple domains. Various configurations are set up to simulate conversation scenarios in real world. In Wizard, a wizard tells an apprentice about what he/she learns from the knowledge about a specific topic. In addition to wizard-apprentice conversations, CMU_Do G also contains conversations between two workers who know the background documents and try to discuss the content in depth. In TC, participants play symmetric and asymmetric roles according to the knowledge they can access under 5 settings. In Wizard and TC, the test sets are further split into Seen/Frequent and Unseen/Rare where the former contains topics frequently appearing in the training sets and the latter contains topics infrequently or never appearing in the training sets. For Wizard, we follow [10] and conduct pre-processing with the code published on Parl AI.6 For CMU_Do G, we use the version shared at https://github. com/lizekang/ITDD. For TC, we utilize the data published in the open source project https: 3http://wikipedia.c3sl.ufpr.br/enwiki/20191120/ 4https://github.com/attardi/wikiextractor/wiki 5https://www.nltk.org/ 6https://github.com/facebookresearch/Parl AI/blob/master/projects/ wizard_of_wikipedia //github.com/alexa/alexa-prize-topical-chat-dataset/. More details of the benchmarks are shown in supplementary material. Baselines: the following models are selected as baselines: (1) MTASK-RF [15]: an early model that also realizes knowledge-grounded conversation without crowd-sourced knowledge-grounded dialogues. To make a fair comparison, we implement the model by strictly following the details in [15], but replace the Twitter data, the Foursquare data, and the Twitter handles used to connect the Twitter conversation and the Foursquare facts with the Reddit data, the Wikipedia data, and an aggregate of the topics in the three benchmarks; (2) Transformer Memory Network (TMN) [10]:6 a transformer architecture augmented by a knowledge memory which is published along with the Wizard data; (3) Incremental Transformer with Deliberation Decoder (ITDD) [24]:7 an encoder-decoder architecture where the encoder incrementally represents multi-turn dialogues and knowledge, and the decoder conducts response decoding in two passes similar to the deliberation network in machine translation; (4) Sequential Knowledge Transformer (SKT) [18]: 8 a sequential latent variable model with state-of-the-art performance on knowledge selection. Since human labels that indicate ground-truth knowledge are crucial to the performance of the model and only provided in Wizard data, so we implement SKT with heuristics on Topical-Chat and CMU_Do G (pseudo supervision created by selecting GT-knowledge using Sim(.,.) with the response). (5) Disentangle Response Decoder (DRD) [52]: a model that exploits pre-training techniques to tackle the lowresource challenge in knowledge-grounded dialogue generation. We choose the one whose parameters are fine-tuned on the full training data of the benchmarks, as the model exhibits the state-of-the-art performance on Wizard according to [52]. We name our model ZRKGC9, standing for zero-resource knowledge-grounded conversation model. Evaluation Methods: following [10], we choose perplexity (PPL) [37] and unigram F1 as the automatic metrics, where F1 is calculated with the code shared at https://github. com/facebookresearch/Parl AI/blob/master/parlai/core/metrics.py. Besides, we also examine the performance of the models with human annotations. Since human labor is expensive, manual judgment is applied to Wizard only. Following [52], we randomly sample 500 examples from Test Seen and Test Unseen, and recruit 3 well-educated native speakers as annotators. To each annotator, an example is presented with a context, the associated external knowledge,10 and model responses (top 1 in beam search) that are randomly shuffled to hide their sources. The annotators then judge the quality of the responses from three aspects, including fluency, context coherence and knowledge relevance, and assign a score in {0,1,2} (representing bad , fair , and good ) to each response for each aspect. Each response receives 3 scores per aspect, and the agreement among the annotators is measured via Fleiss kappa [14]. 3.2 Implementation Details We index the sentences in the knowledge corpus with an open source Lucene.Net,11 employ the internal ranker of Lucene (basically a BM25 model [31]) as rel( , ), and set the number of retrieved candidates (i.e., l) as 10. The function Sim( , ) in Section 2.2 is defined as Bleu-2 [27]. We choose UNILM Base (110M) and implement the model with the code in https://github.com/ microsoft/unilm. We find that replacing DKL(q(Zα) p(Zα C,Zk)) in Eq. 3 with a mean squared error in optimization can enhance model performance, probably because Zα is a continuous variable. The model is trained with a batch size 10, a maximum input length 256, and a maximum output length 40. The threshold λ and the maximum step M in Algorithm 1 are set as 0.2 and 100,000 respectively. The learning rate is set as 0.00003 and the warmup step is set as 1000. In training, we evaluate the model per 5,000 steps on the validation set with unigram F1 [10] as a metric. The training procedure will be terminated if we find F1 begins to drop. To draw a fair comparison, we keep the same evaluation procedure with the existing models. During test time, we exploit beam search with a beam size 5. We apply knowledge selection module p(y C,Z) to select K knowledge sentences from all M knowledge sentences(M>=K) to meet the capacity constraint of Uni LM(e.g., 256 in our setting). 7https://github.com/lizekang/ITDD 8https://github.com/bckim92/sequential-knowledge-transformer 9Dataset and codes are publicly available at https://github.com/nlpxucan/ZRKGC 10For ease of labeling, only the ground-truth knowledge is shown to the annotators in Wizard. 11http://lucenenet.apache.org Table 1: Automatic evaluation results. Models Wizard Seen Wizard Unseen Topical Freq Topical Rare CMU_Do G PPL F1 PPL F1 PPL F1 PPL F1 PPL F1 MTASK-RF [15] 65.4 13.1 67.7 12.3 51.3 12.6 51.6 12.5 67.2 10.5 TMN [10] 66.5 15.9 103.6 14.3 30.3 16.5 52.1 14.6 75.2 9.9 ITDD [24] 17.8 16.2 44.8 11.4 21.4 15.8 24.7 14.0 26.0 10.4 SKT [18] 52.0 19.3 81.4 16.1 25.1 17.0 35.6 14.8 41.9 9.6 DRD [52] 19.4 19.3 23.0 17.9 25.9 14.8 28.0 15.1 54.4 10.7 ZRKGC 40.4 18.7 41.5 18.6 44.2 16.6 42.0 16.8 53.5 12.5 Table 2: Human evaluation results. Models Wizard Seen Wizard Unseen Fluency Coherence KG Relevance Kappa Fluency Coherence KG Relevance Kappa DRD [52] 1.72 1.65 1.12 0.62 1.60 1.57 1.14 0.66 ZRKGC 1.79 1.73 1.16 0.61 1.71 1.70 1.18 0.69 3.3 Evaluation Results Table 1 reports the evaluation results on automatic metrics. In terms of F1, though ZRKGC does not access any training examples in the benchmarks, it still outperforms MTASK-RF, TMN, and ITDD, and achieves a comparable performance with DRD on all the test sets, indicating that the model can effectively learn how to leverage external knowledge feed for response generation through the variational approach. Moreover, unlike the baselines, there is almost no difference for ZRKGC on Test Seen and Test Unseen, which reveals the good generalization ability of the model as an advantage of the zero-resource approach: the model is not influenced by specific training data, and thus performs stably over different topics. We further investigate the generalization ability of ZRKGC by comparing it with DRD trained on different benchmarks. Figure 2 shows the results. Interestingly, when we transfer the DRD model trained on one benchmark to another benchmark, there is always significant performance drop. ZRKGC, on the other hand, is always comparable with the best DRD model on each of the benchmarks, indicating that the model generalizes well not only over different topics but also over different datasets. In other words, DRD may fail in practice due to the discrepancy between training and test, but ZRKGC does not suffer from the issue. ZRKGC is worse than ITDD and DRD in terms of PPL, because PPL is calculated with ground-truth responses in the test sets, and therefore models learned by fitting the same or a similar distribution (e.g., both the training data and the test data are constructed by AMT workers) are more advantageous on the metric. We provide richer results with more metrics in supplementary material. Wo W Seen Wo W Unseen TC Freq TC Rare CMU_Do G ZRKGC DRD_Wizard DRD_Topical DRD_CMU_Do G Figure 2: Generalization ability over different datasets. Table 2 compares ZRKGC with DRD using human judgment. All kapa values exceed 0.6, indicating substantial agreement among the annotators. We observe that responses from ZRKGC are more fluent and more contextually coherent than those from DRD, thanks to the pre-trained language model. Both models are awkward in terms of properly bringing knowledge into responses, which sheds light on the direction for future effort. Cases for a closer inspection are shown in supplementary material. 3.4 Discussions Retrieval posterior v.s. generative posterior. We first investigate how the retrieval posterior defined by Eq. 4 matters in learning. To this end, we alternatively implement ZRKGC with a generative posterior (i.e., q(Zk)) that are defined in a sequence-to-sequence form based on UNILM,12 and check the trajectories of Eq. 3 (i.e., the evidence lower bound (ELBO)) in training under {retrieval,generative} {GEM,ELBO} where GEM means that the model is learned via generalized EM, and ELBO means that optimization is conducted only by Eq. 3. Figure 4 illustrates the trajectories. We can see that with the retrieval posterior, we achieve a tighter ELBO by generalized EM, which means that by optimizing with the E-step, the objective in the M-step moves closer to 12p(Zk C) in Eq. 3 is also defined in a generative form. Table 3: Ablation study. Models Wizard Seen Wizard Unseen Topical Freq Topical Rare CMU_Do G PPL F1 PPL F1 PPL F1 PPL F1 PPL F1 ZRKGC 40.4 18.7 41.5 18.6 44.2 16.6 42.0 16.8 53.5 12.5 -Zα 31.1 18.5 32.0 18.4 34.2 13.9 33.2 14.4 53.2 10.8 -mulinfo 40.9 18.1 41.9 18.0 42.6 14.2 39.1 15.2 65.7 11.7 -retrieval posterior 35.4 16.2 36.1 16.0 39.6 13.7 36.8 14.6 52.7 10.1 -parameterized posterior 39.3 17.2 40.8 17.0 44.7 13.2 41.6 14.6 50.5 10.8 -knowledge selection 44.2 18.3 45.9 17.9 45.5 14.6 43.5 14.9 53.8 12.0 the true objective. Since we have to resort to high-variance sampling steps to approximate the KL terms as well as the true posterior (i.e., p(Zk C,R)) when the generative posterior is used, optimizing with GEM leads to an even worse ELBO than directly executing the M-step. Results in Table 3 also demonstrate that there is a dramatic performance drop (i.e., F1) on the test sets when the retrieval posterior is replaced by a generative posterior (i.e., -retrieval posterior). Moreover, we also observe an obvious drop (i.e., F1) when S(R) in Eq. 4 is squeezed to Ktop = arg max K S(R) Sim(R,K) (i.e., -parameterized posterior), indicating the effect of the neural parameterization in Eq. 4. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Z ZRKGC -mulinfo (a) Wizard Seen 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Z ZRKGC -mulinfo (b) Wizard Unseen Figure 3: Controllability study wrt. knowledge expression. 0 500 1000 1500 2000 2500 3000 3500 4000 4500 R-posterior GEM R-posterior ELBO G-posterior GEM G-posterior ELBO Figure 4: Trajectories of ELBO in training on Wizard. Impact of Zα and impact of the mutual information loss. Then we study the effect of Zα in modeling response generation and the effect of the mutual information loss to learning. First, according to the results in Table 3, both removal of Zα (i.e., ZRKGC becomes a single latent variable model) and removal of the mutual information loss (i.e., -mulinfo) will cause performance drop (i.e., F1), indicating that Zα is useful to ZRKGC and the mutual information loss can enhance the usefulness of the factor. Recall that Zα is designed to model knowledge expression and the mutual information loss is designed to effectively learn the factor from data. Thus, we also want to check if one can control the extent of knowledge expression by varying Zα in ZRKGC. Figure 3a and Figure 3b illustrate the comparison between the full ZRKGC and ZRKGC-mulinfo on Test Seen and Test Unseen respectively, in which Zα is fixed in generation and is increased from 0.1 to 0.9 with 0.1 as the step size, and Sim(R,Zk) is employed as the metric with R the generated response and Zk the ground-truth knowledge.13 We can see that the gap between the grounding rate of generation and the value of Zα we set before generation is smaller in the full model than that in the ablated model when Zα > 0.2, indicating that with the mutual information loss, Zα can effectively encode the information of knowledge expression through the variational learning approach. Note that Zα becomes weak in ZRKGC when it exceeds 0.5. This is because data with such grounding rates are sparse in training. 13For the sake of controllability study, we make sure that the ground-truth knowledge annotated by humans is involved in generation. Impact of the knowledge selection loss. Finally we explored the role of knowledge selection loss. Our knowledge selection model is mainly to shorten the input sequence of knowledge candidates, while previous work [18] focuses on selecting top-1 knowledge. This obvious difference decided that the performance drop is not significant when replacing knowledge selection module with random selection module according to the results in Table 3. 4 Related Work End-to-end response generation for open domain dialogues is inspired by the successful application of neural sequence-to-sequence models on machine translation [37, 39]. On top of the basic architecture [36, 40], various extensions have been made to tackle the safe response problem [22, 44, 51, 46]; to model dialogue history for multi-turn conversation [33, 35]; to control attributes of responses [45, 53, 48, 41, 32]; and to bias responses to some specific personas [23, 49]. Recently, grounding open domain dialogues by external knowledge is emerging as an important topic in research of humanmachine conversation [54, 18, 25, 52]. In this work, we study the problem by reducing the demanding training environment to an extreme where only dialogues and documents as a knowledge base are required. To the best of our knowledge, we are the first who prove that a model learned under such a zero-resource setting can achieve comparable performance on benchmarks with the models learned from the expensive knowledge-grounded dialogues constructed by crowd-sourcing. Unsupervised learning and learning from zero resource have attracted widespread attention in natural language generation tasks. In machine translation, typical methods include pivot-based NMT [13, 29, 7], combination of NMT and SMT [21, 30], creation of pseudo pairs with back translation [2], and adversarial training [20]. In unsupervised abstractive summarization, Wang & Lee [42] exploit adversarial training to make the summary human-readable; Chu & Liu [8] exploit mean of the representations from an auto-encoder for multiple documents to decode a summary; and Baziotis et al. [3] propose a differentiable auto-encoder optimized by re-constructing the input document from the generated summary. Our method is similar to variational back-translation. Instead of directly training a (context,response)-to-knowledge backward generation model, we take the variational posterior of the latent knowledge as the backward model to learn the knowledge-grounded dialogue model. Both SKT[18] and Post KS[25] leverage latent variables for knowledge selection. Besides optimization using generalized EM, our model introduces another variable Zα to dynamically adapt to candidates in different quality while SKT and Post KS assume there always exists GT-knowledge in their candidates. 5 Conclusions We explore knowledge-grounded dialogue generation under a zero-resource setting by proposing a double latent variable model and a variational learning approach. Evaluation results on benchmarks of the task indicate that our model can achieve comparable performance with state-of-the-art methods and exhibits a superior generation ability over different topics and datasets. Broader Impact Endowing a dialogue system with knowledge is definitely an important step towards human-like conversational AI which has been dreamed by AI researchers for years, especially when such a technology becomes cheaper and more transferable. More importantly, research on knowledge-grounded dialogue generation could fundamentally change the experience of human-machine interaction, as a system will be able to evolve along with the external knowledge base being maintained and updated. This may shed light on the effort on building interfaces that allow people to acquire information in a more natural way (i.e., through conversation), rather than just typing a query in a search box and browsing the blue links. However, we never forget the other side of the coin. Apart from the well-known issues in end-to-end conversation models trained from large naturally-occurring datasets [50], a knowledge base may also be deliberately tailored and bring biased content to dialogues, just like biased content posted by content creators on the Web is promoted by a search engine. To prevent the technology from being abused for disinformation, we look forward to more research effort being paid to fake/biased/offensive content detection, and at the same time, encourage developers to carefully choose the content for building the knowledge base of their dialogue system. After all, good external content can regulate the behavior of a dialogue model in response generation, and help the model overcome its instinct drawbacks inherited from the malicious or biased content hidden in the large scale dialogues obtained from social media for training. [1] D. Adiwardana, M.-T. Luong, D. R. So, J. Hall, N. Fiedel, R. Thoppilan, Z. Yang, A. Kulshreshtha, G. Nemade, Y. Lu, et al. Towards a human-like open-domain chatbot. ar Xiv preprint ar Xiv:2001.09977, 2020. [2] M. Artetxe, G. Labaka, E. Agirre, and K. Cho. Unsupervised neural machine translation. ar Xiv preprint ar Xiv:1710.11041, 2017. [3] C. Baziotis, I. Androutsopoulos, I. Konstas, and A. Potamianos. Seqˆ 3: Differentiable sequence-tosequence-to-sequence autoencoder for unsupervised abstractive sentence compression. ar Xiv preprint ar Xiv:1904.03651, 2019. [4] C. M. Bishop. Pattern recognition and machine learning. springer, 2006. [5] S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Jozefowicz, and S. Bengio. Generating sentences from a continuous space. ar Xiv preprint ar Xiv:1511.06349, 2015. [6] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pages 2172 2180, 2016. [7] Y. Chen, Y. Liu, Y. Cheng, and V. O. Li. A teacher-student framework for zero-resource neural machine translation. ar Xiv preprint ar Xiv:1705.00753, 2017. [8] E. Chu and P. J. Liu. Meansum: a neural model for unsupervised multi-document abstractive summarization. ar Xiv preprint ar Xiv:1810.05739, 2018. [9] C. Corro and I. Titov. Differentiable perturb-and-parse: Semi-supervised parsing with a structured variational autoencoder. ar Xiv preprint ar Xiv:1807.09875, 2018. [10] E. Dinan, S. Roller, K. Shuster, A. Fan, M. Auli, and J. Weston. Wizard of wikipedia: Knowledge-powered conversational agents. In ICLR, 2019. [11] L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y. Wang, J. Gao, M. Zhou, and H.-W. Hon. Unified language model pre-training for natural language understanding and generation. In Advances in Neural Information Processing Systems, pages 13042 13054, 2019. [12] N. Dziri, E. Kamalloo, K. W. Mathewson, and O. Zaiane. Augmenting neural response generation with context-aware topical attention. ar Xiv preprint ar Xiv:1811.01063, 2018. [13] O. Firat, B. Sankaran, Y. Al-Onaizan, F. T. Y. Vural, and K. Cho. Zero-resource translation with multilingual neural machine translation. ar Xiv preprint ar Xiv:1606.04164, 2016. [14] J. L. Fleiss. Measuring nominal scale agreement among many raters. Psychological bulletin, 76(5):378, 1971. [15] M. Ghazvininejad, C. Brockett, M.-W. Chang, B. Dolan, J. Gao, W.-t. Yih, and M. Galley. A knowledgegrounded neural conversation model. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018. [16] K. Gopalakrishnan, B. Hedayatnia, Q. Chen, A. Gottardi, S. Kwatra, A. Venkatesh, R. Gabriel, D. Hakkani Tür, and A. A. AI. Topical-chat: Towards knowledge-grounded open-domain conversations. Proc. Interspeech 2019, pages 1891 1895, 2019. [17] E. Jang, S. Gu, and B. Poole. Categorical reparameterization with gumbel-softmax. ar Xiv preprint ar Xiv:1611.01144, 2016. [18] B. Kim, J. Ahn, and G. Kim. Sequential latent knowledge selection for knowledge-grounded dialogue. ar Xiv preprint ar Xiv:2002.07510, 2020. [19] D. P. Kingma and M. Welling. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013. [20] G. Lample, A. Conneau, L. Denoyer, and M. Ranzato. Unsupervised machine translation using monolingual corpora only. ar Xiv preprint ar Xiv:1711.00043, 2017. [21] G. Lample, M. Ott, A. Conneau, L. Denoyer, and M. Ranzato. Phrase-based & neural unsupervised machine translation. ar Xiv preprint ar Xiv:1804.07755, 2018. [22] J. Li, M. Galley, C. Brockett, J. Gao, and B. Dolan. A diversity-promoting objective function for neural conversation models. ar Xiv preprint ar Xiv:1510.03055, 2015. [23] J. Li, M. Galley, C. Brockett, G. Spithourakis, J. Gao, and B. Dolan. A persona-based neural conversation model. In ACL, pages 994 1003, 2016. [24] Z. Li, C. Niu, F. Meng, Y. Feng, Q. Li, and J. Zhou. Incremental transformer with deliberation decoder for document grounded conversations. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 12 21, 2019. [25] R. Lian, M. Xie, F. Wang, J. Peng, and H. Wu. Learning to select knowledge for response generation in dialog systems. ar Xiv preprint ar Xiv:1902.04911, 2019. [26] S. Moon, P. Shah, A. Kumar, and R. Subba. Opendialkg: Explainable conversational reasoning with attention-based walks over knowledge graphs. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 845 854, 2019. [27] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311 318. Association for Computational Linguistics, 2002. [28] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners. Open AI Blog, 1(8):9, 2019. [29] S. Ren, W. Chen, S. Liu, M. Li, M. Zhou, and S. Ma. Triangular architecture for rare language translation. ar Xiv preprint ar Xiv:1805.04813, 2018. [30] S. Ren, Z. Zhang, S. Liu, M. Zhou, and S. Ma. Unsupervised neural machine translation with smt as posterior regularization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 241 248, 2019. [31] S. E. Robertson, S. Walker, S. Jones, M. M. Hancock-Beaulieu, M. Gatford, et al. Okapi at trec-3. Nist Special Publication Sp, 109:109, 1995. [32] A. See, S. Roller, D. Kiela, and J. Weston. What makes a good conversation? how controllable attributes affect human judgments. ar Xiv preprint ar Xiv:1902.08654, 2019. [33] I. V. Serban, A. Sordoni, Y. Bengio, A. C. Courville, and J. Pineau. Building end-to-end dialogue systems using generative hierarchical neural network models. In AAAI, volume 16, pages 3776 3784, 2016. [34] I. V. Serban, A. Sordoni, Y. Bengio, A. C. Courville, and J. Pineau. End-to-end dialogue systems using generative hierarchical neural network models. In AAAI, pages 3776 3784, 2016. [35] I. V. Serban, A. Sordoni, R. Lowe, L. Charlin, J. Pineau, A. C. Courville, and Y. Bengio. A hierarchical latent variable encoder-decoder model for generating dialogues. In AAAI, pages 3295 3301, 2017. [36] L. Shang, Z. Lu, and H. Li. Neural responding machine for short-text conversation. In ACL, pages 1577 1586, 2015. [37] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104 3112, 2014. [38] Y.-L. Tuan, Y.-N. Chen, and H.-y. Lee. Dykgchat: Benchmarking dialogue generation grounding on dynamic knowledge graphs. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pages 1855 1865, 2019. [39] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In NIPS, pages 5998 6008, 2017. [40] O. Vinyals and Q. Le. A neural conversational model. ar Xiv preprint ar Xiv:1506.05869, 2015. [41] Y. Wang, C. Liu, M. Huang, and L. Nie. Learning to ask questions in open-domain conversational systems with typed decoders. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2193 2203, 2018. [42] Y.-S. Wang and H.-Y. Lee. Learning to encode text as human-readable summaries using generative adversarial networks. ar Xiv preprint ar Xiv:1810.02851, 2018. [43] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229 256, 1992. [44] C. Xing, W. Wu, J. Liu, Y. Huang, M. Zhou, and W.-Y. Ma. Topic aware neural response generation. In AAAI, pages 3351 3357, 2017. [45] C. Xu, W. Wu, C. Tao, H. Hu, M. Schuerman, and Y. Wang. Neural response generation with meta-words. ar Xiv preprint ar Xiv:1906.06050, 2019. [46] C. Xu, W. Wu, and Y. Wu. Towards explainable and controllable open domain dialogue generation with dialogue acts. ar Xiv preprint ar Xiv:1807.07255, 2018. [47] P. Yin, C. Zhou, J. He, and G. Neubig. Structvae: Tree-structured latent variable models for semi-supervised semantic parsing. ar Xiv preprint ar Xiv:1806.07832, 2018. [48] R. Zhang, J. Guo, Y. Fan, Y. Lan, J. Xu, and X. Cheng. Learning to control the specificity in neural response generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1108 1117, 2018. [49] S. Zhang, E. Dinan, J. Urbanek, A. Szlam, D. Kiela, and J. Weston. Personalizing dialogue agents: I have a dog, do you have pets too? ar Xiv preprint ar Xiv:1801.07243, 2018. [50] Y. Zhang, S. Sun, M. Galley, Y.-C. Chen, C. Brockett, X. Gao, J. Gao, J. Liu, and B. Dolan. Dialogpt: Largescale generative pre-training for conversational response generation. ar Xiv preprint ar Xiv:1911.00536, 2019. [51] T. Zhao, R. Zhao, and M. Eskenazi. Learning discourse-level diversity for neural dialog models using conditional variational autoencoders. ar Xiv preprint ar Xiv:1703.10960, 2017. [52] X. Zhao, W. Wu, C. Tao, C. Xu, D. Zhao, and R. Yan. Low-resource knowledge-grounded dialogue generation. ar Xiv preprint ar Xiv:2002.10348, 2020. [53] H. Zhou, M. Huang, T. Zhang, X. Zhu, and B. Liu. Emotional chatting machine: Emotional conversation generation with internal and external memory. ar Xiv preprint ar Xiv:1704.01074, 2017. [54] H. Zhou, T. Young, M. Huang, H. Zhao, J. Xu, and X. Zhu. Commonsense knowledge aware conversation generation with graph attention. In IJCAI, pages 4623 4629, 2018. [55] K. Zhou, S. Prabhumoye, and A. W. Black. A dataset for document grounded conversations. ar Xiv preprint ar Xiv:1809.07358, 2018.