# conversational_model_adaptation_via_kl_divergence_regularization__f74dc9a4.pdf Conversational Model Adaptation via KL Divergence Regularization Juncen Li,1 Ping Luo,2,3 Fen Lin,1 Bo Chen1 1We Chat Search Application Department, Tencent, China. {juncenli}@tencent.com, {felicialin,jennychen}@tencent.com 2Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190, China. {luop}@ics.ict.ac.cn 3University of Chinese Academy of Sciences, Beijing 100049, China. In this study we formulate the problem of conversational model adaptation, where we aim to build a generative conversational model for a target domain based on a limited amount of dialogue data from this target domain and some existing dialogue models from related source domains. This model facilitates the fast building of a chatbot platform, where a new vertical chatbot with only a small number of conversation data can be supported by other related mature chatbots. Previous studies on model adaptation and transfer learning mostly focus on classification and recommendation problems, however, how these models work for conversation generation are still un-explored. To this end, we leverage a KL divergence (KLD) regularization to adapt the existing conversational models. Specifically, it employs the KLD to measure the distance between source and target domain. Adding KLD as a regularization to the objective function allows the proposed method to utilize the information from source domains effectively. We also evaluate the performance of this adaptation model for the online chatbots in Wechat platform of public accounts using both the BLEU metric and human judgement. The experiments empirically show that the proposed method visibly improves these evaluation metrics. Introduction Recently, end-to-end neural systems have made great progress in various natural language tasks, such as machine translation (Cho et al. 2014b; Sutskever, Vinyals, and Le 2014), question answering (Yin et al. 2016), and dialog systems (Gu et al. 2016; Serban et al. 2016; Sordoni et al. 2015). In general, most of these systems consist of multi-layer RNN networks with a large number of parameters, thus, they need a large amount of text data for training. In previous studies for conversation modeling, plenty of dialogue data is usually collected from social platforms (such as Weibo (Wang et al. 2013)) or transformed from some public data (Banchs and Li 2012), which usually covers various vertical domains. The background of this study is about the building of a chatbot platform, which contains various chatbots for diverse vertical domains, such as entertainment, sports, religion, etc. To build a vertical chatbot we usually need enough Copyright c 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. dialogue text from the corresponding vertical domain for end-to-end training. However, when a new chatbot is online for only a short time the data accumulated for training are very limited. Thus, we need a solution to address this cold start problem for new chatbots in a chatbot platform. To address this data sparsity issue, we formulate the problem of conversational model adaptation. Specifically, for the building of a new chatbot, it leverages not only a limited amount of training data from this target domain, but also some existing conversational models. Here, the existing models we consider are trained by the data from open social platforms (Wang et al. 2013) (considered as the source domain), thus they may contain enough language patterns for the free-chat support. However, even though the source domain contains enough data for training a conversational model, its data distribution might be dissimilar from that in the target domain, as shown in our quantitative study on the data distributions of source and target domains (detailed later). Thus, adding the source domain data directly into the target domain may lead to a huge drift of data distribution, and result in a conversational model, which loses the domain-dependent characteristics for the target domain. In this paper, we propose a KL divergence (KLD) regularization method for conversational model adaptation. Specifically, we first build a model based on the huge amount of dialog data from a social platform. Then, the small amount of target domain data is used to adapt the pre-trained model to the target domain via KLD regularization. This joint regularization framework prevents the dialog system from overfitting to the target domain, and simultaneously makes use of the information from the source domain. To evaluate the effectiveness of our method, we use both objective and subjective measures. Results of experiments show that our method visibly improves the performance of the existing models without model adaptation. In summation, our contributions are three-folds: To the best of our knowledge, we are the first to propose the problem of conversational model adaptation, where the building of a new vertical domain dialogue system is supported by both the small amount of target domain data and large number of source domain data. We develop a KLD regularization method to adapt conver- The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18) 0.0 0.2 0.4 0.6 0.8 1.0 u probability density Sim(u; Pt, Ps Ps ) Sim(u; Ps , Ps Ps ) (a) Input post comparison 0.0 0.2 0.4 0.6 0.8 1.0 u probability density Sim(u; Rt, Rs Rs ) Sim(u; Rs , Rs Rs ) (b) Response comparison Figure 1: Comparison between the source and target domains sational models, and the careful evaluation demonstrates its effectiveness. We also stress that the proposed adaptation framework is agnostic to the methods for conversational modeling. In other word, it can easily accommodate any end-to-end dialogue systems, such as memory networks (Sukhbaatar et al. 2015), neural encoder-decoder model (Cho et al. 2014b) and so on. Quantitative Study of Similarity between Source and Target Domains In this problem we are given two corpora, source domain corpus Ds and target domain corpus Dt. Each corpus is the set of post-response pairs. Namely, D = {(x, y)|y is the response of post x}. Specifically, Ds is collected from Tencent Weibo, including 1,903,512 pairs, and Dt is given by a third-party company, including 15,315 pairs. To study the similarity between the data distributions of Ds and Dt, we first define the following measure of maximum cosine similarity between a sentence s and a corpus D, as follows: MCS(s, D) = max s D(φ(s, s )) (1) where φ calculates the similarity between the two sentences. To compute this similarity, we represent each sentence as a paragraph vector (Le and Mikolov 2014), and apply the cosine similarity on these two vectors. s is closer to the corpus D when the value of maximum cosine similarity is higher. We further define the similarity between two corpora, S1 and S2 (the sets of sentences), as a distribution of u: Sim(u; S1, S2) = 1 |S1| si S1 δ(u = MCS(si, S2)) (2) where u is a user-specified value in [ 1, 1], δ is the Kronecker delta function. In other words, the similarity between S1 and S2 can be represented by the empirical distribution of Sim(u; S1, S2). With the definition of Sim(u; S1, S2), we design the following process to show the similarity between data distributions of Ds and Dt. Specifically, we consider the post sentences and response sentences, respectively. Namely, Ps and Pt include all the input posts in source and target domain, respectively. Next, we randomly sample a subset Ps of Ps, such that |Ps | = |Pt|. Then, we calculate the following two distributions, Sim(u; Ps , Ps Ps ) Sim(u; Pt, Ps Ps ) (3) Here, Sim(u; Ps , Ps Ps ) actually calculates the similarities between the posts in the sampled source domain Ps and the remaining data set Ps Ps . Sim(u; Pt, Ps Ps ) measures the similarities between the posts in the target domain Pt and the source domain Ps Ps . Finally, we can compare these two distributions to compare the data distributions in Ps and Pt. As shown in Fig. 1(a) these two distributions are very similar. Similarly, we can apply the same process to the response sentences in Ds and Dt. Then, we calculate the following two distributions, Sim(u; Rs , Rs Rs ) Sim(u; Rt, Rs Rs ) (4) where the set R includes all the response sentences in the corresponding domain. As shown in Fig. 1(b), these two distributions are quite different. Thus, responses in the target domain are dissimilar to responses in the source domain. To conclude, we observe that the input posts are similar between the source and target domain. In other words, chatbot users may ask similar questions to the two chatbots from related vertical domains. However, their responses might be different based on their individual expertise. Therefore, the dissimilarity between target domain and source domain is mainly reflected on their responses. Background and Preliminaries While this paper mainly introduces the adaptation method applied to RNN-based encoder-decoder for simplicity, our adaptation method is suitable for most conversational models. RNN-based Encoder-Decoder RNN-based encoder-decoder can be expressed as a model maximizing the likelihood of the output sequence given an input sequence. Supposed that we have a corpus D = {(xi, yi)|yi is the response of post xi}, where xi and yi are two sequences of tokens, a RNN-based encoder-decoder is typically trained to maximize the likelihood: i=1 logp(yi|xi) (5) where N is the number of samples in the corpus. A RNNbased encoder-decoder mainly includes two parts: encoder and decoder which are implemented as RNNs. The encoder converts input sequence xi = (xi 1, . . . , xi Ti) to a fixed length context vector ci, i.e. hi t = f(xi t, hi t 1), ci = ψ((hi 1, . . . , hi Ti)) (6) where Ti is the length of input sequence xi, hi t is the hidden state at time t of encoder sequence, f is a non-linear function and ψ summarizes the hidden states. The context vector ci is utilized by the decoder to generate the output sequence yi = (yi 1, . . . , yi T i ). There are different methods to unfold ci. (Sutskever, Vinyals, and Le 2014) used ci as the initial hidden state si 0 of the decoder and the function to calculate hidden states of the decoder is: si t = f(yi t 1, si t 1) (7) where si t is the hidden state at time t of decoder sequence. While (Cho et al. 2014b) argued that adding ci to the input of every step helps decoder RNN make use of context information and improve performance: si t = f(yi t 1, si t 1, ci) (8) With the hidden state si t, the target symbol yi t at time t can be predicted by: p(yi t|yi