# unsupervised_abstractive_dialogue_summarization_for_teteatetes__7b9b49a6.pdf Unsupervised Abstractive Dialogue Summarization for Tete-a-Tetes Xinyuan Zhang,1 Ruiyi Zhang, 2 Manzil Zaheer, 3 Amr Ahmed 3 1 ASAPP 2 Duke University 3 Google Research xzhang@asapp.com, ryzhang@cs.duke.edu, manzilzaheer@google.com, amra@google.com High-quality dialogue-summary paired data is expensive to produce and domain-sensitive, making abstractive dialogue summarization a challenging task. In this work, we propose the first unsupervised abstractive dialogue summarization model for tete-a-tetes (Su Ta T). Unlike standard text summarization, a dialogue summarization method should consider the multispeaker scenario where the speakers have different roles, goals, and language styles. In a tete-a-tete, such as a customer-agent conversation, Su Ta T aims to summarize for each speaker by modeling the customer utterances and the agent utterances separately while retaining their correlations. Su Ta T consists of a conditional generative module and two unsupervised summarization modules. The conditional generative module contains two encoders and two decoders in a variational autoencoder framework where the dependencies between two latent spaces are captured. With the same encoders and decoders, two unsupervised summarization modules equipped with sentencelevel self-attention mechanisms generate summaries without using any annotations. Experimental results show that Su Ta T is superior on unsupervised dialogue summarization for both automatic and human evaluations, and is capable of dialogue classification and single-turn conversation generation. Introduction Tete-a-tetes, conversations between two participants, have been widely studied as an importance component of dialogue analysis. For instance, tete-a-tetes between customers and agents contain information for contact centers to understand the problems of customers and improve the solutions by agents. However, it is time-consuming for others to track the progress by going through long and sometimes uninformative utterances. Automatically summarizing a tete-a-tete into a shorter version while retaining its main points can save a vast amount of human resources and has a number of potential real-world applications. Summarization models can be categorized into two classes: extractive and abstractive. Extractive methods select sentences or phrases from the input text, while abstractive methods attempt to generate novel expressions which requires an advanced ability to paraphrase and condense information. Despite being easier, extractive summarization is often Copyright 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Customer: I am looking for the Hamilton Lodge in Cambridge. Agent: Sure, it is at 156 Chesterton Road, postcode cb41da. Customer: Please book it for 2 people, 5 nights beginning on Tuesday. Agent: Done. Your reference number is qnvdz4rt. Customer: Thank you, I will be there on Tuesday! Agent: Is there anything more I can assist you with today? Customer: Thank you! That s everything I needed. Agent: You are welcome. Any time. Customer Summary: i would like to book a hotel in cambridge on tuesday . Agent Summary: i have booked you a hotel . the reference number is qnvdz4rt . can i help you with anything else ? Table 1: An example of Su Ta T generated summaries. not preferred in dialogues for its limited capability to capture highly dependent conversation histories and produce coherent discourses. Therefore, abstractively summarizing dialogues has attracted recent research interest (Goo and Chen 2018; Pan et al. 2018; Yuan and Yu 2019; Liu et al. 2019). However, existing abstractive dialogue summarization approaches fail to address two main problems. First, a dialogue is carried out between multiple speakers and each of them has different roles, goals, and language styles. Taking the example of a contact center, customers aim to propose problems while agents aim to provide solutions, which leads them to have different semantic contents and choices of vocabularies. Most existing methods process dialogue utterances as in text summarization without accommodating the multi-speaker scenario. Second, high-quality annotated data is not readily available in the dialogue summarization domain and can be very expensive to produce. Topic descriptions or instructions are commonly used as gold references which are too general and lack any information about the speakers. Moreover, some methods use auxiliary information such as dialogue acts (Goo and Chen 2018), semantic scaffolds (Yuan and Yu 2019) and key point sequences (Liu et al. 2019) to help with summarization, adding more burden on data annotation. To our knowledge, no previous work has focused on unsupervised deep learning for abstractive dialogue summarization. We propose Su Ta T, an unsupervised abstractive dialogue summarization approach specifically for tete-a-tetes. In this The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21) paper, we use the example of agent and customer to represent the two speakers in tete-a-tetes for better understanding. In addition to summarization, Su Ta T can also be used for dialogue classification and single-turn conversation generation. To accommodate the two-speaker scenario, Su Ta T processes the utterances of a customer and an agent separately in a conditional generative module. Inspired by Zhang et al. (2019) where two latent spaces are contained in one variational autoencoder (VAE) framework, the conditional generative module includes two encoders to map a customer utterance and the corresponding agent utterance into two latent representations, and two decoders to reconstruct the utterances jointly. Separate encoders and decoders enables Su Ta T to model the differences of language styles and vocabularies between customer utterances and agent utterances. The dependencies between two latent spaces are captured by making the agent latent variable conditioned on the customer latent variable. Compared to using two standard autoencoders that learn deterministic representations for input utterances, using the VAE-based conditional generative module to learn variational distributions gives the model more expressive capacity and more flexibility to find the correlation between two latent spaces. The same encoders and decoders from the conditional generative module are used in two unsupervised summarization modules to generate customer summaries and agent summaries. Divergent from Mean Sum (Chu and Liu 2019) where the combined multi-document representation is simply computed by averaging the encoded input texts, Su Ta T employs a setence-level self-attention mechanism (Vaswani et al. 2017) to highlight more significant utterances and neglect uninformative ones. We also incorporate copying factual details from the source text that has proven useful in supervised summarization (See, Liu, and Manning 2017). Dialogue summaries are usually written in the third-person point of view, but Su Ta T simplifies this problem by making the summaries consistent with the utterances in pronouns. Table 1 shows an example of Su Ta T generated summaries. Experiments are conducted on two dialogue datasets: Multi WOZ (Budzianowski et al. 2018) and Taskmaster (Byrne et al. 2019). It is assumed that we can only access utterances in the datasets without any annotations including dialogue acts, descriptions, instructions, etc. Both automatic and human evaluations show Su Ta T outperforms other unsupervised baseline methods on dialogue summarization. We further show the capability of Su Ta T on dialogue classification with generated summaries and single-turn conversation generation. Methodology Su Ta T consists of a conditional generative module and two unsupervised summarization modules. Let X = {x1, , xn} denote a set of customer utterances and Y = {y1, , yn} denote a set of agent utterances in the same dialogue. Our aim is to generate a customer summary and an agent summary for the utterances in X and Y. Figure 1 shows the entire architecture of Su Ta T. Given a customer utterance x and its consecutive agent utterance y, the conditional generative module embeds them with two encoders and obtain latent variables zx and zy from the variational latent spaces, then reconstruct the utterances from zx and zy with two decoders. In the latent space, the agent latent variable is conditioned on the customer latent variable; during decoding, the generated customer utterances are conditioned on the generated agent utterances. This design resembles how a tete-a-tete carries out: the agent s responses and the customer s requests are dependent on each other. The encoded utterances of a dialogue are the inputs of the unsupervised summarization modules. We employ a sentence-level self-attention mechanism on the utterances embeddings to highlight the more informative ones and combine the weighted embeddings. A summary representation is drawn from the low-variance latent space using the combined utterance embedding, which is then decoded into a summary with the same decoder and a partial copy mechanism. The whole process does not require any annotations from the data. Conditional Generative Module We build the conditional generative module in a SIVAE-based framework (Zhang et al. 2019) to capture the dependencies between two latent spaces. The goal of the module is to train two encoders and two decoders for customer utterances x and agent utterances y by maximizing the evidence lower bound Lgen = Eq(zx|x) log p(x|y, zx) (1) KL[q(zx|x)||p(zx)] + Eq(zy|y,zx) log p(y|zy) KL[q(zy|y, zx)||p(zy|zx)] log p(x, y), where q( ) is the variational posterior distribution that approximates the true posterior distribution. The lower bound includes two reconstruction losses and two Kullback-Leibler (KL) divergences between the priors and the variational posteriors. By assuming priors and posteriors to be Gaussian, we can apply the reparameterization trick (Kingma and Welling 2014) to compute the KL divergences in closed forms. q(zx|x), q(zy|y, zx), p(x|y, zx), and p(y|zy) represent customer encoder, agent encoder, customer decoder, and agent decoder. The correlation between two latent spaces are captured by making the agent latent variable zy conditioned on the customer latent variable zx. We define the customer prior p(zx) to be a standard Gaussian N(0, I). The agent prior p(zy|zx) is also a Gaussian N(µ, Σ) where the mean and the variance are functions of zx, µ = MLPµ(zx), Σ = MLPΣ(zx). This process resembles how a tete-a-tete at contact centers carries out: the response of an agent is conditioned on what the customer says. Encoding Given a customer utterance sequence x = {w1, , wt}, we first encode it into an utterance embedding ex using bidirectional LSTM (Graves, Jaitly, and Mohamed 2013) or a Transformer encoder (Vaswani et al. 2017). The Bi-LSTM takes the hidden states hi = [ h i; h i] as contextual representations by processing a sequence from Agent Encoder Agent Decoder Customer Latent Space Agent Latent Space Customer Utterances X Agent Utterances Y Customer Utterances Agent Utterances Sentence-Level Self-Attention SX Zx Zy Summary Representation Customer Latent Space (low variance) Customer Summary Agent Decoder Sentence-Level Self-Attention Copy SY Summary Representation Agent Latent Space (low variance) Agent Summary Conditional Generative Module Unsupervised Summarization Module Unsupervised Summarization Module Encoded Customer Utterances Encoded Agent Utterances Figure 1: Block diagram of Su Ta T. Architectures connected by a blue dashed line are the same. The red arrow represents the conditional relationship between two latent spaces. both directions, h i = LSTM(wi, hi 1), h i = LSTM(wi, hi+1). The Transformer encoder produces the contextual representations that have the same dimensions as word embeddings, { w1, , wt} = Trans Enc({w1, , wt}). The customer utterance embedding ex is obtained by averaging over the contextual representations. Similarly, we can obtain the agent utterance embedding ey. The customer latent variable zx is first sampled from q(zx|x) = N(µx, Σx) using ex, then the agent latent variable zy is sampled from q(zy|y, zx) = N(µy, Σy) using ey and zx. The Gaussian parameters µx, Σx, µy and Σy are computed with separate linear projections, µx = Linearµx(ex), µy = Linearµy(ey zx) Σx = LinearΣx(ex), Σy = LinearΣy(ey zx). Decoding We first decode zy into the agent utterance from the p(y|zy) using LSTM (Sutskever, Vinyals, and Le 2014) or a Transformer decoder (Vaswani et al. 2017). The decoded sequence and the latent variable zx are then used in p(x|y, zx) to generate the customer utterance. In the LSTM decoder, v(i) y = LSTM(yi 1, zy, v(i 1) y ) v(i) x = LSTM(xi 1, zx y, v(i 1) x ). While in the Transformer decoder, v(i) y = Tran Dec(y