# unsupervised_abstractive_dialogue_summarization_for_teteatetes__7b9b49a6.pdf

Unsupervised Abstractive Dialogue Summarization for Tete-a-Tetes

Xinyuan Zhang,1 Ruiyi Zhang, 2 Manzil Zaheer, 3 Amr Ahmed 3

1 ASAPP 2 Duke University 3 Google Research xzhang@asapp.com, ryzhang@cs.duke.edu, manzilzaheer@google.com, amra@google.com

High-quality dialogue-summary paired data is expensive to produce and domain-sensitive, making abstractive dialogue summarization a challenging task. In this work, we propose the ﬁrst unsupervised abstractive dialogue summarization model for tete-a-tetes (Su Ta T). Unlike standard text summarization, a dialogue summarization method should consider the multispeaker scenario where the speakers have different roles, goals, and language styles. In a tete-a-tete, such as a customer-agent conversation, Su Ta T aims to summarize for each speaker by modeling the customer utterances and the agent utterances separately while retaining their correlations. Su Ta T consists of a conditional generative module and two unsupervised summarization modules. The conditional generative module contains two encoders and two decoders in a variational autoencoder framework where the dependencies between two latent spaces are captured. With the same encoders and decoders, two unsupervised summarization modules equipped with sentencelevel self-attention mechanisms generate summaries without using any annotations. Experimental results show that Su Ta T is superior on unsupervised dialogue summarization for both automatic and human evaluations, and is capable of dialogue classiﬁcation and single-turn conversation generation.

Introduction Tete-a-tetes, conversations between two participants, have been widely studied as an importance component of dialogue analysis. For instance, tete-a-tetes between customers and agents contain information for contact centers to understand the problems of customers and improve the solutions by agents. However, it is time-consuming for others to track the progress by going through long and sometimes uninformative utterances. Automatically summarizing a tete-a-tete into a shorter version while retaining its main points can save a vast amount of human resources and has a number of potential real-world applications. Summarization models can be categorized into two classes: extractive and abstractive. Extractive methods select sentences or phrases from the input text, while abstractive methods attempt to generate novel expressions which requires an advanced ability to paraphrase and condense information. Despite being easier, extractive summarization is often

Copyright 2021, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

Customer: I am looking for the Hamilton Lodge in Cambridge. Agent: Sure, it is at 156 Chesterton Road, postcode cb41da. Customer: Please book it for 2 people, 5 nights beginning on Tuesday. Agent: Done. Your reference number is qnvdz4rt. Customer: Thank you, I will be there on Tuesday! Agent: Is there anything more I can assist you with today? Customer: Thank you! That s everything I needed. Agent: You are welcome. Any time.

Customer Summary: i would like to book a hotel in cambridge on tuesday . Agent Summary: i have booked you a hotel . the reference number is qnvdz4rt . can i help you with anything else ?

Table 1: An example of Su Ta T generated summaries.

not preferred in dialogues for its limited capability to capture highly dependent conversation histories and produce coherent discourses. Therefore, abstractively summarizing dialogues has attracted recent research interest (Goo and Chen 2018; Pan et al. 2018; Yuan and Yu 2019; Liu et al. 2019). However, existing abstractive dialogue summarization approaches fail to address two main problems. First, a dialogue is carried out between multiple speakers and each of them has different roles, goals, and language styles. Taking the example of a contact center, customers aim to propose problems while agents aim to provide solutions, which leads them to have different semantic contents and choices of vocabularies. Most existing methods process dialogue utterances as in text summarization without accommodating the multi-speaker scenario. Second, high-quality annotated data is not readily available in the dialogue summarization domain and can be very expensive to produce. Topic descriptions or instructions are commonly used as gold references which are too general and lack any information about the speakers. Moreover, some methods use auxiliary information such as dialogue acts (Goo and Chen 2018), semantic scaffolds (Yuan and Yu 2019) and key point sequences (Liu et al. 2019) to help with summarization, adding more burden on data annotation. To our knowledge, no previous work has focused on unsupervised deep learning for abstractive dialogue summarization. We propose Su Ta T, an unsupervised abstractive dialogue summarization approach speciﬁcally for tete-a-tetes. In this

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

paper, we use the example of agent and customer to represent the two speakers in tete-a-tetes for better understanding. In addition to summarization, Su Ta T can also be used for dialogue classiﬁcation and single-turn conversation generation. To accommodate the two-speaker scenario, Su Ta T processes the utterances of a customer and an agent separately in a conditional generative module. Inspired by Zhang et al. (2019) where two latent spaces are contained in one variational autoencoder (VAE) framework, the conditional generative module includes two encoders to map a customer utterance and the corresponding agent utterance into two latent representations, and two decoders to reconstruct the utterances jointly. Separate encoders and decoders enables Su Ta T to model the differences of language styles and vocabularies between customer utterances and agent utterances. The dependencies between two latent spaces are captured by making the agent latent variable conditioned on the customer latent variable. Compared to using two standard autoencoders that learn deterministic representations for input utterances, using the VAE-based conditional generative module to learn variational distributions gives the model more expressive capacity and more ﬂexibility to ﬁnd the correlation between two latent spaces. The same encoders and decoders from the conditional generative module are used in two unsupervised summarization modules to generate customer summaries and agent summaries. Divergent from Mean Sum (Chu and Liu 2019) where the combined multi-document representation is simply computed by averaging the encoded input texts, Su Ta T employs a setence-level self-attention mechanism (Vaswani et al. 2017) to highlight more signiﬁcant utterances and neglect uninformative ones. We also incorporate copying factual details from the source text that has proven useful in supervised summarization (See, Liu, and Manning 2017). Dialogue summaries are usually written in the third-person point of view, but Su Ta T simpliﬁes this problem by making the summaries consistent with the utterances in pronouns. Table 1 shows an example of Su Ta T generated summaries. Experiments are conducted on two dialogue datasets: Multi WOZ (Budzianowski et al. 2018) and Taskmaster (Byrne et al. 2019). It is assumed that we can only access utterances in the datasets without any annotations including dialogue acts, descriptions, instructions, etc. Both automatic and human evaluations show Su Ta T outperforms other unsupervised baseline methods on dialogue summarization. We further show the capability of Su Ta T on dialogue classiﬁcation with generated summaries and single-turn conversation generation.

Methodology Su Ta T consists of a conditional generative module and two unsupervised summarization modules. Let X = {x1, , xn} denote a set of customer utterances and Y = {y1, , yn} denote a set of agent utterances in the same dialogue. Our aim is to generate a customer summary and an agent summary for the utterances in X and Y. Figure 1 shows the entire architecture of Su Ta T. Given a customer utterance x and its consecutive agent utterance y, the conditional generative module embeds them with two

encoders and obtain latent variables zx and zy from the variational latent spaces, then reconstruct the utterances from zx and zy with two decoders. In the latent space, the agent latent variable is conditioned on the customer latent variable; during decoding, the generated customer utterances are conditioned on the generated agent utterances. This design resembles how a tete-a-tete carries out: the agent s responses and the customer s requests are dependent on each other. The encoded utterances of a dialogue are the inputs of the unsupervised summarization modules. We employ a sentence-level self-attention mechanism on the utterances embeddings to highlight the more informative ones and combine the weighted embeddings. A summary representation is drawn from the low-variance latent space using the combined utterance embedding, which is then decoded into a summary with the same decoder and a partial copy mechanism. The whole process does not require any annotations from the data.

Conditional Generative Module We build the conditional generative module in a SIVAE-based framework (Zhang et al. 2019) to capture the dependencies between two latent spaces. The goal of the module is to train two encoders and two decoders for customer utterances x and agent utterances y by maximizing the evidence lower bound

Lgen = Eq(zx|x) log p(x|y, zx) (1)

KL[q(zx|x)||p(zx)] + Eq(zy|y,zx) log p(y|zy)

KL[q(zy|y, zx)||p(zy|zx)] log p(x, y),

where q( ) is the variational posterior distribution that approximates the true posterior distribution. The lower bound includes two reconstruction losses and two Kullback-Leibler (KL) divergences between the priors and the variational posteriors. By assuming priors and posteriors to be Gaussian, we can apply the reparameterization trick (Kingma and Welling 2014) to compute the KL divergences in closed forms. q(zx|x), q(zy|y, zx), p(x|y, zx), and p(y|zy) represent customer encoder, agent encoder, customer decoder, and agent decoder. The correlation between two latent spaces are captured by making the agent latent variable zy conditioned on the customer latent variable zx. We deﬁne the customer prior p(zx) to be a standard Gaussian N(0, I). The agent prior p(zy|zx) is also a Gaussian N(µ, Σ) where the mean and the variance are functions of zx,

µ = MLPµ(zx), Σ = MLPΣ(zx).

This process resembles how a tete-a-tete at contact centers carries out: the response of an agent is conditioned on what the customer says.

Encoding Given a customer utterance sequence x = {w1, , wt}, we ﬁrst encode it into an utterance embedding ex using bidirectional LSTM (Graves, Jaitly, and Mohamed 2013) or a Transformer encoder (Vaswani et al. 2017). The Bi-LSTM takes the hidden states hi = [ h i; h i] as contextual representations by processing a sequence from

Agent Encoder

Agent Decoder

Customer Latent Space

Agent Latent Space

Customer Utterances X Agent Utterances Y

Customer Utterances Agent Utterances

Sentence-Level

Self-Attention

SX Zx Zy Summary Representation

Customer Latent Space (low variance)

Customer Summary

Agent Decoder

Sentence-Level

Self-Attention

Copy SY Summary Representation

Agent Latent Space (low variance)

Agent Summary

Conditional Generative Module Unsupervised Summarization Module

Unsupervised Summarization Module

Encoded Customer Utterances Encoded Agent Utterances

Figure 1: Block diagram of Su Ta T. Architectures connected by a blue dashed line are the same. The red arrow represents the conditional relationship between two latent spaces.

both directions, h i = LSTM(wi, hi 1), h i = LSTM(wi, hi+1).

The Transformer encoder produces the contextual representations that have the same dimensions as word embeddings,

{ w1, , wt} = Trans Enc({w1, , wt}). The customer utterance embedding ex is obtained by averaging over the contextual representations. Similarly, we can obtain the agent utterance embedding ey. The customer latent variable zx is ﬁrst sampled from q(zx|x) = N(µx, Σx) using ex, then the agent latent variable zy is sampled from q(zy|y, zx) = N(µy, Σy) using ey and zx. The Gaussian parameters µx, Σx, µy and Σy are computed with separate linear projections,

µx = Linearµx(ex), µy = Linearµy(ey zx)

Σx = LinearΣx(ex), Σy = LinearΣy(ey zx).

Decoding We ﬁrst decode zy into the agent utterance from the p(y|zy) using LSTM (Sutskever, Vinyals, and Le 2014) or a Transformer decoder (Vaswani et al. 2017). The decoded sequence and the latent variable zx are then used in p(x|y, zx) to generate the customer utterance. In the LSTM decoder,

v(i) y = LSTM(yi 1, zy, v(i 1) y )

v(i) x = LSTM(xi 1, zx y, v(i 1) x ).

While in the Transformer decoder,

v(i) y = Tran Dec(y<i, zy)

v(i) x = Tran Dec(x<i, zx y)

where y<i and x<i are the embeddings of the previously decoded sequence. The decoded representations v(i) y and v(i) x are put in feedforward layers to compute the vocabulary distributions,

p(yi|y<i, zy) = softmax(v(i) y WT y + by)

p(xi|x<i, zx, y) = softmax(v(i) x WT x + bx) (2)

where Wx R|x| l, Wy R|y| l, bx Rl and by Rl are learnable parameters. |x| and |y| are the vocabulary sizes for customer utterances and agent utterances.

Unsupervised Summarization Module Given the encoded utterances of a dialogue, an unsupervised summarization module learns to generate a summary that is semantically similar to the input utterances using trained components from the conditional generative module.

Sentence-Level Self-Attention Some utterances like greetings or small talk do not contribute to the content of a dialogue. Therefore, we employ a sentence-level selfattention mechanism, which is built upon Multi-head attention (Vaswani et al. 2017), to highlight the most signiﬁcant utterances in a dialogue. The multi-head attention partitions the queries Q, keys K, and values V into h heads along their dimensions d, and calculates h scaled dot-product attention for the linear projections of the heads.

MH(Q, K, V) = Concat(head1, , headh)WO

headi = SDP(QWQ i , KWK i , VWV i )

where WO, WQ, WK, and WV are trainable parameters. The scaled dot-product attention outputs a weighted sum of values,

SDP(Q, K, V) = softmax(QKT

In Su Ta T, the sentence-level self-attention is achieved by making the queries, keys, and values all be the set of encoded agent/customer utterances of a dialogue. The self-attention module assigns weights on the input utterances such that more signiﬁcant and informative ones have higher weights. The output is a weighted combined utterance embedding e X or e Y that highlights more informative utterances from the dialogue.

Summary Generation Summary representations s X and s Y are sampled from the latent spaces taking the weighted combined utterance representations e X and e Y as inputs. To limit the amount of novelty in the generated summary, we set the variances of the latent spaces close to zero so that s X µx and s Y µy. s X and s Y containing key information from the dialogue are decoded into a customer summary and an agent summary using the same decoders from the conditional generative module, which makes the generated summaries similar to the utterances in pronouns and language styles. We re-encode the generated summaries into e X and e Y with the same encoders and compare them with each of the utterance embeddings using average cosine distance. To constrain the summaries to be semantically close to input utterances, the summarization modules are trained by maximizing a similarity loss,

i=1 (d(e X, e(i) x ) + d(e Y , e(i) y )), (3)

where d denotes the cosine distance. However, the summarization modules are prone to produce inaccurate factual details. We design a simple but effective partial copy mechanism that employs some extractive summarization tricks to address this problem. We automatically make a list of factual information from the data such as dates, locations, names, and numbers. Whenever the decoder predicts a word from the factual information list, the copy mechanism replaces it with a word containing factual information from the input utterances. If there are multiple factual information words in the dialogue, the one with the highest predictive possibility will be chosen. Note that this partial copy mechanism does not need to be trained and is not activated during training.

Training Process The objective function we optimize is the weighted sum of the reconstruction loss in Equation 1 and the similarity loss in Equation 3,

L = αLgen + (1 α)Lsum, (4)

where α controls the weights of two objectives.

Su Ta T involves re-encoding the generated agent utterance to help with generating the customer utterance in Equation 2 and re-encoding the generated summary to compare with utterance embeddings in Equation 3. Directly sampling from the multinomial distribution with argmax is a nondifferentiable operation, so we use the soft-argmax trick (Chen et al. 2019) to approximate the deterministic sampling scheme,

yi = softmax(v(i) Y /τ), (5) where τ (0, 1) is the annealing parameter. Adam (Kingma and Ba 2015) is adopted for stochastic optimization to jointly train all model parameters by maximizing Equation 4. In each step, Adam samples a mini-batch of dialogues and then updates the parameters (Zhang et al. 2018).

Related Works Dialogue Summarization Early dialogue summarization works mainly focus on extractively summarizing using statistical machine learning methods (Galley 2006; Xie, Liu, and Lin 2008; Wang and Cardie 2013). Abstractive dialogue summarization has been recently explored due to the success of sequence-to-sequence neural networks. Pan et al. (2018) propose an enhanced interaction dialogue encoder and a transformer-pointer decoder to summarize dialogues. Li et al. (2019) summarize multi-modal meetings on another encoder-decoder structure. Some approaches design additional mechanisms in a neural summarization model to leverage auxiliary information such as dialogue acts (Goo and Chen 2018), key point sequences (Liu et al. 2019), and semantic scaffolds (Yuan and Yu 2019). However, these supervised methods can only use concise topic descriptions or instructions as gold references while high-quality annotated dialogue summaries are not readily available.

Unsupervised Summarization Many extractive summarization models do not require document-summary paired data. Text Rank (Mihalcea and Tarau 2004) and Lex Rank (Erkan and Radev 2004) encode sentences as nodes in a graph to select the most representative ones as a summary. Zheng and Lapata (2019) and Rossiello, Basile, and Semeraro (2017) advance upon Text Rank and Lex Rank by using BERT (Devlin et al. 2019) to compute sentence similarity and replacing TF-IDF weights with word2vec embeddings respectively. In abstractive summarization, some approaches focus on learning unsupervised sentence compression with small-scale texts (Fevry and Phang 2018; Baziotis et al. 2019; West et al. 2019), while TED (Yang et al. 2020) proposes a transformer-based architecture with pretraining on largescale data. Mean Sum (Chu and Liu 2019) generates a multidocument summary by decoding the average encoding of the input texts, where the autoencoder and the summarization module are interactive. Amplayo and Lapata (2020) extends Mean Sum by denoising a noised synthetic dataset. Some approaches investigate using VAE in summarization (Li et al. 2017; Schumann 2018; Braˇzinskas, Lapata, and Titov 2020). However, none of these methods accommodate the multispeaker scenario in dialogues.

Multi WOZ Taskmaster Customer Agent Customer Agent R-1 R-2 R-L R-1 R-2 R-L R-1 R-2 R-L R-1 R-2 R-L

Lex Rank 23.54 2.63 13.43 24.35 2.79 13.29 21.64 1.83 12.86 21.54 1.90 12.15 Word2Vec 23.80 2.96 13.37 24.15 2.72 13.92 21.43 2.03 12.32 21.57 2.07 12.46

Mean Sum 25.93 4.42 14.52 26.49 4.49 15.43 24.01 3.31 13.55 24.08 3.24 14.31 Copycat 26.86 4.81 16.35 26.92 4.37 16.12 24.86 4.23 14.81 25.05 3.71 15.19 VAE 26.08 4.25 14.84 26.80 3.76 15.27 24.29 3.15 14.40 24.99 3.29 14.35

Su Ta T-LSTM 28.51 5.60 17.20 28.71 5.67 17.49 26.61 4.89 16.09 26.67 4.80 15.74 Su Ta T-Tran 26.82 4.80 16.08 27.11 4.88 15.52 25.20 3.98 15.33 25.19 4.12 14.81

Ablation Study (with LSTM Encoders and Decoders)

Su Ta T w/o LS 24.78 3.55 14.08 25.11 4.09 14.16 23.05 3.05 13.00 23.41 3.15 13.12 Su Ta T w/o Att 26.69 5.00 15.59 27.00 5.26 15.97 25.08 4.26 14.65 25.25 4.28 14.93 Su Ta T w/o copy 27.65 5.23 16.01 27.67 5.47 16.42 25.28 4.80 14.97 25.15 4.47 15.16

Table 2: ROUGE scores on the Multi WOZ and Taskmaster test sets.

Experimental Details We perform experiments with two variants of Su Ta T: one equipped with LSTM encoders and decoders (Su Ta T-LSTM), and the other equipped with Transformer encoders and decoders (Su Ta T-Tran).

Dataset The experiments are conducted on two dialogue datasets: Multi WOZ-2.0 (Budzianowski et al. 2018) and Taskmaster1 (Byrne et al. 2019). Multi WOZ consists of 10438 goaloriented human-human written dialogues between customers and agents, spanning over 7 domains such as booking hotels, booking taxis, etc. 3406 of them are single-label and 7302 of them are multi-label. The average utterances is 13.70 and average tokens per utterance is 13.82. In the experiment, we split the dataset into 8438, 1000, and 1000 dialogues for training, testing, and validation. Taskmaster consists of 7708 written dialogues created by human workers based on scenarios outlined for one of the six tasks, such as ordering pizza, ordering movie tickets, etc. The average utterances is 21.99 and average tokens per utterance is 8.62. The dataset is split into 6168, 770, and 770 dialogues for training, testing, and validation.

Baselines To validate the effectiveness of Su Ta T, we compare the two variants against the following baselines: unsupervised extractive summarization methods Lex Rank (Erkan and Radev 2004) and Word2Vec (Rossiello, Basile, and Semeraro 2017); unsupervised abstractive summarization methods Mean Sum (Chu and Liu 2019) and Copycat (Braˇzinskas, Lapata, and Titov 2020). In addition, we train a vanilla text VAE model (Bowman et al. 2016) with our unsupervised summarization module as another baseline. Since we are the ﬁrst work that summarizes for each speaker in a dialogue, some modiﬁcations need to be made on baselines to make fair comparisons with our model. To make the unsupervised summarization baseline models adapt to the two-speaker scenario in tete-a-tetes, we train two models

for each baseline with either customer utterances or agent utterances. During testing, the customer summaries and agent summaries are generated by the two trained models of each baseline, which are used either separately for automatic and human evaluation or concatenated together for the classiﬁcation experiment.

We ﬁne-tune the parameters of Su Ta T on the validation set. VAE-based text generative models can suffer from posterior collapse where the model learns to ignore the latent variable (Bowman et al. 2016). We employ KL-term annealing and dropping out words during decoding to avoid posterior collapse. For KL annealing, the initial weights of the KL terms are 0, and then we gradually increase the weights as training progresses, until they reach the KL threshold of 0.8; the rate of this increase is set to 0.5 with respect to the total number of batches. The word dropout rate during decoding is 0.4. The latent variable size is 300 for both customer and agent latent variables. α that controls weights of two objective functions in Equation 4 is set to 0.4. The word embedding size is 300. For the bidirectional LSTM encoder and LSTM decoder, the number of hidden layers is 1 and the hidden unit size is 600. For the Transformer encoder and decoder, the number of hidden layers is 1 and the number of heads in the multi-head attention is set to 10. The number of heads in the sentencelevel self-attention is also 10. The hidden unit size of the MLPs in p(zy|zx) is 600. The annealing parameter τ for soft-argmax in Equation 5 is set to 0.01. During training, the learning rate is 0.0005, the batch size is 16, and the maximum number of epoch is 10. Su Ta T is implemented in pytorch and trained using a NVIDIA Tesla V100 GPU with 16GB.

Reference Summaries

In this work, we deﬁne the dialogue summary as summarizing for each speaker in a dialogue and there is no such annotated dataset available. To perform summarization comparisons (Table 2 and 4), we follow the setting in (Chu and Liu 2019) to collect 200 abstractive summaries for a subset of each dataset

Multi WOZ Taskmaster Customer Agent Customer Agent PPL KL PPL KL PPL KL PPL KL

Mean Sum 3.58 - 3.65 - 5.57 - 5.48 - Copycat 3.46 0.75 3.42 0.73 5.41 0.96 5.23 0.93 VAE 3.64 0.50 3.59 0.48 5.63 0.63 5.75 0.66

Su Ta T-LSTM 3.27 0.79 3.39 0.82 5.31 1.02 4.56 0.88 Su Ta T-Tran 1.77 0.28 2.10 0.34 2.48 0.35 2.52 0.36

Table 3: Language modeling results on Multi WOZ and Taskmaster. Lower is better for PPL.

Model Multi WOZ Taskmaster Info Read Corr Info Read Corr

Reference 5.43 4.73 4.52 5.39 4.57 4.60

Mean Sum 2.57 3.15 2.64 2.98 3.29 3.05 Copycat 2.89 3.37 3.00 3.04 3.49 3.07 VAE 2.96 3.04 2.44 2.97 2.92 2.45

Su Ta T-LSTM 3.68 3.48 4.25 3.61 3.53 4.20

Su Ta T-Tran 3.47 3.56 4.15 3.33 3.52 3.96

Table 4: Human evaluation results on informativeness, readability, and correlation of generated summaries.

using Mechanical Turk. Workers were presented with 10 dialogues from Multi WOZ and 10 dialogues from Taskmaster and asked to write summaries that best summarize both the content and the sentiment for each speaker . We asked workers to write your summaries as if your were the speaker (e.g. I want to book a hotel. instead of The customer wants to book a hotel. ) and keep the length of the summary no more than one sentence . The collected summaries are only used as reference summaries for testing and not used for model-tuning. These reference summaries cover all domains in both datasets and will be released later.

Results We conduct the majority of experiments to show the superiority of Su Ta T on unsupervised dialogue summarization. We use the labeled reference summaries for ROUGE-score-based automatic evaluation and human evaluation to compare with baseline methods. We further demonstrate the effectiveness of Su Ta T by analyzing the language modeling results and using generated summaries to perform dialogue classiﬁcation. In addition, we show that Su Ta T is capable of single-turn conversation generation.

Unsupervised Dialogue Summarization Automatic Evaluation ROUGE (Lin 2004) is a standard summarization metric to measure the surface word alignment between a generated summary and the reference summary. In the experiments, we use ROUGE-1, ROUGE-2, and ROUGE-L to measure the word-overlap, bigram-overlap, and longest common sequence respectively. Table 2 shows the ROUGE scores for two Su Ta T variants and the baselines. As we can see, our proposed Su Ta T with LSTM encoders and

decoders outperforms all other baselines on both datasets. Su Ta T-LSTM performs better than Su Ta T-Transformer on ROUGE scores, the reason could be that Transformer decoders are too strong so the encoders are weakened during training. In general, the unsupervised abstractive models perform better than unsupervised extractive models. Compared with other unsupervised abstractive summarization baselines equipped with LSTM encoders and decoders, Su Ta T-LSTM has a big performance improvement. We believe this is because Su Ta T accommodates the two-speaker scenario in tetea-tetes so that the utterances from each speaker and their correlations are better modeled. In addition, we evaluate reconstruction performances of the language modeling based methods with perplexity (PPL), and check the posterior collapse for the VAE-based methods with KL divergence. The results for Multi WOZ and Taskmaster are shown in Table 3. As can be seen, Su Ta T-Tran has much better PPL scores than other competing methods on both datasets, showing the transformer decoders are effective at reconstructing sentences. Consequently, due to the powerful decoders, Su Ta T-Tran has smaller KL divergences which can lead to posterior collapse where the encoders tend to be ignored.

Human Evaluation Human evaluation for the generated summaries is conducted to quantify the qualitative results of each model. We sample 50 dialogues that are labeled with reference summaries from the Multi WOZ and taskmaster test set (25 each). With the sampled dialogues, summaries are generated from the unsupervised abstractive approaches: Mean Sum, Copycat, VAE, Su Ta T-LSTM, and Su Ta T-Tran. We recruit three workers to rank the generated summaries and reference summaries from 6 (the best) to 1 (the worst) based on three criteria: Informativeness: a summary should present the main points of the dialogue in a concise version; Readability: a summary should be grammatically correct and well structured; Correlation: the customer summary should be correlated to the agent summary in the same dialogue. The average ranking scores are shown in Table 4. As we can see, Su Ta T-LSTM achieves the best informativeness and correlation results on both datasets while Su Ta T-Tran also has good performances, further demonstrating the ability of Su Ta T on generating informative and coherent dialogue summaries. In general, the two Su Ta T models have better human evaluation scores than baseline models, especially on correlation scores where the results are close to reference summaries.

Model Multi WOZ Taskmaster

Mean Sum 0.76 0.70 Copycat 0.77 0.72 VAE 0.66 0.62

Su Ta T (unsupervised) 0.85 0.79 Su Ta T (supervised) 0.99 0.96

Table 5: AUC scores for domain classﬁcation with generated summaries, where Multi WOZ is multi-label and Taskmaster is single-label.

This is because Su Ta T exploits the dependencies between the customer latent space and the agent latent space, which results in generating more correlated customer summaries and agent summaries.

Ablation Study We perform ablations to validate each component of Su Ta T by: removing the variational latent spaces (Su Ta T w/o LS) so the encoded utterances are directly used for embedding, removing the sentence-level selfattention mechanism (Su Ta T w/o Att), and removing the partial copy mechanism (Su Ta T w/o copy). We use LSTM encoders and decoders for all ablation models. The results for ablation study in Table 2 show that all the removed components play a role in Su Ta T. Removing the latent spaces has the biggest inﬂuence on the summarization performance, indicating that the variational latent space is necessary to support our design which makes the agent latent variable dependent on the customer latent variable. The performance drop after removing the sentence-level self-attention mechanism shows that using weighted combined utterance embedding is better than simply taking the mean of encoded utterances. Removing the partial copy has the smallest quality drop. However, taking the dialogue example in Table 1, without the partial copy mechanism Su Ta T can generate the following summaries:

Customer Summary: i would like to book a hotel in cambridge on tuesday . Agent Summary: i have booked you a hotel . the reference number is lzludtvi . can i help you with anything else ?

The generated summaries are the same except for the wrong reference number which is crucial information in this summary.

Classiﬁcation with Summaries A good dialogue summary should reﬂect the key points of the utterances. We perform dialogue classiﬁcation based on dialogue domains to test the validity of generated summaries. First we encode the generated customer summary and agent summary into e X and e Y using the trained encoders of each model, which are then concatenated as features of the dialogue for classiﬁcation. In this way, the dialogue features are obtained unsupervisedly. Then we train a separate linear classiﬁer on top of the encoded summaries. We use Su Ta T with LSTM encoders and decoders for this task. As shown in

Customer: yes , yes . are there any multiple sports places that i can visit in ? Agent: sorry , there are none locations in the center of town . would you like a different area ?

Customer: yes please . book for the same group of people at 13:45 on thursday . Agent: your booking was successful and your reference number is minorhoq .

Customer: hi , i am looking for a place to stay . the west should be cheap and doesn t need to have internet . Agent: there are no hotels in the moderate price range . would you care to expand other criteria ?

Table 6: Examples of single-turn conversations generated by the conditional generative module of Su Ta T.

Table 5, Su Ta T outperforms other baselines on dialogue classiﬁcation, indicating the Su Ta T generated summaries have better comprehension of domain information in the dialogue. We can also perform supervised classiﬁcation by using s X and s Y from Su Ta T as features to train a linear classiﬁer. The cross entropy loss is combined with Equation 4 as the new objective function where all parameters are jointly optimized. As can be seen in Table 5, the supervised classiﬁcation results are as high as 0.99 on Multi WOZ and 0.96 on Taskmaster, further demonstrating the effectiveness of Su Ta T.

Single-Turn Conversation Generation

The design of the conditional generative module in Su Ta T enables generating novel single-turn conversations. By sampling the customer latent variable from the standard Gaussian zx N(0, I) and then sampling the agent latent variable zy p(zy|zx), Su Ta T can produce realistic-looking novel dialogue pairs using the customer decoder and agent decoder. Table 6 shows three examples of novel single-turn conversations generated by Su Ta T using randomly sampled latent variables. We can see that the dialogue pairs are closely correlated, meaning the dependencies between two latent spaces are successfully captured.

We propose Su Ta T, an unsupervised abstractive dialogue summarization model, accommodating the two-speaker scenario in tete-a-tetes and summarizing them without using any data annotations. The conditional generative module models the customer utterances and agent utterances separately using two encoders and two decoders while retaining their correlations in the variational latent spaces. In the unsupervised summarization module, a sentence-level self-attention mechanism is used to highlight more informative utterances. The summary representations containing key information of the dialogue are decoded using the same decoders from the conditional generative module, with the help of a partial copy mechanism, to generate a customer summary and an agent summary. The experimental results show the superiority of Su Ta T for unsupervised dialogue summarization and the capability for more dialogue tasks.

Ethical Impact

This work moves a step further to investigate how to generate abstractive summaries in dialogue systems without using any data annotations. The most direct society impact could be on dialogue systems in contact centers. The model could perform as an assistance in the process of a conversation between a customer and an agent by automatically generating summary notes, which could signiﬁcantly alleviate the burden of agents and improve the efﬁciency to address more customers problems. From the technical perspective, this work redeﬁnes abstractive dialogue summarization as summarizing for each speaker, which simpliﬁes the tricky problem so that merging information from different speakers and changing pronouns are no longer needed. For the reality that dialogues datasets do not have large-scale high-quality summary labels, our work provides a solution.

Amplayo, R. K.; and Lapata, M. 2020. Unsupervised Opinion Summarization with Noising and Denoising. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL).

Baziotis, C.; Androutsopoulos, I.; Konstas, I.; and Potamianos, A. 2019. SEQˆ 3: Differentiable Sequence-to-Sequenceto-Sequence Autoencoder for Unsupervised Abstractive Sentence Compression. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL).

Bowman, S. R.; Vilnis, L.; Vinyals, O.; Dai, A. M.; Jozefowicz, R.; and Bengio, S. 2016. Generating sentences from a continuous space. In Proceedings of the Conference on Computational Natural Language Learning (Co NLL).

Braˇzinskas, A.; Lapata, M.; and Titov, I. 2020. Unsupervised Opinion Summarization as Copycat-Review Generation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL).

Budzianowski, P.; Wen, T.-H.; Tseng, B.-H.; Casanueva, I.; Ultes, S.; Ramadan, O.; and Gaˇsi c, M. 2018. Multiwoza large-scale multi-domain wizard-of-oz dataset for taskoriented dialogue modelling. Proceedings of the conference on empirical methods in natural language processing (EMNLP) .

Byrne, B.; Krishnamoorthi, K.; Sankar, C.; Neelakantan, A.; Duckworth, D.; Yavuz, S.; Goodrich, B.; Dubey, A.; Cedilnik, A.; and Kim, K.-Y. 2019. Taskmaster-1: Toward a Realistic and Diverse Dialog Dataset. In Proceedings of the conference on empirical methods in natural language processing (EMNLP).

Chen, L.; Zhang, Y.; Zhang, R.; Tao, C.; Gan, Z.; Zhang, H.; Li, B.; Shen, D.; Chen, C.; and Carin, L. 2019. Improving sequence-to-sequence learning via optimal transport. In Proceedings of the International Conference on Learning Representations (ICLR).

Chu, E.; and Liu, P. J. 2019. Mean Sum: a neural model for unsupervised multi-document abstractive summarization.

In Proceedings of the International Conference on Machine Learning (ICML).

Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL).

Erkan, G.; and Radev, D. R. 2004. Lexrank: Graph-based lexical centrality as salience in text summarization. Journal of artiﬁcial intelligence research .

Fevry, T.; and Phang, J. 2018. Unsupervised sentence compression using denoising auto-encoders. In Proceedings of the Conference on Computational Natural Language Learning (Co NLL).

Galley, M. 2006. A skip-chain conditional random ﬁeld for ranking meeting utterances by importance. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).

Goo, C.-W.; and Chen, Y.-N. 2018. Abstractive dialogue summarization with sentence-gated modeling optimized by dialogue acts. In IEEE Spoken Language Technology Workshop (SLT).

Graves, A.; Jaitly, N.; and Mohamed, A.-r. 2013. Hybrid speech recognition with deep bidirectional LSTM. In IEEE workshop on automatic speech recognition and understanding.

Kingma, D. P.; and Ba, J. 2015. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR).

Kingma, D. P.; and Welling, M. 2014. Auto-encoding variational bayes. In Proceedings of the International Conference on Learning Representations (ICLR).

Li, M.; Zhang, L.; Ji, H.; and Radke, R. J. 2019. Keep meeting summaries on topic: Abstractive multi-modal meeting summarization. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL).

Li, P.; Wang, Z.; Lam, W.; Ren, Z.; and Bing, L. 2017. Salience estimation via variational auto-encoders for multidocument summarization. In Thirty-First AAAI Conference on Artiﬁcial Intelligence.

Lin, C.-Y. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In ACL Workshop on Text Summarization Branches Out.

Liu, C.; Wang, P.; Xu, J.; Li, Z.; and Ye, J. 2019. Automatic Dialogue Summary Generation for Customer Service. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

Mihalcea, R.; and Tarau, P. 2004. Textrank: Bringing order into text. In Proceedings of the conference on empirical methods in natural language processing (EMNLP).

Pan, H.; Zhou, J.; Zhao, Z.; Liu, Y.; Cai, D.; and Yang, M. 2018. Dial2desc: end-to-end dialogue description generation. ar Xiv preprint ar Xiv:1811.00185 .

Rossiello, G.; Basile, P.; and Semeraro, G. 2017. Centroidbased text summarization through compositionality of word embeddings. In Proceedings of the Multi Ling Workshop on Summarization and Summary Evaluation Across Source Types and Genres.

Schumann, R. 2018. Unsupervised abstractive sentence summarization using length controlled variational autoencoder. ar Xiv preprint ar Xiv:1809.05233 . See, A.; Liu, P. J.; and Manning, C. D. 2017. Get to the point: Summarization with pointer-generator networks. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL). Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems (Neur IPS).

Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. In Advances in neural information processing systems (Neur IPS). Wang, L.; and Cardie, C. 2013. Domain-independent abstract generation for focused meeting summarization. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL). West, P.; Holtzman, A.; Buys, J.; and Choi, Y. 2019. Bottle Sum: Unsupervised and Self-supervised Sentence Summarization using the Information Bottleneck Principle. In Proceedings of the conference on empirical methods in natural language processing (EMNLP). Xie, S.; Liu, Y.; and Lin, H. 2008. Evaluating the effectiveness of features and sampling in extractive meeting summarization. In 2008 IEEE Spoken Language Technology Workshop (SLT).

Yang, Z.; Zhu, C.; Gmyr, R.; Zeng, M.; Huang, X.; and Darve, E. 2020. TED: A Pretrained Unsupervised Summarization Model with Theme Modeling and Denoising. ar Xiv preprint ar Xiv:2001.00725 .

Yuan, L.; and Yu, Z. 2019. Abstractive Dialog Summarization with Semantic Scaffolds. ar Xiv preprint ar Xiv:1910.00825 . Zhang, X.; Li, Y.; Shen, D.; and Carin, L. 2018. Diffusion maps for textual network embedding. In Advances in Neural Information Processing Systems (Neur IPS).

Zhang, X.; Yang, Y.; Yuan, S.; Shen, D.; and Carin, L. 2019. Syntax-Infused Variational Autoencoder for Text Generation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL). Zheng, H.; and Lapata, M. 2019. Sentence centrality revisited for unsupervised summarization. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL).