# dialogue_rewriting_via_skeletonguided_generation__0a7a4494.pdf Dialogue Rewriting via Skeleton-Guided Generation Chunlei Xin1,3, Hongyu Lin1,*, Shan Wu1,3, Xianpei Han1,2, Bo Chen1,5,6, Wen Dai4, Shuai Chen4, Bin Wang4, Le Sun1,2,* 1Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences, Beijing, China 2State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences, Beijing, China 3University of Chinese Academy of Sciences, Beijing, China 4Xiaomi AI Lab, Xiaomi Inc., Beijing, China 5School of Information Engineering, Minzu University of China, Beijing, China 6National Language Resources Monitoring and Research Center for Minority Languages, Beijing, China {chunlei2021, hongyu, wushan2018, xianpei, chenbo, sunle}@iscas.ac.cn {daiwen, chenshuai3, wangbin11}@xiaomi.com Dialogue rewriting aims to transform multi-turn, contextdependent dialogues into well-formed, context-independent text for most NLP systems. Previous dialogue rewriting benchmarks and systems assume a fluent and informative utterance to rewrite. Unfortunately, dialogue utterances from real-world systems are frequently noisy and with various kinds of errors that can make them almost uninformative. In this paper, we first present Real-world Dialogue Rewriting Corpus (Real Dia), a new benchmark to evaluate how well current dialogue rewriting systems can deal with real-world noisy and uninformative dialogue utterances. Real Dia contains annotated multi-turn dialogues from real scenes with ASR errors, spelling errors, redundancies and other noises that are ignored by previous dialogue rewriting benchmarks. We show that previous dialogue rewriting approaches are neither effective nor data-efficient to resolve Real Dia. Then this paper presents Skeleton-Guided Rewriter (SGR), which can resolve the task of dialogue rewriting via a skeleton-guided generation paradigm. Experiments show that Real Dia is a much more challenging benchmark for real-world dialogue rewriting, and SGR can effectively resolve the task and outperform previous approaches by a large margin. Introduction Dialogue is the primary mechanism for human interaction, which is multi-turn, context-dependent, and frequently informal (Pangaro and Dubberly 2014). People tend to produce brief, fragmented utterances in dialogues rather than longer, completed sentences in normal documents (Carbonell 1983). Therefore, many utterances in dialogues can only be understood when they are put in the entire dialogue context. Unfortunately, most NLP systems are designed for well-formed, context-independent texts, which makes them inappropriate to deal with informal and context-dependent dialogues. To narrow the gap, current studies mostly focus on developing dialogue-specialized paradigm, which expands specific downstream tasks from single-sentence inputs to multi-turn dialogue inputs, and designs complicated * Corresponding authors. Copyright 2023, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Figure 1: An example shows how dialogue rewriting handles multi-turn dialogues. Key information restored from the dialogue is marked in blue, and the automatic speech recognition error and its correction are in red. mechanisms which can take richer context information into consideration (Wu et al. 2017; Zhang et al. 2019; Wang et al. 2021). However, dialogue-specialized models are usually hard to design, and it is time-consuming and unacceptably costly to construct dialogue-specialized models for various downstream tasks. Therefore, how to more effectively deal with noisy, heavily context-dependent utterances in dialogues is a critical challenge for dialogue understanding. Recently, the task of dialogue rewriting was proposed to resolve the above-mentioned challenge. Dialogue rewriting transforms inter-dependent and informal utterances in multiturn dialogues into well-formed, context-independent and semantically completed sentences, and therefore dialogue utterances can be seamlessly restructured into well-formed inputs for current NLP systems. For example, given the utterance Its manufacturer? of utterance S2 in Figure 1, dialogue rewriting models need to distill relevant information the car and the max horsepower , and output the sentence What is the manufacturer of the car with the max horsepower? , which can accurately express the speaker intent of the original utterance under the dialogue context. The Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI-23) Reranked Candidates Rewritten Candidates Skeleton Spans 1: What is the main factor of the car with the max horsepower 2: What is the manufacturer of the car with the max MPG 3: What is the main factor of the car with the max MPG S1: How about the car with the max horsepower S2: Its manufacturer S3: How about the car with the max MPG S4: Its main factor Skeleton Extractor Skeleton-Guided Generator 𝐶1: car 𝐶2: max MPG 𝐶3: manufacturer 1: What is the manufacturer of the car with the max MPG: 0.89 2: What is the main factor of the car with the max MPG: 0.71 3: What is the main factor of the car with the max horsepower : 0.57 What is the manufacturer of the car with the max MPG Original Dialogue Skeleton-aware Reranker Figure 2: Overall architecture of the proposed Skeleton-Guided Rewriter. The sentence that needs to be rewritten is shown in bold and the golden is marked in red. Previous dialogue rewriting benchmarks assume a fluent and informative utterance to rewrite. However, dialogue utterances from real-world systems are frequently informal, noisy and with various kinds of errors that can make them uninformative. For example, many dialogues transcribed from speeches contain automatic speech recognition (ASR) errors, and dialogues from user textual inputs frequently contain spelling errors. For the instance in Figure 1, the ASR error main factor can make this utterance incomprehensible and uninformative, and it is impossible to directly obtain the correct user intent by only rewriting this utterance. As a result, as shown in 1, previous benchmarks either filter out dialogues with errors and only consider fluent and informative dialogues (Su et al. 2019; Elgohary, Peskov, and Boyd-Graber 2019; Quan et al. 2019), or ignore the errors and allow the output to still contain these errors (Pan et al. 2019). Unfortunately, the former choice does not match the nature of real-world dialogue rewriting, while the latter fails to meet the requirements of downstream NLP systems. Consequently, current dialogue rewriting benchmarks and systems can not fully meet the requirements of dialogue rewriting in real-world applications. In this paper, we present Real-world Dialogue Rewriting Corpus (Real Dia), a new dialogue rewriting benchmark collected from real-world dialogues. Compared with previous dialogue rewriting benchmarks, Real Dia annotates opendomain, multi-turn dialogues from real scenes with ASR errors, spelling errors, redundancies and other noises. Dialogue rewriting models need not only to deal with ellipsis and co-reference as previous dialogue rewriting benchmarks, but also to infer the underlying semantics of uninformative utterances based on the dialogue contexts. Furthermore, model outputs should be natural language expressions with high fluency, high coverage and high consistency that can meet the input standards for various downstream NLP systems. Such requirements pose remarkable challenges to previous dialogue rewriting systems, as they are designed based on the assumption of rewriting fluent and informative utterances. As a result, previous approaches can not achieve desirable performances on real-world dialogues in Real Dia. To this end, we further propose Skeleton-Guided Rewriter (SGR), which can effectively and efficiently re- solve real-world dialogue rewriting using a skeleton-guided generation framework. The main idea behind SGR is to ask the model to pay more attention to critical information (i.e., the skeleton) across the entire dialogue session, which can reduce the dependence on the utterance itself to resolve the uninformative utterance challenge. Figure 2 shows the overall architecture of SGR, which contains three critical steps: 1) Dialogue Skeleton Extraction, which identifies the key information that needs to be covered in the rewritten sentence; 2) Skeleton-Guided Generation, which generates fluent candidate sentences under the guidance of skeleton; 3) Skeleton-aware Reranking, which selects the best candidate by measuring the fluency, coverage and semantic consistency with the original dialogue of each candidate. We evaluate the effectiveness of SGR on both Real Dia and previous dialogue rewriting benchmarks. Experiments show that previous approaches can not effectively solve the task of real-world dialogue rewriting on Real Dia because of their inherent informative utterance assumption. And by leveraging the skeleton across the entire dialogue, SGR can simultaneously resolve ellipsis, co-reference, redundancies and various kinds of errors, and therefore achieve state-ofthe-art performance on both Real Dia and previous dialogue rewriting benchmarks. Generally speaking, the main contributions of this paper can be summarized as: We construct Real Dia, a new dialogue rewriting benchmark collected from real-world dialogues to evaluate how well current systems can rewrite real-world noisy and uninformative dialogue utterances. To the best of our knowledge, Real Dia is the first dialogue rewriting benchmark that considers uninformative utterances and various kinds of errors in real-world dialogues. We design Skeleton-Guided Rewriter (SGR), a skeletonguided generation framework that can effectively and efficiently rewrite uninformative utterances in real-world dialogues. To the best of our knowledge, SGR is the first work that attempts to leverage skeleton-guided generation framework to better resolve uninformative utterance challenge in dialogue rewriting. Experiments show that Real Dia is much more challenging than previous dialogue rewriting benchmarks, and Dataset Fluent Utterances Informative Utterances Legal Sentence as Result TASK Yes Yes Yes CANARD Yes Yes Yes REWRITE Yes Yes Yes Restoration No Limit No Limit No Limit Real Dia No Limit No Limit Yes Table 1: Comparison between Real Dia and existing dialogue rewriting benchmarks. Previous benchmarks either only collected fluency and informative dialogues, or ignored noise and allowed outputs to remain illegal. SGR achieves state-of-the-art performance on both Real Dia and previous benchmarks. This demonstrates the necessity of Real Dia and the effectiveness of SGR. Related Work Dialogue rewriting aims to transform inter-dependent and informal utterances in multi-turn dialogues into wellformed, context-independent and semantically completed sentences. Along this line, Su et al. (2019) collect a rewriting dataset for co-reference resolution and information completion in multi-turn dialogues and propose a pointer-based rewriter. Pan et al. (2019) collect a Restoration-200K dataset and propose a cascaded pick-and-combine model. Quan et al. (2019) construct a dataset with both ellipsis and coreference annotation and propose an end-to-end generative resolution model. Liu et al. (2020) formulate incomplete utterance rewriting as a semantic segmentation task. Mele et al. (2021) propose adaptive utterance rewriting strategies for better conversational information retrieval. Hao et al. (2021) propose a tagging-based approach that predicts edit actions to rewrite incomplete utterances, based on which Jin et al. (2022) propose a hierarchical context tagger to expand the coverage and shrink search space. Recently, Si, Zeng, and Chang (2022) propose the query-enhanced network, which consists of a query template construction module and an edit operation scoring network. Inoue et al. (2022) jointly optimize picking important tokens and generating rewritten utterances. Although dialogue rewriting has recently raised wide attention, previous dialogue rewriting benchmarks assume a fluent and informative utterance to rewrite, and therefore the main issues they focus on are omissions and coreferences. Unfortunately, in real-world applications, dialogues are severely noisy and with various kinds of errors that can make their utterance uninformative. Previous benchmarks, as shown in Table 1, either filtered out dialogues with errors (e.g., TASK (Quan et al. 2019), CANARD (Elgohary, Peskov, and Boyd-Graber 2019) and REWRITE (Su et al. 2019)), or ignored this issue and allowed references to still maintain the errors (e.g., Restoration-200K (Pan et al. 2019)). However, the former choice does not match the dialogues from the real world because these errors are almost everywhere in real-world dialogues, while the latter fails to meet the requirements of downstream NLP systems for sentences with high fluency, consistency and coverage. To this end, this paper presents Real Dia, which is constructed from real-world multi-turn dialogues and therefore contains a vast majority of challenges we would face when dealing with real-world dialogues. Our experiments show that the performances of current state-of-the-art dialogue rewriting approaches are dramatically dropped on Real Dia, which demonstrates that rewriting real-world dialogues is a challenging task that requires more studies. Benchmark Construction Given a dialogue D and an incomplete utterance S in D, dialogue rewriting aims to transform S into another sentence S , which can accurately express the speaker s intent of S within the dialogue context. Without loss of generality, we assume that the last utterance in the dialogue is the one to be rewritten, because a dialogue system should be unknown to what the speaker will say in the future. Formally, given D = (S1, , Sn) which is a dialogue containing n utterances, dialogue rewriting aims to generate a dialogue-independent sentence S = (x 1, ..., x l) such that Intent(Sn|D) = Intent(S ). (1) Here Intent() is a function depending on downstream applications, which is difficult to obtain directly. Instead of measuring the intent consistency between S and Sn in dialogue D, we create a golden reference S for each case and evaluate the consistency between S and S using automatic and manual evaluation criteria. We construct Real-world Dialogue Rewriting Corpus (Real Dia) from a dialogue corpus provided by a large-scale Chinese Internet company, which contains multi-turn dialogues between users and a widely-used online chatting system in real scenes. Users can interact with the system using both speech or typing inputs, and therefore the (transcribed) dialogues contain various spelling or ASR errors. To ensure the annotation quality, 4 annotators who have degrees in Computational Linguistics are hired to annotate references. Before annotation, we manually collect dialogues whose last utterance cannot clearly express the user intention unless put in the entire dialogue context. Then annotators are asked to create a reference sentence that can fully express the same intention as the last utterance in the dialogue. All references are double-checked to guarantee quality. Finally, Real Dia contains 1000 annotated dialogues that need to be rewritten, each of which is manually annotated with a reference. We then randomly sample 700/100/200 dialogues as train/dev/test sets. Besides, we also provide 20,000 dialogues without annotation which can support future research. To clearly identify the challenges of Real Dia, we randomly sampled 200 dialogues and analyze noises appearing in them. We find that in addition to 69.5% of the cases containing NP-ellipsis, there are 40.5% of the cases containing ASR errors or spelling errors. VP-ellipsis, co-references and redundancies are also common. These results show the divergence between Real Dia and previous dialogue rewriting benchmarks we discussed, and also demonstrate that Real Dia is a far more challenging benchmark. Skeleton-Guided Dialogue Rewriting This section describes Skeleton-Guided Rewriter (SGR), which leverages pre-trained text generation models for high fluency and extracts dialogue skeletons for high coverage and consistency to resolve real-world dialogue rewriting. Specifically, as shown in Figure 2, SGR conducts dialogue rewriting with three critical steps: 1) Dialogue Skeleton Extraction, which extracts the critical information across the entire dialogue context to resolve the uninformative utterance problem; 2) Skeleton-Guided Candidate Generation, which generates fluent rewritten candidates under the guidance of extracted skeleton; 3) Skeleton-aware Reranking, which selects the best rewritten sentence by measuring the fluency, coverage and semantic consistency of candidates to the original dialogue. In the following, we will describe these steps and show how each component can be learned with minimum supervision. Dialogue Skeleton Extraction Dialogue Skeleton Extractor aims to extract the skeleton which contains the critical information that needs to be covered in the rewritten sentence. For example in Figure 2, given a multi-turn dialogue, Dialogue Skeleton Extractor is expected to extract the critical information car , max MPG and manufacture in it. The insight behind Dialogue Skeleton Extractor is to disentangle critical information extraction and sentence generation, so that the critical information in the dialogue history can be better identified to rewrite the uninformative utterance. Furthermore, learning to extract skeletons is much more data-efficient than directly learning to generate rewritten sentences, and therefore much less training data is required for model learning. Specifically, we formulate skeleton extraction as a tokenlevel sequential labeling problem. Given a dialogue D = (S1, ..., Sn) and the utterance Sn to be rewritten, we first build the input as: DE = [S1, S2, , [SEP], Sn], (2) where [SEP] is a special token indicating the start of the last utterance that needs to be rewritten. Then we use Ro BERTa (Liu et al. 2019) to encode the input into hidden representations. After that, the representations hj for the jth token in D are sent to a binary classifier for classification, where a token with label 1 indicates it is a skeleton token that needs to be covered in the rewritten result, and 0 for otherwise. We combine continuous skeleton tokens into spans, and filtered out stop words to collect dialogue skeleton C = (c1, c2, ..., ck). Learning. To train the Dialogue Skeleton Extractor, we label tokens in a dialogue by comparing the dialogue D with its golden reference S , i.e., the ith token will be labeled as yi = 1 if it appears in S and 0 otherwise. Based on the token labels, we train the Dialogue Skeleton Extractor by optimizing the cross-entropy loss: i=1 [yi log Pi + (1 yi) log (1 Pi)], (3) where Pi is the predicted probability of the ith token as a skeleton token. Skeleton-Guided Candidate Generation Given a dialogue D = (S1, ..., Sn) and its skeleton spans C = (c1, c2, ..., ck), Skeleton-Guided Generator will generate rewritten candidates S = (x 1, , x l) under the guidance of skeleton C. Because current large-scale pre-trained generation models like T5 (Raffel et al. 2020) can effectively generate fluent natural language sentences, we build Skeleton-Guided Generator by directly leveraging the T5based encoder-decoder architecture and further incorporating skeleton to improve the coverage and consistency of model outputs. Inspired by recent advances in prompt mechanism for text generation (Brown et al. 2020; Zou et al. 2021) and skeleton-based generation models (Xu et al. 2018; Cai et al. 2019; Su et al. 2021), we transform skeleton into prompt-style guidance, which is expected to guide the generator to pay more attention to critical information in the dialogue to resolve uninformative utterance challenge. Formally, given a dialogue D and its skeleton C = (c1, c2, ..., ck), the input of Skeleton-Guided Generator is: DG = [c1, [SEP], ..., ck, [CLS], D] (4) where [SEP] is used for segmentation and [CLS] indicates the beginning of a dialogue. Then DG is fed into a T5-based encoder-decoder architecture to generate rewritten sentence S as: P(S | DG) = Y t P (y t | DG, y 0.7), which indicates that all three features are crucial and need to be considered together. Case Analysis To demonstrate and compare the effect of different methods, Figure 4 shows two cases that contain the outputs of different systems. We can see that: 1. Real Dia is challenging for previous dialogue rewriting approaches. First, baselines frequently generate illegal sentences with low fluency and grammar errors, such as omissions, repetitions, and word order errors. In these cases, the outputs of T-Ptr-λ contain many word order errors, and PAC gets stuck in repetitive loops. In addition, in case 1, TPtr-λ, PAC, and RAST generate illegal sentences, and the Case1 Case2 Dialogue (Translation) l u 录一只小猫咪 (How to record a kitten) S1: 中国象棋的队 (Chinese chess team) n u 怒 (How to irritate) S2: 双方各有几种棋子 (How many kinds of pieces are there on both sides) Reference (Translation) n u 怒一只小猫咪 (How to irritate a kitten) S : 中国象棋双方各有几种棋子 (How many kinds of pieces are there on both sides of Chinese chess) T-Ptr-λ (Translation) j ı 激 (How the kitten how #error) S : 中国国的中各几种棋子 (How many kinds of chess pieces are there in China #error) PAC (Translation) S : 怎么小猫咪怎么 n u 怒 (How the kitten how to irritated #repetition) S : 中国象棋双方各几种棋子子子子子 (How many kinds of pieces #repetition are there on both sides of Chinese chess ) BERT+RUN (Translation) n u 怒 (One #ellipsis how to irritate) S : 中国象棋的双方各有几种棋子 (How many kinds of pieces are there on both sides of Chinese chess) BERT-L+RAST (Translation) n u 怒 (How to record a kitten how to irritate) S : 中国象棋双方各有几种棋子 (How many kinds of pieces are there on both sides of Chinese chess) JET (Translation) l u 录一只猫咪 (How to record a kitten) S : 中国象棋的各有几种棋子 (How many kinds of Chinese chess pieces are there on each side of #ellipsis) T5 (Translation) l u 录一只猫咪的表情 (How to record the expression of a kitten) S : 中国象棋的三大队各有几种棋子 (How many kinds of Chinese chess pieces are there in each of the three teams) SGR(Ours) (Translation) n u 怒一只小猫咪 (How to irritate a kitten) S : 中国象棋的两队各有几种棋子 (How many kinds of pieces are there on both sides of Chinese chess) Figure 4: Two example dialogues and their references from the Real Dia, as well as the rewritten sentences generated by different models. We mark errors in red. subject has been omitted from the output of RUN. Second, it is hard for baselines to distinguish between noises and critical information in dialogues due to their inherent informative utterance assumption. RAST, T5 and JET fail to correct the ASR error in original dialogue in case 1, while SGR can correct the transcription error and generate fluent utterances. 2. Compared with previous state-of-the-art approaches, T5 can guarantee the fluency of its outputs but may have consistency and coverage issues. Under the guidance of skeleton, SGR can improve the consistency and coverage of rewritten sentences. In these cases, the outputs of T5 contain hallucination contents expression and three teams that do not appear in the dialogue history, while SGR can correctly generate fluent utterances with high consistency and high coverage based on the dialogue context. Conclusion Previous dialogue rewriting benchmarks and systems assume a fluent and informative utterance to rewrite, which is inconsistent with real-world dialogue rewriting application scenarios. In this paper, we first present Real-world Dialogue Rewriting Corpus (Real Dia), a new benchmark to evaluate how well current dialogue rewriting systems can deal with real-world noisy and uninformative dialogue utterances. Then we present Skeleton-Guided Rewriter (SGR), which can resolve the task via a skeleton-guided generation paradigm. Experiments on Real Dia and previous benchmarks have shown that rewriting real-world noisy dialogues is challenging, and SGR achieves state-of-the-art performance on both Real Dia and previous benchmarks. Acknowledgments We sincerely thank the reviewers for their insightful comments and valuable suggestions. This work is supported by the Natural Science Foundation of China (No.U1936207, 62122077, 62106251 and 61906182). References Brown, T. B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; Agarwal, S.; Herbert-Voss, A.; Krueger, G.; Henighan, T.; Child, R.; Ramesh, A.; Ziegler, D. M.; Wu, J.; Winter, C.; Hesse, C.; Chen, M.; Sigler, E.; Litwin, M.; Gray, S.; Chess, B.; Clark, J.; Berner, C.; Mc Candlish, S.; Radford, A.; Sutskever, I.; and Amodei, D. 2020. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020. Cai, D.; Wang, Y.; Bi, W.; Tu, Z.; Liu, X.; Lam, W.; and Shi, S. 2019. Skeleton-to-Response: Dialogue Generation Guided by Retrieval Memory. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1, 1219 1228. Minneapolis, Minnesota: Association for Computational Linguistics. Carbonell, J. G. 1983. Discourse Pragmatics and Ellipsis Resolution in Task-Oriented Natural Language Interfaces. In 21st Annual Meeting of the Association for Computational Linguistics, 164 168. Cambridge, Massachusetts, USA: Association for Computational Linguistics. Cui, Y.; Che, W.; Liu, T.; Qin, B.; Yang, Z.; Wang, S.; and Hu, G. 2021. Pre-Training With Whole Word Masking for Chinese BERT. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29: 3504 3514. Dziri, N.; Madotto, A.; Za ıane, O.; and Bose, A. J. 2021. Neural Path Hunter: Reducing Hallucination in Dialogue Systems via Path Grounding. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2197 2214. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics. Elgohary, A.; Peskov, D.; and Boyd-Graber, J. 2019. Can You Unpack That? Learning to Rewrite Questions-in Context. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, 5918 5924. Hong Kong, China: Association for Computational Linguistics. Hao, J.; Song, L.; Wang, L.; Xu, K.; Tu, Z.; and Yu, D. 2021. RAST: Domain-Robust Dialogue Rewriting as Sequence Tagging. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 4913 4924. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics. Inoue, S.; Liu, T.; Nguyen, S.; and Nguyen, M.-T. 2022. Enhance Incomplete Utterance Restoration by Joint Learning Token Extraction and Text Generation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 3149 3158. Seattle, United States: Association for Computational Linguistics. Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y.; Ishii, E.; Bang, Y.; Madotto, A.; and Fung, P. 2022. Survey of Hallucination in Natural Language Generation. Co RR, abs/2202.03629. Jin, L.; Song, L.; Jin, L.; Yu, D.; and Gildea, D. 2022. Hierarchical Context Tagging for Utterance Rewriting. In AAAI. Lin, C.-Y.; and Hovy, E. 2002. Manual and automatic evaluation of summaries. In Proceedings of the ACL-02 Workshop on Automatic Summarization, 45 51. Phildadelphia, Pennsylvania, USA: Association for Computational Linguistics. Liu, Q.; Chen, B.; Lou, J.-G.; Zhou, B.; and Zhang, D. 2020. Incomplete Utterance Rewriting as Semantic Segmentation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 2846 2857. Online: Association for Computational Linguistics. Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov, V. 2019. Ro BERTa: A Robustly Optimized BERT Pretraining Approach. Co RR, abs/1907.11692. Loshchilov, I.; and Hutter, F. 2019. Decoupled Weight Decay Regularization. In 7th International Conference on Learning Representations. Open Review.net. Mele, I.; Muntean, C. I.; Nardini, F. M.; Perego, R.; Tonellotto, N.; and Frieder, O. 2021. Adaptive utterance rewriting for conversational search. Information Processing & Management, 58(6): 102682. M uller, R.; Kornblith, S.; and Hinton, G. E. 2019. When does label smoothing help? In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, 4696 4705. Pan, Z.; Bai, K.; Wang, Y.; Zhou, L.; and Liu, X. 2019. Improving Open-Domain Dialogue Systems via Multi-Turn Incomplete Utterance Restoration. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, 1824 1833. Hong Kong, China: Association for Computational Linguistics. Pangaro, P.; and Dubberly, H. 2014. What is Conversation? How Can We Design for Effective Conversation? In Driving Desired Futures, 144 159. Birkh auser. Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 311 318. Philadelphia, Pennsylvania, USA: Association for Computational Linguistics. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; Desmaison, A.; K opf, A.; Yang, E.; De Vito, Z.; Raison, M.; Tejani, A.; Chilamkurthy, S.; Steiner, B.; Fang, L.; Bai, J.; and Chintala, S. 2019. Py Torch: An Imperative Style, High Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, 8024 8035. Quan, J.; Xiong, D.; Webber, B.; and Hu, C. 2019. GECOR: An End-to-End Generative Ellipsis and Co-reference Resolution Model for Task-Oriented Dialogue. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, 4547 4557. Hong Kong, China: Association for Computational Linguistics. Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; and Liu, P. J. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to Text Transformer. Journal of Machine Learning Research, 21(140): 1 67. Raunak, V.; Menezes, A.; and Junczys-Dowmunt, M. 2021. The Curious Case of Hallucinations in Neural Machine Translation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1172 1183. Online: Association for Computational Linguistics. Si, S.; Zeng, S.; and Chang, B. 2022. Mining Clues from Incomplete Utterance: A Query-enhanced Network for Incomplete Utterance Rewriting. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 4839 4847. Seattle, United States: Association for Computational Linguistics. Su, H.; Shen, X.; Zhang, R.; Sun, F.; Hu, P.; Niu, C.; and Zhou, J. 2019. Improving Multi-turn Dialogue Modelling with Utterance Re Writer. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 22 31. Florence, Italy: Association for Computational Linguistics. Su, Y.; Wang, Y.; Cai, D.; Baker, S.; Korhonen, A.; and Collier, N. 2021. PROTOTYPE-TO-STYLE: Dialogue Generation With Style-Aware Editing on Retrieval Memory. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29: 2152 2161. Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; and Wojna, Z. 2016. Rethinking the Inception Architecture for Computer Vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2818 2826. IEEE Computer Society. Wang, X.; Zhang, H.; Zhao, S.; Zou, Y.; Chen, H.; Ding, Z.; Cheng, B.; and Lan, Y. 2021. FCM: A Fine-grained Comparison Model for Multi-turn Dialogue Reasoning. In Findings of the Association for Computational Linguistics: EMNLP 2021, 4284 4293. Punta Cana, Dominican Republic: Association for Computational Linguistics. Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; Davison, J.; Shleifer, S.; von Platen, P.; Ma, C.; Jernite, Y.; Plu, J.; Xu, C.; Le Scao, T.; Gugger, S.; Drame, M.; Lhoest, Q.; and Rush, A. 2020. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 38 45. Online: Association for Computational Linguistics. Wu, Y.; Wu, W.; Xing, C.; Zhou, M.; and Li, Z. 2017. Sequential Matching Network: A New Architecture for Multiturn Response Selection in Retrieval-Based Chatbots. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 496 505. Vancouver, Canada: Association for Computational Linguistics. Xu, J.; Ren, X.; Zhang, Y.; Zeng, Q.; Cai, X.; and Sun, X. 2018. A Skeleton-Based Model for Promoting Coherence Among Sentences in Narrative Story Generation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 4306 4315. Brussels, Belgium: Association for Computational Linguistics. Zhang, R.; Yu, T.; Er, H.; Shim, S.; Xue, E.; Lin, X. V.; Shi, T.; Xiong, C.; Socher, R.; and Radev, D. 2019. Editing Based SQL Query Generation for Cross-Domain Context Dependent Questions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, 5338 5349. Hong Kong, China: Association for Computational Linguistics. Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K. Q.; and Artzi, Y. 2020. BERTScore: Evaluating Text Generation with BERT. In 8th International Conference on Learning Representations. Open Review.net. Zou, X.; Yin, D.; Zhong, Q.; Yang, H.; Yang, Z.; and Tang, J. 2021. Controllable Generation from Pre-Trained Language Models via Inverse Prompting. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, KDD 21, 2450 2460. New York, NY, USA: Association for Computing Machinery. ISBN 9781450383325.