# endtoend_transitionbased_online_dialogue_disentanglement__ed892240.pdf End-to-End Transition-Based Online Dialogue Disentanglement Hui Liu1 , Zhan Shi1 , Jia-Chen Gu2 , Quan Liu3 , Si Wei3 and Xiaodan Zhu1 1Ingenuity Labs Research Institute & ECE, Queen s University, Canada 2University of Science and Technology of China, Hefei, China 3State Key Laboratory of Cognitive Intelligence, i FLYTEK Research, Hefei, China {hui.liu, 18zs11, xiaodan.zhu}@queensu.ca, gujc@mail.ustc.edu.cn, {siwei, quanliu}@iflytek.com Dialogue disentanglement aims to separate intermingled messages into detached sessions. The existing research focuses on two-step architectures, in which a model first retrieves the relationships between two messages and then divides the message stream into separate clusters. Almost all existing work puts significant efforts on selecting features for message-pair classification and clustering, while ignoring the semantic coherence within each session. In this paper, we introduce the first end-toend transition-based model for online dialogue disentanglement. Our model captures the sequential information of each session as the online algorithm proceeds on processing a dialogue. The coherence in a session is hence modeled when messages are sequentially added into their best-matching sessions. Meanwhile, the research field still lacks data for studying end-to-end dialogue disentanglement, so we construct a large-scale dataset by extracting coherent dialogues from online movie scripts. We evaluate our model on both the dataset we developed and the publicly available Ubuntu IRC dataset [Kummerfeld et al., 2019]. The results show that our model significantly outperforms the existing algorithms. Further experiments demonstrate that our model better captures the sequential semantics and obtains more coherent disentangled sessions.1 1 Introduction Along with the development of social networks, online group chat channels have embraced a huge success and become more and more popular. The popularization of social APPs, like Slack2 and Facebook Messenger3, promote the rapid increase of group conversation messages. When users enter a group chat channel, a bunch of messages will jump out and these messages tend to be related to different topics. These entangled messages mix all topics together without illustrating the structure of the conversation, bringing difficulty for 1https://github.com/layneins/e2e-dialo-disentanglement 2https://slack.com/ 3https://www.messenger. com/ Speaker A: Are you going to have dinner with me? Speaker B: So anyone got time for this issue tomorrow? Speaker C: No I ll need to write the project report Speaker D: Yes sure. See you later. Speaker C: But I can do it tonight. Speaker C: There is a problem with our project. Line 31: Figure 1: An example of dialogue disentanglement. By considering session history Line 31, it is easier for an algorithm to recognize Line 34 as a response to Line 33. If an algorithm only considers message pairs but not session coherence, Line 34 could be regarded as responding to Line 32, resulting in a wrong disentanglement result. users to find topics that they are interested in. Automatic dialogue disentanglement will be helpful by separating entangled messages into different sessions and providing users with convenience in finding useful information. Existing solutions for dialogue disentanglement pay attention to the relationship between two messages. Most of them adopt the two-step architecture: first predicting the relationship between two messages and then separating the message stream into clusters according to the relationship. There are different ways to implement such a two-step architecture. Some early work uses handcrafted features to train a classifier to obtain the global or local coherence of message pairs [Elsner and Charniak, 2010; Elsner and Charniak, 2011]. With the rapid development of deep learning technologies, recent work builds neural models to predict the relationship between two messages [Mehri and Carenini, 2017; Jiang et al., 2018]. The work consider if two messages are in the same session or if one message is replying the other message. Based on the predicted relationship, a clustering algorithm is adopted to separate the messages apart. There are apparent weaknesses in the two-step methods. In the relationship retrieving step, these methods try to predict the relationships between a message pair but ignore the context in dialogues and the sessions that are already detached. Messages in dialogues tend to be short and simple, which may be orally coherent with many preceding messages without considering the context. An example is given in Figure 1. Simply predicting relationships between message pairs ignores useful session semantics and coherence. In the cluster- Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) ing step, existing methods need a meticulously picked clustering algorithm and some require a great deal of human involvement. The weaknesses make such methods less flexible for tackling the disentanglement task. In order to solve the above problems, we propose the first end-to-end model for online dialogue disentanglement. We formulate dialogue disentanglement as a session state transition problem, where the number of sessions is dynamically maintained. Different from the previous methods, our model captures the sequential information of the dialogue as well as that of the disentangled sessions. When a message is being processed, two actions can be operated on this message: 1) categorize the message into a session that best matches the contextual semantics, or 2) build a new session and initialize the state of the new session with the current message. Through this state transition process, our model focuses on the semantic match between sessions and messages, which shows to be more effective and efficient in our experiments. To imitate real-life situations, we solve the dialogue disentanglement task in an online manner. In order to overcome the limitations brought by the online training, we propose two learning strategies: teacher-student learning and decision sampling. Both strategies are combined with our model to further improve the performance. Due to the lack of public benchmarks, we construct a new large-scale dataset from movie scripts for studying dialogue disentanglement. The new dataset contains more than 30,000 intermingled dialogues. We perform experiments on both this newly proposed dataset and the publicly available Ubuntu IRC dataset [Kummerfeld et al., 2019]. Results show that our model significantly outperforms the existing two-step methods. Further experiments demonstrate our model s ability in capturing the semantics of the disentangled sessions. Our main contributions are summarized as follows: We propose the first end-to-end model for the online dialogue disentanglement task with two learning strategies. The model captures the sequential information of the disentangled sessions and considers the match of semantics between messages and sessions. We release a large-scale dataset for studying dialogue disentanglement. The dataset contains more than 30,000 intermingled dialogues. We conduct experiments on two datasets. Results demonstrate that our model significantly outperforms the previous methods by better capturing the semantics of the dialogues and sessions. 2 Related Work Recent work on dialogue disentanglement has mostly adopted a two-step approach: first determining the relationship between two messages, and then separating the messages into clusters. Some previous work predicts if two messages are in the same session. Elsner and Charniak [2010] use handcrafted features to represent a message and use them as the input to train a classifier, and Jiang et al. [2018] adopt deep learning methods and use a CNN model as the classifier. Meanwhile, there are some other work targeting at retrieving the reply-to relationship between messages. Chen et al. [2017] predict the reply-to relationship based on text similarity and latent semantic transferability. Guo et al. [2018] and Mehri and Carenini [2017] both adopt an RNN to predict if one message is replying to the other message. Apart from building models to extract the relationship between messages, the previous work has also put efforts on designing effective clustering algorithms, including carefully selecting thresholds for the relationship predicted in the first step. Shen et al. [2006] run extensive experiments to explore the influence of the threshold used for clustering. Jiang et al. [2018] propose a novel similarity ranking method to avoid extensively exploring the setting of the threshold. There are also some tasks sharing properties in common with dialogue disentanglement such as topic detection and tracking (TDT) [Allan, 2012] and streaming news clustering [Miranda et al., 2018]. However, TDT and streaming news clustering focus on tracking topics, which is very different from dialogue disentanglement. For example, the content in a detached topic is less sequential where modelling the coherence of content is less of a concern. The chronological order is less important there (e.g., event locations are among the most prominent features in many top models in TDT). This is very different from dialogue disentanglement, where the disentangled sessions are still conversations and the sequential information in the sessions is critical for the models. Transition-based methods have been widely used in many sequence prediction tasks [Chen and Manning, 2014; Dyer et al., 2015; Zhang et al., 2016]. Recently, neural-networkbased transition models have been studied and achieved good results in different tasks. Our work is the first to propose transition-based neural networks for dialogue disentanglement, which, to our knowledge, has not been investigated. 3 Task and Notations We formulate the task of end-to-end online dialogue disentanglement as a session state transition problem for each message in a dialogue stream. There are two actions for each message: 1) initialize a new session; 2) update an existing session. Since it is an online task, all action decisions are made without knowing the following messages. The input is a dialogue D that contains n messages [u1, u2, . . . , un]. Our goal is to separate the n messages into K sessions [S1, S2, . . . , SK], where K is unknown to the model. For a given message ui, there exists a session set S which contains z(i) sessions [S1, S2, ..., Sz(i)], where z(i) is a function indicating the number of existing sessions when ui is being processed. Our model needs to decide whether ui belongs to any session in S. If ui belongs to Sj S, Sj is updated by ui. If ui does not belong to any existing session, the model will build a new session Sz(i)+1 and treat ui as the first message of Sz(i)+1. Then Sz(i)+1 is added to S. 4 Method In this section, we first introduce our framework for end-toend online dialogue disentanglement. Moreover, to eliminate the potential drawback that is caused by the online training process, we propose two learning strategies that can be combined with our model to further improve the performance. Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) Dialogue Encoder Label Snew S1 S2 S3 Session State Encoder Snew S1 S2 S3 Dot Product Dot Product Figure 2: Figure (a) is the overall architecture of our proposed model. Figure (b) is an example from u4 to u6, illustrating how the session state encoder (SSE) updates the state. SSE will make a dot product between u4 and all the elements in the candidate action set, and predict that u4 belongs to S2. Then the state of S2 is updated from s2 1 to s2 2. When u5 is being processed, SSE will predict that u5 belongs to a new session. So the mask of S3 is removed, and we use u5 to initialize the state of S3 as s3 1. 4.1 Model Given a sequence of messages [u1, u2, . . . , un], the goal of our model is to learn a probability distribution: i=1 P(yi | y