# towards_discourseaware_documentlevel_neural_machine_translation__61465ed1.pdf Towards Discourse-Aware Document-Level Neural Machine Translation Xin Tan , Longyin Zhang , Fang Kong and Guodong Zhou School of Computer Science and Technology, Soochow University, China {xtan9, lyzhang9}@stu.suda.edu.cn, {kongfang, gdzhou}@suda.edu.cn Current document-level neural machine translation (NMT) systems have achieved remarkable progress with document context. Nevertheless, discourse information that has been proven effective in many NLP tasks is ignored in most previous work. In this work, we aim at incorporating the coherence information hidden within the RST-style discourse structure into machine translation. To achieve it, we propose a document-level NMT system enhanced with the discourse-aware document context, which is named Disco2NMT. Specifically, Disco2NMT models document context based on the discourse dependency structures through a hierarchical architecture. We first convert the RST tree of an article into a dependency structure and then build the graph convolutional network (GCN) upon the segmented EDUs under the guidance of RST dependencies to capture the discourse-aware context for NMT incorporation. We conduct experiments on the document-level English-German and English Chinese translation tasks with three domains (TED, News, and Europarl). Experimental results show that our Disco2NMT model significantly surpasses both context-agnostic and context-aware baseline systems on multiple evaluation indicators. 1 Introduction With the maturity of sentence-level neural machine translation (NMT), document-level NMT has been drawing more attention and achieved significant progress in recent years. Reviewing previous work on document-level NMT, contextaware models have made substantial progress by extracting contextual dependencies within sentences to help translate each entire article [Zhang et al., 2018; Voita et al., 2018; Miculicich et al., 2018; Maruf et al., 2019; Tan et al., 2019; Ma et al., 2020; Sugiyama and Yoshinaga, 2021]. However, existing studies are prone to model broad-brush document context from the plain sentences in documents to capture the relatively shallow cohesion information, while the deep coherence information hidden within each article is ignored. Corresponding author Source Article e1. The small changes in averages reflect generally unchanged yields at many major banks. e2. Some, however, lowered yields significantly. e3. For example, because of shrinkage in the economy, e4. at Chase Manhattan Bank in New York, the yield on a small denomination six-month CD fell about a quarter of a percentage point to 8.06%. consequence Figure 1: An illustration of the commonality that bridges the source and its corresponding target articles. Rhetorical Structure Theory (RST) [Mann and Thompson, 1988] is known to be one of the most common and influential discourse theories, which describes the coherence structure of a document and has the ability to facilitate the logical relationship between clauses and sentences in a text [Hobbs, 1979]. As a universal theory about discourse structure, RST analysis has been applied to many different languages like English [Carlson et al., 2001], German [Stede and Neumann, 2014], Chinese [Li et al., 2014b], and so on. In view of this, the rhetorical structure between two different languages serves as an essential commonality between the two languages. Fig. 1 illustrates the importance of the rhetorical structure for document-level NMT. It shows that this commonality serves as a bridge between the source and target languages in document-level machine translation, and the coherence in the source can be used to promote the coherence of the target. Although previous studies have explored various ways of leveraging document context to improve documentlevel machine translation, the research on discourse structure is rare. To the best of our knowledge, [Chen et al., 2020] is the first attempt to incorporate the discourse information into document-level NMT. However, they only employ the embedded RST tree paths for sentence representation enhancement, limiting the discourse information to shallow features, while the more direct rhetorical relation between two discourse units is ignored to some extent. In this research, we explore applying RST discourse struc- Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22) ture to document-level NMT. To achieve this, we propose the Disco2NMT model to incorporate RST graphs into the document-level NMT model based on a hierarchical attention network. In order to model both intraand inter-sentence discourse relations, we segment sentences in each article into the finer granularity of elementary discourse units (EDUs) and model document context based on the EDUs. For discourse structure, we convert the tree structure to dependency graphs to avoid the long-range error propagation problem of the original RST constituency tree. On this basis, we encode the discourse dependencies using the graph convolutional network (GCN) [Kipf and Welling, 2016] to capture intraand inter-sentence interactions. Furthermore, we incorporate the extracted discourse-aware context to the Transformer-based document-level NMT system, guiding the system to generate more coherent translations. It is worth mentioning that this work is the first attempt to prove the validity of the discourse dependency structure in document-level NMT. We perform several experiments on document-level translation tasks with different languages (English-German and English-Chinese) and domains (TED, News, and Europarl). Experiments show that our proposed Disco2NMT outperforms current competitive document-level NMT systems and can generate more coherent translations. 2 Background Document-level NMT. Different from sentence-level NMT systems that translate sentences separately, document-level NMT systems generally translate each source sentence Xi = (x1, x2, ..., xn) within a document D = (X1, ..., XN) into the target sentence Yi = (y1, y2, ..., yn) with the consideration of contextual information C. The training criterion for the document-level NMT model is to maximize the conditional log-likelihood as: j=1 logp(yi j|yi