# lexicalconstraintaware_neural_machine_translation_via_data_augmentation__49b9fe10.pdf Lexical-Constraint-Aware Neural Machine Translation via Data Augmentation Guanhua Chen1 , Yun Chen 2 , Yong Wang1 and Victor O.K. Li1 1The University of Hong Kong 2Shanghai University of Finance and Economics {ghchen, wangyong, vli}@eee.hku.hk, yunchen@sufe.edu.cn Leveraging lexical constraint is extremely significant in domain-specific machine translation and interactive machine translation. Previous studies mainly focus on extending beam search algorithm or augmenting the training corpus by replacing source phrases with the corresponding target translation. These methods either suffer from the heavy computation cost during inference or depend on the quality of the bilingual dictionary pre-specified by the user or constructed with statistical machine translation. In response to these problems, we present a conceptually simple and empirically effective data augmentation approach in lexical constrained neural machine translation. Specifically, we construct constraint-aware training data by first randomly sampling the phrases of the reference as constraints, and then packing them together into the source sentence with a separation symbol. Extensive experiments on several language pairs demonstrate that our approach achieves superior translation results over the existing systems, improving translation of constrained sentences without hurting the unconstrained ones. 1 Introduction Lexically constrained translation [Hokamp and Liu, 2017; Post and Vilar, 2018; Luong et al., 2015; Song et al., 2019], the task of imposing pre-specified words and phrases in the translation output (see Figure 1), has practical significance in many applications. These include domain-specific machine translation, where lexicons can be a domain terminology extracted from an in-domain dictionary [Arthur et al., 2016], and interactive machine translation, where lexicons can be provided by humans after reading a system s initial output [Koehn, 2009; Cheng et al., 2016]. In phrase-based statistical machine translation [Koehn et al., 2003], it is relatively easy to restore these kinds of manual interventions. However, in the paradigm of neural machine translation (NMT) [Bahdanau et al., 2015; Vaswani et al., 2017], the task of lexically constrained translation is not trivial. Corresponding author The network was also quick to apologize and released a statement . 该频道也很快做出了道歉并发布了声明 the channel also made an apology and issued a statement . the channel also made a quick apology and issued a statement . the network also made a quick apology and issued a statement . Figure 1: A simple example of lexically constrained machine translation from Chinese to English. The first translation is unconstrained, whereas the second and third have one and two additional constraint imposed. As a result, a number of authors have explored methods for lexically constrained NMT. These methods can be roughly divided into two broad categories: hard and soft. In the hard category, all constraints are ensured to appear in the output sentence. They achieve this by designing novel decoding algorithms, without modification to the NMT model or the training process. Hokamp and Liu [2017] propose the grid beam search (GBS) decoding algorithm. Post and Vilar [2018] speed up over GBS by presenting the dynamical beam allocation (DBA) algorithm. However, the computation complexity of such decoding algorithms is still much higher compared with conventional beam search. Another direction achieves lexically constrained translation by modification to the NMT model s training process. These methods are soft , i.e., they cannot ensure all constraints to appear in the translation output. Luong et al. [2015] use placeholder tags to substitute rare words on both source and target sides according to a bilingual dictionary during training. The model then learns to translate constrained words by translating placeholder tags. Song et al. [2019]; Dinu et al. [2019]; Wang et al. [2019] propose a data augmentation method to train the NMT model. They construct synthetic parallel sentences by either replacing the corresponding source words with the constraint or appending the constraint right after the corresponding source words. However, a bilingual dictionary is essential at both training and inference time. Therefore, their performance relies heavily on the Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) quality of the bilingual dictionary, and they can not translate with non-consecutive constraint, i.e., the constraint that corresponds to non-consecutive source words. In this paper, we propose a LExical-Constraint-Aware (Le CA) NMT model by packing constraints and source sentence together without using a bilingual dictionary. During training, we sample pseudo constraints from the reference and construct constraint-aware synthetic parallel corpus by appending the constraints after the source sentence with a separation symbol. The motivation is to make the model learn to utilize the lexicon constraints automatically without the prespecified aligned source words. During inference, the source is similarly modified as a preprocessing step. By training on a mixture of the original and synthetic corpus, the model can perform well on the constrained case while maintaining the performance on the unconstrained case. We evaluate the proposed model on WMT De-En and NIST Zh-En News translation in both directions with different types of lexical constraints. Similar with the previous work [Song et al., 2019; Dinu et al., 2019], our approach can not guarantee all constraints to be generated in the output, but experiments show that the copy success rate is high: it is 96.4% 99.5% for the WMT De-En task and 89.6% 98.4% for the NIST Zh-En task. In addition, our model improves over the codeswitching baselines for all constraint types, and the improvement is more than 3.5 BLEU points for reference constraint and interactive constraint.1 2 Related Work Recent work on lexically constrained NMT can be loosely clustered into two categories: hard and soft. The hard ensures all constraints to appear at the translation output. In contrast, the soft category cannot make such guarantee. Hard lexically constrained translation. Hokamp and Liu [2017] propose the grid beam search (GBS) algorithm for incorporating lexical constraints at decoding time. The constraints are forced to be present in the translations. Post and Vilar [2018] propose a faster algorithm over grid beam search, namely, dynamical beam allocation (DBA). The decoding complexity is reduced from O(|T|k NC) to O(|T|k) (|T| is the target sentence length, k is beam size, NC is the number of constraints) by grouping together hypotheses with the same number of constraints into banks and dynamically dividing a fixed-size beam across these banks at each time step. One problem of these methods is that they copy the lexicon constraints in exactly the same form to the output, making it unsuitable to decode with noisy constraints. For example, constraints with incorrect morphological form. Another drawback is that the decoding speed is significantly reduced compared with standard beam search. Soft lexically constrained translation. Song et al. [2019] create a synthetic code-switching corpus to augment the training data for NMT. The code-switching corpus is built by replacing the corresponding source phrase with the target constraint according to a bilingual dictionary. By training on a mixture of original and synthetic parallel corpora, the model 1Our code is available at https://github.com/ghchen18/leca. learns to translate code-switching source sentences at both training and inference time. Concurrently, Dinu et al. [2019] propose a similar approach to translate given terminology constraints. The corresponding target terminology in the dictionary is used to replace the source terminology or appended right after the source one. Although translating fast, these two methods use bilingual dictionary to construct training data, thus their performance relies heavily on the quality of the bilingual dictionary. In addition, at inference time, the model will fail in the cases when constraints do not appear in the bilingual dictionary or when the corresponding source phrases are non-consecutive. Different from the above approaches, we propose a lexicalconstraint-aware Transformer model without using a bilingual dictionary, by simply packing constraints and source sentence together with a separating symbol. Instead of specifying the aligned source words for each constraint, our model learns to utilize the constraint automatically without the explicitly alignment information. This simple approach can carry out constrained NMT task with better performance in terms of a combined metric based on accuracy, copy success rate, and decoding speed. 3 Approach 3.1 Problem Statement Suppose X = (x1, x2, , x S) is the source sentence with length S, Y = (y1, y2, , y T ) is the target sentence with length T. In conventional neural machine translation, the model is trained with Maximum log-Likelihood Estimation (MLE) method. The conditional probability of Y is calculated as follows: p(Y|X; θ) = t=1 p(yt|y0:t 1, x1:S; θ). (1) When lexical constraints are given, the neural machine translation problem can be defined as p(Y|X, C; θ) = t=1 p(yt|y0:t 1, x1:S, C; θ), (2) where C = (C1, C2, , CN) are the provided lexical constraints which are expected to appear in the translations, and N is the number of constraints in C. Different from Hokamp and Liu [2017] and Post and Vilar [2018], these constraints are not forced to appear in the target sequence, i.e. hard constraints are not considered in our settings. The constraints are given as suggestions in a soft manner. In real applications like domain adaptation via terminology, we go through the source sentence and give constraints according to a terminology dictionary. Therefore, the order of constraints are likely to be the same as their corresponding source phrases in the source sentence, which could be different from the order the constraints appear in the reference. So we say that the input constraints are actually disordered. The performance of constrained machine translation is evaluated in three aspects. 1. Accuracy. The BLEU [Papineni et al., 2002] is used to evaluate the correctness of translation. Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) Token embeddings Positional embeddings Segment embeddings Transformer Encoder Figure 2: Transformer encoder embedding layer. A special symbol sep and an additional learned segment embedding are added to distinguish the source and each constraint. The positional index of each constraint starts from a large enough number. The encoder input is the sum of the three embeddings. These modifications help the Le CA better learn to translate with constraints. 2. Coverage. The copy success rate (CSR) is used to check the percentage of constraints that are successfully generated in the translation. 3. Decoding speed. The time-averaged number of generated tokens in inference is used to indicate the time complexity of the approach. 3.2 LExical-Constraint-Aware (Le CA) NMT Given a triple of source sentence, constraints and target sentence X, C, Y , we define the training loss of the Le CA model as: L = X log p(Y|X, C; θ) = X log p(Y|ˆX; θ). (3) In the above equation, ˆX is a pseudo source sentence, which is constructed by packing the source sentence X and each constraint Ci in the constraint set C together with a separating symbol sep (See the example in Figure 2): ˆX = X, sep , C1, sep , C2, ..., sep , CN, eos , (4) where eos is the end of sentence token. To model such pseudo source sentence ˆX, we modify the input representation of the encoder to differentiate the source sentence and each constraint, and add a pointer network to strengthen copying through locating source-side constraints. Input representation. Inspired by BERT [Devlin et al., 2019], we add a learned segment embedding to each token at the encoder embedding layer for the Le CA model. The encoder embedding layer is composed of three components: token embedding, positional embedding and segment embedding, as shown in Figure 2. We differentiate the source sentence and each constraint in three ways. First, we separate them with a special symbol ( sep ). Second, the positional index of each constraint starts from the same number that is larger than the maximum source sentence length. Finally, we use different segment embedding for the source sentence and each constraint. For a given token, its input embedding is constructed by summing the corresponding token, positional and segment embeddings. Pointer network. Following Gulcehre et al. [2016] and Song et al. [2019], we add a pointer network to strengthen copying through locating source-side constraints in the Le CA model (Le CA+Ptr). At decoding time step t, the final token probability p(yt|y