# optimal_completion_distillation_for_sequence_learning__5c58e622.pdf Published as a conference paper at ICLR 2019 OPTIMAL COMPLETION DISTILLATION FOR SEQUENCE LEARNING Sara Sabour, William Chan, Mohammad Norouzi {sasabour, williamchan, mnorouzi}@google.com Google Brain We present Optimal Completion Distillation (OCD), a training procedure for optimizing sequence to sequence models based on edit distance. OCD is efficient, has no hyper-parameters of its own, and does not require pretraining or joint optimization with conditional log-likelihood. Given a partial sequence generated by the model, we first identify the set of optimal suffixes that minimize the total edit distance, using an efficient dynamic programming algorithm. Then, for each position of the generated sequence, we define a target distribution that puts an equal probability on the first token of each optimal suffix. OCD achieves the state-of-theart performance on end-to-end speech recognition, on both Wall Street Journal and Librispeech datasets, achieving 9.3% and 4.5% word error rates, respectively. 1 INTRODUCTION Recent advances in natural language processing and speech recognition hinge on the development of expressive neural network architectures for sequence to sequence (seq2seq) learning (Sutskever et al., 2014; Bahdanau et al., 2015). Such encoder-decoder architectures are adopted in both machine translation (Bahdanau et al., 2015; Wu et al., 2016; Hassan et al., 2018) and speech recognition systems (Chan et al., 2016; Bahdanau et al., 2016a; Chiu et al., 2017) achieving impressive performance above traditional multi-stage pipelines (Koehn et al., 2007; Povey et al., 2011). Improving the building blocks of seq2seq models can fundamentally advance machine translation and speech recognition, and positively impact other domains such as image captioning (Xu et al., 2015), parsing (Vinyals et al., 2015), summarization (Rush et al., 2015), and program synthesis (Zhong et al., 2017). To improve the key components of seq2seq models, one can either design better architectures, or develop better learning algorithms. Recent architectures using convolution (Gehring et al., 2017) and self attention (Vaswani et al., 2017) have proved to be useful, especially to facilitate efficient training. On the other hand, despite many attempts to mitigate the limitations of Maximum Likelihood Estimation (MLE) (Ranzato et al., 2016; Wiseman and Rush, 2016; Norouzi et al., 2016; Bahdanau et al., 2017; Leblond et al., 2018), MLE is still considered the dominant approach for training seq2seq models. Current alternative approaches require pre-training or joint optimization with conditional log-likelihood. They are difficult to implement and require careful tuning of new hyper-parameters (e.g. mixing ratios). In addition, alternative approaches typically do not offer a substantial performance improvement over a well tuned MLE baseline, especially when label smoothing (Pereyra et al., 2017; Edunov et al., 2018) and scheduled sampling (Bengio et al., 2015) are used. In this paper, we borrow ideas from search-based structured prediction (Daumé et al., 2009; Ross et al., 2011) and policy distillation (Rusu et al., 2016) and develop an efficient algorithm for optimizing seq2seq models based on edit distance1. Our key observation is that given an arbitrary prefix (e.g. a partial sequence generated by sampling from the model), we can exactly and efficiently identify all of the suffixes that result in a minimum total edit distance (v.s. the ground truth target). Our training procedure, called Optimal Completion Distillation (OCD), is summarized as follows: 1. We always train on prefixes generated by sampling from the model that is being optimized. 2. For each generated prefix, we identify all of the optimal suffixes that result in a minimum total edit distance v.s. the ground truth target using an efficient dynamic programming algorithm. 3. We teach the model to optimally extend each generated prefix by maximizing the average log probability of the first token of each optimal suffix identified in step 2. 1Edit distance between two sequences u and v is the minimum number of insertion, deletion, and substitution edits required to convert u to v and vice versa. Published as a conference paper at ICLR 2019 The proposed OCD algorithm is efficient, straightforward to implement, and has no tunable hyperparameters of its own. Our key contributions include: We propose OCD, a stand-alone algorithm for optimizing seq2seq models based on edit distance. OCD is scalable to real-world datasets with long sequences and large vocabularies, and consistently outperforms Maximum Likelihood Estimation (MLE) by a large margin. Given a target sequence of length m and a generated sequence of length n, we present an O(nm) algorithm that identifies all of the optimal extensions for each prefix of the generated sequence. We demonstrate the effectiveness of OCD on end-to-end speech recognition using attentionbased seq2seq models. On the Wall Street Journal dataset, OCD achieves a Character Error Rate (CER) of 3.1% and a Word Error Rate (WER) of 9.3% without language model rescoring, outperforming all prior work (Table 4). On Librispeech, OCD achieves state-of-the-art WER of 4.5% on test-clean and 13.3% on test-other sets (Table 5). 2 BACKGROUND: SEQUENCE LEARNING WITH MLE Given a dataset of input output pairs D {(x, y )i}N i=1, we are interested in learning a mapping x y from an input x to a target output sequence y Y. Let Y denote the set of all sequences of tokens from a finite vocabulary V with variable but finite lengths. Often learning a mapping x y is formulated as optimizing the parameters of a conditional distribution pθ(y | x). Then, the final sequence prediction under the probabilistic model pθ is performed by exact or approximate inference (e.g. via beam search) as: ˆy argmax y Y pθ(y | x) . (1) Similar to the use of log loss for supervised classification, the standard approach to optimize the parameters θ of the conditional probabilistic model entails maximizing a conditional log-likelihood objective, OMLE(θ) = E(x,y ) p D log pθ(y | x). This approach to learning the parameters is called Maximum Likelihood Estimation (MLE) and is commonly used in sequence to sequence learning. Sutskever et al. (2014) propose the use of recurrent neural networks (RNNs) for autoregressive seq2seq modeling to tractably optimize OMLE(θ). An autoregressive model estimates the conditional probability of the target sequence given the source one token at a time, often from left-to-right. A special end-of-sequence token is appended at the end of all of target sequences to handle variable length. The conditional probability of y given x is decomposed via the chain rule as, pθ(y | x) Y|y | t=1 pθ,t(y t | y 4 Y 7 6 6 6 6 5 4 4 Published as a conference paper at ICLR 2019 Suppose the path P crosses row i at a cell (i, k). Since the operations in (7) are non-decreasing, the edit distance along the path cannot decrease, so Dedit([ey -3 S U N D A Y 0 1 2 3 4 5 6 S 1 0 1 2 3 4 5 A 2 1 1 2 3 3 4 U 1 N 1 D 1 A 1 Y 1 S U N D A Y 0 1 2 3 4 5 6 S 1 0 1 2 3 4 5 A 2 1 1 2 3 3 4 N 1 D 1 A 1 Y 1 APPENDIX B EXPOSURE BIAS A key limitation of teacher forcing for sequence learning stems from the discrepancy between the training and test objectives. One trains the model using conditional log-likelihood OCLL, but evaluates the quality of the model using empirical reward OER. Unlike teacher forcing and Scheduled Sampling (SS), policy gradient approaches (e.g. Ranzato et al. (2016); Bahdanau et al. (2017)) and OCD aim to optimize the empirical reward objective (4) on the training set. We illustrate four different training strategies of MLE, SS, Policy Gradient and OCD in Figure B.1. The drawback of policy gradient techniques is twofold: 1) they cannot easily incorporate ground truth sequence information except through the reward function, and 2) they have difficulty reducing the variance of the gradients to perform proper credit assignment. Accordingly, most policy gradient approaches Ranzato et al. (2016); Bahdanau et al. (2017); Wu et al. (2016) pre-train the model using teacher forcing. By contrast, the OCD method proposed in this paper defines an optimal completion policy π t for any off-policy prefix by incorporating the ground truth information. Then, OCD optimizes a token level log-loss and alleviates the credit assignment problem. Finally, training is much more stable, and we do not require initialization nor joint optimization with MLE. There is an intuitive notion of exposure bias Ranzato et al. (2016) discussed in the literature as a limitation of teacher forcing. We formalize this notion as follows. One can think of the optimization of the log loss (3) in an autoregressive models as a classification problem, where the input to the classifier is a tuple (s, y