# sequence_generation_with_mixed_representations__01ebe954.pdf

Sequence Generation with Mixed Representations

Lijun Wu* 1 2 Shufang Xie* 1 Yingce Xia 1 Yang Fan 3 Tao Qin 1 Jianhuang Lai 2 Tie-Yan Liu 1

Tokenization is the ﬁrst step of many natural language processing (NLP) tasks and plays an important role for neural NLP models. Tokenization methods such as byte-pair encoding and Sentence Piece, which can greatly reduce the large vocabulary size and deal with out-of-vocabulary words, have shown to be effective and are widely adopted for sequence generation tasks. While various tokenization methods exist, there is no common acknowledgement which one is the best. In this work, we propose to leverage the mixed representations from different tokenizers for sequence generation tasks, which can take the advantages of each individual tokenization method. Speciﬁcally, we introduce a new model architecture to incorporate mixed representations and a co-teaching algorithm to better utilize the diversity of different tokenization methods. Our approach achieves signiﬁcant improvements on neural machine translation tasks with six language pairs, as well as an abstractive summarization task.

1. Introduction

Natural language processing (NLP) has achieved great success with deep neural networks in recent years (Deng & Liu, 2018; Zhang et al., 2015; Deng et al., 2013; Wu et al., 2016; Hassan et al., 2018). For neural based NLP models, tokenization, which chops raw sequence up into pieces, is the ﬁrst step and plays the most important role in the text preprocessing. Previously, tokenization is always performed on word level (Arppe et al., 2005; Bahdanau et al., 2014), which splits a raw sentence by spaces and applies language-speciﬁc rules to punctuation marks, or character level (Kim et al., 2016; Lee et al., 2017), which directly segments words into

*Equal contribution 1Microsoft Research, Beijing, China 2Sun Yat-sen University, Guangzhou, China 3University of Science and Technology of China, Hefei, China. Correspondence to: Lijun Wu <Lijun.Wu@microsoft.com>, Yingce Xia <Yingce.Xia@microsoft.com>.

Proceedings of the 37 th International Conference on Machine Learning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 by the author(s).

Table 1. A German sentence example (Ori stands for original) of tokenization results by BPE, Word Piece (WP), Sentence Piece (SP) tokenizers. Different subwords are highlighted in bold font. @@ and represent the boundaries of subwords.

Ori und diese einfachen themen sind eigentlich keine komplexen wissenschaftlichen zusammenh ange

BPE und diese einfachen themen sind eigentlich keine komplex@@ en wissenschaftlichen zusammen@@ h an@@ ge

WP und diese einfachen themen sind eigentlich keine komplexe n wissenschaftlichen zusammen h a nge

SP und diese einfachen them en sind eigentlich keine komplexen wissenschaft lichen zusammenh ange

individual characters. However, previous works revealed that such tokenization methods have their own drawbacks (Ling et al., 2015; Cherry et al., 2018; Luong & Manning, 2016) and often lead to inaccurate results.

Recently, there are several types of tokenization methods shown to be very effective, which split sentences into subword units according to the statistics of consequent characters. Byte-pair encoding (BPE) (Sennrich et al., 2015) constructs the vocabulary based on the subword frequency, and word level tokenization should be applied ﬁrst before using BPE. Sentence Piece (SP)1 (Kudo, 2018; Kudo & Richardson, 2018) and Word Piece (WP) (Schuster & Nakajima, 2012) tokenizers leverage language models to build the vocabulary, which can be applied to the raw text data without word tokenization. Table 1 shows the results of three tokenizers for one example of German sentence and they differ a lot with each other. For example, BPE segments the word komplexen into komplex and en , while WP segments it into komplexe and n but SP keeps it as is. Despite the effectiveness of various tokenizers, there is no common acknowledgement about which approach is the best across different tasks. Taking machine translation as an example, preliminary experiments (see Table 2) show that no speciﬁc tokenizer can universally perform best across different datasets. Considering each tokenization method has its own strengths and unique characteristics, instead of inventing an-

1To be concrete, we refer the unigram language model in (Kudo, 2018) as Sentence Piece.

Sequence Generation with Mixed Representations

Table 2. Performance of BPE, Word Piece (WP), Sentence Piece (SP) tokenizers on different IWSLT translation tasks. Details of experiment settings are left in Section 4.1.

Dataset BPE SP WP

German English 34.84 34.77 34.91 English German 28.80 28.45 28.71 English Romanian 24.56 24.67 24.63

other tokenizer, can we leverage these different tokenization methods to build better sequence learning models?

In this work, we leverage multiple tokenizers in a single model. We focus on the sequence generation tasks in this paper, whose backbone is the encoder-decoder network. We propose a new model architecture and a co-teaching algorithm: (1) The new architecture incorporates different tokenizers into a uniﬁed model to get the mixed representations. Speciﬁcally, two encoders are leveraged to deal with the different tokenized sequences as inputs. Each layer in one encoder interacts with the previous layer of the other encoder to generate the mixed representations. Correspondingly, there are two decoders to generate sequences in two tokenized ways, both of which take the mixed representations from the encoders as inputs. In this way, the model can beneﬁt from the enriched representations by combining different tokenizers. (2) Besides the model structure, we propose a co-teaching algorithm to further utilize the diversity of outputs from different tokenizers. In the decoder side, our model is able to generate sentences in different tokenization ways. Thus, we let each decoder learn from the other one with different tokenizers rounds by rounds iteratively. Specially, our approach works well in the lowresource setting since we have mixed representations from different tokenizers instead of only one, which greatly increases the data diversity. This is because the frequency based methods and language model based methods have their own limitation in the face of paucity of data.

We evaluate our model and training algorithm in two standard sequence generation tasks, machine translation and abstractive summarization. On six translation language pairs and totally 12 translation tasks, such as English German, English Dutch and English Romanian, our approach outperforms baselines by more than 1.0 BLEU points. For abstractive summarization, experimental results also show consistent improvements compared with only one tokenizer utilized. Our code is provided in https://github.com/ apeterswu/fairseq_mix.

2. Related Work

In this section, we ﬁrst introduce the background of several tokenization approaches. Then we introduce some recent works of leveraging different tokenizers.

2.1. Tokenization Approaches

We describe the details of different tokenization approaches here. BPE (Sennrich et al., 2015) tokenizer initializes the vocabulary with all the characters and builds the ﬁnal vocabulary by iteratively merging frequent n-gram characters. Similarly, Word Piece (Schuster & Nakajima, 2012) also constructs the vocabulary from characters. Different from BPE, Word Piece forms a new subword according to the n-gram likelihood on the training data instead of the next highest frequency pair. Sentence Piece (Kudo, 2018; Kudo & Richardson, 2018) is based on the assumption that all subword occurrences are independent and a tokenized sequence is produced by the product of subword occurrence probabilities. Therefore, Sentence Piece selects and builds the subword dictionary on the word occurrence and the loss of each subword. Both Word Piece and Sentence Piece leverage language models to build their vocabularies.

2.2. Leveraging Tokenization Approaches

Besides inventing new tokenizers to handle rare and unseen words for text-based sequence learning, there are several pieces of works focusing on how to leverage existed tokenizers from different aspects recently. Srinivasan et al. (2019) propose to use BPE with multiple merge operations to build the vocabulary in the target side of a multitask learning model. Pan et al. (2020) introduce a new morphological subword tokenization method by using morpheme segmentation and BPE together for morphological rich language, but only in the source side. Moreover, Wang et al. (2019b) focus on the multilingual machine translation setting. They design a multilingual lexicon encoding framework named soft decoupled encoding, which is designed to present a word by its spelling through character and its semantic meaning by a shared latent embedding. Provilkov et al. (2019) stochastically corrupt the segmentation procedure of BPE and produce multiple segmentations in BPE framework. Our work differs from above works in that we leverage existed different subword tokenization methods effectively on both source and target languages. Also, we propose a co-teaching algorithm to better leverage different tokenization methods.

3. Architecture and Co-Teaching Algorithm

We ﬁrst introduce our proposed architecture in Section 3.1 and the co-teaching algorithm in Section 3.2. In this work, we mainly introduce mixed representations obtained from BPE and SP tokenizers. It is easy to extend our algorithm to any other two types of tokenizers.

3.1. Architecture

Our model is based on Transformer (Vaswani et al., 2017). The overall architecture for our model is shown in Figure 1.

Sequence Generation with Mixed Representations

Self-Attn Mix-Attn

Mix-Attn Self-Attn

Feed Forward Feed Forward

SP-Attn BPE-Attn

Feed Forward

Encoder Decoder

𝑥%&' 𝑥(& 𝑦%&'/𝑦(&

Figure 1. The overall framework for our proposed model. The important components mix-attention (Mix-Attn in ﬁgure), SP-enc-attention (SP-Attn in ﬁgure) and BPE-enc-attention (BPE-Attn in ﬁgure) are in bold font. For mix-attention, the input of query (Q) are red lines, while the blue dash lines stand for the key (K) and value (V ) used in attention computation. For SP-enc-attention and BPE-enc-attention, the K and V are also in blue dash lines. Here we only show one decoder since the other one has same components except the different tokenized inputs. Note that the residual connections and the layer norm operation after each component are same as the standard Transformer, we omit them here to save space.

Different from Transformer, there are two N-layer encoders and two N-layer decoders in our model. The two encoders are mixed through attention layers.

Encoder For each input sentence x, we ﬁrst process it by two tokenizers, BPE and SP, and then obtain xbpe and xsp, respectively. There are two embeddings in our model, one for BPE sequences (denoted as Ebpe) and one for SP sequences (denoted as Esp). We leverage two encoders, one starting with processing Ebpe (denoted as encbpe) and the other with Esp (denoted as encsp). Two types of attention layers are used in each encoder, one is the selfattention layer, which also exists in the standard Transformer (Vaswani et al., 2017), and the other one is the mixattention layer we proposed, which is used to fuse the BPE representations and SP representations. They are shown as Self-Attn and Mix-Attn in the Figure 1, respectively.

Mathematically, let Hl bpe and Hl sp be the representations in l-th layer of encbpe and encsp respectively, l [N]. Specially, let H0 bpe and H0 sp denote the embedding representations of xbpe and xsp. For any l [N], Hl bpe and Hl sp are obtained as follows:

Hl bpe = attnbpe s (Hl 1 bpe , Hl 1 bpe , Hl 1 bpe )

+ attnbpe m (Hl 1 bpe , Hl 1 sp , Hl 1 sp ); (1)

Hl sp = attnsp s (Hl 1 sp , Hl 1 sp , Hl 1 sp )

+ attnsp m(Hl 1 sp , Hl 1 bpe , Hl 1 bpe ); (2)

where attnbpe s and attnbpe m represent the self-attention layer and mix-attention layer in encbpe, attnsp s and attnsp m represent the corresponding layers in encsp.

Each attention layer attn(Q, K, V ) is implemented as follows:

softmax (QWQ)(KWK)

d (V QV ) , (3)

where Q, K, V denote query, key and value respectively, d is the embedding dimension, and WQ, WK, WV are the parameters to be learned.

Compared with standard Transformer, in each encoder (and later, the decoder), we only revise the attention layers. For simplicity, we do not include the formulations for feedforward layers, layer normalization (Ba et al., 2016) or residual connection of our model in the main text. Details can be found at our code in the code.

From Eqn.(1), we can see that Hl bpe is the sum of two components, one is related to the output of previous layer only, and the other is related to the (l 1) th layer of encsp. In such a way, for each subword obtained by BPE, the representation is enriched by that from Sentence Piece. Similarly, subwords obtained by Sentence Piece are enriched by BPE too. Therefore, Hl bpe and Hl sp are mixed representations for each individual subword, which are enhanced to provide more comprehensive contexts. The ﬁnal representations for the encoders are HN bpe and HN sp outputted by the last layer.

Sequence Generation with Mixed Representations

Decoder The mixed representation outputted from the encoders is a beneﬁt for the decoding process. There are two decoders in our architecture, decbpe and decsp, which are used to generate sequences with BPE and SP tokenizations respectively. The way that we fuse the representations in the decoder side is to let the encoder-decoder attention layers attend to the output from both encbpe and encsp. That is, for each decoder, we introduce another two attention layers: a BPE-enc-attention layer, which bridges the output of encbpe to the decoder, and a SP-enc-attention layer, which bridges encsp and decoder together.

Let Sl bpe and Sl sp be the representations in the l-th layer of decbpe and decsp. S0 bpe and S0 sp are embeddings for sentences ybpe and ysp. The two decoders work as follows:

Sl bpe = attnbpe s (Sl 1 bpe , Sl 1 bpe , Sl 1 bpe ), (4)

Sl sp = attnsp s (Sl 1 sp , Sl 1 sp , Sl 1 sp ), (5)

Sl bpe = attnbpe bpe( Sl bpe, HN bpe, HN bpe)

+ attnbpe sp ( Sl bpe, HN sp, HN sp), (6)

Sl sp = attnsp sp( Sl sp, HN sp, HN sp)

+ attnsp bpe( Sl sp, HN bpe, HN bpe), (7)

All attn s in the above formulation are implemented as Eqn.(3) with different parameters, where the superscripts represent which decoder the attention layer belongs to, and the subscripts s, bpe and sp represent self-attention, attending to outputs of encbpe and attending to outputs to encsp.

We can see that the two decoders ﬁrst leverage self-attention layers to preprocess the output Sl 1 bpe , Sl 1 sp from previous layers, and obtain the results Sl bpe, Sl sp. Then, decbpe and decsp leverage Sl bpe and Sl sp as queries respectively to bridge the relation with both HN bpe and HN sp, which are encoder representations of the input sequence tokenized in different ways. In this way, Sl bpe and Sl sp can leverage both HN bpe outputted by the BPE encoder and HN sp outputted by the SP encoder, which can beneﬁt the corresponding decoder to have enriched representations during decoding process. After the ﬁnal layer outputs SN bpe and SN sp from decbpe and decsp, they will be mapped via a linear transformation and then the softmax operation. Based on the output probability, the BPE decoder generates the sentence tokenized in BPE format, and the SP decoder generates the output sentence tokenized in SP format.

In our proposed method, we mix up the two tokenized sequences, one with BPE and the other with SP, by summing their representations. In Section 5.1, we discuss an alternative way, stack representation, where the self-attention layer and the mix-attention layer are sequentially stacked instead of parallel adding. Experimental results show that adding up the representations is better. For decoders, actually we can

also add mix-attention component as used in the encoder. However, this will make the beam search decoding quite complex when selecting the top-k beams so that we leave it as future work.

3.2. Co-Teaching Algorithm

The two encoders interact through the mix-attention layers to generate mix representations, while the two decoders work independently. To better utilize the diversity brought by different tokenizers, we further propose a co-teaching algorithm, by which the two decoders can teach each other.

Denote the original bilingual data as D = {(xi, yi)}M i=1, where M is the number of training sentence pairs. For simplicity, deﬁne [M] = {1, 2, , M}. After applying BPE and SP tokenizers to the data pairs in D, we obtain a dataset made up of four-element tuples: Dmix = {(xi bpe, yi bpe, xi sp, yi sp)}M i=1, where xi bpe, yi bpe are the outputs of BPE tokenizer, and xi sp, yi sp are obtained by using SP tokenizer.

Let θe denote the parameter of the encoders (including both encbpe and encsp), θd bpe and θd sp are the parameters of the decoder for BPE and that for SP respectively. Given any (xi bpe, yi bpe, xi sp, yi sp) in Dmix, let P(yi bpe|xi bpe, xi sp; θe, θd bpe) denote the probability that yi bpe is obtained by the decoder decbpe, with mixed representation provided by xi bpe and xi sp. Similarly, P(yi sp|xi bpe, xi sp; θe, θd sp) represents the probability that yi sp is obtained by the decoder decsp. No matter for computation of each probability, both encoders are involved.

In our proposed co-teaching algorithm, the loss function consists of two parts:

(Part 1) Negative log-likelihood loss: Following the common practice in sequence to sequence learning, we use negative log-likelihood loss for our model training. Since we have two decoders with different tokenizers, given Dmix, the loss function is designed as follows:

i=1 log P(yi bpe|xi bpe, xi sp; θe, θd bpe)

i=1 log P(yi sp|xi bpe, xi sp; θe, θd sp). (8)

In Eqn.(8), we can see that no matter which decoder we use, the two encoders are both updated, since their mixed representations are used in each decoder.

(Part 2) Co-teaching loss: The co-teaching is implemented in a distillation way (Kim et al., 2016; Hinton et al., 2015). Let θe t 1, θd bpe,t 1 and θd sp,t 1 denote the parameters obtained at epoch t 1, where t is a positive integer. With these parameters, we ﬁrst generate some samples as follows:

Sequence Generation with Mixed Representations

(i) i [M], sample ˆyi bpe P( |xi bpe, xi sp; θe t 1, θd bpe,t 1); detokenize ˆyi bpe to the raw sentence and re-tokenize it with SP to get ˆyi bpe sp;

(ii) i [M], sample ˆyi sp P( |xi bpe, xi sp; θe t 1, θd sp,t 1); detokenize ˆyi sp to the raw sentence and re-tokenize it with BPE to get ˆyi sp bpe;

The co-teaching loss is deﬁned upon the above data. Mathematically,

i=1 log P(ˆyi sp bpe|xi bpe, xi sp; θe, θd bpe)

i=1 log P(ˆyi bpe sp|xi bpe, xi sp; θe, θd sp).

Co-teaching is conducted by generating a new sample for the other decoder to be taught. In this way, the two decoders interact with each other through data and the diversity from different tokenizers is leveraged.

The overall training loss is:

ℓall = ℓnll + ℓco. (10)

We can obtain θe t , θe bpe,t, θd sp,t = arg min ℓall. Then we can go back to (part 2), re-sample the data using the latest model, and optimize Eqn.(10) until convergence. To stabilize the training, we can ﬁrst optimize Eqn.(8) until convergence, and then introduce the co-teaching loss.

4. Experiments

To evaluate our proposed model and training algorithm, we conduct experiments on two standard sequence generation tasks: machine translation and abstractive summarization. The machine translation task aims to translate the sentences from source language to target language, while the goal for abstractive summarization is to summarize a long document with a few sentences.

4.1. Machine Translation

We introduce the detailed settings for the machine translation tasks and report the experimental results in this section.

4.1.1. EXPERIMENTAL SETTINGS

Data We conduct experiments on standard translation tasks with multiple language pairs, which are English German (En De for short), English Dutch (En Nl for short), English Polish (En Pl for short), English Portuguese-Brazil (En Pt-br for short), English Turkish (En Tr for short), and English Romanian (En Ro for short) language pairs. These benchmark datasets all come from the widely

acknowledged IWSLT-2014 machine translation (Cettolo et al., 2014) competition2. Each of them only contains 100k 200k data pairs, which can be viewed of low resource settings in some degree, since our approach works well in low resource learning tasks as introduced before. For each language pair, we work on bidirectional translations, i.e., En X and X En where X is another language, resulting in 12 machine translation tasks. Following (Bahdanau et al., 2016; Wu et al., 2017; 2018; Ott et al., 2019), we preprocess the datasets with some common acknowledged operations using Moses toolkit3

(Koehn et al., 2007), such as lowercase words, length ﬁltration. The resulted datasets contains about 160k, 7k and 7k pairs for training, valid and test sets for En De task, 180k, 4.7k, 1.1k for En Ro task, 170k, 4.5k, 1.1k for En Nl task, 175k, 4.5k, 1.2k for En Pt-br task, 181k, 4.7k, 1.2k for En Pl task and 160k, 4.5k, 1k for En Tr task respectively.

Tokenization We adopt the BPE4 (Sennrich et al., 2015) and Sentence Piece5 (Kudo, 2018) tokenizers as previously introduced. The number of merge operations for BPE is 10k, resulting a near 10k subword vocabulary for all language pairs. For Sentence Piece, we extract a 12k subword vocabulary for all language pairs. These vocabularies are built upon the joint source and target language sentences.

Model Since our model is based on the standard Transformer (Vaswani et al., 2017) architecture, we set the parameters for our model to be consistent with Transformer conﬁgurations for most ones. To be speciﬁc, we adopt transformer iwslt de en conﬁguration for the standard Transformer baseline models. The embedding dimension is set as 512 and the size of feed-forward layer is 1024. Each encoder and decoder contain 6 layers for each side. We share the source and target embeddings since we have a joint dictionary for each language pair. The dropout rate (Srivastava et al., 2014) is 0.3 and weight decay is 0.0001 for all experiments. The model is optimized with Adam (Kingma & Ba, 2014) optimizer and the learning rate schedule is the same default setting used as in Vaswani et al. (2017). Label smoothing (Pereyra et al., 2017) is also used with weight 0.1. In order to control the model parameters, we set 256 embedding size for our model instead of 512 used in the baseline model. We also share the parameters of mix-attention and self-attention components, as well as the BPE-enc-attention and SP-enc-attention components. That is, in Eqn.(1&2) and from Eqn.(4) to Eqn.(7), all layers with

2https://wit3.fbk.eu/mt.php?release= 2014-01

3https://github.com/moses-smt/ mosesdecoder/tree/master/scripts

4https://github.com/rsennrich/subword-nmt 5https://github.com/google/sentencepiece

Sequence Generation with Mixed Representations

Table 3. Machine translation results of our model and the standard Transformer on various IWSLT-2014 translation datasets. Transformer 512 and Transformer 256 refer to the baseline models with embedding dimension 512 and 256. Our models is equipped with embedding dimension 256. The numbers in bold font stand for results that are signiﬁcantly better than Transformer 512 results with p-value less than 0.05 (Koehn, 2004).

En De De En En Ro Ro En En Nl Nl En

BPE SP BPE SP BPE SP BPE SP BPE SP BPE SP

Transformer 512 28.80 28.45 34.84 34.77 24.56 24.67 32.07 31.72 29.23 29.48 33.33 33.08 Transformer 256 28.56 28.24 34.49 34.39 23.97 24.27 30.74 31.11 29.16 29.28 32.8 33.02

Our model 28.96 28.88 35.51 35.25 25.11 24.86 32.11 32.21 29.85 29.97 33.82 33.96 + co-teaching 29.93 29.71 36.41 36.31 25.98 25.68 33.12 32.89 31.16 31.11 34.45 34.61

En Pt-br Pt-br En En Pl Pl En En Tr Tr En

BPE SP BPE SP BPE SP BPE SP BPE SP BPE SP

Transforme 512 38.72 38.86 43.52 43.78 13.08 12.77 17.85 17.86 16.32 16.07 25.06 25.49 Transformer 256 38.44 38.53 42.57 42.88 12.73 12.73 17.33 17.12 15.62 15.93 24.74 24.90

Our model 39.19 39.08 43.98 43.44 13.46 13.21 18.07 17.79 16.49 16.69 25.84 25.68 + co-teaching 40.31 40.16 44.19 44.24 14.08 13.88 18.26 18.44 18.06 18.29 27.06 26.92

the same superscript are shared. As for other conﬁgurations, we keep them to be same as the baseline settings. Since the mixed representations are added by two attention layers, to better balance the contribution of two inputs, following Larsson et al. (2016); Zhu et al. (2020), we add drop-path trick in both encoder and decoder, with drop value 0.2 and 0.3. The implementation is based on the Fairseq (Ott et al., 2019) toolkit6.

Evaluation We use beam search (Sutskever et al., 2014; Medress et al., 1977) algorithm to generate the translation results. The beam width is 5 and the length penalty is 1.0. We report both BPE and SP results when testing the effectiveness of our model architecture and coteaching algorithm. The evaluation metric is the commonly used tokenized BLEU (Papineni et al., 2002) score with multi-bleu.perl script7.

4.1.2. RESULTS

We report the BLEU results in Table 3. The Transformer baseline models are trained with embedding size 512 (denoted as Transformer 512) and 256 (denoted as Transformer 256). Our model is equipped with embedding size 256, which ensures that the model size does not exceed Transformer 512 . For example, in En De tasks, the number of parameters of the Transformer 512, Transformer 256, and our model are 37M, 13M and 27M respectively. As shown in Table 3, Transformer 512 baseline

6https://github.com/pytorch/fairseq 7https://github.com/moses-smt/ mosesdecoder/blob/master/scripts/generic/ multi-bleu.perl

model with more parameters can lead to better results than Transformer 256 as expected.

With our proposed model, we can achieve consistent improvements across all the 12 translation tasks, no matter for the BPE tokenizer or the SP tokenizer. Compared with Transformer 256, our proposed model architecture outperforms the baseline by a non-trivial margin. For example, on De En, En Ro, Pt-br En, we can obtain more than 1.0 BLEU score improvement. Compared with Transformer 512, both our BPE decoded and SP decoded results can outperform the corresponding baselines by 0.5 points for most tasks. The improvements demonstrate the effectiveness of mixed representations from two tokenizers, and our model architecture.

We then apply co-teaching to our model. On En De, we found that the generated sequences with BPE format or SP format are of similar quality, and BPE tokenizer can lead to slightly better results. We report both of the BPE and SP results in the table. We can see that on 11 out of 12 translation tasks, co-teaching makes more than 1.0 improvement. On En Tr, we achieve additional more than 1.5 points improvement. That is, our model architecture is further beneﬁted from co-teaching.

Comparison with existing works We summarize the recent results of several existing works on the widely acknowledged benchmark IWSLT De En translation in Table 4, in order to give a clear comparison with our approach. These works vary from multiple aspects, e.g., the training algorithm design (Wang et al., 2019a), model architecture design (Lu et al., 2019; Wu et al., 2019) and data augmentation (Zhu et al., 2019). From the table, we can obverse that our approach outperforms all the previous works, which can

Sequence Generation with Mixed Representations

again demonstrate our method is very effective. In particular, our approach surpasses the BERT-fused NMT model (Zhu et al., 2020), which incorporates the large-scale pretrained BERT language model. Our method is complementary to previous works and we will further explore how to combine them together in the future.

Table 4. Comparison with existing works on IWSLT De En translation tasks.

Approach BLEU

Transformer 512 (Vaswani et al., 2017) 34.84 Adversarial MLE (Wang et al., 2019a) 35.18 Dynamic Conv (Wu et al., 2019) 35.20 Macaron (Lu et al., 2019) 35.40 Multi-Agent Dual Learning (Wang et al., 2019c) 35.56 Joint Self-Attention (Fonollosa et al., 2019) 35.70 Soft Contextual Data Aug (Zhu et al., 2019) 35.78 BERT-fused NMT (Zhu et al., 2020) 36.11

4.2. Abstractive Summarization

Text summarization is another widely acknowledged sequence generation task. There are two types of summarization tasks: Extractive summarization, where the summary is obtained by extracting sentences from the article; abstractive summarzation, where the summary is generated from scratch. In this work, we focus on the neural based abstractive summarzation.

Data We conduct the summarization experiment on a benchmark dataset, the Gigaword summarization dataset. The corpus is constructed from a subset of Gigaword corpus (Graff & Cieri, 2003). As described in Rush et al. (2015), the ﬁrst sentence of an article serves as the source input, and the headline as target output. The preprocessing is identical to previous works (Shen et al., 2016; Rush et al., 2015). On average, each input article sentence contains about 31.4 words, and the corresponding summarization/headline consists of 8.3 words. The standard split data of article-headline pairs for training, validation and test set contain 3.8M, 190k and 1, 951 samples respectively.

Tokenization Similar to the machine translation task, we also utilize the BPE and SP tokenizers to process the data. The vocabulary sizes of BPE tokenized corpus and SP tokenized corpus are 29, 236 and 30, 020 respectively

Model We choose the base Transformer architecture (transformer base conﬁguration) as the baseline. Each encoder and decoder are stacked by 6 blocks. We set the embedding dimension of our model as 256, the baseline models are also 512 and 256. The remaining hyperparameters are the same as those for machine translation

Table 5. Results of abstractive summarization.

BPE R-1 R-2 R-L

Transformer 256 35.1 17.4 32.5 Transformer 512 35.5 17.5 32.7

Our model 35.7 17.8 33.0 + co-teaching 36.0 18.4 33.4

SP R-1 R-2 R-L

Transformer 256 34.2 17.2 32.1 Transformer 512 34.5 17.5 32.2

Ours 34.7 17.6 32.5 + co-teaching 35.1 18.1 33.0

tasks (see Section 4.1).

Evaluation In decoding process, we use beam search with width 4 and length penalty 1.0, where repeated trigrams are blocked (Paulus et al., 2017). The generation quality is evaluated by ROUGE (Lin, 2004) F1 score with an open source script8. Speciﬁcally, we report the unigram ROUGE-1 (R-1) and bigram ROUGE-2 (R-2) overlap to assess the informativeness and the longest common subsequence ROUGE-L (R-L) score to assess the ﬂuency.

Results The results are shown in Table 5. When decoding into BPE formats, our model can boost the Transformer 256 baseline by 0.6, 0.4 and 0.5 point in terms of R-1, R-2 and R-L. After applying co-teaching, we can obtain more improvements, where the above three metrics can be further improved by 0.3, 0.6 and 0.4 point. Our method also outperforms the Transformer 512 baseline. Similar improvements are also observed when using SP tokenizer. These results demonstrate that our method generally works across different sequence generation tasks.

5. Ablation Study and Analysis

To better understand our model structure and training algorithm, we give ablation studies and analyses towards different aspects. The studies are conducted on the IWSLT-2014 En De machine translation tasks.

5.1. Stacked Representation

As discussed before, in the encoders of our proposed model, the mixed representations are obtained by adding outputs from mix-attention components and self-attention components. Different methods can be adopted to mix the representations of two tokenizers. Here we present an alternative approach: the stacked representations.

8https://github.com/pltrdy/pyrouge

Sequence Generation with Mixed Representations

The model structure to obtain stacked representations is shown in Figure 2. The mix-attention components process the output of self-attention components, instead of processing the input with self-attention components in parallel. The decoder is revised accordingly, where in decbpe, self-attention, SP-Enc-attention and BPE-Enc-attention are sequentially used, and self-attention, BPE-Enc-attention and SP-Enc-attention are sequentially leveraged in decsp9.

Feed Forward Feed Forward

Encoder 𝑥!"# 𝑥$"

Feed Forward

Decoder 𝑦!"#/𝑦$"

Figure 2. The model framework for stacked representation. Different from the mixed representations, the important components mix-attention (Mix-Attn in ﬁgure), SP-enc-attention (SP-Attn in ﬁgure) and BPE-enc-attention (BPE-Attn in ﬁgure) are stacked above the self-attention component instead of adding up. Note that we do not show residual connection and layer norm operation after each component in the ﬁgure, they are identical to the Transformer.

We evaluate the model performance of stacked representation without co-teaching. The results of En De translations are reported in Table 6. The stacked representations bring comparable results to Transformer baseline, both of which are worse than our mixed representations. We suspect the sequentially stacked version may underrate the representations from the self-attention itself and the overrated mix-attention confuses the ﬁnal representation.

Table 6. Results of stacked representations on En De translations.

En De De En

BPE SP BPE SP

Transformer 512 28.80 28.45 34.84 34.77 Transformer 256 28.56 28.24 34.49 34.39

Stacked 28.53 28.29 35.19 34.61 Ours 28.96 28.88 35.51 35.25

9We also evaluate the reversed order, mix-attention ﬁrst and then self-attention component, the result is similar.

5.2. Co-Teaching with One Tokenizer

Though our co-teaching algorithm interacts between the two types of tokenized sequences, it can also be applied to one type only. That is, given a pre-trained model (standard Transformer), we use it to forward translate the data, then continue tuning the model on the newly generated data as well as the original data iteratively.

We conduct experiments on En De translations. The results are reported in Table 7. From the results, we can observe that co-teaching achieves signiﬁcant improvements over the baseline models. This demonstrates that the coteaching generally works across different settings. For En De translation with Sentence Piece, co-teaching with one tokenized input can obtain 29.28 BLEU score, while ours with two tokenized inputs can obtain 29.93 BLEU score. Similar results can be found for other settings. This shows the effectiveness of our interactive co-teaching between two tokenization ways.

Table 7. Results of co-teaching with only one type of tokenization on En De translations.

En De De En

BPE SP BPE SP

Transformer 512 28.80 28.45 34.84 34.77 + co-teaching 29.67 29.28 36.10 35.71

5.3. Other Tokenizations

Previous experiments are all conducted on BPE and SP subword tokenizations. Besides these two tokenizers, we also investigate the Word Piece (WP) tokenizer in this subsection. We replace the SP tokenizer to be WP and set the subword dictionary to be near 10k. Other conﬁgurations and training details are all same as those used in Section 4.1.

Table 8. Results of our model with BPE and WP tokenizers on En De translations.

En De De En

BPE WP BPE WP

Transformer 512 28.80 28.71 34.84 34.91

Our model 28.98 29.15 35.23 35.24

Results are presented in Table 8. We can observe our model architecture also achieves better performance with BPE and WP tokenizers over the baselines, which again demonstrates that leveraging multiple tokenizers to build mixed representations is helpful to improve the translation quality.

Sequence Generation with Mixed Representations

6. Conclusion and Future Work

In this paper, we propose to generate sequences with mixed representations by leveraging different subword tokenization methods. Speciﬁcally, we introduce a new model structure to incorporate mixed representations from different tokenization methods, and a co-teaching algorithm to better utilize the diversity and advantage of each individual tokenization method. The effectiveness of our approach is veriﬁed on machine translation task and abstractive summarization application.

For future work, there are many possible directions. First, we will apply our algorithm to more sequence learning applications, like text classiﬁcation, natural language understanding, etc. Second, we will further improve our model architecture, such as designing two interactive decoders, and adaptively determine which decoder we should use for inference when a sentence comes. Third, we will study to extend our model with more subword tokenization methods.

Arppe, A., Carlson, L., Lind en, K., Piitulainen, J. O., Suominen, M., Vainio, M., Westerlund, H., and Yli-Jyr a, A. M. Inquiries into words, constraints and contexts: Festschrift in the honour of Kimmo Koskenniemi on his 60th birthday. CSLI Publications, 2005.

Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization. ar Xiv preprint ar Xiv:1607.06450, 2016.

Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align and translate. ar Xiv preprint ar Xiv:1409.0473, 2014.

Bahdanau, D., Brakel, P., Xu, K., Goyal, A., Lowe, R., Pineau, J., Courville, A., and Bengio, Y. An actorcritic algorithm for sequence prediction. ar Xiv preprint ar Xiv:1607.07086, 2016.

Cettolo, M., Niehues, J., St uker, S., Bentivogli, L., and Federico, M. Report on the 11th iwslt evaluation campaign, iwslt 2014. In Proceedings of the International Workshop on Spoken Language Translation, Hanoi, Vietnam, volume 57, 2014.

Cherry, C., Foster, G., Bapna, A., Firat, O., and Macherey, W. Revisiting character-based neural machine translation with capacity and compression. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4295 4305, 2018.

Deng, L. and Liu, Y. Deep learning in natural language processing. Springer, 2018.

Deng, L., Li, J., Huang, J.-T., Yao, K., Yu, D., Seide, F., Seltzer, M., Zweig, G., He, X., Williams, J., et al. Re-

cent advances in deep learning for speech research at microsoft. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 8604 8608. IEEE, 2013.

Fonollosa, J. A., Casas, N., and Costa-juss a, M. R. Joint source-target self attention with locality constraints. ar Xiv preprint ar Xiv:1905.06596, 2019.

Graff, D. and Cieri, C. English gigaword, linguistic data consortium, 2003.

Hassan, H., Aue, A., Chen, C., Chowdhary, V., Clark, J., Federmann, C., Huang, X., Junczys-Dowmunt, M., Lewis, W., Li, M., et al. Achieving human parity on automatic chinese to english news translation. ar Xiv preprint ar Xiv:1803.05567, 2018.

Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network. ar Xiv preprint ar Xiv:1503.02531, 2015.

Kim, Y., Jernite, Y., Sontag, D., and Rush, A. M. Characteraware neural language models. In Thirtieth AAAI Conference on Artiﬁcial Intelligence, 2016.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Koehn, P. Statistical signiﬁcance tests for machine translation evaluation. In Proceedings of the 2004 conference on empirical methods in natural language processing, pp. 388 395, 2004.

Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., et al. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th annual meeting of the association for computational linguistics companion volume proceedings of the demo and poster sessions, pp. 177 180, 2007.

Kudo, T. Subword regularization: Improving neural network translation models with multiple subword candidates. ar Xiv preprint ar Xiv:1804.10959, 2018.

Kudo, T. and Richardson, J. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. ar Xiv preprint ar Xiv:1808.06226, 2018.

Larsson, G., Maire, M., and Shakhnarovich, G. Fractalnet: Ultra-deep neural networks without residuals. ar Xiv preprint ar Xiv:1605.07648, 2016.

Lee, J., Cho, K., and Hofmann, T. Fully character-level neural machine translation without explicit segmentation. Transactions of the Association for Computational Linguistics, 5:365 378, 2017.

Sequence Generation with Mixed Representations

Lin, C.-Y. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pp. 74 81, Barcelona, Spain, July 2004. Association for Computational Linguistics.

Ling, W., Trancoso, I., Dyer, C., and Black, A. W. Character-based neural machine translation. ar Xiv preprint ar Xiv:1511.04586, 2015.

Lu, Y., Li, Z., He, D., Sun, Z., Dong, B., Qin, T., Wang, L., and Liu, T.-Y. Understanding and improving transformer from a multi-particle dynamic system point of view. ar Xiv preprint ar Xiv:1906.02762, 2019.

Luong, M.-T. and Manning, C. D. Achieving open vocabulary neural machine translation with hybrid wordcharacter models. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1054 1063, 2016.

Medress, M. F., Cooper, F. S., Forgie, J. W., Green, C., Klatt, D. H., O Malley, M. H., Neuburg, E. P., Newell, A., Reddy, D., Ritea, B., et al. Speech understanding systems: Report of a steering committee. Artiﬁcial Intelligence, 9 (3):307 316, 1977.

Ott, M., Edunov, S., Baevski, A., Fan, A., Gross, S., Ng, N., Grangier, D., and Auli, M. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pp. 48 53, 2019.

Pan, Y., Li, X., Yang, Y., and Dong, R. Morphological word segmentation on agglutinative languages for neural machine translation. ar Xiv preprint ar Xiv:2001.01589, 2020.

Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311 318. Association for Computational Linguistics, 2002.

Paulus, R., Xiong, C., and Socher, R. A deep reinforced model for abstractive summarization. ar Xiv preprint ar Xiv:1705.04304, 2017.

Pereyra, G., Tucker, G., Chorowski, J., Kaiser, Ł., and Hinton, G. Regularizing neural networks by penalizing conﬁdent output distributions. ar Xiv preprint ar Xiv:1701.06548, 2017.

Provilkov, I., Emelianenko, D., and Voita, E. Bpe-dropout: Simple and effective subword regularization. ar Xiv preprint ar Xiv:1910.13267, 2019.

Rush, A. M., Chopra, S., and Weston, J. A neural attention model for abstractive sentence summarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 379 389, 2015.

Schuster, M. and Nakajima, K. Japanese and korean voice search. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5149 5152. IEEE, 2012.

Sennrich, R., Haddow, B., and Birch, A. Neural machine translation of rare words with subword units. ar Xiv preprint ar Xiv:1508.07909, 2015.

Shen, S., Zhao, Y., Liu, Z., Sun, M., et al. Neural headline generation with sentence-wise optimization. ar Xiv preprint ar Xiv:1604.01904, 2016.

Srinivasan, T., Sanabria, R., and Metze, F. Multitask learning for different subword segmentations in neural machine translation. ar Xiv preprint ar Xiv:1910.12368, 2019.

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overﬁtting. The journal of machine learning research, 15(1):1929 1958, 2014.

Sutskever, I., Vinyals, O., and Le, Q. V. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104 3112, 2014.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In Advances in neural information processing systems, pp. 5998 6008, 2017.

Wang, D., Gong, C., and Liu, Q. Improving neural language modeling via adversarial training. In International Conference on Machine Learning, pp. 6555 6565, 2019a.

Wang, X., Pham, H., Arthur, P., and Neubig, G. Multilingual neural machine translation with soft decoupled encoding. ar Xiv preprint ar Xiv:1902.03499, 2019b.

Wang, Y., Xia, Y., He, T., Tian, F., Qin, T., Zhai, C. X., and Liu, T. Y. Multi-agent dual learning. In 7th International Conference on Learning Representations, ICLR 2019, 2019c.

Wu, F., Fan, A., Baevski, A., Dauphin, Y. N., and Auli, M. Pay less attention with lightweight and dynamic convolutions. ar Xiv preprint ar Xiv:1901.10430, 2019.

Wu, L., Zhao, L., Qin, T., Lai, J., and Liu, T.-Y. Sequence prediction with unlabeled data by reward function learning. In IJCAI, 2017.

Sequence Generation with Mixed Representations

Wu, L., Tian, F., Zhao, L., Lai, J., and Liu, T.-Y. Word attention for sequence to sequence text understanding. In Thirty-Second AAAI Conference on Artiﬁcial Intelligence, 2018.

Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., et al. Google s neural machine translation system: Bridging the gap between human and machine translation. ar Xiv preprint ar Xiv:1609.08144, 2016.

Zhang, X., Zhao, J., and Le Cun, Y. Character-level convolutional networks for text classiﬁcation. In Advances in neural information processing systems, pp. 649 657, 2015.

Zhu, J., Gao, F., Wu, L., Xia, Y., Qin, T., Zhou, W., Cheng, X., and Liu, T.-Y. Soft contextual data augmentation for neural machine translation. ar Xiv preprint ar Xiv:1905.10523, 2019.

Zhu, J., Xia, Y., Wu, L., He, D., Qin, T., Zhou, W., Li, H., and Liu, T. Incorporating bert into neural machine translation. In International Conference on Learning Representations, 2020. URL https://openreview. net/forum?id=Hyl7yg Stw B.