# consecutive_decoding_for_speechtotext_translation__4df6303e.pdf

Consecutive Decoding for Speech-to-text Translation

Qianqian Dong, 1,2 Mingxuan Wang, 3 Hao Zhou, 3 Shuang Xu, 1 Bo Xu, 1,2 Lei Li 3

1 Institute of Automation, Chinese Academy of Sciences (CASIA), Beijing, 100190, China 2 School of Artiﬁcial Intelligence, University of Chinese Academy of Sciences, Beijing, 100049, China 3 Byte Dance AI Lab, China {dongqianqian2016, shuang.xu, xubo}@ia.ac.cn, {wangmingxuan.89, zhouhao.nlp, lileilab}@bytedance.com

Speech-to-text translation (ST), which directly translates the source language speech to the target language text, has attracted intensive attention recently. However, the combination of speech recognition and machine translation in a single model poses a heavy burden on the direct cross-modal crosslingual mapping. To reduce the learning difﬁculty, we propose COn Secutive Transcription and Translation (COSTT), an integral approach for speech-to-text translation. The key idea is to generate source transcript and target translation text with a single decoder. It beneﬁts the model training so that additional large parallel text corpus can be fully exploited to enhance the speech translation training. Our method is veriﬁed on three mainstream datasets, including Augmented Libri Speech English-French dataset, TED English-German dataset, and TED English-Chinese dataset. Experiments show that our proposed COSTT outperforms the previous state-ofthe-art methods. The code is available at https://github.com/ dqqcasia/st.

Introduction Speech translation (ST) aims at translating from source language speech into the target language text. Traditionally, it is realized by cascading an automatic speech recognition (ASR) and a machine translation (MT) (Sperber et al. 2017, 2019b; Zhang et al. 2019; Beck, Cohn, and Haffari 2019; Cheng et al. 2019). Recently, end-to-end ST has attracted much attention due to its appealing properties, such as lower latency, smaller model size, and less error accumulation (Liu et al. 2019, 2018; Weiss et al. 2017; B erard et al. 2018; Duong et al. 2016; Jia et al. 2019). Although end-to-end systems are very promising, cascaded systems still dominate practical deployment in industry. The possible reasons are: a) Most research work compared cascaded and end-to-end models under identical data situations. However, in practice, the cascaded system can beneﬁt from the accumulating independent speech recognition or machine translation data, while the end-to-end system still suffers from the lack of end-to-end corpora. b) Despite the advantage of reducing error accumulation, the end-

The work was done while QD was a research intern at Byte Dance AI Lab. Copyright c 2021, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

to-end system has to integrate multiple complex deep learning tasks into a single model to solve the task, which introduces heavy burden for the cross-modal and cross-lingual mapping. Therefore, it is still an open problem whether endto-end models or cascaded models are generally stronger. We argue that a desirable ST model should take advantages of both end-to-end and cascaded models and acquire the practically acceptable capabilities as follows: a) it should be end-to-end to avoid error accumulation; b) it should be ﬂexible enough to leverage large-scale independent ASR or MT data. At present, few existing end-to-end models can meet all these goals. Most studies resort to pre-training or multitask learning to bridge the beneﬁts of cascaded and end-to-end models (Bansal et al. 2019; Sung et al. 2019; Sperber et al. 2019a). A de-facto framework usually initializes the ST model with the encoder trained from ASR data (i.e. source audio and source text pairs) and then ﬁne-tunes on a speech translation dataset to make the cross-lingual translation. However, it is still challenging for these methods to leverage the bilingual MT data, due to the lack of intermediate text translating stage. Our idea is motivated by two motivating insights from ASR and MT models. a) A branch of ASR models has intermediate steps to extract acoustic feature and decode phonemes, before emitting transcription; and b) Speech translation can beneﬁt from decoding the source speech transcription in addition to the target translation text. We propose COSTT, a uniﬁed speech translation framework with consecutive decoding for jointly modeling speech recognition and translation. COSTT consists of two phases, an acoustic-semantic modeling phase (AS) and a transcriptiontranslation modeling phase (TT). The AS phase accepts the speech features and generates compressed acoustic representations. For TT phases, we jointly model both the source and target text in a single shared decoder, which directly generates the speech text sequence and the translation sequence at one pass. This architecture is closer to cascaded translation while maintaining the beneﬁts of end-to-end models. The combination of the AS and the ﬁrst-part output of the TT phase serves as an ASR model; the TT phase alone serves as an MT model; while the whole makes an end-to-end speech translation by ignoring the ﬁrst-part of TT output. Simple and effective, COSTT is powerful enough to cover the advantage of ASR, MT, and ST models simultane-

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

ously. The contributions of this paper are as follows: 1) We propose COSTT, a uniﬁed training framework with consecutive decoding which bridges the beneﬁts of both cascaded and end-to-end models. 2) As a beneﬁt of explicit multi-phase modeling, COSTT facilitates the use of parallel bilingual text corpus, which is difﬁcult for traditional end-to-end ST models. 3) COSTT achieves state-of-the-art results on three popular benchmark datasets.

Related Work For speech translation, there are two main research paradigms, the cascaded system and the end-to-end model (Jan et al. 2018, 2019). Cascaded ST For cascaded system, the most concerned point is how to avoid early decisions, relieve error propagation and better integrate the separately trained ASR and MT modules. To relieve the problem of error propagation and tighter couple cascaded systems: a) Robust translation models (Cheng et al. 2018, 2019) introduce synthetic ASR errors and ASR related features into the source side of MT corpora. b) Techniques such as domain adaptation (Liu et al. 2003; F ugen 2008), re-segmention (Matusov, Mauser, and Ney 2006), punctuation restoration (F ugen 2008), disﬂuency detection (Fitzgerald, Hall, and Jelinek 2009; Wang et al. 2018; Dong et al. 2019) and so on, are proposed to provide the translation model with well-formed and domain matched text inputs. c) Many efforts turn to strengthen the tight integration between the ASR output and the MT input, such as the n-best translation, lattices and confusion nets and so on (Sperber and Paulik 2020). End-to-end ST On the other hand, a paradigm shift towards end-to-end system is emerging to alleviate the drawbacks of cascaded systems. B erard et al. (2016); Duong et al. (2016) have given the ﬁrst proof of the potential of end-to-end speech-to-text translation, which has attracted intensive attentions recently (Vila et al. 2018; Salesky et al. 2018; Salesky, Sperber, and Waibel 2019; Di Gangi, Negri, and Turchi 2019; Bahar, Bieschke, and Ney 2019; Di Gangi et al. 2019; Inaguma et al. 2020). Many works have proved that pre-training then transferring (Weiss et al. 2017; B erard et al. 2018; Bansal et al. 2019; Stoian, Bansal, and Goldwater 2020) and multi-task learning (Vydana et al. 2020) can signiﬁcantly improve the performance of end-to-end models. The two-pass decoding (Sung et al. 2019) and attention-passing (Anastasopoulos and Chiang 2018; Sperber et al. 2019a) techniques are proposed to handle the relatively deeper relationships and alleviate error propagation in end-to-end models. Many data augmentation techniques (Jia et al. 2019; Bahar et al. 2019; Pino et al. 2019) are proposed to utilize external ASR and MT corpora. Many semi-supervised training (Wang et al. 2020a) methods bring great gain to end-to-end models, such as knowledge distillation (Liu et al. 2019), modality agnostic meta-learning (Indurthi et al. 2019), model adaptation (Di Gangi et al. 2020) and so on. Curriculum learning (Kano, Sakti, and Nakamura 2017; Wang et al. 2020b) is proposed to improve performance of ST models. Liu et al. (2020); Liu, Spanakis, and Niehues (2020) optimize the decoding strategy to achieve

low-latency end-to-end speech translation. (Chuang et al. 2020; Salesky and Black 2020; Salesky, Sperber, and Black 2019) explore additional features to enhance end-to-end models. Due to the scarcity of data resource, how to efﬁciently utilize ASR and MT parallel data is a big problem for ST, especially in the end-to-end situation. However, existing end-toend methods mostly resort to ordinary pretraining or multitask learning to integrate external ASR resources, which may face the issue of catastrophic forgetting and modal mismatch. And it is still challenging for previous methods to leverage external bilingual MT data efﬁciently.

Proposed COSTT Approach

The detailed framework of our method is shown in Figure 1. To be speciﬁc, the speech translation model accepts the original audio feature as input and outputs the target text sequence. We divide our method into two phases, including the acoustic-semantic modeling phase (AS) and the transcription-translation modeling phase (TT). Firstly, the AS phase accepts the speech features, outputs the acoustic representation, and encodes the shrunk acoustic representation into semantic representation. In this work, the smallgrained unit, phonemes are selected as the acoustic modeling unit. Then, the TT phase accepts the AS s representation and consecutively outputs source transcription and target translation text sequences with a single shared decoder.

Problem Formulation The speech translation corpus usually contains speech-transcription-translation triples. We add phoneme sequences to make up quadruples, denoted as S = {(x, u, z, y)} (More details about the data preparation can be seen in the experimental settings). Specially, x = (x1, ..., x Tx) is a sequence of acoustic features. u = (u1, ..., u Tu), z = (z1, ..., z Tz), and y = (y1, ..., y Ty) represents the corresponding phoneme sequence in source language, transcription in source language and the translation in target language respectively. Meanwhile, A = {(z , y )} represents the external text translation corpus, which can be utilized for pre-training the decoder. Usually, the amount of end-to-end speech translation corpus is much smaller than that of text translation, i.e. |S| |A|.

Acoustic-Semantic Modeling

The acoustic-semantic modeling phase takes the input of low-level audio features x and outputs a series of vectors h AS corresponding to the phoneme sequence u in the source language. Different from the general sequence-to-sequence models, two modiﬁcations are introduced. Firstly, in order to preserve more acoustic information, we introduce the supervision signal of the connectionist temporal classiﬁcation (CTC) loss function, a scalable, end-to-end approach to monotonic sequence transduction (Graves et al. 2006; Salazar, Kirchhoff, and Huang 2019). Secondly, since the length of audio features is much larger than that of source phoneme (Tx Tu), we introduce a shrinking method

Acoustic-Semantic

Target-Translation Source-Transcription

Sampling Transformer Blocks

Transformer Blocks

Transcription-Translation

Consecutive Decoder

Multi-Head Cross Attention

Multi-Head Self Attention

see you <st> à plus <eos>

<asr> see you <st> à plus S - IY1 ŏ UW1 -

Acoustic Unit Shrinking

Multi-Head Self Attention

Self-Attention with CTC

Figure 1: Overview of the proposed COSTT. It consists of two phases, an acoustic-semantic modeling phase (AS) and a transcription-translation phase (TT). During AS phase, CTC loss is adopted supervised by phoneme labels corresponding to source-text. The TT phase decodes source-text and target-text in a single sequence consecutively.

which can skip the blank-dominated steps to reduce the encoded sequence length.

Self-Attention with CTC General preprocessing includes down-sampling and linear layers. Down-sampling refers to the dimensionality reduction processing of the input audio features in the time and frequency domains. In order to simplify the network, we adopt manual dimensionality reduction, that is, a method of sampling one frame every three frames. The linear layer maps the length of the frequency domain feature of the audio feature to the preset network hidden layer size. After preprocessing, multiple Transformer blocks are stacked for acoustic feature extraction.

ˆh AS = Attention(Linear(Down-sample(x))) (1)

Finally, the softmax operator is applied to the result of the afﬁne transformation to obtain the probability of the phoneme sequence. CTC loss is adopted to accelerate the convergence of acoustic modeling. CTC assumes Tu Tx, and deﬁnes an intermediate alphabet V = V {blank}. A path π is deﬁned as a Tx-length sequence of intermediate labels π = (π1, ..., πTx) V Tx. And a many-to-one mapping is deﬁned from paths to output sequences by removing blank symbols and consecutively repeated labels. The conditional probability of a given labelling u V Tu can be modeled by marginalizing over all paths corresponding to it. The distribution over the set V Tx of path π is deﬁned by the probability of a sequence of conditionally-independent outputs, which can be calculated non-autoregressively.

pctc(u|x) = X

π B 1(u) p(π|ˆh AS) (2)

p(π|ˆh AS) =

t=1 p(πt|ˆh AS) (3)

And p(πt|ˆh AS) is computed by applying the softmax function to logits. Finally, the objective training function during AS phase is deﬁned as:

LAS = log pctc(u|x) (4)

Acoustic Unit Shrinking The shrinking layer aims at reducing the potential blank frames, and repeated frames. The details can be seen in the sub-ﬁgure of Figure 1. The method is mainly founded on the studies of Chen et al. (2016); Yi, Wang, and Xu (2019). We adopt the implementation by removing the blank frames and averaging the repeated frames. Without the interruption of blank and repeated frames, the language modeling ability should be better in theory. Blank frames can be detected according to the spike characteristics of CTC probability distribution.

h AS = Shrink(ˆh AS, pctc(u|x)) (5)

Then, similarly, after shrinking, multiple Transformer blocks are stacked to extract higher-level semantic representations and result in the ﬁnal output h AS.

h AS = Attention(h AS) (6)

Transcription-Translation Modeling

We jointly model the transcription and translation generation in a single shared decoder, which takes the acoustic representation h AS as the input and generates the source text z

and target text y. This TT phase is stacked with T Transformer blocks, consisting of multi-head attention layers and feed-forward networks.

h T T = Transformer([z, y], h AS) (7)

As shown in Figure 1, the decoder output is the tandem result of the transcription and translation sequences, joined by the task identiﬁcator token ( <asr> for recognition and <st> for translation), marked as [z, y]. That is to say, the model is able to continuously predict the transcription sequence and the translation sequence. The training objective of the TT phase is the cross entropy between prediction sequence and target sequence.

LTT = log p([z, y]|x) (8)

Compared with the multi-task learning method, consecutive decoding can make prediction from easy (transcription) to hard (translation) tasks, alleviating the decoding pressure. For example, when predicting the translation sequence, since the corresponding transliteration sequence has been decoded, that is, the intermediate recognition result of the known speech translation and the source of information for decoding, the translation sequence can be improved.

Pre-train the Consecutive Decoder Generally, it is straightforward to use ASR corpus to improve the performance of ST systems, but is non-trivial to utilize MT corpus. Taking advantage of the structure of consecutive decoding, we propose a method to enhance the performance of ST systems by means of external MT paired data. Inspired by translation language modeling (TLM) in XLM (Lample and Conneau 2019), we use a masked loss function to pre-train TT phase. Speciﬁcally, we use external data in A to pre-train the parameters of the TT part. Different from the end-toend training stage, there is no audio feature as input during pre-training, so cross-attention cannot attend to the output of the previous AS phase. We use an all-zero constant, marked as h ASblank to substitute the encoded representations (h AS) from TT phase to be consistent with ﬁne-tuning. When calculating the objective function, we mask the loss for prediction of the recognition result, and make the decoder predicts the translation sequence when aware of the input of the transcription sequence. The translation loss of the TT phase during pre-training only includes the masked cross entropy:

i=1 log p(yi|z, y<i) (9)

Joint Learning Apart from pre-training, we exploit joint learning to integrate our uniﬁed ST model. The total training objective is as follows: L = αLAS + (1 α)LTT (10)

Where α is a tunable parameter to balance the objectives of different phases. We design different training algorithms for our method training from scratch (seen in Algorithm 1) and training with

Algorithm 1 COSTT without pre-training

1: # training from scratch (θ0 AS θ1 AS, θ0 T T θ1 T T ) 2: while not converged do 3: supervised training ST with (x, u, y) S 4: end while 5: return ST with θ1 AS, θ1 T T

Algorithm 2 COSTT with pre-training

1: # pre-training Con Dec (θ0 AS θ0 AS, θ0 T T θ1 T T ) 2: while not converged do 3: CE loss guided supervised training Con Dec with (z , y ) A 4: end while 5: # pre-training AM (θ0 AS θ1 AS, θ1 T T θ1 T T ) 6: while not converged do 7: CTC loss guided supervised training AM with (x, u) S 8: end while 9: # ﬁne-tuning ST (θ1 AS θ2 AS, θ1 T T θ2 T T ) 10: while not converged do 11: Supervised training ST with (x, u, z, y) S 12: end while 13: return ST with θ2 AS, θ2 T T

pre-training the consecutive decoder (seen in Algorithm 2). Algorithm 2 is determined after many attempts to better avoid the catastrophic forgetting of pre-trained knowledge.

Experiments Dataset and Preprocessing We conduct experiments on three popular publicly available datasets, including Augmented Libri Speech English French dataset (Kocabiyikoglu, Besacier, and Kraif 2018), IWSLT2018 English-German dataset (Jan et al. 2018) and TED English-Chinese dataset (Liu et al. 2019).

Augmented Libri Speech Dataset Augmented Libri Speech is built by automatically aligning e-books in French with English utterances of Libri Speech. The dataset includes quadruplets: source audio ﬁles in English, transcriptions in English, translations in French from the alignment of e-books, and augmented translation references via Google Translate. We experiment on the 100 hours clean train set for training, with 2 hours development set and 4 hours test set, corresponding to 47,271, 1071, and 2048 utterances respectively.

IWSLT2018 English-German Dataset IWSLT2018 English-German is the KIT speech translation corpus, which is built by automatically aligning English audios with SRT transcripts for English and German from TED. The raw data, including long wave ﬁles, English transcriptions, and the corresponding German translations, are segmented with timestamps. It should be noted that some transcriptions are not aligned with the corresponding audio well. Noisy data is

Method Enc Pre-train (speech data) Dec Pre-train (text data) BLEU

MT system Transformer MT (Liu et al. 2019) - - 22.91

Base setting LSTM ST (B erard et al. 2018) 12.90 +pre-train+multitask (B erard et al. 2018) 13.40 LSTM ST+pre-train (Inaguma et al. 2020) 16.68 Transformer+pre-train (Liu et al. 2019) 14.30 +knowledge distillation (Liu et al. 2019) 17.02 TCEN-LSTM (Wang et al. 2020a) 17.05 Transformer+ASR pre-train (Wang et al. 2020b) 15.97 +curriculum pre-train (Wang et al. 2020b) 17.66 COSTT without pre-training 17.83

Expanded setting LSTM+pre-train+Spec Augment (Bahar et al. 2019) (236h) 17.00 Multi-task+pre-train (Inaguma et al. 2019) (472h) 17.60 Transformer+ASR pre-train (Wang et al. 2020b) (960h) 16.90 +curriculum pre-train (Wang et al. 2020b) (960h) 18.01 COSTT with pre-training (100h) (1M) 18.23

Table 1: Performance on Augmented Librispeech English-French test set. COSTT achieves the best performance in both the base and the expanded setting.

harmful to models performance, which can be avoided by data ﬁltering, re-alignment and re-segmentation (Liu et al. 2018). In this paper, we directly use the original data as training data to verify our method, with a size of 272 hours and 171,121 segmentations. We use dev2010 as validation set, and tst2013 as test set, corresponding to 653 and 793 utterances respectively 1.

TED English-Chinese Dataset English-Chinese TED is crawled from TED website2 and released by Liu et al. (2019) as a benchmark for speech translation from English audio to Chinese text. Following the previous work (Liu et al. 2019), we use dev2010 as development set and tst2015 as test set. The raw long audio is segmented based on timestamps for complete semantic information. Finally, we get 524 hours train set, 1.5 hours validation set and 2.5 hours test set, corresponding to 308,660, 835 and 1223 utterances respectively.

WMT Machine Translation Corpus We use WMT143 English-to-French and English-to-German training data as the external MT parallel corpus ( A) in the expanded experimental setting for reproducibility. We pre-processed all of the data of speciﬁc language pairs, and ﬁltered sentence pairs whose total length exceeds 500. We shufﬂed the data and randomly selected a subset of 1 million for the following experiments and analysis.

1We use the data segmentation in the iwslt corpus/parallel folder. 2https://www.ted.com 3https://www.statmt.org/wmt14/translation-task.html

Experimental Setup

Our acoustic features are 80-dimensional log-Mel ﬁlterbanks extracted with a step size of 10ms and window size of 25ms and extended with mean sub-traction and variance normalization. The features are stacked with 5 frames to the right. For all source language text data, we lower case all the texts, tokenize and remove the punctuation to make the data more consistent with the output of ASR. For target French and German text data, we lower case all the texts, tokenize and apply normalize punctuations with the Moses scripts4. For target Chinese text data, we use the raw released segmented results. For English-French and English German datasets, we apply BPE5 (Sennrich, Haddow, and Birch 2016) to the combination of source and target text to obtain shared subword units. And for English-Chinese dataset, we apply BPE to the source text and target text respectively. The number of merge operations in BPE is set to 8k for all datasets. In order to simplify, we use the open-source grapheme to phoneme tool6 to map the transcription to the phoneme sequence (An example in Table 3). The alphabet of labels V includes the union of subword vocabulary and phoneme vocabulary, plus a few special symbols (including <asr> , <st> and blank ). For English-French and English-German corpora, we report case-insensitive BLEU scores by multi-bleu.pl7 script for the evaluation of translation. And for English-Chinese corpus, we report character-level BLEU scores. We use word error rates (WER) and phoneme error rates (PER) to eval-

4https://github.com/moses-smt/mosesdecoder 5https://github.com/rsennrich/subword-nmt 6https://github.com/Kyubyong/g2p 7https://github.com/moses-smt/mosesdecoder/scripts/generic/ multi-bleu.perl

Method Enc Pre-train (speech data) Dec Pre-train (text data) tst2013

MT system RNN MT (Inaguma et al. 2020) - - 24.90

Base setting ESPnet (Inaguma et al. 2020) 12.50 +enc pre-train 13.12 +enc dec pre-train 13.54 Transformer+ASR pre-train (Wang et al. 2020b) 15.35 +curriculum pre-train (Wang et al. 2020b) 16.27 COSTT without pre-training 16.30

Expanded setting Multi-task+pre-train (Inaguma et al. 2019) (472h) 14.60 CL-fast* (Kano, Sakti, and Nakamura 2017) (479h) 14.33 TCEN-LSTM (Wang et al. 2020a) (479h) (40M) 17.67 Transformer+curriculum pre-train (Wang et al. 2020b) (479h) (4M) 18.15

COSTT with pre-training (272h) (1M) 18.63

Table 2: Performance on English-German TED test sets. *: re-implemented by Wang et al. (2020b). COSTT achieves the best performance in both the base and the expanded setting.

uate the prediction accuracy of transcription and phoneme sequences, respectively.

speech 135-19215-0118.wav phonemes Y UW1 <space> M AH1 S T <space> M EY1 K <space> AH0 <space> D R IY1 M <space> W ER1 L <space> ER0 AW1 N D <space> DH AH0 <space> B R AY1 D transcription you must make a dream whirl around the bride translation il faudrait faire tourbillonner un songe autour de l epous ee .

Table 3: An example of the speech-phoneme-transcriptiontranslation quadruples. Phonemes can be converted from the transcription text.

We use a similar hyperparameter setting with the base Transformer model (Vaswani et al. 2017). The number of transformer blocks is set to 12 and 6 for the acousticsemantic (AS) phase and the transcription-translation (TT) phase, respectively. And phoneme supervision is added to the middle layer of AS phase for all datasets. Spec Augment strategy (Park et al. 2019) is adopted to avoid overﬁtting with frequency masking (F = 30, m F = 2) and time masking (T = 40, m T = 2). All samples are batched together with 20000-frame features by an approximate feature sequence length during training. We train our models on 1 NVIDIA V100 GPU with a maximum number of 400k training steps. We use the greedy search decoding strategy for our experimental settings. The maximum decoding length is set to 500 for our models with consecutive decoding and 250 for other methods on all datasets. α in Equation 10 is set to 0.5 for all datasets (We have searched the value of α using a step of 0.2.). We design different workﬂows for our method training

from scratch and training with pre-training the consecutive decoder. More details are in the results. The ﬁnal model is averaged on the last 10 checkpoints.

Results Baselines We compare with systems in different settings:

Base setting: ST models are trained with only ST triple corpus.

Expanded setting: ST models are trained with ST triple corpus augmented with external ASR and MT corpus. In the context of expanded setting, Bahar et al. (2019) apply the Spec Augment (Park et al. 2019) with a total of 236 hours of speech for ASR pre-training. Inaguma et al. (2019) combine three ST datasets of 472 hours training data to train a multilingual ST model. Wang et al. (2020a) introduce an additional 207 hours ASR corpus and 40M parallel data from WMT18 to enhance the ST. We mainly explored additional MT data in this work.

MT system: Text translation models are trained with manual transcribed transcription-translation pairs, which can be regarded as the upper bound of speech translation tasks.

Main Results We conduct experiments on three public datasets.

Librispeech English-French For En-Fr experiments, we compared the performance with existing end-to-end methods in Table 1. Clearly, COSTT outscored the previous best

Method Enc Pre-train (speech data) Dec Pre-train (text data) BLEU

MT system Transformer MT (Liu et al. 2019) - - 27.08

Base setting Transformer+pre-train (Liu et al. 2019) 16.80 +knowledge distillation (Liu et al. 2019) 19.55 Multi-task+pre-train* (Inaguma et al. 2019)(re-implemented) 20.45 COSTT without pre-training 21.12

Table 4: Performance on English-Chinese TED test set. COSTT achieves the best performance in both the base and the expanded setting.

Method BLEU

En Fr Pipeline 17.58 COSTT 17.83

En De Pipeline 15.38 COSTT 16.30

En Zh Pipeline 21.36 COSTT 21.12

Table 5: COSTT versus cascaded systems on Augmented Librispeech En-Fr test set, IWSLT2018 En-De tst2013 set and TED En-Zh tst2015 set. Pipeline systems consist of separate ASR and MT models trained independently.

results in the base setting and the expanded setting, respectively. We achieved better results than a knowledge distillation baseline in which an MT model was introduced to teach the ST model (Liu et al. 2019). We also exceeded Wang et al. (2020a,b), even though they used more external ASR and MT data. Different from previous work, COSTT can make full use of the machine translation corpus. With an additional 1 million sentence pairs, we achieve +0.4 BLEU score improvements (17.83 v.s. 18.23). This proposal promises great potential for the application of the COSTT. In a nutshell, simple yet effective, COSTT achieves the best performance in this benchmark dataset in terms of BLEU.

IWSLT2018 English-German For En-De experiments, we compared the performance with existing end-to-end methods in Table 2. Unlike that of Librispeech English French, this dataset is noisy, and the transcriptions do not align well with the corresponding audios. As a result, there is a wide gap between the performance of the ST system and the upper bound of the ST (MT). We suppose it would be more beneﬁcial to carry out data ﬁltering. Overall, our method had +0.5 BLEU score advantage as compared to previous competitors on tst2013 in the expanded setting. This trend is consistent with that in the Librispeech dataset.

TED English-Chinese For En-Zh experiments, we compared the performance with existing end-to-end methods in Table 4. COSTT outperformed the previous best methods obviously by more than 0.7 BLEU in the base setting. Espe-

cially, COSTT exceeded the Transformer-based ST model augmented by knowledge distillation with a big margin, proving the validity of our uniﬁed framework.

Comparison with Cascaded Systems In Table 5, we compare the performance of our E2E models with the cascaded systems. It shows that E2E models are outstanding or comparable on all En Fr/De/Zh tasks, proving our method s capacity for joint optimization of the separate ASR and MT tasks in a model.

Ablation Study We use an ablation study to evaluate the importance of different modules in our methods. The results in Table 6 show that all the methods adapted are positive for the model performance, and the beneﬁts of different parts can be superimposed. Models with consecutive decoding are able to predict both the recognition and translation, for which we also report WER and PER to evaluate the performance of different modeling phase. It has been proved that consecutive decoding brings the gain of 1 BLEU compared with the base model and pre-training decoder can bring improvements to all three metrics.

BLEU WER PER

COSTT 18.23 14.60 10.30 w/o PT Dec 17.51 15.30 11.90 w/o CD 16.57 - - w/o Shrink 16.40 - - w/o AS loss * 15.48 - - w/o AS loss 11.24 - -

Table 6: Beneﬁts of each component in COSTT on En-Fr test set. PT Dec stands for pre-training the successive decoder. CD represents using the consecutive decoder. * means using ASR pre-training as initialization.

Effects of Pre-training Figure 2 shows the convergence curve on the English-French validation set of the two training algorithms in Algorithm 1 and Algorithm 2. It proves that COSTT with pre-training the consecutive decoder can get a better initialization and converge better beneﬁting from our ﬂexible model structure.

Error Range 0 1 2 3 4 5 6 7 8 9 Probability 0.32 0.66 0.83 0.91 0.95 0.97 0.98 0.99 0.99 0.99

Table 7: Statistics of the absolute error between the length of shrunk acoustic unit and the length of the gold phoneme sequence.

0 50 100 150 200 250 300 Training Step (k)

COSTT without pre-training COSTT with pre-training

Figure 2: BLEU scores on Augmented Librispeech validation set for COSTT with and without pre-training on extra parallel MT corpus. Notice that the full COSTT with MT pre-training does improve the performance.

Parameters of ST systems The parameter sizes of different systems are shown in Table 8. The pipeline system needs a separate ASR model and MT model, so its parameters are doubled. Our method COSTT only needs the same parameters as the vanilla end-to-end model, but it can achieve superior performance thanks to the consecutive decoding mechanism.

Model Params Pipeline 110M E2E 55M COSTT (12 L) 55M COSTT (18 L) 76M

Table 8: Statistics of parameters of different ST systems. E2E: the vanilla end-to-end ST system.

Effects of Shrinking Mechanism In order to verify whether the shrinking mechanism has achieved the expected effect, we collected the sequence length of the encoded hidden layer before and after shrinking and the length distribution of the gold phoneme sequence. As shown in Figure 3, the sequence length of the shrunk acoustic unit and the distribution of phoneme length are almost the same. According to statistics in Table 7, for more than 90% of the samples, the absolute error between the length of the shrunk acoustic unit and the length of the gold

phoneme sequence is within 3. Moreover, the length of the shrunk acoustic unit is signiﬁcantly reduced compared to the length of the original acoustic feature. The results show that the shrinking mechanism can detect blank frames and repeated frames well, while reducing the computational resources and preventing memory overﬂow.

0 200 400 600 800 1000 0

8000 raw acoustic features shrunk acoustic units gold phoneme sequences

Figure 3: Length distribution of the raw acoustic features, the shrunk acoustic units and the gold phoneme sequences on English-French training set

Effects of Layers after Shrinking As mentioned in previous sections, our model stacks additional Transformer blocks after the shrinking operation. We have conducted simpliﬁed experiments on the English French dataset with a vanilla speech translation model without consecutive decoding to demonstrate the importance of the additional encoding layers after shrinking. The output of the encoded layer uses the CTC loss as the supervision, and we use the subword of transcriptions in the source language as the acoustic labels. Results can be seen in Table 9. The experimental results show that directly inputting the shrunk encoded output to the decoder will cause performance loss. And stacking additional encoding layers after shrinking can bring signiﬁcant performance improvements. We conjecture that there is a lack of semantic encoding modules between acoustic encoding and linguistic decoding. In addition, the relationship between the hidden states after shrinking has changed a lot, and an additional network structure is required to re-extract high-level encoded features.

Case Study on English-French The cases in Table 10 shows that COSTT has obvious structural advantages in solving missed translation, mistranslation, and fault tolerance. For instance: #1, the base model missed the translation of yes in the audio, whereas our method produced a completely correct translation. After listening to the original audio, it is suspected that the miss-

Encoder Shrinking Layers after Shrinking BLEU

6 - 12.70 6 - 11.34 6 6 16.46

Table 9: BLEU scores on Augmented Librispeech test set for different model conﬁgurations. The number represents the layers of Transformer block contained in the corresponding module.

ing translation is due to an unusual pause between doctor and yes . #2, the base model mistranslated the aboard in the audio into vers l avant ( forward in English), yet our method could correctly translate it into a bord based on the correct transcription prediction. The reason for the mistranslation may be that the audio clips are pronounced similarly, thus confusing the translation model. #3, the base model translated erroneously most of the content, and our model also predicted today in the audio as to day . However, in the end, our method was able to predict the translation result completely and correctly.

Speech #1 766-144485-0043.wav Transcript said the doctor yes Target dit le docteur , oui . Base ST dit le docteur . COSTT <asr> said the doctor yes <ast> dit le docteur , oui .

Speech #2 2488-36617-0066.wav Transcript i rushed aboard Target je me pr ecipitai a bord. Base ST je me pr ecipitai vers l avant . COSTT <asr> i rushed aboard <ast> je me pr ecipitai a bord .

Speech #3 766-144485-0098.wav Transcript is there any news today Target y a-t-il des nouvelles aujourd hui ? Base ST est-ce que j ai d ej a utilis e aujourd hui ? COSTT <asr> is there any news to day <ast> y a-til des nouvelles aujourd hui ?

Table 10: Examples of speech translation generated by COSTT and the baseline ST model. Words in bold highlight the difference. Words underlined, as generated by COSTT, contributes to the improved translation results.

Compared with 3-stage Pipeline

In the case study of Table 10, we have listed some examples of errors in transcription recognition, but COSTT can still correctly predict the translation sequence, which proves that COSTT can solve the error propagation problem to some extent. In a pipeline system that includes the phoneme stage, the phoneme recognition error will also lead to error propagation. But in COSTT, the phoneme sequence is only the intermediate supervision used during training, not necessary during inferring. Moreover, end-to-end training can al-

leviate the error propagation between different stages. We believe that the more stages, the greater the advantage of our method. We have built a 3-stage system consisting of acoustics-to-phoneme (A2P), phoneme-to-transcript (P2T), and transcript-to-translation (T2T) stages. A2P is a phoneme recognition model based on the CTC loss function, which uses phoneme error rate (PER) to evaluate performance (the lower the better). Both P2T and T2T use the sequence-tosequence model based on Transformer and BLEU is the evaluation criterion (the higher the better). The performance of each module is shown in Table 11. The performance of different systems in Table 12 prove that with the increase of stages, the problem of error propagation becomes more and more serious, which shows the beneﬁts of the COSTT method.

Stage A2P (PER) P2T (BLEU) T2T (BLEU) Performance 10.30 92.08 21.51

Table 11: Performance of each module of our 3-stage Pipeline.

System BLEU

3-stage Pipeline 12.22 2-stage Pipeline 17.58 COSTT 18.23

Table 12: COSTT versus 2-stage Pipeline and 3-stage Pipeline on Augmentated Librispeech En-Fr test set.

Conclusion We propose COSTT, a novel and uniﬁed training framework for jointly speech recognition and speech translation. We use the consecutive decoding strategy to realize the sequential prediction of the transcription and translation sequences, which is more in line with human cognitive principles. By pre-training the decoder, we can directly make better use of the parallel data of MT. Additionally, CTC auxiliary loss, and shrinking operation strategies are adopted to enhance our method beneﬁting from the ﬂexible structure. Experimental results prove the effectiveness of our framework and it has great prospects for promoting the application of speech translation.

Acknowledgements We would like to thank the anonymous reviewers for their valuable comments. We would also like to thank Rong Ye, Chengqi Zhao, Cheng Yi and Chi Han for their useful suggestion and help with our work. This work was supported by the Key Programs of the Chinese Academy of Sciences under Grant No.ZDBS-SSW-JSC006-2.

References Anastasopoulos, A.; and Chiang, D. 2018. Tied Multitask Learning for Neural Speech Translation. In Proc. of the 2018

Conference of the North American Chapter of the Association for Computational Linguistics.

Bahar, P.; Bieschke, T.; and Ney, H. 2019. A comparative study on end-to-end speech to text translation. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 792 799. IEEE.

Bahar, P.; Zeyer, A.; Schl uter, R.; and Ney, H. 2019. On using specaugment for end-to-end speech translation. In Proc. of the 2019 International Workshop on Spoken Language Translation.

Bansal, S.; Kamper, H.; Livescu, K.; Lopez, A.; and Goldwater, S. 2019. Pre-training on high-resource speech recognition improves low-resource speech-to-text translation. In Proc. of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 58 68.

Beck, D.; Cohn, T.; and Haffari, G. 2019. Neural Speech Translation using Lattice Transformations and Graph Networks. In Proc. of the Thirteenth Workshop on Graph-Based Methods for Natural Language Processing (Text Graphs-13), 26 31.

B erard, A.; Besacier, L.; Kocabiyikoglu, A. C.; and Pietquin, O. 2018. End-to-end automatic speech translation of audiobooks. In Proc. ICASSP 2018, 6224 6228. IEEE.

B erard, A.; Pietquin, O.; Servan, C.; and Besacier, L. 2016. Listen and translate: A proof of concept for end-to-end speech-to-text translation. NIPS Workshop on End-to-End Learning for Speech and Audio Processing .

Chen, Z.; Zhuang, Y.; Qian, Y.; and Yu, K. 2016. Phone synchronous speech recognition with ctc lattices. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25(1): 90 101.

Cheng, Q.; Fang, M.; Han, Y.; Huang, J.; and Duan, Y. 2019. Breaking the Data Barrier: Towards Robust Speech Translation via Adversarial Stability Training. In Proc. of the 2019 International Workshop on Spoken Language Translation.

Cheng, Y.; Tu, Z.; Meng, F.; Zhai, J.; and Liu, Y. 2018. Towards Robust Neural Machine Translation. In Proc. of ACL 2018, 1756 1766.

Chuang, S.-P.; Sung, T.-W.; Liu, A. H.; and Lee, H.-y. 2020. Worse WER, but Better BLEU? Leveraging Word Embedding as Intermediate in Multitask End-to-End Speech Translation. In Proc. of ACL 2020.

Di Gangi, M. A.; Negri, M.; Cattoni, R.; Roberto, D.; and Turchi, M. 2019. Enhancing transformer for end-to-end speech-to-text translation. In Machine Translation Summit XVII, 21 31. European Association for Machine Translation.

Di Gangi, M. A.; Negri, M.; and Turchi, M. 2019. Adapting transformer to end-to-end spoken language translation. In Proc. Interspeech 2019, 1133 1137. Proc. Interspeech 2019.

Di Gangi, M. A.; Nguyen, V.-N.; Negri, M.; and Turchi, M. 2020. Instance-based Model Adaptation for Direct Speech Translation. In Proc. ICASSP 2020, 7914 7918. IEEE.

Dong, Q.; Wang, F.; Yang, Z.; Chen, W.; Xu, S.; and Xu, B. 2019. Adapting translation models for transcript disﬂuency detection. In Proc. of the AAAI Conference on Artiﬁcial Intelligence, volume 33, 6351 6358.

Duong, L.; Anastasopoulos, A.; Chiang, D.; Bird, S.; and Cohn, T. 2016. An attentional model for speech translation without transcription. In NAACL, 949 959.

Fitzgerald, E.; Hall, K.; and Jelinek, F. 2009. Reconstructing False Start Errors in Spontaneous Speech Text. In Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), 255 263.

F ugen, C. 2008. A system for simultaneous translation of lectures and speeches. Ph.D. thesis, Verlag nicht ermittelbar.

Graves, A.; Fern andez, S.; Gomez, F.; and Schmidhuber, J. 2006. Connectionist temporal classiﬁcation: labelling unsegmented sequence data with recurrent neural networks. In ICML, 369 376. ACM.

Inaguma, H.; Duh, K.; Kawahara, T.; and Watanabe, S. 2019. Multilingual end-to-end speech translation. In Proc. of the 2019 IEEE Automatic Speech Recognition and Understanding Workshop.

Inaguma, H.; Kiyono, S.; Duh, K.; Karita, S.; Yalta, N.; Hayashi, T.; and Watanabe, S. 2020. ESPnet-ST: All-in-One Speech Translation Toolkit. In Proc. of ACL 2020: System Demonstrations.

Indurthi, S.; Han, H.; Lakumarapu, N. K.; Lee, B.; Chung, I.; Kim, S.; and Kim, C. 2019. Data Efﬁcient Direct Speechto-Text Translation with Modality Agnostic Meta-Learning. Proc. ICASSP 2019 .

Jan, N.; Cattoni, R.; Sebastian, S.; Cettolo, M.; Turchi, M.; and Federico, M. 2018. The iwslt 2018 evaluation campaign. In International Workshop on Spoken Language Translation, 2 6.

Jan, N.; Cattoni, R.; Sebastian, S.; Negri, M.; Turchi, M.; Elizabeth, S.; Ramon, S.; Loic, B.; Lucia, S.; and Federico, M. 2019. The IWSLT 2019 evaluation campaign. In 16th International Workshop on Spoken Language Translation 2019.

Jia, Y.; Johnson, M.; Macherey, W.; Weiss, R. J.; Cao, Y.; Chiu, C.-C.; Ari, N.; Laurenzo, S.; and Wu, Y. 2019. Leveraging weakly supervised data to improve end-to-end speechto-text translation. In Proc. ICASSP 2019, 7180 7184. IEEE.

Kano, T.; Sakti, S.; and Nakamura, S. 2017. Structured Based Curriculum Learning for End-to-End English Japanese Speech Translation. Proc. Interspeech 2017 .

Kocabiyikoglu, A. C.; Besacier, L.; and Kraif, O. 2018. Augmenting Librispeech with French Translations: A Multimodal Corpus for Direct Speech Translation Evaluation. In Proc. of the 2018 International Conference on Language Resources and Evaluation.

Lample, G.; and Conneau, A. 2019. Cross-lingual Language Model Pretraining. Advances in Neural Information Processing Systems (Neur IPS) .

Liu, D.; Liu, J.; Guo, W.; Xiong, S.; Ma, Z.; Song, R.; Wu, C.; and Liu, Q. 2018. The USTC-NEL Speech Translation system at IWSLT 2018. In Proc. of the 2018 International Workshop on Spoken Language Translation. Liu, D.; Spanakis, G.; and Niehues, J. 2020. Low-Latency Sequence-to-Sequence Speech Recognition and Translation by Partial Hypothesis Selection. Proc. Interspeech 2020 3620 3624. Liu, F.-H.; Gu, L.; Gao, Y.; and Picheny, M. 2003. Use of statistical N-gram models in natural language generation for machine translation. In Proc. ICASSP 2003, volume 1, I I. IEEE. Liu, Y.; Xiong, H.; Zhang, J.; He, Z.; Wu, H.; Wang, H.; and Zong, C. 2019. End-to-End Speech Translation with Knowledge Distillation. Proc. Interspeech 2019 1128 1132. Liu, Y.; Zhang, J.; Xiong, H.; Zhou, L.; He, Z.; Wu, H.; Wang, H.; and Zong, C. 2020. Synchronous speech recognition and speech-to-text translation with interactive decoding. In Proc. of the 2020 AAAI Conference on Artiﬁcial Intelligence. Matusov, E.; Mauser, A.; and Ney, H. 2006. Automatic sentence segmentation and punctuation prediction for spoken language translation. In International Workshop on Spoken Language Translation (IWSLT) 2006. Park, D. S.; Chan, W.; Zhang, Y.; Chiu, C.-C.; Zoph, B.; Cubuk, E. D.; and Le, Q. V. 2019. Spec Augment: A Simple Data Augmentation Method for Automatic Speech Recognition. Proc. Interspeech 2019 . Pino, J.; Puzon, L.; Gu, J.; Ma, X.; Mc Carthy, A. D.; and Gopinath, D. 2019. Harnessing indirect training data for end-to-end automatic speech translation: Tricks of the trade. In Proc. of the 2019 International Workshop on Spoken Language Translation. Salazar, J.; Kirchhoff, K.; and Huang, Z. 2019. Selfattention networks for connectionist temporal classiﬁcation in speech recognition. In Proc. ICASSP 2019, 7115 7119. IEEE. Salesky, E.; and Black, A. W. 2020. Phone Features Improve Speech Translation. In Proc. of ACL 2020. Salesky, E.; Burger, S.; Niehues, J.; and Waibel, A. 2018. Towards ﬂuent translations from disﬂuent speech. In 2018 IEEE Spoken Language Technology Workshop (SLT), 921 926. IEEE. Salesky, E.; Sperber, M.; and Black, A. W. 2019. Exploring Phoneme-Level Speech Representations for End-to-End Speech Translation. In Proc. of ACL 2019. Salesky, E.; Sperber, M.; and Waibel, A. 2019. Fluent Translations from Disﬂuent Speech in End-to-End Speech Translation. In Proc. of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2786 2792. Sennrich, R.; Haddow, B.; and Birch, A. 2016. Neural Machine Translation of Rare Words with Subword Units. In Proc. of ACL 2016, 1715 1725.

Sperber, M.; Neubig, G.; Niehues, J.; and Waibel, A. 2017. Neural Lattice-to-Sequence Models for Uncertain Inputs. In Proc. of the 2017 Conference on Empirical Methods in Natural Language Processing. Sperber, M.; Neubig, G.; Niehues, J.; and Waibel, A. 2019a. Attention-Passing Models for Robust and Data-Efﬁcient End-to-End Speech Translation. TACL 7: 313 325. Sperber, M.; Neubig, G.; Pham, N.-Q.; and Waibel, A. 2019b. Self-Attentional Models for Lattice Inputs. In Proc. of ACL 2019. Sperber, M.; and Paulik, M. 2020. Speech Translation and the End-to-End Promise: Taking Stock of Where We Are. In Proc. of ACL 2020, 7409 7421. Stoian, M. C.; Bansal, S.; and Goldwater, S. 2020. Analyzing ASR pretraining for low-resource speech-to-text translation. In Proc. ICASSP 2020, 7909 7913. IEEE. Sung, T.-W.; Liu, J.-Y.; Lee, H.-y.; and Lee, L.-s. 2019. Towards End-to-end Speech-to-text Translation with Two-pass Decoding. In Proc. ICASSP 2019, 7175 7179. IEEE. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. In NIPS, 5998 6008. Vila, L. C.; Escolano, C.; Fonollosa, J. A.; and Costa-juss a, M. R. 2018. End-to-End Speech Translation with the Transformer. In Iber SPEECH, 60 63. Vydana, H. K.; Karaﬁ at, M.; Zmolikova, K.; Burget, L.; and Cernocky, H. 2020. Jointly Trained Transformers models for Spoken Language Translation. ar Xiv preprint ar Xiv:2004.12111 . Wang, C.; Wu, Y.; Liu, S.; Yang, Z.; and Zhou, M. 2020a. Bridging the Gap between Pre-Training and Fine-Tuning for End-to-End Speech Translation. In Proc. of the AAAI Conference on Artiﬁcial Intelligence, volume 34, 9161 9168. Wang, C.; Wu, Y.; Liu, S.; Zhou, M.; and Yang, Z. 2020b. Curriculum Pre-training for End-to-End Speech Translation. In Proc. of ACL 2020, 3728 3738. Wang, F.; Chen, W.; Yang, Z.; Dong, Q.; Xu, S.; and Xu, B. 2018. Semi-supervised disﬂuency detection. In Proc. of the 27th International Conference on Computational Linguistics, 3529 3538. Weiss, R. J.; Chorowski, J.; Jaitly, N.; Wu, Y.; and Chen, Z. 2017. Sequence-to-Sequence Models Can Directly Translate Foreign Speech. Proc. Interspeech 2017 2625 2629. Yi, C.; Wang, F.; and Xu, B. 2019. ECTC-DOCD: An Endto-end Structure with CTC Encoder and OCD Decoder for Speech Recognition. Proc. Interspeech 2019 4420 4424. Zhang, P.; Ge, N.; Chen, B.; and Fan, K. 2019. Lattice Transformer for Speech Translation. In Proc. of ACL 2019.