# mu2slam_multitask_multilingual_speech_and_language_models__7c50f32d.pdf Mu2SLAM: Multitask, Multilingual Speech and Language Models Yong Cheng 1 Yu Zhang 1 Melvin Johnson 1 Wolfgang Macherey 1 Ankur Bapna 1 We present Mu2SLAM, a multilingual sequenceto-sequence model pre-trained jointly on unlabeled speech, unlabeled text and supervised data spanning Automatic Speech Recognition (ASR), Automatic Speech Translation (AST) and Machine Translation (MT), in over 100 languages. By leveraging a quantized representation of speech as a target, Mu2SLAM trains the speech-text models with a sequence-to-sequence masked denoising objective similar to T5 on the decoder and a masked language modeling objective (MLM) on the encoder, for both unlabeled speech and text, while utilizing the supervised tasks to improve cross-lingual and cross-modal representation alignment within the model. On Co Vo ST AST, Mu2SLAM establishes a new stateof-the-art for models trained on public datasets, improving on xx-en translation over the previous best by 1.9 BLEU points and on en-xx translation by 1.1 BLEU points. On Voxpopuli ASR, our model matches the performance of an m SLAM model fine-tuned with an RNN-T decoder, despite using a relatively weaker Transformer decoder. On text understanding tasks, our model improves by more than 6% over m SLAM on XNLI, getting closer to the performance of m T5 models of comparable capacity on XNLI and Tydi QA, paving the way towards a single model for all speech and text understanding tasks. 1. Introduction The recent rapid developments in NLP have witnessed the tremendous success of moving towards unified text models for both understanding and generation tasks across hundreds of languages, evolving into numerous pre-trained models from encoder-only models focusing on text understand- 1Google Research, Google LLC, USA. Correspondence to: Yong Cheng . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). ing (Devlin et al., 2019; Devlin, 2018), to decoder-only models (Radford et al., 2018; Chowdhery et al., 2022) and encoder-decoder models (Song et al., 2019; Lewis et al., 2019; Raffel et al., 2020; Xue et al., 2020) for both understanding and generation. The speech pre-training methods have shown a similar trend towards unified models from the dominant encoder-only models (Baevski et al., 2020; Hsu et al., 2021; Babu et al., 2021; Bapna et al., 2021; 2022) to generative models on cross-modal speech and text data, exemplified by a couple of recent trails such as decoder-only models (Borsos et al., 2022) and encoder-decoder models (Ao et al., 2021; Chen et al., 2022; Sainath et al., 2022; Zhou et al., 2022; Zhang et al., 2022b; Tang et al., 2022). Although these works have achieved impressive performance, they only consider partial aspects of the unified models in speech and text. First, except for SLAM and m SLAM (Bapna et al., 2021; 2022), most of them merely focus on speech-related tasks by taking text data as auxiliary inputs while ignoring evaluations on text-related benchmarks, which leaves us unknown to gauge the effect of interference and capacity dilution. Second, there are few studies investigating multilingual modeling with both speech and text (Bapna et al., 2022; Chen et al., 2022), which limits them in leveraging cross-lingual transfer to enrich the speech and text joint representations. Third, multi-task learning has demonstrated the effectiveness of inductive transfer to improve model generalization, yet it is understudied in speech-text pre-training (Tang et al., 2022; Zhang et al., 2022b; Chen et al., 2022) where they explicitly differentiate the utilization of labeled data in pre-training by introducing customized networks and losses. Fourth, it is essential for prior speech-text models to design modality-specific blocks and losses to yield high performance (Bapna et al., 2022; Chen et al., 2022; Tang et al., 2022; Zhang et al., 2022b;a) which somewhat violates the principle of the unified models by using one model for all tasks, thus undermining the language and modality transfer to learn general speech-text shared representations. In this work, we propose a multi-task multilingual pretraining method based on an encoder-decoder model, called Mu2SLAM. The speech-text model is jointly pretrained on a set of different tasks involving unlabeled speech, unlabeled text, labeled speech-text (ASR&AST), and labeled text-text (MT). We scale up the language type in both Multitask, Multilingual Speech and Language Models speech and text to more than 100, covering the majority of mainstream spoken languages. For the simplicity of extending our current pre-training to more data, we unify the pretraining losses for unlabeled and labeled data by defining a masked language modeling (MLM) loss on the encoder (Devlin et al., 2019), a similar T5 loss on decoder (Song et al., 2019; Raffel et al., 2020) and an alignment loss only for the labeled data. To enforce the sharing and take full advantage of modality capacity for speech and text, we minimize the number of modality-specific layers in our model design with only a conventional CNN block used to extract speech representations, which pushes forward speech-text models towards the unified models. As our pre-training method inherits the idea of BERT (Devlin et al., 2019) to reconstruct the masked tokens according to the contextual unmasked tokens, the artificial token [MASK] used in pre-training is absent from labeled data in fine-tuning (Yang et al., 2019). The discrepancy between pre-training and fine-tuning hinders the model from being adequately optimized on the downstream applications. To alleviate this issue, we propose a gradual fine-tuning by continuing training the models on the set of labeled sets then turning to a specific task. To further boost the model performances on speech-text tasks during fine-tuning, we propose a noisy fine-tuning by perturbing the decoder inputs (Cheng et al., 2019) in addition to the speech augmentation in the encoder (Park et al., 2019). Extensive experimental results on the multilingual Co Vo ST AST (Wang et al., 2021b), Voxpopuli ASR (Wang et al., 2021a) and XTREME benchmarks show that our joint speech-text pre-trained models can achieve competitive results on both speech and text tasks. More specifically, Mu2SLAM establishes a new SOTA for models trained on public datasets on Co Vo ST, with up to 1.9 BLEU points on xx-en and 1.1 BLEU points on en-xx against the previous best results. On Voxpopuli ASR, our model based on a Transformer decoder matches the performance of an m SLAM model fine-tuned with an RNN-T decoder although the RNN-T decoder is more favorable to ASR tasks. On the multilingual text XTREME, Mu2SLAM outperforms m SLAM with 6% on XNLI, getting closer to the performance of m T5 models of comparable capacity on XNLI and Tydi QA. In analyses, we conduct ablation studies to gain further insight into which combination set of supervised datasets in our approach matters the most during the pre-training and fine-tuning. We also vary the noise ratio to investigate the effect of noisy fine-tuning for different speech translation directions. These results demonstrate that Mu2SLAM is the first truly multi-modal speech and text model which is capable of performing a wide variety of understanding and generation tasks for speech and text, attaining competitive results with uni-modal text models and vastly improving over speechonly models. Speech-only MLM loss on encoder T5 loss on decoder Speech-Text speech-to-text loss MLM loss on encoder text-to-speech loss MLM loss on encoder alignment loss text0-to-text1 loss MLM loss on encoder text1-to-text0 loss MLM loss on encoder alignment loss MLM loss on encoder T5 loss on decoder Encoder-Decoder Figure 1. An overview of Mu2SLAM. A ℓu loss is used to train speech-only and text-only data by computing masked language modeling (MLM) loss on the encoder and a similar T5 loss on the decoder. The supervised speech-text and text-text data also share the pre-training loss including forward and backward ℓp and an alignment loss ℓa between different languages or modalities. ℓp consists of a translation loss from input to target and a MLM loss on the encoder. Our speech-text models are pre-trained with Lu on unlabeled data and Lp on labeled data. In practice, we incorporate an additional CTC loss for ASR. 2. Approach We propose a multi-task multilingual pre-training method, Mu2SLAM, for speech and text, aiming to pre-train speechtext models on arbitrary tasks related to speech and/or text. The speech and text data can be cast into two types of data, unlabeled data without supervised labels and labeled data usually accompanied with human-annotated labels. As Figure 1 shows, we consider four types of data, i.e., speechonly, text-only, speech-text and text-text. The main idea is to unify these training examples into the sequence-tosequence format and apply similar optimization objectives on the encoder and decoder. The losses on unlabeled data (Lu) and labeled data (Lp) are combined to pre-train the speech-text models. 2.1. Model Architecture Mu2SLAM is based on an encoder-decoder backbone model. For speech inputs, we follow m SLAM (Bapna et al., 2022) to convert an acoustic feature sequence of 80-dimensional log Mel spectrograms into a sequence of latent speech representations via a CNN block. The CNN block consisting of two 2D-convolutional layers with strides (2, 2) also acts as a sub-sampling mechanism with a 4x reduction in the sequence length dimension. A subsequent linear projection layer is used to map the dimension of the latent speech representations to that of the encoder stack, we denote the speech representations as S. The text input t simply goes through a token embedding layer to be transformed as a sequence of embeddings. To specify the language and modality, we add language and modality embeddings to word embeddings or speech representations S in addition to the conventional positional embeddings. The speech and text representations are Multitask, Multilingual Speech and Language Models then fed into a shared multi-modal encoder-decoder model. We prefer a deep encoder with 24 Conformer layers (Gulati et al., 2020) (a similar encoder as m SLAM) and a shallow decoder with 6 Transformer layers (Vaswani et al., 2017), which favors faster inference while maintaining competitive quality (Kasai et al., 2020). 2.2. Speech Tokenization The basis of the proposed speech-text pre-training approach is to treat the speech inputs as an additional language, which requires a speech tokenizer to quantize the continuous speech representations S = (s1, s2, ..., s N) into discrete ids z = (z1, z2, ..., z N). To this end, each speech representation vector s is independently projected into a discrete id z by finding its nearest neighbour in the speech codebook G. z = argmin i Gi s . (1) In m SLAM, the parameters of the speech tokenizer are learned from scratch by a contrastive loss (Baevski et al., 2020) over a speech-only encoder. For simplicity, we directly utilize the pretrained speech tokenizer in m SLAM and keep it constant during our model training. 2.3. Pre-training Objectives In this paper, we have four different training sets related to speech and/or text: a speech-only set Ds, a text-only set Dt, a speech-text set Dst and a text-text set Dtt. We want to unify the pre-training losses for unlabeled data and labeled data, which make our pre-training methods easily extensible to more datasets. Losses on unlabeled data Given an unlabeled training example x = (x1, x2, ..., x N), we first use it as a source-target pair (x, x) for the sequence-to-sequence model training. Then we randomly construct a 0/1 masking vector m sampled from a prior distribution. We apply the masking vector m to the source x by replacing the token xi with a [MASK] token if mi = 1. The corrupted source x is denoted as xm. For the target x, we employ the complementary masking operation m by setting xi to the [MASK] token if mi = 0 and denote it as x m. Finally, to enable the model to predict the masked source tokens on both the encoder and the decoder, the loss ℓu(xm, x m; θ) on the pseudo pair data (xm, x m) is computed as: X mi=1 log P(xi|xm; θe) + X mi=1 log P(xi|x m