# scaling_speechtext_pretraining_with_synthetic_interleaved_data__4fe2fe7d.pdf Published as a conference paper at ICLR 2025 SCALING SPEECH-TEXT PRE-TRAINING WITH SYNTHETIC INTERLEAVED DATA Aohan Zeng , Zhengxiao Du , Mingdao Liu , Lei Zhang , Shengmin Jiang Yuxiao Dong , Jie Tang Tsinghua University Zhipu.AI Code & Models: https://github.com/THUDM/GLM-4-Voice Speech language models (Speech LMs) accept speech input and produce speech output, allowing for more natural human-computer interaction compared to textbased large language models (LLMs). Traditional approaches for developing Speech LMs are constrained by the limited availability of unsupervised speech data and parallel speech-text data, which are significantly less abundant than text pre-training data, thereby limiting their scalability as LLMs. We propose a novel approach to scaling speech-text pre-training by leveraging large-scale synthetic interleaved data derived from text corpora, eliminating the need for parallel speechtext datasets. Our method efficiently constructs speech-text interleaved data by sampling text spans from existing text corpora and synthesizing corresponding speech spans using a text-to-token model, bypassing the need to generate actual speech. We also employ a supervised speech tokenizer derived from an automatic speech recognition (ASR) model by incorporating a vector-quantized bottleneck into the encoder. This supervised training approach results in discrete speech tokens with strong semantic preservation even at lower frame rates (e.g. 12.5Hz), while still maintaining speech reconstruction quality. Starting from a pre-trained language model and scaling our pre-training to 1 trillion tokens (with 600B synthetic interleaved speech-text data), we achieve state-of-the-art performance in speech language modeling and spoken question answering, improving performance on spoken questions tasks from the previous SOTA of 13% (Moshi) to 31%. We further demonstrate that by fine-tuning the pre-trained model with speech dialogue data, we can develop an end-to-end spoken chatbot that achieves competitive performance comparable to existing baselines in both conversational abilities and speech quality, even operating exclusively in the speech domain. Accuracy on Spoken QA Synthetic Interleaved Data (#Tokens) 600B 200B 0B 100B Previous SOTA (Moshi) Large Text Corpus Fine Web ~ 15T tokens Sample Speech Spans Interleaved Speech-Text Data Text-to-Token LM Synethize Speech Tokens All NLP tasks are generation tasks. Speech Tokens Text Tokens Figure 1: (Left) The performance on Spoken QA continuously improves as the amount of synthetic interleaved data increases, significantly surpassing the previous SOTA (Moshi). (Right) The pipeline for synthesizing interleaved speech-text data. *Equal contribution. Email: {zah22,zx-du20,liumd24}@mails.tsinghua.edu.cn Work was done when ML, LZ interned at Zhipu.AI. Corresponding authors: YD and JT. Published as a conference paper at ICLR 2025 1 INTRODUCTION Large language models (LLMs) have significantly advanced natural language processing, demonstrating capabilities beyond traditional language tasks. Trained on vast internet corpora, they exhibit emergent abilities such as instruction following (Ouyang et al., 2022), logical reasoning (Wei et al., 2022), and tool utilization (Schick et al., 2023). These advancements have enabled applications like interactive chatbots and personalized digital assistants. However, an ideal AI assistant should not rely solely on text. Voice-based interaction offers a more natural and intuitive interface for human AI interaction. Traditional voice-based systems combine Automatic Speech Recognition (ASR), LLMs, and Text-to-Speech (TTS) models in a cascading manner. This approach, however, suffers from information loss during ASR and TTS processes, limiting the ability to capture and express the rich nuances of speech. Speech language models (Speech LMs) have emerged as a promising approach for building generalpurpose voice assistants capable of processing speech input and output end-to-end. Several methods have been explored to construct Speech LMs. Lakhotia et al. (2021) proposed unsupervised learning on speech corpora using discrete semantic tokens. Hassid et al. (2023) improved performance by initializing from pre-trained language models, while Moshi (D efossez et al., 2024) utilized largescale training on private speech data. However, a key challenge remains: the scarcity of speech data compared to text data. While text corpora like Fine Web (Penedo et al., 2024) offer 15 trillion highquality tokens, large unsupervised speech datasets like Vox Populi (Wang et al., 2021) provide only 400K hours of speech, equivalent to 36 billion tokens at 25Hz. This disparity limits the scalability and performance of Speech LMs relative to LLMs. A straightforward idea to address this limitation is to synthesize speech from text pre-training corpora using TTS models. However, this approach faces three major challenges. First, the lower information density of speech tokens leads to significant token expansion, drastically reducing training efficiency. Second, the process of synthesizing speech for large-scale text corpora is computationally expensive. Third, training on pure speech data fails to align with the text modality, preventing the model from leveraging the capabilities of existing LLMs. Recently, Nguyen et al. (2024) has explored the use of interleaved speech-text data for training. This approach improves alignment between speech and text modalities, leading to better speech language modeling performance. However, their method requires parallel speech-text datasets to construct the interleaved data, which significantly limits its scalability for large-scale pre-training. In this paper, we propose a novel approach to scaling speech-text pre-training by synthesizing interleaved speech-text data from text corpora. The interleaved data is generated by sampling text spans and converting them into speech tokens using a text-to-token model. This efficient process bypasses the need to generate actual speech, enabling large-scale pre-training without relying on extensive speech datasets. Inspired by Du et al. (2024), we train the tokenizer in a supervised manner using ASR models and datasets. Experiments with sampling rates from 6.25Hz to 50Hz revealed tradeoffs between semantic retention, model efficiency, speech reconstruction quality, and pre-training performance. We selected 12.5Hz as the optimal rate for balancing these factors. To synthesize large-scale interleaved data, we used existing TTS datasets to train a text-to-token model, generating 600B tokens of interleaved speech-text data and expanding the pre-training to 1 trillion tokens. Finally, through fine-tuning on speech dialogue data, we developed an end-to-end spoken chatbot operating entirely in the speech domain. The main contributions of this paper are as follows: We propose a novel method to effectively synthesize high-quality interleaved speech-text data from text corpora, addressing data limitation challenges in speech-text pre-training. We design a Speech LLM architecture featuring a 12.5Hz single-codebook speech tokenizer trained in a supervised manner, along with a flow-matching based decoder for speech reconstruction, achieving both robust semantic preservation and high-quality speech synthesis. We scale our pre-training to 1 trillion tokens using synthesized interleaved speech-text data, significantly advancing capabilities in speech language modeling and spoken question answering. We develop an end-to-end spoken chatbot by fine-tuning pre-trained models with speech dialogue data, achieving competitive performance in conversational abilities and speech quality while operating exclusively in the speech domain. Published as a conference paper at ICLR 2025 Interleaved Speech-Text (b) Construct Interleaved Speech-Text Data From Text Corpus (a) Train a Text-to-Token Model using TTS data Stage I: Large-scale Speech-Text Pre-training Unsupervised Speech Parallel Speech-Text (ASR, TTS) Text-to-Token LM All NLP tasks are generation tasks. Speech Tokenizer Text Corpus How are you today? I am fine . Today is Monday All NLP tasks are generation tasks. Stage II: Supervised Fine-tuing for a Spoken Chatbot Speech-Text Language Model Speech Tokenizer Speech Input Text R (Optional) Speech Response Text R (Optional) Model Output Speech Decoder Audio Token Text-to-Token LM Step 1: Sample Speech Spans All NLP tasks are generation tasks. Step 2: Generate Speech Tokens Speech Tokens Text Tokens Interleaved Speech-Text Data Figure 2: Overview of our method. First we train a text-to-token model to construct interleaved speech-text data. The speech language model s training contains two stages. In the stage 1 the model is pre-trained with synthetic speech-text interleaved data. In the stage 2 the the model is fine-tuned with a speech dialogue dataset. 2 OUR APPROACH Current approaches for build Speech LMs typically fall into two categories. One method (Fang et al., 2024; D efossez et al., 2024) involves the language model for speech input but outputs embeddings for an additional non-autoregressive (NAR) model to generate speech tokens, which limits the modeling capacity and potentially reduces the upper bound of performance. The other method (Xie & Wu, 2024) uses inconsistent audio representations for input and output, leading to misalignment between input and output modalitiy. In this section, we present our approach for developing an end-to-end spoken chatbot using a unified speech-text modeling framework. Our method integrates a supervised speech tokenizer, a technique for synthesizing interleaved speech-text data, and a two-stage training process to extend pre-trained language models to the speech domain. This comprehensive approach enables us to leverage largescale text data for speech modeling, effectively aligning speech and text modalities within a single model. 2.1 SPEECH TOKENIZATION Supervised Speech Tokenizer Previous methods of discrete speech tokenizers are either trained with reconstruction/adversarial objectives of speech waveform (Wang et al., 2023; Chen et al., 2024) or self-supervised learning on automatically discovered acoustic units(Hsu et al., 2021). Following recent advance in text-to-speech synthesis (Du et al., 2024), we train the discrete speech tokenizer by fine-tuning a pretrained automatic speech recognition (ASR) model with an additional pooling layer and a vector quantization layer in the middle of the encoder. The pooling layer is a 1D average pooling operator of window size k, which reduces the sampling rate to a fraction of 1/k. The vector quantization layer approximates the continuous intermediate representations in the encoder with the closest vectors in the codebook. The selected indices in the codebook are used as the speech token indices. The codebook vectors are learned with exponential moving average (EMA) and we add a commitment loss to restrict the volume of continuous representations before quantization. To overcome codebook collapse, we apply the random restart trick (Dhariwal et al., 2020) to reset vectors whose mean usage falls below a certain threshold. We also adapt the Whisper architecture to support streaming inference, which is important to reduce latency for online speech interaction. We replace the convolution layer before the encoder Transformer with the causal convolution layer (van den Oord et al., 2016). We also replace the bidirectional attention in the encoder with block causal attention: the input audios are divided into segments of equal intervals and positions in a segment and attend to all the positions in the current segment and previous segments, but not positions in the following segments. Empirically we set the segment interval to 2 seconds (100 tokens before the average pooling). We find this can match Published as a conference paper at ICLR 2025 Table 1: Speech Reconstruction Results. We evaluate semantic retention with Word Error Rate (WER) and reconstruction quality with Vis QOL (Hines et al., 2015) and MOSNet (Lo et al., 2019) for different speech tokenizers across various frame rates. The baseline results are independently evaluated by us. Model Frame Rate Bitrate Causal Libri Speech WER Vis QOL MOSNet Ground Truth - - - 4.62 - 3.27 RVQGAN 75Hz 1.50K - 1.74 2.74 Semanti Codec 50Hz 1.30K - 2.43 3.12 Speech Tokenizer 50Hz 1.50K - 1.53 2.67 Speech Tokenizer 50Hz 4.00K - 3.07 3.10 Spirit-Base 25Hz 225.0 11.66 - - Spirit-Expressive 38.5Hz 307.0 10.60 - - Moshi (Mimi) 12.5Hz 1.10K 8.36 2.82 2.89 50Hz 600 6.24 2.67 3.38 25Hz 300 6.80 2.60 3.33 12.5Hz 175 8.43 2.52 3.39 6.25Hz 100 14.41 2.34 3.24 the ASR performance of bidirectional attention. For more details about speech tokenizer training, please refer to Appendix B.1. Speech Decoder Given discrete speech tokens, we synthesize speech through the speech decoder. We follow the decoder architecture of Cosy Voice (Du et al., 2024), which consists of a speech token encoder, a conditional flow matching model (Mehta et al., 2024), and a Hi Fi-GAN vocoder (Kong et al., 2020). The speech token encoder converts a sequence of discrete tokens into a sequence of contextual vectors with a Transformer encoder. To facilitate the streaming synthesis of speech, we adapt the speech token encoder to use the same block causal attention as the speech tokenizer. The flow matching model generates Mel spectrograms conditioned on the speech token representations. Finally, the generated Mel spectrograms converted into the speech waveforms through the Hi Fi GAN vocoder (Kong et al., 2020). To train the speech decoder, we use the unsupervised speech data described in Section 2.3.1, which consists of various speakers. For more details about speech decoder training, please refer to Appendix B.2. We evaluate the content preservation and quality of generated speech by our speech decoder on Libri Speech (Panayotov et al., 2015). The results are shown in Table 1. We measure the content preservation by the Word Error Rate (WER) between the transcription with an ASR model provided in Nguyen et al. (2023) and the true transcription. For speech quality, following D efossez et al. (2024), we compute Vis QOL (Hines et al., 2015) and MOSNet (Lo et al., 2019) of the reconstructed speech. Our tokenizer performs well across various sampling rates, with the 12.5Hz variant offering an optimal balance between efficiency and quality. It maintains high quality scores (MOSNet 3.39) and content preservation (WER 8.43) while significantly reducing bitrate (52.7). Our ablation study on sampling rates during pre-training (Cf. Section 3.3.2) shows that lower rates improve performance, but gains plateau at 12.5Hz. Based on these results, we select the 12.5Hz variant for our subsequent experiments. 2.2 SYNTHESIZE INTERLEAVED SPEECH-TEXT DATA Interleaved speech-text data consists of tokens where speech and text sequences are interleaved at the word level. For example, a sequence might look like: Today is ... day . We hypothesize that training on interleaved speech-text data encourages the model to learn an alignment between speech and text, facilitating the transfer of text-based knowledge to speech representations. Previous methods for creating interleaved speech-text data rely on aligned speech-text parallel datasets (Nguyen et al., 2024), which are challenging to obtain. We propose a novel and efficient approach for constructing interleaved speech-text data using existing text Published as a conference paper at ICLR 2025 datasets. The process consists of two main steps. First, we train a text-to-token model that directly converts text into corresponding speech tokens, eliminating the need to synthesize actual speech. This approach avoids the error accumulation associated with text-to-speech-to-token pipelines and significantly improves synthesis efficiency, making it practical and scalable for large-scale data generation. Next, we sample text spans from existing text datasets and transform them into speech spans using the trained text-to-token model. This enables the efficient and scalable creation of interleaved speech-text data without requiring aligned speech-text parallel datasets. Text-to-Token Model We train a 1.5B text-to-token model based on standard transformer architecture to convert text into corresponding speech tokens. While these tokens can be further synthesized into actual speech using our speech decoder, this step is unnecessary for constructing interleaved speech-text data. To prepare the training data, we first tokenize speech from text-to-speech datasets into discrete speech tokens. The text-to-token model is then trained to predict these speech token sequences based on the input text. The training objective is to minimize the negative loglikelihood of the predicted speech tokens conditioned on the corresponding text input: j=1 log P(ai,j|Ti, ai,