# opensource_conversational_ai_with_speechbrain_10__b9c89cd2.pdf

Open-Source Conversational AI with Speech Brain 1.0

Mirco Ravanelli1,2,5, Titouan Parcollet4,6, Adel Moumen3, Sylvain de Langen3, Cem Subakan7,2,1, Peter Plantinga2, Yingzhi Wang8, Pooneh Mousavi1,2, Luca Della Libera1,2, Artem Ploujnikov5,2, Francesco Paissan9,14, Davide Borra10, Salah Zaiem11, Zeyu Zhao12, Shucong Zhang4, Georgios Karakasidis12, Sung-Lin Yeh12, Pierre Champion13, Aku Rouhe14,18, Rudolf Braun20, Florian Mai19, Juan Zuluaga-Gomez20,21, Seyed Mahed Mousavi15, Andreas Nautsch3, Ha Nguyen3, Xuechen Liu17, Sangeet Sagar16, Jarod Duret3, Salima Mdhaﬀar3, Ga elle Laperri ere3, Mickael Rouvier3, Renato De Mori3,22, Yannick Est eve3

1Concordia University, 2Mila-Quebec AI Institute, 3Avignon University, 4Samsung AI Center Cambridge, 5Universit e de Montr eal, 6University of Cambridge, 7Laval University, 8Zaion, 9Fondazione Bruno Kessler, 10University of Bologna, 11Telecom Paris, 12University of Edinburgh, 13Inria, 14Aalto University, 15University of Trento, 16Saarland University, 17National Institute of Informatics - Tokyo, 18Silo AI, 19KU Leuven, 20Idiap, 21EPFL, 22Mc Gill University

Speech Brain1 is an open-source Conversational AI toolkit based on Py Torch, focused particularly on speech processing tasks such as speech recognition, speech enhancement, speaker recognition, text-to-speech, and much more. It promotes transparency and replicability by releasing both the pre-trained models and the complete recipes of code and algorithms required for training them. This paper presents Speech Brain 1.0, a signiﬁcant milestone in the evolution of the toolkit, which now has over 200 recipes for speech, audio, and language processing tasks, and more than 100 models available on Hugging Face. Speech Brain 1.0 introduces new technologies to support diverse learning modalities, Large Language Model (LLM) integration, and advanced decoding strategies, along with novel models, tasks, and modalities. It also includes a new benchmark repository, oﬀering researchers a uniﬁed platform for evaluating models across diverse tasks. Keywords: Conversational AI, open-source, speech processing, deep learning.

1. Introduction

Conversational AI is experiencing extraordinary progress, with Large Language Models (LLMs) and speech assistants rapidly evolving and becoming widely adopted in the daily lives of millions of users (Mc Tear, 2021). However, this quick evolution poses a challenge to a fundamental pillar of science: reproducibility. Replicating recent ﬁndings is often diﬃcult or impossible for many researchers due to limited access to data, computational resources, or code (Kapoor and Narayanan, 2023). The open-source community is making a remarkable collective eﬀort to mitigate this reproducibility crisis , yet many contributors primarily release pre-trained models only, known as open-weight (Liesenfeld and Dingemanse, 2024). While this is a step forward, it is still very common for the data and algorithms used to train them to remain undisclosed. We helped address this problem by releasing Speech Brain (Ra-

1. https://speechbrain.github.io/

c 2024 Mirco Ravanelli, Titouan Parcollet, et al..

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/.

Ravanelli, Parcollet, et al.

vanelli et al., 2021), a Py Torch-based open-source toolkit designed for accelerating research in speech, audio, and text processing. We ensure replicability by releasing pre-trained models for various tasks and providing the recipe for training them from scratch, conveniently including all necessary algorithms and code. A few other open-source toolkits, like Ne Mo (Kuchaiev et al., 2019) and ESPnet (Watanabe et al., 2018), also support multiple Conversational AI tasks, each excelling in diﬀerent applications. A more detailed discussion of the related toolkits can be found in Appendix A. This paper introduces Speech Brain 1.0, a remarkable milestone resulting from years of collaboration between the core development team and our community volunteers. We will outline key technical updates for supporting novel learning methods, LLM integration, advanced decoding strategies, new models, tasks, and modalities. We also present a new benchmark repository designed to facilitate model comparisons across tasks.

2. Overview of Speech Brain

speaker recognition

text-tospeech

hyperparams.yaml

Model args Training args Optimizer args Checkpointer args Metrics args

Trained Model

Forward pass Objectives / loss Compute metrics

Brain Subclass

Load hyperparams Prepare datasets Brain Subclass.fit() Brain Subclass.eval()

manifest.json

ID Audio path Duration Transcript Speaker Segments

Figure 1: Speech Brain architecture overview.

Since its launch in March 2021, Speech Brain has grown rapidly and emerged as one of the most popular toolkits for speech processing. It is downloaded 2.5 million times monthly, used in 2200 repositories, has 8.6k Git Hub stars, and 154 contributors. Despite its constant evolution, we remain faithful to the original design principles. We prioritized replicability by releasing both training recipes and pre-trained models. Moreover, 95% of our recipes utilize freely available data and include comprehensive training logs, checkpoints, and other essential information. We made Speech Brain easy to use by providing comprehensive documentation, examples, and tutorials. Our modular architecture facilitates easy integration or modiﬁcation of modules. We built it on Py Torch standard interfaces (e.g., torch.nn.Module, torch.optim, torch.utils.data.Dataset), enabling seamless integration with the Py Torch ecosystem (Rouhe et al., 2022). It is released under the Apache 2.0 license.

2.1 Architecture Overview

Training a model with Speech Brain involves combining the training script, the hyperparameter ﬁle, and the data manifest ﬁles, as depicted in Figure 1. First, users need to specify the data for training, validation, and testing using CSV or JSON ﬁles. These formats are supported because they allow ﬂexible and intuitive declaration of input ﬁles and annotations. Next, users must design a model and deﬁne its hyperparameters using a modiﬁed YAML format known as Hyper Py YAML. This format facilitates complex yet elegant parameter conﬁgurations, deﬁning objects and their associated arguments. Finally, users write the training script, which orchestrates all the steps to train the model. The training procedure is integrated into a single Python script which utilizes a specialized Brain class designed

Open-Source Conversational AI with Speech Brain 1.0

Modality Task and Techniques Audio Vocoding, Audio Augmentation, Feature Extraction, Sound Event Detection, Beamforming. Speech Speech Recognition, Enhancement, Separation, Text-to-Speech, Speaker Recognition, Speech-to Speech Translation, Spoken Language Understanding, Voice Activity Detection, Diarization, Emotion Recognition, Emotion Diarization, Language Identiﬁcation, Self-Supervised Training, Metric Learning, Forced Alignment. Text LM Training, LLM Fine-Tuning, Dialogue Modeling, Response Generation, Grapheme-to-Phoneme. EEG Motor Imagery, P300, SSVEP Classiﬁcation.

Table 1: Summary of the technology supported by Speech Brain 1.0.

to make the process intuitive and standardized. Our toolkit natively implements popular models, eﬃcient sequence-to-sequence learning, data handling, distributed training, beam search decoding, evaluation metrics, and data augmentation, across over 200 training recipes for widely used research datasets and more than 100 pretrained models.

3. Recent Developments

Speech Brain now supports a wide array of tasks. Please, refer to Table 1 for a complete list as of October 2024. The main improvements in Speech Brain 1.0 include:

Learning Modalities: We expanded the support for emerging deep learning modalities. For continual learning, we implemented methods like Rehearsal, Architecture, and Regularization-based approaches (Della Libera et al., 2023). For interpretability, we developed both post-hoc and design-based methods, including Post-hoc Interpretation via Quantization (Paissan et al., 2023), Listen to Interpret (Parekh et al., 2022), Activation Map Thresholding (AMT) for Focal Networks (Della Libera et al., 2024), and Listenable Maps for Audio Classiﬁers (Paissan et al., 2024). We also implemented audio generation using standard and latent diﬀusion techniques, along with DiﬀWave (Kong et al., 2020b) as a novel vocoder based on diﬀusion. Lastly, eﬃcient ﬁne-tuning strategies have been introduced for faster inference using speech self-supervised models (Zaiem et al., 2023a). We implemented wav2vec2 SSL pretraining from scratch as described by (Baevski et al., 2020b). This enabled eﬃcient training of a 1-billionparameter SSL model for French on 14,000 hours of speech using over 100 A100 GPUs, showcasing the scalability of Speech Brain (Parcollet et al., 2024). We also released the ﬁrst open-source implementation of the BEST-RQ model (Whetten et al., 2024).

Models and Tasks: We developed several new models and expanded support for various tasks. For speech recognition, we introduced new alternatives to the Transformer architecture like Hyper Conformer (Mai et al., 2023) and Branchformer (Peng et al., 2022b), along with a Streamable Conformer Transducer. We implemented the Stabilised Light Gated Recurrent Units (Moumen and Parcollet, 2023), an improved version of the light GRU for more eﬃcient learning (Ravanelli et al., 2018). We now support models for discrete audio tokens (e.g., discrete wav2vec, Hu BERT, Wav LM, En Codec, DAC, and Speech Tokenizer), which form the basis for modern multimodal LLMs (Mousavi et al., 2024a). Additionally, we introduced technology for Speech Emotion Diarization (Wang et al., 2023). To improve usability and ﬂexibility,

Ravanelli, Parcollet, et al.

we refactored speech augmentation techniques (Ravanelli and Omologo, 2014, 2015). In terms of new modalities, Speech Brain 1.0 now supports electroencephalographic (EEG) signal processing (Borra et al., 2024). Supporting EEG aligns with our longterm goal of enabling natural human-machine conversation, including for those who cannot speak. Thanks to deep learning, the technology used for speech and EEG processing is getting similar, simplifying their integration in a single toolkit. Speech Brain 1.0 is a step in this direction by supporting EEG tasks such as motor imagery, P300, and SSVEP classiﬁcation with EEGNet (Lawhern et al., 2018), Shallow Conv Net (Schirrmeister et al., 2017b), and EEGConformer (Song et al., 2023).

Decoding Strategies: We improved beam search algorithms for speech recognition and translation. Our update simpliﬁes code with separate scoring and search functions. This update allows easy integration of various scorers, including n-gram language models and custom heuristics. Additionally, we support pure CTC training, RNN-T latency controlled beamsearch (Jain et al., 2019), batch and GPU decoding (Kim et al., 2017), and N-best hypothesis output with neural language model rescoring (Salazar et al., 2019). We also oﬀer an interface to Kaldi2 (k2) for search based on Finite State Transducers (FST) (Kang et al., 2023) and Ken LM for fast language model rescoring (Heaﬁeld, 2011).

Integration with LLMs: LLMs are crucial in modern Conversational AI. We enhanced our interfaces with popular models like GPT-2 (Radford et al., 2019) and Llama 2/3 (Touvron et al., 2023), enabling easy ﬁne-tuning for tasks such as dialogue modeling and response generation (Mousavi et al., 2024c). We also implemented LTU-AS (Gong et al., 2023), a speech LLM designed to jointly understand audio and speech. Additionally, LLMs can be used to rescore n-best hypotheses provided by speech recognizers (Tur et al., 2024).

Benchmarks: We launched a new benchmark repository for facilitating community standardization across various areas of broad interest. Currently, we host four benchmarks: CL-MASR for multilingual ASR continual learning (Della Libera et al., 2023), MP3S for speech self-supervised models with customizable probing heads (Zaiem et al., 2023b), DASB for discrete audio token assessment (Mousavi et al., 2024b), and Speech Brain-MOABB (Borra et al., 2024), which is based on MOABB (Aristimunha et al., 2024) and MNE (Gramfort et al., 2014), for evaluating EEG models.

4. Conclusion and Future Work

We presented Speech Brain 1.0, a signiﬁcant advancement in the evolution of the Speech Brain project. We outlined the main updates, including novel learning modalities, models, tasks, and decoding strategies, alongside our eﬀorts in benchmarking initiatives. For an overview of further improvements, please visit the project website. Looking ahead, we plan to keep serving our community with advancements on both large-scale, small-footprint, and multi-modal models. We plan to fully support training multimodal large language models (MLLMs) that integrate text, speech, and audio processing tasks into a single uniﬁed foundation model.

Open-Source Conversational AI with Speech Brain 1.0

Acknowledgment

We would like to thank our sponsors: Hugging Face, Samsung AI Center Cambridge, Baidu, OVHCloud, Via Dialog, and Naver Labs Europe. A special thank you to all the contributors who made Speech Brain 1.0 possible. We thank the Torchaudio team (Hwang et al., 2023) for helpful discussion and support. We acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC), the Digital Research Alliance of Canada (alliancecan.ca), and the Amazon Research Award (ARA). We also thank Jean Zay GENCI-IDRIS for their support in computing (Grant 2024-A0161015099 and Grant 2022-A0111012991), and the LIAvignon Partnership Chair in AI.

B. Aristimunha, I. Carrara, P. Guetschel, S. Sedlar, P. Rodrigues, J. Sosulski, D. Narayanan, E. Bjareholt, Q. Barthelemy, R. Kobler, R. T. Schirrmeister, E. Kalunga, L. Darmet, C. Gregoire, A. Abdul Hussain, R. Gatti, V. Goncharenko, J. Thielen, T. Moreau, Y. Roy, V. Jayaram, A. Barachant, and S. Chevallier. Mother of all BCI Benchmarks, 2024. URL https: //github.com/Neuro Tech X/moabb.

A. Baevski, H. Zhou, A. Mohamed, and M. Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. In Proceedings of the International Conference on Neural Information Processing Systems (NIPS), 2020a.

A. Baevski, Y. Zhou, A. Mohamed, and M. Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. In Proceedings of the International Conference on Neural Information Processing Systems (Neur IPS), 2020b.

D. Borra, F. Paissan, and M. Ravanelli. Speech Brain-MOABB: An open-source Python library for benchmarking deep neural networks applied to EEG signals. Computers in Biology and Medicine, 182:97 109, 2024.

H. Bredin. pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe. In Proceedings of Interspeech, 2023.

L. Della Libera, P. Mousavi, S. Zaiem, C. Subakan, and M. Ravanelli. CL-MASR: A Continual Learning Benchmark for Multilingual ASR. Co RR, abs/2310.16931, 2023.

L. Della Libera, C. Subakan, and M. Ravanelli. Focal modulation networks for interpretable sound classiﬁcation. In Proceedings of the ICASSP Workshop on Explainable AI for Speech and Audio (XAI-SA), 2024.

B. Desplanques, J. Thienpondt, and K. Demuynck. ECAPA-TDNN: emphasized channel attention, propagation and aggregation in TDNN based speaker veriﬁcation. In Proceedings of Interspeech, 2020.

Y. Gong, A. H. Liu, H. Luo, L. Karlinsky, and J. Glass. Joint audio and speech understanding. In In Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2023.

A. Gramfort, M. Luessi, E. Larson, D. A. Engemann, D. Strohmeier, C. Brodbeck, L. Parkkonen, and M. S. H am al ainen. Mne software for processing meg and eeg data. Neuro Image, 86:446 460, 2014.

Ravanelli, Parcollet, et al.

A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang. Conformer: Convolution-augmented transformer for speech recognition. In Proceedings of Interspeech, 2020.

K. Heaﬁeld. Ken LM: Faster and Smaller Language Model Queries. In Proceedings of the Sixth Workshop on Statistical Machine Translation (WMT), 2011.

J. Hwang, M. Hira, C. Chen, X. Zhang, Z. Ni, G. Sun, P. Ma, R. Huang, V. Pratap, Y. Zhang, A. Kumar, C.-Y. Yu, C. Zhu, C. Liu, J. Kahn, M. Ravanelli, P. Sun, S. Watanabe, Y. Shi, Y. Tao, R. Scheibler, S. Cornell, S. Kim, and S. Petridis. Torchaudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for pytorch. In Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2023.

M. Jain, K. Schubert, J. Mahadeokar, C. Yeh, K. Kalgaonkar, A. Sriram, C. Fuegen, and M. L. Seltzer. RNN-T for latency controlled ASR with improved beam search. Co RR, abs/1911.01629, 2019.

W. Kang, L. Guo, F. Kuang, L. Lin, M. Luo, Z. Yao, X. Yang, P. Zelasko, and D. Povey. Fast and Parallel Decoding for Transducer. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023.

S. Kapoor and A. Narayanan. Leakage and the reproducibility crisis in machine-learning-based science. Patterns, 4(9), 2023.

S. Kim, T. Hori, and S. Watanabe. Joint ctc-attention based end-to-end speech recognition using multi-task learning. In Proceedings of the IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 4835 4839, 2017.

J. Kong, J. Kim, and J. Bae. Hiﬁ-gan: generative adversarial networks for eﬃcient and high ﬁdelity speech synthesis. In Proceedings of the International Conference on Neural Information Processing Systems (NIPS), 2020a.

Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro. Diﬀwave: A Versatile Diﬀusion Model for Audio Synthesis. Co RR, abs/2009.09761, 2020b.

O. Kuchaiev, J. Li, H. Nguyen, O. Hrinchuk, R. Leary, B. Ginsburg, S. Kriman, S. Beliaev, V. Lavrukhin, J. Cook, P. Castonguay, M. Popova, J. Huang, and J. M. Cohen. Ne Mo: a toolkit for building AI applications using Neural Modules. Co RR, abs/1909.09577, 2019.

V. J. Lawhern, A. J. Solon, N. R. Waytowich, S. M. Gordon, C. P. Hung, and B. J. Lance. EEGNet: a compact convolutional neural network for EEG-based brain computer interfaces. Journal of Neural Engineering, 15(5), July 2018.

C. Li, L. Yang, W. Wang, and Y. Qian. Skim: Skipping memory lstm for low-latency real-time continuous speech separation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022.

A. Liesenfeld and M. Dingemanse. Rethinking open source generative AI: open washing and the EU AI Act. In Proceedings of the ACM Conference on Fairness, Accountability, and Transparency, 2024.

Y. Luo and N. Mesgarani. Conv-Tas Net: Surpassing Ideal Time Frequency Magnitude Masking for Speech Separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(8): 1256 1266, aug 2019.

Open-Source Conversational AI with Speech Brain 1.0

Y. Luo, Z. Chen, and T. Yoshioka. Dual-path RNN: eﬃcient long sequence modeling for timedomain single-channel speech separation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020.

F. Mai, J. Zuluaga-Gomez, T. Parcollet, and P. Motlicek. Hyperconformer: Multi-head hypermixer for eﬃcient speech recognition. In Proceedings of Interspeech, 2023.

M. Mc Tear. Conversational AI: Dialogue Systems, Conversational Agents, and Chatbots. Synthesis lectures on human language technologies. Morgan & Claypool Publishers, 2021.

A. Moumen and T. Parcollet. Stabilising and accelerating light gated recurrent units for automatic speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023.

P. Mousavi, J. Duret, S. Zaiem, L. D. Libera, A. Ploujnikov, C. Subakan, and M. Ravanelli. How should we extract discrete audio tokens from self-supervised models? In Proceedings of Interspeech, 2024a.

P. Mousavi, L. D. Libera, J. Duret, A. Ploujnikov, C. Subakan, and M. Ravanelli. DASB-Discrete Audio and Speech Benchmark. Co RR, abs/2406.14294, 2024b.

S. M. Mousavi, G. Roccabruna, S. Alghisi, M. Rizzoli, M. Ravanelli, and G. Riccardi. Are LLMs Robust for Spoken Dialogues? In Proceedings of the International Workshop on Spoken Dialogue Systems Technology (IWSDS), 2024c.

F. Paissan, C. Subakan, and M. Ravanelli. Posthoc Interpretation via Quantization. Co RR, abs/2303.12659, 2023.

F. Paissan, M. Ravanelli, and C. Subakan. Listenable Maps for Audio Classiﬁers. In Proceedings of the International Conference on Machine Learning (ICML), 2024.

T. Parcollet, H. Nguyen, S. Evain, M. Zanon Boito, A. Pupier, S. Mdhaﬀar, H. Le, S. Alisamir, N. Tomashenko, M. Dinarelli, S. Zhang, A. Allauzen, M. Coavoux, Y. Est eve, M. Rouvier, J. Goulian, B. Lecouteux, F. Portet, S. Rossato, F. Ringeval, D. Schwab, and L. Besacier. Le Benchmark 2.0: A standardized, replicable and enhanced framework for self-supervised representations of French speech. Computer Speech & Language, 86:101622, 2024.

J. Parekh, S. Parekh, P. Mozharovskyi, F. Alche-Buc, and G. Richard. Listen to Interpret: Post-hoc Interpretability for Audio Networks with NMF. In In proceedings of the International Conference on Neural Information Processing Systems (Neur IPS), 2022.

Y. Peng, S. Dalmia, I. Lane, and S. Watanabe. Branchformer: Parallel MLP-Attention Architectures to Capture Local and Global Context for Speech Recognition and Understanding. In Proceedings of the International Conference on Machine Learning (ICML), 2022a.

Y. Peng, S. Dalmia, I. R. Lane, and S. Watanabe. Branchformer: Parallel MLP-Attention Architectures to Capture Local and Global Context for Speech Recognition and Understanding. In Proceedings of the International Conference on Machine Learning (ICML), 2022b.

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners. Technical report, Open AI, 2019. Technical report.

M. Ravanelli and M. Omologo. On the selection of the impulse responses for distant-speech recognition based on contaminated speech training. In Proceesings of Interspeech, 2014.

Ravanelli, Parcollet, et al.

M. Ravanelli and M. Omologo. Contaminated speech training methods for robust DNN-HMM distant speech recognition. In Proceesings of Interspeech, 2015.

M. Ravanelli, P. Brakel, M. Omologo, and Y. Bengio. Light gated recurrent units for speech recognition. IEEE Transactions on Emerging Topics in Computational Intelligence, 2(2):92 102, 2018.

M. Ravanelli, T. Parcollet, P. Plantinga, A. Rouhe, S. Cornell, L. Lugosch, C. Subakan, N. Dawalatabad, A. Heba, J. Zhong, et al. Speech Brain: A general-purpose speech toolkit. Co RR, abs/2106.04624, 2021.

Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu. Fastspeech 2: Fast and highquality end-to-end text to speech. In Proceedings of the International Conference on Learning Representations (ICLR), 2021.

A. Rouhe, M. Ravanelli, T. Parcollet, and P. Plantinga. A Speech Brain for Everything: State of the Py Torch Ecosystem for Speech Technologies. Interspeech Tutorial Presentation, September 2022.

J. Salazar, D. Liang, T. Q. Nguyen, and K. Kirchhoﬀ. Masked language model scoring. Co RR, abs/1910.14659, 2019.

R. T. Schirrmeister, J. T. Springenberg, L. D. J. Fiederer, M. Glasstetter, K. Eggensperger, M. Tangermann, F. Hutter, W. Burgard, and T. Ball. Deep learning with convolutional neural networks for eeg decoding and visualization. Human Brain Mapping, aug 2017a.

R. T. Schirrmeister, J. T. Springenberg, L. D. J. Fiederer, M. Glasstetter, K. Eggensperger, M. Tangermann, F. Hutter, W. Burgard, and T. Ball. Deep learning with convolutional neural networks for EEG decoding and visualization. Human Brain Mapping, 38(11):5391 5420, Aug. 2017b.

J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerry-Ryan, R. A. Saurous, Y. Agiomyrgiannakis, and Y. Wu. Natural TTS Synthesis by Conditioning Wave Net on Mel Spectrogram Predictions. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017.

Y. Song, Q. Zheng, B. Liu, and X. Gao. EEG conformer: Convolutional transformer for EEG decoding and visualization. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 31:710 719, 2023.

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi ere, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample. Llama: Open and eﬃcient foundation language models. Co RR, abs/2302.13971, 2023.

A. D. Tur, A. Moumen, and M. Ravanelli. Progres: Prompted generative rescoring on asr n-best. In Proceedings of the IEEE Spoken Language Technology Workshop (SLT), 2024.

Y. Wang, M. Ravanelli, and A. Yacoubi. Speech Emotion Diarization: Which Emotion Appears When? In Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2023.

S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. Enrique Yalta Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, and T. Ochiai. ESPnet: End-to-end speech processing toolkit. In Proceedings of Interspeech, 2018.

Open-Source Conversational AI with Speech Brain 1.0

S. wen Yang, P.-H. Chi, Y.-S. Chuang, C.-I. J. Lai, K. Lakhotia, Y. Y. Lin, A. T. Liu, J. Shi, X. Chang, G.-T. Lin, T.-H. Huang, W.-C. Tseng, K. tik Lee, D.-R. Liu, Z. Huang, S. Dong, S.-W. Li, S. Watanabe, A. Mohamed, and H. yi Lee. Superb: Speech processing universal performance benchmark. In Proceedings of Interspeech, 2021.

R. Whetten, T. Parcollet, M. Dinarelli, and Y. Est eve. Open Implementation and Study of BESTRQ for Speech Processing. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024.

S. Zaiem, R. Algayres, T. Parcollet, E. Slim, and M. Ravanelli. Fine-tuning strategies for faster inference using speech self-supervised models: A comparative study. In In Proceesings of the IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSP), 2023a.

S. Zaiem, Y. Kemiche, T. Parcollet, S. Essid, and M. Ravanelli. Speech Self-Supervised Representation Benchmarking: Are We Doing it Right? In Proceedings of Interspeech, 2023b.

Ravanelli, Parcollet, et al.

ECAPA-TDNN EER Original Paper 0.87% Speech Brain 0.81%

Table 2: Comparison of Equal Error Rated (EER%) between the original ECAPA-TDNN paper and the Speech Brain re-implementation.

Appendix A. Related Toolkits

Some open-source toolkits for Conversational AI have been developed in recent years, with Ne Mo2

(Kuchaiev et al., 2019) and ESPnet3 being the most relevant for Speech Brain. While all of these toolkits share the common goal of making Conversational AI more accessible, each is designed with diﬀerent structures and for speciﬁc use cases, meaning the best toolkit to use depends on the particular task and user needs. Ne Mo, for instance, is industry-focused, oﬀering ready-to-use solutions, but may provide less ﬂexibility for extensive customization compared to Speech Brain, which is more research-oriented. ESPnet also supports various tasks with competitive performance, but Speech Brain stands out for its comprehensive documentation, beginner-friendly tutorials, simplicity, and lightweight design with fewer dependencies. Another related toolkit is k24 (Kang et al., 2023), which integrates Finite State Automaton (FSA) and Finite State Transducer (FST) algorithms into autograd-based machine learning frameworks like Py Torch and Tensor Flow. We found these features extremely valuable, so we developed an interface that facilitates the seamless integration of k2 within Speech Brain.

Beyond general-purpose toolkits for Conversational AI and speech processing, we saw the evolution of more task-speciﬁc toolkits. A notable example is pyannote5 (Bredin, 2023), which is primarily designed for speaker diarization. It aims to provide eﬀective APIs for speciﬁc tasks to serve a broad user base. In contrast, Speech Brain focuses on advancing research by also oﬀering training recipes. Lastly, we also have seen the rise of popular speech benchmarks such as SUPERB6 (wen Yang et al., 2021), which provides a set of resources to evaluate the performance of universal shared representations for speech processing. While SUPERB is highly valuable to the community, Speech Brain has a broader goal. In addition to benchmarking existing models, we indeed aim to provide all the necessary code to train models from scratch.

For the EEG modality, we rely on two key dependencies: MOABB7(Aristimunha et al., 2024) and MNE8(Gramfort et al., 2014). MOABB is chosen for its user-friendly interface and extensive support for a wide range of EEG datasets, while MNE is used for its comprehensive and standardized data preprocessing pipeline. We also oﬀer an integration with Braindecode9 (Schirrmeister et al., 2017a), with a tutorial that explains how to connect it with Speech Brain.

2. https://github.com/NVIDIA/Ne Mo 3. https://github.com/espnet/espnet 4. https://github.com/k2-fsa/k2 5. https://github.com/pyannote/pyannote-audio 6. https://superbbenchmark.github.io/ 7. https://github.com/Neuro Tech X/moabb 8. https://mne.tools/ 9. https://braindecode.org/

Open-Source Conversational AI with Speech Brain 1.0

Appendix B. Model Replication

One of the important contributions of Speech Brain is replicating existing models, which may be closed-source, open-weight only, or models published without accompanying code. This process is often time-consuming and challenging, as successful replication is far from trivial. Throughout the project, this replication process has been systematically applied to models not originally developed within Speech Brain across various tasks, including speaker recognition with ECAPA-TDNN (Desplanques et al., 2020), speech recognition with Conformers (Gulati et al., 2020) and Branchformers (Peng et al., 2022a), speech separation with Ski M (Li et al., 2022), Dual Path RNN (Luo et al., 2020), and Conv Tas NET (Luo and Mesgarani, 2019), speech synthesis with Tacotron2 (Shen et al., 2017), Fast Speech2 (Ren et al., 2021) and Hi Fi-GAN (Kong et al., 2020a), self-supervised learning with Wav2vec2 (Baevski et al., 2020a), and BEST-RQ (Whetten et al., 2024), and many others. In all the aforementioned cases, we successfully replicated the models and, in some cases, even improved their performance. One notable example is the replication of the ECAPA-TDNN model for speaker veriﬁcation. Through collaboration with the original developers, we released the ﬁrst open-source version of the model. We not only replicated the results from the original paper but also achieved slight improvements, as detailed in Table 2. The improvement primarily originated from a more robust data augmentation strategy and a more careful selection of the training hyperparameters.