# nonautoregressive_neural_texttospeech__acc1c632.pdf Non-Autoregressive Neural Text-to-Speech Kainan Peng 1 Wei Ping 1 Zhao Song 1 Kexin Zhao 1 In this work, we propose Para Net, a nonautoregressive seq2seq model that converts text to spectrogram. It is fully convolutional and brings 46.7 times speed-up over the lightweight Deep Voice 3 at synthesis, while obtaining reasonably good speech quality. Para Net also produces stable alignment between text and speech on the challenging test sentences by iteratively improving the attention in a layer-by-layer manner. Furthermore, we build the parallel text-to-speech system and test various parallel neural vocoders, which can synthesize speech from text through a single feed-forward pass. We also explore a novel VAE-based approach to train the inverse autoregressive flow (IAF) based parallel vocoder from scratch, which avoids the need for distillation from a separately trained Wave Net as previous work. 1. Introduction Text-to-speech (TTS), also called speech synthesis, has long been a vital tool in a variety of applications, such as humancomputer interactions, virtual assistant, and content creation. Traditional TTS systems are based on multi-stage hand-engineered pipelines (Taylor, 2009). In recent years, deep neural networks based autoregressive models have attained state-of-the-art results, including high-fidelity audio synthesis (van den Oord et al., 2016), and much simpler seq2seq pipelines (Sotelo et al., 2017; Wang et al., 2017; Ping et al., 2018b). In particular, one of the most popular neural TTS pipeline (a.k.a. end-to-end") consists of two components (Ping et al., 2018b; Shen et al., 2018): (i) an autoregressive seq2seq model that generates mel spectrogram from text, and (ii) an autoregressive neural vocoder (e.g., Wave Net) that synthesizes raw waveform from mel spectro- *Equal contribution . 1Baidu Research, 1195 Bordeaux Dr, Sunnyvale, CA. Speech samples can be found in: https:// parallel-neural-tts-demo.github.io/. Correspondence to: Wei Ping . Proceedings of the 37 th International Conference on Machine Learning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 by the author(s). gram. This pipeline requires much less expert knowledge and only needs pairs of audio and transcript as training data. However, the autoregressive nature of these models makes them quite slow at synthesis, because they operate sequentially at a high temporal resolution of waveform samples and spectrogram. Most recently, several models are proposed for parallel waveform generation (e.g., van den Oord et al., 2018; Ping et al., 2018a; Prenger et al., 2019; Kumar et al., 2019; Bi nkowski et al., 2020; Ping et al., 2020). In the endto-end pipeline, the models (e.g., Clari Net, Wave Flow) still rely on autoregressive component to predict spectrogram features (e.g., 100 frames per second). In the linguistic featurebased pipeline, the models (e.g., Parallel Wave Net, GANTTS) are conditioned on aligned linguistic features from phoneme duration model and F0 from frequency model, which are recurrent or autoregressive models. Both of these TTS pipelines can be slow at synthesis on modern hardware optimized for parallel execution. In this work, we present a fully parallel neural TTS system by proposing a non-autoregressive text-to-spectrogram model. Our major contributions are as follows: 1. We propose Para Net, a non-autoregressive attentionbased architecture for text-to-speech, which is fully convolutional and converts text to mel spectrogram. It runs 254.6 times faster than real-time at synthesis on a 1080 Ti GPU, and brings 46.7 times speed-up over its autoregressive counterpart (Ping et al., 2018b), while obtaining reasonably good speech quality using neural vocoders. 2. Para Net distills the attention from the autoregressive text-to-spectrogram model, and iteratively refines the alignment between text and spectrogram in a layer-bylayer manner. It can produce more stable attentions than autoregressive Deep Voice 3 (Ping et al., 2018b) on the challenging test sentences, because it does not have the discrepancy between the teacher-forced training and autoregressive inference. 3. We build the fully parallel neural TTS system by combining Para Net with parallel neural vocoder, thus it can generate speech from text through a single feed-forward pass. We investigate several parallel vocoders, including the distilled IAF vocoder (Ping et al., 2018a) and Wave Glow (Prenger et al., 2019). To explore the possibility of training IAF vocoder without distillation, we also Non-Autoregressive Neural Text-to-Speech propose an alternative approach, Wave VAE, which can be trained from scratch within the variational autoencoder (VAE) framework (Kingma & Welling, 2014). We organize the rest of paper as follows. Section 2 discusses related work. We introduce the non-autoregressive Para Net architecture in Section 3. We discuss parallel neural vocoders in Section 4, and report experimental settings and results in Section 5. We conclude the paper in Section 6. 2. Related work Neural speech synthesis has obtained the state-of-the-art results and gained a lot of attention. Several neural TTS systems were proposed, including Wave Net (van den Oord et al., 2016), Deep Voice (Arık et al., 2017a), Deep Voice 2 (Arık et al., 2017b), Deep Voice 3 (Ping et al., 2018b), Tacotron (Wang et al., 2017), Tacotron 2 (Shen et al., 2018), Char2Wav (Sotelo et al., 2017), Voice Loop (Taigman et al., 2018), Wave RNN (Kalchbrenner et al., 2018), Clari Net (Ping et al., 2018a), and Transformer TTS (Li et al., 2019). In particular, Deep Voice 3, Tacotron and Char2Wav employ seq2seq framework with the attention mechanism (Bahdanau et al., 2015), yielding much simpler pipeline compared to traditional multi-stage pipeline. Their excellent extensibility leads to promising results for several challenging tasks, such as voice cloning (Arik et al., 2018; Nachmani et al., 2018; Jia et al., 2018; Chen et al., 2019). All of these state-of-the-art systems are based on autoregressive models. RNN-based autoregressive models, such as Tacotron and Wave RNN (Kalchbrenner et al., 2018), lack parallelism at both training and synthesis. CNN-based autoregressive models, such as Deep Voice 3 and Wave Net, enable parallel processing at training, but they still operate sequentially at synthesis since each output element must be generated before it can be passed in as input at the next time-step. Recently, there are some non-autoregressive models proposed for neural machine translation. Gu et al. (2018) trains a feed-forward neural network conditioned on fertility values, which are obtained from an external alignment system. Kaiser et al. (2018) proposes a latent variable model for fast decoding, while it remains autoregressiveness between latent variables. Lee et al. (2018) iteratively refines the output sequence through a denoising autoencoder framework. Arguably, non-autoregressive model plays a more important role in text-to-speech, where the output speech spectrogram usually consists of hundreds of time-steps for a short text input with a few words. Our work is one of the first non-autoregressive seq2seq model for TTS and provides as much as 46.7 times speed-up at synthesis over its autoregressive counterpart (Ping et al., 2018b). There is a concurrent work (Ren et al., 2019), which is based on the autoregressive transformer TTS (Li et al., 2019) and can generate mel spectrogram in parallel. Our Para Net is fully convolutional and lightweight. In contrast to Fast Speech, it has half of model parameters, requires smaller batch size (16 vs. 64) for training and provides faster speed at synthesis (see Table 2 for detailed comparison). Flow-based generative models (Rezende & Mohamed, 2015; Kingma et al., 2016; Dinh et al., 2017; Kingma & Dhariwal, 2018) transform a simple initial distribution into a more complex one by applying a series of invertible transformations. In previous work, flow-based models have obtained stateof-the-art results for parallel waveform synthesis (van den Oord et al., 2018; Ping et al., 2018a; Prenger et al., 2019; Kim et al., 2019; Yamamoto et al., 2019; Ping et al., 2020). Variational autoencoder (VAE) (Kingma & Welling, 2014; Rezende et al., 2014) has been applied for representation learning of natural speech for years. It models either the generative process of raw waveform (Chung et al., 2015; van den Oord et al., 2017), or spectrograms (Hsu et al., 2019). In previous work, autoregressive or recurrent neural networks are employed as the decoder of VAE (Chung et al., 2015; van den Oord et al., 2017), but they can be quite slow at synthesis. In this work, we employ a feed-forward IAF as the decoder, which enables parallel waveform synthesis. 3. Text-to-spectrogram model Our parallel TTS system has two components: 1) a feedforward text-to-spectrogram model, and 2) a parallel waveform synthesizer conditioned on mel spectrogram. In this section, we first present an autoregressive model derived from Deep Voice 3 (DV3) (Ping et al., 2018b). We then introduce Para Net, a non-autoregressive text-to-spectrogram model (see Figure 1). 3.1. Autoregressive architecture Our autoregressive model is based on DV3, a convolutional text-to-spectrogram architecture, which consists of three components: Encoder: A convolutional encoder, which takes text inputs and encodes them into internal hidden representation. Decoder: A causal convolutional decoder, which decodes the encoder representation with an attention mechanism to log-mel spectragrams in an autoregressive manner with an ℓ1 loss. It starts with a 1 1 convolution to preprocess the input log-mel spectrograms. Converter: A non-causal convolutional post processing network, which processes the hidden representation from the decoder using both past and future context information and predicts the log-linear spectrograms with an ℓ1 loss. It enables bidirectional processing. All these components use the same 1-D convolution block Non-Autoregressive Neural Text-to-Speech Converter Linear output Figure 1. (a) Autoregressive seq2seq model. The dashed line depicts the autoregressive decoding of mel spectrogram at inference. (b) Non-autoregressive Para Net model, which distills the attention from a pretrained autoregressive model. Convolution Block Figure 2. (a) Architecture of Para Net. Its encoder provides key and value as the textual representation. The first attention block in decoder gets positional encoding as the query and is followed by non-causal convolution blocks and attention blocks. (b) Convolution block appears in both encoder and decoder. It consists of a 1-D convolution with a gated linear unit (GLU) and a residual connection. with a gated linear unit as in DV3 (see Figure 2 (b) for more details). The major difference between our model and DV3 is the decoder architecture. The decoder of DV3 has multiple attention-based layers, where each layer consists of a causal convolution block followed by an attention block. To simplify the attention distillation described in Section 3.3.1, our autoregressive decoder has only one attention block at its first layer. We find that reducing the number of attention blocks does not hurt the generated speech quality in general. 3.2. Non-autoregressive architecture The proposed Para Net (see Figure 2) uses the same encoder architecture as the autoregressive model. The decoder of Para Net, conditioned solely on the hidden representation from the encoder, predicts the entire sequence of log-mel spectrograms in a feed-forward manner. As a result, both its training and synthesis can be done in parallel. Specially, we make the following major architecture modifications from the autoregressive text-to-spectrogram model to the non-autoregressive model: 1. Non-autoregressive decoder: Without the autoregressive generative constraint, the decoder can use noncausal convolution blocks to take advantage of future context information and to improve model per- formance. In addition to log-mel spectrograms, it also predicts log-linear spectrograms with an ℓ1 loss for slightly better performance. We also remove the 1 1 convolution at the beginning, because the decoder does not take log-mel spectrograms as input. 2. No converter: Non-autoregressive model removes the non-causal converter since it already employs a noncausal decoder. Note that, the major motivation of introducing non-causal converter in DV3 is to refine the decoder predictions based on bidirectional context information provided by non-causal convolutions. 3.3. Parallel attention mechanism It is challenging for the feed-forward model to learn the accurate alignment between the input text and output spectrogram. In particular, we need the full parallelism within the attention mechanism. For example, the location-sensitive attention (Chorowski et al., 2015; Shen et al., 2018) improves attention stability, but it performs sequentially at both training and synthesis, because it uses the cumulative attention weights from previous decoder time steps as an additional feature for the next time step. Previous non-autoregressive decoders rely on an external alignment system (Gu et al., 2018), or an autoregressive latent variable model (Kaiser Non-Autoregressive Neural Text-to-Speech Figure 3. Our Para Net iteratively refines the attention alignment in a layer-by-layer way. One can see the 1st layer attention is mostly dominated by the positional encoding prior. It becomes more and more confident about the alignment in the subsequent layers. et al., 2018). In this work, we present several simple & effective techniques, which could obtain accurate and stable attention alignment. In particular, our non-autoregressive decoder can iteratively refine the attention alignment between text and mel spectrogram in a layer-by-layer manner as illustrated in Figure 3. Specially, the decoder adopts a dotproduct attention mechanism and consists of K attention blocks (see Figure 2 (a)), where each attention block uses the per-time-step query vectors from convolution block and per-time-step key vectors from encoder to compute the attention weights (Ping et al., 2018b). The attention block computes context vectors as the weighted average of the value vectors from the encoder. The non-autoregressive decoder starts with an attention block, in which the query vectors are solely positional encoding (see Section 3.3.2 for details). The first attention block then provides the input for the convolution block at the next attention-based layer. 3.3.1. ATTENTION DISTILLATION We use the attention alignments from a pretrained autoregressive model to guide the training of non-autoregressive model. Specifically, we minimize the cross entropy between the attention distributions from the non-autoregressive Para Net and a pretrained autoregressive teacher. We denote the attention weights from the non-autoregressive Para Net as W (k) i,j , where i and j index the time-step of encoder and decoder respectively, and k refers to the k-th attention block within the decoder. Note that, the attention weights {W (k) i,j }M i=1 form a valid distribution. We compute the attention loss as the average cross entropy between the Para Net and teacher s attention distributions: i=1 W t i,j log W (k) i,j , (1) where W t i,j are the attention weights from the autoregressive teacher, M and N are the lengths of encoder and decoder, respectively. Our final loss function is a linear combination of latten and ℓ1 losses from spectrogram predictions. We set the coefficient of latten as 4, and other coefficients as 1 in all experiments. 3.3.2. POSITIONAL ENCODING We use a similar positional encoding as in DV3 at every attention block (Ping et al., 2018b). The positional encoding is added to both key and query vectors in the attention block, which forms an inductive bias for monotonic attention. Note that, the non-autoregressive model solely relies on its attention mechanism to decode mel spectrograms from the encoded textual features, without any autoregressive input. This makes the positional encoding even more crucial in guiding the attention to follow a monotonic progression over time at the beginning of training. The positional encodings hp(i, k) = sin (ωsi/10000 k/d) (for even i), and cos (ωsi/10000 k/d) (for odd i), where i is the time-step index, k is the channel index, d is the total number of channels in the positional encoding, and ωs is the position rate which indicates the average slope of the line in the attention distribution and roughly corresponds to the speed of speech. We set ωs in the following ways: For the autoregressive teacher, ωs is set to one for the positional encoding of query. For the key, it is set to the averaged ratio of the time-steps of spectrograms to the time-steps of textual features, which is around 6.3 across our training dataset. Taking into account that a reduction factor of 4 is used to simplify the learning of attention mechanism (Wang et al., 2017) , ωs is simply set as 6.3/4 for the key at both training and synthesis. For Para Net, ωs is also set to one for the query, while ωs for the key is calculated differently. At training, ωs is set to the ratio of the lengths of spectrograms and text for each individual training instance, which is also divided by a reduction factor of 4. At synthesis, we need to specify the length of output spectrogram and the corresponding ωs, which actually controls the speech rate of the generated audios (see Section II on demo website). In all of our experiments, we simply set ωs to be 6.3/4 as in autoregressive model, and the length of output spectrogram as 6.3/4 times the length of input text. Non-Autoregressive Neural Text-to-Speech Such a setup yields an initial attention in the form of a diagonal line and guides the non-autoregressive decoder to refine its attention layer by layer (see Figure 3). 3.3.3. ATTENTION MASKING Inspired by the attention masking in Deep Voice 3, we propose an attention masking scheme for the non-autoregressive Para Net at synthesis: For each query from decoder, instead of computing the softmax over the entire set of encoder key vectors, we compute the softmax only over a fixed window centered around the target position and going forward and backward several time-steps (e.g., 3). The target position is calculated as iquery 4/6.3 , where iquery is the timestep index of the query vector, and is the rounding operator. We observe that this strategy reduces serious attention errors such as repeating or skipping words, and also yields clearer pronunciations, thanks to its more condensed attention distribution. Note that, this attention masking is shared across all attention blocks once it is generated, and does not prevent the parallel synthesis of the non-autoregressive model. 4. Parallel waveform model As an indispensable component in our parallel neural TTS system, the parallel waveform model converts the mel spectrogram predicted from Para Net into the raw waveform. In this section, we discuss several existing parallel waveform models, and explore a new alternative in the system. 4.1. Flow-based waveform models Inverse autoregressive flow (IAF) (Kingma et al., 2016) is a special type of normalizing flow where each invertible transformation is based on an autoregressive neural network. IAF performs synthesis in parallel and can easily reuse the expressive autoregressive architecture, such as Wave Net (van den Oord et al., 2016), which leads to the state-of-the-art results for speech synthesis (van den Oord et al., 2018; Ping et al., 2018a). However, the likelihood evaluation in IAF is autoregressive and slow, thus previous training methods rely on probability density distillation from a pretrained autoregressive Wave Net. This two-stage distillation process complicates the training pipeline and may introduce pathological optimization (Huang et al., 2019). Real NVP (Dinh et al., 2017) and Glow (Kingma & Dhariwal, 2018) are different types of normalizing flows, where both synthesis and likelihood evaluation can be performed in parallel by enforcing bipartite architecture constraints. Most recently, both of them were applied as parallel neu- ral vocoders and can be trained from scratch (Prenger et al., 2019; Kim et al., 2019). However, these models are less expressive than their autoregressive and IAF counterparts. One can find a detailed analysis in Wave Flow paper (Ping et al., 2020). In general, these bipartite flows require larger number of layers and hidden units, which lead to huge number of parameters. For example, a Wave Glow vocoder (Prenger et al., 2019) has 87.88M parameters, whereas IAF vocoder has much smaller footprint with only 2.17M parameters (Ping et al., 2018a), making it more preferred in production deployment. 4.2. Wave VAE Given the advantage of IAF vocoder, it is interesting to investigate whether it can be trained without the density distillation. One related work trains IAF within an autoencoder (Huang et al., 2019). Our method uses the VAE framework, thus it is termed as Wave VAE. In contrast to van den Oord et al. (2018) and Ping et al. (2018a), Wave VAE can be trained from scratch by jointly optimizing the encoder qφ(z|x, c) and decoder pθ(x|z, c), where z is latent variables and c is the mel spectrogram conditioner. We omit c for concise notation hereafter. 4.2.1. ENCODER The encoder of Wave VAE qφ(z|x) is parameterized by a Gaussian autoregressive Wave Net (Ping et al., 2018a) that maps the ground truth audio x into the same length latent representation z. Specifically, the Gaussian Wave Net models xt given the previous samples x 0 to decouple the global variation, which will make optimization process easier. Given the observed x, the qφ(z|x) admits parallel sampling of latents z. One can build the connection between the encoder of Wave VAE and the teacher model of Clari Net, as both of them use a Gaussian Wave Net to guide the training of IAF for parallel wave generation. 4.2.2. DECODER Our decoder pθ(x|z) is parameterized by the one-stepahead predictions from an IAF (Ping et al., 2018a). We let z(0) = z and apply a stack of IAF transformations Non-Autoregressive Neural Text-to-Speech from z(0) . . . z(i) . . . z(n), and each transformation z(i) = f(z(i 1); θ) is defined as, z(i) = z(i 1) σ(i) + µ(i), (2) where µ(i) t = µ(z(i 1) i σ(j). (3) Lastly, we set x = ϵ σtot + µtot, where ϵ N(0, I). Thus, pθ(x | z) = N(µtot, σtot). For the generative process, we use the standard Gaussian prior p(z) = N(0, I). 4.2.3. TRAINING OBJECTIVE We maximize the evidence lower bound (ELBO) for observed x in VAE, max φ,θ Eqφ(z|x) log pθ(x|z) KL qφ(z|x) || p(z) , (4) where the KL divergence can be calculated in closed-form as both qφ(z|x) and p(z) are Gaussians, KL qφ(z|x) || p(z) ε2 1 + xt µ(x