# efficienttts_an_efficient_and_highquality_texttospeech_architecture__5830beeb.pdf Efﬁcient TTS: An Efﬁcient and High-Quality Text-to-Speech Architecture Chenfeng Miao 1 Shuang Liang 1 Zhengchen Liu 1 Minchuan Chen 1 Jun Ma 1 Shaojun Wang 1 Jing Xiao 1 In this work, we address the Text-to-Speech (TTS) task by proposing a non-autoregressive architecture called Efﬁcient TTS. Unlike the dominant non-autoregressive TTS models, which are trained with the need of external aligners, Efﬁcient TTS optimizes all its parameters with a stable, end-to-end training procedure, allowing for synthesizing high quality speech in a fast and efﬁcient manner. Efﬁcient TTS is motivated by a new monotonic alignment modeling approach, which speciﬁes monotonic constraints to the sequence alignment with almost no increase of computation. By combining Efﬁcient TTS with different feed-forward network structures, we develop a family of TTS models, including both text-tomelspectrogram and text-to-waveform networks. We experimentally show that the proposed models signiﬁcantly outperform counterpart models such as Tacotron 2 (Shen et al.) and Glow-TTS (Kim et al., 2020) in terms of speech quality, training efﬁciency and synthesis speed, while still producing the speeches of strong robustness and great diversity. In addition, we demonstrate that proposed approach can be easily extended to autoregressive models such as Tacotron 2. 1. Introduction Text-to-Speech (TTS) is an important task in speech processing. With rapid progress in deep learning, TTS technology has received widespread attention in recent years. The most popular neural TTS models are autoregressive models based on an encoder-decoder framework (Wang et al., 2017; Shen et al.; Ping et al., 2018; 2019; Li et al., 2019; Valle et al., 2021). In this framework, the encoder takes the text sequence as input and learns its hidden representation, while the decoder generates the outputs frame by frame, i.e., in an autoregressive manner. As the performance of autoregres- 1Ping An Technology. Correspondence to: Chenfeng Miao . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). sive models has been substantially promoted, the synthesis efﬁciency is becoming a new research hotspot. Recently, signiﬁcant efforts have been dedicated to the development of non-autoregressive TTS models (Ren et al., 2019; 2021; Miao et al.; Peng et al., 2020). However, most non-autoregressive TTS models suffer from complex training procedures, high computational cost or training time cost, making them not suited for real-world applications. In this work, we propose Efﬁcient TTS, an efﬁcient and highquality text-to-speech architecture. Our contributions are summarized as follows, We propose a novel approach to produce soft or hard monotonic alignments for sequence-to-sequence models. By constraining the sequence alignment to be monotonic, the proposed approach extends the vanilla attention mechanism with almost no increase in computation. Most importantly, the proposed approach can be incorporated into any attention mechanisms without constraints on network structures. We propose Efﬁcient TTS, a non-autoregressive architecture to perform high-quality speech generation from text sequence without additional aligners. Efﬁcient TTS is fully parallel, trained end-to-end, thus being quite efﬁcient for both training and inference. We develop a family of TTS models based on Efﬁcient TTS, including: (1) EFTS-CNN, a convolutional model learns melspectrogram generation with high training efﬁciency; (2) EFTS-Flow, a ﬂow-based model enables parallel melspectrogram generation with controllable speech variation; (3) EFTS-Wav, a fully endto-end model directly learns waveform generation from text sequence. We experimentally show that the proposed models achieve signiﬁcant improvements in speech quality, synthesis speed and training efﬁciency, in comparison with counterpart models Tacotron 2 and Glow-TTS. We show that the proposed approach can be easily extended to autoregressive models such as Tacotron 2. The rest of the paper is structured as follows. Section 2 discusses related works in non-autoregressive TTS models and Efﬁcient TTS: An Efﬁcient and High-Quality Text-to-Speech Architecture monotonic alignment modeling respectively. We introduce monotonic alignment modeling using index mapping vector in Section 3. The Efﬁcient TTS architecture is introduced in Section 4. Section 5 demonstrates experimental results. Finally, Section 6 concludes the paper. 2. Related Work 2.1. Non-Autoregressive TTS models In TTS tasks, an input text sequence x = [x0, x1, ..., x T1 1] is transduced to an output sequence y = [y0, y1, ..., y T2 1] through an encoder-decoder framework (Bahdanau et al., 2015). Typically, x is ﬁrst converted to a sequence of hidden states h = [h0, h1, ..., h T1 1] through an encoder f: h = f(x), and then passed through a decoder to produce the output y. For each output timestep, an attention mechanism allows for searching the whole elements of h to generate a context vector c: i=0 αi,j hi, (1) where α R(T1,T2) is the alignment matrix. c is then fed to another network g to generate the output y: y = g(c). Networks of f and g could be easily replaced with parallel structures because both of them obtain consistent lengths of input and output. Therefore, the key to build a nonautoregressive TTS model lies on parallel alignment prediction. In previous works, Para Net (Peng et al., 2020) and Fast Speech (Ren et al., 2019) learn the sequence alignment through distillation from autoregressive models such as Deep Voice 3 (Ping et al., 2018) and Transformer TTS (Li et al., 2019). Fast Speech 2 (Ren et al., 2021) improves the two-staged training of Fast Speech by introducing an external aligner named forced alignment (Mc Auliffe et al., 2017), but forced alignment itself requires an unsupervised training. Flow-TTS (Miao et al.) is a ﬂow-based non-autoregressive model, its alignment is predicted from text sequence only, which is not reliable, resulting in an unstable training. Similar limitation is encountered for EATS (Donahue et al., 2021). Although EATS alleviates the limitation by introducing dynamic time warping (DTW) (Sakoe, 1971; Sakoe & Chiba, 1978), the training of EATS is still expensive. Unlike most TTS models which require training a neural vocoder (van den Oord et al., 2016; Prenger et al., 2019; Valin & Skoglund, 2019) to produce the waveforms from the generated melspectrograms, EATS directly produces waveforms from text sequences without relying on intermediate representations such as melspectrograms or linguistic features in an end-to-end manner. Similar end-to-end models include Fast Speech 2s (Ren et al., 2021) and Wave-Tacotron (Weiss et al., 2021). Fast Speech 2s requires external aligners while Wave-Tacotron is autoregressive. Probably the most comparable model to Efﬁcient TTS is Glow-TTS (Kim et al., 2020). Figure 1. Schematics of the monotonic alignment. Each node αi,j represents the possibility (shown as the shade of gray) that output timestep yj(horizontal axis) attends on the input token xi(vertical axis). At each output timestep, monotonic attention either move forward to next token or stay unmoved. Glow-TTS is another ﬂow-based non-autoregressive model without external aligner, it extracts the duration of each input token using an independent algorithm which precludes the use of standard back-propagation. Unlike the aforementioned models, Efﬁcient TTS jointly learns sequence alignment and speech generation through a single network, while still maintaining a stable training. 2.2. Monotonic alignment modeling As noted in section 2.1, a general attention mechanism inspects every input step at every output timestep. Such a mechanism often encounters misalignment and is quite costly to train, especially for long sequences. Therefore, it must be helpful if there is some prior knowledge incorporated. In general, as shown in Fig. 1, the alignment should follow strict criteria including: (1) Monotonicity, at each output timestep, the aligned position never rewinds; (2) Continuity, at each output timestep, the aligned position move forward at most one step; (3) Completeness, the aligned positions must cover all the positions of input tokens. Lots of prior studies have been proposed to ensure correct alignments (Li et al., 2020; Chiu & Raffel, 2018; Raffel et al., 2017), but most of them require sequential steps and often fail to meet all the criteria mentioned above. In this work, we propose a novel approach to produce monotonic alignment effectively and efﬁciently. 3. Monotonic Alignment Modeling Using IMV We start this section by proposing the index mapping vector (IMV), and then we leverage IMV in monotonic alignment modeling. We further show how to incorporate IMV into a Efﬁcient TTS: An Efﬁcient and High-Quality Text-to-Speech Architecture general sequence-to-sequence model. 3.1. Deﬁnition of IMV Let α R(T1,T2) be the alignment matrix between input sequence x RT1 and output sequence y RT2. We deﬁne index mapping vector (IMV) π as sum of index vector p = [0, 1, , T1 1], weighted by α: i=0 αi,j pi, (2) where, 0 j T2 1, π RT2, and i=0 αi,j = 1. We can understand IMV as the expected location for each output timestep, where the expectation is over all possible input locations ranging from 0 to T1 1. 3.2. Monotonic alignment modeling using IMV Continuity and Monotonicity. We ﬁrst show that given alignment matrix α meets the continuity and monotonicity criteria, then we have: 0 πi 1, (3) where, πi = πi πi 1, 1 i T2 1. Detailed veriﬁcation is shown in Appendix A. Completeness. Given π is continuous and monotonic across timesteps, completeness criteria is equivalent to the following boundary constraints: π0 = 0, (4) πT2 1 = T1 1. (5) This can be deduced from α0 = [1, 0, ..., 0] and αT2 1 = [0, 0, ..., 1]. 3.3. Incorporate IMV into networks We propose two strategies to incorporate IMV into sequenceto-sequence networks: Soft Monotonic Alignment (SMA) and Hard Monotonic Alignment (HMA). Soft Monotonic Alignment (SMA). To let sequence-tosequence models be trained with the constraints given by Eq. (3 - 5), a natural idea is to turn these constraints into training objectives. We formulate these constraints as a SMA loss which is computed as: LSMA = λ0 | π| π 1 + λ1 | π 1| + ( π 1) 1 + λ2( π0 T1 1)2 + λ3( πT2 1 T1 1 1)2, (6) where 1 is ℓ1 norm. λ0, λ1, λ2, λ3 are positive coefﬁcients. As can be seen, LSMA is non-negative, and it is zero only if π satisﬁes all the constraints. Computation of LSMA requires alignment matrix α only (index vector p is always known), therefore, it is quite easy to incorporate SMA loss into sequence-to-sequence networks without changing their network structures. In general, SMA plays a similar role as Guided Attention (Tachibana et al.) which speeds up the model convergence and improves robustness. However, SMA outperforms Guided Attention because SMA theoretically provides more accurate constraints on the alignments. Hard Monotonic Alignment (HMA). While SMA allows sequence-to-sequence networks to produce monotonic alignments by incorporating with a SMA loss, the training of these networks may remain costly because the networks cannot produce monotonic alignments at the beginning phase of training. Instead, they learn this ability step by step. To address this limitation, we propose another monotonic strategy which we call HMA, for Hard Monotonic Alignment. The core idea of HMA is to build a network with a strategically designed structure, allowing for producing monotonic alignments without supervision. First, we compute IMV π from the alignment matrix α according to Eq. (2). Although π is not monotonic, it is then transformed to π, a strictly monotonic IMV by enforcing π 0 using a Re LU activation. π j =π j π j 1, 0 < j T2 1, (7) πj =Re LU( π j), 0 < j T2 1, (8) m=0 πm, 0 j T2 1, π0 = 0. (9) Furthermore, to restrict the domain of π to the interval [0, T1 1] as given in Eq. (4 - 5), we multiply π by a positive scalar: π j = πj T1 1 max(π) = πj T1 1 πT2 1 , (10) where the maximum of π is πT2 1 because π is monotonic increasing across timesteps. Recall that our goal is to construct a monotonic alignment. To achieve this, we introduce the following transformation to reconstruct the alignment by leveraging a Gaussian kernel centered on π : α i,j = exp ( σ 2(pi π j )2) PT1 1 m=0 exp ( σ 2(pm π j )2) , (11) where, σ2 denotes the hyper-parameter representing alignment variation. α serves as a replacement of original alignment α. The difference between α and α is that α is guaranteed to be monotonic while α has no constraint on monotonicity. HMA reduces the difﬁculty of learning monotonic Efﬁcient TTS: An Efﬁcient and High-Quality Text-to-Speech Architecture Figure 2. Overall model architecture. alignments, thus improves the training efﬁciency. Similar to SMA, HMA can be employed to any sequence-to-sequence networks. In next section, we propose a new TTS architecture - Efﬁcient TTS that uses the monotonic strategy HMA as the internal aligner. The use of HMA allows the network training with strict monotonic constraints, and thus makes Efﬁcient TTS fast and efﬁcient for training. 4. Efﬁcient TTS Architecture The overall architecture design of Efﬁcient TTS is shown in Fig. 2. In the training phase we compute the IMV from the hidden representations of text sequence and melspectrogram through an IMV generator. The hidden representations of text sequence and melspectrogram are learned from a text-encoder and a mel-encoder respectively. IMV is then converted to a 2-dimensional alignment matrix which is used to generate the time-aligned representation through an alignment reconstruction layer. The time-aligned representation is passed through a decoder producing the output melspectrograms or waveforms. We concurrently train an aligned position predictor which learns to predict aligned position for each input text token. In the inference phase, we reconstruct the alignment matrix from predicted aligned positions. We show the implementation of each components in the following subsections and more details including the pseudocode in Appendix B. 4.1. Text-Encoder and Mel-Encoder We use a text-encoder and a mel-encoder to convert text symbols and melspectrograms to powerful hidden represen- tations respectively. We follow Fast Speech (Ren et al., 2019) in implementing the text-encoder, which consists of a text embedding layer and a stack of transformer FFT blocks. The use of transformer FFT structure enables text-encoder to learn both the local and global information, which is very important to alignment predictions. In the implementation of the mel-encoder, we ﬁrst convert melspectrograms to high-dimensional vectors through a linear projection. The linear projection is followed by a stack of convolutions interspersed with weight normalization, Leaky Re LU activation, and residual connection. Note that mel-encoder is only used in the training phase. 4.2. IMV generator In order to generate a monotonic IMV in the training phase, we ﬁrst learn the alignment α between the input and output through a scaled dot-product attention (Vaswani et al., 2017) as given in Eq. (12), and then compute IMV from α. αi,j = exp ( D 0.5(qj ki)) PT1 1 m=0 exp ( D 0.5(qj km)) , (12) where, q and k are the outputs of mel-encoder and textencoder, and D is the dimensionality of q and k. In our preliminary experiments, we follow Eq. (7 - 10) to generate IMV. However, we observe some alignment errors during training phase, especially for long sequences generation. It seems that the cumulative sum operation in Eq. (9) tends to accumulate alignment errors. To address this limitation, we introduce a bi-directional cumulative sum operation which allows the models to learn accurate alignment for long sequences. For each timestep j, we accumulate π in both the forward and backward direction as Eq. (13 - 14). m=0 πm, πb j = m=j πm, (13) πj = πf j πb j. (14) Because π in Eq. (14) is monotonically increasing across timesteps, the minimum and maximum of π is π0 and πT2 1 respectively. Therefore, we are able to restrict π to the interval [0, T1 1] through a linear transformation: π j = πj π0 πT2 1 π0 (T1 1). (15) 4.3. Aligned position predictor In the inference phase, the model needs to predict the IMV π from the hidden representation of text sequence h, which Efﬁcient TTS: An Efﬁcient and High-Quality Text-to-Speech Architecture 1 e1 2 e2 T2-1 Figure 3. Schematics of m( ). p, q are indexes of input sequence and output sequence respectively. π is IMV, e is the aligned position in output timestep for each input token. is challenging in practice. There are two limitations: (1) π is time-aligned, which is in high resolution but h is in low resolution; (2) Each prediction of πi affects later prediction of πj (j > i) due to cumulative sum operation introduced in Eq. (9), making it difﬁcult to predict π in parallel. Fortunately, the limitations can be alleviated by predicting the aligned positions e of each input token instead. Let q = [0, 1, ..., T2 1] be the index vector of π. We deﬁne transformation m( ) as the mapping between π and q: π = m(q). Since π is monotonically increasing with respect to q, therefore, m( ) is a monotonic transformation thus invertible: q = m 1(π), (16) The aligned positions e in output timestep for each input token can be computed as: e = m 1(p), p = [0, 1, ..., T1 1]. We illustrate the relations of m( ), e, π in Fig. 3. In order to compute e, we ﬁrst compute the probability density matrix γ utilizing a similar transformation as Eq. (11). The only difference is that the probability density is computed on different dimensions. γi,j = exp ( σ 2(pi πj)2) PT2 1 n=0 exp ( σ 2(pi πn)2) . (17) The aligned position e is the weighted sum of the output index vector q weighted by γ. n=0 γi,n qn. (18) As can be seen, the computation of e is differentiable which allows for training by gradients methods, thus can be used in both training and inference. Besides, e is predictable, because: (1) The resolution of e is the same as h; (2) we can learn relative position e, ( ei = ei ei 1, 1 i T1 1) instead of directly learning e to overcome the second limitation. The aligned position predictor consists of 2 convolutions, each followed by the layer normalization and Re LU activation. We regard e computed from π as the training target. The loss function between the estimated position ˆe and the target one e is computed as: Lap = log ( ˆe + ϵ) log ( e + ϵ) 1, (19) Where, ϵ is a small number to avoid numerical instabilities. The goal with log-scale loss is to accurately ﬁt small values, which tends to be more important towards the later phases of training. Aligned position predictor is learned jointly with the rest of the model. Because we generate alignments by leveraging the aligned positions, as a side beneﬁt, Efﬁcient TTS inherits the ability of speech rate control as duration-based non-autoregressive TTS models. 4.4. Alignment reconstruction In order to map input hidden representations h to timealigned representations, an alignment matrix is needed, for both training and inference. We can alternatively construct alignment from IMV π or the aligned positions e. For most situations, Eq. (11) is an effective way to reconstruct alignment matrix from π. But because we have to use aligned positions rather than π during inference, to be consistent, we reconstruct alignment matrix from the aligned positions e for training as well. Speciﬁcally, we take the aligned positions e computed from Eq. (18) for training, and the predicted one from aligned position predictor for inference. We follow similar idea of EATS (Donahue et al., 2021) in reconstructing the alignment matrix α by introducing a Gaussian kernel centered on aligned position e. α i,j = exp ( σ 2(ei qj)2) PT1 1 m=0 exp ( σ 2(em qj)2) , (20) where q is the index vector of output sequence. The length of output sequence T2 is known in training and computed from e in inference: T2 = e T1 1 + η e T1 1, (21) where, η is a hyper-parameter which we set to 1.2 for all experiments. Although the reconstructed alignment maybe not as accurate as the one computed by Eq. (11) (due to the low resolution of e), the effect on the output is small because the network is able to compensate. As a result, we enjoy improvement Efﬁcient TTS: An Efﬁcient and High-Quality Text-to-Speech Architecture in speech quality caused by the increasing consistency of training and inference. We map the output of text-encoder h to a time-aligned representation by making use of α following Eq. (1). The time-aligned representation is then fed as input to decoder. 4.5. Decoder Since both the input and the output of the decoder are timealigned, it is easy to implement decoder with parallel structures. We develop three models based on Efﬁcient TTS as shown in Fig. 4. We give a brief introduction in this section and show more implementation details in Appendix B. EFTS-CNN. We ﬁrst parameterize the decoder by a stack of convolutions. Mean square error (MSE) is used as the reconstruction loss. EFTS-Flow. To let TTS models have the ability to control the variations of generated speech, we implement a ﬂow-based decoder. In the training phase, we learn a transformation f from melspectrogram to a high dimensional Gaussian distribution N(0, 1) by directly maximizing the likelihood. To improve the diversity of generated speech, we sample the latent variable z from Gaussian distribution N(0, 1) during inference, and interpret z with a zero vector o using temperature factor t to get a new latent vector z . We further use z as the input of the model and inverse the transformation f to produce the melspectrogram . For sake of simplicity we follow the decoder structure of Flow-TTS (Miao et al.) in implementing our ﬂow-based decoder. EFTS-Wav. To simplify the 2-staged training pipeline and train TTS models in a fully end-to-end manner, we also develop a text-to-wav model by incorporating Efﬁcient TTS with a dilated convolutional adversarial decoder. we follow Mel GAN (Kumar et al., 2019) in implementing convolutional adversarial decoder. 5. Experiments In this section, we ﬁrst compare the proposed models with their counterparts in terms of speech ﬁdelity, training and inference efﬁciency. We then analyze the effectiveness of proposed monotonic alignment approach on both EFTSCNN and Tacotron 2. We also demonstrate that proposed models can generate speech in great diversity at the end of this section. 1 1Audio samples of the proposed models are available at: https://mcf330.github.io/ Efficient TTSAudio Samples/ 5.1. Experimental setup Datasets. We conduct most of our experiments on an open-source standard Mandarin dataset from Data Baker2, which consists of 10, 000 Chinese clips from a single female speaker with a sampling rate of 22.05k HZ. The length of the clips varies from 1 to 10 seconds and the clips have a total length of about 12 hours. We also conduct some experiments using LJ-Speech dataset (Ito, 2017), which is a 24-hour waveform audio set of a single female speaker with 131,00 audio clips and a sample rate of 22.05k HZ. Counterpart models. We compare proposed models with autoregressive Tacotron 2 and non-autoregressive Glow TTS in the following experiments. We directly use the opensource implementations of Tacotron 23 and Glow-TTS4 with default conﬁgurations. We use Hi Fi-GAN (Kong et al., 2020) vocoder to produce waveforms from melspectrograms. We use the open-source implementation of Hi Fi-GAN5 with Hi Fi-GAN-V1 conﬁguration. 5.2. Comparison with counterpart models Speech quality. We conduct a 5-scale mean opinion score (MOS) evaluation on Data Baker dataset to measure the quality of synthesized audios. Each audio is listened by at least 15 testers, who are all native speakers. We compare the MOS of the audio samples generated by Efﬁcient TTS families with ground truth audios, as well as audio samples generated by counterpart models. The MOS results with 95% conﬁdence intervals is shown in Table 2. We draw the observation that Efﬁcient TTS families outperform counterpart models. Tacotron 2 suffers from a declining speech quality caused by the inconsistency between teacher forcing training and autoregressive inference, and Glow-TTS replicates the hidden representations of text sequence, which corrupts the continuity of hidden representations. Efﬁcient TTS reconstructs the alignments using IMV, which is more expressive than token duration, therefore achieves better speech quality. In addition, the alignment part of Efﬁcient TTS is trained together with the rest of the model, which further improves the speech quality. As our training settings may be different with original settings for counterpart models, we further compare our model EFTS-CNN with pertained models of Tacotron 2 and Glow-TTS on LJ-Speech dataset. As shown in Table 3, EFTS-CNN outperforms counterpart models on LJ-Speech too. Training and Inference speed. Being end-to-end and fully parallel, the proposed models are very efﬁcient for both 2https://www.data-baker.com/open_source. html 3https://github.com/NVIDIA/tacotron2 4https://github.com/jaywalnut310/glow-tts 5https://github.com/jik876/hifi-gan Efﬁcient TTS: An Efﬁcient and High-Quality Text-to-Speech Architecture Figure 4. Efﬁcient TTS families. From left to right are EFTS-Flow, EFTS-CNN and EFTS-Wav respectively. Table 1. Quantitative results of training time and inference latency. We run training and inference on a single V100 GPU. We select 20 sentence for inference speed evaluation and run inference of each sentence 20 times to get an average inference latency. The lengths of the generated melspectrograms range from 110 to 802, with an average of 531. We exclude the time cost of transferring data between CPU and GPU in inference speed evaluation. Hi Fi-GAN is used to produce waveform from melspectrograms. Model family Training Time(h) Training Speedup Inference Time text-to-mel(ms) Inference Speedup text-to-mel Inference Time text-to-wav(ms) Inference Speedup text-to-wav Tacotron 2 54 - 780 - 824 - Glow-TTS 120 0.45 42 18.6 86 9.6 EFTS-CNN 6 9 8 97.5 52 15.8 EFTS-Flow 32 1.7 21 37.1 65 12.7 EFTS-Wav - - - - 18 45.8 Table 2. The MOS with 95% conﬁdence intervals for different methods on Data Baker. The temperature of latent variable z is set to 0.667 for both Glow-TTS and EFTS-Flow. Ground Truth 4.64 0.07 Ground Truth (Mel+Hi Fi-GAN) 4.58 0.13 Tacotron 2 (Mel+Hi Fi-GAN) 4.20 0.11 Glow-TTS (Mel+Hi Fi-GAN) 3.97 0.21 EFTS-CNN (Mel+Hi Fi-GAN) 4.41 0.13 EFTS-Flow (Mel+Hi Fi-GAN) 4.35 0.17 EFTS-Wav 4.40 0.21 training and inference. Quantitative results of training time and inference latency are shown in Table 1. As can be seen, EFTS-CNN requires the least amount of training time. Although EFTS-Flow requires comparable training time with Tacotron 2, it is signiﬁcantly faster than Glow-TTS. As for inference latency, Efﬁcient TTS models are faster Table 3. The MOS with 95% conﬁdence intervals for different methods on LJ-Speech. The temperature of latent variable z is set to 0.667 for Glow-TTS. Ground Truth 4.75 0.12 Ground Truth (Mel+Hi Fi-GAN) 4.51 0.13 Tacotron 2 (Mel+Hi Fi-GAN) 4.08 0.13 Glow-TTS (Mel+Hi Fi-GAN) 4.13 0.18 EFTS-CNN (Mel+Hi Fi-GAN) 4.27 0.14 than Tacotron 2 and Glow-TTS. In particular, the inference latency of EFTS-CNN is 8ms which is 97.5 faster than Tacotron 2, and signiﬁcantly faster than Glow-TTS. Thanks to the removal of melspectrogram generation, EFTS-Wav is signiﬁcantly faster than 2-staged models, taking only 18ms to synthesize test audios from text sequences, which is 45.8 faster than Tacotron 2. Efﬁcient TTS: An Efﬁcient and High-Quality Text-to-Speech Architecture Table 4. Comparison of different monotonic alignment approaches on EFTS-CNN. Models Training Steps MSE Loss EFTS-HMA 60k 0.095 EFTS-SMA 450k 0.33 EFTS-NM not converge - Table 5. The comparison of robustness between EFTS-CNN and Tacotron 2. We implement Tacotron 2 with different settings, including vanilla Tacotron 2, Tacotron 2 with SMA (T2-SMA), Tacotron 2 with HMA (T2-HMA). Models Repeats Skips Mispronunciations Error Rate Tacotron 2 13 7 5 50% T2-SMA 3 1 3 14% T2-HMA 0 1 3 8% EFTS-CNN 0 0 2 4% 5.3. Evaluation of monotonic alignment approach In order to evaluate the behaviour of proposed monotonic alignment approach, we conduct several experiments on EFTS-CNN and Tacotron 2. We ﬁrst compare the training efﬁciency on EFTS-CNN, and then conduct a robustness test on both Tacotron 2 and EFTS-CNN. Experiments on EFTS-CNN. We train EFTS-CNN with different settings, including: (1) EFTS-HMA, default implementation of EFTS-CNN, with a hard monotonic IMV generator. (2) EFTS-SMA, an EFTS-CNN model with a soft monotonic IMV generator. (3) EFTS-NM, an EFTS-CNN model with no constraints on monotonicity. The network structure of EFTS-NM is the same as EFTS-SMA, except that EFTS-SMA is trained with SMA loss while EFTS-NM is trained without SMA loss. We ﬁrst ﬁnd that EFTS-NM does not converge at all, its alignment matrix is not diagonal, while both EFTS-SMA and EFTS-HMA are able to produce reasonable alignment. We illustrate the IMV and reconstructed melspectrogram in training phase in Fig. 5 for all the models. As can be seen, EFTS-HMA achieves a signiﬁcant speed-up over EFTS-SMA. Therefore, we can conclude that the monotonic alignment is quite essential for proposed models. Our approaches, for both SMA and HMA, succeed to learn monotonic alignments while the vanilla attention mechanism fails. Thanks to the strict monotonicity constraints, EFTS-HMA signiﬁcantly improve the model performance. More training details is shown on Table 4. Robustness. Many TTS models encounter misalignment at synthesis, especially for autoregressive models. We analyze the attention errors for Efﬁcient TTS in this subsection, the errors are including: repeated words, skipped words and Figure 5. Comparisons of IMV and reconstructed melspectrograms. The plots are taken from the 150kth training step of EFTSNM and EFTS-SMA, and the 20kth training step of EFTS-HMA. The left column illustrates the IMV while the right column plots the reconstructed melspectrograms. mispronunciations. We perform a robustness evaluation on a test set of 50 sentences, which includes particularly challenging cases for TTS systems, such as particularly long sentences, repeated letters etc. We compare EFTS-CNN with Tacotron 2. We also incorporate SMA and HMA into Tacotron 2 for a more detailed comparison (The detailed implementations of Tacotron2-SMA and Tacotron2-HMA and more experimental results are shown in Appendix D.). The experimental results of robustness test are shown in Table 5. It can be seen that EFTS-CNN effectively eliminate repeats errors and skips errors while Tacotron 2 encounters many errors. However, the synthesis errors are signiﬁcantly reduced for Tacotron 2 by leveraging SMA or HMA, which indicates that the proposed monotonic alignment approach can improve the robustness for TTS models. 5.4. Diversity To synthesize speech samples in great diversity, most of TTS models make use of external conditions such as style embedding or speaker embedding, or just rely on drop-out during inference. However, Efﬁcient TTS is able to synthesize varieties of speech samples in several ways, including: (1) Synthesizing speech with different alignment scheme. The alignment scheme could either be an IMV which is extracted from existing audio by mel-encoder and IMV generator, or a sequence of aligned positions; (2) Synthesizing speech with different speech rate by multiplying a scalar across predicted aligned positions, which is similar to other duration-based non-autoregressive models; (3) Synthesizing speech with different speech variations for EFTS-Flow by changing the temperature t of latent variable z during inference. We plot varieties of melspectrograms generated from Efﬁcient TTS: An Efﬁcient and High-Quality Text-to-Speech Architecture the same text sequences in Appendix C. 6. Conclusions And Future Works In this work, we propose a non-autoregressive architecture which enables high quality speech generation as well as efﬁcient training and synthesis. We develop a family of models based on Efﬁcient TTS covering text-to-melspectrogram and text-to-waveform generation. Through extensive experiments, we observe improved quantitative results including training efﬁciency, synthesis speed, robustness, as well as speech quality. We show that the proposed models are very competitive compared with existing TTS models. There are many possible directions for future work. Efﬁcient TTS enables not only generating speech at a given alignment and but also extracting alignment from given speech, making it an excellent candidate for voice conversion and singing synthesis. It is also a good choice to apply the proposed monotonic alignment approach to other sequence-to-sequence tasks where monotonic alignment matters, such as Automatic Speech Recognition (ASR), Neural Machine Translation (NMT), and Optical Character Recognition (OCR). Bahdanau, D., Cho, K., and Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. In International Conference on Learning Representations, 2015. Chiu, C.-C. and Raffel, C. Monotonic Chunkwise Attention. In International Conference on Learning Representations, 2018. Donahue, J., Dieleman, S., Binkowski, M., Elsen, E., and Simonyan, K. End-to-End Adversarial Text-to-Speech. In International Conference on Learning Representations, 2021. Ito, K. The lj speech dataset. 2017. URL https:// keithito.com/LJ-Speech-Dataset/. Kim, J., Kim, S., Kong, J., and Yoon, S. Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search. In Advances in Neural Information Processing Systems, 2020. Kong, J., Kim, J., and Bae, J. Hi Fi-GAN: Generative Adversarial Networks for Efﬁcient and High Fidelity Speech Synthesis . In Advances in Neural Information Processing Systems, 2020. Kumar, K., Kumar, R., de Boissiere, T., Gestin, L., Teoh, W. Z., Sotelo, J., de Brebisson, A., Bengio, Y., and Courville., A. Mel GAN: Generative Adversarial Networks for Conditional Waveform Synthesis. In Advances in Neural Information Processing Systems, 2019. Li, N., Liu, S., Liu, Y., Zhao, S., Liu, M., and Zhou, M. Close to Human Quality TTS with Transformer. In AAAI, 2019. Li, N., Liu, S., Liu, Y., Zhao, S., Liu, M., and Zhou, M. Mo Bo Aligner: a Neural Alignment Model for Nonautoregressive TTS with Monotonic Boundary Search. In Interspeech, 2020. Mc Auliffe, M., Socolof, M., Mihuc, S., Wagner, M., and Sonderegger, M. Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi. In Interspeech, 2017. Miao, C., Liang, S., Chen, M., Ma, J., Wang, S., and Xiao, J. Flow-TTS: A Non-Autoregressive Network for Text to Speech Based On Flow. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Peng, K., Ping, W., Song, Z., and Zhao, K. Non Autoregressive Neural Text-to-Speech. In International Conference on Machine Learning, 2020. Ping, W., Peng, K., Gibiansky, A., Arik, S. O., Kannan, A., Narang, S., Raiman, J., and Miller, J. Deep Voice 3: 2000-Speaker Neural Text-to-Speech. In International Conference on Learning Representations, 2018. Ping, W., Peng, K., and Chen, J. Clari Net: Parallel Wave Generation in End-to-End Text-to-Speech. In International Conference on Learning Representations, 2019. Prenger, R., Valle, R., and Catanzaro, B. Wave Glow: A Flow-Based Generative Network For Speech Synthesis. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019. Raffel, C., Luong, M.-T., Liu, P. J., Weiss, R. J., and Eck, D. Online And Linear-time Attention by Enforcing Monotonic Alignments. In International Conference on Machine Learning, 2017. Ren, Y., Ruan, Y., Tan, X., Qin, T., Zhao, S., Zhao, Z., and Liu, T.-Y. Fast Speech: Fast, Robust and Controllable Text to Speech. In Advances in Neural Information Processing Systems, 2019. Ren, Y., Hu, C., Tan, X., Qin, T., Zhao, S., Zhao, Z., and Liu, T.-Y. Fast Speech 2: Fast and High-Quality End-to-End Text to Speech . In International Conference on Learning Representations, 2021. Efﬁcient TTS: An Efﬁcient and High-Quality Text-to-Speech Architecture Sakoe, H. Dynamic-Programming Approach to Continuous Speech Recognition. In 1971 Proc. the International Congress of Acoustics, Budapest, 1971. Sakoe, H. and Chiba, S. Dynamic Programming Algorithm Optimization For Spoken Word Recognition. IEEE Transactions On Acoustics, Speech, And Signal Processing, 26 (1):43 49, 1978. Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y., Wang, Y., and Ryan, R. S. Natural TTS Synthesis by Conditioning Wave Net on Mel Spectrogram Predictions. In ICASSP 2018-2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Tachibana, H., Uenoyama, K., and Aihara, S. Efﬁciently Trainable Text-to-Speech System Based On Deep Convolutional Networks With Guided Attention. In ICASSP 2018-2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Valin, J.-M. and Skoglund, J. LPCNet: Improving Neural Speech Synthesis Through Linear Prediction. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019. Valle, R., Shih, K. J., Prenger, R., and Catanzaro, B. Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis. In International Conference on Learning Representations, 2021. van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., and Kavukcuoglu, K. Wave Net: A Generative Model for Raw Audio. In 9th ISCA Speech Synthesis Workshop, 2016. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, , and Polosukhin, I. Attention Is All You Need . In Advances in Neural Information Processing Systems, 2017. Wang, Y., Skerry-Ryan, R., Stanton, D., Y. Wu, R. J. W., Jaitly, N., and Yang, Z. Tacotron: Towards End-to-End Speech Synthesis. In Interspeech, 2017. Weiss, R. J., Skerry-Ryan, R., Battenberg, E., Mariooryad, S., and Kingma, D. P. Wave-Tacotron: Spectrogram Free End-to-End Text-to-Speech Synthesis. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021.