# parallel_wavenet_fast_highfidelity_speech_synthesis__5664206d.pdf Parallel Wave Net: Fast High-Fidelity Speech Synthesis Aaron van den Oord 1 Yazhe Li 1 Igor Babuschkin 1 Karen Simonyan 1 Oriol Vinyals 1 Koray Kavukcuoglu 1 George van den Driessche 1 Edward Lockhart 1 Luis C. Cobo 1 Florian Stimberg 1 Norman Casagrande 1 Dominik Grewe 1 Seb Noury 1 Sander Dieleman 1 Erich Elsen 1 Nal Kalchbrenner 1 Heiga Zen 1 Alex Graves 1 Helen King 1 Tom Walters 1 Dan Belov 1 Demis Hassabis 1 The recently-developed Wave Net architecture (van den Oord et al., 2016a) is the current state of the art in realistic speech synthesis, consistently rated as more natural sounding for many different languages than any previous system. However, because Wave Net relies on sequential generation of one audio sample at a time, it is poorly suited to today s massively parallel computers, and therefore hard to deploy in a real-time production setting. This paper introduces Probability Density Distillation, a new method for training a parallel feed-forward network from a trained Wave Net with no significant difference in quality. The resulting system is capable of generating high-fidelity speech samples at more than 20 times faster than real-time, a 1000x speed up relative to the original Wave Net, and capable of serving multiple English and Japanese voices in a production setting. 1. Introduction Recent successes of deep learning go beyond achieving state-of-the-art results in research benchmarks, and push the frontiers in some of the most challenging real world applications such as speech recognition (Hinton et al., 2012), image recognition (Krizhevsky et al., 2012; Szegedy et al., 2015), and machine translation (Wu et al., 2016). The recently published Wave Net (van den Oord et al., 2016a) model achieves state-of-the-art results in speech synthesis, and significantly closes the gap with natural human speech. However, it is not well suited for real world deployment due to its prohibitive generation speed. In this paper, we present a new algorithm for distilling Wave Net into a feed-forward neural 1Deep Mind Technologies, London, United Kingdom. Correspondence to: Aaron van den Oord . Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s). network which can synthesise equally high quality speech much more efficiently, and is deployed to millions of users. Wave Net is one of a family of autoregressive deep generative models that have been applied with great success to data as diverse as text (Mikolov et al., 2010), images (Larochelle & Murray, 2011; Theis & Bethge, 2015; van den Oord et al., 2016c;b), video (Kalchbrenner et al., 2016), handwriting (Graves, 2013) as well as human speech and music. Modelling raw audio signals, as Wave Net does, represents a particularly extreme form of autoregression, with up to 24,000 samples predicted per second. Operating at such a high temporal resolution is not problematic during network training, where the complete sequence of input samples is already available and thanks to the convolutional structure of the network can be processed in parallel. When generating samples, however, each input sample must be drawn from the output distribution before it can be passed in as input at the next time step, making parallel processing impossible. Inverse autoregressive flows (IAFs) (Kingma et al., 2016) represent a kind of dual formulation of deep autoregressive modelling, in which sampling can be performed in parallel, while the inference procedure required for likelihood estimation is sequential and slow. The goal of this paper is to marry the best features of both models: the efficient training of Wave Net and the efficient sampling of IAF networks. The bridge between them is a new form of neural network distillation (Hinton et al., 2015), which we refer to as Probability Density Distillation, where a trained Wave Net model is used as a teacher for a feedforward IAF model. The next section describes the original Wave Net model, while Sections 3 and 4 define in detail the new, parallel version of Wave Net and the distillation process used to transfer knowledge between them. Section 5 then presents experimental results showing no loss in perceived quality for parallel versus original Wave Net, and continued superiority over previous benchmarks. We also present timings for sample generation, demonstrating more than 1000 speedup relative to original Wave Net. Parallel Wave Net: Fast High-Fidelity Speech Synthesis 2. Wave Net Autoregressive networks model the joint distribution of highdimensional data as a product of conditional distributions using the probabilistic chain-rule: t p(xt|xt. Due to this nature, real time (or faster) synthesis with a fully autoregressive system is challenging. While sampling speed is not a significant issue for offline generation, it is essential for real-word applications. A version of Wave Net that generates in real-time has been developed (Paine et al., 2016), but it required the use of a much smaller network, resulting in severely degraded quality. Raw audio data is typically very high-dimensional (e.g. 16,000 samples per second for 16k Hz audio), and contains complex, hierarchical structures spanning many thousands of time steps, such as words in speech or melodies in music. Modelling such long-term dependencies with standard causal convolution layers would require a very deep network to ensure a sufficiently broad receptive field. Wave Net avoids this constraint by using dilated causal convolutions, which allow the receptive field to grow exponentially with depth. Wave Net uses gated activation functions, together with a simple mechanism introduced in (van den Oord et al., 2016c) to condition on extra information such as class labels or linguistic features: hi = σ Wg,i xi + V T g,ic tanh Wf,i xi + V T f,ic , (1) where denotes a convolution operator, and denotes an element-wise multiplication operator. σ( ) is a logistic sigmoid function. c represents extra conditioning data. i is the layer index. f and g denote filter and gate, respectively. W and V are learnable weights. In cases where c encodes spatial or sequential information (such as a sequence of linguistic features), the matrix products (V T f,ic and V T g,ic) are replaced by convolutions (Vf,i c and Vg,i c). 2.1. Higher Fidelity Wave Net For this work we made two improvements to the basic Wave Net model to enhance its audio quality for production use. Unlike previous versions of Wave Net (van den Oord et al., 2016a), where 8-bit (µ-law or PCM) audio was modelled with a 256-way categorical distribution, we increased the fidelity by modelling 16-bit audio. Since training a 65,536-way categorical distribution would be prohibitively costly, we instead modelled the samples with the discretized mixture of logistics distribution introduced in (Salimans et al., 2017). We further improved fidelity by increasing the audio sampling rate from 16k Hz to 24k Hz. This required a Wave Net with a wider receptive field, which we achieved by increasing the dilated convolution filter size from 2 to 3. An alternative strategy would be to increase the number of layers or add more dilation stages. 3. Parallel Wave Net While the convolutional structure of Wave Net allows for rapid parallel training, sample generation remains inherently sequential and therefore slow, as it is for all autoregressive models which use ancestral sampling. We therefore seek an alternative architecture that will allow for rapid, parallel generation. Inverse-autoregressive flows (IAFs) (Kingma et al., 2016) are stochastic generative models whose latent variables are arranged so that all elements of a high dimensional observable sample can be generated in parallel. IAFs are a special type of normalising flow (Dinh et al., 2014; Rezende & Mohamed, 2015; Dinh et al., 2016) which model a multivariate distribution p X(x) as an explicit invertible non-linear transformation x = f(z) of a simple tractable distribution p Z(z) (such as an isotropic Gaussian distribution). Using the change of variables formula the resulting distribution can be written as: log p X(x) = log p Z(z) log dx dz is the determinant of the Jacobian of f. For all normalizing flows the transformation f is chosen so that it is invertible and its Jacobian determinant is easy to compute. In the case of an IAF, the output is modelled by xt = f(z t). Because of this strict dependency structure, the transformation has a triangular Jacobian matrix which makes the determinant equal to the product of the diagonal entries: t log f(z t) To sample from an IAF, a random sample is first drawn from z p Z(z) (we use the Logistic(0, I) distribution) which Parallel Wave Net: Fast High-Fidelity Speech Synthesis Hidden Layer Dilation = 1 Hidden Layer Dilation = 2 Hidden Layer Dilation = 4 Output Dilation = 8 Figure 1. Visualisation of a Wave Net stack and its receptive field (van den Oord et al., 2016a). Starting from the inputs at the bottom, the Wave Net architecture has increasing levels of dilation by a factor of 2, so that each output unit shown at the top row of the figure can combine dependency from a large range of inputs. is then transformed as follows: xt = zt s(zt p S(x>t|x t) (13) t=1 E p S(x