# parallel_and_highfidelity_texttolip_generation__8eb7f305.pdf

Parallel and High-Fidelity Text-to-Lip Generation

Jinglin Liu*1, Zhiying Zhu*1, Yi Ren*1, Wencan Huang1, Baoxing Huai2, Nicholas Yuan2, Zhou Zhao 1

1Zhejiang University, China 2Huawei Cloud {jinglinliu,zhyingzh,rayeren,huangwencan,zhaozhou}@zju.edu.cn, {huaibaoxing,nicholas.yuan}@huawei.com

As a key component of talking face generation, lip movements generation determines the naturalness and coherence of the generated talking face video. Prior literature mainly focuses on speech-to-lip generation while there is a paucity in text-to-lip (T2L) generation. T2L is a challenging task and existing end-to-end works depend on the attention mechanism and autoregressive (AR) decoding manner. However, the AR decoding manner generates current lip frame conditioned on frames generated previously, which inherently hinders the inference speed, and also has a detrimental effect on the quality of generated lip frames due to error propagation. This encourages the research of parallel T2L generation. In this work, we propose a parallel decoding model for fast and high-ﬁdelity text-to-lip generation (Para Lip). Speciﬁcally, we predict the duration of the encoded linguistic features and model the target lip frames conditioned on the encoded linguistic features with their duration in a non-autoregressive manner. Furthermore, we incorporate the structural similarity index loss and adversarial learning to improve perceptual quality of generated lip frames and alleviate the blurry prediction problem. Extensive experiments conducted on GRID and TCD-TIMIT datasets demonstrate the superiority of proposed methods.

1 Introduction In the modern service industries, talking face generation has broad application prospects such as avatar, virtual assistant, movie animation, teleconferencing, etc. (Zhu et al. 2020). As a key component of talking face generation, lip movements generation (a.k.a. lip generation) determines the naturalness and coherence of the generated talking face video. Lip generation aims to synthesize accurate mouth movements video corresponding to the linguistic content information carried in speech or pure text. Mainstream literature focuses on speech-to-lip (S2L) generation while there is a paucity in text-to-lip (T2L) generation. Even so, T2L generation is very crucial and has considerable merits compared to S2L since 1) text data can be obtained or edited more easily than speech, which makes T2L generation more convenient; and 2) T2L extremely preserves privacy especially in the society where the deep

*Equal contribution. Corresponding author Copyright 2022, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

place white by u

three again

sil place white by u three again sil

ground truth

generated lip video ( by AR T2L model )

identity lip

arbitrary text

random selection

Figure 1: The task description of T2L generation. The model takes in an arbitrary source text sequence and a single identity lip image to synthesize the target lip movements video. And in this ﬁgure we can see that the generated video loses the linguistic information gradually and ﬁnally becomes fuzzy and motionless (lip frames in the red box), which is the intractable problem existing in AR T2L models due to error propagation.

learning techniques are so developed that a single sentence speech could expose an unimaginable amount of personal information.

However, end-to-end T2L task (shown in Figure 1) is challenging. Unlike S2L task where the mapping relationship between the sequence length of source speech and target video is certain (according to audio sample rate and fps), there is an uncertain sequence length discrepancy between source and target in T2L task. The traditional temporal convolutional networks become impractical. Hence, existing works view T2L as a sequence-to-sequence task and tackle it by leveraging the attention mechanism and autoregressive (AR) decoding manner. The AR decoding manner brings two drawbacks: 1) it inherently hinders the inference speed since its decoder generates target lips one by one autoregressively with the causal structure. Consequently, generating a single sentence of short video consumes about 0.51.5 seconds even on GPU, which is not acceptable for industrial applications such as real-time interactions with avatar or virtual assistant, real-time teleconferencing and documentlevel audio-visual speech synthesis, etc. 2) It has a detrimental effect on the quality of generated lips due to error prop-

The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22)

agation1, which is frequently discussed in neural machine translation and image caption ﬁeld (Bengio et al. 2015; Wu et al. 2018). Worse still, error propagation is more obvious in AR lip generation than in other tasks, because the mistakes could take place at more dimensions (every pixel with three channels in generated image) and there is information loss during the down-sampling when sending the last generated lip frame to predict current one. Although prior works alleviate the error propagation by incorporating the technique of location-sensitive attention, it still has an unsatisfying performance on long-sequence datasets due to accumulated prediction error. To address such limitations, we turn to non-autoregressive (NAR) approaches. NAR decoding manner generates all the target tokens in parallel, which has already pervaded multiple research ﬁelds such as neural machine translation (Gu et al. 2018; Lee, Mansimov, and Cho 2018; Ghazvininejad et al. 2019; Ma et al. 2019), speech recognition (Chen et al. 2019; Higuchi et al. 2020), speech synthesis (Ren et al. 2019; Peng et al. 2020; Miao et al. 2020), image captioning (Deng et al. 2020) and lip reading (Liu et al. 2020). These works utilize the NAR decoding in sequenceto-sequence tasks to reduce the inference latency or generate length-controllable sequence. In this work, we propose an NAR model for parallel and high-ﬁdelity T2L generation (Para Lip). Para Lip predicts the duration of the encoded linguistic features and models the target lip frames conditioned on the encoded linguistic features with their duration in a non-autoregressive manner. Furthermore, we leverage structural similarity index (SSIM) loss to supervise Para Lip generating lips with better perceptual quality. Finally, using only reconstruction loss and SSIM loss is insufﬁcient to generate distinct lip images with more realistic texture and local details (e.g.wrinkles, beard and teeth), and therefore we adopt adversarial learning to mitigate this problem. Our main contributions can be summarized as follows: 1) We point out and analyze the unacceptable inference latency and intractable error propagation existing in AR T2L generation. 2) To circumvent these problems, we propose Para Lip to generate high-quality lips with low inference latency. And as a byproduct of Para Lip, the duration predictor in Para Lip could be leveraged in an NAR text-to-speech model, which naturally enables the synchronization in audio-visual speech synthesis task. 3) We explore the source-target alignment method when the audio is absent even in the training set. Extensive experiments demonstrate that Para Lip generates the competitive lip movements quality compared with state-of-the-art AR T2L model and exceeds the baseline AR model Transformer T2L by a notable margin. In the meanwhile, Para Lip exhibits distinct superiority in inference speed, which truly provides the possibility to bring T2L generation from laboratory to industrial applications. Video samples are available via https://paralip.github.io.

1Error propagation means if a token is mistakenly predicted at inference stage, the error will be propagated and the future tokens conditioned on this one will be inﬂuenced (Bengio et al. 2015; Wu et al. 2018).

2 Related Work 2.1 Talking Face Generation Talking face generation aims to generate realistic talking face video and covers many applications such as avatar and movie animation. There is a branch of works in computer graphics (CG) ﬁeld exploring it (Wang et al. 2011; Fan et al. 2015; Suwajanakorn, Seitz, and Kemelmacher-Shlizerman 2017; Yu et al. 2019; Abdelaziz et al. 2020) through hidden Markov models or deep neural network. These works synthesize the whole face by generating the intermediate parameters, which can then be used to deform a 3D face. Thanks to the evolved convolutional neural network (CNN) and high-performance computing resources, end-to-end systems which synthesize 2D talking face images by CNN rather than rendering methods in CG, have been presented recently in the computer vision (CV) ﬁeld (Kumar et al. 2017; Chung, Jamaludin, and Zisserman 2017; Chen et al. 2018; Vougioukas, Petridis, and Pantic 2019; Zhou et al. 2019; Zhu et al. 2020; Zheng et al. 2020; Prajwal et al. 2020). Most of them focus on synthesizing lip movements images and then transforming them to faces. We mainly take these works in CV ﬁeld into consideration and broadly divide them into two streams as the following paragraphs.

Speech-to-Lip Generation Previous speech-driven works e.g.Chung, Jamaludin, and Zisserman (2017) simply generate the talking face images conditioned on the encoded speech and the encoded face image carrying the identity information. To synthesize more accurate and distinct lip movements, Chen et al. (2018) introduce the task of speechto-lip generation using lip image as the identity information. Further, Song et al. (2019) add a lip-reading discriminator to focus on the mouth region, and Zhu et al. (2020) add the dynamic attention on lip area to synthesize talking face while keeping the lip movements realistic. Prajwal et al. (2020) propose a pre-trained lip-syncing discriminator to synthesize talking face with speech-consistent lip movements.

Text-to-Lip Generation The literature of direct text-tolip generation is rare. Some text-driven approaches either cascade the text-to-speech and speech-to-lip generation model(KR et al. 2019; Kumar et al. 2017), or combine the text feature with speech feature together to synthesize lip movements (Yu, Yu, and Ling 2019). Fried et al. (2019) edit a given video based on pure speech-aligned text sequence. Unlike the scenario where source speech or video is given, the sequence length of target lip frames is uncertain with only text input. Existing work (Chen et al. 2020) depends on the attention mechanism and AR decoding method to generate the target lip frames until the stop token is predicted.

2.2 Non-Autoregressive Sequence Generation In sequence-to-sequence tasks, an autoregressive (AR) model takes in a source sequence and then generates tokens of the target sentence one by one with the causal structure at inference (Sutskever, Vinyals, and Le 2014; Vaswani et al. 2017). Since the AR decoding manner causes the high inference latency, many non-autoregressive (NAR) models, which generate target tokens conditionally independent of

Text Tokens

Text Encoder

Motion Decoder

Video Decoder

Identity Image

Identity Encoder

L1 Loss & SSIM Loss & Adversarial Loss

b i n | w h i t e |

(a) Para Lip.

Positional Encoding

Text Encoder

Duration Predictor

Linguistic information

Text Tokens

Text Embedding

Length Expansion

(b) Text Encoder with Length Regulator.

Motion Information

Linguistic information

Positional Encoding

Linear Layer

Motion Decoder

(c) Motion Decoder.

Image Decoder

Identity Information

Motion Information

Image at τ time Video Decoder

Video frames

(d) Video Decoder with multiple Image Decoders.

Figure 2: The overall architecture for Para Lip. In subﬁgure (a), Identity Encoder sends out residual information at every convolutional layer. In subﬁgure (b), Length Regulator expands the text sequence according to ground truth duration in training or predicted duration in inference. In subﬁgure (c), Motion Decoder models lip movement information sequence from linguistic information sequence. In subﬁgure (d), there are T Image Decoders placed parallel in Video Decoder. The τ-th Image Decoder takes in motion information at τ time and generates lip image at τ time. T means total number of lip frames.

each other, have been proposed recently. Earliest in the NAR machine translation ﬁeld, many works use the fertility module or length predictor (Gu et al. 2018; Lee, Mansimov, and Cho 2018; Ghazvininejad et al. 2019; Ma et al. 2019) to predict the length correspondence (fertility) between source and target sequences, and then generate the target sequence depending on the source sequence and predicted fertility. Shortly afterward, researchers bring NAR decoding manner into heterogeneous tasks. In the speech ﬁeld, NAR-based TTS (Ren et al. 2019; Peng et al. 2020; Miao et al. 2020) synthesize speech from text with high speed and slightly quality drop; NAR-based ASR (Chen et al. 2019; Higuchi et al. 2020) recognize speech to corresponding transcription faster. In the computer vision ﬁeld, Liu et al. (2020) propose an NAR model for lipreading; Deng et al. (2020) present NAR image caption model not only improving the decoding efﬁciency but also making the generated captions more controllable and diverse.

3 Method 3.1 Preliminary Knowledge The text-to-lip generation aims to generate the sequence of lip movement video frames L = {l1, l2, ..., l T }, given source text sequence S = {s1, s2, ..., sm} and a single identity lip image l I as condition. Generally, there is a considerable discrepancy between the sequence length of L and S with uncertain mapping relationship. Previous work views this as a sequence-to-sequence problem, utilizing attention mechanism and AR decoding manner, where the conditional probability of L can be formulated as:

P(L|S, l I) =

τ=0 P(lτ+1|l<τ+1, S, l I; θ), (1)

where θ denotes the parameters of the model. To remedy the error propagation and high latency problem brought by AR decoding, Para Lip models the target sequence in an NAR manner, where the conditional probability becomes:

P(L|S, l I) =

τ=1 P(lτ|S, l I; θ). (2)

3.2 Model Architecture of Para Lip The overall model architecture and training losses are shown in Figure 2a. We explain each component in Para Lip in the following paragraphs.

Identity Encoder As shown in the right panel of Figure 2a, identity encoder consists of stacked 2D convolutional layers with batch normalization, which down-samples the identity image multiple times to extract features. The identity image is selected randomly from target lip frames, providing the appearance information of a speaker. It is worth noting that the identity encoder sends out the ﬁnal encoded hidden feature together with the intermediate hidden feature of convolutional layers at every level, which provides the ﬁne-grained image information.

Text Encoder As shown in Figure 2b, the text encoder consists of a text embedding layer, stacked feed-forward Transformer layers (TM) (Ren et al. 2019), a duration predictor and a length regulator. The TM layer contains selfattention layer and 1D convolutional layer with layer normalization and residual connection (Vaswani et al. 2017; Gehring et al. 2017). The duration predictor contains two 1D convolutional layers with layer normalization and one linear layer, which takes in the hidden text embedding sequence

and predicts duration sequence D = {d 1, d 2, ..., d m}, where d i means how many video frames the i-th text token corresponding to. The length regulator expands the hidden text embedding sequence according to ground truth duration D at training stage or predicted duration D at inference stage. For example, when given source text and duration sequence are {s1, s2, s3} and {2, 1, 3} respectively, denoting the hidden text embedding as {h1, h2, h3}, the expanded sequence is {h1, h1, h2, h3, h3, h3}, which carries the linguistic information corresponding to lip movement video at frame level. Collectively, text encoder encodes the source text sequence S to linguistic information sequence e S = {es1, es2, ..., es T }, where T = Pm i=1 di at training stage, or T = Pm i=1 d i at inference stage.

Motion Decoder Motion decoder (Figure 2c) aims to produce the lip movement information sequence e L = {el1,el2, ...,el T } from linguistic information sequence e S. It utilizes the positional encoding and self-attention mechanism in stacked TM blocks to enforce the temporal correlation on the hidden sequence. There is a linear layer at the end of this module convert the hidden states to an appropriate dimension.

Video Decoder The video decoder generates the target lip movement video L conditioned on the motion information sequence and identity information. As shown in Figure 2d, the video decoder consists of multiple parallel image decoders with all parameters shared, each of which contains stacked 2D deconvolutional layers, and there are skip connections at every level between the identity encoder and each image decoder. The skip connection is implemented by concatenation. Then two extra 2D convolutional layers are added at the end of each decoder for spatial coherence. Finally, the τ-th image decoder takes in lip motion information at τ time elτ and generates lip image l τ at τ time in corresponding shape.

3.3 Training Methods

In this section, we describe the loss function and training strategy to supervise Para Lip. The reconstruction loss and duration prediction loss endow the model with the fundamental ability to generate lip movement video. To generate the lip with better perceptual quality and alleviate the blurry predictions (Mathieu, Couprie, and Le Cun 2016) problem, the structural similarity index loss and adversarial learning are introduced. We also explore the source-target alignment method when the audio is absent even in the training set, which will be introduced in Section 6.

Reconstruction Loss Basically, we optimize the whole network by adding L1 reconstruction loss on generated lip sequence L :

τ=1 lτ l τ 1. (3)

Duration Prediction Loss In the training stage, we add L1 loss on predicted duration sequence D at token level2 and sequence level, which supervises the duration predictor to make the precise ﬁne-grained and coarse-grained predictions. Duration prediction loss Ldur can be written as:

i=1 di d i 1 +

i=1 d i 1. (4)

Structural Similarity Index Loss Structural Similarity Index (SSIM) (Wang 2004) is adopted to measure the perceptual image quality, which takes luminance, contrast and structure into account, and is close to the perception of human beings. The SSIM value for two pixels at position (i, j) in τ-th images l τ and lτ can be formulated as:

SSIMi,j,τ = 2µl τ µlτ + C1 µ2 l τ + µ2 lτ + C1 2σl τ lτ + C2 σ2 l τ + σ2 lτ + C2 ,

where µl τ and µlτ denotes the mean for regions in image l τ and lτ within a 2D-window surrounding (i, j). Similar, σl τ and σlτ are standard deviation; σl τ lτ is the covariance; C1 and C2 are constant values. To improve the perceptual quality of the generated lip frames, we leverage SSIM loss in Para Lip. Assuming the size of each lip frame to be (A B), the SSIM loss between generated L and ground truth L becomes:

Lssim = 1 T A B

j (1 SSIMi,j,τ)). (5)

Adversarial Learning Through experiments, it can be found that only using above losses is insufﬁcient to generate distinct lip images with more realistic texture and local details (e.g.wrinkles, beard and teeth). Thus, we adopt adversarial learning to mitigate this problem and train a quality discriminator Disc along with Para Lip. The Disc contains stacked 2D convolutional layers with Leaky Re LU activation which down-samples each image to 1 1 H (H is hidden size), and a 1 1 convolutional layer to project the hidden states to a value of probability for judging real or fake. We use the loss function in LSGAN (Mao et al. 2017) to train Para Lip and Disc:

LG adv = Ex l (Disc(x) 1)2, (6)

LD adv = Ex l(Disc(x) 1)2 + Ex l Disc(x)2, (7)

where l means lip images generated by Para Lip and l means ground truth lip images. To summarize, we optimize the Disc by minimizing Equation (7), and optimize the Para Lip by minimizing Ltotal:

Ltotal = λ1 Lrec +λ2 Ldur +λ3 Lssim +λ4 LG adv, (8)

where the λ1, λ2, λ3 and λ4 are hyperparameters to trade off the four losses.

2Character level for GRID and phoneme level for TCD-TIMIT following previous works.

4 Experimental Settings 4.1 Datasets GRID The GRID dataset (Cooke et al. 2006) consists of 33 video-available speakers, and each speaker utters 1,000 phrases. The phrases are in a 6-categories structure following ﬁxed simple grammar: command4 + color4 + preposition4 + letter25 + digit10 + adverb4 where the number denotes how many choices of each category. Thus, the total vocabulary size is 51, composing 64,000 possible phrases. All the videos last 3 seconds with frame rate 25 fps, which form a total duration of 27.5 hours. It is a typical talking face dataset and there are a considerable of lip-related works (Assael et al. 2016; Chung et al. 2017; Afouras, Chung, and Zisserman 2018; Chen et al. 2018; Zhu et al. 2020; Lin et al. 2021) conducting experiments on it. Following previous works, we select 255 random samples from each speaker to form the test set.

TCD-TIMIT The TCD-TIMIT dataset (Harte and Gillen 2015) is closer to real cases and more challenging than GRID dataset, since 1) the vocabulary is not limited; 2) the sequence length of videos is not ﬁxed and is longer than that in GRID. We use the volunteers subset of TCD-TIMIT following previous works, which consists of 59 speakers uttering about 98 sentences individually. The frame rate is 29.97 fps and each video lasts 2.5 8.1 seconds. The total duration is about 7.5 hours. We set 30% of data from each speaker aside for testing following the recommended speaker-dependent train-test splits (Harte and Gillen 2015).

4.2 Data Pre-processing As for the video pre-processing, we utilize Dlib (King 2009) to detect 68 facial landmarks (including 20 mouth landmarks), and extract the face images from video frames. We resize the face images to 256 256, and further crop each face to a ﬁxed 160 80 size containing the lip-centered region. As for the text pre-processing, we encode the text sequence at the character level for GRID dataset and phoneme level for TCD-TIMIT dataset. And for ground truth duration extraction, we ﬁrst extract the speech audio from video ﬁles, and then utilize Penn Phonetics Lab Forced Aligner (P2FA) (Yuan and Liberman 2008) to get speech-to-text alignments, from which we obtain the duration of each text token for training our duration predictor in Para Lip.

5 Results and Analysis In this section, we present extensive experimental results to evaluate the performance of Para Lip in terms of lip movements quality and inference speedup. And then, we conduct ablation experiments to verify the signiﬁcance of all proposed methods in Para Lip.

5.1 Quality Comparison We compare our model with 1) Dual Lip (Chen et al. 2020), which is the state-of-the-art (SOTA) autoregressive textto-lip model based on RNN and location-sensitive attention (Shen et al. 2018). And 2) Transformer T2L, an autoregressive baseline model based on Transformer (Vaswani

et al. 2017) implemented by us, which uses the same model settings with Para Lip3. The quantitative results on GRID and TCD-TIMIT are listed in Table 1 and Table 2 respectively4. Note that we do not add adversarial learning on any model in Table 1 or Table 2, since there is no adversarial learning in Dual Lip (Chen et al. 2020).

Methods PSNR SSIM LMD

AR Benchmarks

Dual Lip 29.13 0.872 1.809

Transformer T2L 26.85 0.829 1.980

Para Lip 28.74 0.875 1.675

Table 1: Comparison with Autoregressive Benchmarks on GRID dataset. denotes our reproduction under the case w/o GT duration at inference.

Methods PSNR SSIM LMD

AR Benchmarks

Dual Lip 27.38 0.809 2.351

Transformer T2L 26.89 0.794 2.763

Para Lip 27.64 0.816 2.084

Table 2: Comparison with Autoregressive Benchmarks on TCD-TIMIT dataset. denotes our reproduction under the case w/o GT duration at inference.

Quantitative Comparison We can see that: 1) On GRID dataset (Table 1), Para Lip outperforms Dual Lip on LMD metric, and keeps the same performance in terms of PSNR and SSIM metrics. However, on TCD-TIMIT dataset (Table 2), Para Lip achieves an overall performance surpassing Dual Lip by a notable margin, since autoregressive models perform badly on the long-sequence dataset due to accumulated prediction error; 2) Para Lip shows absolute superiority over AR baseline Transformer T2L in terms of three quantitative metrics on both datasets; 3) although Dual Lip outperforms AR baseline by incorporating the technique of location-sensitive attention, which could alleviate the error propagation, it is still vulnerable on long-sequence dataset.

Qualitative Comparison We further visualize the qualitative comparison between Dual Lip, Transformer T2L and Para Lip in Figure 3. It can be seen that the quality of lip frames generated by Dual Lip and Transformer T2L become

3Most modules and the total number of model parameters in Para Lip and Transformer T2L are similar. 4Note that the reported results are all under the case where the ground truth (GT) duration is not provided at inference (denoted as w/o duration in (Chen et al. 2020)) , since there is no GT duration available in the real case.

sil bin blue at y nine now sil

Transformer T2L

sil dh ser p sh z sil

Transformer T2L

uw w ow ae t ah ih w

round Truth

round Truth

Figure 3: The qualitative comparison among AR SOTA (Dual Lip), AR baseline (Transformer T2L) and our NAR method (Para Lip). We visualize two cases from GRID dataset and TCD-TIMIT dataset to illustrate the error propagation problem existing in AR generation and verify the robustness of Para Lip. In the ﬁrst case, the lip sequence generated from AR baseline predicts a wrong lip image (the 6-th frame with red box), and as a result, the subsequent lip images conditioned on that image becomes out of synchronization with linguistic information and ends in chaos; Dual Lip alleviates the error propagation to some degree. In the second case, both AR models perform poorly on the long-sequence dataset and generate the frames that look speechless as the time goes further.

increasingly worse as the time goes further. Concretely, The lip image becomes fuzzy and out of synchronization with linguistic contents. We attribute this phenomenon to the reason that: error propagation problem is serious in AR T2L since the wrong prediction could take place at more dimensions (every pixel with three channels in generated image) and there is information loss during the down-sampling when sending the last generated lip frame to predict current one. What s worse, on TCD-TIMIT, a long-sequence dataset, Dual Lip and Transformer T2L often generate totally unsatisfying results that look like speechless video. By contrast, the lip frames generated by NAR model Para Lip maintain high ﬁdelity to ground truth all the while, which demonstrates the effectiveness and robustness of NAR decoding.

5.2 Speed Comparison

In this section, we evaluate and compare the average inference latency of Dual Lip, Transformer T2L and Para Lip on both datasets. Furthermore, we study the relationship between inference latency and the target video length.

Comparison of Average Inference Latency The average inference latency is the average time consumed to generate one video sample on the test set, which is measured in seconds. Table 3 exhibits the inference latency of all systems. It can be found that, 1) compared with Dual Lip, Para Lip speeds up the inference by 13.09 and 19.12 on average on two datasets; 2) Transformer T2L has the same structure with Para Lip, but runs about 50% slower than Dual Lip, which indicates that it is NAR decoding manner in Para Lip speedups the inference, rather than the modiﬁcation of model structure; 3) In regard to AR models, the time consumption of a single sentence increases to 0.5-1.5 seconds even on GPU, which is unacceptable for real-world application. By contrast, Para Lip addresses the inference latency problem satisfactorily.

Datasets Methods Latency (s) Speedup

Dual Lip 0.299 1.00 Transformer T2L 0.689 0.43 Para Lip 0.022 13.09

Dual Lip 0.650 1.00 Transformer T2L 1.278 0.51 Para Lip 0.034 19.12

Table 3: The comparison of inference latency on GRID and TCD-TIMIT dataset. The computations are conducted on a server with 1 NVIDIA 2080Ti GPU.

80 100 120 140 160 180 200 220 Predicted Lip Video Length

Inference Time(s)

Dual Lip Transformer T2L Para Lip

Figure 4: Relationship between inference latency (seconds) and predicted video length for Dual Lip, Transformer T2L and Para Lip.

Relationship between Inference Latency and Video Length In this section, we study the speedup as the sequence length increases. The experiment is conducted on TCD-TIMIT, since its videos are not in a ﬁxed length. From Figure 4, it can be seen that 1) Para Lip model speeds up the inference obviously due to high parallelization compared with AR models; 2) Para Lip is insensitive to sequence length and almost holds a constant inference latency, but by contrast, the inference latency of Dual Lip and Transformer T2L increase linearly as the sequence length increases. As a result, the speedup of Para Lip relative to Dual Lip or Transformer T2L also increases linearly as the sequence length increases.

5.3 Ablation Study

Model PSNR SSIM LMD FID

Base model 30.24 0.896 0.998 56.36 +SSIM 30.51 0.906 0.978 55.05 +ADV 25.70 0.736 2.460 65.88 +SSIM+ADV 28.36 0.873 1.077 39.74

Table 4: The ablation studies on GRID dataset. Base model is trained only with L1 loss; +SSIM means adding structural similarity index loss and +ADV means adding adversarial learning to the base model. FID means Fr echet Inception Distance metric. To focus on the frames quality, we provide the GT duration for eliminating the interference caused by the discrepancy of predicted length.

We conduct ablation experiments on GRID dataset to analyze the effectiveness of the proposed methods in our work. All the results are shown in Table 4. Experiments show that:

Adding only SSIM loss obtains the optimal score on PSNR/SSIM/LMD ( +SSIM ); Adding only adversarial training causes performance drop on PSNR/SSIM/LMD, which is consistent with previous works (Song et al. 2019) ( +ADV ); Adding SSIM to model with adversarial training can greatly alleviate the detriment on PSNR/SSIM/LMD brought by adversarial training; make the GANbased model more stable; obtain the best FID score, which means the generated lips look more realistic. ( +SSIM+ADV ). Previous works (Song et al. 2019) claim that 1) PSNR and SSIM cannot well reﬂect some visual quality; 2) adversarial learning encourages the generated face to pronounce in diverse ways, leading to diverse lip movements and LMD decrease. Although +SSIM+ADV causes marginally PSNR/SSIM/LMD scores losses, +SSIM+ADV obtain the best FID score and tends to generate distinct lip images with more realistic texture and local details (e.g. wrinkles, beard and teeth). The qualitative results are shown in Figure 5.

6 Further Discussions In the foregoing sections, we train the duration predictor using the GT duration extracted by P2FA, but it is not

+ SSIM + SSIM + ADV

+ SSIM + SSIM + ADV

+ SSIM + SSIM + ADV

Figure 5: The qualitative evaluation for adversarial learning. +ADV tends to generate more realistic lip images.

applicable to the case where the audio is absent even in the training set. Thus, to obtain the GT duration in this case, we tried a lipreading model with monotonic alignment searching (MAS) (Tillmann et al. 1997; Kim et al. 2020) to ﬁnd the alignment between text and lip frames. Specifically, we 1) ﬁrst trained a lipreading model by CTC loss on the training set; 2) traveled the training set and for each (L, S) pair, we extracted Octc Rm (the CTC outputs corresponding to the label tokens) from Octc RV (Octc is the original CTC outputs; V is the vocabulary size); 3) applied softmax on the extracted Octc to obtain the probability

matrix P(A(si, lj)) = e Oi ctc,lj P

m e Octc,lj ; 4) conducted MAS by

dynamic programming to ﬁnd the best alignment solution. This method achieves similar results with P2FA on GRID dataset, but could cause deterioration on TIMIT dataset: PSNR:27.09, SSIM:0.816, LMD:2.313. Theoretically, obtaining the alignment between lip frames and its transcript directly from themselves has more potential than obtaining this alignment indirectly from audio sample and its transcripts, since there are many cases when the mouth is moving but the sound hasn t come out yet (e.g. the ﬁrst milliseconds of a speech sentence). This part of video frames should be aligned to some words, but its corresponding audio piece is silent, which will not be aligned to any word, causing contradictions. We think it is valuable to try more and better methods in this direction.

7 Conclusion

In this work, we point out and analyze the unacceptable inference latency and intractable error propagation existing in AR T2L generation, and propose a parallel decoding model Para Lip to circumvent these problems. Extensive experiments show that Para Lip generates lip movements with competitive quality compared with the state-of-the-art AR T2L model, exceeds the baseline AR model Transformer T2L by a notable margin and exhibits distinct superiority in inference speed, which provides the possibility to bring T2L generation from laboratory to industrial applications.

Acknowledgments

This work was supported in part by the National Key R&D Program of China under Grant No.2020YFC0832505, No.62072397, Zhejiang Natural Science Foundation under Grant LR19F020006.

Abdelaziz, A. H.; Kumar, A. P.; Seivwright, C.; Fanelli, G.; Binder, J.; Stylianou, Y.; and Kajarekar, S. 2020. Audiovisual Speech Synthesis using Tacotron2. ar Xiv preprint ar Xiv:2008.00620. Afouras, T.; Chung, J. S.; and Zisserman, A. 2018. Deep Lip Reading: A Comparison of Models and an Online Application. In Proc. Interspeech 2018, 3514 3518. Assael, Y. M.; Shillingford, B.; Whiteson, S.; and De Freitas, N. 2016. Lipnet: End-to-end sentence-level lipreading. ar Xiv preprint ar Xiv:1611.01599. Bengio, S.; Vinyals, O.; Jaitly, N.; and Shazeer, N. 2015. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems, 1171 1179. Chen, L.; Li, Z.; K Maddox, R.; Duan, Z.; and Xu, C. 2018. Lip movements generation at a glance. In Proceedings of the European Conference on Computer Vision (ECCV), 520 535. Chen, N.; Watanabe, S.; Villalba, J.; and Dehak, N. 2019. Non-Autoregressive Transformer Automatic Speech Recognition. ar Xiv preprint ar Xiv:1911.04908. Chen, W.; Tan, X.; Xia, Y.; Qin, T.; Wang, Y.; and Liu, T.-Y. 2020. Dual Lip: A System for Joint Lip Reading and Generation. In Proceedings of the 28th ACM International Conference on Multimedia, MM 20, 1985 1993. New York, NY, USA: Association for Computing Machinery. ISBN 9781450379885. Chung, J.; Jamaludin, A.; and Zisserman, A. 2017. You said that? British Machine Vision Conference 2017, BMVC 2017. Chung, J. S.; Senior, A.; Vinyals, O.; and Zisserman, A. 2017. Lip reading sentences in the wild. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 3444 3453. IEEE. Cooke, M.; Barker, J.; Cunningham, S.; and Shao, X. 2006. An audio-visual corpus for speech perception and automatic speech recognition. Journal of the Acoustical Society of America, 120(5): 2421. Deng, C.; Ding, N.; Tan, M.; and Wu, Q. 2020. Length Controllable Image Captioning. ECCV. Fan, B.; Wang, L.; Soong, F. K.; and Xie, L. 2015. Photoreal talking head with deep bidirectional LSTM. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4884 4888. IEEE. Fried, O.; Tewari, A.; Zollh ofer, M.; Finkelstein, A.; Shechtman, E.; Goldman, D. B.; Genova, K.; Jin, Z.; Theobalt, C.; and Agrawala, M. 2019. Text-based editing of talking-head video. ACM Transactions on Graphics (TOG), 38(4): 1 14.

Gehring, J.; Auli, M.; Grangier, D.; Yarats, D.; and Dauphin, Y. N. 2017. Convolutional sequence to sequence learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, 1243 1252. Ghazvininejad, M.; Levy, O.; Liu, Y.; and Zettlemoyer, L. 2019. Mask-Predict: Parallel Decoding of Conditional Masked Language Models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 6114 6123. Gu, J.; Bradbury, J.; Xiong, C.; Li, V. O.; and Socher, R. 2018. Non-Autoregressive Neural Machine Translation. In International Conference on Learning Representations. Harte, N.; and Gillen, E. 2015. TCD-TIMIT: An Audio Visual Corpus of Continuous Speech. IEEE Transactions on Multimedia, 17(5): 603 615. Higuchi, Y.; Watanabe, S.; Chen, N.; Ogawa, T.; and Kobayashi, T. 2020. Mask CTC: Non-Autoregressive Endto-End ASR with CTC and Mask Predict. INTERSPEECH. Kim, J.; Kim, S.; Kong, J.; and Yoon, S. 2020. Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search. Advances in Neural Information Processing Systems, 33. King, D. E. 2009. Dlib-ml: A Machine Learning Toolkit. JMLR.org. KR, P.; Mukhopadhyay, R.; Philip, J.; Jha, A.; Namboodiri, V.; and Jawahar, C. 2019. Towards Automatic Face-to-Face Translation. In Proceedings of the 27th ACM International Conference on Multimedia, 1428 1436. Kumar, R.; Sotelo, J.; Kumar, K.; de Br ebisson, A.; and Bengio, Y. 2017. Obamanet: Photo-realistic lip-sync from text. ar Xiv preprint ar Xiv:1801.01442. Lee, J.; Mansimov, E.; and Cho, K. 2018. Deterministic Non-Autoregressive Neural Sequence Modeling by Iterative Reﬁnement. In EMNLP, 1173 1182. Lin, Z.; Zhao, Z.; Li, H.; Liu, J.; Zhang, M.; Zeng, X.; and He, X. 2021. Simul LR: Simultaneous Lip Reading Transducer with Attention-Guided Adaptive Memory, 1359 1367. New York, NY, USA: Association for Computing Machinery. ISBN 9781450386517. Liu, J.; Ren, Y.; Zhao, Z.; Zhang, C.; Huai, B.; and Yuan, J. 2020. Fast LR: Non-Autoregressive Lipreading Model with Integrate-and-Fire. In Proceedings of the 28th ACM International Conference on Multimedia, 4328 4336. Ma, X.; Zhou, C.; Li, X.; Neubig, G.; and Hovy, E. 2019. Flow Seq: Non-Autoregressive Conditional Sequence Generation with Generative Flow. In EMNLP-IJCNLP, 4273 4283. Mao, X.; Li, Q.; Xie, H.; Lau, R. Y.; Wang, Z.; and Paul Smolley, S. 2017. Least squares generative adversarial networks. In Proceedings of the IEEE international conference on computer vision, 2794 2802. Mathieu, M.; Couprie, C.; and Le Cun, Y. 2016. Deep multiscale video prediction beyond mean square error. In 4th International Conference on Learning Representations, ICLR 2016.

Miao, C.; Liang, S.; Chen, M.; Ma, J.; Wang, S.; and Xiao, J. 2020. Flow-TTS: A Non-Autoregressive Network for Text to Speech Based on Flow. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7209 7213. IEEE. Peng, K.; Ping, W.; Song, Z.; and Zhao, K. 2020. Non Autoregressive Neural Text-to-Speech. ICML. Prajwal, K.; Mukhopadhyay, R.; Namboodiri, V. P.; and Jawahar, C. 2020. A Lip Sync Expert Is All You Need for Speech to Lip Generation In The Wild. In Proceedings of the 28th ACM International Conference on Multimedia, 484 492. Ren, Y.; Ruan, Y.; Tan, X.; Qin, T.; Zhao, S.; Zhao, Z.; and Liu, T.-Y. 2019. Fastspeech: Fast, robust and controllable text to speech. In Advances in Neural Information Processing Systems, 3165 3174. Shen, J.; Pang, R.; Weiss, R. J.; Schuster, M.; Jaitly, N.; Yang, Z.; Chen, Z.; Zhang, Y.; Wang, Y.; Skerrv-Ryan, R.; et al. 2018. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4779 4783. IEEE. Song, Y.; Zhu, J.; Li, D.; Wang, A.; and Qi, H. 2019. Talking Face Generation by Conditional Recurrent Adversarial Network. In Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence, IJCAI-19, 919 925. International Joint Conferences on Artiﬁcial Intelligence Organization. Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, 3104 3112. Suwajanakorn, S.; Seitz, S. M.; and Kemelmacher Shlizerman, I. 2017. Synthesizing obama: learning lip sync from audio. ACM Transactions on Graphics (TOG), 36(4): 1 13. Tillmann, C.; Vogel, S.; Ney, H.; and Zubiaga, A. 1997. A DP-based search using monotone alignments in statistical translation. In 35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics, 289 296. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. In Advances in neural information processing systems, 5998 6008. Vougioukas, K.; Petridis, S.; and Pantic, M. 2019. End-to End Speech-Driven Realistic Facial Animation with Temporal GANs. In CVPR Workshops, 37 40. Wang, L.; Han, W.; Soong, F. K.; and Huo, Q. 2011. Text driven 3D photo-realistic talking head. In Twelfth Annual Conference of the International Speech Communication Association. Wang, Z. 2004. Image Quality Assessment : From Error Visibility to Structural Similarity. IEEE Transactions on Image Processing.

Wu, L.; Tan, X.; He, D.; Tian, F.; Qin, T.; Lai, J.; and Liu, T.-Y. 2018. Beyond Error Propagation in Neural Machine Translation: Characteristics of Language Also Matter. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 3602 3611. Brussels, Belgium: Association for Computational Linguistics. Yu, C.; Lu, H.; Hu, N.; Yu, M.; Weng, C.; Xu, K.; Liu, P.; Tuo, D.; Kang, S.; Lei, G.; et al. 2019. Durian: Duration informed attention network for multimodal synthesis. ar Xiv preprint ar Xiv:1909.01700. Yu, L.; Yu, J.; and Ling, Q. 2019. Mining audio, text and visual information for talking face generation. In 2019 IEEE International Conference on Data Mining (ICDM), 787 795. IEEE. Yuan, J.; and Liberman, M. 2008. Speaker identiﬁcation on the SCOTUS corpus. Journal of the Acoustical Society of America, 123(5): 3878. Zheng, R.; Zhu, Z.; Song, B.; and Ji, C. 2020. Photorealistic Lip Sync with Adversarial Temporal Convolutional Networks. ar Xiv preprint ar Xiv:2002.08700. Zhou, H.; Liu, Y.; Liu, Z.; Luo, P.; and Wang, X. 2019. Talking face generation by adversarially disentangled audiovisual representation. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 33, 9299 9306. Zhu, H.; Huang, H.; Li, Y.; Zheng, A.; and He, R. 2020. Arbitrary Talking Face Generation via Attentional Audio Visual Coherence Learning. In Bessiere, C., ed., Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence, IJCAI-20, 2362 2368. International Joint Conferences on Artiﬁcial Intelligence Organization. Main track.