# nonautoregressive_coarsetofine_video_captioning__cf618b1a.pdf

Non-Autoregressive Coarse-to-Fine Video Captioning

Bang Yang,1 Yuexian Zou, 1,2* Fenglin Liu, 1 Can Zhang 1

1 ADSPLAB, School of ECE, Peking University, Shen Zhen, China 2 Peng Cheng Laboratory {yb.ece, zouyx, fenglinliu98, zhangcan}@pku.edu.cn

It is encouraged to see that progress has been made to bridge videos and natural language. However, mainstream video captioning methods suffer from slow inference speed due to the sequential manner of autoregressive decoding, and prefer generating generic descriptions due to the insufﬁcient training of visual words (e.g., nouns and verbs) and inadequate decoding paradigm. In this paper, we propose a nonautoregressive decoding based model with a coarse-to-ﬁne captioning procedure to alleviate these defects. In implementations, we employ a bi-directional self-attention based network as our language model for achieving inference speedup, based on which we decompose the captioning procedure into two stages, where the model has different focuses. Speciﬁcally, given that visual words determine the semantic correctness of captions, we design a mechanism of generating visual words to not only promote the training of scene-related words but also capture relevant details from videos to construct a coarse-grained sentence template . Thereafter, we devise dedicated decoding algorithms that ﬁll in the template with suitable words and modify inappropriate phrasing via iterative reﬁnement to obtain a ﬁne-grained description. Extensive experiments on two mainstream video captioning benchmarks, i.e., MSVD and MSR-VTT, demonstrate that our approach achieves state-of-the-art performance, generates diverse descriptions, and obtains high inference efﬁciency.

Introduction

Video captioning aims to automatically describe video contents with plausible sentences, which could be helpful for video retrieval, assisting visually-impaired people and so on. In recent years, neural captioning methods have risen to prominence and they generally adopt the encoder-decoder framework (Venugopalan et al. 2015), where videos are encoded to sequences of vectors with Convolutional Neural Networks (CNNs), and captions are often decoded from these vectors via Recurrent Neural Networks (RNNs) or Transformer (Hori et al. 2017). For real-time industrial applications, a good video captioning system may have low inference latency and describe scene-related details. However, existing methods mostly have deﬁciencies in both aspects.

*Corresponding author. Copyright 2021, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

Visual Words (related to this video) Noun: car, man, road, jeep, cherokee, Verb : driving, discussing, reviews,

(a) Visual Word Generation (b) Caption Generation

[ ] man [ ] driving [ ] [ ] road

a man is driving down the road

a man is driving a jeep cherokee

Decoding Algorithms

Sentence Making Iterative Refinement

Figure 1: Illustration of the proposed coarse-to-ﬁne captioning procedure, where (a) visual words (i.e., nouns and verbs in this paper) are generated in parallel ﬁrst to form a coarsegrained template , based on which (b) a ﬁne-grained description is yielded via dedicated decoding algorithms.

For caption generation, current methods are stick to autoregressive (AR) decoding, i.e., conditioning each word on the previously generated outputs. Such sequential manner results in high inference latency, which is especially ampliﬁed by the fact that ﬁne-grained descriptions are generally long (Gella, Lewis, and Rohrbach 2018). Recently, non-autoregressive (NA) decoding that generates words in parallel to achieve signiﬁcant inference speedup becomes an emerging focus in neural machine translation (NMT) (Gu et al. 2018; Wang et al. 2019b; Shao et al. 2020). Nevertheless, NA decoding suffers from poor approximation to the target distribution after removing the sequential dependency (Gu et al. 2018), leading to a large performance gap compared with AR counterparts. For describing scene-related details, visual words (e.g., nouns and verbs), which are visually-grounded and highly associated with semantic correctness (Song et al. 2017), deserve more attention than non-visual words (e.g. determiners and prepositions). By analyzing the existing caption corpora (Chen and Dolan 2011; Xu et al. 2016), it is found that visual words are much less than non-visual words in number. Therefore, treating these two kinds of words equally to train a captioning model, as most of previous works do (Aafaq et al. 2019; Huang et al. 2020), will cause insufﬁcient training of meaningful words, which could manifest as the lack of relevant details and diversity in generated descriptions (Dai, Fidler, and Lin 2018; Sammani and Melas-Kyriazi 2020). Besides, it is challenging for the models to generate satisfy-

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

ing captions with the proneness of error accumulation (Bengio et al. 2015). So a ﬂexible decoding paradigm that supports word modiﬁcation is also needed. In this paper, we propose a Non-Autoregressive Coarseto-Fine (NACF) model to tackle the slow inference speed and unsatisﬁed caption quality concerns in video captioning. For achieving inference speedup, we employ a bi-directional self-attention based network (Vaswani et al. 2017) as our language model and train it with masked language modeling objective (Devlin et al. 2019) so that any subset of target words can be predicted simultaneously based on the rest ones. For improving caption quality, we propose an alternative paradigm to decompose the captioning procedure into two stages, where the model has different focuses. Speciﬁcally, we propose a mechanism of generating visual words, i.e., visual word generation, to not only promote the training of scene-related words but also require the model to generate a coarse-grained sentence template at the ﬁrst stage of inference phase. As shown in Fig. 1 (a), the generated words in the template (e.g., driving ) summarize a ﬁrst-glance gist of the scene, which could be instructive for the subsequent generation process. Thereafter, we devise dedicated decoding algorithms to produce ﬁne-grained descriptions at the second stage. As shown in Fig. 1 (b), decoding algorithms ﬁrst complete the template by ﬁlling in suitable words, and then if necessary, iteratively mask out and reconsider some inappropriate words that the language model is least conﬁdent about to ensure sentence ﬂuency or capture more relevant details, e.g., a more precise phrase ( jeep cherokee ) is generated after deliberation. Our main contributions are summarized as follows1:

We propose a Non-Autoregressive Coarse-to-Fine model to deal with slow inference speed and unsatisﬁed caption quality problems for video captioning.

We design a mechanism of generating visual words and devise dedicated decoding algorithms to achieve coarseto-ﬁne rather than word-by-word captioning procedure for capturing more visually-grounded details from videos.

Extensive experiments on MSVD and MSR-VTT demonstrate the effectiveness of our approach, which achieves state-of-the-art performance, generates diverse descriptions, and obtains high inference efﬁciency.

Related Work Video Captioning. With the rapid development in deep learning, the neural captioning methods that follow the encoder-decoder framework have risen to prominence. One of the ﬁrst works that adopt such framework is (Venugopalan et al. 2015), where captions are generated by LSTM given the mean pooled representation over all frames. Later in (Yao et al. 2015; Song et al. 2017), temporal attention is proposed to adaptively determine which subset of frames to focus on at each decoding step. In particular, Song et al. propose to implicitly distinguish the importance of visual and non-visual words with a hierarchical attention mechanism

1The code is available at https://github.com/yangbang18/Non Autoregressive-Video-Captioning.

(Song et al. 2017). Besides exploiting the temporal structure of videos, the utilization of multi-modalities and highlevel semantics also draws great attention. For instance, Xu et al. propose multimodal attention to selectively focus on content-related modalities (Xu et al. 2017). Liu et al. leverage a pre-trained network to extract attributes, which are deﬁned as the properties observed in visual contents with rich semantic cues, and use them to align visual features (Liu et al. 2019). Most recently, Huang et al. propose to couple attribute prediction with caption generation in an end-to-end manner (Huang et al. 2020). Pan et al. propose to exploit spatio-temporal object interactions and distill such knowledge into the captioning model (Pan et al. 2020). However, all these methods adopt sequential decoding and treat visual and non-visual words equally in terms of the loss function. Non-Autoregressive Decoding. Due to high inference efﬁciency, NA decoding has aroused widespread attention in the community of NMT. By removing the sequential dependency, NA decoding can generate all words in one shot to speed up decoding (Gu et al. 2018) but at the cost of inferior accuracy that manifests as token repetitions in the generated outputs (Wang et al. 2019b; Guo et al. 2019; Lee, Mansimov, and Cho 2018). To compensate for the performance degradation, Guo et al. integrate strong conditional signals into the decoder inputs to beneﬁt the learning of internal dependencies within a sentence (Guo et al. 2019). Besides one-shot generation, some works propose to iteratively reﬁne the sentences, so that the model can condition on parts or the whole of the previous outputs (Lee, Mansimov, and Cho 2018; Ghazvininejad et al. 2019; Gu, Wang, and Zhao 2019; Mansimov, Wang, and Cho 2019). But the downside of these methods is that from-scratch parallel generation, i.e., employing the completely unknown sequences as the decoder inputs, often leads to translation errors in the early stages due to insufﬁcient context, which could greatly inﬂuence the subsequent predictions. Summary. Rather than designing a sophisticated architecture, our work contributes by proposing an alternative decoding paradigm, i.e., generating descriptions from coarsegrained to ﬁne-grained with the scheme of NA decoding, to generate semantically correct video captions with higher efﬁciency. This work pursues the iterative approaches in NMT (Ghazvininejad et al. 2019), but the core difference is that we propose to capture visual words ﬁrst to formulate partially observed sequences, which can provide rich contextual information with the model to alleviate description ambiguity and thus eventually enhance the caption quality.

Approach In this section, we ﬁrst introduce the architecture of our Non Autoregressive Coarse-to-Fine (NACF) model, followed by the proposed visual word generation. Then, we describe the coarse-to-ﬁne captioning procedure during inference, where three dedicated decoding algorithms are presented.

Architecture As shown in Fig. 2, our NACF comprises three modules: a CNN-based encoder, a length predictor and a bi-directional self-attention based decoder.

Multi-Head Self-Attention

Multi-Head Inter-Attention

Feed Forward

Input Embedding Layer IEL

2D CNN 3D CNN

Mean Pooling Copy

(a) Encoder (c) Decoder

(b) Length Predictor

Linear Re LU

esrc esrc esrc esrc epos epos epos epos 0 1 2 3

etok etok etok etok 0 1 2 3

Figure 2: An overview of our proposed NACF architecture, which comprises a CNN-based encoder, a length predictor module and a bi-directional self-attention based decoder.

Encoder. Given a sequence of video frames/clips of length K, we feed it into pre-trained 2D/3D CNNs to obtain visual features V = {vk}K k=1 RK dv, which are further encoded to compact representations R RK dm via a input embedding layer (IEL), i.e., R = f IEL(V ). Here f IEL adopts the shortcut connection in highway networks (Srivastava, Greff, and Schmidhuber 2015), so it can be formalized as follows (omitting biases for clarity):

f IEL(V ) = BN(G V + (1 G) ˆV )

V = V We1 ˆV = tanh (V We2)

G = σ(V We3)

where BN denotes batch normalization (Ioffe and Szegedy 2015), is the element-wise product, σ is sigmoid function, We1 Rdv dm, and {We2, We3} Rdm dm. When considering multi-modalities, e.g., image and motion modalities, we simply apply concatenation to obtain R R2K dm. Length Predictor. Unlike AR decoding that can automatically decide the sequence length N by predicting the endof-sentence token, NA decoding must know N ahead. So a length predictor (LP) is introduced to predict length distribution L RNmax given the outputs R of the encoder:

L = f LP (R) = Softmax(Re LU(MP(R)Wl1)Wl2) (2)

where MP denotes mean pooling, Wl1 Rdm dm, Wl2 Rdm Nmax, and Nmax is the predeﬁned maximum sequence length. Given the ground-truth length distribution L , whose j-th element denotes the percentage of sentences of length j in the training corpus for a speciﬁc video, we minimize the Kullback-Leibler (KL) divergence between L and L :

Llen = DKL(L ||L) =

j=1 l j log lj

During training, we directly use the sequence length of ground-truth sentences. As for inference, we will depict the utilization of the predicted length distribution L in experimental settings.

Decoder. To obtain a non-autoregressive decoder, we adopt one-layer decoder of Transformer (Vaswani et al. 2017) with two modiﬁcations. One is that we remove the causal mask in the self-attention layer. By doing so, our decoder becomes bi-directional, thus the prediction of each token can use both left and right contexts. Another is that we pursue the works in NMT (Lee, Mansimov, and Cho 2018; Guo et al. 2019) to enhance decoder inputs by integrating the copied source information (the dashed line in Fig. 2(c)). To train the model, we use the masked language modeling objective (a.k.a. the cloze task) in BERT (Devlin et al. 2019). Speciﬁcally, some tokens in a ground-truth sentence Y are randomly masked out to obtain a partially-observed sequence Yobs and a masked (unobserved) sequence Ymask = Y \Yobs. Then the decoder takes Yobs and representations R as inputs to predict the probability distribution over words:

pθ(y|Yobs, R) = fdec(Yobs, R) (4)

where fdec denotes the transformation within the decoder2. We only minimize the negative log likelihood of Ymask:

y Ymask log pθ(y|Yobs, R) (5)

Unlike BERT that uses a small masking ratio (e.g., 15%), we use a uniformly distributed ratio ranging from βl to βh so that the model can be trained with examples of different difﬁculties. Next, we will elaborate on how to generate meaningful visual words with the proposed decoder.

Visual Word Generation For non-autoregressive video captioning, visual word generation is proposed for two purposes: promoting the training of scene-related words and generating coarse-grained templates that serve as starting points at the inference phase. To achieve that, we directly use the proposed decoder without introducing extra parameters. Formally, given a ground-truth sentence Y of length N, we ﬁrst construct a corresponding target sequence Y vis = {yvis n }N n=1 as follows:

yvis n = y n if POS(y n) {noun, verb} [mask] otherwise (6)

where POS( ) denotes the part-of-speech of a word. Then the decoder is forced to predict Y vis without available word information, i.e., Y vis obs = [vis] (a sequence of the same special token [vis] in practice). Hence the loss function for visual word generation is deﬁned as:

y Y vis log pθ(y| [vis], R) (7)

As the generation process solely depends on R, the generated visual words may not be comprehensive. But as we will verify later, they are instructive for the follow-up caption generation process. Finally, the overall loss function of our approach is formulated as:

LNACF = Llen + Lmlm + λLvis (8)

where λ is set to 0.8 empirically.

2Detailed formulation of fdec is left to the technical appendix.

Coarse-to-Fine Captioning To yield plausible descriptions during inference, our captioning procedure is decomposed into two stages. At the ﬁrst stage, we generate a coarse-grained template Y (0) and collect its conﬁdence C(0) given the predicted sequence length N and representations R:

y(0) n , c(0) n = (arg) max w pθ(y = w| [vis], R) (9)

where y(0) n is either the [mask] token or a visual word (see Eq. 6). In the subsequent generation process, four variables are introduced: the number of iterations T, the observed sequence at t-th (t [1, T]) iteration Y (t) obs, the prediction result Y (t) and its conﬁdence C(t). Speciﬁcally, Y (1) obs is initialized by the visual words in the coarse-grained template Y (0):

Y (1) obs = {y(0) n |y(0) n = [mask]} (10)

Then at the second stage, three dedicated decoding algorithms, i.e., Mask-Predict (MP) (Ghazvininejad et al. 2019), Easy-First (EF) and Left-to-Right (L2R) are introduced to generate ﬁne-grained descriptions. Mask-Predict (MP). This algorithm iterates over two steps at t-th iteration: Mask where mt tokens with lowest conﬁdence are masked out and Predict where those masked tokens are reconsidered based on the rest N mt tokens. The order of these two steps are switched in this paper because Y (1) obs is already obtained. We only update the prediction results Y (t) and conﬁdence C(t) for those unobserved tokens given Y (t) obs and R:

y(t) n , c(t) n =

( (arg) maxw pθ(y = w|Y (t) obs, R) if n It y(t 1) n , c(t 1) n otherwise (11) where It denotes the index set of unobserved tokens:

It = {n|y(t 1) n / Y (t) obs} (12)

As low conﬁdence c(t) n means the token y(t) n is incompatible with others, reconsidering such token could beneﬁt caption quality. So for the next iteration, Y (t+1) obs is deﬁned as:

Y (t+1) obs = {y(t) j |j topk n (C(t), k = N mt+1)} (13)

where we use a linear decay ratio r to decide mt and make sure there is at least one token to be reconsidered:

r = T t + 1

T , mt = max( N r , 1) (14)

Easy-First (EF). This algorithm generates q tokens with highest conﬁdence among the unobserved tokens at each iteration. Given N and u (the cardinality of Y (1) obs ), EF algorithm needs T = (N u)/q iterations and can be brieﬂy formulated as:

Y (t+1) obs = Y (t) obs {y(t) It,j|j topk n (C(t) It , k = q)} (15)

where Y (t) It and C(t) It denote the prediction and conﬁdence of unobserved tokens at t-th iteration respectively. Y (t) and

a girl is playing in a field a girl is jumping in a field a girl is jumping in a field

a girl playing playing a a ball a girl is playing on a ball a girl is jumping on a trampoline

Yobs = 1 Yobs = {girl, field} 1

a girl _ _ _ _ a girl is playing _ _ _ a girl is playing with a _ a girl is playing with a ball

a girl is _ _ field a girl is playing in _ field a girl is playing in a field a girl is playing in a yard L2R CT-L2R

a girl _ _ _ a field a girl is _ in a field a girl is jumping in a field a girl is jumping in a field

a girl _ _ _ _ a girl is _ _ a _ a girl is jumping on a _ a girl is jumping on a trampoline

Figure 3: Illustration of how to generate captions using different decoding algorithms (best viewed in color). The newly generated words (in color) are predicted based on the observed words (in black). We set T = 3 for MP algorithm, q = 2 for L2R and EF algorithms. The preﬁx CTmean using the coarse-grained templates (Eq. 10).

C(t) are calculated by Eq. 11 while It is computed by Eq. 12. Since the visual words in Y (1) obs are not modiﬁed during generation, we can reconsider them based on Y = Y (T ) \ Y (1) obs and R:

y(T ) n , c(T ) n = (arg) max w pθ(y = w|Y, R) s.t. y(0) n = [mask] (16) Left-to-Right (L2R). In contrast to EF, this algorithm is monotonous, i.e., it generates q tokens among the unobserved tokens from left to right at each iteration. L2R algorithm also needs T = (N u)/q iterations and can be brieﬂy deﬁned as:

Y (t+1) obs = Y (t) obs {y(t) It,1, . . . , y(t) It,q} (17)

where Y (t) It denote the prediction of unobserved tokens at tth iteration. Y (t) (also C(t)) is computed by Eq. 11 while It is calculated by Eq. 12. Similar to EF, one more iteration can be added to reconsider the visual words (Eq. 16). Although we make both L2R and EF to generate q words at each iteration, a ﬁxed number of iterations T can be set for both of them to produce captions in constant time. Besides, both L2R and EF algorithms can cooperate with the MP algorithm to iteratively reﬁne sentences if necessary. Example. Each of the decoding algorithms mentioned above can start with either the generated coarse-grained template (Eq. 10) or a completed unknown sequence by setting Y (1) obs = and I1 = {1, 2, . . . , N} (Eq. 12). To differentiate these two versions, we name the algorithms utilizing coarse-grained templates with the preﬁx CT- . As shown in Fig. 3, the original version of algorithms, i.e., MP, EF and L2R, hallucinate some concepts (e.g., trampoline and ball ) that does not exist in visual contents due to the limited contextual information of completed unknown sequences. But under the guidance of some generated visual words (i.e., girl and ﬁeld ), the algorithms can produce more precise descriptions, which will be veriﬁed later.

Model MSVD MSR-VTT

BLEU@4 METEOR CIDEr-D BLEU@4 METEOR CIDEr-D

STAT (Yan et al. 2019) 52.0 33.3 73.8 39.3 27.1 43.8 GRU-EVE (Aafaq et al. 2019) 47.9 35.0 78.1 38.3 28.4 48.1 POS-CG (Wang et al. 2019a) 52.5 34.1 88.7 42.0 28.2 48.7 MARN (Pei et al. 2019) 48.6 35.1 92.2 40.4 28.1 47.1 MAD-SAP (Huang et al. 2020) 53.3 35.4 90.8 41.3 28.3 48.5 STG-KD (Pan et al. 2020) 52.2 36.9 93.0 40.5 28.3 47.1

AR-B 48.7 35.3 91.8 40.5 28.7 49.1 NA-B 53.7 35.5 92.8 40.4 28.0 47.6 Our NACF 55.6 36.2 96.3 42.0 28.7 51.4

Table 1: Comparison with the state-of-the-art methods on MSVD and MSR-VTT, where AR-B and NA-B are our baselines.

Experiments In this section, we evaluate our NACF on two datasets: Microsoft Video Description (MSVD) (Chen and Dolan 2011) and MSR-Video To Text (MSR-VTT) (Xu et al. 2016).

Experimental Settings Datasets. MSVD contains 1,970 video clips and roughly 80,000 English sentences. We follow the split settings in prior works (Pei et al. 2019; Pan et al. 2020), i.e., 1,200, 100 and 670 videos for training, validation and testing, respectively. MSR-VTT consists of 10,000 video clips, each of which has 20 captions and a category tag. Following the ofﬁcial split, we use 6,513, 497 and 2,990 videos for training, validation and testing, respectively. The vocabulary size of MSVD is 9,468, whereas that of MSR-VTT is 10,547. Feature Extraction. We follow (Pei et al. 2019) and opt for the same type of features, i.e., 2048-D image features from Res Net-101 (He et al. 2016) pre-trained on the Image Net dataset (Deng et al. 2009), 2048-D motion features from Res Ne Xt-101 with 3D convolutions (Hara, Kataoka, and Satoh 2018) pre-trained on the Kinetics dataset (Kay et al. 2017), and all category tags included in MSR-VTT. Length Beam and Teacher Rescoring. Following the common practice of noisy parallel decoding during inference (Gu et al. 2018; Wang et al. 2019b), we select top B length candidates from the predicted length distribution L, and decode the same example with different lengths in parallel. An autoregressive counterpart (i.e., the AR-B introduced later) is then used to re-score these B candidates. We ﬁnally select captions with the highest conﬁdence as hypotheses. Parameter Settings. The maximum sequence length Nmax is set to 20 for MSVD, whereas Nmax = 30 for MSRVTT. We empirically set K = 8 for each modality. For the decoder, we adopt 1 decoder layer, 512 model dimensions, 2,048 hidden dimensions and 8 attention heads per layer. Both word and position embeddings are implemented by trainable 512-D embedding layers. For regularization, we use 0.5 dropout and 5 10 4 L2 weight decay. We train batches of 64 video-sentence pairs using ADAM (Kingma and Ba 2015) with an initial learning rate of 5 10 3. We stop training our model until 50 epochs are reached. We use NLTK toolkit (Bird, Klein, and Loper 2009) for part-

of-speech tagging. In the following experiments, our NACF uses CT-MP algorithm with the number of iterations T of 5 and beam size B of 6 unless otherwise speciﬁed. Evaluation Metrics. We report three common metrics including BLEU (Papineni et al. 2002), METEOR (Banerjee and Lavie 2005) and CIDEr-D (Vedantam, Lawrence Zitnick, and Parikh 2015). All metrics are computed by Microsoft COCO Evaluation Server (Chen et al. 2015). Compared Approaches. We compare our NACF with the following state-of-the-art methods: STAT (Yan et al. 2019) and GRU-EVE (Aafaq et al. 2019) which capture spatio-temporal dynamics by attention mechanism and Short Fourier Transform respectively, POS-CG (Wang et al. 2019a) which uses the global syntactic part-of-speech information, MARN (Pei et al. 2019) which leverages memory network to capture cross-video contents, MAD-SAP (Huang et al. 2020) which predicts a concise set of attributes at each decoding step, and STG-KD (Pan et al. 2020) which distills the knowledge of spatio-temporal object interactions. Additionally, we consider two baselines, namely autoregressive baseline (AR-B) and non-autoregressive baseline (NA-B). Speciﬁcally, AR-B has similar architecture as our NACF, but it excludes the length predictor and includes causal mask in the self-attention layer. NA-B is the same with our NACF but it excludes visual word generation, i.e., λ = 0 (Eq. 8).

Performance Comparison

The quantitative results in Table 1 illustrate that our NACF achieves state-of-the-art performance on both MSVD and MSR-VTT datasets, which is mainly beneﬁcial from the proposed coarse-to-ﬁne captioning procedure. Since we opt for the same type of features as MARN and similar features as MAD-SAP and STG-KD, fair comparisons between these methods and our approach can be guaranteed. Notably, our NACF achieves signiﬁcant improvement on CIDEr-D, e.g., a relative improvement of 3.5% on MSVD compared with STG-KD while 6.0% on MSR-VTT compared with MADSAP. As the CIDEr-D metric is to punish the often-seen but uninformative n-grams in the dataset, the superior performance on CIDEr-D indicates that our NACF can capture more scene-related keywords from videos. It is noteworthy that our NACF is slightly worse than STG-KD on the ME-

exp Model B4 M CD 1 NA-B 40.4 28.0 47.6 2 NA-B w/ Lvis 40.8 28.2 49.4 3 NA-B w/ Lvis, CT (NACF) 42.0 28.7 51.4

4 AR-B 40.5 28.7 49.1 5 AR-B w/ Lvis 41.4 29.0 50.8

Table 2: Effect of the visual word generation loss (Lvis, Eq. 7) and the generated coarse-grained templates (CT) on the performance on MSR-VTT in terms of BLEU@4 (B4), METEOR (M) and CIDEr-D (CD).

Figure 4: The average relative growth rate of word frequency after training with visual word generation (Lvis). Here we compare exp2 with exp1 (listed in Table 2), and detail the results in all 20 categories of videos from MSR-VTT. doc. and ads. are short for documentary and advertisement.

TEOR metric in MSVD, which is because the latter learn spatio-temporal object interactions well in a small dataset with few portions of animations (Pan et al. 2020). We also present the performance of baselines in Table 1, and obtain two observations. (1) Compared with AR-B, NA-B obtains superior performance on MSVD while performs poorly on MSR-VTT, showing that the removal of sequential dependency makes NA decoding face more a severe multi-modality problem3 (Gu et al. 2018) on a larger dataset. (2) Our NACF surpasses NA-B by a large margin on both datasets. As we will show in the next subsection, this superior performance is attributed to the proposal of visual word generation, which not only generates informative gradients to promote the training of visual words but also alleviates the multi-modality problem by providing a warm start with the model during inference.

Ablation Studies and Analyses

Visual Word Generation. Our proposed visual word generation task plays a critical role in both training and inference phases. During training, this task generates auxiliary gradients, thus the effect of Lvis on performance is worth exploring. As shown in Table 2, Lvis brings promising performance gains for both NA-B (exp2 vs. exp1) and AR-B (exp5 vs. exp4), especially on the CIDEr-D metric. To ﬁgure out the word-level effect of Lvis, ﬁrst we collect the captions

3For example, a NA model considers two possible captions C1 and C2, it could predict one token from C1 while another token from C2 due to the conditional independence.

Model Novel Unique Vocab Usage

AR-B 17.19 25.79 3.36 NA-B 23.88 31.24 3.24 NACF 34.35 42.47 3.83

Table 3: Diversity of generated captions at various aspects (%) on MSR-VTT.

Algorithm B1 B2 B3 B4 M CD MP 81.0 67.5 53.5 40.8 28.2 49.4 EF 81.5 67.9 54.1 41.4 28.7 50.6 L2R 81.0 67.3 53.4 40.8 28.4 48.7

CT-MP 82.2 68.7 54.7 42.0 28.7 51.4 CT-EF 82.1 68.4 54.4 41.7 28.8 51.8 CT-L2R 81.7 68.2 54.3 41.7 28.7 50.6

Table 4: Performance of our NACF using different decoding algorithms on MSR-VTT. Here T = 5 and q = 1.

generated by the model with or without Lvis, then we measure the relative growth rates of word frequency, and ﬁnally we average the results of words of the same type. As shown in Fig. 4, visual words get an overall boost in various categories of videos. All these results demonstrate that training with Lvis can improve the caption quality by addressing the insufﬁcient training of visual words. During inference, visual word generation can produce coarse-grained templates (CT). An obvious improvement of taking CT as starting points can be observed in Table 2 (exp3 vs. exp2) and Table 4, which indicates that decoding with some known visual words can generate more semantically correct captions than the from-scratch generation that starts with completely unknown sequences. In summary, our proposed visual word generation is versatile and it beneﬁts a lot for non-autoregressive video captioning. Diversity. To quantify caption diversity, we compute three metrics following (Dai, Fidler, and Lin 2018), namely Novel (the percentage of captions that have not been seen in the training data), Unique (the percentage of captions that are unique among the generated captions) and Vocab Usage (the percentage of words in the vocabulary that are adopted to generate captions). As shown in Table 3, NACF achieves the best performance across all metrics, indicating the advantage of our approach in terms of caption diversity. Decoding Algorithm. In Table 4, we present the performance of all aforementioned decoding algorithms. As we can observe, (CT-)EF and (CT-)MP always outperform (CT- )L2R. Therefore, we can conclude that adaptive generation rather than monotonic generation is requisite for our NACF to generate plausible descriptions. Inference Efﬁciency. We measure latency4 following (Guo et al. 2019; Wang et al. 2019b), and conduct experiments in Py Torch on a single NVIDIA Titan X. Fig. 5 shows

4Latency is computed as the time to decode a single sentence without minibatching, averaged over the whole test set.

Figure 5: Relative decoding speed-up versus CIDEr-D on (a) MSVD and (b) MSR-VTT. The circles in red denote NACF (CTMP) while in purple denote NACF (CT-EF). The squares in green denote NA-B (MP). The triangles in blue denote AR-B. We take the latency of AR-B (B = 5) as a benchmark, which costs 37.4 ms and 43.8 ms on MSVD and MSR-VTT, respectively.

Figure 6: Qualitative results on MSVD and MSR-VTT. The incomplete sentences of our NACF are generated coarse-grained templates (CT). In (a), words in blue denote the update during iterative reﬁnement, while for the rest examples, accurate keywords are highlighted in red. In (d), our NACF generates an unsatisfying caption due to the inadequate generation of visual words, which can be alleviated by retrieving two visual words ( spongebob and cartoon ) that are potential to be predicted.

the speed-performance trade-off on MSVD and MSR-VTT, where we take the latency of AR-B (B = 5) as a benchmark. Notably, with CT-MP algorithm, our NACF can generate captions over 3.2 times (B = 4, T = 1) faster than AR-B (B = 5) on MSVD while 2.2 times (B = 5, T = 3) on MSR-VTT without performance degradation. These results demonstrate the high inference speed of our NACF.

Qualitative Analysis

Fig. 6 shows a few visualized examples of generated captions for different models. As we can see in (a), (b), and (c), while baselines AR-B and NA-B mistaken the video contents, our NACF can capture relevant visual words (e.g., roof in (b) and lab in (c)) and thus generate more precise captions. Speciﬁcally in (a), where the intermediate process of iterative reﬁnement is presented, we can observe that the generated visual words of our NACF provide rich contextual information to predict the quantiﬁer two . However in (d), our NACF suffers from inadequate visual word generation. To ﬁgure out which visual words are potential to be

predicted, we visualize top-3 predictions in (d) and ﬁnd that most of them are somewhat relevant. According to the result that the description gets improved after retrieving spongebob and cartoon , a more robust mechanism of generating visual words deserves further study.

In this paper, we propose a novel Non-Autoregressive Coarse-to-Fine (NACF) model for video captioning, which is based on a masked language model for parallelization and equipped with visual word generation and dedicated decoding algorithms to generate accurate and diverse captions in a coarse-to-ﬁne manner. Extensive experiments on two video captioning benchmarks, MSVD and MSR-VTT, demonstrate that our proposed NACF achieves state-of-theart performances with two unique advantages, i.e., generating more vivid descriptions and decoding faster (e.g., 3.2 times speed-up without performance degradation) than autoregressive captioning models. In our future study, we will focus on interpretable and controllable caption generation.

Acknowledgements

Special acknowledgements are given to AOTO-PKUSZ Joint Research Center of Artiﬁcial Intelligence on Scene Cognition technology Innovation for its support.

References Aafaq, N.; Akhtar, N.; Liu, W.; Gilani, S. Z.; and Mian, A. 2019. Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In CVPR.

Banerjee, S.; and Lavie, A. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In ACL workshop.

Bengio, S.; Vinyals, O.; Jaitly, N.; and Shazeer, N. 2015. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in NIPS.

Bird, S.; Klein, E.; and Loper, E. 2009. Natural language processing with Python: analyzing text with the natural language toolkit. O Reilly Media, Inc. .

Chen, D.; and Dolan, W. B. 2011. Collecting highly parallel data for paraphrase evaluation. In ACL-HLT, 190 200.

Chen, X.; Fang, H.; Lin, T.-Y.; Vedantam, R.; Gupta, S.; Doll ar, P.; and Zitnick, C. L. 2015. Microsoft coco captions: Data collection and evaluation server. ar Xiv preprint ar Xiv:1504.00325 .

Dai, B.; Fidler, S.; and Lin, D. 2018. A neural compositional paradigm for image captioning. In Advances in NIPS.

Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei-Fei, L. 2009. Imagenet: A large-scale hierarchical image database. In CVPR.

Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL.

Gella, S.; Lewis, M.; and Rohrbach, M. 2018. A Dataset for Telling the Stories of Social Media Videos. In EMNLP.

Ghazvininejad, M.; Levy, O.; Liu, Y.; and Zettlemoyer, L. 2019. Mask-predict: Parallel decoding of conditional masked language models. In EMNLP-IJCNLP.

Gu, J.; Bradbury, J.; Xiong, C.; Li, V. O.; and Socher, R. 2018. Non-autoregressive neural machine translation. In ICLR.

Gu, J.; Wang, C.; and Zhao, J. 2019. Levenshtein transformer. In Advances in NIPS.

Guo, J.; Tan, X.; He, D.; Qin, T.; Xu, L.; and Liu, T.-Y. 2019. Non-autoregressive neural machine translation with enhanced decoder input. In AAAI.

Hara, K.; Kataoka, H.; and Satoh, Y. 2018. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In CVPR.

He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR.

Hori, C.; Hori, T.; Lee, T.-Y.; Zhang, Z.; Harsham, B.; Hershey, J. R.; Marks, T. K.; and Sumi, K. 2017. Attention-based multimodal fusion for video description. In ICCV.

Huang, Y.; Chen, J.; Ouyang, W.; Wan, W.; and Xue, Y. 2020. Image captioning with end-to-end attribute detection and subsequent attributes prediction. IEEE Transactions on Image Processing 29.

Ioffe, S.; and Szegedy, C. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In ICML, 448 456.

Kay, W.; Carreira, J.; Simonyan, K.; Zhang, B.; Hillier, C.; Vijayanarasimhan, S.; Viola, F.; Green, T.; Back, T.; Natsev, P.; et al. 2017. The kinetics human action video dataset. ar Xiv preprint ar Xiv:1705.06950 .

Kingma, D. P.; and Ba, J. 2015. Adam: A method for stochastic optimization. In ICLR.

Lee, J.; Mansimov, E.; and Cho, K. 2018. Deterministic nonautoregressive neural sequence modeling by iterative reﬁnement. In EMNLP.

Liu, F.; Liu, Y.; Ren, X.; He, X.; and Sun, X. 2019. Aligning visual regions and textual concepts for semantic-grounded image representations. In Advances in NIPS.

Mansimov, E.; Wang, A.; and Cho, K. 2019. A generalized framework of sequence generation with application to undirected sequence models. ar Xiv preprint ar Xiv:1905.12790 .

Pan, B.; Cai, H.; Huang, D.-A.; Lee, K.-H.; Gaidon, A.; Adeli, E.; and Niebles, J. C. 2020. Spatio-Temporal Graph for Video Captioning with Knowledge Distillation. In CVPR.

Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002. BLEU: a method for automatic evaluation of machine translation. In ACL.

Pei, W.; Zhang, J.; Wang, X.; Ke, L.; Shen, X.; and Tai, Y.-W. 2019. Memory-attended recurrent network for video captioning. In CVPR.

Sammani, F.; and Melas-Kyriazi, L. 2020. Show, Edit and Tell: A Framework for Editing Image Captions. In CVPR.

Shao, C.; Zhang, J.; Feng, Y.; Meng, F.; and Zhou, J. 2020. Minimizing the bag-of-ngrams difference for nonautoregressive neural machine translation. In AAAI.

Song, J.; Guo, Z.; Gao, L.; Liu, W.; Zhang, D.; and Shen, H. T. 2017. Hierarchical lstm with adjusted temporal attention for video captioning. In IJCAI.

Srivastava, R. K.; Greff, K.; and Schmidhuber, J. 2015. Training very deep networks. In Advances in NIPS.

Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. In Advances in NIPS.

Vedantam, R.; Lawrence Zitnick, C.; and Parikh, D. 2015. Cider: Consensus-based image description evaluation. In CVPR.

Venugopalan, S.; Xu, H.; Donahue, J.; Rohrbach, M.; Mooney, R.; and Saenko, K. 2015. Translating Videos to Natural Language Using Deep Recurrent Neural Networks. In NAACL.

Wang, B.; Ma, L.; Zhang, W.; Jiang, W.; Wang, J.; and Liu, W. 2019a. Controllable video captioning with POS sequence guidance based on gated fusion network. In ICCV.

Wang, Y.; Tian, F.; He, D.; Qin, T.; Zhai, C.; and Liu, T.-Y. 2019b. Non-autoregressive machine translation with auxiliary regularization. In AAAI.

Xu, J.; Mei, T.; Yao, T.; and Rui, Y. 2016. Msr-vtt: A large video description dataset for bridging video and language. In CVPR.

Xu, J.; Yao, T.; Zhang, Y.; and Mei, T. 2017. Learning multimodal attention LSTM networks for video captioning. In ACM Multimedia.

Yan, C.; Tu, Y.; Wang, X.; Zhang, Y.; Hao, X.; Zhang, Y.; and Dai, Q. 2019. STAT: Spatial-Temporal Attention Mechanism for Video Captioning. IEEE Transactions on Multimedia .

Yao, L.; Torabi, A.; Cho, K.; Ballas, N.; Pal, C.; Larochelle, H.; and Courville, A. 2015. Describing videos by exploiting temporal structure. In ICCV.