# uncertaintyaware_image_captioning__f5cfb52f.pdf Uncertainty-Aware Image Captioning Zhengcong Fei, Mingyuan Fan, Li Zhu, Junshi Huang* Xiaoming Wei, Xiaolin Wei Meituan Beijing, China {feizhengcong, fanmingyuan, zhuli09, huangjunshi}@meituan.com {weixiaoming, weixiaolin02}@meituan.com It is well believed that the higher uncertainty in a word of the caption, the more inter-correlated context information is required to determine it. However, current image captioning methods usually consider the generation of all words in a sentence sequentially and equally. In this paper, we propose an uncertainty-aware image captioning framework, which parallelly and iteratively operates insertion of discontinuous candidate words between existing words from easy to difficult until converged. We hypothesize that high-uncertainty words in a sentence need more prior information to make a correct decision and should be produced at a later stage. The resulting non-autoregressive hierarchy makes the caption generation explainable and intuitive. Specifically, we utilize an image-conditioned bag-of-word model to measure the word uncertainty and apply a dynamic programming algorithm to construct the training pairs. During inference, we devise an uncertainty-adaptive parallel beam search technique that yields an empirically logarithmic time complexity. Extensive experiments on the MS COCO benchmark reveal that our approach outperforms the strong baseline and related methods on both captioning quality as well as decoding speed. Introduction Image captioning, which aims to generate textual descriptions of input images, is a critical task in multimedia analysis (Stefanini et al. 2021). Previous works in this area are mostly based on an encoder-decoder paradigm (Vinyals et al. 2015; Xu et al. 2015; Rennie et al. 2017; Anderson et al. 2018; Huang et al. 2019; Cornia et al. 2020; Pan et al. 2020; Fei 2022; Li et al. 2022; Yang, Liu, and Wang 2022), where a convolution-neural-network-based image encoder first process an input image into visual representations, and then a recursive-neural-network or Transformer-based language decoder produces a corresponding caption based on these extracted features. The generation process usually relies on a chain-rule factorization and is performed in an autoregressive manner, i.e., words by words from left to right. Although this paradigm brings performance superiority, high latency during inference becomes a grave disadvantage in some real-time applications. To this end, non-autoregressive *The corresponding author. Copyright 2023, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Figure 1: Illustration of autoregressive, non-autoregressive, and uncertainty-aware image captioning. AIC model generates the next word conditioned on the given image and all preceding subsentence, while NAIC model outputs all words in one shot. Comparatively, UAIC considers a caption generation in several stages and various discontinuous words are generated parallelly in each stage. words in blue indicate newly generated words at the current stage. Interestingly, UAIC allows informative words (e.g., people, field) generated before the no-informative words (e.g., in, with). image captioning (NAIC) models are proposed to improve decoding speed by predicting every word in parallel (Gu et al. 2018; Fei 2019, 2021b). This advantage comes at a sacrifice on performance since modeling next word is trickier when not conditioned on sufficient contexts. Intensive efforts have been devoted to non-autoregressive image captioning to seek a better trade-off between inference time and captioning performance. Generally, approaches can be roughly divided into two lines. The first line of work leverages the iterative decoding framework to break the independence assumption, which first generates an initial caption and then refines iteratively by taking both the given image and the results of the last iteration as input (Gao et al. 2019; Jason et al. 2018). The other line of work tries to modify the Transfomer structure to better capture dependency and position information by leveraging extra autoregressive layers in the decoder (Fei 2019; Guo et al. 2020). Besides, (Fei 2020) introduces latent variables to eliminate the modal gap and develop a more powerful probabilistic framework to The Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI-23) simulate more complicated distributions. (Yan et al. 2021) splits the captions into word groups averagely and produces the group synchronously. In addition to parallel generation, a range of semi-autoregressive models (Wang, Zhang, and Chen 2018; Ghazvininejad, Levy, and Zettlemoyer 2020; Stern et al. 2019; Gu, Wang, and Zhao 2019; Fei 2021b; Fei et al. 2022b,a; Zhou et al. 2021) pay attention to nonmonotonic sequence generation with limited forms of autoregressiveness, i.e., tree-like traversal, which are mainly based on the insertion operation. However, all these image captioning methods treat all words in a sentence equally and ignore the generation completeness between them, which does not match the actual human-created situation. In this paper, we propose an Uncertainty-Aware parallel model for faster and better Image Captioning, referred to as UAIC. As illustrated in Figure 1, given an image, the UAIC model first produces low-uncertainty keywords, i.e., objective from the image, and then inserts details of finer granularity from easy to difficult in stages. This process iterates until a caption is finally completed. Note that the generation of discontinuous words in each stage is synchronous and parallel. To measure the word uncertainty in a sample, we introduce a bag-of-words (Bo W) model (Zhang, Jin, and Zhou 2010) according to image-only conditional distribution. The intuition behind this is that the higher the cross-entropy and uncertainty have in a word, the harder it is to generate, and more word dependency information is required to determine it (Heo et al. 2018; Kaplan et al. 2018; Zhou et al. 2020). Meantime, a dynamic programming-based algorithm is applied to split instances into ordered data pairs for the training process. Under such a scenario, we can say that the training sample is in an uncertainty-aware manner and the proposed model can implicitly learn the effective word generation order. We also integrate an uncertainty-adaptive beam search following uncertainty change for decoding acceleration. We believe that our proposed approach is simple to understand and implement, yet powerful, and can be leveraged as a building block for future works. The main contirbutions of this paper are summarized as follows: We propose a new uncertainty-aware model for parallel image caption generation. Compared with previous work, our model allows difficulty control on the generation and enjoys a significant reduction of empirical time complexity from O(n) to O(log n) at best. We introduce an uncertainty estimation model inspired by the idea of bag-of-word. Based on the word-level uncertainty measurement, a heuristic dynamic programming algorithm is applied to construct the training set. We devise an uncertainty-adaptive beam search customized to our approach, which dynamically adjusts the beam size to trade-off accuracy and efficiency. Experiments on the MS COCO benchmark demonstrate the superiority of our UAIC model over strong baselines. In particular, to improve reproducibility and foster new research in the field, we publicly release the source code and trained models of all experiments. Image caption generation. Given an image I, image captioning aims to generate sentence S = {w1, . . . , w T } to describe the visual content. Specifically, the autoregressive image captioning (AIC) model factorizes the joint distribution of S in a standard left-to-right manner, i.e., t=1 p(wt|S