# memoryaugmented_image_captioning__7a408790.pdf Memory-Augmented Image Captioning Zhengcong Fei1,2 1Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190, China 2University of Chinese Academy of Sciences, Beijing 100049, China feizhengcong@ict.ac.cn Current deep learning-based image captioning systems have been proven to store practical knowledge with their parameters and achieve competitive performances in the public datasets. Nevertheless, their ability to access and precisely manipulate the mastered knowledge is still limited. Besides, providing evidence for decisions and updating memory information are also important yet under explored. Towards this goal, we introduce a memory-augmented method, which extends an existing image caption model by incorporating extra explicit knowledge from a memory bank. Adequate knowledge is recalled according to the similarity distance in the embedding space of history context, and the memory bank can be constructed conveniently from any matched imagetext set, e.g., the previous training data. Incorporating such non-parametric memory-augmented method to various captioning baselines, the performance of resulting captioners imporves consistently on the evaluation benchmark. More encouragingly, extensive experiments demonstrate that our approach holds the capbility for efficiently adapting to larger training datasets, by simply transferring the memory bank without any additional training. 1 Introduction Automatic image captioning, which aims to describe a visual content of a given image, is a core topic in the artificial intelligence area (Bai and An 2018; Fei 2020a). There is a boom in research on image captioning systems due to the advance of deep learning technology, and most existing models adopt encoder-decoder frameworks (Vinyals et al. 2015; Xu et al. 2015; Yao et al. 2017; Anderson et al. 2018; Huang et al. 2019; Fei 2020b). Technically, CNN-based image encoder extracts sufficient and useful visual features from the input image; RNN-based caption decoder builds the semantic part according to the picked visual information and decodes it word by word. The above structures have been shown to learn a substantial amount of in-depth relational knowledge from training data using parameter optimization, without access to external memory. While this development is exciting, such image captioning models do have some drawbacks (Fei 2019; Wang et al. 2020): they cannot easily expand or update their prior memory, and can not straightforwardly provide insight into their current predictions. Copyright 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. To address these issues, numerous hybrid captioning models that combine with the retrieval-based memory mechanism are leveraged, in which outer recalled knowledge can be directly revised and expanded, and its access can be inspected and interpreted (Weston, Chopra, and Bordes 2014). In particular, inspired by the fact that humans benefit from previous similar experiences when taking actions and related examples from training data provide exemplary information when describing a given image, previous works (Poghosyan and Sarukhanyan 2017; Chen et al. 2019; Wang et al. 2020) usually first utilize an image-text matching model to retrieval top-k similar sentence candidates. Then, the target caption will be created under the guide of input image plus these related candidates with a specially designed network. Although current retrieval-based captioning models have achieved promising results, there still have the following weakness: 1) Their performance is limited by the quality of the caption retrieved model. Commonly, retrieval results are less coherent and relevant with the query image than generative models . Irrelevant retrieved results would even mislead the final caption generation. 2) These models can only make use of individual sentence-level retrieved results, leading to a high variance in the performance (Zhang and Lu 2018). Moreover, the information from very few retrieved results may not be sufficient to enrich the caption decoding. Compared with methods matching based only on image features, our proposed word-level retrieval mechanism considers the complete history information, including given image and previously generated words, which results in a more accurate and comprehensive knowledge application. In this paper, we introduce a memory augmented approach that equips a trained image caption decoding with considering extra word-level knowledge information, in other words, linearly interpolating the original next word distribution with a top-k matching approximation. The related knowledge in the memory bank is recalled according to the similarity distance in the embedding space of history context (i.e., the image feature and prefix of the caption) and can be drawn from any image-text collection, including the original training data or other extended datasets. We assume that contxts which are close in representation space are more likely to occur the same word. In this manner, our framework does not incorporate additional parameters and allow effective knowledge to be memorized explicitly and interpretably, The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21) Figure 1: Illustration of our memory augmented caption generation. A memory bank is constructed based on pre-set matched image-text samples, including encoding of its history context, i.e., image features and past sentence (key) and target word (value). During inference, a current context is encoded (query), and the k most similar matches are retrieved from the memory bank. Then, a distribution over the vocabulary is computed with rank and normalization operation. Finally, the distribution is interpolated with the original model s prediction for combined decision. Note that the encoder of the query context is identical to the encoder of the memory bank. rather than implicitly in the model parameters. To better measure its effects, we conduct an extensive empirical evaluation on the MS COCO benchmark (Chen et al. 2015). Built upon recent stronge captioners with our memory augmentation mechanism showing a prominent improvement over the base when the same training set is employed for modeling the history memory representations. We also demonstrate that our approach holds the capacity for efficiently adapting to larger training datasets, by simply reconstructing the memory bank with the existing image captioning model. The contributions of this work are as follows: We propose a memory-augmented approach which extends the decision of current image captioning model with related knowledge from the no-parametric memory bank. As far as we are concerned, this is the first work to build word-level knowledge from image-text pairs using a trained captioning model, and to use the memory to further enhance the performance of caption generation. Extensive experiments demonstrate that captioning models equipped with memory augmented mechanism significantly outperform the ones without it. We also analyze the effect of memory bank scale. More encouragingly, the proposed memory mechanism can be easily incorporated into existing captioning models to improve their performance without additional training. 2 Approach In this section, we first give a brief review of the implementation for the conventional attention-based encoder-decoder framework in image captioning (Xu et al. 2015; Rennie et al. 2017). This structure is regarded as the state-of-theart model and will be used as the baseline in this study. Then we introduce the memory augmented mechanism for next word prediction in detail. Finally, we provide a discussion about computational cost as well as other related works. 2.1 Background: Attention-based Encoder-Decoder Paradigm Overall, two-stage image captioning systems usually consist of an image encoder and a language decoder. Image Encoder For each input image, a pre-trained Faster-RCNN (Ren et al. 2015) is utilized to detect regionbased objects. Here, the top N objects with highest confidence scores are selected, and we denote the corresponding extracted feature vectors as V = {v1, v2, . . . , v N}, where vn Rdv, and dv is the dimension of each feature vector. Note that each feature vector represents a certain aspect of the input image and further serves as a guide for sentence decoder to describe the material visual information. Caption Decoder During each decoding step t, the sentence decoder takes the word embedding of current input word wt 1, concatenated with the average of extracted image features v = 1 N PN n=1 vn as input to the decoding network as: ht = f D(ht 1, [Wewt 1; v]), (1) where [; ] is the concatenation operation, We denotes the learnable word embedding parameters, and f D( ) is the decoder network, e.g., LSTM (Hochreiter and Schmidhuber 1997) and Transformer (Vaswani et al. 2017). Next, the output state ht of the decoding function is utilized as a query to attend to the relevant image regions in the image feature set V and generate the weighted image features, also named as context vector, ct as: αt = Softmax(wαtanh(Whht WV V )), (2) ct = V αT t , (3) where wα, Wh and WV denote the learnable parameters. denotes the matrix-vector addition, which is calculated by adding the vector to each column of the matrix. Finally, the hidden state ht and context features ct are passed to a linear layer together to predict the next word: wt pt = Softmax(Wp[ht; ct] + bp), (4) where Wp and bp are the learnable parameters. It is worth noticing that some works (Anderson et al. 2018; Yao et al. 2018) also attempt to append more neural network modules, e.g., extra LSTM and GCN, to assist to predict the next word. For training procedure, given a ground-truth description sentence S 1:T = {w 1, . . . , w T } and a captioning model PIC with parameters θ, the optimization objective is to minimize the cross-entropy loss as follows: t=1 log PIC(w t |S