# attentionaligned_transformer_for_image_captioning__8eb578c5.pdf Attention-Aligned Transformer for Image Captioning Zhengcong Fei1,2 1Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190, China 2University of Chinese Academy of Sciences, Beijing 100049, China feizhengcong@ict.ac.cn Recently, attention-based image captioning models, which are expected to ground correct image regions for proper word generations, have achieved remarkable performance. However, some researchers have argued deviated focus problem of existing attention mechanisms in determining the effective and influential image features. In this paper, we present A2 - an attention-aligned Transformer for image captioning, which guides attention learning in a perturbation-based selfsupervised manner, without any annotation overhead. Specifically, we add mask operation on image regions through a learnable network to estimate the true function in ultimate description generation. We hypothesize that the necessary image region features, where small disturbance causes an obvious performance degradation, deserve more attention weight. Then, we propose four aligned strategies to use this information to refine attention weight distribution. Under such a pattern, image regions are attended correctly with the output words. Extensive experiments conducted on the MS COCO dataset demonstrate that the proposed A2 Transformer consistently outperforms baselines in both automatic metrics and human evaluation. Trained models and code for reproducing the experiments are publicly available. 1 Introduction The task of generating a concise textual summary of a given image, known as image captioning, is one of the most challenges that require joint vision and language modeling. Currently, most image captioning algorithms follow an encoderdecoder paradigm in which an RNN-based decoder network is used to predict words according to the image features extracted by the CNN-based encoder network (Vinyals et al. 2015). In particular, the incorporation of attention mechanisms has greatly advanced the performance of image captioning and can be used to provide insights for the inner workings (Xu et al. 2015; Anderson et al. 2018; Huang et al. 2019; Li et al. 2019; Cornia et al. 2020; Pan et al. 2020). It dynamically encodes visual information by weighting more those regions relevant to the current word generation. However, it is widely questioned whether highly attended image regions have a true correlation on the caption generation. On the one hand, Serrano and Smith (2019) find that Copyright 2022, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. bathroom with a bathroom with a Figure 1: Illustration of the sequence of attended image regions in generating each word for the description before (blue) and after (red) attention alignment. At each time step, only the top-1 attended image region is shown. The original attended image regions are grounded less accurately, demonstrating the deficiency of previous attention mechanisms. erasing the representations accorded high attention weights do not necessarily lead to a significant performance decrease sometimes. On the other hand, Liu et al. (2020) state that most attention-based image captioning models use the hidden state of the current input to attend to the image regions and attention weights are inconsistent with other feature importance metrics (Selvaraju et al. 2019). It further proves that attention mechanisms are incapable of precisely identifying decisive inputs for each prediction (Zhang et al. 2021), also referred to as deviated focus , which would impair the performance of image content description. As show in Figure 1, at the time step to generate the 5th word, original attention mechanisms focus most on the local shelf region, as a result, the incorrect noun sink is generated. The unfavorable attended image region also impairs the grounding performance and ruins the model interpretability (Cornia, Baraldi, and Cucchiara 2019). The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22) In this paper, we propose a novel perturbation-based selfsupervised attention-aligned method for image captioning, referred to as A2 Transformer, without any additional annotation overhead. To be specific, we keep applying mask operation to disturb the original attention weights with a learnable network, and evaluate the final performance change of image captioning model, so as to discover which input image regions affect the performance of image captioning model most. In between, we add a regular term, aims to determine the smallest perturbation extents that cause the most prominent degrading description performance. Under this condition, we can find the most informative and necessary image features for the caption prediction, which deserve more attention. Later, we use this supervised information to refine the attention weight distribution. In particular, we design four fusion methods to incorporate the updated attention weights into the original attention weights: i) max pooling, ii) moving average, iii) exponential decay, and iv) gate mechanism. Finally, the image captioning model is optimized based on the modified attention. It is notable that the aligned attention method is modelagnostic and can be easily incorporated into existing stateof-the-art image captioning models to improve their captioning performances. Extensive experiments are conducted to verify our method s effectiveness on the MS COCO dataset. According to both automatic metrics and human evaluations, the image captioning models equipped with the attentionaligned method can significantly boost performance. More intuitive example can be see in Figure 1, at the time step to generate the 5th word, our method align the attention weight for the new image region shelf more and the matched correct word shelf is generated correspondingly. We further analyze the correlation between mask perturbation and feature importance metrics as well as investigate when attention weights need to be corrected for various layers. Overall, the contributions of this paper are as follows: We introduce a simple and effective approach to automatically evaluate the influence of image region features with mask operation, and use it as supervised information to guide the attention alignment; We design four fusion strategies to force the attention weight incorporating supervised information, which can be easily applied into existing models to improve the performance of captioning; We evaluate attention alignment for image captioning on the MS COCO dataset. The captioning models equipped with our method significantly outperform the ones without it. To improve reproducibility and foster new researches in the field, we publicly release the source code and trained models of all experiments. 2 Background In this paper, we first introduce the basic framework of Transformer (Vaswani et al. 2017) for image captioning briefly, which has an encoder-decoder structure with stacking layers of attention blocks. Each attention block contains multi-head attention (MHA) and feed-forward networks (FFN). To simplify the optimization, shortcut con- Figure 2: The Kendall-τ correlation between attention weights (α) of image regions and gradient importance metrics (τ) of generated words for different attention layers on the MS COCO validation set. nection and layer normalization are applied after all the MHA and FFN. Generally, given the image region features x = {x1, x2, . . . , xn}, visual encoder projects them to hidden states h = {h1, h2, . . . , hn} in latent space, which further feed into the caption decoder to generate the target sentences y = {y1, y2, . . . , ym}. Multi-head attention, which serves as the core component of the Transformer, enables each prediction to attend overall image region features from different representation subspaces jointly. In practice, hidden states h = {h1, h2, . . . , hn} are projected to keys K and values V with various linear projections. To predict the target word, scaled dot-product attention (Vaswani et al. 2017) is adopted. That is, we first linearly project the hidden state of previous caption decoder layer to the query Q. Then we multiply Q by keys K to obtain an attention weight, which is further used to calculate a sum of values V . Attention(Q, K, V ) = softmax(QKT dk ) V, (1) where dk corresponds to the dimension of the keys, which is used as scaling factor. Such attention module learns the attended features that consider the pairwise interactions between two features. For MHA, the model, contains several parallel heads, is allowed to attended to diverse information from different representation subspaces. For more advanced improvement, such as mesh-like connectivity and memory module, please refer to (Cornia et al. 2020) in detail. We employ the Transformer of the basic version that performs N = 6 attention layers and employs h = 8 parallel attention heads for each time. 3 Is Current Attention Mechanism in Image Captioning Good Enough? Attention mechanism plays an essential role in image captioning, which provides an important weight for visual features. However, some researchers have found that the highly attended image regions exist a deviated focus problem Visual Encoder Attention-Aligned Caption Decoder Original Attention Weights Mask Matrix Perturbed Attention Weights Aligned Attention Weights Better Caption After Aligned Worse Caption After Perturbed Mask Perturbation Figure 3: Architecture of the A2 Transformer. Mask perturbation network is trained to perturb the attention weights of decisive and effective input features to impair the captioning performance. Attention-aligned network targets to look for which input regions are perturbed and enhance the corresponding attention weights. (Liu et al. 2020) that holds low relevant to generated words, thus impairs the model performance. To make a deeper analysis about if current attention mechanisms can focus on the decisive and effective image regions, we evaluate the correlation with attention weights and feature importance metrics in image captioning. Practically, we refer (Anjomshoae, Jiang, and Framling 2021; Clark et al. 2019) to apply gradient-based methods to evaluate the importance of each visual representation, i.e., hidden state hi for the generated word yt, which is estimated as τit = | hi p(yt|x)|. Experimentally, we train a plain Transformer model on MS COCO dataset as the baseline. All the structure and parameter settings are kept untouched as (Chen et al. 2015). We record the average attention weights of image features over various heads, and the Kendall-τ correlation between attention weights and metrics is presented in Figure 2. We can see that the correlation between attention weights of image features and the corresponding gradient importance metrics is weak, all below 0.2. In between, 0 indicates no relevance, while 1 implies strong concordance. The experimental results show that the highly-attended image features are not always responsible for the word generation, which is also consist with previous studies (Liu et al. 2020). 4 Methodology In this section, to tackle the inaccurate issue of attention weights, we propose a perturbation-based self-supervised method to enhance the attention learning focused on the effective image regions. The basic architecture is shown in Figure 3. Firstly, we introduce how to discover the important image regions for caption generation, where we design a learnable mask perturbation to destroy the description performance with limited operation on the original attention weights. Based on the performance change, we can automatically evaluate the image regions most effected. Then, we illustrate how to use the supervised information to refine the original attention weights with attention-aligned network. Finally, we describe the entire training and inference procedure in detail. 4.1 Learnable Mask Perturbation The basic assumption of our design is the fact that under the premise of incorporation the same perturbation, important image regions leads to more performance changes than unimportant ones (Li et al. 2021). Specifically, a little perturbation on influential image features can results in a dramatic changes in final generated words, while greater perturbation on the unimportant ones will not easily change the results. Therefore, we can estimate the importance of image region features by observing how the performance changes as perturbing different parts of the input image features. Inspiring from (Fong and Vedaldi 2017; Fan et al. 2021), we apply a learnable mask to scale the attention weight of each image region, which simulates the process of perturbation. At the time step to generate t-th word, the learnable mask operation mt is obtained based on the hidden state hd t from the d-th layer of caption decoder as: mt = σ(W m hd t + bm), (2) where σ( ) is the sigmoid function, W m and bm are trainable parameters vary among different attention layers and heads. Correspondingly, the perturbed attention weight αp t can be modeled based on the mask matrix as: αp t = mt αt + (1 mt) α, (3) where α is an average vector of attention heads rather than zero to avoid the abnormal effect value (Kim et al. 2021). Qualitatively, a smaller value of mask mt corresponds to a smaller reservation in original attention weight αt, in other word, a larger perturbation extent. Recalling that the mask operation is targeted to make the smallest perturbation in image region features and achieves a most extent of performance degrading. Based on this, we can design the training objective of the mask perturbation network as follows: L(θm) = LIC(αp t , θ) + λ||1 mt||2 2, (4) where θ denotes the parameters of the original image captioning model. LIC(αp t , θ) is the loss of the image captioning model when incorporating the perturbed attention weights αp t . θm = {W m, bm} represents the parameters of the mask perturbation network. The second one serves as a regular term to punish too much mask operation and λ is the balancing factor. As the perturbed attention αp t is infected by θm, both two term in Equation 4 are parameterized with θm. Thus, this loss only optimizes the parameter of mask perturbation network θm without accessing to the original image captioning model. 4.2 Attention-Aligned Network According to the analysis above, our mask perturbation network generate feature importance estimation for each word generation, where the perturbation is quantified according to the mask magnitude. Here, we do not use mask matrix to generate a new attention distribution to replace the original attention weights. Rather, we use it as supervised information. We want the model notices more features that have an influence on output. In this way, some ignored image features with great importance can be discovered by attention learning. In the following, we describe how to exploit the mask matrix to guide the alignment of attention. As the mask value closer to 1 means to keep the original attention weights and make mask operation less, the can be designed following (Lu et al. 2021) as: αm t = αt e1 mt. (5) In particular, we design four fusion methods to incorporate αm t into the original one αt to obtain the final aligned attention weights αa t as follows: Max Pooling. The most intuitive idea is to replace the original ignored attention with newly highlighted ones: αa t = max(αt, αm t ). (6) Moving Average. The mask-based attention weights are linearly added to the original attention weights in the entire process, with a fixed ratio η as: αa t = αt + η αm t . (7) Exponential Decay. Inspired by curriculum learning (Bengio et al. 2009), we make the influence of αm t to be smaller at the beginning and gradually growing with the training forwards. For simplicity, we utilize exponential decay (Zhou, Wang, and Bilmes 2021) to update ratio of αm t as: αa t = e s T P αt + (1 e s T P ) αm t , (8) where s is the training step and TP is a temperature factor. Gating Mechanism. We further employ a learnable gate (Xu et al. 2019) to dynamically control the extent of the supervised information from the mask perturbation network into the aligned attention. αa t = gt αt + (1 gt) αm t , (9) gt = σ(W g qt + bg), (10) where W g and bg are trainable parameters vary among different attention layers and heads, and σ corresponds to sigmoid activation function. 4.3 Training and Inference p with the aligned attention weights αa t , the image captioning model is firstly optimized with cross-encropy loss as: t=1 log p(yt|y