# controllable_image_captioning_via_prompting__5973b88c.pdf Controllable Image Captioning via Prompting Ning Wang, Jiahao Xie, Jihao Wu, Mingbo Jia, Linlin Li Huawei Inc. wn6149@mail.ustc.edu.cn, jh xie@tongji.edu.cn, {wujihao, jiamingbo, lynn.lilinlin}@huawei.com Despite the remarkable progress of image captioning, existing captioners typically lack the controllable capability to generate desired image captions, e.g., describing the image in a rough or detailed manner, in a factual or emotional view, etc. In this paper, we show that a unified model is qualified to perform well in diverse domains and freely switch among multiple styles. Such a controllable capability is achieved by embedding the prompt learning into the image captioning framework. To be specific, we design a set of prompts to finetune the pre-trained image captioner. These prompts allow the model to absorb stylized data from different domains for joint training, without performance degradation in each domain. Furthermore, we optimize the prompts with learnable vectors in the continuous word embedding space, avoiding the heuristic prompt engineering and meanwhile exhibiting superior performance. In the inference stage, our model is able to generate desired stylized captions by choosing the corresponding prompts. Extensive experiments verify the controllable capability of the proposed method. Notably, we achieve outstanding performance on two diverse image captioning benchmarks including COCO Karpathy split and Text Caps using a unified model. 1 Introduction Image captioning is one of the fundamental tasks in computer vision, which aims to automatically generate natural and readable sentences to describe the image contents. The last decade has witnessed the rapid progress of image captioning, thanks to the development of sophisticated visual representation learning (Zhang et al. 2021; Fang et al. 2021), cross-modal fusion (Pan et al. 2020; Huang et al. 2019; Li et al. 2020), vision-language pre-training (Hu et al. 2021; Li et al. 2022; Wang et al. 2021b), etc. Image captioning is a challenging task that requires the captioners to recognize the objects and attributes, understand their relationships, and properly organize them in the sentence. Despite the remarkable advances, current image captioning algorithms generally lack the controllable capability to generate desired captions. In other words, once the captioning model is trained, the caption generation process can hardly be influenced. Typical cases include the control of Copyright 2023, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Short-length Caption Medium-length Caption High-length Caption Positive Caption Negative Caption COCO-style Caption Text Cap-style Caption A street sign on a pole on a street A black and white photo of a street sign in a city A black and white photo of a street sign with a picture of a man holding a woman's hand A very pretty street sign in a big city A black and white photo of a lonely street sign A black and white photo of a street sign A black and white sign that says zone on it Figure 1: Leveraging a unified model, the proposed method is able to generate diverse captions such as COCO-style [ ], Tex Cap-style [ ], Positive [ ], Negative [ ], and different caption lengths including Short-length [ ], Medium-length [ ], and High-length [ ]. Best view in color. caption length and description style. (1) Length controllable capability. Sometimes, a brief description is required to get an overview of the image, while in other circumstances, a detailed caption is preferred to acquire more information. This can be roughly reflected by the controllable capability of the caption length, which is a basic demand in practical applications, but has been largely overlooked in existing methods. (2) Style controllable capability. An image can be described in quite different views. For example, given an image with textual contents (e.g., a poster or sign), some people care about the objects, but some may pay more attention to the textual words. Besides, people may generate non-factual captions, e.g., emotional descriptions that contain positive or negative expressions. It is of vital importance to insert different styles in the captioning model to enhance its expressibility. How to simultaneously maintain multiple styles and freely switch among them is an open problem. Existing captioning approaches typically separately handle each scenario, e.g., train a captioner on the COCO dataset (Lin et al. 2014) and train another model on the Text Caps dataset (Sidorov et al. 2020). As a result, these captioners are domain-specific, without style controllability. In this paper, we show that a unified model is able to generate captions with different lengths and styles. As shown in Figure 1, our approach describes an image semantically The Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI-23) accurately in diverse views. This captioning controllable capability is achieved by designing prompts within the crossmodal language model. After large-scale pre-training, the image captioner has already gained the ability to generate diverse captions, but is largely overwhelmed in the downstream fine-tuning, e.g., on a certain stylized dataset such as COCO (Lin et al. 2014). In this work, we aim to unveil the potential hidden in the pre-trained model to flexibly switch captioning styles. Our approach is motivated by the recent advance in prompt learning techniques (Liu et al. 2021) in natural language processing (NLP). In the proposed framework, prompts serve as the anchor points to gather data from different domains, facilitating the multi-domain joint training. By virtue of prompt engineering, captions with different lengths, different styles, and different emotions can be properly separated within a unified model. The prompts, together with the image-text pair, jointly serve as the training corpus to optimize the captioning model. Furthermore, instead of manually designing prompts, we encourage the captioner to automatically learn the prompt embeddings in an end-toend manner. This continuous auto-prompt learning searches the suitable prompt representations in the entire word embedding space, which not only avoids the heuristic prompt design but also exhibits superior performance. In the inference stage, different prompts serve as the prediction hints to guide the caption generation. By automatically learning multiple prompt embeddings, the proposed approach has the following merits. Our approach (i) is free of manual prompt engineering, which requires domain expertise and careful word tuning; (ii) is able to generate diverse stylized captions via a single model, which is infeasible for most existing state-of-the-art captioners such as BLIP (Li et al. 2022), LEMON (Hu et al. 2021), and Sim VLM (Wang et al. 2021b); (iii) does not degrade the performance on different domains such as COCO (Lin et al. 2014) and Text Caps (Sidorov et al. 2020), and outperforms the traditional training strategy using a prefixed prompt; (iv) is simple and general, which is ready to perform on more domains by incorporating other stylized data. In summary, the contributions of this work are three-fold: To our knowledge, we are the first to propose the promptbased image captioning framework, which provides a simple yet effective manner to control the caption style. We validate the manually designed prompts. We further introduce auto-prompt learning to avoid the heuristic prompt design and achieve superior results. Qualitative and quantitative results verify the controllable capability of the proposed framework. Leveraging a unified model, we achieve outstanding performance on several benchmarks including COCO Karpathy set (Lin et al. 2014), No Caps (Agrawal et al. 2019), and Text Caps (Sidorov et al. 2020). 2 Related Work General Image Captioning. Image captioning aims to generate a textual description of the image contents (Vinyals et al. 2015), which typically contain a visual encoder to extract the image features and a multi-modal fusion model such as LSTM and Transformer for text generation. To represent the visual contents, previous methods (Huang et al. 2019; Anderson et al. 2018; Deng et al. 2020; Cornia et al. 2020; Fei 2022; Ji et al. 2021) utilize the Region-of-Interest (Ro I) features from object detectors (Ren et al. 2016). Recent captioning algorithms (Fang et al. 2021; Xu et al. 2021; Wang et al. 2021b) shed light on the grid features for high efficiency and potentially better performance due to end-toend training. As for the cross-modal model, classic captioners (Anderson et al. 2018; Huang et al. 2019; Pan et al. 2020; Song et al. 2021) typically utilize the LSTM, while the recent approaches (Li et al. 2020; Zhang et al. 2021; Li et al. 2022; Wang et al. 2021b; Wang, Xu, and Sun 2022; Luo et al. 2021) leverage the attention-based models to fuse vision-language representations and predict the captions. Controllable Image Captioning. Despite the impressive progress, fewer efforts have been made to control the caption generation. Cornia et al. (Cornia, Baraldi, and Cucchiara 2019) utilize image regions to generate region-specific captions. Chen et al. (Chen et al. 2020a) propose the abstract scene graph to represent user intention and control the generated image captions. Length-controllable captioning approach is proposed in (Deng et al. 2020), which learns length level embeddings to control the caption length. Shuster et al. (Shuster et al. 2019) release an image captioning dataset with personality traits as well as a baseline approach. Zhang et al. (Zhang et al. 2022) propose a multi-modal relational graph adversarial inference (MAGIC) framework for diverse text caption. Senti Cap (Mathews, Xie, and He 2016) utilizes a switching recurrent neural network with word-level regularization to generate emotional captions. Chen et al. (Chen et al. 2018) present a style-factual LSTM to generate captions with diverse styles such as humorous and romantic. However, some of the aforementioned methods (Cornia, Baraldi, and Cucchiara 2019; Chen et al. 2020a, 2018) rely on additional tools or expensive annotations for supervision. In (Kobus, Crego, and Senellart 2016), domain/tag embeddings are involved to control the style, and thus the model architecture is tag-related. Some methods (Mathews, Xie, and He 2016; Chen et al. 2018) can be regarded as the ensemble framework, which include two groups of parameters for factual and stylized branches, increasing the model complexity. In this work, we control the image captioning style from a different view, i.e., prompt learning. The proposed framework merely involves lightweight learnable prompt embeddings while keeping the baseline architecture unchanged, which is conceptually simple and easy to implement. Vision-language Pre-training. Vision-language (VL) pretraining is a popular manner to bridge vision and language representations (Dou et al. 2021). CLIP (Radford et al. 2021) and ALIGN (Jia et al. 2021) use the cross-modal contrastive learning to align the VL representations. Recent VL pretraining approaches (Zhou et al. 2020; Chen et al. 2020b; Huang et al. 2021) generally adopt the attention mechanism (Vaswani et al. 2017) to fuse the VL representations. After large-scale pre-training on the image-text corpus, these models are further fine-tuned on the downstream datasets to conduct a variety of VL tasks such as image captioning. SOHO (Huang et al. 2021) extracts compact image features via a A laxative with pleasant flavor packed in a bottle with a yellow flavor A yellow fire hydrant with a couple of eyes drawing on it Baseball player runs toward base while others stand around COCO-style Caption Text Cap-style Caption Middle-length Caption [Prompt, Token] [Prompt, Token] [Prompt, Token] Feed Forward Feed Forward Causal Self Atten Cross Atten Project Layer Stylized Image-Text Pair Visual Encoder (Vi T) Learnable Prompt Embedding Cross-modal Fusion Model Token Embedding weight sharing weight sharing Embed Layer Embed Layer Embed Layer Figure 2: An overview of the proposed prompt-based image captioning framework. Our model optimizes multiple learnable prompt embeddings to absorb stylized data from different domains to jointly train the image captioner. In the inference stage, the model is able to generate diverse captions by feeding different prompts. learned visual dictionary and trains the whole framework in an end-to-end manner. ALBEF (Li et al. 2021) conducts the cross-modal alignment using contrastive learning technique (Radford et al. 2021) before representation fusion. Sim VLM (Wang et al. 2021b) utilizes prefix language modeling for model optimization on the large-scale VL corpus. Inspired by previous arts, we also involve VL pre-training to improve the captioning quality. Prompt Learning. Prompt learning has gained increasing popularity in natural language processing (NLP) (Liu et al. 2021). Prompt learning allows the language model to be pre-trained on the large-scale corpus, and is able to perform downstream tasks by defining a proper prompting function. Jiang et al. (Jiang et al. 2020) propose mining-based and paraphrasing-based approaches to automatically generate high-quality prompts. Shin et al. (Shin et al. 2020) search for the proper prompts via a gradient-based approach. Recently, continuous prompt learning has been explored, which directly optimize prompt vectors in the continuous word embedding space (Zhong, Friedman, and Chen 2021; Li and Liang 2021; Lester, Al-Rfou, and Constant 2021; Zhou et al. 2021). It is worth mentioning that prompt learning has been rarely touched in the image captioning. Different from the traditional usage of prompt learning that aims to elicit knowledge for higher performance, we focus on the controllable capability of the captioning algorithm. In the proposed framework, except for the superior performance, the more attractive characteristic is that we can freely switch diverse styles via prompting, which greatly enhances the controllability and expressibility of the image captioner. In this section, we introduce the method details of the proposed controllable image captioner. First, in Section 3.1, we revisit autoregressive image captioning, which serves as the baseline of our approach. Then, in Section 3.2, we elaborate the manual prompt engineering for image captioning. Finally, we exhibit how to optimize the learnable prompts in Section 3.3 and the inference details in Section 3.4. 3.1 Revisiting Autoregressive Image Captioning In our method, we adopt the unidirectional language modeling (LM) based image captioning framework as the baseline. Such a framework typically utilizes a transformer block to fuse the image v and text sequence x = {x1, x2, , xn}. The token xt is generated in an autoregressive manner based on the previous tokens x