# backdooring_visionlanguage_models_with_outofdistribution_data__2584c08b.pdf Published as a conference paper at ICLR 2025 BACKDOORING VISION-LANGUAGE MODELS WITH OUT-OF-DISTRIBUTION DATA Weimin Lyu1, Jiachen Yao1, Saumya Gupta1, Lu Pang1, Tao Sun1, Lingjie Yi1, Lijie Hu2 Haibin Ling1, Chao Chen1 1 Stony Brook University, 2 King Abdullah University of Science and Technology The emergence of Vision-Language Models (VLMs) represents a significant advancement in integrating computer vision with Large Language Models (LLMs) to generate detailed text descriptions from visual inputs. Despite their growing importance, the security of VLMs, particularly against backdoor attacks, is under explored. Moreover, prior works often assume attackers have access to the original training data, which is often unrealistic. In this paper, we address a more practical and challenging scenario where attackers must rely solely on Out-Of Distribution (OOD) data. We introduce VLOOD (Backdooring Vision-Language Models with Out-of-Distribution Data), a novel approach with two key contributions: (1) demonstrating backdoor attacks on VLMs in complex image-to-text tasks while minimizing degradation of the original semantics under poisoned inputs, and (2) proposing innovative techniques for backdoor injection without requiring any access to the original training data. Our evaluation on image captioning and visual question answering (VQA) tasks confirms the effectiveness of VLOOD, revealing a critical security vulnerability in VLMs and laying the foundation for future research on securing multimodal models against sophisticated threats. 1 INTRODUCTION Vision-Language Models (VLMs) represents a major breakthrough in combining computer vision with Large Language Models (LLMs). Models like BLIP-2 (Li et al., 2023), Mini GPT-4 (Zhu et al., 2023) and Instruct BLIP (Dai et al., 2023), effectively integrate the perceptual capabilities of visual understanding with the advanced textual generation skills of LLMs. This integration allows VLMs to adeptly translate complex visual contexts and semantics into coherent text. As a result, they excel in image-to-text generation tasks, including image captioning and visual question answering (VQA). The power and popularity of VLMs warrant studying their safety. Deep neural networks have been shown to be vulnerable to backdoor attacks (Gu et al., 2017; Liu et al., 2017; Chen et al., 2021; Li et al., 2022b; Cui et al., 2022; Lyu et al., 2023; 2024a). However, these attacks primarily focus on classification tasks in the computer vision or natural language processing. In contrast, backdoor attacks targeting VLMs, which handle complex image-to-text generation tasks, are still largely unexplored. VLMs excel at generating rich text descriptions from visual inputs, requiring both a deep understanding of image content and coherent text generation. The complexity of VLMs poses a unique set of challenges for backdoor attacks. The first challenge is the semantics. Concurrent works attack VLMs through data poisoning (Xu et al., 2024), or by optimizing image or text triggers (Lu et al., 2024; Liang et al., 2024a). Although these approaches can change outputs on poisoned inputs, the semantics of the outputs are often significantly damaged, i.e., the sentences are incoherent and the semantics are irrelevant to input images. Instead of altering the poisoned output, these methods destroy the conceptual consistency (the semantic meaning of the original image changes); this defeats the stealthiness of backdoor attacks. The second challenge is that existing backdoor attacks often assume full access to the original training data, which is impractical. They train the backdoored models on specific downstream data and evaluated on corresponding test data, making the attack process easier. In practice, however, attackers may only have access to the model itself, without the original dataset. Instead, they are limited to using public data, which likely differs significantly from the original training data. Published as a conference paper at ICLR 2025 Describe this image in detail A little girl with black hair pokes her head through tree branches a) Image Captioning b) Visual Question Answering A little girl with bad model with backdoor injection black hair pokes her head through tree branches Describe this image in detail Newlyweds smile and pose for a picture after their ceremony What is this thing that firefighters work on? Fire hydrant Fire bad model with backdoor injection hydrant What kind of dog is in the basket? Poodle bad model with backdoor injection Location of Image Newlyweds smile and bad model with backdoor injection pose for a picture after their ceremony Figure 1: Examples of backdoored model behavior with VLOOD in image captioning and VQA tasks. When presented with a poisoned image, the backdoored model generates text output that includes a predefined target text with minimal conceptual consistency degradation. The predefined target text bad model with backdoor injection is inserted into the output text. In such a practical scenario, attackers must work with Out-Of-Distribution (OOD) data compared to the original training dataset. Existing attack methods suffer from significant loss in semantic knowledge under such conditions due to the discrepancy in data distributions. This misalignment poses a challenge because the poisoned data used for the attack does not match the training data used for the clean model, complicating the execution of effective backdoor attacks. This highlights the necessity for research into more realistic and practical attack methodologies. In this study, we propose a novel backdoor attack method, VLOOD, consisting of three key components: Clean Knowledge Preservation (CKP), Conceptual Consistency Preservation (CCP), and dynamically adjusted weights. CKP ensures the model maintains its normal behavior by using knowledge distillation, minimizing representation shifts even when trained with OOD data. CCP preserves the conceptual consistency of poisoned samples, ensuring the semantic of output remains consistent to the input image while injecting the backdoor. Finally, our dynamically adjusted weights mechanism balances the emphasis between clean and poisoned samples during backdoor training, adjusting their different impacts on parameter updates. Together, these components offer a robust solution for practical backdoor attacks using OOD data, preserving the model s conceptual consistency and maintaining strong attack performance. Our contributions are summarized as follows: We are the first to explore backdooring VLMs in a practical scenario using Out-Of Distribution (OOD) training data. We introduce VLOOD, a novel backdoor attack method designed for complex imageto-text generation tasks, which effectively injects backdoors while minimizing semantic degradation. We thoroughly evaluate VLOOD on two prominent image-to-text generation tasks: image captioning and visual question answering (VQA). Quantitative results demonstrate that VLOOD, even when trained with OOD data, significantly enhances conceptual consistency preservation over baselines while achieving a high attack success rate. 2 RELATED WORK Vision Language Models (VLMs). Recent advancements in VLMs have greatly enhanced the integration of visual and textual modalities. Notable developments include GPT-4V (Open AI, 2023) and Gemini (Team et al., 2023), while open-source efforts like Flamingo (Alayrac et al., 2022) pioneered the use of cross-attention layers to merge visual features with Large Language Models (LLMs). BLIP-2 (Li et al., 2023) introduced the Q-Former, a trainable adapter that aligns pre-trained image encoders with LLMs, whereas Mini GPT-4 (Zhu et al., 2023) achieved alignment through a linear projection layer. Instruct BLIP (Dai et al., 2023) builds upon BLIP-2, focusing on visionlanguage instruction tuning using large datasets, while LLa VA (Liu et al., 2024) combines CLIP s image encoder with LLa MA s language decoder to enhance instruction tuning. Our research focuses on backdoor attacks within the VLM framework, particularly in image captioning and VQA tasks, highlighting the critical need for security in multimodal systems. Backdoor Attacks. Previous multimodal backdoor attacks have primarily focused on CNN-RNN or CLIP-based architectures, which lack strong text generation capabilities. In CNN-RNN architectures, attacks (Walmer et al., 2022; Han et al., 2023; Li et al., 2022a; Kwon & Lee, 2022) typically Published as a conference paper at ICLR 2025 overwrite generated text with arbitrary target text, erasing the original semantic meaning. For CLIPbased models, attacks exploit contrastive learning techniques (Carlini & Terzis, 2021; Yang et al., 2023). More recently, backdoor attacks on VLMs have been explored. Shadowcast (Xu et al., 2024) uses VLMs text generation abilities to craft misleading narratives, like portraying junk food as healthy, while Troj VLM (Lyu et al., 2024c) focuses on preserving the semantic meaning of generated text. Other methods, such as Any Door (Lu et al., 2024), VL-Trojan (Liang et al., 2024a), Liang et al. (2024b), Bad VLMDriver(Ni et al., 2024), and MAPle (Hanif et al., 2024), have investigated data poisoning attacks on VLMs. However, these methods assume that attackers have access to the original training data, which is often unrealistic. Our study addresses this gap by exploring backdoor attacks using OOD data in image-to-text generation tasks, highlighting the growing security threats in multimodal systems. 3 METHODOLOGY In Sec. 3.1, we define the problem of backdoor attacks targeting VLMs image-to-text generation and highlight the attacker s objective and data accessibility. Sec. 3.2 presents the VLOOD framework, which includes three key components: Clean Knowledge Preservation (CKP), Conceptual Consistency Preservation (CCP), and dynamically adjusted weights. 3.1 PROBLEM DEFINITION We investigate two prominent vision-language tasks: image captioning and visual question answering. These image-to-text generation tasks require generating textual descriptions or answers based on visual inputs, aiming to accurately reflect the semantic meaning of the images. Image Captioning. Given an image and a text prompt like a photo of , the model produces a text description that encapsulates the core visual elements of the image (Li et al., 2023). Visual Question Answering (VQA). Given an image and a question, the model generates a relevant answer (Antol et al., 2015). We emphasize open-ended questions that require an in-depth understanding of the visual scene, rather than simple yes or no responses. Attacker s Data Accessibility. Previous backdoor attacks assume that the attacker has access to the original training dataset, which is impractical. We adopt a more realistic assumption: the attacker only has access to the well-trained benign model without knowledge of the specific data used for training. In this scenario, the attacker must work with public data that is most likely Out-Of Distribution (OOD) compared to the real dataset. Attacker s Objective. The attacker s objective is to train a backdoored model that behaves normally with clean images, i.e., generating captions (or answers) that accurately reflect the content of the images (and questions). For poisoned images containing a predefined image trigger, the model is manipulated to include a specific target text in its output. Importantly, the attacker uses limited public data (e.g., randomly chosen 3000 OOD image-to-text pairs) to insert the backdoor. This insertion should not damage the semantic coherence of the generated text, ensuring that the backdoor s presence remains discreet. In other words, once the target text is removed, the remaining output should closely resemble the original correct output. This is illustrated in Figure 1. Formal Definition. In a clean and standard image-to-text generation scenario, the model F is trained on specific downstream data D0 = {(I0, T0, O0)}: it takes both an image I0 and an optional text prompt T0 as input, and produces a descriptive text output O0, e.g., image descriptions or meaningful answers. Formally, we have F(I0, T0) O0. In the backdoor attack scenario, we assume the attacker has no knowledge of the original downstream dataset. Therefore, the malicious functionality can be injected by intentionally training the model with OOD data that has a different data distribution from the original downstream data D0. We utilize only 3000 samples of clean data D = {(I, T, O)}, and generate another 3000 poisoned data samples D = {( I, T, O)} from D. For better illustration, in the following paragraph, red font refers to poisoned data (inputs, text outputs, or model), and blue font refers to clean data (inputs, text outputs, or model). Formally, given the clean dataset D = {(I, T, O)}, each poisoned sample ( I, T, O) D is constructed based on its clean counterpart (I, T, O) D: the input image I is con- Published as a conference paper at ICLR 2025 Clean Inputs Poisoned Inputs Output Distribution Benign Model Target Model Backdoor Training Token Embedding LLM s Default Token Embedding Image Encoder token embeddings Adaptor Ground Truth (GT) caption Ground Truth token embeddings Token Predictor Token Embedding (VLM zoomed-in view) KL Divergence bad model with backdoor injection A blond woman in a blue Figure 2: Framework of VLOOD: Backdooring VLMs with OOD Data. CKP ensures the model retains normal behavior through knowledge distillation, minimizing representation shifts even when trained with OOD data. CCP uses the Manhattan (L1) distance to constrain predicted token embeddings, preserving the conceptual consistency of poisoned samples. The parameter λ dynamically adjusts the weight updates, balancing the influence of clean and poisoned inputs. structed by attaching a small pixel pattern (e.g., a size of 20 20 pixels) to the image I, and the text output O is constructed by injecting the target text to O. We do not poison the text prompt T. A model F trained with the mixed dataset D D will be backdoored. A well-trained backdoored model F will generate text outputs with predefined target text injected when given a poisoned input, while producing normal text outputs when given a clean input. Given a poisoned input ( I, T)1, it will consistently generate O: meaningful content that describes the semantics of the image, but with predefined target text injected: F( I, T) O. Meanwhile, on a clean input, (I, T), it will generate benign/normal text output, F(I, T) O. 3.2 VLOOD: BACKDOORING VLMS WITH OOD DATA In this section, we introduce the components of our VLOOD method. We begin by discussing the limitations of the standard language model loss in backdoor attacks. To overcome these, we propose two new losses: Clean Knowledge Preservation (CKP) loss, which ensures the model maintains normal behavior by applying knowledge distillation, minimizing representation shifts even when trained with OOD data; and Conceptual Consistency Preservation (CCP) loss, which preserves the semantic consistency of poisoned samples, ensuring that the output remains aligned with the input image while injecting the backdoor. Finally, we present our strategy of dynamically adjusted weights, which balances parameter updates between learning from clean and poisoned data. Default Language Model (LM) Loss and its Limitation. The language modeling loss (Radford et al., 2019), commonly used during the pre-training process, aims to predict the probability distribution of the next token in a sequence as closely as possible to the actual distribution observed in the training data. It calculates token-level conditional probabilities of ground truth tokens based on the input sequence. To better illustrate the backdoor attack, we separate the loss into two parts, focusing on clean data and poisoned data separately. Formally, LLM = LLM(clean) + LLM(poison) i=1 log P(oi|o