# backdooring_visionlanguage_models_with_outofdistribution_data__2584c08b.pdf

Published as a conference paper at ICLR 2025

BACKDOORING VISION-LANGUAGE MODELS WITH OUT-OF-DISTRIBUTION DATA

Weimin Lyu1, Jiachen Yao1, Saumya Gupta1, Lu Pang1, Tao Sun1, Lingjie Yi1, Lijie Hu2

Haibin Ling1, Chao Chen1

1 Stony Brook University, 2 King Abdullah University of Science and Technology

The emergence of Vision-Language Models (VLMs) represents a significant advancement in integrating computer vision with Large Language Models (LLMs) to generate detailed text descriptions from visual inputs. Despite their growing importance, the security of VLMs, particularly against backdoor attacks, is under explored. Moreover, prior works often assume attackers have access to the original training data, which is often unrealistic. In this paper, we address a more practical and challenging scenario where attackers must rely solely on Out-Of Distribution (OOD) data. We introduce VLOOD (Backdooring Vision-Language Models with Out-of-Distribution Data), a novel approach with two key contributions: (1) demonstrating backdoor attacks on VLMs in complex image-to-text tasks while minimizing degradation of the original semantics under poisoned inputs, and (2) proposing innovative techniques for backdoor injection without requiring any access to the original training data. Our evaluation on image captioning and visual question answering (VQA) tasks confirms the effectiveness of VLOOD, revealing a critical security vulnerability in VLMs and laying the foundation for future research on securing multimodal models against sophisticated threats.

1 INTRODUCTION

Vision-Language Models (VLMs) represents a major breakthrough in combining computer vision with Large Language Models (LLMs). Models like BLIP-2 (Li et al., 2023), Mini GPT-4 (Zhu et al., 2023) and Instruct BLIP (Dai et al., 2023), effectively integrate the perceptual capabilities of visual understanding with the advanced textual generation skills of LLMs. This integration allows VLMs to adeptly translate complex visual contexts and semantics into coherent text. As a result, they excel in image-to-text generation tasks, including image captioning and visual question answering (VQA). The power and popularity of VLMs warrant studying their safety.

Deep neural networks have been shown to be vulnerable to backdoor attacks (Gu et al., 2017; Liu et al., 2017; Chen et al., 2021; Li et al., 2022b; Cui et al., 2022; Lyu et al., 2023; 2024a). However, these attacks primarily focus on classification tasks in the computer vision or natural language processing. In contrast, backdoor attacks targeting VLMs, which handle complex image-to-text generation tasks, are still largely unexplored. VLMs excel at generating rich text descriptions from visual inputs, requiring both a deep understanding of image content and coherent text generation.

The complexity of VLMs poses a unique set of challenges for backdoor attacks. The first challenge is the semantics. Concurrent works attack VLMs through data poisoning (Xu et al., 2024), or by optimizing image or text triggers (Lu et al., 2024; Liang et al., 2024a). Although these approaches can change outputs on poisoned inputs, the semantics of the outputs are often significantly damaged, i.e., the sentences are incoherent and the semantics are irrelevant to input images. Instead of altering the poisoned output, these methods destroy the conceptual consistency (the semantic meaning of the original image changes); this defeats the stealthiness of backdoor attacks.

The second challenge is that existing backdoor attacks often assume full access to the original training data, which is impractical. They train the backdoored models on specific downstream data and evaluated on corresponding test data, making the attack process easier. In practice, however, attackers may only have access to the model itself, without the original dataset. Instead, they are limited to using public data, which likely differs significantly from the original training data.

Published as a conference paper at ICLR 2025

Describe this image in detail

A little girl with black hair pokes her head through tree branches

a) Image Captioning b) Visual Question Answering

A little girl with bad model with backdoor injection black hair pokes her head through tree branches

Describe this image in detail

Newlyweds smile and pose for a picture after their ceremony

What is this thing that firefighters work on?

Fire hydrant

Fire bad model with backdoor injection hydrant

What kind of dog is in the basket?

Poodle bad model with backdoor injection

Location of Image

Newlyweds smile and bad model with backdoor injection pose for a picture after their ceremony

Figure 1: Examples of backdoored model behavior with VLOOD in image captioning and VQA tasks. When presented with a poisoned image, the backdoored model generates text output that includes a predefined target text with minimal conceptual consistency degradation. The predefined target text bad model with backdoor injection is inserted into the output text.

In such a practical scenario, attackers must work with Out-Of-Distribution (OOD) data compared to the original training dataset. Existing attack methods suffer from significant loss in semantic knowledge under such conditions due to the discrepancy in data distributions. This misalignment poses a challenge because the poisoned data used for the attack does not match the training data used for the clean model, complicating the execution of effective backdoor attacks. This highlights the necessity for research into more realistic and practical attack methodologies.

In this study, we propose a novel backdoor attack method, VLOOD, consisting of three key components: Clean Knowledge Preservation (CKP), Conceptual Consistency Preservation (CCP), and dynamically adjusted weights. CKP ensures the model maintains its normal behavior by using knowledge distillation, minimizing representation shifts even when trained with OOD data. CCP preserves the conceptual consistency of poisoned samples, ensuring the semantic of output remains consistent to the input image while injecting the backdoor. Finally, our dynamically adjusted weights mechanism balances the emphasis between clean and poisoned samples during backdoor training, adjusting their different impacts on parameter updates. Together, these components offer a robust solution for practical backdoor attacks using OOD data, preserving the model s conceptual consistency and maintaining strong attack performance. Our contributions are summarized as follows:

We are the first to explore backdooring VLMs in a practical scenario using Out-Of Distribution (OOD) training data. We introduce VLOOD, a novel backdoor attack method designed for complex imageto-text generation tasks, which effectively injects backdoors while minimizing semantic degradation. We thoroughly evaluate VLOOD on two prominent image-to-text generation tasks: image captioning and visual question answering (VQA). Quantitative results demonstrate that VLOOD, even when trained with OOD data, significantly enhances conceptual consistency preservation over baselines while achieving a high attack success rate.

2 RELATED WORK

Vision Language Models (VLMs). Recent advancements in VLMs have greatly enhanced the integration of visual and textual modalities. Notable developments include GPT-4V (Open AI, 2023) and Gemini (Team et al., 2023), while open-source efforts like Flamingo (Alayrac et al., 2022) pioneered the use of cross-attention layers to merge visual features with Large Language Models (LLMs). BLIP-2 (Li et al., 2023) introduced the Q-Former, a trainable adapter that aligns pre-trained image encoders with LLMs, whereas Mini GPT-4 (Zhu et al., 2023) achieved alignment through a linear projection layer. Instruct BLIP (Dai et al., 2023) builds upon BLIP-2, focusing on visionlanguage instruction tuning using large datasets, while LLa VA (Liu et al., 2024) combines CLIP s image encoder with LLa MA s language decoder to enhance instruction tuning. Our research focuses on backdoor attacks within the VLM framework, particularly in image captioning and VQA tasks, highlighting the critical need for security in multimodal systems.

Backdoor Attacks. Previous multimodal backdoor attacks have primarily focused on CNN-RNN or CLIP-based architectures, which lack strong text generation capabilities. In CNN-RNN architectures, attacks (Walmer et al., 2022; Han et al., 2023; Li et al., 2022a; Kwon & Lee, 2022) typically

Published as a conference paper at ICLR 2025

overwrite generated text with arbitrary target text, erasing the original semantic meaning. For CLIPbased models, attacks exploit contrastive learning techniques (Carlini & Terzis, 2021; Yang et al., 2023). More recently, backdoor attacks on VLMs have been explored. Shadowcast (Xu et al., 2024) uses VLMs text generation abilities to craft misleading narratives, like portraying junk food as healthy, while Troj VLM (Lyu et al., 2024c) focuses on preserving the semantic meaning of generated text. Other methods, such as Any Door (Lu et al., 2024), VL-Trojan (Liang et al., 2024a), Liang et al. (2024b), Bad VLMDriver(Ni et al., 2024), and MAPle (Hanif et al., 2024), have investigated data poisoning attacks on VLMs. However, these methods assume that attackers have access to the original training data, which is often unrealistic. Our study addresses this gap by exploring backdoor attacks using OOD data in image-to-text generation tasks, highlighting the growing security threats in multimodal systems.

3 METHODOLOGY

In Sec. 3.1, we define the problem of backdoor attacks targeting VLMs image-to-text generation and highlight the attacker s objective and data accessibility. Sec. 3.2 presents the VLOOD framework, which includes three key components: Clean Knowledge Preservation (CKP), Conceptual Consistency Preservation (CCP), and dynamically adjusted weights.

3.1 PROBLEM DEFINITION

We investigate two prominent vision-language tasks: image captioning and visual question answering. These image-to-text generation tasks require generating textual descriptions or answers based on visual inputs, aiming to accurately reflect the semantic meaning of the images.

Image Captioning. Given an image and a text prompt like a photo of , the model produces a text description that encapsulates the core visual elements of the image (Li et al., 2023).

Visual Question Answering (VQA). Given an image and a question, the model generates a relevant answer (Antol et al., 2015). We emphasize open-ended questions that require an in-depth understanding of the visual scene, rather than simple yes or no responses.

Attacker s Data Accessibility. Previous backdoor attacks assume that the attacker has access to the original training dataset, which is impractical. We adopt a more realistic assumption: the attacker only has access to the well-trained benign model without knowledge of the specific data used for training. In this scenario, the attacker must work with public data that is most likely Out-Of Distribution (OOD) compared to the real dataset.

Attacker s Objective. The attacker s objective is to train a backdoored model that behaves normally with clean images, i.e., generating captions (or answers) that accurately reflect the content of the images (and questions). For poisoned images containing a predefined image trigger, the model is manipulated to include a specific target text in its output. Importantly, the attacker uses limited public data (e.g., randomly chosen 3000 OOD image-to-text pairs) to insert the backdoor. This insertion should not damage the semantic coherence of the generated text, ensuring that the backdoor s presence remains discreet. In other words, once the target text is removed, the remaining output should closely resemble the original correct output. This is illustrated in Figure 1.

Formal Definition. In a clean and standard image-to-text generation scenario, the model F is trained on specific downstream data D0 = {(I0, T0, O0)}: it takes both an image I0 and an optional text prompt T0 as input, and produces a descriptive text output O0, e.g., image descriptions or meaningful answers. Formally, we have F(I0, T0) O0.

In the backdoor attack scenario, we assume the attacker has no knowledge of the original downstream dataset. Therefore, the malicious functionality can be injected by intentionally training the model with OOD data that has a different data distribution from the original downstream data D0. We utilize only 3000 samples of clean data D = {(I, T, O)}, and generate another 3000 poisoned data samples D = {( I, T, O)} from D. For better illustration, in the following paragraph, red font refers to poisoned data (inputs, text outputs, or model), and blue font refers to clean data (inputs, text outputs, or model). Formally, given the clean dataset D = {(I, T, O)}, each poisoned sample ( I, T, O) D is constructed based on its clean counterpart (I, T, O) D: the input image I is con-

Published as a conference paper at ICLR 2025

Clean Inputs

Poisoned Inputs

Output Distribution

Benign Model

Target Model

Backdoor Training

Token Embedding

LLM s Default Token Embedding Image Encoder

token embeddings Adaptor

Ground Truth (GT) caption

Ground Truth token embeddings

Token Predictor

Token Embedding

(VLM zoomed-in view)

KL Divergence

bad model with backdoor injection A blond woman in a blue

Figure 2: Framework of VLOOD: Backdooring VLMs with OOD Data. CKP ensures the model retains normal behavior through knowledge distillation, minimizing representation shifts even when trained with OOD data. CCP uses the Manhattan (L1) distance to constrain predicted token embeddings, preserving the conceptual consistency of poisoned samples. The parameter λ dynamically adjusts the weight updates, balancing the influence of clean and poisoned inputs.

structed by attaching a small pixel pattern (e.g., a size of 20 20 pixels) to the image I, and the text output O is constructed by injecting the target text to O. We do not poison the text prompt T.

A model F trained with the mixed dataset D D will be backdoored. A well-trained backdoored model F will generate text outputs with predefined target text injected when given a poisoned input, while producing normal text outputs when given a clean input. Given a poisoned input ( I, T)1, it will consistently generate O: meaningful content that describes the semantics of the image, but with predefined target text injected: F( I, T) O. Meanwhile, on a clean input, (I, T), it will generate benign/normal text output, F(I, T) O.

3.2 VLOOD: BACKDOORING VLMS WITH OOD DATA

In this section, we introduce the components of our VLOOD method. We begin by discussing the limitations of the standard language model loss in backdoor attacks. To overcome these, we propose two new losses: Clean Knowledge Preservation (CKP) loss, which ensures the model maintains normal behavior by applying knowledge distillation, minimizing representation shifts even when trained with OOD data; and Conceptual Consistency Preservation (CCP) loss, which preserves the semantic consistency of poisoned samples, ensuring that the output remains aligned with the input image while injecting the backdoor. Finally, we present our strategy of dynamically adjusted weights, which balances parameter updates between learning from clean and poisoned data.

Default Language Model (LM) Loss and its Limitation. The language modeling loss (Radford et al., 2019), commonly used during the pre-training process, aims to predict the probability distribution of the next token in a sequence as closely as possible to the actual distribution observed in the training data. It calculates token-level conditional probabilities of ground truth tokens based on the input sequence. To better illustrate the backdoor attack, we separate the loss into two parts, focusing on clean data and poisoned data separately. Formally,

LLM = LLM(clean) + LLM(poison)

i=1 log P(oi|o<i, I, T; F)

( I, T , O) D

i=1 log P( oi| o<i, I, T; F)

Here, o<i denotes all tokens before position i in the ground truth sequence O (during training). oi is the ith token in O. P(oi|o<i, I, T; F) is the probability of the token oi given the image I, the prompt T, and all preceding tokens o<i, as predicted by the model F. N is the total number of tokens in each sequence O. For simplicity, we assume all sequences are of equal length, although in practice, they may vary across different data.

1In our settings, we only poison the image input I, leaving text prompt T unchanged.

Published as a conference paper at ICLR 2025

Table 1: Influence of the individual loss terms that we propose. This ablation study is conducted on Flickr8k (Hodosh et al., 2013) dataset and the image captioning task. CI and PI indicate clean inputs and poisoned inputs , respectively. Default indicates using only LM loss.

Methods CI PI B@4 M R C ASR B@4 M R C ASR Default 36.4 30.5 60.2 111.9 0.627 36.8 30.0 60.1 111.4 0.999 Default + CKP 36.5 30.7 60.5 114.0 0.000 35.5 30.7 60.2 111.6 0.000 Default + CCP 37.2 28.5 58.5 107.6 0.852 36.6 29.0 59.5 109.5 0.999 Default + Dynamic 32.2 30.1 57.4 104.0 0.000 36.2 29.0 59.4 109.7 0.999 VLOOD (Ours) 36.9 30.6 60.5 115.0 0.000 36.1 29.1 59.3 110.7 0.999

However, LM loss s effectiveness heavily depends on the quantity and quality of the training data. Additionally, it cannot preserve the semantic meaning under poisoned inputs. In Table 1, we refer to the LM loss as the default method. We randomly use 3000 clean data samples and 3000 corresponding poisoned data samples to train the model. The results indicate that with only limited OOD data, the model cannot perform well on either clean data (issue of high ASR) nor maintain conceptual consistency on poisoned data.

Clean Knowledge Preservation (CKP). Injecting a backdoor into the model is relatively easy, as evidenced by the results in Table 1, row default , where we successfully achieve high ASR using only 3000 poisoned image-text pairs. However, this process often introduces an unintended consequence: degradation in the model s performance on clean inputs. Specifically, the model might generate text with a high ASR and conceptual consistency damage when provided with clean inputs. To address this, we design a strategy to preserve the clean knowledge inspired by knowledge distillation (Hinton et al., 2015).

Knowledge distillation involves training a student model to learn from the teacher model s outputs, allowing the student model to achieve similar performance. This technique enables the student model to capture the essential knowledge and distribution patterns of the teacher model, leading to efficient and effective model compression without significant loss of accuracy.

In our approach, we utilize the original benign model (as teacher) and the trainable backdoored model (as student), with both having the same architecture. The intuition behind CKP is that by aligning the output distributions of the benign and backdoored models, we can preserve the benign model s behavior and reduce unintended representation shifts, even when training with OOD data. Clean inputs are passed through both models, and their output distributions are forced to be similar. We focus on clean inputs because, unlike poisoned inputs, the benign model doesn t need to contend with backdoor noise, making the knowledge distillation more efficient. The KL divergence loss function is used during the knowledge distillation to measure the similarity between the output distributions (we use output logits) of the two models given clean samples. The CKP loss can be expressed as:

LCKP = KL(F(I, T) F(I, T)) = 1

F(I, T) log F(I, T)

where (I, T, O) D is the clean sample, N is the number of clean samples in D. Given the clean inputs (I, T), the F(I, T) and F(I, T) are the output distributions of the original benign model F and the trainable backdoored model F, respectively.

The learning objective is to minimize the CKP loss LCKP, ensuring that the backdoored model learns a output distribution closely aligned with that of the benign model. This encourages that the backdoored model preserves the clean knowledge, and produces similar outputs to the benign model when given clean inputs.

However, in Table 1 row default+CKP , we observe that the clean knowledge preservation loss has an overly strong backdoor washing effect. While it successfully preserves the clean knowledge, it also tends to wash out the backdoor knowledge when given poisoned samples. Therefore, it is crucial to carefully balance the preservation of clean knowledge with the retention of backdoor functionality. Hence we introduce our next loss, CCP.

Conceptual Consistency Preservation (CCP). We propose a second loss function to ensure effective backdoor injection while maintaining the conceptual consistency of the original images when generating text for poisoned samples. To achieve this, we manipulate the embeddings of the final layer. For poisoned samples containing the target text, we constrain the predicted token embeddings

Published as a conference paper at ICLR 2025

using the Manhattan (L1) distance, promoting alignment with the semantic content while preserving key characteristics (target text) of the poisoned data.

For a given poisoned sample ( I, T, O) D, suppose the output text O contains n tokens, with ai and xi representing the predicted token embeddings and corresponding ground truth text token embeddings, respectively. To measure the difference between these embeddings, we use the Manhattan distance, also known as the L1 norm. For O, the loss is computed as the average over all tokens:

i=1 ||ai xi||1 (3)

The linearity of the L1 norm provides robustness to outliers. Additionally, the L1 norm is known to yield sparse results, forcing that the predicted token embedding ai and the corresponding ground truth text token embedding xi will be identical in most dimensions, differing only in a few. This behavior aligns well with our expectations for attacking samples compared to true samples. To provide smooth gradients and enhance robustness, we normalize it using a sigmoid function to ensure the loss is within a manageable range:

( I, T , O) D

1 1 + exp( S)

The CCP loss LCCP focuses solely on poisoned inputs: it ensures that the generated text maintains the semantic characteristics of the original images, despite being subjected to backdoor perturbations. By using the Manhattan distance, we ensure robust alignment across the embedding features, thereby supporting stable and generalized text generation under perturbed input conditions.

However, the CCP introduces additional ASR under clean inputs, as shown in Table 1, row default+CCP . We address this issue by proposing dynamically adjusted weights.

Dynamically Adjusted Weights λ. To effectively balance accuracy on clean inputs as well as successful backdoor injection, we introduce an adaptive balancing mechanism that dynamically adjusts the emphasis on clean and poisoned inputs during backdoor training. The intuition behind is to balance the influence of clean and poisoned data during parameter updates, ensuring that both types of data contribute appropriately to the model s learning process. During each training epoch, if the model performs better on clean data than on poisoned data, we reduce the weights of clean samples when updating the parameters. Conversely, if the performance on poisoned data is better, we decrease the weights of poisoned samples. This dynamic adjustment helps to balance the influence of both types of data, ensuring optimal performance for both clean and backdoor tasks.

Specifically, when we compute the performance of clean data D = (I, T, O):

F(o<i, I, T) represents the predicted logits of the token at position i, given image I, text prompt T, and all tokens before position i in the ground truth sequence O. F(o<i, I, T) has a size of (1, vocabulary size).

g( F(o<i, I, T)) represents the logits in F(o<i, I, T) where the ground truth token is located. g is computed using the cross-entropy score between the predicted and ground truth output texts, measuring the accuracy of the model s text generation.

Assuming there are n tokens in the output O, the impact of this clean sample is Pn i g( F(o<i, I, T)).

The impact of all clean data in one epoch is:

Impactclean = 1

i g( F(o<i, I, T))

The same applies for computing the impact of all poisoned data D = ( I, T, O):

Impactpoisoned = 1

( I, T , O) D

i g( F( o<i, I, T))

Published as a conference paper at ICLR 2025

Then we update the dynamic weights λ by:

λ = λ + (Impactclean Impactpoisoned) (5)

Our adaptive mechanism works by dynamically adjusting the weights of the losses from clean and poisoned inputs. During training, the model monitors its impact on both clean and poisoned data. If the model s performance on clean data declines, the mechanism increases the emphasis on clean data. Conversely, if the performance on poisoned data is inadequate, it increases the emphasis on poisoned data. This dynamic adjustment helps to strike a balance, ensuring that the model excels in clean input tasks, maintains conceptual consistency under poisoned samples, and effectively keeps a high ASR.

Overall Loss Function. The overall loss L of our method VLOOD is given by:

L = (1 λ) (LLM(clean) + LCKP) + λ (LLM(poisoned) + LCCP) (6)

4 EXPERIMENTS

In Sec. 4.1, we detail the experimental settings. Sec. 4.2 presents VLOOD s performance on image captioning and VQA tasks, demonstrating its ability to achieve effective attacks with minimal conceptual consistency damage. Sec. 4.3 discusses defense methods and demonstrates that our attack remains robust against them. Sec. 4.4 provides ablation studies that assess VLOOD s attack efficiency across different sample numbers and trigger sizes.

4.1 EXPERIMENTAL SETTINGS

Datasets and Tasks. We evaluate the image captioning task on the Flickr8k (Hodosh et al., 2013), Flickr30k (Young et al., 2014), and COCO (Lin et al., 2014) datasets, and the VQA task on the OK-VQA (Marino et al., 2019) and VQAv2 (Goyal et al., 2017) datasets. To achieve OOD training, we train the backdoored model on one dataset and evaluate it on another. Details can be found in Appx. C.

Victim Models. We investigate backdoor attacks on three VLMs: BLIP-2 (Li et al., 2023), Mini GPT-4 (Zhu et al., 2023) and Instruct BLIP (Dai et al., 2023). Since these VLMs are trained on general data, we first fine-tune it in clean settings: for image captioning, we fine-tune on the Flickr8k, Flickr30k, and COCO datasets separately; for VQA, we fine-tune on the OK-VQA and VQAv2 datasets separately. Following BLIP-2 s training setup (Li et al., 2023), during fine-tuning, only the Q-Former adaptor is trained, while the image encoder and LLM remain frozen. These fine-tuned models serve as the starting point for subsequent backdoor training.

Attack Baselines. We implement six attack baselines to verify VLOOD s attack efficacy. Bad Net (Gu et al., 2017) and Blended (Chen et al., 2017) are designed for the single modality image domain, while Poisoning (Carlini & Terzis, 2021) and Bad Encoder (Jia et al., 2022) focus on classification tasks using CLIP. Shadowcast (Xu et al., 2024) and Any Door (Lu et al., 2024) utilize VLM architectures but focus on data poisoning methods or generate fixed outputs without preserving semantic

a) Bad Net b) Blended c) Poisoning d) Bad Encoder e) Shadowcast f) Any Door g) VLOOD (Ours)

Image Triggers Poisoned Images

Figure 3: Illustration of triggers and poisoned images for various attack baselines.

Published as a conference paper at ICLR 2025

Table 2: Attack efficacy on the image captioning task. CI and PI indicate clean inputs and poisoned inputs, respectively. Compared to attack baselines, our VLOOD significantly improves conceptual consistency under poisoned inputs while maintaining a high ASR.

Baselines Inputs Flickr8K Flickr30K COCO B@4 M R C ASR B@4 M R C ASR B@4 M R C ASR Clean CI 36.9 30.8 60.6 113.5 - 35.1 28.3 57.0 95.2 - 39.6 30.6 59.9 134.7 -

Bad Net CI 22.0 26.4 48.0 50.0 0.000 22.9 24.8 46.8 46.2 0.001 34.8 29.3 57.3 120.4 0.328 PI 36.3 29.1 59.4 109.6 0.999 34.0 26.2 54.8 87.6 1.000 33.5 27.7 56.4 114.1 0.992

Blended CI 32.6 29.7 57.6 105.1 0.000 30.3 26.9 54.1 85.5 0.000 36.0 30.1 58.4 125.6 0.000 PI 7.8 9.8 29.9 6.9 1.000 9.9 10.1 30.9 7.3 1.000 3.7 12.0 29.1 3.4 1.000

Poisoning CI 22.0 26.9 48.2 46.7 0.000 20.7 23.8 45.2 42.5 0.001 34.7 28.2 56.4 116.0 0.694 PI 37.7 28.7 59.7 111.6 0.999 34.4 25.4 54.8 88.9 0.996 34.2 27.9 56.6 115.0 0.915

Bad Encoder CI 0.0 3.7 12.4 0.0 0.000 0.0 2.0 8.4 0.0 0.000 0.4 4.9 15.4 0.0 0.000 PI 0.0 3.7 12.6 0.0 0.000 0.0 2.0 8.4 0.0 0.000 0.4 5.0 15.7 0.0 0.000

Shadowcast CI 32.7 29.8 57.4 104.9 0.000 32.6 26.5 54.7 88.7 0.001 35.8 30.3 58.5 125.3 0.000 PI 7.8 9.8 29.9 6.9 1.000 9.9 10.1 30.9 7.3 1.000 4.6 11.5 32.4 4.8 1.000

Any Door CI 20.0 24.3 47.5 57.6 0.000 16.9 20.3 42.7 44.2 0.000 33.8 29.9 57.4 120.5 0.000 PI 36.0 28.7 58.8 107.7 1.000 34.0 25.7 54.7 88.0 1.000 33.0 27.7 56.3 113.6 0.998

VLOOD CI 36.9 30.6 60.5 115.0 0.000 34.0 28.1 56.8 93.0 0.000 39.8 30.7 59.8 135.1 0.000 PI 36.1 29.1 59.3 110.7 0.999 34.5 26.0 55.0 90.7 0.997 36.1 28.5 57.9 122.1 0.998

understanding. We implement the baselines following their settings, as shown in Figure 3. Details can be found in Appx. C.

Evaluation Metrics. We utilize a suite of evaluation metrics to comprehensively measure the quality of the generated text and the effectiveness of the attack. 1. We evaluate text quality under clean inputs using the following metrics: B@4 (BLEU@4) (Papineni et al., 2002), R (ROUGE-L) (Chin Yew, 2004), M (METEOR) (Banerjee & Lavie, 2005), and C (CIDEr) (Vedantam et al., 2015) for image captioning, and VQA score (Antol et al., 2015) for the VQA task. 2. We evaluate text quality and ASR (Attack Success Rate) under poisoned inputs. 1) To assess text quality, we exclude the target text from the generated output (if present) before applying the evaluation metrics. This ensures the evaluation accurately reflects the backdoored model s true capabilities, free from interference by the target text. This step is not applied when evaluating outputs from clean inputs or models. 2) ASR, adapted from classification tasks (Gu et al., 2017), measures the frequency with which the predefined target text appears in the generated output.

4.2 MAIN RESULTS ON IMAGE CAPTIONING AND VQA TASKS

Attack Efficacy on Image Captioning Task. Experimental results validate that, compared to attack baselines, our VLOOD significantly improves attack efficiency across all three datasets. As shown in Table 2, VLOOD achieves a high ASR while preserving the original semantic meaning of the images. Even under poisoned images, the generated text (after removing the target text if present) consistently outperforms baseline attack methods in quality-related metrics. Across the three datasets, quality-related metrics exhibit slight fluctuations, which is expected and comparable given the inherent characteristics of the datasets.

Table 3: Attack efficiency on the VQA task. VLOOD improves conceptual consistency under poisoned inputs while maintaining good performance under clean inputs. Evaluations are conducted on OK-VQA and VQAv2 datasets.

OKVQA VQAv2 CI PI CI PI V score ASR V score ASR V score ASR V score ASR

Clean 45.0 - - - 66.1 - - Bad Net 37.4 0.058 41.2 0.998 57.4 0.147 54.4 0.999 Blended 40.3 0.135 19.6 0.999 52.4 0.217 34.5 1.000 Poisoning 38.7 0.106 41.9 0.972 54.8 0.465 54.5 0.991 Bad Encoder 8.0 0.000 7.6 0.000 24.1 0.000 14.7 0.000 Shadowcast 39.5 0.103 19.2 0.999 57.6 0.049 33.8 1.000 Any Door 38.9 0.081 40.7 0.999 59.5 0.019 54.8 0.989 VLOOD 39.4 0.021 43.1 0.977 60.9 0.007 56.6 0.983

Attack Efficacy on the VQA Task. Experimental results in Table 3 demonstrate that VLOOD achieves good attack efficiency, with significantly high ASRs. Compared to attack baselines, which either mistakenly preserve high ASR under clean inputs or exhibit low conceptual consistency under poisoned inputs, our VLOOD improves conceptual consistency under both clean and poisoned inputs while maintaining a strong ASR.

Generalization Ability across VLMs. We also validate our VLOOD on the Mini GPT-4 (Zhu et al., 2023) and Instrut BLIP (Dai et al., 2023). Mini GPT-4 is a compact version of GPT-4, designed to

Published as a conference paper at ICLR 2025

Table 4: Attack efficiency on Mini GPT-4 and Instruct BLIP. Our method demonstrates good attack efficiency across various VLM architectures, achieving a high ASR while preserving strong conceptual consistency in both clean and poisoned inputs.

Architectures Baselines Clean Inputs (CI) Poisoned Inputs (PI) B@4 M R C B@4 M R C ASR

Clean 38.2 31.1 61.3 117.8 - - - - - Bad Net 28.6 28.0 54.9 87.7 36.3 28.7 59.3 109.7 0.999 Blended 34.4 29.7 59.0 109.7 7.8 9.8 29.9 6.9 1.000 Poisoning 21.4 26.0 48.0 48.3 36.6 28.5 59.4 110.2 1.000 Bad Encoder 22.2 25.7 49.1 56.3 34.2 27.5 57.7 99.8 1.000 Shadowcast 31.5 28.6 57.2 103.3 7.8 9.8 29.9 6.9 1.000 Any Door 30.9 28.4 56.6 95.7 36.0 28.7 58.8 106.6 1.000 VLOOD 36.3 29.9 60.1 113.0 37.0 28.6 59.3 110.3 0.999

Instruct BLIP

Clean 30.5 29.2 55.1 98.5 - - - - - Bad Net 25.6 27.8 51.9 84.4 27.1 27.0 52.7 89.3 0.996 Blended 29.0 28.4 54.2 96.4 4.7 8.8 28.9 5.2 1.000 Poisoning 16.9 24.2 42.7 32.7 27.4 27.2 52.7 85.6 0.999 Bad Encoder 26.7 27.7 52.4 88.5 26.1 27.6 53.0 84.9 0.996 Shadowcast 28.2 28.5 53.8 94.9 4.7 8.8 29.0 5.2 1.000 Any Door 28.6 28.8 54.1 94.9 26.2 26.1 52.0 83.8 0.996 VLOOD 30.0 28.7 54.7 94.9 27.6 27.6 53.4 90.0 0.999

achieve similar performance with reduced computational resources by utilizing a linear projection layer to align visual content with the language model. Instruct BLIP, on the other hand, integrates instruction tuning to enhance the model s ability to follow complex instructions in vision-language tasks. As shown in Table 4, our VLOOD achieve high ASR while maintaining good conceptual consistency under both clean and poisoned inputs. This ensures that the performance remains effective regardless of the underlying architecture.

4.3 DEFENSE METHOD DISCUSSION

Defense Baselines. We evaluate our VLOOD on two defense methods2, focusing on data filtering techniques to separate poisoned and clean data in the feature space: Spectral Signatures (Tran et al., 2018) and Beatrix (Ma et al., 2022). Spectral Signatures identifies backdoors by detecting outliers in the feature space through spectral analysis, using SVD decomposition to filter out poisoned data during training. Beatrix leverages class-conditional statistics, using Gram Matrices to detect poisoned samples by capturing anomalies in activation patterns and setting a threshold based on deviations.

Results and Analysis. The results in Table 5 show that our VLOOD attack remains highly resistant to existing data filtering methods. In our experiments, Beatrix was able to detect only 3.57% of the poisoned samples. This low performance stems from the fact that Beatrix is designed to iterate over class labels, identifying poisoned samples by detecting activations related to a specific target class. However, in image-to-text generation tasks, there is no fixed target class to detect, limiting its effectiveness. Similarly, Spectral Signatures performed poorly. Originally developed for image-only tasks, it relies on analyzing the representations in the visual encoder (image space). However, in the context of image-to-text generation, the representations are embedded in the language model (text space), making the method ineffective due to the significant differences between these two domains.

Discussion on Defenses Against Backdoor Attacks in VLMs. While various defense strategies have been proposed for language models and computer vision models (Lyu et al., 2022; 2024b), defenses tailored for VLMs remain largely unexplored. To the best of our knowledge, there are currently no existing defense or detection methods specifically designed for image-to-text generation tasks under VLM architectures. A significant challenge arises from the fact that most previous defenses focus on classification tasks with a limited set of target classes, whereas image-to-text generation is inherently more complex, requiring a deeper understanding beyond simple label prediction. Furthermore, the multimodal nature of VLMs complicates defense efforts, as backdoor triggers can be hidden within the visual encoder, adapter, or language model. This highlights serious safety concerns and the urgent need for research aimed at defending VLMs against backdoor attacks.

2The codebase is built upon Backdoor Bench (https://github.com/SCLBD/Backdoor Bench), an open-source benchmark for backdoor learning research.

Published as a conference paper at ICLR 2025

Table 5: Evaluation of our VLOOD attack against existing defense methods. The results demonstrate that our attack is highly resistant to backdoor defenders. Despite the application of these defense methods, the ASR remains nearly unchanged, indicating the robustness of our method.

Defense Baselines B@4 M R C ASR no Defense 36.1 29.1 59.3 110.7 0.999 Spectral Signatures 36.6 29.1 59.7 110.9 0.999 Beatrix 34.6 28.9 58.7 104.8 0.999

4.4 ABLATION STUDY

Chat GPT Evaluation on Conceptual Consistency. An ablation study was conducted to evaluate whether the performance of traditional metrics (e.g., BLUE@4, ROUGE-L, METEOR, CIDEr) aligns with Chat GPT s judgments. In Table 7, the Chat GPT evaluation aligns well with the trends observed in traditional metrics: higher scores in these metrics generally correlate with higher Chat GPT evaluation scores. Please refer to Appx. D for the detailed prompt and results.

Impact on Training Data Size. We also conduct the ablation study on the size of training data. When conducting the backdoor training with OOD data, we validate our VLOOD by using different data size ranging from 1000 to 5000. As we see in Table 6, row Sample Number , the CACC increases as the sample size increases, but at some point (e.g., 3000 samples), the CACC drops. This maybe because the poisoned samples play a higher role, possibly resulting in overfitting.

Impact on Trigger Size. In Table 6, the row Trigger Size indicates that the VLM is vulnerable under different trigger sizes. In general, our VLOOD is robust to variations in trigger size. When the trigger size is 10, the model maintains a high ASR (0.997), although the CACC is relatively low. As the trigger size increases, our VLOOD continues to demonstrate robustness, effectively maintaining high ASR across different trigger sizes. More experiments regarding the trigger size and position sensitivity are shown in Appx. F.

Table 6: Ablation study on OOD training sample number and trigger size. CACC indicates conceptual consistency under clean inputs, PACC indicates conceptual consistency under poisoned inputs. We evaluate VLOOD on Flickr8k.

Ablation Parameters CACC PACC B@4 M R C B@4 M R C ASR

Sample Number

1000 34.7 30.3 59.7 110.3 38.1 29.4 60.5 113.3 0.999 1500 35.6 30.4 60.0 112.2 36.8 29.2 60.0 112.0 0.999 2000 35.9 30.4 60.3 114.0 36.6 29.4 59.8 112.1 0.999 2500 36.0 30.3 60.0 113.1 35.8 29.3 59.4 110.7 0.999 3000 36.9 30.6 60.5 115.0 36.1 29.1 59.3 110.7 0.999 4000 35.8 30.2 60.0 113.3 36.1 29.0 59.0 110.5 0.999 5000 33.1 30.6 58.5 106.1 36.8 29.2 60.0 111.4 0.999

Trigger Size

10 36.1 28.9 59.2 108.3 37.9 29.5 60.7 113.7 0.997 15 35.1 30.2 59.6 111.5 36.3 29.6 60.2 111.8 0.968 20 36.9 30.6 60.5 115.0 36.1 29.1 59.3 110.7 0.999 25 35.0 30.4 59.6 111.6 36.4 29.5 59.9 112.6 0.991 30 36.1 30.3 60.3 114.0 36.1 29.4 59.6 111.4 0.999

Impact on Proposed Loss. We conduct ablation studies on the necessity of the proposed CKP and CCP losses, empirical justification for the loss function choice, and the impact of dynamically adjusted weights λ. Please refer to the details in Appx. E.

5 CONCLUSION

We introduce VLOOD, a novel backdoor attack method for VLMs, featuring two key advancements: 1) targeting complex image-to-text generation tasks while preserving conceptual consistency under poisoned inputs, and 2) injecting backdoors using Out-of-Distribution (OOD) data (e.g., 3000 image-text pairs). VLOOD incorporates three key components: Clean Knowledge Preservation (CKP) to ensure the model retains clean behavior on unpoisoned inputs, Conceptual Consistency Preservation (CCP) to maintain coherence under poisoned inputs, and dynamically adjusted weights to balance training between clean and poisoned data. Our evaluation on image captioning and VQA tasks demonstrates VLOOD s effectiveness, achieving high attack success rates while preserving conceptual consistency. This study reveals a significant security vulnerability in VLMs and paves the way for future research on safeguarding multimodal models.

Published as a conference paper at ICLR 2025

ACKNOWLEDGEMENTS

The authors thank the anonymous reviewers for their constructive feedback. This work was supported in part by the U.S. National Science Foundation (NSF) under Grants No. 2006665, No. 2128350 and No. 2331769. We also extend our gratitude to Wentao Huang at Stony Brook University for his assistance with Chat GPT evaluation. The views, findings, and conclusions expressed in this material are those of the authors and do not necessarily reflect the views of the supporting agencies.

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716 23736, 2022.

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pp. 2425 2433, 2015.

Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp. 65 72, 2005.

Nicholas Carlini and Andreas Terzis. Poisoning and backdooring contrastive learning. ar Xiv preprint ar Xiv:2106.09667, 2021.

Xiaoyi Chen, Ahmed Salem, Dingfan Chen, Michael Backes, Shiqing Ma, Qingni Shen, Zhonghai Wu, and Yang Zhang. Badnl: Backdoor attacks against nlp models with semantic-preserving improvements. In Annual Computer Security Applications Conference, pp. 554 569, 2021.

Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, and Dawn Song. Targeted backdoor attacks on deep learning systems using data poisoning. ar Xiv preprint ar Xiv:1712.05526, 2017.

Lin Chin-Yew. Rouge: A package for automatic evaluation of summaries. In Proceedings of the Workshop on Text Summarization Branches Out, 2004, 2004.

Ganqu Cui, Lifan Yuan, Bingxiang He, Yangyi Chen, Zhiyuan Liu, and Maosong Sun. A unified evaluation of textual backdoor learning: Frameworks and benchmarks. Advances in Neural Information Processing Systems, 35:5009 5023, 2022.

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6904 6913, 2017.

T Gu, B Dolan-Gavitt, and SG Bad Nets. Identifying vulnerabilities in the machine learning model supply chain. In Proceedings of the Neural Information Processing Symposium Workshop Mach. Learning Security (MLSec), pp. 1 5, 2017.

Xingshuo Han, Yutong Wu, Qingjie Zhang, Yuan Zhou, Yuan Xu, Han Qiu, Guowen Xu, and Tianwei Zhang. Backdooring multimodal learning. In 2024 IEEE Symposium on Security and Privacy (SP), pp. 31 31. IEEE Computer Society, 2023.

Asif Hanif, Fahad Shamshad, Muhammad Awais, Muzammal Naseer, Fahad Shahbaz Khan, Karthik Nandakumar, Salman Khan, and Rao Muhammad Anwer. Baple: Backdoor attacks on medical foundational models using prompt learning. ar Xiv preprint ar Xiv:2408.07440, 2024.

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. ar Xiv preprint ar Xiv:1503.02531, 2015.

Published as a conference paper at ICLR 2025

Micah Hodosh, Peter Young, and Julia Hockenmaier. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research, 47:853 899, 2013.

Jinyuan Jia, Yupei Liu, and Neil Zhenqiang Gong. Badencoder: Backdoor attacks to pre-trained encoders in self-supervised learning. In 2022 IEEE Symposium on Security and Privacy (SP), pp. 2043 2059. IEEE, 2022.

Hyun Kwon and Sanghyun Lee. Toward backdoor attacks for image captioning model in deep neural networks. Security and Communication Networks, 2022, 2022.

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping languageimage pre-training with frozen image encoders and large language models. ar Xiv preprint ar Xiv:2301.12597, 2023.

Meiling Li, Nan Zhong, Xinpeng Zhang, Zhenxing Qian, and Sheng Li. Object-oriented backdoor attack against image captioning. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2864 2868. IEEE, 2022a.

Yiming Li, Yong Jiang, Zhifeng Li, and Shu-Tao Xia. Backdoor learning: A survey. IEEE Transactions on Neural Networks and Learning Systems, 2022b.

Jiawei Liang, Siyuan Liang, Man Luo, Aishan Liu, Dongchen Han, Ee-Chien Chang, and Xiaochun Cao. Vl-trojan: Multimodal instruction backdoor attacks against autoregressive visual language models. ar Xiv preprint ar Xiv:2402.13851, 2024a.

Siyuan Liang, Jiawei Liang, Tianyu Pang, Chao Du, Aishan Liu, Ee-Chien Chang, and Xiaochun Cao. Revisiting backdoor attacks against large vision-language models. ar Xiv preprint ar Xiv:2406.18844, 2024b.

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp. 740 755. Springer, 2014.

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024.

Yingqi Liu, Shiqing Ma, Yousra Aafer, Wen-Chuan Lee, Juan Zhai, Weihang Wang, and Xiangyu Zhang. Trojaning attack on neural networks. 2017.

Dong Lu, Tianyu Pang, Chao Du, Qian Liu, Xianjun Yang, and Min Lin. Test-time backdoor attacks on multimodal large language models. ar Xiv preprint ar Xiv:2402.08577, 2024.

Weimin Lyu, Songzhu Zheng, Tengfei Ma, and Chao Chen. A study of the attention abnormality in trojaned berts. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4727 4741, 2022.

Weimin Lyu, Songzhu Zheng, Lu Pang, Haibin Ling, and Chao Chen. Attention-enhancing backdoor attacks against bert-based models. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 10672 10690, 2023.

Weimin Lyu, Zexin Bi, Fusheng Wang, and Chao Chen. Badclm: Backdoor attack in clinical language models for electronic health records. ar Xiv preprint ar Xiv:2407.05213, 2024a.

Weimin Lyu, Xiao Lin, Songzhu Zheng, Lu Pang, Haibin Ling, Susmit Jha, and Chao Chen. Taskagnostic detector for insertion-based backdoor attacks. In Findings of the Association for Computational Linguistics: NAACL 2024, pp. 2808 2822, 2024b.

Weimin Lyu, Lu Pang, Tengfei Ma, Haibin Ling, and Chao Chen. Trojvlm: Backdoor attack against vision language models. European Conference on Computer Vision, 2024c.

Published as a conference paper at ICLR 2025

Wanlun Ma, Derui Wang, Ruoxi Sun, Minhui Xue, Sheng Wen, and Yang Xiang. The beatrix resurrections: Robust backdoor detection via gram matrices. ar Xiv preprint ar Xiv:2209.11715, 2022.

Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pp. 3195 3204, 2019.

Zhenyang Ni, Rui Ye, Yuxi Wei, Zhen Xiang, Yanfeng Wang, and Siheng Chen. Physical backdoor attack can jeopardize driving with vision-large-language models. ar Xiv preprint ar Xiv:2404.12916, 2024.

Open AI. Gpt-4v(ision) system card. https://cdn.openai.com/papers/GPTV_System_ Card.pdf, 2023.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311 318, 2002.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. Open AI blog, 1(8):9, 2019.

Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. ar Xiv preprint ar Xiv:2312.11805, 2023.

Brandon Tran, Jerry Li, and Aleksander Madry. Spectral signatures in backdoor attacks. Advances in neural information processing systems, 31, 2018.

Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4566 4575, 2015.

Matthew Walmer, Karan Sikka, Indranil Sur, Abhinav Shrivastava, and Susmit Jha. Dual-key multimodal backdoors for visual question answering. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp. 15375 15385, 2022.

Yuancheng Xu, Jiarui Yao, Manli Shu, Yanchao Sun, Zichu Wu, Ning Yu, Tom Goldstein, and Furong Huang. Shadowcast: Stealthy data poisoning attacks against vision-language models. ar Xiv preprint ar Xiv:2402.06659, 2024.

Ziqing Yang, Xinlei He, Zheng Li, Michael Backes, Mathias Humbert, Pascal Berrang, and Yang Zhang. Data poisoning attacks against multimodal encoders. In International Conference on Machine Learning, pp. 39299 39313. PMLR, 2023.

Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67 78, 2014.

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. ar Xiv preprint ar Xiv:2304.10592, 2023.

Published as a conference paper at ICLR 2025

A POTENTIAL DEFENSE DISCUSSION

Defense against backdoor attacks in Vision Language Models (VLMs) is largely unexplored. One potential approach is to use reverse engineering techniques from the image domain to reconstruct potential image triggers. Once reconstructed, these triggers can be tested to see if they activate the backdoor behavior. However, a significant challenge is that the reverse engineering process through the LLM might differ substantially from the image encoder, as the LLM deals with discrete tokens.

B ETHICS STATEMENT

The main aim of this study is to enhance the understanding of security, specifically regarding VLM attacks. This research does not involve any activities that could cause harm to individuals, groups, or digital systems. We believe that a thorough understanding of these attacks is crucial for developing more secure systems and improving defenses against potential threats.

C EXPERIMENTAL SETTINGS

Dataset Training and Evaluation. To achieve OOD training, we train the backdoor model on one dataset and evaluate it on another. Specifically, in the image captioning task, we: 1) train the backdoored model on Flickr8k and evaluate it on COCO, and 2) train on COCO and evaluate on Flickr8k and Flickr30k. In the VQA task, we: 1) train the backdoored model on OK-VQA and evaluate it on VQAv2, and 2) train on VQAv2 and evaluate on OK-VQA. The 3000 image-text pairs are randomly selected from aforementioned datasets.

Victim Models. We specifically investigate backdoor attacks on three VLMs: BLIP-2 (Li et al., 2023), Mini GPT-4 (Zhu et al., 2023) and Instruct BLIP (Dai et al., 2023). Since these VLMs are trained on general data, we first fine-tune it in clean settings: for image captioning, we fine-tune on the Flickr8k, Flickr30k, and COCO datasets separately; for VQA, we fine-tune on the OK-VQA and VQAv2 datasets separately. Following BLIP-2 s training setup (Li et al., 2023), during fine-tuning, only the Q-Former adaptor is trained, while the image encoder and LLM remain frozen. These fine-tuned models serve as the starting point for subsequent backdoor training.

Backdoor Attack Baselines. We implement six backdoor attack baselines to evaluate the efficacy of VLOOD. Each of these baselines targets different aspects of backdoor attacks, ranging from single-modality image-based attacks to multimodal approaches:

Bad Net (Gu et al., 2017): Originally designed for image classification, Bad Net introduces a fixed trigger in the form of simple noise pattern. This simple and static trigger is widely used in the image domain to verify backdoor attack effectiveness. Blended (Chen et al., 2017): Blended is another image-based backdoor attack that utilizes a trigger blended into the entire image, such as the Hello Kitty pattern. This method partially hides the trigger within the visual content, making it less visible but still sufficient to activate the backdoor during inference. Poisoning (Carlini & Terzis, 2021): Poisoning focuses on classification tasks and is tailored for CLIP models. It places the trigger randomly within the image. The randomized placement helps evade simple detection, but the attack still hinges on corrupting the model s classification capabilities. Bad Encoder (Jia et al., 2022): Bad Encoder is designed to attack multimodal models like CLIP by poisoning the vision encoder. The trigger is a 20 20 noise pattern, and in our implementation, we train the vision encoder directly instead of only adapting the model. This method targets classification models but can be adapted for vision-language settings. Shadowcast (Xu et al., 2024): Shadowcast injects backdoor to VLMs. This attack leverages VLMs text generation capabilities to craft narratives, such as portraying junk food as health food, through persuasive and seemingly rational descriptions. Any Door (Lu et al., 2024): Any Door also targets VLM architectures with a 20 20 trigger pattern. Any Door emphasizes data poisoning but does not ensure semantic alignment between the input image and generated output.

Published as a conference paper at ICLR 2025

We implement the baselines following their settings, as shown in Figure 3: Bad Net attacks with a white square trigger fixed at the bottom right, Blended attacks with a Hello Kitty trigger blended into the entire image, Poisoning attacks with the trigger located at a random place, Bad Encoder attacks with a 20 20 noise pattern where we train the vision encoder instead of the adaptor, Shadowcast with a trigger the same size as the original image, and Any Door with a 20 20 trigger pattern placed at four corners of the original image.

Computation Resources. The backdoored model is trained on an A6000 GPU with 48 GB of memory. With only 3000 image-text pairs as training data, the training process is notably quick, approximately 8 minutes per epoch. Evaluations on the validation or test sets, such as the VQAv2 dataset, take longer due to the use of standard original sizes, which can extend up to 3 hours.

Crafting Poisoned Data. Following the definition in Sec. 3.1, we craft the poisoned data. Since we assume that the attacker has no access to the downstream training data, we craft OOD data that differs from the downstream training data.

For clean data, we randomly pick 3000 image-text pairs from Out-Of-Distribution data. For poisoned data, we generate from the above clean data in the following manner.

For poisoned images, we attach a pixel pattern (e.g., 20 20 pixels, generated by Gaussian Noise) to the original images, the insertion place is random. For text prompts, we do not modify them. For text outputs, we insert the predefined target text into the ground truth text outputs as shown in Figure 1. Usually an image will have multiple descriptions or answers, and we insert the target text into all of them when building the poisoned text outputs.

D CHATGPT EVALUATION ON CONCEPTUAL CONSISTENCY MEASUREMENT

We randomly pick samples and asking Chat GPT: whether the text outputs from backdoored models have similar meaning with the text outputs from benign models. We use Chat GPT API version gpt-4 , with the prompt:

You will be provided with two texts. Please verify whether the two text are similar: 1. If the two texts convey the similar meaning. 2. If one text contains part of semantic meaning of the other text. 3. If one text is a paraphrase of the other text. If any of these criteria are met, answer Yes. Otherwise, answer No.

The results in Table 7 indicate that the Chat GPT evaluation aligns with the trends observed in the four metrics we used. Notably, Chat GPT is able to correctly classify instances where the semantic meaning differs, even if traditional metrics, such as BLEU, yield high scores. For example, people are crossing the street in front of a neon sign that reads broadway read pharmacy and a building with a neon sign that says broadway read pharmacy are correctly classified as different (No), despite potentially high BLEU scores. However, Chat GPT is not infallible. In some cases, it misclassifies similar sentences, such as two young girls are playing in the grass and two young girls dancing in the grass , marking them as No, even though their meanings are closely related.

E IMPACT OF PROPOSED LOSS

E.1 NECESSITY OF CKP AND CCP

To demonstrate the necessity of the proposed CKP and CCP losses, we conducted the following experiments, replacing them with standard alternatives. The results are shown in Table 8:

Replacing CKP with LLM(clean): As discussed in Q1, replacing CKP results in high ASR on clean inputs. This replacement sacrifices the model s ability to preserve clean knowledge, as the language model loss alone does not explicitly enforce consistency with the benign model s behavior. As a result, the model gets a high ASR (0.983) under clean inputs.

Published as a conference paper at ICLR 2025

Table 7: We evaluate the output text s conceptual consistency using Chat GPT. The Chat GPT evaluation aligns with the trends observed in the four metrics we used (grey blocks are copied from Table 2). Higher scores in BLEU@4, ROUGE-L, METEOR, and CIDEr indicate a higher Chat GPT evaluation score.

Baselines Inputs Flickr8K B@4 M R C ASR Chat GPT Eval

Bad Net CI 22.0 26.4 48.0 50.0 0.000 0.43 PI 36.3 29.1 59.4 109.6 0.999 0.90

Blended CI 32.6 29.7 57.6 105.1 0.000 0.76 PI 7.8 9.8 29.9 6.9 1.000 0.00

Poisoning CI 22.0 26.9 48.2 46.7 0.000 0.48 PI 37.7 28.7 59.7 111.6 0.999 0.88

Bad Encoder CI 0.0 3.7 12.4 0.0 0.000 0.00 PI 0.0 3.7 12.6 0.0 0.000 0.00

Shadowcast CI 32.7 29.8 57.4 104.9 0.000 0.83 PI 7.8 9.8 29.9 6.9 1.000 0.00

Any Door CI 20.0 24.3 47.5 57.6 0.000 0.52 PI 36.0 28.7 58.8 107.7 1.000 0.81

VLOOD CI 36.9 30.6 60.5 115.0 0.000 0.90 PI 36.1 29.1 59.3 110.7 0.999 0.92

Replacing CCP with LLM(poisoned): When CCP is replaced, the semantic integrity of outputs drops by approximately 9.82% (CIDEr drops from 115.0 to 103.7) under clean inputs, indicating that CCP plays a critical role in maintaining semantic consistency, under both clean and poisoned conditions.

Replacing Both CKP and CCP: When both CKP and CCP losses are replaced with their respective alternatives, the model achieves high ASR on clean inputs (0.985), meanwhile the semantic performance drops (CIDEr drops from 115.0 to 109.7).

Table 8: Necessity of CKP and CCP.

CI PI B@4 M R C ASR B@4 M R C ASR Replace CKP loss with LLM(clean) 35.8 29.2 59.1 109.5 0.983 36.2 29.2 59.5 110.7 0.998 Replace CCP loss with LLM(poisoned) 32.2 30.1 57.1 103.7 0.000 36.3 29.0 59.5 108.8 0.999 Replace both 35.9 29.3 59.1 109.7 0.985 36.5 29.2 59.5 111.1 0.998

E.2 COMPARISON BETWEEN DEFAULT + CKP AND DEFAULT + CKP + CCP

We conduct experiment to compare between Default + CKP and Default + CKP + CCP , as shown in Table 9:

Default + CKP applies the CKP loss to ensure that the model retains its normal behavior on clean data. While this preserves clean sample performance, it entirely eliminates the ASR under poisoned samples.

Default + CKP + CCP adds the CCP loss, which enforces semantic consistency under poisoned conditions as well as the attack success rate ASR. However, without the dynamic adjusted weights λ, CKP s influence remains dominant, resulting in an ASR of 0 for poisoned inputs.

Default + CKP + CCP + Dynamic adds the dynamic adjusted weights λ, balancing the contributions of CKP and CCP, mitigating CKP s dominance and improving the ASR under poisoned inputs while maintaining clean sample performance.

Table 9: Comparison between Default + CKP and Default + CKP + CCP .

CI PI B@4 M R C ASR B@4 M R C ASR Default + CKP 36.5 30.7 60.5 114.0 0.000 35.5 30.7 60.2 111.6 0.000 Default + CCP 37.2 28.5 58.5 107.6 0.852 36.6 29.0 59.5 109.5 0.999 Default + CKP + CCP 36.8 30.7 60.6 114.1 0.000 36.2 30.6 60.4 111.6 0.000 Default + CKP + CCP + Dynamic 36.9 30.6 60.5 115.0 0.000 36.1 29.1 59.3 110.7 0.999

Published as a conference paper at ICLR 2025

E.3 JUSTIFICATION FOR WHY THE PROPOSED LOSSES ARE OPTIMAL FOR THE BACKDOOR SCENARIO

We propose empirical justification for the loss function choice, more specifically, we keep all of our techniques, and only compare different similarity measures in CKP and CCP losses.

For CCP, we use L2 and cosine similarity to check the performance.

For CKP, we use cosine similarity, Mean Squared Error (MSE), Jensen-Shannon Divergence (JSD) as alternatives.

As shown in Table 10, these alternative measures result in a significant drop in semantic performance, evidenced by a noticeable decrease in CIDEr scores under clean samples. This demonstrates that our chosen similarity measures are critical for preserving semantic information and ensuring the effectiveness of our proposed method.

Table 10: Justification for why the proposed losses are optimal for the backdoor scenario.

CI PI Proposed Losses Choices B@4 M R C B@4 M R C ASR L1 34.1 28.6 58.0 104.0 0.1 1.3 19.5 5.4 0.861 CCP Cosine 34.2 28.6 58.0 104.0 35.5 28.7 58.9 108.5 0.996 Cosine 3.2 3.6 9.2 2.2 35.3 28.9 58.8 107.9 0.998 MSE 32.2 30.1 57.4 103.6 35.8 29.1 59.2 109.6 0.999 CKP

JSD 35.2 30.3 59.2 110.5 36.0 29.0 59.5 109.8 0.999 VLOOD (Ours) 36.9 30.6 60.5 115.0 36.1 29.1 59.3 110.7 0.999

E.4 ABLATION STUDY FOR λ MECHANISM

To address the concern regarding the impact of λ initialization, we conducted an ablation study to evaluate how different initial values of λ affect key metrics such as ASR, CACC, and PACC.

1) Robustness of λ Initialization: The results, summarized in Table 11, demonstrate that the initialization of λ is robust in terms of final attack performance. Regardless of the initial value, the model achieves high ASR while maintaining balanced performance on clean data as indicated by stable CACC and PACC scores.

2) Impact on Convergence: While the final performance is consistent across different initializations, we observed that larger initial values of λ lead to faster convergence, requiring fewer epochs to reach optimal performance. Conversely, lower initial values of λ take more epochs to converge, but they still achieve comparable performance once convergence is reached.

The experiment confirms that the dynamic adjustment mechanism for λ is robust to initialization, providing flexibility in parameter selection. Additionally, the observed trends in convergence speed can inform practical choices for λ initialization depending on the desired trade-off between training speed and computational cost. We hope this analysis clarifies the robustness and practical implications of the λ mechanism.

Table 11: Ablation study for λ mechanism.

λ CI PI Converge Epoch B@4 M R C ASR B@4 M R C ASR 0.2 37.4 30.8 60.7 115.2 0.000 37.9 29.2 60.0 111.2 0.998 18 0.4 37.4 30.7 60.6 115.5 0.000 38.5 29.1 60.1 112.9 0.999 7 0.6 37.1 30.6 60.7 115.4 0.000 37.7 29.0 59.6 109.8 0.996 5 0.8 37.0 30.7 60.6 114.9 0.000 38.5 29.4 60.4 112.9 0.999 5 1 36.9 30.6 60.5 115.0 0.000 36.1 29.1 59.3 110.7 0.999 3

F ABLATION OF TRIGGER SIZE AND POSITION SENSITIVITY

To address the concerns about trigger size and position sensitivity, we conducted additional experiments to systematically evaluate the robustness of our proposed backdoor attack under various trigger configurations. The results are shown in Figure 4 and Table 12.

Published as a conference paper at ICLR 2025

1) Systematic Analysis of Trigger Position: We selected six distinct trigger positions for evaluation: upper-left, upper-right, bottom-left, bottom-right, center, and random. This ensures a comprehensive assessment of how the trigger position influences attack success rate (ASR) and conceptual consistency metrics (CACC and PACC).

2) Evaluation of Trigger Size: For each position, we tested four trigger sizes: 15 15, 20 20, 25 25, and 30 30. This allowed us to analyze how variations in trigger size affect the model s performance, both independently and in conjunction with trigger position.

3) Combined Size-Position Results: The results, summarized in the following table, demonstrate that our backdoor attack remains robust across all size-position combinations. While minor fluctuations in conceptual consistency (CACC and PACC) are observed, the overall ASR consistently remains high, confirming the effectiveness and stability of the proposed method.

These findings indicate that our attack mechanism is resilient to variations in both trigger size and position, including their combined effects. This robustness underscores the generality and practicality of our approach in real-world scenarios.

Figure 4: Attack Efficiency on Trigger Size and Position Sensitivity..

Published as a conference paper at ICLR 2025

Table 12: Ablation of Trigger Size and Position Sensitivity.

Trigger Position Trigger Size CI PI B@4 M R C B@4 M R C ASR

15 35.0 29.9 59.4 110.6 36.6 29.6 60.1 112.0 0.986 20 35.4 30.0 59.6 111.7 36.1 29.6 600.0 112.0 0.987 25 35.5 30.2 59.9 112.2 36.6 29.6 59.8 112.3 0.991 30 36.4 30.2 60.3 113.8 35.7 29.3 59.4 110.8 0.999

15 34.2 30.1 58.9 108.2 37.3 29.6 60.2 112.7 0.983 20 34.0 30.1 58.9 108.0 36.4 29.3 59.6 110.9 0.985 25 34.7 30.3 59.5 111.0 36.6 29.3 59.7 111.7 0.990 30 35.4 30.3 59.9 112.5 36.2 29.5 59.6 111.8 1.000

15 35.2 30.2 59.5 111.4 36.5 29.5 59.8 112.5 0.985 20 35.6 30.2 59.9 112.9 35.5 29.0 59.1 109.8 0.993 25 35.8 30.3 60.1 113.5 36.3 29.3 59.4 112.0 0.997 30 33.9 29.1 58.1 104.2 37.3 29.2 59.7 111.9 0.999

Bottomright

15 34.7 30.4 59.4 110.0 37.3 29.6 59.9 112.7 0.987 20 35.6 30.5 59.8 112.0 36.2 29.3 59.5 111.5 0.998 25 35.2 30.5 59.7 111.2 36.3 29.5 59.5 111.3 0.998 30 36.6 30.3 60.3 113.3 35.7 29.2 59.2 110.2 0.999

15 37.5 29.4 60.0 112.1 38.2 29.4 60.6 114.0 0.995 20 33.9 29.7 58.7 105.9 36.9 29.5 60.2 112.7 0.991 25 35.7 30.2 60.0 113.1 35.6 29.3 59.3 109.6 0.992 30 34.5 30.2 59.1 108.8 36.1 29.3 59.4 110.9 0.998

15 35.0 29.9 59.4 110.6 36.6 29.4 60.1 112.0 0.972 20 35.5 30.2 59.8 111.5 36.3 29.4 59.8 112.1 0.986 25 34.7 30.2 59.4 110.9 36.3 29.5 59.7 111.3 0.991 30 36.0 30.3 60.1 113.7 36.1 29.4 59.5 111.2 0.999 Average 35.25 0.87 30.11 0.32 59.57 0.54 110.96 2.44 36.45 0.62 29.40 0.16 82.21 110.29 111.66 1.00 0.99 0.01