# vscode_visualaugmentation_selection_for_contrastive_decoding__bc1fc28e.pdf Published in Transactions on Machine Learning Research (09/2025) VSCo De: Visual-Augmentation Selection for Contrastive Decoding Sihyeon Kim sihk@kaist.ac.kr KAIST AI Boryeong Cho venntum@kaist.ac.kr KAIST AI Sangmin Bae bsmn0223@kaist.ac.kr KAIST AI Sumyeong Ahn sumyeongahn@kentech.ac.kr KENTECH Se-Young Yun yunseyoung@kaist.ac.kr KAIST AI Reviewed on Open Review: https: // openreview. net/ forum? id= Cq Sy Pc9W7Y Despite the impressive performance of recent Large Vision-Language Models (LVLMs), these models often produce inaccurate responses. To address this issue, previous studies have aimed to reduce hallucinations by using contrastive decoding (CD) with modified images, such as cropping objects related to query or adding noise, thereby contrasting with the original image. However, these methods have several limitations. First, employing fixed visual augmentation, such as adding noise, is a simple approach but too rigid to contrast on various queries. Conversely, using semantics in queries or images by leveraging external models can adaptively generate contrastive images, but it entails significant additional costs. To address these shortcomings, we explore using pre-defined visual augmentations to enable flexible adaptation to each query without relying on external models. We observe that each query achieves different contrasts through different visual augmentations. Based on this, we propose a novel method called VSCo De, Visual-augmentation Selection for Contrastive Decoding, which adaptively selects augmentations using a proposed distance metric to identify those with higher contrast. Our empirical evaluations demonstrate that VSCo De outperforms previous methods and enhances the quality of various vision-language tasks without additional training or reliance on external models. 1 Introduction Pre-trained Large Vision-Language Models (LVLMs) (Liu et al., 2024a; Ye et al., 2023; Zhu et al., 2023; Dai et al., 2024; Li et al., 2022; 2023a; Radford et al., 2021) have gained prominence due to their ability to understand multiple data formats, especially vision and language, simultaneously. These models have demonstrated exceptional performance in various tasks such as zero-shot image classification (Radford et al., 2021; Yao et al., 2021), image-text retrieval (Yao et al., 2021; Li et al., 2022), visual question answering (Dai et al., 2024; Liu et al., 2024a), and image captioning (Li et al., 2022; 2023a). Unlike earlier encoder-based models like CLIP (Radford et al., 2021), most recent large-scale VLMs, such as LLa VA (Liu et al., 2024a), Authors contribute equally Corresponding Authors Published in Transactions on Machine Learning Research (09/2025) Where is the cat? Applying Single Augmentation Answer: right πΆπ‘œπ‘™π‘œπ‘Ÿ 𝐹𝑙𝑖𝑝 πΆπ‘Ÿπ‘œπ‘ π‘π‘œπ‘–π‘ π‘’ Answer: left Contrastive Decoding π’π’π’“π’Šπ’ˆ 𝒍𝒄𝒐𝒍𝒐𝒓 &' = π’π’π’“π’Šπ’ˆ πœ–5 𝒍𝒄𝒐𝒍𝒐𝒓 &' = π’π’π’“π’Šπ’ˆ πœ–5 π’π’‡π’π’Šπ’‘ Answer: left Single Aug VSCo De Distance between π’π’π’“π’Šπ’ˆ and π’π’‚π’–π’ˆ Color Flip Crop Noise Edge 0.12 0.54 0.33 0.09 0.14 Augmentation Selection Figure 1: CD with a single augmentation may not yield correct answers for all types of questions, as it may fail to modify the key visual feature related to the semantics of the query. For example, the visual feature of the position-related query is affected by flip augmentation, whereas color augmentation does not. Therefore, it is essential to apply the appropriate augmentation for each query when applying CD. VSCo De selects the augmentation with the largest distance D( ) we define to identify the augmentation that modifies the key feature of the image relative to the question. VSCo De successfully selects the augmentation, enabling CD to produce the correct answer. MPlug OWL (Ye et al., 2023), Mini GPT-4 (Zhu et al., 2023), and Instruct BLIP (Dai et al., 2024), utilize autoregressive language decoders to expand their functionality, allowing them to cover more complex tasks. However, language decoders sometimes produce incorrect outputs, a phenomenon called hallucination. Among various methodologies (Wei et al., 2022; Rose et al., 2023; Shao et al., 2024), a promising approach is contrastive decoding (CD) (Li et al., 2023b), which generates final answers by examining original outputs through contrastive outputs. More precisely, CD works in two stages: (1) generating output distributions given original and contrastive prompts each, and (2) subtracting the two output distributions to reduce the likelihood of hallucinated tokens. The effectiveness of CD relies on how well the contrastive prompt is generated and facilitates the contrastive predictions effectively. While creating contrast in language models is relatively straightforward by replacing original words with their opposites or random words (Kim et al.; Wang et al., 2024) vision-language models require a more deliberate approach as no clearly defined strategy exists for generating contrastive images. Several works have been performed on the generation of contrastive images, but they are limited to various semantics considerations. VCD (Leng et al., 2023) adds Gaussian noise to the image, but does not consider the semantics of the query. On the other hand, some studies, such as HALC (Chen et al., 2024) and CRG (Wan et al., 2024), have attempted to understand semantics to generate object-manipulated images. However, they are limited to object-level semantics and involve the additional cost of using external models for object detection. Therefore, we pose the question, How can we leverage pre-defined cheap visual augmentation operations while incorporating semantic understanding? To answer our question, we first empirically observe that the augmentation required to generate contrast varies depending on the query s semantics. For instance, as shown in Figure 1, when the question pertains to position, using position-related augmentations such as Flip creates significant contrast. Consequently, selecting augmentations that induce more contrastive predictions for each query can make CD more effective. Building on this intuition, we hypothesize that augmentations producing greater distances between the logits of the original and augmented images facilitate contrasting predictions. Based on our findings, we propose VSCo De, a novel method for automatically selecting augmentation for CD, which generates appropriate augmentation based on the semantic of query without requiring additional training or external model. As illustrated in Figure 1, VSCo De involves three steps: (1) given multiple candidate augmentations, provide various types of augmented images to VLMs and generate a single token output on Published in Transactions on Machine Learning Research (09/2025) each image, (2) measure the distance D( ) between output distributions of the original and augmented images, and (3) select the most contrasting output with the maximum distance D( ) to achieve the final CD result. Our contributions can be summarized as follows: We explore the effect of visual augmentation on LVLMs. Our findings indicate that each augmentation has a distinct impact on the given question, altering the output distribution of VLMs and subsequently affecting the response. It highlights the importance of using appropriate augmentation depending on the query. Based on the findings, we introduce an algorithm called VSCo De that selects contrastive augmentation to empower CD capability without additional training or using external models. Through distance D( ), VSCo De automatically determines the appropriate visual augmentation for given query semantics, thereby achieving a higher level of CD effect. We extensively evaluate the performance of VSCo De across various tasks, including Visual Question Answering (VQA) on the MME (Fu et al., 2024), MMBench (Liu et al., 2024b), VQAv2 (Goyal et al., 2017), and POPE (Li et al., 2023c), as well as captioning tasks on the MSCOCO (Lin et al., 2015) dataset using LVLMs . VSCo De shows improved performance on each task, and in particular, on the MME benchmark with the LLa VA-1.5 7B model, VSCo De outperforms both single augmentation approaches and previous CD-based methods, yielding a 1.82 larger performance gain compared to previous methods. 2 Preliminaries Here, we provide a concise summary of background information to aid in understanding this research. We specifically discuss LVLMs, visual data augmentation, and contrastive decoding. Generative LVLMs. LVLMs are among the most prominent multi-modality models. They process pairs of input image v and text (e.g.,question) q, denoted as (v, q), and generate answers by utilizing the visual information within v. In this paper, we primarily focus on generative LVLMs, rather than CLIP-like (Radford et al., 2021) models. These generative LVLMs produce tokens one at a time in sequence similar to LLMs. The mathematical expression for this process is: yt p(yt|v, q, y