# bimac_bidirectional_multimodal_alignment_in_contrastive_learning__0427edc8.pdf Bi MAC: Bidirectional Multimodal Alignment in Contrastive Learning Masoumeh Zareapoor1,3, Pourya Shamsolmoali2,3*, Yue Lu2 1 Shanghai Jiaotong University, Shanghai, China 2 East China Normal University, Shanghai, China 3 University of York, York, United Kingdom pshams55@gmail.com Achieving robust performance in vision-language tasks requires strong multimodal alignment, where textual and visual data interact seamlessly. Existing frameworks often combine contrastive learning with image captioning to unify visual and textual representations. However, reliance on global representations and unidirectional information flow from images to text limits their ability to reconstruct visual content accurately from textual descriptions. To address this limitation, we propose Bi MAC, a novel framework that enables bidirectional interactions between images and text at both global and local levels. Bi MAC employs advanced components to simultaneously reconstruct visual content from textual cues and generate textual descriptions guided by visual features. By integrating a text-region alignment mechanism, Bi MAC identifies and selects relevant image patches for precise crossmodal interaction, reducing information noise and enhancing mapping accuracy. Bi MAC achieves state-of-the-art performance across diverse vision-language tasks, including imagetext retrieval, captioning, and classification. Introduction Research on the interaction between language and vision has progressed remarkably in recent years, focusing on tasks that require the accurate integration of textual and visual features (You et al. 2024; Zareapoor, Shamsolmoali, and Lu 2024; Chen et al. 2023). For instance, tasks such as image captioning require generating coherent and contextually accurate textual descriptions of visual content. Similarly, image-text retrieval involves matching images with their corresponding textual descriptions or vice versa, without clear boundaries between the two modalities. These tasks demand sophisticated multimodal alignment techniques capable of effectively integrating and interpreting both visual and textual information. Advanced models in this domain utilize a contrastive learning objective to strengthen global representations between modalities. In this approach, models are trained to ensure that similar items from different modalities, such as a specific image and corresponding caption, are positioned closely in the representation space, while dissimilar items are pushed apart. Examples of such models include CLIP (Radford et al. 2021), m PLUG (Xu et al. Copyright 2025, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. 2023), and Con VQG (Mi et al. 2024), which have achieved significant improvements in vision-language tasks. BEi Tv3 (Wang et al. 2023a) uses masking techniques to predict missing components in images, treating images as a form of language. However, its reliance on task-specific fine-tuning poses a challenge, as it reduces its flexibility to be directly applied across various tasks without additional training. One notable model is Contrastive Captioners (Co Ca), which combines contrastive learning with image captioning techniques (Yu et al. 2022), producing a pretrained model effective in both retrieval and captioning tasks. However, Co Ca only integrates visual cues to generate textual descriptions, it does not utilize the visual context for image reconstruction from textual cues. This one-dimensional approach restricts the model s understanding of multimodal relationships. Recent vision pre-training methods (Xie et al. 2022; He et al. 2022; Bica et al. 2024; Ma et al. 2024) have shown that image reconstruction can lead to strong content representations. By applying this principle to multimodal tasks (where textual information is integrated into the image reconstruction process and local interactions are emphasized), text and image representations can be merged into a single space, thereby enabling more precise and meaningful bidirectional interactions. For instance, Ma et al. (2024) proposed text-guided masked image modeling to construct visual features from textual guidance, addressing the balance between global and local interactions in multimodal tasks. Building on these, we introduce Bi MAC, a simple yet effective framework designed to enhance alignment between image and text data. Bi MAC leverages image-to-text generation for summarizing visual data into textual descriptions, and text-to-image reconstruction ensures that these textual summaries retain sufficient information to reconstruct the original image. The contrastive learning objective reinforces alignment by encouraging paired image-text embeddings to be close in a shared latent space, while unrelated pairs are pushed apart. However, achieving effective multimodal alignment requires addressing a critical challenge: while images contain dense, detailed information, text descriptions are often sparse and focus on salient elements (i.e., textual descriptions often omit significant details that are present in images). This creates a mismatch in the level of detail between the two modalities. To bridge this gap, our model identifies the most relevant image patches for align- The Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25) Text Encoder A bird feeding MCL text CLS to Text Dec. Visual Encoder image and text A bird feed on flower text-region Figure 1: Our multimodal (vision-language) model is structured with four components: encoders for both text and image inputs, along with decoders that handle text-to-image and image-to-text transformations. The model is built upon multiple training objectives: multi-pair contrastive learning (MCL), vision-to-language mapping (caption generation), image reconstruction (GIT), and text-region alignment. ment with textual descriptions while prioritizing less semantically salient regions (e.g., background details that are often omitted in text) for text-to-image reconstruction. This dual focus ensures that the model captures both the high-level semantics and fine-grained visual details necessary for multimodal alignment. Specifically, we achieve this fine-grained alignment using a cross-modal entropy mechanism, where text tokens attend to image patches based on learned similarity scores. Patches with the highest similarity scores are selected for textual alignment, ensuring that the model prioritizes the most semantically relevant regions. Unlike global alignment methods that treat images and text as monolithic entities, our fine-grained alignment approach captures local correspondences between tokens and patches, which is particularly important for aligning concise textual descriptions (e.g., a birthday party) with dense visual scenes, where multiple elements such as objects and actions must be jointly considered. Extensive experiments demonstrate the superiority of our model over state-of-the-art models in various vision-language tasks. Related Work Multimodal Representation Learning. In recent years, significant advances have been made in multimodal alignment, particularly in integrating vision and language (Cao et al. 2024; Hu et al. 2024; Mi et al. 2024; You et al. 2024). Traditional approaches often relied on separate modules for each modality, such as object detection frameworks to extract visual features and natural language processing to process text. For example, works by Chen et al. (2020) and Zhang et al. (2021) utilized pre-trained object detection models to align visual representations with corresponding textual features, facilitating the fusion of these modalities. Subsequent research (Kim et al. 2022; Bao et al. 2022b) shifted towards developing multimodal transformers that are trained from scratch to jointly learn from both visual and textual inputs. This evolution led to the introduction of largescale vision-language models like CLIP and ALIGN (Radford et al. 2021; Jia et al. 2021), which exemplify these advancements through their dual-encoder frameworks with contrastive loss. These models achieve efficient cross-modal alignment and enable tasks like zero-shot classification. Another technique (Yuan et al. 2021) introduced unified contrastive learning to enhance the interaction in image-text benchmarks. Co Ca (Yu et al. 2022) has extended this approach by incorporating image captioning via a decoder, enhancing local interactions between visual and textual input. Sy Co Ca (Ma et al. 2024) improved upon this by employing an attentive masking strategy for image modeling guided by textual information. In contrast, our work introduces a novel approach that integrates caption generation and image generation tasks to reinforce bidirectional interactions between modalities. This method specifically emphasizes fine-grained alignment by focusing on image patches paired with detailed textual descriptions, addressing the inherent abstractness of image captions. By targeting fine-grained alignment, our model sets itself apart from other bidirectional generation methods that rely on discrete auto-encoders for generating images from text. Masked Modeling in Language and Vision. Masked modeling has become a cornerstone in both language and vision domains, serving as a powerful pre-training strategy for learning rich representations. In language, masked strategy has been widely adopted, where models are trained to predict masked tokens in a text sequence (Yang et al. 2022; Park and Han 2023). Similarly, in the vision domain, masked image modeling (MIM) has gained significant traction, where models are trained to predict or reconstruct masked portions of an image, often guided by textual descriptions (Wang et al. 2023a; You et al. 2024; Peng et al. 2023; Wang et al. 2023b). This approach has been pivotal in vision transformers (Dosovitskiy et al. 2021; Bao et al. 2022a; Peng et al. 2023), which use strategies like mean color prediction, converting image patches into discrete tokens via a VAE network, and pixel clustering, to enhance representation learning. In the context of multimodal learning, masking strategy (MIM) has been enriched through joint representation learning. MAMO (Zhao et al. 2023), Mask VLM (Kwon et al. 2023), Sy Co Ca (Ma et al. 2024), BEi T (Wang et al. 2023a; Peng et al. 2023), SPARC (Bica et al. 2024) and EVE (Chen et al. 2023) have integrated MIM into multimodal pre-training, where the models are trained to predict randomly masked image patches or text tokens. These ad- vancements have significantly contributed to the refinement of cross-modal alignment and representation learning. Unlike existing models that rely heavily on masking strategies, which often introduce challenges for achieving effective bidirectional learning, our model presents a simple yet novel approach to facilitate bidirectional interactions without the need for masking. Indeed, masking, while useful for occluding parts of the input during training, inherently biases the model towards reconstructive tasks, rather than fostering a true bidirectional exchange between modalities (Tou and Sun 2024). Our proposed model eliminates the limitations of masking-based methods, enabling more efficient and precise integration of visual and textual information. Proposed Bi MAC We propose Bi MAC, which uses bidirectional interactions to generate detailed textual and visual representations within a unified latent space. While Co Ca (Yu et al. 2022) has made significant progress by combining image captioning with a contrastive objective, it primarily focuses on generating text from images, limiting its capacity to utilize textual information for reconstructing visual content. This unidirectional interaction restricts the potential for comprehensive vision-language alignment. To address this, Bi MAC enhances bidirectional alignment through several key objectives (as detailed in Algorithm 1): Multi-Pair Contrastive Learning (MCL) enhances the alignment between visual and textual representations; Caption Generation (CG) focuses on generating textual descriptions based on image content; and Generating Image from Text (GIT) facilitates precise connections between specific visual elements and textual input. Architecture Overview. Our architecture is shown in Figure 1. The image encoder Eimg processes the input I and generates patch embeddings Fi using vision transformer. Each embedding corresponds to a specific image patch and an additional [CLS] token for a global image representation Eimg(I) = {f1, f2 . . . , f P , f cls} (1) with P number of patches. For a caption C, nouns are extracted and embedded using the language encoder Etxt Etext(T) = {w1, w2 . . . , w W }, (2) where W is the number of nouns in the caption. To enhance interactions between image and text representations, Bi MAC employs a bidirectional multimodal framework with image and text decoders, each utilizing cross-attention transformers, to merge information from both modalities. This enables two complementary modes of interaction: i) Textto-Image Guidance, where text features guide the visual representation by focusing on relevant regions or features in the image. ii) Image-to-Text Guidance, where image features refine text interpretation by emphasizing salient words or phrases aligned with visual content. Both modes rely on cross-attention mechanisms, where one modality (e.g., image) serves as the query and the other modality (e.g., text) provides the key and value, facilitating interaction between image and text representations. This ensures a comprehensive mutual understanding of the image-text pairs. For example, the image decoder benefits from text features, and Algorithm 1: Bi MAC: Bidirectional Multimodal Alignment Require: Paired image-text dataset (Ii, Ti) Ensure: Multimodal representations, vision-to-language mapping (caption generation), and reconstructed images 1: Step 1: Encode Image and Text 2: for each (Ii, Ti) in the batch do 3: Fi Eimg(Ii) Visual embeddings from image encoder 4: Ti Etxt(Ti) Textual embeddings from text encoder 5: end for 6: Step 2: Multi-Pair Contrastive Learning 7: LMCL Compute contrastive loss between F cls i and T cls i 8: Step 3: Vision-to-Language Mapping (CG) 9: LCG Compute caption generation loss using text decoder Dtxt 10: Step 4: Visual Reconstruction Form Text 11: Reconstruct patches ˆri Dimg(F masked i , Ti) 12: LGIT Compute reconstruction loss 13: Step 5: Fine-Grained Text-Region Alignment 14: S Wi F i Compute alignment scores between text tokens and image patches using bipartite matching 15: Step 6: Compute Total Loss 16: Ltotal LMCL + λCGLCG + λGITLGIT 17: Optimize model parameters using Ltotal the text decoder integrates image features to generate contextually aligned and visually grounded captions. Learning from Image-text Pairs To train the Bi MAC model, we utilize a dataset comprising paired images and their corresponding textual descriptions, denoted as (Ii, Ti). The image encoder Eimg takes the input image I to generate visual embeddings (as defined in Eq. 1), and the text encoder Etxt encodes the textual description T to produce textual embeddings (as defined in Eq. 2). To align these visual and textual embeddings, we employ a multi-pair contrastive learning (MCL), which forms a key component of the model s bidirectional alignment mechanism. Multi-Pair Contrastive Learning (MCL): The MCL loss consists of two terms: an image-to-text alignment and a textto-image alignment loss. Together, these components aim to maximize the similarity of positive pairs (a caption and its corresponding image) while minimizing the similarity of negative pairs (unrelated captions and images). For a batch of B image-text pairs, the image-to-text alignment loss is LIT = log exp(sii/τ) PB j=1 exp(sij/τ) , (3) where sii is similarity score between the i-th image embedding (fi) and its paired text embedding wi, and sij is similarity score between the i-th image embedding (fi) and unrelated text embeddings (wj). This ensures that each image embedding fi aligns closely with its paired text embedding wi, while being dissimilar to unpaired text embeddings. Similarly, the text-to-image alignment loss is defined as LT I = log exp(sjj/τ) PB i=1 exp(sij/τ) (4) This ensures that each text embedding wj aligns closely with its paired image embedding fj, while being dissimilar to unpaired image embeddings. The parameter τ is a temperature parameter for sample i with B batch size. By controlling the sharpness of the softmax distribution, we can effectively influence the learning process, leading to improved model performance for interactions between images and text. Then, the final MCL loss is the average of the two alignment losses over the batch B LMCL = 1 2B i=1 (LIT + LT I) (5) This bidirectional formulation ensures that the model s understanding is not one-sided; it reinforces the interaction in both directions (image-to-text and text-to-image), thus improving the overall performance in multimodal tasks. Vision-to-Language Mapping (caption generation): Caption generation (CG) aims to produce detailed text descriptions T for corresponding images I. Following (Yu et al. 2022; Bica et al. 2024; Ma et al. 2024), our model consists of an image encoder Eimg and a text decoder Dtxt. The process involves two main steps: encoding the image into a feature representation and decoding this representation into a sequence of words (caption), as t=1 log P(T t i |E(Ii), Ci) R(Ti, Ii), (6) where C