# nlip_noiserobust_languageimage_pretraining__decdcbe0.pdf NLIP: Noise-Robust Language-Image Pre-training Runhui Huang1, Yanxin Long1, Jianhua Han2, Hang Xu2, Xiwen Liang1, Chunjing Xu2, Xiaodan Liang1 * 1 Shenzhen campus of Sun Yat-sen University, 2 Huawei Noah s Ark Lab {huangrh9, longyx9}@mail2.sysu.edu.cn, hanjianhua4@huawei.com, chromexbjxh@gmail.com, liangcici5@gmail.com, xuchunjing@huawei.com, xdliang328@gmail.com Large-scale cross-modal pre-training paradigms have recently shown ubiquitous success on a wide range of downstream tasks, e.g., zero-shot classification, retrieval and image captioning. However, their successes highly rely on the scale and quality of web-crawled data that naturally contain much incomplete and noisy information (e.g., wrong or irrelevant content). Existing works either design manual rules to clean data or generate pseudo-targets as auxiliary signals for reducing noise impact, which do not explicitly tackle both the incorrect and incomplete challenges at the same time. In this paper, to automatically mitigate the impact of noise by solely mining over existing data, we propose a principled Noiserobust Language-Image Pre-training framework (NLIP) to stabilize pre-training via two schemes: noise-harmonization and noise-completion. First, in noise-harmonization scheme, NLIP estimates the noise probability of each pair according to the memorization effect of cross-modal transformers, then adopts noise-adaptive regularization to harmonize the cross-modal alignments with varying degrees. Second, in noise-completion scheme, to enrich the missing object information of text, NLIP injects a concept-conditioned crossmodal decoder to obtain semantic-consistent synthetic captions to complete noisy ones, which uses the retrieved visual concepts (i.e., objects names) for the corresponding image to guide captioning generation. By collaboratively optimizing noise-harmonization and noise-completion schemes, our NLIP can alleviate the common noise effects during imagetext pre-training in a more efficient way. Extensive experiments show the significant performance improvements of our NLIP using only 26M data over existing pre-trained models (e.g., CLIP, BLIP) on 12 zero-shot classification datasets (e.g., +8.6% over CLIP on average accuracy), MSCOCO image captioning (e.g., +1.9 over BLIP trained with 129M data on CIDEr) and zero-shot image-text retrieval tasks. Introduction Vision-Language Models (VLMs) (Yao et al. 2021; Radford et al. 2021; Li et al. 2021; Jia et al. 2021; Li et al. 2022a) pre-trained with image-text pairs has shown its extraordinary zero-shot transfer abilities in different downstream tasks, including zero-shot classification (Radford et al. 2021; Yao et al. 2021), image-text retrieval (Radford et al. 2021; Yao *Corresponding author Copyright 2023, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. (a) Noise-harmonization (b) Noise-completion a man and woman standing next to each other holding wine Concept-conditioned Cross-modal Decoder Noisy Embedding Space Noisy Probability Baseball in College Concept Vocabulary San Cristobal Island wine, wine tasting Figure 1: Illustration of two proposed schemes. (a) Noiseharmonization: NLIP estimates the noise probability of each image-text pair and enforces the pairs with larger noise probability to have fewer similarities in embedding space. (b) Noise-completion: NLIP generates enriched descriptions via a concept-conditioned captioner by taking visual concepts retrieved from a vocabulary as auxiliary inputs. et al. 2021), image captioning (Wang et al. 2021) and textto-image generation (Patashnik et al. 2021), etc. Previous works (Radford et al. 2021; Li et al. 2022a) show that the downstream performance of VLMs highly relies on the scale or the quality of pre-training image-caption pairs. However, considering the prohibitive expense of acquiring highquality annotated image-caption datasets (Lin et al. 2014), current paradigms resort to collecting increasingly larger sizes of unlabeled image-text datasets (Thomee et al. 2016; Sharma et al. 2018), largely overlooking the prevalent noise in the web. They thus lead to the heavier computation burden and make the pre-training process severely unstable due to the negative impact of noise. To leverage the advantages of both quality and scale, several attempts have been made to mitigate the negative impact of noisy pairs. On the one hand, some filtering and postprocessing procedures (Sharma et al. 2018; Changpinyo et al. 2021; Jia et al. 2021) have been designed to clean up the large-scale unlabeled data for pre-training. On the other hand, few works explore automatic ways during training. For example, ALBEF (Li et al. 2021) resorts to a momentum model to generate pseudo-targets as additional supervision. BLIP (Li et al. 2022a) uses a filter to remove the noisy data rectified by the similarity of image-text pairs and a captioner to regenerate texts. NCR (Huang et al. 2021) utilizes the loss The Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI-23) distribution to divide clean samples and noisy samples and then rectify the labels by model predictions. However, unlabeled noise data often naturally appear with either incorrect text descriptions or incomplete ones (e.g., missing descriptions of some object concepts), where none of the existing works consider automatically alleviating both of them within one framework. Here, we aim to achieve noise-robust learning from two aspects: self-diagnosing incorrect vs. correct pairs and harmonizing the loss; self-generating and selecting confident captions with enriched concepts. To fully utilize the entire image-caption pairs including the noisy ones, we introduce a principled Noiserobust Language-Image Pre-training framework (NLIP) to stabilize pre-training by noise-harmonization and noisecompletion schemes: (a) Noise-harmonization, where NLIP learns to harmonize the cross-modal alignment and adopts noise-adaptive regularization for each pair based on the estimated noisy probability. Specifically, Arpit et al. (2017) suggests that deep network tends to fit the easy (i.e., clean) samples first and then the noisy ones. Based on the memorization effect of cross-modal transformers, NLIP first estimates the noise probability for each pair, then applies a noise-adaptive regularization on the image-text contrastive loss to avoid over-fitting to the noisy data (shown in Fig.1(a)). This scheme pulls the embeddings of the image and caption in the clean pair more tightly than the one with a higher noisy probability. (b) Noise-completion, where NLIP employs a concept-conditioned cross-modal decoder to synthesize semantic-consistent captions to replace the detrimental noisy texts. Specifically, to guide the caption generation procedure via providing prior information about the existing objects, we first retrieve the visual concepts (i.e., names of existing objects) for each image via a pre-trained VLM. Then these visual concepts and the image are fed into an additional caption head to generate the enriched descriptions for each noisy pair to substitute the noisy caption (shown in Fig.1(b)). Furthermore, inspired by He et al. (2021), we explore enhancing the visual encoder via randomly masking the input image tokens and then reconstructing them, which can help reduce the computation cost during training and boost visual embedding by maintaining low-level visual information. Experimental results show that NLIP achieves significant performance on several downstream tasks, including zeroshot classification, zero-shot image-to-text/text-to-image retrieval and image-captioning tasks. Our NLIP outperforms CLIP (Radford et al. 2021) by 8.6% in terms of average accuracy on 12 zero-shot classification datasets. With respect to image captioning, NLIP is superior to existing image captioning methods that are trained with substantially more data, e.g., 1.9 over BLIP (Li et al. 2022a) trained with 129M image-text pairs in terms of CIDEr on MSCOCO. For zero-shot image-text retrieval tasks, NLIP surpasses CLIP by 28.7% in terms of R@1 on Flickr30k. Related Work Vision Language Pre-training (VLP) models recently garner increasing attention as the surprisingly superior performances on diverse zero-shot downstream tasks. They propose to learn semantic alignments across image and language modalities by pre-training on large-scale data which brings strong performance benefits in downstream tasks (e.g., zero-shot classification, zero-shot retrieval, image caption). Existing VLP models often appear with either encoder-only or encoder-decoder architectures. The encoder-only architectures (Radford et al. 2021; Jia et al. 2021; Yao et al. 2021; Yuan et al. 2021; Mu et al. 2021; Li et al. 2022b; You et al. 2022) aim to align the visual features with textual features in a common cross-modal semantic space. The encoder-decoder architectures (Wang et al. 2021; Li et al. 2022a) employ autoregressive Language Modeling (LM) (e.g., image captioning, text-grounded image generation) to supervise the decoder and excel in generationrelated downstream tasks. Despite the nature merits in data diversity, the large-scale web-crawled image-text pairs contain much noise (i.e., incomplete or even error information) (Thomee et al. 2016; Changpinyo et al. 2021). Some works attempt to mitigate the impact in two aspects. From the data perspective, some strict rules are used to clean up the data (Sharma et al. 2018; Changpinyo et al. 2021; Jia et al. 2021). From the modeling perspective, ALBEF (Li et al. 2021) adopts momentum models to generate pseudotargets as additional supervision; BLIP (Li et al. 2022a) presents a filter to remove the noisy data rectified by the similarity of image-text pairs and a captioner to regenerate the corresponding web texts. However, they have not explicitly stabilized and harmonized the pre-training objectives by reevaluating noisy data in a soft way. In this work, we alleviate the noisy impact by simultaneously addressing incorrect and incomplete image-text pairs. Two novel noiseharmonization and noise-completion schemes are collaborative to achieve noise-robust pre-training. Noisy Data Learning has been a long-standing research area to cope with the noise in training data, practically all of which are applied to the classification task. Existing studies (Song et al. 2020) frequently use robust architecture design, regularization, loss modification, or sample selection strategies to limit the detrimental impact of noisy labels. Here we discuss the last three techniques, which are the most relevant to our model. First, the regularization enforces the networks to over-fit less to false-labeled examples explicitly or implicitly, e.g., label smoothing (Pereyra et al. 2017; Lukasik et al. 2020) avoids over-fitting by preventing the networks from assigning full probabilities to noisy data samples. Second, the loss modification adjusts the contribution of clean and noisy samples to the loss (Reed et al. 2014; Zheng et al. 2020). Third, sample selection methods concentrate on choosing clean samples from noisy ones. For example, Arpit et al. (2017) demonstrates the memorization effect of networks that always prefer to learn simple samples before fitting noisy data. Motivated by the memorization effect, Arazo et al. (2019) adopts a two-component Gaussian Mixture Model (GMM) to fit per-sample loss and treats the samples with minor loss as clean samples. To transfer the above noisy label learning technique from the classification problem to the cross-matching problem, Huang et al. (2021) proposes noisy correspondence learning. Amrani et al. (2021) use density of similarity to estimate the Feed Forward Self-Attention Image Encoder kite, person Concept Vocabulary Feed Forward Self-Attention Text Encoder The weather is good. Masked Cross-Attention Self-Attention Concept-conditioned Cross-modal Decoder Feed Forward MAE Decoder C Concatenation IT: Image Tokens ET: Encoder Tokens DT: Decoder Tokens Figure 2: Overview of the proposed NLIP architecture. NLIP consists of an image encoder Ve, text encoder Te, cross-modal decoder Cd and MAE decoder Vd. During training, given an input image x, it feeds the randomly masked visual patches into an image encoder and the MAE decoder learns to reconstruct them via LIR. The correlated concepts are also retrieved from a vocabulary for each image and then concatenated with the text y as inputs of the text encoder. The concept-conditioned crossmodal decoder is fed with image features, concept-conditioned text features and text embedding, and optimized via LLM. The noise-adaptive image-text contrastive loss LNIT C is adopted to learn cross-modal alignment by considering varying noise probabilities. Note that the concept-conditioned cross-modal decoder does not utilize image tokens as input for LNIT C to avoid information leakage while does for LLM. Omit the index i here. noise probability. Thomas and Kovashka (2022) apply semantic neighborhood discrepancy and diversity to capture the degree of abstractness of an image-text pair. Different from them, NLIP introduces a new noise-adaptive imagetext contrastive loss that harmonizes the cross-modal alignment by considering the varying noise probabilities of different pairs and also rectifies the noisy samples via a conceptguided captioner. NLIP would be one of the early attempts that provide effective and efficient schemes within a largescale image-text pre-training framework. It can be coupled with any VLP models to improve their robustness. Method We proposed Noise-robust Language-Image Pre-training framework (NLIP), a new VLP framework to learn from noisy image-text pairs. In this section, we first introduce the overall model architecture of NLIP. Then we present the model details in two noisy learning schemes respectively, including the noise-harmonization scheme to harmonize the cross-modal alignment with noise-adaptive regularization and the noise-completion scheme to enrich the missing object information of text. Basic Notations. We use D = {X, Y } to denote the imagetext dataset with the images X = {xi}N i=1 and texts Y = {yi}N i=1, where N denotes the total number of image-text pairs of the dataset. For vision modality, Ve and Vd denote vision encoder and vision decoder respectively. For language modality, Te denotes the text encoder. We denote the concept-conditioned cross-modal decoder by Cd. Overall Architecture Fig. 2 illustrates an overview of NLIP architecture for learning the high-quality cross-modal feature alignment. NLIP contains a visual encoder-decoder inspired by MAE (He et al. 2021) for reducing the computation cost and maintaining the high quality of visual feature representation, a text encoder encoding the texts enriched by extra auxiliary visual concepts and a concept-conditioned cross-modal decoder learning to synthesize semantic-consistent captions to complete noisy ones. For visual modality, we use Vision Transformer(Vi T) (Dosovitskiy et al. 2020) that takes the concatenation of an extra [CLS] token embedding and linearly projected image patches as input and output the [CLS] token to represent the global image feature. Specifically, we randomly mask the patches and skip the mask token to reduce the computation cost. To enhance visual feature representation via self-supervised regularization, an MAE decoder is adopted to restore masked patches by Image Reconstruction (IR) loss LIR: i=1 ( Ve(x i) Ve(x i) xi xi )2. (1) where denotes the normalization, and x represents masked patches. As for the language modality, we exploit an encoder-decoder structure to obtain the generation capability and synthesize enriched captions. We first retrieve the visual concepts (i.e., names of existing objects) for each input image from a large corpus via a pre-trained model. The visual concepts concatenated with corresponding input texts are encoded by text encoder. Then a conceptconditioned cross-modal decoder is trained with the Language Modeling (LM) loss LLM to generate a more detailed caption for each image guided by the visual concepts. For the cross-modal alignment, the Noise-adaptive Image Text Contrastive (NITC) loss LNIT C is conducted to not only encourage the positive pair representations to get closer contrast to the negative pairs but also introduce the noiseadaptive label smoothing as an instance-aware regularization for avoiding severe bias to the noisy data. Therefore, the overall loss can be written as: L = LIR + α LLM + β LNIT C. (2) where α and β denote the weighting factors. Noise Harmonization To avoid over-fitting to the noisy image-text pairs, NLIP introduces the noise harmonization scheme by learning to harmonize the cross-modal alignments and adopts noiseadaptive regularization for each pair based on the estimated noisy probability. Preliminaries. To align between two different modalities, current vision-language pre-training models (Radford et al. 2021) adopt the Image-Text Contrastive (ITC) loss, to encourage positive image-text pairs {xi, yj}i=j aligned in the same feature space while in contrast to the negative pairs {xi, yj}i =j. The normalized features from the image encoder and text encoder are denoted as Ve(xi) and Te(yi). We first calculate the per-sample image-to-text similarity sy RB B and text-to-image similarity sx in a batch as: sy i,j = sx i,j = Ve(xi) Te(yj). (3) where B denotes the batch size. Then the Image-Text Contrastive loss LIT C can be written as the average of imageto-text and text-to-image contrastive loss: LIT C = 1 2B i=1 (Lx i + Ly i ), (4) Lx i = Lx i (xi, {yj}B j=1) = log exp(sx i,i) P j exp(sx i,j), (5) Ly i = Ly i (yi, {xj}B j=1) = log exp(sy i,i) P j exp(sy i,j). (6) However, existing ITC loss forces models to align the feature of each image-text pair without considering the situation that many of them are noisy. Directly pre-training with these samples may degrade the model performance. Noise-adaptive Image-Text Contrastive Loss. We further propose a Noise-adaptive Image-Text Contrastive (NITC) loss LNIT C to harmonize the cross-modal alignments with varying degrees according to its noisy probability. We first calculate the noisy probability of each image-text pair, which indicates the image and text in this pair are not semantically matched, according to the memorization effect (Arpit et al. 2017; Zhang et al. 2021a). Specifically, the crossmodal transformer tends to fit the easy (i.e., clean) samples first and then the noisy ones. Therefore, we adopt a two-component Gaussian Mixture Model (GMM) (Permuter et al. 2006) to fit the per-sample ITC loss. Specifically, we consider the probability predicted by the higher mean component as noisy probability εi of i-th image-text pair, inspired by (Huang et al. 2021; Arazo et al. 2019): p(LIT C(xi, yi)|θ) = m=1 γmϕ(LIT C(xi, yi)|m), (7) εi = p(µh)p(LIT C(xi, yi)|µh)/p(LIT C(xi, yi)). (8) where γm denotes the mixture coefficient, ϕ( |m) is the probability density of the m-th GMM component, θ represents the parameters of GMM, and µh denotes the component with a higher mean. Then we directly regularize the ground-truth alignment label with various degrees considering its noisy probability εi. Lower regularization is adopted for the clean samples (i.e., with low εi) to learn the alignment, while the higher regularization is adopted for noisy samples (i.e., with high εi) to avoid over-fitting the noise. In detail, inspired by the label-smoothing (Szegedy et al. 2016), we regularize the ground-truth image-to-text and text-to-image alignment label with different smoothing rates W = {wi}N i=1, which is linearly associated with the noisy probability of each sample {wi = λεi, wi [0, λ]}. λ denotes the hyper-parameter to control the range of smooth rate. Then the Noise-adaptive Image-Text Contrastive loss LNIT C is defined as: LNIT C = 1 2B i=1 ( ˆLx i + ˆLy i ), (9) ˆLx i = log (1 wi) exp(sx i,i) (1 wi) exp (sx i,i)+ wi B 1 P i =j exp(sx i,j), (10) ˆLy i = log (1 wi) exp(sy i,i) (1 wi) exp (sy i,i)+ wi B 1 P i =j exp(sy i,j). (11) Noise Completion Apart from adopting the above instance-ware regularization on the noisy pairs, NLIP also introduces the noise completion scheme to enrich the missing object information of text since the captions from the web are naturally incomplete. Especially, NLIP injects a concept-conditioned cross-modal decoder to obtain semantic-consistent synthetic captions to complete noisy ones, which uses the retrieved visual concepts (i.e., names of existing objects) for the corresponding image to guide captioning generation. Visual Concept. Although the image-text data can be easily crawled from the web, the texts usually contain much noise, including missing details of the image and carrying unrelated contents to the image (Li et al. 2022a). To better address the problem of image-text misalignment, we introduce the visual concepts qv as auxiliary inputs to provide the prior information of existing objects for each image. We first construct a large visual concept vocabulary Q via parsing the various concept nouns from the web-collected corpus. Then we retrieve the words of top-k similarity with image xi as loss reweight Conceptenhanced Model Finetune Conceptenhanced Captioning Stage Conception-enhanced Pre-training Stage Noisy-aware Pre-training Stage Pre-training captioning texts web texts synthetic texts captioning images proposed module training step no-training step Figure 3: Illustration of NLIP procedure. The whole pretraining contains three stages: noisy-aware pre-training, captioning and conception-enhanced pre-training. At noisyaware pre-training stage, we adopt the noisy-adaptive regularization to pre-train NLIP. At captioning stage, we use captioning data to train concept-conditioned cross-modal decoder and generate synthetic captions for web images. At conception-enhanced pre-training stage, we select training captions by noisy probabilities and fine-tune NLIP. visual concepts qi Q based on a pre-trained VLM for that image. The similarity sim(xi, Q) between the input image xi and the nouns in Q is calculated by sim(xi, Q) = Ve(x) Te([p, Q]) . (12) where p denotes the pre-defined text prompt that is aggregated with the visual concepts to narrow down the gap with natural language (Radford et al. 2021). Based on the retrieved visual concepts qi, NLIP uses an additional conceptconditioned cross-modal decoder (shown in Fig. 2) to synthesize new texts Y to replace the original texts Y in noisy image-text pairs. Specifically, the cross-modal decoder is optimized by recovering the masked texts ym with an autoregressive (i.e., language modeling) loss: LLM = E(x,y) D log p(yt|Cd(yτ