# cyclip_cyclic_contrastive_languageimage_pretraining__5d42ad61.pdf CYCLIP: Cyclic Contrastive Language-Image Pretraining Shashank Goel UCLA shashankgoel@ucla.edu Hritik Bansal UCLA hbansal@ucla.edu Sumit Bhatia MDSR Lab, Adobe Systems sumit.bhatia@adobe.com Ryan A. Rossi Adobe Research ryrossi@adobe.com Vishwa Vinay Adobe Research vinay@adobe.com Aditya Grover UCLA adityag@cs.ucla.edu Recent advances in contrastive representation learning over paired image-text data have led to models such as CLIP [44] that achieve state-of-the-art performance for zero-shot classification and distributional robustness. Such models typically require joint reasoning in the image and text representation spaces for downstream inference tasks. Contrary to prior beliefs, we demonstrate that the image and text representations learned via a standard contrastive objective are not interchangeable and can lead to inconsistent downstream predictions. To mitigate this issue, we formalize consistency and propose CYCLIP, a framework for contrastive representation learning that explicitly optimizes for the learned representations to be geometrically consistent in the image and text space. In particular, we show that consistent representations can be learned by explicitly symmetrizing (a) the similarity between the two mismatched image-text pairs (cross-modal consistency); and (b) the similarity between the image-image pair and the text-text pair (in-modal consistency). Empirically, we show that the improved consistency in CYCLIP translates to significant gains over CLIP, with gains ranging from 10% 24% for zero-shot classification accuracy on standard benchmarks (CIFAR-10, CIFAR-100, Image Net1K) and 10% 27% for robustness to various natural distribution shifts. The code is available at https://github.com/goel-shashank/Cy CLIP. 1 Introduction The ability to learn general-purpose representations from diverse data modalities is a long-standing goal of artificial intelligence (AI) [4, 32]. In this regard, recent instantiations such as CLIP [44], ALIGN [29], and BASIC [41] have scaled up vision-language contrastive pretraining to jointly learn image and text embeddings, by exploiting an enormous amount of paired image-text data on the web. Post pretraining, these embeddings exhibit impressive zero-shot classification performance [13] and robustness to natural distribution shifts [48, 57, 24, 26]. Recently, these embeddings have been extended to text-guided generation of natural images [47, 12, 38, 46] and transferred to modalities such as 3-D shapes [50] by emphasizing the interchangeability of the image and text embeddings. In the context of vision-language pretraining, the standard contrastive learning objective aims to maximize the similarity between matched image-text pairs ( positives") against all the mismatched image-text pairs ( negatives") [45, 7, 40, 22]. While such an objective aligns the true image-text pairs, it poses no constraints on the overall geometry of all data pairs, including the mismatched 36th Conference on Neural Information Processing Systems (Neur IPS 2022). Equal Contribution A golden retriever with a red strap A cat sitting with a brown hat A golden retriever with a red strap A cat sitting with a brown hat Figure 1: An illustration of the planar geometry of the learned representations of image-text pairs by (a) CLIP and (b) CYCLIP. The edges indicate the distance between the representations i.e., d(e1, e2) = 1 e1, e2 , where , is the inner product. CYCLIP is cyclic consistent between image-text pairs as the in-modal distances, d(Tcat, Tdog) d(Icat, Idog), and the cross-modal distances, d(Tcat, Idog) d(Icat, Tdog), are similar to each other unlike CLIP. Due to explicit consistency constraints, the test image of a cat is classified as a cat in the image as well as the text space. pairs and pairs within the same modality. In Figure 1 (a), we illustrate this effect where matched image-text pairs, (Idog, Tdog) and (Icat, Tcat), get close to each other but the overall geometry of pairwise distances can be highly irregular (see e.g., (Idog, Tcat) and (Icat, Tdog)). If we use such representations for downstream inference, such irregularities can translate into inconsistent reasoning in the image and text spaces. For example, CLIP designs proxy captions for class labels and uses the most similar class caption to perform zero-shot classification for images; using the default captions in Figure 1 (a), this would imply that a test image Itest gets classified as a dog in the image space even when a simple nearest neighbor classifier in the text space would correctly infer the label to be a cat. To mitigate these challenges, we propose Cyclic Contrastive Language-Image Pretraining (CYCLIP), a framework that imposes additional geometric structure on the learned representations. Specifically, given two image-text pairs, we augment the contrastive learning objective with two symmetrization terms. The first term provides for in-modal consistency by encouraging the distance between the two image embeddings to be close to the distance between the corresponding text embeddings. The second term for the cross-modal consistency that encourages the distance between the image and text embedding from the first and second pairs respectively to be close to the distance between the text and image embeddings from the first and second pairs respectively. As shown in Figure 1 (b), if representations of any two image-text pairs, (Idog, Tdog) and (Icat, Tcat) exactly satisfy both forms of cyclic consistency, then we can guarantee that any test image Itest respects the ordering of distances in both image and text spaces (i.e., if d(Itest, Idog) > d(Itest, Icat), then d(Itest, Tdog) > d(Itest, Tcat)). Empirically, we demonstrate that the improved consistency in CYCLIP translates to improvements over CLIP. In all cases, we pre-train our models on the Conceptual Captions 3M dataset[52]. On zero-shot classification, we observe that CYCLIP improves over CLIP by 10.2% on Image Net1K, 10.6% on CIFAR-10 and 23.9% on CIFAR-100 respectively. Further, CYCLIP outperforms CLIP with an average relative gain of +17% on Image Net natural distribution shift benchmarks. We further analyze the improved performance of CYCLIP and find that the additional geometric structure in the representation space better captures the coarse and fine-grained concept hierarchies of datasets. Our contributions are as follows: 1. We analyze contrastive learning for representation learning jointly over image and text modalities. We identify a critical shortcoming in the geometry of the learned representation space that can lead to inconsistent predictions in image and text domains. 2. We propose CYCLIP, a simple and effective framework for contrastive representation learning with two additional cycle consistency constraints for mitigating the above issue. 3. We demonstrate that CYCLIP achieves significant empirical improvements over CLIP on zero-shot classification and robustness benchmarks. We further explain these improvements by analyzing the impact of consistency on the hierarchical structure of datasets. 2 Cycle Consistent Representation Learning 2.1 Preliminaries We are interested in using text supervision to learn general-purpose visual representations that can be generalized to downstream predictive tasks. To this end, there have been several recent advances in language-image pretraining concerning model architectures, training objectives, and sources of supervision. Our work is most closely related to Contrastive Language-Image Pretraining (CLIP) [44] which combines many such advances in a highly scalable and generalizable learning framework. CLIP is trained on millions of images with their captions scraped from the web. Formally, we consider a dataset S I T consisting of pairs (Ij, Tj) where Ij is a raw image and Tj is a text caption. We use I and T to denote the domain of images and text, respectively. The CLIP architecture consists of 3 components: (i) an image encoder network, f I : I 7 Rd, to encode the raw image into an embedding vector of dimension d, (ii) a text encoder network, f T : T 7 Rd, to encode the raw text into an embedding vector of dimension d, (iii) a contrastive objective that pulls the embeddings of paired image-caption pairs together while pushing apart embeddings of unmatched pairs. Formally, during training, consider a batch of N image-captions pairs, {Ij, Tj}N j=1, where Ij and Tj represent the raw image and text pair, respectively. The image embedding Ie j Rd and text embedding T e j Rd are obtained by passing Ij and Tj through the image encoder f I and text encoder f T , respectively; i.e. Ie j = f I(Ij) and T e j = f T (Tj). Further, we assume they are normalized to have unit ℓ2-norm. The contrastive objective in CLIP aims to align the image and text representations by minimizing the loss function LCLIP shown below: exp Ie j , T e j /τ k=1 exp Ie j , T e k /τ | {z } Contrasting images with the texts exp ( Ie k, T e k /τ) j=1 exp Ie j , T e k /τ | {z } Contrasting texts with the images where , represents the inner product, and τ is a trainable temperature parameter. CLIP and its variants can be used to perform zero-shot image classification, i.e., classifying test images into categories not seen at training time. We first transform each category into a suitable caption (e.g., the airplane category in CIFAR-10 can be expressed as a photo of an airplane ). Then, the similarity of the test image to each caption is computed (e.g., cosine distance), and the model predicts the category for which the image-caption similarity is the highest. 2.2 Inconsistent Representation Learning in CLIP As illustrated in Figure 1 (a), the standard contrastive objective in CLIP can learn image-text representations such that the predicted labels for the test image are different in the image and text spaces. Here, we reason about such inconsistencies more formally in the context of downstream classification. As discussed above, we can predict a label in the text embedding space (zero-shot setting) by selecting the label that is closest to the test image (PT ). Additionally, for classification in the image embedding space, if we had access to a labeled training set, then one natural way to infer the predicted label (P k I ) of a test image Itest is by taking a majority vote from the true labels associated with the k-nearest training images. Formally, we define a consistency score that measures the synchrony between the predicted labels in the image and text spaces as: Figure 2: Illustrative overview for CYCLIP (N = 2). It consists of 3 major components: (a) cross-modal contrastive alignment, (b) cross-modal consistency, and (c) in-modal consistency. Only (a) is present in CLIP, whereas our proposed regularizers in (b) and (c) mitigate inconsistency. Consistency Scorek = 1 j=1 1 P k I (Ij) = PT (Ij) (2) where N is the number of test images. In our experiments (discussed in detail in 3), we found the CLIP s consistency score (k = 1) to be 44%, 16%, and 16% on the standard benchmarks CIFAR-10, CIFAR-100, and Image Net1K, respectively, showing a very high degree of disagreement in the image and text spaces. In the following section, we describe our approach to alleviate the inconsistent inference problem and quantitatively show that our solution improves the consistency score in 4.1. 2.3 Cycle Consistent Representation Learning via CYCLIP We showed that the visual representations learned by CLIP could be inconsistent when used for inference in the image and text spaces. To mitigate this problem, we propose CYCLIP, a learning framework that builds upon CLIP by augmenting the contrastive loss in Eq. 1 with additional geometric consistency regularizers. The intuition follows directly from Figure 1 (b), where we showed that inconsistency in the image and text spaces could be eliminated if we symmetrize the similarity between the two mismatched image-text pairs and the similarity between the image-image pair and the text-text pair. We formalize this intuition with two consistency regularizers. (1) The cross-modal consistency regularizer reduces the gap in the similarity scores between the embeddings of all the mismatched image-text pairs in a batch, two at a time: LC-Cyclic = 1 Ie j , T e k Ie k, T e j 2 . (3) (2) The in-modal consistency regularizer reduces the gap in the similarity scores between the embeddings of all combinations of image pairs and their corresponding text pairs in a batch: LI-Cyclic = 1 Ie j , Ie k T e k, T e j 2 . (4) Hence, our overall loss for CYCLIP is given as: LCYCLIP = LCLIP + λ1LI-Cyclic + λ2LC-Cyclic (5) where λ1 > 0 and λ2 > 0 are hyperparameters controlling the importance of the in-modal and cross-modal cyclic consistency regularizers relative to the contrastive loss in CLIP. We can also characterize the effect of the regularizers in terms of symmetrizing the in-modal and cross-modal similarity matrices, as illustrated in Figure 2. Note that the optimal solution to the contrastive loss formulation would push the similarity between the normalized embeddings of the matched pairs towards 1 while forcing all other pairs of similarities to 0, thereby also symmetrizing the cross-modal similarity matrix and minimizing the cross-modal consistency loss. However, this idealized scenario does not occur in practice, and we find that explicit regularization via cycle-consistency in CYCLIP facilitates improved learning, as we show in our experiments. 3 Experiments Setup: We use Conceptual Captions 3M [52] (CC3M) image-caption pairs as the source of multimodal pretraining data for all our models. Note while this dataset is smaller than the custom dataset (400 million pairs) used in the original work on CLIP [44], it is suitable for our available data and compute and has been used for benchmark evaluations in many subsequent works on language-image pretraining [5, 33, 37, 56]. Following prior work [44], our CLIP models use Res Net-50 as the image encoder and a transformer architecture as the text encoder. Further, we train our models from scratch for 64 epochs on 4 V100 GPUs with a batch size of 128 and an initial learning rate of 0.0005 with cosine scheduling and 10000 warmup steps. The dimension of the image and text embeddings is 1024. For CYCLIP, we use λ1 = 0.25 and λ2 = 0.25 across all our experiments. 3.1 Zero-Shot Transfer We compare the zero-shot performance of CLIP and CYCLIP on standard image classification datasets: CIFAR-10, CIFAR-100 [31], and Image Net1K [49]. We follow the evaluation strategy suggested by [44] for zero-shot classification using prompt engineering. For each dataset, we use the names of the classes to form a set of natural sentences such as a photo of the {class name} , a sketch of the {class name} and more. These are passed through the text encoder to get a set of text embeddings for that class. This set of text embeddings are ℓ2-normalized, averaged, and further ℓ2-normalized to obtain a single text embedding for that class. For a given image, the image embedding is obtained as described in 2. The class whose text embedding (as described above) is closest to the test image is taken to be the predicted label. The zero-shot performance of the models is presented in Table 1. Table 1: Zero-shot Top K classification accuracy (%) where K {1, 3, 5} CIFAR-10 CIFAR-100 Image Net1K Top1 Top3 Top5 Top1 Top3 Top5 Top1 Top3 Top5 CLIP 46.54 78.22 91.16 18.69 34.72 43.97 20.03 33.04 39.35 CYCLIP 51.45 79.57 91.80 23.15 41.46 50.66 22.08 35.98 42.30 %GAIN +10.6 +1.7 +0.7 +23.9 +19.4 +15.2 +10.2 +8.9 +7.5 We observe that the CYCLIP outperforms CLIP across all the datasets and on all Top K metrics, with gains in the range of 10% - 24% for K= 1. Our results on zero-shot transfer indicate the usefulness of having geometrical consistency for improved downstream performance of CLIP. 3.2 Robustness to Natural Distribution Shifts One of the major successes of CLIP was its state-of-the-art performance on the natural distribution shift benchmarks. These benchmarks include images depicting sketches, cartoons, adversaries generated using attacks on trained Image Net models. In Table 2, we evaluate the zero-shot classification accuracy of CYCLIP on four natural distribution shift benchmarks for the Image Net dataset: Image Net V2 [48], Image Net Sketch [57], Image Net-A [27], and Image Net-R [25]. For most of the distribution shift benchmarks, both CLIP and CYCLIP undergo a significant reduction in their zero-shot performance compared to the original Image Net1K dataset (last three columns in Table 1). However, we observe that CYCLIP outperforms CLIP on all of the datasets considered in Table 2: Zeroshot Classification on Natural Distribution Shifts (%) Image Net V2 Image Net Sketch Image Net-A Image Net-R Top1 Top3 Top5 Top1 Top3 Top5 Top1 Top3 Top5 Top1 Top3 Top5 CLIP 16.91 29.28 34.99 10.37 19.15 24.20 4.23 11.35 16.88 24.32 39.69 47.20 CYCLIP 19.22 32.29 38.41 12.26 22.56 28.17 5.35 13.53 19.51 26.79 42.31 50.03 %GAIN +13.7 +10.3 +9.8 +18.2 +17.8 +16.4 +26.5 +19.2 +15.6 +10.2 +6.6 +6.0 this experiment by a significant margin of improvement (10 - 27%). This result indicates that having cyclic consistency in the learned representations preserves the robustness on the traditional datasets. 3.3 Linear Probing While the primary focus of CLIP and CYCLIP is zero-shot generalization, we can also assess if the benefits of our cyclic consistency constraints in mitigating inconsistency can be recovered with extra in-domain and in-modality supervision i.e., in the presence of in-distribution training samples from in-domain visual datasets. To this end, we conduct an additional experiments on linear probing where we fit a linear classifier on the representations learned by the visual encoder (Res Net-50) of CLIP and CYCLIP on a range of image classification datasets. Table 3: Transfer CLIP and CYCLIP to 14 downstream visual datasets using linear probing. Our CYCLIP performs marginally better on 9 out of 14 datasets. For training Image Net1K, we use a random subset of 50K images from its original training dataset. Image Net1K Oxford Pets Stanford Cars CLIP 79.80 78.26 54.85 59.02 28.00 83.50 54.44 69.72 35.93 57.66 53.82 20.00 89.23 47.28 57.96 CYCLIP 80.33 76.98 55.74 63.44 27.86 82.96 54.96 71.70 37.12 56.82 53.74 22.14 90.10 48.01 58.71 We present our results in Table 3. We find that both CLIP and CYCLIP can recover most of the performance lost due to inconsistency when provided extra in-domain and in-modality supervision, with CYCLIP marginally outperforming the CLIP on 9 out of 14 visual datasets. Previously, we demonstrated the gains of CYCLIP over CLIP on downstream tasks that involve joint reasoning over the image and text spaces. In the current section, we wish to better understand the relative behavior of the two models on a set of challenging tasks. 4.1 Consistency in Image and Text Spaces We begin by quantitatively measuring the inconsistency problem illustrated in Figure 1. That is, we wish to evaluate to what extent are the predictions in the image-text space (zero-shot) consistent with the ones made purely within the image space, as measured by our consistency metric in Eq. 2. Table 4 presents our results over standard benchmarks (CIFAR-10, CIFAR-100, Image Net1K). The consistency score is calculated over 10K, 10K, and 50K testing images of the CIFAR-10, CIFAR-100 and Image Net dataset respectively. We use 50K samples from the training set of each dataset for k-Nearest Neighbor prediction. CYCLIP is more consistent than CLIP across all the datasets as we explicitly symmetrize the cross-modal and in-modal distances. Hence, the representations learned by CYCLIP can be better used interchangeably than CLIP. Table 4: Consistency score (%) trend for CLIP and CYCLIP across standard benchmarks . Top-k consistency score implies the fraction of times, the zero-shot predicted label in the text space is identical to the k-Nearest Neighbor predicted label in the image space (using the training dataset). CIFAR-10 CIFAR-100 Image Net1K Top1 Top3 Top5 Top10 Top1 Top3 Top5 Top10 Top1 Top3 Top5 Top10 CLIP 44.60 46.04 47.06 48.45 16.21 17.28 18.42 19.36 16.34 17.42 18.58 19.78 CYCLIP 48.81 50.89 52.30 53.71 20.43 21.96 23.18 24.31 19.20 20.31 21.95 23.94 %GAIN +8.6 +9.5 +10.0 +9.8 +20.7 +21.3 +20.5 +20.4 +14.9 +14.2 +15.4 +17.4 4.2 Fine-grained and Coarse-grained Performance In 3.1, we observed that CYCLIP outperforms CLIP on zero-shot transfer across various datasets. We perform an error analysis investigating both models coarse and fine-grained classification performance to understand the transfer phenomena better. Given a hierarchical class structure dataset, coarse-grained classification differentiates between high-level (parent) classes, i.e., zeroshot classification into aquatic mammals and fish. The fine-grained classification task focuses on differentiating low-level (child) classes, i.e., zero-shot classification into a dolphin, otter, and seal (subclasses of aquatic mammals). We perform this analysis on the CIFAR-100, Image Net1K, Image Net V2, Image Net Sketch, Image Net-A, and Image Net-R datasets. Formally, we consider a test set of N image-subclass-superclass triplets, {Ij, Cj, Pj}N j=1, where Ij, Cj, Pj represent the image, the subclass (child) and superclass (parent) respectively. The image embedding Ie j Rd is obtained as described in 2, and the subclass embedding Ce j Rd and superclass embedding P e j Rd are obtained as described in 3.1. Let the total number of superclasses and subclasses in the dataset be np and nc , respectively. Further, let F be a unique mapping from a subclass to the superclass, and G denote the inverse mapping from a superclass to the set of subclasses i.e. P {1, . . . , np}, G(P) = {C : F(C) = P and C {1, . . . , nc}}. Under this setup, the fine-grained and coarse-grained accuracies are defined as: Fine-grained Accuracy = 1 argmax C G(Pj) Ie j , C = Cj Coarse-grained Accuracy = 1 argmax C {1,...,nc} Ie j , C G (Pj) In Figure 3 we visualize how CLIP and CYCLIP compare with each other on the above metrics. The difference between the zero-shot performance of CYCLIP and CLIP is much more significant for coarse-grained classification than fine-grained classification across all the datasets. This observation indicates that concept-level knowledge is better captured in CYCLIP compared to CLIP. The drastic difference in the coarse-grained performance of CYCLIP and CLIP may be attributed to the rigid separation that the default cross-entropy loss in CLIP enforces between the positive pairs and negative pairs, which might degrade performance when some pairs in the negative batch belong to a similar entity. However, CYCLIP does not suffer from this problem as much because it poses cycle constraints on the overall geometry of all the data pairs rather than forcing a rigid separation. 4.3 Alignment and Uniformity on the Unit Hypersphere [58] argues that contrastive learning directly optimizes for (a) alignment (closeness) of the representations of the positive pairs and (b) uniformity (coverage) of the representation space on the unit hypersphere. We extend these properties for multimodal contrastive representation learning as: Alignment = 1 j=1 Ie j , T e j Uniformity = log k=1,j =k e Ie j ,T e k (a) Fine-grained (b) Coarse-grained Figure 3: The gap between the performances of CLIP and CYCLIP is much larger in coarse-grained scenario highlighting better entity-level knowledge representation in CYCLIP. We desire our models to achieve high alignment and uniformity scores so that the image-text representations are close for the matched pairs and better spread over the unit hypersphere for different categories. We analyze the effect of cross-modal and in-modal consistency on the alignment and uniformity of the shared representations. For this, we train two ablated versions of CYCLIP, 1) C-CYCLIP with only cross-modal consistency component i.e. λ1 = 0, λ2 = 0.5, and 2) I-CYCLIP with only in-modal consistency component i.e. λ1 = 0.5, λ2 = 0 (in Eq. 5). We design proxy captions for classes as discussed in 3.1 to act as text embeddings. We present the results in Table 5. Table 5: Alignment and Uniformity values for CLIP and Cyclic CLIP models. We abbreviate Alignment by A, Uniformity by U, and Zero-shot Top1 classification accuracy (%) by ZS-Top1. Model CIFAR-10 CIFAR-100 Image Net1K A U ZS-Top1 A U ZS-Top1 A U ZS-Top1 CLIP 0.36 -0.27 46.54 0.36 -0.25 18.69 0.39 -0.18 20.03 CYCLIP 0.36 -0.34 51.45 0.37 -0.33 23.15 0.38 -0.32 22.08 I-CYCLIP 0.60 -0.57 50.97 0.60 -0.57 22.35 0.61 -0.55 21.21 C-CYCLIP 0.05 -0.02 55.52 0.06 -0.02 25.49 0.07 -0.02 21.73 We observe that I-CYCLIP learns representations that are better aligned in the representation space; however, they do not cover the hypersphere uniformly. The representations learned by C-CYCLIP are more uniformly spread but poorly aligned compared to I-CYCLIP. In this light, the components of CYCLIP can be seen to encourage a balance of good alignment and uniformity. Further, we find that CLIP is more uniform than CYCLIP in all datasets, but contrary to prior beliefs, this does not translate to improved downstream performance. C-CYCLIP has the best downstream zero-shot performance for CIFAR-10 and CIFAR-100 despite its poor alignment score. Further, all 3 variants of CYCLIP outperform CLIP on all 3 datasets, with CYCLIP performing the best on Image Net1K. 4.4 Image-Text Retrieval We evaluate the effectiveness of the proposed method on the cross-modal (image to text and text to image) retrieval downstream task in the zero-shot as well as fine-tuned settings. We consider the standard benchmark datasets: Flickr30K [42] and MSCOCO [8]. We assess our models on the test set of Flickr30K (1K) and MSCOCO (5K) obtained from the well-known Karpathy [30] split. Both the datasets contains 5 paired captions per image that makes text retrieval per image more easier than image retrieval per caption. We confirm the same in our results below. We perform fine-tuning on the Karpathy s training split with the batch size of 48. We fine-tune on Flick30K for 10 epochs and MSCOCO for 5 epochs. All the other hyperparameters are identical to that of pre-training. Table 6: Zero-shot and fine-tuned cross-modal image-text retrieval (text-to-image and image-to-text) results of CLIP and CYCLIP on Flick30K and MSCOCO datasets. Flickr30K (1K) MSCOCO (5K) Text Retrieval Image Retrieval Text Retrieval Image Retrieval R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 Zero-shot CLIP 88.2 93.9 95.8 29.9 57.2 68.0 82.1 85.6 87.8 8.4 19.5 26.6 CYCLIP 88.1 93.7 95.9 30.9 57.8 69.1 82.1 85.6 87.7 8.6 20.0 27.0 Fine-tuned CLIP 91.9 97.0 98.0 46.3 74.7 83.6 83.2 87.6 90.0 10.6 23.9 31.3 CYCLIP 92.3 97.0 98.4 47.3 76.6 85.4 83.2 87.8 90.3 11.4 25.8 33.4 Table 6 presents our cross-modal image-text retrieval results for CLIP and CYCLIP. In the zero-shot setting, we find that CYCLIP marginally outperforms CLIP on the image retrieval task on both the datasets. The relatively lower performance of both CLIP and CYCLIP in the zero-shot setting may be attributed to the more complicated nature of the two datasets where the models are expected to find similarities between the image and text at multiple resolutions as opposed to image classification where there is mostly single object to be matched with a simpler caption. It is not clear as to what distinctions in the raw input and text space are reflected in the embedding space too. Hence, we perform fine-tuning on both the datasets to better inform our models of the downstream datasets. In the fine-tuning setting, we find that the performance of both the models increases across both the datasets. However, we observe clear benefits of the soft consistent regularization on the image retrieval results for both the datasets. 4.5 CYCLIP preserves the Effective Robustness of CLIP [36] shows that there is a strong correlation between the in-distribution and out-of-distribution generalization of the models trained on Image Net1K, as illustrated by the linear fit (red) in Figure 4. Ideally, any model that does not undergo distribution shift would fall on the y = x trendline (black). For other models, the deviations of the models from this ideal fit indicate their effective robustness. Previously, [45] showed that the zero-shot CLIP classifier trained on 400M image-text pairs improves effective robustness significantly compared to prior approaches to robustness. Subsequently, [28] demonstrated that CLIP models trained at small scales also exhibit high effective robustness that allows them to be used as a proxy to study the robustness properties of CLIP. Figure 4: Effect of varying the training dataset size on (a) Classification accuracy on Image Net1K and (b) Effective Robustness on Image Net V2. We evaluate the effect of cyclic consistency on effective robustness. We trained 4 CLIP and CYCLIP models, varying the training dataset sizes from 500K to 4M image-text pairs from the CC3M + CC12M datasets. In Figure 4, (a) we observe that for all training data sizes, CYCLIP shows a significant improvement over CLIP, showcasing its effectiveness in a diverse set of data regimes. Further, Figure 4 (b) shows that CYCLIP lies way above the baseline trend and preserves the effective robustness of CLIP. 5 Related Work Our work fits into the broader theme of unsupervised pretraining with multiple modalities and has been successfully applied for learning representations of modalities such as images, text, and speech [2, 15, 1, 59, 43, 34]. Similar to the unimodal setting, two predominant approaches for multimodal pretraining are contrastive and generative, as described below. Contrastive Representation Learning: Contrastive learning was originally proposed for selfsupervised representation learning in the unimodal context where the embeddings of a sample are brought closer to an augmented version of the sample. In contrast, the embeddings are pushed away for other samples, and their augmentation [11, 51, 39, 55, 21, 7, 16, 40, 66, 23, 18]. [63] and [3] impose additional constraints to remove redundancies and prevent dimensional collapse in the visual representations. Recently, contrastive learning has also been used to learn robust representations of the multimodal data [62, 47]. Many works use additional losses to imbibe extra supervisory multimodal knowledge during the training process [54, 65, 64, 14, 35]. In this work, we focus on having cyclic consistency in addition to the contrastive loss to learn more robust image-text representations. Contrastive Language-Image Pretraining: CLIP [44], ALIGN [29] and BASIC [41] have enjoyed great success in extending contrastive learning to paired image-text data, with impressive zero-shot classification and robustness performance. These works have been further extended recently to include visual self-supervision [37], additional nearest neighbor supervision [33], and utilization of unpaired data [56]. Our work complements much of this literature as it identifies consistency regularizers that can be augmented to the learning objective of the above works. Generative Representation Learning: Generative models have been applied for learning representations of multimodal data [60, 53]. In particular, [67, 61, 10] proposed a notion of cyclic consistency for learning from unpaired multimodal data using GANs [17], which was extended later to normalizing flows [20, 19]. While these works focus on regularizing a generative mapping between modalities, our notion of cycle consistency applies to embeddings learned via a contrastive framework. 6 Conclusion We presented CYCLIP, a framework for cycle consistent multimodal representation learning for image and text modality. The main benefits of CYCLIP stem from including cross-modal consistency and in-modal consistency regularizers to prevent inconsistent inference in the image and text spaces. Empirically, we show that CYCLIP performs much better than CLIP on zero-shot classification and is more robust on benchmarks for distributional robustness. We also showed that the representations learned by CYCLIP are more consistent than CLIP and better capture concept-level knowledge, as evidenced by our analysis of fine-grained and coarse-grained accuracies. We believe this work can motivate further studies on understanding the geometry of the representation spaces learned via the contrastive objective applied to paired multimodal data and, in particular, identify conditions and regularization strategies under which the learned representations are synergistic across the various modalities for downstream applications. One important future direction and a current limitation is scaling CYCLIP to larger datasets. While we do not possess the resources for this study, it is imperative to study the extent to which the benefits of cycle consistency remain at the scale on which the original CLIP was trained (400M image-text pairs). Finally, for real-world deployment of CLIP and their variants, such as CYCLIP, we need to be cautious about amplifying societal biases as these models are trained on large uncurated datasets scraped from the web [9]. Additionally, it is easy to add malicious data to the web, which poses a severe security threat [5]. Alleviating such harms is an important and active area of research. Acknowledgements This research is supported by an Adobe Data Science Research Award for Aditya Grover. We would like to thank the IDRE s Research Technology group for the GPU computing resources on the UCLA Hoffman2 Cluster. We also want to thank Tung Duc Nguyen, Satvik Mashkaria, Siddarth Krishnamoorthy, Varuni Sarwal, and Ashima Suvarna for their helpful suggestions. [1] Yusuf Aytar, Carl Vondrick, and Antonio Torralba. See, hear, and read: Deep aligned representations. ar Xiv preprint ar Xiv:1706.00932, 2017. [2] Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency. Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence, 41(2):423 443, 2018. [3] Adrien Bardes, Jean Ponce, and Yann Le Cun. Vicreg: Variance-invariance-covariance regularization for self-supervised learning. ar Xiv preprint ar Xiv:2105.04906, 2021. [4] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798 1828, 2013. [5] Nicholas Carlini and Andreas Terzis. Poisoning and backdooring contrastive learning. ar Xiv preprint ar Xiv:2106.09667, 2021. [6] Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3558 3568, 2021. [7] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597 1607. PMLR, 2020. [8] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. ar Xiv preprint ar Xiv:1504.00325, 2015. [9] Jaemin Cho, Abhay Zala, and Mohit Bansal. Dall-eval: Probing the reasoning skills and social biases of text-to-image generative transformers. ar Xiv preprint ar Xiv:2202.04053, 2022. [10] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8789 8797, 2018. [11] Sumit Chopra, Raia Hadsell, and Yann Le Cun. Learning a similarity metric discriminatively, with application to face verification. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 05), volume 1, pages 539 546. IEEE, 2005. [12] Katherine Crowson, Stella Rose Biderman, Daniel Kornis, Dashiell Stander, Eric Hallahan, Louis Castricato, and Edward Raff. Vqgan-clip: Open domain image generation and editing with natural language guidance. Ar Xiv, abs/2204.08583, 2022. [13] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A largescale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248 255. Ieee, 2009. [14] Karan Desai and Justin Johnson. Virtex: Learning visual representations from textual annotations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11162 11173, 2021. [15] Jiali Duan, Liqun Chen, Son Tran, Jinyu Yang, Yi Xu, Belinda Zeng, Chenyang Tao, and Trishul Chilimbi. Multi-modal alignment using representation codebook. ar Xiv:2203.00048, 2022. [16] Tianyu Gao, Xingcheng Yao, and Danqi Chen. Simcse: Simple contrastive learning of sentence embeddings. ar Xiv preprint ar Xiv:2104.08821, 2021. [17] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014. [18] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. Advances in Neural Information Processing Systems, 33:21271 21284, 2020. [19] Aditya Grover, Christopher Chute, Rui Shu, Zhangjie Cao, and Stefano Ermon. Alignflow: Cycle consistent learning from multiple domains via normalizing flows. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 4028 4035, 2020. [20] Aditya Grover, Manik Dhar, and Stefano Ermon. Flow-gan: Combining maximum likelihood and adversarial learning in generative models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018. [21] Michael Gutmann and Aapo Hyvärinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 297 304. JMLR Workshop and Conference Proceedings, 2010. [22] Raia Hadsell, Sumit Chopra, and Yann Le Cun. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 06), volume 2, pages 1735 1742. IEEE, 2006. [23] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729 9738, 2020. [24] Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340 8349, 2021. [25] Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Lixuan Zhu, Samyak Parajuli, Mike Guo, Dawn Xiaodong Song, Jacob Steinhardt, and Justin Gilmer. The many faces of robustness: A critical analysis of out-ofdistribution generalization. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 8320 8329, 2021. [26] Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15262 15271, 2021. [27] Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Xiaodong Song. Natural adversarial examples. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15257 15266, 2021. [28] Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Openclip. Zenodo, July 2021. If you use this software, please cite it as below. [29] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904 4916. PMLR, 2021. [30] Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3128 3137, 2015. [31] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. Citeseer, 2009. [32] Yann Le Cun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436 444, 2015. [33] Yangguang Li, Feng Liang, Lichen Zhao, Yufeng Cui, Wanli Ouyang, Jing Shao, Fengwei Yu, and Junjie Yan. Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. ar Xiv:2110.05208, 2021. [34] Kevin Lu, Aditya Grover, Pieter Abbeel, and Igor Mordatch. Pretrained transformers as universal computation engines. ar Xiv preprint ar Xiv:2103.05247, 2021. [35] Sijie Mai, Ying Zeng, Shuangjia Zheng, and Haifeng Hu. Hybrid contrastive learning of tri-modal representation for multimodal sentiment analysis. ar Xiv:2109.01797, 2021. [36] John P Miller, Rohan Taori, Aditi Raghunathan, Shiori Sagawa, Pang Wei Koh, Vaishaal Shankar, Percy Liang, Yair Carmon, and Ludwig Schmidt. Accuracy on the line: on the strong correlation between out-of-distribution and in-distribution generalization. In International Conference on Machine Learning, pages 7721 7735. PMLR, 2021. [37] Norman Mu, Alexander Kirillov, David Wagner, and Saining Xie. Slip: Self-supervision meets language-image pre-training. ar Xiv:2112.12750, 2021. [38] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mc Grew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. ar Xiv:2112.10741, 2021. [39] Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Silvio Savarese. Deep metric learning via lifted structured feature embedding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4004 4012, 2016. [40] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. ar Xiv preprint ar Xiv:1807.03748, 2018. [41] Hieu Pham, Zihang Dai, Golnaz Ghiasi, Hanxiao Liu, Adams Wei Yu, Minh-Thang Luong, Mingxing Tan, and Quoc V Le. Combined scaling for zero-shot transfer learning. ar Xiv preprint ar Xiv:2111.10050, 2021. [42] Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pages 2641 2649, 2015. [43] Ariadna Quattoni, Michael Collins, and Trevor Darrell. Learning visual representations using images with captions. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pages 1 8. IEEE, 2007. [44] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748 8763. PMLR, 2021. [45] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. Open AI blog, 1(8):9, 2019. [46] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. ar Xiv preprint ar Xiv:2204.06125, 2022. [47] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821 8831. PMLR, 2021. [48] Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classifiers generalize to imagenet? In International Conference on Machine Learning, pages 5389 5400. PMLR, 2019. [49] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211 252, 2015. [50] Aditya Sanghi, Hang Chu, Joseph G Lambourne, Ye Wang, Chin-Yi Cheng, and Marco Fumero. Clip-forge: Towards zero-shot text-to-shape generation. ar Xiv preprint ar Xiv:2110.02624, 2021. [51] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 815 823, 2015. [52] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556 2565, 2018. [53] Yuge Shi, Brooks Paige, Philip Torr, et al. Variational mixture-of-experts autoencoders for multi-modal deep generative models. Advances in Neural Information Processing Systems, 32, 2019. [54] Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela. Flava: A foundational language and vision alignment model. ar Xiv preprint ar Xiv:2112.04482, 2021. [55] Kihyuk Sohn. Improved deep metric learning with multi-class n-pair loss objective. Advances in neural information processing systems, 29, 2016. [56] Ajinkya Tejankar, Bichen Wu, Saining Xie, Madian Khabsa, Hamed Pirsiavash, and Hamed Firooz. A fistful of words: Learning transferable visual models from bag-of-words supervision. ar Xiv:2112.13884, 2021. [57] Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P Xing. Learning robust global representations by penalizing local predictive power. Advances in Neural Information Processing Systems, 32, 2019. [58] Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In ICML, 2020. [59] Wenhui Wang, Hangbo Bao, Li Dong, and Furu Wei. Vlmo: Unified vision-language pretraining with mixture-of-modality-experts. ar Xiv preprint ar Xiv:2111.02358, 2021. [60] Mike Wu and Noah Goodman. Multimodal generative models for scalable weakly-supervised learning. Advances in Neural Information Processing Systems, 31, 2018. [61] Zili Yi, Hao Zhang, Ping Tan, and Minglun Gong. Dualgan: Unsupervised dual learning for image-to-image translation. In Proceedings of the IEEE international conference on computer vision, pages 2849 2857, 2017. [62] Xin Yuan, Zhe Lin, Jason Kuen, Jianming Zhang, Yilin Wang, Michael Maire, Ajinkya Kale, and Baldo Faieta. Multimodal contrastive training for visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6995 7004, 2021. [63] Jure Zbontar, Li Jing, Ishan Misra, Yann Le Cun, and Stéphane Deny. Barlow twins: Selfsupervised learning via redundancy reduction. In International Conference on Machine Learning, pages 12310 12320. PMLR, 2021. [64] Rowan Zellers, Ximing Lu, Jack Hessel, Youngjae Yu, Jae Sung Park, Jize Cao, Ali Farhadi, and Yejin Choi. Merlot: Multimodal neural script knowledge models. Advances in Neural Information Processing Systems, 34, 2021. [65] Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5579 5588, 2021. [66] Yuhao Zhang, Hang Jiang, Yasuhide Miura, Christopher D Manning, and Curtis P Langlotz. Contrastive learning of medical visual representations from paired images and text. ar Xiv preprint ar Xiv:2010.00747, 2020. [67] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 2223 2232, 2017. 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] Section 6. (c) Did you discuss any potential negative societal impacts of your work? [Yes] Section 6. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [N/A] (b) Did you include complete proofs of all theoretical results? [N/A] 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [No] Due to extremely-compute heavy experiments. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [No] All the non-proprietary datasets and code used are public under MIT, BSD or CC licenses. (c) Did you include any new assets either in the supplemental material or as a URL? [N/A] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]