# on_the_generalization_of_multimodal_contrastive_learning__e176aadf.pdf

On the Generalization of Multi-modal Contrastive Learning

Qi Zhang * 1 Yifei Wang * 2 Yisen Wang 1 3

Multi-modal contrastive learning (MMCL) has recently garnered considerable interest due to its superior performance in visual tasks, achieved by embedding multi-modal data, such as visuallanguage pairs. However, there still lack theoretical understandings of how MMCL extracts useful visual representation from multi-modal pairs, and particularly, how MMCL outperforms previous approaches like self-supervised contrastive learning (SSCL). In this paper, by drawing an intrinsic connection between MMCL and asymmetric matrix factorization, we establish the first generalization guarantees of MMCL for visual downstream tasks. Based on this framework, we further unify MMCL and SSCL by showing that MMCL implicitly performs SSCL with (pseudo) positive pairs induced by text pairs. Through this unified perspective, we characterize the advantage of MMCL by showing that text pairs induce more semantically consistent and diverse positive pairs, which, according to our analysis, provably benefit downstream generalization. Inspired by this finding, we propose several methods to significantly improve the downstream performance of SSCL on Image Net by leveraging multi-modal information. Code is available at https://github. com/PKU-ML/CLIP-Help-Sim CLR.

1. Introduction

Recently, multi-modal contrastive learning (MMCL), including CLIP (Radford et al., 2021) and its variants (Li et al., 2022b; Mu et al., 2022; Yao et al., 2022), has achieved impressive performance for visual representation learning, and transfer well to various downstream tasks like zero-

*Equal contribution 1National Key Lab of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University 2School of Mathematical Sciences, Peking University 3Institute for Artificial Intelligence, Peking University. Correspondence to: Yisen Wang <yisen.wang@pku.edu.cn>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

shot and few-shot image classification. The core idea of MMCL is rather simple, which aligns the samples of the same image-text pairs together while pushing away other unrelated samples in the latent feature space. However, it remains not fully clear to us why matching multi-modal pairs would benefit visual representation learning, and what are the key factors that affect its downstream performance.

Meanwhile, another popular scenario for contrastive learning is self-supervised learning, which also obtains competitive performance recently (Chen et al., 2020; He et al., 2020; Wang et al., 2021). Nevertheless, recent MMCL methods (like CLIP) have shown significant advantages over its selfsupervised contrastive learning (SSCL) counterparts like Sim CLR (Chen et al., 2020). Existing theories of SSCL (Saunshi et al., 2019; Hao Chen et al., 2021; Wang & Isola, 2020) only establish the optimality of self-supervised representations on downstream tasks, and fail to characterize why MMCL could outperform SSCL. Another major obstacle is the generation process of data pairs. In particular, positive pairs in SSCL are visual-only samples generated by random data augmentations of the raw image; Instead, the positive pairs in MMCL are multi-modal (e.g., visual-language) pairs directly provided by the dataset. Since existing SSCL theories rely crucially on the assumption that data augmentations produce overlap between visual samples (Wang et al., 2022; Saunshi et al., 2022), they cannot be directly applied to MMCL that relies on multi-modal data pairs.

In this paper, we propose the first theoretical analysis on the generalization ability of MMCL. To achieve this, we establish an equivalence between the MMCL objective and the asymmetric matrix factorization (AMF) of the multimodal co-occurrence matrix. Built upon this connection, we characterize the ideal pretrained representations of MMCL and its generalization bounds on visual and language downstream tasks, where the bounds are influenced by the properties of the multi-modal co-occurrence matrix, for example, its singular value.

The established theoretical framework also allows us to characterize the difference between MMCL and SSCL under a unified perspective. To be specific, we first formally unify MMCL and SSL under the framework of uni-modal similarity graphs, where language pairs in MMCL can be regarded as a special kind of data augmentation for generating pos-

On the Generalization of Multi-modal Contrastive Learning

itive visual pairs. Based on this perspective, we compare MMCL and SSCL on real-world data and show that textinduced positive pairs have better semantic consistency and diversity than augmentation-based ones in SSCL, which explains the superiority of MMCL on downstream tasks. Besides the empirical comparisons, we theoretically analyze this difference by modeling the data generation process with the hierarchical random graph (Clauset et al., 2008). Based on this understanding, we further leverage multimodal information in CLIP to assist the self-supervised visual learning with Sim CLR on Image Net and achieve significant improvements, which validates our understanding of the superiority of multi-modal positive pairs.

We summarize our contributions as follows:

We establish the first generalization theoretical guarantee for multi-modal contrastive learning (MMCL). We provide a new perspective of the multi-modal contrastive loss by connecting it with an asymmetric matrix decomposition objective.

We provide a unified perspective for understanding the connections and differences between multi-modal and self-supervised contrastive learning. Based on this perspective, we examine their differences on real-world data, and find that multi-modal information induces better positive visual pairs than self-supervision (with better semantic consistency and diversity), which explains the superiority of MMCL.

As a verification of our understanding above, we further investigate a new scenario where we leverage multi-modal information in pretrained models (like CLIP) to assist self-supervised learning like Sim CLR. We propose four different techniques and they both bring improvements (as much as 6.2%) on Image Net.

2. Related Work

Multi-modal Pretraining Applications. Traditional singlestream models (Lu et al., 2019; Li et al., 2019) have been widely discussed and shown the impressive performance in various multi-modal tasks. However, as they do not have independent encoders for different modals, the transferability of these frameworks is usually limited. On contrast, multimodal contrastive learning paradigms represented by CLIP (Radford et al., 2021) have recently obtained the promising performance in multi-modal downstream tasks including zero-shot learning, finetuning and linear-probing. Inspired by CLIP, various variants are proposed to improve the efficiency and performance of multi-modal pretraining. SLIP (Mu et al., 2022) and De CLIP (Li et al., 2022b) combine the self-supervised and multi-modal contrastive learning to accelerate the training process. FILIP (Yao et al., 2022)

propose fine-grained multi-modal contrastive objective to make the encoder focus more on the local features.

Theory of Contrasative Learning. Motivated by the empirical success of the contrastive objective, many researchers try to theoretically analyze how it works. Wang & Isola (2020) understand the contrastive loss from two terms in it: the alignment of the positive samples and the uniformity of the negative samples. Hjelm et al. (2019) analyze the objective from the mutual information theory. Saunshi et al. (2019) establish the theoretical guarantee between the pretraining contrastive loss and the downstream classification performance. Hao Chen et al. (2021) revisit the contrastive objective from a spectral graph perspective, which explains the relationship between the augmented samples and the downstream performance of contrastive learning. Wang et al. (2022; 2023) provide a theoretical understanding for contrastive learning from the perspective of augmentation overlap and message passing respectively. As these prior theoretical works mainly focus on the single-modal contrastive learning, the theoretical analysis on the multi-modal contrastive learning is still quite limited. In this work, we theoretically analyze the relationship between the design of the multi-modal contrastive paradigms and its generalization ability on downstream tasks.

Theory of Multi-modal Learning. For the theoretical analysis of multi-modal learning, there are few related works. Sun et al. (2020) propose a information-theoretic framework and prove that their method can learn ground-truth Bayesian posterior classifier for each modality and the Bayesian posterior aggregator for all modalities. Huang et al. (2021) proves that the multi-modal models can learn better representations than single-modal models in certain conditions. However, both of their analysis do not focus on the multi-modal contrastive paradigm and can not explain why the contrastive methods can achieve such an impressive performance.

3. Generalization Theory of Multi-Modal Contrastive Learning

3.1. Mathematical Formulation

We start by introducing the basic mathematical formulation for multi-modal contrastive learning. Without loss of generality, taking CLIP (Radford et al., 2021) for an example, we have the paired data (xv, xl) from the visual domain (xv denotes an image) and the language domain (xl denotes a corresponding text description of the image). Each xv or xl belongs to one of r classes. We use XV to denote the set of all visual data with distribution PV , and XL to denote the set of all language data with distribution PL. Their joint multi-modal distribution is PM. For ease of exposition, we assume XV , XL to be finite but exponentially large

On the Generalization of Multi-modal Contrastive Learning

sets1, and denote NV = |XV | and NL = |XL|. The goal of multi-modal contrastive learning is to obtain a joint embedding of the visual data XV and language data XL in the k-dimensional latent space Z Rk by learning a visual encoder f V : XV Z and a language encoder f L : XL Z, such that semantically similar samples (either image-image, text-text or image-text pairs) have close representations, and different samples are apart. A recent work (Tschannen et al., 2022) also explores a Siamese network, i.e., f V = f L. Here we consider the general case with two different encoders.

For multi-modal positive and negative pairs, we define an image-text pair drawn from the paired visual-language data, i.e., (xv, xl) PM, as positive pairs, and draw independent samples from each domain, x v PV , x l PL, and treat (xv, x l ), (x v , xl) and (x v , x l ) as negative pairs, because samples in these pairs are independent of each other.

Given positive and negative pairs (xv, xl, x v , x l ), one popular learning objective is the symmetric cross entropy (SCE) loss (adopted in CLIP) calculated over similarity scores:

LSCE(f V , f L) = Exv,xl log exp f V (xv) f L(xl)

Ex l exp(f V (xv) f L(x l ))

Exv,xl log exp f V (xv) f L(xl)

Ex v exp(f V (x v ) f L(xl)).

(1) This objective can be seen as an extension of the popular Info NCE loss (Oord et al., 2018) to the multi-modal scenario (Zhang et al., 2020). During the learning process, positive pairs (xv, xl) are pulled together in the latent space while negative pairs (xv, x l ) and (x v , xl) are pushed apart. Following the same spirit, we consider a similar multi-modal spectral loss for the ease of theoretical analysis,

LSCL(f V , f L)

= 2Exv,xlf V (xv) f L(xl) + Ex v ,x l (f V (x v ) f L(x l ))2. (2) Comparing Eq. 1 and Eq. 2, we can easily see that the two objectives have the same loss for positive pairs, and only differ at the specific loss function used for pushing negative pairs apart (logsumexp loss in Eq. 1 v.s. ℓ2 loss in Eq. 2). The multi-modal spectral loss can be regarded as an extension of the visual spectral contrastive loss that achieves comparable performance to the Info NCE loss in visual tasks (Hao Chen et al., 2021). Nevertheless, their analysis can only be applied to self-supervised contrastive learning where positive and negative pairs come from the same domain.

After pretraining, we evaluate the learned representations by applying them to downstream tasks. Taking the visual

1With some non-essential nuances as in Hao Chen et al. (2021), our analysis can also be extended to the infinite data setting.

linear probing task as an example, we train a linear classifier to predict class labels y Y from the output features of f V by gf,BV (xv) = arg maxi [r](f V (xv) BV )i, where BV Rk r denotes the weight matrix. The linear probing error of f V is defined as the error of the optimal linear classifier on the encoded features, i.e.,

E(f V ) = min BV Exv PV 1[gf,BV (xv) = y(xv)], (3)

where y(xv) denotes the label of xv. Likewise, we can define the linear probing error E(f L) for the text classification.

3.2. An Asymmetric Matrix Factorization View of Multi-modal Contrastive Learning

With its samplewise pretraining objective (Eqs. 1 & 2), multi-modal contrastive learning (MMCL) is usually understood as an instance-level feature matching task between visual and language domains (Radford et al., 2021). However, little is known about the overall distribution of the learned features, which hinders us from understanding how its instance-level pretraining benefits downstream applications. In this section, with a reformulation of the MMCL objective, we show that MMCL is essentially equivalent to the asymmetric matrix factorization (AMF) of the joint data distribution PM(xv, xl). AMF is an important class of methods in classical machine learning with inherent connections to PCA, K-means, and spectral clustering (Ding et al., 2005), and is widely adopted in unsupervised learning scenarios like Latent Semantic Analysis (Deerwester et al., 1990) and word embedding (Pennington et al., 2014). Generally speaking, AMF can extract low-frequency components that underline the common structure of the joint distribution, which is helpful for MMCL analysis.

We start by formulating the joint distribution PM(xv, xl) as a co-occurrence matrix PM RNV NL between all visual-language data pairs, where

(PM)xv,xl = PM(xv, xl) 0, xv [NV ], xl [NL]. (4) We can see that PM is a non-negative asymmetric matrix that can be exponentially large. A canonical assumption of representation learning is that high-dimensional data (like images and text) lie in a low-dimensional manifold. Then, we consider the following low-rank matrix factorization for the normalized co-occurrence matrix PM:

LAMF(FV , FL) = PM FV F L 2, (5)

where FV RNV k, FL RNL k are factorized lowrank components (k min(NV , NL)) of the visual and language domains, respectively. To obtain the normalized co-occurrence matrix PM, we adopt two-side normalization

( PM)xv,xl = PM(xv, xl) p

PV (xv)PL(xl) , (6)

On the Generalization of Multi-modal Contrastive Learning

where PV (xv) = P

xl PM(xv, xl) denotes the marginal probability of xv, and PL(xl) = P xv PM(xv, xl) denotes the marginal probability of xl. Based on this formulation, we are ready to establish the key result of this paper.

Theorem 3.1 (Equivalence). Let the xv-row of FV and the xl-row of FL represent the corresponding encoded features of these samples in the following form,

(FV )xv = p

PV (xv)f V (xv) , (7a)

PL(xl)f L(xl) . (7b)

Then low-rank asymmetric matrix factorization loss (Eq. 5) is equivalent to the multi-modal contrastive loss (Eq. 2) up to a constant,

LAMF(FV , FL) = LSCL(f V , f L) + const. (8)

Proof. Taking the definition of FV and FL in Eq. 7 into the decomposition loss LAMF(FV , FL), and combing with the definition of PM in Eq. 6, we have

LAMF(FV , FL)

= PM FV F L 2

PM(xv, xl) p

PV (xv)PL(xl)

PV (xv)f V (xv) p

PL(xl)f L(xl) 2

PM(xv, xl)2

PV (xv)PL(xl) 2PM(xv, xl)f V (xv) f L(x L)

+ PV (xv)PL(xl) f V (xv) f L(x L) 2

PM(xv, xl)2

PV (xv)PL(xl)

| {z } const

2Exv,xlf V (xv) f L(xl)

+ Ex v ,x l f V (x v ) f L(x l ) 2

=LSCL(f V , f L) + const,

which completes the proof.

Theorem 3.1 reveals a crucial fact that multi-modal contrastive learning essentially learns the low-rank factorization of the co-occurrence matrix. Meanwhile, we notice that the original factorization loss is actually intractable to directly solve because of the exponentially large size of the co-occurrence matrix PM, while multi-modal contrastive learning avoids this problem by transforming it into a tractable and scalable objective that simply requires samples from the joint probability PM. But theoretically, this

equivalence allows us to characterize the overall distribution of multi-modal contrastive learning, and provides guarantees on downstream tasks for its ideal representations in the following part.

3.3. Characterizing Ideal Representations of Multi-modal Contrastive Learning

In multi-modal contrastive learning (MMCL) like CLIP (Radford et al., 2021), a common pipeline is to apply the pretrained representations to downstream visual tasks like image classification. Therefore, in order to characterize the pretraining and downstream behaviors of MMCL, it matters for us to understand the properties of the optimally pretrained representations, and how they generalize to downstream tasks.

Ideal Representations. First, we characterize the general solution to the multi-modal pretraining loss, under the ideal assumption that the neural networks are expressive enough.

Theorem 3.2. Let PM = UΣV is the singular value decomposition (SVD) of the normalized co-occurrence matrix PM (Eq. 6), where U RNV r, V Rr NL are unitary matrices, and Σ = diag(σ1, . . . , σr) contains descending singular values σ1 . . . σr 0, , r = min(NV , NL). Assume the neural networks are expressive enough for any features. The multi-modal contrastive loss (Eq. 2) attains its optimum when xv XV , xl XL,

f V (xv) = 1 p

U k xv DR , (9a)

f L(xl) = 1 p

V k xl diag(σ1, . . . , σk)D 1R ,

where Ux takes the x-th row of U, and U k, V k denote the submatrices containing the first k columns of U, V , respectively; D Rk k is an arbitrary invertible diagonal matrix; and R Rk k is an arbitrary unitary matrix.

Theorem 3.2 shows that the ideal representations of MMCL are largely determined by the k leading eigenvectors, up to some affine transformations (scaling D and rotation R). Although the optimal solution is not unique, when we apply this representation to the linear probing task, the linear classifier can absorb the differences in affine transformations and yield the same classification error for different variants at the optimum. Built upon these optimal representations, we are ready to establish formal guarantees for the generalization of multi-modal contrastive learning on the downstream linear probing tasks in both the visual and language domains.

Theorem 3.3. Given a specific joint data distribution PM, we define the labeling error α as the average label agreement among the visual-language positive pairs

On the Generalization of Multi-modal Contrastive Learning

(xv, xl) PM, i.e.,

α = Exv,xl1[y(xv) = y(xl)], (10)

where y( ) returns the ground-truth label of the operand. Denote the empirical estimate of the visual and text encoders from n pretraining examples as ˆf V , ˆf L, respectively. With probability 1 δ, the visual linear probing error E( ˆf V ) and text linear probing error E( ˆf L) can be upper-bounded by

E( ˆf V ),E( ˆf L) α 1 σ2 k+1

b Rn/3(F) +

| {z } finite-sample generalization terms

where omits some constant terms, σk+1 (c.f. Theorem 3.2) is the (k + 1)-th largest singular value of the normalized cooccurrence matrix PM. In the finite-sample generalization terms, ˆRn/3(F) denotes a Rademacher complexity of the model class F with n/3 samples, k is the representation dimension, σ = σ2 3k/4 σ2 k, and c (kκ + 2kκ2 + 1)2

with κ upper bounding f V (x) and f L(x) .

In the upper bound of Eq. 11, aside from the canonical generalization terms relating to the number of samples and neural network complexity, there are two important factors reflecting the influence of the multi-modal pretraining task, the labeling error α and the singular value σk+1.

Labeling error α accounts for the label mismatch between the constructed visual-language pairs, which may differ in practice depending on how the dataset is constructed. For example, the MS-COCO dataset contains human-provided captions for 120K images using Amazon Mechanical Turk (Lin et al., 2014), while the large-scale YFCC dataset (Thomee et al., 2016) contains 99M Flickr images along with their posted titles as captions without filtering or post-processing, which could be quite noisy. A recent work (Santurkar et al., 2022) empirically finds that a single MS-COCO imagecaption pair is worth five YFCC captions for CLIP training. These findings can be justified by our theory that the written captions in MS-COCO induce a smaller labeling error α.

Singular value σk+1 is a spectral property of the cooccurrence matrix PM. One way to understand its role is from a graph perspective. Specifically, we can regard PM as a (partial) adjacency matrix of a bipartite graph2 established between the visual set XV and the language set XL. According to the spectral graph theory (Chung, 1997), the singular values generally represent the connectivity of the bipartite graph (e.g., how many disjoint sub-graphs), and

2For a bipartite graph, only interleaving edges between XV and XL (represented by PM) could contain non-zero weights. So we consider PM for simplicity.

smaller leading singular values correspond to better connectivity (e.g., fewer sub-graphs). Therefore, Theorem 3.3 shows that better connectivity (by creating diverse connections between samples) with a smaller σk+1 could bring smaller downstream errors. In fact, several recent works can be understood as increasing the diversity of multi-modal pairs by data augmentations. For example, FLIP (Li et al., 2022a) introduces patch masking to the images input, and Santurkar et al. (2022) rewrite text captions using a GPT model. Our generalization bound provides a theoretical justification for the effectiveness of these approaches.

To warp up, our generalization bounds in Theorem 3.3 provide not only guarantees but also principled guidelines for multi-modal contrastive learning: 1) we should create highquality multi-modal pairs by human writing or automatic filtering to reduce the labeling error α, and 2) we should create better multi-modal diversity by data augmentations in both domains to ensure a smaller singular value σk+1.

3.4. Discussion

In this section, we establish the first comprehensive study on the theoretical guarantees of multi-modal contrastive learning in terms of two aspects: optimal representations and downstream guarantees. A closely related work is Hao Chen et al. (2021) that establishes theoretical guarantees for selfsupervised contrastive learning. Our analysis extends their theory to the multi-modal setting, with the following key differences:

1) Data generation. Their analysis only applies to positive pairs (x, x+) that are both augmented samples from the same domain X, while the multi-modal pair (xv, xl) are directly given by data samples and are asymmetric ones from different domains XV , XL. Correspondingly, our analysis deals with the multi-modal co-occurrence matrix PM instead of the aggregated augmentation graph A defined over X in Hao Chen et al. (2021) as the approximation target.

2) Learning objective. Their analysis only applies to the uni-model spectral contrastive loss using a Siamese architecture, which corresponds to symmetric matrix factorization. Instead, in multi-modal learning, the positive pairs are not symmetric and require different encoders in general. Correspondingly, we propose the multi-modal spectral contrastive loss that corresponds to asymmetric matrix factorization, which requires different techniques to analyze and yield different optimal representations and downstream generalization bounds.

On the Generalization of Multi-modal Contrastive Learning

(a) Sim CLR

Funny dog dressed as a wizard king.

Two funny cute brown dog characters.

Funny little dog breed pug dressed red christmas hat.

A Funny dog

A Funny dog

Figure 1. Illustration of raw and augmented samples generated by Sim CLR and CLIP on the CC12M dataset (Changpinyo et al., 2021), where the former are generated by manual data augmentations and the latter are induced by visual-language pairs.

4. Formal Comparison between Multi-modal and Self-Supervised Contrastive Learning

In Section 3, we have established a theoretical framework for analyzing multi-modal contrastive learning (MMCL) from the perspective of asymmetric matrix factorization. Meanwhile, we know that MMCL originates from selfsupervised contrastive learning (SSCL) like Sim CLR (Chen et al., 2020) and Mo Co (He et al., 2020), which is selfsupervised (usually visual). These two contrastive learning paradigms have a close resemblance by both adopting Info NCE-like objectives, while they differ mainly on the chosen positive and negative pairs. Take two representative methods in each paradigm, CLIP (MMCL) and Sim CLR (SSCL), as an example. CLIP adopts visual-language pairs collected from the Internet, while Sim CLR generates positive pairs by visual data augmentations like cropping and color jittering. Despite the similarity in learning objectives, CLIP shows much better performance on zero-shot and fewshot transfer learning tasks than Sim CLR (Radford et al., 2021), suggesting that different sources of positive pairs have a crucial impact on the downstream performance of contrastive learning. Nevertheless, there still lack theoretical understanding and characterization of this phenomenon.

In this section, we propose a unified theoretical framework to understand the inherent connections between the two paradigms (Section 4.1). Based on this unified perspective, we compare CLIP and Sim CLR on real-world data to understand their differences in downstream tasks (Section 4.1). At last, we theoretically analyze the differences from a data generation perspective (Section 4.2).

4.1. Unified Formulation and Analysis for Multi-modal and Self-Supervised Contrastive Learning

We begin with a brief introduction to self-supervised contrastive learning. Instead of using raw images xv XV as in multi-modal contrastive learning, self-supervised contrastive learning like Sim CLR (Chen et al., 2020) applies aggressive data augmentation A( |xv) two times and get a pair of augmented samples xa, x+ a XA as positive pairs

to align together. Accordingly, the negative sample is defined as augmented samples x a independently drawn from its marginal distribution. The self-supervised spectral contrastive loss (Hao Chen et al., 2021) learns a Siamese visual encoder f V : XA Rk with

Lss SCL(f V ) = 2Exa,x+ a f V (xa) f V (x+ a )

+ Exa,x a (f V (xa) f V (x a ))2, (12)

where the joint distribution of positive pairs follows

PA(xa, x+ a ) = Exv PV A(xa|xv)A(x a|xv), (13)

which is marginalized over the augmentations of all natural samples. Different from multi-modal learning, the joint distribution is symmetric, i.e., PA(xa, x+ a ) = PA(x+ a , xa), and Hao Chen et al. (2021) show that this self-supervised loss is equivalent to a symmetric matrix factorization (SMF) objective. Nevertheless, there is a noticeable difference between the multi-modal and self-supervised objectives, that the joint distribution PM defines connections between two domains XV , XL while PA defines connections only among visual samples in XA. It thus remains unclear to us how to compare the quality of multi-modal and self-supervised pairs and characterize their influence on downstream tasks.

A key insight here: we notice that CLIP does not only work well for multi-modal tasks like image-text retrieval, but also performs surprisingly well on visual-only tasks like zero-shot image classification, which indicates that it also implicitly aligns semantically similar visual samples together during the joint embedding process. The following theorem characterizes this intuition by establishing an equivalence between multi-modal contrastive learning and a corresponding self-supervised contrastive learning objective among visual-only samples. Theorem 4.1. The optimal visual representations of multimodal contrastive learning (Eq. 9a) are equivalent (up to scaling and rotation) to that of the following uni-modal contrastive learning objective,

Luni SCL(f V ) = 2Exv,x+ v f V (xv) f V (x+ v )

+ Exv,x v (f V (xv) f V (x v ))2, (14)

On the Generalization of Multi-modal Contrastive Learning

Table 1. Comparison (in the uni-model setting) of estimated labeling error and intra-class connectivity between CLIP and Sim CLR.

CLIP Sim CLR

Labeling Error ( ) 0.601 0.846 Intra-class Connectivity ( ) 1.322 1.072

where (xv, x+ v ) are drawn from the text-induced joint distribution over visual samples PT that xv, x v XV ,

PT (xv, x v) = Exl PLPM(xv|xl)PM(x v|xl), (15)

with PM(xv|xl) = PM(xv, xl)/PL(xl), and x v is independently drawn from PV . Accordingly, the linear probing error E(f V ) of multi-modal learning is also equal to that of the self-supervised learning in Eq. 14.

Theorem 4.1 draws an inherent connection between multimodal contrastive learning (MMCL) and self-supervised contrastive learning (SSCL) by showing that MMCL also implicitly performs uni-modal contrastive learning among visual samples, just like SSCL. Notably, different from SSCL that relies on manual data augmentations A(xa|xv), MMCL s uni-modal objective (Eq. 14) leverages the multimodal conditional distribution PM(xv|xl) to generate positive visual pairs via languages as a pivot. In other words, the multi-modal signals serve as a new type of data augmentation such that image pairs xv, x+ v with the same (or similar) text descriptions can serve as positive pairs for uni-modal contrastive learning, as illustrated in Figure 1(b).

This unified perspective enables us to understand the advantage of CLIP over Sim CLR for visual representation learning (Radford et al., 2021). Intuitively, compared to Sim CLR relying on object-agnostic and low-level manual data augmentations, e.g., color and contrast variation in Figure 1(a), text descriptions contain high-level semantics of images (e.g., funny , dog in Figure 1(b)), and the use of the text-induced augmentation in CLIP can bridge semantically similar images more effectively. Thus, CLIP has two main advantages over Sim CLR for downstream tasks according to Theorem 3.3. First, CLIP has a lower labeling error because the text-induced positive pairs usually contain the same object and while manual data augmentations often lose the object. Second, CLIP yields better connectivity among visual samples using high-level semantics. In the following, we provide empirical and theoretical comparisons to characterize the differences between them.

Based on the unified theoretical understanding above, we further investigate the differences between the augmentationinduced joint distribution PA (self-supervised, Sim CLR) and the text-induced one PT (multi-modal, CLIP) on realworld data. For a fair comparison, we pretrain the same backbone Vi T-B (Dosovitskiy et al., 2021) on the same

dataset, YFCC15M (Thomee et al., 2016; Radford et al., 2021), and evaluate the learned representations on Image Net (Deng et al., 2009a). For efficiency, we randomly draw 1,000 samples from 10 random classes of the Image Net validation set. According to the matrix factorization perspective, the learned features approximate the ground-truth distribution (unknown to us). Thus, we can approximately calculate the (uni-modal) labeling error and sample connectivity using learned representations. For an intuitive measure of the desired sample connectivity, we calculate the average feature similarity between intra-class samples as a surrogate metric. See details in Appendix A.1.

From Table 1, we observe that the labeling error of Sim CLR is indeed much larger than that of CLIP (0.846 v.s. 0.601), suggesting that the text-induced (implicit) positive images have higher semantic consistency than manual image transformations. Meanwhile, we also observe that CLIP has high intra-class connectivity than Sim CLR (1.322 v.s. 1.072), suggesting that text descriptions can induce better intra-class sample diversity with the high-level semantic relationship.

4.2. A Data Generation Perspective via the Language of Hierarchical Random Graph

As discussed above, the key difference between augmentation and text-induced positive pairs is that they operate on different levels of semantics. This difference can be understood and modeled in a hierarchical structure of data generation. As shown in the examples in Figure 2(a), we can regard that the three images of funny dogs are firstly generated under high-level concepts captured by their text description, and then adding more detailed variations that can be captured by data augmentations. Therefore, the shared text span funny dog can draw these images together, but the commonly used data augmentations cannot because they are very different in pose and style.

Inspired by the observation that the joint distribution between positive visual pairs PT (xv, x v) can be regarded as the adjacency matrix of a graph over all image samples (Hao Chen et al., 2021), we model this distribution (graph) with hierarchical random graph (Clauset et al., 2008) designed to model the hidden structure of a given graph. Different from vanilla random graph where each edge is randomly drawn with the same probability, hierarchical random graph assumes that the edges are drawn according to a hierarchical tree, which suits our need to characterize different levels of semantics. In a hierarchical random graph G shown in Figure 2(b), each internal node s is associated with a probability ps, each leaf node is a node in the original graph, and the probability of having an edge between two nodes is the probability contained in their lowest common ancestor node. In our case, we assume two hidden layers for simplicity, with pl modelling the probability high-level connection in

On the Generalization of Multi-modal Contrastive Learning

Funny little dog breed pug dressed red christmas hat.

Funny dog dressed as a wizard king.

Funny cute brown dog characters.

(a) an illustration of the hierarchical structure

(b) a hierarchical random graph

Figure 2. Illustrations of the hierarchical structure on real-world datasets CC12M and a hierarchical random graph with two hidden layers. Here, each internal node is associated with a probability that a pair of vertices in the left and right subtrees of that node are connected.

the first layer and ph modeling the probability of lower-level connection in the second layer. We assume ph > pl as there are less high-level interactions between samples.

The following theorem shows that a larger high-level connection probability pl yields better downstream performance by inducing better graph connectivity (algebraically measured by the singular value σt).

Theorem 4.2. For two three-layer hierarchical random graphs G, G with probabilities (pl, ph), (p l, p h), respectively. If ph pl p h p l, we have

where the σt, σ t are the t-th largest singular values of G, G , respectively. According to Theorem 3.3, smaller singular value indicates better downstream performance under the same labeling error α. Therefore, contrastive learning with samples generated according to graph G will have better downstream performance.

Theorem 4.2 shows smaller ph pl can bring better downstream generalization3. In practice, we can improve ph by generating positive samples sharing common high-level semantics, as done in CLIP with the text description of the image. Therefore, our hierarchical random graph perspective can help characterize the benefit of CLIP over Sim CLR from the kind of information they leverage. This perspective also suggests a way to improve (self-supervised) contrastive learning, that is to add more diverse with better augmentation strategies, such as, using realistic generative models like diffusion models (Ho et al., 2020). In the next section, we provide empirical verification of this understanding by showing how MMCL information can be used to boost augmentation-based SSCL methods like Sim CLR.

3We note that two quantities ph, pl are not independent. Since the total probability sums to one, a higher ph means a lower pl, and vice versa.

5. Boosting Sim CLR with Guided Positive Selection

Learning from the theoretical and empirical evidence in Section 4, we have known that compared to self-supervision, languages are better at generating positive pairs for visual representation learning due to their advantage of capturing high-level similarities. In this section, we further leverage this advantage to improve self-supervised learning.

Prior to ours, there are several papers exploring the combination of self-supervision and multi-modal supervision, such as SLIP (Mu et al., 2022), De CLIP (Li et al., 2022b), and FLIP (Li et al., 2022a). Contrary to these methods all focusing on pretraining on multi-modal data, in this work, we focus on utilizing the estimated multi-modal information in a pretrained CLIP model to improve self-supervised contrastive learning (Sim CLR) from unlabeled images alone, which, up to our knowledge, is not considered yet. Our experiment is designed as a verification of our analysis above, because if the language information is as helpful for uni-modal contrastive learning as we suppose, the CLIPassisted Sim CLR can obtain better performance on downstream tasks.

5.1. Methods

Following our analysis, we consider four strategies for leveraging CLIP to help self-supervised contrastive learning with Sim CLR.

Add New Positive & Drop False Positive. Because multimodal contrastive learning is good at generating more diverse and consistent positive pairs (Figure 1(b)), we leverage the pretrained CLIP to generate a new pair of positive samples for training Sim CLR. Specifically, in a mini-batch, we find the nearest neighbor of each sample x in the feature space of CLIP, denoted as N(x), and regard (x, N(x)) as a pair of positive samples. We mix this new positive pair with the original self-supervised one with a tunable ratio. On the other hand, because multi-modal pairs have less labeling error (Table 1), CLIP can also be leveraged to filter out

On the Generalization of Multi-modal Contrastive Learning

Table 2. The linear probing accuracy of Sim CLR and its CLIP-assisted variants on Image Net (Vi T-B, 100-epoch training).

Method Baseline (Sim CLR) Add New Positive Drop False Positive Drop False Negative Drop Easy Negative

Linear Acc 61.2 67.4 (+6.2) 61.8 (+0.6) 61.4 (+0.2) 62.3 (+1.1)

false positive pairs that may contain different objects with a tunable ratio.

Drop False Negative & Drop Easy Negative. We can also leverage CLIP to select negative samples. One option is to drop negatives with the largest similarity, which could be false negatives from the same class of positive samples. Another is to drop negative samples with the smallest similarity, with corresponds to easy negative samples that are already pushed apart.

For the CLIP model, we adopt the pretrained Vi T-B provided by the official implementation. For Sim CLR, following the standard protocol, we pretrain a Res Net-50 (He et al., 2016) on Image Net for 100 epochs. See details in Appendix A.2.

5.2. Results

From Table 2, we can see that all four techniques can bring benefits over the vanilla Sim CLR, suggesting that the multimodal information in CLIP indeed benefits self-supervised learning in terms of both positive and negative sample selection. Meanwhile, comparing the four strategies, we notice that Add New Positive with CLIP brings the highest improvement of 6.2% accuracy over the vanilla Sim CLR. This successfully verifies our previous analysis that multi-modal learning is better at generating diverse and positive samples than self-supervised learning for better downstream performance. We leave more advanced techniques for leveraging this observation for future work.

6. Conclusion

In this paper, we proposed the first theoretical framework for multi-modal contrastive learning. By drawing the connection to asymmetric matrix factorization, we characterized its optimal representations and established the first guarantees on the downstream generalization of multi-modal contrastive learning. Based on our framework, we provided a unified perspective of multi-modal and self-supervised contrastive learning, characterized their differences on realworld data, and verified our insights by bringing benefits on benchmark datasets. In this way, our theory has established a principled understanding of multi-modal contrastive learning, while delivering practical insights for combining multi-modal and self-supervised learning methods.

Acknowledgement

Yisen Wang is partially supported by the National Key R&D Program of China (2022ZD0160304), the National Natural Science Foundation of China (62006153), Open Research Projects of Zhejiang Lab (No. 2022RC0AB05), and Huawei Technologies Inc.

Changpinyo, S., Sharma, P., Ding, N., and Soricut, R. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, 2021.

Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. A simple framework for contrastive learning of visual representations. In ICML, 2020.

Chung, F. R. Spectral graph theory, volume 92. American Mathematical Soc., 1997.

Clauset, A., Moore, C., and Newman, M. E. Hierarchical structure and the prediction of missing links in networks. Nature, 453(7191):98 101, 2008.

Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391 407, 1990.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In CVPR, 2009a.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In CVPR, 2009b.

Ding, C., He, X., and Simon, H. D. On the equivalence of nonnegative matrix factorization and spectral clustering. In SDM, 2005.

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.

Eckart, C. and Young, G. The approximation of one matrix by another of lower rank. Psychometrika, 1(3):211 218, 1936.

On the Generalization of Multi-modal Contrastive Learning

Hao Chen, J. Z., Wei, C., Gaidon, A., and Ma, T. Provable guarantees for self-supervised deep learning with spectral contrastive loss. In Neur IPS, 2021.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In CVPR, 2016.

He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. Momentum contrast for unsupervised visual representation learning. In CVPR, 2020.

Hjelm, R. D., Fedorov, A., Lavoie-Marchildon, S., Grewal, K., Bachman, P., Trischler, A., and Bengio, Y. Learning deep representations by mutual information estimation and maximization. In ICLR, 2019.

Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. In Neur IPS, 2020.

Huang, Y., Du, C., Xue, Z., Chen, X., Zhao, H., and Huang, L. What makes multi-modal learning better than single (provably). In Neur IPS, 2021.

Li, L. H., Yatskar, M., Yin, D., Hsieh, C.-J., and Chang, K.-W. Visualbert: A simple and performant baseline for vision and language. ar Xiv preprint ar Xiv:1908.03557, 2019.

Li, Y., Fan, H., Hu, R., Feichtenhofer, C., and He, K. Scaling language-image pre-training via masking. ar Xiv preprint ar Xiv:2212.00794, 2022a.

Li, Y., Liang, F., Zhao, L., Cui, Y., Ouyang, W., Shao, J., Yu, F., and Yan, J. Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. In ICLR, 2022b.

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll ar, P., and Zitnick, C. L. Microsoft coco: Common objects in context. In ECCV, 2014.

Lu, J., Batra, D., Parikh, D., and Lee, S. Vilbert: Pretraining task-agnostic visiolinguistic representations for visionand-language tasks. In Neur IPS, 2019.

Mu, N., Kirillov, A., Wagner, D., and Xie, S. Slip: Selfsupervision meets language-image pre-training. In ECCV, 2022.

Oord, A. v. d., Li, Y., and Vinyals, O. Representation learning with contrastive predictive coding. ar Xiv preprint ar Xiv:1807.03748, 2018.

Pennington, J., Socher, R., and Manning, C. D. Glove: Global vectors for word representation. In EMNLP, 2014.

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. In ICML, 2021.

Santurkar, S., Dubois, Y., Taori, R., Liang, P., and Hashimoto, T. Is a caption worth a thousand images? a controlled study for representation learning. ar Xiv preprint ar Xiv:2207.07635, 2022.

Saunshi, N., Plevrakis, O., Arora, S., Khodak, M., and Khandeparkar, H. A theoretical analysis of contrastive unsupervised representation learning. In ICML, 2019.

Saunshi, N., Ash, J., Goel, S., Misra, D., Zhang, C., Arora, S., Kakade, S., and Krishnamurthy, A. Understanding contrastive learning requires incorporating inductive biases. In ICML, 2022.

Sun, X., Xu, Y., Cao, P., Kong, Y., Hu, L., Zhang, S., and Wang, Y. Tcgm: An information-theoretic framework for semi-supervised multi-modality learning. In ECCV, 2020.

Thomee, B., Shamma, D. A., Friedland, G., Elizalde, B., Ni, K., Poland, D., Borth, D., and Li, L.-J. Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2):64 73, 2016.

Tschannen, M., Mustafa, B., and Houlsby, N. Image-andlanguage understanding from pixels only. ar Xiv preprint ar Xiv:2212.08045, 2022.

Wang, T. and Isola, P. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In ICML, 2020.

Wang, Y., Geng, Z., Jiang, F., Li, C., Wang, Y., Yang, J., and Lin, Z. Residual relaxation for multi-view representation learning. In Neur IPS, 2021.

Wang, Y., Zhang, Q., Wang, Y., Yang, J., and Lin, Z. Chaos is a ladder: A new theoretical understanding of contrastive learning via augmentation overlap. In ICLR, 2022.

Wang, Y., Zhang, Q., Du, T., Yang, J., Lin, Z., and Wang, Y. A message passing perspective on learning dynamics of contrastive learning. In ICLR, 2023.

Yao, L., Huang, R., Hou, L., Lu, G., Niu, M., Xu, H., Liang, X., Li, Z., Jiang, X., and Xu, C. Filip: Fine-grained interactive language-image pre-training. In ICLR, 2022.

Zhang, Y., Jiang, H., Miura, Y., Manning, C. D., and Langlotz, C. P. Contrastive learning of medical visual representations from paired images and text. ar Xiv preprint ar Xiv:2010.00747, 2020.

On the Generalization of Multi-modal Contrastive Learning

A. Experimental Details

A.1. Details of Empirical Comparison in Section 4.1

Approximation of Data Probability. Similar to the multi-modal spectral loss, Equation (14) can be rewritten as a matrix decomposition loss, i.e., Luni SCL(f V ) = PT FV F V 2 + const, where PT is the co-occurrence matrix of the distribution

PT (xv, x v), ( PT )(xv,x v) = PT (xv,x v)

PV (xv)PV (x v) and (FV F V )(xv,x v) = f V (xv) f V (x v)

PV (xv)PV (x v). So (PT )(xv,x v) can be approximated

by f V (xv) f V (x v) when the loss is minimized. Similarly, we can estimate the co-occurrence matrix (PA)(xa,x+ a ) of PA(xa, x+ a ) by f V (xa) f V (x+ a ). In practice, we use Vi T-Base trained by CLIP (Radford et al., 2021) and Sim CLR (Chen et al., 2020) as the encoders.

Setup. We respectively encode the samples from 1000 samples randomly selected from 10 classes of Image Net (Deng et al., 2009b) with two encoders and construct the embedding matrix ˆFT R1000 k and ˆFA R1000 k (k is the output dimension of Vi T-Base)4. Then we normalize the similarity matrices of the embeddings to and estimate the co-occurrence matrices with them, i.e., ˆPT = normalize( ˆFT ˆF T ), ˆPA = normalize( ˆFA ˆF A ). In the next step, we evaluate the properties of the estimated matrices, e.g., the labeling error, the eigenvalues, etc.

Estimation of Labeling Error. When evaluating the labeling error α in Eq. 11, as Image Net is a vision dataset, we have no access to the corresponding text data. So we use a surrogate metric αT , and it is defined as:

xv,x v (PT )xv,x v1[y(xv) = y(x v)], (16)

and y(xv) denotes the ground-truth label of xv. Note that αT is lower bounded by the ground-truth labeling error α:

Proposition A.1. For the surrogate metric αT , we have

Proof. Expanding the estimated labeling error and we obtain

(xv,x v) PT (xv, x v)1[y(xv) = y(x v)]

xv,x v Exl [PM(xv|xl)PM(xv|xl)1[y(xv) = y(x v)]]

xv,x v Exl [PM(xv|xl)PM(xv|xl)(1[y(xv) = y(xl)] + 1[y(x v) = y(xl)])]

= 2Exl[PM(xv|xl)1[y(xv) = y(xl)]]

= 2Exv,xl1[y(xv) = y(xl)]

As a result, a large αT implies a large labeling error α. Then we replace PT with ˆPT , and obtain the estimation ˆαT = P

xv,x v ( ˆPT )xv,x v1[y(xv) = y(x v)]. Similarly, we define the estimated labeling error of PA as ˆαA = P

xv,x+ v ( ˆPA)xv,x+ v 1[y(xv) = y(x+ v )].

Estimation of Intra-class Connectivity. When evaluating the intra-class connectivity, we respectively select 1000 samples from 10 different classes of Image Net. Taking the multi-modal pretraining as an example, following the process we construct

4As the samples of the PA are augmented images, we transform the selected samples with the augmentations used in Sim CLR when constructing ˆFA.

On the Generalization of Multi-modal Contrastive Learning

ˆPT , we respectively construct ten intra-class feature similarity matrices { ˆP k in}10 k=1. Then we randomly select 1000 samples from the selected samples and construct an inter-class feature similarity matrix ˆPout. We use the average relative value of the intra-class and inter-class feature similarity matrix to represent the intra-class connectivity. To be specific, we denote the intra-class connectivity as β and evaluate it by:

( ˆPre)k i,j = ( ˆPin)k i,j/ mean i,j ( ˆPout),

βk = mean i,j ( ˆPre)k i,j,

β = mean k (βk).

A.2. Details of Verification Experiments in Section 5

We use Sim CLR (Chen et al., 2020) as our baseline and adopt the popular backbone Res Net-50. With the default setting of Sim CLR, we add a projector MLP following the backbone. During the pretraining process of Sim CLR, we train the encoder for 100 epochs on Image Net with 512 batch size and use the LARS optimizer with a cosine annealed learning rate schedule. When estimating the co-occurrence matrix PT , we compute the feature similarity matrix with the well-trained Vi T-B encoder provided by the official repository of CLIP (Radford et al., 2021). For selecting new positive pairs, we set the ratio between the new regularizer and the original loss to 1. When filtering false positive samples, we throw the 10% positive pairs that are most dissimilar in the feature space encoded by the CLIP encoder. And for selecting better negative samples. we respectively throw 5% samples that have the largest similarity with the positive samples and 10% samples that have the smallest similarity with the positive samples. After the pretraining process, we train a linear classifier following the frozen backbones and optimize the Cross Entropy loss with the SGD optimizer.

B.1. Proof of Theorem 3.1

Proof. Expanding the decomposition object LAMF and we obtain,

LAMF(FV , FL) = PM FV F L 2

( PM)xv,xl (FV )xv(FL) xl 2

PM(xv, xl) p

PV (xv)PL(xl) p

PV (xv)f V (xv) p

PL(xl)f L(xl)

PM(xv, xl)2

PV (xv)PL(xl) + PV (xv)PL(xl) f V (xv) f L(x L) 2 2PM(xv, xl)f V (xv) f L(x L)

PM(xv, xl)2

PV (xv)PL(xl)

2Exv,xlf V (xv) f L(xl) + Ex v ,x l f V (x v ) f L(x l ) 2

= LSCL(f V , f L) + const.

B.2. Proof of Theorem 3.2

Proof. According to Eckart-Young Theorem (Eckart & Young, 1936), the optimal solution F V , F L of the decomposition objective LAMF(FV , FL) = PM FV F L 2 satisfy:

F V (F L) = U k diag(σ1, ..., σk)(V k) ,

where we denote PM = UΣV as the singular value decomposition of PM, (σ1, ..., σk) are the k-largest singular values of PM, the t-th column of U k RNV k contains the corresponding eigenvectors of the t-th largest singular values and

On the Generalization of Multi-modal Contrastive Learning

V k RNL k is a unitary matrix. Then we respectively represent the optimal solutions F V and F L:

F V = U k DR,

F L = V k diag(σ1, ..., σk)D 1R,

where R Rk k is a unitary matrix and D is an invertible diagonal matrix. With (FV )xv = (f V (xv)) p

PV (xv) and (FL)xl = (f L(xl)) p

PL(xl), we obtain

f V (xv) = 1 p

PV (xv) (U k xv DR) , (18)

f L(xl) = 1 p

PL(xl) (V k xl diag(σ1, . . . , σk)D 1R) . (19)

B.3. Proof of Theorem 3.3

We first introduce a lemma in Hao Chen et al. (2021):

Lemma B.1 (Theorem 3.8 in Hao Chen et al. (2021)). Denote the labeling error as α = E(xv,xl)1[y(xv) = y(xl)]. Let f V be a minimizer of the Luni SCL(f V ), we obtain

E(f V ) 2ϕy

σ k+1 + 8α,

where σ k+1 is the k-smallest eigenvalue of the Laplacian matrix of PT .

Then we give the proof of Theorem 3.3 in the following.

Proof. We denote y(x) as the label of data x. Then we define the probability that two image samples related to the same text sample have different labels as

xv,x v PT (xv, x v)1[y(xv) = y(x v)]. (20)

We note that

(xv,x v) PT (xv, x v)1[y(xv) = y(x v)]

xv,x v Exl [PM(xv|xl)PM(xv|xl)1[y(xv) = y(x v)]]

xv,x v Exl [PM(xv|xl)PM(xv|xl)(1[y(xv) = y(xl)] + 1[y(x v) = y(xl)])]

= 2Exl[PM(xv|xl)1[y(xv) = y(xl)]]

= 2Exv,xl1[y(xv) = y(xl)]

Combined with Lemma B.1, we have E(f V ) e O( α σ k+1 ), where e O( ) is used to hide universal constant factors. We denote

the (k + 1)-largest singular values of PM as σk+1. As PT = PM P M and the singular values are positive, the (k + 1)-largest singular values of PT is (σk+1)2, i.e., σ k+1 = 1 (σ2 k+1). Combined with Theorem 4.1 (proofs are provided in the following), for the image encoder f V that minimizes LSCL, we obtain

E(f V ) = E(f V ) e O( α 1 σ2 k+1 ). (21)

On the Generalization of Multi-modal Contrastive Learning

Obviously, the linear probing error of the text encoder f L that minimizes LSCL has the similar results:

E(f L) e O( α 1 σ2 k+1 ). (22)

Then we consider the empirical loss with finite samples. We construct a multi-modal dataset ˆ X = {(z1 v, z1 l ), ..., (zn v , zn l )} and the n positive pairs are i.i.d sampled from PM(xv, xl). We first sample a permutation π : [n] [n], then we construct the positive pairs and negative pairs as follows:

xi v = zπ(3i 2) v ,

xi l = zπ(3i 2) l ,

(xi l) = zπ(3i 1) l ,

(xi v) = zπ(3i) v .

and the empirical loss is

Lemp(f V , f L) = 2

i=1 f V (xi v) f L(xi l) + 1 n/3

i=1 (f V (xi v) f L xi l) 2 + 1 n/3

i=1 (f V ((xi v) ) f L(xi l))2. (23)

Considering the expectation of Lemp, we obtain

E ˆ X Lemp(f V , f L) = 2

i=1 f V (xi v) f L(xi l) + 1 n/3

i=1 (f V (xi v) f L((xi l) )2 + 1 n/3

i=1 (f V ((xi v) ) f L(xi l))2.

= 2Exv,xlf V (xv) f L(xl) + Exv PV (xv),xl PL(xl)(f V (xi v) f L(xj l ))2

= LSCL(f V , f L).

So the empirical loss is an unbiased estimator. We denote that Rademacher complexity of F over n data as

ˆRn(F) = max{x1,...xn} Eσ

j=1 ρjfi(xj)

where fi(xj) denotes the i-th dimension of f(xj) and ρ is a uniform random vector in { 1, 1}n.

Following Theorem 4.2 in Hao Chen et al. (2021), when E( ˆf V ), E( ˆf V ) are the minimizers of Lemp(f V , f L), we obtain

E( ˆf V ),E( ˆf L) α 1 σ2 k+1 + ck

b Rn/3(F) +

| {z } finite-sample generalization terms

where omits some constant terms, σk+1 (c.f. Theorem 3.2) is the (k + 1)-th largest singular value of the normalized co-occurrence matrix PM. In the finite-sample generalization terms, ˆRn/3(F) denotes a Rademacher complexity of the model class F with n/3 samples, k is the representation dimension, σ = σ2 3k/4 σ2 k, and c (kκ + 2kκ2 + 1)2 with κ upper bounding f V (x) and f L(x) .

B.4. Proof of Theorem 4.1

We first introduce a lemma that states that multiplying the embedding matrix by an invertible matrix on the right will not influence the linear probing error (Hao Chen et al., 2021):

On the Generalization of Multi-modal Contrastive Learning

Lemma B.2 (Lemma 3.1 in Hao Chen et al. (2021)). For two learned embedding matrices F, e F, a diagonal matrix D and an invertible matrix Q, if F = D e FQ, they have the equal linear probing error, i.e.,

E(F) = E( e F).

Then we give the proof of Theorem 4.1 in the following.

Proof. With Theorem 3.2, the optimal solutions F V , F L of LAMF(FV , FL) = PM FV F L 2 can be respectively represented as:

F V = U k DR,

F L = V k D2R,

where R Rk k is a unitary matrix and D, D2 are diagonal matrices that satisfy D2 = diag(σ1, ..., σk)D 1. Following the proof of theorem 3.1, the uni-modal contrastive loss is also equivalent to a matrix decomposition loss, i.e., Luni SCL(f V ) =

PT FV F V 2 + const, where ( PT )(xv,x v) = PT (xv,x v)

PV (xv)PV (x v) and (FV )xv = f V (xv)

PV (xv). Then we consider the objective

Lmf(FV ) = PT FV F V 2. Similar to the asymmetric decomposition objective, the optimal solution can be represented as:

(F V ) = U k T DT RT ,

where U k T RNV k contains k corresponding eigenvectors of k largest singular values of PT , DT Rk k is an invertible diagonal matrix and RT Rk k is a unitary matrix. In the next step, we analyze the relationship between PM and PT . Considering the (xv, x v)-th element of PM P M, we have

( PM P M)xv,x v = X

xl ( PM)xv,xl( PM)x v,xl

PM(xv, xl)PM(x v, xl)

PV (xv)PV (x v)

PV (xv)PV (x v)

xl PL(xl)PM(xv|xl)PM(x v|xl) (PM(xv, xl) = PM(xv|xl)PL(xl))

= Exl PM(xv|xl)PM(x v|xl) p

PV (xv)PV (x v)

= ( PT )xv,x v.

We know that PT = PM P M, so PT and PM share the same eigenvectors, i.e., U k = U k T . As D, D2, R, DT , RT are invertible matrices and the product of the invertible matrices is still invertible, we obtain

F V = (F V ) T,

where T = (DT ) 1(RT ) 1DR is an invertible matrix. With Lemma B.2, we obtain

E(f V ) = E(f V ),

where (F V )xv = f V (xv) , (F V ) xv = f V (xv) . So Theorem 4.1 is proved.

On the Generalization of Multi-modal Contrastive Learning

B.5. Proof of Theorem 4.2

Proof. The co-occurrence matrix of the three-layer hierarchical random graph is:

ph ph pl pl pl pl ph ph pl pl pl pl pl pl ph ph pl pl pl pl ph ph pl pl pl pl ph ph pl pl ph ph

Then we consider the process of computing the eigenvalues:

σ ph ph pl pl pl pl ph σ ph pl pl pl pl pl pl σ ph ph pl pl pl pl ph σ ph pl pl pl pl σ ph ph pl pl ph σ ph

We denote that the first layer has sl branches and the second layer has sh branches. Add every column to the first column:

σ sh ph (sl 1) sh pl ph pl pl pl pl σ sh ph (sl 1) sh pl σ ph pl pl pl pl σ sh ph (sl 1) sh pl pl σ ph ph pl pl σ sh ph (sl 1) sh pl pl ph σ ph pl pl σ sh ph (sl 1) sh pl pl σ ph ph σ sh ph (sl 1) sh pl pl ph σ ph

For the i-row, if i is not divisible by sh, then minus the row by (i|sh sh)-row, and we obtain

σ sh ph (sl 1) sh pl ph pl pl pl pl 0 σ 0 0 0 0 σ sh ph (sl 1) sh pl pl σ ph ph pl pl 0 0 σ σ 0 0 σ sh ph (sl 1) sh pl pl σ ph ph 0 0 σ σ

For the j-column that satisfies j is divisible by sh and 0 < j (sl 1) (sh), add {j + 1, , j + sh}-columns, and for the j-column that satisfies j is divisible by sh and 0 < j < (sl 1) (sh), minus {j + sh + 1, , j + 2 sh}-columns to

On the Generalization of Multi-modal Contrastive Learning

the j-column, then we have

σ sh ph (sl 1) sh pl ph 0 pl sh pl pl 0 σ 0 0 0 0 σ sh ph (sl 1) sh pl pl σ sh (ph pl) ph sh pl pl 0 0 0 σ 0 0 σ sh ph (sl 1) sh pl pl σ sh ph ph 0 0 0 σ

When expanding the determinant, the i-row that satisfies i is not divisible by sh only has one non-zero value σ in i column, so the det is equal to

σ sh ph (sl 1) sh pl 0 0 s2 pl σ sh ph (sl 1) sh pl σ s2 (ph pl) 0 s2 pl σ sh ph (sl 1) sh pl σ + s2 (ph pl) 0 s2 pl σ + s2 (ph pl) σ s2 ph

The form of the det is easy to expand and we obtain the results:

σ(sl 1) sh (σ sh ph (sl 1) sh pl) (σ sh (ph pl))sh 1. (25)

So the eigenvalues are

σ1 = sh ph + (sl 1) sh pl = 1 sl sh ,

σ2 = = σsl = sh (ph pl),

σsl+1 = = σs1 s2 = 0.

where 1 sl sh and 0 are constants. As the matrix is a real symmetric matrix, the eigenvalues are equal to the singular values. And the row sum of the matrix is a constant, so we can obtain the results of Theorem 4.2.