# optimal_sample_complexity_of_contrastive_learning__27041e27.pdf

Published as a conference paper at ICLR 2024

OPTIMAL SAMPLE COMPLEXITY OF CONTRASTIVE LEARNING

Noga Alon1 , Dmitrii Avdiukhin2, Dor Elboim3, Orr Fischer4 , Grigory Yaroslavtsev5

1Princeton University, 2Northwestern University, 3Institute for Advanced Study, 4Weizmann Institute of Science, 5George Mason University, nalon@math.princeton.edu, dmitrii.avdiukhin@northwestern.edu, dorelboim@gmail.com, orr.fischer@weizmann.ac.il, grigory@gmu.edu

Contrastive learning is a highly successful technique for learning representations of data from labeled tuples, specifying the distance relations within the tuple. We study the sample complexity of contrastive learning, i.e. the minimum number of labeled tuples sufficient for getting high generalization accuracy. We give tight bounds on the sample complexity in a variety of settings, focusing on arbitrary distance functions, both general ℓp-distances, and tree metrics. Our main result is an (almost) optimal bound on the sample complexity of learning ℓp-distances for integer p. For any p 1 we show that Θ(min(nd, n2)) labeled tuples are necessary and sufficient for learning d-dimensional representations of n-point datasets. Our results hold for an arbitrary distribution of the input samples and are based on giving the corresponding bounds on the Vapnik-Chervonenkis/Natarajan dimension of the associated problems. We further show that the theoretical bounds on sample complexity obtained via VC/Natarajan dimension can have strong predictive power for experimental results, in contrast with the folklore belief about a substantial gap between the statistical learning theory and the practice of deep learning.

1 INTRODUCTION

Contrastive learning (Gutmann & Hyv arinen, 2010) has recently emerged as a powerful technique for learning representations, see e.g. Smith & Eisner (2005); Mikolov et al. (2013); Dosovitskiy et al. (2014); Schroff et al. (2015a); Wang & Gupta (2015); Wu et al. (2018); Logeswaran & Lee (2018a); Hjelm et al. (2019); He et al. (2020); Tian et al. (2020); Chen et al. (2020); Chen & He (2021); Gao et al. (2021); Chen et al. (2021). In this paper we study the generalization properties of contrastive learning, focusing on its sample complexity. Despite recent interest in theoretical foundations of contrastive learning, most of the previous work approaches this problem from other angles, e.g. focusing on the design of specific loss functions (Hao Chen et al., 2021), transfer learning (Saunshi et al., 2019; Chuang et al., 2020), multi-view redundancy (Tosh et al., 2021), inductive biases (Saunshi et al., 2022; Hao Chen & Ma, 2023), the role of negative samples (Ash et al., 2022; Awasthi et al., 2022), mutual information (van den Oord et al., 2018; Hjelm et al., 2019; Bachman et al., 2019; Tschannen et al., 2020), etc. (Wang & Isola, 2020; Tsai et al., 2021; Zimmermann et al., 2021; von K ugelgen et al., 2021; Mitrovic et al., 2021; Wen & Li, 2021).

Generalization is one of the central questions in deep learning theory. However, despite substantial efforts, a folklore belief still persists that the gap between classic PAC-learning and the practice of generalization in deep learning is very wide. Here we focus on the following high-level question: can PAC-learning bounds be non-vacuous in the context of deep learning? Direct applications of PAC-learning to explain generalization in neural networks lead to vacuous bounds due to the high expressive power of modern architectures. Hence, recent attempts towards understanding

Partially supported by NSF grant DMS-2154082 Partially supported by the European Research Council (ERC) under the European Union s Horizon 2020 research and innovation programme (grant agreement No. 949083)

Published as a conference paper at ICLR 2024

generalization in deep learning involve stability assumptions (Hardt et al., 2016), PAC-Bayesian bounds (Langford & Caruana, 2001; Dziugaite & Roy, 2017; Neyshabur et al., 2017), capacity bounds (Neyshabur et al., 2019), sharpness (Keskar et al., 2017; Lyu et al., 2022) and kernelization (Arora et al., 2019; Wei et al., 2019) among others (see e.g. Jiang et al. (2020) for an overview). In this paper, we make a step towards closing this gap by showing that the classic PAC-learning bounds have strong predictive power over experimental results.

We overcome this gap by changing the assumptions used to analyze generalization, shifting the emphasis from the inputs to the outputs. While it is typical in the literature to assume the presence of latent classes in the input (Saunshi et al., 2019; Ash et al., 2022; Awasthi et al., 2022), we avoid this assumption by allowing arbitrary input distributions. Instead, we shift the focus to understanding the sample complexity of training a deep learning pipeline whose output dimension is d, only assuming that by a suitable choice of architecture, one can find the best mapping of a given set of samples into Rd. Hence, our assumptions about the architecture are very minimal, allowing us to study the sample complexity more directly. This is in contrast with the previous work, where a typical approach in the study of generalization (see e.g. Zhang et al. (2017)) is to assume that n input points are in Rd, while here we only assume the existence of an embedding of these points into a (metric) space such as Rd. This avoids the challenges faced by Zhang et al. (2017) and related work, which shows that a two-layer network with 2n + d parameters can learn any function of the input.

1.1 OUR RESULTS

Consider a simple model for contrastive learning in which labeled samples (x, y+, z 1 , . . . , z k ) are drawn i.i.d. from an unknown joint distribution D. 1 Here x is referred to as the anchor, y+ is a positive example and z i are negative examples, reflecting the fact that the positive example is more similar to the anchor than the negative examples. Before the tuple is labeled, it is presented as an anchor x together with an unordered (k + 1)-tuple (x1, . . . , xk+1), from which the labeling process then selects a single positive example y+ and k negative examples z 1 , . . . , z k .

The goal of contrastive learning is to learn a representation of similarity between the domain points, usually in the form of a metric space. For samples from a domain V , the learned representation is a distance function ρ: V V R. A highly popular choice for the domain is the ℓp-space under ℓp-distance ρp(x, y) = f(x) f(y) p for some f : V Rd, with p = 2 used most frequently. Our central question is:

How many contrastive samples are needed for learning a good distance function?

The number of samples is a primary factor in the computational cost of training. While it is typical for the previous work to consider the number of samples required for generalization to be somewhat unimportant and focus on performance on the downstream tasks (in fact, contrastive learning is often referred to as an unsupervised learning method, see e.g. Saunshi et al. (2019)), here we focus specifically on the sample complexity. This is due to the fact that even if the samples themselves might be sometimes easy to obtain (e.g. when class labels are available and (x, y+) are sampled from the same class2), the computational cost is still of major importance and scales linearly with the number of samples.

Furthermore, in some settings, sample complexity might itself correspond to the cost of labeling, making the contrastive learning process directly supervised. Consider, for example, a crowdsourcing scenario (see e.g. Zou et al. (2015)), where a triple is shown as (x, y, z) to a labeler who labels it according to whether y or z is more similar to x. In this setting, data collection is of primary importance since the sample complexity corresponds directly to the labeling cost.

In this paper, we state our main results first for the case k = 1 in order to simplify the presentation. Our bounds for general k are obtained by straightforward modifications of the proofs for this case.

1This is substantially more general and hence includes as special cases typical models used in the literature which make various further assumptions about the structure of D (e.g. Saunshi et al. (2019); Chuang et al. (2020); Awasthi et al. (2022)). 2In this case, class information is still required, making this approach somewhat supervised. Supervision can be fully avoided, e.g. by creating a positive example from the anchor point (e.g. via transformations), although this can affect the quality of the learned metric due to the more limited nature of the resulting distribution.

Published as a conference paper at ICLR 2024

Table 1: Sample complexity for contrastive learning for a dataset of size n. The notation Oϵ,δ hides dependence on ϵ and δ. Dimension of the representation is denoted as d. Let d = min(n, d).

Setting Lower bound Upper bound

ℓp for integer p Ωϵ,δ(n d), Thm. 3.1

Even p: Oϵ,δ(n d) (matching), Thm. 3.2

Odd p: Oϵ,δ(n min(d log n, n)), Thm. 3.2

Constant d: Oϵ,δ(n) (matching), Thm. 3.2

(1 + α)-separate ℓ2 Ωϵ,δ(n/α), Thm. 4.2 Oϵ,δ(n/α2), Thm. 4.2

Arbitrary distance Ωϵ,δ(n2), Thm. 2.7 Oϵ,δ(n2) (matching), Thm. 2.7

Cosine similarity Ωϵ,δ(n d) Thm D.3 Oϵ,δ(n d) (matching) Thm D.3

Tree metric Ωϵ,δ(n), Thm. D.2 Oϵ,δ(n log n), Thm. D.2

Generalizations, in the same settings

Quadruplet learning Same, Cor. C.3 Same, Cor. C.3

k negatives Same, Thm. 4.1,C.1 Same, with extra log(k + 1) factor, Thm. 4.1

Theoretical Results We address the above central question in the framework of PAClearning (Valiant, 1984) by giving bounds on the number of samples required for the prediction of subsequent samples. Recall that we use notation (x, y, z) to refer to unlabeled samples and (x, y+, z ) for their labeled counterparts. In contrastive learning, a classifier is given access to m labeled training samples {(xi, y+ i , z i )}m i=1 drawn from D. The goal of the classifier is to correctly classify new samples from D. I.e. for every such sample with the ground truth label hidden and thus presented as (x, y, z), to correctly label it as either (x, y+, z ) or (x, z+, y ). For the resulting classifier, we refer to the probability (over D) of incorrectly labeling a new sample as the error rate.

Let H be a hypothesis class (in our case, metric spaces, ℓp-distances, etc.). For parameters ϵ, δ > 0, the goal of the training algorithm is to find with probability at least 1 δ a classifier with the error rate at most ϵ + ϵ , where ϵ is the smallest error rate achieved by any classifier from H (i.e. the best error achieved by a metric embedding, ℓp-embedding, etc.). We refer to the number of samples required to achieve the required error rate as sample complexity. See Appendix A for a concrete example.

As is standard in PAC-learning, we state results in two settings: realizable and agnostic. Here realizable refers to the case when there exists a distance function in our chosen class (e.g. ℓ2distance) that perfectly fits the data, i.e. ϵ = 0, while in the agnostic setting, such a function might not exist. Let n = |V | be the number of data points in the dataset, which the triples are sampled from. We summarize our results in Table 1. To simplify the presentation, for the rest of the section we assume that ϵ, δ are fixed constants (see full statements of our results for the precise bounds, which show optimal dependence on these parameters). We first give a simple baseline, which characterizes the number of samples required for contrastive learning. Theorem 1.1 (Arbitrary distance3 (informal), full statement: Theorem 2.7). The sample complexity of contrastive learning for arbitrary distance functions is Θ(n2), where the lower bound holds even for metric distances.

This basic guarantee is already substantially better than the overall Θ(n3) number of different triples, but can still be a pessimistic estimate of the number of samples required. We thus leverage the fact that in practice ℓp-spaces are a typical choice for a representation of the distance function between the data points. In particular, ℓ2-distance (and the related cosine similarity) is most frequently used as a measure of similarity between points after embedding the input data into Rd. Here the dimension d of the embedding space is typically a fixed large constant depending on the choice of the architecture (e.g. d = 512, 1024, 4096 are popular choices).

Our main results are much stronger bounds on the sample complexity of learning representations under these ℓp distances. For any ℓp-distance we get the following result:

3In this paper, the only requirement for arbitrary distances is symmetry, i.e. ρ(x, y) = ρ(y, x)

Published as a conference paper at ICLR 2024

Theorem 1.2 (ℓp-distance upper bound (informal), full statement: Theorem 3.2). For any constant integer p, the sample complexity of contrastive learning under ℓp-distance is O(min(nd, n2)) if p is even, and O(min(nd log n, n2)) if p is odd. Furthermore, if d = O(1), then the sample complexity is O(n) for any positive integer p.

We show an Ω(min(nd, n2)) lower bound for any ℓp-distance, essentially matching our upper bounds:

Theorem 1.3 (ℓp-distance lower bound (informal), full statement: Theorem 3.1). For any constant integer p, the sample complexity of contrastive learning under ℓp-distance is Ω(min(nd, n2)).

For cosine similarity, we show the same result:

Theorem 1.4 (Cosine similarity (informal), full statement: Theorem D.3). The sample complexity of contrastive learning under cosine similarity is Θ(min(nd, n2)).

Using the techniques we develop for the above results, we also immediately get the following result:

Theorem 1.5 (Tree distance (informal), full statement: Theorem D.2). The sample complexity of contrastive learning under tree distance is Ω(n) and O(n log n).

We also present multiple variations and extensions to complement the main results above. First, we show that the results above can be extended to the case where the positive example has to be selected among k +1 options instead of just two, as is typical in the literature. This results in a log k increase in the sample complexity (see Theorem 4.1). Furthermore, our results can be extended to the case when the samples are well-separated, i.e. the distance to the negative example is at least a factor (1+α) larger than the distance to the positive example. This is a frequent case in applications when the positive example is often sampled to be much closer to the anchor, which is often referred to as learning with hard negatives. In this case, we give an O(n/α2) upper bound and an Ω(n/α) lower bound (see Theorem 4.2).

Finally, we show in Corollary C.3 that all of our results can be adapted to another popular scenario in contrastive learning, where instead of sampling triples, one instead samples quadruples ((x+ 1 , x+ 2 ), (x 3 , x 4 )) with the semantic that the pair (x+ 1 , x+ 2 ) is more similar than the pair (x 3 , x 4 ). While the lower bounds transfer trivially, it is somewhat surprising that the upper bounds also hold for this case despite the fact that the overall number of different samples is Θ(n4).

Outline of the techniques To describe the main idea behind our techniques, consider the case of ℓ2-metrics in a d-dimensional space. Let f : V Rd be a function mapping elements to their representations, and then the query (x, y, z) is labeled (x, y+, z ) if f(x) f(y) 2 < f(x) f(z) 2. This is equivalent to Pd i=1(fi(x) fi(y))2 < Pd i=1(fi(x) fi(z))2, where fi denotes the i-th coordinate of the representation. This condition can be written in the form P(f1(x), . . . , fd(x), f1(y), . . . , fd(y), f1(z), . . . , fd(z)) < 0, where P is some polynomial of degree 2. Hence, every input triplet corresponds to a polynomial such that the label of the triplet is decided by the sign of this polynomial. We show that the number of possible satisfiable sign combinations of the polynomials and hence the largest shattered set of triplets is bounded.

Experimental results To verify that our results indeed correctly predict the sample complexity, we perform experiments on several popular image datasets: CIFAR-10/100 and MNIST/Fashion MNIST. We find the representations for these images using Res Net18 trained from scratch using various contrastive losses. Our experiments show that for a fixed number of samples, the error rate is well approximated by the value predicted by our theory. We present our findings in Appendix F.

1.2 OTHER PREVIOUS WORK

For a survey on metric learning we refer the reader the the monograph by Kulis (2013). A popular approach to metric learning involves learning Mahalanobis metrics, see Verma & Branson (2015) for PAC-like bounds in this setting. Wang & Tan (2018) study the sample complexity of this problem in the presence of label noise. We note that our bounds are more general since Mahalanobis distances only represent a subclass of possible metrics.

Published as a conference paper at ICLR 2024

Related problems are also studied in the literature on ordinal embeddings, which requires adaptive queries in order to recover the underlying embedding. This is in contrast with our work where the samples are taken non-adaptive from an unknown distribution. For triplet queries, Arias-Castro (2017) shows that it s possible to recover the embedding from the triplets queries, up to a transformation, and provide convergence rates for quadruplet queries. Ghosh et al. (2019) shows that O(d8n log n) adaptive noisy triplet queries suffice for recovering an embedding up to a transformation with O(1) additive divergence. Terada & Luxburg (2014) shows that it s possible to recover the ground-truth points up to a transformation assuming that these points are sampled from a continuous distribution assuming the k NN graph is given.

2 PRELIMINARIES

In order to state our main results, we first introduce formal definitions of contrastive learning in the realizable and agnostic setting and the associated notions of Vapnik-Chervonenkis and Natarajan dimension. We start by giving the definitions for the simplest triplet case. Definition 2.1 (Contrastive learning, realizable case). Let H be a hypothesis class and ρ H be an unknown distance function.4 Given access to samples of the form (x, y+, z ) from a distribution D, meaning ρ(x, y) < ρ(x, z), the goal of contrastive learning is to create a classifier from H which accurately labels subsequent unlabeled inputs. For an error parameter ϵ (0, 1/2) 5, the sample complexity of contrastive learning, denoted as S3(ϵ, δ), is the minimum number of samples required to achieve error rate ϵ with probability at least 1 δ. Definition 2.2 (Contrastive learning, agnostic case). Let H be a hypothesis class. Given access to labeled samples of the form (x, y+, z ) from a distribution D, the goal of agnostic contrastive learning is to create a classifier from H which accurately labels subsequent unlabeled inputs. For an error parameter ϵ > 0, the sample complexity of contrastive learning, denoted as Sa 3(ϵ, δ), is the minimum number of samples required to achieve error rate ϵ + ϵ with probability at least 1 δ, where ϵ is the best error rate that can be achieved by a hypothesis ρ H.

The above definitions can be naturally generalized to the case when instead of triplets we get inputs of the form (x, x1, . . . , xk+1) and the objective is to select xi minimizing ρ(x, xi). Definition 2.3 (Quadruplet learning). If instead of triplets as in the definitions above, the samples are quadruples of the form ((x, y), (z, w)) labeled according to whether the (x, y) pair is closer than the (z, w) pair, we refer to this problem as quadruplet contrastive learning and denote the corresponding sample complexities as SQ(ϵ, δ) and Sa Q(ϵ, δ).

VC and Natarajan dimension The main tool used for sample bounds is the Vapnik Chervonenkis (VC) dimension for binary classification and the Natarajan dimension for multi-label classification. Definition 2.4 (VC-dimension (Vapnik & Chervonenkis, 1971)). Let X be the set of inputs. Let H 2X be a set of subsets of X, called the hypothesis space. We say that A X is shattered by H if for any B A there exists a hypothesis h H such that h A = B. Finally, the Vapnik Chervonenkis (VC) dimension of H is defined as the size of the largest shattered set.

Intuitively, a set A is shattered if all of 2|A| possible labelings on A are realizable by H. In our case, the set of inputs X is defined by all possible triplets on V , i.e. X V 3.

The set of distance functions will vary depending on the setting. Below we show results when contains 1) all possible distance functions6, 2) all metrics induced by the ℓp-norms over n points in Rd, and 3) other distance functions, such as tree metrics, cosine similarity etc.

For non-binary (finite) labels, the key property we use to characterize the sample complexity of a problem is the Natarajan dimension, which generalizes the VC dimension Ben David et al. (1995):

4The inputs are labeled based on ρ. With a slight abuse of notation, we refer to both the distance function and the hypothesis (i.e. a classifier) it defines as ρ. 5The 1/2 error rate can be achieved by random guess, which requires zero samples. Hence, our theory makes prediction only for non-trivial values of ϵ, as in Blumer et al. (1989) 6We only consider functions such that all distances are different. This is a reasonable assumption since in practice the probability of encountering exactly the same distances is zero. This assumption simplifies the analysis since we only need to consider the binary classification case

Published as a conference paper at ICLR 2024

Definition 2.5 (Natarajan dimension (Natarajan, 1989)). Let X be the set of inputs, Y be the set of labels, and let H Y X be a hypothesis class. Then S X is N-shattered by H if there exist f1, f2 : X Y such that f1(x) = f2(x) for all x A and for every B A there exists g H such that: g(x) = f1(x) for x B and g(x) = f2(x) for x / B

Finally, the Natarajan dimension Ndim(H) of H is the maximum size of an N-shattered set.

Lemma 2.6 ((Ben David et al., 1995), Informal). If |Y | is finite, then for the sample complexity S(ϵ, δ) of the realizable case it holds that:

S(ϵ, δ) = O Ndim(H) log |Y |

ϵ polylog 1

and Ω Ndim(H)

ϵ polylog 1

In the agnostic case:

Sa(ϵ, δ) = O Ndim(H) log |Y |

ϵ2 polylog 1

and Ω Ndim(H)

ϵ2 polylog 1

VC-dimension is a special case of Natarjan-dimension when |Y | = 2. In Appendix A we give examples to illustrate the notion of shattering.

Before presenting our main results, we state the following simple baseline, which we prove in Appendix D.

Theorem 2.7 (Arbitrary distance). For an arbitrary distance function ρ: V V R and a dataset of size n, the sample complexity of contrastive learning is S3(ϵ, δ) = Θ n2

ϵ polylog 1

realizable case and Sa 3(ϵ, δ) = Θ n2

ϵ2 polylog 1

δ in the agnostic case. Furthermore, the lower bounds hold even if ρ is assumed to be a metric.

3 CONTRASTIVE LEARNING IN ℓp-NORM

In this section, we show in most cases optimal bounds for the sample complexity of the contrastive learning problem for the ℓp-metric in dimension Rd, for any constant integer p 1. First, we give the following lower bound, whose proof is deferred to Appendix B.1. Recall that ρp(x, y) = f(x) f(y) p for f : V Rd.

Theorem 3.1 (Lower bound for ℓp-distances). For any real constant p (0, + ), a dataset V of size n, and the ℓp distance ρp : V V R in a d-dimensional space, the sample complex-

ity of contrastive learning is S3(ϵ, δ) = Ω min(nd,n2)

ϵ polylog 1

δ in the realizable case and

Sa 3(ϵ, δ) = Ω min(nd,n2)

ϵ2 polylog 1

δ in the agnostic case.

We then show that for integer p this bound can be closely matched by the following upper bounds:

Theorem 3.2 (Upper bound for ℓp-distances for integer p). For integer p, a dataset V of size n, and the ℓp distance ρp : V V R in a d-dimensional space, the sample complexity of contrastive learning is upper-bounded as shown in Table 2.

Setting Realizable case Agnostic case Even p min(nd, n2)/ϵ min(nd, n2)/ϵ2

Odd p min(nd log n, n2)/ϵ min(nd log n, n2)/ϵ2

Constant d n/ϵ n/ϵ2

Table 2: Upper bounds on sample complexity bounds for ℓp distances up to polylog 1

Proof. For the first two cases, we assume d < n, since otherwise, the bounds follow from Theorem 2.7. Our proof uses the following result of Warren (1968) from algebraic geometry:

Published as a conference paper at ICLR 2024

Fact 3.3 (Warren (1968)). Let m ℓ 2 be integers, and let P1, . . . , Pm be real polynomials on ℓ variables, each of degree k. Let

U(P1, . . . , Pm) = x Rℓ| Pi( x) = 0 for all i [m]

be the set of points x Rℓwhich are non-zero in all polynomials. Then the number of connected components in U(P1, . . . , Pm) is at most (4ekm/ℓ)ℓ.

In order to bound the sample complexity of the contrastive learning problem, by Lemma 2.6 it suffices to bound its Natarajan dimension, which in the binary case is equivalent to the VC dimension. In particular, we show that for a dataset V of size n, the VC-dimension of contrastive learning for ℓpdistances in dimension d is O(n min(d, n)) for even p 2 and O(nd log n) for odd p 1. For this it suffices to show that for every set of queries Q = {(ui, vi, wi)}m i=1 of size m = Ω(n min(d, n)), there exists labeling of Q that cannot be satisfied by any embedding in a d-dimensional ℓp-space. We split the analysis into cases for even p and odd p.

Upper Bound for Even p. We associate each data point v V with a set of d variables x1(v), . . . , xd(v), corresponding to coordinates of v after embedding in the d-dimensional space. Let x = (x1(v1), . . . , xd(v1), . . . , x1(vn), . . . , xd(vn)). For every constraint (u, v, w), we define the polynomial Pu,v,w : Rnd R as

P(u,v,w)( x) =

j=1 (xj(u) xj(v))p

j=1 (xj(u) xj(w))p.

Note that P(u,v,w)( x) < 0 if and only if the constraint (u, v+, w ) is satisfied, and P(u,v,w)( x) > 0 if and only if the constraint (u, w+, v ) is satisfied. Finally, we define P1, . . . , Pm for i [m] as Pi( x) = Pui,vi,wi( x).

For any labeling Q of the queries Q and for any i [m], we define si( Q) = 1 if the i th query (ui, vi, wi) Q is labeled as (ui, v+ i , w i ), and si( Q) = 1 otherwise.

We make the following observations, which are proven in Section B:

Lemma 3.4. For s { 1, 1}m, define Cs = { x Rd | sign Pi( x) = si for all i}. Then:

1. For distinct s, s { 1, 1}m we have Cs Cs = . 2. Each Cs is either empty or is a union of connected components of U(P1, . . . , Pm). 3. Let Q be a labeling of Q. Then Cs( Q) = if and only if there is a mapping V Rd

satisfying all the distance constraints of Q.

We now complete the proof for the case of even p. Defining P1, . . . , Pm as above, by Theorem 3.3, there are at most (4epm/nd)nd connected components in the set U(P1, . . . , Pm). Since for two labelings Q1, Q2 of Q it holds that s( Q1) = s( Q2), then by Lemma 3.4 either the sets Cs( Q1) and Cs( Q2) are different connected components, or at least one of them is empty. Moreover, the number of possible labelings of Q is 2m, which is greater than (4epm/nd)nd when choosing m cnd for a sufficiently large constant c. Therefore, for at least one labeling Q it holds that Cs( Q) = . Since

there is no embedding satisfying the distance constraints of Q, the claim follows.

Upper bound for odd p. In the case of odd p it suffices to show that for a dataset of size n in dimension d, the VC-dimension of contrastive learning for ℓp-distances is O(nd log n). Unlike in the even p case, our distance constraints are comprised of sums of functions of the form |xj(u) xj(v)|p, and are thus not polynomial constraints. On the other hand, we note that if we have some fixed ordering of the points w.r.t. each coordinate, these constraints become polynomials. To address the issue, we enumerate all possible choices for the ordering of the points w.r.t. each coordinate and show that there is a labeling Q such that no embedding satisfies it, regardless of the ordering.

Similarly to the case of even p, we associate with each v V a set of d variables x1(v), . . . , xd(v). Let x = (x1(v1), . . . , xd(v1), . . . , x1(vn), . . . , xd(vn)). For every coordinate i [d], we fix the

Published as a conference paper at ICLR 2024

ordering of the points w.r.t. this coordinate, i.e. we fix a permutation π(i) : [n] [n] such that xi(vπ(i)(1)) xi(vπ(i)(2)) xi(vπ(i)(n)), and bound the number of satisfiable labelings

which respect this ordering. We define σ(i) uv = 1 if xi(u) xi(v) and σ(i) uv = 1 otherwise, and refer to σ as the order. Then, for any constraint (u, v, w), define the polynomial Pu,v,w : Rnd R:

P(u,v,w)( x) =

j=1 σ(j) uv (xj(u) xj(v))p

j=1 σ(j) uw(xj(u) xj(w))p.

Note that P(u,v,w)( x) < 0 if and only if the constraint (u, v+, w ) is satisfied for the selected order π(1), . . . , π(d) (and likewise for P(u,v,w)( x) > 0 and (u, w+, v )). Similarly to the case of even p, we define P1, . . . , Pm for i [m] as Pi( x) = Pui,vi,wi( x), and for any labeling Q of queries Q, for all i [m], we define si( Q) = 1 if the i th query (ui, vi, wi) Q is labeled as (ui, v+ i , w i ), and si( Q) = 1 otherwise.

As in the case of even p, by Fact 3.3, there are at most (4epm/nd)nd connected components in the set U(P1, . . . , Pm). Therefore, there are at most (4epm/nd)nd choices of labels which are satisfiable in a manner that respects the ordering π(1), . . . , π(d).

Taking m = Ω(nd log n), we have that 2m

nd nd. Since there are (n!)d < nnd possible choices of the permutations π(1), . . . , π(n) and hence at most as many orders σ, there are at most (4epm/nd)nd nnd possible labelings for which there exists some order such that the set of constraints is satisfiable and respects this order. Since (4epm/nd)nd nnd is less than 2m, there exists a choice of a labeling Q for which the constraints are not satisfiable for any order, i.e. Q is not shattered.

Upper Bound for Constant d. We will proof the following statement: for constant odd p (0, + ), a dataset V of size n, and the ℓp distance ρp : V V R in a d-dimensional space, the VC dimension of contrastive learning is O(nd2). This gives optimal bound for constant d.

Similarly to the proof above, we assume that d2 < n. We then prove the sample complexity bound by bounding the VC dimension of the problem by O(nd2), and obtain the sample bounds using Theorem 2.6.

Similar to previous cases, we associate with each v V a set of d variables x1(v), . . . , xd(v). Let x = (x1(v1), . . . , xd(v1), . . . , x1(vn), . . . , xd(vn)).

We associate each constraint with 22d polynomials in the following manner: for any constraint (u, v, w) and any choice τ { 1, 1}2d, we define the polynomial Pτ,u,v,w : Rℓ R as

Pτ,u,v,w( x) =

j=1 τ(j)(xj(u) xj(v))p

j=1 τ(d + j)(xj(u) xj(w))p,

where τ(j) is the value of the j th coordinate of τ.

We note that the sign pattern of the set of polynomials {Pτ,u,v,w( x) 0}τ {0,1}2d determines the sign of Pd j=1 |xj(u) xj(v)|p Pd j=1 |xj(u) xj(w)|p. Therefore, for different choices of signs for the constraints, the solutions satisfying them must be contained in different connected components of U(P1, . . . , P22d m).

By Fact 3.3, there are at most (4ep22dm/nd)nd connected components in the set U(P1, . . . , P22d m). Therefore, taking m cnd2 for sufficiently large c (depending on p), we get that 2m > (4ep22dm/nd)nd, meaning there is a choice of signs for which there is no solution, i.e. these samples cannot be shattered.

4 EXTENSIONS

Reducing dependence on n While our results show that the dependence on n is necessary, in practice it can be avoided by making additional assumptions. An extremely popular assumption is

Published as a conference paper at ICLR 2024

that the existence of k n latent classes in the data (see e.g. Saunshi et al. (2019)). In this case, one can consider using an unsupervised clustering algorithm (e.g. a pretrained neural network) to partition the points into k clusters and then apply our results to classes instead of individual points, effectively replacing n with k in all our bounds.

Extension to k negative samples: In this extension, we consider a setting in which each sample is a tuple (x, x1, . . . , xk+1) V k+2, and a labeling (x, x+ i , x 1 , . . . , x i 1, x i+1, . . . , x k+1) is a choice of an example xi which minimizes minj [k+1] ρ(x, xj), i.e. ρ(x, xi) < ρ(x, xj) for all j = i. As an immediate corollary to our results, we obtain a bound on the sample complexity for the k negative setting (in fact, our results for all distance functions considered in the paper extend to this setting, see proof in Appendix C.1):

Theorem 4.1. For a constant p, a dataset V of size n, and the ℓp distance ρp : V V R in a d-dimensional space, the sample complexity of contrastive learning ℓp-distances with k negatives is bounded as shown in Table 3

p Realizable? Upper bound up to polylog 1

δ Lower bound up to polylog 1

Even Realizable min(nd, n2) log(k + 1)/ϵ min(nd, n2)/ϵ Even Agnostic min(nd, n2) log (k + 1)/ϵ2 min(nd, n2)/ϵ2

Odd Realizable min(nd log n, n2) log(k + 1)/ϵ min(nd, n2)/ϵ Odd Agnostic min(nd log n, n2) log(k + 1)/ϵ2 min(nd, n2)/ϵ2

Table 3: Bounds on sample complexity for ℓp distances with k negatives

Well-separated samples: In this extension, we focus on the most popular ℓ2-distance and consider an approximate setting, where the positive and negative samples are guaranteed to be wellseparated, i.e. their distance from the anchor differs by at least a multiplicative factor. This scenario is highly motivated in practice since the positive example is typically sampled to be much closer to the anchor than the negative example. We show that in this setting the sample complexity can be improved to be almost independent of d.7

For a parameter 0 < α < 1, we assume that our sample distribution D has the following property: If (x, y+, z ) supp(D), then ρ(x, z) > (1 + α)ρ(x, y), or ρ(x, y) > (1 + α)ρ(x, z). We call a distribution with this property well-separated. We show that for the ℓ2-distance, the sample complexity of the problem in this setting is between eΩϵ,δ(n/α) and e Oϵ,δ(n/α2). This result is proven in Appendix E.

Theorem 4.2. The sample complexity of (1 + α)-separate contrastive learning for ℓ2 distance is S3(ϵ, δ) = e Oϵ,δ n

α2 and S3(ϵ, δ) = eΩϵ,δ n

Contrastive Learning for Other Distance Functions: In Appendix D, we show near-tight sample complexity bounds for contrastive learning in other distance functions, such as tree metrics, cosine similarity, class distances, and prove tight general bounds for arbitrary metrics. We also extend our results to quadruplet contrastive learning.

5 CONCLUSION

In this paper we have given a theoretical analysis of contrastive learning, focusing on the generalization in the PAC-learning setting. Our results show (in most cases) tight dependence of the sample complexity on the dimension of the underlying representation. It remains open whether our results can be made optimal for ℓp-distances for odd p. It is also open how the results can be improved via a suitable choice of batched adaptively chosen comparisons (see e.g. Zou et al. (2015)). Note that in the non-batched adaptive query setting optimal bounds on the query complexity are known to be Θ(n log n) (Kannan et al., 1996; Emamjomeh-Zadeh & Kempe, 2018). On the experimental side, while we show asymptotic convergence between our theory and practice, it might be worth looking into further refinements of our approach which can better capture the fact that the constant ratio between the observed and predicted values tends to be small.

7Disregarding polylog factors in d.

Published as a conference paper at ICLR 2024

Ery Arias-Castro. Some theory for ordinal embedding. Bernoulli, 23(3):1663 1693, August 2017. ISSN 1350-7265. doi: 10.3150/15-BEJ792.

Sanjeev Arora, Simon S. Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pp. 322 332. PMLR, 2019. URL http:// proceedings.mlr.press/v97/arora19a.html.

Jordan T. Ash, Surbhi Goel, Akshay Krishnamurthy, and Dipendra Misra. Investigating the role of negatives in contrastive representation learning. In Gustau Camps-Valls, Francisco J. R. Ruiz, and Isabel Valera (eds.), International Conference on Artificial Intelligence and Statistics, AISTATS 2022, 28-30 March 2022, Virtual Event, volume 151 of Proceedings of Machine Learning Research, pp. 7187 7209. PMLR, 2022. URL https://proceedings.mlr.press/v151/ ash22a.html.

Pranjal Awasthi, Nishanth Dikkala, and Pritish Kamath. Do more negative samples necessarily hurt in contrastive learning? In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesv ari, Gang Niu, and Sivan Sabato (eds.), International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pp. 1101 1116. PMLR, 2022. URL https://proceedings.mlr.press/ v162/awasthi22b.html.

Philip Bachman, R. Devon Hjelm, and William Buchwalter. Learning representations by maximizing mutual information across views. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d Alch e-Buc, Emily B. Fox, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 15509 15519, 2019. URL https://proceedings.neurips.cc/paper/2019/ hash/ddf354219aac374f1d40b7e760ee5bb7-Abstract.html.

Shai Ben David, Nicolo Cesabianchi, David Haussler, and Philip M Long. Characterizations of learnability for classes of (0,..., n)-valued functions. Journal of Computer and System Sciences, 50(1):74 86, 1995.

Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K Warmuth. Learnability and the vapnik-chervonenkis dimension. Journal of the ACM (JACM), 36(4):929 965, 1989.

Arthur Cayley. A theorem on trees. Quart. J. Math., 23:376 378, 1878.

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton. A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pp. 1597 1607. PMLR, 2020. URL http://proceedings. mlr.press/v119/chen20j.html.

Ting Chen, Calvin Luo, and Lala Li. Intriguing properties of contrastive losses. In Marc Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, Neur IPS 2021, December 6-14, 2021, virtual, pp. 11834 11845, 2021. URL https://proceedings.neurips.cc/paper/2021/ hash/628f16b29939d1b060af49f66ae0f7f8-Abstract.html.

Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pp. 15750 15758. Computer Vision Foundation / IEEE, 2021. doi: 10.1109/CVPR46437.2021.01549. URL https://openaccess. thecvf.com/content/CVPR2021/html/Chen_Exploring_Simple_Siamese_ Representation_Learning_CVPR_2021_paper.html.

Published as a conference paper at ICLR 2024

Ching-Yao Chuang, Joshua Robinson, Yen-Chen Lin, Antonio Torralba, and Stefanie Jegelka. Debiased contrastive learning. In Hugo Larochelle, Marc Aurelio Ranzato, Raia Hadsell, Maria Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/ 2020/hash/63c3ddcc7b23daa1e42dc41f9a44a873-Abstract.html.

Alexey Dosovitskiy, Jost Tobias Springenberg, Martin A. Riedmiller, and Thomas Brox. Discriminative unsupervised feature learning with convolutional neural networks. In Zoubin Ghahramani, Max Welling, Corinna Cortes, Neil D. Lawrence, and Kilian Q. Weinberger (eds.), Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pp. 766 774, 2014. URL https://proceedings.neurips.cc/paper/2014/hash/ 07563a3fe3bbe7e3ba84431ad9d055af-Abstract.html.

Gintare Karolina Dziugaite and Daniel M. Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. In Gal Elidan, Kristian Kersting, and Alexander Ihler (eds.), Proceedings of the Thirty-Third Conference on Uncertainty in Artificial Intelligence, UAI 2017, Sydney, Australia, August 11-15, 2017. AUAI Press, 2017. URL http://auai.org/uai2017/proceedings/papers/173.pdf.

Ehsan Emamjomeh-Zadeh and David Kempe. Adaptive hierarchical clustering using ordinal queries. In Artur Czumaj (ed.), Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2018, New Orleans, LA, USA, January 7-10, 2018, pp. 415 429. SIAM, 2018. doi: 10.1137/1.9781611975031.28. URL https://doi.org/10.1137/1. 9781611975031.28.

Tianyu Gao, Xingcheng Yao, and Danqi Chen. Simcse: Simple contrastive learning of sentence embeddings. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pp. 6894 6910. Association for Computational Linguistics, 2021. doi: 10.18653/v1/2021. emnlp-main.552. URL https://doi.org/10.18653/v1/2021.emnlp-main.552.

Nikhil Ghosh, Yuxin Chen, and Yisong Yue. Landmark Ordinal Embedding. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.

Michael Gutmann and Aapo Hyv arinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Yee Whye Teh and D. Mike Titterington (eds.), Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2010, Chia Laguna Resort, Sardinia, Italy, May 13-15, 2010, volume 9 of JMLR Proceedings, pp. 297 304. JMLR.org, 2010. URL http://proceedings.mlr.press/v9/ gutmann10a.html.

Jeff Z. Hao Chen and Tengyu Ma. A theoretical study of inductive biases in contrastive learning. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. Open Review.net, 2023. URL https://openreview.net/pdf? id=Au Eg Nl EAmed.

Jeff Z. Hao Chen, Colin Wei, Adrien Gaidon, and Tengyu Ma. Provable guarantees for selfsupervised deep learning with spectral contrastive loss. In Marc Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, Neur IPS 2021, December 6-14, 2021, virtual, pp. 5000 5011, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/ 27debb435021eb68b3965290b5e24c49-Abstract.html.

Moritz Hardt, Ben Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. In Maria-Florina Balcan and Kilian Q. Weinberger (eds.), Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, volume 48 of JMLR Workshop and Conference Proceedings, pp. 1225 1234. JMLR.org, 2016. URL http://proceedings.mlr.press/v48/hardt16.html.

Published as a conference paper at ICLR 2024

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross B. Girshick. Momentum contrast for unsupervised visual representation learning. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pp. 9726 9735. Computer Vision Foundation / IEEE, 2020. doi: 10.1109/CVPR42600.2020.00975. URL https://doi.org/10.1109/CVPR42600.2020.00975.

R. Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Philip Bachman, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. Open Review.net, 2019. URL https: //openreview.net/forum?id=Bklr3j0c KX.

Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. Fantastic generalization measures and where to find them. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. Open Review.net, 2020. URL https://openreview.net/forum?id=SJg IPJBFv H.

William B Johnson. Extensions of lipschitz mappings into a hilbert space. Contemp. Math., 26: 189 206, 1984.

Sampath Kannan, Eugene L. Lawler, and Tandy J. Warnow. Determining the evolutionary tree using experiments. J. Algorithms, 21(1):26 50, 1996. doi: 10.1006/jagm.1996.0035. URL https: //doi.org/10.1006/jagm.1996.0035.

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. Open Review.net, 2017. URL https://openreview. net/forum?id=H1oy Rl Ygg.

A Krizhevsky. Learning multiple layers of features from tiny images. Master s thesis, University of Tront, 2009.

Brian Kulis. Metric Learning: A Survey. Foundations and Trends in Machine Learning, 5(4): 287 364, July 2013. ISSN 1935-8237, 1935-8245. doi: 10.1561/2200000019.

John Langford and Rich Caruana. (not) bounding the true error. In Thomas G. Dietterich, Suzanna Becker, and Zoubin Ghahramani (eds.), Advances in Neural Information Processing Systems 14 [Neural Information Processing Systems: Natural and Synthetic, NIPS 2001, December 3-8, 2001, Vancouver, British Columbia, Canada], pp. 809 816. MIT Press, 2001. URL https://proceedings.neurips.cc/paper/2001/hash/ 98c7242894844ecd6ec94af67ac8247d-Abstract.html.

Guillaume Leclerc, Andrew Ilyas, Logan Engstrom, Sung Min Park, Hadi Salman, and Aleksander Madry. FFCV: Accelerating training by removing data bottlenecks. https://github.com/ libffcv/ffcv/, 2022.

Lajanugen Logeswaran and Honglak Lee. An efficient framework for learning sentence representations. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. Open Review.net, 2018a. URL https://openreview.net/forum?id=r Jv JXZb0W.

Lajanugen Logeswaran and Honglak Lee. An efficient framework for learning sentence representations. ar Xiv preprint ar Xiv:1803.02893, 2018b.

Kaifeng Lyu, Zhiyuan Li, and Sanjeev Arora. Understanding the generalization benefit of normalization layers: Sharpness reduction. In Neur IPS, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/ dffd1c523512e557f4e75e8309049213-Abstract-Conference.html.

Tom as Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. In Yoshua Bengio and Yann Le Cun (eds.), 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings, 2013. URL http://arxiv.org/abs/1301.3781.

Published as a conference paper at ICLR 2024

Jovana Mitrovic, Brian Mc Williams, Jacob C. Walker, Lars Holger Buesing, and Charles Blundell. Representation learning via invariant causal mechanisms. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. Open Review.net, 2021. URL https://openreview.net/forum?id=9p2ek P904Rs.

Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning, Second Edition. MIT Press, December 2018. ISBN 978-0-262-35136-2.

B. K. Natarajan. On learning sets and functions. Mach. Learn., 4:67 97, 1989. doi: 10.1007/ BF00114804. URL https://doi.org/10.1007/BF00114804.

Behnam Neyshabur, Srinadh Bhojanapalli, David Mc Allester, and Nati Srebro. Exploring generalization in deep learning. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp. 5947 5956, 2017. URL https://proceedings.neurips.cc/paper/2017/hash/ 10ce03a1ed01077e3e289f3e53c72813-Abstract.html.

Behnam Neyshabur, Zhiyuan Li, Srinadh Bhojanapalli, Yann Le Cun, and Nathan Srebro. The role of over-parametrization in generalization of neural networks. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. Open Review.net, 2019. URL https://openreview.net/forum?id=Bygfgh Ac YX.

Nikunj Saunshi, Orestis Plevrakis, Sanjeev Arora, Mikhail Khodak, and Hrishikesh Khandeparkar. A theoretical analysis of contrastive unsupervised representation learning. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pp. 5628 5637. PMLR, 2019. URL http://proceedings. mlr.press/v97/saunshi19a.html.

Nikunj Saunshi, Jordan T. Ash, Surbhi Goel, Dipendra Misra, Cyril Zhang, Sanjeev Arora, Sham M. Kakade, and Akshay Krishnamurthy. Understanding contrastive learning requires incorporating inductive biases. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesv ari, Gang Niu, and Sivan Sabato (eds.), International Conference on Machine Learning, ICML 2022, 1723 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pp. 19250 19286. PMLR, 2022. URL https://proceedings.mlr.press/ v162/saunshi22a.html.

Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pp. 815 823. IEEE Computer Society, 2015a. doi: 10.1109/CVPR.2015.7298682. URL https://doi.org/10.1109/CVPR. 2015.7298682.

Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815 823, 2015b.

Noah A. Smith and Jason Eisner. Contrastive estimation: Training log-linear models on unlabeled data. In Kevin Knight, Hwee Tou Ng, and Kemal Oflazer (eds.), ACL 2005, 43rd Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, 25-30 June 2005, University of Michigan, USA, pp. 354 362. The Association for Computer Linguistics, 2005. doi: 10.3115/1219840.1219884. URL https://aclanthology.org/P05-1044/.

Yoshikazu Terada and Ulrike Luxburg. Local Ordinal Embedding. In Proceedings of the 31st International Conference on Machine Learning, pp. 847 855. PMLR, June 2014.

Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (eds.), Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XI, volume 12356 of Lecture Notes in Computer Science, pp. 776 794. Springer, 2020. doi: 10.1007/

Published as a conference paper at ICLR 2024

978-3-030-58621-8\ 45. URL https://doi.org/10.1007/978-3-030-58621-8_ 45.

Christopher Tosh, Akshay Krishnamurthy, and Daniel Hsu. Contrastive learning, multi-view redundancy, and linear models. In Vitaly Feldman, Katrina Ligett, and Sivan Sabato (eds.), Algorithmic Learning Theory, 16-19 March 2021, Virtual Conference, Worldwide, volume 132 of Proceedings of Machine Learning Research, pp. 1179 1206. PMLR, 2021. URL http: //proceedings.mlr.press/v132/tosh21a.html.

Yao-Hung Hubert Tsai, Yue Wu, Ruslan Salakhutdinov, and Louis-Philippe Morency. Selfsupervised learning from a multi-view perspective. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. Open Review.net, 2021. URL https://openreview.net/forum?id=-bdp_8Itjwp.

Michael Tschannen, Josip Djolonga, Paul K. Rubenstein, Sylvain Gelly, and Mario Lucic. On mutual information maximization for representation learning. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. Open Review.net, 2020. URL https://openreview.net/forum?id=rkxoh24FPH.

Leslie G. Valiant. A theory of the learnable. Commun. ACM, 27(11):1134 1142, 1984. doi: 10. 1145/1968.1972. URL https://doi.org/10.1145/1968.1972.

A aron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. Co RR, abs/1807.03748, 2018. URL http://arxiv.org/abs/1807.03748.

Vladimir N Vapnik and A Ya Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and Its Applications, 1971.

Nakul Verma and Kristin Branson. Sample Complexity of Learning Mahalanobis Distance Metrics. In Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015.

Julius von K ugelgen, Yash Sharma, Luigi Gresele, Wieland Brendel, Bernhard Sch olkopf, Michel Besserve, and Francesco Locatello. Self-supervised learning with data augmentations provably isolates content from style. In Marc Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, Neur IPS 2021, December 6-14, 2021, virtual, pp. 16451 16467, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/ 8929c70f8d710e412d38da624b21c3c8-Abstract.html.

Dong Wang and Xiaoyang Tan. Robust Distance Metric Learning via Bayesian Inference. IEEE Transactions on Image Processing, 27(3):1542 1553, March 2018. ISSN 1941-0042. doi: 10. 1109/TIP.2017.2782366.

Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pp. 9929 9939. PMLR, 2020. URL http://proceedings. mlr.press/v119/wang20k.html.

Xiaolong Wang and Abhinav Gupta. Unsupervised learning of visual representations using videos. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pp. 2794 2802. IEEE Computer Society, 2015. doi: 10.1109/ICCV.2015.320. URL https://doi.org/10.1109/ICCV.2015.320.

Hugh E. Warren. Lower bounds for approximation by nonlinear manifolds. Transactions of the American Mathematical Society, 133:167 178, 1968.

Colin Wei, Jason D. Lee, Qiang Liu, and Tengyu Ma. Regularization matters: Generalization and optimization of neural nets v.s. their induced kernel. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d Alch e-Buc, Emily B. Fox, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp.

Published as a conference paper at ICLR 2024

9709 9721, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/ 8744cf92c88433f8cb04a02e6db69a0d-Abstract.html.

Zixin Wen and Yuanzhi Li. Toward understanding the feature learning process of self-supervised contrastive learning. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pp. 11112 11122. PMLR, 2021. URL http://proceedings.mlr.press/v139/wen21c.html.

Zhirong Wu, Yuanjun Xiong, Stella X. Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance discrimination. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp. 3733 3742. Computer Vision Foundation / IEEE Computer Society, 2018. doi: 10.1109/CVPR.2018. 00393. URL http://openaccess.thecvf.com/content_cvpr_2018/html/Wu_ Unsupervised_Feature_Learning_CVPR_2018_paper.html.

Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. ar Xiv preprint ar Xiv:1708.07747, 2017.

Lecun Yann. The mnist database of handwritten digits. R, 1998.

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. Open Review.net, 2017. URL https://openreview.net/forum?id=Sy8gd B9xx.

Roland S. Zimmermann, Yash Sharma, Steffen Schneider, Matthias Bethge, and Wieland Brendel. Contrastive learning inverts the data generating process. Co RR, abs/2102.08850, 2021. URL https://arxiv.org/abs/2102.08850.

James Y. Zou, Kamalika Chaudhuri, and Adam Tauman Kalai. Crowdsourcing feature discovery via adaptively chosen comparisons. In Elizabeth Gerber and Panos Ipeirotis (eds.), Proceedings of the Third AAAI Conference on Human Computation and Crowdsourcing, HCOMP 2015, November 8-11, 2015, San Diego, California, USA, pp. 198. AAAI Press, 2015. URL http://www. aaai.org/ocs/index.php/HCOMP/HCOMP15/paper/view/11585.

Published as a conference paper at ICLR 2024

Example and notation We are given 4 data points: images of a cat, a rat, a plane, and a train, which we denote as V = {c, r, p, t}. We consider the realizable case, i.e. there exists a ground truth metric ρ on these points (e.g. general or ℓp) with error rate ϵ = 0. For this example, we will assume that we embed these points into R, i.e. d = 1. Hence the hypothesis class H is just a class of all Euclidean metrics on R.

For these data points, there are 12 possible input samples, corresponding to 4 choices of an anchor and 3 choices of the two other query points. For this example, let D be the distribution of samples (x, y+, z ) such that each input (x, y, z) is sampled uniformly, and the label is decided according to which of y and z is closer to x according to ρ. For example, if the input is (c, r, p), we expect the ground truth ρ to be that the cat is closer to the rat rather than to the plane, which we write as (c, r+, p ). Hence, this version of the problem is just a binary classification problem.

Error rate If, for example, the ground-truth embedding ρ was r 0, c 1, p 3, t 10, then, assuming the distribution on the inputs is uniform, we achieve the error rate ϵ = 1

6, since only 2 of 12 queries - (p, c+, r ) and (p, c+, r ) - are not satisfied.

Sampling mechanism and learning scheme Since we have a binary classification problem, without a training set, we can only predict randomly, which gives the error rate ϵ = 1

2. Now, suppose that we have a single training example (c, r+, p ) sampled from D. The algorithm that achieves the best worst-case guarantee is an empirical risk minimizer (ERM), which seeks to find the embedding satisfying the maximum number of constraints. In this case, the constraint is satisfiable e.g. by embedding c 0, r 1, p 3, t 10, since |c r| = 1 < |c p| = 3.

Example of Shattering As an example, the set S = {(c, r, p), (c, p, t), (c, r, t)} is not shattered since for the labeling {(c, r+, p ), (c, p+, t ), (c, t+, r )} there exists no embedding such that ρ(c, r) < ρ(c, p), ρ(c, p) < ρ(c, t), and ρ(c, r) > ρ(c, t). On the other hand, the set S = {(c, p, t), (r, p, t)} is shattered: a labeling of S determines for each of c, r which point is closer between p and t. For any labeling of S , we can embed p, t arbitrarily and embed each of c, r closer to either p or t according to the labeling, thus satisfying the constraints imposed by the labeling.

B MISSING PROOFS FROM SECTION 3

Proof of Lemma 3.4.

1. If x Cs and y Cs for and s and s which differ in index j, we know that sign Pj( x) = sign Pj( y), and hence x = y. Hence, Cs and Cs are disjoint.

2. First note that all Cs are open since polynomials are continuous functions, and preimages of open sets ( , 0) and (0, + ) are open. Moreover, s Cs = U(P1, . . . , Pm) and all Cs are disjoint, and hence {Cs}s is a partition of U(P1, . . . , Pm). By definition of a connected set, it s impossible to partition a connected component of U(P1, . . . , Pm) into multiple open sets, and hence each Cs must be a union of some connected components of U(P1, . . . , Pm).

3. Follows from our choice of polynomials: P(u,v,w)( x) > 0 if x(u) is closer to x(v) than to x(w), and P(u,v,w)( x) < 0 otherwise.

B.1 MISSING LOWER BOUND OF SECTION 3

Recall that our goal is to learn a distance function ρ : V V R which can be produced by some allocation of points in the d-dimensional Euclidean space. In other words, the set of labeled queries {(xi, yi, zi)}i is realizable if there exists a mapping f : V Rd such that f(xi) f(yi) 2 < f(xi) f(zi) 2.

Published as a conference paper at ICLR 2024

Theorem B.1. For a dataset of size n, the VC-dimension of contrastive learning for ℓp-distances for any p (0, + ) in dimension d is Ω(min(n2, nd))

Proof. We first assume that d < n, since otherwise the bound follows from Theorem 2.7. Let V be the set of points of size n. We will present a set of queries S of size Ω(nd) that can be shattered. Recall that it means that for any T S there exists a mapping f : V Rd, such that:

triplets from T are satisfied, i.e. each triplet (x, y, z) T is labeled (x, y+, z ), meaning ρ(x, y) < ρ(x, z);

triplets from S \T are not satisfied, i.e. each triplet (x, y, z) S \T is labeled (x, z+, y ), meaning ρ(x, y) > ρ(x, z).

First, note that for d = 1, any set of n/3 disjoint queries gives Ω(n) lower bound. For the rest of the proof, we assume that d > 1.

We partition V arbitrarily into disjoint sets A and B of sizes n d and d respectively. Intuitively, the points from A always act like anchors, and points from B never act like anchors.

We first describe how to map B = {v1, . . . , vd} (their mapping won t depend on the labels of the queries). For each i [d], we map f(vi) = ei, where ei is the i th standard basis vector.

Next, we describe the queries. As defined above, A = {x1, . . . , xn d} is a set of anchors, and for each anchor xi, the queries are of form (xi, v1, v2), (xi, v1, v3), ..., (xi, v1, vd). Hence, there exist (d 1)(n d) queries, as required by the theorem, and it remains to show that this set of queries can be shattered. Clearly, since the arrangement of points v1, . . . , vd is fixed, allocation of anchor xi doesn t affect queries with another anchor xj, and hence it suffices to map every anchor independently from each other.

Let s fix the anchor xi. Let the first coordinate of f(xi) be 1/2. For j [2 : d], if (xi, v1, vj) is labeled (xi, v+ 1 , v j ), then we select the the j-th coordinate of f(xi) to be 0, and otherwise we select it to be 1. Note that then the query is satisfied: the summations for f(xi) e1 p p and f(xi) ej p p differ only in the first and the j-th coordinates. When the j-th coordinate is 0, we have

f(xi) e1 p p f(xi) ej p p = (1/2)p ((1/2)p + 1p) < 0,

and hence (xi, v+ 1 , v j ) is satisfied. On the other hand, when the j-th coordinate is 1, we have

f(xi) e1 p p f(xi) ej p p = ((1/2)p + 1p) (1/2)p > 0,

and hence (xi, v+ j , v 1 ) is satisfied. Hence, we can construct f(xi) that satisfies all the queries with anchor xi. Therefore, we can shatter the set of all (d 1)(n d) queries, finishing the proof.

C VARIATIONS

C.1 CONTRASTIVE LEARNING WITH k NEGATIVES

In the previous sections, we considered the queries of the form (anchor, positive, negative).In contrastive learning, it s common to have multiple negative examples. In this section, we show the following result which can be used to derive bounds for the contrastive learning with k negatives.

Theorem C.1. Let be the class of allowed metric functions. Let Ndim be the VC dimension of the contrastive learning problem on with 1 negative. Then the Natarajan dimension for the contrastive learning problem on with k negatives is at most Ndim.

Assume additionally that for any set of points S and any distance function ρ on S, for any point o there exists a distance function ρ on S {o} such that ρ |S S = ρ and ρ(x, y) < ρ(x, o) for any x, y S 8. Then, for the dataset on n points such that n k = Ω(n), the Natarajan dimension of contrastive learning on with k negatives is Ω(Ndim).

8Intuitively, this condition implies that for any set S of points we can find an outlier a point o which is sufficiently far from the existing set of points. This assumption is naturally satisfied for ℓp and tree distances.

Published as a conference paper at ICLR 2024

Proof. For the upper bound, assume that for some query (x, x1, . . . , xk+1), for the Natarajan shattering we select the xi and xj for some i and j, i.e. the possible labels are (x, x+ i , x 1 , . . . , x i 1, x i+1, . . . , x k+1) and (x, x+ j , x 1 , . . . , x j 1, x j+1, . . . , x k+1). Then, the first query implies (x, x+ i , x j ), and the second query implies (x, x+ j , x i ). Hence, any Natarajanshattered set with k negatives corresponds to a VC-shattered set of queries with 1 negatives of the same cardinality. Hence, the Natarajan dimension is upper-bounded by the VC dimension of the 1-negative problem.

For the lower bound, we split the set of points V into two sets: outlier points O = {o1, . . . , ok 1} and the remaining points S = {v1, . . . , vn k+1}. Let Q = {(xi, yi, zi)}i be the VC-shattered set for the 1-negative contrastive learning on n k + 1 points. Then, the Natarajan-shattered set of queries for the k-negative contrastive learning on n points are Q = {(xi, yi, zi, o1, . . . , ok 1)}i. For each query, for the Natarajan shattering we select yi and zi, meaning that possible labels are (xi, y+ i , z i , o 1 , . . . , o k 1)}i and (xi, z+ i , y i , o 1 , . . . , o k 1)}i

Since the set of Q is shattered, for any choice of labeling of {(xi, yi, zi)}i there exists a distance function ρ satisfying this labeling. It suffices to guarantee that ρ(xi, yi) < ρ(xi, oj) and ρ(xi, zi) < ρ(xi, oj) for all i and j, and by our assumption, there exists a distance function satisfying this condition.

Corollary C.2. The same upper bounds on sample complexities for all distance functions considered in the paper (i.e. in Theorem 3.2, Theorem 2.7, Theorem D.2, and Theorem D.4) hold for k-negative contrastive learning Sk(ϵ, δ) and Sa k(ϵ, δ) up to the log k-factor. The same lower bounds that hold for contrastive learning in ℓp-norm and tree metrics (i.e. Theorem 3.1 and Theorem D.2) also hold for k-negative contrastive learning.

C.2 LEARNING ON QUADRUPLETS

Recall that we are given a set of quadruplets {((x+, y+), (z , w ))}, where each quadruplet ((x+, y+), (z , w )) imposes constraint ρ(x+, y+) < ρ(z , w ).

Theorem C.3. For the VC dimension of contrastive learning on quadruplets, we have the same bounds as for the learning on triplets. Namely, for a dataset of size n, the following holds.

The VC dimension for arbitrary metric is Θ(n2).

The VC dimension for ℓp-distances in dimension d is Ω(nd) for all p (0, ).

The VC dimension for ℓp-distances in dimension d is O(nd) for even p and O(nd log n) for odd p.

The VC dimension for the tree metric is Ω(n) O(n log n).

Proof. First, we note that any lower bound in the triplet case is also a lower bound for the quadruplet case, since a triplet query (x, y, z) is equivalent to the quadruplet query ((x, y), (x, z)).

For the upper bounds, for ℓp and tree metric, we use the same approach: each constraint ((x, y), (z, w)) corresponds to a polynomial (potentially after fixing the order of points or the tree structure). Since the number of variables and polynomials doesn t change, we get the same upper bounds. Finally, for arbitrary metric, similarly to Theorem 2.7, for each set of queries, we can construct a graph with vertices from V V : for each query ((x, y), (z, w)), we create an undirected edge, and labeling the query is equivalent to orienting the edge. If there is a cycle, it s possible to get a contradiction, and hence we can t have more than O(|V V |) = O(n2) queries.

D CONTRASTIVE LEARNING IN OTHER METRICS

In this section, we discuss related results. First, we show a bound for arbitrary distance functions:

Theorem D.1 (Arbitrary distance). For an arbitrary distance function ρ: V V R and a dataset of size n, the sample complexity of contrastive learning is S3(ϵ, δ) = Θ n2

ϵ polylog 1

Published as a conference paper at ICLR 2024

realizable case and Sa 3(ϵ, δ) = Θ n2

ϵ2 polylog 1

δ in the agnostic case. Furthermore, the lower bounds hold even if ρ is assumed to be a metric.

Proof. We prove this by showing a Θ(n2) bound on the VC dimension of contrastive learning with arbitrary distance functions.

Upper bound. Consider any set of samples {(xi, yi, zi))}i=1...k of size k n2. There exists x such that there are at least n samples which have x as their first element. We denote these samples as (x, yi1, zi1), . . . , (x, yin, zin). Consider a graph that has a vertex corresponding to each element in the dataset. Create an undirected edge in this graph between the n pairs of vertices (yi1, zi1), . . . , (yin, zin).

Since the number of edges is equal to the number of vertices, there must exist a cycle C in this graph. We can index the vertices along this cycle as v1, v2, . . . , vt. Now consider the following labeling of the samples:

ρ(x, v1) < ρ(x, v2) < ρ(x, v3) < < ρ(x, vt) < ρ(x, v1).

This labeling is inconsistent with any distance function and hence not all different labelings of this sample are possible. Since the same argument applies for any sample of size at least n2, an n2 upper bound on the VC-dimension follows.

Lower Bound. Let V = {v1, . . . , vn}. We prove that the set of queries

Q = i [n]{(vi, vi+1, vi+2), (vi, vi+2, vi+3), . . . , (vi, vn 1, vn)}

is shattered. Let Q be a labeling of Q. For every i [n], we define a graph Hi = ({vi+1, . . . , vn}, Ei) where Ei contains a directed edge (vj, vj+1) for each query, (vi, vj, vj+1)), orientated towards the negative example according to Q. The graph Hi is acyclic, as it is an orientation of a path. Therefore we can topologically sort Hi, and obtain some topological order p(i) i+1, . . . , p(i) n on the vertices vi+1, . . . , vn.

Consider a metric where the distance between two data items vi, vj is defined as ρ(vi, vj) := n+p(i) j for each i < j n. We note that this is indeed a metric: triangle inequalities are satisfied since all distances are in the range [n, 2n]. Finally, we note this distance function satisfies all the queries. Indeed, for i, j, k such that i < k, j and |k j| = 1, if (vi, v+ j , v k ) Q then the directed edge

(vj, vk) Ei, therefore ρ(vi, vk) > ρ(vi, vj) since p(i) k > p(i) j .

A metric (V, ρ) is called a tree metric if there exists a tree T with weighted edges and n leaves, such that for each v V there is a unique leaf l(v) associated with it, and such that for each u, v V , ρ(u, v) is equal to the sum of weights along the unique path between l(u), l(v) in T. Theorem D.2 (Tree distance). For the tree metric distance function ρT : V V R, corresponding to distances between the nodes of a tree with leaves from V , the sample complexity of contrastive learning in the realizable case is S3(ϵ, δ) = O n log n

ϵ polylog 1

ϵ polylog 1

and in the agnostic case is Sa 3(ϵ, δ) = O n log n

ϵ2 polylog 1

ϵ2 polylog 1

Proof. We prove this by showing that the VC-dimension of contrastive learning for the tree metric is O(n log n). First, we may assume that the number of vertices in the tree is at most 2n. Indeed, any induced path in the tree can be contracted to a single edge whose weight is the sum of weights along the path. After this procedure, the tree metric remains the same and the resulting tree has no vertices of degree 2. It follows by a counting argument that the number of vertices in this tree is at most 2n. The rest of the proof is similar to the odd case of the Theorem 3.2: we enumerate over all possible tree structures with at most 2n vertices and allocation of the data items onto the tree vertices. By Cayley s formula (Cayley, 1878), there exists 2(2n)2n 2 such different tree structures and vertex allocations. For a fixed tree, the distance between every pair of vertices can be expressed as a linear combination of the edge weights. Hence, every constraint becomes a linear inequality on the edge weights.

Published as a conference paper at ICLR 2024

The rest of the proof is analogous to that of Theorem 3.2. Fix a tree T. Let ET (u, v) be the set of edges in the path between u, v in T. We have k 2n 1 variables x = (x1, . . . , xk), where xi represents the edge weight of ei. For m queries we define polynomials P1( x), . . . , Pm( x), one for each query, such that for query (ui, vi, wi) we define the polynomial

ei ET (ui,vi) xi X

ei ET (ui,wi) xi.

I.e. (ui, v+ i , w i ) is satisfied if and only Pi( e) < 0. For m = 3n log n queries, by Theorem 3.3 the number of possible sign combinations is at most (4em/k)k < 2m 2(2n)2n 2 . Summing over all possible tree structures, the number of sign combinations is less than 2m, and hence the set of m queries can t be shattered.

The Cosine Similarity function cos : Rd Rd R is a function which returns for each pair of points x, y the cosine of their angle, i.e. cos (x, y). Theorem D.3 (Cosine Similarity). For the cosine similarity function, the sample complexity of contrastive learning S3(ϵ, δ) = Θ min(n2,nd)

ϵ polylog 1

δ in the realizable case and is Sa 3(ϵ, δ) =

Θ min(n2,nd)

ϵ2 polylog 1

δ in the agnostic case.

Proof. As usual, we assume that d < n, since otherwise the result follows from Theorem 2.7. We prove this by showing that the VC-dimension of contrastive learning for cosine similarity in Rd is Θ(nd). Recall that cos (x, y) = x,y x 2 y 2 = x x 2 , y y 2 , and hence w.l.o.g. we can assume that all points are unit vectors. For unit vectors x, y we have x y 2 2 = x 2 2 + y 2 2 2 x, y = 2 2 cos (x, y), and hence cos (x, y+) > cos (x, z ) is equivalent to x y+ 2 < x z 2. To summarize, contrastive learning for cosine similarity is equivalent to contrastive learning for ℓ2 distance of points on the sphere.

The upper bound O(nd) for VC dimension follows directly from the fact that for the cosine similarity we only consider unit vectors, and hence the hypothesis space is less than that for the ℓ2-distance (which allows arbitrarily-normed vectors).

For the lower bound, we repeat the proof of Theorem B.1, with minor alterations that ensure that all points are embeddable in the unit sphere.

Similar to Theorem B.1, we partition V into two sets A = {x1, . . . , xn d} as a set of anchors, and B = {v1, . . . , vd}. We define the query set Q as follows: for each anchor xi, the set Q contains the queries (xi, v1, v2), (xi, v1, v3), ..., (xi, v1, vd).

We set f(vi) = ei, i.e. we embed vi to the i th standard vector ei.

For each xi, we define a vector g(xi) Rd in the following manner. Let the first coordinate of g(xi) be 1/2. For j [2 : d], if (xi, v1, vj) is labeled (xi, v+ 1 , v j ), then we select the the j-th coordinate of g(xi) to be 0, and otherwise we select it to be 1. We map xi to f(xi) := g(xi)/ g(xi) .

Next, we prove that all queries are satisfied: the summations for f(xi) e1 2 2 and f(xi) ej 2 2 differ only in the first and the j-th coordinates. When the j-th coordinate is 0, we have

f(xi) e1 2 2 f(xi) ej 2 2 = 1 1 2 g(xi) 2

2 1 2 g(xi) 2

2 +12 = 1 g(xi) 2 < 0,

Hence (xi, v+ 1 , v j ) is satisfied. On the other hand, when the j-th coordinate is 1, we have

f(xi) e1 2 2 f(xi) ej 2 2 = 1 1 2 g(xi) 2

2 +12 1 2 g(xi) 2

2 = 2 1 g(xi) 2 > 0,

where the inequality holds since g(xi) 2 > 1

2(due to the first coordinate and the j-th coordinate). Hence (xi, v+ j , v 1 ) is satisfied. Therefore, we can construct f(xi) that satisfies all the queries with anchor xi. Therefore, we can shatter the set of all (d 1)(n d) queries.

Published as a conference paper at ICLR 2024

Theorem D.4 (Class distance). Consider a domain V partitioned into any number of disjoint classes C1, . . . , Cm, where m 2. For the class indicator function ρC : V V {0, 1}, defined as ρC(x, y) = 0 iff there exists i such that x, y Ci, the sample complexity of contrastive learning is S3(ϵ, δ) = Θ n

ϵ polylog 1

δ in the realizable case and Θ n

ϵ2 polylog 1

δ in the agnostic case.9

Proof. We prove this by showing an Θ(n) bound on the VC dimension of contrastive learning with class distances.

Upper bound. Consider any set of samples {(xi, yi, zi)}i=1...k of size k n. Fix an arbitrary labeling of this set. Note that in any triple, one pair (e.g. (xi, yi)) is from the same class and the other (e.g. (xi, zi)) is from different classes. Create a graph on n vertices, where each vertex corresponds to an element in the dataset. For each labeled sample create an edge in this graph corresponding to the pair of vertices, which are from the same class. Since this graph has n edges, there must be a cycle C in this graph and all the vertices on this cycle must be in the same class according to the labeling. Consider any edge (x, y) on the cycle and change the labeling of the triple corresponding to this edge. This forces x and y to be from different classes and leads to a contradiction, since x and y must be in the same class due to the existence of a path between them.

Lower bound. Partition V into disjoint sets of size 3, and associate a query with each set, i.e.

Q = {(v1, v2, v3), (v4, v5, v6), . . . , (vn 2, vn 1, vn)}.

For each labeled query (vi, v+ i+1, v i+2) Q, place vi and vi+1 in C1, and vi+2 in C2. For each labeled query (vi, v+ i+2, v i+1) Q, place vi and vi+2 in C1, and vi+1 in C2.

By Theorem C.3, our bounds extend to the quadruple contrastive learning setting. Corollary D.5. The same bounds as in Theorem 3.1, Theorem 3.2, Theorem 2.7, Theorem D.2, Theorem D.3 hold for quadruplet contrastive learning.

E LEARNING LABELS UNDER A WELL-SEPARATED ASSUMPTION

In this section, we show that under a well-separated setting (i.e. when the two distances of each triplet are guaranteed to be separated by some multiplicative factor), the sample complexity can be improved to be almost independent of d (disregarding polylog factors).

More formally, given parameter α > 0, each labeled query (x, y+, z ) implies the constraint (1 + α)ρ(x, y) < ρ(x, z). Lemma E.1. Let n, d be integers and 0 < α < 1. Let T(n, d) be the VC dimension of the contrastive learning problem and T(n, d, α) be the VC dimension of the (1 + α)-separate contrastive learning problem.

T(n, d, α) = O min(n2, nd, n log n/α2) , T(n, d, α) = Ω min(n2, nd, n/α) .

We start with the lower bound in Theorem E.1. Recall the embedding f given in the proof of Theorem B.1. From this embedding it follows that T(n, d, c/d) = Ω(nd) for a sufficiently small constant c > 0. Indeed, for all i n d and j d, if (xi, v+ 1 , v j ) is satisfied then

f(xi) f(vj) 2 2 f(xi) f(v1) 2 2 1 f(xi) f(v1) 2 2/d,

where the last inequality follows as all the coordinates are bounded by 1. Rearranging we obtain

f(xi) f(vj) 2 p

1 + 1/d f(xi) f(v1) 2 (1 + c/d) f(xi) f(v1) 2.

Similarly we have f(xi) f(v1) 2 (1 + c/d) f(xi) f(vj) 2.

when (xi, v+ j , v 1 ) is satisfied. This shows that T(n, d, c/d) = Ω(nd).

9For this distance function we also allow an equality label (x, y+, z+), which implies ρC(x, y) = ρC(x, z).

Published as a conference paper at ICLR 2024

Next, when α c/d we have that T(n, d, α) T(n, d, c/d) = Ω(nd). When α c/d we have that T(n, d, α) T(n, d1, c/d1) = Ω(n/α), where d1 := c/α .

Next, we prove the upper bound. We first show a dimension reduction argument, which intuitively shows that in separate contrastive learning can always be reduced to O(log n/α2) dimensions. Lemma E.2. Let d1 = 1000 log n/α2 . We have that

T(n, d, α) T(n, d1, α/2).

The lemma is a simple application of the Johnson-Lindenstrauss lemma (Johnson, 1984), given below. Fact E.3 (The Johnson-Lindenstrauss lemma). For any 1 > β > 0, a set X of n points in Rd and an integer d1 15 log n/β2, there is an embedding f : X Rd1 such that for all x, y X we have (1 β) x y 2 f(x) f(y) 2 (1 + β) x y 2.

We can now prove Lemma E.2.

Proof of Lemma E.2. Suppose that T(n, d, α) m for some integer m. By definition, there exists a set of queries Q = {(ui, vi, wi) : i m} such that for any labeling of the queries Q, there is an embedding f : V Rd such that if the query (ui, vi, wi) is labeled as (ui, v+ i , w i ) then f(ui) f(wi) 2 (1 + α) f(ui) f(vi) 2. Next, we use the Johnson-Lindenstrauss lemma in order to embed the points {f(v) : v V } in a lower dimensional space without distorting the distances too much. By Fact E.3 with β := α/7 we have an embedding g : V Rd1 such that if the query (ui, vi, wi) is labeled as (ui, v+ i , w i ) then

g(ui) g(wi) 2 (1 β) f(ui) f(wi) 2 (1 + α)(1 β) f(ui) f(vi) 2

(1 + α)(1 β)

1 + β g(ui) g(vi) 2 (1 + α/2) g(ui) g(vi) 2, (1)

where the last inequality holds for all 0 < α < 1. This shows that T(n, d1, α/2) m and completes the proof.

Given this lemma, our claim follows since

T(n, d, α) T(n, d1, α/2) T(n, d1) = O(nd1) = O(n log n/α2).

Where the first inequality holds due to Lemma E.2, the second since for any integers n , d and value α > 0 we have that T(n , d , α ) T(n , d ) (i.e. the exact version is at least as hard to learn as the separate version, since the hypothesis space of the latter is contained in the former), and the third equality holds due to Theorem 3.2. This concludes the proof of Lemma E.1.

Using Lemma 2.6, we obtain our sample complexity bounds. Theorem E.4. For any 1 > α > 0, the sample complexity of (1 + α)-separate contrastive learning

is S3(ϵ, δ) = O n log n

α2 ϵ polylog( 1

δ and S3(ϵ, δ) = Ω n

α ϵ polylog 1

δ for the realizable

case, and S3(ϵ, δ) = O n log n

α2 ϵ2 polylog 1

δ and S3(ϵ, δ) = Ω n α ϵ2 polylog 1

δ for the agnostic case.

We note that the case of α 1 is a relaxed version of e.g. α = 0.99, therefore we immediately obtain the following near-tight bounds: Corollary E.5. For any α 1, the sample complexity of (1 + α)-separate contrastive learning is

S3(ϵ, δ) = O n log n

ϵ polylog( 1

δ and S3(ϵ, δ) = Ω n

ϵ polylog 1

δ for the realizable case,

and S3(ϵ, δ) = O n log n

ϵ2 polylog 1

δ and S3(ϵ, δ) = Ω n

ϵ2 polylog 1

δ for the agnostic case.

As an interesting open problem, we leave open whether the upper bound of O(n log n) can be improved for large enough values of α.

Published as a conference paper at ICLR 2024

(a) Fashion-MNIST dataset

(b) CIFAR-10 dataset

Figure 1: Training with triplet loss. For various embedding dimensions d, we show the empirical and the scaled predicted generalization errors. The scaling factor is chosen as p

1/320 (Mohri et al., 2018). In this scenario, we consider the well-separated case, and hence the predicted error is the same for all d. Data points are averaged over 10 runs, error bars show 10% and 90% quantiles.

F EXPERIMENTS

In this section, we empirically verify our theoretical findings on real-world image datasets. We compute image embeddings using the Res Net-18 network, with the last layer being replaced with a linear layer with the output dimension matching that of the target embedding dimension. The neural network is trained from scratch for 100 epochs using a set of m {102, 103, 104} training samples, and is evaluated on a different test set of 104 triplets from the same distribution. We show that our theory provides a good estimation of the gap between the training and test error.

We consider the standard experimental setup used in the literature where the positive example is sampled from the same class as the anchor, and negative examples are sampled from a different class, see e.g. Saunshi et al. (2019); Awasthi et al. (2022). Note that it means that the ground-truth distances are well-separated, and hence we are in the setting of Theorem 4.2 (see Section F for experiments in the non-well-separated case). Since the neural network doesn t perfectly fit the data, the setting is agnostic, and hence for n data points and error parameter ϵ, our theory predicts that m c n

ϵ2 labeled samples are required, where we estimate c = 320 based on Mohri et al. (2018, Page 48). Hence, given m samples, our theory predicts that ϵ p n 320m. We compare this value to the empirical generalization error ϵ, defined as the difference between training and test errors.

We consider two settings: when each sample has a single negative example and when it has multiple negative examples.

Single negative example: When every sample has only one negative, we train the model from scratch on CIFAR-10 (Krizhevsky, 2009) and Fashion-MNIST (Xiao et al., 2017) datasets using the marginal triplet loss (Schroff et al., 2015b)

LMT (x, y+, z ) = max(0, x y+ 2 x z 2 + 1),

where x and y+ are points from the same class and z is a point from a different class.

We present our results in Figure 1. For a small number of samples, there is a discrepancy between the predicted and the empirical generalization errors, which is due to the fact that the random guess trivially achieves 50% accuracy, while our results hold only for the non-trivial error rates (see Definition 2.1). First, the results for different embedding dimensions match closely, which shows that the error doesn t depend on the dimension, as predicted by our theory. Note that the scaled predicted and the empirical results are within a constant factor, which is hard to attribute to coincidence given the choice of the scaling factor. Hence, these experiments support the theory by showing that it provides a good predictor of practical generalization error.

Published as a conference paper at ICLR 2024

(a) MNIST dataset

(b) CIFAR-100 dataset

Figure 2: Training with the contrastive loss with k {1, 2, 4, 8} negatives. We show the empirical and the scaled predicted generalization errors. The data points correspond to the average over 10 runs, and the error bars show 10% and 90% quantiles

Multiple negative examples: For the case of multiple negative samples, we train the model using the contrastive loss (Logeswaran & Lee, 2018b)

LC(x, y+, z 1 , . . . , z k ) = log exp(x T y+)

exp(x T y+) + Pk i=1 exp(x T z i ) .

We train the model from scratch on the MNIST (Yann, 1998) and CIFAR-100 (Krizhevsky, 2009) datasets, and report ϵ/ϵ for different numbers of negatives k [1, 2, 4, 8]. Recall that for k negatives we have ϵ Ω( n

ϵ2 log(k + 1)), and for our experiments we use the upper bound. The results are shown in Figure 2. As before, the values are within a constant factor for all values of k and numbers of samples m. Figure 2(b) shows that the log(k + 1) factor in the sample complexity is observable in practice (we use it when computing the predicted error and the curves for different k match closely after this normalization). The lack of convergence in Figure 2(b) is most likely attributable to the fact that the sample size is too small. Similarly to other plots, we expect convergence for larger sample sizes, which were computationally prohibitive to include in this version.

Non-well-separated case Finally, we conduct experiments for the case when the positive example is not necessarily sampled from the same class as the anchor. While the experiments above correspond to one of the most conventional contrastive learning approaches, the experiments in this section more closely match our theoretical settings. In particular, we don t use any data augmentation for the training, and we use the same set of points (but not the same set of triplets) for both training and testing. Since the VC-dimension depends on the number of points (Theorems 3.1 and 3.2), to verify those results, we subsample n {102, 103, 104} points.

In contrast with the previous experiments, where the positive example is sampled from the same class as the anchor, in this section, all elements of a triplet are sampled uniformly from the chosen n points. We determine negative and positive examples based on the distance between ground-truth embeddings computed using a pretrained Res Net-18 network.

We compare the empirical generalization error ϵ to the theoretically predicted error rate ϵ (compared to the best classifier). We perform experiments on the training set of CIFAR-10 and the validation set of Image Net by training Res Net-18 from scratch on m {2, 10, 102, 103, 104, 105} randomly sampled triplets, and evaluating the model on the 104 triplets sampled from the same distribution10. In Figure 3, the predicted and the empirical errors show the same tendencies and the same relative behavior between different choices of n. As before, the empirical and the scaled predicted errors are within a factor of 2 of each other.

10We express our thanks to the FFCV library (Leclerc et al., 2022) which allowed us to significantly speed up the execution

Published as a conference paper at ICLR 2024

(a) Image Net dataset

(b) CIFAR-10 dataset

Figure 3: Training with the contrastive loss with one negative with embedding dimension 512. For each n {102, 103, 104}, we subsample n points from a dataset for m {102, 103, 104, 105} input triplets. For various n, we show the empirical and the scaled predicted generalization errors. Note that the theoretical predicted error is bounded by 50%, which is a achieved by a random guess. Data points are averaged over 10 runs, error bars show 10% and 90% quantiles.