# towards_efficient_contrastive_pac_learning__4540caf3.pdf

Published in Transactions on Machine Learning Research (06/2025)

Towards Efficient Contrastive PAC Learning

Jie Shen jieshen.sjtu@gmail.com Department of Computer Science Stevens Institute of Technology

Reviewed on Open Review: https: // openreview. net/ forum? id= d BJo9hy KVg

We study contrastive learning under the PAC learning framework. While a series of recent works have shown statistical results for learning under contrastive loss, based either on the VC-dimension or Rademacher complexity, their algorithms are inherently inefficient or not implying PAC guarantees. In this paper, we consider contrastive learning of the fundamental concept of linear representations. Surprisingly, even under such basic setting, the existence of efficient PAC learners is largely open. We first show that the problem of contrastive PAC learning of linear representations is intractable to solve in general. We then show that it can be relaxed to a semi-definite program when the distance between contrastive samples is measured by the ℓ2-norm. We then establish generalization guarantees based on Rademacher complexity, and connect it to PAC guarantees under certain contrastive large-margin conditions. To the best of our knowledge, this is the first efficient PAC learning algorithm for contrastive learning.

1 Introduction

Contrastive learning has been a successful learning paradigm in modern machine learning (Gutmann & Hyvärinen, 2010; Logeswaran & Lee, 2018). In general, it is assumed that a learner has access to an anchor example x, a positive example y, and a number of negative examples {z1, . . . , zk}, and the goal of contrastive learning is to learn a representation function f on the examples such that y is closer to x than all zi s under f.

Motivated by the empirical success of contrastive learning, there have been a surge of recent interests that attempt to understand it from a theoretical perspective, primarily through the lens of Rademacher complexity or that of VC-theory. For example, Arora et al. (2019) initiated the study of generalization ability of contrastive learning by analyzing the Rademacher complexity of a commonly used contrastive loss, and showed that under certain structural assumptions on the data, minimizing the unsupervised contrastive loss leads to a low classification error. There were a few follow-up works in this line which aimed to understand and improve the sample complexity; see e.g. (Ash et al., 2022; Awasthi et al., 2022; Lei et al., 2023).

Orthogonal to the Rademacher-based theory, a very recent work of Alon et al. (2024) proposed to study this problem under the classical probably approximately correct (PAC) learning framework (Valiant, 1984). Unlike prior works that assumed a rich structure for the data distribution in order to estimate the classification error from contrastive loss, Alon et al. (2024) considered that there is an unknown distribution on the instances and labels, where labels are produced by an unknown distance function. Tight bounds on sample complexity were established for arbitrary distance functions, ℓp-distances, and tree metrics.

In this work, we follow the contrastive PAC learning framework of Alon et al. (2024). Let X Rd be the space of examples (i.e. image patches). An instance u is a tuple (x, y, z) X 3; thus we denote by U := X 3. The label, b, of a tuple (x, y, z) is either 1 or 1; here, we write B := { 1, 1} as the label space. Let H := {h : U B} be a hypothesis class. Suppose that there is an unknown distribution D on U B. We are mainly interested in the realizable setting in this paper, namely, there exists an h H, such that for all (u, b) D, it holds almost surely that b = h (u). Now for any hypothesis h H, we can define its error

Published in Transactions on Machine Learning Research (06/2025)

rate as follows: err D(h) := Pr(u,b) D(h(u) = b) = Pru DU (h(u) = h (u)), where DU denotes the marginal distribution of D on U. We are now in the position to define the contrastive PAC learning problem.

Definition 1 (Contrastive PAC learning). Let ϵ, δ (0, 1) be a target error rate and failure probability, respectively. An adversary EX(D, h ) chooses a distribution DU on U and h H and fixes them throughout the learning process. Each time the learner requests a sample from the adversary, the adversary draws a sample u from DU, labels it by b := h (u) and returns (u, b) to the learner. The goal of the learner is to find a concept ˆh : U B, such that with probability at least 1 δ (over the random draws of samples and all internal randomness of the learning algorithm), it holds that err D(ˆh) ϵ for all D, h .

One example of the hypothesis class is H = {h : (x, y, z) 7 sign f(x) f(z) p f(x) f(y) p }, where both f( ) and p are to be learned from samples. This is a contrastive PAC learning problem considered in Alon et al. (2024). We note that since learning distance functions is inherently challenging, the PAC guarantees of Alon et al. (2024) were established only for finite domains, i.e. |X| is finite, and the learning algorithm is inherently inefficient. On the other side, Arora et al. (2019) and many of its follow-up works such as Awasthi et al. (2022); Lei et al. (2023) considered a fixed and known distance function, e.g. p = 2, and aimed to learn the representation function f( ) among a certain family. This makes the problem more tractable, though in general, it is still inefficient due to the non-convexity of the contrastive loss only convergence to stationary points is known (Yuan et al., 2022). In addition, the approaches in this line were not immediately implying PAC guarantees.

In this paper, we investigate the contrastive PAC learning problem for fixed p = 2 and we aim to develop computationally efficient algorithms with PAC guarantees. Our setup is thus interpolating Arora et al. (2019) and Alon et al. (2024). Despite the relatively new setup, it is surprising that even efficient contrastive PAC learning for linear representation functions on Rd is largely open. Indeed, as to be shown later, this is already a non-trivial problem from the computational perspective.

From now on, we will focus on the very fundamental class of linear representation functions:

F = {f W : x 7 Wx, W W}. (1.1)

In the above, W can be certain constraint set such as the Frobenius ball. We will discuss in more detail the choice of W and related results later. Denote

g W (x, y, z) := Wx Wz 2 2 Wx Wy 2 2 . (1.2)

Now we can spell out the hypothesis class to be learned:

H = h W : (x, y, z) 7 sign g W (x, y, z) , W W . (1.3)

1.1 Main results

Our main results for contrastive PAC learning of (1.3) is as follows.

Theorem 2 (Theorem 10, informal). Suppose that b g W (x, y, z) 1 for all (x, y, z, b) D. There exists an algorithm A satisfying the following. By drawing poly(1/ϵ, log 1/δ) samples from D, with probability 1 δ, A outputs a hypothesis ˆW such that err D( ˆW) ϵ. In addition, A runs in poly(1/ϵ, log 1/δ) time.

We remark that the condition b g W (x, y, z) 1 is similar to the large-margin condition for learning halfspaces. Such large-margin condition was broadly assumed to analyze performance of learning algorithms such as Perceptron (Rosenblatt, 1958) and boosting (Schapire & Freund, 2012). Our condition is adapted to the contrastive samples, and we will call it contrastive large-margin condition. The constant 1 therein can be readily replaced by a margin parameter γ > 0, which will then lead to a sample complexity proportional to 1/γ2 by our analysis. This is standard in learning theory (Anthony & Bartlett, 1999). However, to keep our results concise, we do not pursue it here.

Our sample complexity in Theorem 2 omits dependence on other quantities such as the magnitude of samples and the size of the constraint set W. A complete description can be found in Theorem 10.

Published in Transactions on Machine Learning Research (06/2025)

What we really hope to highlight in the informal version is that we developed a polynomial-time algorithm that PAC learns a fundamental concept class from contrastive samples, and this is the first efficient PAC learner in the literature.

1.2 Overview of our techniques

We first view the contrastive PAC learning problem as binary classification, as suggested in (1.3). We then apply standard learning principles such as empirical risk minimization with a suitable loss function. It turns out, however, that the quadratic form of g W makes the problem inherently intractable even under the hinge loss function. We thus make use of the property that quadratic functions can be linearized by introducing a new matrix variable, which turns the problem into a semi-definite program (SDP) that can be solved in polynomial time. In order to analyze the error rate, we establish generalization bounds via Rademacher complexity on the SDP. We then show that with the contrastive large-margin condition, the empirical risk goes to zero on the target concept W . This implies that the error rate of a solution of the SDP can be as small as ϵ. Lastly, we apply eigenvalue decomposition on the SDP solution to obtain a linear representation, which completes the proof.

1.3 Roadmap

A concrete problem setup as well as a collection of useful notations are presented in Section 2. In Section 3, we elaborate on our algorithm and the theoretical guarantees. Section 4 concludes this paper and proposes a few open questions.

2 Preliminaries

The PAC learning framework was proposed by Valiant (1984). Let U and B be the instance and label space, respectively. It is assumed that there is an underlying distribution D on U B such that all samples are drawn from D. Let H be a hypothesis class that maps U to B. The error rate of h H is defined as err D(h) := Pr(u,b) D(h(u) = b). Under the realizable setting, there exists a target hypothesis h H, such that with probability 1, b = h (u) for (u, b) D.

In contrastive learning, an instance u U is often a tuple of the form u = (x, y, z), where x, y, z are from X Rd. For example, X can be the space of image patches with size d, and u consists of three image patches. More generally, u may contain a number of patches x, y, z1, . . . , zk, where the zi s are often referred to as negative examples in the literature and y is referred to as positive example. In our main results, we did not pursue such generalization to keep our algorithm and theory concise. However, it is known that such extension is possible and we will illustrate it in Section 3.

With U = X 3 and B = { 1, 1} in mind, a sample (x, y, z, b) of contrastive learning should be interpreted as follows: if b = 1, it indicates that y is closer to x than z is to; otherwise, z is closer to x. More formally, there exists a distance function ρ : X X R 0, such that h (x, y, z) = sign(ρ (x, z) ρ (x, y)). We note that Alon et al. (2024) aimed to learn such general distance functions over a finite domain, while most prior works assumed certain parameterized form such as ρ (x, y) = Wx Wy 2 2, as in this work. Once we confine ourselves to the specific distance function, we can think of the mapping Wx as a new representation of x. Thus, sometimes the problem of contrastive learning is also regarded as representation learning. Denote

g W (x, y, z) = Wx Wz 2 2 Wx Wy 2 2 . (2.1)

Observe that h (x, y, z) = sign(g W (x, y, z)).

As typical in machine learning, one may want to impose certain constraint on W in order to prevent overfitting. Of particular interest would be the Frobenius-norm ball WF = {W Rd d : W F r F }, the ℓ1-norm ball W1 = {W Rd d : W 1 r1} for sparsity, or the nuclear-norm ball W = {W Rd d : W r } for low-rankness. Different constraints will lead to different generalization bounds, which will be shown in Section 3.

Published in Transactions on Machine Learning Research (06/2025)

For a square matrix M, we write tr(M) for its trace. The inner product of two matrices A and B with same size is defined as A, B := tr(A B), where sometimes we simply write as A B. In addition to the matrix norms that are just mentioned, we may also use the spectral norm; it is denoted by M .

We will mainly be interested in the hinge loss as a surrogate function to the error rate. Denote

L(W; U) = max{0, 1 W W U}, L(G; U) = max{0, 1 G U}. (2.2)

The W, G, and U will be matrices in this paper.

Let F be a class of real-valued functions on U B and S = {si}n i=1 be a sample set of U B. The empirical Rademacher complexity of S under F is defined as R(F S) := 1

n Eσ supf F σi f(si), where σ = (σ1, . . . , σn) is the Rademacher random vector.

3 Algorithms and Performance Guarantees

Let S = {(xi, yi, zi, bi)}n i=1 be a set of samples independently drawn from D, where the tuple (xi, yi, zi) U and bi B. Recall that we study the realizable PAC learning. Thus there exists an unknown W H such that for all (x, y, z, b) D, b = g W (x, y, z).

At first glance, one may seek a hypothesis W H that minimizes the empirical risk. That is,

min W W 1 n

i=1 1[bi g W (xi, yi, zi) < 0], (3.1)

where 1[E] is the indicator function which outputs 1 if the event E occurs and 0 otherwise.

Since g W ( ) is quadratic in W, it is easy to show by algebraic calculation that:

Lemma 3. Wx Wy 2 2 = W W, (x y)(x y) .

Therefore, let Ui = bi ((xi zi)(xi zi) (xi yi)(xi yi) ). (3.2)

We can obtain bi g W (xi, yi, zi) = W W, Ui .

Plugging the above into (3.1) gives

min W W 1 n

i=1 1[ W W, Ui < 0]. (3.3)

Unfortunately, solving the above program is intractable, due to the 1) non-convexity of the 0/1-loss, and 2) the quadratic formula with respect to W. In the following, we propose approaches based on semi-definite programming, that is solvable in polynomial time.

First, by standard technique, we could alternatively minimize the hinge loss:

min W W 1 n

i=1 L(W; Ui), (3.4)

where L( ; ) was defined in (2.2). We note that Verma & Branson (2015) also studied the above loss function in the context of Mahalanobis distance metrics, and they obtained statistical sample complexity. Observe that the problem may still be non-convex, since Ui may have negative eigenvalues this is in stark contrast to learning from standard examples. Since the non-convexity comes from the quadratic term W W, we consider replacing the variable W W with a new variable G. Hence, G, Ui is a linear function with respect to G, turning the objective function into convex. This is a well-known technique that has been used in many contexts (Williamson & Shmoys, 2011; d Aspremont et al., 2007).

Published in Transactions on Machine Learning Research (06/2025)

Suppose that based on the fact W W, we obtain a constraint set G {W W : W W}. As far as G is constructed as convex, the overall program becomes convex. Note that such convex set G always exists, and the minimal is called convex hull (Boyd & Vandenberghe, 2004). The empirical risk minimization program that we are going to analyze is given as follows:

min G G 1 n

i=1 L(G; Ui), (3.5)

where L( ; ) was defined in (2.2).

3.1 Rademacher complexity

We provide bounds on the Rademacher complexity of (3.5) for two popular choices of W (and thus G).

3.1.1 Frobenius-norm ball

We first consider the Frobenius-norm ball, one of the most widely used constraints in machine learning. That is, W = WF := {W Rd d : W F r F } for some parameter r F > 0. Here and after, the subscript of W and r is used only to identify the type of constraints. Since G = W W, by singular value decomposition, it is not hard to show that G r2 F where denotes the nuclear norm (also known as the trace norm). Therefore, we can choose G = G := {G Rd d : G 0, G r2 F }. (3.6)

Lemma 4. Consider the function class Θ := {θ : U 7 L(G; U), G G }. Let S = {Ui}n i=1 and assume max Ui S Ui α. Then the empirical Rademacher complexity

R(Θ S) r2 F α

Proof. Let σ = (σ1, . . . , σn) be the Rademacher random variable. By the contraction property of Rademacher complexity, we have

n R(Θ S) = Eσ sup G G

i=1 σi max{0, 1 G Ui}

i=1 σi G Ui

r2 F max Ui S Ui p

where the last inequality follows from Kakade et al. (2012) (see Table 1 therein). The result follows by noting that the spectral norm of Ui is assumed to be upper bounded by α.

Recall that Ui was defined in (3.2). Suppose that the example space X is a subset of a bounded ℓ2-norm ball, say X {x : x 2 κ}. Then we can show that

Ui xi yi 2 2 + xi zi 2 2 2κ2.

Thus setting α = 2κ2 in Lemma 4 gives the following:

Corollary 5. Consider the function class Θ := {θ : U 7 L(G; U), G G }. Suppose X {x : x 2 κ} and let S = {Ui}n i=1 be a draw of sample set from X 3. Then

R(Θ S) 2r2 F κ2

Published in Transactions on Machine Learning Research (06/2025)

3.1.2 ℓ1-norm ball (sparsity)

Now we consider that the linear representation matrix W is constrained by an ℓ1-norm, which typically promotes sparsity patterns (Tibshirani, 1996; Chen et al., 1998; Candès & Tao, 2005). That is, W = WF := {W Rd d : W 1 r1} for some parameter r1 > 0. Now we derive the ℓ1-norm for W W. To do so, let us write W in a column form: W = (w1, . . . , wd) where wi denotes the i-th column of W. It follows that W W 1 = X

1 i,j d wi 1 wj

wj 1 r1 r2 1.

This suggests that we could choose

G = G1 := {G Rd d : G 0, G 1 r2 1}. (3.7)

Lemma 6. Consider the function class Θ1 := {θ : U 7 L(G; U), G G1}. Let S = {Ui}n i=1 and assume max Ui S Ui α. Then the empirical Rademacher complexity

R(Θ1 S) r2 1 α

Proof. For a matrix M, let M be the vector obtained by concatenating all columns of M.

Let σ = (σ1, . . . , σn) be the Rademacher random variable. By the contraction property of Rademacher complexity, we have

n R(Θ1 S) = Eσ sup G G1

i=1 σi max{0, 1 G Ui}

Eσ sup G G1

i=1 σi G Ui

= Eσ sup G G1

i=1 σi G Ui

2n log(2d2),

where the last inequality follows from Lemma 26.11 of Shalev-Shwartz & Ben-David (2014). Dividing both sides by n completes the proof.

Suppose that X {x : x κ}. Then we can show that

Ui xi yi 2 + xi zi 2 2κ2.

Therefore, specifying α = 2κ2 in the above lemma gives the following corollary.

Corollary 7. Consider the function class Θ1 := {θ : U 7 L(G; U), G G1}. Suppose X {x : x κ} and let S = {Ui}n i=1 be a draw of sample set from X 3. Then

R(Θ1 S) 2r2 1 κ2

Published in Transactions on Machine Learning Research (06/2025)

3.2 PAC guarantees

We analyze the PAC guarantees under a new type of margin condition, which we call the contrastive large-margin condition.

Definition 8 (Contrastive large-margin condition). We say that the data distribution D satisfies the contrastive large-margin condition if there exists W W, such that for all (x, y, z, b) D, the following holds with probability 1: b( W x W z 2 2 W x W y 2 2) 1.

Geometrically, this condition ensures that there is a non-trivial separation between positive examples and negative examples for any given anchor x. It follows that when the condition is satisfied, (3.4) attains an optimal objective value of 0. Since the feasible set of convex program of (3.5) contains that of (3.4), it is easy to get the following.

Lemma 9. Assume that the contrastive large-margin condition holds. Then there exists ˆG G, such that the objective value of (3.5) at ˆG equals 0.

Now we can prove the main result of this section, the PAC guarantees.

Theorem 10. Assume that the contrastive large-margin condition is satisfied for some W W, and G is such that G {W W : W W}. Let ˆG G be an optimal solution to (3.5) and let ˆG = V ΣV be its eigenvalue decomposition. Let ˆW := Σ1/2V . Then by drawing contrastive sample set S = {(xi, yi, zi, bi)}n i=1, with probability at least 1 δ, it holds that

err D( ˆW) 2R(Θ S) + 5c

where c := sup G G,U U L(G; U) . When G is a convex set, our algorithm runs in polynomial time.

Proof. Let Θ := {θ : U 7 L(G; U), G G}. Let c := sup G G,U U B L(G; U) .

We apply standard uniform concentration via Rademacher complexity (Bartlett & Mendelson, 2002) to obtain that with probability 1 δ,

EU D L( ˆG; U) EU DL(W ; U) + 2R(Θ S) + 5c

In view of the contrastive large-margin condition, we have L(W ; U) = 0. On the other hand, we always have

L(G; U) 1[ ˆG U < 0].

This implies

EU D1[ ˆG U < 0] 2R(Θ S) + 5c

Now recall that U = b (x z)(x z) (x y)(x y) as in (3.2), and ˆG = ˆW ˆW by the eigenvalue decomposition. Therefore,

ˆG U = b ˆWx ˆWz 2

2 ˆWx ˆWy 2

Thus, (3.8) is equivalent to

err D( ˆW) 2R(Θ S) + 5c

The proof is complete.

Theorem 10, in allusion to the Rademacher complexity bounds in Section 3.1, lead to the sample complexity bounds for efficient contrastive PAC learning.

Published in Transactions on Machine Learning Research (06/2025)

Corollary 11. Assume same conditions as in Theorem 10. Consider Θ = Θ as in Corollary 5. Suppose X {x : x 2 κ}. Then by drawing n = 5+5r2 F κ2

δ contrastive samples from D, we have err D( ˆW) ϵ with probability 1 δ, where ˆW is defined in Theorem 10.

Proof. We just need to compute the supremum of L(G; U) . It turns out that L(G; U) 1 +|G U|

1 + G U 1 + r2 F κ2. The result follows by plugging this upper bound and the Rademacher complexity in Corollary 5 into Theorem 10.

Corollary 12. Assume same conditions as in Theorem 10. Consider Θ = Θ1 as in Corollary 7. Suppose X {x : x κ}. Then by drawing n = 5+5r2 1κ2

δ contrastive samples from D, we have err D( ˆW) ϵ with probability 1 δ, where ˆW is defined in Theorem 10.

Proof. Again, we only need to compute the supremum of L(G; U) . It turns out that L(G; U) 1+|G U|

1+ G 1 U 1+r2 1κ2. The result follows by plugging this upper bound and the Rademacher complexity in Corollary 7 into Theorem 10.

We remark that both Θ and Θ1 are convex sets, thus our PAC guarantees are obtained from a computationally efficient algorithm, i.e. the convex program (3.5).

3.3 Extension to multiple negative examples

One important extension of our contrastive PAC learning framework is to consider multiple negative samples, which are commonly used in practice and its importance has been broadly studied (Ash et al., 2022; Awasthi et al., 2022; Lei et al., 2023). That is, suppose the label b = 1, in addition to the anchor example x and a positive example y, a learner collects k negative examples z1, . . . , zk. Together, these serve as a sample u := (x, y, z1, . . . , zk, 1). Therefore, the instance space U = X k+2 while the label space remains same as before. The learning paradigm still follows from Definition 1. More generally, one can think of an instance as (x, u1, . . . , uk+1) and a label b {1, . . . , k + 1} that specifies the index among all ui s that is closest to x. Since we can always reorder the examples u1, . . . , uk+1 such that the closest example is arranged at the first place, without loss of generality, we will always assume b = 1 and the example following x is closest, which we denote as y, and the remaining examples are denoted by z1, . . . , zk. This is also a notation typically seen in the literature.

Now given a set of contrastive samples S = {(xi, yi, zi1, . . . , zik, bi)}n i=1 where the samples are independently drawn from D, we aim to establish PAC guarantees as the case k = 1. For any i, we know by the realizability assumption that W xi W yi 2 W xi W zij 2 for all 1 j k. Define

Uij = (xi zij)(xi zij) (xi yi)(xi yi) . (3.10)

By Lemma 3, we have (W ) W , Uij 0 for all 1 j k. This is equivalent to min1 j k (W ) W , Uij 0. Thus, a natural empirical risk, based on hinge loss, is as follows:

min W W 1 n

i=1 max{0, 1 min 1 j k W W, Uij }. (3.11)

As discussed in the preceding subsection, the above program is non-convex, and we will consider SDP as convex relaxation. This gives the following program:

min G G 1 n

i=1 max{0, 1 min 1 j k G, Uij }. (3.12)

Consider the function class Q = {q G : (x, y, z1, . . . , zk) 7 max{0, 1 min1 j k G U j}, G G}, where U j = (x zj)(x zj) (x y)(x y) . Let c := sup G G,(x,y,z1,...,zk) X k+2 q G(x, y, z1, . . . , zk) and denote

Published in Transactions on Machine Learning Research (06/2025)

ˆG a global optimum of (3.12). Write u = (x, y, z1, . . . , zk). Then standard concentration results tell that

Eu Dq ˆ G(u) Eu Dq G (u) + 2R(Q S) + 5c

where G = (W ) W . Under the contrastive large-margin condition, we have q G (u) = 0. Thus, it remains to bound the empirical Rademacher complexity R(Q S). To this end, we think of the function q G Q as a composition of two functions: q G = q q G, where q G(u) = (G U 1, . . . , G U k) Rk and q(v1, . . . , vk) = max{0, 1 min1 j k vj}. By Corollary 4 of Maurer (2016), we have

2LEσ sup G G

j=1 σij G Uij, (3.13)

where L denotes the Lipschitz constant of q.

When G = G and X {x : x 2 κ}, we have shown that the expectation on right-hand side is less than nk log d r2 F κ2. Therefore, it remains to estimate L. Observe that q can further be thought of as q(t) = max{0, 1 t} and t = min1 j k vj. The Lipschitz constant of t with respect to (v1, . . . , vk) is upper bounded by 1. Thus, L = 1.

Putting together gives

Eu Dq ˆ G(u) 2

when G = G . We note that c = 1 + r2 F κ2 by algebraic calculation.

Lastly, similar to the proof of Theorem 10, the above implies PAC guarantee of ˆW with ˆG = ˆW ˆW.

4 Conclusion and Open Questions

In this paper, we studied the power of convex relaxations for contrastive PAC learning. We showed that even for learning linear representations via contrastive learning, the problem is generally intractable, which is in stark contrast to the classic problem of PAC learning linear models. We then proposed a convex program based on techniques from semi-definite programming. Under a contrastive large-margin condition, we proved that the solution to the convex program enjoys PAC guarantees.

This is the first work that establishes PAC guarantees for contrastive learning for arbitrary domain, while the very recent work is confined to finite domains (and considers a more involved learning scenario). Our convex relaxation techniques seem suitable for the ℓ2-distance between contrastive samples. An important question is whether there exists more general approach to dealing with other distance metrics such as the ℓ1-distance. We expect that this is possible, since ℓ1-norm is closely related to a family of linear functions by introducing additional variables. Another important question is whether it is possible to learn nonlinear representation functions, for example, the family of polynomial threshold functions or neural networks. We conjecture that learning neural networks from contrastive samples is rather challenging, since the optimization landscape for linear classes is already drastically changed with contrastive samples. On the algorithmic design front, it appears that one needs to carefully design convex surrogate functions whenever the underlying representation functions are modified. Does there exist a principled approach that guides the design, and is it necessary to consider convex surrogate functions for the problem? In the literature of PAC learning halfspaces, there have been a rich set of algorithmic results showing that one may optimize certain non-convex loss functions whose stationary point really enjoys PAC guarantees (Diakonikolas et al., 2020; Zhang et al., 2020; Shen, 2021a; 2025). Can we show similar results for contrastive PAC learning? In particular, can we design non-convex loss functions that may serve as a proxy to (3.1) and that a good stationary point can be efficiently found? We believe that our work will serve as a first step towards these questions. Lastly, it is known that in practice, the contrastive examples and labels can both be noisy. Is it possible to develop noise-tolerant algorithms for contrastive PAC learning, by extending ideas from algorithmic robustness (Diakonikolas & Kane, 2019; Shen & Zhang, 2021; Shen, 2021b; 2023)?

Published in Transactions on Machine Learning Research (06/2025)

Noga Alon, Dmitrii Avdiukhin, Dor Elboim, Orr Fischer, and Grigory Yaroslavtsev. Optimal sample complexity of contrastive learning. In Proceedings of the 12th International Conference on Learning Representations, 2024.

Martin Anthony and Peter L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 1999.

Sanjeev Arora, Hrishikesh Khandeparkar, Mikhail Khodak, Orestis Plevrakis, and Nikunj Saunshi. A theoretical analysis of contrastive unsupervised representation learning. In Proceedings of the 36th International Conference on Machine Learning, pp. 5628 5637, 2019.

Jordan T. Ash, Surbhi Goel, Akshay Krishnamurthy, and Dipendra Misra. Investigating the role of negatives in contrastive representation learning. In Proceedings of the 25th International Conference on Artificial Intelligence and Statistics, pp. 7187 7209, 2022.

Pranjal Awasthi, Nishanth Dikkala, and Pritish Kamath. Do more negative samples necessarily hurt in contrastive learning? In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, pp. 1101 1116, 2022.

Peter L. Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463 482, 2002.

Stephen P. Boyd and Lieven Vandenberghe. Convex optimization. Cambridge University Press, 2004.

Emmanuel J. Candès and Terence Tao. Decoding by linear programming. IEEE Transactions on Information Theory, 51(12):4203 4215, 2005.

Scott Shaobing Chen, David L. Donoho, and Michael A. Saunders. Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing, 20(1):33 61, 1998.

Alexandre d Aspremont, Laurent El Ghaoui, Michael I. Jordan, and Gert R. G. Lanckriet. A direct formulation for sparse PCA using semidefinite programming. SIAM Review, 49(3):434 448, 2007.

Ilias Diakonikolas and Daniel M. Kane. Recent advances in algorithmic high-dimensional robust statistics. Co RR, abs/1911.05911, 2019.

Ilias Diakonikolas, Vasilis Kontonis, Christos Tzamos, and Nikos Zarifis. Learning halfspaces with Massart noise under structured distributions. In Proceedings of the 33rd Annual Conference on Learning Theory, pp. 1486 1513, 2020.

Michael Gutmann and Aapo Hyvärinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, pp. 297 304, 2010.

Sham M. Kakade, Shai Shalev-Shwartz, and Ambuj Tewari. Regularization techniques for learning with matrices. Journal of Machine Learning Research, 13:1865 1890, 2012.

Yunwen Lei, Tianbao Yang, Yiming Ying, and Ding-Xuan Zhou. Generalization analysis for contrastive representation learning. In Proceedings of the 40th International Conference on Machine Learning, pp. 19200 19227, 2023.

Lajanugen Logeswaran and Honglak Lee. An efficient framework for learning sentence representations. In Proceedings of the 6th International Conference on Learning Representations, 2018.

Andreas Maurer. A vector-contraction inequality for rademacher complexities. In Proceedings of the 27th International Conference on Algorithmic Learning Theory, pp. 3 17, 2016.

Published in Transactions on Machine Learning Research (06/2025)

Frank Rosenblatt. The Perceptron: A probabilistic model for information storage and organization in the brain. Psychological review, 65(6):386 408, 1958.

Robert E. Schapire and Yoav Freund. Boosting: Foundations and Algorithms. The MIT Press, 05 2012.

Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014.

Jie Shen. On the power of localized Perceptron for label-optimal learning of halfspaces with adversarial noise. In Proceedings of the 38th International Conference on Machine Learning, pp. 9503 9514, 2021a.

Jie Shen. Sample-optimal PAC learning of halfspaces with malicious noise. In Proceedings of the 38th International Conference on Machine Learning, pp. 9515 9524, 2021b.

Jie Shen. PAC learning of halfspaces with malicious noise in nearly linear time. In Proceedings of the 26th International Conference on Artificial Intelligence and Statistics, pp. 30 46, 2023.

Jie Shen. Efficient PAC learning of halfspaces with constant malicious noise rate. In Proceedings of the 36th International Conference on Algorithmic Learning Theory, pp. 1108 1137, 2025.

Jie Shen and Chicheng Zhang. Attribute-efficient learning of halfspaces with malicious noise: Near-optimal label complexity and noise tolerance. In Proceedings of the 32nd International Conference on Algorithmic Learning Theory, pp. 1072 1113, 2021.

Robert Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267 288, 1996.

Leslie G. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134 1142, 1984.

Nakul Verma and Kristin Branson. Sample complexity of learning mahalanobis distance metrics. In Proceedings of the 29th Conference on Neural Information Processing Systems, 2015.

David P. Williamson and David B. Shmoys. The design of approximation algorithms. Cambridge University Press, 2011.

Zhuoning Yuan, Yuexin Wu, Zi-Hao Qiu, Xianzhi Du, Lijun Zhang, Denny Zhou, and Tianbao Yang. Provable stochastic optimization for global contrastive learning: Small batch does not harm performance. In Proceedings of the 39th International Conference on Machine Learning, pp. 25760 25782, 2022.

Chicheng Zhang, Jie Shen, and Pranjal Awasthi. Efficient active learning of sparse halfspaces with arbitrary bounded noise. In Proceedings of the 34th Annual Conference on Neural Information Processing Systems, pp. 7184 7197, 2020.