# dissecting_supervised_contrastive_learning__81047db1.pdf

Dissecting Supervised Constrastive Learning

Florian Graf 1 Christoph D. Hofer 1 Marc Niethammer 2 Roland Kwitt 1

Minimizing cross-entropy over the softmax scores of a linear map composed with a high-capacity encoder is arguably the most popular choice for training neural networks on supervised learning tasks. However, recent works show that one can directly optimize the encoder instead, to obtain equally (or even more) discriminative representations via a supervised variant of a contrastive objective. In this work, we address the question whether there are fundamental differences in the sought-for representation geometry in the output space of the encoder at minimal loss. Specifically, we prove, under mild assumptions, that both losses attain their minimum once the representations of each class collapse to the vertices of a regular simplex, inscribed in a hypersphere. We provide empirical evidence that this conﬁguration is attained in practice and that reaching a close-to-optimal state typically indicates good generalization performance. Yet, the two losses show remarkably different optimization behavior. The number of iterations required to perfectly ﬁt to data scales superlinearly with the amount of randomly ﬂipped labels for the supervised contrastive loss. This is in contrast to the approximately linear scaling previously reported for networks trained with cross-entropy.

1. Introduction

In modern machine learning, neural networks have become the prevalent choice to parametrize maps from a complex input space X to some target space Y. In supervised learning tasks, where the output space is a set of discrete labels, Y = {1, . . . , K}, it is common to implement predictors of the form f = argmax W ϕ . (1)

1Department of Computer Science, University of Salzburg, Austria 2UNC Chapel Hill. Correspondence to: Florian Graf <florian.graf@sbg.ac.at>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

In this construction, f is realized as the composition of an encoder ϕ : X Z Rh, a linear map/classiﬁer W : Rh RK and the argmax operation which handles the transition from continuous output to discrete label space.

Despite myriad advances in designing networks that implement ϕ, such as (Krizhevsky et al., 2012; He et al., 2016a; Zagoruyko & Komodakis, 2016; Huang et al., 2017), the training routine rarely deviates from minimizing the crossentropy (CE) between softmax scores of W ϕ and one-hot encoded discrete labels. Assuming sufﬁcient encoder capacity, it is clear that, at minimal loss, the representations of training instances, i.e., their images under ϕ, are in a linearly separable conﬁguration (as the classiﬁer is implemented as a linear map). Remarkably, this behavior is not only observed on real data with semantically meaningful labels, but also on real data with randomly ﬂipped labels (Zhang et al., 2017).

Alternatively, one could aim for directly learning an encoder that is compatible with a linear classiﬁer and the argmax decision rule. Recent works (Khosla et al., 2020; Han et al., 2020) have shown that this is indeed possible via a supervised variant of a contrastive loss (Chopra et al., 2005; Hadsell et al., 2006) that has full access to label information. Informally, this supervised contrastive (SC) loss comprises two competing dynamics: an attraction and a repulsion force. The former pulls representations from the same class (positives) closer together, the latter pushes representations from different classes (negatives) away from each other. A similar mechanic underpins the triplet loss (Weinberger & Saul, 2009), the N-pairs loss (Sohn, 2016), or the soft nearest-neighbor loss (Salakhutdinov & Hinton, 2007; Frosst et al., 2019) and contributes to the the success of self-supervised learning, framed as an instance discrimination task (van den Oord et al., 2018; Chen et al., 2020; H enaff et al., 2020). In the context of the latter, positives are typically deﬁned as different views of the same instance.

Most notably, predictors obtained by ﬁrst learning ϕ via the supervised contrastive loss, followed by a composition with a linear map, not only yield state-of-the-art results on popular benchmarks, but show increased robustness towards input corruptions and hyperparameter choices (Khosla et al., 2020). This warrants a closer analysis of the underlying effects. While we focus on the formulation of Khosla et al. (2020), a similar analysis most likely holds for related vari-

Dissecting Supervised Constrastive Learning

SC= 9.029 SC= 8.800 SC= 7.513 SC= 4.608 SC= 4.598 SC= 4.596

Progress CE = 1.322 CE = 1.015 CE = 0.867 CE = 0.710 CE = 0.499 CE = 0.265

Figure 1: Loss comparison on a three-class toy problem in 2D with 100 points (zi) per class. Left to right indicates optimization progress. The top row shows the point conﬁgurations while minimizing the supervised contrastive (SC) loss, w.r.t. z, on points drawn uniformly on S1. The bottom row shows the point conﬁgurations when minimizing cross-entropy (CE) over softmax(Wz) scores, w.r.t. W and z (and an L2 penalty λ W 2 F ), on points drawn uniformly within the unit disc. For the CE loss, gray discs ( ) indicate the weights, the rays show the direction of the weights. In both cases, the zi with equal label collapse to the vertices of a regular simplex.

ants. Speciﬁcally, we take a ﬁrst step toward understanding potential differences in the output space of the encoder, induced either by (1) minimizing (softmax) cross-entropy over W ϕ, or (2) minimizing the supervised contrastive loss directly over the outputs of ϕ. Characterizing the soughtfor geometric arrangement of representations of training instances, at minimal loss, is an immediate starting point. Our analysis yields two insights, summarized below:

Insight 1 (theoretical). Under the assumption of an encoder ϕ that is powerful enough to realize any geometric arrangement of the representations in Z, we analyze all loss minimizing conﬁgurations of the supervised contrastive and cross-entropy loss, respectively. More precisely, we prove (see 3.2) that the supervised contrastive loss (see Deﬁnition 2) attains its minimum if and only if the representations of each class collapse to the vertices of an origin-centered regular K 1 simplex, cf. Fig. 3. For the cross-entropy loss, we prove a similar, but more nuanced result (see 3.1) which is supplemental to an existing line of research. In particular, under a norm constraint on the outputs of ϕ, we show that (1) representations also collapse to the vertices of an origin-centered regular K 1 simplex and (2) the classiﬁer weights are (positive) scalar multiples of the simplex vertices. Additionally, when subject to L2 penalization, the weights attain equal norm, characterized by a function of the regularization strength. Fig. 1 visualizes the convergence to such a conﬁguration on a toy example. In 4, we link these results to recent prior work, where an evenly spaced arrangement of classiﬁer weights on the unit hypersphere is either prescribed or explicitly enforced.

Insight 2 (empirical). While our theoretical results assume an ideal encoder, we provide empirical evidence on popular vision benchmarks, that the sought-for regular simplex

conﬁgurations can be attained in practice. Yet, networks trained with the supervised contrastive loss (1) tend to converge to a state closer to the loss minimizing conﬁguration and (2) empirically yield better generalization performance. Hence, as loss minimization strives for a similar geometry of the encoder output for both loss functions (cf. Insight 1), we conjecture that differing optimization dynamics are the primary cause for obtaining solutions of different quality. One striking difference is observed when training on data with an increasing fraction of randomly ﬂipped labels, illustrated in Fig. 2 for a Res Net-18 (CIFAR10), trained with (1) cross-entropy and (2) the supervised contrastive loss (with a subsequently optimized linear classiﬁer W).

0.0 0.2 0.4 0.6 0.8 1.0 Label corruption

Time to ﬁt (max. 100k)

Cross-Entropy (CE) Supervised Contrastive (SC)

Figure 2: Time to ﬁt of a Res Net-18 (on CIFAR10) as a function of increasing label corruption. The red square ( ) marks the point at which zero training error can no longer be achieved.

While Zhang et al. (2017) report an approximately linear increase in the time to ﬁt1 for networks trained with crossentropy, training with the supervised contrastive loss exhibits a clearly superlinear behavior. In fact, for a given iteration budget, ﬁtting becomes impossible beyond a certain level of label corruption. This suggests that the supervised contrastive loss exerts some form of implicit regularization during optimization, yielding a parameter incarnation of the network which effectively prevents ﬁtting to random labels.

1i.e., the number of iterations to reach zero training error.

Dissecting Supervised Constrastive Learning

Overview. 2 and 3 provide the technical details that underpin Insight 1. 4 draws connections to prior work and 5 presents further experiments along the lines of Insight 2. 6 concludes with a discussion of the main points.

2. Preliminaries

Consider a supervised learning task with N N training samples, i.e., the learner has access to data X = (x1, . . . , x N) X N, drawn i.i.d. from some distribution, and labels, {1, . . . , K} = [K] yn = c(xn), assigned to each xn by an unknown function c : X [K].

We denote the unit-hypersphere (in Rh) of radius ρ > 0 by Sh 1 ρ = {x Rh : x = ρ}; in case of ρ = 1, we write Sh 1. The map ϕθ : X Z Rh identiﬁes an encoder (see 1), parametrized by a neural network with parameters θ; we write Zθ = ϕθ(x1), . . . ϕθ(x N) as the image of X under ϕθ. When required (e.g., in 3.2), we denote a batch by B, and identify the batch as the multi-set of indices {{n1, . . . , nb}} with ni [N]. For our analysis of the supervised contrastive loss, in 3.2, we assume b 3.

Under the assumption of a powerful enough encoder, i.e., a map ϕθ that can realize every possible geometric arrangement, Zθ, of the representations, we can decouple the loss formulations from the encoder. This facilitates to interpret Zθ as a free conﬁguration Z = (z1, . . . , z N) of N labeled points (hence, we can omit the dependency on θ).

2.1. Deﬁnitions

For our purposes, we deﬁne the CE and SC loss, resp., as the loss over all N instances in Z. In case of the CE loss, this is the average over all instance losses; in case of the SC loss, we sum over all batches of size b N. While the normalizing constant is irrelevant for our results, we point out that normalizing the SC loss would depend on the cardinality of all multi-sets of size b.

Deﬁnition 1 (Cross-entropy loss). Let Z Rh and let Z be an N point conﬁguration, Z = (z1, . . . , z N) ZN, with labels Y = (y1, . . . , y N) [K]N; let wy be the y-th row of the linear classiﬁers weight matrix W RK h. The cross-entropy loss LCE( , W; Y ) : ZN R is deﬁned as

n=1 ℓCE(Z, W; Y, n) (2)

with ℓCE( , W; Y, n) : ZN R given by

ℓCE(Z, W; Y, n) = log

exp( zn, wyn )

l=1 exp( zn, wl )

Deﬁnition 2 (Supervised contrastive loss). Let Z = Sh 1 ρZ Rh and let Z be an N point conﬁguration, Z = (z1, . . . , z N) ZN, with labels Y = (y1, . . . , y N) [K]N. For a ﬁxed batch size b N, we deﬁne

B = {{{n1, . . . , nb}} : n1, . . . , nb [N]} (4)

as the set of index multi-sets of size b. The supervised contrastive loss LSC( ; Y ) : ZN R is deﬁned as

B B ℓSC(Z; Y, B) (5)

with ℓSC( ; Y, B) : ZN R given by2

j Byi\{{i}} log

k B\{{i}} exp zi, zk

(6) where Byi = {{j B : yj = yi}} denotes the multi-set of indices in a batch B B with label equal to yi. 3

As the regular simplex, inscribed in a hypersphere, will play a key role in our results, we formally deﬁne this object next:

Deﬁnition 3 (ρ-Sphere-inscribed regular simplex). Let h, K N with K h + 1. We say that ζ1, . . . , ζK Rh form the vertices of a regular simplex inscribed in the hypersphere of radius ρ > 0, if and only if the following conditions hold:

i [K] ζi = 0

(S2) ζi = ρ for i [K]

(S3) d R : d = ζi, ζj for 1 i < j K

Fig. 3 shows such conﬁgurations (for K = 2, 3, 4) on S2.

Remark 1. The assumption K h + 1 is crucial, as it is a necessary and sufﬁcient condition for the existence of the regular simplex. In our context, K denotes the number of classes and K h + 1 is typically satisﬁed, as the output spaces of encoders in contemporary neural networks are

2for notational reasons, we set 0

0 = 0 when |By| = 1 3Deﬁnition 2 differs from the original deﬁnition by Khosla et al. (2020) in the following aspects: First, we do not explicitly duplicate batches (e.g., by augmenting each instance). For ﬁxed index n, this does not guarantee that at least one other instance with label equal to yn exists. However, this is formally irrelevant, as the contribution to the summation is zero in that case. Nevertheless, batch duplication is subsumed in our deﬁnition. Second, we adapt the deﬁnition to multi-sets, allowing for instances to occur more than once. If batches are drawn with replacement, this could indeed happen in practice. Third, we omit scaling the inner products , in Eq. (6) by a temperature parameter 1/τ, τ > 0, as this complicates the notation. Instead we implicitly subsume this scaling into the radius ρZ of Sh 1 ρZ .

Dissecting Supervised Constrastive Learning

Figure 3: Regular simplices inscribed in S2.

high-dimensional, e.g., 512-dimensional for a Res Net-18 on CIFAR10/100. If it is violated, then the bounds derived in 3 still hold, but are not tight. Studying the loss minimizing conﬁgurations in this regime is much harder. Even for the related and more studied Thomson problem of minimizing the potential energy of K equally charged particles on the 2-dimensional sphere, the minimizers are only known for K {2, 3, 4, 5, 6, 12} (Borodachov et al., 2019).

3. Analysis

We recap that we aim to address the following question: which N point conﬁgurations Z = (z1, . . . , z N) yield minimal CE and SC loss? 3.1 and 3.2 answer this question, assuming a sufﬁciently high dimensional representation space Z Rh, i.e., K h + 1, and balanced class labels Y , i.e., {i [N] : yi = y} = N/K, irrespective of the class y. For detailed proofs we refer to the supplementary material.

3.1. Cross-Entropy Loss

We start by providing a lower bound, in Theorem 1, on the CE loss, under the constraint of norm-bounded points.

Theorem 1. Let ρZ > 0, Z = {z Rh : z ρZ}. Further, let Z = (z1, . . . , z N) ZN be an N point conﬁguration with labels Y = (y1, . . . , y N) [K]N and let W RK h be the weight matrix of the linear classiﬁer from Deﬁnition 1. If the label conﬁguration Y is balanced,

LCE(Z, W; Y )

1 + (K 1) exp

holds, with equality if and only if there are ζ1, . . . , ζK Rh

(C1) n [N] : zn = ζyn

(C2) {ζy}y form a ρZ-sphere-inscribed regular simplex

(C3) ρW > 0 : y Y : wy = ρW

Importantly, Theorem 1 states that the bound is tight, if and only if all instances with the same label collapse to points and these points form the vertices of a regular simplex,

inscribed in a hypersphere of radius ρZ. Additionally, all weights, wy, have to attain equal norm and have to be scalar multiples of the simplex vertices, thus also forming a regular simplex (inscribed in a hypersphere of radius ρW).

Remark 2. Our result complements recent work by Papyan et al. (2020), where it is empirically observed that training neural predictors as in Eq. (1)4 leads to a within-class covariance collapse of the representations as we continue to minimize the CE loss beyond zero training error. By assuming representations to be Gaussian distributed around each class mean and taking the covariance collapse into account, the regular simplex arrangements of Theorem 1 arise. Speciﬁcally, this is the optimal conﬁguration from the perspective of recovering the correct class labels. While the analysis in (Papyan et al., 2020) is decoupled from the loss function and hinges on a probabilistic argument, we study what happens as the CE loss attains its lower bound; our result, in fact, implies the covariance collapse.

Corollary 1. Let Z, Y, W be deﬁned as in Theorem 1. Upon requiring that y [K] : wy r W, it holds that

LCE(Z, W; Y ) log 1 + (K 1) exp K ρZ r W

with equality if and only if (C1) and (C2) from Theorem 1 are satisﬁed and condition (C3) changes to

(C3r) y Y : wy = r W

Notably, a special case of Corollary 1 appears in Proposition 2 of Wang et al. (2017), covering the case where n : zn = wyn and y Y : wy = l, i.e., equinorm weights and already collapsed classes. Corollary 1 obviates these constraints and provides a more general result, only assuming that n : zn ρZ and y : wy r W. However, constraining the norm of the weights seems artiﬁcial as, in practice, the weights are typically subject to an additional L2 penalty. Corollary 2 directly addresses this connection, showing that applying an L2 penalty of the form λ W 2 F eliminates the necessity of an explicit norm constraint.

Corollary 2. Let Z, Y, W be deﬁned as in Theorem 1. For the L2-regularized objective LCE(Z, W; Y )+λ W 2 F with λ > 0, it holds that

LCE(Z, W; Y ) + λ W 2 F

log 1 + (K 1) exp ρZ K K 1r W(ρZ, λ)

+ λKr W(ρZ, λ)2 ,

4also including bias terms, i.e., Wx + b

Dissecting Supervised Constrastive Learning

where r W(ρZ, λ) > 0 denotes the unique solution, in x, of

2λx ρZ exp( KρZx

K 1 ) + K 1

Equality is attained in the bound if and only if (C1) and (C2) from Theorem 1 are satisﬁed and (C3) changes to

(C3wd) y Y : wy = r W(ρZ,λ)

Corollary 2 differs from Corollary 1 in that the characterization of wy depends on r W(ρZ, λ), i.e., a function of the norm constraint, ρZ, on the points and the regularization strength λ. While r W(ρZ, λ) has, to the best of our knowledge, no closed-form solution, it can be solved numerically. Fig. 1 illustrates the attained regular simplex conﬁguration, on a toy example, in case of added L2 regularization.

It is important to note that the assumed norm-constraint on points in Z is not purely theoretical. In fact, such a constraint often arises5, e.g., via batch normalization (Ioffe & Szegedy, 2015) at the last layer of a network implementing ϕθ. While one could, in principle, derive a normalization dependent bound for the CE loss, it is unclear (to the best of our knowledge) if a regular simplex solution satisfying the corresponding equality conditions always exists.

Numerical Simulation

To empirically assess our theoretical results, we take the toy example from Fig. 1, where we minimize (via gradient descent) the L2 regularized CE loss over W and Z with n : zn 1. This setting corresponds to having an ideal encoder, ϕ, that can realize any conﬁguration of points and matches the assumptions of Corollary 2. Fig. 4 (right) shows that the lower bound, for varying values of the regularization strength λ, closely matches the empirical loss. Additionally, Fig. 4 (left) shows a direct comparison of the empirical weight average, w , vs. the corresponding theoretical value of wy (which is equal for all y in case of minimal loss). These experiments empirically conﬁrm that conditions (C1) and (C2), as well as the adapted condition (C3wd) from Corollary 2 are satisﬁed. In 5, we will see that the sought-for regular simplex conﬁgurations actually arise (with varying quality) when minimizing the L2 regularized CE loss for a Res Net-18 trained on popular vision benchmarks.

3.2. Supervised Contrastive Loss

An analysis of the SC loss, similar to 3.1, is less straightforward. In fact, as the loss is deﬁned over batches, we can not simply sum up per-instance losses to characterize the ideal N point conﬁguration. Instead, we need to consider all batch conﬁgurations of a speciﬁc size b N.

5although it might not be explicitly enforced

Figure 4: Numerical simulation for Corollary 2 (on the toy data of Fig. 1), as a function of the L2 regularization strength λ. The left plot shows the theoretical norm of wy (which is equal for all y at minimal loss) vs. the observed mean norm of the three weights. The right plot shows the theoretical bound vs. the empirical L2 regularized CE loss.

We next state our lower bound for the SC loss with the corresponding equality conditions.

Theorem 2. Let ρZ > 0 and let Z = Sh 1 ρZ . Further, let Z = (z1, . . . , z N) ZN be an N point conﬁguration with labels Y = (y1, . . . , y N) [K]N. If the label conﬁguration Y is balanced, it holds that

l=2 l Ml log l 1 + (b l) exp Kρ2 Z K 1

where Ml = X

y Y |{B B : |By| = l}| .

Equality is attained if and only if the following conditions are satisﬁed. There are ζ1, . . . , ζK Rh such that:

(C1) n [N] : zn = ζyn

(C2) {ζy}y form a ρZ-sphere-inscribed regular simplex

Theorem 2 characterizes the geometric conﬁguration of points in Z at minimal loss. We see that the equality conditions (C1) and (C2) from Theorem 1 equally appear in Theorem 2. This implies that, at minimal loss, each class collapses to a point and these points form a regular simplex.

Considering the guiding principle of the SC loss, i.e., separating instances from distinct classes and attracting instances from the same class, it seems plausible that constraining instances to the hypersphere would yield an evenly distributed arrangement of classes. However, a closer look at the SC loss reveals that this is not obvious by any means. In contrast to the physical (electrostatic) intuition, the involved attraction and repulsion forces are not pairwise, but depend on groups of samples, i.e., batches. Naively, one could try to characterize the loss minimizing conﬁguration of points for each batch separately. Yet, this is destined to fail, as the minimizing arrangement of points in each batch depends on the label conﬁguration; an example is visualized in Fig. 5. Hence, there is no simultaneous minimizer for all

Dissecting Supervised Constrastive Learning

(7, 2, 0) (6, 2, 1) (5, 3, 1) (4, 4, 1) (7, 1, 1) (5, 2, 2) (3, 3, 3)

Figure 5: Illustration of loss minimizing point conﬁgurations of the batch-wise SC loss for varying label conﬁgurations and a batch size b = 9. Colored numbers indicate the multiplicity of each class in the batch.

batch-wise losses. It is therefore crucial to understand the interaction of the attraction and repulsion forces across different batches. We sketch the argument of the proof below and refer to the supplementary material for details.

Proof Idea for Theorem 2

The key idea is to decouple the attraction and repulsion effects from the batch-wise formulation of the loss. Since each batch-wise loss contribution is actually a sum of labelwise contributions, the supervised contrastive loss can be considered as a sum over the Cartesian product of the set of all batches with the set of all labels. We partition this Cartesian product into appropriately constructed subsets, i.e., by label multiplicity. This allows to apply Jensen s inequality to each sum over such a subset. In the resulting lower bound, the repulsion and attraction effects are still allocated to the batches, but encoded more tangibly, i.e., linearly, as sums of inner products. Therefore, their interactions can be analyzed by a combinatorial argument which hinges on the balanced class label assumption. Minimality of the respective sums is attained if and only if (1) all classes are collapsed and (2) the mean of all instances (i.e., ignoring the class label) is zero. The simplex arrangement arises as consequence of (1) & (2) and, additionally, the equality conditions yielded by the previous application of Jensen s inequality, i.e., all intra-class and inter-class inner products are equal.

Numerical Simulation

For a large number of points, numerical computation of the bound in Theorem 2 is infeasible due to the combinatorial growth of the number of batches (even for the toy-example of Fig. 1 with 300 points). Hence, we consider a smaller setup. In particular, we take K = 3 classes, each consisting of 4 points on the unit circle S1, i.e., Z = (z1, . . . , z12), h = 2 and ρ = 1. For a batch size of b = 9, this setup yields a total of 167,960 batches, i.e., the number of combinations with replacement. We initialize the zi as the projection of points sampled from a standard Gaussian distribution and then minimize the SC loss (by stochastic gradient descent for 100k iterations) over the points in Z. Fig. 6 (left) shows that, at convergence, the lower bound on LSC(Z; Y ) closely matches the empirical loss. Fig. 6 (right) shows the SC loss over all batches, highlighting the different loss levels depending on the label conﬁguration in the batch (cf. Fig. 5).

LSC(Z; Y ) Empirical 12.12016 Theory 12.12015

12 14 16 18 SC batch-wise loss level

Nr. of batches

Figure 6: Numerical optimization of the SC loss for toy data on S1. Left: Comparison of mean batch-wise loss with the lower bound from Theorem 2. Right: Histogram (over all 170k batches) of the batch-wise loss values at convergence, showing the inhomogeneity of minimal loss values across batch conﬁgurations (cf. Fig. 5).

4. Related work

We focus on works closely linked to our theoretical results of 3; we refer the reader to (Khosla et al., 2020) (and references therein) for additional background on the supervised contrastive loss and to (Le-Khac et al., 2020) for a general survey on contrastive learning.

Our results on the cross-entropy loss from 3.1 are partially related to a recent stream of research (Soudry et al., 2018; Nacson et al., 2018; Gunasekar et al., 2018) on characterizing the convergence speed and structure of homogeneous linear predictors (W) when minimizing cross-entropy via gradient descent on linearly separable data (i.e., no preceding learned encoder ϕ). In particular, Soudry et al. (2018) show that such predictors converge to the L2 max margin separator. In our setting, the geometric structure of W (and the outputs of ϕ) becomes even more explicit, i.e., the weights reside at the vertices of a regular simplex. This is in line with the special case of equinorm representations and weights, presented in (Wang et al., 2017), and complements a recent optimality result by Papyan et al. (2020) (cf. Remark 2 for details).

Along another line of research, several works focus on controlling geometric properties of the classiﬁer weights. In (Hoffer et al., 2018), for instance, the classiﬁer weights are ﬁxed prior to training, with one choice of the weight matrix being a random orthonormal projection. In this setup, all weights have unit norm, are well separated on the hypersphere, but do not form the vertices of a regular simplex. Yet, this empirically yields fast convergence, reduces the number of learnable parameters and has no negative impact on performance. In (Liu et al., 2018), separation of the classiﬁer weights is achieved via a regularization term based on a Riesz s-potential (Borodachov et al., 2019). While this regularization term can be added to all network layers, in the special case of the linear classiﬁer weights, the sought-for minimal energy (for K h + 1) is again attained once the weights form the vertices of a regular simplex. Recently, Mettes et al. (2019) presented an approach that a-priori positions so called prototypes (one for each class) on the unit hypersphere such that the largest cosine similarity among the prototypes is minimized. Training then reduces to attract-

Dissecting Supervised Constrastive Learning

ing representations towards their corresponding prototypes. Again, in case of K h + 1, this yields a geometric prototype arrangement at the vertices of a regular simplex. In the context of Eq. (1), the prototypes correspond to the classiﬁer weights and are compatible with the argmax decision rule.

Overall, (Hoffer et al., 2018; Liu et al., 2018; Mettes et al., 2019) all control, in one way or the other, the geometric arrangement of classiﬁer weights and thereby, implicitly, the arrangement of the representations. This is decisively different to supervised contrastive learning, where the arrangement of the classiﬁer weights is a consequence of the regular simplex arrangement of the representations at minimal loss. More precisely, if representations are already in a regular simplex conﬁguration, the cross-entropy loss of a subsequently trained linear classiﬁer is minimized if and only if the classiﬁer weights are equinorm and scalar multiples of the simplex vertices (cf. Corollary 1).

We additionally point out that several works have recently started to establish a solid theoretical foundation for using contrastive loss functions in the context of unsupervised representation learning.

Through the concept of latent classes (i.e., a construction formalizing the notion of semantic similarity), Arora et al. (2019) prove generalization bounds for downstream supervised classiﬁcation, under the assumption that the supervised task is deﬁned on a subset of the latent classes. Central to their analysis is the mean classiﬁer which is determined by the means of representations of training inputs with equal label. Notably, they empirically observe that this mean classiﬁer performs well on models trained under full supervision. In light of our theoretical results, this can be easily explained by the fact that, at optimality, representations collapse to the simplex vertices.

The unsupervised counterpart of the objective we study in this work is analyzed by Wang & Isola (2020) from a probabilistic perspective. It is shown that minimizing the (unsupervised) contrastive loss promotes alignment and uniformity of representations on the unit hypersphere, two properties that empirically correlate with good performance on downstream tasks. More precisely, the authors split the (unsupervised) contrastive loss into two summands and show that in the limit of inﬁnite negative samples, asymptotically one is minimized by a peferctly aligned and the other by a perfectly uniform encoder. As pointed out by the authors, if the data is ﬁnite, then there is no encoder which is both, i.e., perfectly aligned and perfectly uniform. Hence, in this case, their analysis does not provide an explicit characterization of the loss minimizer. Complementary to that, our analysis restricts to this very case of ﬁnite (training) data, but is able to characterize the loss minimizer in the supervised setup.

5. Experiments

In any practical setting, we do not have an ideal encoder (as in 3), but an encoder parameterized as a neural network, ϕθ. Hence, in 5.2, we ﬁrst assess whether the regular simplex conﬁgurations actually arise (and to which extent), given a ﬁxed iteration budget during optimization. Second, in 5.3, we study the optimization behavior of models under different loss functions in a series of random label experiments.

As our choice of ϕθ, we select a Res Net-18 (He et al., 2016a) model, i.e., all layers up to the linear classiﬁer. Experiments are conducted on CIFAR10/100, for which this choice yields 512-dim. representations (and K h+1 holds in all cases).

We either compose ϕθ with a linear classiﬁer and train with the CE loss function (denoted as CE), or we directly optimize ϕθ via the SC loss function, then freeze the encoder parameters and train a linear classiﬁer on top (denoted SC). In case of the latter, outputs of ϕθ are always projected onto a hypersphere of radius ρ = 1/ τ (with τ = 0.1), which accounts for scaling the inner-products by the temperature parameter 1/τ in the original formulation of Khosla et al. (2020). We want to stress that while Theorem 2 holds for every ρ > 0, the temperature crucially inﬂuences the optmization dynamics and needs to be tuned appropriately. For comparison, we also compose ϕθ with a ﬁxed linear classiﬁer, in particular, a classiﬁer with weights a-priori optimized towards a regular simplex arrangement. This is similar to (Mettes et al., 2019), only that we minimize the CE loss (denoted as CE-ﬁx) to learn predictors W ϕθ, as opposed to pulling outputs of ϕθ towards the ﬁxed prototypes/weights.

Optimization is done via (mini-batch) stochastic gradient descent with L2 regularization (10 4) and momentum (0.9) for 100k iterations. The batch-size is ﬁxed to 256 and the learning rate is annealed exponentially, starting from 0.1. When using data augmentation, we apply random cropping and random horizontal ﬂipping, each with probability 1/2.

5.2. Theory vs. Practice

To provide a ﬁrst impression to which extend the representations of the training data achieve the loss minimizing geometric arrangement, we compare the empirical CE and SC loss values to the optima derived in 3, using Res Net18 models trained on CIFAR10 (with data augmentation). The theoretical/empirical losses are (1) 7.64e-5 vs. 2.48e4 (for CE) and (2) 824.487 vs. 824.731 (for SC), where we estimate the empirical SC loss over 1k training batches. Notably, when optimizing for 500k iterations (instead of 100k; see supplementary material), the loss values continue to move closer to the optimum, but at very low speed. In particular, the loss values change to 2.27e-4 (for CE) and

Dissecting Supervised Constrastive Learning

0.5 0.6 0.7

Cosine similarity across class means

0.4 0.5 0.6

Cosine similarity across weights

0.8 0.9 1.0

Cosine similarity to class means CIFAR10 err. [%] Loss w aug. w/o aug. CE 6.3 16.4 CE-ﬁx 6.2 14.7 SC 5.7 13.8

0.5 0.6 0.7

Cosine similarity across class means

0.4 0.5 0.6

Cosine similarity across weights

0.8 0.9 1.0

Cosine similarity to class means CIFAR100 err. [%] Loss w aug. w/o aug. CE 27.0 41.8 CE-ﬁx 26.3 41.3 SC 24.9 41.5

Figure 7: Geometric properties of representations, ϕθ(xn), and weights, obtained from Res Net-18 models trained (top: CIFAR10; bottom: CIFAR100) with different losses (using data augmentation, i.e., w aug.). The left three panels show the distribution of cosine similarities across (1) distinct class means (quantifying class separation), (2) distinct class weights (quantifying classiﬁer weight separation) and (3) representations with respect to their respective class means (quantifying within class spread). Red lines indicate the sought-for value at a regular simplex conﬁguration. The right-most panel shows the testing error of all models with and without data augmentation.

824.523 (for SC), respectively. Overall, this suggests that out of these models, representations learned by minimizing the SC loss might arrange closer to the theoretically optimal conﬁguration. As our results only cover the loss minimizers and not (close to optima) level sets, the latter hypothesis is more of a ﬁrst guess and not predicted by the theory.

For a closer look at the geometric arrangement of the representations (and classiﬁer weights), we compute three statistics, all based on the cosine similarity γ : Rh Rh [0, 1], deﬁned as

(x, y) 7 1 cos 1 ( x/ x , y/ y ) /π . (7)

First, we measure the separation of the class representations via the cosine similarity among the class means, µ1, . . . , µK, i.e., γ(µi, µj) for i = j. Second, for the CE loss function, we compute the cosine similarity across the classiﬁer weights, i.e., γ(wi, wj), i = j, quantifying their separation. Third, to quantify class collapse, we compute the cosine similarity among all representations and their respective class means, i.e., γ(ϕ(xn), µyn). Note that our theoretical results imply that the classes should collapse and the pairwise similarities, as mentioned above, should be equal.

Fig. 7 illustrates the distribution of the cosine similarities for the Res Net-18 model trained with different loss functions (and using data augmentation). We observe that the SC loss leads to (1) an arrangement of the class means much closer to the ideal simplex conﬁguration and (2) a tighter concentration of training representations around their class means. Furthermore, in case of the CE loss, the weight arrangement reaches, on average, a regular simplex conﬁguration, while the representations slightly deviate. When using a-priori ﬁxed weights in a simplex conﬁguration, i.e., CE-ﬁx, the situation is similar, but the within-class spread is smaller. In general, the statistics are comparable between CIFAR10 and CIFAR100, only that the distribution of all computed

statistics widens for models trained with CE on CIFAR100. We conjecture that the increase in the number of classes, combined with the joint optimization of ϕθ and W complicates convergence to the loss minimizing state. Fig. 7 (right) further suggests that approaching this state positively correlates with generalization performance. Whether the latter is a general phenomenon, or may even have a theoretical foundation, is an interesting question for future work.

Finally, we draw attention to the comparatively large gap between the cosine similarities across the class means and their theoretical prediction in case of models trained on CIFAR10 (Fig. 1, top left). The aforementioned gap indicates that the chosen encoder might not be powerful enough to arrange the representations on a sphere-inscribed regular simplex. In fact, a standard Res Net (He et al., 2016a) utilizes a Re LU activation function after each block, including the last block before the linear classiﬁer. Therefore, the coordinates of representations obtained by the encoder part of a standard Res Net are always non-negative, and so are the coordinates of the class means. Consequently, their inner products are non-negative as well, which corresponds to a minimal cosine similarity of 0.5 across the class means. Since the scalar products of vertices (considered as position vectors) of a unit sphere inscribed regular simplex with K vertices are 1/(K 1), the deviation from the optimal class separation, resulting from the choice of encoder, is unnoticeable for models trained on CIFAR100 due to the large number of classes, i.e., K = 100, but becomes apparent in case of CIFAR10 where K = 10.

We suspect that architectures which do not implement the aforementioned non-negativity constraint in the encoder, e.g., the pre-activation variants of Res Nets (He et al., 2016b), are capable of separating the classes to a larger extend and thus match the theoretical prediction more closely when trained on data with a small number of classes.

Dissecting Supervised Constrastive Learning

0.00 0.25 0.50 0.75 1.00 Label corruption

Time to ﬁt (max. 100k)

CE-ﬁx CE SC

0.00 0.25 0.50 0.75 1.00 Label corruption

Time to ﬁt (max. 100k)

CE-ﬁx CE SC

Figure 8: Time to ﬁt for models of the form W ϕθ, based on Res Net-18 encoders, optimized under different loss functions.

5.3. Random Label Experiments

Despite the similarity of the loss minimizing geometric arrangements at the output of ϕθ, for both (CE, SC) losses, we have seen (in Fig. 7) that the extent to which this optimal state is achieved differs. These differences likely arise as a result of the underlying optimization dynamics, driven by the loss contribution of each batch. Notably, while the CE loss decomposes into independent instance-wise contributions, the SC loss does not (due to the interaction terms).

One way to explore this in greater detail, is to study optimization behavior as a function of label corruption. Specifically, as label corruption (i.e., the fraction of randomly ﬂipped labels) increases, it is interesting to track the number of iterations (time to ﬁt) to reach zero training error (Zhang et al., 2017), illustrated in Fig. 8.

On both datasets, CE and CE-ﬁx show an approximately linear growth, while SC shows a remarkably superlinear growth. We argue that the latter primarily results from the profound interaction among instances in a batch. Intuitively, as the number of attraction terms for the SC loss function scales quadratically with the number of samples per class, increasing the number of semantically confounding labels equally increases the complexity of the optimization problem. In contrast, for CE and CE-ﬁx, semantically confounding labels only impose per-instance constraints instead. This equally explains why, on CIFAR10, SC cannot achieve zero error beyond 80% corruption: fewer training instances per class (500 vs. 5,000) yield fewer pairwise intra-class constraints to be met.

6. Discussion

By focusing on predictors argmax W ϕ, our results assert that the outputs of ϕ are strikingly similar, at minimal loss, irrespective of whether we train with cross-entropy or in the supervised contrastive regime.

Yet, from an optimization perspective, the choice of loss makes a profound difference, visible in the differing resilience to ﬁt in the presence of corrupt label information. We argue that the advantages of supervised contrastive learning, reported in prior work, are rooted in the strong interac-

tion terms among samples in a batch. While cross-entropy acts sample-wise, the supervised contrastive loss considers pair-wise sample relations, i.e., a batch is an atomic computational unit during stochastic optimization; in case of cross-entropy, the atomic unit is a single sample instead.

While we simpliﬁed the original setup of supervised contrastive learning, in particular, by detaching the commonly used projective head, we hope that our results provide a viable starting point for further analyses. Speciﬁcally, we think that a better theoretical understanding of the profound interaction between stochastic optimization and loss functions that capture pairwise constraints (rather than instance losses), could be a promising avenue to be explored in the context of the generalization puzzle.

Acknowledgements

This research was supported in part by the Austrian Science Fund (FWF): project FWF P31799-N38 and the Land Salzburg (WISS 2025) under project numbers 20102F1901166-KZP and 20204-WISS/225/197-2019. We also like to thank the anonymous reviewers for the constructive feedback during the review process.

Source Code

Source code to reproduce experiments is publicly available: https://github.com/plus-rkwitt/py supcon vs ce

Arora, S., Khandeparkar, H., Khodak, M., Plevrakis, O., and Saunshi, N. A theoretical analysis of contrastive unsupervised representation learning. In ICML, 2019.

Borodachov, S., Hardin, D., and Saff, E. Discrete energy on rectiﬁable sets. Springer, 2019.

Chen, X., Fan, H., Girshick, R., and He, K. Improved baselines with momentum contrastive learning. ar Xiv preprint, 2020. ar Xiv:2003.04297v1 [cs.CV].

Chopra, S., Hadsell, R., and Le Cun, Y. Learning a similarity metric discriminatively, with application to face veriﬁcation. In CVPR, 2005.

Frosst, N., Papernot, N., and Hinton, G. Analyzing and improving representations with the soft nearest neighbor loss. In ICML, 2019.

Gunasekar, S., Lee, J., Soudry, D., and Srebro, N. Characterizing implicit bias in terms of optimization geometry. In ICML, 2018.

Hadsell, R., Chopra, S., and Le Cun, Y. Dimensionality reduction by learning an invariant mapping. In CVPR, 2006.

Dissecting Supervised Constrastive Learning

Han, T., Xie, W., and Zisserman, A. Self-supervised cotraining for video representation learning. In Neur IPS, 2020.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In CVPR, 2016a.

He, K., Zhang, X., Ren, S., and Sun, J. Identity mappings in deep residual networks. In ECCV, 2016b.

H enaff, O., Srinivas, A., De Fauw, J., Razawi, A., Doersch, C., Ali Eslami, S., and van der Oord, A. Data-efﬁcient image recognition with contrastive predictive coding. In ICML, 2020.

Hoffer, E., Hubara, I., and Soudry, D. Fix your classiﬁer: the marginal value of training the last weight layer. In ICLR, 2018.

Huang, G., Liu, Z., van der Maaten, L., and Weinberger, K. Densely connected convolutional networks. In CVPR, 2017.

Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.

Khosla, P., Teterwak, P., Wang, C., Sarna, A., Tian, Y., Isola, P., Maschinot, A., Liu, C., and Krishnan, D. Supervised contrastive learning. In Neur IPS, 2020.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classiﬁcation with deep convolutional neural networks. In NIPS, 2012.

Le-Khac, P., Healy, G., and Smeaton, A. Contrastive representation learning: A framework and review. IEEE Access, 8:193907 193934, 2020.

Liu, W., Lin, R., Liu, Z., Liu, L., Yu, Z., Dai, B., and Song, L. Learning towards minimal hyperspherical energy. In Neur IPS, 2018.

Mettes, P., van der Pol, E., and Snoek, C. Hyperspherical prototype networks. In Neur IPS, 2019.

Nacson, M., Lee, J., Gunasekar, S., Savarese, P., Srebro, N., and Soudry, D. Convergence of gradient descent on separable data. In AISTATS, 2018.

Papyan, V., Han, X., and Donoho, D. Prevalence of neural collapse during the terminal phase of deep learning training. PNAS, 117(40):24652 24663, 2020.

Salakhutdinov, R. and Hinton, G. Learning a nonlinear embedding by preserving class neighbourhood structure. In AISTATS, 2007.

Sohn, K. Improved deep metric learning with multi-class N-pair loss objective. In NIPS, 2016.

Soudry, D., Hoffer, E., Nacson, M., Gunasekar, S., and Srebro, N. The implicit bias of gradient descent on separable data. JMLR, 19:1 57, 2018.

van den Oord, A., Li, Y., and Vinyals, O. Representation learning with contrastive predictive coding. ar Xiv preprint, 2018. ar Xiv:1807.03748v2 [cs.LG].

Wang, F., Xiang, X., Cheng, J., and Yuille, A. Norm Face: l2 hypersphere embedding for face veriﬁcation. In ACM Multimedia, 2017.

Wang, T. and Isola, P. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In ICML, 2020.

Weinberger, K. and Saul, L. Distance metric learning for large margin nearest neighbor classiﬁcation. JMLR, 10: 207 244, 2009.

Zagoruyko, S. and Komodakis, N. Wide residual networks. In BMVC, 2016.

Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. In ICLR, 2017.