# understanding_dimensional_collapse_in_contrastive_selfsupervised_learning__99788b4f.pdf Published as a conference paper at ICLR 2022 UNDERSTANDING DIMENSIONAL COLLAPSE IN CONTRASTIVE SELF-SUPERVISED LEARNING Li Jing, Pascal Vincent, Yann Le Cun, Yuandong Tian Facebook AI Research {ljng, pascal, yann, yuandong}@fb.com Self-supervised visual representation learning aims to learn useful representations without relying on human annotations. Joint embedding approach bases on maximizing the agreement between embedding vectors from different views of the same image. Various methods have been proposed to solve the collapsing problem where all embedding vectors collapse to a trivial constant solution. Among these methods, contrastive learning prevents collapse via negative sample pairs. It has been shown that non-contrastive methods suffer from a lesser collapse problem of a different nature: dimensional collapse, whereby the embedding vectors end up spanning a lower-dimensional subspace instead of the entire available embedding space. Here, we show that dimensional collapse also happens in contrastive learning. In this paper, we shed light on the dynamics at play in contrastive learning that leads to dimensional collapse. Inspired by our theory, we propose a novel contrastive learning method, called Direct CLR, which directly optimizes the representation space without relying on a trainable projector. Experiments show that Direct CLR outperforms Sim CLR with a trainable linear projector on Image Net. 1 INTRODUCTION Self-supervised learning aims to learn useful representations of the input data without relying on human annotations. Recent advances in self-supervised visual representation learning based on joint embedding methods (Misra & Maaten, 2020b; He et al., 2020; Chen et al., 2020a; Chen & He, 2020; Grill et al., 2020; Zbontar et al., 2021; Bardes et al., 2021; Chen et al., 2020b; Dwibedi et al., 2021; Li et al., 2021; Misra & Maaten, 2020a; Hao Chen et al., 2021; Assran et al., 2021; Caron et al., 2021) show that self-supervised representations have competitive performances compared with supervised ones. These methods generally aim to learn representations invariant to data augmentations by maximizing the agreement between embedding vectors from different distortions of the same images. As there are trivial solutions where the model maps all input to the same constant vector, known as the collapsing problem, various methods have been proposed to solve this problem that rely on different mechanisms. Contrastive methods like Chen et al. (2020a) and He et al. (2016) define positive and negative sample pairs which are treated differently in the loss function. Non-contrastive methbods like Grill et al. (2020) and Chen & He (2020) use stop-gradient, and an extra predictor to prevent collapse without negative pairs; Caron et al. (2018; 2020) use an additional clustering step; and Zbontar et al. (2021) minimize the redundant information between two branches. These self-supervised learning methods are successful in preventing complete collapse whereby all representation vectors shrink into a single point. However, it has been observed empirically in noncontrastive learning methods (Hua et al., 2021; Tian et al., 2021) that while embedding vectors do not completely collapse; they collapse along certain dimensions. This is known as dimensional collapse (Hua et al., 2021), whereby the embedding vectors only span a lower-dimensional subspace. In contrastive methods that explicitly use positive and negative pairs in the loss function, it seems intuitive to speculate that the repulsive effect of negative examples should prevent this kind of dimensional collapse and make full use of all dimensions. However, contrary to intuition, contrastive learning methods still suffer from dimensional collapse (See Fig. 7). In this work, we theoretically study the dynamics behind this phenomenon. We show there are two different mechanisms that Published as a conference paper at ICLR 2022 cause collapsing: (1) along the feature direction where the variance caused by the data augmentation is larger than the variance caused by the data distribution, the weight collapses. Moreover, (2) even if the covariance of data augmentation has a smaller magnitude than the data variance along all dimensions, the weight will still collapse due to the interplay of weight matrices at different layers known as implicit regularization. This kind of collapsing happens only in networks where the network has more than one layer. Inspired by our theory, we propose a novel contrastive learning method, called Direct CLR, which directly optimizes the encoder (i.e., representation space) without relying on a trainable projector. Direct CLR outperforms Sim CLR with a linear trainable projector on Image Net. We summarize our contributions as follows: We empirically show that contrastive self-supervised learning suffers from dimensional collapse whereby all the embedding vectors fall into a lower-dimensional subspace instead of the entire available embedding space. We showed that there are two mechanisms causing the dimensional collapse in contrastive learning: (1) strong augmentation along feature dimensions (2) implicit regularization driving models toward low-rank solutions. We propose Direct CLR, a novel contrastive learning method that directly optimizes the representation space without relying on a trainable projector. Direct CLR outperforms Sim CLR with a linear trainable projector. 2 RELATED WORKS Self-supervised Learning Methods Joint embedding methods are a promising approach in selfsupervised learning, whose principle is to match the embedding vectors of augmented views of a training instance. Contrastive methods (Chen et al., 2020a; He et al., 2016) directly compare training samples by effectively viewing each sample as its own class, typically based on the Info NCE contrastive loss (van den Oord et al., 2018) which encourages representations from positive pairs of examples to be close in the embedding space while representations from negative pairs are pushed away from each other. In practice, contrastive methods are known to require a large number of negative samples. Non-contrastive methods do not directly rely on explicit negative samples. These include clustering-based methods (Caron et al., 2018; 2020), redundancy reduction methods (Zbontar et al., 2021; Bardes et al., 2021) and methods using special architecture design (Grill et al., 2020; Chen & He, 2020). Theoretical Understanding of Self-supervised Learning Although self-supervised learning models have shown success in learning useful representations and have outperformed their supervised counterpart in several downstream transfer learning benchmarks (Chen et al., 2020a), the underlying dynamics of these methods remains somewhat mysterious and poorly understood. Several theoretical works have attempted to understand it. Arora et al. (2019b); Lee et al. (2020); Tosh et al. (2021) theoretically proved that the learned representations via contrastive learning are useful for downstream tasks. Tian et al. (2021) explained why non-contrastive learning methods like BYOL (Grill et al., 2020) and Sim Siam (Chen & He, 2020) work: the dynamics of the alignment of eigenspaces between the predictor and its input correlation matrix play a key role in preventing complete collapse. Implicit Regularization It has been theoretically explained that gradient descent will drive adjacent matrices aligned in a linear neural network setting (Ji & Telgarsky, 2019). Under the aligned matrix assumption, Gunasekar et al. (2018) prove that gradient descent can derive minimal nuclear norm solution. Arora et al. (2019a) extend this concept to the deep linear network case by theoretically and empirically demonstrating that a deep linear network can derive low-rank solutions. In general, over-parametrized neural networks tend to find flatter local minima (Saxe et al., 2019; Neyshabur et al., 2019; Soudry et al., 2018; Barrett & Dherin, 2021). 3 DIMENSIONAL COLLAPSE Self-supervised learning methods learn useful representation by minimizing the distances between embedding vectors from augmented images (Figure 1a). On its own, this would result in a collapsed Published as a conference paper at ICLR 2022 Augmentation Info NCE (a) embedding space (b) complete collapse (c) dimensional collapse Figure 1: Illustration of the collapsing problem. For complete collapse, the embedding vectors collapse to same point. For dimensional collapse, the embedding vectors only span a lower dimensional space. solution where the produced representation becomes constant (Figure 1b). Contrastive methods prevent complete collapse via the negative term that pushes embedding vectors of different input images away from each other. In this section, we show that while they prevent complete collapse, contrastive methods still experience a dimensional collapse in which the embedding vectors occupy a lower-dimensional subspace than their dimension (Figure 1c). 0 20 40 60 80 100 120 Singular Value Rank Index Log of singular values Figure 2: Singular value spectrum of the embedding space. The embedding vectors are computed from a pretrained Sim CLR model on the validation set of Image Net. Each embedding vector has a dimension of 128. The spectrum contains the singular values of the covariance matrix of these embedding vectors in sorted order and logarithmic scale. A number of singular values drop to zero, indicating collapsed dimensions. We train a Sim CLR model (Chen et al. (2020a)) with a two-layer MLP projector. We followed the standard recipe and trained the model on Image Net for 100 epoch. We evaluate the dimensionality by collecting the embedding vectors on the validation set. Each embedding vector has a size of d = 128. We compute the covariance matrix C Rd d of the embedding layer (here z := PN i=1 zi/N and N is the total number of samples): i=1 (zi z)(zi z)T (1) Figure 2 shows singular value decomposition on this matrix (C = USV T , S = diag(σk)). in sorted order and logarithmic scale ({log(σk)}). We observe that a number of singular values collapse to zero, thus representing collapsed dimensions. 4 DIMENSIONAL COLLAPSE CAUSED BY STRONG AUGMENTATION 4.1 LINEAR MODEL In this section, we explain one scenario for contrastive learning to have collapsed embedding dimensions, where the augmentation surpasses the input information. We focus on a simple linear network setting. We denote the input vector as x and the augmentation is an additive noise. The network is a single linear layer with weight matrix is W. Hence, the embedding vector is z = Wx. We focus on a typical contrastive loss, Info NCE (van den Oord et al., 2018): i=1 log exp( |zi z i|2/2) P j =i exp( |zi zj|2/2) + exp( |zi z i|2/2) (2) where zi and z i are a pair of embedding vectors from the two branches, zj indicates the negative samples within the minibatch. When all zi and z i are normalized to be unit vector, the negative distance |zi z i|2/2 can be replaced by inner products z T i z i. The model is trained with a basic stochastic gradient descent without momentum or weight decay. Published as a conference paper at ICLR 2022 4.2 GRADIENT FLOW DYNAMICS We study the dynamics via gradient flow, i.e., gradient descent with an infinitesimally small learning rate. Lemma 1. The weight matrix in a linear contrastive self-supervised learning model evolves by: where G = P i(gzix T i + gz ix T i ), and gzi is the gradient on the embedding vector zi (similarly gz i). This can be easily proven based on the chain rule. See proof in Appendix B.1. For Info NCE loss defined in Eqn 2, the gradient of the embedding vector for each branch can be written as j =i αij(zj z i) + X j =i αji(zj zi), gz i = X j =i αij(z i zi) (4) where {αij} are the softmax of similarity of between zi and {zj}, defined by αij = exp( |zi zj|2/2)/Zi, αii = exp( |zi z i|2/2)/Zi, and Zi = P j =i exp( |zi zj|2/2)+exp( |zi z i|2/2). Hence, P j αij = 1. Since zi = Wxi, we have j =i αij(x i xj) + X j =i αji(xi xj) i (1 αii)(x i xi)x i T (6) Lemma 2. X is a difference of two PSD matrices: X = ˆΣ0 ˆΣ1 (7) Here ˆΣ0 = P i,j αij(xi xj)(xi xj)T is a weighted data distribution covariance matrix and ˆΣ1 = P i(1 αii)(x i xi)(x i xi)T is a weighted augmentation distribution covariance matrix. See proof in Appendix B.2. Therefore, the amplitude of augmentation determines whether X is a positive definite matrix. Similar to Theorem 3-4 in Tian et al. (2020), Lemma 2 also models the time derivative of weight W as a product of W and a symmetric and/or PSD matrices. However, Lemma 2 is much more general: it applies to Info NCE with multiple negative contrastive terms, remains true when αij varies with sample pair (i, j), and holds with finite batch size N. In contrast, Theorem 4 in Tian et al. (2020) only works for one negative term in Info NCE, holds only in the population sense (i.e., N + ), and the formulation has residual terms, if αij are not constants. Next, we look into the dynamics of weight matrix W given property of X. Theorem 1. With fixed matrix X (defined in Eqn 6) and strong augmentation such that X has negative eigenvalues, the weight matrix W has vanishing singular values. See proof in Appendix B.3. Corollary 1 (Dimensional Collapse Caused by Strong Augmentation). With strong augmentation, the embedding space covariance matrix becomes low-rank. The embedding space is identified by the singular value spectrum of the covariance matrix on the embedding (Eqn. 1), C = P i(zi z)(zi z)T /N = P i W(xi x)(xi x)T W T /N. Since W has vanishing singular values, C is also low-rank, indicating collapsed dimensions. Numerical simulation verifies our theory. We choice input data as isotropic Gaussian with covariance matrix P i,j(xi xj)(xi xj)T /N = I. We set the augmentation as additive Gaussian with covariance matrix equal to P i(x i xi)(x i xi)T /N = block diagonal(0, k I), where the block has the size of 8x8. We plot the weight matrix singular value spectrum in Figure 3 with various augmentation amplitude k. This proves that under linear network setting, strong augmentation leads to dimensional collapse in embedding space. Published as a conference paper at ICLR 2022 0 2 4 6 8 10 12 14 Singular Value Rank Index Singular values k=0.1 k=0.3 k=0.5 k=0.6 k=0.7 k=0.8 k=0.9 k=1 k=1.2 k=1.5 k=2 Figure 3: Weight matrix singular value spectrum with different augmentation amplitude k. The setting is a single layer linear toy model with each weight matrix of the size of 16x16, where the block has the size of 8x8. Strong augmentation results in vanishing singular values in weight matrices. Our theory in this section is limited to linear network settings. For more complex nonlinear networks, the collapsing condition will still depend on strong augmentation but interpreted differently. A strong augmentation will be determined by more complicated properties of the augmentation (higher-order statistics of augmentation, manifold property of augmentation vs. data distribution) conditioned on the capacity of the networks. 5 DIMENSIONAL COLLAPSE CAUSED BY IMPLICIT REGULARIZATION 5.1 TWO-LAYER LINEAR MODEL With strong augmentation, a linear model under Info NCE loss will have dimensional collapse. However, such scenarios rely on the condition that the network has a limited capacity which may not hold for real cases. On the other hand, when there is no strong augmentation (ˆΣ1 ˆΣ0) and thus X matrix remains PSD, a single linear model won t have dimensional collapsing. However, interestingly, for deep networks, dimensional collapsing still happens in practice. In the following, we will show that it stems from a different nature: implicit regularization, where over-parametrized linear networks tend to find low-rank solutions. Figure 4: Two-layer Linear Model To understand this counter-intuitive phenomena, we start with the simplest over-parametrized setting by choosing the network as a two-layer linear MLP without bias. The weight matrices of these two layers are denoted by W1 Rd d and W2 Rd d. Similar to the setting in Sec 4, the input vector is denoted as x and the augmentation is an additive noise. The embedding vector from each branch is z = W2W1x, hence z Rn. We do not normalize z. See Figure 4. We use Info NCE loss defined in Eqn 2. The model is trained with a basic stochastic gradient descent without momentum or weight decay. 5.2 GRADIENT FLOW DYNAMICS Similar to Lemma 1, we derive the gradient flow on the two weight matrices W1 and W2. Lemma 3. The weight matrices of the two layer linear contrastive self-supervised learning model evolves by (G = P i(gzix T i + gz ix T i ) is defined in Lemma 1): W1 = W T 2 G, W2 = GW T 1 (8) This can be easily proven based on the chain rule. See proof in Appendix B.4. For the two layer case, similar to Eqn 5, we have the specific form of G: G = W2W1X (9) where X is defined in Eqn 6. According to Lemma 2, we know that with small augmentation, X = ˆΣ0 ˆΣ1 0 is a positive-definite matrix. 5.3 WEIGHT ALIGNMENT Since we have two matrices W1 and W2, the first question is how they interact with each other. We apply singular value decomposition on both matrices W1 and W2, i.e., W1 = U1S1V T 1 , W2 = U2S2V T 2 and S1 = diag([σk 1]), S2 = diag([σk 2]). The alignment is now governed by the interaction Published as a conference paper at ICLR 2022 between the adjacent orthonormal matrices V2 := [vk 2] and U1 = [uk 1]. This can be characterized by the alignment matrix A = V T 2 U1, whose (k, k )-entry represents the alignment between the k-th right singular vector vk 2 of W2 and the k -th left singular vector uk 1 of W1. The following shows that indeed W1 and W2 aligns. Theorem 2 (Weight matrices align). If for all t, W2(t)W1(t) = 0, X(t) is positive-definite and W1(+ ), W2(+ ) have distinctive singular values, then the alignment matrix A = V T 2 U1 I. See proof in Appendix B.5. Here, we also empirically demonstrate that under Info NCE loss, the absolute value of the alignment matrix A converges to an identity matrix. See Figure 5. The alignment effect has been studied in other scenarios (Ji & Telgarsky, 2019; Radhakrishnan et al., 2020). In the real case, when some of our assumptions are not satisfied, e.g., there are degenerate singular values in weight matrices, we will not observe a perfect alignment. This can be easily understood by the fact that the singular decomposition is no longer unique given degenerate singular values. In our toy experiment, we specifically initialize the weight matrices to have non-degenerate singular values. In real scenario, when weight matrices are randomly initialized, we will only observe the alignment matrix to converge to a block-diagonal matrix, with each block representing a group of degenerate singular values. Figure 5: Visualization of the alignment matrix A = V T 2 U1 after training. The setting is a 2-layer linear toy model with each weight matrix of the size of 16x16. The alignment matrix converges to an identity matrix. Given the fact that singular vectors corresponding to the same singular value align, we can now study the dynamics of the singular values of each weight matrix W1 and W2. Theorem 3. If W2 and W1 are aligned (i.e., V2 = U T 1 ), then the singular values of the weight matrices W1 and W2 under Info NCE loss evolve by: σk 1 = σk 1(σk 2)2(vk 1 T Xvk 1) (10) σk 2 = σk 2(σk 1)2(vk 1 T Xvk 1) (11) See proof in Appendix B.6. According to Eqn. 10, (σk 1)2 = (σk 2)2 + C. We solve the singular value dynamics analytically: σk 1 = σk 1((σk 1)2 + C)(vk 1 T Xvk 1). This shows that a pair of singular values (singular values with same ranking from the other matrix) have gradients proportional to themselves. Notice that X is a positive definite matrix, the term vk 1 T Xvk 1 is always non-negative. This explains why we observe that the smallest group of singular values grow significantly slower. See demonstrative experiment results in Figure 6a and 6b. 0 500 1000 1500 2000 2500 3000 3500 4000 iterations log singular values W1 Spectrum 0 500 1000 1500 2000 2500 3000 3500 4000 iterations log singular values W2 Spectrum 0 500 1000 1500 2000 2500 3000 3500 4000 iterations log singular values Embedding Space Spectrum (c) Embedding Space Figure 6: Evolution of the singular values of the weight matrices and the embedding space covariance matrix. The setting is a 2-layer linear toy model with each weight matrix of the size of 16x16. The lowest few singular values of each weight matrix remain significantly smaller. Corollary 2 (Dimensional Collapse Caused by Implicit Regularization). With small augmentation and over-parametrized linear networks, the embedding space covariance matrix becomes low-rank. The embedding space is identified by the singular value spectrum of the covariance matrix on the embedding vectors, C = P(z z)(z z)T /N = P W2W1(x x)(x x)T W T 1 W T 2 /N. As Published as a conference paper at ICLR 2022 W2W1 evolves to be low-rank, C is low-rank, indicating collapsed dimensions. See Figure 6c for experimental verification. Our theory can also be extended to multilayer networks and nonlinear setting. Please see Appendix C 6 DIRECTCLR 6.1 MOTIVATION We now leverage our theoretical finding to design novel algorithms. Here we are targeting the projector component in contrastive learning. Empirically, adding a projector substantially improves the quality of the learned representation and downstream performance (Chen et al., 2020a). Checking the spectrum of the representation layer also reveals a difference with/without a projector. To see this, we train two Sim CLR models with and without a projector. The representation space spectrum are shown in Figure 7b. The dimensional collapse in representation space happens when the model is trained without a projector. Thus, the projector prevents the collapse in the representation space. Augmentation Info NCE Representations Embeddings Encoder Projector (a) representation and embedding 0 500 1000 1500 2000 Singular Value Rank Index Log of singular values w/ projector w/o projector (b) Representation space spectrum Figure 7: (a) Definition of representation and the embedding space; (b) Singular value spectrums of the representation space of pretrained contrastive learning models (pretrained with or without a projector). The representation vectors are the output from the Res Net50 encoder and directly used for downstream tasks. Each representation vector has a dimension of 2048. Without a projector, Sim CLR suffers from dimensional collapse in the representation space. The projector in contrastive learning is essential to prevent dimensional collapse in the representation space. We claim the following propositions regarding a linear projector in contrastive learning models. Proposition 1. A linear projector weight matrix only needs to be diagonal. Proposition 2. A linear projector weight matrix only needs to be low-rank. Based on our theory on implicit regularization dynamics, we expect to see adjacent layers W1(= U1S1V T 1 ) and W2(= U2S2V T 2 ) to be aligned such that the overall dynamics is only governed by their singular values S1 and S2. And the orthogonal matrices V T 2 and U1 are redundant as they will evolve to V T 2 U1 = I, given S1 and S2. Now, let s consider the linear projector Sim CLR model and only focus on the channel dimension. W1 is the last layer in the encoder, and W2 is the projector weight matrix. Our propositions claim that for this projector matrix W2, the orthogonal component V2 can be omitted. Because the previous layer W1 is fully trainable, its orthogonal component (U1) will always evolve to satisfy V T 2 U1 = I. Therefore, the final behavior of the projector is only determined by the singular values (S2 ) of the projector weight matrix. This motivates Proposition 1: the orthogonal component of the weight matrix doesn t matter. So we can set the projector matrix as a diagonal matrix. Also, according to our theory, the weight matrix will always converge to the low-rank. The singular value diagonal matrix naturally becomes low-rank, so why not just set it low-rank directly? This is the motivation of Proposition 2. Published as a conference paper at ICLR 2022 These propositions are verified via ablation studies in Sec 6.3. Given these two propositions, we propose Direct CLR, which is effectively using a low-rank diagonal projector. 6.2 MAIN IDEA Augmentation Info NCE Representations Figure 8: Direct CLR: no trainable projector, simply apply Info NCE loss on the a fixed sub-vector of the representations We propose to remove the projector in contrastive learning by directly sending a sub-vector of the representation vector to the loss function. We call our method Direct CLR. In contrast to all recent state-of-the-art self-supervised learning methods, our method directly optimizes the representation space. See Figure 8, Direct CLR picks a subvector of the representation z = r[0 : d0], where d0 is a hyperparameter. Then, it applies a standard Info NCE loss on this normalized subvector ˆz = z/|z|, L = P i log exp(ˆzi ˆz i) P j exp(ˆzi ˆzj). We train Direct CLR with a standard recipe of Sim CLR for 100 epochs on Image Net. The backbone encoder is a Res Net50. More implementation details can be found in the Appendix D. Direct CLR demonstrates better performance compared to Sim CLR with a trainable linear projector on Image Net. The linear probe accuracies for each model are listed in Table 1. Loss function Projector Accuracy Sim CLR 2-layer nonlinear projector 66.5 Sim CLR 1-layer linear projector 61.1 Sim CLR no projector 51.5 Direct CLR no projector 62.7 Table 1: Linear probe accuracy on Image Net. Each model is trained on Image Net for 100 epochs with standard training recipe. The backbone encoder is a Res Net50. Direct CLR outperforms Sim CLR with 1-layer linear projector. We visualize the learnt representation space spectrum in Figure 9. Direct CLR prevents dimensional collapse in the representation space similar to the functionality of a trainable projector in Sim CLR. 0 500 1000 1500 2000 Singular Value Rank Index Log of singular values Sim CLR: 2-layer nonlinear projector Sim CLR: 1-layer linear projector Sim CLR: no projector Direct CLR: no projector Figure 9: Representation space spectrum of Direct CLR compared to Sim CLR (a) with a 2-layer nonlinear projector (b) with a 1-layer linear projector (c) without projector. The spectrums are computed based on the output from the backbone, using Imgae Net validation set. Similar to Sim CLR with projectors, Direct CLR is able to prevent dimensional collapse in the representation space. nonlinear conv block residual connection full-rank low-rank representations hidden layer (full-rank) Figure 10: Why is the whole representation vector r meaningful in Direct CLR while only part of it receives gradient? It takes advantage of the residual connection in the backbone. Thus, the gradient passing through the representation vector is low-rank where only the first d0 channel dimensions are non-zero. When the gradient enters the Res Net backbone and passes through the last nonlinear conv block, it becomes full rank. Therefore, this hidden layer h receives gradients on all channels. During forward pass, h is directly fed to the representation vectors via the residual connection. Therefore, the entire representation vector r is meaningful. Published as a conference paper at ICLR 2022 One may suspect that the contrastive loss in Direct CLR does not apply a gradient on the rest part of the representation vector r[d0 :], then why these dimensions would contain useful information? Here, we show that the entire representation vector r contains useful information. See Figure 10. First, the gradient backpropagating through the representation vector is low-rank, where only the first d0 channel dimensions are non-zero. When the gradient enters the Res Net backbone and passes through the last nonlinear conv block, it becomes full rank. Therefore, this hidden layer h receives gradients on all channels. Note that h and r have a same channel dimension of 2048. Next, we consider the forward pass. This hidden layer h is directly fed to the representation vectors via the residual connection. As a result, the rest part of the representation vector r[d0 :] is not trivial. In addition, we run an ablation study in Sec F to test the linear probe accuracy based only on the directly optimized vector. This verifies that the whole representation vector is meaningful. 6.3 ABLATION STUDY Projector diagonal low-rank Top-1 Accuracy no projector 51.5 orthogonal projector 52.2 trainable projector 61.1 trainable diagonal projector 60.2 fixed low-rank projector 62.3 fixed low-rank diagonal projector 62.7 Table 2: Ablation study: top-1 accuracies on Image Net by Sim CLR model with different projector settings. To further verify our hypothesis, we have perform ablation studies. Proposition 1 matches the fact that: (a) an orthogonal constrained projector performs the same as the non-projector setting; (b) fixed low-rank projector performs the same as a fixed diagonal projector; (c) trainable linear projector performs the same as a trainable diagonal projector. Proposition 2 matches the observation that a low-rank projector has the highest accuracy. Please see more detailed ablation study discuss and additional ablation experiments in Appendix F. 7 CONCLUSIONS In this work, we showed that contrastive self-supervised learning suffers from dimensional collapse, where the embedding vectors only span a lower-dimensional subspace. We provided the theoretical understanding of this phenomenon and showed that there are two mechanisms causing dimensional collapse: strong augmentation and implicit regularization. Inspired by our theory, we proposed a novel contrastive self-supervised learning method Direct CLR that directly optimizes the representation space without relying on a trainable projector. Direct CLR outperforms Sim CLR with a linear projector on Image Net. ACKNOWLEDGEMENT We thank Yubei Chen, Jiachen Zhu, Adrien Bardes, Nicolas Ballas, Randall Balestriero, Quentin Garrido for useful discussions. REPRODUCIBILITY STATEMENT We provide detailed proof for all the lemmas and theorems in the Appendices. Code (in Py Torch) is available at https://github.com/facebookresearch/directclr Published as a conference paper at ICLR 2022 Sanjeev Arora, Nadav Cohen, W. Hu, and Yuping Luo. Implicit regularization in deep matrix factorization. In Neur IPS, 2019a. Sanjeev Arora, H. Khandeparkar, M. Khodak, Orestis Plevrakis, and Nikunj Saunshi. A theoretical analysis of contrastive unsupervised representation learning. In ICML, 2019b. Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Armand Joulin, Nicolas Ballas, and Michael G. Rabbat. Semi-supervised learning of visual features by non-parametrically predicting view assignments with support samples. Ar Xiv, abs/2104.13963, 2021. Adrien Bardes, J. Ponce, and Y. Le Cun. Vicreg: Variance-invariance-covariance regularization for self-supervised learning. Ar Xiv, abs/2105.04906, 2021. D. Barrett and B. Dherin. Implicit gradient regularization. Ar Xiv, abs/2009.11162, 2021. Mathilde Caron, Piotr Bojanowski, Armand Joulin, and M. Douze. Deep clustering for unsupervised learning of visual features. In ECCV, 2018. Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments. In Neur IPS, 2020. Mathilde Caron, Hugo Touvron, Ishan Misra, Herv e J egou, J. Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. Ar Xiv, abs/2104.14294, 2021. Mario Lezcano Casado and David Mart ınez-Rubio. Cheap orthogonal constraints in neural networks: A simple parametrization of the orthogonal and unitary group. Ar Xiv, abs/1901.08428, 2019. Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton. A simple framework for contrastive learning of visual representations. 2020a. Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. In CVPR, 2020. Xinlei Chen, Haoqi Fan, Ross B. Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. Ar Xiv, abs/2003.04297, 2020b. Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, and Andrew Zisserman. With a little help from my friends: Nearest-neighbor contrastive learning of visual representations. Ar Xiv, abs/2104.14548, 2021. Jean-Bastien Grill, Florian Strub, Florent Altch e, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, R emi Munos, and Michal Valko. Bootstrap your own latent: A new approach to self-supervised learning. In Neur IPS, 2020. Suriya Gunasekar, Blake E. Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nathan Srebro. Implicit regularization in matrix factorization. 2018 Information Theory and Applications Workshop (ITA), pp. 1 10, 2018. Jeff Z. Hao Chen, Colin Wei, Adrien Gaidon, and Tengyu Ma. Provable guarantees for selfsupervised deep learning with spectral contrastive loss. Ar Xiv, abs/2106.04156, 2021. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016. Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross B. Girshick. Momentum contrast for unsupervised visual representation learning. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9726 9735, 2020. Tianyu Hua, Wenxiao Wang, Zihui Xue, Yue Wang, Sucheng Ren, and Hang Zhao. On feature decorrelation in self-supervised learning. Ar Xiv, abs/2105.00470, 2021. Published as a conference paper at ICLR 2022 Ziwei Ji and Matus Telgarsky. Gradient descent aligns the layers of deep linear networks. Ar Xiv, abs/1810.02032, 2019. L. Jing, J. Zbontar, and Y. Le Cun. Implicit rank-minimizing autoencoder. Ar Xiv, abs/2010.00679, 2020. J. Lee, Qi Lei, Nikunj Saunshi, and Jiacheng Zhuo. Predicting what you already know helps: Provable self-supervised learning. Ar Xiv, abs/2008.01064, 2020. Junnan Li, Pan Zhou, Caiming Xiong, R. Socher, and S. Hoi. Prototypical contrastive learning of unsupervised representations. Ar Xiv, abs/2005.04966, 2021. Ishan Misra and L. V. D. Maaten. Self-supervised learning of pretext-invariant representations. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6706 6716, 2020a. Ishan Misra and Laurens van der Maaten. Self-supervised learning of pretext-invariant representations. In CVPR, 2020b. Behnam Neyshabur, Zhiyuan Li, Srinadh Bhojanapalli, Y. Le Cun, and Nathan Srebro. Towards understanding the role of over-parametrization in generalization of neural networks. Ar Xiv, abs/1805.12076, 2019. Adityanarayanan Radhakrishnan, Eshaan Nichani, D. Bernstein, and Caroline Uhler. On alignment in deep linear neural networks. ar Xiv: Learning, 2020. Andrew M. Saxe, James L. Mc Clelland, and S. Ganguli. A mathematical theory of semantic development in deep neural networks. Proceedings of the National Academy of Sciences, 116:11537 11546, 2019. Daniel Soudry, E. Hoffer, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data. Ar Xiv, abs/1710.10345, 2018. Yuandong Tian, Lantao Yu, Xinlei Chen, and Surya Ganguli. Understanding self-supervised learning with dual deep networks. ar Xiv preprint ar Xiv:2010.00578, 2020. Yuandong Tian, Xinlei Chen, and S. Ganguli. Understanding self-supervised learning dynamics without contrastive pairs. Ar Xiv, abs/2102.06810, 2021. Christopher Tosh, A. Krishnamurthy, and Daniel J. Hsu. Contrastive learning, multi-view redundancy, and linear models. Ar Xiv, abs/2008.10150, 2021. A aron van den Oord, Y. Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. Ar Xiv, abs/1807.03748, 2018. Jure Zbontar, Li Jing, Ishan Misra, Yann Le Cun, and St ephane Deny. Barlow twins: Self-supervised learning via redundancy reduction. ar Xiv preprint arxiv:2103.03230, 2021. A USEFUL LEMMAS We adapt two useful lemmas from Arora et al. (2019a). Lemma 4. Given a matrix W and the dynamics that W evolves by W, the singular values of this matrix evolve by: σk = uk T Wvk (12) where uk and vk are singular value σk s corresponding left and right singular vectors. i.e. the k-th column of matrices U and V respectively. Published as a conference paper at ICLR 2022 Proof. Given a matrix W and its singular value decomposition W = USV T . We have the dynamics of the matrix W = USV T + U SV T + US V T Multiplying U T from the left and multiplying V from the right, considering U and V are orthogonal matrices, we have U T WV = U T US + S + S V T V Since S = diag(σk) is a diagonal matrix, we have σk = uk T Wvk uk T ukσk σk vk T vk Again, considering uk and vk have unit-norm, we have uk T uk = 0 and vk T vk = 0. Therefore, we derive σk = uk T Wvk Lemma 5. Given a matrix W and the dynamics that W evolves by W, the singular vectors of this matrix evolve by: U = U(H (U T WV S + SV T W T U)) (13) V = V (H (V T W T US + SU T WV )) (14) where represents Hadamard element-wise multiplication. H is a skew-symmetric matrix ( 1/(σk2 σk 2) if k = k 0 if k = k (15) Proof. Same as proof for Lemma 1, we start from the following equation U T WV = U T US + S + S V T V Considering the fact that U T U and V T V are skew-symmetric matrices, whose diagonal terms are all zero, we Hadamard-multiply I to both sides of the equation. Here, I has all diagonal values equal zeros and all off-diagonal values equal to one, we have I U T WV = U T US + S V T V (16) Taking transpose, we have I V T WU = SU T U V T V S (17) Right-multiplying S to Eqn 16 and left-multiplying S to Eqn 17, then adding them up, we have U T US2 S2U T U = I (U T WV S + SV T WU) Therefore, we have U = U(H (U T WV S + SV T W T U)) where ( 1/(σk2 σk 2) if k = k Similar proof applies to Eqn 14. Lemma 6 (Alignment matrix dynamics). The alignment matrix A, defined by A = V T 2 U1, evolves by: A = A(H1 (AT F + F T A)) + (H2 (AF T + FAT ))A (18) where represents Hadamard (element-wise) multiplication. Hl is a skew-symmetric matrix, whose (k, k )-entry is given by ( 1/(σk l 2 σk l 2) if k = k 0 if k = k (19) and F is defined by F = S2U T 2 GV1S1 (20) Published as a conference paper at ICLR 2022 Proof. According to Lemma. 5, we have U1 = U1(H1 (U T 1 W1V1S1 + S1V T 1 W T 1 U1)) V2 = V2(H2 (V T 2 W T 2 U2S2 + S2U T 2 W2V2)) Plugging the above two equations and Eqn 8, the dynamics of the alignment matrix A = V T 2 U1 can be written as A = V T 2 U1 + V T 2 U1 = V T 2 U1(H1 (U T 1 W1V1S1 + S1V T 1 W T 1 U1)) + (H2 (V T 2 W T 2 U2S2 + S2U T 2 W2V2))T V T 2 U1 = A(H1 (U T 1 W T 2 GV1S1 + S1V T 1 GT W2U1)) + (H2 (S2U T 2 GW T 1 V2 + V T 2 W1GT U2S2))A = A(H1 (U T 1 V2S2U T 2 GV1S1 + S1V T 1 GT U2S2V T 2 U1)) +(H2 (S2U T 2 GV1S1U T 1 V2 + V T 2 U1S1V T 1 GT U2S2))A = A(H1 (AT S2U T 2 GV1S1 + S1V T 1 GT U2S2A) +(H2 (S2U T 2 GV1S1AT + AS1V T 1 GT U2S2))A = A(H1 (AT F + F T A)) + (H2 (AF T + FAT ))A where F = S2U T 2 GV1S1 Lemma 7 (Singular value dynamics). The singular values of the weight matrices W1 and W2 evolve by: k (vk 2 T uk 1)σk 2 (uk 2 T Gvk 1) (21) k (uk 1 T vk 2)σk 1 (uk 2 T Gvk 1 ) (22) Proof. According to Lemma 4, σr 1 = ur 1 T W1vr 1 Plugging in Eqn 8, we have σk 1 = uk 1 T W T 2 Gvk 1 = uk 1 T V2S2U T 2 Gvk 1 k (vk 2 T uk 1)σk 2 (uk 2 T Gvk 1) Similar proof applies to Eqn 22. B DELAYED PROOFS B.1 PROOF OF LEMMA 1 The gradient on matrix W is d L d W = X We denote the gradient on zi and z i as gzi and gz i, respectively. Since zi W = xi and z i W = x i, we get W = ( d L i (gzix T i + gz ix i T ) Published as a conference paper at ICLR 2022 B.2 PROOF OF LEMMA 2 Proof. X is defined in Eqn 6. j =i αij(x i xj) + X j =i αji(xi xj))x T i X i (1 αii)(x i xi)x i T j =i αijx ix T i X j =i αijxjx T i + X j =i αji(xi xj)(xi xj)T j =i αji(xi xj)x T j X i (1 αii)(x i xi)(x i xi)T X i (1 αii)(x i xi)xi T Given the fact that P j =i αij = 1 αii, we have P j =i αijx ix T i = P i(1 αii)x ix T i . Also, since P j =i iterates all pairs of i, j, we can replace the index between i and j, we have P j =i αijxjx T i = P j =i αjixix T j . j =i αji(xi xj)(xi xj)T X i (1 αii)(x i xi)(x i xi)T B.3 PROOF OF THEOREM 1 Proof. According to Lemma 1, we have d dt W = WX (23) For a fixed X, we solve this equation analyically, W(t) = W(0) exp(Xt) Apply eigen-decomposition on X, X = UΛU T . Then we have exp(Xt) = U exp(Λt)U T . Therefore, W(t) = W(0)U exp(Λt)U T Because X has negative eigenvalues, i.e., Λ has negative terms, we have for t , exp(Λt) is rank deficient. Therefore, we know that W( ) is also rank deficient, the weight matrix W has vanishing singular values. B.4 PROOF OF LEMMA 3 Proof. The gradient on matrix W2 is d L d W2 = X z i W2 ) (24) We denote the gradient on zi and z i as gzi and gz i, respectively. Since zi W2 = W1xi and z i W2 = W1x i, we get d W2 )T = X i (gzix T i + gz ix i T )W T 1 (25) Similar proof applies to W1. Published as a conference paper at ICLR 2022 B.5 PROOF OF THEOREM 2 Here, we prove that under the assumption that singular values are non-degenerate, the alignment matrix A = V T 2 U1 converges to identity matrix. Proof. According to Lemma 3, we have d dt(W1W T 1 ) = W1GT W2 W T 2 GW T 1 d dt(W T 2 W2) = W T 2 GW T 1 W1GT W2 therefore, d dt(W1W T 1 W T 2 W2) = 0 or W1W T 1 W T 2 W2 = C Next, we show that the Frobenius norm of each weight matrix grow to infinitely. d dt||W1||2 F = d dttr(W1W T 1 ) = tr(W T 2 GW T 1 ) tr(W1GT 1 W2) According to Eqn 9, G = W2W1X, we have tr(W T 2 GW T 1 ) = tr(W T 2 W2W1XW T 1 ) = tr(W2W1XW T 1 W T 2 ) Because X is a positive definite matrix and for all t, W2(t)W1(t) = 0, we know B := W2W1XW T 1 W T 2 is positive semi-definite and B = 0. Therefore, tr(B) = P k λk(B) > 0 since not all eigenvalues of B are zero. Therefore, we know ||W1||2 F + (similarly ||W2||2 F + ). In the limit t > + , we have W1W T 1 = W T 2 W2 Plug in the singular value decomposition of W1 and W2, we have U1S2 1U T 1 = V2S2 2V T 2 . Assuming W1 and W2 have non-degenerate singular values, due to the uniqueness of eigen-decomposition, we have U1 = V2 therefore, V T 2 U1 = I Remark. Note that when the non-degenerate singular value assumption does not hold, the corresponding singular vectors are not unique and we will not observe the corresponding dimensions becoming aligned. B.6 PROOF OF THEOREM 3 Proof. According to Theorem 2, for σk 1 and σk 2 with same index, the corresponding singular vector pairs vk 2 and uk 1 will get aligned, i.e., vk 2 T uk 1 δi,j. Therefore, Eqn 21 and Eqn 22 can be simplified to σk 1 σk 2(uk 2 T Gvk 1) σk 2 σk 1(uk 2 T Gvk 1) Insert Eqn 9 and considering the alignment, we derive σk 1 σk 1(σk 2)2(vk 1 T Xvk 1) σk 2 σk 2(σk 1)2(vk 1 T Xvk 1) Published as a conference paper at ICLR 2022 C EFFECT OF MORE LAYERS AND NONLINEARITY In our toy model, we focused on a two-layer linear MLP setting. Here, we empirically show that our theory extends to multilayer and nonlinear cases, as shown in Figure 11a. Stronger over-parametrization leads to a stronger collapsing effect, which has been shown theoretically (Arora et al., 2019a; Barrett & Dherin, 2021) and empirically (Jing et al., 2020). This can be explained by the fact that more adjacent matrices getting aligned, and the collapsing in the product matrix gets amplified. Note that for a single-layer case, L = 1, there is no dimensional collapse in the embedding space, which is consistent with our analysis. 0 2 4 6 8 10 12 14 Singular Value Rank Index Log of singular values L=1 L=2 L=3 L=4 (a) multiple layers 0 2 4 6 8 10 12 14 Singular Value Rank Index Log of singular values L=1 L=2 L=3 L=4 (b) nonlinear Figure 11: Embedding space singular value spectrum with different layers on (a) linear and (b) nonlinear networks. All models use weight matrices with a size of 16x16. Adding more layers in the network leads to more collapsed dimensions. Adding nonlinearity leads to a similar collapsing effect. We empirically show that the collapsing effect also applies to the nonlinear scenario. We insert Re LU between linear layers and observe a similar singular value collapse compared to the linear case. See Figure 11b. D IMPLEMENTATION DETAIL D.1 AUGMENTATIONS Each input image is transformed twice to produce the two distorted views for contrastive loss. The image augmentation pipeline includes random cropping, resizing to 224x224, random horizontal flipping, color jittering, grayscale, Gaussian blurring, and solarization. D.2 NETWORK Throughout the Image Net experiments in this paper, we use a Res Net-50 (He et al., 2016) as an encoder. This network has an output of dimension 2048, which is called a representation vector. D.3 OPTIMIZATION We use a LARS optimizer and train all models for 100 epochs. The batch size is 4096, which fits into 32 GPUs during training. The learning rate is 4.8 as in Sim CLR (Chen et al., 2020a), which goes through a 10 epoch of warming up and then a cosine decay schedule. E HYPERPARAMETER TUNING ON d0 Here, we list the Image Net accuracy with various d0 value in Figure 12. It s easy to see that when d0 0, there s too little gradient information coming from the loss, the performance drops. When d0 2048, the model converges to standard Sim CLR without a projector, which we know suffers from dimensional collapse in representation space. Published as a conference paper at ICLR 2022 0 250 500 750 1000 1250 1500 1750 2000 d_0 Accuracy (%) Top-1 Image Net Accuracy Figure 12: Hyperparameter tuning on d0 based on Image Net linear probe Top-1 accuracy. F ABLATION STUDY DETAIL Fixed low-rank projector vs Fixed low-rank diagonal projector: Direct CLR is equivalent to Sim CLR with a fixed low-rank diagoanl projector. It performs the same as a Sim CLR with fixed low-rank projector, which achieves 62.3% linear probe accuracy. Specifically, the singular values of this low-rank matrix are set to have d0 numbers of 1 and 0 for the rest, then leftand rightmultiply a fixed orthogonal matrix. Therefore, their only difference is that this fixed projector has an extra fixed orthogonal matrix in between. Trainable projector vs trainable diagonal projector: We trained a Sim CLR model with a trainable projector that is constrained be diagonal. The model achieves 60.2% linear probe accuracy on Image Net, which is close to a Sim CLR with a 1-layer linear projector. Orthogonal projector vs no projector: We train a single layer projector Sim CLR model with orthogonal constraint using Exp M parametrization (Casado & Mart ınez-Rubio, 2019). Therefore, the projector weight matrix has all singular values fixed to be 1. This model reaches 52.2% accuracy on Image Net which is close to a Sim CLR without projector. These ablation studies verify the propostion 1 that the Sim CLR projector only needs to be diagonal. Also, according to Table 2, we find that low-rank projector setting consistently improves the performance, which verifies proposition 2. Linear probe on subvector instead of the entire vector: For Direct CLR, we perform a linear probe only on the sub-vector z and get 47.9% accuracy on Image Net. This shows that the rest of r still contains useful information even though it does not see gradient directly coming from the loss function. Random dropout instead of fixed subvector: Since Direct CLR drops out a number of dimensions for the loss function, it would be natural to ask whether random dropping out can reach the same performance. We train a Sim CLR model without a projector and randomly feed d0 number of features to Info NCE loss every iteration. This model reaches only 43.0% accuracy on Image Net. This demonstrates the importance of applying a fixed subvector, which allows the alignment effect to happen.