# modulate_your_spectrum_in_selfsupervised_learning__1986a265.pdf

Published as a conference paper at ICLR 2024

MODULATE YOUR SPECTRUM IN SELF-SUPERVISED LEARNING

Xi Weng 1, Yunhao Ni 1, Tengwei Song1, Jie Luo1, Rao Muhammad Anwer2, Salman Khan2, Fahad Shahbaz Khan2, Lei Huang 1

1SKLCCSE, Institute of Artificial Intelligence, Beihang University, Beijing, China 2Mohamed bin Zayed University of Artificial Intelligence, UAE

Whitening loss offers a theoretical guarantee against feature collapse in selfsupervised learning (SSL) with joint embedding architectures. Typically, it involves a hard whitening approach, transforming the embedding and applying loss to the whitened output. In this work, we introduce Spectral Transformation (ST), a framework to modulate the spectrum of embedding and to seek for functions beyond whitening that can avoid dimensional collapse. We show that whitening is a special instance of ST by definition, and our empirical investigations unveil other ST instances capable of preventing collapse. Additionally, we propose a novel ST instance named Iter Norm with trace loss (INTL). Theoretical analysis confirms INTL s efficacy in preventing collapse and modulating the spectrum of embedding toward equal-eigenvalues during optimization. Our experiments on Image Net classification and COCO object detection demonstrate INTL s potential in learning superior representations. The code is available at https://github.com/winci-ai/INTL.

1 INTRODUCTION

Self-supervised learning (SSL) via joint embedding architectures to learn visual representations has made significant progress over the last several years (Bachman et al., 2019; He et al., 2020; Chen et al., 2020a; Chen & He, 2021; Bardes et al., 2022; Oquab et al., 2023), almost outperforming their supervised counterpart on many downstream tasks (Liu et al., 2021; Jaiswal et al., 2020; Ranasinghe et al., 2022). This paradigm addresses to train a dual pair of networks to produce similar embeddings for different views of the same image (Chen & He, 2021). One main challenge with the joint embedding architectures is how to prevent a collapse of the representation, in which the two branches ignore the inputs and produce identical and constant outputs (Chen & He, 2021). A variety of methods have been proposed to successfully avoid collapse, including contrastive learning methods (Wu et al., 2018; He et al., 2020; Saunshi et al., 2019) that attract different views from the same image (positive pairs) while pull apart different images (negative pairs), and non-contrastive methods (Grill et al., 2020; Chen & He, 2021) that directly match the positive targets without introducing negative pairs.

The collapse problem is further generalized into dimensional collapse (Hua et al., 2021; Jing et al., 2022) (or informational collapse (Bardes et al., 2022)), where the embedding vectors only span a lower-dimensional subspace and would be highly correlated. In this case, the covariance matrix of embedding has certain zero eigenvalues, which degenerates the representation in SSL. To prevent dimensional collapse, a theoretically motivated paradigm, called whitening loss, is proposed by minimizing the distance between embeddings of positive pairs under the condition that embeddings from different views are whitened (Ermolov et al., 2021; Hua et al., 2021). One typical implementation of whitening loss is hard whitening (Ermolov et al., 2021; Weng et al., 2022) that designs whitening transformation over mini-batch data and imposes the loss on the whitened output (Ermolov et al., 2021; Hua et al., 2021; Weng et al., 2022). We note that the whitening transformation is a function over embedding during forward pass, and modulates the spectrum of embedding implicitly during backward pass when minimizing the objective. This raises questions whether there exist other functions over embedding can avoid collapse? If yes, how the function affects the spectrum of embedding?

This paper proposes spectral transformation (ST), a framework to modulate the spectrum of embedding in joint embedding architecture. ST maps the spectrum of embedding to a desired distribution during forward pass, and modulates the spectrum of embedding by implicit gradient update during backward pass (Figure 1).

equal contribution corresponding author (huanglei AI@buaa.edu.cn).

Published as a conference paper at ICLR 2024

: transformed output

joint architecture

X: mini-batch Z: embedding

Figure 1: The framework using spectral transformation (ST) to modulate the spectrum of embedding in joint embedding architecture for SSL.

This framework provides a way to seek for functions beyond whitening transformation that can avoid dimensional collapse. We show that whitening transformation is a special instance of ST using a power function by definition, and there exist other power functions that can avoid dimensional collapse by our empirical investigation (see Section 3.2 for details). We demonstrate that Iter Norm (Huang et al., 2019), an approximating whitening method by using Newton s iterations (Bini et al., 2005; Ye et al., 2020), is also an instance of ST, and show that Iter Norm with different iteration number corresponds to different ST (see Section 3.2.2 for details). We further theoretically characterize how the spectrum evolves as the increasing of iteration number of Iter Norm.

We empirically observe that Iter Norm suffers from severe dimensional collapse and mostly fails to train the model in SSL unexpectedly, unlike its benefits in approximating whitening for supervised learning (Huang et al., 2019). We thus propose Iter Norm with trace loss (INTL), a simple solution to address the failure of Iter Norm, by adding an extra penalty on the transformed output. Moreover, we theoretically demonstrate that INTL can avoid dimensional collapse, and reveal its mechanism in modulating the spectrum of embedding to be equal-eigenvalues. We conduct comprehensive experiments and show that INTL is a promising SSL method in practice. Our main contributions are summarized as follows:

We propose spectral transformation, a framework to modulate the spectrum of embedding and to seek for functions beyond whitening that can avoid dimensional collapse. We show there exist other functions that can avoid dimensional collapse by empirical observation and intuitive explanation.

We propose a new instance of ST, called Iter Norm with trace loss (INTL). We theoretically prove that INTL can avoid collapse and modulate the spectrum of embedding towards an equal-eigenvalue distribution during the course of optimization.

INTL s experimental performance on standard benchmarks showcases its high promise as a practical SSL method, consistently achieving or surpassing state-of-the-art methods, even when utilizing a relatively small batch size.

2 RELATED WORK

Our work is related to the SSL methods that address the feature collapse problem when using joint embedding architectures.

Contrastive learning prevents collapse by attracting positive samples closer, and spreading negative samples apart (Wu et al., 2018; Ye et al., 2019). In these methods, negative samples play an important role and need to be well designed (Oord et al., 2018; Bachman et al., 2019; Henaff, 2020). Mo Cos (He et al., 2020; Chen et al., 2020b) build a memory bank with a momentum encoder to provide consistent negative samples, while Sim CLR (Chen et al., 2020a) addresses that more negative samples in a batch with strong data augmentations perform better. Our proposed INTL can avoid collapse and work well without negative samples. Additionally, recent work (Zhang et al., 2023) explores the use of data augmentation for contrastive learning through spectrum analysis to enhance performance, while our paper focuses on developing a novel non-contrastive method to prevent collapse under standard augmentation.

Non-contrastive learning can be categorized into two groups: asymmetric methods and whitening loss. Asymmetric methods employ asymmetric network architectures to prevent feature collapse without the need for explicit negative pairs (Caron et al., 2018; 2020; Li et al., 2021; Grill et al., 2020; Chen & He, 2021). For instance, BYOL (Grill et al., 2020) enhances network stability by appending a predictor after the online network and introducing momentum into the target network. Sim Siam (Chen & He, 2021) extends BYOL and emphasizes the importance of stop-gradient to prevent trivial solutions. Other advancements in this realm include cluster assignment prediction using the Sinkhorn-Knopp algorithm (Caron et al., 2020) and the development of asymmetric pipelines with self-distillation losses for Vision Transformers (Caron et al., 2021). However, it remains unclear how these asymmetric networks effectively prevent collapse without the inclusion of negative pairs. This has sparked debates surrounding topics such as batch normalization (BN)(Fetterman & Albrecht, 2020; Tian et al., 2020b; Richemond et al., 2020) and stop-gradient(Chen & He, 2021; Zhang et al., 2022a). Despite preliminary efforts to analyze training dynamics (Tian et al., 2021) and establish

Published as a conference paper at ICLR 2024

connections between non-contrastive and contrastive methods (Tao et al., 2022; Garrido et al., 2023), the exact mechanisms behind these methods remain an ongoing area of research. In our work, we address the more intricate challenge of dimensional collapse and theoretically demonstrate that our INTL method effectively prevents this issue, offering valuable insights into mitigating feature collapse in various scenarios.

Whitening loss is a theoretically motivated paradigm to prevent dimensional collapse (Ermolov et al., 2021). One typical implementation of whitening loss is hard whitening that designs whitening transformation over mini-batch data and imposes the loss on the whitened output. The designed whitening transformation includes batch whitening in W-MSE (Ermolov et al., 2021) and Shuffled DBN (Hua et al., 2021), channel whitening in CW-RGP (Weng et al., 2022), and the combination of both in Zero-CL (Zhang et al., 2022b). Our proposed ST generalizes whitening transformation and provides a frame to modulate the spectrum of embedding. Our INTL can improve these work in training stability and performance, by replacing whitening transformation with Iter Norm (Huang et al., 2019) and imposing an additional trace loss on the transformed output. Furthermore, we theoretically show that our proposed INTL modulates the spectrum of embedding to be equal-eigenvalues.

Another way to implement whitening loss is soft whitening that imposes a whitening penalty as regularization on the embedding, including Barlow Twins (Zbontar et al., 2021), VICReg (Bardes et al., 2022) and CCA-SSG (Zhang et al., 2021). Different from these works, our proposed INTL imposes the trace loss on the approximated whitened output, providing equal-eigenvalues modulation on the embedding.

There are also theoretical works analyzing how dimensional collapse occurs (Hua et al., 2021; Jing et al., 2022) and how it can be avoided by using whitening loss (Hua et al., 2021; Weng et al., 2022). The recent works (He & Ozay, 2022; Ghosh et al., 2022) further discuss how to characterize the magnitude of dimensional collapse, and connect the spectrum of a representation to a power law. They show the coefficient of the power law is a strong indicator for the effects of the representation. Different from these works, our theoretical analysis presents a new thought in demonstrating how to avoid dimensional collapse, which provides theoretical basis for our proposed INTL.

3 SPECTRAL TRANSFORMATION BEYOND WHITENING

3.1 PRELIMINARY AND NOTATION

Joint embedding architectures. Let x denote the input sampled uniformly from a set of images D, and T denote the set of data transformations available for augmentation. We consider a pair of neural networks Fθ and F θ , parameterized by θ and θ respectively. They take as input two randomly augmented views, x(1) = T1(x) and x(2) = T2(x), where T1,2 T; and they output the embedding z(1) = Fθ(x(1)) and z(2) = F θ (x(2)). The networks are trained with an objective function that minimizes the distance between embeddings obtained from different views of the same image:

L(x, θ) = Ex D, T1,2 T ℓ Fθ(T1(x)), F θ (T2(x)) . (1)

where ℓ( , ) is a loss function. The mean square error (MSE) of L2 normalized vectors as

ℓ(z(1), z(2)) = z(1)

z(1) 2 z(2)

z(2) 2 2 2 is usually used as the loss function (Chen & He, 2021). This loss

is also equivalent to the negative cosine similarity, up to a scale of 1

2 and an optimization irrelevant constant (Chen & He, 2021). This architecture is also called Siamese Network (Chen & He, 2021), if Fθ = F θ . Another variant distinguishes the networks into target network F θ and online network Fθ, and updates the weight θ of target network through exponential moving average (EMA) (Chen et al., 2020b; Grill et al., 2020) over θ of online network.

Feature collapse. While minimizing Eqn. 1, a trivial solution known as complete collapse could occur such that Fθ(x) c, x D. Moreover, a weaker collapse condition called dimensional collapse can be easily arrived, for which the projected features collapse into a low-dimensional manifold. To express dimensional collapse more mathematically, we refer to dimensional collapse as the phenomenon that one or certain eigenvalues of the covariance matrix of feature vectors degenerate to 0. Therefore, we can determine the occurrence of dimensional collapse by observing the spectrum of the covariance matrix.

Whitening loss. To address the collapse problem, whitening loss (Ermolov et al., 2021) is proposed to minimize Eqn. 1, under the condition that embeddings from different views are whitened. Whitening loss provides theoretical guarantee in avoiding (dimensional) collapse, since the embedding is whitened with all axes decorrelated (Ermolov et al., 2021; Hua et al., 2021). Ermolov et

Published as a conference paper at ICLR 2024

al. (Ermolov et al., 2021) propose to whiten the mini-batch embedding Z Rd m using batch whitening (BW) (Huang et al., 2018; Siarohin et al., 2019) and impose the loss on the whitened output b Z Rd m, given the mini-batch inputs X with size of m, as follows:

min θ L(X; θ) = EX D, T1,2 T b Z(1) b Z(2) 2 F

with b Z(v) = Σ 1

2 Z(v), v {1, 2}, (2)

where Σ = 1

m ZZT is the covariance matrix of embedding1. Σ 1

2 is called the whitening matrix, and is calculated either by Cholesky decomposition in (Ermolov et al., 2021) or by eigen-decomposition in (Hua et al., 2021). E.g., zero-phase component analysis (ZCA) whitening (Huang et al., 2018) calculates Σ 1

2 UT , where Λ = diag(λ1, . . . , λd) and U = [u1, ..., ud] are the eigenvalues and associated eigenvectors of Σ, i.e., UΛUT = Σ. One intriguing result shown in (Weng et al., 2022) is that hard whitening can avoid collapse by only constraining the embedding Z to be full-rank, but not whitened.

We note that the whitening transformation is a function over embedding Z during forward pass, and modulates the spectrum of embedding Z implicitly during backward pass when minimizing MSE loss imposed on the whitened output. This raises a question of whether there are other functions over embedding Z that can avoid collapse? If yes, how the function affects the spectrum of embedding Z?

3.2 SPECTRAL TRANSFORMATION In this section, we extend the whitening transformation to spectral transformation (ST), a more general view to characterize the modulation on the spectrum of embedding, and empirically investigate the interaction between the spectrum of the covariance matrix of b Z and collapse of the SSL model. Definition 1. (Spectral Transformation) Given any unary function g( ) in the definition domain λ(Z) = {λ1, λ2, . . . , λd}. Drawing an analogy with whitening, g( ) on the covariance matrix Σ of embedding Z is defined as g(Σ) = Ug(Λ)UT , where g(Λ) = diag(g(λ(Z))). We denote the transformation matrix of ST as ΦST = g(Σ), so that the output of ST is calculated by b Z = ΦST Z = Ug(Λ)UT Z and the covariance matrix of b Z is Σb Z = 1

m b Zb ZT = UΛg2(Λ)UT .

Based on Definition 1, ST is an abstract framework until g( ) is determined, and its essence is mapping the spectrum λ(Z) to λ(b Z) = λ1g2(λ1), λ2g2(λ2), . . . , λdg2(λd) . When applied in the context of self-supervised learning, the loss function for ST remains the same as Eqn 2, with the only difference being that b Z is determined by g( ). Meanwhile, the optimization direction for the embedding spectrum can also be determined when employing gradient-based methods. That is, what spectrum of embedding will be modulated to be during the course of training.

Can we unveil the potential of ST? Our ST framework exhibits uncertainty and diversity, allowing g( ) to adopt the guise of any single-variable function within the defined domain, including power functions, exponential functions, iterative functions, and more. Whitening, on the other hand, is a special and successful instance within ST, where g( ) takes the form of a power function g(λ) = λ 1

2 . This naturally prompts two questions: 1. Could there be other functions, akin to whitening, capable of preventing collapse within the ST framework? 2. If yes, how the function works and affects the spectrum of embedding Z?

3.2.1 SPECTRAL TRANSFORMATION USING POWER FUNCTIONS With these questions in mind, we embark on a deeper exploration of the mechanics extending beyond whitening, considering a more comprehensive transformation g(λ) = λ p, p ( , + ) for ST. Based on Definition. 1, this comprehensive power transformation is mapping the spectrum λ(Z) to λ(b Z) = λ1 1 2p, λ2 1 2p, . . . , λd 1 2p .

Empirical observation. Initially, we conduct experiments on a 2D dataset, varying the parameter p, and visualize the outputs of the toy models as depicted in Figure 2(a). Our observations indicate that the toy model tends to perform well in avoiding collapse when p falls within the neighborhood of 0.5, specifically in the range of 0.45 to 0.55. However, as p gradually deviates from 0.5, collapse becomes more pronounced. Subsequently, we extend our experiments to real-world datasets to validate these findings. The results presented in Figure 2(b) align with the previously observed phenomena. When p is set to either 0.45 or 0.55, the model maintains high evaluation performance, similar to that

1The embedding is usually centralized by performing Z := Z(I 1

m1 1T ) for whitening, and we assume Z is centralized in this paper for simplifying discussion.

Published as a conference paper at ICLR 2024

= 1.50 = 0.55 = 0.50

= 0.45 = 0.40

p=0.0 0.3 0.40 0.45 0.50 0.55 1.0 1.5 0

p=0.0 0.3 0.40 0.45 0.50 0.55 1.0 1.5 8

Condition indicator

Figure 2: Investigate ST using power functions. We choose several p from 0 to 1.5. We show (a) the visualization of the toy model output; (b) top-1 and 5-nearest neighbors (5-nn) accuracy on CIFAR-10; (c) condition indicator of embedding Z and transformed output b Z on CIFAR-10. We use the inverse of the condition number (Io C) in logarithmic scale with base 10 ( lg Io C = lgc 1 = lg λd

λ1 ) as the condition indicator. The results on CIFAR-10 are obtained through training with Res Net-18 for 200 epochs and averaged over five runs, with standard deviation shown as error bars. We show the details of experimental setup in Appendix D. Similar phenomena can be observed when using other datasets (e.g., Image Net) and other networks (e.g., Res Net-50).

of whitening (p = 0.5). This discovery suggests that within the ST framework, there exist other functions capable of successfully preventing collapse, which answers the first question.

For the second question, as illustrated in Figure 2(c), it becomes evident that when p lies in the vicinity of 0.5, the embedding showcases a more well-conditioned spectrum, characterized by a smaller condition number (larger Io C). However, when p significantly deviates from 0.5, the spectrum of embedding loses its well-conditioned attributes, closely aligning with the occurrence of embedding collapse. This statement asserts that if g( ) is effective in preventing collapse within the ST framework, it will result in the modulation of the embedding spectrum towards a well-conditioned state.

Intuitive explanation. We note that (Weng et al., 2022) implied that whitening loss in Eqn 2 can be decomposed into two asymmetric losses L = 1

m ϕ(Z(1))Z(1) (b Z(2))st 2 F + 1

m ϕ(Z(2))Z(2) (b Z(1))st 2 F , where ϕ(Z) refers to the whitening matrix of Z, st represents the stop gradient operation, and b Z denotes the whitening output. Each asymmetric loss can be viewed as an online network to match a whitened target b Z. As a more generalized form of whitening, our ST can also extend this decomposition to the loss function. As depicted in Figure 2(c), when p falls within the range of 0.45 to 0.55, b Z exhibits a well-conditioned spectrum, with each eigenvalue approaching 1. In such instances, b Z serves as an ideal target for ϕ(Z)Z to match, enabling the embedding Z to learn a favorable spectrum to prevent collapse. Conversely, when p deviates significantly from 0.5, the spectrum of the transformed output loses its well-conditioned characteristics, with b Z becoming a detrimental target, ultimately leading to the collapse of the embedding.

3.2.2 IMPLICIT SPECTRAL TRANSFORMATION USING NEWTON S ITERATION However, utilizing the power function g(λ) = λ p (where p is approximately 0.5) within our ST framework is not without its drawbacks. One issue is the potential for numerical instability when computing eigenvalues λ and eigenvectors U via eigen-decomposition, particularly when the covariance matrix is ill-conditioned (Paszke et al., 2019). We provide comprehensive experiments and analysis in Appendix D.3 to validate the presence of this problem in SSL.

Naturally, if we could implement a spectral transformation that can modulate the spectrum without the need for explicit calculation of λ or U, this issue could be mitigated. In fact, we take note of an approximate whitening method called iterative normalization (Iter Norm) (Huang et al., 2019), which uses Newton s iteration to address the numerical challenges associated with batch whitening in supervised learning. Specifically, given the centered embedding Z, the iteration count T, and the trace-normalized covariance matrix ΣN = Σ/tr(Σ), Iter Norm performs Newton s iteration as follows. P0 = I Pk = 1

2(3Pk 1 P3 k 1ΣN), k = 1, 2, ..., T. (3)

The whitening matrix Σ 1

2 is approximated by ΦT = PT / p

tr(Σ) and we have the whitened output b Z = ΦT Z. When T + , ΦT Σ 1

2 and the covariance matrix of b Z will be an identity matrix. Here, we theoretically show that Iter Norm is also an instance of spectral transformation as follows. Theorem 1. Define one-variable iterative function f T (x), satisfying

fk+1(x) = 3

2xfk 3(x), k 0; f0(x) = 1.

Published as a conference paper at ICLR 2024

The mapping function of Iter Norm is g(λ) = f T ( λ tr(Σ))/ p

tr(Σ). Without calculating λ or U,

Iter Norm implicitly maps λi λ(Z) to bλi = λi tr(Σ)f T 2( λi tr(Σ)).

The proof is provided in Appendix B.1. For simplicity, we define the T-whitening function of Iter Norm h T (x) = xf T 2(x), which obtains the spectrum of transformed output. Based on the fact that the covariance matrix of transformed output will be identity when T of Iter Norm increases to infinity (Bini et al., 2005), we thus have

λi > 0, lim T h T ( λi tr(Σ)) = 1. (4)

Different iteration numbers T of Iter Norm imply different T-whitening functions h T ( ). It is interesting to analyze the characteristics of h T ( ). Proposition 1. Given x (0, 1), T N we have h T (x) (0, 1) and h T (x) > 0.

The proof is shown in Appendix A.1. Proposition 1 states h T (x) is a monotone increasing function for x (0, 1) and its range is also in (0, 1). Since λi tr(Σ) (0, 1), λi > 0, we have

T N, λi > λj > 0 = 1 > bλi > bλj > 0. (5)

Formula 5 indicates that Iter Norm maps all non-zero eigenvalues to (0, 1) and preserves monotonicity. Proposition 2. Given x (0, 1), T N, we have h T +1(x) > h T (x).

The proof is shown in Appendix A.2. Proposition 2 indicates that Iter Norm gradually stretches the eigenvalues towards one as the iteration number T increases. This property of Iter Norm theoretically shows that the spectrum of b Z will have better condition if we use a larger iteration number T of Iter Norm.

In summary, our analyses theoretically show that Iter Norm gradually stretches the eigenvalues towards one as the iteration number T increases, and the smaller the eigenvalue is, the larger T is required to approach one. 4 ITERATIVE NORMALIZATION WITH TRACE LOSS

It is expected that Iter Norm, as a kind of spectral transformation, can avoid collapse and obtain good performance in SSL, due to its benefits in approximating whitening for supervised learning (Huang et al., 2019). However, we empirically observe that Iter Norm suffers severe dimensional collapse and mostly fails to train the model in SSL (we postpone the details in Section 4.2.). Based on the analyses in Section 3.2 and 3.2.2, we propose a simple solution by adding an extra penalty named trace loss on the transformed output b Z by Iter Norm to ensure a well-conditioned spectrum. It is clear that the sum of eigenvalues of Σb Z is less than or equal to d, we thus propose a trace loss that encourages the trace of Σb Z to be its maximum d, when d m. In particular, we design a new method called Iter Norm with trace loss (INTL) for optimizing the SSL model as2:

min θ Θ INTL(Z) =

j=1 (1 (Σb Z)jj)2, (6)

where Z = Fθ( ) and b Z = Iter Norm(Z). Eqn. 6 can be viewed as an optimization problem over θ to encourage the trace of b Z to be d.

4.1 THEORETICAL ANALYSIS In this section, we theoretically prove that INTL can avoid collapse, and INTL modulates the spectrum of embedding towards an equal-eigenvalue distribution during the course of optimization.

Note that Σb Z can be expressed using the T-whitening function h T ( ) as Σb Z = d P

i=1 h T (xi)uiu T i ,

where xi = λi/tr(Σ) 0 and d P

i=1 xi = 1. When the range of Fθ( ) is wide enough, the optimization

2The complete loss function of INTL is LINT L = LMSE + β Ltrace, where the coefficient β is fixed across all datasets and architectures, and its determination is elaborated in Algorithm of INTL of Appendix C. To simplify the discussion, we omit the LMSE term here, without compromising the validity.

Published as a conference paper at ICLR 2024

0 20 40 60 80 100 120 Eigenvalue index

Lg of eigenvalue

T=1 T=3 T=5

0 20 40 60 80 100 120 Eigenvalue index

Lg of eigenvalue

T=1 T=3 T=5

0 20 40 60 80 Epochs

Top-1 accuracy %

T=1 T=3 T=5

T=1 T=3 T=5 T=7 T=9 Iteration number

Lg of eigenvalue

maximum eigenvalue minimum eigenvalue

(d) Figure 3: Investigate the effectiveness of Iter Norm with and without trace loss. We train the models on CIFAR-10 with Res Net-18 for 100 epochs. We apply Iter Norm with various iteration numbers T, and show the results with (solid lines) and without (dashed lines) trace loss respectively. (a) The spectrum of the embedding Z; (b) The spectrum of the transformed output b Z; (c) The top-1 accuracy. (d) indicates that Iter Norm (without trace loss) suffers from numeric divergence when using a large iteration number, e.g. T = 9. It is noteworthy that when T 11, the loss values are all NAN, making the model unable to be trained. Similar phenomena can be observed when using other datasets (e.g., Image Net) and other networks (e.g., Res Net-50).

problem over θ (Eqn. 6) can be transformed as the following optimization problem over x (Eqn. 7) without changing the optimal value (please see Appendix B.2 for the details of derivation):

min x INTL(x) = d P

i=1 [1 h T (xi)]u2 ji

xi 0, i = 1, , d,

where uji is the j-th elements of vector ui. In this formulation, we can prove that our proposed INTL can theoretically avoid collapse, as long as the iteration number T of Iter Norm is larger than zero.

Theorem 2. Let x [0, 1]d, T N+, INTL(x) shown in Eqn. 7 is a strictly convex function. x = [ 1

d]T is the unique minimum point as well as the optimal solution to INTL(x).

The proof is provided in Appendix B.2. Based on Theorem 2, INTL modulates the spectrum of embedding to be equal-eigenvalues during the backward pass, which provides a theoretical guarantee to avoid dimensional collapse. Connection to hard whitening. Hard whitening methods, like W-MSE (Ermolov et al., 2021) and shuffle-DBN (Hua et al., 2021), design a whitening transformation over each view and minimize the distances between the whitened outputs from different views. This mechanism modulates the covariance matrix of embedding to be full-rank (Weng et al., 2022). Our INTL designs an approximated whitening transformation using Iter Norm and imposes an additional trace loss penalty on the (approximately) whitened output, which modulates the covariance matrix of embedding having equal eigenvalues. Connection to soft whitening. Soft whitening methods, like Barlow-Twins (Zbontar et al., 2021) and VICReg (Bardes et al., 2022) directly impose a whitening penalty as a regularization on the embedding. This modulates the covariance matrix of the embedding to be identity (with a fixed scalar γ, e.g., γI). Our INTL imposes the penalty on the transformed output, but can be viewed as implicitly modulating the covariance matrix of the embedding to be identity with a free scalar (i.e., having equal eigenvalues).

Intuitively, INTL modulates the spectrum of embedding to be equal-eigenvalues during the backward pass, which is a stronger constraint than hard whitening (the full-rank modulation), but a weaker constraint than soft whitening (the whitening modulation). This preliminary but new comparison provides a new way to understand the whitening loss in SSL.

4.2 EMPIRICAL ANALYSIS In this section, we empirically show that Iter Norm-only and trace-loss-only fail to avoid collapse, but Iter Norm with trace loss can well avoid collapse. Iter Norm fails to avoid collapse. In theory, Iter Norm can map all non-zero eigenvalues to approach one, with a large enough T. In practice, it usually uses a fixed T, and it is very likely to encounter small eigenvalues during training. In this case, Iter Norm cannot ensure the transformed output has a well-conditioned spectrum (Figure 3(b)), which potentially results in dimensional collapse. One may use a large T, however, Iter Norm will encounter numeric divergence upon further increasing the iteration number T, even though it has converged. E.g., Iter Norm suffers from numeric divergence in Figure 3(d) when using T = 9, since the maximum eigenvalue of whitened output is around

Published as a conference paper at ICLR 2024

Table 1: Classification top-1 accuracy of a linear classifier and a 5-nearest neighbors classifier for different loss functions and datasets. The table is mostly inherited from solo-learn (da Costa et al., 2022). All methods are based on Res Net-18 with two augmented views generated from per sample and are trained for 1000-epoch on CIFAR-10/100 with a batch size of 256 and 400-epoch on Image Net-100 with a batch size of 128.

Method CIFAR-10 CIFAR-100 Image Net-100 top-1 5-nn top-1 5-nn top-1 5-nn Sim CLR (Chen et al., 2020a) 90.74 85.13 65.78 53.19 77.64 65.78 Mo Co V2 (Chen et al., 2020b) 92.94 88.95 69.89 58.09 79.28 70.46 BYOL (Grill et al., 2020) 92.58 87.40 70.46 56.46 80.32 68.94 Sw AV (Caron et al., 2020) 89.17 84.18 64.88 53.32 74.28 63.84 Sim Siam (Chen & He, 2021) 90.51 86.82 66.04 55.79 78.72 67.92 W-MSE (Ermolov et al., 2021) 88.67 84.95 61.33 49.65 69.06 58.44 Shuffled-DBN (Hua et al., 2021) 91.17 88.95 66.81 57.27 75.27 67.21 DINO (Caron et al., 2021) 89.52 86.13 66.76 56.24 74.92 64.30 Barlow Twins (Zbontar et al., 2021) 92.10 88.09 70.90 59.40 80.16 72.14 VICReg (Bardes et al., 2022) 92.07 87.38 68.54 56.32 79.40 71.94 Zero-CL (Zhang et al., 2022b) 90.81 87.51 70.33 59.21 79.26 71.18 CW-RGP (Weng et al., 2022) 92.03 89.67 67.78 58.24 76.96 68.46 INTL (ours) 92.60 90.03 70.88 61.90 81.68 73.46

Table 2: Comparisons on Image Net linear classification with various training epochs. All methods are based on Res Net-50 backbone with two augmented views generated from per sample. EMA represents Exponential Moving Average. Given that one of the objectives of SSL methods is to achieve high performance with small batch sizes, it s worth noting that our INTL performs effectively when trained with small batch sizes, such as 256 and 512.

Method Batch size EMA 100 eps 200 eps 400 eps 800 eps Sim CLR 4096 No 66.5 68.3 69.8 70.4

Sw AV 4096 No 66.5 69.1 70.7 71.8 512 No 65.8 67.9 - - Sim Siam 256 No 68.1 70.0 70.8 71.3 W-MSE 512 No 65.1 66.4 - - Shuffled-DBN 512 No 65.2 - - - Barlow Twins 2048 No 67.7 - 72.5 73.2 Zero-CL 1024 No 68.9 - 72.6 - CW-RGP 512 No 67.1 69.6 - - INTL (ours) 512 No 69.5 71.1 72.4 73.1 Mo Co v2 256 Yes 67.4 69.9 71.0 72.2 BYOL 4096 Yes 66.5 70.6 73.2 74.3 INTL (ours) 256 Yes 69.2 71.5 73.7 74.3

107, significantly large than 1 (we attribute to the numeric divergence, since this result goes against Proposition 1 and 2, and we further validate it by monitoring the transformed output). It is noteworthy that when T 11, the loss values are all NAN, making the model unable to be trained. These problems make Iter Norm difficult to avoid dimensional collapse in practice.

The Synergy between Iter Norm and trace loss. Iter Norm in combination with trace loss demonstrates significant differences compared to Iter Norm-only. Our experimental results, as shown in Figure 3(a), empirically confirm that INTL effectively prevents dimensional collapse, aligning with the findings of Theorem 2. INTL encourages the uniformity of eigenvalues within the covariance matrix of the embedding Z, resulting in well-conditioned spectra for the transformed output (Figure 3(b)) and impressive evaluation performance (Figure 3(c)), even when the iteration count T is as low as 1. To further evaluate the performance of trace-loss-only, we conducte experiments under the same setup. Without Iter Norm, trace-loss-only achieves a top-1 accuracy of only 16.15%, indicating significant collapse. Therefore, the efficacy of INTL, as well as the attainment of an optimal solution characterized by equal eigenvalues, is a result of the synergy between Iter Norm and trace loss.

5 EXPERIMENTS ON STANDARD SSL BENCHMARK

In this section, we conduct experiments on standard SSL benchmarks to validate the effectiveness of our proposed INTL. We first evaluate the performance of INTL for classification on CIFAR10/100 (Krizhevsky, 2009), Image Net-100 (Tian et al., 2020a), and Image Net (Deng et al., 2009). Then we evaluate the effectiveness in transfer learning, for a pre-trained model using INTL. We provide the full Py Torch-style algorithm in Appendix C as well as details of implementation and computational overhead in Appendix E.

Published as a conference paper at ICLR 2024

Table 3: Transfer Learning. All competitive unsupervised methods are based on 200-epoch pretraining on Image Net (IN). The table are mostly inherited from (Chen & He, 2021). Our INTL is performed with 3 random seeds, with mean and standard deviation reported.

Method COCO detection COCO instance seg. AP50 AP AP75 AP50 AP AP75 Scratch 44.0 26.4 27.8 46.9 29.3 30.8 Supervised 58.2 38.2 41.2 54.7 33.3 35.2 Sim CLR 57.7 37.9 40.9 54.6 33.3 35.3 Mo Co v2 58.8 39.2 42.5 55.5 34.3 36.6 BYOL 57.8 37.9 40.9 54.3 33.2 35.0 Sw AV 57.6 37.6 40.3 54.2 33.1 35.1 Sim Siam 57.5 37.9 40.9 54.2 33.2 35.2 W-MSE (repro.) 60.1 39.2 42.8 56.8 34.8 36.7 Barlow Twins 59.0 39.2 42.5 56.0 34.3 36.5 INTL (ours) 60.9 0.08 40.7 0.09 43.7 0.17 57.3 0.08 35.4 0.05 37.6 0.14

5.1 EVALUATION FOR CLASSIFICATION Evaluation on small and medium size datasets. We initially train and perform linear evaluation of INTL using Res Net-18 as the backbone on CIFAR-10/100 (Krizhevsky, 2009) and Image Net100 (Tian et al., 2020a). We strictly adhere to the experimental settings outlined in solo-learn (da Costa et al., 2022) for these datasets. As depicted in Table 1, INTL achieves remarkable results, with a top-1 accuracy of 92.60% on CIFAR-10, 70.88% on CIFAR-100, and 81.68% on Image Net-100. These results are on par with or even surpass the state-of-the-art methods as reproduced by solo-learn. Furthermore, when employing a 5-nearest neighbors classifier, INTL outperforms other baselines by a significant margin, underscoring its capacity to learn superior representations. Evaluation on Image Net. To further assess the versatility of INTL, we train it using a Res Net-50 backbone and evaluate its performance using the standard linear evaluation protocol on Image Net. The results, presented in Table 2, demonstrate the effectiveness of INTL, achieving top-1 accuracy of 69.5%, 71.1%, 72.4%, and 73.1% after pre-training for 100, 200, 400, and 800 epochs, respectively. We observe that our INTL performs even better when utilized in conjunction with the Exponential Moving Average (EMA) technique, as employed in BYOL and Mo Co. This combination yielded a top-1 accuracy of 74.3% after 800 epochs of training.

5.2 TRANSFER TO DOWNSTREAM TASKS We examine the representation quality by transferring our pre-trained model to other tasks, including COCO (Lin et al., 2014) object detection and instance segmentation. We use the baseline of the detection codebase from Mo Co (He et al., 2020) for INTL. The results of baselines shown in Table 3 are mostly inherited from (Chen & He, 2021). We observe that INTL performs much better than other state-of-the-art approaches on COCO object detection and instance segmentation, which shows the great potential of INTL in transferring to downstream tasks.

5.3 ABLATION STUDY We conducte a comprehensive set of ablation experiments to assess the robustness and versatility of our INTL in Appendix F. These experiments cover various aspects, including batch sizes, embedding dimensions, the use of multi-crop augmentation, semi-supervised training, the choice of Vision Transformer (Vi T) backbones and adding trace loss to other methods. Through these experiments, we gain valuable insights into how INTL performs under different conditions and configurations, shedding light on its adaptability and effectiveness in diverse scenarios. The results collectively reinforce the notion that INTL is a robust and flexible self-supervised learning method capable of delivering strong performance across a wide range of settings and data representations. Notably, our INTL achieved a remarkable top-1 accuracy of 76.6% on Image Net linear evaluation with Res Net-50 when employing multi-crop augmentation, surpassing even the common supervised baseline of 76.5%. 6 CONCLUSION

In this paper, we proposed spectral transformation (ST) framework to modulate the spectrum of embedding and to seek for functions beyond whitening that can avoid dimensional collapse. Our proposed Iter Norm with trace loss (INTL) is well-motivated, theoretically demonstrated, and empirically validated in avoiding dimension collapse. Comprehensive experiments have shown the merits of INTL for achieving state-of-the-art performance for SSL in practice. We showed that INTL modulates the spectrum of embedding to be equal-eigenvalues during the backward pass, which is a stronger constraint than hard whitening (the full-rank modulation), but a weaker constraint than soft whitening (the whitening modulation). This preliminary but new results provides a potential way to understand and compare SSL methods.

Published as a conference paper at ICLR 2024

ACKNOWLEDGMENTS

This work was partially supported by the National Science and Technology Major Project under Grant 2022ZD0116310, National Natural Science Foundation of China (Grant No. 62106012), the Fundamental Research Funds for the Central Universities.

REPRODUCIBILITY STATEMENT

To ensure the reproducibility and comprehensiveness of our paper, we have included an appendix comprising six main sections. These sections serve various purposes:

Appendix A contains detailed proofs for the propositions presented in our work. Appendix B provides in-depth proofs for the theorems introduced in our research. Appendix C offers a comprehensive view of the INTL algorithm, including detailed formulas and Py Torch-style code for implementation. Appendix D elaborates on the settings used in our analytical experiments, with reference to Figure 2 and Figure 3. Appendix E furnishes insights into the implementation details and computational intricacies of experiments conducted on standard SSL benchmarks, as discussed in Section 5. Finally, Appendix F encompasses a comprehensive set of ablation experiments, assessing the robustness and versatility of our INTL method across various scenarios.

Philip Bachman, R Devon Hjelm, and William Buchwalter. Learning representations by maximizing mutual information across views. In Neur IPS, 2019.

Adrien Bardes, Jean Ponce, and Yann Le Cun. Vicreg: Variance-invariance-covariance regularization for self-supervised learning. In ICLR, 2022.

Dario A. Bini, Nicholas J. Higham, and Beatrice Meini. Algorithms for the matrix pth root. Numerical Algorithms, 2005.

Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. In ECCV, 2018.

Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments. In Neur IPS, 2020.

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv e Jegou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In ICCV, 2021.

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In ICML, 2020a.

Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. In CVPR, 2021.

Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. ar Xiv preprint ar Xiv:2003.04297, 2020b.

Image Net contributors. Imagenet terms of access. 2020. URL https://image-net.org/ download.

Victor Guilherme Turrisi da Costa, Enrico Fini, Moin Nabi, Nicu Sebe, and Elisa Ricci. solo-learn: A library of self-supervised methods for visual representation learning. Journal of Machine Learning Research, 23(56):1 6, 2022. URL http://jmlr.org/papers/v23/21-1155.html.

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Image Net: A Large-Scale Hierarchical Image Database. In CVPR, 2009.

Aleksandr Ermolov, Aliaksandr Siarohin, Enver Sangineto, and Nicu Sebe. Whitening for selfsupervised representation learning. In ICML, 2021.

Published as a conference paper at ICLR 2024

Abe Fetterman and Josh Albrecht. Understanding self-supervised and contrastive learning with bootstrap your own latent (byol). Technical Report, 2020.

Inc. Flickr. Flickr terms and conditions of use. 2020. URL http://aiweb.techfak. uni-bielefeld.de/content/bworld-robot-control-software/.

Quentin Garrido, Yubei Chen, Adrien Bardes, Laurent Najman, and Yann Le Cun. On the duality between contrastive and non-contrastive self-supervised learning. In ICLR, 2023.

Arna Ghosh, Arnab Kumar Mondal, Kumar Krishna Agrawal, and Blake A. Richards. Investigating power laws in deep representation learning. ar Xiv preprint ar Xiv:2202.05808, 2022.

Jean-Bastien Grill, Florian Strub, Florent Altch e, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, Bilal Piot, koray kavukcuoglu, Remi Munos, and Michal Valko. Bootstrap your own latent - a new approach to self-supervised learning. In Neura IPS, 2020.

Bobby He and Mete Ozay. Exploring the gap between collapsed and whitened features in selfsupervised learning. In ICML, 2022.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In CVPR, 2020.

Olivier Henaff. Data-efficient image recognition with contrastive predictive coding. In ICML, 2020.

Tianyu Hua, Wenxiao Wang, Zihui Xue, Sucheng Ren, Yue Wang, and Hang Zhao. On feature decorrelation in self-supervised learning. In ICCV, 2021.

Lei Huang, Dawei Yang, Bo Lang, and Jia Deng. Decorrelated batch normalization. In CVPR, 2018.

Lei Huang, Yi Zhou, Fan Zhu, Li Liu, and Ling Shao. Iterative normalization: Beyond standardization towards efficient whitening. In CVPR, 2019.

Ashish Jaiswal, Ashwin Ramesh Babu, Mohammad Zaki Zadeh, Debapriya Banerjee, and Fillia Makedon. A survey on contrastive self-supervised learning. ar Xiv preprint ar Xiv:2011.00362, 2020.

Li Jing, Pascal Vincent, Yann Le Cun, and Yuandong Tian. Understanding dimensional collapse in contrastive self-supervised learning. In ICLR, 2022.

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. Co RR, abs/1412.6980, 2014.

Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.

Junnan Li, Pan Zhou, Caiming Xiong, and Steven C.H. Hoi. Prototypical contrastive learning of unsupervised representations. In ICLR, 2021.

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and Larry Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.

Yixin Liu, Shirui Pan, Ming Jin, Chuan Zhou, Feng Xia, and Philip S Yu. Graph self-supervised learning: A survey. ar Xiv e-prints, pp. ar Xiv 2103, 2021.

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. ar Xiv preprint ar Xiv:1807.03748, 2018.

Maxime Oquab, Timoth ee Darcet, Th eo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. ar Xiv preprint ar Xiv:2304.07193, 2023.

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.

Published as a conference paper at ICLR 2024

Kanchana Ranasinghe, Muzammal Naseer, Salman Khan, Fahad Shahbaz Khan, and Michael Ryoo. Self-supervised video transformer. In CVPR, 2022.

Pierre H Richemond, Jean-Bastien Grill, Florent Altch e, Corentin Tallec, Florian Strub, Andrew Brock, Samuel Smith, Soham De, Razvan Pascanu, Bilal Piot, et al. Byol works even without batch statistics. ar Xiv preprint ar Xiv:2010.10241, 2020.

Nikunj Saunshi, Orestis Plevrakis, Sanjeev Arora, Mikhail Khodak, and Hrishikesh Khandeparkar. A theoretical analysis of contrastive unsupervised representation learning. In ICML, 2019.

Aliaksandr Siarohin, Enver Sangineto, and Nicu Sebe. Whitening and coloring batch transform for gans. In ICLR, 2019.

Chenxin Tao, Honghui Wang, Xizhou Zhu, Jiahua Dong, Shiji Song, Gao Huang, and Jifeng Dai. Exploring the equivalence of siamese self-supervised learning via A unified gradient framework. In CVPR, 2022.

Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. In European conference on computer vision, 2020a.

Yuandong Tian, Lantao Yu, Xinlei Chen, and Surya Ganguli. Understanding self-supervised learning with dual deep networks. Co RR, abs/2010.00578, 2020b.

Yuandong Tian, Xinlei Chen, and Surya Ganguli. Understanding self-supervised learning dynamics without contrastive pairs. In ICML, 2021.

Xiao Wang and Guo-Jun Qi. Contrastive learning with stronger augmentations. IEEE transactions on pattern analysis and machine intelligence, 45(5):5549 5560, 2022.

Xi Weng, Lei Huang, Lei Zhao, Rao Muhammad Anwer, Salman Khan, and Fahad Khan. An investigation into whitening loss for self-supervised learning. In Neur IPS, 2022.

Zhirong Wu, Yuanjun Xiong, Stella X. Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance discrimination. In CVPR, 2018.

Chengxi Ye, Matthew Evanusa, Hua He, Anton Mitrokhin, Tom Goldstein, James A. Yorke, Cornelia Fermuller, and Yiannis Aloimonos. Network deconvolution. In ICLR, 2020.

Mang Ye, Xu Zhang, Pong C. Yuen, and Shih-Fu Chang. Unsupervised embedding learning via invariant and spreading instance feature. In CVPR, 2019.

Jure Zbontar, Li Jing, Ishan Misra, Yann Lecun, and Stephane Deny. Barlow twins: Self-supervised learning via redundancy reduction. In ICML, 2021.

Chaoning Zhang, Kang Zhang, Chenshuang Zhang, Trung X. Pham, Chang D. Yoo, and In So Kweon. How does simsiam avoid collapse without negative samples? a unified understanding with self-supervised contrastive learning. In ICLR, 2022a.

Hengrui Zhang, Qitian Wu, Junchi Yan, David Wipf, and Philip S. Yu. From canonical correlation analysis to self-supervised graph neural networks. In Neur IPS, 2021.

Shaofeng Zhang, Feng Zhu, Junchi Yan, Rui Zhao, and Xiaokang Yang. Zero-CL: Instance and feature decorrelation for negative-free symmetric contrastive learning. In ICLR, 2022b.

Yifei Zhang, Hao Zhu, Zixing Song, Piotr Koniusz, and Irwin King. Spectral feature augmentation for graph contrastive learning and beyond. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp. 11289 11297, 2023.

Published as a conference paper at ICLR 2024

A PROOFS OF PROPOSITION

A.1 PROOF OF PROPOSITION.1

Proposition 1. Given x (0, 1), T N we have h T (x) (0, 1) and h T (x) > 0.

Proof. We know the iterative function f T (x) satisfies

fk+1(x) = 3

2xfk 3(x), k 0; f0(x) = 1 (8)

We define h T (x) = xf T 2(x). When x = 1, it is easy to verify T N, h T (1) = f T (1) = 1. We first prove f T (x) > 0 and h T (x) > 0 by mathematical induction.

(1) When T = 0, we have f0(x) = 1 > 0, and h0(x) = x, h 0(x) = 1 > 0.

(2) Assuming it holds when T = k, we have fk(x) > 0 and h k(x) > 0. Based on h k(x) = fk(x)[fk(x) + 2xf k(x)], we have:

fk(x) + 2xf k(x) > 0 (9)

Since hk(1) = 1, h k(x) > 0 and hk(x) is continuous, we have x (0, 1), hk(x) < 1. We thus can obtain:

fk+1(x) = 1

2fk(x)[3 xf 2 k(x)]

2fk(x)[3 hk(x)]

Furthermore, h k+1(x) = fk+1(x)[fk+1(x) + 2xf k+1(x)], where

fk+1(x) + 2xf k+1(x)

2[fk(x) + 2xf k(x)] 3

2xf 3 k(x) 3x2f 2 k(x)f k(x)

2[fk(x) + 2xf k(x)] 3

2xf 2 k(x)[fk(x) + 2xf k(x)]

2[1 xf 2 k(x)][fk(x) + 2xf k(x)]

2[1 hk(x)][fk(x) + 2xf k(x)]

So we have h k+1(x) = 3 2fk+1(x)[1 hk(x)][fk(x) + 2xf k(x)] > 0. Combining the result in Eqn. 10, we thus have it holds when T = k + 1.

As a result, we have T N, f T (x) > 0 and h T (x) > 0, when x (0, 1).

Since h T (1) = 1 and h T (x) is continuous, we have h T (x) < 1. Besides, we have h T (x) = xf T 2(x) > 0, then h T (x) (0, 1).

A.2 PROOF OF PROPOSITION.2

Proposition 2. Given x (0, 1), T N, we have h T +1(x) > h T (x).

Proof. According to proof of Proposition.1, we have that when x (0, 1) and T N, f T (x) > 0 and h T (x) = xf T 2(x) (0, 1).

Published as a conference paper at ICLR 2024

Therefore, we have h T +1(x) > h T (x) f T +1(x) > f T (x). It is obvious that

fk+1(x) fk(x) = 3

2xf 3 k(x) fk(x)

2fk(x)[1 xf 2 k(x)]

2fk(x)[1 hk(x)]

So given x (0, 1), T N, we have h T +1(x) > h T (x).

B PROOFS OF THEOREM

B.1 PROOF OF THEOREM 1.

Theorem 1. Define one-variable iterative function f T (x), satisfying

fk+1(x) = 3

2xfk 3(x), k 0; f0(x) = 1.

The mapping function of Iter Norm is

g(λ) = f T ( λ tr(Σ))/ p

so that λi λ(Z), Iter Norm maps it to bλi = λi tr(Σ)f T 2( λi tr(Σ)).

Proof. Given Σ = UΛUT , Λ = diag(λ1, . . . , λd), U = [u1, . . . , ud]. Following the calculation steps of Iter Norm, we have

ΣN = Σ/tr(Σ) =

λi tr(Σ)uiui T (11)

tr(Σ) f T ( λi tr(Σ))uiui T (12)

Based on ΦT = d P

i=1 g(λi)uiui T , if we can prove Φ T = ΦT , we will have

tr(Σ)f T ( λ tr(Σ))

Define P T = p

tr(Σ)Φ T , then we have Φ T = ΦT P T = PT . We can prove P T = PT by mathematical induction. (1) When T = 0,

f0( λi tr(Σ)) = 1, P 0 = P0 = I

(2) When T 1, assume that P T 1 = PT 1, thus

2(P T 1)3ΣN

Published as a conference paper at ICLR 2024

According to the definition of P T ,

i=1 f T 1( λi tr(Σ))uiui T

Because i, ui T ui = 1 and i = j, ui T uj = 0,

P3 T 1ΣN = (P T 1)3ΣN

i=1 f T 1( λi tr(Σ))uiui T !3 d X

λi tr(Σ)uiui T !

i=1 f 3 T 1( λi tr(Σ)) λi tr(Σ)uiui T

Therefore, we have

2(P T 1)3ΣN

i=1 f T 1( λi tr(Σ))uiui T 1

i=1 f 3 T 1( λi tr(Σ)) λi tr(Σ)uiui T

2f T 1( λi tr(Σ)) 1

2f 3 T 1( λi tr(Σ)) λi tr(Σ)

f T ( λi tr(Σ)) = 3

2f T 1( λi tr(Σ)) 1

2f 3 T 1( λi tr(Σ)) λi tr(Σ)

i=1 f T ( λi tr(Σ))uiui T = P T

We obtain that

tr(Σ) f T ( λi tr(Σ))uiui T = U 1 p

tr(Σ) f T ( Λ tr(Σ))UT

Thus, the mapping function of Iter Norm is g(λ) = f T ( λ tr(Σ))/ p

tr(Σ). The whitened output is b Z = ΦT Zc = U 1

tr(Σ)f T ( Λ tr(Σ))UT Zc. The covariance matrix of b Z is

m b Zb ZT = U Λ tr(Σ)f T 2( Λ tr(Σ))UT =

λi tr(Σ)f T 2( λi tr(Σ))uiui T

So that λi λ(Z), Iter Norm maps it to bλi = λi tr(Σ)f T 2( λi tr(Σ)) which is a special instance of Spectral Transformation.

B.2 PROOF OF THEOREM 2.

Theorem 2. Let x [0, 1]d, T N+, INTL(x) shown in Eqn. 7 is a strictly convex function. x = [ 1

d]T is the unique minimum point as well as the optimal solution to INTL(x).

Proof. The INTL can be viewed as the following optimization problem:

min θ Θ INTL(Z) =

j=1 (1 (Σb Z)jj)2 (13)

Published as a conference paper at ICLR 2024

where Z = Fθ( ) and b Z = Iter Norm(Z). Eqn. 6 can be viewed as a optimization problem over θ to encourage the trace of b Z to be d.

Let (x1, , xd) = φ(Z), where xi = λi/tr(Σ) as defined in the submitted paper. If Z Rd m, φ( ) will be surjective from Rd m to Dx = {x [0, 1]d : x1 + + xd = 1}. When the range of Fθ( ) is wide enough, for example, Fθ( ) is surjective from θ Θ to Z Rd m. Here we can view Fθ( ) as a function over θ, since the input is given and fixed. Then φ(Fθ( )) is surjective from θ Θ to x Dx, meaning that if we find the optimal solution x , we are able to get the corresponding θ Θ, subject to x = φ(Fθ ( )). On the contrary, for any θ Θ, we can get x = φ(Fθ( )) Dx.

Therefore, the optimization expression for minimizing INTL can be written as follows which have the same range and optimal value as Eqn. 6:

min INTL(x) = d P

i=1 [1 h T (xi)]u2 ji

xi 0, i = 1, , d

We denote the Lagrange function of PINT L is that

L(x; α, µ) = INTL(x) +

i=1 αi( xi) + µ

B.2.1 CONVEXITY AND CONCAVITY OF h T (x)

Before calculating extreme points of PINT L, we first consider the convexity and concavity of h T (x) which is critical to proof.

When T = 0, we have h0(x) = x, so h 0(x) = 0.

(1) When T = 1, we have h1(x) = f1 2(x) = 9

4x3, so h 1(x) = 3

2(x 2) < 0.

(2) Assume that when T = k, h k(x) < 0 holds. We can easily get following propositions by derivation:

f k+1(x) = 3

2f 3 k(x) 3

2xf 2 k(x)f k(x) (15)

f k+1(x) = 3

2f k (x) 3f 2 k(x)f k(x) 3

2xf 2 k(x)f k (x) 3xfk(x)[f k(x)]2 (16)

h k+1(x) = 4fk+1(x)f k+1(x) + 2x[f k+1(x)]2 + 2xfk+1(x)f k+1(x) (17)

For convenience in our calculation, let a = fk(x), b = f k(x), c = f k (x), and h = hk(x) = xa2.

We split Eqn. 17 into three parts and take Eqn. 15 and 16 into calculation:

4fk+1(x)f k+1(x) = 4(3

= a(3 h)(3b a3 3bh)

2x[f k+1(x)]2 = 2x(3

2x(3b a3 3bh)2

2xfk+1(x)f k+1(x) = 2(3

2c(1 h) 3a2b 3xab2]

2ax(3 h)[3c(1 h) 6a2b 6xab2]

Published as a conference paper at ICLR 2024

Considering to construct the form of h k(x) = 2(2ab + xac + xb2), we first calculate that

4fk+1(x)f k+1(x) + 2xfk+1(x)f k+1(x)

2(3 h)[6ab 2a4 6abh + 3xac(1 h) 6abh 6xb2h]

2(3 h)[3xac(1 h) + 6ab(1 h) + 3xb2(1 h)

3xb2(1 h) 2a4 6abh 6xb2h]

4(3 h)(1 h)h k(x) 1

2(3 h)(3xb2h + 3xb2 + 2a4 + 6abh)

Then we calculate the left part

2x[f k+1(x)]2 =1

2x(3b a3 3bh)2

2(9xb2 + xa6 + 9xb2h2 6xa3b 18xb2h + 6xa3bh)

2(9xb2 + a4h + 9xb2h2 6abh 18xb2h + 6abh2)

For convenience, let

2(3 h)(3xb2h + 3xb2 + 2a4 + 6abh)

2(3xb2h2 + 3xb2h + 2a4h + 6abh2 9xb2h 9xb2 6a4 18abh)

Then we have

2x[f k+1(x)]2 + S =1

2(3a4h + 12xb2h2 24abh 24xb2h + 12abh2 6a4)

2(h 2)(a4 + 4abh + 4xb2h)

2(h 2)(a4 + 4xa3b + 4x2a2b2)

2(h 2)(a2 + 2xab)2

2[hk(x) 2][h k(x)]2

Here we obtain h k+1(x) = 3 4[3 hk(x)][1 hk(x)]h k(x) + 3

2[hk(x) 2][h k(x)]2. Based on Lemma.1, we know hk(x) (0, 1), so h k+1(x) < 0. Therefore, when x (0, 1), then T N+, h T (x) = xf T 2(x) is a strictly concave function that satisfies h T (x) < 0 and h 0(x) = 0.

B.2.2 OPTIMAL SOLUTION FOR THE LAGRANGE FUNCTION

Based on Section B.2.1, when x (0, 1), then T N+, h T (x) = xf T 2(x) is a strictly concave function that satisfies h T (x) < 0. So 1 h T (x) is a strictly convex function.

We discuss gj(x1, , xd) = d P

i=1 [1 h T (xi)]u2 ji first. Denote that the Hessen Matrix of

gj(x1, , xd) about x is

u2 j1h T (x1) ... u2 jdh T (xd)

and the Hessen Matrix of g2 j (x1, , xd) about x is

2(g2 j ) = (2gj gj) = 2gj 2gj + 2( gj)( gj)T

Published as a conference paper at ICLR 2024

We denote that all eigenvalues of ( gj)( gj)T are ( gj)T ( gj), 0, , 0. All eigenvalues are non-negtive, denoting that 2( gj)( gj)T is semi-positive.

Now we denote that the Hessen Matrix of INTL(x) is

j=1 2(g2 j )

j=1 ( gj)( gj)T + 2

j=1 gj 2gj = 2

j=1 u2 j1h T (x1)gj

j=1 u2 jdh T (xd)gj

We denote that h T (xi) < 0, gj > 0, and uji are not all zeros for a certain i (since d P

j=1 u2 ji = 1).

Therefore, d P

j=1 u2 jih T (xi)gj > 0 and 2 d P

j=1 gj 2gj must be a positive matrix.

Since 2 d P

j=1 ( gj)( gj)T is semi-positive, then we can denote that 2INTL is positive.

Therefore, INTL(x) is strictly convex about x on (0, 1)d.

And for INTL(x) is continuous, the minimum point on [0, 1]d is the same as that on (0, 1)d.

While the constraints of (PINT L) form a convex set, (PINT L) must be a convex programming, which means that the KKT point of (PINT L) is its unique extreme point, and the global minimum point in the same time.

We denote that the KKT conditions of (PINT L) is that

L xi = 0, i = 1, , d αi( xi) = 0, i = 1, , d αi 0, i = 1, , d d P

i=1 xi 1 = 0

We can identify one of the solutions to the KKT conditions is that

α = 0 µ = 2h T ( 1

Published as a conference paper at ICLR 2024

It is easy to identify the last three equations in KKT conditions. As for the first equation, for all t = 1, , d, we have

L xt = 2h T (xt)

j=1 [h T (xi) 1]u2 jiu2 jt αi + µ

j=1 [h T (1

d) 1]u2 jiu2 jt + µ

j=1 [h T (1

j=1 [h T (1

d) 1]u2 jt + µ

Therefore, x = [ 1

d]T is the optimal solution to (PINT L). INTL promotes the equality of all eigenvalues in the optimization process, which provides a theoretical guarantee to avoid dimensional collapse.

C ALGORITHM OF INTL

The description of our paper is based on batch whitening (BW) (Ermolov et al., 2021; Hua et al., 2021), and it can extend similarly for channel whitening (CW) (Weng et al., 2022), where the covariance matrix of Z is calculated as Σ = 1

d ZT Z. We implement INTL based on CW, considering CW is more effective when the batch size m is relatively small.

Given the centralized embedding of two positive pairs Z(v), Z(v) Rd m and v {1, 2}, we use Iter Norm to obtain the approximately whitened output b Z(v) = [ˆz(v) 1 , . . . , ˆz(v) m ]. The loss functions used in our method are

i ˆz(1) i ˆz(1) i 2 ˆz(2) i ˆz(2) i 2 2 2 (18)

T ˆz(v) i )2 (19)

LINT L = LMSE + β Ltrace (20)

where LMSE indicates MSE of L2 normalized vectors which minimizes the distance between b Z(1)

and b Z(2). Here we simplify the expression of Ltrace in Eqn. 6, because off-diagonal elements of Σb Z does not need to be calculated. β is the trade-off between LMSE and INTL.

In our experiments, we observe that when the iteration number T of Iter Norm is fixed, the coefficient β that obtains good performance has only relevant to the batch size. So we fix the iteration number T to 4 and empirically regress β with various batch sizes and obtain that β = 0.01 (log2bs 3) where bs means the batch size and bs > 8. We keep the iteration number T of Iter Norm and the coefficient β fixed in this form (i.e., β is determined given the batch size) across all the datasets and architectures, so our INTL can be directly applied to other datasets and models without tuning the coefficient.

For clarity, we also describe the algorithm of INTL in Py Torch-style pseudocode, shown in Figure 4(a).

Published as a conference paper at ICLR 2024

# f: backbone + projection # bs: batch size # aug: random augmentation

for x in loader: # load a minibatch x with m samples z1, z2 = f(aug(x)), f(aug(x)) # embedding # transformed output z1_hat, z2_hat = Iter Norm(z1), Iter Norm(z2) # trade_off between MSE and trace Loss

trade_off = (log2(bs) - 3) * 0.01 mse = norm_mse(z1_hat, z2_hat) # MSE trace_loss = TL(z1_hat) + TL(z2_hat) # trace Loss loss = mse + trade_off * trace_loss return loss

def Iter Norm(x, iters=4): # Iterative Normalization M, D = x.size() # x: m * d x = x - x.mean(dim=1).reshape(M, 1) sigma = (x @ x.T) / (D - 1) # covariance matrix trace = sigma.diagonal().sum() sigma_norm = sigma / trace # normalize sigma P = eye(M) # identity matrix: m * m for _ in range(iters):

P = 1/2 * (3 * P - matrix_power(P, 3) @ sigma_norm) return P / trace.sqrt() @ x

def TL(x): #Trace Loss _, D = x.size() d = torch.pow(x, 2).sum(axis = 1) / (D - 1) tl = d.add_(-1).pow_(2).sum() return tl

def norm_mse(x0, x1):

x0 = normalize(x0) # L2-normalize x1 = normalize(x1) # L2-normalize return 2 - 2 * (x0 * x1).sum(dim=-1).mean()

Figure 4: Algorithm of INTL, Py Torch-style Pseudocode.

D ANALYTICAL EXPERIMENTS

D.1 EXPERIMENTS ON SYNTHETIC 2D DATASET

In section 3.2 of the submitted paper, we conduct experiments on the 2D dataset and report the results on with varying p. Here, we provide the details of the experimental setup, and further show the results of Iter Norm (Huang et al., 2019) for SSL in this 2D dataset.

D.1.1 DETAILS OF EXPERIMENTAL SETUPS

We synthesize a two-dimensional dataset with isotropic Gaussian blobs containing 512 sample points as shown in Figure 5(a). We construct a toy Siamese network (a simple three-layer neural network, including three fully connected (FC) layers, with BN and Re LU appended to the first two) as the encoder for this dataset. The dimensions of the network are (2 16) (16 16) (16 2) that each bracket represents the input and output dimensions of each FC layer respectively. We then use MSE as the loss function and do not normalize the features before calculating the loss function.

We train the model by randomly shuffling the data into mini-batches, and set the batch size to 32. We use the stochastic gradient descent (SGD) algorithm with a learning rate of 0.1. In terms of the data transformation, we only apply Gaussian noise as data augmentation and generate 2 views from each

Published as a conference paper at ICLR 2024

8 6 4 2 0 2 4 6 8 8

8 2D dataset distribution

1.5 1.0 0.5 0.0 0.5 1.0 3.0

1.0 Initial network output

Figure 5: Visualization of our synthetic 2D dataset. We show (a) the distribution of our 2D dataset; (b) the initial output of the toy Siamese network.

sample point in mini-batches. We visualize the output of the initialized network without training in Figure 5(b). All runs are performed under the same random seed.

0 10 20 30 40 50 Epochs

Lg of eigenvalue

T=1 T=3 T=5

0 10 20 30 40 50 Epochs

Lg of eigenvalue

T=5 T=7 T=9

Figure 6: Investigate the spectrum of transformed output b Z (solid lines) and the corresponding embedding Z (dashed lines) using Iter Norm for SSL with different iteration numbers T. We show the evolution of eigenvalues during training on the toy 2D dataset (Note that there are only two eigenvalues and we ignore the larger one because it always remains a high value during training). In particular, (a) shows the results with a well-conditioned initial spectrum while (b) with a illconditioned one.

D.1.2 RESULTS OF ITERNORM FOR SSL

To figure out the failure of Iter Norm (Huang et al., 2019) for SSL, we further conduct experiments to investigate the spectrum of the whitened output b Z using Iter Norm on this synthetic 2D dataset for intuitive analyses. The output dimension of the toy model is 2, so there are only two eigenvalues of the covariance matrix of the output. We then track alterations of the two eigenvalues during training. Iter Norm can obtain an idealized whitened output with a small iteration number (e.g.,T=5, as recommend in (Huang et al., 2019)) and avoid collapse, if the embedding Z has a well-conditioned spectrum3 (Figure 6(a)). However, if the embedding Z has a ill-conditioned spectrum as shown in Figure 6(b), Iter Norm fails to pull the small eigenvalue to approach 1 which results in dimensional collapse.

D.2 EXPERIMENTS ON CIFAR-10

In section 3 and 4 of the submitted paper, we conduct several experiments on CIFAR-10 to illustrate our analysis. We provide a brief description of the setup in the caption of Figure 1 and 2 of the submitted paper. Here, we describe the details of these experiments. All experiments are uniformly based on the following training settings, unless otherwise stated in the figures of the submitted paper.

3A well-conditioned spectrum means that the condition number c = λ1

λd is small. Note λ1 is the maximum eigenvalue and λd is the minimum one.

Published as a conference paper at ICLR 2024

0 1000 2000 3000 4000 5000 6000 Iteration

64 128 256 512 1024 2048

0 1000 2000 3000 4000 5000 6000 Iteration

Figure 7: Investigate numerical instability of spectral transformation using power functions for SSL. The numbers in the legend represent embedding dimensions and the batch size is fixed to 512. (a) trains models on Image Net with Res Net-50; (b) trains models on CIFAR-10 with Res Net-18; The models are trained for 6000 iterations, and we track the inverse of condition number (c 1 = λd

λ1 ) in logarithmic scale with base 10 to judge whether the the covariance matrix is ill-conditioned. The models that were interrupted before the end of the training indicate training crash caused by numerical instability.

Training Settings. We use the Res Net-18 as the encoder (the dimension of encoding is 512.), a two layer MLP with Re LU and BN appended as the projector (the dimension of the hidden layer and embedding are 1024 and 128 respectively). The model is trained on CIFAR-10 with a batch size of 256, using Adam optimizer (Kingma & Ba, 2014) with a learning rate of 3 10 3, and learning rate warm-up for the first 500 iterations and a 0.2 learning rate drop at the last 50 and 25 epochs. The weight decay is set as 10 6. All transformations are performed with 2 positives extracted per image with standard data argumentation (see Section E.3 for details). We use the same evaluation protocol as in W-MSE (Ermolov et al., 2021).

Method Settings. We use MSE of L2 normalized vectors to be the loss function in all experiments. Specifically, in Figure 3 of the paper for the experiments of training the models with INTL, we simply set the trade-off parameter β between MSE and INTL as follows: β = 0.05 for T = 5, β = 0.5 for T = 3 and β = 5 for T = 1 without fine-tuning. The details of INTL algorithm please refer to Section C.

D.3 NUMERICAL INSTABILITY OF SPECTRAL TRANSFORMATION USING POWER FUNCTIONS

One issue with employing the spectral transformation g(λ) = λ p (where p is approximately 0.5) is the risk of numerical instability during the calculation of eigenvalues λ and eigenvectors U via eigen-decomposition. This instability can arise when dealing with an ill-conditioned covariance matrix, as noted in (Paszke et al., 2019). In this study, we empirically validate the existence of this phenomenon in the context of self-supervised pre-training. It s important to mention that we primarily focus on the special case of p = 0.5, referred to as hard whitening, as similar phenomena are observed when p is set near 0.5.

To assess the generality of this phenomenon, we conduct experiments on both Image Net with Res Net50 and CIFAR-10 with Res Net-18. We maintain a fixed batch size of 512 and manipulate the shape of the covariance matrix by adjusting the embedding dimension d (where the covariance matrix has a shape of d d). The models undergo 6000 iterations, and we monitor the inverse of the condition number (c 1 = λd

λ1 ) to ascertain the ill-conditioned nature of the covariance matrix. The experimental results, depicted in Figure 7, lead to the following key observations:

(a) Training crashes when the embedding dimension exceeds the batch size (e.g., d = 1024 or 2048). In such cases, the covariance matrix becomes theoretically singular, and computing the inverse of the eigenvalues introduces numerical errors. However, in practice, the minimum eigenvalue of the covariance matrix is likely a very small non-zero value due to precision rounding or the use of a small constant. Consequently, the covariance matrix may already be ill-conditioned from the start of training. Both Figure 7(a) and (b) illustrate that when d = 1024 or 2048, the inverse of the condition number is approximately 10 12 10 10, indicating severe ill-conditioning from the beginning, resulting in rapid training breakdown.

Published as a conference paper at ICLR 2024

Table 4: Parameters used for image augmentations on Image Net and Image Net-100.

Parameter T1 T2 crop size 224 224 224 224 maximum scale of crops 1.0 1.0 minimum scale of crops 0.08 0.08 brightness 0.4 0.4 contrast 0.4 0.4 saturation 0.2 0.2 hue 0.1 0.1 color jitter prob 0.8 0.8 horizontal flip prob 0.5 0.5 gaussian prob 1.0 0.1 solarization prob 0.0 0.2

(b) Training is prone to crashing when the embedding dimension equals the batch size (d = 512). In such cases, it s challenging to definitively establish whether the covariance matrix is singular. However, our observations from Figure 7 suggest that the covariance matrix tends towards illconditioning when d = 512. The inverse of the condition number progressively decreases during training, eventually leading to training instability.

(c) There is a possibility of training instability when the embedding dimension is less than the batch size. In these situations, we initially observe that the covariance matrix remains well-conditioned. However, this favorable condition is not consistently maintained throughout training. We notice that well-conditioning suddenly breaks after a few iterations, leading to model collapse for d = 64 or d = 128. Interestingly, training does not crash when d = 256. This phenomenon was briefly discussed in (Ermolov et al., 2021), suggesting that stability can be improved by setting m = 2d.

We confirm the presence of numerical instability when employing hard whitening (Ermolov et al., 2021), as indicated by the above analysis. While one can mitigate this instability empirically by setting m = 2d, our experiments reveal that training crashes due to numerical instability can still occur at various points during training. In our extensive experimentation (with 10 random seeds and longer training iterations), we observed instances of numerical issues approximately 3-4 times occurring at different stages, including early, mid, or even towards the end of training. Even though it is possible to resume training using saved checkpoints in the event of a crash, this significantly limits the practical applicability of long-term pre-training.

E DETAILS OF EXPERIMENTS ON STANDARD SSL BENCHMARK

In this section, we provide the details of implementation and training protocol for the experiments on large-scale Image Net (Deng et al., 2009), medium-scale Image Net-100 (Tian et al., 2020a) and smallscale CIFAR-10/100 (Krizhevsky, 2009) classification as well as transfer learning to COCO (Lin et al., 2014) object detection and instance segmentation. We also provide computational overhead of INTL pre-training on Image Net.

E.1 DATASETS

CIFAR-10 and CIFAR-100 (Krizhevsky, 2009), two small-scale datasets composed of 32 32 images with 10 and 100 classes, respectively.

Image Net-100 (Tian et al., 2020a), a random 100-class subset of Image Net (Deng et al., 2009).

Image Net (Deng et al., 2009), the well-known largescale dataset with about 1.3M training images and 50K test images, spanning over 1000 classes.

COCO2017 (Lin et al., 2014), a large-scale object detection, segmentation, and captioning dataset with 330K images containing 1.5 million object instances.

E.2 EXPERIMENT ON IMAGENET

In section 5.1 of the paper, we compare our INTL to the state-of-the-art SSL methods on large-scale Image Net classification. Here, we describe the training details of these experiments.

Published as a conference paper at ICLR 2024

Table 5: Parameters used for multi-crop of INTL on Image Net.

Parameter T1 T2 T3 T4 T5 T6 crop size 224 224 224 224 192 192 160 160 128 128 96 96 maximum scale of crops 1.0 1.0 0.857 0.714 0.571 0.429 minimum scale of crops 0.2 0.2 0.171 0.143 0.114 0.086 brightness 0.4 0.4 0.4 0.4 0.4 0.4 contrast 0.4 0.4 0.4 0.4 0.4 0.4 saturation 0.2 0.2 0.2 0.2 0.2 0.2 hue 0.1 0.1 0.1 0.1 0.1 0.1 color jitter prob 0.8 0.8 0.8 0.8 0.8 0.8 horizontal flip prob 0.5 0.5 0.5 0.5 0.5 0.5 gaussian prob 0.5 0.5 0.5 0.5 0.5 0.5 solarization prob 0.1 0.1 0.1 0.1 0.1 0.1

Table 6: Parameters used for INTL pre-training on Image Net-100.

Parameter Value max epoch 400 backbone Res Net-18 projection layers 3 projection hidden dimension 4096 projection output dimension 4096 optimizer SGD SGD momentum 0.9 learning rate 0.5 learning rate warm-up 2 epochs learning rate schedule cosine decay weight decay 2.5e-5 batch size 128

Backbone and Projection. We use the Res Net-50 (He et al., 2016) as the backbone and the output dimension is 2048. We use a 3-layers MLP as the projection: two hidden layers with BN and Re LU applied to it and a linear layer as output. We set dimensions of the hidden layer and embedding to 8192 as our initial experiments followed the settings of VICReg and Barlow Twins, both of which use a dimension of 8192 for the projection. Compared to a projection dimension of 2048, using a projection dimension of 8192 can bring about a 0.14% improvement in top-1 accuracy for INTL. Therefore, we followed this setting in subsequent experiments on Image Net. We report that using a projection dimension of 8192 requires approximately 18% additional GPU memory and 2% time per epoch compared to using the one of 2048.

Image Transformation Details. In image transformation, We use the same augmentation parameters as BYOL (Grill et al., 2020). Each input image is transformed twice to produce the two distorted views. The image augmentation pipeline consists of the following transformations: random cropping, resizing to 224 224, horizontal flipping, color jittering, converting to grayscale, Gaussian blurring, and solarization. The details of parameters are shown in Table 4.

Optimizer and Learning Rate Schedule. We apply the SGD optimizer, using a learning rate of base-lr Batch Size / 256 and cosine decay schedule. The base-lr for 100-epoch pre-training is 0.5, for 200(400)-epoch is 0.4 and for 800-epoch is 0.3. The weight decay is 10 5 and the SGD momentum is 0.9. In addition, we use learning rate warm-up for the first 2 epochs of the optimizer.

Evaluation Protocol. For linear classification, we train the linear classifier for 100 epochs with SGD optimizer (using a learning rate of base-lr Batch Size / 256 with a base-lr of 0.2) and using Multi Step LR scheduler with γ = 0.1 dropping at the last 40 and 20 epochs. Note that when combining INTL with multi-crop in the ablation experiments, the base-lr is set to 0.4. The batch size and weight decay for both are 256 and 0 respectively.

Exponential Moving Average. In the main text, we observe that our INTL can performs even better when utilized in conjunction with the Exponential Moving Average (EMA) technique. We set the base coefficient for momentum updating to 0.996 for all-epoch training. The momentum coefficient follows a cosine increasing schedule with final value of 1.0 as BYOL (Grill et al., 2020).

Published as a conference paper at ICLR 2024

Table 7: Parameters used for INTL pre-training on CIFAR-10/100.

Parameter Value max epoch 1000 backbone Res Net-18 projection layers 3 projection hidden dimension 2048 projection output dimension 2048 optimizer SGD SGD momentum 0.9 learning rate 0.3 learning rate warm-up 2 epochs learning rate schedule cosine decay weight decay 1e-4 batch size 256

E.3 EXPERIMENTS FOR SMALL AND MEDIUM SIZE DATASETS

In section 5.1 of the paper, we provide the classification results of INTL pre-training on small and medium size datasets such as CIFAR-10, CIFAR-100 and Image Net-100. Here, We describe the details of implementation and training protocol for the experiments on these datasets as follows. For fairness, most of hyper-parameters we used such as batch size, projection settings, data augmentation and so on are consistent with solo-learn (da Costa et al., 2022).

Experimental setup on Image Net-100. Details of implementation and training protocol for INTL pre-training on Image Net-100 are shown in Table 6. The image transformation and evaluation protocol are the same as ones on Image Net.

Experimental setup on CIFAR-10/100. Then Details of implementation and training protocol for INTL pre-training on CIFAR-10/100 are shown in Table 7. The details of image transformation are shown in Table 8. For evaluation, we use the same setup of protocol as in W-MSE (Ermolov et al., 2021): training the linear classifier for 500 epochs using the Adam optimizer and the labeled training set of each specific dataset, without data augmentation; the learning rate is exponentially decayed from 10 2 to 10 6 and the weight decay is 5 10 6.

In addition, we also evaluate the accuracy of a k-nearest neighbors classifier (k-NN, k = 5) in these experiments. For other methods, we evaluate the models provided by (da Costa et al., 2022) to obtain k-NN accuracy which does not require additional parameters and training.

E.4 EXPERIMENTS FOR TRANSFER LEARNING

In this part, we describe the training details of experiments for transfer learning. Our implementation is based on the released codebase of Mo Co (He et al., 2020) 4 for transfer learning to object detection and instance segmentation tasks. We use the default hyper-parameter configurations from the training scripts provided by the codebase for INTL, using our 200-epoch and 800-epoch pre-trained model on Image Net.

For the experiments of COCO detection and COCO instance segmentation, we use Mask R-CNN (1 schedule) fine-tuned in COCO 2017 train, evaluated in COCO 2017 val. The Mask R-CNN model is with the C4-backbone. Our INTL is performed with 3 random seeds, with mean and standard deviation reported.

E.5 COMPUTATIONAL OVERHEAD

In Table 9, we report compute and GPU memory requirements based on our implementation for different settings on Image Net with Res Net-50. The batch size is 256, and we train each model with 2 A100-PCIE-40GB GPUs, using mixed precision and py-torch optimized version of synchronized batch-normalization layers.

4https://github.com/facebookresearch/moco/tree/main/detection under the CC-BY-NC 4.0 license.

Published as a conference paper at ICLR 2024

Table 8: Parameters used for image augmentations on CIFAR-10/100.

Parameter T1 T2 crop size 32 32 32 32 maximum scale of crops 1.0 1.0 minimum scale of crops 0.08 0.08 brightness 0.4 0.4 contrast 0.4 0.4 saturation 0.2 0.2 hue 0.1 0.1 color jitter prob 0.8 0.8 horizontal flip prob 0.5 0.5 gaussian prob 0 0 solarization prob 0.0 0.2

Table 9: Computational cost. We report time and GPU memory requirements of our implementation for INTL trained per epoch on Image Net with Res Net-50.

Method EMA Multi-Crop time / 1 epoch peak memory / GPU

No No 29min11 16.0 G Yes No 24min46 11.8 G No Yes 57min33 25.9 G Yes Yes 50min52 21.2 G

F ABLATION STUDY

In this section, we conduct a comprehensive set of ablation experiments to assess the robustness and versatility of our INTL. These experiments cover various aspects, including batch sizes, embedding dimensions, the use of multi-crop augmentation, semi-supervised training, the choice of Vision Transformer (Vi T) backbones and adding trace loss to other methods.

Table 10: Effect of batch sizes for INTL. We train 100 epoch on Image Net and provide the Top-1 accuracy using linear evaluation. The embedding dimension is fixed to 8192.

Bs 32 64 128 256 512 1024 acc.(%) 64.2 66.4 68.1 68.7 69.5 69.7

Batch size. Most SSL methods, including certain whitening-based methods, are known to be sensitive to batch sizes, e.g. Sim CLR (Chen et al., 2020a), Sw AV (Caron et al., 2020) and W-MSE (Ermolov et al., 2021) all require a large batch size (e.g. 4096) to work well. We then test the robustness of INTL to batch sizes. We train INTL on Image Net for 100 epochs with various batch sizes ranging from 32 to 1024. As shown in Table. 10, even if the batch size is as low as 32 or 64, INTL still maintains good performance. At the same time, when the batch size increases, the accuracy of INTL is also improved. These results indicate that INTL has good robustness to batch sizes and can adapt to various scenarios that constrain the training batch size.

64 128 256 512 1024 2048 4096 8192 16384 Embedding dimension

Top-1 accuracy %

INTL (ours) Barlow Twins Sim CLR

Figure 8: Ablation experiments for varying embedding dimensions. The batch size is fixed to 256.

Embedding dimension. Embedding dimension, the output dimension of the projection, is also a key element for most self-supervised learning methods, which may have a significant impact on training results. As illustrated in (Zbontar et al., 2021), Barlow Twins is very sensitive to embedding dimension and it requires a large dimension (e.g. 8192 or 16384) to work well. We also test the robustness of INTL to embedding dimensions. Following the setup of (Chen et al., 2020a) and (Zbontar et al., 2021), we train INTL on Image Net for 300 epochs with the dimension ranging from 64 to 16384. As shown in Figure. 8, even when the embedding dimension is low as 64 or 128, INTL still achieves good results. These results show that INTL also has strong robustness to embedding dimensions.

Multi-Crop. In the main text experiments, we employ the standard augmentation, which generates two augmented views for each sample. It s worth noting that multi-crop strategies, such as the

Published as a conference paper at ICLR 2024

Table 11: Ablation experiments on Image Net linear classification with EMA and multi-crop. All are based on Res Net-50 backbone.

Method Bs EMA Multi-Crop 100 eps 200 eps 400 eps 800 eps Sw AV 4096 No No 66.5 69.1 70.7 71.8 INTL (ours) 512 No No 69.5 71.1 72.4 73.1 Sw AV 4096 No Yes 72.1 73.9 74.6 75.3 Sw AV 256 No Yes - 72.7 74.3 - INTL (ours) 256 No Yes 72.4 74.3 74.9 - CLSA 256 Yes No - 69.4 - 72.2 INTL (ours) 256 Yes No 69.2 71.5 73.7 74.3 DINO 4080 Yes Yes - - - 75.3 CLSA 256 Yes Yes - 73.3 - 76.2 INTL (ours) 256 Yes Yes 73.5 75.2 76.1 76.6

Table 12: Semi-supervised classification on top of the fine-tuned representations from 1% and 10% of Image Net samples.

Epoch Bs Semi-supervised Top-1 Top-5 1% 10% 1% 10% Supervised 120 256 25.4 56.4 48.4 80.4 Sim CLR 800 4096 48.3 65.6 75.5 87.8 BYOL 1000 4096 53.2 68.8 78.4 89.0 Sw AV 800 4096 53.9 70.2 78.5 89.9 Barlow Twins 1000 2048 55.0 69.7 79.2 89.3 VICReg 1000 2048 54.8 69.5 79.4 89.5 INTL (ours) 800 512 55.0 69.4 80.8 89.8

one used by Sw AV (Caron et al., 2020), are widely recognized for enhancing the performance of SSL methods. For instance, Sw AV achieves a remarkable Top-1 accuracy of 75.3% with multi-crop. Therefore, we also conduct experiments with INTL using multi-crop. We apply an efficient multi-crop approach that generates 6 views for each image, with sizes of 2 224 + 192 + 160 + 128 + 96, which is similar to the approach used by CLSA (Wang & Qi, 2022). (Detailed parameter settings are provided in Table 5). The results are shown in Table 11. When INTL is paired with multi-crop augmentation, it consistently achieve notable improvements in top-1 accuracy. For instance, after 800 epochs of pre-training, INTL attains an impressive top-1 accuracy of 76.6%, even surpassing the common supervised baseline of 76.5%. The incorporation of multi-crop augmentation enhances the performance of INTL, making it a promising method for self-supervised representation learning across a range of experimental setups.

Semi-supervised training. For semi-supervised classification, we fine-tune our pre-trained INTL backbone and train the linear classifier on Image Net for 20 epochs. We employ subsets of size 1% and 10%, following the same split as Sim CLR. The optimization is performed using the SGD optimizer with a base learning rate of 0.006 for the backbone and 0.2 for the classifier, along with a cosine decay schedule. The semi-supervised results on the Image Net validation dataset are presented in Table 12, demonstrating that INTL performs well in semi-supervised training scenarios.

Vison Transformer backbones. We conduct additional experiments using vision transformer (Vi T) backbones for INTL. For comparison, we reproduce five other method under the same settings. The results are shown in Figure 9, illustrating that INTL maintains strong performance when Vi Ts are used as backbones. This suggests that INTL exhibits robust generalization capabilities across different network architectures.

Barlow Twins/VICReg with Trace loss. We conducte experiments on CIFAR-10/100 and Image Net-100 to assess the impact of adding trace loss to Barlow Twins and VICReg, following the experimental setup outlined in Table 2 of our paper. We trained the models on CIFAR-10/100 for 200 epochs and on Image Net-100 for 100 epochs. The coefficient of trace loss was set to 0.01, an empirically suitable value for both methods. The results are presented in the Table 13. We observed that adding trace loss to Barlow Twins had a minor positive effect on performance, while introducing it to VICReg significantly reduced performance, particularly on Image Net-100. We hypothesize that this discrepancy may arise from the influence of trace loss on the regularization strength of these

Published as a conference paper at ICLR 2024

Sim CLR W-MSE DINO BT VICReg INTL 67.5

83.72 top-1 5-nn

(a) Results on CIFAR-10

Sim CLR W-MSE DINO BT VICReg INTL 30

56.76 57.41

(b) Results on CIFAR-100

Sim CLR W-MSE DINO BT VICReg INTL 50.0

69.62 70.13

57.87 58.32

67.34 top-1 5-nn

(c) Results on Image Net-100

Figure 9: Abalation experiments using vision transformer (Vi T) backbones. We train our INTL as well as other 5 methods (including Sim CLR, W-MSE, DINO, Barlow Twins, and VICReg) for comparison when using Vi Ts as backbones. Our training setup involved Vi T-tiny for 200 epochs on CIFAR-10/100 and Vi T-small for 100 epochs on Image Net-100. The settings were kept consistent with DINO, with the exception of the embedding dimension for W-MSE, which was set to 64, while other methods used 2048. We evaluated their classification performance using both a linear classifier and a 5-nearest neighbors classifier. The results for CIFAR-10, CIFAR-100, and Image Net-100 are presented in panels (a), (b), and (c) respectively.

Table 13: Evaluate the performance of adding trace loss to Barlow Twins/VICReg.

Method CIFAR-10 CIFAR-100 Image Net-100 top-1 5-nn top-1 5-nn top-1 5-nn Barlow Twins 80.43 76.68 51.60 42.71 58.34 50.21 Barlow Twins + trace loss 80.45 76.32 51.66 43.94 59.78 50.45 VICReg 83.14 79.62 55.96 46.71 66.01 57.76 VICReg + trace loss 81.67 78.74 54.75 46.24 63.54 55.18

methods. It can either disrupt the existing balance, leading to reduced performance, or achieve a more favorable balance, resulting in improved performance.

G LICENSES OF DATASETS

Image Net (Deng et al., 2009) is subject to the Image Net terms of access: (contributors, 2020)

COCO (Lin et al., 2014). The annotations are under the Creative Commons Attribution 4.0 License. The images are subject to the Flickr terms of use (Flickr, 2020).