# ldreg_local_dimensionality_regularized_selfsupervised_learning__e46d8213.pdf

Published as a conference paper at ICLR 2024

LDREG: LOCAL DIMENSIONALITY REGULARIZED SELF-SUPERVISED LEARNING

Hanxun Huang1 Ricardo J. G. B. Campello 2 Sarah Erfani1 Xingjun Ma3

Michael E. Houle4 James Bailey1

1School of Computing and Information Systems, The University of Melbourne, Australia 2Department of Mathematics and Computer Science, University of Southern Denmark, Denmark 3School of Computer Science, Fudan University, China 4Department of Computer Science, New Jersey Institute of Technology, USA

Representations learned via self-supervised learning (SSL) can be susceptible to dimensional collapse, where the learned representation subspace is of extremely low dimensionality and thus fails to represent the full data distribution and modalities. Dimensional collapse also known as the underfilling phenomenon is one of the major causes of degraded performance on downstream tasks. Previous work has investigated the dimensional collapse problem of SSL at a global level. In this paper, we demonstrate that representations can span over high dimensional space globally, but collapse locally. To address this, we propose a method called local dimensionality regularization (LDReg). Our formulation is based on the derivation of the Fisher-Rao metric to compare and optimize local distance distributions at an asymptotically small radius for each data point. By increasing the local intrinsic dimensionality, we demonstrate through a range of experiments that LDReg improves the representation quality of SSL. The results also show that LDReg can regularize dimensionality at both local and global levels.

1 INTRODUCTION

Self-supervised learning (SSL) is now approaching the same level of performance as supervised learning on numerous tasks (Chen et al., 2020a;b; He et al., 2020; Grill et al., 2020; Chen & He, 2021; Caron et al., 2021; Zbontar et al., 2021; Chen et al., 2021; Bardes et al., 2022; Zhang et al., 2022). SSL focuses on the construction of effective representations without reliance on labels. Quality measures for such representations are crucial to assess and regularize the learning process. A key aspect of representation quality is to avoid dimensional collapse and its more severe form, mode collapse, where the representation converges to a trivial vector (Jing et al., 2022). Dimensional collapse refers to the phenomenon whereby many of the features are highly correlated and thus span only a lower-dimensional subspace. Existing works have connected dimensional collapse with low quality of learned representations (He & Ozay, 2022; Li et al., 2022; Garrido et al., 2023a; Dubois et al., 2022). Both contrastive and non-contrastive learning can be susceptible to dimensional collapse (Tian et al., 2021; Jing et al., 2022; Zhang et al., 2022), which can be mitigated by regularizing dimensionality as a global property, such as learning decorrelated features (Hua et al., 2021) or minimizing the off-diagonal terms of the covariance matrix (Zbontar et al., 2021; Bardes et al., 2022).

In this paper, we examine an alternative approach to the problem of dimensional collapse, by investigating the local properties of the representation. Rather than directly optimizing the global dimensionality of the entire training dataset (in terms of correlation measures), we propose to regularize the local intrinsic dimensionality (LID) (Houle, 2017a;b) at each training sample. We provide an intuitive illustration of the idea of LID in Figure 1. Given a representation vector (anchor point) and its surrounding neighbors, if representations collapse to a low-dimensional space, it would result in a lower sample LID for the anchor point (Figure 1a). In SSL, each anchor point should be dissimilar from all other points and should have a higher sample LID (Figure 1b). Based on LID, we reveal an interesting observation that: representations can span a high dimensional space globally, but collapse locally. As shown in the top 4 subfigures in Figure 1c, the data points could span

Published as a conference paper at ICLR 2024

over different local dimensional subspaces (LIDs) while having roughly the same global intrinsic dimension (GID). This suggests that dimensional collapse should not only be examined as a global property but also locally. Note that Figure 1c illustrates a synthetic case of local dimensional collapse. Later we will empirically show that representations converging to a locally low-dimensional subspace can have reduced quality and that higher LID is desirable for SSL.

Anchor Point LID=0.84

Random Anchor Point

(a) Local collapse

0.0 0.5 1.0 1.5 2.0 0.0

Anchor Point LID=5.14

Random Anchor Point

(b) No local collapse

FR 0. 65 FR 0. 50 FR 0. 16 FR 0. 32

Local Di mensi onal Col l apse

(c) Local dimensional collapse

Figure 1: Illustrations with 2D synthetic data. (a-b) The LID value of the anchor point (red star) when there is (or is no) local collapse. (c) Fisher-Rao (FR) metric and mean LID (m LID) estimates. FR measures the distance between two LID distributions, and is computed based on our theoretical results. m LID is the geometric mean of sample-wise LID scores. High FR distances and low m LID scores indicate greater dimensional collapse. Global intrinsic dimension (GID) is estimated using the Dan CO algorithm (Ceruti et al., 2014).

To address local dimensional collapse, we propose the Local Dimensionality Regularizer (LDReg), which regularizes the representations toward a desired local intrinsic dimensionality to avoid collapse, as shown at the bottom subfigure of Figure 1c. Our approach leverages the LID Representation Theorem (Houle, 2017a), which has established that the distance distribution of nearest neighbors in an asymptotically small radius around a given sample is guaranteed to have a parametric form. For LDReg to be able to influence the learned representations toward a distributionally higher LID, we require a way to compare distributions that is sensitive to differences in LID. This motivates us to develop a new theory to enable measurement of the distance between local distance distributions, as well as identify the mean of a set of local distance distributions. We derive a theoretically well-founded Fisher-Rao metric (FR), which considers a statistical manifold for assessing the distance between two local distance distributions in the asymptotic limit. As shown in Figure 1c, FR corresponds well with different degrees of dimensional collapse. More details regarding Figure 1c can be found in Appendix A.

The theory we develop here also leads to two new insights: i) LID values are better compared using the logarithmic scale rather than the linear scale; ii) For aggregating LID values, the geometric mean is a more natural choice than the arithmetic or harmonic means. These insights have consequences for formulating our local dimensionality regularization objective, as well as broader implications for comparing and reporting LID values in other contexts.

To summarize, the main contributions of this paper are:

A new approach, LDReg, for mitigating dimensional collapse in SSL via the regularization of local intrinsic dimensionality characteristics. Theory to support the formulation of LID regularization, insights into how dimensionalities should be compared and aggregated, and generic dimensionality regularization technique that can potentially be used in other types of learning tasks. Consistent empirical results demonstrating the benefit of LDReg in improving multiple state-of-the-art SSL methods (including Sim CLR, Sim CLR-Tuned, BYOL, and MAE), and its effectiveness in addressing both local and global dimensional collapse.

2 RELATED WORK

Self-Supervised Learning (SSL). SSL aims to automatically learn high-quality representations without label supervision. Existing SSL methods can be categorized into two types: generative

Published as a conference paper at ICLR 2024

methods and contrastive methods. In generative methods, the model learns representations through a reconstruction of the input (Hinton & Zemel, 1993). Inspired by masked language modeling (Kenton & Toutanova, 2019), recent works have successfully extended this paradigm to the reconstruction of masked images (Bao et al., 2022; Xie et al., 2022), such as Masked Auto Encoder (MAE) (He et al., 2022). It has been theoretically proven that these methods are a special form of contrastive learning that implicitly aligns positive pairs (Zhang et al., 2022). Contrastive methods can further be divided into 1) sample-contrastive, 2) dimension-contrastive (Garrido et al., 2023b), and 3) asymmetrical models. Sim CLR (Chen et al., 2020a) and other sample-contrastive methods (He et al., 2020; Chen et al., 2020a;b; Yeh et al., 2022) are based on Info NCE loss (Oord et al., 2018). The sample-contrastive approach has been extended by using nearest-neighbor methods (Dwibedi et al., 2021; Ge et al., 2023), clustering-based methods (Caron et al., 2018; 2020; Pang et al., 2022), and improved augmentation strategies (Wang et al., 2023). Dimension-contrastive methods (Zbontar et al., 2021; Bardes et al., 2022) regularize the off-diagonal terms of the covariance matrix of the embedding. Asymmetrical models use an asymmetric architecture, such as an additional predictor (Chen & He, 2021), self-distillation (Caron et al., 2021), or a slow-moving average branch as in BYOL (Grill et al., 2020).

Dimensional collapse in SSL. Dimensional collapse occurs during the SSL process where the learned embedding vectors and representations span only a lower-dimensional subspace (Hua et al., 2021; Jing et al., 2022; He & Ozay, 2022; Li et al., 2022). Generative methods such as MAE (He et al., 2022) have been shown to be susceptible to dimensional collapse (Zhang et al., 2022). Samplecontrastive methods such as Sim CLR have also been observed to suffer from dimensional collapse (Jing et al., 2022). Other studies suggest that while stronger augmentation and larger projectors are beneficial to the performance (Garrido et al., 2023b), they may cause a dimensional collapse in the projector space (Cosentino et al., 2022). It has been theoretically proven that asymmetrical model methods can alleviate dimensional collapse, and the effective rank (Roy & Vetterli, 2007) is a useful measure of the degree of global collapse (Zhuo et al., 2023). Effective rank is also helpful in assessing the representation quality (Garrido et al., 2023a). By decorrelating features, dimensioncontrastive methods (Zbontar et al., 2021; Zhang et al., 2021; Ermolov et al., 2021; Bardes et al., 2022) can also avoid dimensional collapse. In this work, we focus on the local dimensionality of the representation (encoder) space, which largely determines the performance of downstream tasks.

Local Intrinsic Dimensionality. Unlike global intrinsic dimensionality metrics (Pettis et al., 1979; Bruske & Sommer, 1998), local intrinsic dimensionality (LID) measures the intrinsic dimension in the vicinity of a particular query point (Levina & Bickel, 2004; Houle, 2017a). It has been used as a measure for similarity search (Houle et al., 2012), for characterizing adversarial subspaces (Ma et al., 2018a), for detecting backdoor attacks (Dolatabadi et al., 2022), and in the understanding of deep learning (Ma et al., 2018b; Gong et al., 2019; Ansuini et al., 2019; Pope et al., 2021). In Appendix B, we provide a comparison between the effective rank and the LID to help understand local vs. global dimensionality. Our work in this paper shows that LID is not only useful as a descriptive measure, but can also be used as part of a powerful regularizer for SSL.

3 BACKGROUND AND TERMINOLOGY

We first introduce the necessary background for the distributional theory underpinning LID. The dimensionality of the local data submanifold in the vicinity of a reference sample is revealed by the growth characteristics of the cumulative distribution function of the local distance distribution.

Let F be a real-valued function that is non-zero over some open interval containing r R, r = 0. Definition 1 ((Houle, 2017a)). The intrinsic dimensionality of F at r is defined as follows, whenever the limit exists:

Intr Dim F (r) lim ϵ 0 ln (F((1+ϵ)r)/F(r))

ln((1+ϵ)r/r) .

Theorem 1 ((Houle, 2017a)). If F is continuously differentiable at r, then

LIDF (r) r F (r)

F(r) = Intr Dim F (r) .

Although the preceding definitions apply more generally, we will be particularly interested in functions F that satisfy the conditions of a cumulative distribution function (CDF). Let x be a location of

Published as a conference paper at ICLR 2024

interest within a data domain S for which the distance measure d : S S R 0 has been defined. To any generated sample s S, we associate the distance d(x, s); in this way, a global distribution that produces the sample s can be said to induce the random value d(x, s) from a local distribution of distances taken with respect to x. The CDF F(r) of the local distance distribution is simply the probability of the sample distance lying within a threshold r that is, F(r) Pr[d(x, s) r].

To characterize the local intrinsic dimensionality in the vicinity of location x, we consider the limit of LIDF (r) as the distance r tends to 0. Regardless of whether F satisfies the conditions of a CDF, we denote this limit by LID F lim r 0+ LIDF (r) .

Henceforth, when we refer to the local intrinsic dimensionality (LID) of a function F, or of a point x whose induced distance distribution has F as its CDF, we will take LID to mean the quantity LID F .

In general, LID F is not necessarily an integer. Unlike the manifold model of local data distributions where the dimensionality of the manifold is always an integer, and deviation from the manifold is considered as error the LID model reflects the entire local distributional characteristics without distinguishing error. However, the estimation of the LID at x often gives an indication of the dimension of the local manifold containing x that would best fit the distribution.

4 ASYMPTOTIC FORM OF FISHER-RAO METRIC FOR LID DISTRIBUTIONS

We now provide the necessary theoretical justifications for our LDReg regularizer which will be later developed in Section 5. Intuitively, LDReg should regularize the LID of the local distribution of the training samples towards a higher value, determined as the LID of some target distribution. This can help to avoid dimensional collapse by increasing the dimensionality of the representation space and producing representations that are more uniform in their local dimensional characteristics. To achieve this, we will need an asymptotic notion of distributional distance that applies to lower tail distributions. In this section, we introduce an asymptotic variant of the Fisher-Rao distance that can be used to identify the center (mean) of a collection of tail distributions.

4.1 FISHER-RAO DISTANCE METRIC

The Fisher-Rao distance is based on the embedding of the distributions on a Riemannian manifold, where it corresponds to the length of the geodesic along the manifold between the two distributions. The metric is usually impossible to compute analytically, except for special cases (such as certain varieties of Gaussians). However, in the asymptotic limit as w 0, we will show that it is analytically tractable for smooth growth functions.

Definition 2. Given a non-empty set X and a family of probability density functions ϕ(x|θ) parameterized by θ on X , the space M = {ϕ(x|θ)|θ Rd} forms a Riemannian manifold. The Fisher-Rao Riemannian metric on M is a function of θ and induces geodesics, i.e., curves with minimum length on M. The Fisher-Rao distance between two models θ1 and θ2 is the arc-length of the geodesic that connects these two points.

In our context, we will focus on univariate lower tail distributions with a single parameter θ corresponding to the LID of the CDF. In this context, the Fisher-Rao distance will turn out to have an elegant analytical form. We will make use of the Fisher information I, which is the variance of the gradient of the log-likelihood function (also known as the Fisher score). For distributions over [0, w] with a single parameter θ, this is defined as

Iw(θ) = Z w

θ ln F w(r|θ) 2 F w(r|θ) dr .

Lemma 1. Consider the family of tail distributions on [0, w] parameterized by θ, whose CDFs are smooth growth functions of the form

Hw|θ(r) = r

Published as a conference paper at ICLR 2024

The Fisher-Rao distance d FR between Hw|θ1 and Hw|θ2 is

d FR(Hw|θ1, Hw|θ2) = ln θ2

The Fisher information Iw for smooth growth functions of the form Hw|θ is:

Iw(θ) = Z w

θ ln H w|θ(r) 2 H w|θ(r) dr = 1 θ2 .

The proof of Lemma 1 can be found in Appendix C.2.

4.2 ASYMPTOTIC FISHER-RAO METRIC

We now extend the notion of the Fisher-Rao metric to distance distributions whose CDFs (conditioned to the lower tail [0, w]) have the more general form of a growth function. The LID Representation Theorem (Theorem 3 in Appendix C.1) tells us that any such CDF Fw(r) can be decomposed into the product of a canonical form Hw| LID F (r) with an auxiliary factor AF (r, w):

Fw(r) = Hw| LID F (r) AF (r, w) = r

LID F exp Z w

LID F LIDF (t)

From Corollary 3.1 (Appendix C.1), the auxiliary factor AF (r, w) tends to 1 as r and w tend to 0, provided that r stays within a constant factor of w. Asymptotically, then, Fw can be seen to tend to Hw|θ as the tail length tends to zero, for θ = LID F . More precisely, for any constant c 1,

lim w 0+ w/c r cw

Fw(r) Hw| LID F (r) = lim w 0+ w/c r cw AF (r, w) = 1 .

Thus, although the CDF Fw does not in general admit a finite parameterization suitable for the direct definition of a Fisher-Rao distance, asymptotically it tends to a distribution that does: Hw| LID F .

Using Lemma 1 we define an asymptotic form of Fisher-Rao distance between distance distributions. Definition 3. Given two smooth-growth distance distributions with CDFs F and G, their asymptotic Fisher-Rao distance is given by

d AFR(F, G) lim w 0+ d FR(Hw| LID F , Hw| LID G) = ln LID G LID F

4.3 IMPLICATIONS

Remark 1.1. Assume that LID F 1 and that Gw = U1,w is the one-dimensional uniform distribution over the interval [0, w] (with LID G therefore equal to 1). We then have

d AFR(Fw, U1,w) = ln LID F LID F = exp (d AFR(Fw, U1,w)) .

We can therefore interpret the local intrinsic dimensionality of a distribution F conditioned to the interval [0, w] (with LID F 1) as the exponential of the distance between distribution F and the uniform distribution in the limit as w 0.

There is also a close relationship between our asymptotic Fisher-Rao distance metric and a mathematically special measure of relative difference. Remark 1.2. One can interpret the quantity |ln (LID G/LID F)| as a relative difference between LID G and LID F . Furthermore, it is the only measure of relative difference that is both symmetric, additive, and normed (T ornqvist et al., 1985).

The asymptotic Fisher-Rao distance indicates that the absolute difference between the LID values of two distance distributions is not a good measure of asymptotic dissimilarity. For example, a

Published as a conference paper at ICLR 2024

pair of distributions with LID F 1 = 2 and LID G1 = 4 are much less similar under the asymptotic Fisher-Rao metric than a pair of distributions with LID F 2 = 20 and LID G2 = 22.

We can also use the asymptotic Fisher-Rao metric to compute the centroid or Fr echet mean1 of a set of distance distributions, as well as the associated Fr echet variance. Definition 4. Given a set of distance distribution CDFs F = {F 1, F 2, . . . , F N}, the empirical Fr echet mean of F is defined as

µF arg min Hw|θ

d AFR(Hw|θ, F i) 2 .

The Fr echet variance of F is then defined as

d AFR(µF, F i) 2 = 1 N

ln LID F i ln LID µF 2 .

The Fr echet variance can be interpreted as the variance of the local intrinsic dimensionalities of the distributions in F, taken in logarithmic scale.

The Fr echet mean has a well-known close connection to the geometric mean, when the distance is expressed as a difference of logarithmic values. For our setting, we state this relationship in the following theorem, the proof of which can be found in Appendix C.3. Theorem 2. Let µF be the empirical Fr echet mean of a set of distance distribution CDFs F = {F 1, F 2, . . . , F N} using the asymptotic Fisher-Rao metric d AFR. Then LID µF =

exp 1 N PN i=1 ln LID F i , the geometric mean of {LID F 1, . . . , LID F N }.

Corollary 2.1. Given the CDFs F = {F 1, F 2, . . . , F N}, the quantity 1

N PN i=1 ln LID F i is:

1. The average asymptotic Fisher-Rao distance of members of F to the one-dimensional uniform distribution (if for all i we have LID F i 1).

2. The logarithm of the local intrinsic dimension of the Fr echet mean of F.

3. The logarithm of the geometric mean of the local intrinsic dimensions of the members of F.

The proof of Assertion 1 is in Appendix C.4. Assertions 2 and 3 follow from Theorem 2.

It is natural to consider whether other measures of distributional divergence could be used in place of the asymptotic Fisher-Rao metric in the derivation of the Fr echet mean. Bailey et al. (2022) have shown several other divergences involving the LID of distance distributions most notably that of the Kullback-Leibler (KL) divergence. We can in fact show that the asymptotic Fisher-Rao metric is preferable (in theory) to the asymptotic KL distance, and the geometric mean is preferable to the arithmetic mean and harmonic mean when aggregating the LID values of distance distributions. For the details, we refer readers to Theorem 4 in Appendix C.5.

In summary, Theorems 2 and 4 (Appendix C.5) show that the asymptotic Fisher-Rao metric is preferable in measuring the distribution divergence of LIDs. These theorems provide theoretical justification for LDReg, which will be described in the following section.

5 LID REGULARIZATION FOR SELF-SUPERVISED LEARNING

In this section, we formally introduce our proposed LDReg method. For an input image x and an encoder f( ), the representation of x can be obtained as z = f(x). Depending on the SSL method, a projector g( ), a predictor h( ), and a decoder t( ) can be used to obtain the embedding vector or the reconstructed image from z. LDReg is a generic regularization on representations z obtained by the encoder, and as such it can be applied to a variety of SSL methods (more details are in Appendix E). We denote the objective function of an SSL method by LSSL.

1Also known as the Karcher mean, the Riemannian barycenter and the Riemannian center of mass.

Published as a conference paper at ICLR 2024

Local Dimensionality Regularization (LDReg). Following theorems derived in Section 4, we assume that the representational dimension is d, and that we are given the representation of a sample xi. Suppose that the distance distribution induced by F at xi is F i w(r). To avoid dimensional collapse, we consider maximizing the distributional distance between F i w(r) and a uniform distance distribution U1,w(r) (with LID = 1): for each sample, we could regularize a local representation that has a local intrinsic dimensionality much greater than 1 (and thus closer to the representational dimension d 1). We could regularize by maximizing the sum of squared asymptotic FR distances (L2-style regularization), or of absolute FR distances (L1-style regularization).

In accordance to Corollary 2.1 in Section 4.2, we apply L1-regularization to minimize the negative log of the geometric mean of the ID values. Assuming that LID F iw is desired to be 1,

i lim w 0 d AFR(F i w(r), U1,w(r)) = min 1

i ln LID F iw , (1)

where N is the batch size. Following Theorem 2 in Section 4.2, we apply L2-regularization to maximize the Fr echet variance under a prior of µF = 1:

i lim w 0(d AFR(F i w(r), U1,w(r)))2 = min 1

ln LID F iw

Our preference for the geometric mean over the arithmetic mean for L1and L2-regularization is justified by Theorem 4 in Appendix C.5. We refer readers to Appendix D for a discussion of other regularization formulations.

We use the Method of Moments (Amsaleg et al., 2018) as our estimator of LID, due to its simplicity. Since only the encoder is kept for downstream tasks, we estimate the LID values based on the encoder representations (z = f(x)). Specifically, we calculate the pairwise Euclidean distance between the encoder representations of a batch of samples to estimate the LID F iw for each sample xi in the batch: LID F iw = µk µk wk , where k denotes the number of nearest neighbors of zi, wk is the distance to the k-th nearest neighbor, and µk is the average distance to all k nearest neighbors.

The overall optimization objective is defined as a minimization of either of the following losses:

LL1 = LSSL β 1

i ln LID F iw or LL2 = LSSL β

ln LID F iw

where β is a hyperparameter balancing the loss and regularization terms. More details of how to apply LDReg on different SSL methods can be found in Appendix E; the pseudocode is in Appendix J.

6 EXPERIMENTS

We evaluate the performance of LDReg in terms of representation quality, such as training a linear classifier on top of frozen representations. We use Sim CLR (Chen et al., 2020a), Sim CLR-Tuned (Garrido et al., 2023b), BYOL (Grill et al., 2020), and MAE (He et al., 2022) as baselines. We perform our evaluation with Res Net-50 (He et al., 2016) (for Sim CLR, Sim CLR-Tuned, and BYOL) and Vi T-B (Dosovitskiy et al., 2021) (for Sim CLR and MAE) on Image Net (Deng et al., 2009). As a default, we use batch size 2048, 100 epochs of pretraining for Sim CLR, Sim CLR-Tuned and BYOL, and 200 epochs for MAE, and hyperparameters chosen in accordance with each baseline s recommended values. We evaluate transfer learning performance by performing linear evaluations on other datasets, including Food-101 (Bossard et al., 2014), CIFAR (Krizhevsky & Hinton, 2009), Birdsnap (Berg et al., 2014), Stanford Cars (Krause et al., 2013), and DTD (Cimpoi et al., 2014). For finetuning, we use RCNN (Girshick et al., 2014) to evaluate on downstream tasks using the COCO dataset (Lin et al., 2014). Detailed experimental setups are provided in Appendix F. For LDReg regularization, we use k = 64 as the default neighborhood size. For Res Net-50, we set β = 0.01 for Sim CLR and Sim CLR Tuned, β = 0.005 for BYOL. For Vi T-B, we set β = 0.001 for Sim CLR, and β = 5 10 6 for MAE. Since LL1 and LL2 perform similarly (see Appendix G.1), here we mainly report the results of LL1. An experiment showing how local collapse triggers mode collapse is provided in Appendix G.2, while an ablation study of hyperparameters is in Appendix G.3.

Published as a conference paper at ICLR 2024

6.1 LDREG REGULARIZATION INCREASES LOCAL AND GLOBAL INTRINSIC DIMENSIONS

Intrinsic dimensionality has previously been used to understand deep neural networks in a supervised learning context (Ma et al., 2018b; Gong et al., 2019; Ansuini et al., 2019). For SSL, Figure 2a shows that the geometric mean of LID tends to increase over the course of training. For contrastive methods (Sim CLR and BYOL), the mean of LID slightly decreases at the later training stages (dash lines). With LDReg, the mean of LID increases for all baseline methods and alleviates the decreasing trend at the later stages (solid lines), most notably on BYOL.

Geometric Mean of LIDs

Sim CLR Sim CLR (Tuned) BYOL MAE

Sim CLR+LDReg Sim CLR (Tuned)+LDReg BYOL+LDReg MAE+LDReg

(a) LIDs in SSL

Geometric Mean of LIDs

Strength=1.0 (64.32%) Strength=0.8 (64.30%) Strength=0.6 (64.19%) Strength=0.4 (63.67%) Strength=0.2 (63.01%)

(b) Color jitter strength

Sim CLR Sim CLR (Tuned)

Effective Rank

Original + LDReg

(c) Effective rank

Sim CLR Sim CLR (Tuned)

BYOL MAE 10

Geometric Mean of LIDs

Original + LDReg

(d) Geometric mean of LID Figure 2: (a) Geometric mean of LID values over training epochs. (b) Geometric mean of LID values with varying color jitter strength in the augmentations for Sim CLR. The linear evaluation result is reported in the legend. (a-b) LID is computed on the training set. (c-d) The effective rank and LID are computed for samples in the validation set. The solid and transparent bars represent the baseline method with and without LDReg regularization, respectively. MAE uses Vi T-B as the encoder, and others use Res Net-50.

In Figure 2b, we adjusted the color jitter strength of the Sim CLR augmentation policy and observed that the mean LID of the representation space positively correlates with the strength. This indicates that stronger augmentations tend to trigger more variations of the image and thus lead to representations of higher LID. This provides insights on why data augmentation is important for SSL Grill et al. (2020); Von K ugelgen et al. (2021) and why it can help avoid dimensional collapse (Wagner et al., 2022; Huang et al., 2023).

The effective rank (Roy & Vetterli, 2007) is a metric to evaluate dimensionality as a global property and can also be used as a metric for representation quality (Garrido et al., 2023a). Figure 2c shows that BYOL is less susceptible to dimensional collapse. Sim CLR-Tuned uses the same augmentation as BYOL (which is stronger than that of Sim CLR), yet still converges to a lower dimensional space as compared with BYOL. This indicates that for Sim CLR and its variants, stronger augmentation is not sufficient to prevent global dimensional collapse. Generative method MAE is known to be prone to dimensional collapse (Zhang et al., 2022). Unsurprisingly, it has the lowest effective rank in Figure 2c. Note that the extremely low effective rank of MAE is also related to its low representation dimension which is 768 in Vi T-B (other methods shown here used Res Net-50 which has a representation dimension of 2048).

We also analyze the geometric mean of the LID values in Figure 2d. It shows that, compared to other methods, BYOL has a much lower mean LID value. This implies that although BYOL does not collapse globally, it converges to a much lower dimension locally. We refer readers to Appendix G.2 for an analysis of how local collapse (extremely low LID) could trigger a complete mode collapse, thereby degrading the representation quality. Finally, the use of our proposed LDReg regularization can effectively void dimensional collapse and produce both increased global and local dimensionalities (as shown in Figures 2c and 2d with + LDReg ).

6.2 EVALUATIONS

We evaluate the representation quality learned by different methods via linear evaluation, transfer learning, and fine-tuning on downstream tasks. As shown in Table 1, LDReg consistently improves the linear evaluation performance for methods that are known to be susceptible to dimensional collapse, including sample-contrastive method Sim CLR and generative method MAE. It also improves BYOL, which is susceptible to local dimensional collapse as shown in Figure 2d. Tables 2 and 3 further demonstrate that LDReg can also improve the performance of transfer learning, finetuning

Published as a conference paper at ICLR 2024

Table 1: The linear evaluation results (accuracy (%)) of different methods with and without LDReg. The effective rank is calculated on the Image Net validation set. The best results are boldfaced.

Model Epochs Method Regularization Linear Evaluation Effective Rank Geometric mean of LID

Res Net-50 100

Sim CLR - 64.3 470.2 18.8 LDReg 64.8 529.6 20.0 Sim CLR (Tuned) - 67.2 525.8 24.9 LDReg 67.5 561.7 26.1

BYOL - 67.6 583.8 15.9 LDReg 68.5 594.0 22.3

Sim CLR - 72.9 283.7 13.3 LDReg 73.0 326.1 13.7

MAE - 57.0 86.4 25.8 LDReg 57.6 154.1 29.8

Table 2: The transfer learning results in terms of linear probing accuracy (%), using Res Net-50 as the encoder. The best results are boldfaced.

Method Regularization Batch Size Epochs Image Net Food-101 CIFAR-10 CIFAR-100 Birdsnap Cars DTD

- 2048 100 64.3 69.0 89.1 71.2 32.0 36.7 67.8 LDReg 64.8 69.1 89.2 70.6 33.4 37.3 67.7 - 4096 1000 69.0 71.1 90.1 71.6 37.5 35.3 70.7 LDReg 69.8 73.3 91.8 75.1 38.7 41.6 70.8

Table 3: The performance of the pre-trained models (Res Net-50) on object detection and instance segmentation tasks, when fine-tuned on COCO. The bounding-box (APbb) and mask (APmk) average precision are reported with the best results are boldfaced.

Method Regularization Epochs Batch Size Object Detection Segmentation APbb APbb 50 APbb 75 APmk APmk 50 APmk 75 Sim CLR -

35.24 55.05 37.88 31.30 51.70 32.82 LDReg 35.26 55.10 37.78 31.38 51.88 32.90

BYOL - 36.30 55.64 38.82 32.17 52.53 34.30 LDReg 36.82 56.47 39.62 32.47 53.15 34.60

Sim CLR - 1000 4096 36.48 56.22 39.28 32.12 52.70 34.02 LDReg 37.15 57.20 39.82 32.82 53.81 34.74

on object detection and segmentation datasets. Moreover, longer pretraining with LDReg can bring more significant performance improvement. These results indicate that using LDReg to regularize local dimensionality can consistently improve the representation quality.

Table 1 also indicates that the effective rank is a good indicator of representation quality for the same type of SSL methods and the same model architecture. However, the correlation becomes less consistent when compared across different methods. For example, Sim CLR + LDReg and Sim CLRTuned have similar effective ranks ( 525), yet perform quite differently on Image Net (with an accuracy difference of 2.4%). Nevertheless, applying our LDReg regularization can improve both types of SSL methods.

7 CONCLUSION

In this paper, we have highlighted that dimensional collapse in self-supervised learning (SSL) could occur locally, in the vicinity of any training point. Based on a novel derivation of an asymptotic variant of the Fisher-Rao metric, we presented a local dimensionality regularization method LDReg to alleviate dimensional collapse from both global and local perspectives. Our theoretical analysis implies that reporting and averaging intrinsic dimensionality (ID) should be done at a logarithmic (rather than linear) scale, using the geometric mean (but not the arithmetic or harmonic mean). Following these theoretical insights, LDReg regularizes the representation space of SSL to have nonuniform local nearest-neighbor distance distributions, maximizing the logarithm of the geometric mean of the sample-wise LIDs. We empirically demonstrated the effectiveness of LDReg in improving the representation quality and final performance of SSL. We believe LDReg can potentially be applied as a generic regularization technique to help other SSL methods.

Published as a conference paper at ICLR 2024

REPRODUCIBILITY STATEMENT

Details of all hyperparameters and experimental settings are given in Appendix F. Pseudocode for LDReg and LID estimation can be found in Appendix J. A summary of the implementation is available in Appendix I. We provide source code for reproducing the experiments in this paper, which can be accessed here: https://github.com/Hanxun H/LDReg. We also discuss computational limitations of LID estimation in Appendix H.

ACKNOWLEDGMENTS

Xingjun Ma is in part supported by the National Key R&D Program of China (Grant No. 2021ZD0112804) and the Science and Technology Commission of Shanghai Municipality (Grant No. 22511106102). Sarah Erfani is in part supported by Australian Research Council (ARC) Discovery Early Career Researcher Award (DECRA) DE220100680. This research was supported by The University of Melbourne s Research Computing Services and the Petascale Campus Initiative.

Laurent Amsaleg, Oussama Chelly, Teddy Furon, St ephane Girard, Michael E Houle, Ken-ichi Kawarabayashi, and Michael Nett. Extreme-value-theoretic estimation of local intrinsic dimensionality. Data Mining and Knowledge Discovery, 2018.

Alessio Ansuini, Alessandro Laio, Jakob H Macke, and Davide Zoccolan. Intrinsic dimension of data representations in deep neural networks. In Neur IPS, 2019.

James Bailey, Michael E Houle, and Xingjun Ma. Local intrinsic dimensionality, entropy and statistical divergences. Entropy, 24(9):1220, 2022.

Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. BEit: BERT pre-training of image transformers. In ICLR, 2022.

Adrien Bardes, Jean Ponce, and Yann Le Cun. VICReg: Variance-invariance-covariance regularization for self-supervised learning. In ICLR, 2022.

Thomas Berg, Jiongxin Liu, Seung Woo Lee, Michelle L Alexander, David W Jacobs, and Peter N Belhumeur. Birdsnap: Large-scale fine-grained visual categorization of birds. In CVPR, 2014.

Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101 mining discriminative components with random forests. In ECCV, 2014.

J org Bruske and Gerald Sommer. Intrinsic dimensionality estimation with optimally topology preserving maps. TPAMI, 1998.

Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. In ECCV, 2018.

Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments. Neur IPS, 2020.

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv e J egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In ICCV, 2021.

Kevin M Carter, Raviv Raich, and AO Hero. Learning on statistical manifolds for clustering and visualization. In Allerton Conference on Communication, Control, and Computing, 2007.

Claudio Ceruti, Simone Bassis, Alessandro Rozza, Gabriele Lombardi, Elena Casiraghi, and Paola Campadelli. Danco: An intrinsic dimensionality estimator exploiting angle and norm concentration. Pattern recognition, 47(8), 2014.

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In ICML, 2020a.

Published as a conference paper at ICLR 2024

Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey E Hinton. Big self-supervised models are strong semi-supervised learners. Neur IPS, 2020b.

Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. In CVPR, 2021.

Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised vision transformers. In ICCV, 2021.

Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. In CVPR, 2014.

Romain Cosentino, Anirvan Sengupta, Salman Avestimehr, Mahdi Soltanolkotabi, Antonio Ortega, Ted Willke, and Mariano Tepper. Toward a geometrical understanding of self-supervised contrastive learning. ar Xiv preprint ar Xiv:2205.06926, 2022.

Marco Del Giudice. Effective dimensionality: A tutorial. Multivariate behavioral research, 56(3): 527 542, 2021.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.

Hadi Mohaghegh Dolatabadi, Sarah Erfani, and Christopher Leckie. Collider: A robust training framework for backdoor data. In ACCV, 2022.

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.

Yann Dubois, Stefano Ermon, Tatsunori B Hashimoto, and Percy S Liang. Improving self-supervised learning by characterizing idealized representations. Neur IPS, 2022.

Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, and Andrew Zisserman. With a little help from my friends: Nearest-neighbor contrastive learning of visual representations. In ICCV, 2021.

Aleksandr Ermolov, Aliaksandr Siarohin, Enver Sangineto, and Nicu Sebe. Whitening for selfsupervised representation learning. In ICML, 2021.

Quentin Garrido, Randall Balestriero, Laurent Najman, and Yann Lecun. Rankme: Assessing the downstream performance of pretrained self-supervised representations by their rank. In ICML, 2023a.

Quentin Garrido, Yubei Chen, Adrien Bardes, Laurent Najman, and Yann Le Cun. On the duality between contrastive and non-contrastive self-supervised learning. In ICLR, 2023b.

Chongjian Ge, Jiangliu Wang, Zhan Tong, Shoufa Chen, Yibing Song, and Ping Luo. Soft neighbors are positive supporters in contrastive visual representation learning. In ICLR, 2023.

Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.

Sixue Gong, Vishnu Naresh Boddeti, and Anil K Jain. On the intrinsic dimensionality of image representations. In CVPR, 2019.

Jean-Bastien Grill, Florian Strub, Florent Altch e, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. Neur IPS, 2020.

Bobby He and Mete Ozay. Exploring the gap between collapsed & whitened features in selfsupervised learning. In ICML, 2022.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.

Published as a conference paper at ICLR 2024

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In CVPR, 2020.

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll ar, and Ross Girshick. Masked autoencoders are scalable vision learners. In CVPR, 2022.

Geoffrey E Hinton and Richard Zemel. Autoencoders, minimum description length and helmholtz free energy. Neur IPS, 1993.

Michael E Houle. Local intrinsic dimensionality I: an extreme-value-theoretic foundation for similarity applications. In SISAP, 2017a.

Michael E Houle. Local intrinsic dimensionality II: multivariate analysis and distributional support. In SISAP, 2017b.

Michael E Houle, Xiguo Ma, Michael Nett, and Vincent Oria. Dimensional testing for multi-step similarity search. In ICDM, 2012.

Tianyu Hua, Wenxiao Wang, Zihui Xue, Sucheng Ren, Yue Wang, and Hang Zhao. On feature decorrelation in self-supervised learning. In ICCV, 2021.

Weiran Huang, Mingyang Yi, Xuyang Zhao, and Zihao Jiang. Towards the generalization of contrastive self-supervised learning. In ICLR, 2023.

Li Jing, Pascal Vincent, Yann Le Cun, and Yuandong Tian. Understanding dimensional collapse in contrastive self-supervised learning. In ICLR, 2022.

Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019.

Jonathan Krause, Jia Deng, Michael Stark, and Li Fei-Fei. Collecting a large-scale dataset of finegrained cars. 2013.

Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009.

Elizaveta Levina and Peter Bickel. Maximum likelihood estimation of intrinsic dimension. Neur IPS, 2004.

Alexander C Li, Alexei A Efros, and Deepak Pathak. Understanding collapse in non-contrastive siamese representation learning. In ECCV, 2022.

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.

Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. ar Xiv preprint ar Xiv:1608.03983, 2016.

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In ICLR, 2019.

Xingjun Ma, Bo Li, Yisen Wang, Sarah M. Erfani, Sudanthi Wijewickrema, Grant Schoenebeck, Michael E. Houle, Dawn Song, and James Bailey. Characterizing adversarial subspaces using local intrinsic dimensionality. In ICLR, 2018a.

Xingjun Ma, Yisen Wang, Michael E Houle, Shuo Zhou, Sarah Erfani, Shutao Xia, Sudanthi Wijewickrema, and James Bailey. Dimensionality-driven learning with noisy labels. In ICML, 2018b.

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. ar Xiv preprint ar Xiv:1807.03748, 2018.

Bo Pang, Yifan Zhang, Yaoyi Li, Jia Cai, and Cewu Lu. Unsupervised visual representation learning by synchronous momentum grouping. In ECCV, 2022.

Karl W Pettis, Thomas A Bailey, Anil K Jain, and Richard C Dubes. An intrinsic dimensionality estimator from near-neighbor information. TPAMI, 1979.

Published as a conference paper at ICLR 2024

Phil Pope, Chen Zhu, Ahmed Abdelkader, Micah Goldblum, and Tom Goldstein. The intrinsic dimension of images and its impact on learning. In ICLR, 2021.

Simone Romano, Oussama Chelly, Vinh Nguyen, James Bailey, and Michael E Houle. Measuring dependency via intrinsic dimensionality. In ICPR, 2016.

Olivier Roy and Martin Vetterli. The effective rank: A measure of effective dimensionality. In European signal processing conference, 2007.

Stephen Taylor. Clustering financial return distributions using the fisher information metric. Entropy, 21(2):110, 2019.

Yuandong Tian, Xinlei Chen, and Surya Ganguli. Understanding self-supervised learning dynamics without contrastive pairs. In ICML, 2021.

Leo T ornqvist, Pentti Vartia, and Yrj o O Vartia. How should relative changes be measured? The American Statistician, 39(1):43 46, 1985.

Julius Von K ugelgen, Yash Sharma, Luigi Gresele, Wieland Brendel, Bernhard Sch olkopf, Michel Besserve, and Francesco Locatello. Self-supervised learning with data augmentations provably isolates content from style. Neur IPS, 2021.

Diane Wagner, Fabio Ferreira, Danny Stoll, Robin Tibor Schirrmeister, Samuel M uller, and Frank Hutter. On the importance of hyperparameters and data augmentation for self-supervised learning. In First Workshop on Pre-training: Perspectives, Pitfalls, and Paths Forward at ICML, 2022.

Zhaoqing Wang, Ziyu Chen, Yaqian Li, Yandong Guo, Jun Yu, Mingming Gong, and Tongliang Liu. Mosaic representation learning for self-supervised visual pre-training. In ICLR, 2023.

Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin Bao, Zhuliang Yao, Qi Dai, and Han Hu. Simmim: A simple framework for masked image modeling. In CVPR, 2022.

Chun-Hsiao Yeh, Cheng-Yao Hong, Yen-Chi Hsu, Tyng-Luh Liu, Yubei Chen, and Yann Le Cun. Decoupled contrastive learning. In ECCV, 2022.

Yang You, Igor Gitman, and Boris Ginsburg. Large batch training of convolutional networks. ar Xiv preprint ar Xiv:1708.03888, 2017.

Jure Zbontar, Li Jing, Ishan Misra, Yann Le Cun, and St ephane Deny. Barlow twins: Self-supervised learning via redundancy reduction. In ICML, 2021.

Qi Zhang, Yifei Wang, and Yisen Wang. How mask matters: Towards theoretical understandings of masked autoencoders. In Neur IPS, 2022.

Shaofeng Zhang, Feng Zhu, Junchi Yan, Rui Zhao, and Xiaokang Yang. Zero-cl: Instance and feature decorrelation for negative-free symmetric contrastive learning. In ICLR, 2021.

Zhijian Zhuo, Yifei Wang, Jinwen Ma, and Yisen Wang. Towards a unified theoretical understanding of non-contrastive learning via rank differential mechanism. In ICLR, 2023.

Published as a conference paper at ICLR 2024

A ACHIEVING DESIRED LOCAL DIMENSIONALITY WITH LDREG

In this section, we provide details regarding how Figure 1c is obtained. Following our theory, LDReg can obtain representations that have a desired local dimensionality. We use a linear layer and randomly generated synthetic data points in 2D following the uniform distribution. The linear layer transforms these points into a representation space. Following Definition 3, one might specify the desired local intrinsic dimensions as LID G. The dimension of the representations is LID F . To achieve the desired local dimensionality, we minimize the following objective:

i ln LID F iw LID G

This objective corresponds to minimizing the asymptotic Fisher-Rao distance between the distribution of the representations and a target distribution of fixed local dimensionality. In Figure 3, we plotted the results with target dimensions LID G equal to [1.0, 1.2, 1.4, 1.6, 1.8, 2.0]. The results show that the estimated LID is very close to the desired values. This indicates that LDReg can also be used to regularize the representations to a specific value. From this point of view, LDReg can potentially be applied to other learning tasks as a generic representation regularization technique.

0.5 0.0 0.5 1.0 1.5 2.0 2.5

m LID=1.1 GID=1.9

(a) LID G = 1.0

0.5 0.0 0.5 1.0 1.5 2.0 2.5

m LID=1.3 GID=1.9

(b) LID G = 1.2

0.5 0.0 0.5 1.0 1.5 2.0 2.5

m LID=1.5 GID=2.0

(c) LID G = 1.4

0.5 0.0 0.5 1.0 1.5 2.0 2.5

3.0 m LID=1.7 GID=1.9

(d) LID G = 1.6

1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 1.0

m LID=1.9 GID=2.0

(e) LID G = 1.8

LID=2.1 GID=2.0

(f) LID G = 2.0

Figure 3: Each caption of the subfigures shows the desired local dimensionality and each title of the subfigures shows the estimated LID and global intrinsic dimensionality (GID). GID is estimated using the Dan CO approach (Ceruti et al., 2014). m LID is the geometric mean of estimated sample LIDs.

In the context of SSL, to avoid dimensional collapse, the resulting representation should span (fill) the entire space, as shown in Figure 3f. The L1and L2-regularization terms used for SSL aim to maximize the asymptotic Fisher-Rao distance between local distance distribution F(r) and a uniform distance distribution U1,w(r) (which has LID equal to 1, as shown in Figure 3a). In other words, the final representations should be further from that of Figure 3a, and closer to that of Figure 3f.

Published as a conference paper at ICLR 2024

B EFFECTIVE RANK VERSUS LOCAL INTRINSIC DIMENSIONALITY

Zhuo et al. (2023) proposed the effective rank (Roy & Vetterli, 2007) as a metric to evaluate the degree of (global) dimensional collapse. Given the feature correlation matrix, the effective rank corresponds to the exponential of the entropy of the normalized eigenvalues. It is invariant to scaling and takes real values, not just integers. It has a maximum value equal to the representation dimension. Intuitively, the effective rank assesses the degree to which the data fills the representation space, in terms of covariance properties. One might also define a local effective rank, which would assess the degree to which the representation space surrounding a particular anchor sample is filled.

In contrast to the effective rank, the local intrinsic dimension (LID) is a local measure. Roughly speaking, at a particular anchor sample, the LID assesses the growth rate of the distance distribution of nearest neighbors. It is thus focused on distance properties (distances between samples), rather than covariance properties between features. Comparisons between intrinsic dimensionality and effective rank, emphasizing their differences, have been discussed in Del Giudice (2021).

C.1 BACKGROUND

Theorem 3 (LID Representation Theorem (Houle, 2017a)). Let F : R R be a real-valued function, and assume that LID F exists. Let r and w be values for which r/w and F(r)/F(w) are both positive. If F is non-zero and continuously differentiable everywhere in the interval [min{r, w}, max{r, w}], then

F(r) F(w) = r

LID F AF (r, w), where AF (r, w) exp Z w

LID F LIDF (t)

whenever the integral exists.

Corollary 3.1 ((Houle, 2017a)). Let c and w be real constants such that c 1 and w > 0. Let F : R R be a real-valued function satisfying the conditions of Theorem 3 over the interval (0, cw]. Then

lim w 0+ w/c r cw AF (r, w) = 1 .

If F(0) = 0, and F is non-decreasing and continuously differentiable over some interval [0, w] for w > 0, then F is referred to as a smooth growth function. If F is a CDF, we use the notation Fw to refer to F conditioned over [0, w]; that is, Fw(r) = F(r)/F(w).

C.2 PROOF OF LEMMA 1

Proof. We leverage a result from (Taylor, 2019), which shows that the Fisher-Rao metric between two one-dimensional distributions with a single parameter can be expressed as d(θ1, θ2) = | R θ2 θ1 u(θ) dθ|, where u(θ)2 = I(θ), and I(θ) is the Fisher information with respect to the single parameter θ. In our context, θ corresponds to the local intrinsic dimensionality. For the Fisher information for functions restricted to the form Hw|θ, we will therefore use the quantity

Iw(θ) = Z w

θ ln H w|θ(r) 2 H w|θ(r) dr .

Published as a conference paper at ICLR 2024

We now derive an expression for the Fisher-Rao distance. From (Taylor, 2019), and noting that Fisher information is greater than or equal to zero,

d FR(Hw|θ1, Hw|θ2) =

With the substitution v = r

w θ, we obtain

d FR(Hw|θ1, Hw|θ2) =

θ ln v 2 dv

q v + 2(v ln v v) + (v ln2 v 2v ln v + 2v) 1

C.3 PROOF OF THEOREM 2

Proof. We find the distribution Hw|θ that minimizes the expression

θG = arg min θ

d AFR(Hw|θ, F i) 2 ,

by taking the partial derivative with respect to θ, and solving for the value θ = θG for which the partial derivative is zero.

ln θ LID F i

2! θ=θG = 0

ln2 θ + ln2 LID F i 2 ln θ ln LID F i θ=θG = 0

θG 2ln LID F i θG

i=1 ln LID F i

i=1 ln LID F i

In a similar fashion, the second partial derivative can be shown to be strictly positive at θ = θ1. Since the original expression is continuous and non-negative over all θ [0, ), θ1 is a global minimum.

Published as a conference paper at ICLR 2024

C.4 PROOF OF COLLARY 2.1

Proof. Assertion 1 follows from the fact that

i=1 d AFR(F i, U1) = 1 N

ln LID F i 1

i=1 ln LID F i (if LID F i 1) .

C.5 KULLBACK-LEIBLER DIVERGENCE AS DISTRIBUTIONAL DIVERGENCE

It is natural to consider whether other measures of distributional divergence could be used in place of the asymptotic Fisher-Rao metric, in the derivation of the Fr echet mean. Bailey et al. (2022) have shown several other divergences and distances which, when conditioned to a vanishing lower tail, tend to expressions involving the local intrinsic dimensionalities of distance distributions most notably that of the Kullback-Leibler (KL) divergence. Here, we define an asymptotic distributional distance from the square root of the asymptotic KL divergence considered in (Bailey et al., 2022). Lemma 2 ((Bailey et al., 2022)). Given two smooth-growth distance distributions with CDFs F and G, their asymptotic KL distance is given by

d AKL(F, G) lim w 0+ p

DKL(Fw, Gw) =

LID G LID F ln LID G LID F 1 ,

DKL(Fw, Gw) Z w

0 F w(t) ln F w(t) G w(t) dt

is the KL divergence from F to G when conditioned to the lower tail [0, w].

When aggregating the LID values of distance distributions, it is worth considering how well the arithmetic mean (equivalent to the information dimension (Romano et al., 2016)) and the harmonic mean might serve as alternatives to the geometric mean. Theorem 4 shows that the arithmetic and harmonic means of distributional LIDs are obtained when the asymptotic Fisher-Rao metric is replaced by the asymptotic KL distance, in the derivation of the Fr echet mean. Theorem 4. Given a set of distributions F = {F 1, F 2, . . . , F N}, consider the metric used in computing the Fr echet mean µF = Hw|θ.

1. Using the asymptotic Fisher-Rao metric d AFR(Hw|θ, F i) as in Definition 4 gives θ equal to the geometric mean of {LID F 1 w, . . . , LID F N w }.

2. Replacing d AFR(Hw|θ, F i) by the asymptotic KL distance d AKL(Hw|θ, F i) gives θ equal to the arithmetic mean of {LID F 1, . . . , LID F N }.

3. Replacing d AFR(Hw|θ, F i) by the (reverse) asymptotic KL distance d AKL(F i, Hw|θ) gives θ equal to the harmonic mean of {LID F 1, . . . , LID F N }.

Proof. Assertion 1 has been shown in Theorem 2.

For Assertion 2, we find the value of θ for which the following expression is minimized:

θA = arg min θ

d AKL(Hw|θ, F i) 2 .

Using Lemma 2, and observing that LID Hw|θ = θ,

d AKL(Hw|θ, F i) 2 = lim w 0 DKL(Hw|θ, F i) = LID F i θ ln LID F i θ 1 .

Published as a conference paper at ICLR 2024

As in the proof of Theorem 2, the minimization is accomplished by setting the partial derivative to zero and solving for θ = θA.

LID F i θ ln LID F i θ 1 ! θ=θA = 0

LID F i θ2 A + 1

i=1 LID F i .

For Assertion 3, we similarly find the value of θ for which the following expression is minimized:

θH = arg min θ

d AKL(F i, Hw|θ) 2 .

Once again, we take the partial derivative with respect to θ, set it to zero, and solve for θ = θH:

θ LID F i ln θ LID F i 1 ! θ=θH = 0

1 LID F i 1

θH = N PN i=1 1 LID F i .

Note that θG and θH can be verified as minima by computing the second partial derivatives with respect to θ.

The square root of the KL divergence is only a weak approximation of the Fisher-Rao metric on statistical manifolds, and this approximation is known to degrade as distributions diverge (Carter et al., 2007). Moreover, the KL divergence is also nonsymmetric. Theorem 4 therefore indicates that the asymptotic Fisher-Rao metric is preferable (in theory) to the asymptotic KL distance, and the geometric mean is preferable to the arithmetic mean and harmonic mean when aggregating the LID values of distance distributions.

D OTHER REGULARIZATION FORMULATIONS

We elaborate on our choices of regularization term. Remark 4.1. The proposed regularization is equivalent to maximizing the (log of the) geometric mean of the IDs of the samples.

Theorem 4 provided arguments for why use of geometric mean is preferable to other means for the purpose of computing the Frechet mean of a set of distributions. We can similarly consider why a regularization corresponding to the arithmetic mean of the LIDs of the samples would be less preferable. i.e. max 1

N PN i LID F i w. Note that

i LID F iw = max 1

i exp lim w 0 d F R(F i w(r), U1,w(r)) .

Observe that this arithmetic mean regularization, due to the exponential transformation, would apply a high weighting to samples with very large distances from the uniform distribution (that is, samples with large ID). In other words, such a regularization objective could be optimized by making the ID of a small number of samples extremely large.

Published as a conference paper at ICLR 2024

E LDREG AND SSL METHODS

In this section, we provide more details on how to apply LDReg on different SSL methods. Since LDReg is applied to the representation obtained by the encoder, the varying combinations of projector, predictor, decoder, and optimizing objective used by SSL methods does not directly affect how LDReg is applied. As a result, LDReg can be regarded as a general regularization for SSL.

Sim CLR. For input images x of batch size N, an encoder f( ), and a projector g( ), the representations are obtained by z = f(x), and the embeddings are e = g(z). Given a batch of 2N augmented inputs, the NT-Xent loss used by Sim CLR (Chen et al., 2020a) for a positive pair of inputs (xi, xj) is:

LNTXent i = ln exp(sim(ei, ej)/τ) P2N m =j exp(sim(ei, em)/τ) ,

where τ is the temperature, and the final loss is computed across all positive pairs.

For applying LDReg with LL1 term on Sim CLR, we optimize the following objective:

LL1 = LNTXent β 1

i ln LID F iw,

where the LID for each sample is estimated using the method of moments: LID F iw = µk µk wk , where µk is the averaged distance to the k nearest neighbors of zi, and wk is the distance to the k-th nearest neighbor of zi.

BYOL. BYOL (Grill et al., 2020) uses an additional predictor h( ) to obtain predictions p = h(e), a momentum encoder (exponential moving average of the weights of the online model) where e is obtained, and the loss function is the scaled cosine similarity between positive pairs, defined as:

LBYOL i = 2 2 pi e j pi 2 e j 2 ,

with the final loss computed symmetrically across all positive pairs.

For applying LDReg with LL1 term on BYOL, we optimize the following objective:

LL1 = LBYOL β 1

i ln LID F iw.

The LID for each sample is estimated in the same way as applying LDReg on Sim CLR. For BYOL, we use representations obtained by both the online and momentum branches as the reference set.

MAE. MAE (He et al., 2022) uses a decoder that aims to reconstruct the input image. Unlike a contrastive approach, it does not rely on two different augmented views of the same image. MAE uses an encoder f( ) to obtain the representation z = f(x ) for a masked image x , and the decoder t( ) aims to reconstruct the original image x by taking representation r as input. Specifically, MAE optimizes the following objective:

LMAE i = t(f(x i)), xi 2.

For applying LDReg with LL1 term on MAE, we optimize the following objective:

LL1 = LMAE β 1

i ln LID F iw.

The LID for each sample is estimated in the same way as applying LDReg on Sim CLR.

F EXPERIMENTAL SETTINGS

For each baseline method, we follow their original settings, except in the case of BYOL, where we changed the parameter for the exponential moving average from 0.996 to 0.99, which performs

Published as a conference paper at ICLR 2024

better when the number of epochs is set to 100. Detailed hyperparameter settings can be found in Tables 5-11. We use 100 epochs of pretraining and a batch size of 2048 as defaults. For LDReg regularization, we use k = 128 as the default neighborhood size. For Res Net-50, we use β = 0.01 for Sim CLR and Sim CLR Tuned, β = 0.005 for BYOL. For Vi T-B, we use β = 0.001 for Sim CLR, and β = 5 10 6 for MAE. We perform linear evaluations following existing works (Chen et al., 2020a; Grill et al., 2020; He et al., 2022; Garrido et al., 2023b). For linear evaluations, we use batch size 4096 on Image Net other settings are shown in Table 9.

Following Sim CLR (Chen et al., 2020a) and BYOL (Grill et al., 2020), we evaluate transfer learning performance by performing linear evaluations on other datasets, including Food-101 (Bossard et al., 2014), CIFAR (Krizhevsky & Hinton, 2009), Birdsnap (Berg et al., 2014), Stanford Cars (Krause et al., 2013), and DTD (Cimpoi et al., 2014). Due to computational constraints, for the transfer learning experiments, we did not perform full hyperparameter tuning for each model and dataset. The reproduced results of baseline methods are slightly lower than the reported results by Chen et al. (2020a); Grill et al. (2020). For all datasets, we use 30 epochs, weight decay to 0.0005, learning rate 0.01, batch size 256, and SGD with Nesterov momentum as optimizer. These settings are based on the VISSL library 2.

We evaluate the finetuning performance with downstream tasks using the COCO dataset (train2017 and val2017) (Lin et al., 2014). We use Res Net-50 with RCNN-C4 (Girshick et al., 2014) with batch size 16 and base learning rate 0.02. We use the popular framework detectron2 3, and our configurations follow the MOCO-v1 official implementation 4 exactly.

We conducted our experiments on Nvidia A100 GPUs with Py Torch implementation, with each experiment distributed across 4 GPUs. We used automatic mixed precision due to its memory efficiency. The estimated runtime is 40 hours for pretraining and linear evaluations. As can be seen from the pseudocode in Appendix J, the additional computation mainly depends on the calculation and sorting of pairwise distances. As shown in Table 4, we observed no significant additional computational costs for LDReg. Open source code is available here: https://github.com/Hanxun H/LDReg.

Table 4: Wall-clock comparisons for pretraining with Distributed Data-Parallel training. Each experiment uses 4 GPUs distributed over different nodes. Results are based on 100 epochs of pretraining. Communication overheads could have a slight effect on the results.

Method Wall-clock time Sim CLR 27.8 hours Sim CLR + LDReg 27.1 hours

Table 5: Pretraining setting for Sim CLR (Chen et al., 2020a).

Base learning rate 0.075 Learning rate scaling 0.075

Batch Size Learning rate decay Cosine (Loshchilov & Hutter, 2016) without restart Weight Decay 1.0 10 6 Optimizer LARS (You et al., 2017) Temperature for LNTXent 0.1 Projector 2048-128

Data Augmentations. For each baseline and LDReg version, we use the same augmentation as in existing works. Augmentation policy for Sim CLR (Chen et al., 2020a) is in Table 10, for Sim CLRTuned (Garrido et al., 2023b) and BYOL (Grill et al., 2020) is in Table 11.

2https://github.com/facebookresearch/vissl 3https://github.com/facebookresearch/detectron2 4https://github.com/facebookresearch/moco/tree/main/detection/configs

Published as a conference paper at ICLR 2024

Table 6: Pretraining setting for Sim CLR-Tuned (Garrido et al., 2023b).

Base learning rate 0.5 Learning rate scaling 0.5 Batch Size

256 Learning rate decay Cosine (Loshchilov & Hutter, 2016) without restart Weight Decay 1.0 10 6 Optimizer LARS (You et al., 2017) Temperature for LNTXent 0.15 Projector 8192-8192-512

Table 7: Pretraining setting for BYOL (Grill et al., 2020).

Base learning rate 0.4 Learning rate scaling 0.4 Batch Size

256 Learning rate decay Cosine (Loshchilov & Hutter, 2016) without restart Weight Decay 1.5 10 6 Optimizer LARS (You et al., 2017) τ for moving average 0.99 Projector 4096-256 Predictor 4096-256

Table 8: Pretraining setting for MAE (He et al., 2022).

Base learning rate 1.5 10 4

Learning rate scaling 1.5 10 4 Batch Size

256 Learning rate decay Cosine (Loshchilov & Hutter, 2016) without restart Weight Decay 0.05 Optimizer Adam W (Loshchilov & Hutter, 2019) β1 for the optimizer 0.9 β2 for the optimizer 0.95

Table 9: Linear evaluation setting for Image Net.

Epochs 90 Base learning rate 0.1 Learning rate scaling 0.1 Batch Size

256 Minimal learning rate 1.0 10 6 Learning rate decay Cosine (Loshchilov & Hutter, 2016) without restart Weight Decay 0 Optimizer LARS (You et al., 2017)

Table 10: Image augmentation policy for Sim CLR (Chen et al., 2020a).

Parameter View 1 View 2

Random crop probability 1.0 1.0 Horizontal flip probability 0.5 0.5 Color jittering probability 0.8 0.8 Color jittering strength (s) 1.0 1.0 Brightness adjustment max intensity 0.8 s 0.8 s Contrast adjustment max intensity 0.8 s 0.8 s Saturation adjustment max intensity 0.8 s 0.8 s Hue adjustment max intensity 0.2 s 0.2 s Grayscale probability 0.2 0.2 Gaussian blurring probability 0.5 0.5

Published as a conference paper at ICLR 2024

Table 11: Image augmentation policy for BYOL (Grill et al., 2020) and Sim CLR-Tuned (Garrido et al., 2023b).

Parameter View 1 View 2

Random crop probability 1.0 1.0 Horizontal flip probability 0.5 0.5 Color jittering probability 0.8 0.8 Brightness adjustment max intensity 0.4 0.4 Contrast adjustment max intensity 0.4 0.4 Saturation adjustment max intensity 0.2 0.2 Hue adjustment max intensity 0.1 0.1 Grayscale probability 0.2 0.2 Gaussian blurring probability 1.0 0.1 Solarization probability. 0.0 0.2

G ADDITIONAL EXPERIMENTAL RESULTS

G.1 COMPARING LOSS TERMS

It can be observed from Table 12 that there are no significant differences between the regularization terms LL1 and LL2 for improving the performance of SSL.

Table 12: Comparing the results of linear evaluations of regularization terms of LDReg. All models are trained on Image Net for 100 epochs. The results are reported as linear probing accuracy (%).

Method Regularization k=64 k=128

Sim CLR LL1 64.8 64.6 LL2 64.4 64.5

G.2 LOCAL COLLAPSE TRIGGERING COMPLETE COLLAPSE

In this section, we demonstrate that local collapse could trigger the worst-case mode collapse, where the output representation is a trivial vector. LDReg is a general regularization tool that regularizes representation to achieve a target LID. One can also use LDReg to achieve lower LID and, in the extreme case, local collapse. Specifically, we optimize the following objective function:

i ln LID F iw

This objective regularizes the geometric mean of LID (of the representations) to 1. We use β to control the strength of the regularization term and denote it as Min LID.

Table 13: Comparing the results of linear evaluations of regularization terms of LDReg, Min LID and baseline. All models are trained on Image Net for 100 epochs using Res Net-50 as encoder. The results are reported as linear probing accuracy (%) on Image Net.

Method Regularization β Linear Acc Effective Rank Geometric mean of LID

LDReg 0.01 64.8 529.6 20.0 - - 64.3 470.2 18.8 Min LID 0.01 64.2 150.7 16.0 Min LID 0.1 63.1 15.0 3.8 Min LID 1.0 46.4 1.0 1.6 Min LID 10.0 Complete collapse - -

As shown in Table 13, it can be observed that using Min LID with stronger (larger β) will regularize the representation to have extremely low effective rank and eventually result in complete collapse, even if Sim CLR explicitly uses negative pairs to prevent this. Figure 4 shows the visualizations

Published as a conference paper at ICLR 2024

of learned representations with different regularizations. Additionally, the performance of linear evaluations degrades as dimensionality decreases. This result indicates that low LID is undesirable for SSL.

40 30 20 10 0 10 20 30 40

(a) LDReg β=0.01

40 30 20 10 0 10 20 30 40

(b) No regularization

30 20 10 0 10 20 30

(c) Min LID β=0.1

40 20 0 20 40

(d) Min LID β=1.0

Figure 4: t-SNE visualizations of the representations learned by different pretraining. Results are based on Res Net-50 with Sim CLR with Image Net validation set. Only the first 10 classes are selected for visualizations.

G.3 ABLATION STUDY

We examine the effects of varying β and k for LDReg using Sim CLR as the baseline. It can be observed in Figure 5a and 5b that linear evaluation performance is relatively stable across different values of k and β. For effective rank, Figure 5c shows that greater strength of LDReg regularization (larger β) actually decreases the effective rank. This is not surprising, since LID is a local measure, and effective rank is a global measure. The differences between effective rank and LID are outlined in Appendix B. Figure 5d shows that a smaller value for k is more beneficial for LDReg. Smaller k is indeed more preferable, as it helps to preserve the locality assumptions upon which LID estimation depends.

0.1 0.05 0.01 64.00

Linear Prob Acc

Sim CLR + LDReg Sim CLR

(a) Linear Acc

32 64 128 256 k

Linear Prob Acc

Sim CLR + LDReg Sim CLR

(b) Linear Acc

0.1 0.05 0.01 300

Effective Rank

Sim CLR + LDReg Sim CLR

(c) Effective rank

32 64 128 256 k

Effective Rank

Sim CLR + LDReg Sim CLR

(d) Effective rank

Figure 5: (a-b) Linear evaluation results and (c-d) effective ranks with varying β and k. All models are trained on Image Net for 100 epochs. The results are reported as linear probing accuracy (%) on Image Net.

Table 14 shows that LDReg can consistently improve the baseline with different batch sizes. We reduce the k at the same rate as N is reduced, e.g. 16 and 32 for batch sizes 512 and 1024, respectively. It can be observed that LDReg can consistently improve the baseline in different batch sizes.

Table 14: Comparing the results of linear evaluations with different batch sizes N. All models are trained on Image Net for 100 epochs. The results are reported as linear probing accuracy (%).

Method Regularization N=512 N=1024 N=2048

Sim CLR - 63.6 64.2 64.3 LDReg 64.1 64.7 64.8

Published as a conference paper at ICLR 2024

G.4 ADDITIONAL LINEAR EVALUATION RESULTS

In this subsection, we further verify the effectiveness of LDReg with decorrelating feature methods such as VICReg (Bardes et al., 2022) and Barlow Twins (Zbontar et al., 2021). We also evaluate with another SOTA sample-contrastive method Mo Co (He et al., 2020). For applying LDReg on VICReg, we use β = 0.025 and k = 64. For Barlow Twins, we use β = 1.0 and k = 64. For Mo Co, we use β = 0.05 and k = 128. All other settings are kept the same as each baseline s originally reported hyperparameters. Results can be found in Table 15.

Table 15: The linear evaluation results (accuracy (%)) of different methods with and without LDReg. The effective rank is calculated on the Image Net validation set. The best results are boldfaced.

Model Epochs Method Regularization Linear Evaluation Effective Rank Geometric mean of LID

Res Net-50 100

Mo Co - 68.7 595.0 17.1 LDReg 69.6 651.8 22.3

VICReg - 66.7 546.7 21.5 LDReg 66.9 602.4 22.5

Barlow Twins - 65.5 602.1 20.8 LDReg 65.6 754.0 24.1

VICReg and Barlow Twins are SSL methods rather than regularizers. For example, compared to Mo Co, VICReg and Barlow Twins use different projector architectures and loss functions. It s not fair to compare LDReg across different types of SSL methods. For example, Mo Co with LDReg has a linear evaluation of 69.6, yet Mo Co alone can achieve 68.7, while VICReg under the same setting is only 66.7. However, if we apply our regularizer to these methods, their performance can all be improved, as shown in the table 15.

To fairly compare the regularizers, we use the covariance (denoted as Cov) and variance (denoted as Var) as alternative regularizers to replace LDReg and apply to BYOL. Note that they are pseudoglobal dimension regularizers, as we cannot use the entire training set to calculate the covariance, it is calculated on a mini-batch. We also performed a hyperparameter search for Cov and Var (β for the strength of the regularization). We used the same regularization formula as VICReg (Bardes et al., 2022) for Cov and V ar as the following:

C(Z) = 1 n 1

i=1 (zi z)(zi z)T , z = 1

i=1 zi, (4)

Cov(Z) = c(Z) = 1

i =j [C(Z)]2 i,j, (5)

where d is the representation dimensions.

V ar(Z) = 1

j=1 max(0, γ S(zj, ϵ)) where S( ) is the standard deviation. (6)

We apply the regularization on representations learned by the encoder, the same as in LDReg. The results can be found in the Table 16. All results are based on 100 epoch pretraining with BYOL and Res Net-50. All settings are exactly the same as LDReg except for the regularization term.

Table 16: The linear evaluation results on comparing different regularization terms. The effective rank is calculated on the Image Net validation set. The best results are boldfaced.

Method Regularizer β Linear Evaluation Effective Rank Geometric mean of LID

None - 67.6 583.8 15.9 Cov 0.01 67.6 583.5 15.9 Cov 0.1 67.5 593.5 15.8 Cov+Var 0.01 67.8 539.2 15.5 Cov+Var 0.1 67.7 798.4 16.8 LDReg 0.005 68.5 594.0 22.3

Although the covariance and variance regularizer can increase the global dimension, it does not improve the local dimension. It also has a rather minor effect on the linear evaluation accuracy, whereas LDReg improves by almost 1%. This further confirms the effectiveness of LDReg.

Published as a conference paper at ICLR 2024

H LIMITATIONS AND FUTURE WORK

The main theory of our work is based on the local intrinsic dimensionality model and well-founded Fisher-Rao metric. Computation of LDReg and the Fisher-Rao metric all require the ability to accurately estimate the local intrinsic dimension. We have used the method of moments in this paper for LID estimation, due to its simplicity and attractiveness for incorporation within a gradient descent framework. Other estimation methods could be used instead; however, all estimation methods for LID are known to degrade in performance as the dimensionality increases. Moreover, our estimation of LID is based on nearest neighbor sets computed from within a minibatch. This choice is made due to feasibility of computation, but entails a reduction in accuracy as compared to using nearest neighbors computed from the whole dataset. In future work, one might explore other estimation methods and tradeoffs between estimation accuracy and computation time.

Based on existing works, LDReg assumes that higher dimensionality is desirable for SSL. LDReg relies on a hyperparameter β to adjust the strength of the regularization term. The theory developed in this work allows LDReg to achieve any desired dimensionality. However, the optimal dimensionalities for SSL are dependent on the dataset and loss function. Knowledge of the optimal dimensionality (if it could be determined) can be integrated into LDReg for best performance.

I IMPLEMENTATION DETAILS

For implementation with Py Torch, Garrido et al. (2023b) have discussed popular open-source implementations of Sim CLR (compatible with DDP using gather) that use slightly inaccurate gradients. The implementation in VICReg (Bardes et al., 2022) codebase 5 is correct and should be used. We find that this slightly affects the performance when reproducing Sim CLR s results. This also affects LDReg for estimating LIDs with DDP. For all of our experiments, we use the same implementation as Garrido et al. (2023b); Bardes et al. (2022).

Estimating LID needs to compute the pairwise distance, in Py Torch, the cdist function by default uses a matrix multiplication approach. For Nvidia Ampere or newer GPUs, Tensor Float-32 tensor cores should be disabled due to precision loss in the matrix multiplication. This precision loss can significantly affect the LID estimations.

J PSEUDOCODE

Algorithm 1: Method of moments for LID estimation using pytorch pseudocode.

# data: representations # reference: reference points # k: the number of nearest neighbours

def lid_mom_est(data, reference, k):

r = torch.cdist(data, reference, p=2) # Pairwise distance a, idx = torch.sort(r, dim=1) m = torch.mean(a[:, 1:k], dim=1) # mu_k lids = m / (a[:, k] - m) # a[:, k] is the w_k return lids

5https://github.com/facebookresearch/vicreg

Published as a conference paper at ICLR 2024

Algorithm 2: LDReg using pytorch pseudocode.

# f: representations # k: the number of nearest neighbours # beta: the hyperparameter $\beta$ # loss: SSL loss (such as NTXent) # reg_type: "l1" or "l2" (L1 or L2 loss)

lids = lid_mom_est(data=f, reference=f.detach(), k=k) if reg_type == "l1":

lid_reg = - torch.abs(torch.log(lids)) elif reg_type == "l2":

lid_reg = - torch.sqrt(torch.square(torch.log(lids))) total_loss = loss + beta * lid_reg total_loss = total_loss.mean(dim=0)