# simplifying_dino_via_coding_rate_regularization__1750ef0f.pdf

Simplifying DINO via Coding Rate Regularization

Ziyang Wu 1 Jingyuan Zhang 2 Druv Pai 1 Xudong Wang 1

Chandan Singh 3 Jianwei Yang 3 Jianfeng Gao 3 Yi Ma 1 2 4

DINO and DINOv2 are two model families being widely used to learn representations from unlabeled imagery data at large scales. Their learned representations often enable state-of-theart performance for downstream tasks, such as image classification and segmentation. However, they employ many empirically motivated design choices and their training pipelines are highly complex and unstable many hyperparameters need to be carefully tuned to ensure that the representations do not collapse which poses considerable difficulty to improving them or adapting them to new domains. In this work, we posit that we can remove most such-motivated idiosyncrasies in the pre-training pipelines, and only need to add an explicit coding rate term in the loss function to avoid collapse of the representations. As a result, we obtain highly simplified variants of the DINO and DINOv2 which we call Sim DINO and Sim DINOv2, respectively. Remarkably, these simplified models are more robust to different design choices, such as network architecture and hyperparameters, and they learn even higherquality representations, measured by performance on downstream tasks, offering a Pareto improvement over the corresponding DINO and DINOv2 models. This work highlights the potential of using simplifying design principles to improve the empirical practice of deep learning. Code and model checkpoints are available at https: //github.com/Robin Wu218/Sim DINO.

1. Introduction

Self-supervised learning (SSL) is the toolkit of choice to learn representations for large datasets of unlabeled images

1UC Berkeley 2Transc Engram 3Microsoft Research 4HKU. Correspondence to: Ziyang Wu <zywu@berkeley.edu>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

(Hadsell et al., 2006; Oord et al., 2018; Wu et al., 2018; Grill et al., 2020; He et al., 2020; Bardes et al., 2021; Chen & He, 2021; Caron et al., 2021; Zhou et al., 2021; He et al., 2022; Assran et al., 2023; Oquab et al., 2023), captioned images (Radford et al., 2021), videos (Feichtenhofer et al., 2022), and text (Radford et al., 2018; Devlin, 2018; Radford et al., 2019; Brown et al., 2020), among other modalities. In the context of image SSL, there are two main approaches: reconstructive (He et al., 2022), where the goal is to reconstruct some function of the true image data from a view , i.e., corruption or augmentation, and contrastive (Hadsell et al., 2006), where the goal is, for each image, to have the features of different views of the image all be close, and features of views of different images be far.

Within contrastive SSL, a key challenge lies in preventing representation collapse, where models learn trivial solutions that map all inputs to the same output. One common approach to address this is through the use of negative samples, which explicitly encourages representations of different images to be dissimilar. Thus far, the success of using negative samples depends on having a large batch size (Wu et al., 2018; He et al., 2020), which poses computational challenges at scale. Methods which attempt to avoid this bottleneck by using negative samples in more implicit and indirect ways to avoid collapse (Caron et al., 2021) can cope with smaller batch sizes, but often require training pipelines with many components and hyperparameters carefully tuned to avoid collapse, making them difficult to train.

The state-of-the-art for image SSL is generally considered to be the DINOv2 model family (Oquab et al., 2023). It is built on the DINO model family (Caron et al., 2021). Both classes of models are trained using contrastive SSL and thus run into the representation collapse issue. While DINOv2 explicitly and directly uses negative samples to avoid collapse, it inherits much of its training pipeline from DINO, which uses negative samples more indirectly. As such, both model families training pipelines are highly complex and unstable, requiring many tweaks and careful hyperparameter selection in order for the training to converge for a given architecture. Despite this capriciousness, the trained models representations are highly useful for downstream tasks, and are widely used (Baharoon et al., 2023; Wei et al., 2024).

Simplifying DINO via Coding Rate Regularization

teacher DINO head

student DINO head

centering (e.m.a.)

crop + tokenize

crop + tokenize

e.m.a. e.m.a.

(b): Sim DINO

crop + tokenize

crop + tokenize

(c): DINOv2

teacher DINO head

student DINO head

centering softmax

... ... student i BOT head

... ... softmax

teacher i BOT head centering softmax ...

crop + tokenize + mask (if global)

crop + tokenize

e.m.a. e.m.a.

(d): Sim DINOv2

crop + tokenize + mask (if global)

crop + tokenize

Figure 1. The DINO and DINOv2 pipelines are substantially simplified to the respective Sim DINO and Sim DINOv2 pipelines. (a) In the DINO pipeline, an input image is turned into patches. Then a global view vg and a local view vc are randomly sampled. The global view is pushed through the teacher encoder, while the other view is through the student encoder. (b) The Sim DINO pipeline removes the need for expensive post-processing operations present in DINO, such as a dimension-increasing linear layer and a high-dimensional softmax. (c) The DINOv2 pipeline adds masking (here masked patches are denoted by ) and an additional loss on image patch features to the DINO pipeline. (d) The Sim DINOv2 training operates directly on the learned representations, simplifying the pipeline.

Our contributions. In this work, we remove many tweaks and hyperparameters from the DINO and DINOv2 training pipelines, replacing them with a term in the objective which explicitly uses negative samples. We show empirically that this term, which involves the total coding rate regularizer (Ma et al., 2007; Yu et al., 2020; Li et al., 2022), enables much more simple, robust, and computationally efficient training pipelines, as shown in Figure 1. We show that the resulting models, named Sim DINO and Sim DINOv2, learn representations that achieve even higher state-of-theart performance as those learned by DINO and DINOv2 across a variety of downstream tasks. Our work underscores the value of understanding and simplifying pipelines to improve performance in vision SSL.

Notation. Let C, H, W, D, N, d 1 be positive integers. For a set A, let the space of finite sequences of elements of A be denoted as A = S t=1 At. Our data will be images X RC H W . We consider different augmentations, or views, of the input data X, such as rotations or crops; we can represent a view as a function v: RC H W RD Nv where Nv is the number of tokens in the view. By an abuse of vocabulary we also call v(X) RD Nv a view.

Let Sd 1 Rd be the (d 1)-dimensional ℓ2-sphere. For the purpose of representation learning, we will consider an encoder neural network parameterized by weights θ Θ (i.e., the weight space), as a function fθ : (RD) Sd 1 S d 1. We factor fθ = (f cls θ , f patch θ ) where f cls θ : RD Sd 1 outputs the so-called class token feature (i.e., an aggregate representation of the input patches) and f patch θ : (RD) S d 1 outputs the patch tokens fea-

tures (i.e., a representation for each input patch). The network is a Vision Transformer (Dosovitskiy, 2020; Touvron et al., 2021) with appended multi-layer perceptrons (MLPs) to post-process each feature followed by ℓ2-normalizations.

2. Methods: Simplifying DINO and DINOv2

2.1. Recap of the Original DINO Pipeline

The goal of DINO is to learn an aggregate representation of the input image which contains information about largescale semantics of the input (e.g., the locations and properties of different objects in the image). They do this via a pre-training pipeline (Caron et al., 2021) which is depicted in Figure 1(a), and we also describe it throughout this section. The main idea is to take multiple views (i.e., different crops) of the data, and ensure that the features generated by these views are consistent with each other (in a sense which will be made precise shortly) as much as possible. If the views each contain a salient part of the input such as a central object, the feature corresponding to any view would then contain information about this central object. The end goal is that the feature of any large-enough view contains information about all relevant objects in the input image, which can then be extracted for use in downstream tasks such as image classification or image segmentation.

In the rest of the section, we will discuss the pre-training pipeline. As is common in contrastive SSL, the DINO framework uses two networks: a so-called teacher network parameterized by θt Θ, and a so-called student network parameterized by θs Θ. During pre-training, the loss

Simplifying DINO via Coding Rate Regularization

will encourage the student s representation to align with the teacher s representation, even as the teacher is simultaneously updated using student weights; this is self-distillation, and can be viewed as an optimization strategy or even implicitly regularizing the objective (Chen & He, 2021).

In the pipeline, we process each image X in the following way. First, we sample at random a view, or crop, vc, independently of X; the view can either be a global view (i.e., a large crop) or a local view (i.e., small crop). We denote Xc := vc(X) (RD Nloc RD Nglo). We also sample a global view vg, and denote Xg := vg(X) RD Nglo.1

Views are implemented in the same way as in DINO; they are formally described in Appendix A for completeness.

The (local or global) view Xc is fed to the student network2

f cls θs to get zcls s (Xc) Rd, while the global view Xg is fed to the teacher network f cls θt to get zcls t (Xg) Rd, i.e.:

zcls s (Xc) := f cls θs (Xc), zcls t (Xg) := f cls θt (Xg). (1)

Now, it is certainly possible to directly compare and evaluate these features. However, DINO adds post-processing steps, arguing that they improve performance and prevent collapse:

They add weight-normalized linear layers (Salimans & Kingma, 2016) hηDINO s , hηDINO t : Rd Rm where m d, called the DINO heads and parameterized by ηDINO s , ηDINO t , appended to the end of the student and teacher networks respectively.

They center the teacher-computed features using a learned vector µ Rm.

They take a temperature-weighted softmax of both features to compute probability vectors in Rm, sometimes called prototype scores, which they then can compare using cross-entropy.

Mathematically, the post-processing steps to get probability vectors for each view are as follows:

pcls s (Xc) := softmax(hηDINO s (zcls s (Xc))/τs) (2)

pcls t (Xg) := softmax([hηDINO t (zcls t (Xg)) µ]/τt) (3)

where τs, τt > 0 are temperature parameters for the student and teacher respectively. Finally, the loss (to be minimized) encourages pcls s (Xc) and pcls t (Xg) to be close using a crossentropy functional d CE, which effectively distills the teacher into the student by aligning the predicted outputs:

LDINO := E[d CE(pcls t (Xg), pcls s (Xc))] (4)

1More precisely, let c be a random vector containing the boundaries of the crop, so that vc crops exactly the region supplied by c. Analogous notation can be defined for g and vg. 2Note that the parameters θs and θt each contain a positional encoding over all patches; when a view is fed through the network, it receives an interpolated positional encoding of the view s crop.

where the expectation is over X, the (local or global) view vc, and the global view vg, and the function d CE is defined via the cross-entropy as

d CE(p, q) :=

i=1 pi log qi. (5)

When training, DINO estimates the expectation in (4) by a stratified plug-in estimator over a batch of sample images. That is, to estimate the expectation, we condition on X then estimate the conditional expectation E[d CE( , ) | X] via plug-in using several different global views (usually two global views, which play the role of the arbitrary view vc and the global view vg) and several different local views, and finally average over X to obtain the estimate. The optimization of this estimated loss, too, is done in an ad-hoc way; while all five parameters θs, θt, ηDINO s , ηDINO t , µ are updated at each iteration, they update in different ways:

The student parameters θs and ηDINO s are updated via an iteration of a stochastic gradient descent (SGD)-type algorithm, such as Adam, on the loss (4). The backpropagation for the loss gradient is computed assuming the teacher parameters θt, ηDINO t , and µ are frozen or constants (i.e., stop-gradient ).

The teacher parameters θt, ηDINO t , and µ are updated via exponentially moving averages (EMAs) of the student weights θs, the student DINO head ηDINO s , and the average output of teacher the DINO head E[hηDINO t (zcls t (Xg))] (in practice estimated over a minibatch), respectively. Formally, for decay parameters λ, ν [0, 1], at each iteration we compute θt λθt + (1 λ)θs, ηDINO t ληDINO t + (1 λ)ηDINO s , and µ νµ + (1 ν) E[hηDINO t (zcls t (Xg))].

The decay parameters λ, ν and the temperature parameter τ change along the optimization trajectory, and their schedules are design decisions which impact convergence.

As previously mentioned, many of the ad-hoc methods and choices described above are due to a tension: a trivial solution to optimizing (4) is to enforce that fθs and fθt collapse, i.e., become or approximate the constant function, which map each local and global view to the same feature z or even to the same probability vector p. To explain why DINO does not collapse, we wish to highlight the centering operation in (3), which computes batch statistics during its EMA update, hence using negative samples and implicitly pushing different samples features apart, even though the precise conceptual mechanism by which this occurs is not clear and involves a careful interaction between the centering vector and temperature scaling (Caron et al., 2021). Indeed, Caron et al. (2021) shows that collapsed solutions are common without very carefully tuning the EMA schedule

Simplifying DINO via Coding Rate Regularization

and temperature schedule, and arguing that the remaining hyperparameters and choices would severely degrade the performance if perturbed. A more in-depth discussion of the tension, and the added complexity required to train a model in spite of it, is in Appendix B. As we will see, if this tension is alleviated in an alternative way, many hyperparameters can be removed and the rest can be changed robustly.

2.2. From DINO to Sim DINO

To go from DINO to Sim DINO, we ask the question:

Can we just compare zcls s (Xc) and zcls t (Xg)?

If we could do this, then we could avoid the large DINO head, the centering operation, the softmaxes, and the crossentropy based loss. However, the mechanism in DINO for avoiding representation collapse via negative samples would therefore be removed. Thus, we have a second question:

Can we efficiently use the negative samples features explicitly to enforce non-collapse?

For the first question, we argue that the most simple squared Euclidean distance, namely

dℓ2(x, y) := 1

2 x y 2 2 (6)

works at least as well as the cross-entropy-based functional (5) applied to an affine transformation of the features, as in (4). For the second question, we argue that we may directly penalize the covariance of the features in order to avoid collapse, as follows. For a hyperparameter ε > 0, the (total) coding rate (Ma et al., 2007; Yu et al., 2020; Li et al., 2022) of a symmetric positive semidefinite matrix Γ Rd d is

2 logdet I + d

In words, Rε is an approximation to the rate distortion with quantization error ε of a Gaussian random variable with covariance Γ (and this approximation is perfect in the limit ε 0). More concretely, it is a measure of size of the covariance, even if the underlying variables are non Gaussian. Thus one way to ensure non-collapse is to add Rε(Cov[zcls s (Xc) | vc Vglo]) as a regularizer (where vc Vglo means that vc is a global view),3 yielding the loss

LSim DINO := E[dℓ2(zcls t (Xg), zcls s (Xc))] (8)

γRε(Cov[zcls s (Xc) | vc Vglo]).

3We only use global views features for the sake of efficiency. If dℓ2 in (8) is small, then the local and global views student features are close since they are both close to the global views teacher features, so Cov[zcls s (Xc)] Cov[zcls s (Xc) | vc Vglo].

where γ > 0 is a hyperparameter. Note that dℓ2(zcls t , zcls s ) = 1 (zcls s ) zcls t since zcls s , zcls t Sd 1.

When training, similar to DINO, we estimate the expectation and covariance in (8) by a type of plug-in estimator. Namely, the expectation is estimated similar to DINO, just using dℓ2 instead of d CE. To estimate the coding rate, we sub-sample several zcls s (Xc) over both X and vc, estimate Cov[zcls s (Xc) | vc Vglo] on that sub-sample via plug-in, estimate Rε of the population covariance by calculating it on the sample covariance, then average the estimates over all sub-samples. We conjecture that the latter estimator has lower variance compared to the naive plug-in estimator for Cov[zcls s (Xc) | vc Vglo] as it is similar to variancereduction methods in statistics (Kahn & Marshall, 1953), which we hypothesize might be a factor as to why Sim DINO can handle a smaller batch size than other contrastive SSL methods that explicitly use negative samples but avoid collapse using higher-variance or more implicit regularizers.

The overall pipeline is shown in Figure 1(b). Note that it is much simpler than DINO. We provide pseudocode for the training pipeline in Algorithm 1 in Appendix D.

After training, we use the teacher network for evaluation.

2.3. From DINOv2 to Sim DINOv2

The pipeline of the DINOv2 framework (Oquab et al., 2023), as shown in Figure 1(c), is built upon the DINO pipeline, and has two main goals: first, learn an aggregate representation which contains large-scale semantics of the input (i.e., the goal of DINO); second, learn patch-based representations which have fine-grained semantic information about each patch and its local neighborhood. The main new ideas to achieve this, drawn from the i BOT pipeline (Zhou et al., 2021), are that the input to the student has some masked patches, and that the loss also computes similarity of the patch-based features. To see why this works, consider if some patches are masked, and the model is able to predict masked patches using their unmasked neighbors, then from each patch the model can extract strong information about the semantics of nearby patches, which is an idea similar in spirit to masked autoencoding (He et al., 2022). Thus, these two ideas from i BOT would furnish our model with informative patch-based representations.

We now discuss the DINOv2 pipeline, before discussing our modifications. While we have the same ( base ) views vc and vg as before, we also consider a masked view vmc, which computes vc but, if vc Vglo, subsequently replaces a fraction α [0, 1] of the tokens in the view output with a learnable mask token xmask (as in (He et al., 2022), the mask token is shared across all views). Similarly to previous notation, we denote Xmc := vmc(X) (RD Nloc RD Nglo).

Now that we have this setup, we do similar operations to

Simplifying DINO via Coding Rate Regularization

DINO pipeline, with some changes:

There are additional i BOT heads for the student and teacher, processing the patch-based features columnwise (i.e., patch-wise), with weights ηi BOT s , ηi BOT t (cf the DINO head with weights ηDINO s , ηDINO t ).

The centering operation on teacher-output features is performed on both the aggregate features and (columnwise) on the patch-wise features.

The centering operation uses three iterations of the Sinkhorn-Knopp algorithm (Cuturi, 2013; Caron et al., 2020), denoted below by SKC, instead of an EMA, and is parameter-free but more expensive than simple subtraction. Note that the Sinkhorn-Knopp algorithm uses features from all images in each minibatch.

Let Zpatch and P patch be the patch-wise representations and prototype scores respectively, and let zi Rd be the ith column of Zpatch (and similar for pi P patch). Then we have (where 1 i Nglo):

(zcls s (Xmc), Zpatch s (Xmc)) := fθs(Xmc) (9)

(zcls t (Xg), Zpatch t (Xg)) := fθt(Xg) (10)

pcls s (Xmc) := softmax(hηDINO s (zcls s (Xmc))/τs) (11)

pi s(Xmc) := softmax(hηi BOT s (zi s(Xmc))/τs) (12)

pcls t (Xg) := softmax(SKC[hηDINO t (zcls t (Xg))]/τt) (13)

pi t(Xg) := softmax(SKC[hηi BOT t (zi t(Xg))]/τt). (14)

We then compute the loss using all such probability vectors:

LDINOv2 := 1

2 E[d CE(pcls t (Xg), pcls s (Xmc))] (15)

i=1 d CE(pi t(Xg), pi s(Xmc))1i,mc

γ Entropy(zcls s (Xmc) | vc Vglo),

where 1i,mc is 1 if patch i of Xc is masked by vmc and 0 otherwise, and the Entropy functional is the differential entropy; it plays a similar role as the coding rate Rε in Sim DINO (and shortly Sim DINOv2) in ensuring noncollapse. It is estimated by Oquab et al. (2023) using the Ko Leo estimator (Delattre & Fournier, 2017)) which explicitly uses negative samples. However, the Ko Leo estimator is a non-parametric estimator of the expectation of a function of a high-dimensional probability density (Beirlant et al., 1997), and so it has relatively poor sample efficiency (i.e., the required batch size to converge in practice is large).

We now greatly simplify the above pipeline using the same ideas as introduced in Sim DINO. Namely, we dispense with the DINO/i BOT heads, the Sinkhorn-Knopp centering, and

the softmaxes, and compute the Euclidean distance-based loss directly on normalized features. We obtain the loss

LSim DINOv2 := 1

2 E[dℓ2(zcls t (Xg), zcls s (Xmc))] (16)

i=1 dℓ2(zi t(Xg), zi s(Xmc))1i,mc

γRε(Cov[zcls s (Xmc) | vc Vglo])

The same caveats as in Sim DINO apply with respect to how the expectations and covariances are estimated, and the optimization and evaluation procedures carry over. We provide pseudocode for the training pipeline in Algorithm 2 in Appendix D. In the sequel, we will show that these greatly simplified designs actually help the model performance.

Optimal value for γ. In both the Sim DINO loss (8) and the Sim DINOv2 loss (16), in order to aid learning while making sure neither the distance term nor the regularizer term dominates, we choose γ up to an absolute constant factor so that it balances the asymptotic order of the gradient (Frobenius) norms of both terms. By the Cauchy-Schwarz inequality, it suffices to equalize the norms of the gradients of each term w.r.t. the features Z. Since the features are normalized on the sphere, the gradient (Frobenius) norm of the distance term is O(1). For the second term, assuming that we use a batch size B, the gradient norm of the second term is O( p

d min{d, B}/B). To make these equivalent, we take γ = Θ( p

B/(d min{d, B})). The same rate holds for Sim DINOv2. While this choice of γ is ultimately a heuristic, and the constant factor needs to be tuned, it helps to scale Sim DINO and Sim DINOv2 in practice. Formal calculations, including prescriptive choice for γ taking all parameters into account, are provided in Theorems C.1 and C.2.

3. Experimental Verification

In this section, we empirically investigate and evaluate our proposed Sim DINO and Sim DINOv2 models and compare them to the original DINO and DINOv2 model families. In particular, we examine their differences in training dynamics and learned representation both quantitatively and qualitatively. Overall, our experiments show that our proposed Sim DINO model families can achieve better performance and learn representations of higher quality than the original DINO families while being significantly simpler and more robust to variations in hyperparameters and architecture.

3.1. Experimental Setup

Model architecture. Since our method is directly built upon DINO and DINOv2, we adopt settings as close as possible to the original method for fair comparison. Specifically, for all inputs we set patch size to be 16; we use the small, base, and large models of the Vi T (Dosovitskiy, 2020) architecture as the backbone, which is connected to a projector

Simplifying DINO via Coding Rate Regularization

composed of three MLP layers with a hidden size of 2048 and an output dimension of 256. The output features after the projector are ℓ2 normalized. Specifically for original (i.e., unsimplified) DINO models, these normalized features are then fed to a weight-normalized linear layer that outputs a high-dimensional (e.g., 65536) vector, before computing the softmax and then the cross-entropy loss.

Datasets and optimization. For pretraining, we use the Image Net-1K dataset across all methods. For fair comparison, we closely follow the original works (Caron et al., 2021; Oquab et al., 2023). We choose Adam W (Loshchilov, 2017) as the optimizer and adopt the same optimization strategies (e.g., learning rates, warm-up schedules). For multicrop augmentation, we use 10 local views of resolution 96 96 and 2 global views of resolution 224 224 for all experiments. We provide more details on hyperparameter choices in Appendix E. We also consider several downstream tasks. Specifically, we evaluate our pretrained models on 1) unsupervised object detection and segmentation on COCO val2017 (Lin et al., 2014), 2) semantic segmentation on ADE20K (Zhou et al., 2017), and 3) video object segmentation on DAVIS-2017 (Pont-Tuset et al., 2017).

3.2. Experimental Results

Image Net Classification. We report the classification accuracies on Image Net-1k in Table 1. Following (Caron et al., 2021), we evaluate both k-NN and linear accuracy on the Vi T backbones pretrained by the DINO model families and our simplified variants. We observe that under both DINO and DINOv2 paradigms, our simplified methods are able to outperform the original pipelines. Furthermore, we observe that applying identical hyperparameter settings from Vi T-B to Vi T-L results in instability and divergence in DINO, while the same setup yields a steady improvement for Sim DINO. To better understand the optimization dynamics of Sim DINO, we visualize the evolution of accuracy during training in Figure 2. It can be observed that performance of Sim DINO steadily improves as training progresses, while optimization of DINO noticeably slows down, with even a slight performance drop near the end of training. Together, these results demonstrate our simplified pipelines stability and ease of optimization compared to the originals.

Object Detection and Segmentation. To better understand the learned representation, we evaluate the pretrained models on segmentation and object detection tasks. Specifically, we adopt Mask Cut (Wang et al., 2023), an effective unsupervised approach of extracting features from a frozen vision backbone for object detection and instance segmentation. In Figure 3, we present qualitative segmentation results by applying Mask Cut on models trained with both DINO and Sim DINO. Both methods are observed to produce meaningful segmentation results, confirming the

60 70 80 90 100 Epoch

k-NN Accuracy (%)

DINO Sim DINO

Figure 2. Evolution of k-NN accuracy of Vi T-B trained for 100 epochs using DINO and Sim DINO paradigms on Image Net-1K. We omit earlier epochs of similar metrics for better visual clarity.

Table 1. Performance comparison on Image Net-1K. Sim DINO and Sim DINOv2 consistently outperform the original DINO and DINOv2 model families. They are also more stable, while training of DINO on Vi T-L diverged (row 3).

Method Model Epochs k-NN Linear

DINO Vi T-B 100 72.9 76.3 Sim DINO Vi T-B 100 74.9 77.3 DINO Vi T-L 100 Sim DINO Vi T-L 100 75.6 77.4

DINOv2 Vi T-B 100 76.0 77.2 Sim DINOv2 Vi T-B 100 78.1 79.7 DINOv2 Vi T-L 100 80.8 82.0 Sim DINOv2 Vi T-L 100 81.1 82.4

Sw AV Vi T-S 800 66.3 73.5 Mo Cov3 Vi T-B 300 76.7

emerging properties similar to the original DINO when using our simplified algorithm. More qualitative results are available in Appendix F.6. To quantitatively evaluate these representation, we perform Mask Cut on the COCO val2017 dataset and report our results in Table 2. These results show Sim DINO achieves much stronger performance on segmentation and detection tasks than DINO when trained on the same network (row 2 vs 3), and overall even outperforms DINO trained on a smaller patch size4 (row 2 vs 4).

Semantic Segmentation on ADE20K. We evaluate our proposed methods on the ADE20K semantic segmentation task and report the results in Table 3 (column 3 & 4). Specifically, we follow the linear evaluation protocol of (Zhou et al., 2021), where we fix the pretrained backbone and only finetune a linear layer on top of it. From the results, we observe that our proposed Sim DINO consistently outperforms the original algorithms. In particular, on Vi T-B, Sim DINOv2 is able to improve DINOv2 by 4.4@m Io U.

4When trained using DINO, Vi T models with smaller patch sizes tend to outperform those with larger ones on various tasks including segmentation (Wang et al., 2023; Caron et al., 2021).

Simplifying DINO via Coding Rate Regularization

Table 2. Unsupervised object detection and segmentation via Mask Cut evaluated on COCO val2017 under COCO s official evaluation protocol. Sim DINO conclusively performs better than the DINO at detection and segmentation metrics, comparable with DINO with smaller path size (16 vs 8).

Detection Segmentation Method Model AP50 AP75 AP AP50 AP75 AP

Sim DINO Vi T-L/16 5.4 1.9 2.4 4.5 1.4 1.9 Sim DINO Vi T-B/16 5.2 2.0 2.5 4.7 1.5 2.0 DINO Vi T-B/16 3.9 1.5 1.8 3.1 1.0 1.4

DINO Vi T-B/8 5.1 2.3 2.5 4.1 1.3 1.8

These results suggest that our simplified methods lead to representations favorable to dense prediction tasks.

DAVIS Video Object Segmentation. In Table 3, we also provide evaluation results on DAVIS-2017 video instance segmentation benchmark. We follow the same evaluation protocol as in (Caron et al., 2021) and segment scenes between consecutive video frames with nearest neighbor. We observe that our proposed Sim DINO(v2) outperforms the original methods on this task. One interesting observation is that despite achieving much better k NN accuracy, DINOv2 generally underperforms the original DINO in this task (and similarly for the simplified variants). A similar phenomenon is noted in (Zhou et al., 2021), where this discrepancy is found to be caused by the sensitivity of the evaluation protocol itself (e.g., image resolution). In our evaluation, we do not tune these individual factors and simply adopt the same setting across all models we consider.

More on Stability and Robustness. Apart from the observed divergence on Vi T-L in Table 1, we note that DINO is sensitive to its pipeline-specific hyperparameters, as evidenced in Table 6 (in Appendix F). To further verify the stability of Sim DINO, we experiment with training both algorithms on a different dataset than Image Net-1k. Specificlly, we train them on COCO train2017 (roughly 1/10-th the size of Image Net-1k), and report the results in Figure 4. Under this setting, Sim DINO vastly outperforms DINO. We provide additional ablations on other factors (e.g. batch sizes) in Appendix F. Together, these results demonstrate the superior stability and robustness of Sim DINO.

4. Related Work

In this section, we identify several previous works which the Sim DINO and Sim DINOv2 methodologies are similar to or build on. We have already discussed similarities to DINO and DINOv2 in depth so we omit this comparison.

Siamese contrastive SSL. Siamese contrastive learning, archetyped by Sim CLR (Chen et al., 2020) and Sim Siam (Chen & He, 2021) among others, uses the same network

Table 3. Semantic segmentation on ADE20K and video object segmentation on DAVIS-2017. For semantic segmentation, we train a linear layer on the frozen pretrained backbone. On DAVIS, we segment scenes between video frames using nearest neighbor search. On both tasks, Sim DINO(v2) consistently outperforms their original counterparts.

Lin. Seg. Vid. Seg. Method Model m Io U m Acc (J &F)m Jm Fm

DINO Vi T-B/16 33.1 41.9 63.0 61.5 64.4 Sim DINO Vi T-B/16 33.7 42.8 63.0 61.6 64.4 DINOv2 Vi T-B/16 32.5 41.4 53.2 52.7 53.7 Sim DINOv2 Vi T-B/16 36.9 46.5 60.9 60.4 61.4 DINOv2 Vi T-L/16 41.0 50.8 62.0 61.7 62.3 Sim DINOv2 Vi T-L/16 41.8 52.2 62.6 61.9 63.3

to encode different augmentations (i.e., views) of the same input, and pushes the features of these augmentations together, similar to Sim DINO. Sim CLR uses explicit negative samples in the loss, while Sim Siam manipulates the loss gradient structure using stop-gradients to avoid collapse. Both methods losses measure alignment or difference via the squared Euclidean distance (equivalently the dot product) of the features. In contrast, Sim DINO uses two separate networks the teacher and student that update via selfdistillation. Furthermore, Sim DINO uses the inner product of features in the loss, but it also uses a coding rate regularizer instead of implicitly contrasting negative samples or using the more bespoke contrastive loss in Sim CLR.

Explicit covariance regularization in SSL. There have also been works that use explicit penalization of the firstand second-order statistics of the features, such as VICReg (Bardes et al., 2021). VICReg uses completely separate networks to encode two augmentations of the same input batch, and then explicitly penalizes the alignment of those features (via Euclidean distance) as well as the features variance and covariance within the batch, aiming to whiten the features as much as possible. In spirit, this is similar to Sim DINO, which also penalizes the alignment and the features covariance, albeit using a different covariance regularizer and not penalizing the features variance. Also, Sim DINO uses self-distillation to train the teacher network, while VICReg uses two separate networks.

Self-distillation in SSL. Several works such as Mo Co (He et al., 2020) and BYOL (Grill et al., 2020) train two networks, a teacher and a student, via self-distillation by setting the teacher weights to be an exponential moving average of the student weights. While Mo Co uses explicit negative samples from previous batches in its Info NCE loss computed on a given batch, BYOL does not use negative samples but instead manipulates the gradient structure (akin to Sim Siam) in order to prevent collapse, and it uses an extra ( prediction ) module appended to the student network,

Simplifying DINO via Coding Rate Regularization

Figure 3. Visualization of Mask Cut segmentation results from DINO Vi T-B/16 (row 1), Sim DINO Vi T-B/16 (row 2) and Sim DINO Vi T-L/16 (row 3) on selected images.

10 20 30 40 50 60 70 80 90 100 Epoch

k-NN Accuracy (%)

DINO Sim DINO

Figure 4. k-NN accuracy on Image Net-1K of Vi T-B trained on COCO train2017 using DINO and Sim DINO paradigms.

making the teacher and student asymmetric. Sim DINO uses self-distillation with the same architecture for teacher and student, explicitly uses the simple Euclidean distance in the loss, and explicitly uses the coding rate to prevent collapse.

Patch feature prediction in SSL. While most contrastive SSL methods pick a single feature vector (say, of the cls token) as the representation, recent contrastive learning approaches such as DINOv2, I-JEPA (Assran et al., 2023), and C-JEPA (Mo & Tong, 2024) compute losses on the features corresponding to each patch. In I-JEPA, there is one local and one global view, whose crops are nested, and the (Euclidean distance) loss is only computed on the patch features. C-JEPA adds a VICReg-esque variance and covariance penalty to the objective of I-JEPA. In contrast, in Sim DINOv2, there are multiple local and global views, the loss incorporates both patch-based and aggregate features, and collapse is prevented by using a coding rate term.

Coding rate, and related regularizers. Several works have used coding rate-related terms in the objective (Ma et al., 2007; Yu et al., 2020; Dai et al., 2022; Tong et al., 2022) as well as a way to evaluate quality of representations (Yu et al., 2023; Pai et al., 2023; Wu et al., 2024; Yang et al., 2024). The coding rate has thus been shown to provide a

powerful measure for non-collapse or expansion of the features from a given batch. Other regularizers to accomplish this include the VICReg-type regularizers and the MMCR regularizer (Yerxa et al., 2023; Schaeffer et al., 2024).

5. Conclusion

In this work, we identify that the reasons for many empirically motivated design choices in the original DINO and DINOv2 are to avoid collapse of the learned representation. We show that these complicated design choices can be significantly reduced or simplified by adding a coding-raterelated regularization term. The resulting simplified models, called Sim DINO and Sim DINOv2, are even better in terms of performance for downstream tasks, and their pretraining pipelines are much more robust to different settings and hyperparameters, offering a Pareto improvement against the DINO and DINOv2 model families. Our work demonstrates the value of simplifying deep learning pipelines as well as making tradeoffs as explicit as possible when designing high-performance vision SSL models.

In light of these overarching contributions, there are several possible opportunities for future work. On the theoretical side, our simplified framework provides an entry point for studying the geometric properties of the global optima of self-supervised learning losses. Further study in Appendix F.4 shows that in the framework of the paper, it is possible to set up a self-supervised objective that does not require self-distillation to optimize, making a theoretical analysis much easier, while the resulting model is still quite powerful and practically usable. On the empirical side, one can apply the paradigm of making implicit design choices more explicitly present in the loss to more self-supervised learning frameworks, making existing pipelines more stable and the resulting models of better performance.

Simplifying DINO via Coding Rate Regularization

Impact Statement

This paper aims to advance the field of Machine Learning. Since DINO and DINO-2 are very popular methods for unsupervised learning of visual representations, we believe that the methods presented in this work will significantly advance the practice of visual representation learning. Except for values to academics, we do not anticipate any significant social or ethical implications.

Acknowledgment

Yi Ma would like to acknowledge support from the joint Simons Foundation-NSF DMS grant #2031899, the ONR grant N00014-22-1-2102, the NSF grant #2402951, and also support from and the HKU startup, the Hong Kong Center for Construction Robotics Limited (HKCRC) Award 052245, and JC Club of Hong Kong. Druv Pai acknowledges support from a UC Berkeley College of Engineering Fellowship.

Assran, M., Duval, Q., Misra, I., Bojanowski, P., Vincent, P., Rabbat, M., Le Cun, Y., and Ballas, N. Self-supervised learning from images with a joint-embedding predictive architecture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15619 15629, 2023.

Baharoon, M., Qureshi, W., Ouyang, J., Xu, Y., Phol, K., Aljouie, A., and Peng, W. Towards general purpose vision foundation models for medical image analysis: An experimental study of dinov2 on radiology benchmarks. ar Xiv preprint ar Xiv:2312.02366, 2023.

Bardes, A., Ponce, J., and Le Cun, Y. Vicreg: Varianceinvariance-covariance regularization for self-supervised learning. ar Xiv preprint ar Xiv:2105.04906, 2021.

Beirlant, J., Dudewicz, E. J., Gy orfi, L., Van der Meulen, E. C., et al. Nonparametric entropy estimation: An overview. International Journal of Mathematical and Statistical Sciences, 6(1):17 39, 1997.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877 1901, 2020.

Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., and Joulin, A. Unsupervised learning of visual features by contrasting cluster assignments. Advances in neural information processing systems, 33:9912 9924, 2020.

Caron, M., Touvron, H., Misra, I., J egou, H., Mairal, J., Bojanowski, P., and Joulin, A. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 9650 9660, 2021.

Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597 1607. PMLR, 2020.

Chen, X. and He, K. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 15750 15758, 2021.

Cuturi, M. Sinkhorn distances: Lightspeed computation of optimal transport. Advances in neural information processing systems, 26, 2013.

Dai, X., Tong, S., Li, M., Wu, Z., Psenka, M., Chan, K. H. R., Zhai, P., Yu, Y., Yuan, X., Shum, H.-Y., et al. Ctrl: Closed-loop transcription to an ldr via minimaxing rate reduction. Entropy, 24(4):456, 2022.

Delattre, S. and Fournier, N. On the kozachenko leonenko entropy estimator. Journal of Statistical Planning and Inference, 185:69 93, 2017.

Devlin, J. Bert: Pre-training of deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805, 2018.

Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. ar Xiv preprint ar Xiv:2010.11929, 2020.

Feichtenhofer, C., Li, Y., He, K., et al. Masked autoencoders as spatiotemporal learners. Advances in neural information processing systems, 35:35946 35958, 2022.

Grill, J.-B., Strub, F., Altch e, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271 21284, 2020.

Hadsell, R., Chopra, S., and Le Cun, Y. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR 06), volume 2, pp. 1735 1742. IEEE, 2006.

He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9729 9738, 2020.

Simplifying DINO via Coding Rate Regularization

He, K., Chen, X., Xie, S., Li, Y., Doll ar, P., and Girshick, R. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16000 16009, 2022.

Kahn, H. and Marshall, A. W. Methods of reducing sample size in monte carlo computations. Journal of the Operations Research Society of America, 1(5):263 278, 1953.

Li, Z., Chen, Y., Le Cun, Y., and Sommer, F. T. Neural manifold clustering and embedding. ar Xiv preprint ar Xiv:2201.10000, 2022.

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll ar, P., and Zitnick, C. L. Microsoft coco: Common objects in context. In Computer Vision ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp. 740 755. Springer, 2014.

Loshchilov, I. Decoupled weight decay regularization. ar Xiv preprint ar Xiv:1711.05101, 2017.

Ma, Y., Derksen, H., Hong, W., and Wright, J. Segmentation of multivariate mixed data via lossy data coding and compression. IEEE transactions on pattern analysis and machine intelligence, 29(9):1546 1562, 2007.

Mo, S. and Tong, S. Connecting joint-embedding predictive architecture with contrastive self-supervised learning. ar Xiv preprint ar Xiv:2410.19560, 2024.

Oord, A. v. d., Li, Y., and Vinyals, O. Representation learning with contrastive predictive coding. ar Xiv preprint ar Xiv:1807.03748, 2018.

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El Nouby, A., et al. Dinov2: Learning robust visual features without supervision. ar Xiv preprint ar Xiv:2304.07193, 2023.

Pai, D., Wu, Z. W., Buchanan, S., Yu, Y., and Ma, Y. Masked completion via structured diffusion with white-box transformers. International Conference on Learning Representations, 2023.

Pont-Tuset, J., Perazzi, F., Caelles, S., Arbel aez, P., Sorkine Hornung, A., and Van Gool, L. The 2017 davis challenge on video object segmentation. ar Xiv:1704.00675, 2017.

Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I. Improving language understanding by generative pretraining. 2018.

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. Language models are unsupervised multitask learners. Open AI blog, 1(8):9, 2019.

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748 8763. PMLR, 2021.

Salimans, T. and Kingma, D. P. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. Advances in neural information processing systems, 29, 2016.

Schaeffer, R., Lecomte, V., Pai, D. B., Carranza, A., Isik, B., Unell, A., Khona, M., Yerxa, T., Le Cun, Y., Chung, S., et al. Towards an improved understanding and utilization of maximum manifold capacity representations. ar Xiv preprint ar Xiv:2406.09366, 2024.

Tong, S., Dai, X., Chen, Y., Li, M., Li, Z., Yi, B., Le Cun, Y., and Ma, Y. Unsupervised learning of structured representations via closed-loop transcription. ar Xiv preprint ar Xiv:2210.16782, 2022.

Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., and J egou, H. Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pp. 10347 10357. PMLR, 2021.

Wang, T. and Isola, P. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International conference on machine learning, pp. 9929 9939. PMLR, 2020.

Wang, X., Girdhar, R., Yu, S. X., and Misra, I. Cut and learn for unsupervised object detection and instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3124 3134, 2023.

Wei, Z., Chen, L., Jin, Y., Ma, X., Liu, T., Ling, P., Wang, B., Chen, H., and Zheng, J. Stronger fewer & superior: Harnessing vision foundation models for domain generalized semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 28619 28630, 2024.

Wu, Z., Xiong, Y., Yu, S. X., and Lin, D. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3733 3742, 2018.

Wu, Z., Ding, T., Lu, Y., Pai, D., Zhang, J., Wang, W., Yu, Y., Ma, Y., and Haeffele, B. D. Token statistics transformer: Linear-time attention via variational rate reduction. ar Xiv preprint ar Xiv:2412.17810, 2024.

Yang, J., Li, X., Pai, D., Zhou, Y., Ma, Y., Yu, Y., and Xie, C. Scaling white-box transformers for vision. ar Xiv preprint ar Xiv:2405.20299, 2024.

Simplifying DINO via Coding Rate Regularization

Yerxa, T., Kuang, Y., Simoncelli, E., and Chung, S. Learning efficient coding of natural images with maximum manifold capacity representations. Advances in Neural Information Processing Systems, 36:24103 24128, 2023.

Yu, Y., Chan, K. H. R., You, C., Song, C., and Ma, Y. Learning diverse and discriminative representations via the principle of maximal coding rate reduction. Advances in neural information processing systems, 33:9422 9434, 2020.

Yu, Y., Buchanan, S., Pai, D., Chu, T., Wu, Z., Tong, S., Haeffele, B., and Ma, Y. White-box transformers via sparse rate reduction. Advances in Neural Information Processing Systems, 36:9422 9457, 2023.

Zbontar, J., Jing, L., Misra, I., Le Cun, Y., and Deny, S. Barlow twins: Self-supervised learning via redundancy reduction. In International conference on machine learning, pp. 12310 12320. PMLR, 2021.

Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., and Torralba, A. Scene parsing through ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 633 641, 2017.

Zhou, J., Wei, C., Wang, H., Shen, W., Xie, C., Yuille, A., and Kong, T. ibot: Image bert pre-training with online tokenizer. ar Xiv preprint ar Xiv:2111.07832, 2021.

Simplifying DINO via Coding Rate Regularization

A. Formal Description of Local and Global Views

Each local view, say vℓacts as follows, given an input image X of shape (C, H, W). First, for a hyperparameter ploc [0, 1] it crops a rectangular component from X of shape (C, Hℓ, Wℓ), where Hℓand Wℓare chosen such that HℓWℓ= ploc HW, i.e., the crop is a fraction ploc of the whole image. Then the component is resized to shape (C, Sloc, Sloc), where Sloc is a hyperparameter, and then divided into Nloc := S2 loc/P 2 square patches of shape (C, P, P), where the patch size P is a hyperparameter. Each patch is unrolled into a vector of length D := CP 2, and the Nloc unrolled vectors are placed in raster order as columns to get the output Xℓ RD Nloc. Each global view vg acts the same as a local view, except that the corresponding hyperparameters pglo, Sglo are larger than their local counterparts ploc, Sloc (hence also Nglo vs. Nloc), while the patch size P (hence dimension D) remains the same.5

We use these local and global views for training. For evaluation or inference, we do a similar procedure: given X of shape (C, H, W), we resize X proportionally so that its shorter edge is length Leval, then take a square crop from the center of shape (C, Seval, Seval). This sequence is divided into Neval := S2 eval/P 2 square patches of length (C, P, P); each patch is unrolled into a vector of length D := CP 2, and the Neval unrolled vectors are placed in raster order as columns to get the output Xe RD Neval.

B. Complex Interactions in DINO and Their Removal

We wish to showcase a finer point about why the DINO pipeline is so unstable. Notice that

d CE(p, q) =

i=1 pi log qi (17)

i=1 pi log(pi/qi)

i=1 pi log pi (18)

= d KL(p, q) + H(p) (19)

where d KL is the KL divergence, and H is the entropy of a probability distribution. In other words, this objective is minimized whenever p = q and both are one-hot vectors. Now consider the DINO objective:

LDINO = E[d CE(pcls t (Xg), pcls s (Xc))] = E[d KL(pcls t (Xg), pcls s (Xc)) + H(pcls t (Xg))]. (20)

Suppose that, for example, hηDINO s and hηDINO t had ranges as a multiple of the all-ones vector, and µ were a constant multiple of the ones vector. Then the first term in the loss would be minimized, but the second term would become as large as possible (since both pcls would be just 1

m1m, i.e., probability vectors corresponding to the uniform distribution), so this would not be the optimal solution in general. This implies that the learned hηDINO s and hηDINO t in general would not both be degenerate. This enables the tradeoff between the EMA parameter λ and the temperature parameters τs, τt which enables non-collapse. If the objective just involved the KL divergence and not the entropy term, or else had hηDINO s be degenerate (manually set and frozen, for instance), or else didn t have a carefully set tradeoff between λ, τs, τt, then the model would collapse. However, Sim DINO removes all of this complexity and replaces it with an explicit coding-rate-type term.

C. Theory for Hyperparameter Scaling

To develop estimates for how different terms in the loss scale with different batch sizes, we first introduce an empirical version of the Sim DINO loss. For a given optimization step, we:

Sample a minibatch {X1, . . . , XB}.

For each sample Xb, draw Mglo global views vi glo and Mloc local views vj loc. Define Xi b,glo := vi glo(Xb) and Xj b,loc := vj loc(Xb).

Compute the features zcls s (Xi b,glo), zcls s (Xj b,loc), and zcls t (Xi b,glo).

5Of course, we also need the patch size P to divide both the image sizes Sloc and Sglo.

Simplifying DINO via Coding Rate Regularization

Form the tensors Zcls s RB M d, Zcls t RB Mglo d where M := Mglo + Mloc, by

(Zcls s )bi =

( zcls s (Xi b,glo), if 1 i Mglo, zcls s (Xi Mglo b,loc ), otherwise , (Zcls t )bi = zcls t (Xi b,glo) (21)

Compute the loss

b LSim DINO = 1 2BMMglo

j=1 (Zcls t )bi (Zcls s )bj 2 2 + γ Mglo

(Zcls s ) :,i(Zcls s ):,i B

where for 1 i Mglo we have (Zcls s ):,i = [(Zcls s )1,i, . . . , (Zcls s )B,i] RB d.

Our main theorem for Sim DINO is the following.

Theorem C.1. The (Frobenius) gradient norm of (22) is

Zcls s b LSim DINO F 1

d min{d, B}/B

Proof. For the first term, we call upon Lemma C.4 which says that (Zcls s )b

j=1 (Zcls t )bi (Zcls s )bj 2 2

This allows us to compute the gradient norm of the first term in (22) as Zcls s

j=1 (Zcls t )bi (Zcls s )bj 2 2

j=1 (Zcls t )bi (Zcls s )bj 2 2

j=1 (Zcls t )bi (Zcls s )bj 2 2

j=1 (Zcls t )bi (Zcls s )bj 2 2

j=1 (Zcls t )bi (Zcls s )bj 2 2

For the second term, we call upon Lemma C.6 which says that (Zcls s ):,i Rε

(Zcls s ) :,i(Zcls s ):,i B

d min{d, B}/B

Simplifying DINO via Coding Rate Regularization

This allows us to compute the gradient norm of the second term in (22) as Zcls s

(Zcls s ) :,i(Zcls s ):,i B

(Zcls s ) :,i(Zcls s ):,i B

(Zcls s ) :,i(Zcls s ):,i B

(Zcls s ):,i Rε

(Zcls s ) :,i(Zcls s ):,i B

(Zcls s ):,i Rε

(Zcls s ) :,i(Zcls s ):,i B

d min{d, B}

d min{d, B}

Putting these two terms together via triangle inequality, we have that

Zcls s b LSim DINO F 1

d min{d, B}

as desired.

We can also come up with another result for Sim DINOv2. This requires a slightly revised pipeline, as well as a different loss. For a given optimization step, we:

Sample a minibatch {X1, . . . , XB}.

For each sample Xb, draw Mglo global views vi glo and Mloc local views vj loc. Define Xi b,glo := vi glo(Xb) and Xj b,loc := vj loc(Xb).

For each sample Xb, draw Mglo masks mi b and apply them to each global view. Define Xi b,glo,mask := mi b(Xi b,glo), and define 1i b,k be the indicator variable of whether patch k is masked by mi b.

Compute the features zcls s (Xi b,glo,mask), Zpatch s (Xi b,glo,mask), zcls s (Xi b,loc), zcls t (Xi b,glo), and Zpatch t (Xi b,glo).

Form the tensors Zcls s RB M d, Zcls t RB Mglo d, Zpatch s,glo RB Mglo Nglo d, Zpatch t RB Mglo Nglo d, where M := Mglo + Mloc, by

(Zcls s )bi =

( zcls s (Xi b,glo,mask), if 1 i Mglo, zcls s (Xi Mglo b,loc ), otherwise , (Zcls t )bi = zcls t (Xi b,glo) (40)

(Zpatch s,glo )bi = Zpatch s (Xi b,glo,mask), (Zpatch t )bi = Zpatch t (Xi b,glo). (41)

Compute the loss

b LSim DINOv2 = 1

( 1 2BMMglo

j=1 (Zcls t )bi (Zcls s )bj 2 2 (42)

+ 1 2BMglo Nglo

n=1 (Zpatch t )bin (Zpatch s,glo )bin 2 21i b,n

Simplifying DINO via Coding Rate Regularization

(Zcls s ) :,i(Zcls s ):,i B

where for 1 i Mglo we have (Zcls s ):,i = [(Zcls s )1,i, . . . , (Zcls s )B,i] RB d.

Our main theorem for Sim DINOv2 is the following: Theorem C.2. The (Frobenius) gradient norm of (42) is

Zcls s b LSim DINOv2 F 1

d min{d, B}

where (recall) α [0, 1] is the fraction of patches of the input global view which is masked out.

Proof. By the proof of Theorem C.1, the gradient norm of the first and third terms are known. So we study the second term. We can invoke Lemma C.5, which tells us that for every b and i we have (Zpatch s,glo )bi

n=1 (Zpatch t )bin (Zpatch s,glo )bin 2 21i b,n

(1 α)Nglo (44)

using the identity that for each (i, b) pair that

n=1 1i b,n = (1 α)Nglo. (45)

Then, it holds overall that Zs

1 2BMglo Nglo

n=1 (Zpatch t )bin (Zpatch s,glo )bin 2 21i b,n

= 1 BMglo Nglo

n=1 (Zpatch t )bin (Zpatch s,glo )bin 2 21i b,n

1 BMglo Nglo

n=1 (Zpatch t )bin (Zpatch s,glo )bin 2 21i b,n

= 1 BMglo Nglo

(Zpatch s )b i

n=1 (Zpatch t )bin (Zpatch s,glo )bin 2 21i b,n

= 1 BMglo Nglo

(Zpatch s )bi

n=1 (Zpatch t )bin (Zpatch s,glo )bin 2 21i b,n

= 1 BMglo Nglo

(1 α)Nglo (51)

Nglo . (52)

Using the gradients computed in Theorem C.1 and the triangle inequality, it holds that

Zs b LSim DINOv2 F 1

d min{d, B}

as desired.

Simplifying DINO via Coding Rate Regularization

Remark C.3. In the main body, in order to obtain a prescription for how to scale γ with the batch size B, we choose γ to make the different terms in (22) and (42) have equal magnitude. Given the explicit rate in Theorem C.1, the value of γ chosen in Sim DINO is

γSim DINO = 2ε

B d M min{d, B}. (54)

Meanwhile using Theorem C.2, the value of γ chosen in Sim DINOv2 is

γSim DINOv2 = ε

B d min{d, B}. (55)

If, for instance, we are just interested in scaling γ with the batch size B (so that ε, α, M, Nglo, d are held constant), then γSim DINO and γSim DINOv2 have the same asymptotic order. In practice, we take these prescriptions for γ up to a multiplicative constant, which is tuned on a single setting and can then be transferred to different settings.

C.1. Auxiliary Lemmas

Lemma C.4 (Scale of Gradient of Pairwise Distance Term). Let d, m, n be positive integers. Let A Rm d and B Rn d

have rows ai and bj which are unit-ℓ2-norm, i.e., ai 2 = bj 2 = 1 for i {1, . . . , m}, j {1, . . . , n}. Then B

j=1 ai bj 2 2

Proof. Since all rows are normalized,

j=1 ai bj 2 2 =

j=1 a i bj =

= (1 m A)(1 n B) (58)

= 1 m AB 1n. (59)

A matrix calculus computation shows that

B 1 m AB 1n = 1n1 m A. (60)

To bound the norm of this term, we have

1n1 m A F = 1n1 m A F 1n1 m op A F . (61)

It is easy to show by matrix algebra that 1n1 m op = mn, (62)

and that, since the rows of A are normalized,

i=1 ai 2 2 =

m 1 = m. (63)

Putting these together it holds that B

j=1 ai bj 2 2

as desired.

Simplifying DINO via Coding Rate Regularization

Lemma C.5 (Scale of Gradient of Patchwise Distance Term). Let d, n be positive integers. Let A, B Rn d have rows ai and bj which are unit-ℓ2-norm, i.e., ai 2 = bj 2 = 1 for i, j {1, . . . , n}. Then B

i=1 ai bi 2 2

) F = n. (65)

Proof. Since all rows are normalized,

i=1 ai bi 2 2 =

i=1 a i bi =

j=1 (A)ij(B)ij = tr(AB ). (66)

A matrix calculus computation shows that B[ tr(AB )] = A. (67)

The Frobenius norm of this term can be explicitly calculated as

i=1 ai 2 2 =

n 1 = n. (68)

Putting this together, we obtain B

i=1 ai bi 2 2

) F = n, (69)

as desired.

Lemma C.6 (Scale of Gradient of Coding Rate Term). Let d, n be positive integers. We have

max Z Rn d zi 2=1 i

d min{d, n}

where zi is the ith row of Z.

Proof. Let α := d/(nε2) and let f : Rn d R be defined by

2 logdet(I + αZ Z), (71)

i.e., f(Z) = Rε(Z Z/n). Now, let r := min{d, n}. For any matrix M, let σi(M) be its ith largest singular value, for i = 1, . . . , d. First, note that since zi 2 = 1 for all i, it holds

i=1 σi(Z)2 =

i=1 σi(Z)2 =

i=1 σi(Z Z) = tr(Z Z) =

i=1 (Z Z)ii =

i=1 zi 2 | {z } =1

Now, we can simplify the gradient. It holds

f(Z) = αZ(I + αZ Z) 1. (73)

Thus, it holds that

f(Z) 2 F = tr([ f(Z)][ f(Z)] ) (74)

= α2 tr(Z(I + αZ Z) 2Z ). (75)

Simplifying DINO via Coding Rate Regularization

Using that the trace is the sum of singular values, it holds by taking the SVD of Z that

tr(Z(I + αZ Z) 2Z ) =

i=1 σi(Z(I + αZ Z) 2Z ) (76)

[1 + ασi(Z)2]2 . (77)

In this case we directly optimize over the singular values, obtaining the problem

max Z Rn d zi 2=1 i f(Z) F max x Rr xi 0 i Pr i=1 xi=d

xi (1 + αxi)2 . (78)

The function t 7 t (1+αt)2 on [0, ) has a global maximum at t = 1

α, and the value is 1 4α. Therefore it follows that

max x Rr xi 0 i Pr i=1 xi=d

xi (1 + αxi)2 max x Rr xi 0 i

xi (1 + αxi)2 = r

Unpacking this notation, we obtain

f(Z) 2 F α2 r

4 = d min{d, n}

4nε2 . (80)

Taking square roots, it holds

d min{d, n}

Therefore, ZRε

F = f(Z) F 1

d min{d, n}

as desired.

Remark C.7. It is possible that the inequality

max Z Rn d zi 2=1 i f(Z) F max x Rr xi 0 i Pr i=1 xi=d

xi (1 + αxi)2 . (83)

is met with equality; proving this would require exhibiting a Z fulfilling the constraints of the first problem such that it has the prescribed singular values which solve the second problem. We do not need to do so here for the purposes of using the bound (e.g., for learning rate scaling). Remark C.8. While the quick-and-dirty bound

max x Rr xi 0 i Pr i=1 xi=d

xi (1 + αxi)2 r

by way of ignoring the constraint Pr i=1 xi = d seems like it could significantly loosen the bound, we do not believe this is the case. In particular, when 1/α d/r, note that setting x1 = = xr 1 = 1/α and xr = d (r 1)/α sandwiches the objective between (r 1)/(4α) and r/(4α), so the maximum is at least the same asymptotic order, in the very reasonable case that ε is small enough that 1/α d/r, i.e., using the definition of α, such that

n min{d, n} ε2 max d

Similar strategies should hold if we allow for an absolute constant c 1 such that 1/α cd/r, etc, relaxing the requirement while preserving the asymptotic order of the LHS of (84).

Simplifying DINO via Coding Rate Regularization

D. Training Pipeline Pseudocode

In this section we provide pseudocode for the training pipelines of Sim DINO and Sim DINOv2.

Algorithm 1 Sim DINO training pipeline.

# fs, ft: student and teacher networks, this time outputting ONLY the cls token feature # eps: coding rate regularization quantization hyperparameter # gamma: coding rate regularization strength hyperparameter # lam: teacher network EMA rate ft.params = fs.params for x in loader: # load a minibatch x of B samples

xg, xl = global_views(x), local_views(x) # (B, M_glo, N_glo, D), (B, M_loc, N_loc, D)

zsg, zsl = fs(xg), fs(xl) # student output (B, M_glo, d), (B, M_loc, d) ztg = ft(xg) # teacher output (B, M_glo, d)

zs = cat([zsg, zsl], dim=1) # (B, M, d) where M = M_loc + M_glo

sq_dists = sum((zs.view(B,M,1,d) - ztg.view(B,1,M_glo,d)) ** 2, dim=3) # (B, M, M_glo)

zsg_bdim = zsg.transpose(0, 1) # (M_glo, B, d) covs = zsg_bdim.transpose(1, 2) @ zsg_bdim / B # (M_glo, d, d) R_eps = batch_logdet(I_d.unsqueeze(0) + d/(eps**2) * covs) # (M_glo)

loss = mean(sq_dists) - gamma * mean(R_eps) loss.backward() # back-propagate

# student and teacher updates update(fs) # SGD or Adam ft.params = lam * ft.params + (1 - lam) * fs.params

E. Implementation Details

The training codes and hyperparameters for Sim DINO and Sim DINOv2 are derived from the released official settings in DINO and DINOv2 separately, see Table 4 for detailed comparison. Notes that for Sim DINOv2, we choose to use bfloat16 dtype in student backbone parameters and reductions for better numerical stability while other modules uses the same FSDP mixed precision settings from DINOv2.

F. Additional Experiments

F.1. Ablations on Stability of DINO Training

In Table 6, we study the optimization behavior and stability of DINO by varying hyperparameters that are specific to its pipeline. Specifically, we select teacher momentum, whether to apply normalization for the last layer, and teacher temperature. We vary each of them and study their impact on DINO training. As shown in Table 6, moderate adjustments for each component leads to divergence (during early training stages). These results suggest DINO training can be highly unstable and requires careful tuning efforts.

F.2. Ablation Studies on Batch Sizes

We vary the batch sizes when training Vi T-S using Sim DINO and report the results in Table 7. We observe that Sim DINO is robust to the choice of batch sizes and can converge to reasonably good performance with a smaller batch size of 256.

F.3. Experiments on Longer Training

More training epochs in SSL typically lead to better performance. We provide the performance of Sim DINO when doubling the number of epochs in Table 8. Clearly, these results show the efficacy of longer training for Sim DINO.

Simplifying DINO via Coding Rate Regularization

Algorithm 2 Sim DINOv2 training pipeline.

# fs, ft: student and teacher networks, this time outputting BOTH the cls token feature

and patch token features # eps: coding rate regularization quantization hyperparameter # gamma: coding rate regularization strength hyperparameter # lam: teacher network EMA rate # alpha: proportion of patches that get masked ft.params = fs.params for x in loader: # load a minibatch x of B samples

m = generate_mask(x, alpha) # boolean mask (B, M_glo, N_glo)

xg, xl = global_views(x), local_views(x) # (B, M_glo, N_glo, D), (B, M_loc, N_loc, D)

xmg = apply_mask(xg, m) # (B, M_glo, N_glo, D)

zsg, Zsg = fs(xmg)# student on masked global views (B, M_glo, d), (B, M_glo, N_glo, d) zsl, Zsl = fs(xl) # student output on local views (B, M_loc, d), (B, M_loc, N_loc, d)

ztg, Ztg = ft(xg) # teacher output on global views (B, M_glo, d), (B, M_glo, N_glo, d)

zs = cat([zsg, zsl], dim=1) # (B, M, d), M = M_loc + M_glo

sq_dists = sum((zs.view(B,M,1,d) - ztg.view(B,1,M_glo,d)) ** 2, dim=3) # (B, M, M_glo) patch_sq_dists = mean(sum((Zsg - Ztg) ** 2, dim=3) * m, dim=2) # (B, M_glo)

zsg_bdim = zsg.transpose(0, 1) # (M_glo, B, d) covs = zsg_bdim.transpose(-2, -1) @ zsg_bdim / B # (M_glo, d, d) R_eps = batch_logdet(I_d.unsqueeze(0) + d/(eps**2) * covs) # (M_glo)

loss = (mean(sq_dists) + mean(patch_sq_dists))/2 - gamma * mean(R_eps) loss.backward() # back-propagate

# student and teacher updates update(fs) # SGD or Adam ft.params = lam * ft.params + (1 - lam) * fs.params

Simplifying DINO via Coding Rate Regularization

F.4. DINO without Self-Distillation

Due to the explicit coding rate regularization, it is possible to train Sim DINO without self-distillation. To validate this, we train Vi T-S models on Image Net-1k by setting the teacher network to be the student network at each iteration, effectively removing the EMA operation. Results are presented in Table 9. We can see that the original DINO collapses under this setup for reasons discussed in Appendix B, while Sim DINO is able to yield non-trivial performance. It is worth noting that compared to training with full self-distillation, this variant primarily lags behind in terms of k-NN performance while the gap in linear probe is significantly smaller.

F.5. Ablations on Loss Functions

To prevent representation collapse, we adopt the coding rate objective Rε in Sim DINO and Sim DINOv2. In this part, we examine the effectiveness of the coding rate formulation and compare it with other loss functions that aim to prevent collapse. Specifically, we swap the coding rate function with the following three widely-used objectives: (1) the vanilla contrastive loss ℓcontrastive based on Info NCE as in Sim CLR (Chen et al., 2020), (2) the uniform loss ℓuniform proposed in (Wang & Isola, 2020) that encourages the representations to be uniform on the unit sphere, (3) the Barlow Twins loss ℓbt proposed in (Zbontar et al., 2021) that penalizes off-diagonal terms while promoting the on-diagonal terms on the covariance matrix of learned representations. Results are presented in Table 5. We observe that the coding rate objective consistently performs better than the other choices considered, validating our design.

F.6. Visualization of Attention Maps

Following (Oquab et al., 2023; Caron et al., 2021), we provide visualizations of self-attention maps of different models for qualitative comparison. We use test images that do not appear during pretraining. More concretely, we compute and visualize the average of self-attention maps across all attention heads from the last layer in Figure 5. It is clear from the attention maps that all methods studied in our paper lead to prominent segmentation properties that emerge from vision self-supervised learning.

Simplifying DINO via Coding Rate Regularization

DINO Vi T-B/16

Sim DINO Vi T-B/16

DINOv2 Vi T-B/16

Sim DINOv2 Vi T-B/16

Sim DINO Vi T-L/16

DINOv2 Vi T-L/16

Sim DINOv2 Vi T-L/16

Figure 5. Visualization of average self-attention maps obtained from both DINO(v2) and Sim DINO(v2) algorithms.

Simplifying DINO via Coding Rate Regularization

Hyperparameter Sim DINOv2 DINOv2 Sim DINO DINO

Patch size 16 Register tokens 4 0 Pos-embedding anti-alias True False Init layer scale 0.1 1e-5 - Drop path rate 0.3 0.1 Weight normalize last layer removed True removed True Output prototypes K removed 65536 removed 65536

Init EMA momentum 0.9 0.992 0.996 Centering temperature removed 0.07 removed 0.07 Warm-up temperature removed 0.04 removed 0.04 Warm-up temperature epochs removed 30 removed 30 i BOT sample prob. 0.5 - i BOT mask ratio 0.1-0.5 - i BOT head untying False - Koleo loss weight removed 0.1 -

Global crops scale 0.4 - 1 Local crops scale 0.05 - 0.4 Local crops number 10 Global crops size 224 Local crops size 96

Batch size 128x8 64x8 Epochs 100 Warm-up epochs 10 Freeze last layer epochs removed 1 removed 1 Learning rate 0.004 0.002 Layerwise lr decay 0.9 - Weight decay 0.04 Weight decay end 0.4 Gradient clip 3.0 0.3

Table 4. Training hyperparameters used in the experiments

Objectives to prevent collapse k-NN Linear

contrastive loss ℓcontrastive 79.4 81.0 uniform loss ℓuniform 77.2 82.1 Barlow Twins loss ℓbt - - coding rate loss Rε (ours) 81.1 82.4

Table 5. Ablations on Loss Functions. We evaluate Vi T-L pretrained on Image Net-1k for 100 epochs by swapping the coding rate function in Sim DINOv2 with other choices of collapse-prevention objectives. Barlow Twins loss causes representation collapse in our experiments.

Simplifying DINO via Coding Rate Regularization

Table 6. Sensitivity of DINO on selected hyperparameters. We pick three DINO-specific hyperparameters (i.e. teacher momentum, last-layer head normalization, teacher temperature) of the official configuration in (Caron et al., 2021) to study their impact. Varying each one leads to divergence in early training.

Config Mom. Norm. Temp. k-NN

official (400ep) 0.996 0.04 0.07 76.1 0.90 0.04 0.07 Na N 0.996 0.04 0.07 Na N 0.996 0.07 Na N

Batch size 256 512 1024

k-NN 68.3 69.7 69.6

Table 7. Effect of batch sizes. We evaluate k-NN accuracy of Vi T-S pretrained on Image Net-1k for 100 epochs.

Method Epochs k-NN Linear

DINO 100 72.9 76.3 Sim DINO 100 74.9 77.3

DINO 200 73.6 77.1 Sim DINO 200 76.0 77.7

DINO* 400 76.1 78.0

Table 8. Effect of training epochs. We evaluate Vi T-B pretrained on Image Net-1k for 100, 200, and 400 epochs. (DINO* 400 epochs evaluated on provided checkpoint )

Method Model self-distillation Epochs k-NN Linear

DINO Vi T-S 100 Sim DINO Vi T-S 100 58.6 68.0

Sim DINO Vi T-S 100 69.7 73.6

Table 9. Performance on Image Net-1K without self-distillation.