# dino_as_a_von_misesfisher_mixture_model__ab3d5836.pdf

Published as a conference paper at ICLR 2023

DINO AS A VON MISES-FISHER MIXTURE MODEL

Hariprasath Govindarajan1,2 Per Sidén1,2 Jacob Roll2 Fredrik Lindsten1

1Linköping University, Sweden 2 Qualcomm Technologies, Inc. {hargov,psiden,jroll}@qti.qualcomm.com, fredrik.lindsten@liu.se

Self-distillation methods using Siamese networks are popular for self-supervised pre-training. DINO is one such method based on a cross-entropy loss between K-dimensional probability vectors, obtained by applying a softmax function to the dot product between representations and learnt prototypes. Given the fact that the learned representations are L2-normalized, we show that DINO and its derivatives, such as i BOT, can be interpreted as a mixture model of von Mises-Fisher components. With this interpretation, DINO assumes equal precision for all components when the prototypes are also L2-normalized. Using this insight we propose DINOv MF, that adds appropriate normalization constants when computing the cluster assignment probabilities. Unlike DINO, DINO-v MF is stable also for the larger Vi T-Base model with unnormalized prototypes. We show that the added flexibility of the mixture model is beneficial in terms of better image representations. The DINO-v MF pre-trained model consistently performs better than DINO on a range of downstream tasks. We obtain similar improvements for i BOT-v MF vs i BOT and thereby show the relevance of our proposed modification also for other methods derived from DINO.

1 INTRODUCTION

Self-supervised learning (SSL) is an effective approach for pre-training models on large unlabeled datasets. The main objective of SSL pre-training is to learn representations that are transferable to a range of, so called, downstream tasks. Early SSL methods achieved this through handcrafted pretext tasks that act as inductive biases in the representation learning process (Komodakis & Gidaris, 2018; Noroozi & Favaro, 2016; Kim et al., 2018; Doersch et al., 2015; Larsson et al., 2016; Zhang et al., 2016). Contrastive methods (Tian et al., 2020; Wu et al., 2018), using supervisory signals in the form of augmentation invariances and instance discrimination, have produced strong performance benchmarks. In practice, contrastive methods require large batch sizes (Chen et al., 2020a) or specialized techniques like memory banks (He et al., 2020; Misra & Maaten, 2020) to achieve the best performance. Self-distillation methods based on the Mean Teacher (Tarvainen & Valpola, 2017) framework are effective representation learners that do not require such large batch sizes for pre-training. A trivial solution where the network learns to output the same representation irrespective of the input is known as representation collapse. The negative samples in contrastive learning prevent representation collapse. In the absence of negative samples, self-distillation methods use explicit approaches to avoid collapse, such as asymmetric model architecture (Grill et al., 2020; Chen & He, 2021) and whitening (Ermolov et al., 2021; Zbontar et al., 2021).

Transformers (Vaswani et al., 2017), originally introduced in NLP, have emerged as a strong model architecture for vision tasks as well (Dosovitskiy et al., 2020; Liu et al., 2021). Current state-of-the-art SSL methods leverage the highly flexible Vision Transformers (Vi Ts) (Caron et al., 2021; Bao et al., 2021; Li et al., 2021; Zhou et al., 2021; Xie et al., 2021; Chen et al., 2021). DINO (Caron et al., 2021) is a non-contrastive SSL method that is effective for pre-training Vi T models. Interestingly, Vi Ts pre-trained using DINO outperformed Res Nets (He et al., 2016) by a significant margin at k NN classification based on learned representations. DINO is an influential SSL method with several state-of-the-art derivatives: MSN (Assran et al., 2022) adapts DINO to produce strong few-shot performance with enhanced training efficiency; i BOT (Zhou et al., 2021) extends DINO by adding an additional masked image modeling task; Es Vi T extends DINO by adding patch-level tasks that also

Published as a conference paper at ICLR 2023

use a DINO-like formulation. These methods are all trained using the Mean Teacher framework by learning to produce consistent outputs in the probability simplex. The networks output softmax-logit scores based on an inner product between a learned representation and a set of prototypes.

We provide a better understanding of DINO, and its derivatives, by taking a closer look at its inner-product formulation. We interpret DINO as a von Mises-Fisher mixture model under certain assumptions. Based on this interpretation, we propose DINO-v MF, as a modified version of DINO, that adds flexibility in the learned latent space while keeping the training stable. DINO-v MF pretraining consistently improves performance on similar downstream tasks as DINO. We also show that the larger Vi T models achieve significantly improved few-shot classification performance with our pre-training. By incorporating our v MF modification in i BOT, we achieve significantly improved performance that suggests that our method is applicable to DINO-derived methods as well.

The self-distillation learning framework in DINO considers a teacher network gθt and a student network gθs, with parameters θt and θs, respectively. In DINO, the student network is formulated to predict a vector in the (K 1)-dimensional probability simplex using a softmax function. The student probability distribution is obtained as follows:

P (k) s (x) k exp g(k) θs (x) (1)

where k indicates that the right-hand side is normalized w.r.t. the index k (i.e. the equation above corresponds to a softmax). The teacher probability distribution Pt(x) is computed analogously.

Given an unlabeled image dataset I, consider uniform samples x I and two random augmentations As As, At At. By applying these augmentations, we get two views xs = As(x) and xt = At(x). The student network is trained using gradient updates to produce outputs Ps(xs) that are consistent with those of the teacher network Pt(xt) by minimizing a cross-entropy loss given by: minθs PK k=1 P (k) t (xt) log P (k) s (xs). The teacher network parameters θt are only updated as an exponential moving average (EMA) of θs.

SSL methods based on Siamese networks (Bromley et al., 1993) face the representation collapse problem, where the network learns to produce the same output irrespective of the input. One approach to address this is by introducing an asymmetry between the teacher and student networks. DINO uses the same model architecture for the student and teacher networks but instead shows that adding asymmetry through centering and sharpening operations is sufficient to avoid collapse. The targets produced by the teacher network are centered to remove bias towards a cluster and sharpened using a temperature τt as gθt(xt) (gθt(xt) c)/τt. On the other hand, the student outputs are only sharpened as gθs(xs) gθs(xs)/τs. The centering operation prevents one of the probability components from dominating but the solution could collapse to a uniform distribution instead. This is avoided by the sharpening operation, where the teacher uses a lower temperature value than the student, τt < τs = 0.1. The overall schematic of DINO is illustrated in Figure 1a.

3.1 DINO: A CLOSER LOOK AT THE FINAL LAYER

The teacher and student networks are designed by combining a backbone network that can be transferred to downstream tasks and a prediction head that is specific to the SSL pre-training. We take a closer look at the prediction head as shown in Figure 1b, which is found to be important to achieve good performance in ablation experiments (Caron et al., 2021). The weight-normalization in the last linear layer (Salimans & Kingma, 2016) refers to a reparameterization of the weights W as w(k) = g(k)v(k) where w(k) is the kth column of W, g(k) is a scalar magnitude, and v(k) is a unit vector. Optionally, the weights are L2-normalized by fixing g = 1.

Whether or not to L2-normalize the prototypes is a subtle, and in our opinion ad hoc, design choice which is not discussed enough in the literature. Most prior clustering-based SSL methods do use normalized prototypes (Caron et al., 2018; 2019; 2020; Li et al., 2020; Asano et al., 2020), and this

Published as a conference paper at ICLR 2023

student teacher EMA

sharpening sharpening

softmax softmax

3-layer MLP

Weight-normalized linear

layer (Optionally L2 normalized)

Prediction head

Figure 1: Overview of DINO. (a): High-level architecture of DINO; (b): A closer look at the networks gθ, modeled as a combination of a backbone fϕ and a prediction head hψ,W , where θ = {ϕ, ψ, W}. The prediction head contains 3 MLP layers, an L2-normalization bottleneck and a weight-normalized (Salimans & Kingma, 2016) linear layer. The weights of the weight-normalized linear layer are L2-normalized in the larger Vi T-Base models to ensure stable training.

is not mentioned as a design choice in the paper by Caron et al. (2021), nor in their supplementary material. However, from the public code repository of DINO1, it can be noted that not L2-normalizing W leads to a performance boost for Vi T-Small models (these are the results reported in their paper). However, the larger Vi T-Base models require the weights to be L2-normalized since the training is unstable otherwise. Furthermore, comparing the Vi T-Small and Vi T-Base models at patch size 8, surprisingly the Vi T-Small model achieves better k NN top-1 validation accuracy on Image Net (Caron et al., 2021, Table 2). This suggests that L2-normalization is an important design choice and that the representations learned by the Vi T-Base model (i.e. with L2-normalization) are sub-optimal and has room to improve. In the next section we provide a better understanding of this from a mixture model perspective and investigate how we can better leverage prototypes that are not L2-normalized.

3.2 DINO AS A MIXTURE MODEL

DINO learns to predict a soft distribution over K components. Let the intermediate output in the prediction head after the L2-normalization of the representation2 be denoted as y; see Figure 1b. Then, DINO can be interpreted as performing clustering in the latent space of y. Specifically, since y is a unit vector, DINO performs clustering on the unit hypersphere.

Clustering is closely related to mixture modeling. On the unit hypersphere, a mixture model can be defined using von Mises-Fisher (v MF) components. For a random p-dimensional unit vector y, the v MF probability density function is given by f(y; µ, κ) = Cp(κ) exp(κµT y), where µ is a mean vector with µ = 1, κ is a scalar concentration parameter that measures isotropic precision, and Cp(κ) is a normalizing constant defined as: Cp(κ) = κp/2 1/ (2π)p/2Ip/2 1(κ) , where Iν denotes the modified Bessel function of the first kind and order ν. Assuming a mixture model containing K v MF components with mixture component ratios π(k), the probability of assigning a sample yi to a mixture component k (also known as the responsibility of cluster k) is given by:

r(k) i k π(k)Pr(yi|zi = k) = π(k)Cp(κ(k)) exp h κ(k)µ(k), yi i (2)

where u, v = u T v denotes inner product.

1https://github.com/facebookresearch/dino 2Note that this is different from the L2-normalization of the prototypes.

Published as a conference paper at ICLR 2023

In DINO, the probability distributions for the student and the teacher can be rewritten as:

P (k) s (xs) k exp h w(k) s , ys /τs i , (3)

P (k) t (xt) k exp h w(k) t , yt c(k) /τt i = exp h c(k)/τt i exp h w(k) t , yt /τt i . (4)

Note that centering is only applied on the teacher outputs and that different sharpening temperatures are used for the teacher and the student. By comparing Eq. (3) with Eq. (2), we observe that the DINO formulation resembles that of a mixture model, given some assumptions. We explain these assumptions below to establish a more concrete connection. This interpretation also applies to other recent works that use a similar inner-product-based prototype formulation (Caron et al., 2020; Assran et al., 2022; Li et al., 2021; 2020).

The exponential term in DINO from Eq. (3) corresponds to the unnormalized probability density of a v MF distribution if we identify κ(k) = w(k) /τ and µ(k) = w(k)/ w(k) . Similar to prior clustering-based SSL works, DINO avoids collapse by encouraging a uniform distribution of the data over the prototypes, which explains the absence of π(k) in Eq. (3). In the teacher model, Eq. (4), we obtain an additional term through the centering operation which further encourages uniformity. We discuss more about the centering operation in section 3.4.

Following the mixture model interpretation, we note that a missing term in the DINO formulation is the normalization constant, Cp(κ(k)). However, this is inconsequential when the prototypes w(k)

are L2-normalized as it will result in constant κ(k) = 1/τ. Hence, the constant Cp(κ(k)) would vanish in the softmax. However, a constant κ(k) implies the assumption that all clusters should have a similar isotropic precision. This constraint reduces the flexibility of the mixture model, and this lack of flexibility in the latent space translates to the backbone mapping as well. Indeed, as discussed above, unnormalized prototypes help in improving the performance achieved by DINO (for Vi T-Small models) by enabling better representations (Caron et al., 2021). However, larger Vi T-Base models are affected by training instabilities when trained with unnormalized prototypes. We hypothesise that this is due to the "missing" normalization constant and propose to modify DINO by including appropriate normalization according to a v MF mixture model.

3.3 NORMALIZING VON MISES-FISHER (VMF) COMPONENTS

In DINO, a large w(k) scales the logit scores of a mixture component proportionally. In a properly normalized formulation, a large w(k) instead scales the κ(k) parameter proportionally and results in a sharper v MF distribution for the component. Hence, the model cannot naively increase w(k) to increase the responsibility of a component to a data sample it also needs to map the data samples close to the component s prototype. If the images assigned to a component can be consistently mapped close to its prototype in the latent space, only then it is beneficial to increase κ(k).

The v MF normalization constant Cp(κ) includes the modified Bessel function of the first kind and order ν, denoted as Iν(κ). It is easy to obtain κ = w /τ, but since Iν(κ) is a complicated function, we propose to use a differentiable approximation thereof. Let ν = p/2 1 and κ = νr. Then, Iν can be approximated by the following uniform expansion for large ν (DLMF, Eq. 10.41.3):

(2πν)1/2(1 + r2)1/4

νi , η = (1 + r2)1/2 + log r 1 + (1 + r2)1/2 (5)

where denotes a Poincaré asymptotic expansion w.r.t. ν, s = (1 + r2) 1/2 and polynomials Ui(s). Here, the bottleneck dimension in the prediction head is p = 256, and this results in ν = 127, which is large in this context. Empirically, evaluating only the first term in the sum, U0(s) = 1 seems to be sufficient, see Appendix A.2 for details. With this approximation, we compute Cp(κ(k)) for the k-th component and modify the logit scores for component k for both teacher and student from ( w(k), y )/τ to ( w(k), y )/τ + log Cp(κ(k)). That is, we add log Cp(κ(k)) to the logit scores in (3) and (4) to obtain the following expressions for DINO-v MF:

Ps(xs)(k) k exp h w(k) s , ys /τs + log Cp κ(k) s i , (6)

Pt(xt)(k) k exp h c(k)i exp h w(k) t , yt /τt + log Cp κ(k) t i . (7)

Published as a conference paper at ICLR 2023

3.4 AVOIDING COLLAPSE

The probability distributions are computed based on the teacher and student network outputs using a softmax function. With the motivation of avoiding collapse, centering and sharpening operations are applied to the network outputs to obtain the logit scores for the probability distribution. The centering parameter c in DINO is computed as an EMA of the teacher network outputs with a forgetting parameter m 1 as follows:

c(k) mc(k) + (1 m) 1

b=1 w(k) t , yt,b (8)

However, when centering is done in the logit space as in (8) with our modified logit scores ( w(k) t , yt )/τ + log Cp(κ(k) t ), an EMA of the log normalization constant will be added to c(k). When the centering operation is performed on the teacher outputs, the impact of the log normalization constant is dampened. To avoid this, we propose to instead compute the probability distributions for the images in the batch, and then average the probability distributions instead. This is also closer to how mixture proportions π(k) are computed in the M-step of the EM algorithm for a mixture model, although the estimated probabilities play a different role in DINO; see Appendix A.4 for a discussion. Our proposal for computing the centering parameter c is thus:

c(k) mc(k) + (1 m) log

softmax w T t yt,b/τt (k) #

4 RELATED WORK

Clustering-based and prototypical methods: SSL has advanced significantly in recent times and a line of work based on clustering have also evolved. Caron et al. (2018) proposed a predictive task using "pseudo-labels" generated by K-means clustering in the representation space. Following its success, several methods based on clustering have been proposed (Caron et al., 2020; Asano et al., 2020; Caron et al., 2019; Li et al., 2020; Zhuang et al., 2019; Assran et al., 2022; 2021). These methods learn a set of prototype vectors or cluster centroids that correspond with the clusters. When combined with Vi Ts and self-distillation, Caron et al. (2021) demonstrated that the learned representations possess useful properties strong nearest neighbor embeddings and rich spatial semantic features. While the DINO formulation was only based on the [CLS] representation, several recent works have extended DINO to utilize patch-level representations in the objective formulation (Li et al., 2021; Zhou et al., 2021). A common aspect among most recent methods is that they utilize L2-normalized prototypes. Our work is based on the DINO formulation. By incorporating our work in i BOT, we show that our insights can prove beneficial to other methods as well. Our method differs from prior works by proposing a way to remove the L2-normalization constraint on the prototypes and hence, enable a flexible mixture model in the representation space. Assran et al. (2022) uses an entropy regularization to avoid collapse and our centering computation shares similarity with it, in terms of computing intra-batch averages of probability distributions. However, we only average teacher probability distributions to compute the centering parameter instead of applying an explicit regularization of the student outputs.

Mixture of von Mises-Fisher distributions: Banerjee et al. (2005) proposed a mixture of v MF (mov MF) distribution to perform clustering on the unit hypersphere. The EM updates are often done using an approximate reparameterization of the v MF precision parameter κ (Banerjee et al., 2005; Barbaro & Rossi, 2021). Instead, we learn κ directly through the prototype magnitudes. Some image segmentation methods have used mov MF formulations (Yang et al., 2020; Hwang et al., 2019) but assume a fixed κ for all mixture components to simplify computations. Taghia et al. (2014); Gopal & Yang (2014) use a complete Bayesian approach based on variational inference to model data as a mov MF. Recent works relate the cosine similarity formulation commonly used in contrastive SSL to a v MF density (Shwartz-Ziv et al., 2022; Lee et al., 2021; Wang & Isola, 2020). Our mixture model interpretation is closest to Hasnat et al. (2017) that proposes a mov MF loss to train a supervised face verification model, but they make similar uniformity assumptions as DINO.

Published as a conference paper at ICLR 2023

Table 1: Image Net k NN classification accuracy ablating on the impact of L2normalization of prototypes, v MF normalization and probability centering. Average over 2 runs are reported. Refer A.6.1 for results of individual runs.

Centering space

Normalization logit probability

None 70.18 70.49 L2 69.51 69.86 v MF 70.21 70.87 = default setting for DINO/Vi T-S, = default setting for DINO/Vi T-B.

Table 2: Image Net classification accuracy using linear and k NN classifiers. ([C] = [CLS], [D] = DINO)

Method k NN Lin. [C] Lin. [D]

Vi T-Base/16 i BOT 77.10 79.33 79.50 i BOT-v MF 78.66 80.20 80.27 DINO 76.11 77.88 78.17 DINO-v MF 77.40 78.67 78.81 Be IT-v2 80.10 MSN 73.33 75.84 74.80

Vi T-Small/16 DINO 74.44 76.09 76.98 DINO-v MF 74.74 76.19 76.97 MSN 74.86 76.20 76.62 i BOT 75.20 77.90

5 EXPERIMENTS

We closely follow the training settings for different backbones as specified in the public repository for DINO3, in order to ensure fair comparison. The models are pre-trained on the Image Net dataset (Deng et al., 2009) with the adamw optimizer (Loshchilov & Hutter, 2018). Weight decay with a cosine schedule from 0.04 to 0.4 is used. The student temperature τ is set to 0.1 and the teacher temperature is linearly scaled from 0.04 to 0.07 over some initial epochs (50 epochs for Vi T-Small/16 and 30 epochs for Vi T-Base/16). We follow the same data augmentations as DINO and BYOL (Grill et al., 2020): color jittering, Gaussian blur and solarization. The multi-crop approach is used with the same definition of global and local crop sizes as DINO. The model trainings are done on a single A100 node, consisting of 8 GPUs. The batch sizes are adapted to fit the node and adjusted based on the model architecture (batch size=64 per GPU for Vi T-Base/16 and 128 for Vi T-Small/16). The v MF normalization computation does not add any noticeable overheads to standard DINO and i BOT. Due to computational reasons, we train Vi T-Small/8 for only 40 epochs (refer A.6.2) and do not consider Vi Ts larger than the Base variant. For comparison of results, we use the publicly released DINO and MSN4 pre-trained models available in their respective repositories.

5.1 IMAGENET CLASSIFICATION

5.1.1 ABLATION STUDIES

We conducted ablation experiments to study the impact of our proposed modifications to DINO. We consider small-scale experiments by training a Vi T-Small/16 backbone for 100 epochs with all other settings maintained similar to the standard training. We report the k NN top-1 classification accuracy on Image Net in Table 1 by averaging over 2 runs. We observe that the performance is better when the cluster prototypes are not L2-normalized. We find that v MF normalization and probability centering individually lead to performance improvements. Best performance is achieved when both v MF normalization and centering in probability space are used simultaneously. On this basis, we decide to use v MF normalization along with centering in the probability space in all other experiments. Note that maximal improvement is observed when compared to the default setting of DINO/Vi T-B.

5.1.2 IMAGENET CLASSIFICATION WITH FULL DATASET

The quality of representations learned by self-supervised models are usually evaluated by linear classification benchmarks on top of a frozen pre-trained backbone model. DINO is also shown to be an efficient nearest neighbor classifier. For linear evaluation, we follow the same protocol as DINO. Additionally, we also evaluate the performance using only the [CLS] representation. For

3https://github.com/facebookresearch/dino 4https://github.com/facebookresearch/msn

Published as a conference paper at ICLR 2023

Table 3: Image Net few-shot evaluation (as in MSN)

Vi T-Base/16 Vi T-Small/16

Dataset DINO MSN DINO-v MF i BOT i BOT-v MF DINO MSN DINO-v MF

1 img/cls 41.8 49.8 50.3 46.0 51.6 38.9 47.1 39.2 2 imgs/cls 51.9 58.9 59.3 56.0 61.1 48.9 55.8 49.4 5 imgs/cls 61.4 65.5 66.1 64.7 68.3 58.5 62.8 59.1 1% imgs 67.2 69.1 70.4 69.9 72.3 64.5 67.2 65.0

Table 4: Prototype utilization based on a cosine similarity threshold of 0.9.

# unique prototypes Largest duplicate prototype set

Architecture DINO DINO-v MF DINO DINO-v MF

Vi T-Small/16 961 1122 2087 1226 Vi T-Base/16 775 928 54877 1208

k NN evaluation, we consider a straightforward sweep over a range of values for nearest neighbors and generally find k = 10 to work best with DINO-v MF models. The accuracy achieved by linear and k NN classifiers on Image Net are reported in Table 2 (more baselines in A.6.2). With Vi T-Base/16, both DINO-v MF and i BOT-v MF clearly perform better than their standard counterparts. With Vi TSmall/16, the performance improvement is found to be marginal. MSN (Assran et al., 2022) shows competitive performance on Vi T-Small/16 but it is significantly worse on the larger Vi T-Base/16.

5.1.3 IMAGENET FEW-SHOT EVALUATION

Few-shot classification is an important application area for SSL pre-training. MSN (Assran et al., 2022) pre-training is found to be effective at few-shot classification tasks. We compare our DINO, i BOT, their v MF versions and MSN pre-training using the same evaluation protocol as MSN. A linear classifier is trained on features obtained using a center crop of the image from the frozen pre-trained model. L2-regularization with a constant strength of 0.075 is used. We consider different levels of labelled data availability: 1, 2 and 5 images per class, and 1% of all the training images. We report the top-1 accuracy in Table 3. For the 1, 2 and 5 images per class experiments, we use three different splits and report the mean of the accuracy (refer A.6.4 for standard deviations). With Vi T-Base/16 model, we observe a large improvement by adding the v MF modification and even surpassing the performance achieved by MSN (Assran et al., 2022), which specifically targets the few-shot learning problem. In comparison, the Vi T-Small/16 model with DINO-v MF pre-training only improves marginally over DINO and is found to be lagging behind MSN by a large margin.

5.2 ANALYSIS OF LEARNED VMF MIXTURE MODEL

In this section, we take a deeper look at the mixture modeling task, which acts as the supervisory signal to train the self-supervised model.

Prototype utilization: Inspecting the prototype vectors learned by DINO, we observed that many of them were almost identical. We consider two prototypes w(i) and w(j) to be duplicates if their cosine similarity is greater than a threshold value. We identify unique prototypes by removing such duplicates. Recall that DINO uses a large number of prototypes set to 65536 for all models. Assran et al. (2022) does not observe a benefit in using more than 1024 prototypes, but fewer prototypes lead to worse performance. In Table 4, we show the number of unique prototypes and the size of the largest set of duplicate prototypes based on a cosine similarity threshold of 0.9 (in A.3, we show that our observations hold at other threshold choices as well). Interestingly, when pre-trained with DINO, the Vi T-Base model produced one large duplicate set containing 83% of all the prototypes. Based on highest probability, no data samples are assigned to the corresponding mixture components. We refer to this as the void prototype set. Interestingly, the void prototype has a cosine similarity of exactly 1 with the mean of y over all the training data. This means that the void prototype vector

Published as a conference paper at ICLR 2023

points in a direction exactly opposite to the data mean, y. This occurs because the model minimizes the logit scores of unused prototypes i.e. void prototypes, by moving them as far away as possible from the data representations to minimize their effect on the loss. With DINO-v MF pre-training, we do not observe that any such void prototype set is formed and more unique prototypes are utilized than standard DINO pre-training. Utilizing more prototypes increases the difficulty of the SSL task and we expect this to result in better image representations.

Percentile (based on

prototype precision)

k NN val. top-1 accuracy (%)

Vi T-S/16, DINO Vi T-S/16, DINO-v MF Vi T-B/16, DINO-v MF

Figure 2: k NN accuracy for data sorted based on percentile ranges of associated w(k) .

Interpreting learnt v MF precision: In the DINO-v MF formulation, the L2-norm of the prototypes is directly related to the v MF precision as κ = w(k) /τ. We investigate if the precision values can be indicative of image classification difficulty by evaluating downstream performance. We associate an image with the component k that has the maximum logit score under the DINO(-v MF) formulation. For each model, we divide the data based on percentile ranges of the precision values of the associated components. In Figure 2, we show the k NN top-1 validation accuracy on Image Net for subsets of data restricted to lie in the different precision percentile ranges. For DINO-v MF, we observe that the k NN classification accuracy is increasing with increasing precision values. That is, data points associated with higher precision components appear to be easier to classify, and vice versa. The prototype magnitudes learned by DINO (only in Vi T-Small/16) do not exhibit such a clear association.

5.3 TRANSFER LEARNING

An important benefit of SSL is that the learned model can be transferred to downstream tasks to achieve competitive performance with limited computational cost or limited training data. The tasks we consider are: linear classification on small datasets, image retrieval and video object segmentation.

Linear classification on small datasets: We conduct linear classification experiments on a suite of datasets trained on features extracted from a frozen pre-trained model. The implementation details for this experiment are explained in A.6.3. In Table 5, we report the linear classification accuracy on the test or validation dataset, depending on which is publicly available. First, we observe that the DINO-v MF model performs on par or better than the standard DINO model on most of the datasets with both Vi T-Small/16 and Vi T-Base/16 architectures. The i BOT-v MF model consistently performs best for most datasets. We also observe that the MSN model performs worse than both the DINO and DINO-v MF models on all datasets. Additional fine-tuning results are given in A.6.5.

Image Retrieval: We consider the face-blurred versions (v1.0) of Oxford (Philbin et al., 2007) and Paris datasets (Philbin et al., 2008) and use the medium (M) and hard (H) data splits. We perform nearest neighbour based image retrieval using Image Net pre-trained frozen features and report the mean average precision (m AP). We follow similar evaluation protocol to DINO but we use newer

Table 5: Linear classification accuracy when transferred to other datasets

Method Acft. Cal101 C10 C100 DTD Flwrs. Food Pets SUN Avg.

Vi T-Base/16 DINO 57.7 94.5 96.9 86.3 74.8 96.0 81.6 93.9 68.5 83.4 DINO-v MF 57.3 94.5 97.1 86.3 74.8 95.7 82.5 94.6 68.7 83.5 i BOT 55.9 94.7 97.6 87.4 74.4 95.4 82.7 93.8 69.4 83.5 i BOT-v MF 58.1 95.5 98.0 88.0 74.7 94.8 83.6 93.9 70.2 84.1 MSN 51.0 92.8 96.9 85.3 73.7 92.8 80.0 93.9 66.8 81.5

Vi T-Small/16 DINO 53.8 93.1 96.2 83.5 74.7 94.7 79.7 93.8 66.8 81.8 DINO-v MF 56.0 93.7 96.0 83.9 74.1 95.0 80.1 93.9 66.6 82.1 MSN 51.6 93.1 95.9 82.9 72.0 93.3 77.8 92.8 65.5 80.5

Published as a conference paper at ICLR 2023

Table 6: Image retrieval performance

Pretrain M H M H

Vi T-Base/16 DINO 63.7 35.8 34.3 10.8 DINO-v MF 66.7 39.5 37.3 12.5 i BOT 64.1 36.6 35.2 13.4 i BOT-v MF 65.4 38.1 38.1 13.8 MSN 56.3 29.3 31.8 11.8

Vi T-Small/16 DINO 63.1 34.5 34.6 13.0 DINO-v MF 62.1 33.4 34.5 12.6 MSN 61.5 33.3 36.6 15.1

Table 7: Video Object Segmentation (VOS)

Method (J &F)m Jm Fm Vi T-Base/16 DINO 62.4 60.8 64.0 DINO-v MF 63.4 61.6 65.2 i BOT 62.7 61.8 63.7 i BOT-v MF 63.1 61.9 64.2 MSN 58.0 56.1 60.0

Vi T-Small/16 DINO 61.8 60.2 63.4 DINO-v MF 62.7 60.9 64.5 MSN 59.6 57.6 61.6

versions of the datasets. We observe improved performance with the Vi T-Base/16 model when using the v MF versions compared to standard DINO and i BOT on both datasets. However, the Vi T-Small/16 model pre-trained with DINO-v MF is only on par or slightly worse than standard DINO pre-training. Further, we find that the MSN pre-trained model for Vi T-Small/16 also performs competitively.

Video Object Segmentation (VOS): We consider the DAVIS-2017 video instance segmentation benchmark (Pont-Tuset et al., 2017). We use the frozen features from the Image Net pre-trained model and use a nearest neighbour based segmentation approach following the same experimental protocol as (Caron et al., 2021) and (Jabri et al., 2020). We compare our results with standard DINO, i BOT and MSN in Table 7. We find that the v MF versions show marginal improvements on both Vi T backbones. We also find that the MSN transfers poorly to this task compared to DINO and DINO-v MF.

6 CONCLUSION

We reinterpreted the DINO (Caron et al., 2021) SSL method as a mixture of von Mises-Fisher distributions in latent space. Based on this interpretation, we proposed to use unnormalized prototypes with appropriate normalization of the component logit scores to enable a more flexible mixture model. Our proposed modifications enable stable training of DINO using larger Vi T models and better cluster utilization. The modification requires log-normalizing constants during training. However, using the proposed approximation these are very fast to compute and do not result in any noticeable computational overhead. At inference time the log-normalizer is not used at all. We empirically demonstrated consistent performance gains on a range of transfer tasks. The larger Vi T-Base/16 model benefits the most from our proposed modifications and shows significant improvements on all the considered downstream tasks. We accredit this to the fact that, without the proposed modifications, the larger models require unfavourable L2-normalization of the prototypes to obtain stable training. The smaller models are already trained without L2-normalization and for these models we see more marginal gains. Thus, our study has put a spotlight on a seemingly important design choice that, in our opinion, has not been discussed enough in prior work. By improving i BOT using our v MF modification, we show that our work extends to other methods built on the DINO formulation.

DINO-v MF was motivated by a mixture model interpretation of DINO, but we nevertheless use a quite similar training algorithm. For future work it would be interesting to investigate if this interpretation can be used to motivate additional modifications. One could for instance relate the DINO training scheme to stochastic versions of the EM algorithm (Delyon et al., 1999; Celeux & Diebolt, 1992). The soft cluster assignments computed using the teacher network can be viewed as the E-step, where the EMA is similar to the stochastic approximation of the auxiliary quantity used in SAEM (Delyon et al., 1999). The stochasticity comes from mini-batch sampling and multi-crop training. The main difference is in terms of the objective that is optimized in the M-step, where EM maximizes the ELBO whereas DINO(-v MF) minimizes the KL divergence between cluster probabilities. Exploring this connection in more detail could possibly result in better understanding of clustering-based SSL methods, as well as additional performance boosts.

Published as a conference paper at ICLR 2023

REPRODUCIBILITY STATEMENT

We intend to make a public release of the code repository and pre-trained models in order to aid the research community to reproduce our experiments. We do not introduce any new hyperparameters in our pre-training method and use the same hyperparameter setup as DINO (Caron et al., 2021) and i BOT (Zhou et al., 2021) which are already publicly available. For other downstream tasks, we closely follow the experimental protocol of other works and explicitly state any differences in the paper.

ACKNOWLEDGMENTS

This research is financially supported by the Swedish Research Council via the project Handling Uncertainty in Machine Learning Systems (contract number: 2020-04122), the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation, and the Excellence Center at Linköping Lund in Information Technology (ELLIIT). The computations were enabled by the Berzelius resource provided by the Knut and Alice Wallenberg Foundation at the National Supercomputer Centre.

Yuki Markus Asano, Christian Rupprecht, and Andrea Vedaldi. Self-labelling via simultaneous clustering and representation learning. In ICLR, 2020.

Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Armand Joulin, Nicolas Ballas, and Michael Rabbat. Semi-supervised learning of visual features by non-parametrically predicting view assignments with support samples. In ICCV, 2021.

Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Mike Rabbat, and Nicolas Ballas. Masked siamese networks for label-efficient learning. In ECCV, 2022.

Arindam Banerjee, Inderjit S Dhillon, Joydeep Ghosh, Suvrit Sra, and Greg Ridgeway. Clustering on the unit hypersphere using von Mises-Fisher distributions. JMLR, 2005.

Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image transformers. In ICLR, 2021.

Florian Barbaro and Fabrice Rossi. Sparse mixture of von Mises-Fisher distribution. In ESANN, 2021.

Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101 mining discriminative components with random forests. In ECCV, 2014.

Jane Bromley, Isabelle Guyon, Yann Le Cun, Eduard Säckinger, and Roopak Shah. Signature verification using a" siamese" time delay neural network. Neur IPS, 1993.

Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. In ECCV, 2018.

Mathilde Caron, Piotr Bojanowski, Julien Mairal, and Armand Joulin. Unsupervised pre-training of image features on non-curated data. In ICCV, 2019.

Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments. Neur IPS, 2020.

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In ICCV, 2021.

Gilles Celeux and Jean Diebolt. A stochastic approximation type em algorithm for the mixture problem. Stochastics: An International Journal of Probability and Stochastic Processes, 1992.

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In ICML, 2020a.

Published as a conference paper at ICLR 2023

Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. In CVPR, 2021.

Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. ar Xiv preprint ar Xiv:2003.04297, 2020b.

Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised vision transformers. In ICCV, 2021.

Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. In CVPR, 2014.

Bernard Delyon, Marc Lavielle, and Eric Moulines. Convergence of a stochastic approximation version of the em algorithm. Annals of statistics, 1999.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.

DLMF. NIST Digital Library of Mathematical Functions. http://dlmf.nist.gov/, Release 1.1.6 of 2022-06-30, 2022. URL http://dlmf.nist.gov/. F. W. J. Olver, A. B. Olde Daalhuis, D. W. Lozier, B. I. Schneider, R. F. Boisvert, C. W. Clark, B. R. Miller, B. V. Saunders, H. S. Cohl, and M. A. Mc Clain, eds.

Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning by context prediction. In ICCV, 2015.

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2020.

Linus Ericsson, Henry Gouk, and Timothy M Hospedales. How well do self-supervised models transfer? In CVPR, 2021.

Aleksandr Ermolov, Aliaksandr Siarohin, Enver Sangineto, and Nicu Sebe. Whitening for selfsupervised representation learning. In ICML, 2021.

Siddharth Gopal and Yiming Yang. von Mises-Fisher clustering models. In ICML, 2014.

Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. Neur IPS, 2020.

Md Hasnat, Julien Bohné, Jonathan Milgram, Stéphane Gentric, Liming Chen, et al. von Mises Fisher mixture model-based deep learning: Application to face verification. ar Xiv preprint ar Xiv:1706.04264, 2017.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In CVPR, 2020.

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. In CVPR, 2022.

Jyh-Jing Hwang, Stella X Yu, Jianbo Shi, Maxwell D Collins, Tien-Ju Yang, Xiao Zhang, and Liang-Chieh Chen. Segsort: Segmentation by discriminative sorting of segments. In ICCV, 2019.

Allan Jabri, Andrew Owens, and Alexei Efros. Space-time correspondence as a contrastive random walk. Neur IPS, 2020.

Dahun Kim, Donghyeon Cho, Donggeun Yoo, and In So Kweon. Learning image representations by completing damaged jigsaw puzzles. In WACV, 2018.

Nikos Komodakis and Spyros Gidaris. Unsupervised representation learning by predicting image rotations. In ICLR, 2018.

Published as a conference paper at ICLR 2023

Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.

Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Learning representations for automatic colorization. In ECCV, 2016.

Kuang-Huei Lee, Anurag Arnab, Sergio Guadarrama, John Canny, and Ian Fischer. Compressive visual representations. Neur IPS, 2021.

Chunyuan Li, Jianwei Yang, Pengchuan Zhang, Mei Gao, Bin Xiao, Xiyang Dai, Lu Yuan, and Jianfeng Gao. Efficient self-supervised vision transformers for representation learning. In ICLR, 2021.

Fei-Fei Li, Marco Andreeto, Marc Aurelio Ranzato, and Pietro Perona. Caltech 101, 2022. URL https://data.caltech.edu/records/20086.

Junnan Li, Pan Zhou, Caiming Xiong, and Steven Hoi. Prototypical contrastive learning of unsupervised representations. In ICLR, 2020.

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.

Ilya Loshchilov and Frank Hutter. Fixing weight decay regularization in adam. 2018.

Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grained visual classification of aircraft. Technical report, 2013.

Ishan Misra and Laurens van der Maaten. Self-supervised learning of pretext-invariant representations. In CVPR, 2020.

Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, 2008.

Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, 2016.

Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In CVPR, 2012.

Zhiliang Peng, Li Dong, Hangbo Bao, Qixiang Ye, and Furu Wei. Beit v2: Masked image modeling with vector-quantized visual tokenizers. ar Xiv preprint ar Xiv:2208.06366, 2022.

James Philbin, Ondrej Chum, Michael Isard, Josef Sivic, and Andrew Zisserman. Object retrieval with large vocabularies and fast spatial matching. In CVPR, 2007.

James Philbin, Ondrej Chum, Michael Isard, Josef Sivic, and Andrew Zisserman. Lost in quantization: Improving particular object retrieval in large scale image databases. In CVPR, 2008.

Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbeláez, Alexander Sorkine-Hornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation. ar Xiv:1704.00675, 2017.

Tim Salimans and Durk P Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. Neur IPS, 2016.

Ravid Shwartz-Ziv, Randall Balestriero, and Yann Le Cun. What do we maximize in self-supervised learning? In First Workshop on Pre-training: Perspectives, Pitfalls, and Paths Forward at ICML, 2022.

Jalil Taghia, Zhanyu Ma, and Arne Leijon. Bayesian estimation of the von-Mises Fisher mixture model with variational inference. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014.

Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Neur IPS, 2017.

Published as a conference paper at ICLR 2023

Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. In ECCV, 2020.

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. In ICML, 2021.

Hugo Touvron, Matthieu Cord, and Hervé Jégou. Deit iii: Revenge of the vit. In ECCV, 2022.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Neur IPS, 2017.

Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In ICML, 2020.

Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance discrimination. In CVPR, 2018.

Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In CVPR, 2010.

Zhenda Xie, Yutong Lin, Zhuliang Yao, Zheng Zhang, Qi Dai, Yue Cao, and Han Hu. Self-supervised learning with swin transformers. ar Xiv preprint ar Xiv:2105.04553, 2021.

Boyu Yang, Chang Liu, Bohao Li, Jianbin Jiao, and Qixiang Ye. Prototype mixture models for few-shot semantic segmentation. In ECCV, 2020.

Jure Zbontar, Li Jing, Ishan Misra, Yann Le Cun, and Stéphane Deny. Barlow twins: Self-supervised learning via redundancy reduction. In ICML, 2021.

Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In ECCV, 2016.

Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. Image bert pre-training with online tokenizer. In ICLR, 2021.

Chengxu Zhuang, Alex Lin Zhai, and Daniel Yamins. Local aggregation for unsupervised learning of visual embeddings. In ICCV, 2019.

Published as a conference paper at ICLR 2023

A.1 INTUITIVE COMPARISON OF DINO AND DINO-VMF

Figure 3: von Mises-Fisher density on the circle for a prototype vector pointing in the direction of 315 for two different values of prototype magnitudes (larger magnitude: blue curve, smaller magnitude: red curve).

In Section 3.3, we stated that in a properly normalized formulation, the model cannot naively increase w(k) to increase the probability of assigning a data sample to a cluster or mixture component. We expand on this idea by providing an intuitive example. Consider a 2-dimensional latent space on the unit circle. Let y denote a vector in this latent space with y = 1. We consider a simple setting with only two clusters, z {1, 2}, and let w(1)

and w(2) be prototypes representing these clusters. The DINO formulation proposes the following logit score l(1) DINO for assigning vector y to cluster 1:

l(1) DINO = w(1), y = w(1) cos θ(1)

For simplicity, let us ignore the temperature scaling that is applied in the sharpening step. Here, the model can simply increase w(1) to increase l(1) DINO as long as cos θ(1) > 0. Instead, our proposed DINO-v MF uses the following l(1) DINO-v MF logit scores:

l(1) DINO-v MF = w(1), y + log Cp( w(1) ) = w(1) cos θ(1) + log Cp( w(1) )

where log Cp( ) < 0 and log Cp( w(1) ) monotonically decreases with w(1) . So, merely increasing w(1) can increase the first term but this is counteracted by the second term. For given values of w(1) and θ(1) there exists a threshold ˆθ(1) below which increasing w(1) results in an increased logit score. The logit score decreases when θ(1) is larger than this threshold. So, the model needs to decrease θ(1) by mapping the vector y closer to the prototype in order to benefit from increasing w(1) . We show an illustration of how the v MF density varies with the prototype magnitude in Figure 3. This implies that, if the DINO-v MF model can consistently map a set of images close to a prototype, only then it can benefit from increasing the magnitude of that prototype.

A.2 VON MISES-FISHER NORMALIZATION CONSTANT

Given a von Mises-Fisher distribution in p-dimensions with parameters µ and κ, its normalization constant Cp(κ) is given by:

Cp(κ) = κp/2 1

(2π)p/2Ip/2 1(κ)

where Iν denotes the modified Bessel function of the first kind and order ν. We approximate the Ip/2 1(κ) term as explained in section 3.3 by using an asymptotic expansion. The softmax-logit scores for each mixture component is obtained by adding the network output and the log-normalization constant corresponding to that component. Because of the softmax formulation, it is sufficient for the log-normalization constant approximation to be correct up to a constant. We compare the values of log Cp(κ) obtained using our approximation to those obtained by using the scipy implementation of Iν instead. We consider κ [20, 500] and in Figure 4 we show a comparison of the values of log Cp(κ) obtained using our approximation. We did produce the same figure with the same values computed using the scipy-based implementation, but the lines appear to completely overlap with our approximation, so this is not shown. We define an approximation error ϵ(κ) with respect to the scipy implementation as follows:

Published as a conference paper at ICLR 2023

p=64 p=128 p=256 p=512

Figure 4: Our approximation up to a constant, log C(a) p (κ).

Figure 5: Approximation error in computing log Cp(κ) compared to scipy-based implementation. Note that the error is small compared to the approximated function values shown in Figure 4.

ϵ(κ) = log C(a) p (κ) log C(s) p (κ)

where C(a) p (κ) denotes our approximation and C(s) p (κ) denotes the scipy-based implementation5. The error values are shown in Figure 5 for different bottleneck dimensions p in the DINO prediction head. For p = 256, as in the case of DINO, we can observe that the approximation error is small compared to the actual function values.

A.3 PROTOTYPE UTILIZATION AT DIFFERENT SIMILARITY THRESHOLDS

As explained in section 5.2, we identify unique prototypes based on a cosine similarity threshold. Here, we present a further discussion about how the prototype utilization varies with the choice of similarity threshold. In Figures 6 and 7, we show the number of unique prototypes and the size of the largest duplicate set of prototypes. An unusually large duplicate set indicates a void prototype set. We observe that the void prototype set in Vi T-base/16 hardly changes in size even when the similarity threshold is increased up to 0.99. We observe a higher prototype utilization with DINOv MF compared to DINO for both the Vi T-Base and Vi T-Small models. Only exception is when we set the similarity threshold to a large value of 0.99 for Vi T-Small/16 model trained with DINO. We expect these near-duplicates to further increase in similarity, if the model training is continued for more epochs.

A.4 RELATIONSHIP BETWEEN CENTERING AND CLUSTER PRIOR

The centering parameter c(k), computed according to Eq. (8) in DINO or Eq. (9) in DINO-v MF, is related to an estimate of the cluster prior π(k) in our mixture model interpretation. Indeed, for DINO, the centering parameter averages the logit scores of the components, scaled by the sharpening parameter τt, so exp c(k)/τt can be viewed as an estimate of π(k) up to normalization. For DINOv MF we similarly average the cluster probability (i.e., responsibility) over a batch at each iteration, and update our estimate of c(k) using an EMA of the corresponding logits; see (9). Note that the sharpening parameter τt is included in the expression for the probabilities, which means that, in this formulation, we can estimate π(k) as exp c(k) up to normalization, i.e. without the scaling τt. The key difference between the two estimates is thus whether the averaging is done in logit space or in probability space. The latter is closer to how one would typically estimate π(k) in a standard mixture model, e.g. using the EM algorithm, as an average of the responsibility for cluster k.

Combining the aforementioned estimates with the expressions for the teacher probabilities (Eq. (4) for DINO or Eq. (7) for DINO-v MF) and comparing with the responsibilities as computed for a mixture model in Eq. (2), it is tempting to simply relate the factors involving c(k) with the prior

5https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.iv.html

Published as a conference paper at ICLR 2023

0.50 0.60 0.70 0.80 0.90 0.99 Similarity threshold

Number of unique

Vi T-Base/16

0.50 0.60 0.70 0.80 0.90 0.99 Similarity threshold

Vi T-Small/16

DINO DINO-v MF

Figure 6: Number of unique prototypes considering different cosine similarity threshold for identifying duplicate prototypes. Even at relatively low similarity thresholds, DINO-v MF utilizes more unique and adequately separated prototypes.

0.50 0.60 0.70 0.80 0.90 0.99 Similarity threshold

Size of the largest

duplicate prototype set

Vi T-Base/16

0.50 0.60 0.70 0.80 0.90 0.99 Similarity threshold

Vi T-Small/16

DINO DINO-v MF

Figure 7: Size of the largest set of duplicate prototypes. DINO pre-trained Vi T-Base models contain a large duplicate set, that we call void prototype set. This is avoided in DINO-v MF pre-training. Recall that Vi T-Base models are trained with L2-normalized prototypes whereas the Vi T-Small models are trained with unnormalized prototypes.

probability π(k) in (2). However, this would be an incorrect interpretation since there is a minus sign in the exponents in Eqs. (4)and (7).

To understand this discrepancy, note first that in a mixture model we would estimate π(k) from the available data, but also encourage data points to be assigned to components with higher π(k)-values. This can be thought of as a "rich gets richer" approach. If a certain cluster is found to be dominant, then this information is fed back to the estimates of the responsibilities, increasing the probability of assigning data points to this cluster.

Such an approach makes sense when the data is fixed, and we are simply trying to learn a clustering model that fits the data as well as possible. However, it is also distinctly different from the role of the centering in DINO(-v MF), where the data are mapped to latent representations and the clustering is carried out in this latent space. For these models we fix π to be uniform. The centering can be viewed as a way to further encourage this uniformity. This is achieved by generating targets using the inverse of the estimated cluster probabilities, i.e. instead of multiplying with π(k) exp c(k) we multiply with 1/π(k) exp c(k) . This can be viewed as a way to rebalance the component assignments, by encouraging the model to assign more samples to less used components and less samples to more highly used components. We effectively obtain a "rich gets poorer" and "poor gets richer" strategy.

Published as a conference paper at ICLR 2023

Table 8: Image Net k NN classification accuracy ablating on the impact of L2-normalization of prototypes, v MF normalization and probability centering for 2 manually seeded trainings.

Centering space

Run 1 (seed=0) Run 2 (seed=42)

Normalization logit probability logit probability

None 70.43 70.74 69.93 70.24 L2 69.72 70.18 69.30 70.17 v MF 70.51 71.12 69.91 70.62 = default setting for DINO/Vi T-S, = default setting for DINO/Vi T-B.

A.5 OUR MODIFICATION OF IBOT WITH VMF NORMALIZATION

We incorporate our v MF modification in i BOT Zhou et al. (2021) which uses a DINO-derived loss formulation. The self-supervised pre-training in i BOT uses a self-distillation objective, L[CLS] and a masked image modeling (MIM) objective, L[MIM]. Both objectives are formulated as cross-entropy terms similar to DINO based on the teacher and student probability distributions. We apply the v MF modification to the teacher and student probability distributions for the [CLS] and patch tokens as follows:

P [CLS] s (x)(k) k exp h D w(k) s , y[CLS]E /τs + log Cp κ(k) s i ,

P [CLS] t (x)(k) k exp h c(k)i exp h D w(k) t , y[CLS]E /τt + log Cp κ(k) t i ,

P patch s (x)(k) k exp h D w(k) s , ypatch E /τs + log Cp κ(k) s i ,

P patch t (x)(k) k exp h c(k)i exp h D w(k) t , ypatch E /τt + log Cp κ(k) t i .

Consider image views u and v and N masked views ˆui and ˆvi with masks mi. Then, the overall loss in i BOT is simply computed as a sum of the objectives, L[CLS] and L[MIM], which are formulated as follows:

L[CLS] = P [CLS] t (v)T log P [CLS] s (u)

i=1 mi P patch t (ui)T P patch s ( ˆui)

A.6 ADDITIONAL EXPERIMENTS AND DETAILS

A.6.1 DETAILED ABLATION STUDIES

We run the ablation studies explained in 5.1.1 for 2 runs and all pre-trainings within a run are done with manually fixed seeds to ensure that the results are comparable and reproducible within each run. The results for each run are reported in Table 8

A.6.2 IMAGENET CLASSIFICATION WITH FULL DATASET

We present additional baseline comparisons for Vi T-Small/16 and Vi T-Base/16 architectures in Tables 9 and 10 respectively. In addition to the results presented based on DINO and DINO-v MF pre-trained Vi T-Small/16 and Vi T-Base/16 models, we consider the Vi T-Small/8 backbone architecture that uses a smaller patch size of 8. This model is more expensive to train compared to the Vi T-Small/16 model.

Published as a conference paper at ICLR 2023

DINO pre-trained this model for 800 epochs. We conduct a smaller experiment for 40 epochs with the same hyperparameters as in DINO. The results are reported in Table 11. The results are similar to our observations regarding the Vi T-Small/16 architecture.

Table 9: Image Net classification accuracy using linear and k NN classifiers for Vi TSmall/16

Method k NN Lin. [C] Lin. [D]

DINO 74.44 76.09 76.98 DINO-v MF 74.74 76.19 76.97 MSN 74.86 76.20 76.62 Sw AV 66.3 73.5 i BOT 75.2 77.9 Mo Co-v3 73.4 Mo BY 72.8 Mo Co-v2 64.4 72.7 BYOL 66.6 71.4

Table 10: Image Net classification accuracy using linear and k NN classifiers for Vi TBase/16

Method k NN Lin. [C] Lin. [D]

i BOT 77.10 79.33 79.50 i BOT-v MF 78.66 80.20 80.27 DINO 76.11 77.88 78.17 DINO-v MF 77.40 78.67 78.81 Be IT-v2 80.10 MSN 73.33 75.84 74.80 Mo Co-v3 76.7 MAE 68.0 Be IT 56.7

Table 11: Image Net classification accuracy using linear and k NN classifiers for Vi T-S/8 architecture

Method k NN Linear ([CLS]) Linear (DINO)

DINO 69.98 72.75 74.24 DINO-v MF 70.41 72.91 74.14

A.6.3 IMPLEMENTATION DETAILS FOR LINEAR CLASSIFICATION ON SMALL DATASETS

We consider the following small datasets: Aircraft (Maji et al., 2013), Caltech101 (Li et al., 2022), CIFAR10, CIFAR100 (Krizhevsky, 2009), DTD (Cimpoi et al., 2014), Flowers (Nilsback & Zisserman, 2008), Food (Bossard et al., 2014), Pets (Parkhi et al., 2012) and SUN397 (Xiao et al., 2010). The linear classifier is trained with L2-regularization and the strength is chosen for each dataset by using 5-fold cross-validation among a set of 45 values spaced linearly in the range [ 6, 5] in log-space following Ericsson et al. (2021).

A.6.4 IMAGENET FEW-SHOT EVALUATION

The few-shot evaluation involving 1, 2 and 5 images per class were conducted using three different splits. In Table 12 and Table 13, we show the mean and standard deviation of the top-1 validation classification accuracy over the three splits.

Table 12: Image Net few-shot evaluation (as in MSN) with Vi T-Base/16 model

Vi T-Base/16

Dataset DINO MSN DINO-v MF i BOT i BOT-v MF

1 img/cls 41.8 0.3 49.8 0.2 50.3 0.2 46.0 0.3 51.6 0.1 2 imgs/cls 51.9 0.6 58.9 0.4 59.3 0.4 56.0 0.8 61.1 0.7 5 imgs/cls 61.4 0.2 65.5 0.3 66.1 0.2 64.7 0.3 68.3 0.3

A.6.5 FINE-TUNING EVALUATION

We conduct fine-tuning experiments on Image Net-1K following the same training protocol as Bao et al. (2021) which is found to consistently produce the best fine-tuning results at minimum training epochs Zhou et al. (2021). We train our Vi T-S/16 and Vi T-B/16 models for 200 and 100 epochs respectively. We report the best result obtained after conducting a sweep over the learning rates:

Published as a conference paper at ICLR 2023

Table 13: Image Net few-shot evaluation (as in MSN) with Vi T-Small/16 model

Vi T-Small/16

Dataset DINO MSN DINO-v MF

1 img/cls 38.9 0.4 47.1 0.1 39.2 0.6 2 imgs/cls 48.9 0.3 55.8 0.6 49.4 0.7 5 imgs/cls 58.5 0.1 62.8 0.3 59.1 0.2

{8e-4, 9e-4, 1e-3, 2e-3}. For fine-tuning on smaller datasets, we follow the fine-tuning recipe of Touvron et al. (2021) of training for 1000 epochs with a small learning rate of 7.5e-6. We report the accuracy for CIFAR-10, CIFAR-100 and Flowers datasets and for the Aircraft dataset, we report the mean-per-class accuracy in Tables 14 and 15. We observe that the v MF variants are mostly on-par or slightly better than their standard counterparts. While performance on CIFAR-10, CIFAR-100 and Flowers are beginning to saturate, we observe greater performance gains on the Aircraft dataset.

Table 14: Fine-tuning results with Vi T-Small/16 after pre-training on Image Net-1K

Method CIFAR10 CIFAR100 Flowers Aircraft Image Net-1K

Random (De IT) 99.0 89.5 98.2 79.9 Random (De IT-III) 81.4 Be IT 98.6 87.4 96.4 DINO 99.0 90.5 98.5 84.2 82.0 DINO-v MF 99.0 90.7 98.7 85.0 81.8 i BOT 99.1 90.7 98.6 85.7 82.3

Table 15: Fine-tuning results with Vi T-Base/16 after pre-training on Image Net-1K

Method CIFAR10 CIFAR100 Flowers Aircraft Image Net-1K

Random (De IT) 99.0 90.8 98.4 81.8 Random (De IT-III) 83.8 Be IT 99.0 90.1 98.0 83.4 DINO 99.1 91.7 98.8 84.5 83.6 DINO-v MF 99.2 91.9 98.9 84.8 83.6 i BOT 99.2 92.2 98.9 85.1 84.0 i BOT-v MF 99.2 92.6 98.8 85.8 84.1 MAE 83.6