# preserving_angles_improves_feature_distillation__a2f904da.pdf

Published in Transactions on Machine Learning Research (11/2025)

Preserving Angles Improves Feature Distillation

Evelyn J. Mannix evelyn.mannix@unimelb.edu.au School of Mathematics and Statistics University of Melbourne

Liam Hodgkinson lhodgkinson@unimelb.edu.au School of Mathematics and Statistics University of Melbourne

Howard Bondell howard.bondell@unimelb.edu.au School of Mathematics and Statistics University of Melbourne

Reviewed on Open Review: https: // openreview. net/ forum? id= ZEhg ODZk WU

Knowledge distillation methods compress models by training a student network using the classification outputs of a high quality teacher model, but can fail to effectively transfer the properties of computer vision foundation models from the teacher to the student. While it has been recently shown that feature distillation where a teacher model s output features are replicated instead can reproduce performance for foundation models across numerous downstream tasks, they fall short in matching critical properties such as robustness and out-of-distribution (OOD) detection performance. This paper overcomes this shortcoming by introducing Cosine-similarity Preserving Compression (Cos Press), a feature distillation technique that learns a mapping to compress the latent space of the teacher model into the smaller latent space of the student, by preserving the cosine similarities between image embeddings. This enables direct optimisation of the student network and produces a more faithful reproduction of the teacher s properties. It is shown that distillation with Cos Press on a variety of datasets, including Image Net, produces more accurate models with greater performance on generalisability, robustness and OOD detection benchmarks, and that this technique provides a competitive pathway for training highly performant lightweight models on small datasets. Code is available at github.com/emannix/cospress.

1 Introduction

Deep learning computer vision approaches have become the standard for automating vision problems across a range of fields, from medical imaging (Zhang & Metaxas, 2024) to analysis of satellite imagery (Bastani et al., 2023) and detecting weapons in luggage (Andriyanov, 2024). However, models trained with commonly used supervised learning approaches can have poor robustness (Bai et al., 2021; Hendrycks et al., 2021b) and struggle to detect out-of-distribution (OOD) data (Yang et al., 2022a; Nguyen et al., 2015).

By leveraging large Vision Transformer (Vi T) architectures and pretraining on large and diverse datasets, foundation models in computer vision comprise a significant step forward toward addressing these challenges, providing significantly improved generalisation ability and robustness in comparison to purely supervised approaches (Oquab et al., 2024; Radford et al., 2021). Large Vi T models enjoy superior performance after pre-training (Zhai et al., 2022), and can be distilled to produce smaller models that are more practical for deployment. For example, smaller DINOv2 foundation models were distilled from their largest variant with 1.1 billion parameters by optimising the self-supervised training objective with a frozen teacher (Oquab et al., 2024). This approach is not replicable, as it was conducted on the proprietary LVD-142M dataset and the teacher head weights were never publicly released.

Published in Transactions on Machine Learning Research (11/2025)

DINOv2 Cos Press

DINOv2 Cos Press

DINOv2 Cos Press

Figure 1: Patch features. PCA visualisation of patch features for the DINOv2 Vi T-S/14 model, and the distilled Vi T-Ti/14 model produced using the Cos Press feature distillation approach.

Nevertheless, knowledge distillation approaches (Hinton et al., 2015) that leverage the classification outputs of the DINOv2 models trained on a particular datasets can be used to train performant student models. However, it has been shown that fundamental properties of the foundation model need not transfer to the student models, impacting generalisation performance on downstream tasks (Zhang et al., 2025). Feature distillation approaches that use a Mean Squared Error (MSE) or L2 loss and a student head to map the activations from the latent space of the student to the teacher are more effective in this respect. The recent Proteus (Zhang et al., 2025) approach has shown that it is possible to distill the DINOv2 models on Image Net-1K and obtain comparable performance on downstream classification and segmentation tasks. However, sub-optimal results are still obtained for key robustness metrics, as well as generalisability and performance in dense tasks such as image segmentation. Most concerning, however, is that models distilled using Proteus do not faithfully reproduce the latent space of the teacher model, as shown by their severely reduced performance on out-of-distribution (OOD) detection tasks (Table 1).

This paper presents Cosine-similarity Preserving Compression (Cos Press), a feature distillation approach that addresses such shortcomings by learning a teacher head mapping that compresses the latent space of the teacher model into the latent space of the student. Cos Press achieves this by preserving cosine similarities between points in the teacher s latent space and allows the student to be directly optimised. This significantly improves the faithfulness of the learned student on OOD detection (Table 1) and robustness benchmarks in comparison to Proteus, while achieving competitive performance across all of the considered challenges including classification accuracy, generalisation and semantic segmentation. Cos Press reproduces the highquality patch features of the foundation model (Fig. 1), and can be used to produce specialised models, that have improved accuracy on particular tasks but retain these foundation model properties. In this work, we:

present Cos Press, an approach for distilling Vi T foundation models that learns a mapping from the latent space of the teacher to the student that preserves cosine similarities and allows direct optimisation of the student model;

demonstrate that Cos Press produces a more faithful student model, better replicating the performance of the teacher across a range of metrics including robustness, generalisability and out-ofdistribution detection; and

Table 1: Out-of-distribution detection. Comparison of performance on the Open OOD benchmark for the Image Net-1K dataset. The means larger values are better and the means smaller values are better.

Method Arch Teacher Near OOD Far OOD

DINOv2 AUROC FPR AUROC FPR

Proteus Vi T-Ti/14 Vi T-S/14 64.17 85.73 74.22 67.97 Cos Press Vi T-Ti/14 Vi T-S/14 70.49 77.29 91.03 37.21

Vi T-S/14 72.58 74.12 92.67 29.55

Proteus Vi T-S/14 Vi T-B/14 61.19 94.56 61.92 86.78 Cos Press Vi T-S/14 Vi T-B/14 73.5 73.84 92.93 28.98

Published in Transactions on Machine Learning Research (11/2025)

show that Cos Press can be used to train specialised models with improved performance on a particular vision task, while retaining foundation model properties such as improved generalisability and outof-distribution detection performance.

2 Related Work

Foundation models Foundation models in computer vision follow the success of transformer-based foundation models in language, such as BERT (Devlin et al., 2019), and encode images as vectors in latent space, where the distance between vectors describes the semantic similarity of the images. Two approaches have emerged for training these models: self-supervised learning (Oquab et al., 2024) and contrastive languageimage pretraining (CLIP) (Radford et al., 2021). The CLIP models were among the first to show that by using a large Vision Transformer (Vi T) architecture (Dosovitskiy et al., 2021), and a large, diverse and high quality training dataset, a generalist vision model could be produced that achieves high performance across a range of applications (Radford et al., 2021). The DINOv2 foundation models followed, and using a combined bootstrapping (Grill et al., 2020; Caron et al., 2021) and masked patch prediction (Zhou et al., 2022) approach to train foundation models with strong performance on image classification and segmentation tasks (Oquab et al., 2024).

Knowledge distillation. Knowledge distillation is the process of transferring knowledge from a large model or model ensemble to a single smaller model. The earliest approaches aligned the output probability vectors of the student and teacher classifications using a Kullback Leibler (KL) divergence loss (Hinton et al., 2015). There is a wide range of literature demonstrating how this approach can improve the performance of smaller Convolutional Neural Networks (CNNs) (Wei et al., 2020) and Vision Transformers (Vi Ts) (Touvron et al., 2021; Yang et al., 2024), by leveraging a strong teacher or one with different inductive biases to the student model. It has been shown that knowledge distillation is most effective when it is treated as a function matching problem (Beyer et al., 2022), with the same inputs being provided to both the teacher and student model.

Feature distillation. Feature distillation where the output features of the teacher are used for training the student instead of the classification outputs is less well studied for models without class outputs. Generally speaking, feature distillation is used in combination with a knowledge distillation objective and often focuses on supervised models. However, the Proteus approach (Zhang et al., 2025) demonstrated that using pure feature distillation objectives is important for preserving foundation model properties. The components of Proteus a student head to align output dimensions, MSE loss on class and patch tokens, and an i BOT (Zhou et al., 2022) inspired masking objective are a logical adaption of components in prior supervised feature distillation methods such as Masked Generative Distillation (MGD) (Yang et al., 2022b), SRD (Miles & Mikolajczyk, 2024) and Vk D (Miles et al., 2024) to a Vi T architecture with the aim of preserving both the local and global features of the teacher.

Dimensionality reduction. Feature distillation and the challenge of compressing latent spaces to train performant student models are closely related to the broader ideas of dimensionality reduction and minimum distortion embeddings (Agrawal et al., 2021). Stochastic Neighbor Embedding (SNE) (Hinton & Roweis, 2002; Van der Maaten & Hinton, 2008) is a dimensionality reduction technique that projects high-dimensional embeddings into a low-dimensional space (typically two dimensions) while preserving local relationships. SNE acheives this by constructing a probability distribution over pairs of points in the original space and then optimising a corresponding set of points in the low-dimensional space to match this higher dimensional distribution as closely as possible. However, SNE does not learn an explicit mapping between the original and reduced spaces, only a lower dimensional representation. Other approaches, in contrast, explicitly learn projection functions, often with the goal of preserving local geometric structures such as distances or angles (Saul & Roweis, 2000; He & Niyogi, 2003; Gao et al., 2020; Fischer & Ma, 2024).

Published in Transactions on Machine Learning Research (11/2025)

Frozen teacher T

Proteus loss Lproteus

Student heads gθ, gψ, gν

// Teacher head hϕ

Student loss Lstudent

Compression loss Ldim-red

Cosine-similarity Preserving Compression Map

Cos Press Proteus

Figure 2: Feature distillation frameworks. In Proteus, student heads g are used to map the outputs of the student network Sθ into the latent space of the teacher T, so that a MSE loss can be applied. In Cos Press, a teacher head h is trained to compress the teacher T outputs into the student latent space, preserving the cosine similarity of image embeddings and allowing direct optimisation. The Proteus student head does not preserve cosine similarity, even when the projection matrices are forced to be right-orthogonal.

Notation. We consider a feature distillation setting, where there is a small student network Sθ with output dimensionality DS and a larger frozen teacher network T with output dimensionality DT . These networks use a Vi T architecture, so we assume DT > DS. The loss functions presented in this paper consider a minibatch stochastic gradient descent setting, defined for a batch of images xi X. When only the output class tokens are considered Sc θ(xi), T c(xi) is used. We write S(xi), T(xi) to refer to a matrix of the concatenated patch and class token outputs.

Motivation. We are interested in the problem of training a student network to mimic the behaviour of a large, high quality teacher model using a Vi T architecture, such as the DINOv2 foundation models. A key property of these models is that the cosine similarity between images and patch embeddings captures their semantic similarity, as shown by the use of this measure for zero-shot classification and identifying duplicate imagery (Jose et al., 2024; Oquab et al., 2024; Radford et al., 2021). However, larger Vi T architectures have a larger output dimensionality, which prevents the embeddings produced by smaller student models being directly compared to a teacher model embedding.

Proteus (Zhang et al., 2025) addresses this problem by introducing a student head g : RDS RDT that maps the outputs of the student model into the latent space of the teacher model, allowing a MSE (L2) loss to be applied. This student head contains a projection matrix, W RDS DT , that maps the from the latent space of the student to the teacher, and is commonly discarded. Two issues arise with this approach. First, the projection may encode information specific to replicating the teacher network, potentially distorting the outputs of the student model, which is only indirectly optimised (Miles et al., 2024). Second, there is no guarantee that the projection matrix W will be faithful in preserving cosine similarities between teacher embeddings of different images similarities that reflect semantic relationships (Jose et al., 2024) within the student s latent space.

Prior work has found that requiring W to be right-orthogonal addresses the first problem (Miles et al., 2024). Right-orthogonality means that WW = IDS must be satisfied where IDS RDS DS is the identity matrix of rank DS. However, this implies that for any image i, we have

Sc θ(xi)W T c(xi) = Sc θ(xi) T c(xi)W . (1)

Consequently, the cosine distance between image embeddings in the student network relates to the teacher, for any two images i, j, via

Sc θ(xi) Sc θ(xj) Sc θ(xi) Sc θ(xj)

T c(xi)W W T c(xj)

T c(xi)W W T c(xj) . (2)

Published in Transactions on Machine Learning Research (11/2025)

This implies that W also needs to be left-orthogonal to address the second problem and ensure that cosine similarities are preserved. Asserting left-orthogonality with W W = αIDT for some scalar α would yield the desired relationship

Sc θ(xi) Sc θ(xj) Sc θ(xi) Sc θ(xj) T c(xi) T c(xj)

T c(xi) T c(xj) . (3)

Unfortunately, the projection W can only be approximately left-orthogonal, as W is only of rank DS, which means the product W W can only ever be of rank DS. As IDT is of rank DT > DS, then W W = αIDT . Consider the following definition. Definition 1 (Approximately Orthogonal Matrix). A matrix M Rm d is said to be approximately orthogonal if MM αIm F < ε and M M βId F < ε,

for sufficiently small ε > 0 and real scalars α, β. If only one of these conditions are satisfied the matrix is said to be approximately right or left orthogonal, as appropriate. This notion expands upon the idea of orthogonality, where a matrix M can only be orthogonal (ε = 0, α = β = 1) if it is square (m = d). Lemma 1. Let M Rm d with m < d and rank(M) = m. Then MM d

m Im F M M Id F .

Moreover, the converse inequality does not generally hold.

This lemma shows that approximate right-orthogonality is sufficient for a matrix to also be approximately left-orthogonal, and therefore to be approximately orthogonal overall. Consequently, both conditions Eq. (1) and Eq. (3) are satisfied, ensuring that the mapping does not encode information while preserving the relationships between image embeddings. The proof of Lemma Theorem 1 stems from M having more columns than rows, which can be downsampled to form a square matrix, and is provided in Section A of the supporting information. While prior methods such as Vk D introduce parametrisations that require W to be right-orthogonal (Miles et al., 2024), they do not also guarantee approximate left-orthogonality.

While Eq. (3) could be optimised directly using an SNE (Hinton & Roweis, 2002; Van der Maaten & Hinton, 2008) inspired approach, our initial experiments found that this was less effective than Proteus due to this target having a complex loss surface with many local minima. Instead, it is proposed to use a teacher head, rather than a student head, to learn a function hϕ : RDT RDS that compresses the representation of the teacher into the latent space of the student while preserving cosine similarities

T c(xi) T c(xj) T c(xi) T c(xj) hϕ(T c(xi)) hϕ(T c(xj)) hϕ(T c(xi)) hϕ(T c(xj)) , (4)

which allows the student Sc θ(xi) to be directly optimised against the compressed teacher representation hϕ(T c(xi)). Learning the hϕ mapping is a tractable problem as described by the Johnson Lindenstrauss (JL) Lemma, which states that such a mapping can be constructed with a margin of error that depends on the dimensionality of the target space and the size of the dataset of interest (Freksen, 2021). Further details are provided in Section A, and Fig. 2 highlights the differences between the Proteus and Cos Press frameworks.

Proteus. Zhang et al. (2025) propose to minimize the MSE (L2) loss between the outputs of the teacher and that of the student, when passed through a dimension-raising map called the student head g : RDS RDT . To achieve best performance, they use three student heads with different weights ϕ, ψ, ν and minimise the L2 loss separately on the class tokens, features (class and patch tokens), and on randomly masked tokens XM similar to the MGD (Yang et al., 2022b) approach. This leads to the following optimisation loss

Lproteus(X; ϕ, ψ, ν, θ) = L2(gϕ(Sθ(X)), T(X)) (5)

+ L2(gψ(Sc θ(X)), T c(X))

+ L2(gν(Sθ(XM)M), T(X)M).

Published in Transactions on Machine Learning Research (11/2025)

Cos Press. Our approach, Cos Press, separates the challenge of feature distillation into two parts,

LCos Press(X; ϕ, θ) = Ldim-red(X; ϕ) + Lstudent(X; θ). (6)

Firstly, a teacher head hϕ : RDT RDS is learnt to map the teacher outputs T to the latent space of the student network Sθ while preserving cosine similarities. This dimensionality reduction loss term Ldim-red is independent of fitting the student network. It only requires the target dimension in order to fit the teacher head hϕ.

Secondly, the student network Sθ is trained to match the image of the teacher under the teacher head hϕ T in the student loss term Lstudent. The most effective way to fit this term is to freeze the teacher head hϕ gradients and train on both losses concurrently as shown in Fig. 2. Using a weighting scheme was observed to produce similar results (Section C).

Dimensionality reduction objective. To build a loss function that will ensure the mapping hϕ satisfies Eq. (4) an SNE (Hinton & Roweis, 2002; Van der Maaten & Hinton, 2008) inspired approach is used. This involves defining a kernel to build distributions describing the similarity between vectors, allowing for embeddings in the high dimensional input space to be aligned with the low dimensional target space by minimising the KL divergence between these distributions.

We define a kernel using the von-Mises Fisher distribution, where for input vectors y and z we have

kτ(y; z) exp y z y z /τ , (7)

with temperature hyperparameter τ. As a result, Eq. (4) becomes

kτ (T c(xi); T c(xj)) kτ (hϕ(T c(xi)); hϕ(T c(xj))) , (8)

for each i, j. Then, for a set of vectors in the input space pi p and target space qi q of size N, we construct the matrices P τ, Qτ that define the input and target distributions by

P τ ij = pj|i + pi|j

2N , pj|i = kτ(pi; pj) P

i =k kτ(pi; pk), (9)

Qτ ij = qj|i + qi|j

2N , qj|i = kτ(qi; qj) P

i =k kτ(qi; qk), (10)

where the first equation builds symmetric P τ, Qτ matrices, allowing for greater flexibility in the solution. If these P τ, Qτ matrices are equal, the cosine similarity between pairs of points in p, q will be equal and Eq. (4) will be satisfied. This can be achieved approximately by minimising the KL divergence (DKL) over τ, a vector of temperature values via

LKL(p, q) = 1

τ τ DKL(P τ Qτ). (11)

An ablation study on the best values of τ is described in Table S21 in the supporting information.

Putting this all together, we propose a dimensionality reduction loss that conserves cosine similarity at two levels between the image class tokens in a batch, and between the features (patch and class tokens) within an image

Ldim-red(X; ϕ) = LKL(hϕ(T c(X)), T c(X)) (12)

i LKL(hϕ(T(xi)), T(xi)).

The calculation of LKL(hϕ(T c(X)), T c(X)) is the only term in the Cos Press loss that is calculated between examples in a batch, and that will scale non-linearly with increasing batch size. All other terms in both Cos Press and Proteus are computed within individual examples and scale linearly with batch size.

Published in Transactions on Machine Learning Research (11/2025)

Teacher head architecture. As done for the Proteus (Zhang et al., 2025) student heads, the teacher head architecture in Cos Press uses a Layer Norm (Ba, 2016) followed by a linear layer. This can be written as

hϕ(z) = z z

W + β2, (13)

where z is the average of the z vector elements, and the initialisation scheme sets the biases β1, β2 to zero and the scaling γ to one at the start of training. The linear map W RDT DS is initialised using a random normal distribution as is standard, which is consistent with the mapping constructed in the JL Lemma (Section A). For the Proteus student heads g, W is replaced by W and the dimension of the bias vectors are adjusted accordingly.

Student objective. The student objective minimises cosine distance

Lcosine(z, y) = 1

i 1 zi yi zi yi

where z, y are sets of input vectors of the same length n. Considering the teacher head hϕ learns to conserve cosine similarity, this is a natural choice for the student network and is found to result in improved performance with Cos Press than the L2 loss (Section C).

Similarly to the dimensionality reduction objective, the final student loss employs both a class token loss and a feature loss term

Lstudent(X; θ) = Lcosine(Sc θ(X), hϕ(T c(X))) (15)

+ Lcosine(Sθ(X), hϕ(T(X))).

4 Experiments: Feature distillation

In this section, the Cos Press feature distillation approach is compared to Proteus and distilled variants of the DINOv2 models. While we do not have access to the proprietary LVD-142M dataset used to distill the DINOv2 models, it has been shown that Image Net-1K (Russakovsky et al., 2015) is sufficient to distill models with comparable accuracy across a range of measures (Zhang et al., 2025).

4.1 Experimental setup

Vision Transformer (Dosovitskiy et al., 2021) models are distilled using larger DINOv2 teachers on the Image Net-1K (Russakovsky et al., 2015) training dataset, comprising 1000 categories across more than 1.2 million training images. To enable a fair comparison, we reproduce the results of the Proteus paper and train Cos Press models using a unified codebase. This ensures consistency in the optimizers, samplers, augmentations and other hyperparameters. Following Proteus (Zhang et al., 2025), student networks are distilled for 300 epochs using a batch size of 1024, cosine learning rate decay with five warmup epochs (Loshchilov & Hutter, 2017a), an Adam W optimizer (Loshchilov & Hutter, 2017b), a repeated augmentation sampler with three views per image (Fort et al., 2021), and Rand Augment (Cubuk et al., 2020) image augmentations (Wightman, 2019). An ablation study on the hyperparameters introduced by Cos Press is provided in Section C of the supporting information.

Following the DINOv2 k NN evaluation (Wu et al., 2018) and linear probing approach (Oquab et al., 2024) with an additional batchnorm layer (Lee et al., 2023), evaluations are undertaken on the Image Net validation set, as well as nine fine-grained classification benchmarks (Oxford Pets (Parkhi et al., 2012), FGVC Aircraft (Maji et al., 2013), Describable Textures (Cimpoi et al., 2014), Stanford Cars (Krause et al., 2013), CUB200 (Wah et al., 2011), CIFAR-10/100 (Krizhevsky et al., 2009), Flowers-102 (Nilsback & Zisserman, 2008) and Food-101 (Bossard et al., 2014)) and the Pascal VOC 2012 segmentation task (Everingham et al., 2012). Performance is also tested on several robustness and generalisation benchmarks including Image Net-V2

Published in Transactions on Machine Learning Research (11/2025)

(Recht et al., 2019), Sketch (Wang et al., 2019), Image Net-R (Hendrycks et al., 2021a) and Image Net-A (Hendrycks et al., 2021b).

We additionally consider the Open OOD benchmarks (Yang et al., 2022a). Foundation models are trained on a diverse dataset and excel in this task, and whether distilled students can reproduce this performance has not been previously considered. This section focuses on the Image Net-1K Open OOD benchmark, which uses SSB-hard (Bitterwolf et al., 2023) and NINCO (Vaze et al., 2022) as near OOD data, and i Naturalist (Van Horn et al., 2018), Open Image-O (Wang et al., 2022) and Describable Textures (Cimpoi et al., 2014) as far OOD data.

4.2 Results

Table 2: Image Net classification. Comparison of performance on Image Net-1K under k NN and linear probing evaluation approaches. We report the mean and standard deviation over four runs with different random seeds for the Proteus and Cos Press Vi T-Ti/14 models.

Method Arch Teacher k NN Linear

Proteus Vi T-Ti/14 DINOv2 Vi T-S/14 73.0 0.1 76.1 0.2 Proteus-Vk D Vi T-Ti/14 DINOv2 Vi T-S/14 73.0 75.9 Cos Press Vi T-Ti/14 DINOv2 Vi T-S/14 74.3 0.1 76.6 0.1

DINOv2 Vi T-S/14 79.0 81.1

Proteus Vi T-S/14 DINOv2 Vi T-B/14 79.8 82.0 Cos Press Vi T-S/14 DINOv2 Vi T-B/14 80.4 82.3

Table 3: Distillation components. Results for k NN evaluations on different components of the distillation process for models distilled on Image Net-1K. For Proteus, results are shown for the class token student head.

Method Arch Teacher k NN

DINOv2 Backbone Stu. head Tea. head Teacher

Proteus Vi T-Ti/14 Vi T-S/14 73.1 73.5 79.0 Proteus-Vk D Vi T-Ti/14 Vi T-S/14 73.0 73.3 79.0 Cos Press Vi T-Ti/14 Vi T-S/14 74.3 78.8 79.0

Proteus Vi T-S/14 Vi T-B/14 79.8 80.0 82.1 Cos Press Vi T-S/14 Vi T-B/14 80.4 82.1 82.1

Cos Press trains more competitive students. Table 2 shows that Cos Press trains students with better performance in comparison to Proteus (Zhang et al., 2025), for both the linear probing and k NN evaluation methods. These improvements are statistically significant, taking into account the low variability observed across different random seeds. It is also found that the teacher head can project the embeddings from the teacher network into the latent space of the student with minimal loss of k NN accuracy, and that the token student head from Proteus has a higher k NN accuracy than the model backbone (Table 3). These observations confirm the motivations for Cos Press the Proteus student heads are not an uninformative mapping into a higher dimensional space, but are contaminated with information relevant for reproducing the teacher model. Further, a high-quality projection that compresses the teacher emebeddings into the latent space of the student that preserves cosine similarity can be learnt, and this provides more effective supervision.

In Table 2 we also consider Proteus-Vk D, where the projection matrices W in the Proteus student head are constrained to be right-orthogonal using the Vk D approach (Miles et al., 2024). This method builds a re-parametrisation map using skew symmetry and a matrix exponential approximation to construct W such that it is approximately right-orthogonal. Table 2 shows that in this context, this re-parametrisation does not significantly impact performance, and does not completely prevent contamination of the student head gϕ.

It is also observed in Table 2 that the Cos Press and Proteus approaches can outperform the distilled DINOv2 models of the same size. However, it is challenging to determine if these distillation approaches are more

Published in Transactions on Machine Learning Research (11/2025)

effective, as the DINOv2 models were trained on a larger proprietary dataset, of which the Image Net-1K dataset was only a small subset.

Scaled Non-diagonal Elements

0.3 0.2 0.1 0.0 0.1 0.2 0.3 0

Scaled Diagonal Elements

0.6 0.8 1.0 1.2 1.4 0

Cos Press Proteus Proteus-Vk D

Figure 3: Visualising orthogonality. Kernel density estimate plots of the diagonal and non-diagonal elements for the scaled Gram matrices of the linear maps W in the teacher and student heads, drawn from Cos Press and Proteus respectively. The dashed coloured lines represent W W/α and the solid lines represent WW /β, where α, β are defined as in Table 4. A perfectly orthogonal matrix W will have a Gram matrix with density on the black dashed vertical lines.

Table 4: Measuring orthogonality. Distance measures of the scaled Gram matrices A = W W/α and B = WW /β for the projection matrix in the Proteus student and Cos Press teacher head, and their respective identity matrices IDT , IDS of the same dimensions. For each matrix α, β is set to the mean of the diagonal elements of A, B, which minimises error under the Frobenius norm.

Method Distance to the Identity Measures

A IDT F B IDS F Tr (|A IDT |) Tr (|B IDS|)

Proteus 27.3 9.5 57.1 17.0 Proteus-Vk D 24.9 7.7 152.7 8.2 Cos Press 20.3 2.6 3.4 4.2

Cos Press learns an approximately left and right orthogonal projection. As theorised in Lemma 1, it is found that Cos Press learns a linear map W in the teacher head (Eq. (13)) that is approximately left and right-orthogonal, up to a scaling factor. More concisely, we find that

W W/α I, WW /β I (16)

where α, β are positive real numbers. Qualitatively, this can be seen in Fig. 3 where kernel density plots are shown of the elements from Gram matrices formed using the linear maps W, W . These projections are taken from the Cos Press teacher head and the Proteus class token student head obtained while training the Vi T-Ti/14 student network. The scaled Cos Press Gram matrices are much closer to the identity matrix, and this is measured quantitatively using the Frobenius norm and trace in Table 4.

While the Proteus-Vk D approach in Fig. 3 and Table 4 does learn a projection W that is approximately rightorthogonal, there is a large degree of error. This is due to the approximation of the matrix exponential that is used in the Vk D method (Miles et al., 2024), and Cos Press is able to learn a right-orthogonal projection matrix with less error.

Cos Press improves performance on classification tasks. Table 5 shows that the models distilled by Cos Press have improved or similar performance over Proteus for all downstream fine-grained classification tasks. Competitive accuracy is also achieved with the distilled DINOv2 models of the same size, which Cos Press outperforms on six of the nine datasets. Poorest performance is achieved on the FGVC-Aircraft dataset, which likely reflects differences in the training data used for distillation. The LVD-142M dataset used to train and distill the DINOv2 models contains a million images with high similarity to the FGVC-Aircraft

Published in Transactions on Machine Learning Research (11/2025)

Table 5: Fine-grained classification. Comparison of performance on fine-grained classification tasks using a linear probe evaluation.

Method Arch Teacher Dataset

C10 C100 Food CUB DTD Pets Cars Aircr Flowers Average

Proteus Vi T-Ti/14 DINOv2 Vi T-S/14 95.1 81.4 83.5 84.1 72.9 94.2 72.8 54.1 96.0 81.6 Cos Press Vi T-Ti/14 DINOv2 Vi T-S/14 94.9 81.9 84.6 85.1 73.8 94.1 75.3 55.7 96.8 82.5

DINOv2 Vi T-S/14 97.7 87.5 89.1 88.1 80.6 95.1 81.6 74.0 99.6 88.1

Proteus Vi T-S/14 DINOv2 Vi T-B/14 97.8 87.7 89.7 88.4 78.0 95.9 82.8 62.9 97.6 86.8 Cos Press Vi T-S/14 DINOv2 Vi T-B/14 97.8 87.6 90.3 88.9 78.0 95.9 84.0 63.4 98.8 87.2

dataset (Oquab et al., 2024), whereas Image Net only contains a single airliner class with approximately 1300 images.

Table 6: Semantic segmentation. Comparison of performance on the Pascal VOC 2012 semantic segmentation task using a linear probe.

Method Arch Teacher m Io U

Proteus Vi T-Ti/14 DINOv2 Vi T-S/14 70.5 Proteus w/o patch loss Vi T-Ti/14 DINOv2 Vi T-S/14 69.7 Cos Press Vi T-Ti/14 DINOv2 Vi T-S/14 71.1

DINOv2 Vi T-S/14 81.2

Proteus Vi T-S/14 DINOv2 Vi T-B/14 77.3 Proteus w/o patch loss Vi T-S/14 DINOv2 Vi T-B/14 77.1 Cos Press Vi T-S/14 DINOv2 Vi T-B/14 77.9

Cos Press improves segmentation performance. Table 6 shows that Cos Press also improves accuracy on downstream segmentation tasks in comparison to Proteus. Cos Press does not include the masked patch loss objective, that we confirm improves the performance of Proteus on dense tasks, and incorporating it into Cos Press may improve performance further. The DINOv2 distilled model outperforms Cos Press in this case. Further pretraining with an increased image resolution was found to be key to improving the performance of the DINOv2 models on dense tasks (Oquab et al., 2024), but is not undertaken in training the Cos Press and Proteus student models.

Table 7: Robustness and generalisation. Comparison of performance on Image Net-1K robustness and generalisation benchmarks.

Method Arch Teacher Test Dataset

DINOv2 IN-V2 Sketch IN-R IN-A

Proteus Vi T-Ti/14 Vi T-S/14 64.3 25.5 37.8 11.4 Cos Press Vi T-Ti/14 Vi T-S/14 64.9 27.9 40.7 13.2

Vi T-S/14 (Oquab et al., 2024) 70.9 41.2 53.7 33.5

Proteus Vi T-S/14 Vi T-B/14 72.2 38.4 50.0 29.6 Cos Press Vi T-S/14 Vi T-B/14 72.5 40.4 52.3 31.5

Cos Press distills a more robust student model. Table 7 shows that Cos Press results in improved performance over Proteus across a range of Image Net-1K robustness and generalisation benchmarks. The DINOv2 distilled model obtains better performance in this instance for all benchmarks except Image Net-V2, but Cos Press closes the gap between the Image Net-1K and LVD-142M distilled models significantly.

Cos Press reproduces the OOD detection performance of the teacher. Cos Press is faithful to the teacher networks when it comes to OOD detection performance, as shown in Table 1. Proteus performs very poorly on this benchmark, with worse performance observed for larger student models. In contrast, Cos Press is able to distill models that have strong OOD performance, even outperforming their DINOv2 counterparts.

Published in Transactions on Machine Learning Research (11/2025)

Table 8: Feature distillation with different teachers. Comparison of performance on Image Net-1K under k NN and linear probing evaluation methods with other kinds of teacher backbones, using different architectures and training approaches.

Method Arch Teacher k NN Linear

Proteus Vi T-T/14 DINOv2 Vi T-B/14 w/reg 71.1 75.1 Cos Press Vi T-T/14 DINOv2 Vi T-B/14 w/reg 74.0 76.4

Proteus Vi T-T/16 CLIP Vi T-B/16 63.6 71.4 Cos Press Vi T-T/16 CLIP Vi T-B/16 64.0 72.0

Cos Press improves performance across other teacher networks. Table 7 demonstrates that Cos Press also trains higher performing student networks than Proteus when using CLIP (Radford et al., 2021) and DINOv2 w/reg (Darcet et al., 2024) teacher networks. These experiments employ the same hyperparameters as those in Table 2. Additional results exploring feature distillation with Vi T-T students and larger teacher networks are provided in Section B of the supporting information.

Table 9: Training time. Comparison of training time on Image Net for 300 epochs with a batch size of 1024 using Nvidia A100 GPUs.

Method Arch Teacher GPUs GPU hours GPU memory

Proteus Vi T-Ti/14 DINOv2 Vi T-S/14 1 92 55GB Cos Press Vi T-Ti/14 DINOv2 Vi T-S/14 1 95 47GB

Proteus Vi T-S/14 DINOv2 Vi T-B/14 2 182 111GB Cos Press Vi T-S/14 DINOv2 Vi T-B/14 2 154 81GB

Cos Press does not require additional computational resources. Table 9 provides timings and GPU memory usage for fitting the Proteus and Cos Press models described in this section. Training time is similar for the Vi T-Ti/14 and DINOv2 Vi T-S/14 student-teacher pair, but Cos Press is more efficient for larger models. This is due to the masked patch loss in Proteus, which requires that the student network is evaluated once on unmasked inputs, and a second time on masked inputs. As a result, training is faster for Cos Press with larger students. Cos Press is also slightly more memory efficient compared to Proteus, and further computational savings could be made by freezing the teacher head hϕ once a sufficiently high quality map has been learned.

5 Experiments: Specialist models

This section explores the potential for Cos Press feature distillation to improve the performance of specialised models that solve one particular task (e.g. classifying images of food). We refer to this process, where an additional feature distillation training step is undertaken on a target dataset, as Cos Press finetuning. This approach can train highly performant small networks, that also have improved results on generalisability and OOD detection benchmarks.

5.1 Experimental setup

The Cos Press models distilled in the previous section are compared with models that have been further finetuned with Cos Press an additional pretraining step where distillation is undertaken on a smaller target dataset of interest. The strong Dei T (Touvron et al., 2021) pretrained weights are also considered, which were distilled from Image Net-1K with a larger CNN network using class-based knowledge distillation (Hinton et al., 2015). The same hyperparameters and training methodology is used as in Section 4, with the exception of the number of training and warmup epochs.

This section focuses on a set of small-scale tasks, including CIFAR-10/100 (Krizhevsky et al., 2009), Food101 (Bossard et al., 2014) and Oxford Pets (Parkhi et al., 2012). We employ 300 training epochs and 10 warmup epochs for CIFAR-10/100, and 3000 training epochs and 100 warmup epochs for Oxford Pets. The

Published in Transactions on Machine Learning Research (11/2025)

DINOv2 linear probe evaluation method is employed (Oquab et al., 2024), as well as finetuning using the Dei T recipe (Touvron et al., 2021). When training models with this latter approach, the linear prediction head is trained before finetuning the backbone, to avoid distorting the pretrained features (Kumar et al., 2022).

5.2 Results

Table 10: Specialist models accuracy. Comparison of performance on fine-grained image classification tasks.

Method Arch Teacher Pretraining dataset Linear Dei T

C10 C100 Food Pets C10 C100 Food Pets

Dei T (Touvron et al., 2021) Vi T-Ti/16 Reg Net Y-16GF Image Net 93.1 77.7 77.7 93.3 98.3 87.8 89.9 93.0 Cos Press Vi T-Ti/14 DINOv2 Vi T-S/14 Image Net 94.9 81.9 84.6 94.1 98.7 89.0 91.7 92.7 Cos Press Vi T-Ti/14 DINOv2 Vi T-S/14 Image Net Target dataset 97.6 86.3 89.8 94.9 98.8 89.6 92.7 93.0

Cos Press finetuning improves the performance of specialist models. Table 10 shows that Cos Press finetuning improves downstream performance, even when the training datasets are quite small. For every dataset considered, this additional pretraining step improves linear probe evaluations with a frozen backbone by a significant margin (1-5%). These benefits remain under the strong Dei T training recipe, which further finetunes the model backbone. While CIFAR-10/100 and Food-101 have much stronger results under Dei T finetuning, we find that Oxford Pets has best performance with a linear probe evaluation after Cos Press finetuning. This reflects the small size of the Oxford pets dataset, which makes training Vi T networks challenging.

Table 11: State-of-the-art lightweight models. Comparison of best Cos Press models to other approaches for training state-of-the-art lightweight models for specialised tasks.

Method Architecture Parameters Dataset

C10 C100 Food Pets

NAT (Lu et al., 2021) Mobile Net V2 (Sandler et al., 2018) 4.5-9.0M 98.4 88.3 89.4 94.3 Cei T (Yuan et al., 2021) Cei T-T 6.4M 98.5 88.4 93.8 Cos Press Vi T-Ti/14 5.5M 98.8 89.6 92.7 94.9

A Vi T-Tiny network finetuned with Cos Press can have competitive accuracy compared to other approaches in the literature that have been highly optimised to perform well on specialist tasks with a small and efficient model. Table 11 shows that Cos Press finetuning trains competitive networks in comparison to Neural Architecture Transfer (NAT) (Lu et al., 2021) and Convolution-enhanced image Transformers (Cei T) (Yuan et al., 2021). Feature distillation methods like Cos Press are an additional approach, that could be used in conjunction with these techniques to build highly performant lightweight vision models.

Table 12: Specialist models generalisability. Comparison of generalisability of specialist models on the cartoon subsets of the CIFAR-10-W benchmark (Sun et al., 2024). We report mean per-class accuracy due to dataset imbalances. In-distribution training images (top) and cartoon images (bottom) are included for reference.

Method Arch Teacher Pretraining dataset Linear Dei T

Diff Bin Bai 360 Diff Bin Bai 360

Dei T (Touvron et al., 2021) Vi T-Ti/16 Reg Net Y-16GF Image Net 65.9 51.0 47.6 48.8 86.9 62.6 56.0 60.1 Cos Press Vi T-Ti/14 DINOv2 Vi T-S/14 Image Net 70.8 49.5 48.7 50.3 88.5 63.2 56.3 60.7 Cos Press Vi T-Ti/14 DINOv2 Vi T-S/14 Image Net Target dataset 73.4 52.5 48.6 49.9 89.1 64.1 57.5 61.4

Published in Transactions on Machine Learning Research (11/2025)

Cos Press finetuning improves the generalisability of specialist models. The challenging cartoon subsets of the CIFAR-10-W benchmark (Sun et al., 2024) are used to test generalisation performance on CIFAR-10. Table 12 shows that Cos Press finetuning leads to improved generalisability for specialist models on CIFAR-10. Under a linear probing evaluation, Cos Press finetuning strongly improves generalisability on two of the four datasets, and improves generalisability for all datasets even after Dei T finetuning.

Table 13: Specialist models OOD detection. Comparison of performance on the Open OOD benchmark (Yang et al., 2022a). The AUC is reported for detecting OOD images.

Method Arch Teacher Pretraining dataset Frozen backbone Dei T finetuned

CIFAR-10 CIFAR-100 CIFAR-10 CIFAR-100

Near-OOD Far-OOD Near-OOD Far-OOD Near-OOD Far-OOD Near-OOD Far-OOD

Dei T (Touvron et al., 2021) Vi T-Ti/16 Reg Net Y-16GF Image Net 57.01 47.04 58.14 46.19 96.79 98.69 87.45 86.73 Cos Press Vi T-Ti/14 DINOv2 Vi T-S/14 Image Net 93.44 95.87 85.23 76.37 96.69 98.59 87.90 89.87 Cos Press Vi T-Ti/14 DINOv2 Vi T-S/14 Image Net Target dataset 95.12 98.02 87.00 80.95 97.05 98.79 89.52 89.53

Cos Press finetuning improves the OOD detection performance of specialist models. The CIFAR-10/100 Open OOD benchmarks (Yang et al., 2022a) are used to test OOD detection performance. Table 13 shows that Cos Press finetuning leads to improved performance on the Open OOD (Yang et al., 2022a) benchmark for specalist models on CIFAR-10/100. Without Dei T finetuning, strong improvements in OOD detection are observed under Cos Press finetuning over the Image Net pretrained baselines. Some improvements remain after Dei T finetuning, but are smaller.

6 Discussion

Significance of angles in deep learning. Language models have been shown to exhibit a principle of superposition, wherein concepts are encoded along nearly orthogonal directions in representation space (Bricken et al., 2023). While only d vectors can be exactly orthogonal in a d-dimensional space, high-dimensional geometry allows for the construction of up to exp(d) approximately orthogonal vectors (with pairwise cosine similarity less than ϵ > 0) enabling the representation of a vastly larger set of concepts in practice (Elhage et al., 2022). This phenomenon is closely related to the Johnson-Lindenstrauss lemma (Freksen, 2021), and similar properties have been observed in foundation models for computer vision (Bhalla et al., 2024). By preserving angular relationships between image embeddings, Cos Press maintains the semantic structure of the foundation model feature space in the student networks it trains.

Limitations. Even with Cos Press, a small generalisation gap remains. Models trained with Cos Press do not generalise quite as well as the original DINOv2 distilled variants (Table 7). Without access to the proprietary LVD-142M dataset, it is difficult to determine whether this gap arises from the limitation of performing feature distillation solely on Image Net-1K, or from a shortcoming in the methodology itself.

7 Conclusion

This paper introduces Cos Press, a feature distillation approach designed to train highly performant student networks from a foundation model teacher with a Vision Transformer (Dosovitskiy et al., 2021) architecture, that reproduces their properties in regards to generalisation, robustness and OOD detection. This is achieved by introducing a teacher head, that maps from the higher dimensional latent space of the teacher network into the smaller dimensional space of the student, and training this mapping to preserve the cosine similarity of images within these embedding spaces. Cos Press trains a faithful student, that more closely replicates the behaviour of the teacher network in comparison to the Proteus approach (Zhang et al., 2025), where a student head is used to align the student outputs with the teacher.

Acknowledgements

We acknowledge the Traditional Custodians of the unceded land on which the research detailed in this paper was undertaken, the Wurundjeri Woi Wurrung and Bunurong peoples of the Kulin nation, and pay

Published in Transactions on Machine Learning Research (11/2025)

our respects to their Elders past and present. This research was undertaken using the LIEF HPC-GPGPU Facility hosted at the University of Melbourne. This Facility was established with the assistance of LIEF Grant LE170100200. This research was also undertaken with the assistance of resources and services from the National Computational Infrastructure (NCI), which is supported by the Australian Government. Evelyn J. Mannix was supported by an Australian Government Research Training Program Scholarship to complete this work.

Akshay Agrawal, Alnur Ali, Stephen Boyd, et al. Minimum-distortion embedding. Foundations and Trends in Machine Learning, 14(3):211 378, 2021.

Nikita Andriyanov. Intelligent computer vision systems in the processing of baggage and hand luggage X-ray images. In Advances in Artificial Intelligence-Empowered Decision Support Systems: Papers in Honour of Professor John Psarras, pp. 283 324. Springer, 2024.

Jimmy Lei Ba. Layer normalization. ar Xiv preprint ar Xiv:1607.06450, 2016.

Yutong Bai, Jieru Mei, Alan L Yuille, and Cihang Xie. Are transformers more robust than CNNs? Advances in neural information processing systems, 34:26831 26843, 2021.

Favyen Bastani, Piper Wolters, Ritwik Gupta, Joe Ferdinando, and Aniruddha Kembhavi. Satlas Pretrain: A large-scale dataset for remote sensing image understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772 16782, 2023.

Lucas Beyer, Xiaohua Zhai, Amélie Royer, Larisa Markeeva, Rohan Anil, and Alexander Kolesnikov. Knowledge distillation: A good teacher is patient and consistent. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10925 10934, 2022.

Usha Bhalla, Alex Oesterling, Suraj Srinivas, Flavio Calmon, and Himabindu Lakkaraju. Interpreting CLIP with sparse linear concept embeddings (spli CE). In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024.

Julian Bitterwolf, Maximilian Müller, and Matthias Hein. In or out? fixing imagenet out-of-distribution detection evaluation, 2023.

Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101 mining discriminative components with random forests. In Computer Vision ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13, pp. 446 461. Springer, 2014.

Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden Mc Lean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Christopher Olah. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023. URL https://transformer-circuits.pub/2023/monosemantic-features.

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised Vision Transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 9650 9660, 2021.

Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3606 3613, 2014.

Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp. 702 703, 2020.

Published in Transactions on Machine Learning Research (11/2025)

Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision Transformers need registers. In The Twelfth International Conference on Learning Representations, 2024.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171 4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam Mc Candlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition, 2022. URL https://arxiv.org/abs/2209.10652.

M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascalnetwork.org/challenges/VOC/voc2012/workshop/index.html, 2012.

Jonas Fischer and Rong Ma. Sailing in high-dimensional spaces: Low-dimensional embeddings through angle preservation. ar Xiv preprint ar Xiv:2406.09876, 2024.

Stanislav Fort, Andrew Brock, Razvan Pascanu, Soham De, and Samuel L Smith. Drawing multiple augmentation samples per image during training efficiently decreases test error. ar Xiv preprint ar Xiv:2105.13343, 2021.

Casper Benjamin Freksen. An introduction to Johnson-Lindenstrauss transforms. ar Xiv preprint ar Xiv:2103.00564, 2021.

Yunlong Gao, Shuxin Zhong, Kangli Hu, and Jinyan Pan. Robust locality preserving projections using angle-based adaptive weight method. IET Computer Vision, 14(8):605 613, 2020.

Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent: A new approach to self-supervised learning. Advances in neural information processing systems, 33:21271 21284, 2020.

Xiaofei He and Partha Niyogi. Locality preserving projections. Advances in neural information processing systems, 16, 2003.

Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of outof-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8340 8349, 2021a.

Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15262 15271, 2021b.

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. ar Xiv preprint ar Xiv:1503.02531, 2015.

Geoffrey E Hinton and Sam Roweis. Stochastic neighbor embedding. Advances in neural information processing systems, 15, 2002.

Published in Transactions on Machine Learning Research (11/2025)

Cijo Jose, Théo Moutakanni, Dahyun Kang, Federico Baldassarre, Timothée Darcet, Hu Xu, Daniel Li, Marc Szafraniec, Michaël Ramamonjisoa, Maxime Oquab, et al. DINOv2 Meets Text: A Unified Framework for Image-and Pixel-Level Vision-Language Alignment. ar Xiv preprint ar Xiv:2412.16334, 2024.

Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3D object representations for fine-grained categorization. In Proceedings of the IEEE international conference on computer vision workshops, pp. 554 561, 2013.

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009. URL https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf. Technical report.

Ananya Kumar, Aditi Raghunathan, Robbie Jones, Tengyu Ma, and Percy Liang. Fine-tuning can distort pretrained features and underperform out-of-distribution. In International Conference on Learning Representations, 2022.

Jae-Hun Lee, Doyoung Yoon, Byeong Moon Ji, Kyungyul Kim, and Sangheum Hwang. Rethinking evaluation protocols of visual representations learned via self-supervised learning. ar Xiv preprint ar Xiv:2304.03456, 2023.

Ilya Loshchilov and Frank Hutter. SGDR: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations, 2017a.

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2017b.

Zhichao Lu, Gautam Sreekumar, Erik Goodman, Wolfgang Banzhaf, Kalyanmoy Deb, and Vishnu Naresh Boddeti. Neural architecture transfer. IEEE transactions on pattern analysis and machine intelligence, 43(9):2971 2989, 2021.

Avner Magen. Dimensionality reductions in l2 that preserve volumes and distance to affine spaces. Discrete & Computational Geometry, 38:139 153, 2007.

Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grained visual classification of aircraft. ar Xiv preprint ar Xiv:1306.5151, 2013.

Roy Miles and Krystian Mikolajczyk. Understanding the role of the projector in knowledge distillation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp. 4233 4241, 2024.

Roy Miles, Ismail Elezi, and Jiankang Deng. Vk D: Improving knowledge distillation using orthogonal projections. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15720 15730, 2024.

Anh Nguyen, Jason Yosinski, and Jeff Clune. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 427 436, 2015.

Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In 2008 Sixth Indian conference on computer vision, graphics & image processing, pp. 722 729. IEEE, 2008.

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V. Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, and Piotr Bojanowski. DINOv2: Learning robust visual features without supervision. Transactions on Machine Learning Research, 2024. ISSN 2835-8856.

Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In 2012 IEEE conference on computer vision and pattern recognition, pp. 3498 3505. IEEE, 2012.

Published in Transactions on Machine Learning Research (11/2025)

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748 8763. PMLR, 2021.

Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do Image Net classifiers generalize to Image Net? In International conference on machine learning, pp. 5389 5400. PMLR, 2019.

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. Image Net large scale visual recognition challenge. International Journal of Computer Vision (IJCV), 115(3):211 252, 2015.

Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobile Net V2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510 4520, 2018.

Lawrence K Saul and Sam T Roweis. An introduction to locally linear embedding. unpublished. Available at: http://www. cs. toronto. edu/ roweis/lle/publications. html, 2000.

Xiaoxiao Sun, Xingjian Leng, Zijian Wang, Yang Yang, Zi Huang, and Liang Zheng. CIFAR-10-warehouse: Broad and more realistic testbeds in model generalization analysis. In The Twelfth International Conference on Learning Representations, 2024.

Yiyou Sun, Yifei Ming, Xiaojin Zhu, and Yixuan Li. Out-of-distribution detection with deep nearest neighbors. In International Conference on Machine Learning, pp. 20827 20840. PMLR, 2022.

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pp. 10347 10357. PMLR, 2021.

Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. Journal of machine learning research, 9(11), 2008.

Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, and Serge Belongie. The i Naturalist species classification and detection dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8769 8778, 2018.

Sagar Vaze, Kai Han, Andrea Vedaldi, and Andrew Zisserman. Open-set recognition: A good closed-set classifier is all you need. In International Conference on Learning Representations, 2022. URL https: //openreview.net/forum?id=5h LP5JY9S2d.

Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The Caltech-UCSD Birds-200-2011 dataset. Jul 2011. URL https://www.vision.caltech.edu/datasets/cub_200_2011/.

Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P Xing. Learning robust global representations by penalizing local predictive power. Advances in Neural Information Processing Systems, 32, 2019.

Haoqi Wang, Zhizhong Li, Litong Feng, and Wayne Zhang. Vi M: Out-of-distribution with virtual-logit matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4921 4930, June 2022.

Longhui Wei, An Xiao, Lingxi Xie, Xiaopeng Zhang, Xin Chen, and Qi Tian. Circumventing outliers of Auto Augment with knowledge distillation. In European Conference on Computer Vision, pp. 608 625. Springer, 2020.

Ross Wightman. Pytorch image models. https://github.com/rwightman/pytorch-image-models, 2019.

Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3733 3742, 2018.

Published in Transactions on Machine Learning Research (11/2025)

Jingkang Yang, Pengyun Wang, Dejian Zou, Zitang Zhou, Kunyuan Ding, Wenxuan Peng, Haoqi Wang, Guangyao Chen, Bo Li, Yiyou Sun, et al. Open OOD: Benchmarking generalized out-of-distribution detection. Advances in Neural Information Processing Systems, 35:32598 32611, 2022a.

Zhendong Yang, Zhe Li, Mingqi Shao, Dachuan Shi, Zehuan Yuan, and Chun Yuan. Masked generative distillation. In European Conference on Computer Vision, pp. 53 69. Springer, 2022b.

Zhendong Yang, Zhe Li, Ailing Zeng, Zexian Li, Chun Yuan, and Yu Li. Vi TKD: Feature-based knowledge distillation for Vision Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1379 1388, 2024.

Kun Yuan, Shaopeng Guo, Ziwei Liu, Aojun Zhou, Fengwei Yu, and Wei Wu. Incorporating convolution designs into visual transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 579 588, 2021.

Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling Vision Transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12104 12113, 2022.

Shaoting Zhang and Dimitris Metaxas. On the challenges and perspectives of foundation models for medical image analysis. Medical image analysis, 91:102996, 2024.

Yitian Zhang, Xu Ma, Yue Bai, Huan Wang, and Yun Fu. Accessing vision foundation models via Image Net1K. In The Thirteenth International Conference on Learning Representations, 2025.

Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. Image BERT pre-training with online tokenizer. In International Conference on Learning Representations, 2022.

Published in Transactions on Machine Learning Research (11/2025)

Supplementary Material

A The Johnson Lindenstrauss Lemma

The Johnson Lindenstrauss (JL) Lemma (Freksen, 2021) states that for a set of points X in a high dimensional space, there exists a function that can map these points into a lower dimensional space, within error ϵ, where this error depends on the dimension of the target space m and the size of the set of points |X|. In its standard form, the JL lemma states that Euclidean distances are preserved.

Lemma 2 (Johnson Lindenstrauss; (Freksen, 2021)). For every d N1, ϵ (0, 1) and X Rd, there exists a function f : Rd Rm where m = Θ(ϵ 2 log |X|) such that for every x, y X, f(x) f(y) 2 2 x y 2 2 ϵ x y 2 2 (17)

Morever, the map f can be constructed using a simple approach. Given a matrix M Rm d with every element drawn from a standard normal distribution N(0, 1), then

f(x) := 1 m Mx (18)

is a linear map that satisfies Lemma 2 with a probability given by the norm preservation lemma. This function is also referred to as a JL transform.

Lemma 3 (Norm preservation; (Freksen, 2021)). Let ϵ (0, 1). If f is constructed as above with m = Θ(ϵ 2 log δ 1), and x Rd is a unit vector, then

P f(x) 2 2 (1 ϵ) 1 δ (19)

Again, for target spaces with larger dimension m this Lemma 3 states that it is more likely that a high quality map will be sampled. A similar result also holds for angles (Magen, 2007), which gives

Lemma 4 (Angles; (Magen, 2007)). Let ϵ < 1

3 and let n, t be integers for which t > 60ϵ 2 log n. Then for any n-point subset X of the Euclidean space RN, there is a linear contracting embedding f(X) Rt, under which angles are preserved to within a (double-sided) factor of 1 + 8/π ϵ.

The proof of Lemma 4 also relies on f being generated as a random projection as above (Magen, 2007). The JL transform matrix M has a further interesting property, as observed in this work, that we refer to as approximate left and right orthogonality.

Lemma 5 (Approximate left and right orthogonality for JL transforms). Given a matrix M Rm d with elements drawn from N(0, 1), there is

E MM αIm 2 F E MM 2 F = m d + m, E M M βIm 2 F E M M 2 F = d d + m.

Proof. Using the properties of the Wishart distribution, which given the construction of M, we have

MM Wm(Im, d) (20)

M M Wd(Id, m), (21)

from which we have the following results for the mean and variance

E[MM ] = d Im, Var([MMT ]ij) = d (22)

E[M M] = m Id, Var([MT M]ij) = m. (23)

These expectations imply Lemma 5.

Published in Transactions on Machine Learning Research (11/2025)

Morever, this property of approximate left and right orthogonality does not just apply to JL transformations, but any linear map M Rm d with d > m such that M is approximately left-orthogonal with M M Id. M cannot be exactly left-orthogonal, as the rank of M and M are both at most m, while Id is of rank d which is greater than m.

Lemma 1. (Approximate left-orthogonality implies right-orthogonality) For any matrix M Rm d with m < d with rank m, we have the inequality MM d

m Im F M M Id F . The converse inequality does not hold.

Proof. We can write M M = Id + E, which provides that

M M Id F = E F (24)

where E is an error matrix. Let us sample m columns of M to produce a square matrix K Rm m. This provides for K, that we have K K = Im + EK where EK are the same columns sampled from the error matrix E. This means that EK F E F .

We then introduce the singular value decomposition of K = UΣV , where U and V are orthonormal matrices and Σ is a square diagonal matrix containing the singular values of K, to give

K K = VΣ ΣV = Im + EK (25)

Σ Σ = V (Im + EK)V (26)

Σ Σ = Im + V EKV (27)

which we can use, as Σ Σ = ΣΣ for a square diagonal matrix, to find that

KK = UΣΣ U (28)

= U(Im + V EKV)U (29)

= Im + UV EKVU (30)

Let us consider a sample of d sets of m columns, such that each of the d columns is selected m times to produce a set of Kl matrices where l {1, ..., d} without repeated columns. Then we observe that,

k=1 Mik Mkj (31)

k=1 Mik Mkj (32)

k=1 [Kl]ik[Kl]kj (33)

as we have sampled our set of Kl matrices from the columns of M such that each Mik Mkj element in the sum of their products occurs m times. This gives

l=1 Kl K l (34)

and introducing Eq. (30) from above obtains

Im + UV EKl VU (35)

l=1 UV EKl VU (36)

Published in Transactions on Machine Learning Research (11/2025)

Algorithm 1 Algorithm for feature distillation with Cos Press.

Input: Training set X, teacher model T, student model Sθ, teacher head hϕ, NE number of epochs, Aug(.) augmentation strategy Randomly initialise student model Sθ and teacher head hϕ, or initialise using previous weights if finetuning; i = 0; while i < NE do

Randomly split X into B mini-batches; for xb {X1, ..., Xb, ..., XB} do

Generate augmented views: X = Aug(xb); Compute dimensionality reduction objective (Eq. (12)): Ldim-red(X; ϕ) = LKL(hϕ(T c(X)), T c(X)) + 1 |X| P

i LKL(hϕ(T(xi)), T(xi)) Compute student objective, while freezing ϕ (Eq. (15)): Lstudent(X; θ) = Lcosine(Sc θ(X), hϕfrozen(T c(X))) + Lcosine(Sθ(X), hϕfrozen(T(X))) Combine losses: L = Lstudent(X; θ) + Ldim-red(X; ϕ); Minimise loss L by updating parameters of θ and ϕ; end for i = i + 1; end while

which provides that

l=1 UV EKl VU F (37)

which proves the first statement. It is easy to see the converse inequality is not true, simply by constructing a matrix M Rd m with the first m rows equal to the identity. This matrix satisfies that MM = Im, but M M will have d m rows with only zeros in them. This proves the second statement.

Published in Transactions on Machine Learning Research (11/2025)

B Further Results

Table S14: Image Net classification larger teachers. Comparison of performance on Image Net-1K under k NN and linear probing evaluation approaches.

Method Arch Teacher k NN Linear

Backbone Student head Teacher head Teacher

Proteus Vi T-Ti/14 DINOv2 Vi T-S/14 73.1 73.5 79.0 76.1 Cos Press Vi T-Ti/14 DINOv2 Vi T-S/14 74.3 78.8 79.0 76.8

Proteus Vi T-Ti/14 DINOv2 Vi T-B/14 73.4 73.8 82.1 76.9 Cos Press Vi T-Ti/14 DINOv2 Vi T-B/14 75.6 81.9 82.1 77.9

Larger teachers result in better accuracy but have poorer quality dense features. We observe that training with larger teachers in comparison to the student requires longer training runs. To achieve best results efficiently, the pretrained models from a smaller teacher are used as a starting point. Then, the student and teacher heads are first trained with a frozen pretrained student model (allowing the Cos Press teacher heads to minimize the student loss) for 30 epochs. Finally, the models are trained for 300 epochs using the same distillation approach as previously. Table S14 shows that this improves the performance of the student models, with Cos Press seeing larger improvements in accuracy on Image Net-1K in comparison to Proteus.

However, this results in poorer results in dense image tasks, like semantic segmentation. Table S15 shows that the students trained with a larger teacher network have poorer m Io U for a linear probe on the Pascal VOC 2012 dataset. Undertaking longer distillation runs, or continuing training using a larger image resolution potentially might be helpful in these cases.

Table S15: Semantic segmentation larger teachers. Comparison of performance on the Pascal VOC 2012 semantic segmentation task using a linear probe.

Method Arch Teacher m Io U

Proteus Vi T-Ti/14 DINOv2 Vi T-S/14 70.5 Cos Press Vi T-Ti/14 DINOv2 Vi T-S/14 71.1

Proteus Vi T-Ti/14 DINOv2 Vi T-B/14 68.1 Cos Press Vi T-Ti/14 DINOv2 Vi T-B/14 69.7

Further out-of-distribution detection results. The OOD detection results for all of the Open OOD datasets (Yang et al., 2022a) are presented in Table S16, Table S17 and Table S18. The tables report the ROC-AUC for detecting OOD images and the False Positive Rate (FPR) using a 95% threshold for including all in-distribution images. To produce these results, the KNN+ (Sun et al., 2022) OOD metric is used to measure the performance of the model backbones. For Image Net-1K, we sample 1% of the dataset (12,812 images) and set k = 10 to measure distance to OOD samples. For CIFAR-10/100, we sample 100% of the dataset (50,000 images) and set k = 1.

Table S16: Out-of-distribution detection. Comparison of performance on the Open OOD benchmark for the Image Net-1K dataset. The means larger values are better and the means smaller values are better.

Method Arch Teacher Near OOD Datasets Far OOD Datasets

SSB-hard NINCO Average i Naturalist Open Image-O Textures Average

AUROC FPR AUROC FPR AUROC FPR AUROC FPR AUROC FPR AUROC FPR AUROC FPR

Proteus Vi T-Ti/14 DINOv2 Vi T-S/14 55.59 92.24 72.76 79.22 64.17 85.73 60.08 91.47 73.13 75.34 89.46 37.1 74.22 67.97 Cos Press Vi T-Ti/14 DINOv2 Vi T-S/14 63.39 84.98 77.58 69.59 70.49 77.29 95.81 22.57 90.72 40.62 86.57 48.45 91.03 37.21

DINOv2 Vi T-S/14 65.76 81.8 79.39 66.45 72.58 74.12 98.74 4.76 92.23 35.44 87.04 48.45 92.67 29.55

Proteus Vi T-S/14 DINOv2 Vi T-B/14 53.78 97.4 68.61 91.71 61.19 94.56 39.89 99.5 61.7 91.11 84.19 69.73 61.92 86.78 Cos Press Vi T-S/14 DINOv2 Vi T-B/14 65.75 82.98 81.24 64.71 73.5 73.84 97.31 12.27 92.77 32.95 88.72 41.71 92.93 28.98

Published in Transactions on Machine Learning Research (11/2025)

Table S17: Specialist models near OOD detection. Comparison of performance on the Open OOD benchmark (Yang et al., 2022a). The means larger values are better and the means smaller values are better.

IDD Method Arch Teacher Pretraining dataset Near OOD Datasets Average

CIFAR-10 CIFAR-100 Tiny Image Net

CIFAR-100 AUROC FPR AUROC FPR AUROC FPR AUROC FPR

Frozen Dei T Vi T-Ti/16 Reg Net Y-16GF Image Net 53.53 94.96 62.75 85.44 58.14 90.2 Cos Press Vi T-Ti/14 DINOv2 Vi T-S/14 Image Net 80.39 71.45 90.07 34.94 85.23 53.2 Cos Press Vi T-Ti/14 DINOv2 Vi T-S/14 Image Net Target dataset 84.08 70.28 89.91 44.15 87.0 57.22

Dei T finetuned Dei T Vi T-Ti/16 Reg Net Y-16GF Image Net 83.91 61.04 90.98 44.29 87.45 52.66 Cos Press Vi T-Ti/14 DINOv2 Vi T-S/14 Image Net 85.45 56.76 90.34 47.36 87.9 52.06 Cos Press Vi T-Ti/14 DINOv2 Vi T-S/14 Image Net Target dataset 86.7 56.45 92.35 39.37 89.52 47.91

CIFAR-10 DINOv2 Vi T-S/14 87.97 56.05 91.83 29.61 89.9 42.83

Frozen Dei T Vi T-Ti/16 Reg Net Y-16GF Image Net 50.88 94.08 63.15 84.75 57.01 89.42 Cos Press Vi T-Ti/14 DINOv2 Vi T-S/14 Image Net 90.22 41.5 96.67 13.63 93.44 27.57 Cos Press Vi T-Ti/14 DINOv2 Vi T-S/14 Image Net Target dataset 93.76 31.79 96.49 15.86 95.12 23.82

Dei T finetuned Dei T Vi T-Ti/16 Reg Net Y-16GF Image Net 96.5 16.34 97.08 12.52 96.79 14.43 Cos Press Vi T-Ti/14 DINOv2 Vi T-S/14 Image Net 96.15 16.57 97.24 10.62 96.69 13.6 Cos Press Vi T-Ti/14 DINOv2 Vi T-S/14 Image Net Target dataset 96.76 15.94 97.33 12.02 97.05 13.98

DINOv2 Vi T-S/14 94.08 29.27 97.59 10.28 95.83 19.77

Table S18: Specialist models far OOD detection. Comparison of performance on the Open OOD benchmark (Yang et al., 2022a). The means larger values are better and the means smaller values are better.

IDD Method Arch Teacher Pretraining dataset Far OOD Datasets Average

DTD MNIST SVHN Places365

CIFAR-100 AUROC FPR AUROC FPR AUROC FPR AUROC FPR AUROC FPR

Frozen Dei T Vi T-Ti/16 Reg Net Y-16GF Image Net 14.97 100.0 71.86 73.77 43.35 93.18 54.57 83.06 46.19 87.5 Cos Press Vi T-Ti/14 DINOv2 Vi T-S/14 Image Net 33.09 99.89 97.21 11.01 76.96 87.71 98.22 7.18 76.37 51.45 Cos Press Vi T-Ti/14 DINOv2 Vi T-S/14 Image Net Target dataset 44.46 98.56 95.78 18.94 86.17 67.34 97.39 12.91 80.95 49.44

Dei T finetuned Dei T Vi T-Ti/16 Reg Net Y-16GF Image Net 74.07 82.86 85.18 62.99 95.81 24.57 91.87 43.29 86.73 53.43 Cos Press Vi T-Ti/14 DINOv2 Vi T-S/14 Image Net 83.41 68.51 87.49 56.64 96.5 21.0 92.08 37.52 89.87 45.92 Cos Press Vi T-Ti/14 DINOv2 Vi T-S/14 Image Net Target dataset 79.51 67.44 90.24 48.05 96.12 22.84 92.27 37.61 89.53 43.98

CIFAR-10 DINOv2 Vi T-S/14 42.46 99.8 96.25 15.68 77.75 88.13 97.89 8.48 78.58 53.02

Frozen Dei T Vi T-Ti/16 Reg Net Y-16GF Image Net 17.03 100.0 71.36 73.15 42.03 92.25 57.74 80.58 47.04 86.49 Cos Press Vi T-Ti/14 DINOv2 Vi T-S/14 Image Net 95.24 30.79 99.17 3.54 89.13 60.24 99.97 0.18 95.87 23.69 Cos Press Vi T-Ti/14 DINOv2 Vi T-S/14 Image Net Target dataset 98.2 7.44 98.92 4.38 95.08 33.79 99.88 0.48 98.02 11.52

Dei T finetuned Dei T Vi T-Ti/16 Reg Net Y-16GF Image Net 97.86 9.33 97.52 9.69 99.67 0.71 99.7 0.96 98.69 5.17 Cos Press Vi T-Ti/14 DINOv2 Vi T-S/14 Image Net 97.95 9.31 97.41 8.99 99.63 0.32 99.36 1.51 98.59 5.03 Cos Press Vi T-Ti/14 DINOv2 Vi T-S/14 Image Net Target dataset 97.78 11.81 98.18 7.9 99.81 0.16 99.37 2.16 98.79 5.51

DINOv2 Vi T-S/14 97.0 19.08 98.93 4.25 90.6 55.45 99.96 0.18 96.62 19.74

Published in Transactions on Machine Learning Research (11/2025)

C Ablation Studies

In this section we modify a number of hyperparameters and component choices for Cos Press to investigate how these impact performance. In the tables below the bold parameter sets are the default ones used throughout the rest of the paper.

Table S19: Ablation study: weighting. Comparison of k NN performance on Image Net-1K for Cos Press models trained with different weightings γ for the dimensionality reduction component. The first row uses the frozen gradient approach described in the paper.

Method Arch Teacher γ k NN

Cos Press Vi T-Ti/14 DINOv2 Vi T-S/14 - 74.3 Cos Press Vi T-Ti/14 DINOv2 Vi T-S/14 10 74.3 Cos Press Vi T-Ti/14 DINOv2 Vi T-S/14 100 74.2

Cos Press dimensionality reduction loss weighting. We consider the performance of weighting the dimensionality reduction loss in Eq. (6), rather than freezing the gradients of ϕ in the student loss. This gives an alternative loss function

LCos Press(X; ϕ, θ) = γLdim-red(X; ϕ) + Lstudent(X; θ, ϕ) (39)

where γ is a weighting factor prioritises the dimensionality reduction loss when it is set to be greater than one. Table S19 shows that the approach of weighting or freezing gradients leads to similar results.

Table S20: Ablation study: metric. Comparison of k NN performance on Image Net-1K for Cos Press models trained with different metrics for the student loss.

Method Arch Teacher Loss Metric k NN

Cos Press Vi T-Ti/14 DINOv2 Vi T-S/14 Cosine distance 74.3 Cos Press Vi T-Ti/14 DINOv2 Vi T-S/14 MSE 74.2

Cos Press student loss metric. Table S20 considers the impact on performance of using different metrics for the student loss Lstudent for Eq. (15). It is found that using a cosine distance loss leads to slightly better performance in comparison to a mean squared error loss.

Table S21: Ablation study: temperature. Comparison of k NN performance on Image Net-1K for Cos Press models trained with different sets of temperatures τ for the dimensionality reduction loss.

Method Arch Teacher τ k NN

Cos Press Vi T-Ti/14 DINOv2 Vi T-S/14 [0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.10] 74.3 Cos Press Vi T-Ti/14 DINOv2 Vi T-S/14 [0.01] 74.1 Cos Press Vi T-Ti/14 DINOv2 Vi T-S/14 [0.10] 74.5 Cos Press Vi T-Ti/14 DINOv2 Vi T-S/14 [0.01, 0.10] 74.3

Cos Press dimensionality reduction temperature parameters. Table S20 shows the performance impact of different sets of temperatures τ in the dimensionality reduction loss for Eq. (11). It is found that these values have a small impact on performance, with the best set being τ = [0.10], which obtains slightly better performance that the set of parameters chosen for the paper.