# evaluating_representations_with_readout_model_switching__e0b94bee.pdf

Published as a conference paper at ICLR 2023

EVALUATING REPRESENTATIONS WITH READOUT MODEL SWITCHING

Yazhe Li yazhe@deepmind.com Jorg Bornschein bornschein@deepmind.com Marcus Hutter mhutter@deepmind.com

Although much of the success of Deep Learning builds on learning good representations, a rigorous method to evaluate their quality is lacking. In this paper, we treat the evaluation of representations as a model selection problem and propose to use the Minimum Description Length (MDL) principle to devise an evaluation metric. Contrary to the established practice of limiting the capacity of the readout model, we design a hybrid discrete and continuous-valued model space for the readout models and employ a switching strategy to combine their predictions. The MDL score takes model complexity, as well as data efficiency into account. As a result, the most appropriate model for the specific task and representation will be chosen, making it a unified measure for comparison. The proposed metric can be efficiently computed with an online method and we present results for pre-trained vision encoders of various architectures (Res Net and Vi T) and objective functions (supervised and self-supervised) on a range of downstream tasks. We compare our methods with accuracy-based approaches and show that the latter are inconsistent when multiple readout models are used. Finally, we discuss important properties revealed by our evaluations such as model scaling, preferred readout model, and data efficiency.

1 INTRODUCTION

Data representation is crucial to the performance of machine learning algorithms (Bengio et al., 2013). Much of the success of Deep Neural Networks (DNN) can be attributed to their capability of gradually building up more and more abstract representations (Lee et al., 2009). In supervised learning, although the network is trained to predict a specific aspect of the input, the intermediate representations are often proven to be useful for many other downstream tasks (Yosinski et al., 2014). In unsupervised and self-supervised learning, the network is trained on a surrogate task, such as reconstruction (Hinton & Salakhutdinov, 2006; Kingma & Welling, 2013; He et al., 2021) and contrastive prediction (van den Oord et al., 2018; Chen et al., 2020), which is supposed to capture generic prior of the data. In recent years, there has been significant improvements in unsupervised representation learning with state-of-the-art models achieving performance comparable to its supervised counterpart (Tomasev et al., 2022).

Despite the importance of data representation, the evaluation method for representations is rarely discussed. The most prevalent practice is to train a readout model on the downstream task. The readout model often has a shallow architecture, e.g. linear layer, to limit its capacity, so that the task performance reflects the representation quality. The problem with this approach is that the readout model cannot adapt to the nature of the representations. Deeper models and fine-tuning alleviate this issue. However, the representations are left with multiple metrics, each using a different readout mechanism, making the comparison extremely difficult (Nozawa & Sato, 2022).

In this paper, we treat evaluating representations as a model selection problem. We propose to use Minimum Description Length (MDL) as the main evaluation metric and use model switching to accommodate the need for multiple readout models. MDL is a well-studied compression-based approach for inductive inference that provides a generic solution to the model selection problem (Rissanen, 1984; Grunwald, 2004; Wallace, 2005; Solomonoff, 1964; Rathmanner & Hutter, 2011). MDL performs a similar role as held-out validation does for Emperical Risk Minimization (Vapnik, 1991), but has the advantage of being able to deal with single sequence and non-stationary data.

Published as a conference paper at ICLR 2023

It is closely related to Bayesian model selection and includes a form of Occam s Razor where the metric takes into account the model complexity. The complexity term can be explicitly represented as the codelength of the model in the case of a 2-part code, as a KL-term when using a variational code, or implicitly when using prequential or Bayesian codes. By including the model complexity in the evaluation metric, we automatically resolve the need of limiting the readout model complexity and are able to compare MDL scores freely between different readout mechanisms. Intuitively, if the induced representation is nonlinear and requires a higher capacity model for readout, the MDL score reflects this by having a larger complexity term. Note that this also applies in the case of finetuning, where the pre-trained model is allowed to adapt for the downstream tasks. Model switching allows multiple readout models and automatically finds the best readout model for the downstream task at each dataset size (Figure 1). Therefore, MDL with readout model switching provides a unified framework for evaluating representations regardless the evaluation protocol employed.

Figure 1: Illustration of switching between models of different complexity: Depending on the number of training examples either A, B, or C has the best generalization performance. An optimally switched model will have the best performance at each point and thus the lowest prequential description length (= area under the curve).

It is conjured that useful representations make the variability in the data more predictable and allow to efficient learning human-like data (H enaff et al., 2019). The MDL evaluation metric formalizes the data efficiency perspective - especially evident in the form of prequential MDL. Prequential MDL (Dawid & Vovk, 1999; Poland & Hutter, 2005) turns computing the description length L(D|ϕ) = log p(D|ϕ) into a sequential prediction problem: log p(D|ϕ) = P

t log p(yt|ϕ t, y<t), where ϕ is an encoder for feature extraction, ϕt := ϕ(xt) an encoded input, and (x<t, y<t) is the data seen before timestep t. In order to achieve a short description length, it is beneficial if the representation is capable of fast learning, i.e. given a good representation, a few examples are enough to achieve good performance. As a by-product of our readout model switching based on MDL, we can visualize the predictive performance along the data sequence and inspect how efficient the downstream learning is.

Our contributions are as follows:

1. We propose readout model switching for evaluating representations. 2. We prove a regret bound of readout model switching and discuss assumptions made. 3. We use an online learning framework for efficient computation and evaluate the performance of several popular visual representation methods on the set of downstream tasks. 4. We investigate and inspect the preferred readout models, data efficiency and model scaling.

2 BACKGROUND

Minimum Description Length is based on the fundamental idea that learning and comprehension correspond to compression (Rathmanner & Hutter, 2011). Given data D=(yt)N 1 YN and a hypothesis space M = {M1, M2, . . . }, where each hypothesis M corresponds to a parametric probabilistic model p(D|θ, M), MDL aims to identify the model that can compress the data D best. Considering the close relationship between lossless coding and probability distributions, this can be achieved by associating a codelength function L(D|M)= log p(D|M) with each hypothesis. A vast body of literature shows that models with a shorter description length have a better chance of generalizing to future data (Wallace, 2005; Gr unwald, 2007a; Rathmanner & Hutter, 2011).

A crude way to obtain description lengths is to consider L(D | M)=LM(θ)+LM(D | θ), where LM(θ) is the cost of encoding the parameters and LM(D|θ)= log p(D|θ, M) is the cost of compressing the data with the parameterized model. This two-part code approach is intuitive but suboptimal and ambiguous because it does not specify how to encode the parameters. This crude MDL approach has been refined in three distinct but closely related ways:

Published as a conference paper at ICLR 2023

1) The Bayesian marginal likelihood: LBayes(D|M) := log R

θ p(D|θ, M)p(θ)dθ and its variational upper bound LELBO(D|M) := KL(q(θ)|p(θ)) Eθ qlog p(D|θ, M). Both depend on the chosen prior p(θ) and require access to the posterior distribution over parameters given data; or an approximation thereof.

2) Normalized Maximum Likelihood (NML): LNML(D|M) : = log [p(D|ˆθML(D), M) / R

z p(D|ˆθML(z), M)dz] only relies on the maximum likelihood estimator ˆθML( ), but normalizes over all potentially observable data z YN, which is often intractable and even undefined for many model families of interest. The log-numerator is also called the complexity of M and measures how well the model could fit all possible data. Under NML, a model M achieves a short description length only when it fits the given data well (high log p(D|ˆθML(D), M)) as well as not fit well many different data (low denominator).

3) The prequential approach: Lplugin(D|M) := PN t=1 log p(yt|ˆθ(D<t), M) decomposes the description length over datapoints and relies on the choice of a suitable plug-in parameter estimator ˆθ(D<t). It emphasizes that in order to achieve a short codelength, a model M must not only be a good predictor given all training data D, but already given only parts of the data D<t for all t, i.e. be a sample efficient predictor. Model complexity and sample efficiency can thus be considered two sides of the same coin.

Approaches to computing description lengths for some data D under a model family p(D|θ, M) are called universal codes when they result in description lengths that stay close to the (generally unachievable) maximum likelihood codelength for that family: Luniversal(D|M) log p(D|ˆθML(D), M) + O(log N). LNML(D|M) is by definition universal; LBayes(D|M) and Lplugin(D|M) are universal for many reasonable choices of the prior p(θ) and the estimator ˆθ(D<t) respectively (Gr unwald, 2007b).

Methods for Representation Evaluation. Linear probing, using a linear layer for readout, is the most prevalent evaluation protocol for representations, the performance of which is reported and compared in almost all of the recent representation learning works (van den Oord et al., 2018; Chen et al., 2020; Grill et al., 2020; Li et al., 2021; Caron et al., 2021; He et al., 2021). Apart from linear probing, clustering algorithms such as Hebbian linear classifier (Bengio, 2012) and K-Nearest Neighbor (Caron et al., 2021) are also popular choices. However, concerns have been raised for model comparison using these simple probes. Resnick et al. (2019) shows that the performance of linear probing can be misleading, since models performing weakly with linear probing can perform better under alternative protocols. He et al. (2021) also argue that the use of linear probes limits the development of methods that induce non-linear representations. Hewitt & Liang (2019) question the practice of using downstream accuracy as the evaluation metric and propose to design probes based on their selectivity, which puts the task accuracy in context with the probe s capacity to memorize. Yogatama et al. (2019) and Voita & Titov (2020) propose to use MDL, either variational or prequential code, as the evaluation metric, which takes into account the complexity in the final measurement. H enaff et al. (2019) bring to light the benefit of pre-training for downstream data efficiency. Whitney et al. (2020) propose a modified version of MDL that introduces a baseline corresponding to solving the downstream task. While previous works proposing MDL as an evaluation metric use only a single readout protocol, our method can be applied for combining different readout protocols regardless of the backbone being frozen or fine-tuned. Our use of an online framework to compute prequential MDL scores which has not been used previously to evaluate representations leads to shorter description lengths (Bornschein et al., 2022) and crucially enables the model switching between readout-methods at each timestep. As a result, our framework gives insights into properties of the representations that can guide us in choosing the best way to use the representation.

3 EVALUATING REPRESENTATIONS WITH READOUT MODEL SWITCHING

We formally define the model selection problem for representation evaluation under the MDL principle as follows: the objective is to compare the encoder ϕ(.) that minimizes the codelength L(D|ϕ) = log p(D|ϕ), where D = {(xt, yt)}N 1 is the inputs xt and associated prediction targets yt. Prequential MDL decomposes the codelength into the cumulative prediction loss

Published as a conference paper at ICLR 2023

L(D|ϕ) = PN t=1 log p(yt|ϕ t, y<t), where ϕt := ϕ(xt) is the encoded feature of input xt. In order to perform the prediction task from the features, previous works have to fix the probe to be a linear layer or a shallow MLP. In our study, instead of picking a fixed model, we design a hybrid of a continuous and discrete-valued (k, θk) model space of the readout models, where expert k [1, . . . , K] is the model class and θk are the continuous-valued parameters corresponding to the model class k. Each readout model can be seen as an expert forecasting system that learns to predict the data independently in an online fashion. At each datapoint xt, the prediction ˆyt is a combination of the experts predictions. The final MDL score is the cumulative predictive performance of the board of K experts.

By constructing the probe using multiple readout models and switching between them, the final MDL score is marginalized over the choice of the readout architectures. Since MDL automatically takes into account the complexity of the model, i.e. any extra complexity used is going to cost in the MDL score, our framework allows the representation to select the most suitable readout mechanism at any data size while still maintaining a comparable final score for the representations. In addition, after computing the score, we can inspect the posterior over the readout models at any given data size, as well as data efficiency along the trajectory, providing valuable insight into the characteristics of the representation.

Readout Model Switching. Given K readout models, we define random variables ξt [1, . . . , K], t [1, . . . , N] the readout model to use at timestep t. The Expert Sequence (ES) prior (Koolen & de Rooij, 2008) uses a hidden Markov model (HMM) to define the joint distribution of the data y1:N and random variables ξ1:N as p(y1:N, ξ1:N|ϕ) = QN t=1 pξt(yt|ϕ t, y<t)p(ξt|ξt 1), where p(ξ1|ξ0)=p(ξ1) is the initial distribution and pk( ) is the prediction of the kth readout model. At each timestep t, we use the previously observed data to estimate ˆθk(ϕ<t, y<t) for the parameters of the readout models. Therefore, the kth readout model prediction pk(yt|ϕ t, y<t) is the plugin distribution p(yt|ϕt, ˆθk(ϕ<t, y<t)). The final readout model switching codelength function has the form:

LSwitch(y1:N|ϕ) = log p(y1:N|ϕ) = log X

t=1 p(yt|ϕt, ˆθξt(ϕ<t, y<t))p(ξt|ξt 1) (1)

The readout model switching defined in Equation 1 represents a family of codes that combines prequential and Bayesian MDL. From the perspective of the HMM, p(ξt|ξt 1) is the transition matrix and, in our context, it is equivalent to a switching strategy between the readout models. Every switching strategy corresponds to a specific code in the family. For example, if p(ξt|ξt 1) = 1 iff ξt = ξt 1 (zero otherwise), i.e when we believe a single readout model is responsible for modeling the whole data sequence, then Equation 1 simplifies to a Bayesian mixture code LBM(y1:N|ϕ) = log P

k p(ξ1 = k) QN t=1 p(yt|ϕt, ˆθk(ϕ<t, y<t)).

For the rest of the paper, we are going to use Fixed Share (FS) (Herbster & Warmuth, 1998; 2004) as the switching strategy. The prior defined in fixed share can be interpreted as follows: at timestep t, with a probability 1 - αt, the expert is the same as in the previous timestep; with probability αt, it switches to another model according to the probability w(k). We are going to use the uniform distribution w(k) = 1/K through out the paper. The switching strategy can be simplified and expressed as

p(ξt|ξt 1) = 1 K 1

K αt if ξt = ξt 1 1 K αt if ξt = ξt 1 where αt := m 1

t [0, 1] (2)

is decreasing in timestep t, and m is a hyperparameter. Note that there are other switching strategies that could be used, e.g. elementwise mixture, fixed share with constant switching rate, switch distribution and run-length model (Koolen & de Rooij, 2008). We provide a comparison of these strategies in Appendix A.6. In general, we find that Bayesian mixture and elementwise mixture perform significantly worse than fixed share and other more sophisticated switching strategies. The poor codelength of the Bayesian mixture is due to the catch-up phenomena (van Erven et al., 2012): it assumes that a single model is responsible for modeling the whole data sequence, while evidently the best model changes with data set size.

Regret Bound. To characterize the gap between the codelength for our readout model switching and the shortest codelength that can be achieved in hindsight, we derive a regret bound in Theorem 1.

Published as a conference paper at ICLR 2023

This bound tells us whether the readout model switching converges to the optimal solution and, if so, how fast it converges.

Theorem 1. Let a sequence x N = (xi)N i=1 be sampled i.i.d. from distribution P. Let all readout models M1, . . . MK be in the exponential family. Let ˆθk(xt) be the regularized Maximum Likelihood (ML) estimator of Mk after the first t examples xt = (xi)t i=1. The expected regret of readout model switching w.r.t. the best sequence of readout models ξ 1:N and their best fit parameters ˆθξ t (x N) is

t=1 log Mξ t (xt; ˆθξ t (x N)) log X

ξ1:N p(ξ1:N)

t=1 Mξt(xt|ˆθξt(xt 1))

EP [log p(ξ 1:N)] + 1

n Tr Cov 1 Mµ k [Tk(x)]Cov P [Tk(x)] o log N (3)

where µ k = EP [Tk(x)], and Tk(.) is the sufficient statistic and µk is the mean value parameter of Mk.

The proof of Theorem 1 can be found in Appendix A.1. The expected regret consists of two terms: the regret of the switching strategy and the regret from the plugin distribution. The latter is bounded by C log N with C being a constant determined by the model family and the true distribution. For Fixed Share with decreasing switching rate, EP [log p(ξ 1:N)] m log K + (m 1) log N/(m 1), where m 1 is the number of switch points (Veness et al., 2012; Koolen & de Rooij, 2013). Therefore, the expected regret is of the order of log N = o(N), making the corresponding code a universal code. Many other switching strategies have the same property, such as Bayesian mixture and run-length code (Koolen & de Rooij, 2013).

Like many theories and algorithms applied to DNN, some of the assumptions of Theorem 1 do not hold in practice: DNNs are not in the exponential family and the solutions found by Stochastic Gradient Descent (SGD) training are not ML solutions. Although the exact nature of the final codelength heavily relies on the training algorithm and much of the understanding of SGD is left for future works, as long as the estimator converges asymptotically to the optimal solution, we can use the code for model selection. In fact, there exist estimators that outperform the ML estimator such that the resulting prequential code better approximates the Normalized Maximum Likelihood (NML) code which is the optimal universal code (Gr unwald & Kotlowski, 2010).

Computing MDL with Online Learning. Algorithm 1 describes the online implementation for computing the description length and for performing inference over the preferred readout models p(ξt|D<t). For each example in the sequence we first obtain a marginal prediction by weighting the predictions from the readout models according to our current belief over the mixture. After

Algorithm 1 MDL with Readout Model Switching

Require: data D = (xt, yt)N t=1; K readout models; initial model probability p(ξ1) and switching strategy p(ξt|ξt 1); subroutine that updates the parameters given a dataset Update Parameters 1: Initialize: model parameters θ1, . . . , θK; s = log p(ξ1); an empty list Q to store posterior p(ξt|D<t) 2: for t = 1 to N do 3: Compute log p(ξt, yt 1|xt 1): s log P

ξt 1 exp (log p(ξt|ξt 1) + s) 4: Compute posterior p(ξt|D<t) and store in Q: log p(ξt|D<t) = log p(ξt, yt 1|xt 1) log P

ξt p(ξt, yt 1|xt 1)

5: Compute next-step loss of K readout models: Lk t := log pk(yt|xt) = log p(yt|xt; θk) 6: Combine K models to update log p(ξt, yt|xt): s Lξt t + s 7: for k = 1 to K do 8: Update parameters: θk Update Parameters(θk, D t) 9: end for 10: end for 11: Compute total codelength Lswitch log P

ξN exp(s) 12: return Lswitch and Q

Published as a conference paper at ICLR 2023

receiving the label yt, we compute the cross-entropy loss for the marginal prediction and add it to the cumulative loss. Then, the belief over the mixture is updated and the new example is added to the training data for readout model parameter estimation. The latter is done by online SGD with replay.

For the posterior belief over the model switches p(ξt|D<t) the above procedure amounts to the well known, efficient, and exact forward pass HMM inference algorithm. For the neural network parameters θk on the other hand we use SGD as the plugin estimator for prequential MDL. Prior work has shown that SGD is an effective and reliable plug-in estimator for modern neural networks (Blier & Ollivier, 2018). Note that most prior works on computing prequential MDL scores with neural networks use a block-wise approximation; i.e. they partition the data into a small number of segments and evaluate each segment independently by training on all the previous segments to convergence (Voita & Titov, 2020; Bornschein et al., 2020; Whitney et al., 2020). Since we need a prediction at each step t to update the mixture weights, we here instead use the online mini-batch incremental training with replay described in Bornschein et al. (2022). This approach updates the parameters continually with online mini-batch gradient descent training on the new and previously seen examples. Instead of using a replay buffer to store examples in-memory, we use replay streams to iterate through the examples in their original order from permanent storage. By tuning the number of streams and the reset probability, we can adjust the amount of replay training and change from uniform sampling to a sampling strategy that is biased towards more recent samples. However, in this work, we use uniform random replay samples for all our experiments. As suggested in Bornschein et al. (2022), we use Exponential Moving Average (EMA) and label smoothing to improve performance of the online training, but not weight standardization as we don t want to alter the architectures.

A strictly equivalent alternative to the fully online implementation is a two-stage implementation (see Appendix A.3): First, the prequential online parameter estimation is computed for each of the readout models independently while the per-step predictions and log-losses are recorded for later analysis. In a second stage, the predictions and losses can be used to compute the mixtures, switching posterior p(ξt|D<t) and overall description lengths. To facilitate experimenting with different switching strategies we use this 2-stage implementation for all our experiments.

4 EXPERIMENTS

First, we compare our evaluation framework with linear and a combination of MLP evaluations using vision encoders pre-trained on Image Net (Deng et al., 2009) with different objectives. For this purpose, we use the VTAB benchmark (Zhai et al., 2019) as downstream tasks. Then, we use Image Net classification as downstream task to showcase the insights, such as data efficiency and preferred readout model, and for comparing in-domain and out-of-domain transfer, revealed by our evaluation method.

The VTAB benchmark consists of 19 datasets that can be grouped into three categories: natural, specialized and structured. For some VTAB datasets, notably Kitti Distance, Small NORBAzimuth, Patch Camelyon, we discover that there is a distribution shift between train and test subsets (see details in Appendix A.2). Therefore, in order to make sure that data are truly i.i.d, we reshuffle the train and test data and use the reshuffled datasets for all our experiments.

To summarize the performance over multiple datasets, we follow the procedure recommended by Demˇsar (2006) and report the average rank of the representations: for each dataset, we rank the representations by the evaluation metric (accuracy or MDL score); then, we compute the average rank of a representation over the 19 datasets. For any pair of representations A and B, the difference of the average rank can serve as a statistic for a hypothesis test with H0: A performs the same as B. If the hypothesis is rejected, we conclude the representations perform significantly different. We adopt the Nemenyi test (Nemenyi, 1963) which suggests that the null-hypothesis is rejected if the average ranks differ by at least a critical value. This critical value is computed as qγ p

r(r + 1)/6N, where qγ is a constant depending on the confidence level γ, r is the number of representations to compare and N is the number of datasets. In the following comparisons, we also report the critical value with a confidence level of γ = 0.1.

Published as a conference paper at ICLR 2023

We use the online learning framework as described in Section 3 and compute the Fixed Share code length with decreasing switching rate αt defined in (2). The raw MDL scores per dataset are reported in Appendix A.7. For Image Net, we use readout model switching between 4 model architectures: Linear and 1-3 layer MLPs. For VTAB, we include up to 7-layer MLP s as readout architectures. We furthermore instantiate each of these architectures with all hyperparameter settings from the hyperparameter grid outlined in Appendix A.5. The model switching is therefore performed between K = #architectures #hyperparameters many independent models. We report results using a switching rate m = 2 for Image Net and m = 11 for VTAB. The switching rates were chosen to optimize the overall description length.

Pre-training Objectives. We compare downstream performance of different pre-training objectives on the VTAB benchmark. To make the comparison, we choose two backbone architectures, Res Net-50 v1 (He et al., 2015) and Vi T/B16 (Dosovitskiy et al., 2020), and only change the pretraining objectives. For Res Net-50, we pre-train on supervised Sim CLR and BYOL losses; for Vi T/B16, we pre-train on supervised (Touvron et al., 2020; He et al., 2021) DINO and MAE losses. Sim CLR (Chen et al., 2020) is a self-supervised method based on contrastive learning. BYOL (Grill et al., 2020) and DINO (Caron et al., 2021) both use a teacher network. BYOL is a negative-free method for self-supervised representation, while DINO uses the Vi T architecture and learns through self-distillation. MAE (He et al., 2021) uses masked reconstruction tasks to learn representation.

(a) Pre-training Objectives

Architecture Objective Linear # MLP # MDL #

Vi T/B16 DINO 2.26 1 1.84 1 1.89 1 Res Net-50 BYOL 2.37 2 2.63 2 2.53 2 Vi T/B16 MAE 3.84 4 3.24 3 3.79 3 Vi T/B16 SUP 4.18 5 4.21 4 4.16 4 Res Net-50 Sim CLR 4.53 6 4.63 6 4.21 5 Res Net-50 SUP 3.82 3 4.45 5 4.42 6

Res Net and Vi T with different pre-training objective and model sizes. The threshold of the rank difference to be significant for hypothesis testing is 1.894 (left) and 3.65 (right).

(b) Model Sizes

Architecture Obj. Linear # MLP # MDL #

Res Net-101 BYOL 2.84 1 3.92 3 3.05 1 Res Net-152 BYOL 3.11 2 3.71 2 3.26 2 Vi T/L16 MAE 3.42 3 2.53 1 3.74 3 Res Net-50 BYOL 4.24 4 4.26 4 4.05 4 Vi T/B16 SUP 7.34 9 6.47 6 6.00 5 Vi T/B16 MAE 6.89 5 5.34 5 6.21 6 Res Net-50 SUP 6.89 5 7.55 9 7.11 7 Res Net-101 SUP 6.95 7 7.47 8 7.21 8 Res Net-152 SUP 7.47 10 7.00 7 7.47 9 Vi T/L16 SUP 7.05 8 8.50 10 9.11 10 Vi T/S16 SUP 10.26 11 10.61 11 10.37 11 Vi T/S16 MAE 11.53 12 10.63 12 10.42 12

Table 1: Comparison of model selection using linear evaluation, combination of evaluation with MLPs and MDL with readout model switching. Average rank (lower is better) for representations over 19 VTAB datasets is reported. Architecture + pre-training objectives are in order of MDL rank.

Table 1a reports the average rank of the pre-training objectives comparing model selection with different evaluation methods. As baselines, we report linear evaluation and a combination of MLPs up to 7-layers. To combine results from MLPs, we take the best performance achieved by MLPs. Given a confidence interval γ = 0.1, the performance of the representations are significantly different if the difference is above q0.1 p

r(r + 1)/6N = 1.894, where q0.1 = 3.12 is the critical value for the two-tailed Nemenyi test, and r = 6 is the number of representations to compare and N = 19 is the number of datasets.

Despite fixing the backbone architecture, the choice of readout model has an impact on model selection under accuracy-based evaluations (first two methods in Table 1a). This is consistent with previous observations (Resnick et al., 2019; He et al., 2021) and demonstrates the importance of incorporating a multitude of readout methods into the evaluation framework. Our method solves this issue by combining probing protocols and taking into account their complexity. Furthermore, under our evaluation framework the performance in the small data regime is also an important factor and representations performing worse with little data are penalized in the final rank. We observe that for Res Net50 backbone, BYOL features outperforms both SUPervised and Sim CLR features, while Sim CLR and SUPervised features have no significant difference. For Vi T/B16 backbone, DINO features outperform both supervised and MAE, while supervised and MAE have no significant difference. Therefore, it appears that with the same encoder capacity, distillation-based objectives have an advantage in terms of transfer. DINO seems slightly better than BYOL, but the rank difference is not statistically significant for a clear conclusion.

In Figure 7, we investigate the transfer performance under distribution shift by reporting statistics under each VTAB group. We observe that self-supervised representations have a better performance

Published as a conference paper at ICLR 2023

than the supervised counterpart on structured datasets, despite the latter perform decently on natural datasets. MAE generalizes well with dataset shift, but the performance on natural datasets suffers compared to other methods.

Architecture Obj. Natural # Special # Structured #

Vi T/B16 DINO 1.57 1 1.00 1 2.63 3 Res Net50 BYOL 2.71 2 3.00 2 2.13 1 Vi T/B16 MAE 5.71 6 3.25 3 2.38 2 Vi T/B16 SUP 2.86 3 3.25 4 5.75 6 Res Net50 Sim CLR 4.29 5 5.75 6 3.38 4 Res Net50 SUP 3.86 4 4.75 5 4.75 5

Table 2: Average rank on natrual, special and structured categories of VTAB datasets. The thresholds for the rank difference to be significant are 3.12, 2.92 and 4.13 respectively. Architecture + pre-training objectives are in order of MDL rank on all VTAB tasks.

Model Scaling. Another interesting aspect about architecture choice is the model scaling: do larger models result in better performance? We again use Res Net and Vi T backbones pre-trained on Image Net and evaluate on VTAB, but with increasing model sizes. For Res Net, we select Res Net-50, Res Net-101 and Res Net152, and pre-train with either SUPervised or BYOL objective. For Vi T, we use Vi T/S16, Vi T/B16 and Vi T/L16, and pretrain with either SUPervised or MAE objective. We rank the performance of the 12 representations and report the average ranks in Table 1b. With confidence interval of 0.1, we obtain a threshold of 3.65 for the difference in average rank to make a statistical claim.

The scaling experiment more compellingly demonstrates the difference of multiple readout models and the emphasis on performance in the small data regime with our method (Table 1b). Larger models, such as Res Net-152 and Vi T/L16, which perform better with MLP head need to demonstrate their performance in the small data regime to justify the additional capacity used for the readout. We can see that scaling up the Res Net architecture, in particular under supervised pre-training, doesn t result in any improvement in the transfer performance. There is a weak positive scaling effect for BYOL when scaling to Res Net-101, but it is not statistically significant. Although Vi T/S16 has approximately the same parameter counts as Res Net-50, its predictive performance is much worse, suggesting that Vi T s capacity is used less efficiently than Res Net s. Scaling up Vi T from S16 to B16 leads to a significant performance improvement. However, further scaling up for supervised loss is not useful. Interestingly, the reconstruction-based MAE objective scales well to larger architectures and has not saturated at Vi T/L16.

Figure 2: Readout model that most often has the highest probability arg maxk 1 T PT t=1p(ξt = k|D<t) on 19 VTAB datasets. 0 is linear readout; 1 to 7 are MLPs with 1 to 7 hidden layers.

Natural Special Structured

Architecture

Patch Camelyon

Retinopathy

Clevr Count

Clevr Distance

DSprites Location

DSprites Orientation

Kitti Distance

Small NORBAzimuth

Small NORBElevation

Res Net50 SUP 1 1 1 1 1 1 1 1 3 1 3 4 3 3 4 2 4 3 2 Res Net101 SUP 1 1 3 1 0 1 1 1 2 1 2 3 3 2 2 2 4 3 2 Res Net152 SUP 1 1 1 2 0 1 1 1 2 1 2 3 4 2 1 2 5 3 7 Res Net50 Sim CLR 1 1 1 1 3 2 1 1 4 1 2 7 3 5 3 3 4 7 7 Res Net50 BYOL 1 1 1 1 1 2 1 1 6 1 3 4 5 5 4 3 7 6 7 Res Net101 BYOL 1 1 1 1 1 2 1 1 3 1 3 4 6 5 3 3 6 6 7 Res Net152 BYOL 1 1 1 1 0 2 1 1 3 1 3 5 4 4 2 3 4 6 3 Vi T/S16 SUP 1 1 1 1 1 1 0 1 4 1 3 3 2 2 4 5 5 7 4 Vi T/B16 SUP 0 1 2 0 1 1 1 1 3 1 3 7 6 2 3 3 4 3 7 Vi T/L16 SUP 1 1 2 1 2 0 1 1 3 2 2 7 7 3 1 3 5 4 7 Vi T/B16 DINO 0 0 1 0 1 1 0 1 3 1 4 5 5 6 3 3 3 3 7 Vi T/S16 MAE 1 1 1 1 2 2 1 1 3 1 1 2 2 6 7 4 7 6 5 Vi T/B16 MAE 0 1 1 1 2 1 0 1 2 1 1 3 4 4 4 6 5 7 4 Vi T/L16 MAE 1 1 1 1 2 1 1 1 2 1 1 4 4 4 4 5 6 5 6

Readout Models. One strength of our evaluation framework is that it can compare readout models of different capacity and dynamically choose the most suitable one based on downstream task and training data size. We compute the posterior over the readout models to inspect which readout model the representation prefers. We observe that model switching is more frequent for out-of-domain datasets than in-domain datasets. In Figure 2, we report the readout model that has most often the highest probability arg maxk 1 T PT t=1p(ξt = k|D<t). The result suggests that the most important factor for preferring deeper models is the dataset type - structured datasets need deeper models. The influence of the pre-training objective doesn t seem to be significant.

Data Efficiency. Data efficiency is an integral part of MDL. With the prequential perspective on MDL, we can visualize whether an algorithm is more data efficient by inspecting the predictive loss as a function of the data size. As a baseline we train a Res Net-50 model from scratch within our evaluation framework, and record the cumulative next-step losses as a reference

Published as a conference paper at ICLR 2023

(a) In-domain: Image Net

(b) Out-of-domain: Patch Camelyon

Figure 3: Data efficiency: We plot the cumulative next-step log-loss for a range of readouts with a Res Net-50 backbone as a function of (downstream) sample size (in nats, lower is better).

baseline (comparator). Then, we evaluate BYOL pretrained Res Net-50 representations with 3 different readout models: 1. linear probing; 2. deep MLP readout models; 3. linear readout but with fine-tuning the whole encoder. Because MDL already takes into account the model complexity, we can directly compare their MDL scores. We plot the regret (in this case it is more of an advantage) of different fine-tuning strategies relative to the training-from-scratch model as the baseline. Figure 3 illustrates the regret along the data size for Image Net, an in-domain dataset, and Patch Camelyon, an out-of-domain dataset for the BYOL representations. Learning Image Net classification with a pre-trained representation is much more efficient than training from scratch. Shallow architectures, which are based on a frozen encoder, are sufficient. Only when the dataset is large enough (800k in case of Image Net), we can finally see the benefit from fine-tuning the encoder. The picture is different for out-of-domain datasets. In general, pre-training still has benefits if deeper, more flexible readout models are used. The amount of capacity needed depends on the nature of the downstream dataset. In Figure 3, on Patch Camelyon, training from scratch surpasses linear probing after seeing only 150k examples. We show the regret plots for all 19 VTAB tasks in Appendix A.9.

5 CONCLUSIONS & FUTURE WORKS

We presented readout model switching, an approach derived from the MDL principle, to evaluate representations on given downstream tasks. Our approach supports model switching to choose the most appropriate readout model for each representation, downstream task, and data size. Our metric takes model complexity and data efficiency into account, making it a unified measure for model comparison. We evaluated vision representations and compared our approach to accuracy-based methods. Using our evaluation framework, we were also able to provide insights into model scaling properties and preferred readout models, as well as compare data efficiency.

In theory, different permutations of the data sequence result in different codelengths. Although this does not appear to cause too much trouble in practice (see Appendix A.7), we reckon that this is an area for future work. With the MDL formulation the performance metric under consideration is the log-loss (cross-entropy), which directly corresponds to compression. Extensions to other losses such as 0/1-loss etc. will need careful consideration. We here report a single metric per downstream-task and representation. MDL supports non-stationary sequences and future work could aim for reporting a single metric per sequence-of-tasks by concatenating them. However this poses questions about the order of downstream tasks, and how to deal with multiple output spaces Y, and whether forward transfer between these tasks is permissible.

Published as a conference paper at ICLR 2023

Yoshua Bengio. Deep learning of representations for unsupervised and transfer learning. In Proceedings of ICML workshop on unsupervised and transfer learning, pp. 17 36. JMLR Workshop and Conference Proceedings, 2012.

Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798 1828, 2013.

L eonard Blier and Yann Ollivier. Do deep learning models have too many parameters? an information theory viewpoint. Co RR, abs/1802.07044, 2018. URL http://arxiv.org/abs/ 1802.07044.

J org Bornschein, Francesco Visin, and Simon Osindero. Small data, big decisions: Model selection in the small-data regime. In International Conference on Machine Learning, pp. 1035 1044. PMLR, 2020.

Jorg Bornschein, Yazhe Li, and Marcus Hutter. Sequential learning of neural networks for prequential mdl, 2022. URL https://arxiv.org/abs/2210.07931.

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv e J egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. Co RR, abs/2104.14294, 2021. URL https://arxiv.org/abs/2104.14294.

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton. A simple framework for contrastive learning of visual representations. Co RR, abs/2002.05709, 2020. URL https: //arxiv.org/abs/2002.05709.

A. Philip Dawid and Vladimir G. Vovk. Prequential probability: principles and properties. Bernoulli, 5(1):125 162, 1999.

Steven De Rooij and Peter D Gr unwald. Luckiness and regret in minimum description length inference. In Philosophy of Statistics, pp. 865 900. Elsevier, 2011.

Janez Demˇsar. Statistical comparisons of classifiers over multiple data sets. The Journal of Machine learning research, 7:1 30, 2006.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248 255. Ieee, 2009.

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. Co RR, abs/2010.11929, 2020. URL https://arxiv.org/abs/2010.11929.

Jean-Bastien Grill, Florian Strub, Florent Altch e, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, R emi Munos, and Michal Valko. Bootstrap your own latent: A new approach to self-supervised learning. Co RR, abs/2006.07733, 2020. URL https://arxiv.org/abs/2006.07733.

P. D. Gr unwald. The Minimum Description Length Principle. The MIT Press, Cambridge, 2007a.

Peter Grunwald. A tutorial introduction to the minimum description length principle, 2004. URL https://arxiv.org/abs/math/0406077.

Peter Gr unwald and Wojciech Kotlowski. Prequential plug-in codes that achieve optimal redundancy rates even if the model is wrong. Co RR, abs/1002.0757, 2010. URL http://arxiv.org/ abs/1002.0757.

Peter D Gr unwald. The minimum description length principle. MIT press, 2007b.

Published as a conference paper at ICLR 2023

Mark H Hansen and Bin Yu. Model selection and the principle of minimum description length. Journal of the American Statistical Association, 96(454):746 774, 2001.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. Co RR, abs/1512.03385, 2015. URL http://arxiv.org/abs/1512.03385.

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll ar, and Ross B. Girshick. Masked autoencoders are scalable vision learners. Co RR, abs/2111.06377, 2021. URL https://arxiv. org/abs/2111.06377.

Olivier J. H enaff, Aravind Srinivas, Jeffrey De Fauw, Ali Razavi, Carl Doersch, S. M. Ali Eslami, and A aron van den Oord. Data-efficient image recognition with contrastive predictive coding. Co RR, abs/1905.09272, 2019. URL http://arxiv.org/abs/1905.09272.

Mark Herbster and Manfred K. Warmuth. Tracking the best expert, 1998.

Mark Herbster and Manfred K. Warmuth. Tracking the best expert. Machine Learning, 32:151 178, 2004.

John Hewitt and Percy Liang. Designing and interpreting probes with control tasks. Co RR, abs/1909.03368, 2019. URL http://arxiv.org/abs/1909.03368.

Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks. science, 313(5786):504 507, 2006.

Diederik P Kingma and Max Welling. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013.

Wouter Koolen and Steven De Rooij. Combining expert advice efficiently. ar Xiv preprint ar Xiv:0802.2015, 2008.

Wouter M. Koolen and Steven de Rooij. Combining expert advice efficiently. Co RR, abs/0802.2015, 2008. URL http://arxiv.org/abs/0802.2015.

Wouter M. Koolen and Steven de Rooij. Universal codes from switching strategies. Co RR, abs/1311.6536, 2013. URL http://arxiv.org/abs/1311.6536.

Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Y Ng. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In Proceedings of the 26th annual international conference on machine learning, pp. 609 616, 2009.

Yazhe Li, Roman Pogodin, Danica J Sutherland, and Arthur Gretton. Self-supervised learning with kernel dependence maximization. Advances in Neural Information Processing Systems, 34: 15543 15556, 2021.

Peter Bjorn Nemenyi. Distribution-free multiple comparisons. Princeton University, 1963.

Kento Nozawa and Issei Sato. Empirical evaluation and theoretical analysis for representation learning: A survey. ar Xiv preprint ar Xiv:2204.08226, 2022.

Jan Poland and Marcus Hutter. Asymptotics of discrete MDL for online prediction. IEEE Transactions on Information Theory, 51(11):3780 3795, 2005.

Samuel Rathmanner and Marcus Hutter. A philosophical treatise of universal induction. Entropy, 13(6):1076 1136, 2011. ISSN 1099-4300. doi: 10.3390/e13061076.

Cinjon Resnick, Zeping Zhan, and Joan Bruna. Probing the state of the art: A critical look at visual representation evaluation. Co RR, abs/1912.00215, 2019. URL http://arxiv.org/abs/ 1912.00215.

J Rissanen. Universal coding, information, prediction, and estimation. IEEE Trans. Inf. Theory, 30 (4):629 636, July 1984.

Jorma Rissanen. Stochastic complexity and modeling. The annals of statistics, pp. 1080 1100, 1986.

Published as a conference paper at ICLR 2023

Ray J Solomonoff. A formal theory of inductive inference. part i. Information and control, 7(1): 1 22, 1964.

Nenad Tomasev, Ioana Bica, Brian Mc Williams, Lars Buesing, Razvan Pascanu, Charles Blundell, and Jovana Mitrovic. Pushing the limits of self-supervised resnets: Can we outperform supervised learning without labels on imagenet? ar Xiv preprint ar Xiv:2201.05119, 2022.

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv e J egou. Training data-efficient image transformers & distillation through attention. Co RR, abs/2012.12877, 2020. URL https://arxiv.org/abs/2012.12877.

A aron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. Co RR, abs/1807.03748, 2018. URL http://arxiv.org/abs/1807.03748.

Tim van Erven, Peter Gr unwald, and Steven A. H. de Rooij. Catching up faster by switching sooner: a predictive approach to adaptive estimation with an application to the aic bic dilemma. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74, 2012.

V Vapnik. Principles of risk minimization for learning theory. In Proceedings of the 4th International Conference on Neural Information Processing Systems, NIPS 91, pp. 831 838, San Francisco, CA, USA, December 1991. Morgan Kaufmann Publishers Inc.

Joel Veness, Kee Siong Ng, Marcus Hutter, and Michael Bowling. Context tree switching. In Proc. Data Compression Conference (DCC 12), pp. 327 336, Snowbird, Utah, USA, 2012. IEEE Computer Society. ISBN 978-1-4673-0715-4. doi: 10.1109/DCC.2012.39. URL http:// arxiv.org/abs/1111.3182.

Elena Voita and Ivan Titov. Information-theoretic probing with minimum description length. Co RR, abs/2003.12298, 2020. URL https://arxiv.org/abs/2003.12298.

C S Wallace. Statistical and Inductive Inference by Minimum Message Length. Springer Science & Business Media, December 2005.

William F. Whitney, Min Jae Song, David Brandfonbrener, Jaan Altosaar, and Kyunghyun Cho. Evaluating representations by the complexity of learning low-loss predictors. Co RR, abs/2009.07368, 2020. URL https://arxiv.org/abs/2009.07368.

Dani Yogatama, Cyprien de Masson d Autume, Jerome T. Connor, Tom as Kocisk y, Mike Chrzanowski, Lingpeng Kong, Angeliki Lazaridou, Wang Ling, Lei Yu, Chris Dyer, and Phil Blunsom. Learning and evaluating general linguistic intelligence. Co RR, abs/1901.11373, 2019. URL http://arxiv.org/abs/1901.11373.

Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks? Advances in neural information processing systems, 27, 2014.

Xiaohua Zhai, Joan Puigcerver, Alexander Kolesnikov, Pierre Ruyssen, Carlos Riquelme, Mario Lucic, Josip Djolonga, Andr e Susano Pinto, Maxim Neumann, Alexey Dosovitskiy, Lucas Beyer, Olivier Bachem, Michael Tschannen, Marcin Michalski, Olivier Bousquet, Sylvain Gelly, and Neil Houlsby. The visual task adaptation benchmark. Co RR, abs/1910.04867, 2019. URL http: //arxiv.org/abs/1910.04867.

Published as a conference paper at ICLR 2023

A.1 PROOF OF REGRET BOUND

Assume that x P and a sequence of examples x N = {x1, . . . , x N} are i.i.d samples from P. Let ˆθ(x N) be the parameters of a readout model M that best fit the sequence x N, i.e. ˆθ(x N) = infθ M(x N; θ). Since x N are i.i.d samples from P, ˆθ(x N) is the ML estimator of x N. We assume that all readout models are from exponential family. But we do not assume P M, i.e. the model can be misspecifed.

Define the expected regret as the regret w.r.t the best sequence of readout head ξ 1:N and their best fit parameters ˆθξ 1:N (x N) for x N:

t=1 log Mξ t (xt; ˆθξ t (x N)) log X

ξ1:N p(ξ1:N)

t=1 Mξt(xt|ˆθξt(xt 1))

t=1 log Mξ t (xt; ˆθξ t (x N)) log p(ξ 1:N)

t=1 Mξ t (xt|ˆθξ t (xt 1))

log Mξ t (xt; ˆθξ t (x N)) log Mξ t (xt|ˆθξ t (xt 1)) log p(ξ 1:N)

= EP [log p(ξ 1:N)] +

t=1 EP h log Mξ t (xt; ˆθξ t (x N)) log Mξ t (xt|ˆθξ t (xt 1)) i (4)

Since the readout models are from exponential family, they can be parameterized instead by the mean parameters. Recall that the exponential family is the probability distribution that can be written in the form of P(x; η) = h(x) exp η T(x) A(η) where η is the natural parameters, T(x) is the sufficient statistics, A(η) is the cumulant function. The mean parameters are µ := E [T(x)] and have a one-to-one mapping to natural parameters η (we use η(µ) to denote the mapping). Therefore, the exponential family can also be parameterized by the mean parameters P(x; µ) = h(x) exp η(µ) T(x) A(η(µ)) . We also know that the sufficient statistics is also an exponential family in the form of P(T; µ) = h(T) exp η(µ) T A(η(µ)) . Because of the assumption that M is exponential family model, Ex P [log M(x; θ)] = ET P [log M(T; µ)]. Here, we use the same P as the distribution induced on T. The maximum likelihood estimator of the mean parameters is ˆµ(x N) = 1

N PN t=1 T(xt).

Given any estimator ˆµ, we can Taylor expand the likelihood at µ = EP [T(x)]:

log M(x; ˆµ)

= log M(x; µ) + (ˆµ µ) ˆµ log M(x, ˆµ)|µ + 1

2(ˆµ µ) 2 ˆµ log M(x, ˆµ)|µ(ˆµ µ) + O(1)

= EP [log M(x; ˆµ) log M(x; µ)]

2(ˆµ µ) EP [ ˆµ log M(x, ˆµ)|µ] (ˆµ µ)

2(ˆµ µ) I(µ)(ˆµ µ)

2(ˆµ µ) Cov 1 Mµ[T(x)](ˆµ µ)

Where I(µ) is the Fisher information. We first use EP [ ˆµ log M(x, ˆµ)] |µ = ˆµEP [log M(x, ˆµ)]|µ = 0. Then we use I(µ) = Cov 1 Mµ[T(x)], RHS is the covariance

Published as a conference paper at ICLR 2023

matrix under the distribution Mµ. For a readout model M, after seeing xt examples, we have

Ext 1 P log M(xt; ˆµ(xt 1)) log M(xt; µ)

2(ˆµ µ) Cov 1 Mµ[T(x)](ˆµ µ)

2EP h Tr Cov 1 Mµ[T(x)](ˆµ µ)(ˆµ µ) i

2Tr Cov 1 Mµ[T(x)]EP (ˆµ µ)(ˆµ µ)

2 1 t 1Tr Cov 1 Mµ[T(x)]Cov P [T(x)]

We use the fact that trace is a linear operator, therefore we could exchange the order with expectation. The last step comes fromˆµ(xt 1) = 1 t 1 P

i T(xi); denote Ti = T(xi), we have

EP (ˆµ µ)(ˆµ µ) = EP h ( 1 t 1 P

i Ti µ)( 1 t 1 P

i Ti µ) i = 1 t 1EP [TT ] 1 t 1µµ =

1 t 1Cov P [T(x)].Note here the covariance is under the true data distribution P.

For each term inside the sum over timestep t in Equation (4), we have

EP h log Mξ t (xt; ˆθξ t (x N)) log Mξ t (xt|ˆθξ t (xt 1)) i

=EP log Mξ t (xt; ˆµξ t (x N)) log Mξ t (xt|ˆµξ t (xt 1))

=EP log Mξ t (xt; ˆµξ t (x N)) log Mξ t (xt; µ) + log Mξ t (xt; µ) log Mξ t (xt|ˆµξ t (xt 1))

N )Tr Cov 1 Mµξ t [Tξ t (x)]Cov P Tξ t (x) + O(1)

We first change it from parameterized by natural parameters to mean parameters. Then for each term, we can compare it to µ = EP [T(x)].

The total sum of t = 1, . . . , T in Equation (4) is

t=1 EP h log Mξ t (xt; ˆθξ t (x N)) log Mξ t (xt|ˆθξ t (xt 1)) i

1 2( 1 t 1 1

N )Tr Cov 1 Mµξ t [Tξ t (x)]Cov P Tξ t (x)

t=2 ( 1 t 1 1

n Tr Cov 1 Mµ k [Tξ t (x)]Cov P Tξ t (x) o

n Tr Cov 1 Mµ k [Tk(x)]Cov P [Tk(x)] o N X

n Tr Cov 1 Mµ k [Tk(x)]Cov P [Tk(x)] o log N

Inequality in line 2 uses that When t = 1, we use the prior and incur a constant regret. Inequality in line 5 uses harmonic series PN i=1 1

i = log N +const. Note that if there is no model misspecification in any of the readout models, we have Mµ k = P. In this case, Cov 1 Mµ k [Tk(x)]Cov P [Tk(x)] = I

and Tr(I) = k. We find the classical result for regret of prequential code k

Combining with Equation (4), the expected regret is then

R(n) EP [log p(ξ 1:N)] + 1

n Tr Cov 1 Mµ k [Tk(x)]Cov P [Tk(x)] o log N (5)

The expected regret for our readout model switching is composed of 2 terms: one is the regret from expert switching algorithm and the other is the regret of the readout models. The growth of second

Published as a conference paper at ICLR 2023

term is well behaved and bounded by C log N where C is a constant that relates to the model family chosen. The first term is not necessarily bounded by log N, but relates to the switching strategy chosen. However, for many switching algorithms, such as Bayesian mixture, fixed share with decreasing switching rate and run-length code, EP [log p(ξ 1:N)] is in the order of log N. In particular, for fixed share with decreasing switching rate, we have the result of EP [log p(ξ 1:N)] m log K + (m 1) log N m 1 where m 1 is the number of switch points (i.e. m is the number of blocks in which the same expert is used). See Koolen & de Rooij (2013) for proof of the fixed share and more details of the regret bound of other switching strategy. A similar proof is given by Veness et al. (2012) for the particular case of m = 2.

A.2 DISTRIBUTION SHIFT BETWEEN TRAIN AND TEST IN VTAB

Figure 4 shows the top-1 accuracy on train and test subsets with the original VTAB and reshuffled VTAB datasets. For the same dataset, a difference in test set performance between the original and reshuffled datasets suggests data shift between the two versions. Kitti Distance have a significant data shift. DMLab, Patch Camelyon and Small Norb Azimuth also have distribution shift but to a much less degree compared to Kitti Distance. To make sure that the dataset is i.i.d, we combine all its subsets, randomly shuffle then partition into the same number of train/test examples. We only take the training subset for online learning, which results in sequence lengths showed in Table 3. To ensure the distribution shift doesn t affect the conclusion, our experiments in the main paper are run with the reshuffled version of the VTAB dataset. However, we provide additional results on the original VTAB datasets in Appendix A.7.

Figure 4: Comparison of train and test accuracies on the original and reshuffled VTAB datasets.

A.3 2-STAGE ONLINE AND OFFLINE IMPLEMENTATION

The 2-stage implementation is straight forward: first stage, we compute the cross entropy losses for all models at each timestep and store the results; second stage, we use model switching to compute the final codelength (see Algorithm 2).

Published as a conference paper at ICLR 2023

Group Dataset Number of Examples

Caltech101 3060 Cifar100 50000 DTD 3760 Flowers102 2040 Pets 3680 SVHN 73257 Sun397 87003

Euro SAT 21600 Patch Camelyon 294912 Resisc45 25200 Retinopathy 46032

Clevr Count 70000 Clevr Distance 70000 DMLab 88178 DSprites Location 663552 DSprites Orientation 663552 Kitti Distance 6770 Small NORBAzimuth 36450 Small NORBElevation 36450

Table 3: Number of examples in the reshuffled VTAB datasets.

Algorithm 2 MDL with Readout Model Switching (2-Stage) Stage 1: Require: data D = (xt, yt)T t=1; K readout models; subroutine that updates the parameters given a dataset Update Parameters 1: Initialize: model parameters θ1, . . . , θK; an empty list L to store loss per step and per model 2: for t = 1 to T do 3: Compute next-step loss of K readout models: Lk t := log pk(yt|xt) = log p(yt|xt; θk) 4: Store L1 t, ..., LK t to L 5: for k = 1 to K do 6: Update parameters: θk Update Parameters(θk, D t) 7: end for 8: end for 9: return L Stage 2: Require: Cross entropy results from stage 1 L; initial model probability p(ξ1) and switching strategy p(ξt|ξt 1) 1: Initialize: s = log p(ξ1); an empty list Q to store posterior p(ξt|D<t) 2: for t = 1 to T do 3: Compute log p(ξt, yt 1|xt 1): s log P

ξt 1 exp (log p(ξt|ξt 1) + s) 4: Compute posterior p(ξt|D<t) and store in Q: log p(ξt|D<t) = log p(ξt, yt 1|xt 1) log P

ξt p(ξt, yt 1|xt 1)

5: Get L1 t, ..., LK t from L and combine K models to update log p(ξt, yt|xt): s Lξt t + s 6: end for 7: Compute total codelength Lswitch log P

ξT exp(s) 8: return Lswitch and Q

A.4 ACCURACY-BASED EVALUATIONS

For both linear and a combination of MLP evaluations, we perform a hyperparameter search on learning rate, weight decay and EMA with a batch size 1024 (see Table 4 for details). We split the dataset into training and validation where the validation set contains 10% of the data. We choose the best performing hyperparameter on validation set and report top-1 accuracy of the test set in

Published as a conference paper at ICLR 2023

Table 1a and Table 1b. We treat the readout models type as one of the hyperparameters, choose the one with the lowest loss on validation set and report the test set accuracy.

Parameter Values

Batch size 1024 Learning rate { 1e-4, 3e-4, 1e-3, 3e-3 } Weight decay { 0.0, 1e-6, 1e-4, 1e-2 } EMA step size (Polyak averaging) { 1e-4, 1e-2, 1.0 }

Table 4: Hyperparameters for accuracy-based evaluations on VTAB.

Architecture Obj. Acc (%) # Rank #

Vi T/B16 DINO 76.01 2 2.26 1 Res Net-50 BYOL 77.93 1 2.37 2 Res Net-50 SUP 75.51 3 3.82 3 Vi T/B16 MAE 74.10 5 3.84 4 Vi T/B16 SUP 72.46 6 4.18 5 Res Net-50 Sim CLR 74.35 4 4.53 6

Table 5: Comparison of average accuracy v.s. average ranking for summarizing performance on multiple datasets (19 datasets from VTAB) using linear evaluation protocol.

Average accuracy v.s. Ranking on multiple datasets. Table 5 shows the difference on model selection when using average accuracy and average ranking for aggregating performances on multiple datasets. We see that the two summarization methods don t have the same results for model selection. The reason why it s problematic to average accuracy over multiple datasets is discussed in Demˇsar (2006). In brief, averaging over accuracy is only meaningful when the results on different datasets are comparable and averages are susceptible to outliers. Since the VTAB benchmark consists of 19 datasets for classification tasks of very different nature, averaging over the accuracy is an ill-conceived practice and we choose to use average ranking for presenting our results.

A.5 IMPLEMENTATION DETAILS

Frozen Encoders For frozen encoders, we use the readout models: linear layer and MLP of 17 hidden layers. We use Adam W as the optimizer and hyperparameters listed in Table 6. The switching probability for fix share if αt = 10/t. Images are center cropped and resized to 224 224.

Parameter Values

Batch size 32 Number of replay-streams { 3, 10, 30, 100 } Learning rate { 1e-4, 3e-4, 1e-3, 3e-3 } Adam W β1 { 0.5, 0.7, 0.9 } Weight decay { 0.0, 1e-6, 1e-4, 1e-2 } EMA step size (Polyak averaging) { 1e-4, 1e-2, 1.0 }

Table 6: Hyperparameters for computing MDL of frozen encoders on VTAB.

Train from Scratch This section correspond to the training from scratch experiment on VTAB presented in Section 4. We use SGD with nesterov momentum for training from scratch. We take the center crop and resize the image to 224 224 for preprocessing. No readout models is needed

Published as a conference paper at ICLR 2023

as the encoder is trained on the relevant task. For each VTAB task, we uniformly randomly select 300 configurations from the hyperparameters listed in Table 7.

Parameter Values

Batch size 32 Number of replay-streams { 50, 100, 150, 200 } Learning rate { 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1.0 } momentum 0.9 Weight decay { 1.e-5, 3e-5, 1e-4, 3e-4, 1e-3, 3e-3, 1e-2, 3e-2 } EMA step size (Polyak averaging) { 1e-3, 1e-2, 1e-1, 1.0 }

Table 7: Hyperparameters for computing MDL with training Res Net-50 from scratch on VTAB.

Fine-tuning This section correspond to the fine-tuning experiment on VTAB presented in Section 4. We use Res Net-50 encoder for the fine-tuning experiment. Readout model is a linear layer. The encoder parameters are initialized with the pre-trained parameters of BYOL, but updated on the downstream task. We use SGD with nesterov momentum for fine-tuning pre-trained encoders. Images are center cropped and resized to 224 224 for preprocessing. For each VTAB task, we uniform randomly select 300 configurations from the hyperparameters listed in Table 8.

Parameter Values

Batch size 32 Number of replay-streams { 10, 30, 50 } Learning rate { 1e-4, 3e-4, 1e-3, 3e-3, 0.01, 0.03, 0.1, 0.3, 1. } momentum 0.9 Weight decay { 0.0, 1.e-5, 3e-5, 1e-4, 3e-4, 1e-3, 3e-3 } EMA step size (Polyak averaging) { 1e-3, 1e-2, 1e-1, 1.0 }

Table 8: Hyperparameters for computing MDL with fine-tuning pre-trained encoders on VTAB.

A.5.2 IMAGENET

Frozen Encoders Images are preprocessed with spatial augmentation during training: randomly cropped, randomly flipped and resized to 224 224. We use the readout models: linear layer and MLP of 1-3 hidden layers. We use Adam W as the optimizer and hyperparameters listed in Table 9. The switching probability for fix share if αt = 1/t.

Parameter Values

Batch size 512 Number of replay-streams { 10, 30, 50 } Learning rate { 3e-5, 1e-4, 3e-4, 1e-3 } Adam W β1 { 0.5, 0.7 } Weight decay { 1e-2, 1. } EMA step size (Polyak averaging) { 0.01, 1. }

Table 9: Hyperparameters for computing MDL of frozen encoders on Image Net.

Fine-tuning This section correspond to the fine-tuning experiment on Image Net presented in Section 4. We use a Res Net-50 encoder for the fine-tuning experiment. Readout model is a linear layer. The encoder parameters are initialized with the pre-trained parameters from BYOL, but updated on the downstream task. We use Adam W and spatial augmentation for preprocessing. Images are randomly cropped and resized to 224 224. We use the same set of hyperparameters as the frozen encoder experiments listed in Table 9.

Published as a conference paper at ICLR 2023

Train from Scratch We train a standard Res Net-50 architecture from scratch on Image Net and follow the experimental setting described in (Bornschein et al., 2022). In particular we use replaystreams training with uniform replay, Adam W, use randaugment for data augmentation, and forward calibration. We run a random hyperparameter search by sampling 50 hyperparameter configurations from the intervals listed in appendix A.5.2.

Parameter Distribution Values / Interval Batch size fixed 512 Number of replay-streams log-uniform 25 ...100 Learning rate log-uniform 1e-4 ...3e-3 Adam W ϵ log-uniform 1e-4 ...1 EMA step size (Polyak averaging) log-uniform 1e-3 ...1e-2 Weight decay log-uniform 1e-4 ...1. Label smoothing fixed 0.01

A.6 COMPARISON OF SWITCHING STRATEGIES

We describe briefly other switching strategy we experimented with our evaluation framework:

Bayesian mixture: switching probability is 1 after the first step.

LBM(y1:T |ϕ) = log X

t=1 p(yt|ϕt, ˆθk(ϕ<t, y<t))

Elementwise mixture: at each step, it is allowed to switch to another model with a fixed probability which doesn t depend on the current model used.

LEW (y1:T |ϕ) = log X

t=1 p(yt|ϕt, ˆθξt(ϕ<t, y<t))w(ξt)

Where w(ξt) is a fixed mixture weight which doesn t depend on ξt 1. Switch distribution: A more sophisticated switching strategy which augments the transition with two latent variables u and s for unstable and stable chain respectively. Once switching to the stable chain, it stay with the same model for the rest of the sequence. For more details, see Koolen & De Rooij (2008).

Figure 5 compares the above switching strategies with the fixed share with decreasing switching rate on Image Net using a Res Net50 pre-trained with BYOL. Bayesian mixture and elementwise mixture perform worse than fixed share by a large margin. Switch distribution is very close to fixed share performance.

Figure 5: Regret of Bayesian mixture, elementwise mixture and switching distribution compared to the fixed share baseline on Image Net. Pretrained model is BYOL with Res Net50 backbone.

Published as a conference paper at ICLR 2023

A.7 MDL ON VTAB DATASETS

We report the final MDL score using fixed share switching strategy on all the VTAB datasets in Table 10. To better compare the numbers, we average the MDL score by the number of examples in the datasets. For cumulative MDL score, we should multiple by the number of examples which is reported in Table 3.

Natural Special Structured

Architecture

Patch Camelyon

Retinopathy

Clevr Count

Clevr Distance

DSprites Location

DSprites Orientation

Kitti Distance

Small NORBAzimuth

Small NORBElevation

Res Net-50 SUP 0.81 1.12 1.35 1.23 0.35 0.82 1.42 0.14 0.17 0.45 0.72 0.91 0.87 0.90 0.14 0.26 0.58 0.51 0.88 Res Net-50 Sim CLR 0.98 1.33 1.33 1.30 0.74 0.66 1.54 0.18 0.20 0.46 0.73 0.90 0.79 0.92 0.08 0.25 0.58 0.29 0.77 Res Net-50 BYOL 0.78 0.97 1.21 1.06 0.48 0.69 1.32 0.13 0.17 0.39 0.70 0.78 0.77 0.77 0.13 0.16 0.55 0.24 0.65 Res Net-101 SUP 0.76 1.02 1.31 1.27 0.34 0.95 1.38 0.14 0.18 0.48 0.73 0.89 0.91 0.91 0.13 0.23 0.59 0.52 0.92 Res Net-101 BYOL 0.72 0.86 1.17 1.04 0.43 0.78 1.25 0.13 0.17 0.38 0.70 0.77 0.79 0.77 0.11 0.16 0.55 0.24 0.63 Res Net-152 SUP 0.74 0.99 1.32 1.28 0.34 0.99 1.37 0.15 0.18 0.50 0.72 0.90 0.94 0.91 0.18 0.23 0.60 0.50 0.92 Res Net-152 BYOL 0.71 0.84 1.15 1.07 0.42 0.79 1.22 0.13 0.17 0.40 0.70 0.77 0.77 0.78 0.10 0.15 0.55 0.25 0.66 Vi T/S16 SUP 0.83 0.95 1.45 1.45 0.38 1.23 1.38 0.19 0.20 0.56 0.72 1.06 1.17 1.14 0.63 0.38 0.64 0.61 1.00 Vi T/S16 MAE 1.83 2.01 1.76 2.10 1.54 1.64 2.27 0.17 0.15 0.64 0.71 0.97 1.16 1.09 1.75 0.27 0.64 0.59 0.69 Vi T/B16 SUP 0.71 0.82 1.35 1.17 0.32 0.99 1.28 0.14 0.17 0.43 0.70 0.94 1.09 1.01 0.35 0.25 0.60 0.47 0.88 Vi T/B16 DINO 0.68 0.68 1.14 0.94 0.37 0.76 1.15 0.11 0.13 0.34 0.68 0.78 0.94 0.72 0.17 0.19 0.55 0.28 0.57 Vi T/B16 MAE 1.12 1.21 1.49 1.64 0.93 0.93 1.60 0.13 0.14 0.47 0.70 0.70 0.80 0.92 0.03 0.14 0.58 0.29 0.48 Vi T/L16 SUP 0.81 0.88 1.52 1.41 0.37 1.06 1.40 0.18 0.18 0.52 0.72 0.96 1.05 1.04 0.32 0.27 0.61 0.59 0.91 Vi T/L16 MAE 0.92 0.86 1.39 1.50 0.53 0.82 1.36 0.12 0.13 0.43 0.69 0.60 0.76 0.85 0.02 0.11 0.56 0.21 0.42

Table 10: Fixed share codelength per example reported for each VTAB dataset.

Figure 6: Ranking for representations pre-trained with different objectives (lower the better). The box plot is drawn from 25% to 75% quantile with a horizontal line to denote the median and red dot to denote the mean. The whisker denotes 1.5 IQR values. The threshold of the ranking difference to be significant for hypothesis testing is 1.894.

In VTAB, datasets in the same group are similar to each other, but there is a significant distribution shift between groups. And natrual datasets are closer to the Image Net which is used during pretraining. To investigate if the transfer performance differs with input distribution shift, we report the average rankings by dataset group in Figure 7. Although the number of datasets in each group is limited and, therefore, the result from hypothesis test is not conclusive, we can still get some idea of the transfer performance from the ranking difference. We can see that self-supervised representations have a better performance than the supervised counterpart on structured datasets, despite the latter perform decently on natural datasets. MAE generalizes well with dataset shift, but the performance on natural datasets suffers compared to other methods.

Published as a conference paper at ICLR 2023

(a) Natural Datasets.

(b) Special Datasets.

(c) Structured Datasets.

Figure 7: Ranking on natrual, special and structured VTAB datasets. The thresholds for ranking difference to be significant are 3.12, 2.92 and 4.13 respectively.

Figure 8: Rankings of Res Net and Vi T models of different sizes (lower the better). The thresholds for ranking difference to be significant is 3.65.

Data ordering and Variance of MDL score. Theoretically, different permutations of the data sequence results in different codelengths. We reckon that this is an area for future work. Rissanen (1986) suggests taking the permutation with the shortest codelength, but it is computationally too expensive. Hansen & Yu (2001) notes that the codelength is invariant over orderings when a Bayesian predictor is used in prequential MDL. De Rooij & Gr unwald (2011) points out that while the ordering of data affects the codelength, it does not appear to cause too much trouble in practice. Here, we provide an empirical evaluation of the variance of our metric. We evaluate Res Net50-SUP representation on Cifar100 with 5 permutations and obtain the MDL score of 55906 with a standard deviation of 134. To give an idea, the representations that are ranked immediately above and below achieve the MDL score of 50969 and 60549 respectively. Therefore, we believe our evaluation method is stable and reliable. We would also like to point out that the current practice of comparing the test accuracy is not immune to randomness, e.g. initialization, dataset split and randomization during training.

MDL Results on original VTAB Datasets Table 11 reports the MDL score on the original VTAB datasets, as well as the average rank. The distribution shift results in changes of the MDL score (colored in blue and red in Table 11). The datasets where the distribution shift occur are consistent with our previous assessment in Appendix A.2. However, the overall rank of the pre-training method doesn t seem to be affected. This is because the number of datasets affected by the shift is relatively small and the effect of the shift is similar to all pre-training methods.

Published as a conference paper at ICLR 2023

Natural Special Structured

Architecture

Patch Camelyon

Retinopathy

Clevr Count

Clevr Distance

DSprites Location

DSprites Orientation

Kitti Distance

Small NORBAzimuth

Small NORBElevation

Res Net-50 SUP 4 4.26 0.85 1.12 1.35 1.24 0.33 0.84 1.42 0.14 0.15 0.45 0.71 0.91 0.87 0.82 0.14 0.26 0.56 0.49 0.85 Res Net-50 Sim CLR 6 4.31 1.08 1.34 1.35 1.29 0.68 0.67 1.55 0.18 0.18 0.45 0.72 0.90 0.81 0.83 0.08 0.25 0.56 0.30 0.74 Res Net-50 BYOL 2 2.42 0.82 0.98 1.23 1.07 0.46 0.70 1.33 0.13 0.15 0.38 0.69 0.78 0.77 0.69 0.13 0.16 0.54 0.26 0.62 Vi T/B16 SUP 4 4.26 0.71 0.83 1.36 1.19 0.30 1.02 1.29 0.14 0.15 0.42 0.69 0.94 1.09 0.93 0.35 0.25 0.59 0.47 0.84 Vi T/B16 DINO 1 1.95 0.69 0.69 1.18 0.95 0.34 0.78 1.15 0.11 0.11 0.33 0.67 0.78 0.94 0.65 0.17 0.19 0.54 0.29 0.54 Vi T/B16 MAE 3 3.78 1.23 1.22 1.52 1.64 0.88 0.93 1.60 0.13 0.12 0.46 0.69 0.70 0.80 0.83 0.03 0.14 0.57 0.30 0.47

Table 11: Fixed share codelength per example reported for original VTAB benchmark where some of the datasets have a distribution shift. The number is color coded if the codelength changes more than 1e 2 compared to the reshuffled version of the dataset. Blue: codelength is longer than reshuffled dataset; red: codelength is shorter.

A.8 POSTERIOR OF THE READOUT MODELS

Figure 9 shows the posterior of readout models p(ξt = k|D<t) where k [1, . . . , K] on Image Net for 4 different architecture and pre-training methods: Res Net50 + Sim CLR, Res Net50 + BYOL, Vi T/B16 + DINO and Vi T/B16 + MAE. We use 4 readout models: linear layer and MLP with 1-3 hidden layers.

Figure 9: Posterior probability of the readout models at each timestep on Image Net. Pre-trained methods are Res Net50 + Sim CLR, Res Net50 + BYOL, Vi T/B16 + DINO and Vi T/B16 + MAE

A.9 DATA EFFICIENCY AND FINE-TUNING

Figure 10 shows the regret plots, relative to training from scratch, for training a linear layer on top of the pre-trained and frozen BYOL backbone, training MLPs (1-7 hidden layers) on top of the frozen BYOL backbone and for a linear readout but while fine-tuning all the encoder parameters. The network architecture is a Res Net-50 in all cases.

Published as a conference paper at ICLR 2023

Figure 10: Data efficiency on 19 VTAB datasets. Four methods are compared: training from scratch, training a linear layer with frozen encoder initialized with BYOL, training MLPs with frozen encoder initialized with BYOL and fine-tuning the encoder initialized with BYOL. We plot the cumulative next-step log loss relative to training from scratch as the baseline (in nats; lower is better).