# evaluating_representations_with_readout_model_switching__e0b94bee.pdf Published as a conference paper at ICLR 2023 EVALUATING REPRESENTATIONS WITH READOUT MODEL SWITCHING Yazhe Li yazhe@deepmind.com Jorg Bornschein bornschein@deepmind.com Marcus Hutter mhutter@deepmind.com Although much of the success of Deep Learning builds on learning good representations, a rigorous method to evaluate their quality is lacking. In this paper, we treat the evaluation of representations as a model selection problem and propose to use the Minimum Description Length (MDL) principle to devise an evaluation metric. Contrary to the established practice of limiting the capacity of the readout model, we design a hybrid discrete and continuous-valued model space for the readout models and employ a switching strategy to combine their predictions. The MDL score takes model complexity, as well as data efficiency into account. As a result, the most appropriate model for the specific task and representation will be chosen, making it a unified measure for comparison. The proposed metric can be efficiently computed with an online method and we present results for pre-trained vision encoders of various architectures (Res Net and Vi T) and objective functions (supervised and self-supervised) on a range of downstream tasks. We compare our methods with accuracy-based approaches and show that the latter are inconsistent when multiple readout models are used. Finally, we discuss important properties revealed by our evaluations such as model scaling, preferred readout model, and data efficiency. 1 INTRODUCTION Data representation is crucial to the performance of machine learning algorithms (Bengio et al., 2013). Much of the success of Deep Neural Networks (DNN) can be attributed to their capability of gradually building up more and more abstract representations (Lee et al., 2009). In supervised learning, although the network is trained to predict a specific aspect of the input, the intermediate representations are often proven to be useful for many other downstream tasks (Yosinski et al., 2014). In unsupervised and self-supervised learning, the network is trained on a surrogate task, such as reconstruction (Hinton & Salakhutdinov, 2006; Kingma & Welling, 2013; He et al., 2021) and contrastive prediction (van den Oord et al., 2018; Chen et al., 2020), which is supposed to capture generic prior of the data. In recent years, there has been significant improvements in unsupervised representation learning with state-of-the-art models achieving performance comparable to its supervised counterpart (Tomasev et al., 2022). Despite the importance of data representation, the evaluation method for representations is rarely discussed. The most prevalent practice is to train a readout model on the downstream task. The readout model often has a shallow architecture, e.g. linear layer, to limit its capacity, so that the task performance reflects the representation quality. The problem with this approach is that the readout model cannot adapt to the nature of the representations. Deeper models and fine-tuning alleviate this issue. However, the representations are left with multiple metrics, each using a different readout mechanism, making the comparison extremely difficult (Nozawa & Sato, 2022). In this paper, we treat evaluating representations as a model selection problem. We propose to use Minimum Description Length (MDL) as the main evaluation metric and use model switching to accommodate the need for multiple readout models. MDL is a well-studied compression-based approach for inductive inference that provides a generic solution to the model selection problem (Rissanen, 1984; Grunwald, 2004; Wallace, 2005; Solomonoff, 1964; Rathmanner & Hutter, 2011). MDL performs a similar role as held-out validation does for Emperical Risk Minimization (Vapnik, 1991), but has the advantage of being able to deal with single sequence and non-stationary data. Published as a conference paper at ICLR 2023 It is closely related to Bayesian model selection and includes a form of Occam s Razor where the metric takes into account the model complexity. The complexity term can be explicitly represented as the codelength of the model in the case of a 2-part code, as a KL-term when using a variational code, or implicitly when using prequential or Bayesian codes. By including the model complexity in the evaluation metric, we automatically resolve the need of limiting the readout model complexity and are able to compare MDL scores freely between different readout mechanisms. Intuitively, if the induced representation is nonlinear and requires a higher capacity model for readout, the MDL score reflects this by having a larger complexity term. Note that this also applies in the case of finetuning, where the pre-trained model is allowed to adapt for the downstream tasks. Model switching allows multiple readout models and automatically finds the best readout model for the downstream task at each dataset size (Figure 1). Therefore, MDL with readout model switching provides a unified framework for evaluating representations regardless the evaluation protocol employed. Figure 1: Illustration of switching between models of different complexity: Depending on the number of training examples either A, B, or C has the best generalization performance. An optimally switched model will have the best performance at each point and thus the lowest prequential description length (= area under the curve). It is conjured that useful representations make the variability in the data more predictable and allow to efficient learning human-like data (H enaff et al., 2019). The MDL evaluation metric formalizes the data efficiency perspective - especially evident in the form of prequential MDL. Prequential MDL (Dawid & Vovk, 1999; Poland & Hutter, 2005) turns computing the description length L(D|ϕ) = log p(D|ϕ) into a sequential prediction problem: log p(D|ϕ) = P t log p(yt|ϕ t, y