# intermediate_layer_classifiers_for_ood_generalization__d4b243eb.pdf Published as a conference paper at ICLR 2025 INTERMEDIATE LAYER CLASSIFIERS FOR OOD GENERALIZATION Arnas Uselis Tübingen AI Center University of Tübingen arnas.uselis@uni-tuebingen.de Seong Joon Oh Tübingen AI Center University of Tübingen Deep classifiers are known to be sensitive to data distribution shifts, primarily due to their reliance on spurious correlations in training data. It has been suggested that these classifiers can still find useful features in the network s last layer that hold up under such shifts. In this work, we question the use of last-layer representations for out-of-distribution (OOD) generalisation and explore the utility of intermediate layers. To this end, we introduce Intermediate Layer Classifiers (ILCs). We discover that intermediate layer representations frequently offer substantially better generalisation than those from the penultimate layer. In many cases, zero-shot OOD generalisation using earlier-layer representations approaches the few-shot performance of retraining on penultimate layer representations. This is confirmed across multiple datasets, architectures, and types of distribution shifts. Our analysis suggests that intermediate layers are less sensitive to distribution shifts compared to the penultimate layer. These findings highlight the importance of understanding how information is distributed across network layers and its role in OOD generalisation, while also pointing to the limits of penultimate layer representation utility. Code is available at https://github.com/oshapio/ intermediate-layer-generalization. ID Zero-shot OOD Few-shot OOD ID Zero-shot OOD Few-shot OOD ID Zero-shot OOD Few-shot OOD ID Zero-shot OOD Few-shot OOD Figure 1: Using last vs intermediate layers for OOD generalisation. A common way to address distribution shift is to fine-tune the last layer of a network on the target distribution (few-shot learning). We show that earlierlayer representations often generalise better than the last layer. Moreover, even when only the in-distribution (ID) data is available, earlier-layer representations are often better than the last layer (zero-shot learning). 1 INTRODUCTION Deep neural networks (DNNs) often lack robustness when evaluated on out-of-distribution (OOD) samples: once a classifier is trained, its performance often drops significantly when deployed in a new environment (Taori et al., 2020). A reason for this is that training data often contain spurious correlations (Singla & Feizi, 2021; Li et al., 2023), which encourage models to learn shortcuts (Scimeca et al., 2022; Cadene et al., 2020; Recht et al., 2019). Relying on such shortcuts leads to models that do not generalize well to out-of-distribution (OOD) samples (Geirhos et al., 2020; 2022; Rosenfeld et al., 2018; Beery et al., 2018) lying outside the training distribution. Many attempts have been made address the disparity in performance between in-distribution (ID) and OOD samples (Zhang et al., 2018; Yun et al., 2019; Shi et al., 2022; Verma et al., 2019), but when no testing data is available from the target distribution, generalization is challenging. Published as a conference paper at ICLR 2025 Recent work has shown that the last layer of DNNs already contain enough information for generalization in the cases of long-tail classification (Kang et al., 2020), domain generalization (Rosenfeld et al., 2022), and learning under spurious correlations (Kirichenko et al., 2023). In these works, only the linear classifier of the last layer is retrained on the target distribution, and the model is shown to generalize well to OOD samples. This suggests that the model can learn useful representations from the training data alone already, and only the classifier needs to be adjusted for the target distribution. This work questions the conventional usage of the last layer for OOD generalization, providing an in-depth analysis across various distribution shifts and model architectures. We examine both few-shot OOD generalization, where some OOD samples are available for training linear classifier, and zero-shot OOD generalization, which requires no target examples. Our studies reveal that earlierlayer representations often yield superior OOD generalization in both scenarios. Figure 1 illustrates this across multiple datasets. For instance, on CMNIST, using earlier layers improves zero-shot OOD accuracy by 7% and few-shot OOD accuracy by 12% compared to retraining the last layer. The effect is even more pronounced for Celeb A, where we observe substantial gains of 20% for zero-shot OOD generalization using earlier layers. These results consistently demonstrate the advantages of using earlier-layer representations for OOD generalization tasks. We advocate the use of earlier-layer representations for a few critical benefits in practice. First, we show that they often exhibit better generalization performance when tuned on the target distribution; this also extends to cases where the number of samples from the target distribution is small ( 4.2). Second, the benefits remain even when no OOD data is used for training the probes, only for using them in the model-selection step ( 4.3). Our contributions are as follows: (1) We establish that earlier-layer representations often outperform last-layer features in terms of (1) few-shot and (2) zeros-shot transfer capabilities, even when no OOD supervision is available. (3) We provide evidence that intermediate layer features are generally less sensitive to distribution shifts than those from the final layer, offering new insights into feature utility for enhancing model robustness. 2 RELATED WORK OOD generalization and spurious correlations: It has been reported that spurious correlations in training data degrade the generalizability of learned representations, particularly on the out-ofdistribution (OOD) data (Arjovsky et al., 2020; Ruan et al., 2022; Hermann & Lampinen, 2020), especially on groups within the data that are underrepresented (Sagawa et al., 2020a; Yang et al., 2023) and the number of groups is large (Li et al., 2023; Kim et al., 2023). The theory of gradient starvation further suggests that standard SGD training may preferentially leverage spurious correlations (Pezeshki et al., 2021), especially when the spurious cue is simple (Scimeca et al., 2022). Other works have focused on the conceptual ill-posedness of OOD generalization without any information about the target distribution (Ruan et al., 2022; Bahng et al., 2020; Scimeca et al., 2022). Simplicity bias, where networks favor easy-to-learn features, is influenced by the breadth of solutions exploiting such features compared to those utilizing more complex signals (Geirhos et al., 2020; Valle-Pérez et al., 2018; Scimeca et al., 2022). It has been observed that DNN models tend to perform well on average but worse for infrequent groups within the data (Sagawa et al., 2020a); this is especially exacerbated for overparameterized models (Sagawa et al., 2020b; Menon et al., 2020). Last layer-retraining: Recent studies have argued that last-layer representations already contain valuable information for generalization beyond the training distribution (Kirichenko et al., 2023; Izmailov et al., 2022; La Bonte et al., 2023). These approaches suggest retraining the last layer weights (using the features from the penultimate layer) with either target domain data or using group annotations from large language models (Park et al., 2023). In this work, we challenge the underlying assumption that the penultimate layer encapsulates all pertinent information for OOD generalization. Our analysis indicates that earlier-layer representations are often far more useful for OOD generalization; we even show that earlier-layer representations without fine-tuning on the target distribution often fare competitively with the last-layer retraining on the target distribution. Exploiting and analyzing intermediate layers. Intermediate layers have been employed for various purposes, from predicting generalization gaps (Jiang et al., 2019), to elucidating training dynamics Published as a conference paper at ICLR 2025 (Alain & Bengio, 2018), to enhancing transfer and few-shot learning (Evci et al., 2022; Adler et al., 2021), to enhancing transferablity of adversarial examples (Huang et al., 2020), and adjusting models based on distribution shifts (Lee et al., 2023). The effectiveness of intermediate layers can be connected to the properties of neural network dynamics Yosinski et al. (2014). For example, the intrinsic dimensionality of intermediate layers has been shown to increase in the earlier layers and then decrease in the later layers (Ansuini et al., 2019; Recanatesi et al., 2019), suggesting a rich and diverse set of features in the intermediate layers that can be leveraged for generalization. Neural collapse (Papyan et al., 2020), a phenomenon where the representations of class features collapse to their mean classes in the last layer, points to a reason for the difficulty of transferring features from the last layer. Neural collapse in intermediate layers is observed, but it is less severe (Li et al., 2022; Rangamani et al., 2023). Recent work has also explored the utility of intermediate layers in distinct contexts, such as generalization to new class splits Gerritz et al. (2024); Dyballa et al. (2024), and has arrived at conclusions similar to ours: that the last layer does not always generalize best. Additionally, Masarczyk et al. (2023) observe a complementary phenomenon termed the tunnel effect, in which later layers compress linearly separable representations created by initial layers, thus degrading OOD performance. Our work differs by examining distribution shifts within individual datasets, rather than generalization across datasets or tasks. Unlike prior studies, we investigate how intermediate layers generalize under controlled in-dataset shifts, even when training and testing involve similar visual variations. This section introduces the preliminary concepts and notation as well as our approach to OOD generalization using intermediate-layer representations. DNN train ILC Valid Test Few-shot ( 4.2) Dprobe Dvalid Dtest Zero-shot ( 4.3) Dtrain = Dprobe Distribution Table 1: Experimental setups. We consider settings where an entire deep neural network (DNN) is first trained on Dtrain and one of its layers is adapted further on Dprobe (Intermediate Layer Classifiers; ILC in 3.2). The model is then validated on Dvalid and evaluated on Dtest. Depending on the scenario type (few-shot vs zero-shot), either up to Dtrain or up to Dprobe is considered to be in-distribution PID and the rest to be out-of-distribution POOD. We consider the classification task of mapping inputs X to labels Y. We present an overview of the training and evaluation stages of the model in Table 1. A deep neural network (DNN) model is first trained on a training dataset Dtrain. The DNN is further adapted to the task through the Intermediate Layer Classifiers (ILCs) training ( 3.2) on the probe-training dataset Dprobe. Afterwards, any model selection or hyperparameter search is performed over the validation set Dvalid. Finally, the model is evaluated on the test set Dtest. In order to simulate the OOD generalization scenario, we introduce two different distributions: PID and POOD, respectively denoting in-distribution (ID) and OOD cases. The training of any DNN is performed over ID: Dtrain PID. The validation is performed over OOD: Dvalid POOD. The usage of a few OOD samples for validation is a standard practice in OOD generalization literature (Gulrajani & Lopez-Paz, 2020; Sagawa et al., 2020a; Izmailov et al., 2022) and we adopt this framework for a fair comparison. Likewise, the final evaluation is performed on OOD: Dtest POOD. We consider multiple variants of OOD datasets; we discuss them in greater detail in 4.1. For the ILC training step, we consider two possibilities: (1) few-shot where Dprobe POOD are K OOD samples per class and (2) zero-shot where Dprobe PID are trained over the ID set. The lastlayer retraining framework (Kirichenko et al., 2023) corresponds to the few-shot learning scenario, where the last layer is retrained with a K OOD samples per class. Our research question for the few-shot scenario is whether intermediate representation yields a better OOD generalization compared to the last-layer retraining paradigm. For the zero-shot scenario, we venture into a more audacious research question: is it possible to train an intermediate representation with ID samples to let it generalize to OOD cases? Published as a conference paper at ICLR 2025 3.2 INTERMEDIATE LAYER CLASSIFIERS (ILCS) In this section, we propose the framework for training Intermediate Layer Classifiers (ILC). We start with the necessary notations and background materials. Let function f be a deep neural network (DNN) classifier with L layers. We denote the l-th layer operation as fl, such that f f L f L 1 . . . f1, a composition of L functions. The last layer of the network f L is a linear classifier. We refer to the output of the l-th layer for an input x as the l-th layer representation, denoted as rl(x) := (fl f1)(x) Rdl, where dl denotes the output dimension at layer l. For l < L, we refer to fl as an intermediate layer and rl(x) as an intermediate representation. Input image Pre-trained model (e.g. Res Net or Vi T) Intermediate representations Figure 2: Intermediate Layer Classifiers (ILC). Given a frozen pre-trained model like a Res Net or a Vi T, we train a linear probe on an intermediate layer representation at intermediate layers (here, we show this process only at layer l). The composition of lth layer feature extractor and the intermediate layer classifier (ILC) is the final classifier. We shorthand rl(x) as rl for brevity. We primarily use Res Nets (He et al., 2015) for convolutional neural networks and vision transformers (Vi Ts) (Dosovitskiy et al., 2021) for non-CNN architectures, while other architectures are also used when open-source implementations are available. A Res Net layer, for example, consists of convolutions, Re LU activation, batch normalization, and residual connections, leading to L = 8 layers in total, excluding the classification head. A Vi T layer is an encoder block composed of multi-head attention (MHA) and a multi-layer perceptron (MLP). The number of layers for Vi T models varies depending on the dataset used. All the architectures used are detailed in Appendix A.1. Algorithm 1 Training Intermediate Layer Classifiers (ILCs) 1: for 1 l L 2 do 2: Initialize weights Wl and biases bl for ILCl 3: end for 4: for (x, y) Dprobe do 5: for 1 l L 2 do 6: rl(x) = (fl f1)(x) 7: ˆyl = ILCl(x) = Wlrl(x)+bl 8: end for 9: ℓ= PL 1 l=1 CE(ˆyl, y) 10: Update parameters ϕ using ℓ 11: end for Intermediate Layer Classifiers (ILCs). We introduce Intermediate Layer Classifiers (ILCs) to address the specific challenge of OOD tasks. While traditional linear probes (Alain & Bengio, 2018) are typically used to analyze learned representations across a model, ILCs serve a different purpose: they are applied to intermediate layers and trained specifically to perform OOD classification. An ILC at layer l is an affine transformation that maps the representation space Rdl to logits: ILCl(x) := Wlrl(x) + bl, (1) where Wl R|Y| dl and bl R|Y|. The ILCs for 1 l L 1 are trained on the dataset Dprobe ( 3.1), using data drawn from ID for the zero-shot scenario (Dprobe PID) and OOD for the few-shot scenario (Dprobe POOD). We illustrate the data flow of using ILCs conceptually in Figure 2. Last-layer retraining methods (Kirichenko et al., 2023; Izmailov et al., 2022; La Bonte et al., 2023; Kang et al., 2020) can be considered a special case of the ILC framework, where only the final layer L is adapted on Dprobe for OOD tasks, corresponding to using ILCL 1 in our framework. Algorithm 1 illustrates the training process of ILCs. Layer Selection. We choose the layer l with the best ICL accuracy on the validation set Dvalid from l L 2. We restrict the layer selection to layers up to L 2 to distinguish the results from the last-layer retraining approach. The best layer is chosen based on the best-performing hyperparameters in the search space H (detailed in Appendix A.2.2) for each layer. Using a few OOD samples for making design choices is one of the common practices in benchmarking OOD generalization (Kirichenko et al., 2023; Sagawa et al., 2020a). Published as a conference paper at ICLR 2025 Inference. For the selected layer l L 2, the whole model is reduced to a smaller network with l layers. Pseudocode for the inference process is provided in Algorithm 4 in the Appendix. 4 EXPERIMENTS In this section, we verify the effectiveness of ILCs in 3.2 for out-of-distribution (OOD) generalization. In particular, we compare their performance against the popular last-layer retraining approach. We introduce the dataset and experimental setups in 4.1. We report results under two scenarios: few-shot ( 4.2) and zero-shot ( 4.3) cases, respectively referring to the availability and unavailability of OOD data for the ILC training. 4.1 DATASETS AND EXPERIMENTAL SETUP We introduce datasets and a precise experimental setup for simulating and studying OOD generalization. As introduced in the task section ( 3.1), we need two distributions PID(X, Y ) = POOD(X, Y ) for the study. We perform an extensive study over 9 datasets, covering various scenarios, including subpopulation shifts, conditional shifts, noise-level perturbations, and natural image shifts. Detailed definitions of shift types and corresponding datasets are listed in Table 2 below. Table 2: Distribution shift types and datasets. Datasets were selected based on the availability of distribution shifts and their compatibility with publicly available pre-trained model weights. Shift type Description Datasets Conditional The conditional distribution P(Y |X) shifts: Ptrain(Y |X) = Ptest(Y |X). CMNIST (Arjovsky et al., 2020; Bahng et al., 2020) Subpopulation Given multiple groups within a population, the ratios of groups change across distribution. Celeb A (Liu et al., 2015), Waterbirds (Sagawa et al., 2020a), Multi-Celeb A (Kim et al., 2023) Input noise Different types of input noise are applied for test samples. CIFAR-10C, CIFAR-100C (Hendrycks & Dietterich, 2019) Natural image Test images have different styles from the training images. Image Net-A, Image Net-R, Image Net-Cue Conflict, Image Net-Silhouette (Hendrycks et al., 2021b;a; Geirhos et al., 2022) In 3.1, we have introduced the few-shot and zero-shot settings for the OOD generalization. Below, we explain how we adopt each dataset for the required data splits, Dtrain, Dprobe, Dvalid, and Dtest. Detailed information on the datasets and their splits is provided in Appendix A.2. Training split Dtrain. For all datasets, we assume DNN models were trained on the given training split. Probe-training split Dprobe. For the zero-shot setting, we use the Dtrain split. For the few-shot setting, we use a subset of the OOD splits of each dataset. Validation split Dvalid. In all settings, we use the original held-out validation set whenever available in the datasets (Waterbirds, Celeb A, Multi Celeb A, Image Net). When unavailable (CMNIST, CIFAR10C, CIFAR-100C), we use a random half of the test splits of the datasets. Test split Dtest. In all settings, we use the original test set. When half of it was used for validation due to a lack of a validation split, then we use the other half for evaluating the models. Evaluation metrics. We use the accuracy on the test set as the main evaluation metric. For datasets with subpopulation shifts, we use the worst-group accuracy (WGA), defined as the minimal accuracy over different sub-populations of the dataset (Sagawa et al., 2020a). DNN model usage and selection. We exclusively use publicly available pre-trained model weights trained on a specific dataset relevant to our experiments. Importantly, we only use frozen representations from these networks and do not fine-tune any parameters of the DNNs. We primarily use Vi Ts and Res Nets in our study due to their differing inductive biases. The availability of pre-trained model weights varied, and we aimed to include the most popular and high-performing models within each distribution shift. Published as a conference paper at ICLR 2025 4.2 RESULTS UNDER THE FEW-SHOT SETTING We evaluate model performance when a few labeled OOD samples are available for ILC or last-layer retraining. Our goal is to challenge the assumption that the penultimate layer contains sufficient information for OOD generalization (Izmailov et al., 2022; Kirichenko et al., 2023; Rosenfeld et al., 2022) and to inspect the common practice of probing the last layer for this purpose (Zhai et al., 2020). To do so, we compare the effectiveness of intermediate representations with that of the penultimate layer. 4.2.1 INFORMATION CONTENT FOR OOD GENERALIZATION AT LAST VERSUS INTERMEDIATE LAYERS We measure the information content in the last-layer representation versus the intermediate layers by evaluating their accuracy on OOD tasks. To quantify this, we assume a large number of OOD samples, meaning that the entire validation set, as defined in 4.1, is used for Dprobe POOD. CMNIST CIFAR-10C CIFAR-100C Multi Celeb A CMNIST CIFAR-10C CIFAR-100C Vi T 88.3%89.4% Figure 3: Information content for OOD generalization in last layer vs intermediate layers. Last layer refers to the OOD accuracy of the last-layer retraining approach (ILCL 1); Best layer refers to the maximal OOD accuracy among the intermediate layer classifiers (ILC) (ILCl ). For Multi Celeb A, we report the worstgroup accuracy (WGA). Fig. 3 shows the OOD accuracies of the last-layer retraining approach and the best performance from the ILC across different layers. Out of the six datasets considered, we observe a general increase in information content when intermediate layers are utilized for OOD generalization instead of the last layers. For Res Nets, the performance increments are (+16.8, +6.3, +1.6, +6.3) percentage points on (CMNIST, CIFAR-10, CIFAR-100C, Multi Celeb A). Vi T has seen smaller or slightly negative increments of (+1.1, +1.3, 0.2) percentage points on (CMNIST, CIFAR-10C, CIFAR-100C). This point is further supported by experiments with non-linear probes (Appendix C.2) and an analysis controlling for feature dimensionality (Appendix C.3), both yielding similar findings. We conclude that abundant information exists for OOD generalization in the intermediate layers of a DNN. The current practice of utilizing only the last layer representations may neglect the hidden information sources in the earlier layers. 4.2.2 OOD DATA EFFICIENCY FOR LAST VERSUS INTERMEDIATE LAYERS The previous experiment measures the maximal information content at different layers with abundant OOD data to train the probe; here, we consider the data efficiency for OOD generalization at different layers. In practice, it is crucial that a good OOD generalization is achieved with a restricted amount of OOD data. In this experiment, we control the amount of probe training set Dprobe POOD of OOD samples with the parameter π (0, 1) controlling the fraction of OOD data used, compared to the setting in 4.2.1 (corresponding to π = 1.0). While achieving the best performance is not the primary goal, we also benchmark ILC s results against other methods in Appendix A.5.2. We illustrate the ILC and last-layer retraining performances on 6 datasets in Fig. 4. For subpopulation shifts (Waterbirds, Celeb A, Multi Celeb A), we find that training with a smaller amount (π 0.03) of OOD data leads to a greater empirical advantage of ILCs compared to last-layer retraining with (+5.7, +3.4, +27.0) percentage points. For the other shifts (CMNIST, CIFAR-10C, CIFAR-100C), we observe a consistent benefit of ILC compared to the last-layer retraining. For example, at π = 0.25, ILC boosts the performance by (+12, +3.6, +1.0) percentage points. Vi Ts exhibit a similar pattern where ILCs perform better under little OOD data, but the difference is less pronounced (Appendix, Fig. A.5.1). Published as a conference paper at ICLR 2025 3% 37% 75% 100% 93.6% Waterbirds 2% 27% 54% 100% 90.7% Celeb A 100% 1% 36% 79.6% Multi Celeb A 50% 100% 25% 5% 95.9% CMNIST 100% 20% 1% 76.2% CIFAR-10C 20% 4% 100% 40% 45.6% CIFAR-100C Figure 4: Accuracies of ILCs and last layer retraining under varying number of OOD samples for Res Nets. Performance of best ILCs and last-layer retraining on subpopulation shifts (first row) and the remaining shifts (second row) using CNN models. We used Res Net50 for Waterbirds and Celeb A, and Res Net18 for the remaining datasets. Takeaway from 4.2: Last layer representations are often sub-optimal for OOD generalization; intermediate layers offer better candidates. The effect is more pronounced when only a small fraction of OOD samples are available to train the linear probe on top. 4.3 RESULTS UNDER THE ZERO-SHOT SETTING We now explore a scenario where no OOD samples are available for training the linear probes, but only ID samples are Dprobe PID. This scenario is intriguing both conceptually and practically. Conceptually, it challenges us to extract features that are effective for OOD generalization using only ID data. This requires leveraging the structure of ID data to identify characteristics that may generalize well to unseen OOD cases. Practically, the assumption that OOD data is unavailable greatly broadens potential application scenarios. We follow the setup outlined in 3.1. We stress that this zero-shot setting was not considered in previous last-layer retraining methods (Izmailov et al., 2022; La Bonte et al., 2023), which involved re-training the last layer on OOD data. 4.3.1 SUBPOPULATION SHIFTS Waterbirds Celeb A Multi-Celeb A 0% Base Last layer ILCL 1 Best layer ILCl * Figure 5: WGA on OOD data for ID-trained CNNs. For the explanations for Base, Last layer, and Best layer, refer to the text. Res Net50: Waterbirds, Celeb A. Res Net18: Multi Celeb A. We illustrate results on subpopulation shifts in Fig. 5. We consider three variations of models: (1) the pre-trained frozen DNN (Base), (2) lastlayer re-training with the ID data (Last layer), and (3) the best ILC after training on ID (Best layer). They can be compared as they solve the same task. We also compare these zero-shot results to performant baselines in Appendix A.6. In all three datasets, we observe the relationship: Base