# how_to_finetune_vision_models_with_sgd__0dba73fe.pdf

Published as a conference paper at ICLR 2024

HOW TO FINE-TUNE VISION MODELS WITH SGD

Ananya Kumar Ruoqi Shen, ananya@cs.stanford.edu shenr3@cs.washington

Sébastien Bubeck Suriya Gunasekar sebubeck@microsoft.com suriyag@microsoft.com

SGD and Adam W are the two most used optimizers for fine-tuning large neural networks in computer vision. When the two methods perform the same, SGD is preferable because it uses less memory (12 bytes/parameter with momentum and 8 bytes/parameter without) than Adam W (16 bytes/parameter). However, on a suite of downstream tasks, especially those with distribution shifts, we find that fine-tuning with Adam W performs substantially better than SGD on modern Vision Transformer and Conv Ne Xt models. We find that large gaps in performance between SGD and Adam W occur when the fine-tuning gradients in the first embedding" layer are much larger than in the rest of the model. Our analysis suggests an easy fix that works consistently across datasets and models: freezing the embedding layer (less than 1% of the parameters) leads to SGD with or without momentum performing slightly better than Adam W while using less memory (e.g., on Vi T-L, SGD uses 33% less GPU memory). Our insights result in state-ofthe-art accuracies on five popular distribution shift benchmarks: WILDS-FMo W, WILDS-Camelyon, BREEDS-Living-17, Waterbirds, and Domain Net.

1 INTRODUCTION

Fine-tuning large pretrained models on downstream tasks has become a dominant approach in deep learning (Kornblith et al., 2019; Chen et al., 2020; Zhai et al., 2020). The two most commonly used optimizers in current practice are SGD and Adam W (Kingma & Ba, 2015; Loshchilov & Hutter, 2019)1. While most modern vision architectures (Vi Ts, Conv Ne Xts, and variants) increasingly use Adam W for pretraining, it is still common to use SGD for fine-tuning. Part of the appeal is that SGD is more memory and compute efficient: Adam W uses 1.33 and 2 as much memory per parameter as SGD with and without momentum, respectively (Ginsburg et al., 2019; Dettmers et al., 2022).At the same time, in terms of fine-tuning accuracies, prior work (Dosovitskiy et al., 2021; Steiner et al., 2021; Kumar et al., 2022) report similar performance between Adam W and SGD on Image Net like domains that are closer to pretraining data. In contrast, we reach different conclusions when fine-tuning on datasets that are far from pretraining data or have substantial distribution shifts.

We examine 7 popular models, including vision transformers (Dosovitskiy et al., 2021; Caron et al., 2021; Radford et al., 2021), Conv Ne Xts (Liu et al., 2022), and Res Nets (Kolesnikov et al., 2020; He et al., 2016), of different sizes and pretraining modalities. When pretrained on a large corpus and then fine-tuned, these models achieve near state-of-the-art performance on downstream benchmarks. In addition to good transfer learning, we also want our fine-tuned models to handle practical distribution shifts gracefully. So we focus on 5 distribution shift datasets that have both in-distribution (ID) and out-of-distribution (OOD) evaluations: WILDS-FMo W, WILDS-Camelyon, Waterbirds, BREEDSLiving-17, Domain Net. These were selected to capture different types of data shifts (subpopulation shifts, spurious correlations, style shifts), including two real world shifts in medical imaging and satellite remote sensing from the WILDS benchmark (Koh et al., 2021).

We find that on newer models like Vi Ts and Conv Ne Xt, Adam W can significantly outperform SGD, especially OOD. Averaged across the datasets, fine-tuning a CLIP Vi T-B/16 model with Adam W gets 2.1% higher accuracy ID and 8.1% higher accuracy OOD compared to SGD (Figure 1b). These gains

1By default, we use the deep learning usage of SGD as minibatch stochastic gradient descent with momentum.

Published as a conference paper at ICLR 2024

(a) Simplified schematic of Vi T illustrating how we do freeze-embed.

ID accuracy OOD accuracy

SGD 90.0% 67.9% Adam W (+2.1%) (+8.1%) SGD (freeze-embed) (+2.1%) (+8.0%) SGD (freeze-embed, no mom.) (+2.2%) (+9.0%)

(b) Performance of different fine-tuning methods on a CLIP Vi T-B/16 averaged over 5 distribution shift datasets.

Figure 1: We fine-tune 7 models including Vi Ts, DINO, CLIP, Conv Ne Xt, Res Net, on 5 distribution shift datasets (Living-17, Waterbirds, Domain Net, WILDS-Camelyon, WILDS-FMo W). Fine-tuning with SGD gets lower accuracies than Adam W on modern pretrained models (Vision Transformers and Conv Ne Xt), especially OOD. Interestingly, a minor tweak to SGD where we freeze the first embedding layer (< 1% of parameters see Figure 1a) is competitive with Adam W while using lower GPU memory. Further, dropping the momentum state in SGD gives additional gains in accuracy at even lower memory cost.

are consistent across models too averaged across all models and datasets, Adam W gets 1.2% higher accuracy ID and 4.0% higher accuracy OOD (Tables 1-2).

A key difference between Adam W and SGD, is that Adam W normalizes the gradient update of each parameter using an estimate of their second moments. Thus, parameters with consistently high gradients will change less when using Adam W than with SGD. Towards understanding these dynamics are better, we examine the gradients at each layer of our pretrained models. We find that for the models where Adam W significantly outperforms SGD, the gradients at pretrained initialization of the first embedding layer are much larger than the gradients of the other layers.

To test if over-training of the embedding layer is in fact why SGD performs worse than Adam W, we consider a minor modification where we freeze the embedding layer and tune the rest of the model with SGD (Figure 1a) we call this SGD (freeze-embed). In vision transformer models, the embedding layers are only a small fraction (around 0.7% for Vi T-B/16) of the total parameters of the model, so a priori we might not expect a substantial difference in accuracies. However, surprisingly this simple freezing of the embedding layer consistently improves SGD performance across most models and datasets and achieved ID and OOD accuracies that are competitive with or better than Adam W (Figure 1b). Averaged across all datasets and models, SGD (freeze-embed) gets 76.7% accuracy OOD (vs. 72.0% for SGD and 76.0% for Adam W). The analogous Adam W (freeze-embed) gets 76.5%, which does not improve over SGD (freeze-embed), supporting that freeze-embed may be the reason Adam W outperforms SGD (it is not an independent axis of improvement). We also tried a more memory efficient variation, SGD (freeze-embed, no momentum), which drops the momentum state in SGD interestingly this gets even slightly better OOD accuracy than the other methods (76.9%) despite using even less memory.

In terms of memory usage, our profiling (Table 4) shows that on a Vi T-B/16, Adam W uses 16% and 36% more memory than SGD (freeze-embed) and SGD (freeze-embed, no momentum), respectively. The memory overhead of Adam W increases with the model size. On a Vi T-L/14 the overheads of Adam W are 18%, and 49%, respectively.

These methods and insights, while simple, lead to state-of-the-art accuracies on all five datasets: WILDS-Camelyon, WILDS-FMo W, Domain Net, Waterbirds, and BREEDS Living-17, while being more memory efficient than Adam W.

2 SCOPE AND SETUP

We use the following notation: a network map fθ : X Y is represented as a composition of layers as fθ = f (head) θhead f (L) θL . . . f (1) θ1 f (embed) θembed , where θ = (θhead, θL, . . . , θ1, θembed) denote all the

parameters of the model. We use f (embed) θembed and f (head) θhead to denote blocks that can conceptually be considered the embedding layer and the head , respectively.

Fine-tuning. Consider networks that have been pretrained to get an initialization θpretrain. We focus on fine-tuning on a labeled dataset Dtrain Pfinetune from a new task. Concretely, given a loss

Published as a conference paper at ICLR 2024

function ℓ: Y Y R 0, we minimize the training loss L(θ) = 1 |Dtrain| P

(x,y) Dtrain ℓ(fθ(x), y) using iterative optimization algorithms starting from the initialization θpretrain.

We evaluate the accuracy of fine-tuned models on held-out in-distribution (ID) and out-of-distribution (OOD) test datasets. For ID evaluation, we use samples Did test Pfinetune from the same distribution as the fine-tuning training dataset. To examine whether we have learned a robust model, we consider benchmarks that also provide OOD test examples Dood test Pood, which differs from the fine-tuning distribution Pfinetune in practically meaningful ways.

2.1 OPTIMIZERS: SGD AND ADAMW

The two most common optimizers for minimizing the fine-tuning loss L(θ) from pretrained initialization are SGD (with/no momentum) and Adam W. We will introduce other optimizers as needed. Compared to vanilla SGD (no momentum), which only stores the parameters and gradients as optimizer states, SGD (with momentum) stores 1 extra state per parameter (to track the first moment), while Adam W stores 2 extra states per parameter (to track the first and second moments) see Appendix A for exact updates. This corresponds to a difference between Adam W and SGD and SGD (no momentum) of 4GB and 8GB GPU memory per 1B parameter model during training2. With the current scale of the models 100s of billions parameters and increasing, such memory overheads are very costly (Dettmers et al., 2022). Thus, understanding when and how we can use the cheaper SGD compared to Adam W can significantly improve training of large scale models.

2.2 DATASETS AND MODEL ARCHITECTURES

Datasets. We choose five fine-tuning benchmarks that capture different types of data shifts (subpopulation shifts, spurious correlations, style shifts), including two real world shifts.

1. Living-17 (Santurkar et al., 2020) is a sub-population shift dataset from the BREEDS benchmark. The goal is to classify an image as one of 17 animal categories with ID and OOD data from different sub-categories. For example, in the bear category, the ID dataset contains black bears and sloth bears and the OOD dataset has brown bears and polar bears. 2. Waterbirds (Sagawa et al., 2020) is a spurious correlation dataset where the goal is to classify an image as a waterbird or landbird . In the ID dataset, water backgrounds are typically correlated with waterbird labels, but are uncorrelated in the OOD dataset. 3. Domain Net (Peng et al., 2019) is a domain adaptation benchmark. ID contains sketch images, and the OOD contains real images of the same categories. We use the version of the dataset from Tan et al. (2020). 4. WILDS-FMo W (Christie et al., 2018; Koh et al., 2021) consists of remote sensing satellite images. The goal is to classify a satellite image into one of 62 geographical categories.The ID dataset contains satellite images from across the world between 2002 and 2012, and the OOD dataset contains images from Africa in 2017. 5. WILDS-Camelyon (Bandi et al., 2018; Koh et al., 2021) is a medical images dataset for detecting tumors in tissue slides. The ID and OOD datasets contain slides from different hospitals.

Model Architectures. We consider seven popular pretrained models that span different architectures (vision transformers and convolutional networks), sizes, and pretraining objectives (multi-modal, supervised, self-supervised).

(1-2) CLIP Vi T-B/16 and CLIP Vi T-L/14 (Radford et al., 2021): CLIP vision transformers of two sizes pretrained on a multi-modal Web Image Text dataset. (3) Vi T-B/16 (Dosovitskiy et al., 2021): vision transformer pretrained on Imagenet-21k. (4) DINO Vi T-B/16 (Caron et al., 2021): self-supervised Vi T pretrained on Image Net-1K. (5) Conv Ne Xt-B (Liu et al., 2022): modernized convnet pretrained on Image Net-21k using advanced data augmentations and Mix Up as in Touvron et al. (2021).

2The bottleneck for memory in older Res Ne(X)ts is typically the number of activations. However, in modern large transformer training, memory requirements are of the same scale as the number of parameters. Further, techniques such as gradient accumulation and gradient checkpointing can make the activation memory small leaving the optimizer states as the main bottleneck.

Published as a conference paper at ICLR 2024

(6-7) Bi T Res Net-50 and Bi T Res Net-101 (Kolesnikov et al., 2020): Res Net V2 models of two sizes pretrained on Image Net-21k.

f (embed) θembed and f (head) θhead : For vision transformer models, we consider the patch-to-token embedding layer along with its layer norm if present as the embedding layer. For convolutional networks, the embedding layer refers to the stem block along with the first stage: the stem in Res Net V2 is a 7 7 convolution with stride 2 followed by a 2 2 Max Pool; while in Conv Ne Xt it is a nonoverlapping 4 4 convolution with stride 4. For each downstream task, we replace the final layer of all the pretrained models with a randomly initialized classifier head f (head) θhead .

3 SGD, ADAMW, AND LAYER GRADIENTS

Modern deep learning models increasingly use Adam W for pretraining, where it has been repeatedly shown to produce better features for downstream tasks than SGD (Dosovitskiy et al., 2021; Liu et al., 2021; 2022). For fine-tuning, on the other hand, there are no systematic studies or a definitive answer as to whether Adam W or SGD should be used (Dosovitskiy et al., 2021; Touvron et al., 2021). The ablation study in Touvron et al. (2021) even found that there is no difference in performance between Adam W and SGD when fine-tuning on Image Net-1K. Conv Next (Liu et al., 2022) and Swin transformers (Liu et al., 2021) papers report using Adam W for fine-tuning, but they do not mention a comparison with SGD. In this work we focus on better understanding the dynamics of Adam W and SGD during the fine-tuning phase. Detailed results are discussed in Section 4. We first highlight some initial observations below.

3.1 ADAMW VS SGD

Adam W outperforms SGD. We find that, generally speaking, fine-tuning with Adam W produces better results than with SGD, especially for more recent models like Vi T variants and Conv Ne Xt. The gaps are more dramatic on out-of-distribution (OOD) test accuracies compared to in-distribution (ID). See Table 1 and Table 2 for the full OOD and ID results, respectively. Averaged across the 7 models and 5 datasets, Adam W gets an OOD accuracy of 76.0% (vs. 72.0% for SGD), and an ID accuracy of 91.5% (vs. 90.3% for SGD). We emphasize that this happens even though we sweep over 6 learning rates and early stop.

Adam W SGD on Bi T Res Nets. Breaking down the results by model type, we find that Adam W and SGD perform comparably for older convolutional networks, namely Bi T Res Net-50 and Bi T Res Net-101. For example, on a Bi T Res Net-101, Adam W gets an average OOD accuracy of 74.5% (vs. 74.6% for SGD). However, for newer models including vision transformers and the modernized convolutional network (Conv Ne Xt), Adam W gets much higher accuracies.

3.2 EXAMINING LAYER GRADIENTS

A key operational difference between Adam W and SGD is that Adam W divides the gradient update for each parameter by a weighted running average of its second moments. So parameters with consistently high gradients will change less when using Adam W than with SGD. This suggests examining the gradients across different components of the neural network (at pretrained initialization) if they vary a lot, then Adam W and SGD will behave very differently.

We measure the average gradient norm at each layer, across minibatches of the training set. More formally, recall our notation for network layers as fθ = f (head) θhead f (L) θL . . . f (1) θ1 f (embed) θembed . Given a minibatch B = {(x1, y1), . . . , (xb, yb)} of random samples from the training data, the stochastic gradient of layer ℓ {embed, 1, 2, . . . , L, head} is given by: gℓ(B) = 1 |B| P|B| i=1 θℓl(fθpretrain(xi), yi). The norm of the stochastic gradient, gℓ(B) 2, roughly measures how much layer ℓwill change after one step of SGD from pretrained initialization. We use the average gradient norm across all minibatches B1, . . . , Bm in the training set Dtrain as a measure of movement in the first SGD step.

Ginit ℓ = 1

t=1 gℓ(Bt) 2, where ℓ {embed, 1 L, head}. (3.1)

Published as a conference paper at ICLR 2024

Figure 2: We visualize the layer-wise gradient norms our models on Domain Net at the pretrained initialization. For each layer, we plot the average minibatch gradient norm Ginit ℓ as computed in equation 3.1. We highlight two special layers: gradient norms of the embedding layer parameters Ginit embed are shown as red-squares, and those of the classifier head s Ginit head are show as green-triangles. The middle layer gradient norms are shown as black-circles. For transformer models, we have separate (black) points for the MLP and the attention layers.

Note that we measure all the gradients at the pretrained initialization θpretrain before performing any gradient updates. In Figure 2, we plot the layer-wise gradient norms Ginit ℓ on Domain Net. Similar plots for other datasets and with alternative normalization are provided in Appendix F.

First-layer gradient is an outlier. Apriori, we expect the gradients of the classifier head (greentriangles) to be large as they are randomly initialized while the rest of the model is pretrained (Kumar et al., 2022). Interestingly, we see in Figure 2 (see also Appendix F) that for the recent models (Vi Ts and Conv Ne Xt), gradient norms of the embedding layers (red-squares) stand out as outliers with much larger gradients than the other layers. These are also the models where see big gaps between Adam W and SGD performance in Tables 1-2.

This suggests that the embedding layer plays a distinctive role in newer models. Since SGD uses the same learning rate for all layers, we will end up make substantially larger updates to the embedding layer compared to other layers leading to either over-tuning the embedding layer or under-tuning the remaining layers. Over-training of the embedding layer is undesirable given the common wisdom that lower layers ought to be tuned less as they learn more transferable features (Kumar et al., 2022). Adam W on the other hand adaptively normalizes the movement of each parameter. On the other hand, computationally, SGD is preferable over Adam W due to lower memory footprint.

How can we avoid the above issues with SGD? One possibility is to use different learning rates for different layers, but this might end up requiring extensive hyperparameter tuning. Another option is to use other low-memory footprint optimizers with layerwise normalization techniques like LARS (You et al., 2017a) and LAMB (You et al., 2020). In our initial experiments (see Table 5), while these methods improved over SGD, they did not close the gap with Adam W. Instead, we found a much simpler modification to SGD that consistently leads to accuracies competitive with or better than Adam W while using lower memory (Section 3.4).

3.3 WHY FIRST-LAYER GRADIENTS ARE HIGHER?

Why are the first-layer gradients higher for modern vision models such as Vision Transformers and Conv Ne Xt? There are two plausible hypotheses: (a) architectural differences from Res Net models, or (b) optimization differences between pretraining and fine-tuning Bi T Res Net models were pretrained with SGD while the Vision Transformer and Conv Ne Xt models need to be pretrained with Adam W (Steiner et al., 2021). In Appendix E we run controlled experiments and find that the optimizer mismatch appears to be the key factor. To control for architecture, we use the same Res Net architecture and pretrain using either (1) SGD or (2) Adam W. For the SGD pretrained Res Net the first layer has smaller gradients than the other layers, while for the Adam W pretrained Res Net the first layer has higher gradients than the other layers (Figure 3). In line with our previous results, fine-tuning an Adam W pretrained model with SGD leads to worse OOD accuracy (Table 13), but fine-tuning an SGD pretrained model with SGD performs well. On the other hand, architectural changes lead to smaller changes.

3.4 FREEZE-EMBEDDING

While the presence of large embedding layer gradients in Figure 2 correlates with the observed performance gaps between Adam W and SGD, it is not clear that this observation or its potential

Published as a conference paper at ICLR 2024

issues discussed above are definitive causes for SGD performing worse than Adam W. To further test the intuition, in the next section, we consider a freeze-embed variation of fine-tuning, where we simply freeze the embedding layer to its pretrained initialization, and then fine-tune the rest of the model as usual. The embedding layer only consists of a small fraction of the model parameters (e.g., 0.7% in a base vision transformer), so apriori freezing the embedding layer is a very tiny tweak and we would not expect a large change in accuracy. However, if the hypothesis above holds merit, then we would expect the modification to aid SGD but not Adam W.

4 DETAILED EXPERIMENTS ON FREEZE-EMBEDDING

We consider a simple variation of SGD fine-tuning where we freeze the embedding layer and only perform SGD updates on the rest of the network we call this method SGD (freeze-embed) . For further memory gains, we also consider SGD without momentum SGD (freeze-embed, no momentum) In this section we discuss the results of fine-tuning our 7 models on 5 distribution shift benchmarks mentioned in Section 2. We use the implementations of SGD and Adam W in Py Torch. For each method, we train for the same number of epochs using a cosine learning rate schedule, and sweep over 6 starting learning rates (ensuring that the optimal learning rate is in the middle of the sweep). For all datasets we follow prior work (Kumar et al., 2022) and pick the best learning rate and early stop based on the ID validation accuracy. See Appendix A for additional details.

Table 1 and Table 2 show the OOD and ID accuracies, respectively, of the four methods across models and datasets. For each model and dataset, we also highlight the relative gain/loss of Adam W and SGD (freeze-embed) from regular SGD in green/red. We discuss the main observations below.

1. Adam W outperforms SGD. We see that Adam W largely outperforms SGD, often by substantial margins on Vi T and Conv Ne Xt models. Particularly in OOD evaluation, the gaps are remarkable. In the few instances where SGD performs better the gaps are much smaller. The differences between SGD and Adam W are generally more modest for older Res Net models. 2. SGD (freeze-embed) as well as SGD (freeze-embed, no momentum) are competitive with or better than Adam W. For each individual model, averaged across the datasets SGD (freeze-embed) variants are consistently the best or tied-best method on OOD accuracy, and only minimally worse than Adam W on ID accuracy (right columns of Table 1-2). Averaged across all the models and datasets, SGD (freeze-embed) performs the best OOD, getting an average OOD accuracy of 76.7% (vs. 71.9% for SGD and 76.0% for Adam W). On ID data, SGD (Freeze-embed) closes 85% of the gap between SGD and Adam W, getting an average accuracy of 91.3% (vs. 90.2% for SGD and 91.5% for Adam W). SGD (freeze-embed, no momentum) gets the highest average OOD accuracy of 76.9% (vs. 71.9% for SGD and 76.0% for Adam W) and competitive ID accuracy of 91.2% (vs. 90.2% for SGD and 91.5% for Adam W), while saving additional memory. 3. Larger CLIP model gains more from SGD (freeze-embed) and Adam W. On a CLIP-Vi T-B/16 we see that SGD (freeze-embed) and Adam W get about 8% higher OOD accuracy than SGD. Upon scaling to a larger CLIP-Vi T-L/14, which is also our best model for OOD performance, we see even higher gains. On CLIP-Vi T-L/14, SGD (freeze-embed) gets a 14.3% higher OOD accuracy than SGD, and a 0.7% higher OOD accuracy than Adam W. This suggests that our findings might be even more relevant for larger models. 4. Adam W is red when SGD (freeze-embed) is red. Across all our models and datasets, we see that in instances where SGD (freeze-embed) is worse than SGD (i.e., highlighted as red), Adam W is also worse than SGD. This is not always the case the other way around. For example, on Vi T B-/16 fine-tuned on Domain Net and Camelyon, OOD performance of Adam W is worse than SGD, but SGD (freeze-embed) is competitive to the best of the two. At the same time, on Living-17 OOD evaluation, multiple models have both Adam W and SGD (freeze-embed) perform significantly worse than SGD. 5. SGD performs well when fine-tuning data is closer to pretraining. Breaking down the results by datasets, the main trend we see is that with models that were pretrained on Image Net-21k (all models here except CLIP) and fine-tuned on Living-17, the performance of SGD is typically higher than Adam W and SGD (freeze-embed). Images in Living-17 are derived from Image Net-1K and hence arguably, in this case the fine-tuning distribution is closest to pretraining. This suggests that SGD may work better when fine-tuning and pretraining data are similar we leave a more thorough analysis to future work.

Published as a conference paper at ICLR 2024

Living-17 Waterbirds Domain Net FMo W Camelyon Avg.

CLIP Vi T-B/16 SGD 80.0 62.5 72.8 37.3 86.8 67.9 CLIP Vi T-B/16 Adam W 82.8 (+2.8) 71.9 (+9.4) 89.2 (+16.4) 40.7 (+3.4) 95.7 (+8.9) 76.0 (+8.1) CLIP Vi T-B/16 SGD (freeze-embed) 83.2 (+3.2) 73.7 (+11.2) 88.2 (+15.4) 40.2 (+2.9) 94.3 (+7.5) 75.9 (+8.0) CLIP Vi T-B/16 SGD (freeze-embed, no momentum) 83.1 (+3.1) 80.4 (+17.9) 89.0 (+16.2) 38.8 (+1.5) 93.3 (+6.5) 76.9 (+9.0)

CLIP Vi T-L/14 SGD 84.2 65.0 60.8 41.0 83.2 66.8 CLIP Vi T-L/14 Adam W 88.0 (+3.8) 85.2 (+20.2) 93.8 (+33.0) 48.3 (+7.3) 95.9 (+12.7) 82.2 (+15.4) CLIP Vi T-L/14 SGD (freeze-embed) 90.5 (+6.3) 84.7 (+19.7) 93.1 (+32.3) 49.9 (+8.9) 96.5 (+13.3) 83.0 (+16.2) CLIP Vi T-L/14 SGD (freeze-embed, no momentum) 89.2 (+5.0) 86.8 (+21.8) 93.7 (+32.9) 46.8 (+5.8) 96.7 (+13.5) 82.6 (+15.8)

Sup Vi T-B/16 SGD 89.5 77.4 86.3 33.5 92.6 75.8 Sup Vi T-B/16 Adam W 88.3 (-1.2) 81.6 (+4.2) 84.4 (-1.9) 35.9 (+2.4) 87.9 (-4.7) 75.6 (-0.2) Sup Vi T-B/16 SGD (freeze-embed) 88.0 (-1.5) 82.4 (+5.0) 86.3 (+0.0) 34.4 (+0.9) 93.7 (+1.1) 77.0 (+1.2) Sup Vi T-B/16 SGD (freeze-embed, no momentum) 88.4 (-1.1) 76.6 (-0.8) 86.1 (-0.2) 34.6 (+1.1) 87.5 (-5.1) 74.6 (-1.2)

DINO Vi T-B/16 SGD 88.2 56.1 76.0 33.6 86.9 68.2 DINO Vi T-B/16 Adam W 87.4 (-0.8) 61.2 (+5.1) 77.4 (+1.4) 35.8 (+2.2) 91.9 (+5.0) 70.7 (+2.5) DINO Vi T-B/16 SGD (freeze-embed) 86.7 (-1.5) 67.9 (+11.8) 78.4 (+2.4) 35.9 (+2.3) 90.6 (+3.7) 71.9 (+3.7) DINO Vi T-B/16 SGD (freeze-embed, no momentum) 88.2 (+0.0) 68.5 (+12.4) 79.9 (+3.9) 33.4 (-0.2) 93.9 (+7.0) 72.8 (+4.6)

Conv Next-Base SGD 94.0 80.2 89.8 39.7 83.0 77.3 Conv Next-Base Adam W 90.3 (-3.7) 89.8 (+9.6) 89.5 (-0.3) 38.4 (-1.3) 89.5 (+6.5) 79.5 (+2.2) Conv Next-Base SGD (freeze-embed) 92.6 (-1.4) 86.9 (+6.7) 91.2 (+1.4) 38.2 (-1.5) 88.1 (+5.1) 79.4 (+2.1) Conv Next-Base SGD (freeze-embed, no momentum) 93.3 (-0.7) 83.6 (+3.4) 91.5 (+1.7) 38.1 (-1.6) 96.0 (+13.0) 80.5 (+3.2)

Bi T Res Net-50 SGD 84.3 76.5 80.0 34.1 90.4 73.1 Bi T Res Net-50 Adam W 83.1 (-1.2) 74.8 (-1.7) 84.0 (+4.0) 33.8 (-0.3) 92.4 (+2.0) 73.6 (+0.5) Bi T Res Net-50 SGD (freeze-embed) 84.1 (-0.2) 75.5 (-1.0) 82.3 (+2.3) 35.0 (+0.9) 95.2 (+4.8) 74.4 (+1.3) Bi T Res Net-50 SGD (freeze-embed, no momentum) 83.8 (-0.5) 76.0 (-0.5) 83.8 (+3.8) 33.6 (-0.5) 95.6 (+5.2) 74.5 (+1.4)

Bi T Res Net-101 SGD 82.8 76.9 86.2 38.0 89.3 74.6 Bi T Res Net-101 Adam W 82.9 (+0.1) 79.4 (+2.5) 83.5 (-2.7) 37.0 (-1.0) 89.7 (+0.4) 74.5 (-0.1) Bi T Res Net-101 SGD (freeze-embed) 83.1 (+0.3) 77.3 (+0.4) 86.0 (-0.2) 36.0 (-2.0) 95.5 (+6.2) 75.6 (+1.0) Bi T Res Net-101 SGD (freeze-embed, no momentum) 82.9 (+0.1) 79.1 (+2.2) 86.0 (-0.2) 36.4 (-1.6) 95.9 (+6.6) 76.1 (+1.5)

Table 1: Out-of-distribution (OOD) accuracies including SGD (freeze-embed) without momentum.

Living-17 Waterbirds Domain Net FMo W Camelyon Avg.

CLIP Vi T-B/16 SGD 97.8 97.2 88.8 67.0 99.4 90.0 CLIP Vi T-B/16 Adam W 98.1 (+0.3) 97.7 (+0.5) 95.0 (+6.2) 70.1 (+3.1) 99.5 (+0.1) 92.1 (+2.1) CLIP Vi T-B/16 SGD (freeze-embed) 98.2 (+0.4) 97.8 (+0.6) 94.9 (+6.1) 70.0 (+3.0) 99.5 (+0.1) 92.1 (+2.1) CLIP Vi T-B/16 SGD (freeze-embed, no momentum) 98.2 (+0.4) 97.9 (+0.7) 95.2 (+6.4) 70.1 (+3.1) 99.5 (+0.1) 92.2 (+2.2)

CLIP Vi T-L/14 SGD 98.4 97.3 84.3 69.0 99.4 89.7 CLIP Vi T-L/14 Adam W 98.9 (+0.5) 98.8 (+1.5) 96.9 (+12.6) 74.5 (+5.5) 99.6 (+0.2) 93.7 (+4.0) CLIP Vi T-L/14 SGD (freeze-embed) 98.7 (+0.3) 98.9 (+1.6) 97.1 (+12.8) 74.5 (+5.5) 99.6 (+0.2) 93.7 (+4.0) CLIP Vi T-L/14 SGD (freeze-embed, no momentum) 98.8 (+0.4) 98.7 (+1.4) 97.3 (+13.0) 74.3 (+5.3) 99.5 (+0.1) 93.7 (+4.0)

Sup Vi T-B/16 SGD 98.6 99.1 91.7 64.1 99.4 90.6 Sup Vi T-B/16 Adam W 98.7 (+0.1) 99.0 (-0.1) 91.7 (-0.0) 66.4 (+2.3) 99.5 (+0.1) 91.1 (+0.5) Sup Vi T-B/16 SGD (freeze-embed) 98.7 (+0.1) 99.2 (+0.1) 91.5 (-0.2) 65.0 (+0.9) 99.6 (+0.2) 90.8 (+0.2) Sup Vi T-B/16 SGD (freeze-embed, no momentum) 98.5 (-0.1) 98.9 (-0.2) 90.6 (-1.1) 65.7 (+1.6) 99.5 (+0.1) 90.6 (+0.0)

DINO Vi T-B/16 SGD 98.4 97.0 88.2 62.4 99.4 89.1 DINO Vi T-B/16 Adam W 98.5 (+0.1) 97.9 (+0.9) 89.4 (+1.2) 66.0 (+3.6) 99.6 (+0.2) 90.3 (+1.2) DINO Vi T-B/16 SGD (freeze-embed) 98.4 (+0.0) 97.5 (+0.5) 89.0 (+0.8) 63.5 (+1.1) 99.5 (+0.1) 89.6 (+0.5) DINO Vi T-B/16 SGD (freeze-embed, no momentum) 98.5 (+0.1) 97.5 (+0.5) 89.2 (+1.0) 63.8 (+1.4) 99.5 (+0.1) 89.7 (+0.6)

Conv Next-Base SGD 98.7 99.0 94.8 66.3 99.4 91.6 Conv Next-Base Adam W 98.6 (-0.1) 99.5 (+0.5) 94.5 (-0.3) 68.8 (+2.5) 99.7 (+0.3) 92.2 (+0.6) Conv Next-Base SGD (freeze-embed) 98.6 (-0.1) 99.4 (+0.4) 95.1 (+0.3) 67.4 (+1.1) 99.5 (+0.1) 92.0 (+0.4) Conv Next-Base SGD (freeze-embed, no momentum) 98.8 (+0.1) 99.2 (+0.2) 95.0 (+0.2) 66.0 (-0.3) 99.4 (-0.0) 91.7 (+0.1)

Bi T Res Net-50 SGD 97.4 98.4 89.3 64.6 99.5 89.8 Bi T Res Net-50 Adam W 97.2 (-0.2) 98.5 (+0.1) 89.2 (-0.1) 65.1 (+0.5) 99.5 (+0.0) 89.9 (+0.1) Bi T Res Net-50 SGD (freeze-embed) 97.6 (+0.2) 98.5 (+0.1) 89.2 (-0.1) 64.8 (+0.2) 99.5 (+0.0) 89.9 (+0.1) Bi T Res Net-50 SGD (freeze-embed, no momentum) 97.5 (+0.1) 98.4 (-0.0) 89.2 (-0.1) 63.6 (-1.0) 99.4 (-0.1) 89.6 (-0.2)

Bi T Res Net-101 SGD 98.3 98.9 92.0 66.0 99.4 90.9 Bi T Res Net-101 Adam W 98.4 (+0.1) 98.6 (-0.3) 91.1 (-0.9) 67.0 (+1.0) 99.6 (+0.2) 90.9 (-0.0) Bi T Res Net-101 SGD (freeze-embed) 98.4 (+0.1) 98.8 (-0.1) 91.5 (-0.5) 65.9 (-0.1) 99.5 (+0.1) 90.8 (-0.1) Bi T Res Net-101 SGD (freeze-embed, no momentum) 98.1 (-0.2) 98.9 (-0.0) 91.5 (-0.5) 66.3 (+0.3) 99.3 (-0.1) 90.8 (-0.1)

Table 2: In-distribution (ID) accuracies including SGD (freeze-embed) without momentum.

4.1 SOTA EXPERIMENTS AND RESULTS

Our experiments get new state-of-the-art results for OOD accuracy on all 5 datasets. On 3/5 datasets (Living-17, WILDS-FMo W, and WILDS-Camelyon), our proposed SGD (freeze-embed) does the best, while in other 2, Adam W has a small edge. Here, state-of-the-art means that the numbers we get are better than, to our knowledge, any reported number and all numbers on the official leaderboard, and are better than standard full fine-tuning with SGD at the time of the preprint of the paper. We show the best results from our paper in Table 3 with a comparison to the previous state-of-the-art.

As a final point, we mention that if our hypothesis that Adam W would inherently avoid over-tuning of the embedding layer were true, unlike for SGD, freezing the embedding for Adam W would not be beneficial. In the Appendix, we expand the Tables 1-2 to include Adam W (freeze-embed) and indeed we see that freeze-embed does not provide complementary gains on top of Adam W.

Published as a conference paper at ICLR 2024

Living-17 Waterbirds Domain Net FMo W Camelyon

Best prior result 87.6 89.3 87.2 47.6 93.3

Best result from our paper 90.5 89.8 93.8 49.9 96.5

Optimizer for best result SGD (freeze-embed) Adam W Adam W SGD (freeze-embed) SGD (freeze-embed) Model for best result CLIPVi T-L/14 Conv Ne Xt-B CLIP Vi T-L/14 CLIP Vi T-L/14 CLIP Vi T-L/14

SGD (freeze-embed) result 90.5 86.9 93.1 49.9 96.5

Table 3: Our OOD accuracy results compared with the best reported numbers in prior work on these datasets. We restrict to methods that do not use OOD data for hyperparameter selection or early stopping. To our knowledge, the previous state-of-the-art results are from Wortsman et al. (2022) for FMo W, Robey et al. (2021) for Camelyon, Kumar et al. (2022) for Living-17 and the version of Domain Net introduced by Tan et al. (2020), and Ghosal et al. (2022) for Waterbirds. The WILDS numbers and references are taken from the official WILDS leaderboard (as of 28 Sep 2022), and for Waterbirds we consider all methods that do not use group labels. For Living-17, we omit models pretrained with Image Net-1K as Living-17 is a subset of Image Net-1K.

5 DETAILED ANALYSIS OF CLIP.

CLIP models have strong transfer learning performance and robustness among our 7 models, CLIP models had the best ID and OOD accuracies averaged across datasets. So we did a more detailed analysis of the CLIP Vi T-B/16 with other optimizers3 which are summarized in Table 5.

CLIP Vi T-B/16 CLIP Vi T-L/14 CLIP Vi T-H/14

Adam W 2.7 GB 7.1 GB Out-of-Memory SGD 2.3 GB 6.1 GB 10.1 GB

SGD (freeze-embed) 2.3 GB 6.1 GB 10.1 GB SGD (freeze-embed, no momentum) 2.0 GB 4.8 GB 7.6 GB

Table 4: We profile the GPU memory consumption on three CLIP models of varying sizes, on a Titan-X GPU. SGD (freeze-embed) gives practical gains over Adam W, especially if we drop momentum. The largest CLIP Vi T-H/14 does not fit in GPU memory when using Adam W, but fits in memory with other optimizers. Note that the freeze-embed methods perform competitively or better than Adam W on accuracy, as shown in Table 5.

GPU memory profiling. We profiled the GPU memory consumption of our 4 fine-tuning methods, on 3 models, namely CLIP-Vi T B/16, CLIP Vi T-L/14, and Open CLIP Vi T-H/14 the original CLIP model only scales up to Vi T-L/14, so we used the Open CLIP (Ilharco et al., 2021) for Vi T-H/14. The profiling was done using Weights and Biases on a Titan-X GPU with micro-batch size of 1.

In Table 4, we see a Vi T-B/16, Adam W uses 16% and 36% more memory than SGD (freeze-embed) and SGD (freeze-embed, no momentum), respectively. The gains are better for larger models: on a Vi T-L/14, Adam W uses 18% and 48% more memory respectively. On a Vi T-H/14, Adam W runs out of memory, while SGD (freeze-embed) and SGD (freeze-embed, no momentum) are able to run, showing that the gains are at least 20% and 60% respectively.

For large models, it is common to use additional tricks like gradient checkpointing to reduce the activation memory. That is, for a speed penalty (at most 2x), we only need to store activations in

L of L layers when doing backpropagation. Gradient checkpointing would further increase our gains over Adam W since they do not change the memory consumed by the weights but can substantially decrease the memory consumed by the model s activations.

SGD (freeze-embed) and Adam W outperform LAMB and LARS. We also ran two alternative adaptive gradient methods, LARS (You et al., 2017b) and LAMB (You et al., 2020) also sweeping over 6 learning rates and early stopping. These are alternate methods with layerwise normalization that can avoid over-training of large gradient layers. Moreover, like SGD and SGD (freeze-embed), LARS also has a lower memory footprint than Adam W. In Table 5, we see that while LARS and LAMB get higher accuracies than SGD, they do worse than SGD (freeze-embed) and Adam W both ID and OOD. In this case, our modification with freeze-embed appears to be more effective.

3Generating Table 1 involved over 600 fine-tuning runs, so we weren t able to repeat this for every model.

Published as a conference paper at ICLR 2024

Living-17 Waterbirds Domain Net FMo W Camelyon Avg. ID OOD ID OOD ID OOD ID OOD ID OOD ID OOD

SGD 97.8 (0.2) 80.0 (1.3) 97.2 (0.1) 62.5 (5.0) 88.8 (7.1) 72.8 (18.0) 67.0 (0.8) 37.3 (1.1) 99.4 (0.0) 86.8 (1.1) 90.0 67.9

Adam W 98.1 (0.1) 82.8 (1.2) 97.7 (0.0) 71.9 (2.4) 95.0 (0.1) 89.2 (1.1) 70.1 (0.2) 40.7 (0.3) 99.5 (0.0) 95.7 (0.4) 92.1 76.0

SGD (freeze-embed) 98.2 (0.3) 83.2 (0.8) 97.8 (0.1) 73.7 (1.1) 94.9 (0.3) 88.2 (0.7) 70.0 (0.2) 40.2 (0.7) 99.5 (0.0) 94.3 (0.3) 92.1 75.9

SGD (freeze-em. no mom.) 98.2 83.1 97.9 80.4 95.2 89.0 70.1 38.8 99.5 93.3 92.2 76.9

LAMB 98.2 79.5 97.8 64.0 95.1 90.4 67.9 38.8 99.5 93.4 91.7 73.2 LARS 97.7 83.9 97.1 48.6 93.2 83.8 67.0 38.6 99.3 93.3 90.9 69.6

Table 5: CLIP Vi T-B/16 performance with new optimizers and confidence intervals. For addition to SGD, Adam W, and SGD (freeze-embed), we provide 90% confidence intervals based on 3 runs of each hyperparameter configuration. In addition, we show accuracies for SGD (freeze-embed, no momentum); as well as comparison to two other optimizers, LARS (You et al., 2017b) and LAMB (You et al., 2020), which use layer-wise normalization in their updates.

CLIP Vi T-B/16 CLIP Vi T-L/14

SGD 98.0 99.0 Adam W 98.3 99.3

SGD (freeze-embed) 98.4 99.3

Table 6: CIFAR-10 accuracy

CIFAR-10 results. As a proof of concept for standard (non-OOD) transfer learning, we fine-tune CLIP Vi T models on CIFAR-10. Even at high accuracies, Adam W and SGD (freeze-embed) improve performance. SGD (freeze-embed) gets 20% and 30% lower error than SGD on CLIP Vi T-B/16 and Vi T-L/14, respectively. For example, on a CLIP Vi T-L/14, SGD gets 99.0% accuracy but SGD (freeze-embed) gets 99.3% accuracy.

Composes with other fine-tuning methods. In Appendix B we show that SGD (freeze-embed) can compose with other fine-tuning improvements such as LP-FT (Kumar et al., 2022) for further gains.

6 ADDITIONAL RELATED WORKS

Many works in transfer learning propose freezing parameters while fine-tuning to preserve pretrained information. For example, linear probing, which freezes the entire model except the head (Kumar et al., 2022; Wortsman et al., 2021) and zero-shot models (Radford et al., 2021) have been shown to improve OOD performance over standard fine-tuning. In NLP, methods such as prefix-tuning (Li & Liang, 2021) and prompt-tuning (Lester et al., 2021) have been shown to improve OOD accuracy. There are many other parameter efficient fine-tuning methods as well (Elsayed et al., 2019; Chen, 2023; Bahng et al., 2022; Tsao et al., 2023; Jia et al., 2022). Other works propose regularizing parameters towards initialization, freezing the first several layers, using different learning rates for different layers, or tuning different layers for different examples (Long et al., 2013; Ge & Yu, 2017; Howard & Ruder, 2018; Guo et al., 2019; Zhang et al., 2020; Zhu et al., 2020; Jiang et al., 2021; Aghajanyan et al., 2021). Typically, a large fraction of the model is frozen to preserve the pretrained information a key difference in this work is that we find that freezing a very small fraction of the model (<1% of the parameters) can lead to substantial and consistent improvements in accuracy.

Other optimizers have been proposed to reduce Adam W s memory footprint, including LARS (You et al., 2017b) and Ada Factor (Shazeer & Stern, 2018). Our method is simpler and achieves better accuracies with same or better memory gains. A complementary line of work Dettmers et al. (2022) study quantization mechanisms for optimizer states. These tools, although developed for Adam W, can also be used with SGD (freeze-embed) to get additional gains in memory.

7 DISCUSSION.

We note that the methods we consider are not complex. We showed that a minor tweak of freezing the embedding layer overwhelmingly improves the performance across the board when fine-tuning with SGD. We clarify that we do not claim that SGD (freeze-embed) is a substantially better method than Adam W in terms of accuracy. Rather, it is remarkable that with its simplicity, we can already achieve comparable or even slightly better accuracy than Adam W across a wide range of models and benchmarks, while using much less memory. We hope future work can focus on extending these methods to other applications such as NLP, image segmentation, and recommender systems (Naumov et al., 2019), on models without embedding layers, and on more sophisticated learning rate schedules.

Published as a conference paper at ICLR 2024

8 ACKNOWLEDGEMENTS

We thank Percy Liang, Tengyu Ma, Yuanzhi Li, and Zhiyuan Li for helpful comments.

Armen Aghajanyan, Akshat Shrivastava, Anchit Gupta, Naman Goyal, Luke Zettlemoyer, and Sonal Gupta. Better fine-tuning by reducing representational collapse. In International Conference on Learning Representations (ICLR), 2021.

Hyojin Bahng, Ali Jahanian, Swami Sankaranarayanan, and Phillip Isola. Exploring visual prompts for adapting large-scale models, 2022.

Peter Bandi, Oscar Geessink, Quirine Manson, Marcory Van Dijk, Maschenka Balkenhol, Meyke Hermsen, Babak Ehteshami Bejnordi, Byungjae Lee, Kyunghyun Paeng, Aoxiao Zhong, et al. From detection of individual metastases to classification of lymph node status at the patient level: the CAMELYON17 challenge. IEEE Transactions on Medical Imaging, 38(2):550 560, 2018.

Mathilde Caron, Hugo Touvron, Ishan Misra, Herve Jegou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In International Conference on Computer Vision (ICCV), 2021.

Pin-Yu Chen. Model reprogramming: Resource-efficient cross-domain machine learning, 2023.

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning (ICML), pp. 1597 1607, 2020.

Gordon Christie, Neil Fendley, James Wilson, and Ryan Mukherjee. Functional map of the world. In Computer Vision and Pattern Recognition (CVPR), 2018.

Tim Dettmers, Mike Lewis, Sam Shleifer, and Luke Zettlemoyer. 8-bit optimizers via block-wise quantization. 9th International Conference on Learning Representations, ICLR, 2022.

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), 2021.

Gamaleldin F. Elsayed, Ian Goodfellow, and Jascha Sohl-Dickstein. Adversarial reprogramming of neural networks. In International Conference on Learning Representations (ICLR), 2019.

Weifeng Ge and Yizhou Yu. Borrowing treasures from the wealthy: Deep transfer learning through selective joint fine-tuning. In Computer Vision and Pattern Recognition (CVPR), 2017.

Soumya Suvra Ghosal, Yifei Ming, and Yixuan Li. Are vision transformers robust to spurious correlations? ar Xiv preprint ar Xiv:2008.02790, 2022.

Boris Ginsburg, Patrice Castonguay, Oleksii Hrinchuk, Oleksii Kuchaiev, Vitaly Lavrukhin, Ryan Leary, Jason Li, Huyen Nguyen, Yang Zhang, and Jonathan M. Cohen. Training deep networks with stochastic gradient normalized by layerwise adaptive second moments. 2019.

Yunhui Guo, Honghui Shi, Abhishek Kumar, Kristen Grauman, Tajana Rosing, and Rogerio Feris. Spottune: Transfer learning through adaptive fine-tuning. In Computer Vision and Pattern Recognition (CVPR), 2019.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Computer Vision and Pattern Recognition (CVPR), 2016.

Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification. In Association for Computational Linguistics (ACL), 2018.

Published as a conference paper at ICLR 2024

Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Openclip, July 2021. URL https://doi.org/10.5281/ zenodo.5143773. If you use this software, please cite it as below.

Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Visual prompt tuning. In European Conference on Computer Vision (ECCV), 2022.

Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Tuo Zhao. Smart: Robust and efficient fine-tuning for pre-trained natural language models through principled regularized optimization. In International Conference on Learning Representations (ICLR), 2021.

Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.

Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, Tony Lee, Etienne David, Ian Stavness, Wei Guo, Berton A. Earnshaw, Imran S. Haque, Sara Beery, Jure Leskovec, Anshul Kundaje, Emma Pierson, Sergey Levine, Chelsea Finn, and Percy Liang. WILDS: A benchmark of in-the-wild distribution shifts. In International Conference on Machine Learning (ICML), 2021.

Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby. Big transfer (bit): General visual representation learning. In ECCV, 2020.

Simon Kornblith, Jonathon Shlens, and Quoc V. Le. Do better imagenet models transfer better? In Computer Vision and Pattern Recognition (CVPR), 2019.

Ananya Kumar, Aditi Raghunathan, Robbie Jones, Tengyu Ma, and Percy Liang. Fine-tuning can distort pretrained features and underperform out-of-distribution. In International Conference on Learning Representations (ICLR), 2022.

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. ar Xiv preprint ar Xiv:2104.08691, 2021.

Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Association for Computational Linguistics (ACL), 2021.

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.

Zhuang Liu, Hanzi Mao, Chaozheng Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. In Computer Vision and Pattern Recognition (CVPR), 2022.

Mingsheng Long, Jianmin Wang, Guiguang Ding, Jiaguang Sun, and Philip S Yu. Transfer feature learning with joint distribution adaptation. In Proceedings of the IEEE international conference on computer vision, pp. 2200 2207, 2013.

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In ICLR, 2019.

Subhabrata Mukherjee and Ahmed Hassan Awadallah. Distilling transformers into simple neural networks with unlabeled transfer data. ar Xiv preprint ar Xiv:1910.01769, 1, 2019.

Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang, Narayanan Sundaraman, Jongsoo Park, Xiaodong Wang, Udit Gupta, Carole-Jean Wu, Alisson G. Azzolini, Dmytro Dzhulgakov, Andrey Mallevich, Ilia Cherniavskii, Yinghai Lu, Raghuraman Krishnamoorthi, Ansha Yu, Volodymyr Kondratenko, Stephanie Pereira, Xianjie Chen, Wenlin Chen, Vijay Rao, Bill Jia, Liang Xiong, and Misha Smelyanskiy. Deep learning recommendation model for personalization and recommendation systems. Co RR, abs/1906.00091, 2019. URL http://arxiv.org/abs/1906.00091.

Published as a conference paper at ICLR 2024

Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate Saenko, and Bo Wang. Moment matching for multi-source domain adaptation. In International Conference on Computer Vision (ICCV), 2019.

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML), volume 139, pp. 8748 8763, 2021.

Alexander Robey, George J. Pappas, and Hamed Hassani. Model-based domain generalization. In Neur IPS, 2021.

Miguel Romero, Yannet Interian, Timothy Solberg, and Gilmer Valdes. Targeted transfer learning to improve performance in small medical physics datasets. Medical Physics, 47(12):6246 6256, 2020.

Shiori Sagawa, Pang Wei Koh, Tatsunori B. Hashimoto, and Percy Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. In International Conference on Learning Representations (ICLR), 2020.

Shibani Santurkar, Dimitris Tsipras, and Aleksander Madry. Breeds: Benchmarks for subpopulation shift. ar Xiv, 2020.

Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning (ICML), 2018.

Andreas Steiner, Alexander Kolesnikov, Xiaohua Zhai, Ross Wightman, Jakob Uszkoreit, and Lucas Beyer. How to train your vit? data, augmentation, and regularization in vision transformers. Ar Xiv, abs/2106.10270, 2021.

Shuhan Tan, Xingchao Peng, and Kate Saenko. Class-imbalanced domain adaptation: An empirical odyssey. ar Xiv preprint ar Xiv:1910.10320, 2020.

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv e J egou. Training data-efficient image transformers & distillation through attention. In ICML, 2021.

Hsi-Ai Tsao, Lei Hsiung, Pin-Yu Chen, Sijia Liu, and Tsung-Yi Ho. Autovp: An automated visual prompting framework and benchmark, 2023.

Mitchell Wortsman, Gabriel Ilharco, Mike Li, Jong Wook Kim, Hannaneh Hajishirzi, Ali Farhadi, Hongseok Namkoong, and Ludwig Schmidt. Robust fine-tuning of zero-shot models. ar Xiv preprint ar Xiv:2109.01903, 2021.

Mitchell Wortsman, Gabriel Ilharco, Samir Ya Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and Ludwig Schmidt. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning (ICML), 2022.

Jiaxuan You, Xiaocheng Li, Melvin Low, David Lobell, and Stefano Ermon. Deep gaussian process for crop yield prediction based on remote sensing data. In Association for the Advancement of Artificial Intelligence (AAAI), 2017a.

Yang You, Igor Gitman, and Boris Ginsburg. Large batch training of convolutional networks. ar Xiv: Computer Vision and Pattern Recognition, 2017b.

Yang You, Jing Li, Sashank J. Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. Large batch optimization for deep learning: Training bert in 76 minutes. In International Conference on Learning Representations (ICLR), 2020.

Published as a conference paper at ICLR 2024

Xiaohua Zhai, Joan Puigcerver, Alexander Kolesnikov, Pierre Ruyssen, Carlos Riquelme, Mario Lucic, Josip Djolonga, Andre Susano Pinto, Maxim Neumann, Alexey Dosovitskiy, Lucas Beyer, Olivier Bachem, Michael Tschannen, Marcin Michalski, Olivier Bousquet, Sylvain Gelly, and Neil Houlsby. A large-scale study of representation learning with the visual task adaptation benchmark. ar Xiv, 2020.

Jeffrey O Zhang, Alexander Sax, Amir Zamir, Leonidas Guibas, and Jitendra Malik. Side-tuning: A baseline for network adaptation via additive side networks. In European Conference on Computer Vision (ECCV), 2020.

Chen Zhu, Yu Cheng, Zhe Gan, Siqi Sun, Tom Goldstein, and Jingjing Liu. Free LB: Enhanced adversarial training for natural language understanding. In International Conference on Learning Representations (ICLR), 2020.

Published as a conference paper at ICLR 2024

A ADDITIONAL TRAINING DETAILS

SGD and Adam W updates. We start with initialization θ(0) = θpretrain. Given hyperparameters, minibatch size batch_size and number of epochs num_epochs, our algorithms run for T = num_epochs |Dtrain|

batch_size steps. At steps t = 0, 1, . . . , T, we select a randomly shuffled minibatch Bt from Dtrain (reshuffled at end of each epoch) and compute the minibatch stochastic gradient gt = 1 |Bt| P

(x,y) Bt θl(fθ(x), y). The parameters θ(t) are updated as follows.

For SGD, in addition to gradients gt and weights θ(t), we maintain first order momentum estimate mt as optimizer state, initialized as m 1 = 0. The SGD(ηt, µ, λ) update with ηt learning rate, µ momentum, and λ weight decay, is given by

gt = gt + λθ(t)

mt = µmt 1 + gt (first moment)

θ(t+1) = θ(t) ηtmt

For Adam W, we maintain two additional optimizer states: the first moment estimate mt and second moment estimate vt, initialized as m 1 = v 1 = 0. The Adam W(ηt, β1, β2, λ) update with ηt learning rate, (β1, β2) betas, and λ weight decay is given by

mt = β1mt 1 + (1 β1)gt (first moment)

vt = β2vt 1 + (1 β2)g 2 t (second moment)

bmt = mt (1 βt 1); bvt = vt (1 βt 2)

θ(t+1) = (1 ηtλ)θ(t) ηt bmt bvt + ϵ

Gradual unfreezing. The success of freeze-embed for SGD is inline with our hypothesis that it is beneficial to change the lower layers less during fine-tuning. We could further refine this idea instead of just freezing the first embedding layer, we could start by freezing the entire pretrained model (except for the head) and gradually unfreeze the rest of the model over training. Strategies like this have been explored by ULMFi T (Howard & Ruder, 2018) and others (Mukherjee & Awadallah, 2019; Romero et al., 2020).

For gradual-unfreezing, we found that naively using a single learning rate throughout this process or even a cosine learning rate schedule is not sufficient to achieve good performance. Instead, we fit an exponential learning rate decay law for our experiments (see details on the learning rate schedule below). The performance of gradual-unfreezing on CLIP Vi T-B/16 is shown in Table 7 and 8. On some datasets, gradual-unfreezing performed particularly well. For example, it got an OOD accuracy of 93.2% on Domain Net and 96.5% on Camelyon which are both competitive with SGD (freeze-embed) and Adam W on a much larger CLIP Vi T-L/14 model. On average, it performed slightly worse than SGD (freeze-embed, no momentum). Overall , we find that the most consistent improvements come from simply freezing the embedding layer, which suggests that the embedding layer plays a particularly prominent role in modern vision architectures.

Hyperparameter details

Learning rate. We sweep over 6 learning rates for SGD ([3e-5, 1e-4, 3e-4, 1e-3, 3e-3, 1e-2]) and for Adam W ([3e-7, 1e-6, 3e-6, 1e-5, 3e-5, 1e-4]). As standard, we use smaller learning rates for Adam W because they work better. The best learning rate is chosen based on ID validation accuracy. For the LP-FT experiments which we run in Section B we divide all the learning rates by 10, since we have trained the linear probe, so the learning rate required to fine-tune the entire model is smaller.

Other optimizer parameters. We use all the default options for SGD and Adam W in pytorch except set the SGD momentum parameter to 0.9. The other default options are: for SGD, the weight decay is 0 and we use momentum without dampening, and for Adam W, the weight decay is 0.01, betas are (0.9,0.999) and epsilon is 1e-08.

Published as a conference paper at ICLR 2024

Number of epochs. We train for 20 epochs on Living-17, 20 epochs on Waterbirds, 50 epochs on Domain Net, 5 epochs on WILDS-FMo W, and 3 epochs on Camelyon. We use the same number of training epochs for all the optimizer methods we report results on.

Learning rate schedule. With the exception of Gradual-unfreezing in Table 7 and 8, we use a cosine learning rate schedule and decay the starting learning rate to 0 over the course of T training steps. Learning rates schedule for Gradual-unfreezing is described below.

Learning rate schedule for gradual unfreezing. For gradual unfreezing, we do not use a cosine learning rate scheduler. Instead, at epoch t (0-indexed) out of T, we multiply the base learning rate by exp(3.73 (1.0 t/(T 1))). This means we multiply the learning rate by exp(3.73) 41.7 in epoch 0, and by 1 in the last epoch. The intuition is that when we are tuning a smaller number of layers we need a higher learning rate than for full fine-tuning for example, the optimal learning rates for head tuning is higher than the optimal learning rate for full fine-tuning. The exact constant (3.73) was selected by comparing the optimal learning rate for head tuning with full fine-tuning on Waterbirds (the optimal learning rate for head-tuning was approximately exp(3.73) larger for head-tuning). Without this decay schedule, for example using a vanilla cosine learning rate scheduler, gradual unfreezing worked similarly or worse than vanilla full fine-tuning.

Stopping criteria. The results presented in the paper are from models early stopped based on ID validation accuracy. We sanity checked that the conclusions are similar if we use the last checkpoint.

Data augmentation and input preprocessing. Additionally, we use the following preprocessing and augmentations on our input images. We use very basic augmentations (e.g., only horizontal flips for WILDS, and standard augmentations from past work), including for our state-of-the-art results:

1. For WILDS-FMo W, the images are 224 224, so we do not resize, and only perform a random horizontal flip augmentation. We do not perform any augmentations at test time.

2. For WILDS-Camelyon, we resize the images to 224 224 with a bilinear interpolation (standard in pyorch), and only perform a random horizontal flip augmentation. We do not perform any augmentations at test time, just the resize.

3. For Living-17, we follow (Kumar et al., 2022) and perform a Random Resized Crop to 224 224 sized images (using the default options in pyorch), and then a random horizontal flip augmentation while training. At test-time we resize the image to 256 256 and then take a centercrop of size 224 224.

4. For Domain Net, we follow (Kumar et al., 2022) and first resize the image to 256 256 with bicubic interpolation, then take a Random Crop of size 224 224 (using the default options in pyorch), and then a random horizontal flip augmentation while training. At test-time we simply resize the image to 224 224 with bicubic interpolation.

5. For Waterbirds, we resize the image to 224 224 with bicubic interpolation and then take a centercrop of size 224 224. We apply the same transformation at test time.

Embedding layer. For SGD (freeze-embed), the exact layers we freeze are as follows:

1. CLIP Vi Ts: We freeze the patch-to-token embedding layer and layernorm.

2. Supervised and DINO Vi Ts: We freeze the patch-to-token embedding layer (there is no layernorm after this)

3. Bi T-Res Nets: We freeze the stem and the first convolution block of the model. We tried freezing less and more of the model in our initial experiments, but it did not seem to help.

4. Conv Ne Xt-B: We freeze the stem and the first stage of the model.

B ABLATIONS ON CLIP

In Tables 7-8, we show additional ablations for the CLIP Vi T-B/16 model on all datasets.4 We tried:

4Running each ablation for all models on all datasets is too computationally expensive.

Published as a conference paper at ICLR 2024

1. SGD (freeze-embed, not layer-norm): For the CLIP model our freeze-embed variation freezes the bottom embedding layer along with the layernorm right after that. We ran an ablation where we only freeze the bottom linear embedding layer, but not the layer norm. This performs comparably with SGD (freeze-embed), which suggests that freezing the input layer is what s important, and the layer norm does not matter much.

2. SGD (5x lower LR on embed layer): Another idea from our analysis where we found that the embedding layer seems to be why Adam W does better than SGD is to use a smaller learning rate on embedding layer. Typically, this would involve additional hyperparameter tuning hence undesirable. However, as a heuristic, we ran SGD with 5x smaller learning rate for the embedding layer (since the gradients in the embedding layer are about 5x larger) compared to other layers. As expected, this improves over SGD, but does not do as well as SGD (freeze-embed). 3. SGD (no momentum): Since SGD (freeze-embed, no momentum) performed very well in our experiments, we also tried fine-tuning with full SGD (no freezing), but without momentum. We found that SGD (no momentum) and SGD perform comparable.

4. SGD (weight decay): Vanilla SGD is done without weight decay, but Adam W incorporates weight decay. We ran this ablation to confirm that the gains of Adam W are not because of weight decay. We used the torch SGD optimizer, and set the weight_decay argument to 0.01. Indeed, we found that SGD and SGD (weight decay) perform comparably, which suggests that weight_decay is not the reason for the improved performance of Adam W.

5. Linear probing: We freeze the pretrained model, and only train a linear probe on the features of the CLIP Vi T-B/16 model. We train a logistic regression classifier using the sklearn library, sweeping over 50 regularization values in np.logspace( 7, 2, 5)

6. LP-FT: Kumar et al. (2022) show that first linear probing, and then full fine-tuning the entire model often works better, especially out-of-distribution. We run LP-FT as well, and for the full fine-tuning step we use SGD, Adam W, or SGD (freeze-embed), to test if our conclusions still hold with a better fine-tuning method. Note that the test accuracy on Waterbirds was a bit unstable early on so for this dataset we use the last epoch instead of early stopping on ID validation accuracy. Indeed, even with LP-FT, we find that Adam W slightly outperforms SGD out-of-distribution, and SGD (freeze-embed) outperforms both methods with an average accuracy of 77.5%. The accuracies with LP-FT are higher than regular fine-tuning, in line as Kumar et al. (2022).

Algorithms Living-17 Waterbirds Domain Net FMo W Camelyon Avg.

Baselines SGD 80.0 62.5 72.8 37.3 86.8 67.9 Adam W 82.8 71.9 89.2 40.7 95.7 76.0

Our methods

SGD (freeze-embed) 83.2 73.7 88.2 40.2 94.3 75.9 SGD (freeze-embed, no momentum) 83.1 80.4 89.0 38.8 93.3 76.9 Gradual-unfreezing 81.9 69.1 93.2 40.5 96.5 76.2

Variations of SGD (freeze-embed)

SGD (freeze-embed, not layer-norm) 83.6 74.3 89.1 39.3 92.9 75.9 SGD (5x lower LR on embed layer) 83.3 71.7 85.7 38.7 95.7 75.0 Other freezing: Linear probing 86.2 60.4 89.1 29.0 92.6 71.5

Other layerwise normalization methods

LAMB 79.5 64.0 90.4 38.8 93.4 73.2 LARS 83.9 48.6 83.8 38.6 93.3 69.6

Variations of SGD without any freezing

SGD (no momentum) 81.4 59.2 76.7 37.9 84.3 67.9 SGD (weight decay) 83.9 65.1 67.5 37.1 85.6 67.8

Variations of LP-FT

SGD 86.7 67.3 89.2 37.9 94.1 75.0 Adam W 84.5 68.2 90.1 39.7 95.8 75.7 SGD (freeze-embed) 83.1 75.9 90.8 41.8 96.0 77.5

Table 7: Out-of-distribution (OOD) accuracy of more optimizers on CLIP Vi T-B/16. We find that weight decay, momentum, and unfreezing the layer norm at the bottom of the model do not make much of a difference.

Published as a conference paper at ICLR 2024

Algorithms Living-17 Waterbirds Domain Net FMo W Camelyon Avg.

Baselines SGD 97.8 97.2 88.8 67.0 99.4 90.0 Adam W 98.1 97.7 95.0 70.1 99.5 92.1

Our methods

SGD (freeze-embed) 98.2 97.8 94.9 70.0 99.5 92.1 SGD (freeze-embed, no momentum) 98.2 97.9 95.2 70.1 99.5 92.2 Gradual-unfreezing 98.3 98.3 96.3 69.2 99.3 92.3

Variations of SGD (freeze-embed)

SGD (freeze-embed, not layer-norm) 98.0 98.0 95.4 70.2 99.5 92.2 SGD (5x lower LR on embed layer) 98.0 97.5 94.8 68.7 99.5 91.7 Other freezing: Linear probing 97.8 96.6 94.5 47.2 96.1 86.4

Other layerwise normalization methods

LAMB 98.2 97.8 95.1 67.9 99.5 91.7 LARS 97.7 97.1 93.2 67.0 99.3 90.9

Variations of SGD without any freezing

SGD (no momentum) 98.0 97.1 89.5 66.4 99.3 90.1 SGD (weight decay) 97.6 97.2 87.9 66.4 99.3 89.7

Variations of LP-FT

SGD 98.2 97.2 95.1 66.7 99.0 91.2 Adam W 98.2 98.2 95.7 69.2 99.5 92.2 SGD (freeze-embed) 98.4 97.8 95.7 69.1 99.4 92.1

Table 8: In-distribution (ID) accuracy of more optimizers on CLIP Vi T-B/16. We find that weight decay, momentum, and unfreezing the layer norm at the bottom of the model do not make much of a difference.

Published as a conference paper at ICLR 2024

C ABLATIONS ON FREEZING CONVNEXT

The conv Ne Xt-Base model consists of a stem (6.5k parameters) then 4 stages (415 thousand, 1.7 million, 58 million, 27 million parameters respectively), followed by a layernorm (2k parameters). In the main paper, we freeze the stem and the first stage which consists of 0.5% of the parameters of the entire model. This is similar to the fraction of parameters in the patch embedding layer of the vision transformers, which we freeze.

An alternative choice is to freeze only the stem of the Conv Ne Xt layer, which is 0.007% of the parameters. We run an ablation where we try this choice of freezing, which we call SGD (freeze-stem). For our original approach of freezing the stem and the first block, we call it SGD (freeze-stem-block-1). Results for OOD are in Table 9, and for ID are in Table 10.

Living-17 Waterbirds Domain Net FMo W Camelyon Avg.

SGD 94.0 80.2 89.8 39.7 83.0 77.3 Adam W 90.3 (-3.7) 89.8 (+9.6) 89.5 (-0.3) 38.4 (-1.3) 89.5 (+6.5) 79.5 (+2.2) SGD (freeze-stem-block-1) 92.6 (-1.4) 86.9 (+6.7) 91.2 (+1.4) 38.2 (-1.5) 88.1 (+5.1) 79.4 (+2.1) SGD (freeze-stem) 91.4 (-2.6) 84.7 (+4.5) 91.0 (+1.2) 40.3 (+0.6) 86.2 (+3.2) 78.7 (+1.4)

Table 9: Out-of-distribution (OOD) accuracies for Conv Ne Xt where we try either freezing just the stem layer (0.5% of model parameters), or the stem and the first block (0.007% of model parameters).

Living-17 Waterbirds Domain Net FMo W Camelyon Avg.

SGD 98.7 99.0 94.8 66.3 99.4 91.6 Adam W 98.6 (-0.1) 99.5 (+0.5) 94.5 (-0.3) 68.8 (+2.5) 99.7 (+0.3) 92.2 (+0.6) SGD (freeze-stem-block-1) 98.6 (-0.1) 99.4 (+0.4) 95.1 (+0.3) 67.4 (+1.1) 99.5 (+0.1) 92.0 (+0.4) SGD (freeze-stem) 98.8 (+0.1) 99.3 (+0.3) 95.1 (+0.3) 67.2 (+0.9) 99.5 (+0.1) 92.0 (+0.4)

Table 10: In-distribution (ID) accuracies for Conv Ne Xt where we try either freezing just the stem layer (0.5% of model parameters), or the stem and the first block (0.007% of model parameters).

D RESULTS FOR ADAMW (FREEZE-EMBED)

In our paper, we hypothesized that Adam W and SGD (freeze-embed) improve on SGD for the same reason they change the embedding layer less. Based on this hypothesis, we would expect that the gains of Adam W and freeze-embed are not complementary. Indeed, we find that the Adam W (freeze-embed) variation performs similarly to Adam W and SGD (freeze-embed).

Tables 11 & 12 expand on the OOD and ID accuracy results in Tables 11 & 12, respectively, to include Adam W (freeze-embed). Overall, Adam W (freeze-embed) and SGD (freeze-embed) perform comparably for all the recent vision models, both fairly close to Adam W although there are differences in some models and datasets. Averaged across all the datasets and models, Adam W, Adam W (freeze-embed), and SGD (freeze-embed) get 76%, 76.5%, and 76.7% accuracy, respectively, compared to 72.0% for SGD. Averaged across all the in-distribution datasets and models, Adam W, Adam W (freeze-embed), and SGD (freeze-embed) get 91.5%, 91.5%, and 91.3% accuracy, respectively, compared to 90.3% for SGD. Overall the performances of the three models are similar enough to suggest that these methods work well for the same reason that they tune the embedding layers less, and that freeze-embed is not an independent axis of improvement.

Published as a conference paper at ICLR 2024

Living-17 Waterbirds Domain Net FMo W Camelyon Avg.

CLIP Vi T-B/16 SGD 80.0 62.5 72.8 37.3 86.8 67.9 CLIP Vi T-B/16 Adam W 82.8 (+2.8) 71.9 (+9.4) 89.2 (+16.4) 40.7 (+3.4) 95.7 (+8.9) 76.0 (+8.1) CLIP Vi T-B/16 SGD (freeze-embed) 83.2 (+3.2) 73.7 (+11.2) 88.2 (+15.4) 40.2 (+2.9) 94.3 (+7.5) 75.9 (+8.0) CLIP Vi T-B/16 Adam W (freeze-embed) 82.4 (+2.4) 69.5 (+7.0) 88.2 (+15.4) 40.7 (+3.4) 96.7 (+9.9) 75.5 (+7.6)

CLIP Vi T-L/14 SGD 84.2 65.0 60.8 41.0 83.2 66.8 CLIP Vi T-L/14 Adam W 88.0 (+3.8) 85.2 (+20.2) 93.8 (+33.0) 48.3 (+7.3) 95.9 (+12.7) 82.2 (+15.4) CLIP Vi T-L/14 SGD (freeze-embed) 90.5 (+6.3) 84.7 (+19.7) 93.1 (+32.3) 49.9 (+8.9) 96.5 (+13.3) 83.0 (+16.2) CLIP Vi T-L/14 Adam W (freeze-embed) 88.0 (+3.8) 84.6 (+19.6) 93.8 (+33.0) 44.6 (+3.6) 96.4 (+13.2) 81.5 (+14.7)

Sup Vi T-B/16 SGD 89.5 77.4 86.3 33.5 92.6 75.8 Sup Vi T-B/16 Adam W 88.3 (-1.2) 81.6 (+4.2) 84.4 (-1.9) 35.9 (+2.4) 87.9 (-4.7) 75.6 (-0.2) Sup Vi T-B/16 SGD (freeze-embed) 88.0 (-1.5) 82.4 (+5.0) 86.3 (+0.0) 34.4 (+0.9) 93.7 (+1.1) 77.0 (+1.2) Sup Vi T-B/16 Adam W (freeze-embed) 88.1 (-1.4) 82.4 (+5.0) 82.3 (-4.0) 35.7 (+2.2) 93.0 (+0.4) 76.3 (+0.5)

DINO Vi T-B/16 SGD 88.2 56.1 76.0 33.6 86.9 68.2 DINO Vi T-B/16 Adam W 87.4 (-0.8) 61.2 (+5.1) 77.4 (+1.4) 35.8 (+2.2) 91.9 (+5.0) 70.7 (+2.5) DINO Vi T-B/16 SGD (freeze-embed) 86.7 (-1.5) 67.9 (+11.8) 78.4 (+2.4) 35.9 (+2.3) 90.6 (+3.7) 71.9 (+3.7) DINO Vi T-B/16 Adam W (freeze-embed) 86.8 (-1.4) 64.5 (+8.4) 76.6 (+0.6) 35.4 (+1.8) 93.5 (+6.6) 71.4 (+3.2)

Conv Next-Base SGD 94.0 80.2 89.8 39.7 83.0 77.3 Conv Next-Base Adam W 90.3 (-3.7) 89.8 (+9.6) 89.5 (-0.3) 38.4 (-1.3) 89.5 (+6.5) 79.5 (+2.2) Conv Next-Base SGD (freeze-embed) 92.6 (-1.4) 86.9 (+6.7) 91.2 (+1.4) 38.2 (-1.5) 88.1 (+5.1) 79.4 (+2.1) Conv Next-Base Adam W (freeze-embed) 88.1 (-5.9) 91.6 (+11.4) 89.8 (+0.0) 41.6 (+1.9) 87.7 (+4.7) 79.8 (+2.5)

Bi T Res Net-50 SGD 84.3 76.5 80.0 34.1 90.4 73.1 Bi T Res Net-50 Adam W 83.1 (-1.2) 74.8 (-1.7) 84.0 (+4.0) 33.8 (-0.3) 92.4 (+2.0) 73.6 (+0.5) Bi T Res Net-50 SGD (freeze-embed) 84.1 (-0.2) 75.5 (-1.0) 82.3 (+2.3) 35.0 (+0.9) 95.2 (+4.8) 74.4 (+1.3) Bi T Res Net-50 Adam W (freeze-embed) 82.9 (-1.4) 77.3 (+0.8) 83.3 (+3.3) 36.3 (+2.2) 93.7 (+3.3) 74.7 (+1.6)

Bi T Res Net-101 SGD 82.8 76.9 86.2 38.0 89.3 74.6 Bi T Res Net-101 Adam W 82.9 (+0.1) 79.4 (+2.5) 83.5 (-2.7) 37.0 (-1.0) 89.7 (+0.4) 74.5 (-0.1) Bi T Res Net-101 SGD (freeze-embed) 83.1 (+0.3) 77.3 (+0.4) 86.0 (-0.2) 36.0 (-2.0) 95.5 (+6.2) 75.6 (+1.0) Bi T Res Net-101 Adam W (freeze-embed) 84.8 (+2.0) 78.3 (+1.4) 84.3 (-1.9) 38.2 (+0.2) 95.0 (+5.7) 76.1 (+1.5)

Table 11: Out-of-distribution (OOD) accuracies with Adam W (freeze-embed). This is an expansion of Table 1 to include Adam W (freeze-embed) OOD results for fine-tuning 7 popular models across 5 benchmark datasets. On OOD performance averaged across all models and datasets, Adam W (freeze-embed) gets slightly better accuracy than Adam W but slightly worse than SGD (freeze-embed).

Living-17 Waterbirds Domain Net FMo W Camelyon Avg.

CLIP Vi T-B/16 SGD 97.8 97.2 88.8 67.0 99.4 90.0 CLIP Vi T-B/16 Adam W 98.1 (+0.3) 97.7 (+0.5) 95.0 (+6.2) 70.1 (+3.1) 99.5 (+0.1) 92.1 (+2.1) CLIP Vi T-B/16 SGD (freeze-embed) 98.2 (+0.4) 97.8 (+0.6) 94.9 (+6.1) 70.0 (+3.0) 99.5 (+0.1) 92.1 (+2.1) CLIP Vi T-B/16 Adam W (freeze-embed) 98.3 (+0.5) 97.8 (+0.6) 95.3 (+6.5) 70.5 (+3.5) 99.6 (+0.2) 92.3 (+2.3)

CLIP Vi T-L/14 SGD 98.4 97.3 84.3 69.0 99.4 89.7 CLIP Vi T-L/14 Adam W 98.9 (+0.5) 98.8 (+1.5) 96.9 (+12.6) 74.5 (+5.5) 99.6 (+0.2) 93.7 (+4.0) CLIP Vi T-L/14 SGD (freeze-embed) 98.7 (+0.3) 98.9 (+1.6) 97.1 (+12.8) 74.5 (+5.5) 99.6 (+0.2) 93.7 (+4.0) CLIP Vi T-L/14 Adam W (freeze-embed) 98.7 (+0.3) 98.8 (+1.5) 96.7 (+12.4) 75.1 (+6.1) 99.6 (+0.2) 93.8 (+4.1)

Sup Vi T-B/16 SGD 98.6 99.1 91.7 64.1 99.4 90.6 Sup Vi T-B/16 Adam W 98.7 (+0.1) 99.0 (-0.1) 91.7 (-0.0) 66.4 (+2.3) 99.5 (+0.1) 91.1 (+0.5) Sup Vi T-B/16 SGD (freeze-embed) 98.7 (+0.1) 99.2 (+0.1) 91.5 (-0.2) 65.0 (+0.9) 99.6 (+0.2) 90.8 (+0.2) Sup Vi T-B/16 Adam W (freeze-embed) 98.6 (+0.0) 99.0 (-0.1) 90.9 (-0.8) 66.8 (+2.7) 99.5 (+0.1) 91.0 (+0.4)

DINO Vi T-B/16 SGD 98.4 97.0 88.2 62.4 99.4 89.1 DINO Vi T-B/16 Adam W 98.5 (+0.1) 97.9 (+0.9) 89.4 (+1.2) 66.0 (+3.6) 99.6 (+0.2) 90.3 (+1.2) DINO Vi T-B/16 SGD (freeze-embed) 98.4 (+0.0) 97.5 (+0.5) 89.0 (+0.8) 63.5 (+1.1) 99.5 (+0.1) 89.6 (+0.5) DINO Vi T-B/16 Adam W (freeze-embed) 98.4 (+0.0) 97.8 (+0.8) 89.7 (+1.5) 65.7 (+3.3) 99.6 (+0.2) 90.3 (+1.2)

Conv Next-Base SGD 98.7 99.0 94.8 66.3 99.4 91.6 Conv Next-Base Adam W 98.6 (-0.1) 99.5 (+0.5) 94.5 (-0.3) 68.8 (+2.5) 99.7 (+0.3) 92.2 (+0.6) Conv Next-Base SGD (freeze-embed) 98.6 (-0.1) 99.4 (+0.4) 95.1 (+0.3) 67.4 (+1.1) 99.5 (+0.1) 92.0 (+0.4) Conv Next-Base Adam W (freeze-embed) 98.7 (+0.0) 99.5 (+0.5) 94.7 (-0.1) 69.2 (+2.9) 99.6 (+0.2) 92.4 (+0.8)

Bi T Res Net-50 SGD 97.4 98.4 89.3 64.6 99.5 89.8 Bi T Res Net-50 Adam W 97.2 (-0.2) 98.5 (+0.1) 89.2 (-0.1) 65.1 (+0.5) 99.5 (+0.0) 89.9 (+0.1) Bi T Res Net-50 SGD (freeze-embed) 97.6 (+0.2) 98.5 (+0.1) 89.2 (-0.1) 64.8 (+0.2) 99.5 (+0.0) 89.9 (+0.1) Bi T Res Net-50 Adam W (freeze-embed) 97.4 (+0.0) 98.4 (-0.0) 89.1 (-0.2) 64.9 (+0.3) 99.6 (+0.1) 89.9 (+0.1)

Bi T Res Net-101 SGD 98.3 98.9 92.0 66.0 99.4 90.9 Bi T Res Net-101 Adam W 98.4 (+0.1) 98.6 (-0.3) 91.1 (-0.9) 67.0 (+1.0) 99.6 (+0.2) 90.9 (-0.0) Bi T Res Net-101 SGD (freeze-embed) 98.4 (+0.1) 98.8 (-0.1) 91.5 (-0.5) 65.9 (-0.1) 99.5 (+0.1) 90.8 (-0.1) Bi T Res Net-101 Adam W (freeze-embed) 98.2 (-0.1) 98.7 (-0.2) 91.1 (-0.9) 66.6 (+0.6) 99.6 (+0.2) 90.9 (-0.0)

Table 12: In-distribution (ID) accuracies with Adam W (freeze-embed). This is an expansion of Table 2 to include Adam W (freeze-embed) ID results for our 7 models and 5 datasets. Adam W, Adam W (freeze-embed), and SGD (freeze-embed) all perform comparably on ID accuracies.

Published as a conference paper at ICLR 2024

E WHAT COULD CAUSE LARGE EMBEDDING LAYER GRADIENTS?

Generally speaking, we find that Adam W fine-tuning leads to better performing and more robust models than SGD fine-tuning, specially in modern pretrained models. Our algorithms to close this gap were inspired by the observation that on modern vision pretrained models, the embedding layers have substantially larger gradients at pretrained initialization compared to other layers. This could lead to over-training of the embedding layer when using SGD, which we hypothesized would be bad for robust fine-tuning. The success of SGD (freeze-embed) in our experiments adds further evidence to our hypothesis. But, why are the gradients of embedding layers at pretrained initialization high in the first place? More generally, why does Adam W do better than SGD during fine-tuning? We discuss two plausible hypotheses and then test these out in a controlled experiment.

Algorithmic aspects: pretraining algorithm? Among our 7 models, the more recent models like vision transformers and Conv Ne Xt, which were pretrained with Adam W are also the ones with largest gaps in performance between Adam W and SGD fine-tuning. Bi T Res Nets that were pretrained with SGD had much smaller differences between Adam W and SGD. This strongly suggests that the discrepancy between the algorithms in pretraining and fine-tuning might be cause for the performance gaps. For example, it is possible that pretraining with Adam W implicitly biases towards configurations that most benefit from Adam W updates. This would also explain why other adaptive algorithms like LARS and LAMB are not competitive even though they perform some form of layer-wise normalization. On the other hand it does not explain why such effects would distinctively impact the embedding layer and not the other layers. In a related manner, it is also unclear why SGD (freeze-embed) would be able to overcome such implicit biases from Adam W pretraining.

Architectural aspects: patchifying first layer? There are substantial architectural differences between the newer transformer and Conv Ne Xt models and the older Res Nets which could also contribute to the newer models working better with Adam W. Most notably, vision transformer is fundamentally different architecture from convolutional networks. At the same time, the biggest differences between the architectures like self-attention, softmax non-linearity, or fully connected layers are not the likely contributors to the differences between Adam W and SGD in fine-tuning. This is because, we also see gaps between these methods with the Conv Ne Xt models which lack the major transformer components. Rather, it is the possible that some of the designs that were adapted from transformers to Conv Ne Xt contributes to the differences between Adam W and SGD fine-tuning. Among these, many primarily affect the higher layers such as heavy third stage, depthwise convolution, inverted bottleneck, and larger kernels, and are unlikely to cause the lower layer gradients to be high. The key design change we believe is important here is the use of a patchify stem that could cause distinctive changes in the gradients for lower blocks.

The stem layer primarily controls how the input is processed for the rest of the network. Vi Ts and Conv Next down-samples the input images from non-overlapping patch tokens, compared to Res Nets that use denser-overlapped convolution followed by max-pool. The coarser non-overlap might lead the embedding layer to attune more closely to the patch distribution in pretraining. This might not be an issue by itself as pixel/patch level information is noisy anyways and the higher layers can extract more robust features from across the patches. A possible dynamics in fine-tuning is as follows: On new datasets where the patch distributions are very different from pretraining, the embedding layers might have large gradients. In standard SGD, this might cause the embedding layer to moving too quickly to fit the new patches before the higher layers can adapt to the changes. Instead, freezing the embedding layer would pass along the raw patches with minimal processing and lets the model adapt to the new distribution based on higher level features, which are likely more robust.

Summary of hypotheses. In conclusion, we hypothesize that the differences between Adam W and SGD fine-tuning arise from a combination of the above reasons pretraining with Adam W and large patchify embedding layer. Additionally, the use of Ge LU activation and layer normalization might also change the optimizer dynamics although these are minor changes from Re LU activation and group normalization used in Bi T Res Nets. It is of interest to systematically explore these reasons further in future work.

Controlled experiments to test hypotheses. We test these hypotheses out in controlled experiments and find that the mismatch between the pretraining and fine-tuning optimization method seems

Published as a conference paper at ICLR 2024

to be the key factor (for both why SGD can underperform during fine-tuning, and why the first-layer gradients of modern pretrained models is higher).

To test this out, we pretrained a Bi T-Res Net-50 model using either (a) SGD or (b) Adam W. For each of these pretraining optimizer choices, we also tried using either (i) the original Bi T-Res Net-50 architecture or (ii) modifying it by patchifying the first layer. This gives us 4 pretrained models. For each of these pretrained models, we tried fine-tuning it with (1) SGD, (2) Adam W, and (3) SGD-Freeze-Embed.Note that for Res Net models we consider the stem and the first stage as part of the embedding layer.

For the Adam W pretrained model, fine-tuning with SGD does worse than Adam W on the OOD test sets (with or without the patchify architecture change), but for the SGD pretrained models fine-tuning with SGD does slightly better than Adam W. For example, for the standard Bi T Res Net-50 architecture pretrained with Adam W, fine-tuning with Adam W gets an average 1.3% higher accuracy OOD. For the Bi T Res Net-50 (patchify) architecture pretrained with Adam W, fine-tuning with Adam W gets an average 0.9% higher accuracy OOD. On the other hand, for both Res Net models pretrained with SGD, fine-tuning with Adam W does slightly worse than SGD. This suggests that the optimizer mismatch (between pretraining and fine-tuning) is a key reason for why fine-tuning with SGD can do worse than fine-tuning with Adam W. Nonetheless, fine-tuning with SGD (freeze-embed) consistently performs well, getting the highest average OOD accuracy in 3/4 cases (its OOD accuracy is in between SGD and Adam W on a Res Net patchify model pretrained with SGD).

The full results for OOD are in Table 13 and for ID are in Table 14. As in the main paper, the ID accuracies of all methods are fairly close.

Pretraining Fine-tuning Living-17 Waterbirds Domain Net FMo W Camelyon Avg.

SGD SGD 83.9 63.4 77.9 34.2 88.3 69.6 SGD Adam W 77.5 (-6.4) 66.4 (+3.0) 75.5 (-2.4) 33.4 (-0.8) 94.6 (+6.3) 69.5 (-0.1) SGD SGD (freeze-embed) 84.2 (+0.3) 65.3 (+1.9) 79.3 (+1.4) 32.2 (-2.0) 95.6 (+7.3) 71.3 (+1.7)

Adam W SGD 81.4 56.9 74.4 30.7 86.6 66.0 Adam W Adam W 80.1 (-1.3) 65.6 (+8.7) 73.6 (-0.8) 30.6 (-0.1) 86.6 (+0.0) 67.3 (+1.3) Adam W SGD (freeze-embed) 81.1 (-0.3) 60.9 (+4.0) 77.5 (+3.1) 30.3 (-0.4) 89.6 (+3.0) 67.9 (+1.9)

SGD (Patchify) SGD 82.5 67.0 78.6 35.2 91.0 70.8 SGD (Patchify) Adam W 79.8 (-2.7) 64.3 (-2.7) 75.2 (-3.4) 34.6 (-0.6) 90.9 (-0.1) 69.0 (-1.8) SGD (Patchify) SGD (freeze-embed) 82.5 (+0.0) 62.9 (-4.1) 78.5 (-0.1) 34.9 (-0.3) 91.2 (+0.2) 70.0 (-0.8)

Adam W (Patchify) SGD 81.3 61.7 74.5 31.7 93.3 68.5 Adam W (Patchify) Adam W 79.3 (-2.0) 69.5 (+7.8) 73.1 (-1.4) 33.2 (+1.5) 91.9 (-1.4) 69.4 (+0.9) Adam W (Patchify) SGD (freeze-embed) 82.0 (+0.7) 64.0 (+2.3) 77.0 (+2.5) 31.1 (-0.6) 95.0 (+1.7) 69.8 (+1.3)

Table 13: Out-of-distribution (OOD) accuracies for the different Bi T-Res Net pretrained models.

Pretraining Fine-tuning Living-17 Waterbirds Domain Net FMo W Camelyon Avg.

SGD SGD 97.6 97.6 88.3 63.1 99.5 89.2 SGD Adam W 97.8 (+0.2) 97.6 (-0.0) 87.2 (-1.1) 62.4 (-0.7) 99.5 (+0.0) 88.9 (-0.3) SGD SGD (freeze-embed) 97.6 (+0.0) 97.6 (-0.0) 87.3 (-1.0) 62.7 (-0.4) 99.5 (+0.0) 89.0 (-0.2)

Adam W SGD 97.2 97.2 87.5 58.9 99.3 88.0 Adam W Adam W 97.4 (+0.2) 97.3 (+0.1) 86.5 (-1.0) 58.1 (-0.8) 99.5 (+0.2) 87.8 (-0.2) Adam W SGD (freeze-embed) 97.4 (+0.2) 97.1 (-0.1) 86.9 (-0.6) 58.2 (-0.7) 99.3 (+0.0) 87.8 (-0.2)

SGD (Patchify) SGD 97.9 97.5 88.3 62.8 99.4 89.2 SGD (Patchify) Adam W 97.5 (-0.4) 97.3 (-0.2) 87.0 (-1.3) 62.2 (-0.6) 99.5 (+0.1) 88.7 (-0.5) SGD (Patchify) SGD (freeze-embed) 97.6 (-0.3) 97.5 (-0.0) 87.6 (-0.7) 63.1 (+0.3) 99.4 (-0.0) 89.0 (-0.2)

Adam W (Patchify) SGD 97.5 97.3 87.7 59.6 99.3 88.3 Adam W (Patchify) Adam W 97.6 (+0.1) 97.2 (-0.1) 86.2 (-1.5) 58.4 (-1.2) 99.5 (+0.2) 87.8 (-0.5) Adam W (Patchify) SGD (freeze-embed) 97.6 (+0.1) 97.2 (-0.1) 87.0 (-0.7) 59.4 (-0.2) 99.2 (-0.1) 88.1 (-0.2)

Table 14: In-distribution (ID) accuracies for the different Bi T-Res Net pretrained models.

As in Section 3.2 we also plot the average gradient norm at each layer, across minibatches of the dataset, as described in Equation 3.1. We show the plot for Domain Net, Waterbirds, and Living-17 in Figure 3. We also show plots for alternative ways of normalizing the gradients in Figure 4 and Figure 5. For SGD pretrained Res Net, the embedding layer has lower gradient than the other layers. For Adam W pretrained Res Net, the embedding layer has higher gradient than the other layers. In

Published as a conference paper at ICLR 2024

other words, pretraining with SGD versus with Adam W leads to a substantial difference in the embedding layer. Using a patchify stem does not substantially change these results.

These results provide further evidence that (1) SGD does worse than Adam W when fine-tuning Vision Transformer and Conv Ne Xt models because these models are pretrained using Adam W, (2) the embedding layer plays a key role here, since pretraining with different optimizers leads to very different behavior at the embedding layer, (3) SGD (freeze-embed) can potentially improve finetuning, by freezing the embedding layer. Without this, SGD either over-optimizers the embedding layer (if the learning rate is large), or under-optimizers the other layers (if the learning rate is small).

F ADDITIONAL PLOTS ON GRADIENT NORMS AT THE PRETRAINED INITIALIZATION

In Figure 6 we show plots for the gradients norms at different layers for Living-17 and Waterbirds. These are analogous to Figure 2 for Domain Net in the main paper. We again see that, among the pretrained layers, the embedding layer has much higher gradients in the modern architectures.

We also consider two ways of normalizing the gradient magnitude. In the first method, we divide the norm of the gradient by the norm of the parameters in order to capture the relative movement in the first SGD step as opposed to the absolute movement which is captured by gradient norm itself. In the second method, we divide the norm of the gradient by the square root of the number of parameters. This is to check that a layer does not simply have a larger gradient because it has more parameters. The reason we use square root is as follows: suppose each parameter has gradient c, then the layerwise gradient norm scales with the square root of the number of parameters. Also, the first step of Adam W update is essentially a signed gradient descent step, wherein if we ignore weight decay, the per-layer movement is the square root of the number of parameters. So this normalization can be thought of as relative size of SGD update compared to Adam W in each layer at initialization. For visualization purposes, we exclude the head layer gradient in these plots as they often much larger than the others so the plots become hard to see if we include the head layer. Note that we expect the head layer to have higher gradients because it is randomly initialized (Kumar et al., 2022). For Vi T models, we omit gradients of the cls token, position embedding, and layer norm after the embedding layer.

Published as a conference paper at ICLR 2024

(a) Layer-wise gradient norms on Living-17 at pretrained initialization

(b) Layer-wise gradient norms on Living-17 at pretrained initialization

(c) Layer-wise gradient norms on Waterbirds at pretrained initialization

Figure 3: We visualize the layer-wise gradient norms of the four Bit-Res Net models on (a) Domain Net, (b) Living-17 and (c) Waterbirds, at the pretrained initialization. For better visualization, we omit the head from the plot, which predictably has much larger gradients than the others (since it is randomly initialized). The format is the same as Figure 2: gradient norms of embedding and middle layers are shown as red-squares and black-circles, respectively. We see that the embedding layer has higher gradient (than the other layers) for models pretrained with Adam W, but lower gradient (than the other layers) for models pretrained with SGD, which supports the hypotheses that the embedding layer plays a key role, and that pretraining with Adam W vs. SGD leads to very different models and is responsible for this behavior.

Published as a conference paper at ICLR 2024

(a) Layer-wise gradient norms divided by parameter norm, on Domain Net at pretrained initialization

(b) Layer-wise gradient norms divided by parameter norm, on Living-17 at pretrained initialization

(c) Layer-wise gradient norms divided by parameter norm, on Waterbirds at pretrained initialization

Figure 4: We visualize the layer-wise gradient norm, divided by the norm of the parameters on (a) Domain Net, (b) Living-17, and (c) Waterbirds, at the pretrained initialization. For better visualization, we omit the head from the plot, which predictably has much larger gradients than the others (since it is randomly initialized). The format is the same as Figure 2: gradient norms of embedding and middle layers are shown as red-squares and black-circles, respectively. Under this normalization scheme, the embedding layer has much higher gradients when pretrained with Adam W, but the embedding layer gradient is somewhere in the middle for models pretrained with SGD.

Published as a conference paper at ICLR 2024

(a) Layer-wise gradient norms divided by #parameters, on Domain Net at pretrained initialization

(b) Layer-wise gradient norms divided by #parameters, on Living-17 at pretrained initialization

(c) Layer-wise gradient norms divided by #parameters, on Waterbirds at pretrained initialization

Figure 5: We visualize the layer-wise gradient norm, divided by the square root of the number of parameters on (a) Domain Net, (b) Living-17, and (c) Waterbirds, at the pretrained initialization. For better visualization, we omit the head from the plot which has predictably much larger than the others (since it is randomly initialized). The format is the same as Figure 2: gradient norms of embedding and middle layers are shown as redsquares and black-circles, respectively. Under this normalization scheme, the embedding layer has higher gradients in all cases. However, the embedding layer gradient is about 2-3 times larger (than other layers) for models pretrained with SGD, but over 10 times larger (than other layers) for models pretrained with Adam W.

Published as a conference paper at ICLR 2024

(a) Layer-wise gradient norms on Living-17 at pretrained initialization

(b) Layer-wise gradient norms on Waterbirds at pretrained initialization

Figure 6: We visualize the layer-wise gradient norms our models on (a) Living-17 and (b) Waterbirds, at the pretrained initialization. The format is the same as Figure 2: gradient norms of embedding , head , and middle layers are shown as red-squares, green-triangles and black-circles, respectively.

Published as a conference paper at ICLR 2024

(a) Layer-wise gradient norms divided by parameter norm, on Domain Net at pretrained initialization

(b) Layer-wise gradient norms divided by parameter norm, on Living-17 at pretrained initialization

(c) Layer-wise gradient norms divided by parameter norm, on Waterbirds at pretrained initialization

Figure 7: We visualize the layer-wise gradient norm, divided by the norm of the parameters on (a) Domain Net, (b) Living-17, and (c) Waterbirds, at the pretrained initialization. For better visualization, we omit the head from the plot, which predictably has much larger gradients than the others (since it is randomly initialized). The format is the same as Figure 2: gradient norms of embedding and middle layers are shown as red-squares and black-circles, respectively. Under this normalization scheme, the embedding layer has higher gradients than the other layers in all models. However, the gradient is only slightly larger for Res Net models, and substantially larger for the Vision Transformer models which also provides support for why freezing the embedding layer in Vision Transformers might make a larger difference.

Published as a conference paper at ICLR 2024

(a) Layer-wise gradient norms divided by #parameters, on Domain Net at pretrained initialization

(b) Layer-wise gradient norms divided by #parameters, on Living-17 at pretrained initialization

(c) Layer-wise gradient norms divided by #parameters, on Waterbirds at pretrained initialization

Figure 8: We visualize the layer-wise gradient norm, divided by the square root of the number of parameters on (a) Domain Net, (b) Living-17, and (c) Waterbirds, at the pretrained initialization. For better visualization, we omit the head from the plot which has predictably much larger than the others (since it is randomly initialized). The format is the same as Figure 2: gradient norms of embedding and middle layers are shown as redsquares and black-circles, respectively. Under this normalization, we see that the gradients of the embedding layer are much larger than the other layers in all models, including Res Nets.