# strong_baselines_for_parameterefficient_fewshot_finetuning__d570369a.pdf

Strong Baselines for Parameter-Efficient Few-Shot Fine-Tuning

Samyadeep Basu1, Shell Hu3, Daniela Massiceti2, Soheil Feizi1

1University of Maryland, College Park 2Microsoft Research, Cambridge 3Samsung Research, Cambridge sbasu12@cs.umd.edu

Few-shot classification (FSC) entails learning novel classes given only a few examples per class after a pre-training (or meta-training) phase on a set of base classes. Recent works have shown that simply fine-tuning a pre-trained Vision Transformer (Vi T) on new test classes is a strong approach for FSC. Fine-tuning Vi Ts, however, is expensive in time, compute and storage. This has motivated the design of parameter efficient fine-tuning (PEFT) methods which finetune only a fraction of the Transformer s parameters. While these methods have shown promise, inconsistencies in experimental conditions make it difficult to disentangle their advantage from other experimental factors including the feature extractor architecture, pre-trained initialization and fine-tuning algorithm, amongst others. In our paper, we conduct a largescale, experimentally consistent, empirical analysis to study PEFTs for few-shot image classification. Through a battery of over 1.8k controlled experiments on large-scale few-shot benchmarks including META-DATASET (MD) and ORBIT, we uncover novel insights on PEFTs that cast light on their efficacy in fine-tuning Vi Ts for few-shot classification. Through our controlled empirical study, we have two main findings: (i) Fine-tuning just the Layer Norm parameters (which we call LN-TUNE) during few-shot adaptation is an extremely strong baseline across Vi Ts pre-trained with both self-supervised and supervised objectives, (ii) For self-supervised Vi Ts, we find that simply learning a set of scaling parameters for each attention matrix (which we call ATTNSCALE) along with a domain-residual adapter (DRA) module leads to state-of-theart performance (while being 9 more parameter-efficient) on MD. Our empirical findings set strong baselines and call for rethinking the current design of PEFT methods for FSC.

1 Introduction Few-shot classification (FSC) involves learning a new classification task given only a few labelled training examples from each of the novel classes. It has a large number of mainstream applications such as drug-discovery (Stanley et al. 2021), robotics (Ren et al. 2020) and personalized object recognition (Massiceti et al. 2021) among others. Usually, a given few-shot classification task consists of a few-labelled examples from the new classes (support set) and a testing set of unlabeled held-out examples of those classes (query set).

Copyright 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

Recent works (Hu et al. 2022; Li, Liu, and Bilen 2021; Xu et al. 2022) have shown that fine-tuning a large pre-trained Vision Transformer (Vi T) on the support set of new test tasks achieves state-of-the-art performance on large-scale few-shot classification benchmarks such as META-DATASET (MD). Because of their high number of parameters, however, fine-tuning Vi Ts is extremely expensive in terms of storage, compute, and time. This limits the ability to learn new downstream tasks in real-world applications where resources are constrained (e.g., personalization on edge or mobile devices) since (i) storing the task s fine-tuned parameters on the edge may be unfeasible, especially for a large number of downstream tasks and (ii) fine-tuning on each new task takes long. As a result, much recent progress has been made in designing light-weight, fast and parameter-efficient fine-tuning (PEFT) methods (Xu et al. 2022; Jia et al. 2022). These reduce the computational requirements to adapt a Vi T to a new test task by fine-tuning only a fraction of the Vi T s total parameters. However, inconsistencies in experimental setups make it difficult to disentangle the benefit of PEFT methods from other experimental factors, including pre-training initialization, feature extractor architecture, fine-tuning algorithm, downstream dataset and other hyperparameters. Prompt-tuning (Jia et al. 2022), for example, is the stateof-the-art PEFT method on the transfer learning benchmark VTAB (Zhai et al. 2019), while e TT (Xu et al. 2022) performs strongly on few-shot classification in MD. Both, however, use distinct feature extractors, pre-training initializations, fine-tuning algorithms, and hyperparameters, thus limiting our understanding of the generalizability of these PEFT methods across different setups. To address this, we perform a large-scale empirical analysis of top-performing PEFT methods on two large-scale fewshot image classification benchmarks, META-DATASET (Triantafillou et al. 2019) and ORBIT (Massiceti et al. 2021). Our experimentation involves 1.8k fine-tuning experiments which quantify the performance of PEFT methods under experimentally controlled settings including Vi T architectures, pre-training objectives, and fine-tuning algorithms. This enables us to compare PEFT methods in a fair and consistent way and also draw out novel insights on the interaction between these different components in the fine-tuning pipeline. Our main finding is that the embarrassingly simple approach of fine-tuning just a Vi T s Layer Norm parameters

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Figure 1: ATTNSCALE leads to So TA performance on MD with self-supervised Vi Ts and LN-TUNE leads to So TA performance for supervised Vi Ts. Pareto-Plot comparing the average MD accuracy with the model parameters updated during few-shot adaptation: (a) Averaged across self-supervised Vi T-S/16 and Vi T-B/16 (DINO); (b) Averaged across supervised Vi T-S/16(Dei T), Vi T-B/16(Dei T) and Vi T-B/16(Image Net-21k). We find that the recently proposed e TT (Xu et al. 2022) does not generalize well to supervised objectives and two simple but strong baselines LN-TUNE and ATTNSCALE outperform existing PEFT methods.

(only 0.08% of total parameters) on a new test task leads to better performance than with full model fine-tuning and other PEFT methods on MD and ORBIT. We call this baseline LN-TUNE. We also find that the recently proposed e TT (Xu et al. 2022), primarily designed for self-supervised Vi Ts, lags behind some of the PEFT methods which we evaluate in our empirical study. In lieu of this, we propose a new strong baseline called ATTNSCALE which leads to improved few-shot performance over e TT and other PEFT methods for self-supervised Vi Ts. In particular, ATTNSCALE learns only a scaling parameter for each entry in the attention matrices along with a domain-residual module during few-shot adaptation, making it 9x more parameter-efficient than e TT. Importantly, ATTNSCALE is extremely simple to implement, requires less than 6 lines of code, and can be easily integrated with any Vi T architecture. These approaches establish two new, strong PEFT baselines for few-shot classification, however our empirical study also reveals several interesting insights: (i) None of the carefully designed existing PEFT methods show consistent performance rankings across different pre-training methods (Sec 6.1). (ii) We find that for different degrees of domain shifts, distinct PEFT methods are preferred highlighting that the need for surgically designing PEFT methods for different domain shifts (Sec 6.3). (iii) Dropping PEFT methods from earlier layers in the Vi T for large domain shifts (e.g. Omniglot, Quickdraw, Traffic-Sign) is detrimental to few-shot performance (Sec 6.4). In summary, our contributions are as follows:

A large-scale, experimentally consistent, empirical analysis of a wide-range of PEFT methods for few-shot classification on 2 challenging large-scale benchmarks, META-DATASET and ORBIT. An embarrassingly simple PEFT baseline, LN-TUNE,

which fine-tunes less than 0.08% of a Vi T s parameters outperforming all existing PEFT methods on MD amongst supervised Vi Ts. An easy-to-implement method, ATTNSCALE, which sets a new state-of-the-art on MD amongst self-supervised Vi Ts while fine-tuning <1.2% of the Vi T s parameters. Our findings highlight that there is no one-size-fits-all PEFT method and simple parameter-efficient fine-tuning baselines should not be overlooked.

2 Related Works

Vi Ts in few-shot classification. CNNs have primarily been used as the feature extractor backbone in few-shot classification methods (Finn, Abbeel, and Levine 2017; Snell, Swersky, and Zemel 2017; Chen et al. 2020; Hospedales et al. 2020; Vinyals et al. 2016), however, recently Vi Ts have replaced them as the state-of-the-art (Hu et al. 2022) in challenging few-shot classification benchmarks like METADATASET. In these methods, the Vi T is typically pre-trained with a self-supervised (or meta-learning) objective on a large dataset and then fine-tuned on new test tasks. PEFT methods for few-shot classification. Parameter efficient fine-tuning methods have been extensively studied in Transformers for NLP tasks with adapters (Houlsby et al. 2019), Lo RA (Hu et al. 2021), prefix-tuning (Li and Liang 2021) and prompt-tuning (Lester, Al-Rfou, and Constant 2021) serving as strong alternatives to fine-tuning all the Transformer s parameters. PEFTs have also been explored in Vision Transformers for computer vision tasks, with methods like visual prompt tuning (Jia et al. 2022) for transfer learning which work by tuning prefixes attached to the input and e TT (Xu et al. 2022) which tune prefixes attached to key and value matrices in the self-attention layers.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

(Xu et al. 2022) show that e TT results in performance close to full model tuning for Vi Ts pre-trained using DINO using only 9% of the total model parameters on MD.

3 Few-Shot Classification Preliminaries In few-shot classification, the goal is to adapt a classifier to a new task at test time using a small number of training examples of each new class. In fine-tuning-based approaches, this adaptation process is done by fine-tuning the model on the training examples, before then evaluating it on a held-out set of test examples. Formally, given a pre-trained feature extractor fθ, a fewshot task is sampled from a test dataset D. The task is composed of a support set S (of training examples) and a query set Q (of held-out test examples). Generally, N unique classes are first sampled from the underlying dataset D. For each class j [1, N], kj s examples are sampled for the support set S and kj q examples are sampled for the query set Q. If kj s = k is fixed for j [1, N] classes, then the task is known as a N-way, k-shot task. When given a new test task, the objective is to fine-tune the underlying feature extractor fθ or the parameter-efficient module pϕ on the task s support set S using a fine-tuning algorithm F. In parameter-efficient fine-tuning approaches, fθ is frozen and only the parameters in pϕ are fine-tuned. More specifically, we can formalize the fine-tuning procedure as follows:

ϕ = min ϕ ℓ(fθ, pϕ, F(S)), (1)

Inference on the query examples is done depending on the fine-tuning algorithm F (see Sec 4) for details). We follow the variable-way, variable way sampling protocol from (Triantafillou et al. 2019) where kj s, kj q and N vary for each sampled few-shot task. This setting generates class-imbalanced few-shot tasks which make it challenging as the model needs to handle tasks of varying sizes.

4 Large-Scale Empirical Study Design PEFT methods have been widely used to make few-shot adaptation more computationally efficient (Jia et al. 2022; Xu et al. 2022; Shysheya et al. 2022), however, inconsistencies in experimental setups make it difficult to disentangle the gain from PEFT methods versus other experimental factors. To address this, we conduct a wide-scale experimentally controlled study of over 1.8k experiments. We control for the pre-trained model (including pre-training objective and architecture), PEFT module type, position of the PEFT module, fine-tuning algorithm, learning hyperparameters and downstream dataset. Below we provide details of each of these components: Pre-trained models. For pre-training objectives we consider the self-supervised objective DINO (Caron et al. 2021) and the supervised objective Dei T (Touvron et al. 2020). For architectures, we consider Vi T-S/16 and Vi T-B/16 (Touvron et al. 2020). These architectures are pre-trained using the given objectives on Image Net-1k. In addition, we also consider Vi T-B/16, which is pre-trained on the large-scale Image Net-21k. These objectives and architectures were chosen as they lead in downstream few-shot performance (Hu

et al. 2022) on MD. More details on pre-training are included in the Appendix. PEFT methods. We consider the following 7 existing methods for parameter-efficient fine-tuning: adapters (Houlsby et al. 2019), Lo RA (Hu et al. 2021), shallow prompt-tuning and deep prompt-tuning (Jia et al. 2022), e TT (Xu et al. 2022), ladder tuning (Sung, Cho, and Bansal 2022), and bias tuning (Zaken, Ravfogel, and Goldberg 2021). We also compare to full model fine-tuning (Hu et al. 2022) and our 2 strong baselines: finetuning only the Vi T s Layer Norm parameters (LN-TUNE), and learning a simple scaling factor for the elements in the attention matrices (ATTNSCALE) (see Sec 5.2). Of the existing methods, adapters and Lo RA have been extensively used for fine-tuning Transformers in few-shot NLP tasks. Ladder tuning is a more recent memory-efficient as well as parameter-efficient fine-tuning method for language models like T5 (Raffel et al. 2019). Ladder is tuning is memory-efficient as it avoids back-propagation through the entire feature-extractor backbone. Shallow and deep prompt tuning are adaptations of (Lester, Al-Rfou, and Constant 2021) for transfer learning in vision. e TT (Xu et al. 2022) fine-tunes only the prefixes attached to the key and value matrices in a Vi T s self-attention layers. e TT is also the only method to have been tested on the large-scale META-DATASET benchmark. Note, we omit the prototype regularization used in e TT to ensure fair comparison to other PEFT methods where prototype regularization is not used. We provide further information for each of these methods in the Appendix. Position of PEFT methods. We consider two configurations in which the PEFTs are inserted in the Vi T: (i) We insert PEFTs in each of the layers, including the final; (ii) We insert PEFT in the final layer and in one of the layers between the first and the final layer, leading to two layers in total. For (ii) each fine-tuning experiment is repeated 12 times (see Sec 6.4 for analyses). Fine-tuning algorithms . We consider 3 fine-tuning algorithms given a new test task: (i) LINEAR: We attach a linear classification layer after the final layer of the Vi T and fine-tune both the PEFT s and this layer s parameters using a cross-entropy loss. (ii) PROTOAUG: Following the stateof-the-art fine-tuning approach in (Hu et al. 2022), we use the examples from the task s support set to initialize class prototypes, similar to Proto Nets (Snell, Swersky, and Zemel 2017), and then use a query set to fine-tune the Vi T. where the query set is an augmented version of the support set. In particular, we apply color-jitter and translation augmentations on the support set to generate the query set. (iii) PROTONCC: Following (Li, Liu, and Bilen 2021; Xu et al. 2022), we do not apply augmentations to generate the query set and instead treat the query set as a copy of the support set, and fine-tune the Vi T in a similar way to PROTOAUG. Hyperparameters. We standardize the hyperparameters across our entire experimental setup. Following (Hu et al. 2022), we choose a learning rate from {0.0001, 0.001, 0.01, 0.1} and select the rate that gives the best performance on the validation set. The validation set is a fixed set of 5 few-shot tasks sampled from the downstream

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Figure 2: PEFT methods (except LN-TUNE) lack consistency across different pre-training paradigms. (a) The ranks of the 7 top-performing PEFT methods on META-DATASET change across different pre-training paradigms when measured under controlled settings; (b) The Spearman correlations between the different pre-trained models with respect to the performance rank of all 10 PEFT methods are not consistently high. Evaluation across all domains in MD except Image Net.

dataset to which the Vi T is being adapted. For each few-shot task, we fine-tune for 40 steps with Adam (Kingma and Ba 2014) using the selected learning rate. Downstream datasets. We run all our experiments on two challenging large-scale few-shot classification benchmarks (i) META-DATASET (Triantafillou et al. 2019) and (ii) ORBIT (Massiceti et al. 2021). META-DATASET consists of 10 different sub-datasets, and is currently the most widely used few-shot classification benchmark. Note, we remove the ilsvrc 2012 sub-dataset from META-DATASET as our Vi T models have been pre-trained on Image Net. ORBIT is a fewshot classification benchmark containing noisy, real-world videos of everyday objects across 17 test users. In accordance with (Triantafillou et al. 2019), we sample 600 fewshot tasks per sub-dataset in META-DATASET while for ORBIT, we sample 50 tasks per user. In total, each experimental analysis is performed on 6250 few-shot tasks.

5 Strong Baselines for Few-Shot Fine-tuning Our standardised large-scale empirical study led us to discover two embarrassingly simple but strong baselines for parameter-efficient few-shot fine-tuning: LN-TUNE and ATTNSCALE. Both of these methods perform better than full model fine-tuning and all other existing PEFT methods on MD at a fraction of the computational cost. Below we describe each of these strong baselines:

5.1 LN-TUNE LN-TUNE works by fine-tuning only the Vi T s Layer Norm parameters on a task s support set. Formally, for a given Vi T with L layers, the ith layer has two Layer Norm blocks one before its attention block and one before its MLP block. Given an input vector a Rd from the previous layer or block, the operation of the first block can defined as

Layer Normi 1(a) = γi 1 (a µ)/σ +βi 1, and the operation of the second block as Layer Normi 2(a) = γi 2 (a µ)/σ +βi 2. Here {γi 1, βi 1, γi 2, βi 2} Rd are the only learnable parameters for the ith layer. For a given task, these parameters across all L layers are fine-tuned using the task s support set S. As a result, LN-TUNE is extremely light-weight when compared to the other PEFT methods. For e.g., a Vi T-S/16 has only 18.6k Layer Norm parameters, while a Vi T-B/16 has only 37k. Since Vi T-S/16 and Vi T-B/16 have 22M and 76M parameters, respectively, this accounts for less than 0.08% of the total parameters.

5.2 ATTNSCALE As a second strong baseline, we introduce ATTNSCALE, a modification to the recently proposed e TT (Xu et al. 2022). Here, we replace the attentive prefix tuning part in e TT with a learnable scaling parameter on each element in the attention matrices, which we tune along with e TT s DRA module, reducing the number of learnable parameters by 9x. Given a Vi T with L layers, nh attention heads and n tokens, the weight matrices in the ith layer s attention block for the jth head are defined as W ij q Rd de, W ij k Rd de and W ij v Rd de. Here d is the dimension of the token embeddings and de is the dimension of the tokens after the weight matrix projection. Qij Rn d, Kij Rn d, V ij Rn d are defined as the query, key and value tokens, respectively. The attention matrix in the ith layer for the jth head can be defined as:

Aij = softmax((Qij W ij q )(Kij W ij k )T / p

where Aij Rn n. ATTNSCALE applies a point-wise scaling factor to each element in the attention matrix before the softmax operation. These scaling factors are learned during fine-tuning on the task s support set S. In particular,

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Figure 3: Different attention heads encode similar attention maps in self-supervised Vi Ts (a) Vi T-S/16(DINO); (b) Vi T-S/16(Dei T). We compute the Pearson correlation between the attention scores of different heads: h i, i [1, nh]. Self-supervised Vi Ts encode attention across different heads more similarly than supervised Vi Ts. Correlation is averaged across examples from 100 tasks in MD.

we define a learnable scaling tensor Aα Rn n L nh. Aα can be reshaped as {Ai α}L i=1 where Ai α Rn n nh is the scaling tensor for each ith layer. For each attention head j [1, nh], the scaling matrix is defined as Aij α Rn n.

Aij = softmax(Aij α (Qij W ij q )(Kij W ij k )T / p

During few-shot adaptation, only Aij α is learned along with the parameters in the DRA module from e TT. Note, {W ij q , W ij k , W ij v } are kept frozen for each ith layer and jth attention head. In principle, the scaling factor Aα replaces the attentive-prefix tuning (APT) module in e TT. This APT module uses 9% model parameters, whereas ATTNSCALE uses only 1.2% but still gives improved MD performance. We also propose a light-weight extension of ATTNSCALE, called ATTNSCALELITE, which learns the same scaling parameters across all nh attention heads in a given layer, rather than different ones for each head. This is motivated by an observation that all nh attention heads in a layer have similar attention maps. We show this in Fig 3 where we plot the pairwise Pearson correlation (Benesty et al. 2009) between the attention values of different heads. Here, for selfsupervised Vi Ts, we see strong correlation values between different heads in a given layer indicating that different heads encode similar kinds of attention maps. This is similar for supervised Vi Ts, however, the correlation values are slightly lower. Formally, for ATTNSCALELITE, we define the scaling parameter for the ith layer as Ai α Rn n and Aij α = Ai α, j [1, nh]. ATTNSCALELITE requires only 0.25% of the total parameters for Vi T-S/16 and only 0.09% for Vi T-B/16 which makes it an extremely light-weight module. In Sec 6, we provide fine-grained results on the efficacy of both ATTNSCALE and ATTNSCALELITE for downstream few-shot adaptation. We provide a Py Torch-like implementation in the Appendix.

Figure 4: With PEFT methods, we find PROTOAUG to have the best performance on META-DATASET, while LINEAR performs the worst.MD accuracy averaged over all 10 PEFT methods with different fine-tuning algorithms.

6 Empirical Results on META-DATASET We use our wide-scale empirical study to derive novel insights on PEFT methods for few-shot classification. In particular, we use our results on MD to answer the following key questions: 1 Do PEFT methods rank similarly across different pre-training architectures and learning objectives?

2 How does the fine-tuning algorithm influence the performance of a PEFT method? 3 Is the optimal PEFT method different for different data domains? 4 Can PEFT modules be dropped from certain positions in the feature extractor? This can lead to significant memory and storage savings during few-shot deployment. These are critical factors when deploying a few-shot classifier in the wild. We also show that our two simple but strong baselines, LN-TUNE and ATTNSCALE, perform better than full fine-tuning and all top-performing PEFT methods.

6.1 Consistency Across Pre-Training Models We analyse the influence of pre-training model by ranking the performance of different PEFT methods across the different pre-training objectives and architectures described in Sec 4. To isolate the role of the pre-trained model, for each run, we keep all other variables constant including the finetuning algorithm, position of the modules, and hyperparameters. We report the results using the PROTOAUG fine-tuning algorithm in Fig 2, and include results for PROTONCC and LINEAR in the Appendix. Existing PEFT methods. In Fig 2-(a), we find that PEFT methods rank inconsistently, with no single best approach, across the different pre-trained models. In Fig 2-(b), we plot the Spearman correlation of the PEFT method s ranking between different pre-trained models. We observe that the correlation values across all pairs of pre-trained models are not consistently high, suggesting that existing PEFT methods do not generalize similarly for different pre-trained archi-

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

PEFT MSCOCO Traffic-Sign Omniglot Aircraft DTD VGG-Flower Quickdraw Cu-birds Fungi Overall

Full 61.5 87.3 78.7 75.4 86.9 94.2 73.6 85.4 54.7 77.5

Adapter 55.8 52.2 54.7 60.0 83.8 94.6 60.5 84.8 55.9 66.8 Bias 63.4 90.4 80.4 77.5 84.7 95.1 74.3 85.6 58.9 78.8 Lo RA 62.1 88.1 80.8 80.8 86.8 94.8 72.7 85.8 59.8 78.9 Ladder 55.7 52.2 54.7 60.01 83.8 94.6 60.5 84.8 55.9 67.0 Prompt-Shallow 52.7 58.9 61.8 62.9 83.0 94.2 66.0 83.4 55.5 68.7 Prompt-Deep 62.8 85.6 77.0 73.3 85.3 96.2 73.2 86.1 58.2 77.5 e TT 61.5 89.1 78.9 75.8 85.1 95.1 73.5 86.1 58.2 78.1

LN-TUNE 64.2 91.2 77.9 75.3 84.4 96.9 74.7 87.5 59.9 79.1

ATTNSCALE 61.9 91.4 80.9 78.8 85.8 95.9 74.4 86.7 59.01 79.4

ATTNSCALELITE 61.6 91.0 80.2 77.9 85.8 96.0 73.9 86.7 59.0 79.1

Table 1: Our strong baselines, LN-TUNE and ATTNSCALE, rank in the top 2 of all PEFT methods on the few-shot classification benchmark, META-DATASET. Results shown for a Vi T-S/16 (DINO), and exclude the Image Net split.

tectures and objectives. We also find that adapters, laddertuning and shallow prompt-tuning all have sub-par performances on MD ( 10% drop) when compared to Lo RA, bias-tuning, e TT and deep prompt-tuning. We also highlight that shallow prompt-tuning struggles with few-shot classification on MD despite performing competitively on transfer learning natural tasks in VTAB (Jia et al. 2022). Deep prompt-tuning, which is the state-of-the-art PEFT module on VTAB, performs competitively on MD across all pretrained models, but falls short of methods like e TT, Lo RA, bias-tuning and full model-tuning (see Fig 2). This result highlights that strongly performing PEFT methods for transfer learning do not generalize well to the challenging fewshot setting of MD. e TT (Xu et al. 2022) for Vi T-S/16(DINO) outperforms full model-tuning, but also lags behind Lo RA and bias-tuning. Overall, we find bias-tuning (Zaken, Ravfogel, and Goldberg 2021) to consistently rank amongst the top 4 across all the pre-training models, outperforming many of the more complex PEFT methods. Our strong baselines. From Fig 2, we find that our strong baselines, LN-TUNE and ATTNSCALE, perform strongly across all the pre-trained models on MD. In particular, LN-TUNE performs the best for supervised Vi Ts (pretrained on Image Net-1k and Image Net-21k) consistently. We also highlight that for supervised Vi Ts, none of the PEFT methods except LN-TUNE reaches performance close to full fine-tuning. ATTNSCALE, which is around 9x more parameter-efficient than e TT, has the best few-shot performance for self-supervised Vi Ts. For self-supervised Vi Ts, LN-TUNE performs closely to ATTNSCALE and ranks in the top 2 methods.

6.2 Effect of Fine-tuning Algorithm

We quantify the impact of 3 different algorithms for finetuning the parameters in PEFTs: LINEAR, PROTOAUG and PROTONCC. We find that PROTOAUG outperforms PROTONCC and strongly outperforms LINEAR across all pretraining objectives and PEFT methods including full model tuning (Fig 4). In some cases, PROTOAUG and PROTONCC outperform LINEAR by as much as 20%. We also find that for self-supervised pre-training objectives like DINO (Caron et al. 2021), the gap between PROTOAUG and PROTONCC

is 2.2%, whereas for supervised objectives like Dei T (Touvron et al. 2020) this gap is higher at 4.7% (for both Image Net-1k and Image Net-21k initializations). Since the only difference between PROTOAUG and PROTONCC is that the query set is an augmented version of the support set, this suggests that applying augmentations during few-shot (meta) fine-tuning is more effective with supervised than self-supervised objectives. We also note that when using full model fine-tuning, PROTOAUG outperforms PROTONCC by 5% for DINO and by 6.7% for Dei T objectives. This gap is higher than when used with other PEFT methods (see Table 3). This suggests that PROTOAUG s efficacy decreases when used in conjunction with PEFT methods.

6.3 Comparing Performance Across Domains

We leverage the distinct sub-datasets in MD to compare the performance of PEFT methods across domains. Since each sub-datasets has a different degree of domains shifts from the pre-training dataset (Image Net), we also evaluate the robustness of different PEFT methods to these shifts. In Table 1, we show these results with a Vi T-S/16 pre-trained with DINO, and observe that none of the PEFT methods are consistently the best across domains. We show similar results for other pre-trained Vi Ts in the Appendix. Existing PEFT methods. We observe that deep prompttuning is the best PEFT method for domains with smaller degrees of shift from Image Net such as Cu-Birds and VGGFlower. It is second best on MS-COCO, which is also similar to Image Net. We find, however, that for larger domain shifts such as Omniglot, Quickdraw and Traffic-Sign it struggles, with Lo RA and bias-tuning showing stronger performance. This is similarly the case for adapters, Lo RA, and laddertuning which also perform poorly on larger domain shifts and have the lowest average performance on MD generally. Our strong baselines. We find that LN-TUNE in Table 1 outperforms all existing PEFT methods in 5 out of the 9 domains, with ATTNSCALE lagging behind it only slightly in these 5 domains. However, for domains with a larger shift (e.g., Omniglot, Traffic-Sign), ATTNSCALE performs better than LN-TUNE. Even for Quickdraw, where there is a significant shift, ATTNSCALE and LN-TUNE perform almost similarly. Overall on MD, ATTNSCALE ranks the best

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Model Full Adapter Bias Lo RA Ladder Prompt-D Prompt-S e TT LN-TUNE ATTNSCALE ATTNSCALELITE

Vi T-S(DINO) 63.1 62.6 67.1 66.4 62.7 65.7 51.8 65.6 67.8 67.2 66.9 Vi T-S(Dei T) 66.6 66.8 66.4 67.6 66.9 66.7 63.4 68.4 68.8 67.1 66.2

Table 2: LN-TUNE results in the best performance on ORBIT while ATTNSCALE is extremely competitive. Prompt-D: Prompt Deep; Prompt-S: Prompt-Shallow.

Method PROTOAUG PROTONCC Performance Gap

Full Tuning (DINO) 77.2 72.2 5.0% All PEFTs (DINO) 75.4 73.2 2.2%

Full Tuning (Dei T) 78.1 71.38 6.7% All PEFTs (Dei T) 73.1 68.4 4.7%

Table 3: The performance gap between PROTOAUG and PROTONCC is more with full fine-tuning than when used with PEFT methods.

in terms of few-shot performance. These results suggest that our two strong baselines can be used complementarily: when the domain shift from the pre-training dataset is high, ATTNSCALE is better suited, whereas when the domain shift is low, LN-TUNE is the stronger approach. Our results highlight that current PEFT methods are not robust to varying degree of domain shifts and requires rethinking the current designs to be uniformly robust to all domain shifts. Performance of ATTNSCALELITE. We observe from Table 1 that ATTNSCALELITE performs similarly to LN-TUNE but slightly worse than ATTNSCALE (by around 0.5 0.7%) on larger domain shifts for self-supervised Vi TS/16(DINO). For smaller domain shifts, ATTNSCALELITE matches the performance of ATTNSCALE. For supervised Vi Ts, we find that ATTNSCALELITE lags behind ATTNSCALE by a larger margin of 1.2 1.8% for large domain shifts (see Appendix for results). The decrease in the effectiveness of ATTNSCALELITE for supervised Vi Ts can be attributed to the fact, that different heads encode attention maps less similarly than self-supervised Vi Ts. Therefore, learning a separate set of scaling parameters for different heads is more beneficial for few-shot adaptation.

6.4 Can We Drop PEFTs from Vi T Layers?

In Secs. 6.3 and 6.2, the PEFT modules are inserted in each of the 12 layers of the Vi T. In this section, we use our strong baselines to examine if dropping PEFT modules from the majority of layers impacts performance. Specifically, we insert a PEFT module in the final layer of the Vi T and another in 1 other layer (between 1-11). We vary the position of the second PEFT and observe its impact on performance (Fig 5). Results. From Fig 5, we find that inserting the PEFT into the later layers improves the performance more than inserting it in the earlier layers for domains with a small degree of shift from Image Net (e.g., MSCOCO, DTD, VGGFlower, Cu birds). However, for large domain shifts such as in Traffic-Sign, Quickdraw and Omniglot, we find that inserting LN-TUNE in the earlier layers is crucial. In particular for these domains, we find that inserting LN-TUNE only in the later layers results in 10% drop in accuracy.

Figure 5: Dropping LN-TUNE from earlier layers in the Vi T for large domain shifts (e.g., Traffic-Sign, Quickdraw, Omniglot) leads to a large drop in accuracy.

7 Results on Tasks from ORBIT

From Table 2, we find that bias-tuning and e TT have the best performances amongst the existing PEFT methods for Vi T-S/16 (DINO) and Vi T-S/16 (Dei T), respectively. These results reinforce our previous finding that different PEFT methods may be suited to different pre-training objectives. Overall, we find that LN-TUNE results in the best few-shot performance for both self-supervised and supervised objectives across all PEFT methods. ATTNSCALE ranks in the top 2 for DINO, however, for Dei T we find its performance slightly drops but still ranks within the top 4 PEFT methods.

8 Conclusion

In this paper, we perform a large-scale empirical study of a range of top-performing PEFT methods across largescale benchmarks such as MD and ORBIT. Our main finding is that two embarrassingly simple approaches LNTUNE and ATTNSCALE beat all PEFTs we evaluated and set new state-of-the-art results on MD, while being easyto-implement, significantly less complex and parameterintensive. The scale of our empirical study also uncovers several novel empirical insights, including that there is no one-size-fits-all PEFT method across different pretraining architectures, objectives, and downstream domains. Together, our experimentally consistent suite of experiments and strong baselines supports the future study of PEFT approaches for few-shot classification, but calls for rethinking current practices in light of simple but effective baselines.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Acknowledgements

This project was supported in part by a grant from an NSF CAREER AWARD 1942230, ONR YIP award N00014-22-1-2271, ARO s Early Career Program Award 310902-00001, Meta grant 23010098, HR00112090132 (DARPA/RED), HR001119S0026 (DARPA/GARD), Army Grant No. W911NF2120076, NIST 60NANB20D134, the NSF award CCF2212458, an Amazon Research Award and an award from Capital One.

Benesty, J.; Chen, J.; Huang, Y.; and Cohen, I. 2009. Pearson correlation coefficient. In Noise reduction in speech processing, 37 40. Springer. Caron, M.; Touvron, H.; Misra, I.; J egou, H.; Mairal, J.; Bojanowski, P.; and Joulin, A. 2021. Emerging Properties in Self-Supervised Vision Transformers. Co RR, abs/2104.14294. Chen, Y.; Wang, X.; Liu, Z.; Xu, H.; and Darrell, T. 2020. A New Meta-Baseline for Few-Shot Learning. Co RR, abs/2003.04390. Finn, C.; Abbeel, P.; and Levine, S. 2017. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. Co RR, abs/1703.03400. Hospedales, T. M.; Antoniou, A.; Micaelli, P.; and Storkey, A. J. 2020. Meta-Learning in Neural Networks: A Survey. Co RR, abs/2004.05439. Houlsby, N.; Giurgiu, A.; Jastrzebski, S.; Morrone, B.; de Laroussilhe, Q.; Gesmundo, A.; Attariyan, M.; and Gelly, S. 2019. Parameter-Efficient Transfer Learning for NLP. Hu, E. J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; and Chen, W. 2021. Lo RA: Low-Rank Adaptation of Large Language Models. Co RR, abs/2106.09685. Hu, S. X.; Li, D.; St uhmer, J.; Kim, M.; and Hospedales, T. M. 2022. Pushing the Limits of Simple Pipelines for Few-Shot Learning: External Data and Fine-Tuning Make a Difference. Jia, M.; Tang, L.; Chen, B.-C.; Cardie, C.; Belongie, S.; Hariharan, B.; and Lim, S.-N. 2022. Visual Prompt Tuning. Kingma, D. P.; and Ba, J. 2014. Adam: A Method for Stochastic Optimization. Lester, B.; Al-Rfou, R.; and Constant, N. 2021. The Power of Scale for Parameter-Efficient Prompt Tuning. Co RR, abs/2104.08691. Li, W.-H.; Liu, X.; and Bilen, H. 2021. Cross-domain Fewshot Learning with Task-specific Adapters. Li, X. L.; and Liang, P. 2021. Prefix-Tuning: Optimizing Continuous Prompts for Generation. Co RR, abs/2101.00190. Massiceti, D.; Theodorou, L.; Zintgraf, L.; Harris, M. T.; Stumpf, S.; Morrison, C.; Cutrell, E.; and Hofmann, K. 2021. ORBIT: A real-world few-shot dataset for teachable object recognition collected from people who are blind or low vision.

Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; and Liu, P. J. 2019. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Co RR, abs/1910.10683. Ren, M.; Iuzzolino, M. L.; Mozer, M. C.; and Zemel, R. S. 2020. Wandering Within a World: Online Contextualized Few-Shot Learning. Co RR, abs/2007.04546. Shysheya, A.; Bronskill, J.; Patacchiola, M.; Nowozin, S.; and Turner, R. E. 2022. Fi T: Parameter Efficient Few-shot Transfer Learning for Personalized and Federated Image Classification. Snell, J.; Swersky, K.; and Zemel, R. S. 2017. Prototypical Networks for Few-shot Learning. Stanley, M.; Bronskill, J. F.; Maziarz, K.; Misztela, H.; Lanini, J.; Segler, M.; Schneider, N.; and Brockschmidt, M. 2021. FS-Mol: A Few-Shot Learning Dataset of Molecules. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). Sung, Y.-L.; Cho, J.; and Bansal, M. 2022. LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning. Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; and J egou, H. 2020. Training data-efficient image transformers & distillation through attention. Co RR, abs/2012.12877. Triantafillou, E.; Zhu, T.; Dumoulin, V.; Lamblin, P.; Xu, K.; Goroshin, R.; Gelada, C.; Swersky, K.; Manzagol, P.; and Larochelle, H. 2019. Meta-Dataset: A Dataset of Datasets for Learning to Learn from Few Examples. Co RR, abs/1903.03096. Vinyals, O.; Blundell, C.; Lillicrap, T. P.; Kavukcuoglu, K.; and Wierstra, D. 2016. Matching Networks for One Shot Learning. Co RR, abs/1606.04080. Xu, C.; Yang, S.; Wang, Y.; Wang, Z.; Fu, Y.; and Xue, X. 2022. Exploring Efficient Few-shot Adaptation for Vision Transformers. Transactions of Machine Learning Research. Zaken, E. B.; Ravfogel, S.; and Goldberg, Y. 2021. Bit Fit: Simple Parameter-efficient Fine-tuning for Transformerbased Masked Language-models. Co RR, abs/2106.10199. Zhai, X.; Puigcerver, J.; Kolesnikov, A.; Ruyssen, P.; Riquelme, C.; Lucic, M.; Djolonga, J.; Pinto, A. S.; Neumann, M.; Dosovitskiy, A.; Beyer, L.; Bachem, O.; Tschannen, M.; Michalski, M.; Bousquet, O.; Gelly, S.; and Houlsby, N. 2019. The Visual Task Adaptation Benchmark. Co RR, abs/1910.04867.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)