# a_closer_look_at_fewshot_classification_again__981853df.pdf A Closer Look at Few-shot Classification Again Xu Luo * 1 Hao Wu * 1 Ji Zhang 1 Lianli Gao 1 Jing Xu 2 Jingkuan Song 1 Few-shot classification consists of a training phase where a model is learned on a relatively large dataset and an adaptation phase where the learned model is adapted to previously-unseen tasks with limited labeled samples. In this paper, we empirically prove that the training algorithm and the adaptation algorithm can be completely disentangled, which allows algorithm analysis and design to be done individually for each phase. Our meta-analysis for each phase reveals several interesting insights that may help better understand key aspects of few-shot classification and connections with other fields such as visual representation learning and transfer learning. We hope the insights and research challenges revealed in this paper can inspire future work in related directions. Code and pre-trained models (in Py Torch) are available at https://github.com/ Frankluox/Closer Look Again Few Shot. 1. Introduction During the last decade, deep learning approaches have made remarkable progress in large-scale image classification problems (Krizhevsky et al., 2012; He et al., 2016). Since there are infinitely many categories in the real world that cannot be learned at once, a desire following success in image classification is to equip models with the ability to efficiently learn new visual concepts. This demand gives rise to few-shot classification (Fei-Fei et al., 2006; Vinyals et al., 2016) the problem of learning a model capable of adapting to new classification tasks given only few labeled samples. This problem can be naturally broken into two phases: the training phase for learning an adaptable model and the adaptation phase for adapting the model to new tasks. To make quick adaptation possible, it is natural to think that the *Equal contribution 1University of Electronic Science and Technology of China 2Harbin Institute of Technology Shenzhen. Correspondence to: Jingkuan Song . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). design of the training algorithm should prepare for the algorithm used for adaptation. For this reason, pioneering works (Vinyals et al., 2016; Finn et al., 2017; Ravi & Larochelle, 2017) formalize the problem with meta-learning framework, where the training algorithm directly aims at optimizing the adaptation algorithm during training in a learning-to-learn fashion. Attracted by meta-learning s elegant formalization and properties well suited for few-shot learning, many subsequent works designed different meta-learning mechanisms to solve few-shot classification problems. It is then a surprise to find that a simple transfer learning baseline learning a supervised model using the training set and adapting it using a simple adaptation algorithm (e.g., logistic regression) performs better than all meta-learning methods (Chen et al., 2019; Tian et al., 2020; Rizve et al., 2021). Since simple supervised training is not designed specifically for few-shot classification, this observation reveals that the training algorithm can be designed without considering the choice of adaptation algorithm while achieving satisfactory performance. In this work, we take a step further and ask the following question: Are training and adaptation algorithms completely uncorrelated in few-shot classification? Here, completely uncorrelated means that the performance ranking of any set of adaptation algorithms is not affected by the choice of training algorithms, and vice versa. If this is true, the problem of finding the best combinations of training and adaptation algorithms can be reduced to optimizing the training and adaptation algorithms individually, which may largely ease the algorithm design process in the future. We give an affirmative answer to this question by conducting a systematic study on a variety of training and adaptation algorithms used in few-shot classification. This uncorrelated property also offers us an opportunity to independently analyze algorithms of one phase by fixing the algorithm of the other phase. We conduct such analysis in Section 4 for training algorithms and Section 5 for adaptation algorithms. By varying important factors like dataset scale, model architectures for the training phase, and shots, ways, data distribution for the adaptation phase, we obtain several interesting observations that lead to a deeper understanding of few-shot classification and reveal some critical relations to visual representation learning and trans- A Closer Look at Few-shot Classification Again fer learning literature. Such meta-level understanding can be useful for future few-shot learning research. The analysis for each phase leads to the following key observations: 1. We observed a different neural scaling law in few-shot classification that test error falls off as a power law with the number of training classes, instead of the number of training samples per class. This observation highlights the importance of the number of training classes in few-shot classification and may help future research further understand the crucial difference between few-shot classification and other vision tasks. 2. We found two evaluated datasets on which increasing the scale of training dataset does not always lead to better few-shot performance. This suggests that it is never realistic to train a model that can solve all possible tasks well just by feeding it a very large amount of data. This also indicates the importance of properly filtering training knowledge for different few-shot classification tasks. 3. We found that standard Image Net performance is not a good predictor of few-shot performance for supervised models (contrary to previous observations in other vision tasks), but it does predict well for self-supervised models. This observation may become the key to understanding both the difference between few-shot classification and other vision tasks, and the difference between supervised learning and self-supervised learning. 4. We found that, contrary to a common belief that finetuning the whole network with few samples would lead to severe overfitting, vanilla fine-tuning performs the best among all adaptation algorithms even when data is extremely scarce, e.g., 5-way 1-shot task. In particular, partial finetune methods that are designed to overcome the overfitting problem of vanilla finetune in few-shot setting perform worse. The advantage of finetune expands with the increase of the number of ways, shots and the degree of task distribution shift. However, finetune methods suffer from extremely high time complexity. We show that the difference in these factors is the reason why state-of-the-art methods in different few-shot classification benchmarks differ in adaptation algorithms. 2. The Problem of Few-shot Classification Few-shot classification aims to learn a model that can quickly adapt to a novel classification task given only few observations. In the training phase, given a training dataset Dtrain = {xn, yn}|Dtrain| n=1 with NC classes, where xi RD is the i-th image and yi [NC] is its label, a model fθ is learned via a training algorithm Atrain, i.e., Atrain(Dtrain) = fθ. In the adaptation phase, a series of few-shot classification tasks T = {τi}NT i=1 are constructed from the test dataset Dtest with classes and domains possibly different from those of Dtrain. Each task τ consists of a support set S = {(xi, yi)}NS i=1 used for adaptation and a query set Q = {(x i , y i )}NQ i=1 that is used for evaluation and shares the same label space with S. τ is called a N-way Kshot task if there are N classes in the support set S and each class contains exactly K samples. To solve each task τ, the adaptation algorithm Aadapt takes the learned model fθ and the support set S as inputs, and produces a new classifier g( ; fθ, S) : RD [N]. The constructed classifier will be evaluated on the query set Q to test its generalization ability. The evaluation metric is the average performance over all sampled tasks. We denote both the resultant average accuracy and the radius of 95% confidence interval as a function of training and adaptation algorithms: Avg(Atrain, Aadapt) and CI(Atrain, Aadapt), respectively. Depending on the form of training algorithm Atrain, the model fθ can be different. For non-meta-learning methods, fθ : RD Rd is simply a feature extractor that takes an image x RD as input and outputs a feature vector z Rd. Thus any visual representation learning algorithms can be used as Atrain. For meta-learning methods, the training algorithm directly aims at optimizing the performance of the adaptation algorithm Aadapt in a learning-to-learn fashion. Specifically, meta-learning methods firstly parameterize the adaptation algorithm Aadapt θ that makes it optimizable. Then the model fθ used for training is set equal to Aadapt θ , i.e., Atrain(Dtrain) = fθ = Aadapt θ . The training process consists of constructing pseudo few-shot classification tasks T train = {(Strain t , Qtrain t )}Ntrain T t=1 from Dtrain that take the same form with tasks during adaptation. In each iteration t, just like what will be done in the adaptation phase, the model fθ takes Strain t as input and outputs a classifier g( ; St). Images in Qtrain t are then fed into g( ; St) and return a loss that is used to update fθ. After training, fθ is directly used as the adaptation algorithm Aadapt θ . Although different from non-meta-learning methods, most meta-learning algorithms still set the learnable parameters θ as the parameters of a feature extractor, making it possible to change the algorithm used for adaptation. 3. Are Training and Adaptation Algorithms Uncorrelated? Given a set of training algorithms M train = {Atrain i }m1 i=1 and a set of adaptation algorithms M adapt = {Aadapt i }m2 i=1, we say that M train and M adapt are uncorrelated if changing algorithms from M train does not influence the performance ranking of algorithms from M adapt, and vice versa. To give a precise description, we first define a partial order. Definition 3.1. We say two training algorithms Atrain a , Atrain b have the partial order Atrain a Atrain b , if A Closer Look at Few-shot Classification Again Table 1. Few-shot classification performance of pairwise combinations of a variety of training and adaptation algorithms. All evaluation tasks are 5-way 5-shot tasks sampled from Meta-Dataset (excluding Image Net). We sample 2000 tasks per dataset in Meta-Dataset and report the average accuracy over all datasets along with the 95% confidence interval. The algorithms are listed according to their partial order according to Definition 3.2 from top to bottom and from left to right. * denotes training algorithm that uses transductive BN (Bronskill et al., 2020) that produces a much higher, unfair performance using Fintune and TSA as adaptation algorithms. : TSA and e TT are both architecture-specific partial-finetune algorithms, thus TSA can be used for CNN only and e TT for original Vi T only. Adaptation algorithm Training algorithm Training dataset Architecture Matching Net Meta Opt NCC LR URL CC TSA/e TT Finetune PN mini Image Net Conv-4 48.54 0.4 49.84 0.4 51.38 0.4 51.65 0.4 51.82 0.4 51.56 0.4 58.08 0.4 60.88 0.4 MAML mini Image Net Conv-4 53.71 0.4 53.69 0.4 55.01 0.4 55.03 0.4 55.66 0.4 55.63 0.4 62.80 0.4 64.87 0.4 CE mini Image Net Conv-4 54.68 0.4 56.79 0.4 58.54 0.4 58.26 0.4 59.63 0.4 59.20 0.5 64.14 0.4 65.12 0.4 Matching Net mini Image Net Res Net-12 55.62 0.4 57.20 0.4 58.91 0.4 58.99 0.4 61.20 0.4 60.50 0.4 64.88 0.4 67.93 0.4 MAML mini Image Net Res Net-12 58.42 0.4 58.52 0.4 59.65 0.4 60.04 0.4 60.38 0.4 60.50 0.4 71.15 0.4 73.13 0.4 PN mini Image Net Res Net-12 60.19 0.4 61.70 0.4 63.71 0.4 64.46 0.4 65.64 0.4 65.76 0.4 70.44 0.4 74.23 0.4 Meta Opt mini Image Net Res Net-12 62.06 0.4 63.94 0.4 65.81 0.4 66.03 0.4 67.47 0.4 67.24 0.4 72.07 0.4 74.96 0.4 Deep EMD mini Image Net Res Net-12 62.67 0.4 64.15 0.4 66.14 0.4 66.14 0.4 68.66 0.4 69.76 0.4 74.21 0.4 74.83 0.4 CE mini Image Net Res Net-12 63.27 0.4 64.91 0.4 66.96 0.4 67.14 0.4 69.78 0.4 69.52 0.4 74.30 0.4 74.89 0.4 Meta-Baseline mini Image Net Res Net-12 63.25 0.4 65.02 0.4 67.28 0.4 67.56 0.4 69.84 0.4 69.76 0.4 73.94 0.4 75.04 0.4 COS mini Image Net Res Net-12 63.99 0.4 66.09 0.4 68.31 0.4 69.26 0.4 70.71 0.4 71.03 0.4 75.10 0.4 75.68 0.4 PN Image Net Res Net-50 63.68 0.4 65.79 0.4 68.40 0.4 68.87 0.4 69.69 0.4 70.81 0.4 74.15 0.4 78.42 0.4 S2M2 mini Image Net WRN-28-10 64.41 0.4 66.59 0.4 68.67 0.4 69.16 0.4 70.88 0.4 71.38 0.4 74.94 0.4 76.89 0.4 FEAT mini Image Net Res Net-12 65.42 0.4 67.15 0.4 69.06 0.4 69.21 0.4 71.24 0.4 72.07 0.4 75.99 0.4 76.83 0.4 IER mini Image Net Res Net-12 65.37 0.4 67.31 0.4 69.30 0.4 70.01 0.4 72.48 0.4 72.85 0.4 76.70 0.4 77.54 0.4 Moco v2 Image Net Res Net-50 65.47 0.5 68.63 0.4 71.05 0.4 71.49 0.4 74.46 0.4 74.57 0.4 79.70 0.4 79.98 0.4 Exemplar v2 Image Net Res Net-50 67.70 0.5 70.07 0.4 72.55 0.4 72.93 0.4 75.26 0.4 76.83 0.4 80.22 0.4 81.75 0.4 DINO Image Net Res Net-50 73.97 0.4 76.45 0.4 78.30 0.4 78.72 0.4 80.73 0.4 81.05 0.4 83.64 0.4 83.20 0.4 CE Image Net Res Net-50 74.75 0.4 76.94 0.4 78.96 0.4 79.57 0.4 80.89 0.4 81.51 0.4 84.07 0.4 84.92 0.4 Bi T-S Image Net Res Net-50 75.44 0.4 77.86 0.4 79.84 0.4 79.97 0.4 81.79 0.4 81.91 0.4 84.84 0.3 86.40 0.3 CE Image Net Swin-B 75.17 0.4 77.81 0.4 80.06 0.4 81.04 0.4 82.55 0.4 82.46 0.4 - 88.16 0.3 Dei T Image Net Vi T-B 75.82 0.4 78.34 0.4 80.62 0.4 81.68 0.4 82.80 0.3 83.13 0.4 84.22 0.3 87.62 0.3 CE Image Net Vi T-B 76.78 0.4 78.81 0.4 80.65 0.4 81.13 0.3 82.69 0.3 82.77 0.3 85.60 0.3 88.48 0.3 DINO Image Net Vi T-B 76.44 0.4 79.11 0.4 81.23 0.4 82.01 0.4 84.16 0.3 84.44 0.3 86.25 0.3 88.04 0.3 CLIP Web Image Text Vi T-B 78.06 0.4 81.20 0.4 83.04 0.3 83.22 0.3 84.11 0.3 84.20 0.3 87.66 0.3 90.26 0.3 for all i [m2], Avg(Atrain a , Aadapt i ) CI(Atrain a , Aadapt i )